image - Duplicate photo searching with compare only pure imagedata and image similarity?

Question

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.

Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.

So in the first i done:

searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)

This helped a lot, but here are still MANY MANY duplicates:

photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..

Now the questions:

how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?

More complex

how to find "similar" images, what are only the
- resized versions of the originals
- "enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?

I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...

My rough idea:

delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...

Any idea, help, any (software/algorithm) hint how to make order in the chaos?

Ps:

Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.

score 2 · Accepted Answer

Randal Schwartz によるこの記事を見たことがありますか? 彼は、Perl スクリプトと ImageMagick を使用して、サイズ変更された (4x4 RGB グリッド) バージョンの写真を比較し、それを比較して、「類似した」写真にフラグを立てます。

image - Duplicate photo searching with compare only pure imagedata and image similarity?

4 に答える 4

Related

Reference