5

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.

Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.

So in the first i done:

  • searched the the tree for the same size files (fast) and make md5 checksum for those.
  • collected duplicated images (same size + same md5 = duplicate)

This helped a lot, but here are still MANY MANY duplicates:

  • photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
  • or they are only a resized versions of the original image
  • or they are the "enhanced" versions of originals, etc..

Now the questions:

  • how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
  • What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?

More complex

  • how to find "similar" images, what are only the
    • resized versions of the originals
    • "enchanced" versions of the originals (from some photo manipulation programs)
  • is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?

I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...

My rough idea:

  • delete images only at the end of workflow
  • use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
  • make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
  • use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...

Any idea, help, any (software/algorithm) hint how to make order in the chaos?

Ps:

Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.

4

4 に答える 4

2

Randal Schwartz によるこの記事を見たことがありますか? 彼は、Perl スクリプトと ImageMagick を使用して、サイズ変更された (4x4 RGB グリッド) バージョンの写真を比較し、それを比較して、「類似した」写真にフラグを立てます。

于 2013-08-28T03:30:29.877 に答える