Juin 182014
I have several backups of my old mobile phones, with a lot of pictures taken.
When I wanted to categorize them, it appeared that there was duplicated pictures (really? ;))
so a little script after that, I could found all duplicated files et choose which of them to delete.
This script will not touch your files but displays all duplicated found files both in your terminal and in a csv file.
feel free to open it with your preferred sheet program
FYI: The analysis is based on the size and checksum of the files, not on the date nor filename.
the result is:
$ ./find_dups.sh -dir /data/photos\ -\ images/ find duplicate files size md5sum duplicated files size isolate duplicated files found print the result 53b69ddf18c82f222046e8f616955bc6;;/data/photos - images/photos classées/Annee_2011/miaou/IMAG0462.jpg;/data/photos - images/photos classées/Annee_2012/miaou/IMAG0462.jpg the result can be found in: '/tmp/find_dups.sh/04-result.csv'
Download : find_duplicates.sh
#!/bin/bash v_dir="." v_output_dir="/tmp/$(basename $0)" v_sizemin= # Permet de savoir s'il y a un argument apres le flag # si pas d'argument, on passe au suivant. function analyze_arg { # si $2 n'est pas vide et ne commence pas par un tiret if [ -n "$2" -a "$(echo $2 |cut -c1)" != "-" ] then eval $1=\"$2\" return 2 else return 1 fi } while [ "$#" -gt "0" ] # Tant qu'il y a des arguments do case $1 in -dir) analyze_arg v_dir "$2" ; shift $? ;; -outdir) analyze_arg v_output_dir "$2" ; shift $? ;; -sizemin) analyze_arg v_sizemin "$2" ; shift $? ;; *) echo "Incorrect argument detected" echo "$0 [-dir <path(default current)>] [-outdir <path( default /tmp>] [-sizemin <size>]" exit 1 ;; esac done [ -e "$v_output_dir" -a ! -d "$v_output_dir" ] && { echo "error: output dir already exists and is not a directory" exit 1 } [ -n "$v_sizemin" ] && v_sizemin="-size +${v_sizemin}M" [ ! -d "$v_output_dir" ] && mkdir -p "$v_output_dir" v_files_size="${v_output_dir}/01-files_size.list" echo "find duplicate files size" find "${v_dir}" -type f -not -empty ${v_sizemin} -printf "%-25s%p\n" |sort -n |uniq -D -w25 > "${v_files_size}" v_files_md5="${v_output_dir}/02-files.md5" echo "md5sum duplicated files size" cut -b26- "${v_files_size}" |xargs -d"\n" -n1 md5sum |sed "s/ /\x0/" > "${v_files_md5}" echo "isolate duplicated files found" v_files_dup="${v_output_dir}/03-files.dup" uniq -D -w32 "${v_files_md5}" > "${v_files_dup}" echo "print the result" v_result="${v_output_dir}/04-result.csv" awk -F"\0" 'BEGIN { l="" } { if ( l != $1 || l == "" ) { printf ("\n%s\0;",$1) } printf ("\0;%s",$2) l = $1 } END { printf "\n" }' "${v_files_dup}" |sed "/^$/d" |tee "${v_result}" echo "the result can be found in: '${v_result}'"
For information, « fdupes » software allows you to detect the duplicate files easily. I don’t completely check the result but it seems it is pretty usefull and seems to match with your script (I don’t check your work !). Besides, for lazy people like me, package exists (for debian at least).
yeah,
fdupes seems perfect for that.
my script is now outdated 😉