Juin 182014
 

I have several backups of my old mobile phones, with a lot of pictures taken.

When I wanted to categorize them, it appeared that there was duplicated pictures (really? ;))

so a little script after that, I could found all duplicated files et choose which of them to delete.

This script will not touch your files but displays all duplicated found files both in your terminal and in a csv file.
feel free to open it with your preferred sheet program

FYI: The analysis is based on the size and checksum of the files, not on the date nor filename.

the result is:

$ ./find_dups.sh -dir /data/photos\ -\ images/
find duplicate files size
md5sum duplicated files size
isolate duplicated files found
print the result
53b69ddf18c82f222046e8f616955bc6;;/data/photos - images/photos classées/Annee_2011/miaou/IMAG0462.jpg;/data/photos - images/photos classées/Annee_2012/miaou/IMAG0462.jpg
the result can be found in: '/tmp/find_dups.sh/04-result.csv'

Download : find_duplicates.sh

#!/bin/bash

v_dir="."
v_output_dir="/tmp/$(basename $0)"
v_sizemin=

# Permet de savoir s'il y a un argument apres le flag
# si pas d'argument, on passe au suivant.
function analyze_arg {
	# si $2 n'est pas vide et ne commence pas par un tiret
	if [ -n "$2" -a "$(echo $2 |cut -c1)" != "-" ]
	then
		eval $1=\"$2\"
		return 2
	else
		return 1
	fi
}

while [ "$#" -gt "0" ] # Tant qu'il y a des arguments
do
	case $1
	in
		-dir) analyze_arg v_dir "$2" ; shift $?  ;;
		-outdir) analyze_arg v_output_dir "$2" ; shift $?  ;;
		-sizemin) analyze_arg v_sizemin "$2" ; shift $?  ;;
		*) echo "Incorrect argument detected"
			echo "$0 [-dir <path(default current)>] [-outdir <path( default /tmp>] [-sizemin <size>]"
			exit 1
			;;
	esac
done

[ -e "$v_output_dir" -a ! -d "$v_output_dir" ] && {
	echo "error: output dir already exists and is not a directory"
	exit 1
}

[ -n "$v_sizemin" ] && v_sizemin="-size +${v_sizemin}M"

[ ! -d "$v_output_dir" ] && mkdir -p "$v_output_dir"

v_files_size="${v_output_dir}/01-files_size.list"
echo "find duplicate files size"
find "${v_dir}" -type f -not -empty ${v_sizemin} -printf "%-25s%p\n" |sort -n |uniq -D -w25 > "${v_files_size}"

v_files_md5="${v_output_dir}/02-files.md5"
echo "md5sum duplicated files size"
cut -b26- "${v_files_size}" |xargs -d"\n" -n1 md5sum |sed "s/  /\x0/" > "${v_files_md5}"

echo "isolate duplicated files found"
v_files_dup="${v_output_dir}/03-files.dup"
uniq -D -w32 "${v_files_md5}" > "${v_files_dup}"

echo "print the result"
v_result="${v_output_dir}/04-result.csv"
awk -F"\0" 'BEGIN {
		l=""
	} {
		if ( l != $1 || l == "" ) {
			printf ("\n%s\0;",$1)
		}
		printf ("\0;%s",$2)
		l = $1
	}
	END {
		printf "\n"
	}' "${v_files_dup}" |sed "/^$/d" |tee "${v_result}"

echo "the result can be found in: '${v_result}'"

  2 commentaires à “find duplicated files efficiently”

  1. For information, « fdupes » software allows you to detect the duplicate files easily. I don’t completely check the result but it seems it is pretty usefull and seems to match with your script (I don’t check your work !). Besides, for lazy people like me, package exists (for debian at least).

 Laisser un commentaire

Vous pouvez utiliser ces tags et attributs HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(requis)

(requis)

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur comment les données de vos commentaires sont utilisées.