Repo find files diff

get md5sum of all files in first dir 1

find /dir1/ -type f -exec md5sum {} + | sort -k 1 > dir1.txt

Output will contain lines like this

01ac660edad41658b5d6ba67f371aa7eee3211fc  ./Deepak/Programm/ocv2/star.jpg
033098d600668b69a8e607899687a9ab33ab54f1  ./Deepak/MIT/Seminar/report.log

sort usage

  • the argument -k is used to sort by column
  • in our case we sort by the md5sum field present in column 1
  • that's why we use -k 1

repeat the same for the other dir, i.e dir2 to compare with/

merge both the files and sort

cat dir1.txt dir2.txt | sort -k 1 > all_files_md5.txt

run this awk script to get all the files that have duplicates

delete_duplicate_md5_but_keep_line.awk
awk '{if ( $1==old ) { if (cnt == 0 ) {cnt=cnt+1; print "\n\n" oldline "\n" $0 } else { print $0 } } else { cnt=0; }; old=$1; oldline=$0;}' all_files_md5.txt > duplicates.txt

diff the duplicates with orignal md5 file list

diff -u duplicates.txt all_files_md5.txt | grep + | sort -k2

explanation

  • we find the diff between the duplicates and orignal
  • the ones that are shown as added output are our unique files
  • we sort them by folder path, so we can address the conflicts

References

[1] : Comparing the contents of two directories