• Hello and Welcome to HiFiVision.com - an online community for the home entertainment and tech enthusiasts!

    If you would like to ask a question, participate in a discussion and view attachments please Register yourself.

Which is the best tool for removing duplicate files on desktop

QUAD

petes

Member
Joined
Dec 2, 2010
Messages
274
Points
18
Location
Bangalore
Folks - I have been using my computer for almost 18 years with addition of multiple hard-disks over the period of time. I am having over 3TB of data which is spread across different hard-disks (Internal/ External) and obviously giving room for duplicate files.

I am looking for a best freeware tool to clean up this mess. I am totally fine with paid version, if it does the job accurately.

Just a rough analysis of contents

Photos and Videos 70%
MS-Office 20%
Other files 10%


Help me
 

sdk

Active Member
Joined
Sep 28, 2014
Messages
142
Points
28
Location
Mumbai
If you are planning to consolidate to one hard disc then try freefilesync

Set up your folders, copy from different hard discs to the consolidated one and delete/format the sources
 

Hiten

Well-Known Member
Joined
Oct 17, 2008
Messages
2,898
Points
113
Location
Kalyan
+1 to Ccleaner. It is usefull tool with other opetions too. Be carefull with registry cleaner though.
Regards
 

Mayank Shah

Well-Known Member
Joined
May 22, 2015
Messages
735
Points
63
Location
Madras
Ashampoo Winoptimizer and Glary Utilities. Been using both since a long time. Works brilliantly. Also has options to permanently delete files and wipe the free space amongst many more.
 

sandeepsasi

Active Member
Joined
Dec 14, 2018
Messages
66
Points
33
Location
Bangalore
Hi,

I don't know whether these tools can catch files with different names. I often rename files if I find that the names are unnecessarily long. In this situation, I usually write simple Python scripts for the job, which internally use md5sum command to identify files with same content but having different names.
 

captrajesh

Moderator
Staff member
Joined
Oct 18, 2009
Messages
5,279
Points
113
Location
Hyderabad
Hi,

I don't know whether these tools can catch files with different names. I often rename files if I find that the names are unnecessarily long. In this situation, I usually write simple Python scripts for the job, which internally use md5sum command to identify files with same content but having different names.
And how do computer noobs do it!? Could you elaborate?
 

sbg

Active Member
Joined
Dec 28, 2009
Messages
435
Points
43
Location
bangalore
+1 to Ccleaner from me too.
Their file Recovery tool (Recuva) too is just good .
 

harrsha

Member
Joined
Apr 15, 2014
Messages
33
Points
8
Location
andhra
Hi,

I don't know whether these tools can catch files with different names. I often rename files if I find that the names are unnecessarily long. In this situation, I usually write simple Python scripts for the job, which internally use md5sum command to identify files with same content but having different names.
Can u elaborate the procedure and provide the script ..cheers
 

sandeepsasi

Active Member
Joined
Dec 14, 2018
Messages
66
Points
33
Location
Bangalore
And how do computer noobs do it!? Could you elaborate?
Can u elaborate the procedure and provide the script ..cheers
Hi,

Thanks for showing interest. Before going into the details of the procedure and the script itself, I'll first elaborate the kind of problems I usually run into.

1. Managing a large unorganized collection of files and directories, that have accumulated over time, which may contain duplicate files, having different names. Here the task is to find only those files that have duplicates.
2. I play music from a hard-drive (primary) and also have another drive for back-up (master). In the primary drive, I often end up renaming files and directories, or deleting music that I don't listen to or removing unwanted text files or Windows Thumbs.db files. This is my crude way of organizing my music collection. Recently, the primary drive was damaged after I dropped it by accident. I could have rebuilt the primary collection from the master, on a new drive, but I didn't want to spend a lot of time organizing the collection, once again. So, I approached a data recovery service and they said that they would be able to recover most of the files. I was not very happy when they said "most". I wanted them to be more specific and then they said 99%. Fine, but was it 99% of every file or 99% of all the files? Could they exactly tell me what files they were not able to recover 100%? They were not too eager to answer my questions. This data recovery service is rated as one of the best in Bangalore, and I went ahead with the recovery anyway, in spite of my apprehensions. After the recovery was done, I could see that some of the files were hidden and they said that these were the files that couldn't be recovered fully. But, I wanted to be sure that the rest of the files were recovered correctly, bit for bit. I had the master copy, but how do I locate the same files in the master, given the names and paths are not exactly the same? Even if I do, how do I verify a bit for bit match?

In general, the technique I use, can be used to compare collections of files, in one or more file systems, and find similarities or differences. The files are compared purely based on their contents, bit for bit, and not based on their names, extension or paths. I use Python's built-in data structures and Python language, as a front-end, to create a look-up-table data-base of files in memory, find differences or similarities and to decide what is to be done about it. Unix utilities are used as a back-end, to get the list of files and compute their check sums - the heavy lifting is done by these Unix utilities.

Python was used just for convenience and I could very well have written a Unix shell script. I feel the core of the technique that I am going to describe, is finding files and computing their check sums. What is to be done about the findings is use case specific:- Whether to delete all but one of the copies or to move duplicates to another directory and in either case, which one(s)? Can deletions be silent or do you need confirmation? Or, all you need is a report of what and where are the duplicates. In use case 2 that was described before, two separate file systems are compared and the similarities are welcomed; in this case, it is the differences that have to be reported.

I am going to describe this technique using an example. I am only going to use Unix utilities and not Python. For every command, a small excerpt of the transcript on the Unix terminal has been provided. Windows users my install utilities like Cygwin or MinGW and run these commands. These commands have been tested in the Cygwin terminal on my Windows machine. I also apologize for not having a push-button solution.

The example illustrates how to find duplicate files, purely based on file contents and not based on file names, paths or extensions (.jpg v/s .jpeg v/s .JPG). I created a representative collection of files and directories. The collection contains:

1. Files having different names and different contents, spread across same or different directories (unique files)
2. Files having different names and same contents, spread across same or different directories (duplicates, named differently)
3. Files having same names and different contents, spread across same or different directories (eg. dir1/Album.jpg and dir2/dir3/Album.jpg, or same song ripped from different sources; these have to be treated as different files)
4. Files having same names and same contents, spread across same or different directories (duplicates again)

By contents, I mean binary data. Files containing "Hello World" and "Hello World " have different contents, because of the extra trailing white space in the latter. We are not interested in cases (1) and (2). Case (4) can be identified visually, but many not be practical if the number of files and directories is large. Given below is a preview of the collection that I created.

$ cd dupefinder/ # This is the directory containing the collection
$ tree . # Commands executed in the terminal
.
├── dir11
│ ├── another.FLAC
│ ├── dir21
│ │ ├── dir31
│ │ └── dir32
│ ├── dir22
│ └── dir23
│ ├── dir31
│ │ └── dir41
...
├── dup_hash.txt
├── hello.flac
└── songs1.flac

39 directories, 22 files


The process of finding the duplicates can be broken down into three steps.

Step 1. Finding all files, computing their check-sums and saving the results in a text file.

$ find . -type f -print0 | xargs -0 -I '{}' md5sum '{}' | sort > ../all_file_data.txt

Unix find command is used to get the list of all files. Since file names can contain white spaces, -print0 switch has been used so that the file names in the output of find are separated by a NULL (\0) character, rather than newline. xargs is used to execute the command that finds the check-sum (md5sum), once for each file. md5sum can process all files and compute the check-sum of each file in one go. I am not using this approach because, if the number of files is huge, there is a risk of breaching the memory limit on the length of the command-line for a command, on certain Unix systems. md5sum generates a 128-bit check-sum based on the binary contents of files. This check-sum is,

1. Unique for all files containing the same bits, for all practical purposes
2. Even if there is only a bit flip in a gigabyte file, the md5sum output is vastly different. This makes it a good visual aid while working interactively on the command-line.

Whenever large files like OS images are hosted on the internet, it customary to provide the check-sum computed by md5sum and to zip the images. Users can run the md5sum command at their end, after downloading and unzipping the images, to detect errors as small as small as a single bit flip, in files gigabytes in size.

The results of the above command are saved in a text file, ../all_file_data.txt. A preview of this file is given below:
$ head ../all_file_data.txt
05b22424749581089699fa0297a6c95b *./dir13/dir23/Album.png
07ba5b638edecd54a618a2d0d8df4e06 *./dir15/E.WAV
08e80b736bf3697ea57efd497fed29cc *./dir13/dir23/songs1.flac
08e80b736bf3697ea57efd497fed29cc *./dir13/dir24/test1.flac
08e80b736bf3697ea57efd497fed29cc *./songs1.flac
16fcd469a1ed4c0b1cb3d89c2e43a82e *./dir11/another1.wav
16fcd469a1ed4c0b1cb3d89c2e43a82e *./dir14/TEST 1 2 3 4.WAV
16fcd469a1ed4c0b1cb3d89c2e43a82e *./dir15/dir21/FILE 56.WAV
2cd6ee2c70b0bde53fbe6cac3c8b8bb1 *./dir11/dir23/dir32/C.WAV
2cd6ee2c70b0bde53fbe6cac3c8b8bb1 *./dir15/ff.wav


The first column contains the check-sums and the second column contains respective file names (paths). The sort command has sorted the output of xargs, so that duplicate files having the same check-sum will occur next to each other in the list. sort also helps us in the next step, as explained below. Files having the same check-sum have been highlighted in the same color, for clarity. This command took about 15 minutes to run on a 2TB collection of more than 10000 files. The run time will vary from machine to machine.

Step 2. Finding files having the same check-sum

As seen above, duplicate files, i.e., files having the same contents, bit for bit, have the same check-sum. Now that we have the check-sums of all files, next step is to find duplicate files by picking only those check-sums that occur more than once, in the list.

$ awk '{ print $1 }' ../all_file_data.txt | uniq -c | awk '$1 > 1 { print $2 }' > ../repeating_sums.txt

The first awk command prints only the 1st column in ../all_file_data.txt, i.e., the check sums. The output is piped to the uniq -c command. The unique command removes duplicate lines in the input, if they occur next to each other. The sort command in the previous step ensures that files having the same check-sum occur next to each other in the list. The -c switch to uniq prefixes each line of output (check-sum) by a count of how many times it has repeated. Unique check-sums will have a prefix of 1. The last awk command prints only those check-sums that occur more than once, that is the lines in the output of uniq -c having a prefix greater than 1. The results are saved in a text file: ../repeating_sums.txt. A preview of this file is given below:
$ head ../repeating_sums.txt
08e80b736bf3697ea57efd497fed29cc
16fcd469a1ed4c0b1cb3d89c2e43a82e
2cd6ee2c70b0bde53fbe6cac3c8b8bb1
3a6740fdb31bb82d3d28d08f506a87d2
57b8d745384127342f95660d97e1c9c2
86aa3244429584873af62a070669e9b5
bf072e9119077b4e76437a93986787ef


Step 3. Printing only duplicates from the list

$ grep -nwf ../repeating_sums.txt ../all_file_data.txt
3:08e80b736bf3697ea57efd497fed29cc *./dir13/dir23/songs1.flac
4:08e80b736bf3697ea57efd497fed29cc *./dir13/dir24/test1.flac
5:08e80b736bf3697ea57efd497fed29cc *./songs1.flac
6:16fcd469a1ed4c0b1cb3d89c2e43a82e *./dir11/another1.wav
7:16fcd469a1ed4c0b1cb3d89c2e43a82e *./dir14/TEST 1 2 3 4.WAV
8:16fcd469a1ed4c0b1cb3d89c2e43a82e *./dir15/dir21/FILE 56.WAV
9:2cd6ee2c70b0bde53fbe6cac3c8b8bb1 *./dir11/dir23/dir32/C.WAV
10:2cd6ee2c70b0bde53fbe6cac3c8b8bb1 *./dir15/ff.wav
12:3a6740fdb31bb82d3d28d08f506a87d2 *./dir14/song1.wav
...


The Unix grep command serves this purpose. grep command searches and prints lines in a file containing a pattern. In this case, the file to search is ../all_file_data.txt and patterns to look for are the check-sums in ../repeating_sums.txt. The -n switch causes grep to prefix each line of its output by the line number, in ../all_file_data.txt, where they occur. The -w switch requests grep to look only for full words. For example, grep -w and ..., only prints lines containing and and not abandon, band or android. The results can be saved to another text file for further processing.

We can argue that even steps 1 and 2 are specific to our problem. Usually, I prefer to embed step 1 in Python language, by a system call, and steps 2 and 3 are written in Python language, tailored to suit my specific requirement. For the same reason, I am not attaching a script. If anyone insists, I am willing to do that. Hope this has thrown more light on what I was trying to say earlier.

Summary of Commands:
$ find . -type f -print0 | xargs -0 -I '{}' md5sum '{}' | sort > ../all_file_data.txt

$ awk '{ print $1 }' ../all_file_data.txt | uniq -c | awk '$1 > 1 { print $2 }' > ../repeating_sums.txt
$ grep -nwf ../repeating_sums.txt ../all_file_data.txt

With regards,
Sandeep Sasi
 
Top