Remove Duplicate Items Using Python Set

I have a large text file with over 255 million items, separated by carriage returns (seperate lines).

I first tried doing the job in nodejs, but I was getting out of memory errors, even when I made some modifications to allow nodejs to use more memory. The server is an 8 core dell r710 with 64GB of ram, more than enough to bring the 4.5 gigabyte text file into ram to process.

Rather than keep fighting with nodejs, I decided to give it a try using Python. I was pleasantly surprised with the result. It took 144 seconds to go through the file and remove all the duplicate items.

Here’s the source code I used including the timer:

import timestart_time = time.time()
x = set(file(‘all.txt’))print(“— %s seconds —” % (time.time() – start_time))
print(‘count of array: ‘)
print(len(x))

Here’s a video showing it in action:

Thanks, I hope you found this useful!

Add a Comment

Your email address will not be published. Required fields are marked *