I have a large text file with over 255 million items, separated by carriage returns (seperate lines).
I first tried doing the job in nodejs, but I was getting out of memory errors, even when I made some modifications to allow nodejs to use more memory. The server is an 8 core dell r710 with 64GB of ram, more than enough to bring the 4.5 gigabyte text file into ram to process.
Rather than keep fighting with nodejs, I decided to give it a try using Python. I was pleasantly surprised with the result. It took 144 seconds to go through the file and remove all the duplicate items.
Here’s the source code I used including the timer:
import timestart_time = time.time() x = set(file(‘all.txt’))print(“— %s seconds —” % (time.time() – start_time)) print(‘count of array: ‘) print(len(x))