Find duplicate records in large text file
I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.
I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.
Read the file line-by-line, so you don't have to load it all into memory.
For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.
Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.
To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.
d = collections.defaultdict(list) with open('bigdata.txt', 'r') as datafile: for line in datafile: id = hashlib.sha256(line).digest() # Or id = line[:n] k = id[0:2] v = id[2:] if v in d[k]: print "double found:", id else: d[k].append(v)
Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.
$ sqlite3 uniqueness.sqlite create table chk ( ident char(n), -- n as in first n characters lineno integer -- for convenience ); ^D
Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:
import sqlite3 # install pysqlite3 before this n = ... # how many chars are in the key part lineno = 0 conn = sqlite3.connect("uniqueness.sqlite") cur = conn.cursor() with open("giant-file") as input: for line in input: lineno +=1 ident = line[:n] cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno]) cur.close() conn.close()
After this, you can index the table and use SQL:
$ sqlite3 uniqueness.sqlite create index x_ident on chk(ident); -- may take a bit of time -- quickly find duplicates, if any select ident, count(ident) as how_many from chk group by ident having count(ident) > 1; -- find lines of specific violations, if needed select lineno from chk where ident = ...; -- insert a duplicate ident
Yes, I tried most of this code, it should work :)