Find duplicate records in large text file

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.


Read the file line-by-line, so you don't have to load it all into memory.

For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.

Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.

To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.

d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:
    for line in datafile:
        id = hashlib.sha256(line).digest()
        # Or id = line[:n]
        k = id[0:2]
        v = id[2:]
        if v in d[k]:
            print "double found:", id

Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.

$ sqlite3 uniqueness.sqlite
create table chk (
  ident char(n), -- n as in first n characters
  lineno integer -- for convenience

Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:

import sqlite3 # install pysqlite3 before this
n = ... # how many chars are in the key part
lineno = 0

conn = sqlite3.connect("uniqueness.sqlite")
cur = conn.cursor()
with open("giant-file") as input:
  for line in input:
    lineno +=1
    ident = line[:n]
    cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno])

After this, you can index the table and use SQL:

$ sqlite3 uniqueness.sqlite
create index x_ident on chk(ident); -- may take a bit of time

-- quickly find duplicates, if any
select ident, count(ident) as how_many
from chk
group by ident
having count(ident) > 1;

-- find lines of specific violations, if needed
select lineno 
from chk
where ident = ...; -- insert a duplicate ident

Yes, I tried most of this code, it should work :)

Need Your Help

download file using a button in extjs

button extjs download handler

In a formpanel i have a download button

Peer-to-peer communication between iOS devices

iphone ios bluetooth p2p

I am trying to prototype a solution to a problem and am currently exploring multiple routes I could try. Is it possible for one iOS device, running a certain app, to communicate directly with anoth...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.