Comparing Value of one list inside Gigantic Two Dimen list in python, Fastest way?

I want to compare if value of one list exist in value of other list.They are huge (50k + items, from database).

EDIT:

I also want to mark the record which is duplicated as duplicate=True and keep them in the table for later refrence.

here how the lists are:

n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']] 
#in this case i want to retrive id 1 and 5 coz they are same checksum

for m in n_emails:
    dups=_getdups(n_emails,m[1],m[0])           
    n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
    if n_dups:
        print "Dupe Found"
        casesdb(casesdb.email_data.id == m[0]).update(duplicated=True)

def _getdups(old_lst,em_md5,em_id):
    dups=[]
    for old in old_lst:
        if em_md5==old[0] and old[1]!=em_id:
            dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
    return dups

But it seems too long and in larger list (50k vs 50k records+) It ran for over 5000 seconds and never done , seems never ending loop? The server i running have 4 GB of ram and 4 cores. Obviously i am doing something wrong.

Please help .. thanks a lot!

SOLVED:

Dict Index Mapping is way a lot faster! (When mysql table is not indexed, plese note i have not test against indexed table).

Its 20 secs vs 30 miliseconds = 20*1000 / 30 = 666 Times! LOL

Answers


the fastest way is to use a dict like this:

n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'CAFEBABE010']]

d = {}
for id, hash in n_emails:
    if hash not in d:
        d[hash] = [id]
    else:
        d[hash].append(id)

for hash, ids in d:
    if len(ids) > 1:
       print hash, ids

this is nearly the algorithm for a hash join


for hash, count in select hash, count(id) as num from emails group by num having num > 1:
    first = None
    for index, id in enumerate(select id from emails where hash=hash sort by desc id):
        if index == 0:
            first = id
            continue
        update emails set duplicate=first where id=id

would be the sql/python solution in this i take the duplicate column and use it to store which message this one is thought to be a duplicate of

the emails table would be at least:

create table emails (id, hash, duplicate default null)

What you're doing wrong is:

  • You can probably get the result directly from the database. It's much faster than Python.
  • You are doing linear search for the checksums, which means that each of the 50k entries gets compared to each of the other 50k entries ... that is a huge number of comparisons.

What you should do is build an index over the checksums. Make a dict that maps checksum -> entry. When you insert the entries check if the checksum exists already, if so the entry is a duplicate.

Or you simply use your database, they love indexing.


You'd be better off looking for duplicates with SQL. For example, see this page describing how to find duplicates.

Pulling all of those results into Python and processing them is never going to be very fast, but if you must, your best bet is to have a dictionary of checksums to IDs:

got_checksums = {}
for id, checksum in emails:
    if checksum in got_checksums:
        print id, got_checksums[checksum]
    else:
        got_checksums[checksum] = id

Need Your Help

SAS: make values missing

sas

I am trying to make some existing values to missing values (not deleting them).

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.