Modern, high performance bloom filter in Python?

I'm looking for a production quality bloom filter implementation in Python to handle fairly large numbers of items (say 100M to 1B items with 0.01% false positive rate).

Pybloom is one option but it seems to be showing its age as it throws DeprecationWarning errors on Python 2.5 on a regular basis. Joe Gregorio also has an implementation.

Requirements are fast lookup performance and stability. I'm also open to creating Python interfaces to particularly good c/c++ implementations, or even to Jython if there's a good Java implementation.

Lacking that, any recommendations on a bit array / bit vector representation that can handle ~16E9 bits?

Answers


I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.

You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.

In terms of non-Java bit array implementations:

I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.

Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.

from BitVector import BitVector
from random import Random
# get hashes from http://www.partow.net/programming/hashfunctions/index.html
from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash


#
# ryan.a.cox@gmail.com / www.asciiarmor.com
#
# copyright (c) 2008, ryan cox
# all rights reserved 
# BSD license: http://www.opensource.org/licenses/bsd-license.php
#

class BloomFilter(object):
    def __init__(self, n=None, m=None, k=None, p=None, bits=None ):
        self.m = m
        if k > 4 or k < 1:
            raise Exception('Must specify value of k between 1 and 4')
        self.k = k
        if bits:
            self.bits = bits
        else:
            self.bits = BitVector( size=m )
        self.rand = Random()
        self.hashes = []
        self.hashes.append(RSHash)
        self.hashes.append(JSHash)
        self.hashes.append(PJWHash)
        self.hashes.append(DJBHash)

        # switch between hashing techniques
        self._indexes = self._rand_indexes
        #self._indexes = self._hash_indexes

    def __contains__(self, key):
        for i in self._indexes(key): 
            if not self.bits[i]:
                return False    
        return True 

    def add(self, key):
        dupe = True 
        bits = []
        for i in self._indexes(key): 
            if dupe and not self.bits[i]:
                dupe = False
            self.bits[i] = 1
            bits.append(i)
        return dupe

    def __and__(self, filter):
        if (self.k != filter.k) or (self.m != filter.m): 
            raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')
        return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))

    def __or__(self, filter):
        if (self.k != filter.k) or (self.m != filter.m): 
            raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')
        return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))

    def _hash_indexes(self,key):
        ret = []
        for i in range(self.k):
            ret.append(self.hashes[i](key) % self.m)
        return ret

    def _rand_indexes(self,key):
        self.rand.seed(hash(key))
        ret = []
        for i in range(self.k):
            ret.append(self.rand.randint(0,self.m-1))
        return ret

if __name__ == '__main__':
    e = BloomFilter(m=100, k=4)
    e.add('one')
    e.add('two')
    e.add('three')
    e.add('four')
    e.add('five')        

    f = BloomFilter(m=100, k=4)
    f.add('three')
    f.add('four')
    f.add('five')
    f.add('six')
    f.add('seven')
    f.add('eight')
    f.add('nine')
    f.add("ten")        

    # test check for dupe on add
    assert not f.add('eleven') 
    assert f.add('eleven') 

    # test membership operations
    assert 'ten' in f 
    assert 'one' in e 
    assert 'ten' not in e 
    assert 'one' not in f         

    # test set based operations
    union = f | e
    intersection = f & e

    assert 'ten' in union
    assert 'one' in union 
    assert 'three' in intersection
    assert 'ten' not in intersection
    assert 'one' not in intersection

Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:

def fast_count_bits( self, v ):
    bits = (
            0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )

    return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]

Need Your Help

Rails 4 - Column exists but does not update

ruby-on-rails-4 sqlite3

I have recently added a field "tag" to my blog app built in Rails 4. Below you can see the field appearing in the Edit view:

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.