Best way to compare IP addresses quickly

I'm parsing two CSV files which contains IP addresses. The first is a source CSV, and the second is a "Blacklist".

Because of the size of the source file, I'm trying to optimize the speed at which I find IP addresses that match the blacklist.

EDIT: The blacklist consists of IP Address "Blocks". This means that each record in the blacklist has two IP addresses: A Start Block (ex. 216.254.128.0) and an End Block. (Ex. 216.254.223.255)

This means that direct lookups etc, will NOT work.

I'm wondering what's the best way to approach this. The brute strength method would be:

String[] parts = sourceIP.split("\\."); // String array, each element is text between dots

int hi = 255;
int lo = 0;

int mid = (hi - lo) / 2 ;

if (Integer.valueOf(parts[0]) > mid) {
    mid = lo;
}

I could then repeat this for each part to decide whether or not the IP address is in the black list.

This seems pretty aggressive and with 4k+ records, this could take a very, very long time.

It could take 10+ iterations to decide each part and that would then have to be repeated to check the "High" part of the IP blocks in the blacklist. That's 80+ iterations per record.

I'm hoping to get some input here to see the best method for comparing IP addresses.

What are your thoughts?

Would it be possible to use a quick bitwise mask to compare values rapidly by serializing INetAddress?

FILE STRUCTURE CLARIFICATION:

Source IP File:

Contains a list of records from a database. (Aprox 4k). Each record contains names, addresses, emails, and IP Address.

Blacklist:

Contains 4.2k records. Each record is an IP Address "Block". This consists of two IP Addresses. 1. Start and 2. End.

If the record in the Source list has an IP address that's found in the blacklist, I need to save that record and add it to a new file.

Answers


I assume you're talking IPV4 addresses of the form xxx.xxx.xxx.xxx.

You can easily convert an IP address into an integer. Each segment (i.e. xxx) is 8 bits (i.e. one byte). So four of them together makes a 32-bit integer. So, given an IP address like "192.168.100.12", you can split it into its four parts, parse each one to a byte and create an integer. Say, for example, that you created a byte array of the segments:

ipBytes[0] = 192;
ipBytes[1] = 168;
ipBytes[2] = 100;
ipBytes[3] = 12;

You can turn that into an integer:

int ipAddress = ipBytes[0];
ipAddress = (ipAddress << 8) | ipBytes[1];
ipAddress = (ipAddress << 8) | ipBytes[2];
ipAddress = (ipAddress << 8) | ipBytes[3];

There are more efficient ways to do that, but you get the idea. Your language's runtime library might already have something that'll parse an IP address and give you the bytes to make it an integer.

You have a set of IP address ranges that you want to check your source addresses against. Load each of the ranges into a structure like this:

class IPRange
{
    public int startIp;
    public int stopIp;
}

And store those in an array or list. Then sort the list by starting IP address.

For each source IP address, convert it to an integer and do a binary search of the list, searching the starting IP address. The source address itself might not be (probably won't be) found, but when the binary search terminates the mid value will hold the index of the range whose starting IP address is less than or equal to the source address. You then just have to check the source address against that item's ending IP address to see if it's in the range.

Binary search is O(log n). If you're searching a list of 4,300 ranges, it's going to take at most 13 probes to find an address in the array. That should be plenty fast enough, even when doing 4,000 different searches. You're only talking on the order of 50,000 total probes of the range array.

A couple of notes:

First, as I said above, I assume you're talking about IPV4 addresses. If you're talking about IPV6 addresses, the same concepts still apply but you'll need a 64 bit integer. I don't know enough about IPv6 to say how you'd convert the address to 64 bit integer. Probably you should depend on you runtime library to get the address bytes.

Second: I assume that ranges don't overlap. That is, you won't have something like:

start range    end range
192.168.1.1    192.168.2.255
192.168.2.1    192.168.3.255

If you have that, then an IP address could fall within either of those ranges. You could potentially construct overlapping ranges that would allow addresses to fall through the cracks. If you have overlapping ranges, the problem becomes a little bit more complicated.


Put both files in a String. Use split(",") to split the ip's in the first string. Loop through the obtained ips array. For every ip search for it in the second String like blacklist.indexOf("," + ip + ",") But first add a "," at start and end of blacklist string.


Brute force it. Load everything into ram, no reason not to. Split the ips into a 2d array. {0:123,123,123,123} Blacklist into a 3d array. Now you can start searching for integers. When you have a match compare the next section. If source value higher then compare to the END block same section. When you have a match push to a new array and write it to a file at the end. If this takes more time to run then it took me to type this then close the porn you have open because your ram is full and its using your page file.


You could use a data structure called Bloom Filter. Which is rather efficient performance and storage wise. As for an example, there's a question here, Most Efficient way of implementing a BlackList that has an answer that recommends this.

As far as I know, also Google Chrome uses this technique, as also explained rather nicely at Matthials Vallentine's blog post A Garden Variety of Bloom Filters.

Yet more explanation succinctly can be had found at Adobe leaked credentials checker. Some excerpts

The original leak is about 9.3GB uncompressed, of which 3.3GB is email addresses [...] This means the data can fit into 512MB (i.e. 232 bits) of memory and allows us to perform lookups in constant time [...] An optimal bloom filter which is allowed to occupy 840MB would have practically no false positives at all.


It seems like the most direction solution would be to use an interval tree to store the blacklist. Then check if the IP intersects with any of the intervals.

You also might want to consider using a Trie/hashtable to get fast lookups where the interval is the same. IE: 216.254.128.0 to 216.254.223.255 can be merged to 216.254.(128.0, 223.255), where the () is the interval. Thus you'd end up with two hash-table lookups (one for 216 and one for 254) then a search in an interval tree, which is likely to only contain a small number of elements.

You can also merge overlapping intervals into a single interval, which can probably be done as you build the interval tree. Which ends up being more like a binary search tree in that case.


Need Your Help

How to send index to slot with a QMenu

python qt pyside qcombobox qmenu

I am trying to get QComboBox-like behavior using a QPushButton that brings up a QMenu. Namely, when an action is selected, I want to connect to a slot, and for the slot to print out the index of the

why scala doesn't make tail call optimization?

scala continuations tail-call-optimization

Just playing with continuations. The goal is to create function which will receive another function as parameter, and execution amount - and return function which will apply parameter given amount ...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.