# Parallelize a list consolidation in Python

I have a sorted list of "box definitions" that I would like to consolidate. The list looks something like:

big_list = [\ # ... # ... [3, 4, 5, 4, 5, 6, 65],\ [3, 4, 5, 4, 5, 6, 60],\ [3, 4, 5, 4, 5, 6, 55],\ [3, 4, 5, 4, 5, 6, 52],\ [3, 4, 5, 4, 5, 6, 23],\ [3, 4, 5, 4, 5, 6, 17],\ [3, 4, 5, 4, 5, 6, 0],\ [5, 8, 9, 6, 9, 10, 90],\ [5, 8, 9, 6, 9, 10, 84],\ [5, 8, 9, 6, 9, 10, 32],\ [5, 8, 9, 6, 9, 10, 0],\ # ... # ... [750, 800, 900, 751, 801, 901, 97],\ [750, 800, 900, 751, 801, 901, 24],\ [750, 800, 900, 751, 801, 901, 17],\ [750, 800, 900, 751, 801, 901, 16],\ [750, 800, 900, 751, 801, 901, 0]\ # ... # ... ]

Where the box "format" is: [x1, y1, z1, x2, y2, z2, attribute], and we can assume dx=1, dy=1, dz=1

Also, we can assume the list has already been sorted by something like:

big_list=sorted(big_list, key=lambda n:n[6], reverse=True) big_list=sorted(big_list, key=lambda n:n[2]) big_list=sorted(big_list, key=lambda n:n[1]) big_list=sorted(big_list, key=lambda n:n[0])

The list may be several millions of items long, and I would like to reduce the list so that any discrete "box" only gets the highest "attribute"...so something in this case like:

reduced_big_list = [\ [3, 4, 5, 4, 5, 6, 65],\ [5, 8, 9, 6, 9, 10, 90],\ [750, 800, 900, 751, 801, 901, 97]\ ]

The method I am currently using on this list is something like:

i = 0 while i < len(big_list)-1: if big_list[i][0]==big_list[i+1][0]\ and big_list[i][1]==big_list[i+1][1]\ and big_list[i][2]==big_list[i+1][2] \ and big_list[i][6] >= big_list[i+1][6]: del big_list[i+1] else: i=i+1

The problem is that when the list is "long" (10 million+ "boxes"), the process is very, very slow.

Is there a clever way to parallelize this list "decimation" process or perhaps quicken this process?

## Answers

The slowness is the call to del, which moves the items of the *complete* tail of the list by one step. In your case, simply don't use del. Make instead a new list, starting from an empty list and appending the items that you want to keep.

The Boolean and evaluates the left expression first. It only evaluates the right hand expression if the first one is true. Since you've sorted your list, the adjacent elements are maybe more likely to have identical 0th elements than later elements. Try

i = 0 while i < len(big_list)-1: if big_list[i][2]==big_list[i+1][2]\ and big_list[i][1]==big_list[i+1][1]\ and big_list[i][0]==big_list[i+1][0]\ and big_list[i][6] >= big_list[i+1][6]: del big_list[i+1] else: i=i+1

The reason is slow is that each time you del a line takes a linear amount of time, making the total process O(n^2).

If instead of deleting lines from the original list, you append the lines you want to keep into a new list, it should be much faster.

But there are other, possibly more Pythonic, ways to perform the same thing. For example, using itertools.groupby (assuming the list is sorted as you specified):

from itertools import groupby new_list = [next(group) for val,group in groupby(big_list, key=lambda x: x[:3])]

This will group the list items by the first 3 elements, and return a list of the first item in each group.