Is it faster to loop through a Python set of number or a set of letters?
Is it faster to loop through a Python set of numbers or a Python set of letters given that each set is the exact same length and each item within each set is the same length? Why?
I would think that there would be a difference because letters have more possible characters [a-zA-Z] than numbers [0-9] and therefor would be more 'random' and likely affect the hashing to some extent.
numbers = set([00000,00001,00002,00003,00004,00005, ... 99999]) letters = set(['aaaaa','aaaab','aaaac','aaaad', ... 'aaabZZ']) # this is just an example, it does not actually end here for item in numbers: do_something() for item in letters: do_something()
where len(numbers) == len(letters)
Update: I am interested in Python's specific hashing algorithm and what happens behind the scenes with this implementation.
There might be some particular implementation details of Python that I'm not aware of that mess with my general arguments here, but:
- Creating the set of strings will likely be a little slower than creating the set of integers (all else being equal), since the hash operation on strings takes some (small) time to run while the hash operation on integers is trivial.
- Iterating a set doesn't perform any hash operations, so the time to hash is irrelevant there.
- Iterating a set depends on the number of elements in the set and the number of buckets in the hash table backing the set. So the distribution of the hash function only matters if it causes the hash table to increase the bucket count. For some hash table implementations that's impossible (because bucket count is only increased when the load factor exceeds a threshold, not just because of collisions). Other hash table implementations resize when there are a lot of collisions. I don't know which CPython is.
- Anyway, the particular example you give of a set of integers will generate well-distributed hash values.
- There's one way to find out which is faster in Python, which is to timeit, with a realistic example of the data you care about. Speculation is usually a waste of time.
You can see the results of Python's hashing algorithms like this:
>>> foo = 3 >>> foo.__hash__() 3 >>> foo = 1856348 >>> foo.__hash__() 1856348 >>> foo = "\x00" >>> foo.__hash__() 1 >>> foo = "\x01" >>> foo.__hash__() 128000384 >>> foo = "\x02" >>> foo.__hash__() 256000771
So on my copy of Python, those hash results match these reported Python hash algorithms. As ever with CPython, you can look at the source to confirm the algorithm.