Skip to content Skip to sidebar Skip to footer

Finding A Duplicate In A Hdf5 Pytable With 500e6 Rows

Problem I have a large (> 500e6 rows) dataset that I've put into a pytables database. Lets say first column is ID, second column is counter for each ID. each ID-counter combina

Solution 1:

Two obvious techniques come to mind: hashing and sorting.

A) define a hash function to combine ID and Counter into a single, compact value.

B) count how often each hash code occurs

C) select from your data all that has hash collissions (this should be a ''much'' smaller data set)

D) sort this data set to find duplicates.

The hash function in A) needs to be chosen such that it fits into main memory, and at the same time provides enough selectivity. Maybe use two bitsets of 2^30 size or so for this. You can afford to have 5-10% collisions, this should still reduce the data set size enough to allow fast in-memory sorting afterwards.

This is essentially a Bloom filter.

Solution 2:

The brute force approach that you've taken appears to require that you to execute 500e6 queries, one for each row of the table. Although I think that the hashing and sorting approaches suggested in another answer are essentially correct, it's worth noting that pytables is already supposedly built for speed, and should already be expected to have these kinds of techniques effectively included "under the hood", so to speak.

I contend that the simple code you have written most likely does not yet take best advantage of the capabilities that pytables already makes available to you.

In the documentation for create_index(), it says that the default settings are optlevel=6 and kind='medium'. It mentions that you can increase the speed of each of your 500e6 queries by decreasing the entropy of the index, and you can decrease the entropy of your index to its minimum possible value (zero) either by choosing non-default values of optlevel=9 and kind='full', or equivalently, by generating the index with a call to create_csindex() instead. According to the documentation, you have to pay a little more upfront by taking a longer time to create a better optimized index to begin with, but then it pays you back later by saving you time on the series of queries that you have to repeat 500e6 times.

If optimizing your pytables column indices fails to speed up your code sufficiently, and you want to just simply perform a massive sort on all of the rows, and then just search for duplicates by looking for matches in adjacent sorted rows, it's possible to perform a merge sort in O(N log(N)) time using relatively modest amounts of memory by sorting the data in chunks and then saving the chunks in temporary files on disk. Examples here and here demonstrate in principle how to do it in Python specifically. But you should really try optimizing your pytables index first, as that's likely to provide a much simpler and more natural solution in your particular case.

Post a Comment for "Finding A Duplicate In A Hdf5 Pytable With 500e6 Rows"