Skip to content Skip to sidebar Skip to footer

Pytables Duplicates 2.5 Giga Rows

I currently have a .h5 file, with a table in it consisting of three columns: a text columns of 64 chars, an UInt32 column relating to the source of the text and a UInt32 column whi

Solution 1:

Your dataset is 10x larger than the referenced solution (2.5e9 vs 500e6 rows). Have you done any testing to identify where the time is spent? The table.itersorted() method may not be linear - and might be resource intensive. (I don't have any experience with itersorted.)

Here is a process that might be faster:

  1. Extract a NumPy array of the hash field (column xhashx )
  2. Find the unique hash values
  3. Loop thru the unique hash values and extract a NumPy array of rows that match each value
  4. Do your uniqueness tests against the rows in this extracted array
  5. Write the unique rows to your new file

Code for this process below:
Note: This has been not tested, so may have small syntax or logic gaps

# Step 1: Get a Numpy array of the 'xhashx' field/colmu only:
hash_arr = table.read(field='xhashx')
# Step 2: Get new array with unique values only:
hash_arr_u = np.unique(hash_arr)

# Alternately, combine first 2 steps in a single step
hash_arr_u = np.unique(table.read(field='xhashx'))

# Step 3a: Loop on rows unique hash values
for hash_test in hash_arr_u :

# Step 3b: Get an array with all rows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_test')

# Step 4: Check for rows with unique values
# Check the hash row count. 
# If there is only 1 row, uniqueness tested not required  
     if match_row_arr.shape[0] == 1 :
     # only one row, so write it to new.table

     else :
     # check for unique rows
     # then write unique rows to new.table

##################################################
# np.unique has an option to save the hash counts
# these can be used as a test in the loop
(hash_arr_u, hash_cnts) = np.unique(table.read(field='xhashx'), return_counts=True)

# Loop on rows in the array of unique hash values
for cnt in range(hash_arr_u.shape[0]) :

# Get an array with all rows that match this unique hash value
     match_row_arr = table.read_where('xhashx==hash_arr_u(cnt)')

# Check the hash row count. 
# If there is only 1 row, uniqueness tested not required  
     if hash_cnts[cnt] == 1 :
     # only one row, so write it to new.table

     else :
     # check for unique rows
     # then write unique rows to new.table

Post a Comment for "Pytables Duplicates 2.5 Giga Rows"