Reading .h5 File Is Extremely Slow

October 07, 2023 Post a Comment

My data is stored in .h5 format. I use a data generator to fit the model and it is extremely slow. A snippet of my code is provided below. def open_data_file(filename, readwrite='r

Solution 1:

This is the start of my answer. I looked at your code, and you have a lot of calls to read the .h5 data. By my count, the generator makes 6 read calls for every loop on training_list and validation_list. So, that's almost 20k calls on ONE training loop. It's not clear (to me) if the generators are called on every training loop. If they are, multiply by 2268 loops.

Efficiency of HDF5 file read depends on the number to calls to read the data (not just the amount of data). In other words, it is faster to read 1GB of data in a single call than it is to read the same data with 1000 calls x 1MB at a time. So the first thing we need to determine is the amount of time spent reading data from the HDF5 file (to be compare to your 7000s).

I isolated the PyTables calls that read the data file. From that, I built a simple program that mimics the behavior of your generator function. Currently it makes a single training loop on the entire sample list. Increase n_train and n_epoch values if you want the to run a longer test. (Note: The code syntax is correct. However without the file, so can't verify the logic. I think it's correct, but you may have to fix small errors.)

See code below. It should run standalone (all dependencies are imported). It prints basic timing data. Run it to benchmark your generator.

import tables as tb
import numpy as np
from random import shuffle 
import time

with tb.open_file('../data/data.h5', 'r') as data_file:

    n_train = 1
    n_epochs = 1
    loops = n_train*n_epochs
    
    for e_cnt inrange(loops):  
        nb_samples = data_file.root.truth.shape[0]
        sample_list = list(range(nb_samples))
        shuffle(sample_list)
        split = 0.80
        n_training = int(len(sample_list) * split)
        training_list = sample_list[:n_training]
        validation_list = sample_list[n_training:]
        
        start = time.time()
        for index_list in [ training_list, validation_list ]:
            shuffle(index_list)
            x_list = list()
            y_list = list()
            
            whilelen(index_list) > 0:
                index = index_list.pop() 
                
                brain_width = data_file.root.brain_width[index]
                x = np.array([modality_img[index,0,
                                           brain_width[0,0]:brain_width[1,0]+1,
                                           brain_width[0,1]:brain_width[1,1]+1,
                                           brain_width[0,2]:brain_width[1,2]+1] 
                              for modality_img in [data_file.root.t1,
                                                   data_file.root.t1ce,
                                                   data_file.root.flair,
                                                   data_file.root.t2]])
                y = data_file.root.truth[index, 0,
                                         brain_width[0,0]:brain_width[1,0]+1,
                                         brain_width[0,1]:brain_width[1,1]+1,
                                         brain_width[0,2]:brain_width[1,2]+1]    
                
                x_list.append(data)
                y_list.append(truth)
    
        print(f'For loop:{e_cnt}')
        print(f'Time to read all data={time.time()-start:.2f}')

Python College

Reading .h5 File Is Extremely Slow

Solution 1:

Post a Comment for "Reading .h5 File Is Extremely Slow"