Reading .h5 File Is Extremely Slow
Solution 1:
This is the start of my answer. I looked at your code, and you have a lot of calls to read the .h5 data. By my count, the generator makes 6 read calls for every loop on training_list
and validation_list
. So, that's almost 20k calls on ONE training loop. It's not clear (to me) if the generators are called on every training loop. If they are, multiply by 2268 loops.
Efficiency of HDF5 file read depends on the number to calls to read the data (not just the amount of data). In other words, it is faster to read 1GB of data in a single call than it is to read the same data with 1000 calls x 1MB at a time. So the first thing we need to determine is the amount of time spent reading data from the HDF5 file (to be compare to your 7000s).
I isolated the PyTables calls that read the data file. From that, I built a simple program that mimics the behavior of your generator function. Currently it makes a single training loop on the entire sample list. Increase n_train
and n_epoch
values if you want the to run a longer test. (Note: The code syntax is correct. However without the file, so can't verify the logic. I think it's correct, but you may have to fix small errors.)
See code below. It should run standalone (all dependencies are imported). It prints basic timing data. Run it to benchmark your generator.
import tables as tb
import numpy as np
from random import shuffle
import time
with tb.open_file('../data/data.h5', 'r') as data_file:
n_train = 1
n_epochs = 1
loops = n_train*n_epochs
for e_cnt inrange(loops):
nb_samples = data_file.root.truth.shape[0]
sample_list = list(range(nb_samples))
shuffle(sample_list)
split = 0.80
n_training = int(len(sample_list) * split)
training_list = sample_list[:n_training]
validation_list = sample_list[n_training:]
start = time.time()
for index_list in [ training_list, validation_list ]:
shuffle(index_list)
x_list = list()
y_list = list()
whilelen(index_list) > 0:
index = index_list.pop()
brain_width = data_file.root.brain_width[index]
x = np.array([modality_img[index,0,
brain_width[0,0]:brain_width[1,0]+1,
brain_width[0,1]:brain_width[1,1]+1,
brain_width[0,2]:brain_width[1,2]+1]
for modality_img in [data_file.root.t1,
data_file.root.t1ce,
data_file.root.flair,
data_file.root.t2]])
y = data_file.root.truth[index, 0,
brain_width[0,0]:brain_width[1,0]+1,
brain_width[0,1]:brain_width[1,1]+1,
brain_width[0,2]:brain_width[1,2]+1]
x_list.append(data)
y_list.append(truth)
print(f'For loop:{e_cnt}')
print(f'Time to read all data={time.time()-start:.2f}')
Post a Comment for "Reading .h5 File Is Extremely Slow"