Fastest Way To Read A Binary File With A Defined Format?
Solution 1:
Use struct
. In particular, struct.unpack
.
result = struct.unpack("<2i5d...", buffer)
Here buffer
holds the given binary data.
Solution 2:
It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.
If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack
and append it to a [double]
array:
from functools import partial
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8# 9 ints + 16 doubleswithopen('input', 'rb') as fin:
for record initer(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d...", record)
data.extend(values)
Under assumption you are allowed to cast all your int
s to double
s and willing to accept increase in allocated memory size (22% increase for your record from the question).
If you are reading the data from file many times, it could be worthwhile to convert everything to one large array
of double
s (like above) and write it back to another file from which you can later read with array.fromfile()
:
data = array.array('d')
with open('preprocessed', 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8data.fromfile(fin, n)
Update. Thanks to a nice benchmark by @martineau, now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile()
) is ~20x to ~40x
faster than reading it record-per-record, unpacking and appending to array
(as shown in the first code listing above).
A faster (and a more standard) variation of record-by-record reading in @martineau's answer which appends to list
and doesn't upcast to double
is only ~6x to ~10x
slower than array.fromfile()
method and seems like a better reference benchmark.
Solution 3:
Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file()
below), which dramatically changed the results.
To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit
) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)
Here are the snippets of code that were compared:
@TESTCASE('Read and constuct piecemeal with struct')defread_file_piecemeal():
structures = []
withopen(test_filenames[0], 'rb') as inp:
size = fmt1.size
whileTrue:
buffer = inp.read(size)
iflen(buffer) != size: # EOF?break
structures.append(fmt1.unpack(buffer))
return structures
@TESTCASE('Read all-at-once, then slice and struct')defread_entire_file():
offset, unpack, size = 0, fmt1.unpack, fmt1.size
structures = []
withopen(test_filenames[0], 'rb') as inp:
buffer = inp.read() # read entire filewhileTrue:
chunk = buffer[offset: offset+size]
iflen(chunk) != size: # EOF?break
structures.append(unpack(chunk))
offset += size
return structures
@TESTCASE('Convert to array (@randomir part 1)')defconvert_to_array():
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8# 9 ints + 16 doubles (standard sizes)withopen(test_filenames[0], 'rb') as fin:
for record initer(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
data.extend(values)
return data
@TESTCASE('Read array file (@randomir part 2)', setup='create_preprocessed_file')defusing_preprocessed_file():
data = array.array('d')
withopen(test_filenames[1], 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
return data
defcreate_preprocessed_file():
""" Save array created by convert_to_array() into a separate test file. """
test_filename = test_filenames[1]
ifnot os.path.isfile(test_filename): # doesn't already exist?
data = convert_to_array()
withopen(test_filename, 'wb') as file:
data.tofile(file)
And here were the results running them on my system:
Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (@randomir part 2): 0.06430 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative 6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative 6.73x ( 573.09% slower)
Convert to array (@randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)
Interestingly, most of the snippets are actually faster in Python 2...
Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (@randomir part 2): 0.03586 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative 7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
Convert to array (@randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)
Solution 4:
Take a look at the documentation for numpy
's fromfile
function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing
Simplest example:
import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))
Read more about "Structured Arrays" in numpy
and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#
Solution 5:
There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read()
and constructs a numpy
ndarray
all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.
line_cols = 20#For example
line_rows = 40000#For example
data_fmt = 15*'f8,'+5*'f4,'#For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5#For examplewithopen(filename,'rb') as f:
data = np.ndarray(shape=(1,line_rows),
dtype=np.dtype(data_fmt),
buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]
Here, we open the file as a binary file using the 'rb'
option in open
. Then, we construct our ndarray
with the proper shape and dtype to fit our read buffer. We then reduce the ndarray
into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype
, np.view
and np.reshape
methods. This is because np.reshape
doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.
This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.
In the future, I may try to read the data in even faster using a Fortran
script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.
Post a Comment for "Fastest Way To Read A Binary File With A Defined Format?"