Read Binary Flatfile And Skip Bytes
Solution 1:
Another way to rephrase what you are looking for (slightly), is to say you want to read uint32
numbers starting at offset 304, with a stride of 400 bytes. np.fromfile
does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.
The simplest is probably to load the entire file and subset the column you want:
data = np.fromfile(filename, dtype=np.uint32)[304// 4::400 // 4].copy()
If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:
dt = np.dtype([('_1', 'u1', 304), ('data', 'u4'), ('_2', 'u1', 92)])
data = np.fromfile(filename, dtype=dt)['data'].copy()
Here, _1
and _2
are used to discard the unneeded bytes with 1-byte resolution rather than 4.
Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.
Memory maps can be implemented via Pythons mmap
module, and wrapped in an ndarray
using the buffer
parameter, or you can use the np.memmap
class that does it for you:
mm = np.memmap(filename, dtype=np.uint32, mode='r', offset=0, shape=(1000, 400// 4))
data = np.array(mm[:, 304// 4])
del mm
Using a raw mmap
is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of a np.uint32
:
withopen(filename, 'rb') as f, mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
data = np.ndarray(buffer=mm, dtype=np.uint32, offset=304, strides=400, shape=1000).copy()
The final call to copy
is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.
Post a Comment for "Read Binary Flatfile And Skip Bytes"