Skip to content Skip to sidebar Skip to footer

Read Binary Flatfile And Skip Bytes

I have a binary file that has data organized into 400 byte groups. I want to build an array of type np.uint32 from bytes at position 304 to position 308. However, I cannot find a m

Solution 1:

Another way to rephrase what you are looking for (slightly), is to say you want to read uint32 numbers starting at offset 304, with a stride of 400 bytes. np.fromfile does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.

The simplest is probably to load the entire file and subset the column you want:

data = np.fromfile(filename, dtype=np.uint32)[304// 4::400 // 4].copy()

If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:

dt = np.dtype([('_1', 'u1', 304), ('data', 'u4'), ('_2', 'u1', 92)])
data = np.fromfile(filename, dtype=dt)['data'].copy()

Here, _1 and _2 are used to discard the unneeded bytes with 1-byte resolution rather than 4.

Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.

Memory maps can be implemented via Pythons mmap module, and wrapped in an ndarray using the buffer parameter, or you can use the np.memmap class that does it for you:

mm = np.memmap(filename, dtype=np.uint32, mode='r', offset=0, shape=(1000, 400// 4))
data = np.array(mm[:, 304// 4])
del mm

Using a raw mmap is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of a np.uint32:

withopen(filename, 'rb') as f, mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
    data = np.ndarray(buffer=mm, dtype=np.uint32, offset=304, strides=400, shape=1000).copy()

The final call to copy is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.

Post a Comment for "Read Binary Flatfile And Skip Bytes"