Skip to content Skip to sidebar Skip to footer

Quickly Sampling Large Number Of Rows From Large Dataframes In Python

I have a very large dataframe (about 1.1M rows) and I am trying to sample it. I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe. This

Solution 1:

We don't have your data, so here is an example with two options:

  1. after reading: use a pandas Index object to select a subset via the .ilocselection method
  2. while reading: a predicate with the skiprows parameter

Given

A collection of indices and a (large) sample DataFrame written to test.csv:

import pandas as pd
import numpy as np


indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A    1000000 non-null int32
B    1000000 non-null int32
C    1000000 non-null int32
D    1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB

Code

Option 1 - after reading

Convert a sample list of indices to an Index object and slice the loaded DataFrame:

idxs = pd.Index(indices)   
subset = df.iloc[idxs, :]
print(subset)

The .iat and .at methods are even faster, but require scalar indices.


Option 2 - while reading (Recommended)

We can write a predicate that keeps selected indices as the file is being read (more efficient):

pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)

See also the issue that led to extending skiprows.


Results

The same output is produced from the latter options:

AB   C   D
17495284287349943535434971058414815208620921130365922567492386637898636075900261171852176127358917877642309796

Post a Comment for "Quickly Sampling Large Number Of Rows From Large Dataframes In Python"