Skip to content Skip to sidebar Skip to footer

Pandas: Df.groupby() Is Too Slow For Big Data Set. Any Alternatives Methods?

I have a pandas.DataFrame with 3.8 Million rows and one column, and I'm trying to group them by index. The index is the customer ID. I want to group the qty_liter by the index: df

Solution 1:

The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:

df.index = df.index.astype(int)
df.qty_liter = df.qty_liter.astype(float)

Then do groupby() again. It should be much faster. If it is, see if you can modify your data loading step to have the proper dtypes from the beginning.

Solution 2:

Your data is classified into too many categories, which is the main reason that makes the groupby code too slow. I tried using Bodo to see how it would do with the groupby on a large data set. I ran the code with regular sequential Pandas and parallelized Bodo. It took about 20 seconds for Pandas and only 5 seconds for Bodo to run. Bodo basically parallelizes your Pandas code automatically and allows you to run it on multiple processors, which you cannot do with native pandas. It is free for up to four cores: https://docs.bodo.ai/latest/source/installation_and_setup/install.html

Notes on data generation: I generated a relatively large dataset with 20 million rows and 18 numerical columns. To make the generated data more resemblant to your dataset, two other columns named “index” and “qty_liter” are added.

#data generationimport pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(20000000, 18), columns = list('ABCDEFGHIJKLMNOPQR'))
df['index'] = np.random.randint(2147400000,2147500000,20000000).astype(str)
df['qty_liter'] = np.random.randn(20000000)

df.to_parquet("data.pq")

With Regular Pandas:

import time
import pandas as pd
import numpy as np

start = time.time()
df = pd.read_parquet("data.pq")
grouped = df.groupby(['index'])['qty_liter'].sum()
end = time.time()
print("computation time: ", end - start)
print(grouped.head())

output:
computation time:  19.29292106628418
index
214740000029.7010942147400001-7.1640312147400002-21.10411721474000037.3151272147400004-12.661605
Name: qty_liter, dtype: float64

With Bodo:

%%px

import numpy as np
import pandas as pd
import time
import bodo

@bodo.jit(distributed = ['df'])
def group_by():
    start = time.time()
    df = pd.read_parquet("data.pq")
    df = df.groupby(['index'])['qty_liter'].sum()
    end = time.time()
    print("computation time: ", end - start)
    print(df.head())
    return df
    
df = group_by()

output:
[stdout:0] 
computation time:  5.12944599299226
index
21474375316.97557021474564631.729212214744737126.3581582147407055-6.8856632147454784-5.721883
Name: qty_liter, dtype: float64

Disclaimer: I am a data scientist advocate working at Bodo.ai

Solution 3:

I do not use string, but integer values that define the groups. Still it is very slow: about 3 mins vs. a fraction of a second in Stata. The number of observations is about 113k, the number of groups defined by x, y, z is about 26k.

a= df.groupby(["x", "y", "z"])["b"].describe()[['max']]

x,y,z: integer values

b: real value

Post a Comment for "Pandas: Df.groupby() Is Too Slow For Big Data Set. Any Alternatives Methods?"