Python Re-sampling Time Series Data Which Can Not Be Indexed

April 29, 2024 Post a Comment

The purpose of this question is to know how many trades 'happened' in each second (count) as well as the total volume traded (sum). I have time series data which can not be indexed

Solution 1:

~~First you need to have a column for the seconds (since epoch), then groupby using that column, and then do an aggregation on the columns you want.~~

You want to floor the timestamp down to one second accuracy, and group using that. Then apply an aggregation to get the mean/sum/std what ever you need

df = pd.read_csv('data.csv')
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

I modified the data to make sure there are actually different seconds in it,

SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:10.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:10.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0

and the output

In [53]: groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
Out[53]: 
               ask1  tradeVolume
seconds                         
151116480912869.010151116481012869.010

footnote

OP said that the original version (below) was faster, so I ran some timings

deftest1(df):
    """This is the fastest and cleanest."""
    df['dateTime'] = df['dateTime'].astype('datetime64[s]')
    groups = df.groupby('dateTime')
    agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

deftest2(df):
    """Totally unnecessary amount of datetime floors."""defgroup_by_second(index_loc):
        return df.loc[index_loc, 'dateTime'].floor('S')
    df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
    groups = df.groupby(group_by_second)
    result = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

deftest3(df):
    """Original version, but the conversion to/from nanoseconds is unnecessary."""
    df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
    df['seconds'] = df['dateTime'].apply(lambda v: v.value // 1e9)
    groups = df.groupby('dateTime')
    agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})

if __name__ == '__main__':
    import timeit
    print('22 rows')
    df = pd.read_csv('data_small.csv')
    print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
    print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
    print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))

    print('220 rows')
    df = pd.read_csv('data.csv')
    print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
    print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
    print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))

I tested those on two datasets one 10 times the size of the first one, the results

22 rows
test1 [0.08138518501073122, 0.07786444900557399, 0.0775048139039427]
test2 [0.2644687460269779, 0.26298125297762454, 0.2618108610622585]
test3 [0.10624988097697496, 0.1028324980288744, 0.10304366517812014]220 rows
test1 [0.07999306707642972, 0.07842653687112033, 0.07848454895429313]
test2 [1.9794962559826672, 1.966513831866905, 1.9625889619346708]
test3 [0.12691736104898155, 0.12642419710755348, 0.126510804053396]

So, best to use the .astype('datetime[s]') version as that is the fastest and scales the best.

Python College

Python Re-sampling Time Series Data Which Can Not Be Indexed

Solution 1:

Post a Comment for "Python Re-sampling Time Series Data Which Can Not Be Indexed"