Python Re-sampling Time Series Data Which Can Not Be Indexed
The purpose of this question is to know how many trades 'happened' in each second (count) as well as the total volume traded (sum). I have time series data which can not be indexed
Solution 1:
First you need to have a column for the seconds (since epoch), then groupby
using that column, and then do an aggregation on the columns you want.
You want to floor the timestamp down to one second accuracy, and group using that. Then apply an aggregation to get the mean/sum/std what ever you need
df = pd.read_csv('data.csv')
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
I modified the data to make sure there are actually different seconds in it,
SecurityID,dateTime,ask1,ask1Volume,bid1,bid1Volume,ask2,ask2Volume,bid2,bid2Volume,ask3,ask3Volume,bid3,bid3Volume,tradePrice,tradeVolume,isTrade
2318276,2017-11-20 08:00:09.052240,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,0.0,0,0
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,3,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12861.0,1,1
2318276,2017-11-20 08:00:09.052260,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,2,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052270,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052282,12869.0,1,12868.0,1,12870.0,19,12867.5,2,12872.5,2,12867.0,1,12868.0,1,1
2318276,2017-11-20 08:00:09.052282,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12868.0,1,0
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,2,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052291,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,0
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.5,1,1
2318276,2017-11-20 08:00:09.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12867.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.5,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12865.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12867.5,1,12870.0,19,12867.0,1,12872.5,2,12865.5,1,12864.0,1,1
2318276,2017-11-20 08:00:10.052315,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12864.0,1,0
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,2,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052335,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,0
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.5,1,12870.0,19,12861.0,1,12872.5,2,12860.0,5,12861.5,1,1
2318276,2017-11-20 08:00:10.052348,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.5,1,0
2318276,2017-11-20 08:00:10.052357,12869.0,1,12861.0,1,12870.0,19,12860.0,5,12872.5,2,12859.5,3,12861.0,1,1
2318276,2017-11-20 08:00:10.052357,12869.0,1,12860.0,5,12870.0,19,12859.5,3,12872.5,2,12858.0,1,12861.0,1,0
and the output
In [53]: groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
Out[53]:
ask1 tradeVolume
seconds
151116480912869.010151116481012869.010
footnote
OP said that the original version (below) was faster, so I ran some timings
deftest1(df):
"""This is the fastest and cleanest."""
df['dateTime'] = df['dateTime'].astype('datetime64[s]')
groups = df.groupby('dateTime')
agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
deftest2(df):
"""Totally unnecessary amount of datetime floors."""defgroup_by_second(index_loc):
return df.loc[index_loc, 'dateTime'].floor('S')
df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
groups = df.groupby(group_by_second)
result = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
deftest3(df):
"""Original version, but the conversion to/from nanoseconds is unnecessary."""
df['dateTime'] = df['dateTime'].astype('datetime64[ns]')
df['seconds'] = df['dateTime'].apply(lambda v: v.value // 1e9)
groups = df.groupby('dateTime')
agg = groups.agg({'ask1': np.mean, 'tradeVolume': np.sum})
if __name__ == '__main__':
import timeit
print('22 rows')
df = pd.read_csv('data_small.csv')
print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))
print('220 rows')
df = pd.read_csv('data.csv')
print('test1', timeit.repeat("test1(df.copy())", number=50, globals=globals()))
print('test2', timeit.repeat("test2(df.copy())", number=50, globals=globals()))
print('test3', timeit.repeat("test3(df.copy())", number=50, globals=globals()))
I tested those on two datasets one 10 times the size of the first one, the results
22 rows
test1 [0.08138518501073122, 0.07786444900557399, 0.0775048139039427]
test2 [0.2644687460269779, 0.26298125297762454, 0.2618108610622585]
test3 [0.10624988097697496, 0.1028324980288744, 0.10304366517812014]220 rows
test1 [0.07999306707642972, 0.07842653687112033, 0.07848454895429313]
test2 [1.9794962559826672, 1.966513831866905, 1.9625889619346708]
test3 [0.12691736104898155, 0.12642419710755348, 0.126510804053396]
So, best to use the .astype('datetime[s]')
version as that is the fastest and scales the best.
Post a Comment for "Python Re-sampling Time Series Data Which Can Not Be Indexed"