Filtering Grouped Df In Dask
Related to this similar question for Pandas: filtering grouped df in pandas Action To eliminate groups based on an expression applied to a different column than the groupby column.
Solution 1:
I think you can groupby
+ size
first, then map
for Series
(it is like transform
, but not implemented in dask
too) and last filter by boolean indexing
:
df = pd.DataFrame({'A':list('aacaaa'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbc')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 a 5 8 3 3 a
2 c 4 9 5 6 a
3 a 5 4 7 9 b
4 a 5 2 1 2 b
5 a 4 3 0 4 c
a = df.groupby('F')['A'].size()
print (a)
F
a 3
b 2
c 1
Name: A, dtype: int64
s = df['F'].map(a)
print (s)
0 3
1 3
2 3
3 2
4 2
5 1
Name: F, dtype: int64
df = df[s > 1]
print (df)
A B C D E F
0 a 4 7 1 5 a
1 a 5 8 3 3 a
2 c 4 9 5 6 a
3 a 5 4 7 9 b
4 a 5 2 1 2 b
EDIT:
I think here is not necessary groupby
:
df_notall4 = df[df.C != 4].drop_duplicates(subset=['A','D'])['D'].compute()
But if really need it:
def filter_4(x):
return x[x.C != 4]
df_notall4 = df.groupby('A').apply(filter_4, meta=df).D.unique().compute()
print (df_notall4)
01132035
Name: D, dtype: int64
Solution 2:
Thanks to @jezrael I reviewed my implementation and created the following solution (see my provided example).
df_notall4 = []
for d inlist(df[df.C != 4].D.unique().compute()):
df_notall4.append(df.groupby('D').get_group(d))
df_notall4 = dd.concat(df_notall4, interleave_partitions=True)
Which results in
In [8]:
df_notall4.D.unique().compute()
Out[8]:
01132530
Name: D, dtype: object
Post a Comment for "Filtering Grouped Df In Dask"