How To Extract Dataframe By Row Values By Conditions With Other Columns?
I have a dataframe as follows: #values a=['003C', '003P1', '003P1', '003P1', '004C', '004P1', '004P2', '003C', '003P2', '003P1', '003C', '003P1', '003P2', '003C', '003P1', '004C',
Solution 1:
Solution
c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')
m = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)
df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
Explanations
Extract
the columns INT
and STR
by using str.extract
with a regex pattern
>>> df[['INT','STR']]
INT STR
0003 C
1003 P1
2003 P1
3003 P1
4004 C
5004 P1
6004 P2
7003 C
8003 P2
9003 P1
10003 C
11003 P1
12003 P2
13003 C
14003 P1
15004 C
16004 P2
17001 C
18001 P1
Create a boolean mask using isin
to check for the condition where the extracted column STR
contains only the values C
, P1
and P2
>>> m
0True1True2True3True4True5True6True7True8True9True10True11True12True13True14True15True16True17True18True
Name: STR, dtype: bool
Compare STR
column with C
to create a boolean mask then group this mask on the columns ['CHROM', 'POS', 'REF', 'ALT', 'INT']
and transform using any
to create a boolean mask m1
>>> m1
0True1False2False3False4True5True6True7True8True9True10True11True12True13True14True15True16True17True18True
Name: STR, dtype: bool
Mask the values in column STR
where the boolean mask m1
is False
then group this masked column by ['CHROM', 'POS', 'REF', 'ALT', 'INT']
and transform using nunique
then chain with ge
to create a boolean mask m2
>>> m2
0False1False2False3False4True5True6True7True8True9True10True11True12True13True14True15True16True17True18True
Name: STR, dtype: bool
Now take the logical and
of the masks m
, m1
and m2
, and use this to filter the required rows in the dataframe
>>> df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
SampleCHROMPOSREFALT0003Cchr1125895TA1003P1chr1125895TA2004Cchr111163940CG3004P1chr111163940CG4004P2chr111163940CG5004Cchr112587895CG6004P2chr112587895CG7003Cchr115986513GA8003P2chr115986513GA9003P1chr115986513GA10001Cchr914587952TC11001P1chr914587952TC12003Cchr1248650751TA13003P1chr1248650751TA14003P2chr1248650751TA
Post a Comment for "How To Extract Dataframe By Row Values By Conditions With Other Columns?"