Skip to content Skip to sidebar Skip to footer

How To Extract Dataframe By Row Values By Conditions With Other Columns?

I have a dataframe as follows: #values a=['003C', '003P1', '003P1', '003P1', '004C', '004P1', '004P2', '003C', '003P2', '003P1', '003C', '003P1', '003P2', '003C', '003P1', '004C',

Solution 1:

Solution

c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')

m  = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)

df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

Explanations

Extract the columns INT and STR by using str.extract with a regex pattern

>>> df[['INT','STR']]

    INT STR
0003   C
1003  P1
2003  P1
3003  P1
4004   C
5004  P1
6004  P2
7003   C
8003  P2
9003  P1
10003   C
11003  P1
12003  P2
13003   C
14003  P1
15004   C
16004  P2
17001   C
18001  P1

Create a boolean mask using isin to check for the condition where the extracted column STR contains only the values C, P1 and P2

>>> m

0True1True2True3True4True5True6True7True8True9True10True11True12True13True14True15True16True17True18True
Name: STR, dtype: bool

Compare STR column with C to create a boolean mask then group this mask on the columns ['CHROM', 'POS', 'REF', 'ALT', 'INT'] and transform using any to create a boolean mask m1

>>> m1
0True1False2False3False4True5True6True7True8True9True10True11True12True13True14True15True16True17True18True
Name: STR, dtype: bool

Mask the values in column STR where the boolean mask m1 is False then group this masked column by ['CHROM', 'POS', 'REF', 'ALT', 'INT'] and transform using nunique then chain with ge to create a boolean mask m2

>>> m2

0False1False2False3False4True5True6True7True8True9True10True11True12True13True14True15True16True17True18True
Name: STR, dtype: bool

Now take the logical and of the masks m, m1 and m2, and use this to filter the required rows in the dataframe

>>> df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

   SampleCHROMPOSREFALT0003Cchr1125895TA1003P1chr1125895TA2004Cchr111163940CG3004P1chr111163940CG4004P2chr111163940CG5004Cchr112587895CG6004P2chr112587895CG7003Cchr115986513GA8003P2chr115986513GA9003P1chr115986513GA10001Cchr914587952TC11001P1chr914587952TC12003Cchr1248650751TA13003P1chr1248650751TA14003P2chr1248650751TA

Post a Comment for "How To Extract Dataframe By Row Values By Conditions With Other Columns?"