Remove Duplicates By Columns A, Keeping The Row With The Highest Value In Column B
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B. So this: A B 1 10 1 20 2 30 2 40 3 10 Should tur
Solution 1:
This takes the last. Not the maximum though:
In[10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
AB112032404310
You can do also something like:
In[12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
ABA112022403310
Solution 2:
The top answer is doing too much work and looks to be very slow for larger data sets. apply
is slow and should be avoided if possible. ix
is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
AB112032404310
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Solution 3:
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
Solution 4:
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Solution 5:
Try this:
df.groupby(['A']).max()
Post a Comment for "Remove Duplicates By Columns A, Keeping The Row With The Highest Value In Column B"