Remove Duplicates By Columns A, Keeping The Row With The Highest Value In Column B

March 07, 2024 Post a Comment

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B. So this: A B 1 10 1 20 2 30 2 40 3 10 Should tur

Solution 1:

This takes the last. Not the maximum though:

In[10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   AB112032404310

You can do also something like:

In[12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   ABA112022403310

Solution 2:

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   AB112032404310

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

Solution 3:

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

Solution 4:

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

Solution 5:

Try this:

df.groupby(['A']).max()

Python College