Groupby Pandas , Calculate Multiple Columns Based On Date Difference

February 23, 2024 Post a Comment

I have a pandas dataframe shown below: CID RefID Date Group MID 100 1 1/01/2021 A 100 2 3/01/2021 A

Solution 1:

You could do something like this:

def days_diff(sdf):
    result = pd.DataFrame(
        {"days_diff": pd.NaT, "A": None}, index=sdf.index
    )
    start = sdf.at[sdf.index[0], "Date"]
    for index, day, next_MID_is_na in zip(
        sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
    ):
        diff = (day - start).days
        if diff <= 30and next_MID_is_na:
            result.at[index, "days_diff"] = diff
        else:
            start = day
    result.A = result.days_diff.isna().cumsum()
    return result

df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created by

from io import StringIO
data = StringIO(
'''
CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

CIDRefIDDateGroupMIDdays_diffAB010012021-01-01     ANaNNaT11.0110022021-01-03     ANaN211.0210032021-01-04     A101.0311.0310042021-01-15     ANaNNaT24.0410052021-01-18     ANaN324.0520062021-03-03     BNaNNaT16.0620072021-04-04     BNaNNaT27.0720082021-04-09     B102.0527.0820092021-04-25     BNaNNaT39.09300102021-04-26     CNaNNaT110.010300112021-05-27     CNaNNaT211.011300122021-05-28     C103.01211.0

A few explanations:

The function days_diff produces a dataframe with the two columns days_diff and A. It is applied to the grouped by column CID sub-dataframes of df.
First step: Initializing the result dataframe result (column days_diff filled with NaT, column A with None), and setting the starting value start for the day differences to the first day in the group.
Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column Date, and a boolean value next_MID_is_na that signifies if the value of the MID column in the next row ist NaN (via .shift(1).isna()).
In every step of the loop:
1. Calculation of the difference of the current day to the start day.
2. Checking the rules for the days_diff column:
  - If difference of current and start day <= 30 days andNaN in next MID-row -> day-difference.
  - Otherwise -> reset of start to the current day.
After finishing column days_diff calculation of column A: result.days_diff.isna() is True (== 1) when days_diff is NaN, False (== 0) otherwise. Therefore the cummulative sum (.cumsum()) gives the required result.
After the groupby-apply to produce the columns days_diff and A finally the calculation of column B: Selection of RefID-values where the values A change (via .where(df.A != df.A.shift(1))), and then forward filling the remaining NaNs.

Python College

Groupby Pandas , Calculate Multiple Columns Based On Date Difference

Solution 1:

Post a Comment for "Groupby Pandas , Calculate Multiple Columns Based On Date Difference"