Groupby Pandas , Calculate Multiple Columns Based On Date Difference
I have a pandas dataframe shown below: CID RefID Date Group MID 100 1 1/01/2021 A 100 2 3/01/2021 A
Solution 1:
You could do something like this:
def days_diff(sdf):
result = pd.DataFrame(
{"days_diff": pd.NaT, "A": None}, index=sdf.index
)
start = sdf.at[sdf.index[0], "Date"]
for index, day, next_MID_is_na in zip(
sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
):
diff = (day - start).days
if diff <= 30and next_MID_is_na:
result.at[index, "days_diff"] = diff
else:
start = day
result.A = result.days_diff.isna().cumsum()
return result
df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()
Result for df created by
from io import StringIO
data = StringIO(
'''
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")
is
CIDRefIDDateGroupMIDdays_diffAB010012021-01-01 ANaNNaT11.0110022021-01-03 ANaN211.0210032021-01-04 A101.0311.0310042021-01-15 ANaNNaT24.0410052021-01-18 ANaN324.0520062021-03-03 BNaNNaT16.0620072021-04-04 BNaNNaT27.0720082021-04-09 B102.0527.0820092021-04-25 BNaNNaT39.09300102021-04-26 CNaNNaT110.010300112021-05-27 CNaNNaT211.011300122021-05-28 C103.01211.0A few explanations:
- The function
days_diffproduces a dataframe with the two columnsdays_diffandA. It is applied to the grouped by columnCIDsub-dataframes ofdf. - First step: Initializing the result dataframe
result(columndays_difffilled withNaT, columnAwithNone), and setting the starting valuestartfor the day differences to the first day in the group. - Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column
Date, and a boolean valuenext_MID_is_nathat signifies if the value of theMIDcolumn in the next row istNaN(via.shift(1).isna()). - In every step of the loop:
- Calculation of the difference of the current day to the start day.
- Checking the rules for the
days_diffcolumn:- If difference of current and start day <= 30 days and
NaNin nextMID-row -> day-difference. - Otherwise -> reset of
startto the current day.
- If difference of current and start day <= 30 days and
- After finishing column
days_diffcalculation of columnA:result.days_diff.isna()isTrue(== 1) whendays_diffisNaN,False(== 0) otherwise. Therefore the cummulative sum (.cumsum()) gives the required result. - After the
groupby-applyto produce the columnsdays_diffandAfinally the calculation of columnB: Selection ofRefID-values where the valuesAchange (via.where(df.A != df.A.shift(1))), and then forward filling the remainingNaNs.
Post a Comment for "Groupby Pandas , Calculate Multiple Columns Based On Date Difference"