Fill Nan Values From Another Dataframe (with Different Shape)
Solution 1:
you can use Index to speed up the lookup, use combine_first() to fill NaN:
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
print(merged_df[cols])
the result:
day_of_week holiday_flg
0 Tuesday 0.0
1 Wednesday 0.0
2 Thursday 0.0
3 Saturday 1.0
Solution 2:
This is one solution. It should be efficient as there is no explicit merge or apply.
merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date'])
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']
merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))
Result
air_store_idarea_nameday_of_weekgenre_nameholiday_flghpg_store_id\0air_a1TokyoTuesdayJapanese0.0hpg_h11air_a2NaNWednesdayNaN0.0NaN2air_a3NaNThursdayNaN0.0NaN3air_a4NaNSaturdayNaN1.0NaNlatitudelongitudereserve_datetimereserve_visitorsvisit_date\01234.0 5678.0 2017-04-22 11:00:00 25.02017-05-231NaNNaNNaN35.02017-05-242NaNNaNNaN45.02017-05-253NaNNaNNaNNaN2017-05-27visit_datetime02017-05-23 12:00:001NaN2NaN3NaNExplanation
sis apd.Seriesmapping calendar_date to day_of_week fromdate_info_df.- Use
pd.Series.map, which takespd.Seriesas an input, to update missing values, where possible.
Solution 3:
Edit: one can also use merge to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date" and "calendar_date" are of the same format.)
# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]],
left_on="visit_date",
right_on="calendar_date",
how="left") # outer should also workThe desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. We just need to map the results from data_info_df to merged_df.
merge can also do the job because data_info_df's data entries are unique. (No duplicates will be created.)
You can also try using pandas.Series.map. What it does is
Map values of Series using input correspondence (which can be a dict, Series, or function)
# set"calendar_date"as the index such that
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")
# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])
Note merged_df.visit_date originally was of string type. Thus, we use
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
to make it datetime.
Timingsdate_info_df dataset and merged_df provided by karlphillip.
date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of7 runs, 1loopeach)
# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of7 runs, 1loopeach)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
258 ms ± 69.5 ms per loop (mean ± std. dev. of7 runs, 1loopeach)
One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first. I guess that the speed of HARY's method will depend on how sparse it is in merged_df. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.
The performances of the merge and combine_first methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.
Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df. If there are some dates that are contained in merged_df but not data_info_df, it should return NaN. And NaN can override some part of merged_df that originally contains values! This is when combine_first method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria
Post a Comment for "Fill Nan Values From Another Dataframe (with Different Shape)"