我有一些表格,其中包含取決于代理及其時間段的折扣率,我想將其應用到另一個表格上,以獲取其銷售日期的當前適用利率。
這是費率表 (df_r)
Agentname ProductType OldRate NewRate StartDate EndDate
0 VSFAAL SPORTS 0.0 10.0 2020-11-05 2021-01-18
1 VSFAAL APPAREL 0.0 35.0 2020-11-05 2022-05-03
2 VSFAAL SPORTS 10.0 15.0 2021-01-18 2022-05-03
3 VSFAALJS SPORTS 0.0 10.0 2020-11-07 2022-05-03
4 VSFAALJS APPAREL 0.0 15.0 2020-11-07 2021-11-09
5 VSFAALJS APPAREL 15.0 5.0 2021-11-09 2022-05-03
這是交易表(df)
Date Sales Agentname ProductType
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS
1 2022-03-01 08:00:09 99.0 VSFAAL APPAREL
2 2022-03-01 08:00:14 75.0 VSFAAL SPORTS
3 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS
4 2020-05-01 08:00:51 160.0 VSFAALJS APPAREL
5 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL
我希望得到這樣的結果:
Date Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS 10.0
1 2022-03-01 08:00:09 99.0 VSFAAL APPAREL 35.0
2 2022-03-01 08:00:14 75.0 VSFAAL SPORTS 15.0
3 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS 10.0
4 2020-05-01 08:00:51 160.0 VSFAALJS APPAREL 0
5 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL 15.0
目前我正在做的是遍歷產品型別,然后是代理,然后是每個日期索引
col='Agentname'
for product in list(df.ProductType.unique()):
for uname in list(df[col].unique()):
a = df_r.loc[(df_r['Agentname'] == uname) & (df_r['ProductType'] == product.upper()) &
(df_r['EndDate'] >= df['Date'].min())]
for i in a.index:
df.loc[(df['ProductType'].str.upper() == product.upper()) & (df[col] == uname) & (
df['Date'] >= a.loc[i]['StartDate']) & (df['Date'] <= a.loc[i]['EndDate']),
[f"{col}_rates"]] = a.loc[i]['NewRate']
有沒有更有效的方法來做到這一點?
uj5u.com熱心網友回復:
這是一種方法
合并 product 和 agentname 上的兩個 DF,然后根據日期進行過濾
df3=df2.merge(df[['StartDate', 'EndDate','NewRate']],
left_on =[df2['Agentname'], df2['ProductType']],
right_on=[df['Agentname'], df['ProductType']],
how='left',
suffixes=('','_start')
).drop(columns=['key_0', 'key_1' ])
df3[df3['Date'].astype('datetime64').dt.strftime('%Y-%m-%d').between(
df3['StartDate'].astype('datetime64'),
df3['EndDate'].astype('datetime64'))
]
Date Sales Agentname ProductType StartDate EndDate NewRate
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS 2020-11-05 2021-01-18 10.0
2 2022-03-01 08:00:09 99.0 VSFAAL APPAREL 2020-11-05 2022-05-03 35.0
4 2022-03-01 08:00:14 75.0 VSFAAL SPORTS 2021-01-18 2022-05-03 15.0
5 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS 2020-11-07 2022-05-03 10.0
8 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL 2020-11-07 2021-11-09 15.0
uj5u.com熱心網友回復:
您可以嘗試創建一個單獨的函式來檢查費率,并在函式中指定條件
import numpy as np
def check_rates(Date, Agentname, ProductType):
val = df_r['NewRate'].loc[(df_r['ProductType']==ProductType) & (df_r['Agentname']==Agentname) & (df_r['StartDate']<Date) & (df_r['EndDate']>Date)]
try:
return list(val)[0]
except:
return np.nan #not found
df['Agentname_rates'] = df.apply(lambda x: check_rates(x['Date'], x['Agentname'], x['ProductType']), axis=1)
print(df)
輸出:
Date Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS 10.0
1 2022-03-01 08:00:09 99.0 VSFAAL APPAREL 35.0
2 2022-03-01 08:00:14 75.0 VSFAAL SPORTS 15.0
3 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS 10.0
4 2020-05-01 08:00:51 160.0 VSFAALJS APPAREL NaN
5 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL 15.0
出于學習的目的,我比較了這 3 種解決方案所花費的時間:
import time
ti = time.time()
for _ in range(1000):
df['Agentname_rates'] = np.nan
for product in list(df.ProductType.unique()):
for uname in list(df['Agentname'].unique()):
a = df_r.loc[(df_r['Agentname'] == uname)
& (df_r['ProductType'] == product.upper())
& (df_r['EndDate'] >= df['Date'].min())]
for i in a.index:
df.loc[(df['ProductType'].str.upper() == product.upper())
& (df['Agentname'] == uname)
& (df['Date'] >= a.loc[i]['StartDate'])
& (df['Date'] <= a.loc[i]['EndDate']),
["Agentname_rates"]] = a.loc[i]['NewRate']
print('Time taken: {:.1f} sec'.format(time.time() - ti))
print(df)
Time taken: 13.2 sec
Date Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS 10.0
1 2022-03-01 08:00:09 99.0 VSFAAL APPAREL 35.0
2 2022-03-01 08:00:14 75.0 VSFAAL SPORTS 15.0
3 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS 10.0
4 2020-05-01 08:00:51 160.0 VSFAALJS APPAREL NaN
5 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL 15.0
ti = time.time()
for _ in range(1000):
df3 = pd.DataFrame()
df3 = df.merge(df_r[['StartDate', 'EndDate','NewRate']],
left_on=[df['Agentname'], df['ProductType']],
right_on=[df_r['Agentname'], df_r['ProductType']],
how='left', suffixes=('','_start')
).drop(columns=['key_0', 'key_1' ])
df3 = df3[df3['Date'].dt.strftime('%Y-%m-%d').between(df3['StartDate'], df3['EndDate'])
].drop(columns=['StartDate', 'EndDate']).rename(columns={'NewRate': 'Agentname_rates'})
print('Time taken: {:.1f} sec'.format(time.time() - ti))
print(df3)
Time taken: 4.0 sec
Date Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS 10.0
2 2022-03-01 08:00:09 99.0 VSFAAL APPAREL 35.0
4 2022-03-01 08:00:14 75.0 VSFAAL SPORTS 15.0
5 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS 10.0
8 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL 15.0
def check_rates(Date, Agentname, ProductType):
val = df_r['NewRate'].loc[(df_r['ProductType']==ProductType) & (df_r['Agentname']==Agentname) & (df_r['StartDate']<Date) & (df_r['EndDate']>Date)]
try:
return list(val)[0]
except:
return np.nan #not found
ti = time.time()
for _ in range(1000):
df['Agentname_rates'] = np.nan
df['Agentname_rates'] = df.apply(lambda x: check_rates(x['Date'], x['Agentname'], x['ProductType']), axis=1)
print('Time taken: {:.1f} sec'.format(time.time() - ti))
print(df)
Time taken: 4.1 sec
Date Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02 100.0 VSFAAL SPORTS 10.0
1 2022-03-01 08:00:09 99.0 VSFAAL APPAREL 35.0
2 2022-03-01 08:00:14 75.0 VSFAAL SPORTS 15.0
3 2021-05-01 08:00:39 67.0 VSFAALJS SPORTS 10.0
4 2020-05-01 08:00:51 160.0 VSFAALJS APPAREL NaN
5 2021-05-01 08:00:56 65.0 VSFAALJS APPAREL 15.0
耗時總結:
@wiziruv(OP) 的代碼 = 13.2s
@naveed 的代碼 = 4.0s(輸出跳過 NaN 行)
@blackraven 的代碼 = 4.1s
我剛剛又跑了一次,所用的時間也差不多。我想這在很大程度上取決于運行代碼的機器
轉載請註明出處,本文鏈接:https://www.uj5u.com/ruanti/486567.html
下一篇:csv檔案中的資料什么也沒有