Pandas根據另一個資料幀上的日期范圍設定值-有解無憂

我有一些表格，其中包含取決于代理及其時間段的折扣率，我想將其應用到另一個表格上，以獲取其銷售日期的當前適用利率。

這是費率表 (df_r)

Agentname   ProductType     OldRate NewRate StartDate   EndDate
0   VSFAAL      SPORTS       0.0    10.0    2020-11-05  2021-01-18
1   VSFAAL      APPAREL      0.0    35.0    2020-11-05  2022-05-03
2   VSFAAL      SPORTS      10.0    15.0    2021-01-18  2022-05-03
3   VSFAALJS    SPORTS       0.0    10.0    2020-11-07  2022-05-03
4   VSFAALJS    APPAREL      0.0    15.0    2020-11-07  2021-11-09
5   VSFAALJS    APPAREL     15.0     5.0    2021-11-09  2022-05-03

這是交易表（df）

                  Date      Sales   Agentname   ProductType     
0 2020-12-01 08:00:02        100.0  VSFAAL      SPORTS       
1 2022-03-01 08:00:09         99.0  VSFAAL      APPAREL      
2 2022-03-01 08:00:14         75.0  VSFAAL      SPORTS       
3 2021-05-01 08:00:39         67.0  VSFAALJS    SPORTS       
4 2020-05-01 08:00:51        160.0  VSFAALJS    APPAREL      
5 2021-05-01 08:00:56         65.0  VSFAALJS    APPAREL

我希望得到這樣的結果：

                  Date      Sales   Agentname   ProductType     Agentname_rates
0 2020-12-01 08:00:02        100.0  VSFAAL      SPORTS             10.0
1 2022-03-01 08:00:09         99.0  VSFAAL      APPAREL            35.0
2 2022-03-01 08:00:14         75.0  VSFAAL      SPORTS             15.0
3 2021-05-01 08:00:39         67.0  VSFAALJS    SPORTS             10.0
4 2020-05-01 08:00:51        160.0  VSFAALJS    APPAREL               0
5 2021-05-01 08:00:56         65.0  VSFAALJS    APPAREL            15.0

目前我正在做的是遍歷產品型別，然后是代理，然后是每個日期索引

col='Agentname'
for product in list(df.ProductType.unique()):
        for uname in list(df[col].unique()):
            a = df_r.loc[(df_r['Agentname'] == uname) & (df_r['ProductType'] == product.upper()) &
                         (df_r['EndDate'] >= df['Date'].min())]

            for i in a.index:
                     df.loc[(df['ProductType'].str.upper() == product.upper()) & (df[col] == uname) & (
                            df['Date'] >= a.loc[i]['StartDate']) & (df['Date'] <= a.loc[i]['EndDate']),
                           [f"{col}_rates"]] = a.loc[i]['NewRate']

有沒有更有效的方法來做到這一點？

uj5u.com熱心網友回復：

這是一種方法

合并 product 和 agentname 上的兩個 DF，然后根據日期進行過濾

df3=df2.merge(df[['StartDate', 'EndDate','NewRate']], 
         left_on =[df2['Agentname'], df2['ProductType']],
         right_on=[df['Agentname'],  df['ProductType']],
              how='left',
          suffixes=('','_start')
        ).drop(columns=['key_0', 'key_1' ])

df3[df3['Date'].astype('datetime64').dt.strftime('%Y-%m-%d').between(
                                      df3['StartDate'].astype('datetime64'),
                                      df3['EndDate'].astype('datetime64'))
   ]

    Date    Sales   Agentname   ProductType StartDate   EndDate NewRate
0   2020-12-01 08:00:02 100.0   VSFAAL  SPORTS  2020-11-05  2021-01-18  10.0
2   2022-03-01 08:00:09 99.0    VSFAAL  APPAREL 2020-11-05  2022-05-03  35.0
4   2022-03-01 08:00:14 75.0    VSFAAL  SPORTS  2021-01-18  2022-05-03  15.0
5   2021-05-01 08:00:39 67.0    VSFAALJS    SPORTS  2020-11-07  2022-05-03  10.0
8   2021-05-01 08:00:56 65.0    VSFAALJS    APPAREL 2020-11-07  2021-11-09  15.0

uj5u.com熱心網友回復：

您可以嘗試創建一個單獨的函式來檢查費率，并在函式中指定條件

import numpy as np

def check_rates(Date, Agentname, ProductType):
    val = df_r['NewRate'].loc[(df_r['ProductType']==ProductType) & (df_r['Agentname']==Agentname) & (df_r['StartDate']<Date) & (df_r['EndDate']>Date)]
    try:
        return list(val)[0]
    except:
        return np.nan    #not found

df['Agentname_rates'] = df.apply(lambda x: check_rates(x['Date'], x['Agentname'], x['ProductType']), axis=1)
print(df)

輸出：

                 Date  Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02  100.0    VSFAAL      SPORTS            10.0
1 2022-03-01 08:00:09   99.0    VSFAAL     APPAREL            35.0
2 2022-03-01 08:00:14   75.0    VSFAAL      SPORTS            15.0
3 2021-05-01 08:00:39   67.0  VSFAALJS      SPORTS            10.0
4 2020-05-01 08:00:51  160.0  VSFAALJS     APPAREL             NaN
5 2021-05-01 08:00:56   65.0  VSFAALJS     APPAREL            15.0

出于學習的目的，我比較了這 3 種解決方案所花費的時間：

import time

ti = time.time() 
for _ in range(1000):
    df['Agentname_rates'] = np.nan
    for product in list(df.ProductType.unique()):
        for uname in list(df['Agentname'].unique()):
            a = df_r.loc[(df_r['Agentname'] == uname) 
                         & (df_r['ProductType'] == product.upper()) 
                         & (df_r['EndDate'] >= df['Date'].min())]

            for i in a.index:
                 df.loc[(df['ProductType'].str.upper() == product.upper()) 
                        & (df['Agentname'] == uname) 
                        & (df['Date'] >= a.loc[i]['StartDate']) 
                        & (df['Date'] <= a.loc[i]['EndDate']), 
                        ["Agentname_rates"]] = a.loc[i]['NewRate']
                
print('Time taken: {:.1f} sec'.format(time.time() - ti))
print(df)

Time taken: 13.2 sec
                 Date  Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02  100.0    VSFAAL      SPORTS            10.0
1 2022-03-01 08:00:09   99.0    VSFAAL     APPAREL            35.0
2 2022-03-01 08:00:14   75.0    VSFAAL      SPORTS            15.0
3 2021-05-01 08:00:39   67.0  VSFAALJS      SPORTS            10.0
4 2020-05-01 08:00:51  160.0  VSFAALJS     APPAREL             NaN
5 2021-05-01 08:00:56   65.0  VSFAALJS     APPAREL            15.0

ti = time.time() 
for _ in range(1000):
    df3 = pd.DataFrame()
    df3 = df.merge(df_r[['StartDate', 'EndDate','NewRate']], 
                   left_on=[df['Agentname'], df['ProductType']],
                   right_on=[df_r['Agentname'], df_r['ProductType']],
                   how='left', suffixes=('','_start')
                  ).drop(columns=['key_0', 'key_1' ])

    df3 = df3[df3['Date'].dt.strftime('%Y-%m-%d').between(df3['StartDate'], df3['EndDate'])
             ].drop(columns=['StartDate', 'EndDate']).rename(columns={'NewRate': 'Agentname_rates'})
print('Time taken: {:.1f} sec'.format(time.time() - ti))
print(df3)

Time taken: 4.0 sec
                 Date  Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02  100.0    VSFAAL      SPORTS            10.0
2 2022-03-01 08:00:09   99.0    VSFAAL     APPAREL            35.0
4 2022-03-01 08:00:14   75.0    VSFAAL      SPORTS            15.0
5 2021-05-01 08:00:39   67.0  VSFAALJS      SPORTS            10.0
8 2021-05-01 08:00:56   65.0  VSFAALJS     APPAREL            15.0

def check_rates(Date, Agentname, ProductType):
    val = df_r['NewRate'].loc[(df_r['ProductType']==ProductType) & (df_r['Agentname']==Agentname) & (df_r['StartDate']<Date) & (df_r['EndDate']>Date)]
    try:
        return list(val)[0]
    except:
        return np.nan    #not found

ti = time.time() 
for _ in range(1000):
    df['Agentname_rates'] = np.nan
    df['Agentname_rates'] = df.apply(lambda x: check_rates(x['Date'], x['Agentname'], x['ProductType']), axis=1)

print('Time taken: {:.1f} sec'.format(time.time() - ti))
print(df)

Time taken: 4.1 sec
                 Date  Sales Agentname ProductType Agentname_rates
0 2020-12-01 08:00:02  100.0    VSFAAL      SPORTS            10.0
1 2022-03-01 08:00:09   99.0    VSFAAL     APPAREL            35.0
2 2022-03-01 08:00:14   75.0    VSFAAL      SPORTS            15.0
3 2021-05-01 08:00:39   67.0  VSFAALJS      SPORTS            10.0
4 2020-05-01 08:00:51  160.0  VSFAALJS     APPAREL             NaN
5 2021-05-01 08:00:56   65.0  VSFAALJS     APPAREL            15.0

耗時總結：

@wiziruv(OP) 的代碼 = 13.2s

@naveed 的代碼 = 4.0s（輸出跳過 NaN 行）

@blackraven 的代碼 = 4.1s

我剛剛又跑了一次，所用的時間也差不多。我想這在很大程度上取決于運行代碼的機器

Pandas 根據另一個資料幀上的日期范圍設定值

轉載請註明出處，本文鏈接：https://www.uj5u.com/ruanti/486567.html

標籤：Python 熊猫数据框日期申请

上一篇：與DataFrame不兼容的索引器

下一篇：csv檔案中的資料什么也沒有