如何在BigQuery中使用SQL查詢來計算特定時間間隔內的最大事件？-有解無憂

我有一個如下表：

timestamp               | customer_id | product_name
2022-04-01 23:49:07 UTC |   a23       |  dog_shampoo
2022-04-01 23:49:07 UTC |   a33       |  dog_shampoo
2022-04-01 23:49:07 UTC |   a45       |  dog_toy
2022-04-01 23:49:09 UTC |   a67       |  dog_shampoo
2022-04-01 23:49:09 UTC |   a66       |  dog_toy
2022-04-01 23:49:09 UTC |   a63       |  dog_toy
2022-04-01 23:50:10 UTC |   a50       |  dog_shampoo
2022-04-01 23:50:11 UTC |   a51       |  dog_shampoo

如果我想獲得每個產品的并發購買數量，可以簡單地由一個組完成，如下所示：

SELECT
   product_name,
   timestamp,
   COUNT(DISTINCT user_id) AS concurrent_purchases
FROM table
GROUP BY 1,2

得到這個輸出：

result1:
timestamp               | product_name  | concurrent_purchase
2022-04-01 23:49:07 UTC | dog_shampoo   |     2
2022-04-01 23:49:07 UTC | dog_toy       |     1
2022-04-01 23:49:09 UTC | dog_toy       |     2
2022-04-01 23:49:09 UTC | dog_shampoo   |     1
2022-04-01 23:50:10 UTC | dog_shampoo   |     1
2022-04-01 23:50:11 UTC | dog_shampoo   |     1

然后最大值將是：

SELECT
     product_name,
     MAX(concurrent_purchases) AS max_concurrent_purchases
FROM result1
GROUP BY 1

最終結果將是

product_name   | max_concurrent_purchases
dog_shampoo    |     2
dog_toy        |     2

但是，如果我將并發購買定義為彼此在 2 秒內，以構建如下所示的輸出：

timestamp                                          | product_name  | concurrent_purchase
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC  | dog_shampoo   |     3
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC  | dog_shampoo   |     0
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC  | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
                 .                                 | dog_shampoo   |     0
2022-04-01 23:50:10 UTC - 2022-04-01 23:50:12 UTC  | dog_shampoo   |     2
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC  | dog_toy       |     3

這將給出這個結果：

product_name   | max_concurrent_purchases
dog_shampoo    |     3
dog_toy        |     3

我有興趣制作一個靈活的解決方案，可以輕松更改間隔值（2 秒 - 10 秒）并針對不同的值運行。

我想我可以獲得每個產品名稱的最小時間戳并開始增加可能 acontrol flow statement但我不太熟悉在 SQL 中設定變數和回圈，并且不確定這樣的解決方案會有多慢。

如何構建一個查詢，為我提供如下所需的輸出，并為間隔提供不同的值？

interval (seconds) | product_name   | concurrent_purchase
        0          | dog_shampoo    |     2
        0          | dog_toy        |     2
        2          | dog_shampoo    |     3
        2          | dog_toy        |     3

uj5u.com熱心網友回復：

考慮以下方法

select product_name, max(concurrent_purchase) as max_concurrent_purchases
from (
  select *, count(*) over win as concurrent_purchase 
  from your_table
  window win as (
    partition by product_name 
    order by unix_seconds(timestamp) 
    range between current row and 2 following
  )
)
group by product_name

如果應用于您問題中的樣本資料 - 輸出是

如何在 BigQuery 中使用 SQL 查詢來計算特定時間間隔內的最大事件？

如果使用range between current row and 1 following- 輸出是

如何在 BigQuery 中使用 SQL 查詢來計算特定時間間隔內的最大事件？

uj5u.com熱心網友回復：

考慮如下：

interval_value您可以通過變數調整間隔。

DECLARE interval_value INT64 DEFAULT 2; 

CREATE TEMP TABLE sample (
    timestamp TIMESTAMP,
    customer_id STRING,
    product_name STRING
);
INSERT INTO sample VALUES
('2022-04-01 23:49:07 UTC', 'a23', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a33', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a45', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a67', 'dog_shampoo'),
('2022-04-01 23:49:09 UTC', 'a66', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a63', 'dog_toy'),
('2022-04-01 23:50:10 UTC', 'a50', 'dog_shampoo'),
('2022-04-01 23:50:11 UTC', 'a51', 'dog_shampoo')
;

WITH intervals AS (
  SELECT evt_grp, product_name, COUNT(1) AS purchases FROM (
    SELECT *, DIV(TIMESTAMP_DIFF(timestamp,  MIN(timestamp) OVER (), SECOND), interval_value   1) evt_grp
      FROM sample
  )
  GROUP BY 1, 2 
)
SELECT product_name, MAX(purchases) AS max_concurrent_purchasesax 
  FROM intervals
 GROUP BY 1
;

如果設定interval_value為 1，則輸出為：

如何在 BigQuery 中使用 SQL 查詢來計算特定時間間隔內的最大事件？

如果interval_value為 2，則輸出為：

如何在 BigQuery 中使用 SQL 查詢來計算特定時間間隔內的最大事件？

uj5u.com熱心網友回復：

我在這里做了一個查詢來解決這個問題，但沒有使用回圈。我探索的是 GENERATE_TIMESTAMP_ARRAY 函式（https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_timestamp_array）和 TIMESTAMP_ADD 函式（https://cloud.google.com/bigquery/docs /reference/standard-sql/timestamp_functions#timestamp_add）使用間隔制作一個陣列，并在爆炸后將其與原始表連接。

這是代碼：

DECLARE window_size INT64;
SET window_size = 2;

WITH timestamp_array AS (
  SELECT GENERATE_TIMESTAMP_ARRAY(min(timestamp), max(timestamp), INTERVAL window_size SECOND) as ts_array
  FROM <your_table>
)
,products AS (
  SELECT *
  FROM UNNEST(['dog_shampoo', 'dog_toy']) AS product
)
,timestamp_ranges AS (
  SELECT flattened_timestamp AS start_time, TIMESTAMP_ADD(flattened_timestamp, INTERVAL window_size SECOND) AS end_time, p.product
  FROM timestamp_array
  CROSS JOIN UNNEST(timestamp_array.ts_array) AS flattened_timestamp
  CROSS JOIN products p
)
,products_sold AS (
  SELECT
    a.start_time
    ,a.end_time
    ,a.product
    ,b.timestamp
    ,b.customer_id
    ,IF (b.product_name IS NOT NULL, 1, 0) AS purchase_count
  FROM timestamp_ranges a
  LEFT JOIN <your_table> b
    ON a.start_time <= b.timestamp
    AND a.end_time > b.timestamp
    AND a.product = b.product_name
)
SELECT
  CONCAT(start_time, ' - ', end_time) AS time_range
  ,product
  ,sum(purchase_count) as concurrent_purchases
FROM products_sold
GROUP BY start_time, end_time, product
ORDER BY product, time_range
;

唯一的附加評論是，當時間戳大于或等于開始時間且小于結束時間時，我認為時間戳“在范圍內”。使用此規則，結果與您的示例不完全相同。但是如果你想改變這種行為，你只需要在這里改變加入規則：

  LEFT JOIN <your_table> b
    ON a.start_time <= b.timestamp
    AND a.end_time > b.timestamp

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/489186.html

標籤：循环 while循环谷歌大查询控制流

上一篇：每10行創建摘要

下一篇：用10個亂數填充一個串列，然后隨機選擇其中的5個