我有一個如下表:
timestamp | customer_id | product_name
2022-04-01 23:49:07 UTC | a23 | dog_shampoo
2022-04-01 23:49:07 UTC | a33 | dog_shampoo
2022-04-01 23:49:07 UTC | a45 | dog_toy
2022-04-01 23:49:09 UTC | a67 | dog_shampoo
2022-04-01 23:49:09 UTC | a66 | dog_toy
2022-04-01 23:49:09 UTC | a63 | dog_toy
2022-04-01 23:50:10 UTC | a50 | dog_shampoo
2022-04-01 23:50:11 UTC | a51 | dog_shampoo
如果我想獲得每個產品的并發購買數量,可以簡單地由一個組完成,如下所示:
SELECT
product_name,
timestamp,
COUNT(DISTINCT user_id) AS concurrent_purchases
FROM table
GROUP BY 1,2
得到這個輸出:
result1:
timestamp | product_name | concurrent_purchase
2022-04-01 23:49:07 UTC | dog_shampoo | 2
2022-04-01 23:49:07 UTC | dog_toy | 1
2022-04-01 23:49:09 UTC | dog_toy | 2
2022-04-01 23:49:09 UTC | dog_shampoo | 1
2022-04-01 23:50:10 UTC | dog_shampoo | 1
2022-04-01 23:50:11 UTC | dog_shampoo | 1
然后最大值將是:
SELECT
product_name,
MAX(concurrent_purchases) AS max_concurrent_purchases
FROM result1
GROUP BY 1
最終結果將是
product_name | max_concurrent_purchases
dog_shampoo | 2
dog_toy | 2
但是,如果我將并發購買定義為彼此在 2 秒內,以構建如下所示的輸出:
timestamp | product_name | concurrent_purchase
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC | dog_shampoo | 3
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC | dog_shampoo | 0
2022-04-01 23:49:09 UTC - 2022-04-01 23:49:11 UTC | dog_shampoo | 0
. | dog_shampoo | 0
. | dog_shampoo | 0
. | dog_shampoo | 0
2022-04-01 23:50:10 UTC - 2022-04-01 23:50:12 UTC | dog_shampoo | 2
2022-04-01 23:49:07 UTC - 2022-04-01 23:49:09 UTC | dog_toy | 3
這將給出這個結果:
product_name | max_concurrent_purchases
dog_shampoo | 3
dog_toy | 3
我有興趣制作一個靈活的解決方案,可以輕松更改間隔值(2 秒 - 10 秒)并針對不同的值運行。
我想我可以獲得每個產品名稱的最小時間戳并開始增加可能 acontrol flow statement
但我不太熟悉在 SQL 中設定變數和回圈,并且不確定這樣的解決方案會有多慢。
如何構建一個查詢,為我提供如下所需的輸出,并為間隔提供不同的值?
interval (seconds) | product_name | concurrent_purchase
0 | dog_shampoo | 2
0 | dog_toy | 2
2 | dog_shampoo | 3
2 | dog_toy | 3
uj5u.com熱心網友回復:
考慮以下方法
select product_name, max(concurrent_purchase) as max_concurrent_purchases
from (
select *, count(*) over win as concurrent_purchase
from your_table
window win as (
partition by product_name
order by unix_seconds(timestamp)
range between current row and 2 following
)
)
group by product_name
如果應用于您問題中的樣本資料 - 輸出是
如果使用range between current row and 1 following
- 輸出是
uj5u.com熱心網友回復:
考慮如下:
interval_value
您可以通過變數調整間隔。
DECLARE interval_value INT64 DEFAULT 2;
CREATE TEMP TABLE sample (
timestamp TIMESTAMP,
customer_id STRING,
product_name STRING
);
INSERT INTO sample VALUES
('2022-04-01 23:49:07 UTC', 'a23', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a33', 'dog_shampoo'),
('2022-04-01 23:49:07 UTC', 'a45', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a67', 'dog_shampoo'),
('2022-04-01 23:49:09 UTC', 'a66', 'dog_toy'),
('2022-04-01 23:49:09 UTC', 'a63', 'dog_toy'),
('2022-04-01 23:50:10 UTC', 'a50', 'dog_shampoo'),
('2022-04-01 23:50:11 UTC', 'a51', 'dog_shampoo')
;
WITH intervals AS (
SELECT evt_grp, product_name, COUNT(1) AS purchases FROM (
SELECT *, DIV(TIMESTAMP_DIFF(timestamp, MIN(timestamp) OVER (), SECOND), interval_value 1) evt_grp
FROM sample
)
GROUP BY 1, 2
)
SELECT product_name, MAX(purchases) AS max_concurrent_purchasesax
FROM intervals
GROUP BY 1
;
如果設定interval_value
為 1,則輸出為:
如果interval_value
為 2,則輸出為:
uj5u.com熱心網友回復:
我在這里做了一個查詢來解決這個問題,但沒有使用回圈。我探索的是 GENERATE_TIMESTAMP_ARRAY 函式(https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_timestamp_array)和 TIMESTAMP_ADD 函式(https://cloud.google.com/bigquery/docs /reference/standard-sql/timestamp_functions#timestamp_add)使用間隔制作一個陣列,并在爆炸后將其與原始表連接。
這是代碼:
DECLARE window_size INT64;
SET window_size = 2;
WITH timestamp_array AS (
SELECT GENERATE_TIMESTAMP_ARRAY(min(timestamp), max(timestamp), INTERVAL window_size SECOND) as ts_array
FROM <your_table>
)
,products AS (
SELECT *
FROM UNNEST(['dog_shampoo', 'dog_toy']) AS product
)
,timestamp_ranges AS (
SELECT flattened_timestamp AS start_time, TIMESTAMP_ADD(flattened_timestamp, INTERVAL window_size SECOND) AS end_time, p.product
FROM timestamp_array
CROSS JOIN UNNEST(timestamp_array.ts_array) AS flattened_timestamp
CROSS JOIN products p
)
,products_sold AS (
SELECT
a.start_time
,a.end_time
,a.product
,b.timestamp
,b.customer_id
,IF (b.product_name IS NOT NULL, 1, 0) AS purchase_count
FROM timestamp_ranges a
LEFT JOIN <your_table> b
ON a.start_time <= b.timestamp
AND a.end_time > b.timestamp
AND a.product = b.product_name
)
SELECT
CONCAT(start_time, ' - ', end_time) AS time_range
,product
,sum(purchase_count) as concurrent_purchases
FROM products_sold
GROUP BY start_time, end_time, product
ORDER BY product, time_range
;
唯一的附加評論是,當時間戳大于或等于開始時間且小于結束時間時,我認為時間戳“在范圍內”。使用此規則,結果與您的示例不完全相同。但是如果你想改變這種行為,你只需要在這里改變加入規則:
LEFT JOIN <your_table> b
ON a.start_time <= b.timestamp
AND a.end_time > b.timestamp
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/489186.html
上一篇:每10行創建摘要