hive的高級分組聚合是指在聚合時使用GROUPING SETS、CUBE和ROLLUP的分組聚合,
高級分組聚合在很多資料庫類SQL中都有出現,并非hive獨有,這里只說明hive中的情況,
使用高級分組聚合不僅可以簡化SQL陳述句,而且通常情況下會提升SQL陳述句的性能,
1.Grouping sets 的使用
示例:
-- 使用方式
select a,b,sum(c) from tbl group by a,b grouping sets(a,b)
Grouping sets的子句允許在一個group by 陳述句中,指定多個分組聚合列,所有含有Grouping sets 的子句都可以用union連接的多個group by 查詢邏輯來表示,
如下一些常見的等價替換示例:
-- 陳述句1
select a, b sum(c) from tbl group by a,b grouping sets((a,b))
-- 相當于
select a,b,sum(c) from tbl group by a,b
-- 陳述句2
select a,b,sum(c) from tbl group by a,b grouping sets((a,b),a)
-- 相當于
select a,b,sum(c) from tbl group by a,b
union
select a,null ,sum(c) from tbl group by a
-- 陳述句3
select a,b,sum(c) from tbl group by a,b grouping sets(a,b)
-- 相當于
select a,null,sum(c) from tbl group by a
union
select null ,b,sum(c) from tbl group by b
-- 陳述句4
select a,b,sum(c) from tbl group by a,b grouping sets((a,b),a,b,())
-- 相當于
select a,b,sum(c) from tbl group by a,b
union
select a,null,sum(c) from tbl group by a
union
select null,b,sum(c) from tbl group by b
union
select null,null,sum(c) from tbl
可以看到通過等價替換的改寫之后,陳述句會變得簡潔,性能我們之后分析,
2.cube 和rollup的使用
示例:
-- cube使用示例
select a,b,c,count(1) from tbl group by a,b,c with cube
-- rollup使用示例
select a,b,c,count(1) from tbl group by a,b,c with rollup
用法說明:
以上兩個高級分組函式都可以在一個group by 陳述句中完成多個分組聚合,它們都可以用grouping sets來等價替換,
- cube 會計算所有group by 列的所有組合
-- cube陳述句
select a,b,c,count(1) from tbl group by a,b,c with cube
-- 相當于
select a,b,c count(1) from tbl group by a,b,c
grouping sets((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())
- rollup 會按照group by 指定的列從左到右進行分組聚合
-- rollup陳述句 滾動式聚合
select a,b,c,count(1) from tbl group by a,b,c with rollup
-- 相當于
select a,b,c,count(1) from tbl group by a,b,c s
grouping sets((a,b,c),(a,b),(a),())
3.使用高級分組聚合函式的性能分析
我們可以通過執行計劃的執行來分析高級分組聚合SQL陳述句的執行程序,比對其優化的節點,
例1 含grouping sets關鍵詞的SQL執行案例,
set hive.map.aggr=true;
explain
-- 小于30歲人群的不同性別平均年齡
select gender,avg(age) as avg_age from temp.user_info_all where ymd = '20230505'
and age < 30
group by gender;
-- 將以上陳述句改為grouping sets關鍵詞執行陳述句
set hive.map.aggr=true;
explain
select gender,avg(age) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender grouping sets((gender));
查看其執行計劃:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (age < 30) (type: boolean)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: avg(age)
keys: gender (type: int), 0 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: struct<count:bigint,sum:double,input:bigint>)
Reduce Operator Tree:
Group By Operator
aggregations: avg(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col2
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: int), _col2 (type: double)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
對以上內容進行關鍵字解讀:
map階段:
- Group By Operator :Map端開啟聚合操作
- aggregations:分組聚合的演算法,該案例采取avg(age)
- keys: 這里是分組列+ 一個固定列 0
- mode:Hash
- outputColumnNames:最終輸出三列,_col0, _col1, _col2
- Reduce Output Operator:該階段為map階段聚合后的操作
- key expressions:map端最終輸出的key,該例為gender和0兩列,
- sort order:輸出兩列都正序排序
- Map-reduce partition columns:表示Map階段資料輸出的磁區列,該案例為gender和0兩列進行磁區,
- value expressions:map端最終輸出value,為一個結構體,
Reduce階段:
- Group By Operator:reduce階段的分組聚合操作,
- aggregations: 分組聚合演算法,avg(VALUE._col0)表示對map階段輸出的 value expressions的 _col0取平均值,
- keys:指定分組聚合的key,有兩列,為map階段輸出的key,
- mode: mergepartial
- outputColumnNames: 表示最終輸出的列,該例為gender和num,
- pruneGroupingSetId: 表示是否對最終輸出的grouping id進行修剪,如果為true,則表示將keys最后一列拋棄,案例中為0列,
- Select Operator:進行列投影操作,
- expressions:輸出的列,gender和num,
通過查看以上的執行計劃,可以看出在使用含有grouping sets陳述句的SQL中,hive執行計劃并沒有給出具體的實作細節,
再執行具有多個聚合列的實體來看看:
例2 聚合年齡和聚合性別多列合并測驗,
set hive.map.aggr=true;
explain
select gender,age,count(0) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender,age grouping sets(gender,age);
注:grouping sets后進行分組的列一定要在之前的group by中進行申明,
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (age < 30) (type: boolean)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(0)
keys: gender (type: int), age (type: bigint), 0 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
sort order: +++
Map-reduce partition columns: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: bigint), KEY._col2 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: int), _col1 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
通過以上兩個例子可以看出hive執行計劃中沒有具體的高級分組聚合如何實作分組方案,兩者執行方式基本上差不多,
在資料掃描和查詢上的確減少了多次資料掃描和資料io操作,在一定程度上節省了計算資源,
例3 使用cube替代grouping sets ,
set hive.map.aggr=true;
explain
select gender,age,count(0) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender,age with cube;
-- 等價陳述句
select gender,age,count(0) as num from temp.user_info_all
where ymd = '20230505'
and age < 30
group by gender,age grouping sets((gender,age),(gender),(age),());
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: user_info_all
Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (age < 30) (type: boolean)
Statistics: Num rows: 10878098 Data size: 261074352 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(0)
keys: gender (type: int), age (type: bigint), 0 (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 43512392 Data size: 1044297408 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
sort order: +++
Map-reduce partition columns: _col0 (type: int), _col1 (type: bigint), _col2 (type: int)
Statistics: Num rows: 43512392 Data size: 1044297408 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: bigint), KEY._col2 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col3
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: int), _col1 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
Statistics: Num rows: 21756196 Data size: 522148704 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
以上例3 cube陳述句和例2陳述句輸出資料完全是不一樣的,但其輸出執行計劃內容基本和例2一致,可以看出hive的執行計劃對高級分組聚合拆分執行計劃的支持還不是很好,
使用高級分組聚合,要注意開啟map端聚合模式,
使用高級分組聚合,如上案例,僅使用一個作業就能夠實作union寫法需要多個作業才能實作的邏輯,
從這點上來看能夠減少多個作業在磁盤和網路I/O時的負擔,是一種優化,
但是同時也要注意因過度使用高級分組聚合陳述句而導致的資料急速膨脹問題,
-
通常使用簡單的group by 陳述句,一份資料只有一種聚合結果,一個分組聚合通常只有一個記錄;
-
使用高級分組聚合,例如cube,在一個作業中一份資料會存在多種聚合情況,最終輸出是,每種聚合情況各自對應一條資料,
注意事項:
如果使用高級分組聚合的陳述句處理的底表,在資料量很大的情況下容易導致Map或者Reduce任務因硬體資源不足而崩潰,
hive中使用hive.new.job.grouping.set.cardinality
配置項來應對以上情況,
如果SQL陳述句中處理分組聚合情況超過該配置項指定的值,默認值為(30),則會創建一個新的作業,
下一期:hive視窗分析函式解讀以及帶視窗分析函式的SQL性能分析
按例,歡迎點擊此處關注我的個人公眾號,交流更多知識,
后臺回復關鍵字 hive,隨機贈送一本魯邊備注版珍藏大資料書籍,
轉載請註明出處,本文鏈接:https://www.uj5u.com/shujuku/556376.html
標籤:其他
下一篇:返回列表