Hive執行計劃之hive依賴及權限查詢和常見使用場景-有解無憂

概述
1.explain dependency的查詢與使用
2.借助explain dependency解決一些常見問題
- 2.1.識別看似等價的SQL代碼實際上是不等價的：
- 2.2 通過explain dependency驗證將過濾條件在不同位置的查詢區別
3.查看SQL操作涉及到的相關權限資訊

概述

Hive查看執行計劃的命令中還有兩個不怎么常用但很重要的命令，接下來詳細介紹一下，

有一個問題：如何在hiveSQL執行之前就探查到這段邏輯的血緣依賴關系？

hive血緣是很多生產級數倉必須要提供的功能，大多數解決方案都是使用hive hooks的方法通過SQL執行后決議得到hive表的依賴關系，

這個方案能細粒度到欄位級依賴，屬于很完善的一個解決方案，但有很多場景我們需要在SQL執行之前就得到依賴關系，那么如何解決的呢？

1.explain dependency的查詢與使用

explain dependency 提供了這樣的一個解決方案，它可以查詢一段SQL需要的資料來源，以JSON的形式展現結果資料，里面主要包含兩部分內容：

input_tables：描述一段SQL依賴的資料來源表，里面存盤的是hive表名的串列，格式如下：

{"tablename":"庫名@表名","tabletype":"表的型別（外部表/內部表）"}
input_partitions：描述一段SQL依賴的資料來源表磁區，里面存盤的是磁區名稱的串列，格式如下：

{"partitionName":"庫名@表名@磁區列=磁區列的值"}

如果查詢的表為非磁區表，則顯示為空，

可以通過以下例子來進行比對，其中例1是查詢非磁區普通表SQL的explain dependency，例2是查詢磁區表SQL的explain dependency，

例1 使用explain dependency查看SQL非磁區普通表，

explain dependency
-- 統計年齡小于30歲各個年齡里，昵稱里帶“小”的人數
select age,count(0) as num from temp.user_info_all_no
where age < 30 and nick like '%小%'
group by age;

輸出結果內容：

{"input_tables":[{"tablename":"temp@user_info_all_no","tabletype":"MANAGED_TABLE"}],"input_partitions":[]}

例2 使用explain dependency查看SQL查詢磁區表，

explain dependency
-- 統計年齡小于30歲各個年齡里，昵稱里帶“小”的人數,其中ymd欄位為磁區欄位
select age,count(0) as num from temp.user_info_all where ymd >= '20230501'
and age < 30 and nick like '%小%'
group by age;

輸出結果內容：

{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"}]}

2.借助explain dependency解決一些常見問題

explain dependency的使用場景有以下幾個：

場景一，快速排除，快速排除因為讀不到相應磁區的資料而導致任務資料輸出例外，例如，在一個以天為磁區的任務中，上游任務因為生產程序不可控因素出現例外或者空跑，導致下游任務引發例外，通過這種方式，可以快速查看SQL讀取的磁區是否出現例外，

場景二，理清表的輸入，幫助理解程式的運行，特別是有助于理解有多重子查詢，多表連接的依賴輸入，

場景三，提前通過決議hiveSQL腳本進行血緣依賴決議，用于一些定制化資料平臺工具開發中的血緣構建，

explain dependency的使用能幫助開發者解決哪些問題呢？

2.1.識別看似等價的SQL代碼實際上是不等價的：

對于接觸SQL不久的程式員來說，很多人容易將

select * from a left join b on a.no=b.no and a.f>1 and a.f<3;

這段邏輯等價于 select * from a left join b on a.no=b.no where a.f>1 and a.f<3;

這兩段的邏輯的區別是在多表left join的時候where 后加條件是否等價與on后面加條件，

我們通過實體來看看其中的區別：

例3 使用explain dependency識別看似等價的SQL代碼，

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and a.ymd >= '20230501' and a.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where a.ymd >= '20230501' and a.ymd <= '20230502';

輸出結果內容：

// 代碼1輸出結果
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

// 代碼2輸出結果
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

通過以上輸出結果可以看出，上面例子里的兩段SQL其實并不等價，在left join（left outer join）的連接條件中加入非等值的過濾條件后，這里特指作用于a表，也就是連接的基表，并沒有將左外連接的左右兩個表按照過濾條件進行過濾，左外連接在執行時會讀取所有磁區資料，然后進行關聯資料過濾操作，

left outer join 針對左表非等值條件on和where查詢資料on條件查詢資料大于where條件查詢資料，

下面查看left outer join對右表的過濾條件實體：

例4 使用explain dependency識別left outer join 右表過濾非等值條件區別

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and b.ymd >= '20230501' and b.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where b.ymd >= '20230501' and b.ymd <= '20230502';

輸出結果內容：

// 代碼1輸出結果,on后跟非等值條件
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

// 代碼2輸出結果,where后跟非等值條件
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

可以看到left outer join 針對右表非等值條件on和where查詢資料左表都是全表掃描，右表on條件是條件過濾，where條件是全表掃描，

接下來對inner join，right outer join，full outer join進行測驗，會發現

inner join 的類似針對左右表非等值條件on和where查詢資料是等價的，

right outer join和left join相反，

full outer join都是全表掃描，

那么可以很好的判斷出一下兩段SQL的過濾條件資料讀取范圍是完全不一樣的，就不貼執行結果了，

例5 left outer join下的對左表和右表不等值條件過濾，

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and a.ymd >= '20230501' and a.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and b.ymd >= '20230501' and b.ymd <= '20230502';

以上不同join型別資料查詢范圍不一致主要原因和hive對join和where的謂詞下推支持不同有關，通過explain dependency可以直接驗證hive對join和where進行謂詞下推規則的驗證，

謂詞下推可詳細查看什么是謂詞下推，看這一篇就夠了

2.2 通過explain dependency驗證將過濾條件在不同位置的查詢區別

如果要使用外連接并需要對左右兩個表進行條件過濾，做好的方式是將過濾條件放到就近處，即如果已經知道表資料過濾篩選條件，那么在使用該表前，就先用過濾條件進行過濾，然后進行其他操作，

一些SQL內置優化器會做一些過濾下推優化，但部分條件還是不會進行下推，所以我們在寫SQL時盡量養成先過濾而后進行其他操作（聚合，關聯）的習慣，

可以看如下實體：

例6 left outer join對左表過濾資料的優化對比，

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where a.ymd >= '20230501' and a.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from (
	select uid,ymd from temp.user_info_all
  -- 在子查詢內部進行過濾
  where ymd >= '20230501' and ymd <= '20230502'  
) a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd;

-- 代碼3
explain dependency
select a.uid from (
	select uid,ymd from temp.user_info_all
  -- 在子查詢內部進行過濾
  where ymd >= '20230501' and ymd <= '20230502'  
) a
left outer join (
	select uid,ymd from temp.user_act_info
  where ymd >= '20230501' and ymd <= '20230502'
) b
on a.uid = b.uid and a.ymd = b.ymd;

執行結果：

//代碼1，左右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

//代碼2，右表進行了全表掃描
{"input_tables":[{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

//代碼3，左右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

可以看到left outer join對左表過濾資料的優化中代碼1片段等價于代碼3片段，即兩表都在就近處都過濾，

例7 left outer join對右表過濾資料的優化對比，

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where b.ymd >= '20230501' and b.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from (
	select uid,ymd from temp.user_info_all
  -- 在子查詢內部進行過濾
  where ymd >= '20230501' and ymd <= '20230502'  
) a
left outer join (
	select uid,ymd from temp.user_act_info
  where ymd >= '20230501' and ymd <= '20230502'
) b
on a.uid = b.uid and a.ymd = b.ymd;

-- 代碼3
explain dependency
select a.uid from temp.user_info_all a
left outer join (
	select uid,ymd from temp.user_act_info
  where ymd >= '20230501' and ymd <= '20230502'
) b
on a.uid = b.uid and a.ymd = b.ymd;

執行結果內容：

// 代碼1 ，左右表都進行了全表掃描
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

//代碼2，左右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

//代碼3，右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"}]}

可以看到left outer join對右表過濾資料的優化中代碼2是最優，代碼3次之，代碼1最差，

3.查看SQL操作涉及到的相關權限資訊

通過explain authorization可以知道當前SQL訪問的資料來源（INPUTS）和資料輸出（OUTPUTS），以及當前Hive的訪問用戶（CURRENT_USER）和操作（OPERATION），

可以看以下實體：

例8 使用explain authorization查看權限相關資訊，

explain authorization
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where a.ymd >= '20230501' and a.ymd <= '20230502';

執行結果：

INPUTS: 
  temp@user_info_all
  temp@user_act_info
  temp@user_info_all@ymd=20230501
  temp@user_info_all@ymd=20230502
  temp@user_act_info@ymd=20230501
  temp@user_act_info@ymd=20230502
OUTPUTS: 
  hdfs://nameservice1/tmp/hive/hdfs/a88cc133-c310-4129-bfa0-28011ac23904/hive_2023-06-07_19-42-55_464_2777807904847671424-1/-mr-10000
CURRENT_USER: 
  hdfs
OPERATION: 
  QUERY
AUTHORIZATION_FAILURES: 
  Permission denied: Principal [name=hdfs, type=USER] does not have following privileges for operation QUERY [[SELECT] on Object [type=TABLE_OR_VIEW, name=temp.user_act_info], [SELECT] on Object [type=TABLE_OR_VIEW, name=temp.user_info_all]]

從上面的資訊可知：

上面案例的資料來源是temp資料庫中的 user_info_all表和user_act_info表；

資料的輸出路徑是hdfs://nameservice1/tmp/hive/hdfs/a88cc133-c310-4129-bfa0-28011ac23904/hive_2023-06-07_19-42-55_464_2777807904847671424-1/-mr-10000；

當前的操作用戶是hdfs，操作是查詢（QUERY）；

觀察上面的資訊我們還會看到AUTHORIZATION_FAILURES資訊，提示對當前的輸入沒有查詢權限，但如果運行上面的SQL的話也能夠正常運行，為什么會出現這種情況？Hive在默認不配置權限管理的情況下不進行權限驗證，所有的用戶在Hive里面都是超級管理員，即使不對特定的用戶進行賦權，也能夠正常查詢，

通過上面對explain相關引數的介紹，可以發現explain中有很多值得我們去研究的內容，讀懂 explain 的執行計劃有利于我們優化Hive SQL，同時也能提升我們對SQL的掌控力，

下一期：Hive執行計劃之什么是hiveSQL向量化模式及優化詳解

按例，歡迎點擊此處關注我的個人公眾號，交流更多知識，

后臺回復關鍵字 hive，隨機贈送一本魯邊備注版珍藏大資料書籍，

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/554630.html

標籤：大數據

上一篇：kafka的安裝和基本操作

下一篇：返回列表