分布式資料庫 Join 查詢設計與實作淺析-有解無憂

相對于單例資料庫的查詢操作，分布式資料查詢會有很多技術難題，

本文記錄 Mysql 分庫分表和 Elasticsearch Join 查詢的實作思路，了解分布式場景資料處理的設計方案，
文章從常用的關系型資料庫 MySQL 的分庫分表Join 分析，再到非關系型 ElasticSearch 來分析 Join 實作策略，逐步深入Join 的實作機制，

①Mysql 分庫分表 Join 查詢場景

分庫分表場景下，查詢陳述句如何分發，資料如何組織，相較于NoSQL 資料庫，Mysql 在SQL 規范的范圍內，相對比較容易適配分布式場景，

基于 sharding-jdbc 中間件的方案，了解整個設計思路，

sharding-jdbc

sharding-jdbc 代理了原始的 datasource, 實作 jdbc 規范來完成分庫分表的分發和組裝，應用層無感知，
執行流程：SQL決議 => 執行器優化 => SQL路由 => SQL改寫 => SQL執行 => 結果歸并 io.shardingsphere.core.executor.ExecutorEngine#execute
Join 陳述句的決議，決定了要分發 SQL 到哪些實體節點上，對應SQL路由，
SQL 改寫就是要把原始（邏輯）表名，改為實際分片的表名，
復雜情況下，Join 查詢分發的最多執行的次數 = 資料庫實體 × 表A分片數 × 表B分片數

Code Insight

示例代碼工程：[email protected]:cluoHeadon/sharding-jdbc-demo.git

/**
 * 執行查詢 SQL 切入點，從這里可以完整 debug 執行流程
 * @see ShardingPreparedStatement#execute()
 * @see ParsingSQLRouter#route(String, List, SQLStatement) Join 查詢實際涉及哪些表，就是在路由規則里匹配得出來的，
 */
public boolean execute() throws SQLException {
    try {
        // 根據引數（決定分片）和具體的SQL 來匹配相關的實際 Table，
        Collection<PreparedStatementUnit> preparedStatementUnits = route();
        // 使用執行緒池，分發執行和結果歸并，
        return new PreparedStatementExecutor(getConnection().getShardingContext().getExecutorEngine(), routeResult.getSqlStatement().getType(), preparedStatementUnits).execute();
    } finally {
        JDBCShardingRefreshHandler.build(routeResult, connection).execute();
        clearBatch();
    }
}

SQL 路由策略

啟用 sql 列印，直觀看到實際分發執行的 SQL

# 列印的代碼，就是在上述route 得出 ExecutionUnits 后，列印的
sharding.jdbc.config.sharding.props.sql.show=true

sharding-jdbc 根據不同的SQL 陳述句，會有不同的路由策略，我們關注的 Join 查詢，實際相關就是以下兩種策略，

StandardRoutingEngine binding-tables 模式
ComplexRoutingEngine 最復雜的情況，笛卡爾組合關聯關系，

-- 引數不明，不能定位分片的情況
select * from order o inner join order_item oi on o.order_id = oi.order_id 

-- 路由結果
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db1 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_1 o inner join order_item_0 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_1 oi on o.order_id = oi.order_id 
-- Actual SQL: db0 ::: select * from order_0 o inner join order_item_0 oi on o.order_id = oi.order_id

②Elasticsearch Join 查詢場景

首先，對于 NoSQL 資料庫，要求 Join 查詢，可以考慮是不是使用場景和用法有問題，

然后，不可避免的，有些場景需要這個功能，Join 查詢的實作更貼近SQL 引擎，

基于 elasticsearch-sql 組件的方案，了解大概的實作思路，

elasticsearch-sql

這是個elasticsearch 插件，通過提供http 服務實作類 SQL 查詢的功能，高版本的elasticsearch 已經具備該功能?
因為 elasticsearch 沒有 Join 查詢的特性，所以實作 SQL Join 功能，需要提供更加底層的功能，涉及到 Join 演算法，

Code Insight

原始碼地址：[email protected]:NLPchina/elasticsearch-sql.git

/**
 * Execute the ActionRequest and returns the REST response using the channel.
 * @see ElasticDefaultRestExecutor#execute
 * @see ESJoinQueryActionFactory#createJoinAction Join 演算法選擇
 */
@Override
public void execute(Client client, Map<String, String> params, QueryAction queryAction, RestChannel channel) throws Exception{
    // sql parse
    SqlElasticRequestBuilder requestBuilder = queryAction.explain();

    // join 查詢
    if(requestBuilder instanceof JoinRequestBuilder){
        // join 演算法選擇，包括：HashJoinElasticExecutor、NestedLoopsElasticExecutor
        // 如果關聯條件為等值（Condition.OPEAR.EQ）,則使用 HashJoinElasticExecutor
        ElasticJoinExecutor executor = ElasticJoinExecutor.createJoinExecutor(client,requestBuilder);
        executor.run();
        executor.sendResponse(channel);
    }
    // 其他型別查詢 ...
}

③More Than Join

Join 演算法

常用三種 Join 演算法：Nested Loop Join，Hash Join、 Merge Join
MySQL 只支持 NLJ 或其變種，8.0.18 版本后支持 Hash Join
NLJ 相當于兩個嵌套回圈，用第一張表做 Outter Loop，第二張表做 Inner Loop，Outter Loop 的每一條記錄跟 Inner Loop 的記錄作比較，最終符合條件的就將該資料記錄，
Hash Join 分為兩個階段; build 構建階段和 probe 探測階段，
可以使用Explain 查看 MySQL 使用哪種 Join 演算法，需要的語法關鍵字： FORMAT=JSON or FORMAT=Tree

EXPLAIN FORMAT=JSON  
SELECT * FROM
    sale_line_info u
    JOIN sale_line_manager o ON u.sale_line_code = o.sale_line_code;

{
    "query_block": {
        "select_id": 1,
        // 使用的join 演算法： nested_loop
        "nested_loop": [
            // 涉及join 的表以及對應的 key,其他的資訊與常用explain 類似
            {
                "table": {
                    "table_name": "o",
                    "access_type": "ALL"
                }
            },
            {
                "table": {
                    "table_name": "u",
                    "access_type": "ref"
                }
            }
        ]
    }
}

Elasticsearch Nested型別

分析Elasticsearch 業務資料以及使用場景，還有一種選擇是直接存盤關聯資訊的檔案，在 Elasticsearch 中，是以完整檔案形式提供查詢和檢索，徹底避開使用 Join 相關的技術，

這樣就牽扯到關聯是歸屬型別的資料還是公用型別的資料、關聯資料量的大小、關聯資料的更新頻率等，這些都是使用 Nested 型別需要考慮的因素，

更多的使用方法，可以從網上和官網找到，不做贅述，
我們現在有個業務功能正好使用到 Nested型別，在查詢和優化程序中，解決了非常大的難題，

總結

通過運行原理分析，對于運行流程有了清晰和深入的認知，

對于中間件的優化和技術選型更加有目的性，使用上會更加謹慎和小心，

明確的篩選條件，更小的篩選范圍，limit 取值資料，都可以減少計算陳本，提高性能，

參考

如何在分布式資料庫中實作 Hash Join
一文詳解MySQL——Join的使用優化 - 掘金

作者：京東物流楊攀

來源：京東云開發者社區

轉載請註明出處，本文鏈接：https://www.uj5u.com/shujuku/554836.html

標籤：其他

上一篇：openEuler22+GreatSQL+dbops玩轉MGR

下一篇：返回列表