Skip to content

william-dan/LingoDB-CSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LingoDB-CSE banner

LingoDB-CSE

Common Sub-Join Elimination and HashMap Reuse for Multi-Query Optimization

status focus ir join-cse hashmap batch

Overview

This repository contains my undergraduate thesis implementation on top of LingoDB. The work targets repeated Join computation in multi-query scenarios and introduces optimizations at both RelAlg and SubOperator levels.

Core idea:

  • detect equivalent Join sub-expressions and reuse them;
  • adjust build/probe roles to enable better hash sharing;
  • reuse hash multi-map states instead of rebuilding;
  • compile multiple SQL statements into one MLIR module to expose cross-query optimization opportunities.

Thesis Context

  • Thesis title: Multi-Query Optimization via Common Sub-Join Reuse in Relational Compiled Databases
  • Thesis PDF: undergraduate thesis
  • Completion date: May 20, 2024
  • Baseline used in evaluation: LingoDB commit eee8c78847b2377ddc8b84974585182a7bb67700

Case Study

The following four snapshots show how the optimizer rewrites plans in representative scenarios.

Case 1: Common Join Reuse

Repeated join fragments are detected and merged to avoid redundant join work.

Case 1 Optimization

Case 2: Structural Simplification

Equivalent plan structures are normalized so downstream passes can apply stronger rewrites.

Case 2 Optimization

Case 3: HashMap Reuse

Hash build states are shared across compatible contexts, reducing repeated hash construction.

Case 3 Optimization

Case 4: Batch-Aware Optimization

Multiple statements in one module expose larger reuse opportunities than isolated execution.

Case 4 Optimization

Implementation Highlights

Area Contribution Main files
RelAlg Equivalent Join edge matching for CSE lib/RelAlg/Transforms/MatOpt/joinCSE.cpp, include/mlir/Dialect/RelAlg/Transforms/MatOpt/GraphMatcher.h
RelAlg Plan rebuild to realize Join reuse include/mlir/Dialect/RelAlg/Transforms/MatOpt/RebuildPlan.h
RelAlg Alias cleanup and column reference repair lib/RelAlg/Transforms/MatOpt/EraseAlias.cpp
RelAlg Build/probe role adjustment for hash reuse lib/RelAlg/Transforms/MatOpt/SwitchBuildProbe.cpp
SubOperator Hash multi-map reuse pass lib/SubOperator/Transforms/ReuseHashMultiMap.cpp
Pipeline wiring Inject new passes into optimization flow lib/RelAlg/Passes.cpp, lib/execution/Execution.cpp
Multi-query input Parse SQL file into multi-statement module lib/execution/Frontend.cpp

Experimental Results

All numbers below come from Chapter 8 of the thesis.

1) Join CSE (single-query)

xychart-beta
    title "Join CSE Improvement (%)"
    x-axis ["1a", "1b", "1c", "2a", "2b"]
    y-axis "Improvement %" 0 --> 100
    bar [61, 66, 49, 97, 36]
Loading
Case Baseline (s) Join CSE (s) Improvement
1a 3.32 1.31 61%
1b 6.03 2.10 66%
1c 14.68 7.50 49%
2a 0.22 0.007 97%
2b 17.50 11.30 36%

2) HashMap Reuse (on top of Join CSE + aligned build side)

xychart-beta
    title "HashMap Reuse Improvement (%)"
    x-axis ["3a", "3b", "3c"]
    y-axis "Improvement %" 0 --> 100
    bar [58, 91, 89]
Loading
Case Baseline (s) HashMap Reuse (s) Improvement
3a 2.60 1.11 58%
3b 36.96 3.45 91%
3c 38.74 4.49 89%

3) Batch Execution + Join CSE

xychart-beta
    title "Batch Execution Improvement (%)"
    x-axis ["4a", "4b", "4c"]
    y-axis "Improvement %" 0 --> 100
    bar [59, 66, 67]
Loading
Case Baseline (s, sequential) Optimized (s, batch) Improvement
4a 2.25 + 3.32 2.30 59%
4b 2.75 + 6.03 2.97 66%
4c 6.33 + 14.68 6.95 67%

Overall Gain Profile

pie showData
    title Average Improvement by Optimization Type
    "Join CSE" : 60
    "HashMap Reuse" : 79
    "Batch (Join CSE)" : 64
Loading

Why This Matters

  • The optimization is integrated as compiler passes, not a one-off executor hack.
  • Reuse is done on hash structures, which avoids many extra materialization costs.
  • Multi-statement compilation increases optimization visibility and can amplify gains.

Try LingoDB

You can still use LingoDB via:

  1. Hosted SQL web interface
  2. Python package: pip install lingodb
  3. Docker image
  4. Build from source: official docs

Documentation

Official documentation: lingo-db docs
Docs repository: github.com/lingo-db/docs

About

LingoDB-CSE is a research prototype for multi-query optimization via common sub-join elimination and hash multi-map reuse across RelAlg/SubOp, with batch SQL compilation and thesis-backed speedups.

Topics

Resources

License

Stars

Watchers

Forks

Contributors