LingoDB-CSE

Common Sub-Join Elimination and HashMap Reuse for Multi-Query Optimization

Overview

This repository contains my undergraduate thesis implementation on top of LingoDB. The work targets repeated Join computation in multi-query scenarios and introduces optimizations at both RelAlg and SubOperator levels.

Core idea:

detect equivalent Join sub-expressions and reuse them;
adjust build/probe roles to enable better hash sharing;
reuse hash multi-map states instead of rebuilding;
compile multiple SQL statements into one MLIR module to expose cross-query optimization opportunities.

Thesis Context

Thesis title: Multi-Query Optimization via Common Sub-Join Reuse in Relational Compiled Databases
Thesis PDF: undergraduate thesis
Completion date: May 20, 2024
Baseline used in evaluation: LingoDB commit eee8c78847b2377ddc8b84974585182a7bb67700

Case Study

The following four snapshots show how the optimizer rewrites plans in representative scenarios.

Case 1: Common Join Reuse

Repeated join fragments are detected and merged to avoid redundant join work.

Case 2: Structural Simplification

Equivalent plan structures are normalized so downstream passes can apply stronger rewrites.

Case 3: HashMap Reuse

Hash build states are shared across compatible contexts, reducing repeated hash construction.

Case 4: Batch-Aware Optimization

Multiple statements in one module expose larger reuse opportunities than isolated execution.

Implementation Highlights

Area	Contribution	Main files
RelAlg	Equivalent Join edge matching for CSE	`lib/RelAlg/Transforms/MatOpt/joinCSE.cpp`, `include/mlir/Dialect/RelAlg/Transforms/MatOpt/GraphMatcher.h`
RelAlg	Plan rebuild to realize Join reuse	`include/mlir/Dialect/RelAlg/Transforms/MatOpt/RebuildPlan.h`
RelAlg	Alias cleanup and column reference repair	`lib/RelAlg/Transforms/MatOpt/EraseAlias.cpp`
RelAlg	Build/probe role adjustment for hash reuse	`lib/RelAlg/Transforms/MatOpt/SwitchBuildProbe.cpp`
SubOperator	Hash multi-map reuse pass	`lib/SubOperator/Transforms/ReuseHashMultiMap.cpp`
Pipeline wiring	Inject new passes into optimization flow	`lib/RelAlg/Passes.cpp`, `lib/execution/Execution.cpp`
Multi-query input	Parse SQL file into multi-statement module	`lib/execution/Frontend.cpp`

Experimental Results

All numbers below come from Chapter 8 of the thesis.

1) Join CSE (single-query)

xychart-beta
    title "Join CSE Improvement (%)"
    x-axis ["1a", "1b", "1c", "2a", "2b"]
    y-axis "Improvement %" 0 --> 100
    bar [61, 66, 49, 97, 36]

Case	Baseline (s)	Join CSE (s)	Improvement
1a	3.32	1.31	61%
1b	6.03	2.10	66%
1c	14.68	7.50	49%
2a	0.22	0.007	97%
2b	17.50	11.30	36%

2) HashMap Reuse (on top of Join CSE + aligned build side)

xychart-beta
    title "HashMap Reuse Improvement (%)"
    x-axis ["3a", "3b", "3c"]
    y-axis "Improvement %" 0 --> 100
    bar [58, 91, 89]

Case	Baseline (s)	HashMap Reuse (s)	Improvement
3a	2.60	1.11	58%
3b	36.96	3.45	91%
3c	38.74	4.49	89%

3) Batch Execution + Join CSE

xychart-beta
    title "Batch Execution Improvement (%)"
    x-axis ["4a", "4b", "4c"]
    y-axis "Improvement %" 0 --> 100
    bar [59, 66, 67]

Case	Baseline (s, sequential)	Optimized (s, batch)	Improvement
4a	2.25 + 3.32	2.30	59%
4b	2.75 + 6.03	2.97	66%
4c	6.33 + 14.68	6.95	67%

Overall Gain Profile

pie showData
    title Average Improvement by Optimization Type
    "Join CSE" : 60
    "HashMap Reuse" : 79
    "Batch (Join CSE)" : 64

Why This Matters

The optimization is integrated as compiler passes, not a one-off executor hack.
Reuse is done on hash structures, which avoids many extra materialization costs.
Multi-statement compilation increases optimization visibility and can amplify gains.

Try LingoDB

You can still use LingoDB via:

Hosted SQL web interface
Python package: pip install lingodb
Docker image
Build from source: official docs

Documentation

Official documentation: lingo-db docs
Docs repository: github.com/lingo-db/docs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
src		src
BSc_Thesis_LingoDB-CSE_William.pdf		BSc_Thesis_LingoDB-CSE_William.pdf
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LingoDB-CSE

Overview

Thesis Context

Case Study

Case 1: Common Join Reuse

Case 2: Structural Simplification

Case 3: HashMap Reuse

Case 4: Batch-Aware Optimization

Implementation Highlights

Experimental Results

1) Join CSE (single-query)

2) HashMap Reuse (on top of Join CSE + aligned build side)

3) Batch Execution + Join CSE

Overall Gain Profile

Why This Matters

Try LingoDB

Documentation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LingoDB-CSE

Overview

Thesis Context

Case Study

Case 1: Common Join Reuse

Case 2: Structural Simplification

Case 3: HashMap Reuse

Case 4: Batch-Aware Optimization

Implementation Highlights

Experimental Results

1) Join CSE (single-query)

2) HashMap Reuse (on top of Join CSE + aligned build side)

3) Batch Execution + Join CSE

Overall Gain Profile

Why This Matters

Try LingoDB

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages