Skip to content

feat: add Jupyter and Databricks notebook parsing support#69

Open
michael-denyer wants to merge 11 commits intotirth8205:mainfrom
michael-denyer:feat/notebook-support
Open

feat: add Jupyter and Databricks notebook parsing support#69
michael-denyer wants to merge 11 commits intotirth8205:mainfrom
michael-denyer:feat/notebook-support

Conversation

@michael-denyer
Copy link
Copy Markdown

Summary

  • Add .ipynb (Jupyter/Databricks) notebook parsing — extracts functions, classes, imports, and calls from code cells across Python, SQL, R, and Scala kernels
  • Add Databricks .py notebook export parsing — detects # Databricks notebook source header and splits on # COMMAND ---------- markers
  • Extract SQL table references (FROM, JOIN, INTO, CREATE TABLE/VIEW) as import edges for cross-language lineage
  • Shared _parse_notebook_cells method handles multi-language cell dispatch with per-cell line offset tracking

Test plan

  • Jupyter .ipynb parsing with Python kernel cells
  • Databricks multi-language .ipynb with %python, %sql, %r, %scala magic commands
  • Databricks .py export format parsing
  • SQL table regex extraction tests
  • R-kernel notebook cells (xfail pending R language PR feat: add R language parsing support #43)
  • Edge cases: empty notebooks, non-code cells, malformed JSON

Extract code cells from .ipynb files, filter magic/shell commands,
concatenate with offset tracking, and parse as Python via tree-sitter.

Supports:
- Python kernel detection (phase 1)
- Magic command filtering (%pip, !ls)
- Cell index tracking in node.extra["cell_index"]
- Cross-cell function calls and imports
- Edge cases: empty notebooks, non-Python kernels, malformed JSON

Includes test fixture and 12 tests in TestNotebookParsing.
Split _parse_notebook into two methods:
- _parse_notebook: extracts cells from .ipynb JSON, builds list[CellInfo],
  delegates to _parse_notebook_cells
- _parse_notebook_cells: shared method that parses cells grouped by language
  (Python/R via Tree-sitter, SQL via regex)

Also expands supported notebook languages from Python-only to Python and R.
Updates test_non_python_kernel to use an actually unsupported language (Scala)
since R is now supported.
Detect and parse Databricks-exported .py notebooks (identified by the
'# Databricks notebook source' header). Splits on COMMAND delimiters,
classifies cells by MAGIC prefix (%sql, %r, %md, %sh), and delegates
to the existing _parse_notebook_cells shared method. SQL table refs,
Python functions, cross-cell calls, and cell_index tracking all work.
@michael-denyer michael-denyer force-pushed the feat/notebook-support branch from ace72bb to 2da0a49 Compare March 27, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant