This document is the definitive, code-accurate map of the CellScope system. It is written so another engineer (or thesis author) can reconstruct the architecture, data flow, and control logic without reading the source.
All paths are repository-relative unless noted.
Core Python package:
cellscope/ast_capture.py: parse notebooks, extract defs/uses, I/O, labels.cellscope/containerizer_adapter.py: internal static R parser (no external service).cellscope/cross_kernel.py: infer file handoff edges across cells.cellscope/serialization.py: convert capture to JSON for API/UI.cellscope/rocrate_io.py: build RO-Crate, copy artifacts, GraphML/HTML.cellscope/indexer.py: map RO-Crate JSON-LD to SPARQL INSERT DATA.cellscope/visualize.py: PyVis graph generation + HTML panel injection.cellscope/personalization.py: metadata mapping config (file field predicates).cellscope/utils.py: YAML/sidecar helpers.cellscope/validate_crate.py: minimal RO-Crate structural validation.
Server extension:
cellscope_server/handlers.py: Jupyter Server endpoints (/cellscope/*).
JupyterLab extension:
labextension/src/index.ts: analyzer panel, filters, settings, dialogs.labextension/style/index.css: UI styling.
CLI:
cellscope_cli/__main__.py:build,vis,validatesubcommands.
Evaluation assets:
evaluation/: O1/O2/O3 validation material and results.exports/: representative RO-Crates generated from evaluation notebooks.
High-level steps:
- Capture notebook cells -> defs/uses/file I/O.
- Infer cross-cell file handoffs (read-after-write).
- Build RO-Crate with PROV + domain hints + config files.
- Generate GraphML + PyVis HTML for visualization.
- Render SPARQL UPDATE (optional POST to endpoint).
Reference entry points:
- CLI:
cellscope_cli/__main__.py->parse_notebook->build_rocrate->index_crate. - Server:
cellscope_server/handlers.py->/cellscope/analyzeand/cellscope/export. - UI:
labextension/src/index.ts->_runAnalysis()+_requestExport().
Minimal CLI flow:
from cellscope.ast_capture import parse_notebook
from cellscope.cross_kernel import infer_cross_kernel_edges
from cellscope.rocrate_io import build_rocrate
from cellscope.indexer import index_crate
capture = parse_notebook("examples/multi_kernel_demo.ipynb", collect_materialized=True)
xedges = infer_cross_kernel_edges(capture)
crate_dir = build_rocrate(capture, "out-lab/demo", xedges, hints={}, sidecars=[], config_files=[])
index_crate(crate_dir, endpoint="http://localhost:3030/cellscope/update")Defined in cellscope/ast_capture.py.
Fields:
idx: zero-based index among code cells.position: index in full notebook cell list (includes markdown).kernel: kernel name (from cell metadata or notebook kernelspec).source: raw cell source.label: slugified first comment line (auto-deduped).funcs: set of function names defined in cell.func_calls: set of function names called.var_defs: set of variable symbols defined.var_uses: set of variable symbols used.file_writes,file_reads: sets of file paths detected.
Returned by parse_notebook().
{
"nb_path": "path/to/notebook.ipynb",
"cells": [CellInfo, ...],
"graph": {
"edges": [ (u, v, {"type": "uses", "vars": {"x"}, "via": "ast"}), ... ]
}
}Produced by cellscope.serialization.capture_to_json().
{
"nb_path": "...",
"cells": [
{
"idx": 0,
"position": 3,
"notebook": "...",
"label": "climate_input",
"name": "climate_input",
"kernel": "python3",
"funcs": ["compute_stats"],
"func_calls": ["read_csv"],
"var_defs": ["df"],
"var_uses": ["threshold"],
"file_writes": ["out/summary.json"],
"file_reads": ["data/input.csv"]
}
],
"edges": [
{"source": 0, "target": 1, "type": "uses", "vars": ["df"], "via": "ast"}
]
}Produced by the UI review dialog and sent to /cellscope/export.
{
"roles": {
"threshold": "parameter",
"df": "dataset"
},
"domains": {
"climate_readings.csv": {
"encodingFormat": "text/csv",
"keywords": ["climate", "sensor"],
"accessURL": "https://example.org/...",
"etag": "W/\"abc\"",
"retrievedAt": "2025-01-20T10:00:00Z",
"dateModified": "2025-01-15T12:00:00Z"
}
}
}Stored under cellscope:config in the JupyterLab extension.
{
"endpoint": "http://localhost:3030/cellscope/update",
"token": "...",
"username": "...",
"password": "...",
"retries": 2,
"backoffSeconds": 1.5,
"outputPath": "",
"dataSource": "local" | "sparql",
"configFiles": ["requirements.txt", "pyproject.toml"]
}Key steps in parse_notebook():
- Reads the notebook with
nbformat.read(..., as_version=4). - Kernel for a cell is
cell.metadata.kernelif available, else notebook kernelspec name. - Labels are derived from the first non-empty comment in the cell; duplicates are
disambiguated by suffixing
_2,_3, etc. - Python AST parsing removes magics and shell escapes before
ast.parse.
Def/Use heuristics:
- Definitions include assignment targets, augmented assigns, annotated assigns,
forandwithtargets, exception handler names, walrus assignments, and comprehension targets. - Uses are
ast.Namenodes in Load context minus defs in the same cell.
File I/O heuristics:
- Maintains a mini env map for literal path assignments in the same cell.
- Resolves paths from literals, simple string concatenation,
os.path.join, andPath(...). - Recognizes reads and writes by common method names (
read_csv,to_parquet,open_dataset,open, etc.).
Alias normalization:
- Optional
alias_map(YAML or dict) rewrites variable/function names so equivalent symbols unify in the graph.
Edge creation:
- Uses a
last_defmapping. For each use ofv, if a prior definition exists, add edge(last_def[v] -> current_cell)withvars={v}.
A built-in R parser replaces any external containerizer dependency. It is regex-based and static (no execution):
- Definitions from
<-,<<-,=, and right assignment (->,->>). - Uses from identifier tokens minus defs, keywords, member access, and
package prefix (
pkg::fun). - Function defs:
name <- function(...). - Function calls:
name(...)minus defs/keywords, intersected with uses. - File I/O: common read/write calls (
read.csv,readRDS,write.csv,saveRDS,download.file, etc.) and named args likefile,path,url.
infer_cross_kernel_edges() links cells when a later cell reads a file that
an earlier cell wrote.
- Edge data:
{type: "uses", vars: {basename}, via: "file", file: full_path}.
capture_to_json() flattens CellInfo into the UI/API contract, ensuring
set fields are sorted and edges are JSON-safe.
Each export builds:
<out_dir>/ro-crate/
ro-crate-metadata.json
cell_graph.graphml
cell_graph.html # if pyvis is installed
cells/
cell_0.py
cell_1.R
...
files/
<data artifacts>
env/
requirements.txt
pyproject.toml
index/
last_update.sparql
Each code cell is written to cells/cell_<idx>.<ext>:
- Extension rules:
.Rfor R kernels (ir,r-,r),.pyfor Python, otherwise.txt. - RO-Crate entity:
@type = ["File", "ontoflow:Activity"]. - Properties include:
name,kernel,programmingLanguage,position,version,codeSnippet(firstCELLSCOPE_SNIPPET_LINES, default 25). - Optional properties populated from hints:
roles,fileHints,funcCalls.
Variables become #var-<name> context entities:
@type = ontodt:Datafor data symbols.@type = ontodt:Symbolif the symbol is in the set of function defs.
Edges are added both ways:
- Definitions: Activity ->
oflow:hasOutputand Variable ->prov:wasGeneratedBy. - Uses: Activity ->
oflow:hasInputand Activity ->prov:used.
For each file_writes or file_reads entry:
- Resolve to a local path relative to the notebook if possible.
- If the path is a URL, create a
Fileentity withaccessURLand optional metadata from HEAD requests. - If a local file exists, copy into
files/and compute blake2b hash. - If the local file does not exist, still create a logical
Fileentity and attach the original path viacellscope:localPath.
Environment/config files follow the same strategy and are stored under env/.
Remote file support (opt-in):
CELLSCOPE_FETCH_REMOTE_METADATA=1enables HEAD requests to filletaganddateModified(no download).CELLSCOPE_FETCH_REMOTE_ARTIFACTS=1downloads remote artifacts into the crate.CELLSCOPE_REMOTE_MAX_BYTEScaps download size.
Config files are parsed into softwareRequirements entries:
requirements.txt/requirements.in(pip format).environment.yml/.yaml(conda dependencies).pyproject.toml(PEP 621 dependencies).Pipfile.lock(JSON lockfile).
Each dependency becomes a SoftwareApplication entity linked to the root dataset.
- GraphML is generated with NetworkX; nodes are cells, edges carry
label(vars),via, andtype. - If PyVis is available,
visualize_rocrate()generates HTML and injects the hover/click panel (_inject_roshow_panel).
- Default graph URI:
https://cellscope.local/graph/<slug>?v=<n>. <slug>is derived from notebook stem;<n>is counted from sibling crates.- Indexing drops the graph before re-inserting to avoid duplicates.
The indexer walks ro-crate-metadata.json and emits:
rdf:typefor each entity type.schema:name,schema:version,schema:position,schema:programmingLanguage.prov:used,prov:wasGeneratedBy,prov:wasDerivedFrom,prov:wasRevisionOf.- File metadata:
schema:encodingFormat,schema:keywords,schema:identifier(etag),schema:dateModified,prov:generatedAtTime,dcat:accessURL. - Custom fields from
CELLSCOPE_METADATA_CONFIG(file fields only). cellscope:localPath,cellscope:fileHints,cellscope:funcCalls.- Roles: activity
schema:rolesplus variableschema:roleNamewhen role strings are"var: role".
Indexing supports:
CELLSCOPE_SPARQL_ENDPOINTCELLSCOPE_SPARQL_TOKENCELLSCOPE_SPARQL_USER/CELLSCOPE_SPARQL_PASSWORDCELLSCOPE_SPARQL_OUTPUTCELLSCOPE_SPARQL_RETRIES,CELLSCOPE_SPARQL_BACKOFF,CELLSCOPE_SPARQL_TIMEOUT
- Uses PyVis with ForceAtlas2 physics.
- Adds a group node for the notebook (dot) and box nodes for cells.
- Each node stores
snippetandmetafields used by the panel injection. _inject_roshow_panel()appends a floating HTML panel that shows:- Code snippet
- Metadata list
- Edge relation and
viaon edge click
The SPARQL graph handler (/cellscope/sparql_graph) uses the same panel injection
so local and SPARQL graph views match.
Endpoints:
POST /cellscope/analyzePOST /cellscope/exportPOST /cellscope/export_cachedPOST /cellscope/indexPOST /cellscope/sparql_summaryPOST /cellscope/sparql_graph
Request:
{"notebook": "path/to/notebook.ipynb", "aliases": {"aliases": {"a":"b"}}}Response:
{"graph": {"nb_path": "...", "cells": [...], "edges": [...]}}Request:
{
"notebook": "...",
"out_dir": "out-lab/123",
"hints": {"roles": {...}, "domains": {...}},
"config_files": ["requirements.txt"],
"index": {"endpoint": "http://...", "retries": 2}
}Response:
{
"crate": "out-lab/123/ro-crate",
"index": {"triples": 123, "status": 200, "attempts": 1}
}/cellscope/export_cached copies a previously built crate to a new output
folder. The UI uses this when a single-notebook analysis already generated
a crate in out-lab/.analysis-cache.
/cellscope/sparql_summary runs a SPARQL query to list graphs and pull only
a small predicate subset, then rebuilds a graph summary for the UI.
The handler normalizes:
- cell names, kernel, position, version
- defs/uses, file reads/writes, roles
- file metadata tokens (encodingFormat, keywords, accessURL, localPath)
Cross-notebook edges are inferred by shared file basenames.
/cellscope/sparql_graph renders a PyVis HTML graph from the SPARQL summary.
The HTML is written under out-lab/sparql_<ts>/ro-crate/cell_graph.html.
The plugin registers commands:
cellscope:open-list(analyzer panel)cellscope:open-graph(graph view)
The panel lives in the left sidebar and contains:
- header with Analyze / Export / Open Graph
- status and pending banners
- filter + results sections
- export summary
_analyze() -> _promptNotebookSelection() -> _runAnalysis("manual", notebooks).
Notebook selection is two-stage:
- Folder picker: select root or specific folders to scan.
- Notebook picker: choose notebooks from recursive scan.
Notebook scanning uses contents.get(path, {content: true}) and skips:
.git, .venv, node_modules, __pycache__, .ipynb_checkpoints.
- Manual: combines selected notebooks, shows review dialog, and (optionally)
writes analysis crates to
out-lab/.analysis-cachefor export reuse. - Auto: triggered on save/execution; debounced (400-1000ms) and uses the currently open notebook only.
When dataSource = "sparql", auto analysis pulls from the SPARQL endpoint
instead of local parsing.
The review dialog builds a draft from the combined graph:
- Variables: from
var_defsacross all cells. - Files: union of
file_reads+file_writesbasenames.
Users can edit:
- Variable roles (string labels).
- File metadata fields: encodingFormat, keywords, accessURL, etag, retrievedAt, dateModified.
Hints are stored in localStorage per notebook:
cellscope:hints:<encoded notebook path>.
- Export uses the last analysis + last review; if missing, export is blocked.
- For single-notebook analysis with cached crate,
/export_cachedis used. - For multiple notebooks, each is exported to
out-lab/<ts>-<slug>. - When dataSource is
sparqland endpoint configured, indexing is enabled.
- Filter state is stored globally:
cellscope:filters:global. - Filter dropdown includes kernel, roles, file metadata tokens, edge via, and read/write toggles.
- Search terms are highlighted in list results; exact object matches are pinned at the top with a summary of defs/uses and file paths.
- Filter changes emit
cellscope:filters-changedwith the serialized filter state plusfilteredCellsandfilteredEdgescounts.
Settings are stored in cellscope:config:
- endpoint, auth token or basic auth, retries, backoff, output path
- data source (local vs sparql)
- env/config files to bundle (requirements.txt, pyproject.toml, etc.)
Commands:
build <notebook>: parse, build crate, optionally index.vis <crate>: generate HTML graph for existing crate.validate <crate>: run structural checks on RO-Crate JSON-LD.
Key flags for build:
--aliases: YAML map of equivalent variable names.--hints: YAML file for roles/domains.--sidecars: JSON sidecar entities.--config-file: env/config file to include (repeatable).--no-index: skip SPARQL delta.
Environment variables:
CELLSCOPE_SPARQL_ENDPOINT,CELLSCOPE_SPARQL_TOKEN,CELLSCOPE_SPARQL_USER,CELLSCOPE_SPARQL_PASSWORDCELLSCOPE_SPARQL_OUTPUT,CELLSCOPE_SPARQL_RETRIES,CELLSCOPE_SPARQL_BACKOFF,CELLSCOPE_SPARQL_TIMEOUTCELLSCOPE_METADATA_CONFIG(JSON config for file field -> predicate mapping)CELLSCOPE_SNIPPET_LINES(code snippet length)CELLSCOPE_FETCH_REMOTE_METADATA(HEAD remote artifacts)CELLSCOPE_FETCH_REMOTE_ARTIFACTS+CELLSCOPE_REMOTE_MAX_BYTES
- Static analysis only; dynamic runtime behavior is not captured.
- Variable-driven file paths are under-approximated (unless literals or simple joins are used in a single cell).
- R parsing is regex-based and best-effort; unusual constructs may be missed.
- Cross-notebook links in SPARQL mode are inferred by basename, not by full path.
- File metadata hints in the review dialog apply to basenames, not full paths.
For full extension recipes, see PERSONALIZATION.md. Key extension points:
- Capture rules:
cellscope/ast_capture.py,cellscope/containerizer_adapter.py. - Metadata fields: review dialog +
cellscope/rocrate_io.py+cellscope/indexer.py. - SPARQL projection:
cellscope/indexer.py. - UI filters:
labextension/src/index.ts. - Graph styling:
cellscope/visualize.py.