Skip to content

Validate language auto-detection accuracy before defaulting to it #61

@ar7casper

Description

@ar7casper

Context

detect_language() exists in core/parser_adapter.py and supports a dominance heuristic (count source files by extension, return the language with the most files, after filtering common non-source dirs like node_modules, __pycache__, vendor, etc.).

It was historically callable through the Python API (parse_repository(language="auto")), but the Go CLI required --language explicitly, gating the heuristic from the default user path. #40 proposes removing that gate and making --language optional in openant init.

The algorithm itself is unchanged from current master — #40 moves the config to a shared config/languages.json (eliminating Go↔Python drift), adds tests for the algorithm and init flow, and drops the .git requirement, but the dominance heuristic is byte-for-byte identical.

The concern

A wrong auto-detect at openant init cascades through every subsequent command. The detected language is written to ~/.openant/projects/<name>/project.json and read by core/parser_adapter.py to dispatch to parsers/<lang>/test_pipeline.py. The user might never notice the wrong parser ran until output looks weird.

Reliability of the dominance heuristic on real-world repos isn't quantified today. Several edge cases the algorithm doesn't handle well by construction:

  • Polyglot repos with auxiliary languages: a Python service with a TypeScript frontend (web/), a Go backend with Python build/migration scripts (scripts/), a Rust project with vendored C bindings.
  • File count ≠ code volume: 100 small .ts declaration files vs. 50 large .py files; the algorithm picks .ts.
  • Near-ties: 50/50 splits resolve based on rglob walk order, which is non-deterministic across platforms.
  • No user signal: detection runs silently, no count display, no near-tie warning.

What's tested today (in #40)

  • 11 unit tests on detect_language() with synthetic file trees (Python, JS, TS, Go, mixed root, skip_dirs, empty, non-git directories).
  • 8 integration tests on openant init verifying project.json["language"] matches expected for synthetic fixtures.

What's not tested

  • End-to-end correctness: init (auto-detected) → parse → verify the per-language parser actually ran and produced expected dataset shape. The cascade between "right language string" and "right parser output" is unverified.
  • Polyglot real-world fixtures: tests cover 6-TS-vs-4-Py and 7-Py-vs-3-JS at the same root, but real polyglot shapes (frontend/backend split, build-tooling, vendored deps) aren't represented.
  • Calibration corpus: no quantified accuracy on known OSS repos.

Proposed validation work (3 pieces)

Piece 1 — End-to-end fixture test

For each of the 7 supported languages, build a small but realistic fixture and run the full initparse flow with auto-detect, asserting the per-language parser ran and produced expected dataset shape (correct unit_type values, language-specific call graph fields, expected output paths).

This is the test that connects "right language string in project.json" to "right parser output." Catches dispatch regressions and parser-side incompatibilities.

Piece 2 — Polyglot regression fixtures

4-6 fixtures matching real shapes:

  • Python service with TypeScript frontend in web/
  • Go monorepo with Python tooling in scripts/
  • TypeScript NestJS backend with migrations/*.py
  • Ruby on Rails app with embedded JS (app/javascript/)
  • C project with Python bindings
  • (etc.)

Pinned expected outputs. Catches the dominance-heuristic edge cases where the file-count-majority language isn't what a human would scan.

Piece 3 — OSS calibration corpus

10-15 well-known OSS repos (Django, Express, Kubernetes, Rails, Flask, Next.js, Buf, etc.). Run auto-detect against each, compare against the language a human would obviously pick. Gives a quantified accuracy number.

If accuracy is high enough → comfortable defaulting to auto. If not → keep explicit -l required (or expose -l auto as opt-in but warn loudly).

UX improvements worth considering regardless

  • Print the counts ("Detected: python (47 files), javascript (12 files)") so the user can verify before proceeding.
  • Warn on near-ties (e.g., when the runner-up is within 20% of the leader) and require -l explicitly in those cases.
  • Sanity check with --dry-run that shows what would be detected without writing project.json.

Decision needed

Before flipping --language from required to optional (#40), we want some confidence the dominance heuristic actually picks the right thing on real repos. Options:

  1. Block feat: auto-detect language in init #40's CLI default change until Pieces 1 + 2 + 3 are done.
  2. Merge feat: auto-detect language in init #40's other improvements (shared config, non-git path, tests) but keep -l required; expose -l auto as opt-in.
  3. Merge feat: auto-detect language in init #40 as-is, accept the risk, address validation in this issue as follow-up.

Why this is a separate issue

#40 is a UX improvement that touches a real reliability question. Splitting the discussion lets #40's parser/config/non-git work merge on its own merits while we figure out the right validation bar for default-on auto-detection.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions