code-not-text

Cross-Domain Limits of Hand-Crafted CoT-Surface Features:
Strong in Math, Narrow in Science, Weak in Coding

Hand-crafted CoT-surface features transfer well to math, only partly to science, and poorly to coding correctness on unseen problems.

Interactive demo: chi-shan0707.github.io/code-not-text/demo

Core Finding · Five Methods · Results · Repository Structure · Reproduce · Citation

Core Finding

Can cheap summaries of chain-of-thought (CoT) traces predict whether a solution is correct? We test a single feature family—hand-crafted CoT-surface and token-trajectory features—on DeepSeek-R1-0528-Qwen3-8B across three domains: math, science, and coding.

It depends on the domain. In math, the signal is strong and emerges early. In science, the signal is usable but narrower and largely confidence-driven. In coding, the signal is weak and does not transport cleanly to unseen problems.

Domain	AoA (AUC-of-AUROC)	AUROC@100%	Reranking Δ pass@1
Math	0.958 [0.931, 0.980]	0.982	+10.0%
Science	0.799 [0.775, 0.822]	0.841	+8.0%
Coding	0.434 [0.404, 0.464]	0.407	−0.6%

Key insight: These features measure different underlying constructs across domains. In math, they capture "failure to converge" in reasoning. In coding, where correctness requires execution, the same features measure exploratory effort—a mismatch between the measurement instrument and the target variable.

Five Methods, One Ceiling

#	Method	Best coding result	Verdict
1	Interpretable probe (LogReg + SVD)	AoA 0.434 / AUROC@100 0.407	Below baseline
2	Broad feature sweep (83 scalars)	AUROC 0.556 (top-4 combo)	Weak only
3	Nonlinear MLP classifier	AUROC 0.499–0.507	No meaningful gain
4	SSL pre-training (42K unlabeled runs)	0.537 at lf=5%; 0.454 at lf=100%	Low-label regularizer only
5	Token-level de-knotting ablation	AUROC Δ = +0.006	Neutral

Why the convergence matters. Five independent approaches—interpretable probes, feature engineering, nonlinear models, self-supervised learning, and surface-noise removal—all reach the same ceiling for coding. This rules out several plausible failure modes:

Not model capacity: MLP classifiers don't escape the ceiling
Not label scarcity: SSL on 42K unlabeled traces still fails at full labels
Not surface noise: De-knotting doesn't recover signal
Not feature engineering: 83 coding-specific scalars don't close the gap

The evidence supports a measurement explanation: code correctness depends on executable state evolution, which is not captured by token-trajectory summaries.

Results

All quantitative claims reproduce from results/tables/. Key tables:

File	Contents
`results/tables/bon_reranking_domain_pass1_ci.csv`	Best-of-N=64 pass@1 with bootstrap 95% CI
`results/tables/deknot_alldomains_v2.csv`	De-knotting ablation: all 3 domains
`results/tables/glm_knot_findings_v4.md`	Knot annotation summary (math/science/coding)
`results/tables/aoa_bootstrap_ci.csv`	AoA bootstrap CI by domain
`results/tables/coding_feature_family_ablation.csv`	Grouped feature ablation results
`results/tables/cot_run_judge_scores.csv`	CoT-only run judge OOF predictions

Cross-domain feature importance

Permutation importance reveals different semantics across domains:

Math: traj_reflection_count (0.242), traj_continuity (0.092), traj_novelty (0.052) — trajectory-shaped signal
Science: tok_gini_tail (0.038), tok_conf_recency (0.033), tok_conf_prefix (0.033) — confidence-heavy signal
Coding: No strong, stable feature importance pattern

SSL results

Self-supervised pre-training shows domain-specific effects:

Domain	no_svd_lr (lf=5%)	ssl_v2_r24 (lf=5%)	Δ	ssl_v2_r24 (lf=100%)
Math	0.898	0.896	−0.2pp	0.944
Science	0.607	0.674	+6.7pp	0.762
Coding	0.488	0.537	+4.9pp	0.454 ← ceiling persists

SSL helps in low-label regimes but doesn't change the full-label ceiling.

Repository Structure

code-not-text/
├── paper/
│   ├── paper_v13.tex           # Technical report source (cross-domain analysis)
│   └── paper_v13.pdf           # Compiled PDF
├── figures/
│   ├── fig_auroc_by_anchor_v12_5.pdf   # Fig 1: AUROC curves + AoA bar chart
│   ├── fig_ssl_ceiling_v12_5.pdf       # Fig 2: SSL ceiling — coding vs science
│   ├── fig_reranking_v12_5.pdf         # Fig 3: Best-of-N reranking pass@1
│   ├── fig_deknot_alldomains_v12_5.pdf # Fig 4: De-knotting ablation (all 3 domains)
│   └── fig_knot_domain_profiles_v3.pdf  # Legacy: feature gap in break-positive subsets
├── scripts/
│   ├── gen_figures_v12_5.py            # Generate all 4 figures
│   ├── deknot_alldomains_v2.py         # De-knotting experiment v2 (all domains)
│   ├── deknot_coding_experiment.py     # De-knotting experiment v1 (coding only)
│   ├── analyze_glm_knot_v4.py          # Cross-domain knot analysis
│   ├── compare_glm_knot_v4_domains.py  # Domain comparison utilities
│   ├── run_glm_*_knot_labeling*.py     # GLM knot annotation pipelines
│   └── knot_glm_common.py              # Shared utilities
├── results/
│   └── tables/                          # All CSV/MD result tables
├── temp/                               # Old drafts, review notes (gitignored)
├── .gitignore
├── LICENSE                             # Apache 2.0
└── README.md

Reproduce

Figures (requires local data paths)

cd workshop/cotknot
python scripts/gen_figures_v12_5.py
# Outputs: figures/fig_*_v12_5.pdf

De-knotting experiment (requires GLM API key + NAD cache)

# v2: all domains, proper token-level masking
cd /path/to/NAD_Next && source .venv/bin/activate
python /path/to/scripts/deknot_alldomains_v2.py
# Results: results/tables/deknot_alldomains_v2.csv

Compile technical report

cd paper
pdflatex paper_v13.tex && pdflatex paper_v13.tex

Study Summary

Model: DeepSeek-R1-0528-Qwen3-8B Data: AIME24/25 + BRUMO25 + HMMT25 (math), GPQA (science), LiveCodeBench-v5 (coding) Features: tok_conf, tok_gini, tok_logprob, tok_neg_entropy, tok_selfcert (per-token) + traj_continuity, traj_novelty, traj_reflection_count (trajectory) + nc_mean (neuron)

Central claim: Within this single-model study, cheap hand-crafted CoT-surface features are domain-specific measurement tools. They work very well in math, partly in science, and poorly in coding.

What this does NOT claim: This does not rule out raw-text classifiers, code-aware models, execution-aware selectors, hidden-state probes, or LLM judges. A more likely explanation is that coding needs different signals, such as execution-aware ones.

License

Apache 2.0. See LICENSE.

Citation

@misc{chi2026codenottext,
  title   = {Cross-Domain Limits of Hand-Crafted CoT-Surface Features:
             Strong in Math, Narrow in Science, Weak in Coding},
  author  = {Yuhan Chi},
  year    = {2026},
  note    = {Technical report. Code: https://github.com/Chi-Shan0707/code-not-text}
}

Contact

Yuhan Chi · Fudan University · yhchi25@m.fudan.edu.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

code-not-text

Cross-Domain Limits of Hand-Crafted CoT-Surface Features:
Strong in Math, Narrow in Science, Weak in Coding

Core Finding

Five Methods, One Ceiling

Results

Cross-domain feature importance

SSL results

Repository Structure

Reproduce

Figures (requires local data paths)

De-knotting experiment (requires GLM API key + NAD cache)

Compile technical report

Study Summary

License

Citation

Contact

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
demo		demo
figures		figures
project_report		project_report
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

code-not-text

Cross-Domain Limits of Hand-Crafted CoT-Surface Features:Strong in Math, Narrow in Science, Weak in Coding

Core Finding

Five Methods, One Ceiling

Results

Cross-domain feature importance

SSL results

Repository Structure

Reproduce

Figures (requires local data paths)

De-knotting experiment (requires GLM API key + NAD cache)

Compile technical report

Study Summary

License

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Cross-Domain Limits of Hand-Crafted CoT-Surface Features:
Strong in Math, Narrow in Science, Weak in Coding

Packages