Generating Literature-Driven Scientific Discoveries at Scale

This is the repository for Theorizer, from the paper Generating Literature-Driven Scientific Theories at Scale (ACL 2026).

Abstract: Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers.

Plain Language Overview: Existing work in automated scientific discovery largely focuses on running new experiments, rather than higher-level scientific activities like theory building. In this work we show language model agents can be used for theory building, too. In normal usage, you provide a theory query (e.g. build theories about X), and the system uses this to find up to 100 papers related to that theory. It reads each of these papers, extracts relevant evidence from them that might be useful for building theories, and then uses this evidence to synthesize about 4-8 theories per theory query. How do you know if the generated theories are good theories? There are a number of desirable qualities of theories, such as making accurate predictions of future scientific results, and of being new compared to previous theories. We examine several methods of making theories, including using scientific literature versus only the language model's own knowledge, and asking the model to focus on making accurate theories, or new theories. We made 100 theory queries broadly across different areas of AI and Natural Language Processing, and used these to synthesize approximately 3,000 theories from reading almost 14,000 papers. What we found is that different methods of making theories affect their properties (like how accurate or novel they are), with some methods making theories that are (on average) 90% accurate at predicting future scientific results.

Theory Query: The request for theories. This can be as specific or general as you'd like, and for any topic that Semantic Scholar/PaperFinder reasonably will find a broadly available selection of open-access papers for.
Model for Theory Generation: This is the model that will be used to view aggregated evidence from the literature, and infer/deduce/synthesize the set of theories. It should be a powerful model.
Objective for Theory Generation: This allow selecting between the accuracy-focused and novelty-focused prompts for theory generation.
Input-type for Theory Generation: You can use either papers/literature to collect evidence for theory generation, or you can use the LLM's own parametric knowledge.

The remaining parameters only apply if you selected generating literature-supported (not parametric) theories:

Model for Literature Extraction: This is the model that will be used to extract knowledge from each paper. This will be called a lot, so typically you'll want to use an inexpensive model.
Number of Papers: The maximum number of papers to find/download for generating literature-supported theories.
Knowledge Cutoff (Year/Month): When downloading papers, only include papers that were authored before this Year/Month.

When you click submit, after a few moments you should see a message that says the server successfully accepted the theory request, and is processing it.

4.3. Monitoring active theory generation requests

It can take approximately 30-60 minutes for a literature-supported theory request to finish in the queue. You can monitor the status by examining the server status on the main menu. When a given theory query workflow is completed, its theories will automatically show up in the list of theories.

4.4. ⚠️ Saving/Exporting Theories ⚠️

🚨 It's important to save theories after you've generated them, so that they are retained. If you don't save your theories, they will disappear after you restart the interface. 🚨

Use the Export button on the main menu to export your:

TheoryStore: a JSON object that represents all the theories, extraction schemas, extraction results, and so forth.
PaperStore: a JSON cache of all the papers used to build the theories.

If you'd like to reload your theories into the web interface after exporting them, you'll need to manually enter the filename of your theory export at the bottom of TheorizerServer.py.

Failsafe: By default, the interface does attempt to automatically export a backup copy of the TheoryStore after processing each theory request (and these are saved as the theorizer-state-autosave-*.json files in the main execution path). But, you should not rely on this to save the theories, and should manually export the theories using the Export functionality when you wish to save them. Please note that the auto-save functionality does not export the PaperStore, but the Export button does.

4.5. List of Generated Theories

After theory query workflows have successfully completed, their theories will appear in the Theory List. You can select a theory to see more details.

4.6. Examine Specific Theory

Clicking on the details button of a specific theory will reveal the details of that theory, including its description, laws, generation parameters, etc.:

4.7. Examine Specific Evidence used to Build a Theory

You can examine specific evidence extracted from a paper to build a theory by clicking on the evidence links (e.g. [e123.1]). These pages include a header that shows the paper that the evidence was extracted from, as well as list of specific instances of evidence according to the extraction schema:

4.8. Examine Specific Extraction Schema

For a given piece of extracted evidnece, you can also click on its schema link to examine the extraction schema generated in response to the theory query, and applied to extract evidence from each apper:

5. Theory Evaluation

5.1. LLM-as-a-judge evaluation (Table 1)

Code: The LLM-as-a-judge evaluation code is available here: src/EvaluationLLMAsAJudge.py

Example output:

...
LLM-as-a-judge evaluation complete.

Final average evaluation scores (formatted):
Dimension                      Average Score   Counts              
factual_accuracy               6.18            45                  
specificity                    7.36            45                  
novelty                        5.73            45                  
testability                    7.98            45                  
plausibility                   7.24            45                  
empirical_testing              2.69            45                  

Results saved to: ...

5.2. Predictive Accuracy Evaluation (Table 2)

Code: The predictive accuracy evaluation code is available here: src/EvaluationPredictiveAccuracy.py

Example output:

...
Summary of predictive accuracy evaluations:
Precision:
{
    "average_evaluation": {
        "support": 1.0,
        "contradict": 0.0,
        "count": 2
    }
}

Recall:
{
    "count_theories_has_data": 2,
    "count_theories_no_data": 0,
    "proportion_theories_with_data": 1.0,
    "count_elems_has_data": 6,
    "count_elems_no_data": 5,
    "proportion_elems_with_data": 0.5454545454545454
}

Other:
{
    "count_total_papers_evaluated": 20,
    "count_total_papers_with_relevant_info": 9,
    "avg_papers_with_relevant_info": 4.5,
    "count_total_laws_with_relevant_info": 2
}
Wrote summary to: theorystore-example-predictive-evaluation/predictive-accuracy-evaluation/claude_sonnet_4_5_20250929/predictive_accuracy_summary.json
Predictive accuracy evaluation complete.

5.3. Qualified Novelty Evaluation (Table 3)

Code: The qualified novelty evaluation code is available here: src/EvaluationQualifiedNovelty.py

Example output:

...
Found 5 novelty evaluation files in: qualified-novelty-evaluations/literature-supported/
Novelty Evaluation Histogram:
{
    "path_in": "qualified-novelty-evaluations/literature-supported/",
    "histogram_proportions": {
        "phenomenon_effect": {
            "not_novel": 1.0,
            "novel": 0.0
        },
        "explanatory_mechanistic": {
            "not_novel": 0.2,
            "novel": 0.8
        },
        "unification": {
            "not_novel": 0.4,
            "novel": 0.6
        },
        "generalization_scope_expansion": {
            "not_novel": 0.4,
            "novel": 0.6
        },
        "constraint_limitation": {
            "not_novel": 1.0,
            "novel": 0.0
        },
        "conceptual_reframing_abstraction": {
            "not_novel": 0.6,
            "novel": 0.4
        },
        "empirical_synthesis_meta_regulariry": {
            "not_novel": 0.6,
            "novel": 0.4
        }
    },
    "histogram_raw_counts": {
        "phenomenon_effect": {
            "not_novel": 4,
            "novel": 0
        },
        "explanatory_mechanistic": {
            "not_novel": 1,
            "novel": 4
        },
        "unification": {
            "not_novel": 2,
            "novel": 3
        },
        "generalization_scope_expansion": {
            "not_novel": 2,
            "novel": 3
        },
        "constraint_limitation": {
            "not_novel": 5,
            "novel": 0
        },
        "conceptual_reframing_abstraction": {
            "not_novel": 3,
            "novel": 2
        },
        "empirical_synthesis_meta_regulariry": {
            "not_novel": 3,
            "novel": 2
        }
    },
    "total_cost_all_evaluations": 5.84496665,
    "num_evaluations": 5,
    "average_cost_per_evaluation": 1.16899333,
    "average_papers_per_evaluation": 23.0,
    "stddev_papers_per_evaluation": 0.0,
    "total_papers_evaluated": 115.0
}

Total Cost for all evaluations: $5.84
Number of evaluations: 5
Average Cost per evaluation: $1.17
Average Number of papers evaluated per evaluation: 23.00 ± 0.00
Total papers evaluated across all evaluations: 115
Writing overall qualified novelty evaluation summary to: theorystore-example2-literaturesupported-qualified-novelty-evaluation.20260114-121058.json

5.4. Self-assessed Belief/Bayesian Surprise (Table 2)

Code: The self-assessed belif/Bayesian surprise evaluation code is available here: src/EvaluationSurprisal.py

Example output:

Surprisal / Belief Elicitation evaluation complete.
--------------------------------------------------------------------------------

Total laws evaluated: 45
Model used: gpt-4.1-2025-04-14
Number of samples per law: 10
Average (of average) probability that the model believes the laws to be true: 0.809
Standard deviation of the average probabilities: 0.045

Results saved to: theorystore-example-single-laws.surprise_evaluated.20260114-104415.json

6. Data, Example Output, and Theorizer Representation Formats

6.1. Representation Formats

The primary data structure saved by Theorizer is the TheoryStore. At a top-level, dictionary contains the following keys:

theories: A dictionary of theories, each containing their high-level descriptions, components, and lists of theory laws/statements.
extraction_schemas: A dictionary of extraction schemas built for extracting evidence from papers.
extraction_results: A dictionary of specific extraction results -- that is, using an extraction schema to extract evidence from a paper.

Theorizer also generates a secondary data structure, the PaperStore, which is a cache of the papers Theorizer has downloaded and extracted the full text of using the OCR converter.

6.2. Small/Toy Theory Dataset

Two small sets of theories, one generated using literature support, and one using parametric knowledge, along with sample evaluations, are provided in: example-theories/toy-data

While the JSON is provided, the easiest way to view the example theories is in HTML format: example-theories/toy-data/html/index.html

Please note that these toy theories are provided primarily as format examples, as small downloads. The PaperStores are not included due to copyright, and any paper full-text is sanitized from the evaluation files.

6.3. Real Theory Dataset (from the Theorizer paper)

The theories generated in the paper are available here: example-theories/theorizer-paper-data

7. Prompts

The core prompts used in this work can be found at the following locations in the source:

Schema Generation Prompts:

Converting a theory query into a schema: src/TheorizerProcessing.py#L18

Prompts for Extracting Evidence from Papers:

Populating a schema with evidence from one paper: src/SchemaExtractionQueue.py#L285

Theory Generation Prompts:

Theory Generation (accuracy-focused, with literature): src/TheorizerProcessing.py#L227
Theory Generation (accuracy-focused, with parametric knowledge): src/TheorizerProcessing.py#L688
Theory Generation (novelty-focused, with literature): src/TheorizerProcessing.py#L1202
Theory Generation (novelty-focused, with parametric knowledge): src/TheorizerProcessing.py#L1590

Evaluation: LLM-as-a-judge (Table 1):

LLM-as-a-judge prompt: src/EvaluationLLMAsAJudge.py#L26

Evaluation: Predictive Accuracy (Table 2):

Find relevant evaluation papers using PaperFinder: src/EvaluationPredictiveAccuracy.py#L26
Build the prediction rubric: src/EvaluationPredictiveAccuracy.py#L421
Score one paper with the prediction rubric: src/EvaluationPredictiveAccuracy.py#L623

Evaluation: Qualified Novelty (Table 3):

Qualified Novelty assessment (one paper): src/EvaluationQualifiedNovelty.py#L25
Qualified Novelty assessment (aggregate results across papers): src/EvaluationQualifiedNovelty.py#L393

Evaluation: Self-assessed Belief/Bayesian Surprise (Table 2):

Self-assessed belief: src/EvaluationSurprisal.py#L25

8. Citation

If you use this work, please reference the following citation:

@misc{jansen2026generatingliteraturedrivenscientifictheories,
      title={Generating Literature-Driven Scientific Theories at Scale}, 
      author={Peter Jansen and Peter Clark and Doug Downey and Daniel S. Weld},
      year={2026},
      eprint={2601.16282},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.16282}, 
}

9. License

Theorizer is released under an Apache 2.0 License. The text of that license is included in this repository.

Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

10. Contact

For any questions, please contact Peter Jansen (peterj@allenai.org). For issues, bugs, or feature requests, please submit a github issue: TODO

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example-theories		example-theories
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Generating Literature-Driven Scientific Discoveries at Scale

Table of Contents

1. Paper

2. Quick Start

2.1. Is Theorizer limited to making theories in Computer Science/AI?

2.2. I want to read about Theorizer or generating theories from scientific literature

2.3. I want to examine the theories, evaluations, and other results created by Theorizer

2.4. I want to run Theorizer on my local machine

2.5. I would like to generate theories on my own theory queries

2.6. I have a question not answered here.

3. Installation and Running

3.1 Installation Instructions

3.1.1. LLM API keys

3.1.2. Semantic Scholar API key

3.1.3. Asta Paper Finder

3.2. Running (Web User Interface)

3.3. Running (API)

4. Using Theorizer for Theory Generation

4.1 Main Menu

4.2. Submitting a theory query

4.3. Monitoring active theory generation requests

4.4. ⚠️ Saving/Exporting Theories ⚠️

4.5. List of Generated Theories

4.6. Examine Specific Theory

4.7. Examine Specific Evidence used to Build a Theory

4.8. Examine Specific Extraction Schema

5. Theory Evaluation

5.1. LLM-as-a-judge evaluation (Table 1)

5.2. Predictive Accuracy Evaluation (Table 2)

5.3. Qualified Novelty Evaluation (Table 3)

5.4. Self-assessed Belief/Bayesian Surprise (Table 2)

6. Data, Example Output, and Theorizer Representation Formats

6.1. Representation Formats

6.2. Small/Toy Theory Dataset

6.3. Real Theory Dataset (from the Theorizer paper)

7. Prompts

8. Citation

9. License

10. Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages