Research Paper Bot is a modular, domain-agnostic research assistant designed to automate the discovery, filtering, and analysis of academic research papers.
The system integrates multiple research sources (Semantic Scholar and arXiv), applies keyword-driven contextual filtering, and extracts structured insights such as methodologies, formulas, and implementation ideas.
It is built to support both targeted research (e.g., crowd evacuation modeling) and general-purpose exploration across multiple domains including artificial intelligence, robotics, healthcare, and systems engineering.
- Reduce time spent manually reading research papers
- Extract only the most relevant information based on user intent
- Provide structured insights that can be directly implemented
- Enable domain-specific research workflows through presets
- Queries Semantic Scholar API
- Falls back to arXiv for reliable PDF access
- Supports user-defined number of papers
- Displays indexed list of papers
- Allows selective processing instead of bulk analysis
-
Extracts only relevant portions of text based on:
- preset keyword sets
- custom user-defined keywords
-
Includes context window for better understanding
Supports 50+ domains including:
- Crowd dynamics (panic, congestion, hazard)
- Machine learning, deep learning
- Robotics and control systems
- Computer vision and NLP
- Distributed systems and networking
- Mathematics and simulation
- Healthcare and bioinformatics
- Finance and economics
Identifies:
- Models and methodologies
- Algorithms and approaches
- Experimental results
- Implementation-related content
- Extracts mathematical expressions and equations
- Useful for simulation and modeling tasks
-
Converts insights into:
- pseudo-code
- actionable ideas
- difficulty estimates
- Saves results per paper in structured folders
- Maintains history of downloaded papers
- Supports re-download control
The system follows a modular pipeline:
User Input → Search → Selection → Extraction → Processing → Output
-
User Input
- Topic
- Number of papers
- Mode (preset or custom keywords)
-
Search Layer
- Semantic Scholar API
- arXiv API fallback
- Deduplication of results
-
Selection Layer
- Indexed paper display
- Manual user selection
-
Processing Layer
- PDF download
- Text extraction
- Keyword filtering with context
- Insight extraction
- Formula detection
- Implementation generation
-
Output Layer
- Structured text files
- CSV summary
research_paper_bot/
│
├── main.py # Entry point and orchestration
├── search.py # API integration and paper retrieval
├── parser.py # PDF parsing and text extraction
├── analyzer.py # Insight extraction and filtering
├── extractor.py # Implementation idea generation
├── scorer.py # Paper scoring logic
├── utils.py # Utility functions (download, save, history)
├── presets.py # Keyword preset definitions
├── memory.py # Processed paper tracking
│
├── requirements.txt
├── README.md
├── LICENSE
├── .gitignore
│
└── output/ # Generated outputs (ignored in Git)
Clone the repository:
git clone https://github.com/sankhya007/research_paper_hunter.git
cd research_paper_hunterInstall dependencies:
pip install -r requirements.txtRun the program:
python main.py1. Search new papers
2. View downloaded papers
Example:
Topic: crowd evacuation
How many papers: 20
Mode: panic
[0] Paper A
[1] Paper B
[2] Paper C
...
Enter indices:
1 2 5
For each selected paper:
- PDF is downloaded
- Text is extracted
- Relevant sections are filtered
- Insights and formulas are extracted
- Implementation ideas are generated
Each paper generates:
output/<paper_name>/
important.txt
- Important insights
- Extracted formulas
- Implementation ideas with pseudo-code
Additionally:
output/results.csv
Contains a summary of all processed papers.
The system filters text using:
- Keyword matching (case-insensitive)
- Context window expansion (±2 lines)
Example:
If keyword = "panic"
Extracted content:
- Lines containing "panic"
- Surrounding context for better understanding
Edit presets.py:
"your_domain": ["keyword1", "keyword2", ...]
Edit analyzer.py:
- Add new patterns
- Improve filtering logic
Edit extractor.py:
- Add domain-specific pseudo-code generation
- Improve difficulty estimation
- Some papers may not provide downloadable PDFs
- PDF parsing may fail for scanned documents
- Semantic Scholar API may rate-limit requests
- Keyword filtering depends on text quality
- API retry and caching system
- Insight ranking based on importance
- Keyword highlighting in outputs
- GUI-based interface
- Integration with simulation systems (e.g., TRAGIC)
- Automated mapping of insights to executable models
- Research students extracting key insights quickly
- Simulation developers building models from papers
- AI practitioners exploring new methodologies
- Cross-domain literature analysis
Sankhyapriyo Dey
This project is licensed under the MIT License.