PaperTools

This tool is designed for researchers and data analysts to efficiently fetch academic papers from arXiv and perform text analysis on the collected data. It includes features like keyword extraction, TF-IDF calculation, LDA topic modeling, and word cloud generation.

🌍 Language Options

Features

Paper Scraping: Fetch papers from arXiv based on user-defined keywords and date ranges.
Keyword Extraction: Extract frequently occurring keywords from paper titles and abstracts.
TF-IDF Calculation: Compute term frequency-inverse document frequency to identify significant terms.
LDA Topic Modeling: Apply Latent Dirichlet Allocation to uncover underlying topics in the text data.
Word Cloud Visualization: Generate visually appealing word clouds for keywords and topics.
Multi-language Support: Optimized for both English and Chinese text (requires proper font configuration).

Installation

Prerequisites

Python 3.8 or higher
Recommended libraries:
- requests
- matplotlib
- wordcloud
- nltk
- numpy
- scikit-learn
- gensim

Install the required libraries:

pip install requests matplotlib wordcloud nltk numpy scikit-learn gensim

Ensure the NLTK stopword list is downloaded:

import nltk
nltk.download('stopwords')

Font Setup (Optional for Chinese Support)

If analyzing or visualizing Chinese text, ensure a proper Chinese font (e.g., SimHei.ttf) is installed. Update the font path in the script (chinese_font_path parameter).

Usage

1. Basic Usage

To run the tool, execute the script:

python arxiv_paper_scraper.py

The script will:

Fetch papers based on predefined keywords and date ranges.
Save the collected data as a JSON file (e.g., papers_YYYYMMDD_HHMMSS.json).
Analyze the papers and generate word clouds for keywords and topics.

2. Configurable Parameters

Modify these parameters in the main() function:

keywords: List of search keywords.
start_year, end_year: Date range for paper search.
max_results_per_source: Maximum number of papers to fetch.

Future Features (Planned)

Support for Additional Sources:
- Integrate other APIs (e.g., IEEE Xplore, Springer, or PubMed).
Advanced Visualization:
- Add network graphs for keyword co-occurrence.
- Enhance topic modeling results with interactive visualizations (e.g., pyLDAvis).
Export Formats:
- Support for exporting results to CSV, Excel, or interactive dashboards.
Sentiment Analysis:
- Analyze sentiment in abstracts or full text for trend identification.
Custom Stopwords:
- Allow user-defined stopword lists for better keyword extraction.
Performance Improvements:
- Parallelize scraping and analysis for faster execution.

Contributing

We welcome contributions to enhance this tool! Feel free to:

Suggest new features.
Report bugs or performance issues.
Submit pull requests for improvements.

License

This project is licensed under the MIT License. You are free to use, modify, and distribute this tool with attribution.

Happy Researching!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
README_cn.md		README_cn.md
arxiv_paper_scraper.py		arxiv_paper_scraper.py
paper_download.py		paper_download.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperTools

🌍 Language Options

Features

Installation

Prerequisites

Font Setup (Optional for Chinese Support)

Usage

1. Basic Usage

2. Configurable Parameters

Future Features (Planned)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PaperTools

🌍 Language Options

Features

Installation

Prerequisites

Font Setup (Optional for Chinese Support)

Usage

1. Basic Usage

2. Configurable Parameters

Future Features (Planned)

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages