EasyPaper An deep research tool for large-scale literature review and chat-with-paper has been built based on the crawled paper.
- Paper Crawler for Top CS/AI/ML/NLP Conferences and Journals
- Installation
- Usage
- Adding a Custom Spider (Quick & Lazy Solution)
- Supported Arguments
- Known Issues
- Change Log
This is a Scrapy-based crawler. The crawler scrapes accepted papers from top conferences and journals, including:
* The official sites of these publishers either do not have a consistent HTML structure or block spiders. The spider will attempt to query the title from dblp, the abstract could be fetched once the corresponding paper is open-accessed and being hosted on Arxiv.
| Conference | Status | Since |
|---|---|---|
| CVPR | β | 2013 |
| ECCV | β | 2018 |
| ICCV | β | 2013 |
| NIPS | β | 2021 |
| ICLR | β | 2017 |
| ICML | β | 2023 |
| AAAI | β | 2017 |
| IJCAI | β | 2017 |
| ACM MM* | β | 1993 |
| KDD* | β | 2015 |
| WWW* | β | 1994 |
| ACL | β | 2013 |
| EMNLP | β | 2013 |
| NAACL | β | 2013 |
| Interspeech | β | 1987 |
| ICASSP | β | 1976 |
| Journal | Status | Since |
|---|---|---|
| NATURE* | β | 2010 |
| TPAMI* | β | 1979 |
| PIEEE* | β | 1975 |
| NMI* | β | 2019 |
| PNAS* | β | 1997 |
| TNNLS* | β | 2012 |
| IOTJ* | β | 2014 |
| TCOM* | β | 1972 |
| CACM* | β | 1958 |
| CSUR* | β | 1969 |
| TOG* | β | 1982 |
| IJCV* | β | 1987 |
| IF* | β | 2014 |
| TIP* | β | 1992 |
| TAFFC* | β | 2010 |
| TSP* | β | 1991 |
The following information is extracted from each paper:
Conference, matched keywords, title, citation count, categories, concepts, code URL, PDF URL, authors, abstract, doi
pip install scrapy pyparsing feedparser fuzzywuzzy git+https://github.com/sucv/paperCrawler.gitFirst, navigate to the directory where main.py is located. During crawling, a CSV file will be generated in the same directory by default unless -out is specified.
Get ALL papers from all venues (2024-2026) and save output to myresearch/all.csv, without downloading any papers
python main.py -years 2026,2025,2024 -queries "*" -out "myresearch/all.csv" Disable the citation count and greatly speed things up (when you intend to crawl all venues on all papers across multiple years. See Known Issues.).
python main.py -years 2026,2025,2024 -queries "*" -out "myresearch/all.csv" --nocrossrefThe downside is, the spider would hammer the server and got rate limited very soon. In this case, wait for a day, reduce your target venues and years, tweak the crawl_conf/settings.py for longer delay, or remove --nocrossref as the citation count API call serve as a throttle.
Query papers with titles containing emotion recognition, facial expression, or multimodal, also download the papers whose citation count is no smaller than 50
python main.py -confs cvpr,iccv,eccv -years 2021,2022,2023 -queries "(emotion recognition) or (facial expression) or multimodal" -download_pdf 50Note: More examples for queries with AND, OR, (), wildcard can be found here.
python main.py -confs cvpr,iccv,eccv -years 2021,2022,2023 -queries "emo* and (visual or audio or speech)" dblp provides consistent HTML structures, making it easy to add custom spiders for publishers. You can quickly create a spider for any conference or journal. DBLP provides useful information such as citation count and paper categories. Though the abstract is not available from DBLP, the spider will try to salvage by investigating whether the paper is available on Arxiv and fetch the abstract if available.
In spiders.py, add the following code:
A spider for a journal or conference, e.g., TPAMI
class TpamiScrapySpider(DblpScrapySpider):
name = "tpami"
start_urls = [
"https://dblp.org/db/journals/pami/index.html",
]A spider for multiple vanues, please refer to DBLP itself or venues.pyand manually added your interested venues into start_urls.
class ExtraScrapySpider(DblpScrapySpider):
name = 'extra'
start_urls = [
"https://dblp.org/db/journals/tog/index.html",
"https://dblp.org/db/journals/jacm/index.html",
]Simply inherit from DblpScrapySpider, set name=, and provide start_urls pointing to your interested DBLP homepage. That's all. To call the new spider, simply put the name following the args -confs.
confs: A list of supported conferences and journals (must be lowercase, separated by commas), which was defined asnamein each spider. If not specified, all the available publishers will be queried.- Available publisher names so far:
cvpr,iccv,eccv,aaai,ijcai,nips,iclr,icml,mm,kdd,www,acl,emnlp,naacl,tpami,nmi,pnas,ijcv,if,tip,taffc,interspeech,icassp,tsp,pieee,tnnls,iotj,tcom,cacm,csur,jacm,nature,tog. Feel free to add more based on the instruction above.
- Available publisher names so far:
years: A list of four-digit years (separated by commas). If not specified, will query for recent 10 years (since 2016).queries: A case-insensitive query string supporting(),and,or,not, and wildcard*, based on pyparsing. See examples here.out: Specifies the output csv path..csvwill be appended if it does not end with ".csv". The pdfs, if to be downloaded, will be saved in the same directory.download_pdf: The citation count threshold to decide whether to download a paper. Must be an integer. By default, the value is-1which would download nothing.
- The citation count uses the free-tier OpenAlex API, which exceeds limit after around 1K call. As a results, the csv output usually results in
citation_count = -1after about 1K rows. Therefore, if you intend to crawl all venue with all papers across N years, better add--nocrossref, which disables the use of OpenAlex API, and makes it much speedy. However, doing so may result in being rate limited by the server. Therefore, try not crawl all venues and all years with all papers or tweak thecrawl_conf/settings.pyfor larger delay and fewer concurrent requests. - A publisher site may change HTML or block spiders. If that happens, the corresponding spider would raise the 404 error silently. As far as I know, venues like OpenCVF (CVPR, ICCV, and ECCV), OpenReview (ICLR, ICML, and Neurpis), ACL Anthology (ACL, EMNLP, and NAACL) and DBLP (all custom spiders) are quite consistent. Whereas IEEE (all IEEE transactions), ACM (KDD, MM, and WWW), AAAI, and IJCAI might change their policy or html in a hier frequency.
- 3-APR-2026
- Fixed a bug for HTML parsing on dblp so that the title string can be obtained when there is
<sup></sup>included. - Added a few top-tier journal in the default spiders.
- Fixed a bug for HTML parsing on dblp so that the title string can be obtained when there is
- 29-MAR-2026
- Fixed multiple venues that were outdated.
- 13-MAR-2024
- Fixed a bug so that the pdfs can be downloaded to
pdf_dir. - Fixed a bug in which duplicated pdf urls could be saved.
- Merged the
DblpScrapySpiderandDblpConfScrapySpideras one. - Added the top venues from DBLP for CS.
- Fixed the Arxiv's abstract line break issue.
- Fixed a bug so that the pdfs can be downloaded to
- 12-MAR-2024
- Improved the
pipeline.pyso that when CrossRef API says the paper is open-accessed, it will not only accumulate all the OA pdf url, but also examine whether the url is from Arxiv. If so, it will further request the abstract from Arxiv API. Since there is a great number of paper being open-accessed, doing so may largely salvage the records from DBLP that do not come with such information. - Added
download_pdfas the citation count threshold for downloading a paper. Only if a paper's citation count is greater than or eqal to the threshold, would the paper be downloaded. - Removed
--nocrossrefso that the CrossRef API is always called. Doing so can fetch useful information such as citation count, concepts, etc. - Removed
from_dblpfor each spider class. Now it doesn't matter whether the record is from dblp or the original publisher, they all follow the same processing logic. - Fixed the
code_url. If it ends with a period., the latter will be removed.
- Improved the
- 10-MAR-2025
- Fixed the last false match bug by thresholding the match score.
- 7-FEB-2025
- Found a bug in which when the paper title cannot be successfully fetched from the top-5 query results, the citation count / categories / concepts from the CrossRef would be false. Haven't figured out how to fix it without importing extra libraries for sophisticated matching. I will leave it for now since it only affect a very small percentage (~0.1%) of the results.
- 17-JAN-2025
- Add spiders for Interspeech, TSP, and ICASSP.
- 15-JAN-2025
- Add citation count, concepts, categories for a matched paper based on the Crossref API, with 1s cooldown for each request. For unmatched paper, the download cooldown won't be triggered.
- Fixed multiple out-of-date crawlers.
- Removed some arguments such as
count_citationsandquery_from_abstract. Now it will call Crossref API for extra information by default, and will always query from title, not abstract.
- 19-JAN-2024
- Fixed an issue in which the years containing single volume and multiple volumes of a journal from dblp cannot be correctly parsed.
- 05-JAN-2024
- Greatly speeded up journal crawling, as by default only title and authors are captured directly from dblp. Specified
-count_citationsto getabstract,pdf_url, andcitation_count.
- Greatly speeded up journal crawling, as by default only title and authors are captured directly from dblp. Specified
- 04-JAN-2024
- Added support for ACL, EMNLP, and NAACL.
- Added support for top journals, including TPAMI, NMI (Nature Machine Intelligence), PNAS, IJCV, IF, TIP, and TAAFC via dblp and sematic scholar AIP. Example is provided.
- You may easily add your own spider in
spiders.pyby inheriting classDblpScrapySpiderfor the conferences and journals as a shortcut. In this way you will only get the paper title and authors. As paper titles can already provide initial information, you may manually search for your interested papers later.
- You may easily add your own spider in
- 03-JAN-2024
- Added the
-outargument to specify the output path and filename. - Fixed urls for NIPS2023.
- Added the
- 02-JAN-2024
- Fixed urls that were not working due to target website updates.
- Added support for ICLR, ICML, KDD, and WWW.
- Added support for querying with pyparsing:
- 'and', 'or' and implicit 'and' operators;
- parentheses;
- quoted strings;
- wildcards at the end of a search term (help*);
- wildcards at the beginning of a search term (*lp);
- 28-OCT-2022
- Added a feature in which the target conferences can be specified in
main.py. See Example 4.
- Added a feature in which the target conferences can be specified in
- 27-OCT-2022
- Added support for ACM Multimedia.
- 20-OCT-2022
- Fixed a bug that falsely locates the paper pdf url for NIPS.
- 7-OCT-2022
- Rewrote
main.pyso that the crawler can run over all the conferences!
- Rewrote
- 6-OCT-2022
- Removed the use of
PorterStemmer()fromnltkas it involves false negative when querying.
- Removed the use of