A Google Scholar Crawler for GitHub Pages decoupled from AcadHomepage jekyll theme, with added features of i10-index and h-index caching, and improved usability.
This distribution of Google Scholar crawler is originally extracted from AcadHomepage theme and now maintined by me. It works well with Academic Pages, al-folio, and multi-language-al-folio (personally tested).
My modifications to the original version are adding the cached data for i10-index and h-index individually so that one can easily cite the data without digging through gs_data.json.
The benefits of this cawler version include:
- cached data: avoid querying Google Scholar too frequently to encounter HTTP error code 429 "too many requests" which slows down local website building and stops GitHub Pages auto-deployment.
- optimized access: use CDN (in
_config.ymlsetgoogle_scholar_stats_use_cdntotrue) to have better GS data access to in special Internet enviroments with censorship and delay. CDN also avoidsdomain blockederror from GitHub.com when there are too many refreshes. - easy deployment: fork, fill in your info, and play.
Your Google Scholar data is automatically fetched at UTC 2:42 every Sunday.
Note: It is pretty normal to be blocked by Google several times a week resulting in a build action failure, even if random proxies are used. A success once a week should be sufficient for personal use. To change the frequency of the scheduled action, please refer to google_scholar_crawler.yaml. This scheduled task can also be run on demand manually, by visiting the Actions page > Get Citation Data > (Re)Run workflow.
Redundant fetch workflow:
Either one that succeeds will do.
The most recent fetch with free proxy:
The most recent fetch without proxy:
-
repository: "<your-github-user-name>/<repo-name>": change<your-github-user-name>to your GitHub user name.<repo-name>is your GitHub Pages website repo if you choose Option 1 below;GH-ScholarBotif you choose Option 2. -
google_scholar_stats_use_cdn: true: true: use CDN, delay might occur. false: use GitHub.com.
You can merge this repo with (inside) your GitHub Pages website:
- download this repo, keep the folder structure and paste the files into your website root folder;
- setup
_config.yml: copy the lines in this project and change the contents to be yours; - in project settings > Actions > General > Workflow permissions, grant Read and write permissions;
- in project settings > Secret and variables > Actions > Repository Secrets > creat a key name
GOOGLE_SCHOLAR_IDwith value being the string after your Google Scholar profile urluser=; - the crawler will create a branch in the website project named
google-scholar-statswith 5 json files:gs_data.json(full data for all your papers),gs_data_h_index.json,gs_data_i10_index.json,gs_data_total_citation.json, andgs_data_total_publications.json. - If the crawler fails to do so, you can manually create a branch name
google-scholar-statsfrommain. The content in thisgoogle-scholar-statsbranch will be permanantly cleared and replaced by thejsonfiles when the crawler is working.
To use it in your .md file for your website pages:
To change in the following codes: <your-github-user-name> and GOOGLE_SCHOLAR_ID
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_total_citation.json&labelColor=f6f6f6&color=9cf&style=flat&label=citations"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_total_citation.json&labelColor=f6f6f6&color=9cf&style=flat&label=citations"></a>
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_total_publications.json&labelColor=f6f6f6&color=9cf&style=flat&label=publications"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_total_publications.json&labelColor=f6f6f6&color=9cf&style=flat&label=publications"></a>
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_h_index.json&labelColor=f6f6f6&color=9cf&style=flat&label=h-index"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_h_index.json&labelColor=f6f6f6&color=9cf&style=flat&label=h-index"></a>
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_i10index.json&labelColor=f6f6f6&color=9cf&style=flat&label=i10-index"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2F<your-github-user-name>.github.io@google-scholar-stats%2Fgs_data_i10index.json&labelColor=f6f6f6&color=9cf&style=flat&label=i10-index"></a>
You can fork this repo into your own GitHub account, for example github.com/<your-github-user-name>/GH-ScholarBot/
- setup
_config.yml: change the contents to be yours; - in project settings > Actions > General > Workflow permissions, grant Read and write permissions;
- in project settings > Secret and variables > Actions > Repository Secrets > creat a key name
GOOGLE_SCHOLAR_IDwith value being the string after your Google Scholar profile urluser=; - the crawler will create a branch in the crawler project named
google-scholar-statswith 4 json files:gs_data.json(full data for all your papers),gs_data_h_index.json,gs_data_i10_index.json, andgs_data_total_citation.json. - If the crawler fails to do so, you can manually create a branch name
google-scholar-statsfrommain. The content in thisgoogle-scholar-statsbranch will be permanantly cleared and replaced by thejsonfiles when the crawler is working.
To use it in your .md file for your website pages:
To change in the following codes: <your-github-user-name> and GOOGLE_SCHOLAR_ID
Note: the codes below is different from Option 1. It uses data under github.com/<your-github-user-name>/GH-ScholarBot/ other than github.com/<your-github-user-name>/<your-github-user-name>.github.io/.
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2FGH-ScholarBot@google-scholar-stats%2Fgs_data_total_citation.json&labelColor=f6f6f6&color=9cf&style=flat&label=citations"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2FGH-ScholarBot@google-scholar-stats%2Fgs_data_total_citation.json&labelColor=f6f6f6&color=9cf&style=flat&label=citations"></a>
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2FGH-ScholarBot@google-scholar-stats%2Fgs_data_h_index.json&labelColor=f6f6f6&color=9cf&style=flat&label=h-index"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2FGH-ScholarBot@google-scholar-stats%2Fgs_data_h_index.json&labelColor=f6f6f6&color=9cf&style=flat&label=h-index"></a>
Use CDN for GitHub (delays in data-refresh might exist):
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2F<your-github-user-name>%2FGH-ScholarBot@google-scholar-stats%2Fgs_data_i10index.json&labelColor=f6f6f6&color=9cf&style=flat&label=i10-index"></a>
Use GitHub.com:
<a href='https://scholar.google.com/citations?user=GOOGLE_SCHOLAR_ID'><img src="https://img.shields.io/endpoint?logo=Google%20Scholar&url=https%3A%2F%2Fgithub.com%2F<your-github-user-name>%2FGH-ScholarBot@google-scholar-stats%2Fgs_data_i10index.json&labelColor=f6f6f6&color=9cf&style=flat&label=i10-index"></a>
Available in gs_data.json. You can be creative and do whatever you want with it!
The current script will skip the free proxies and use direct access when it encounters a connection issue. Therefore, if the run is stuck, it is more likely that the current GitHub runner is temporarily blocked by Google. Here are some solutions for you to try:
-
A direct but non-free solution is to subscribe to a paid proxy. Please refer to scholarly-python-package.
-
It looks like the scheduled workflow runner of GitHub is more prone to being detected and blocked by Google, and manually rerunning a failed job (several times until it succeeds) has a greater success rate (it seems that the manual jobs are on a different runner; I might be wrong, but it does work). Usually, you don't need to run this repo so frequently. For me, once a week should be sufficient.
-
I prepared two automatic workflow, one with proxy by default and one without. These two can be manually triggered in GitHub Actions.
-
For automatic update (workflow), please take a look at workflow file and play with
schedule: - cron: '42 2 * * 0'
with a different frequency and time. I don't know the optimal combination yet.