Skip to content

Commit 052cce8

Browse files
committed
add rate limits and cache of repo listing and enriched repos
1 parent 5613fb3 commit 052cce8

3 files changed

Lines changed: 534 additions & 107 deletions

File tree

README.md

Lines changed: 78 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,29 @@
1-
2-
# GitHub metadata importer
1+
# GitHub metadata importer
32

43
This tool imports GitHub metadata from repositories into the Software Observatory database. It identifies the GitHub repositories listed in the database entries, retrieves metadata for each repository using the [GitHub metadata API](https://github.com/inab/github-metadata-api), and stores the retrieved metadata back in the database.
54

6-
If you are looking for a tool to import metadata from a GitHub repository, you can directly use the [GitHub metadata importer](https://github.com/inab/github-metadata-api) tool. More specifically, use [this endpoint](https://observatory.openebench.bsc.es/github-metadata-api/api-docs/#/Metadata%20Extractor%20for%20FAIRsoft/post_metadata_user).
5+
If you are looking for a tool to import metadata from a single GitHub repository directly, you can use the [GitHub metadata importer](https://github.com/inab/github-metadata-api) itself. In particular, this importer relies on [this endpoint](https://observatory.openebench.bsc.es/github-metadata-api/api-docs/#/Metadata%20Extractor%20for%20FAIRsoft/post_metadata_user).
6+
7+
## Features
8+
9+
The importer includes several safeguards to make long runs more robust:
10+
11+
- `--resume` support to continue interrupted runs
12+
- local cache of successfully imported repositories
13+
- local cache of failed repositories
14+
- repository listing cache to avoid rebuilding the input list on every run
15+
- retry support for previously failed repositories
16+
- delays with jitter between requests
17+
- exponential backoff for rate limiting and transient server errors
18+
- JSONL run log for debugging and auditing
19+
20+
These mechanisms are especially useful to reduce the impact of `429 Too Many Requests` errors and avoid repeating work after interruptions.
721

822
## Installation
923

10-
The tool is written in Python 3.12 and requires the packages in the file `requirements.txt`. You can install the required packages using the following command:
24+
The tool is written in Python 3.12 and requires the packages listed in `requirements.txt`.
25+
26+
Install dependencies with:
1127

1228
```bash
1329
pip install -r requirements.txt
@@ -34,4 +50,61 @@ To run the tool, execute the following command:
3450

3551
```bash
3652
python3 main.py
37-
```
53+
```
54+
55+
To **resume an interrupted run**:
56+
57+
```bash
58+
python3 main.py --resume
59+
```
60+
This skips repositories already marked as completed in the local import cache.
61+
62+
63+
To **resume and retry previous failures**:
64+
65+
```bash
66+
python3 main.py --resume --retry-failed
67+
```
68+
This retries repositories that failed in earlier runs while still skipping successful ones.
69+
70+
To **refresh the cached repository listing**:
71+
72+
```sh
73+
python3 main.py --refresh-listing-cache
74+
```
75+
This rebuilds the list of repositories from PRETOOLS instead of using the local listing cache.
76+
77+
To **Limit the number of repositories processed**:
78+
79+
```bash
80+
python3 main.py --limit 20
81+
```
82+
83+
For a **slower, safer execution**:
84+
85+
```bash
86+
python3 main.py --resume --delay 3 --max-retries 8
87+
```
88+
89+
### Command-line options
90+
91+
The importer supports the following options:
92+
* `--resume`: skip repositories already completed in the import cache.
93+
* `--retry-failed`: when used with `--resume`, include repositories that failed in previous runs.
94+
* `--cache-file`: path to the import cache file. Default: `github_import_cache.json`.
95+
* `--listing-cache-file`: path to the repository listing cache file. Default: `repos_to_import.json`.
96+
* `--refresh-listing-cache`: rebuild the repository list from PRETOOLS
97+
* `--run-log-file`: path to the JSONL run log file. Default: `import_run.jsonl`.
98+
* `--delay`: base delay in seconds between requests. Default: 1.5.
99+
* `--max-retries`: maximum number of attempts per repository request. Default: 6.
100+
* `--limit`: maximum number of repositories to process in the current run.
101+
102+
## Local cache files
103+
104+
During execution, the importer creates and updates a few local files:
105+
106+
* `repos_to_import.json`: cached list of repository URLs to process.
107+
* `github_import_cache.json`: cache of completed and failed repositories.
108+
* `import_run.jsonl`: append-only run log with one JSON record per processed repository.
109+
110+
These files allow the importer to resume work safely and avoid repeating already completed imports.

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.0.3
1+
0.1.0

0 commit comments

Comments
 (0)