You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+78-5Lines changed: 78 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,29 @@
1
-
2
-
# GitHub metadata importer
1
+
# GitHub metadata importer
3
2
4
3
This tool imports GitHub metadata from repositories into the Software Observatory database. It identifies the GitHub repositories listed in the database entries, retrieves metadata for each repository using the [GitHub metadata API](https://github.com/inab/github-metadata-api), and stores the retrieved metadata back in the database.
5
4
6
-
If you are looking for a tool to import metadata from a GitHub repository, you can directly use the [GitHub metadata importer](https://github.com/inab/github-metadata-api) tool. More specifically, use [this endpoint](https://observatory.openebench.bsc.es/github-metadata-api/api-docs/#/Metadata%20Extractor%20for%20FAIRsoft/post_metadata_user).
5
+
If you are looking for a tool to import metadata from a single GitHub repository directly, you can use the [GitHub metadata importer](https://github.com/inab/github-metadata-api) itself. In particular, this importer relies on [this endpoint](https://observatory.openebench.bsc.es/github-metadata-api/api-docs/#/Metadata%20Extractor%20for%20FAIRsoft/post_metadata_user).
6
+
7
+
## Features
8
+
9
+
The importer includes several safeguards to make long runs more robust:
10
+
11
+
-`--resume` support to continue interrupted runs
12
+
- local cache of successfully imported repositories
13
+
- local cache of failed repositories
14
+
- repository listing cache to avoid rebuilding the input list on every run
15
+
- retry support for previously failed repositories
16
+
- delays with jitter between requests
17
+
- exponential backoff for rate limiting and transient server errors
18
+
- JSONL run log for debugging and auditing
19
+
20
+
These mechanisms are especially useful to reduce the impact of `429 Too Many Requests` errors and avoid repeating work after interruptions.
7
21
8
22
## Installation
9
23
10
-
The tool is written in Python 3.12 and requires the packages in the file `requirements.txt`. You can install the required packages using the following command:
24
+
The tool is written in Python 3.12 and requires the packages listed in `requirements.txt`.
25
+
26
+
Install dependencies with:
11
27
12
28
```bash
13
29
pip install -r requirements.txt
@@ -34,4 +50,61 @@ To run the tool, execute the following command:
34
50
35
51
```bash
36
52
python3 main.py
37
-
```
53
+
```
54
+
55
+
To **resume an interrupted run**:
56
+
57
+
```bash
58
+
python3 main.py --resume
59
+
```
60
+
This skips repositories already marked as completed in the local import cache.
61
+
62
+
63
+
To **resume and retry previous failures**:
64
+
65
+
```bash
66
+
python3 main.py --resume --retry-failed
67
+
```
68
+
This retries repositories that failed in earlier runs while still skipping successful ones.
69
+
70
+
To **refresh the cached repository listing**:
71
+
72
+
```sh
73
+
python3 main.py --refresh-listing-cache
74
+
```
75
+
This rebuilds the list of repositories from PRETOOLS instead of using the local listing cache.
76
+
77
+
To **Limit the number of repositories processed**:
0 commit comments