You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/cookbook/misc/global_pypi_scan.rst
+34-1Lines changed: 34 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,7 @@ The PyPI scan itself is done on a separate high-performance worker node. While i
17
17
18
18
The worker node has a fast SSD disk dedicated to caching the packages for the scan that are prefetched from the offline PyPI mirror right before the scan starts. After the prefetch is completed a full scan of all packages is conducted by running parallel Aura scans. All scripts used on the worker node are available under the ``files/`` directory at the root of the Aura repository.
19
19
20
+
The full list of published PyPI datasets is available at: https://cdn.sourcecode.ai/pypi_datasets/index/datasets.html
OS Windows 10 with Aura running inside WSL2 Ubuntu 18.04
29
+
OS Arch Linux (fully updated prior to scan)
29
30
===== =====
31
+
32
+
33
+
Description of the dataset
34
+
--------------------------
35
+
36
+
Data produced from global scans are distributed via magnet (torrent) links with metadata hosted on SourceCode.AI CDN. The dataset content is as follows:
37
+
38
+
- **dataset.zst** - Single file dataset compressed using `ZSTD <https://facebook.github.io/zstd/>`_. Each line contains a compact JSON per scanned PyPI package
39
+
- **joblog.txt** - Joblog file from GNU Parallels
40
+
- **input_packages.txt** - List of PyPI packages passed as input for the global PyPI scan
41
+
- **package_list.txt** - List of PyPI packages actually processed by Aura during the scan, each package listed in this file has an entry in a dataset.zst file
42
+
- **checksums.md5.txt** - List of MD5 checksums for all files contained within the dataset
43
+
- **README.txt** - License & copy of this description
44
+
45
+
You may have noted that there is a difference between the file ``input_packages.txt`` and ``package_list.txt``. The input file is generally larger and is of all packages contained in our offline PyPI mirror at the start of a global scan. However, some packages may have not any releases published and so they would be skipped by Aura during the actual scan. Other reasons may include that the package has a corrupted archive, timeout for a scan has been reached or Aura crashed during the scan of a package. This is the reason why the input package list is always larger than the actual list produced by Aura during/after the scan.
46
+
47
+
To quickly process or glance at the data, we highly recommend to use the `jq data processor <https://stedolan.github.io/jq/>`_ .
48
+
49
+
The dataset is released under the `CC BY-NC 4.0 license <https://creativecommons.org/licenses/by-nc/4.0/>`_ .
50
+
Use the following citation to give attribution to the original research paper:
51
+
52
+
::
53
+
54
+
@misc{Carnogursky2019thesis,
55
+
AUTHOR = "CARNOGURSKY, Martin",
56
+
TITLE = "Attacks on package managers [online]",
57
+
YEAR = "2019 [cit. 2020-11-02]",
58
+
TYPE = "Bachelor Thesis",
59
+
SCHOOL = "Masaryk University, Faculty of Informatics, Brno",
60
+
SUPERVISOR = "Vit Bukac",
61
+
URL = "Available at WWW <https://is.muni.cz/th/y41ft/>",
0 commit comments