Skip to content

Commit dd6ea9f

Browse files
committed
add documentation describing the global pypi aura scans
1 parent d0617e3 commit dd6ea9f

File tree

4 files changed

+32
-0
lines changed

4 files changed

+32
-0
lines changed

docs/source/architecture.vsdx

62.5 KB
Binary file not shown.

docs/source/cookbook/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Misc
2323
----
2424

2525
.. toctree::
26+
misc/global_pypi_scan.rst
2627
misc/datasets.rst
2728
python_2.rst
2829
misc/visitors.rst

docs/source/cookbook/misc/datasets.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _datasets:
2+
13
Datasets
24
========
35

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Global PyPI Scan
2+
================
3+
4+
The Aura team conducts periodic scans of the PyPI repository on a best effort basis where we scan the latest version of all published packages. The diagram below depicts an overview of the architecture being used to conduct these scans:
5+
6+
.. image:: /_static/imgs/architecture.png
7+
8+
9+
The central piece of this is a Synology server which hosts an offline copy of the PyPI repository. This offline repository is being synced periodically every hour with the official PyPI repository using `bandersnatch <https://pypi.org/project/bandersnatch/>`_ via custom `dataset update scripts <https://gitlab.com/SourceCode.AI/aura-dataset-update>`_. The synchronization is invoked using scheduled Gitlab CI pipelines on a runner hosted on the server. After the PyPI synchronization is finished, new :ref:`datasets` are being re-generated where applicable and uploaded to our CDN for public access.
10+
11+
.. sidebar:: SSD cache
12+
13+
We scan only the latest package releases (of all types such as wheels, sdists, bdists, etc.) available at the time of scan so the disk space requirements on the SSD cache are lower than the main offline PyPI repository.
14+
15+
16+
The PyPI scan itself is done on a separate high-performance worker node. While it is possible to scan the offline PyPI repository directly we opted for a few changes to increase the performance and avoid problems such as network outages. We observed that the network latency and time to transfer the PyPI packages from NAS to the worker node were severely impacting the performance and total run time of the scan.
17+
18+
The worker node has a fast SSD disk dedicated to caching the packages for the scan that are prefetched from the offline PyPI mirror right before the scan starts. After the prefetch is completed a full scan of all packages is conducted by running parallel Aura scans. All scripts used on the worker node are available under the ``files/`` directory at the root of the Aura repository.
19+
20+
21+
Technical specification of the worker node:
22+
23+
===== =====
24+
CPU AMD Ryzen 9 3900X 12-Core Processor
25+
RAM HyperX 32GB Kit DDR4 3200MHz CL16 XMP
26+
GPU SAPPHIRE NITRO+ Radeon RX 580 OC 8G
27+
Disk 2x Intel 660p M.2 2TB SSD NVMe
28+
OS Windows 10 with Aura running inside WSL2 Ubuntu 18.04
29+
===== =====

0 commit comments

Comments
 (0)