Skip to content

Commit 76a2571

Browse files
committed
add more shields to README.rst
add new docs page explaining available datasets
1 parent 4dc7d03 commit 76a2571

File tree

3 files changed

+47
-8
lines changed

3 files changed

+47
-8
lines changed

README.rst

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,15 @@
33

44
======
55

6+
.. image:: https://img.shields.io/badge/Homepage-WIP-blue
7+
.. image:: https://img.shields.io/badge/-Documentation-blue
8+
:target: https://docs.aura.sourcecode.ai/
9+
.. image:: https://img.shields.io/badge/docker-SourceCodeAI/aura-blue
10+
:target: https://hub.docker.com/r/sourcecodeai/aura
11+
.. image:: https://img.shields.io/github/license/SourceCode-AI/aura?color=blue
612
.. image:: https://travis-ci.com/SourceCode-AI/aura.svg?branch=dev
713

14+
815
Security auditing and static code analysis
916
=================================================
1017

@@ -18,14 +25,6 @@ Project goals:
1825
* allow researches to scan code repositories on a large scale, create datasets and perform analysis to further advance research in the area of vulnerable and malicious code dependencies
1926

2027

21-
============= ======
22-
License GPLv3
23-
Documentation https://docs.aura.sourcecode.ai/
24-
Homepage WIP
25-
Docker https://hub.docker.com/r/sourcecodeai/aura
26-
============= ======
27-
28-
2928
Why Aura?
3029
---------
3130

docs/source/cookbook/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Misc
2323
----
2424

2525
.. toctree::
26+
misc/datasets.rst
2627
python_2.rst
2728
misc/visitors.rst
2829
misc/plugin_templates.rst
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
Datasets
2+
========
3+
4+
There are several datasets that are frequently published by the SourceCode.AI team that are not required in order to run Aura but can provide a much accurate results if the datasets are used being and frequently updated. Below is an overview of available datasets that are frequently published by the Aura team.
5+
6+
7+
PyPI download stats
8+
-------------------
9+
10+
This datasets contains aggregated statistics of package downloads from the official PyPI repository for the last 30 days. It contains a name of the package and how many times it was downloaded in the 30 day period. This is accomplished by aggregating the networks logs that are `published in the open dataset on google big query <https://packaging.python.org/guides/analyzing-pypi-package-downloads/>`_. The main usage is to calculate the popularity of a given package which is used as several places such as computing the aura score or in a typosquatting protection where it's suspicious for a package with very low number of downloads to have a very similar name to a package with a very high number of downloads. Google Big Query offers a free tier that is based/priced on amount of data analyzed and as such the current refresh period for this dataset is around 3 days.
11+
12+
13+
PyPI Package list
14+
-----------------
15+
16+
This dataset simply contains just a list of all packages present in our offline PyPI mirror that is being used by the Aura team to conduct global PyPI scans. It is updated every hour when a mirror synchronization is triggered. This dataset is not being used by Aura directly at the moment.
17+
18+
19+
PyPI dependency list
20+
--------------------
21+
22+
This is an aggregation of a package JSON metadata files from which we extracted a list of dependencies on other packages. This dataset is generated every hour when a mirror synchronization is triggered; not used directly by Aura.
23+
24+
25+
PyPI reverse dependencies list
26+
------------------------------
27+
28+
This is an aggregation and normalization of the previous PyPI dependency list that just reverses the direction of dependencies, e.g. for each package it lists other packages that have the package in it's dependencies. This dataset is used by Aura to compute scoring mechanism and importance of a package, when more packages include it it's dependencies, the higher the importance and `aura score` of the package is.
29+
30+
========================= ========================================================= ============= ============
31+
Dataset name URL Update period Note
32+
========================= ========================================================= ============= ============
33+
MD5 checksums https://cdn.sourcecode.ai/aura/md5_checksums.txt ~1 hour Contains MD5 checksums of all published datasets
34+
PyPI package list https://cdn.sourcecode.ai/aura/pypi_package_list.gz ~1 hour
35+
PyPI download stats https://cdn.sourcecode.ai/aura/pypi_download_stats.gz ~3 days
36+
PyPI dependency list https://cdn.sourcecode.ai/aura/dependency_list.gz ~1 hour
37+
PyPI reverse dependencies http://cdn.sourcecode.ai/aura/reverse_dependencies.gz ~1 hour
38+
Aura update dataset https://cdn.sourcecode.ai/aura/aura_dataset.tgz ~1 hour Contains all the datasets required by aura in a single archive
39+
========================= ========================================================= ============= ============

0 commit comments

Comments
 (0)