A repository for the paper Trivial Graph Features and Classical Learning are Enough to Detect Random Anomalies, which is also accompanied by its dedicated website.
In this repository, we demonstrate preprocessing, injection, feature generation, and learning using two datasets as illustrative examples. One dataset is MovieLens, a bipartite link stream, while the other is UCI Messages, a unipartite link stream.
Reference datasets can be downloaded from: http://konect.cc/networks/ and http://snap.stanford.edu/jodie/.
Large scale datasets and their generated features are hosted on the ComplexNetworks team (LIP6, Sorbonne University) lab server at: http://data.complexnetworks.fr/TGF/data.
Preprocess the link stream to have it in the format of (
Folder "1-PreprocessingOfDatasets" contains the notebooks to preprocess unipartite and bipartite networks.
If ground truth is unavailable, anomalous links are injected.
Formally, injected links are created by randomly sampling a timestamp from the link sequence
Folder "2-Injection" contains the scripts to inject anomalies in unipartite and bipartite networks.
Suppose we have a unipartite link stream named "data.gz". To inject 10% anomalies in it, the following scripts should be executed in sequence:
./build_injectable.sh data 20
./inject.sh data 10
./check.sh data 10
Suppose we have a bipartite link stream named "data.gz". To inject 10% anomalies in it, the following scripts should be executed in sequence:
./bip_build_injectable.sh data 20
./inject.sh data 10
./bip_check.sh data 10
Note: We set 20% at the start to have a larger sample of the sets $T$ , $U$ , and $V$ , but any value could be set as long as it is greater than 10%.
Given a link stream (
Folder "3-FeatureGeneration" contains the code to generate the features given a link stream. Following is the command to be used:
zcat input.gz | python3 main.py [-H s] [-G d] [-bip] [-int] [-check N] | gzip -c > output.json.gz
-H s or -G d: Either must be chosen to set the type of the history graph and the size (
-bip: Should be set if the network is bipartite
-int: A switch indicating if node labels are integers
-check N: Enforces a verification of data structures and computations every N lines (costly)
Example on unipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -H 1000 | gzip -c > data_10_injected_H1000.gz
$\rightarrow$ Generates the$O(1)$ features of the link stream data_10_injected.gz with$H$ -type history graph of size$s$ = 1000
Example on bipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -H 1000 -bip | gzip -c > data_10_injected_H1000.gz
$\rightarrow$ Generates the$O(1)$ features of the link stream data_10_injected.gz with$H$ -type history graph of size$s$ = 1000
Example on unipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -G 50 | gzip -c > data_10_injected_G50.gz
$\rightarrow$ Generates the$O(1)$ features of the link stream data_10_injected.gz with$G$ -type history graph of duration$d$ = 50 (suppose$t$ is in seconds, thus 50 represents 50 seconds in the past)
Example on bipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -G 50 -bip | gzip -c > data_10_injected_G50.gz
$\rightarrow$ Generates the$O(1)$ features of the link stream data_10_injected.gz with$G$ -type history graph of duration$d$ = 50 (suppose$t$ is in seconds, thus 50 represents 50 seconds in the past)
Note: zcat is not mandatory, the command to generate features could also be executed as follows: cat input.txt | python3 main.py [-H s] [-G d] [-bip] [-int] [-check N] > output.json
Given a link stream (
Folder "4-LearningAndTesting" contains the notebooks to conduct the learning (classical and with sliding windows) and testing process of the trained model in a unipartite (UCI Messages) and a bipartite network (MovieLens). In this folder, there are 5 main subfolders:
- HTypeHistoryGraphs: Learning is performed on multiple instances of
$H$ -type history graphs of varying sizes. - GTypeHistoryGraphs: Learning is performed on multiple instances of
$G$ -type history graphs of varying durations. - CombiningHTypeHistoryGraphs: Learning is performed on multiple instances of
$H$ -type history graphs of varying sizes combined together. - CombiningGTypeHistoryGraphs: Learning is performed on multiple instances of
$G$ -type history graphs of varying durations combined together. - CombiningHandGTypeHistoryGraphs: Learning is performed on multiple instances of
$H$ -type and$G$ -type history graphs of varying sizes and durations combined together.
Note: If learning is to be done on large dynamic networks, refer to the folder "4-LearningAndTesting-LargeNetworks", where a sampling technique is applied initially so the features are not entirely loaded into the memory and a chunking technique is used for testing. Similarly is the case for TGF with sliding windows.
If you find this repository useful in your research, please cite our work and consider giving the repository a star!
@inproceedings{latapy2025trivial,
title={Trivial Graph Features and Classical Learning are Enough to Detect Random Anomalies},
author={Latapy, Matthieu and Rajeh, Stephany},
booktitle={2025 IEEE International Conference on Data Mining (ICDM)},
pages={1330--1339},
year={2025},
doi={10.1109/ICDM65498.2025.00142}
}If you have any questions, please do not hesitate to reach out to us at stephany.rajeh@efrei.fr and matthieu.latapy@lip6.fr