TGF

A repository for the paper Trivial Graph Features and Classical Learning are Enough to Detect Random Anomalies, which is also accompanied by its dedicated website.

In this repository, we demonstrate preprocessing, injection, feature generation, and learning using two datasets as illustrative examples. One dataset is MovieLens, a bipartite link stream, while the other is UCI Messages, a unipartite link stream.

Reference datasets can be downloaded from: http://konect.cc/networks/ and http://snap.stanford.edu/jodie/.

Large scale datasets and their generated features are hosted on the ComplexNetworks team (LIP6, Sorbonne University) lab server at: http://data.complexnetworks.fr/TGF/data.

Phase 1: Preprocessing

Description

Preprocess the link stream to have it in the format of ($t_i, u_i, v_i$) where $t_i < t_{i+1}$ and to remove loops, directions, and duplicates.

Usage

Folder "1-PreprocessingOfDatasets" contains the notebooks to preprocess unipartite and bipartite networks.

Phase 2: Injecting anomalies in the link stream

Description

If ground truth is unavailable, anomalous links are injected. Formally, injected links are created by randomly sampling a timestamp from the link sequence $S = (t_1,u_1,v_1), (t_2,u_2,v_2),\ \dots, (t_\ell,u_\ell,v_\ell)$, and two distinct nodes from the sets $U$ and $V$ of first and second position nodes, ensuring the sampled link doesn’t exist in $S$. For bipartite datasets, $u$ is sampled from $U$ and $v$ from $V$.

Usage

Folder "2-Injection" contains the scripts to inject anomalies in unipartite and bipartite networks.

Suppose we have a unipartite link stream named "data.gz". To inject 10% anomalies in it, the following scripts should be executed in sequence: ./build_injectable.sh data 20
./inject.sh data 10
./check.sh data 10
$\rightarrow$ The file "data_10_injected.gz" is the file to be used in the next phase.

Suppose we have a bipartite link stream named "data.gz". To inject 10% anomalies in it, the following scripts should be executed in sequence: ./bip_build_injectable.sh data 20
./inject.sh data 10
./bip_check.sh data 10
$\rightarrow$ The file "data_10_injected.gz" is the file to be used in the next phase.

Note: We set 20% at the start to have a larger sample of the sets $T$, $U$, and $V$, but any value could be set as long as it is greater than 10%.

Phase 3: Generating features

Description

Given a link stream ($t_i, u_i, v_i$) with its labels, aggregate the stream into either $H$-type (by size) or a $G$-type (by duration) history graphs and compute the features for each interaction ($u_i,v_i$).

Usage

Folder "3-FeatureGeneration" contains the code to generate the features given a link stream. Following is the command to be used:

zcat input.gz | python3 main.py [-H s] [-G d] [-bip] [-int] [-check N] | gzip -c > output.json.gz

-H s or -G d: Either must be chosen to set the type of the history graph and the size ($s$) or duration ($d$)
-bip: Should be set if the network is bipartite
-int: A switch indicating if node labels are integers
-check N: Enforces a verification of data structures and computations every N lines (costly)

Example on unipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -H 1000 | gzip -c > data_10_injected_H1000.gz
$\rightarrow$ Generates the $O(1)$ features of the link stream data_10_injected.gz with $H$-type history graph of size $s$ = 1000

Example on bipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -H 1000 -bip | gzip -c > data_10_injected_H1000.gz
$\rightarrow$ Generates the $O(1)$ features of the link stream data_10_injected.gz with $H$-type history graph of size $s$ = 1000

Example on unipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -G 50 | gzip -c > data_10_injected_G50.gz
$\rightarrow$ Generates the $O(1)$ features of the link stream data_10_injected.gz with $G$-type history graph of duration $d$ = 50 (suppose $t$ is in seconds, thus 50 represents 50 seconds in the past)

Example on bipartite data_10_injected.gz: zcat data_10_injected.gz | python3 main.py -G 50 -bip | gzip -c > data_10_injected_G50.gz
$\rightarrow$ Generates the $O(1)$ features of the link stream data_10_injected.gz with $G$-type history graph of duration $d$ = 50 (suppose $t$ is in seconds, thus 50 represents 50 seconds in the past)

Note: zcat is not mandatory, the command to generate features could also be executed as follows: cat input.txt | python3 main.py [-H s] [-G d] [-bip] [-int] [-check N] > output.json

Phase 4: Learning and testing

Description

Given a link stream ($t_i, u_i, v_i$) and its graph features based on its history graph, apply machine learning using the Random Forest Classifier for link anomaly detection.

Usage

Folder "4-LearningAndTesting" contains the notebooks to conduct the learning (classical and with sliding windows) and testing process of the trained model in a unipartite (UCI Messages) and a bipartite network (MovieLens). In this folder, there are 5 main subfolders:

HTypeHistoryGraphs: Learning is performed on multiple instances of $H$-type history graphs of varying sizes.
GTypeHistoryGraphs: Learning is performed on multiple instances of $G$-type history graphs of varying durations.
CombiningHTypeHistoryGraphs: Learning is performed on multiple instances of $H$-type history graphs of varying sizes combined together.
CombiningGTypeHistoryGraphs: Learning is performed on multiple instances of $G$-type history graphs of varying durations combined together.
CombiningHandGTypeHistoryGraphs: Learning is performed on multiple instances of $H$-type and $G$-type history graphs of varying sizes and durations combined together.

Note: If learning is to be done on large dynamic networks, refer to the folder "4-LearningAndTesting-LargeNetworks", where a sampling technique is applied initially so the features are not entirely loaded into the memory and a chunking technique is used for testing. Similarly is the case for TGF with sliding windows.

Citation

If you find this repository useful in your research, please cite our work and consider giving the repository a star!

@inproceedings{latapy2025trivial,
  title={Trivial Graph Features and Classical Learning are Enough to Detect Random Anomalies},
  author={Latapy, Matthieu and Rajeh, Stephany},
  booktitle={2025 IEEE International Conference on Data Mining (ICDM)},
  pages={1330--1339},
  year={2025},
  doi={10.1109/ICDM65498.2025.00142}
}

Contact

If you have any questions, please do not hesitate to reach out to us at stephany.rajeh@efrei.fr and matthieu.latapy@lip6.fr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TGF

Phase 1: Preprocessing

Description

Usage

Phase 2: Injecting anomalies in the link stream

Description

Usage

Note: We set 20% at the start to have a larger sample of the sets $T$, $U$, and $V$, but any value could be set as long as it is greater than 10%.

Phase 3: Generating features

Description

Usage

Note: zcat is not mandatory, the command to generate features could also be executed as follows: cat input.txt | python3 main.py [-H s] [-G d] [-bip] [-int] [-check N] > output.json

Phase 4: Learning and testing

Description

Usage

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
1-PreprocessingOfDatasets		1-PreprocessingOfDatasets
2-Injection		2-Injection
3-FeatureGeneration		3-FeatureGeneration
4-LearningAndTesting-LargeNetworks		4-LearningAndTesting-LargeNetworks
4-LearningAndTesting		4-LearningAndTesting
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

TGF

Phase 1: Preprocessing

Description

Usage

Phase 2: Injecting anomalies in the link stream

Description

Usage

Note: We set 20% at the start to have a larger sample of the sets $T$, $U$, and $V$, but any value could be set as long as it is greater than 10%.

Phase 3: Generating features

Description

Usage

Note: zcat is not mandatory, the command to generate features could also be executed as follows: cat input.txt | python3 main.py [-H s] [-G d] [-bip] [-int] [-check N] > output.json

Phase 4: Learning and testing

Description

Usage

Citation

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages