- initial commit
- added
rb-dual-motifsdataset - added
tadfdataset
- Added module
visual_graph_datasets.cli - Improved installation process. It's now possible to install in non-editable mode
- Added tests
- Added function
get_dataset_pathwhich returns the full dataset path given the string name of a dataset.
- Added the dataset
movie_reviewswhich is a natural language classification dataset which was converted into a graph dataset. - Extended the function
visual_graph_datasets.data.load_visual_graph_datasetto be able to load natural language text based graph datasets as well.
Completely refactored the way in which datasets are managed.
- by default the datasets are now stored within a folder called
.visual_graph_datasets/datasetswithin the users home directory, but the datasets are no longer part of the repository itself. Instead, the datasets have to be downloaded from a remote file share provider first. CLI commands have been added to simplify this process. Assuming the remote provider is correctly configured and accessible, datasets can simply downloaded by name using thedownloadCLI command. - Added
visual_graph_datasets.configwhich defines a config singleton class. By default this config class only returns default values, but a config file can be created at.visual_graph_datasets/config.yamlby using theconfigCLI command. Inside this config it is possible to change the remote file share provider and the dataset path. - The CLI command
listcan be used to display all the available datasets in the remote file share.
- Somewhat extended the
AbstractFileShareinterface to also include a methodcheck_datasetwhich retrieves the files shares metadata and then checks if the provided dataset name is available from that file share location. - Added the sub package
visual_graph_datasets.generationwhich will contain all the functionality related to the generation of datasets. - Added the module
visual_graph_datasets.generation.graphand the classGraphGeneratorwhich presents a generic solution for graph generation purposes. - Added the sub package
visual_graph_datasets.visualizationwhich will contain all the functionality related to the visualization of various different kinds of graphs - Added the module
visual_graph_datasets.visualization.base - Added the module
visual_graph_datasets.visualization.colorsand functionality to visualize grayscale graphs which contain a single attribute that represents the grayscale value - Added a
experimentsfolder which will containpyxomexexperiments - Added an experiment
generate_mock.pywhich generates a simple mock dataset which will subsequently be used for testing purposes. - Extended the dependencies
- Added module
visual_graph_datasets.visualization.importanceswhich implements the visualization of importances on top of graph visualizations. - Other small fixes, including a problem with the generation of the mock dataset
- Added
imageioto dependencies
- Default config now has the public nextcloud provider url
- Fixed a bug with the
listcommand which crashed due to non-existing terminal color specification
- Finally finished the implementation of the
bundlecommand. - updated the rb_motifs dataset for the new structure and also recreated all the visualizations with a transparent background.
- Implemented the visualization of colored graphs
- Changed the config file a bit: It is now possible to define as many custom file share providers as
possible under the
providerssection. Each new provider however needs to have a unique name, which is then required to be supplied for theget_file_sharefunction to actually construct the corresponding file share provider object instance. - Added the package
visual_graph_datasets.processingwhich contains functionality to process source datasets into visual graph datasets. - Added experiment
generate_molecule_dataset_from_csvwhich can be used to download the source CSV file for a molecule (SMILES based) dataset from the file share and then generate a visual graph dataset based on that.
- Fixed a bug in the
bundlecommand - Added a module
visual_graph_datasets.testingwith testing utils.
- Renamed
TestingConfigtoIsolatedConfigdue to a warning in pytest test collection
- Fixed a bug in
experiments.generate_molecule_dataset_from_csvwhere faulty node positions were saved for the generated visualizations of the molecules - Added the experiment
experiments.generate_molecule_multitask_dataset_from_csvwhich generates a molecule based dataset for a multitask regression learning objective using multiple CSVs and merging them together. - Fixed a bug in
experiments.generate_molecule_multitask_dataset_from_csvwhere invalid molecules were causing problems down the line. These are being filtered now. - Update
README.md - Added a
examplesfolder
- Initial implementation of the "dataset metadata" feature: The basic idea is that special metadata files
can be added to the various dataset folders optionally to provide useful information about them, such as a
description, a version string, a changelog, information about the relevant tensor shapes etc... In the
future the idea is to allow arbitrary metadata files which begin with a "." character. For now, the
central
.meta.yamlfile has been implemented to hold the bulk of the textual metadata in a machine readable format. - Added a main logger to the main config singleton, such that this can be used for the command line interface.
- Added the
gathercli command which can be used to generate/update the metadata information for a single dataset folder. This will create an updated version of the.meta.yamlfile within that folder. - Changed the
bundlecommand such that the metadata file is now always updated with the new dataset specific metadata, regardless of whether it exists or not. Additionally, custom fields added to that file which do not interfere with the automatically generated part now persist beyond individual bundle operations. - Updated jinja template for
listcommand to be more idiomatic and don't use logic within the template. Additionally extended it with more metadata information that is now available for datasets - Switched to the new version of
pycomexwhich introduces experiment inheritance. - Started to implement more specific sub experiments using experiment inheritance.
INTERFACE CHANGES
- The central function
load_visual_graph_datasetnow has a backward-incompatible signature: The function still returns a tuple of two elements as before, but the first element of that tuple is now the metadata dict of the dataset as it was loaded byload_visual_graph_dataset_metadata
- changed dependencies to fit together with
graph_attention_student - Added experiment to generate
aggregators_binarydataset
Implemented the "preprocessing" feature. Currently a big problem with the visual graph datasets in general is that they are essentially limited to the elements which they already contain. There is no easy way to generate more input elements in the same general graph representation / format as the elements already in a dataset. This is a problem if any model trained based on a VGD is supposed to be actually used on new unseen data: It will be difficult to process a new molecule for example into the appropriate input tensor format required to query the model.
The "preprocessing" feature addresses this problem. During the creation of each VGD a python module "process.py" is automatically created from a template and saved into the VGD folder as well. It contains all the necessary code needed to transform a domain specific implementation (such as a SMILES code for example) into a new input element of that dataset, including the graph representation as well as the visualization. This module can either be imported to use the functionality directly in python code. It also acts as a command line application.
Added the base class
processing.base.ProcessingBase. This class encapsulates the previously described pre-processing functionality. Classes inheriting from this automatically act as a command line interface as well.- Code for a standalone python module with the same processing functionality can be generated from an
instance using the
processing.base.create_processing_modulefunction.
- Code for a standalone python module with the same processing functionality can be generated from an
instance using the
Added the class
processing.molecules.MoleculeProcessing. This class provides a standard implementation for processing molecular graphs given as SMILES strings.Added unittests for base processing functionality
Added unittests for molecule processing functionality
Extended the function
typing.assert_graph_dictto do some more in-depth checks for a valid graph dictAdded module
generation.color. It implements utility functions which are needed specifically for the generation of color graph datasets.Added the experiment
experiment.generate_rb_adv_motifswhich generates the synthetic "red-blue adversarial motifs" classification dataset of color graphs.
- Changed the "config" cli command to also be usable without actually opening the editor. This can be used to silently create or overwrite a config file for example.
- Fixed a bug in
utils.dynamic_import - Fixed a bug in
data.load_visual_graph_element
- Changed the version dependency for numpy
- Slightly changed the generation process of the "rb_adv_motifs" dataset.
- Added the class of experiments based on
experiments.csv_sanchez_lengeling_dataset.py, which convert the datasets from the paper into a single CSV file, which can then be further processed into a visual graph dataset. - Added utility function
util.edge_importances_from_node_importancesto derive edge explanations from the node explanations in cases where they are not created. - Started to move towards the new pycomex Functional API with the experiments
- Added more documentation to
typing
Model Interfaces and Mixins
- Added the
visual_graph_datasets.modelsmodule which will contain all the code which is relevant for models that specifically work with visual graph datasets - Added the
models.PredictGraphMixinclass, which is essentially an interface that can be implemented by a model class to signify that it supports thepredict_graphmethod which can be used to query a model prediction directly based on a GraphDict object.
Examples
- Added a
examples/README.rst - Added
examples/01_explanation_pdf
- Added a section about dataset conversion to the readme file
- Fixed a bug with the
create_processing_modulefunction where it did not work if the Processing class was not defined at the top-level indentation. - Changed some dependency versions
- Moved some more experiment modules to the pycomex functional API
Important
- Made some changes to the
BaseProcessinginterface, which will be backwards incompatible - Mainly made the base interface more specific such as including "output_path" or "value" as concrete positional arguments to the various abstract methods instead of just specifying args and kwargs
- Made some changes to the
- Added the
Batchedutility iterator class which will greatly simplify working in batches for predictions etc. - Made some changes to the base molecule processing file
- Started moving more experiment modules to the new pycomex functional api
- Added an experiment module to process QM9 dataset into a visual graph dataset.
Additions to the processing.molecules module. Added various new molecular node/features based on
RDKit computations.
- Partial Gasteiger Charges of atoms
- Crippen LogP contributions of atoms
- Estate indices
- TPSA contributions
- LabuteASA contributions
- Changed the default experiment
generate_molecule_dataset_from_csv.pyto now use these additional atom/node features for the default Processing implementation.
Overhaul of the dataset writing and reading process. The main difference is that I added support for dataset chunking. Previously a dataset would consist of a single folder which would directly contain all the files for the individual dataset elements. For large datasets these folders would become very large and thus inefficient for the filesystem to handle. With dataset chunking, the dataset can be split into multiple sub folders that contain a max. number of elements each thus hopefully increasing the efficiency.
Added
data.DatasetReaderBaseclass, which contains the base implementation of reading a dataset from the persistent folder representation into the index_data_map. This class now supports the dataset chunking feature.- Added
data.VisualGraphDatasetReaderwhich implements this for the basic dataset format that represents each element as a JSON and PNG file.
- Added
Added
data.DatasetWriterBaseclass, which contains the base implementation of writing a dataset from a data structure representation into the folder. This class now supports the dataset chunking feature.- Added
data.VisualGraphDatasetWriterwhich implements this for the basic dataset format where a metadata dict and a mpl Figure instance are turned into a JSON and PNG file.
- Added
Changed the
processing.molecule.MoleculeProcessingclass to now also support a DatasetWriter instance as an optional argument to make use of the dataset chunking feature during the dataset creation process.
Introduction of COGILES (Color Graph Input Line Entry System) which is a method of specifying colored graphs with a simple human-readable string syntax, which is strongly inspired by SMILES for molecular graphs.
- Added
generate.colors.graph_from_cogiles - Added
generate.colors.graph_to_cogiles
Bugfixes
- I think I finally solved the performance issue in
generate_molecule_dataset_from_csv.py. Previously there was an issue where the avg write speed would rapidly decline for a large dataset, causing the process to take way too long. I think the problem was the matplotlib cache in the end - Also changed
visualize_graph_from_moland made some optimizations there. It no longer relies on the creation of intermediate files and no temp dir either, which shaved of a few ms of computational time.
- Added the new module
graph.pywhich will contain all GraphDict related utility functions in the future - Added a function to copy graph dicts - Added a function to create node adjecency matrices for graph dicts - Added a function to add graph edges - Added a function to remove graph edges
- Fixed a bug where
ColorProcesing.createwould not save the name or the domain representation
- Fixed a bug where the COGILES decoding procedure produced graph dicts with "edge_attributes" arrays of the incorrect data type and shape.
- Fixed a bug where the CogilesEncoder duplicated edges in some very weird edge cases!
- Added the experiment
profile_molecule_processing.pyto profile and plot the runtime of the different process components that create a visual graph dataset element with the aim of identifying the source of the runtime degradation bug. - Fixed the runtime degradation / memory leak issue in
generate_molecule_dataset_from_csv.py. It seems like the problem actually wasn't in the code but in the matplotlib backend! The problem clearly occurs when using theTkAggbackend but does not appear when using theAggbackend. - Modified the generation of the QM9 dataset in
generate_molecule_dataset_from_csv__qm9.py - Added the new experiment file
generate_molecule_dataset_from_csv__qm9sub.pywhich generates the QM9 sub dataset which is a smaller subset of QM9 with only 22k elements and 9 target columns. - Added the new experiment
generate_molecule_dataset_from_csv__aggregators_binary_protonatedwhich processes the larger version of the aggregators dataset where each individual molecule is replaced by all it's protonated variants - Added the new background flavor of visualizing the attributional graph masks. In this method, a filled light green circle will be painted behind the nodes of the graph.
- Slightly modified the
ensure_datasetfunction - Updated the readme file
- Updated the documentation of the standard sub experiments for
generate_molecule_dataset_from_csv.py
- Added a utility function to count how often a subgraph motif appears in a larger graph
- Added experiment
analyze_color_graph_dataset.pyto analyze the properties of color graph based datasets
- Fixed a minor issue where the datasets folder was not created during the
configinitialization which has led to errors when trying to download a dataset.
- Added back in the dictionaries defining the alternative versions for the node and edge importance plotting
- Added some more graph utility functions such as functions to extract sub graphs, add and remove nodes and to identify connected regions of a graph.
- Added documentation for the
ColorProcessingclass - Changed the
ColorProcessing.visualize_as_figuremethod to now also accept external graph dict parameter and external node_positions array. - Modified the
generate_molecule_dataset_from_csv.pyexperiment so that it is now possible to optionally define a indices blacklist of elements that should be skipped during processing. - Moved the dependencies to the most recent version of RDKit. This seems to have fixed the issue of the molecule image generation occasionally crashing with a segmentation fault.
- Added the
genericgraph type. This is a graph type that can be used to represent any kind of graph that cannot be associated with any kind of specific domain. Added theGenericProcessingclass which can be used to process these generic graphs. - Modifed the
colors_layoutfunction such that it is possible to pass a partially defined list of node_positions as an argument, such that the positions of some nodes can be fixed during the layouting.
- The "load" method of the Config instance now returns the itself, which is just a small quality of life improvement for the scripts that will use have to use the config instance.
- Added some additional documentation for the Processing classes
- Added the function ``create_combined_importances_pdf" to generate a visualization PDF that visualizes the explanations not as separate figures, but all the explanation channels into the same figure, encoding the different channels as different colors.
- Changed the version requirement to be compatible with newer python versions
- Fixed the dependency error where utils imported from
graph_attention_studenthave caused a circular import error
- Extended the
ColorsProcesingclass to also support 3D graph structures now. - Added an experiment module to process the COMPAS dataset of polybenzenes molecular property predictions
Modified
pyproject.toml- The command line interface is now installed as the "vgd" command - moved from usingclickfor the command line interface to usingrich-clickwhich is a fork ofclickthat adds rich text support to the command line interface
- Added the
get_num_node_attributesandget_num_edge_attributesfunctions to theProcessingbase interface.
- Fixed the COGILES encoder. There was a bug in the cogiles encoder class which resulted in edges being duplicated in some cases. This is fixed now.
- There is a test case now for the COGILES encoder which tests the encoder and decoder for a large number of randomly generated graphs to check if there are any other edge cases where the encoding or decoding fails.
- Added the
node_atomsandedge_bondsproperties to theMoleculeProcessingclass when returning the graph dict representation of a molecule. These properties return the atom and bond types respectively as human-readable strings. - Added an additional
encoderclass attribute to theColorProcessingclass which can be used to encode the (r,g,b) color values of the nodes and edges into a human-readable string representation.
Dependencies
- Updated the list of dependencies as well as their version requirements such that the minimum version for the visual_graph_datasets package is now Python 3.8 where it was previously 3.10
Tests
- Added a noxfile.py and implemented the unittests to be automatically executed for all supported python versions using nox.