Explaining Drift in Text Data with Document Embeddings

This repository provides a software pipeline in order to explain drift between two sets of documents using embeddings.

Documentation

How to configure file storage and the default directory to read data
Amazon movie reviews
- Data overview
- How to read with Amazon Pickle_Reader and access texts, embeddings, metadata
- How to read with Amazon Pickle_Splitter and get items, which are equally splitted
- Data is currently stored at Google Drive
How to store interim results
How to reduce dimensions
How to create Wordclouds

Goal: Reusable, complete and documented code (good for developers, reviewers, everyone)
If you add new classes, please provide minimal code examples, put them into the doc directory and add a link above.
Directories
- doc: Documentation (e.g. how to read data)
- experiments Jupyter notebooks (e.g. combine class instances into a process generating explanations)
- transformation: Classes for data transformation (e.g. create embeddings, reduce dimensions)
- access: Classes for data access (e.g. read or split embeddings)
- explanations: Classes for the explanation process (e.g. handling ml models, generate explanations)
- scripts: Small sets of commands (e.g. to synchronize repositories)
How to name your code: PEP 8 - Style Guide for Python Code

This work has been supported by the German FederalMinistry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.