-
Notifications
You must be signed in to change notification settings - Fork 2
Home
The Harvard Data Tools project is an extract, transform and load system which takes data multiple sources, transforms them via a multi-step process, and stores them in an Amazon Redshift database warehouse for querying and further processing by pedagogical research data scientists.
The project contains code to process and store data from a number of different platforms. Each data set is assigned a separate schema in Redshift, allowing multiple data sets to be generated from the same platform (for example, we can have one data set from edx.org, and another from edge.edx.org).
The ETL process is designed to operate over many data sets, each of which is updated and processed (mostly) asynchronously. On every run, the data set is downloaded, normalized, transformed and enhanced and loaded into Redshift.

The data pipeline step runs on Amazon's Elastic Map Reduce service, and performs the majority of the data processing. Each pipeline is dynamically generated to account for not only the data set that it is processing, but the specific tables that are present in the data dump. To keep the dependencies in the pipeline reasonable, the pipeline is split into several distinct phases.

While the phases are not required for correctness, they make it easier to reason about the various steps that occur during the pipeline run.
- Data Pipeline and Elastic Map Reduce (EMR) Setup
- Phase 1: Identity Management
- Phase 2: Full Text Extraction
- Phase 3: Data Transformation
- Storing to Redshift
The infrastructure is designed to cope with constantly-evolving data schemas, as more tables and fields are added to existing data sets. We generate a large amount of the code that is specific to data sets to avoid having to create masses of copied and pasted code. The Java SDKs and certain Hadoop jobs are automatically generated on each run, meaning that for most schema changes it is only necessary to update a configuration file with the changes, rather than editing Java code.
There is a set of fixed resources that are required on AWS. For convenience, we maintain separate development and production instances of many structures (such as Redshift, Dynamo tables etc).