Skip to content
Phil McGachey edited this page Jun 13, 2017 · 19 revisions

Harvard Data Tools

The Harvard Data Tools project is an extract, transform and load system which takes data multiple sources, transforms them via a multi-step process, and stores them in an Amazon Redshift database warehouse for querying and further processing by pedagogical research data scientists.

Data Sources

The project contains code to process and store data from a number of different platforms. Each data set is assigned a separate schema in Redshift, allowing multiple data sets to be generated from the same platform (for example, we can have one data set from edx.org, and another from edge.edx.org).

Workflow

The ETL process is designed to operate over many data sets, each of which is updated and processed (mostly) asynchronously. On every run, the data set is downloaded, normalized, transformed and enhanced and loaded into Redshift.

HighLevel

Data Pipeline

The data pipeline step runs on Amazon's Elastic Map Reduce service, and performs the majority of the data processing. Each pipeline is dynamically generated to account for not only the data set that it is processing, but the specific tables that are present in the data dump. To keep the dependencies in the pipeline reasonable, the pipeline is split into several distinct phases.

PipelineOverview

While the phases are not required for correctness, they make it easier to reason about the various steps that occur during the pipeline run.

Code Generation

The infrastructure is designed to cope with constantly-evolving data schemas, as more tables and fields are added to existing data sets. We generate a large amount of the code that is specific to data sets to avoid having to create masses of copied and pasted code. The Java SDKs and certain Hadoop jobs are automatically generated on each run, meaning that for most schema changes it is only necessary to update a configuration file with the changes, rather than editing Java code.

Configuration

There is a set of fixed resources that are required on AWS. For convenience, we maintain separate development and production instances of many structures (such as Redshift, Dynamo tables etc).

Other Useful Information

Clone this wiki locally