Summary
Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.
Background
Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the extract_edxorg_courserun_metadata multi-asset in src/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.
Current State
The existing implementation:
- Extracts course XML archives from edX.org and Open edX instances
- Processes specific elements using functions in
src/ol_orchestrate/lib/openedx.py:
process_course_xml() - extracts course metadata
process_video_xml() - extracts video elements
process_policy_json() - extracts policy information
- Outputs processed data to S3 as JSON/JSONL files
- Has some staging models in
src/ol_dbt/models/staging/edxorg/ that reference raw course structure data
What's Missing
The complete course XML contents, including all course components, blocks, and their relationships, need to be:
- Loaded into the raw layer (
ol_warehouse_*_raw schemas)
- Staged through dbt models for consistent transformation and data quality
- Made available for downstream marts and analytics
Requirements
1. Raw Layer Data Loading
Objective: Load course XML contents into raw layer tables
Tasks:
Data Sources:
- edX.org course archives (production and edge)
- Open edX instance course exports
- Archives stored in S3 buckets
Target Schema Pattern:
raw__edxorg__s3__course_xml_<entity>
raw__openedx__s3__course_xml_<entity>
2. dbt Staging Models
Objective: Create staging models to transform raw course XML data into clean, typed datasets
Tasks:
Staging Model Pattern:
-- stg__edxorg__s3__course_xml_blocks.sql
with source as (
select * from {{ source('ol_warehouse_raw_data', 'raw__edxorg__s3__course_xml_blocks') }}
)
, cleaned as (
select
course_id as courserun_id
, block_id as coursestructure_block_id
, block_type as coursestructure_block_type
, block_title as coursestructure_block_title
, {{ cast_timestamp_to_iso8601('retrieved_at') }} as coursestructure_retrieved_at
, ...
from source
)
select * from cleaned
3. Integration with Existing Pipeline
Tasks:
Technical Considerations
XML Structure
Course XML archives contain:
course.xml - root course definition
course/{run_tag}.xml - course metadata
chapter/, sequential/, vertical/ - course structure
video/, problem/, html/ - content components
policies/ - course policies
about/ - course about pages
Data Volume
- Multiple course runs per course
- Potentially thousands of blocks per course
- Historical archives spanning multiple years
- Need for incremental processing
Dependencies
- Requires access to S3 buckets with course archives
- May need coordination with existing archive retrieval jobs
- Should align with existing edxorg tracking log processing
Success Criteria
References
- Existing code:
packages/ol-orchestrate-lib/ol_orchestrate/assets/openedx_course_archives.py
packages/ol-orchestrate-lib/ol_orchestrate/lib/openedx.py
src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_structure.sql
- Utility:
bin/dbt-create-staging-models.py
- Documentation:
README.md (dbt Staging Model Generation section)
Labels
product:data-platform, enhancement, good-first-issue (for some sub-tasks)
Summary
Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.
Background
Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the
extract_edxorg_courserun_metadatamulti-asset insrc/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.Current State
The existing implementation:
src/ol_orchestrate/lib/openedx.py:process_course_xml()- extracts course metadataprocess_video_xml()- extracts video elementsprocess_policy_json()- extracts policy informationsrc/ol_dbt/models/staging/edxorg/that reference raw course structure dataWhat's Missing
The complete course XML contents, including all course components, blocks, and their relationships, need to be:
ol_warehouse_*_rawschemas)Requirements
1. Raw Layer Data Loading
Objective: Load course XML contents into raw layer tables
Tasks:
src/ol_orchestrate/assets/Data Sources:
Target Schema Pattern:
2. dbt Staging Models
Objective: Create staging models to transform raw course XML data into clean, typed datasets
Tasks:
bin/dbt-create-staging-models.pyutility to scaffold sourcessrc/ol_dbt/models/staging/edxorg/Staging Model Pattern:
3. Integration with Existing Pipeline
Tasks:
extract_edxorg_courserun_metadataif neededstg__edxorg__s3__course_structure.sqlTechnical Considerations
XML Structure
Course XML archives contain:
course.xml- root course definitioncourse/{run_tag}.xml- course metadatachapter/,sequential/,vertical/- course structurevideo/,problem/,html/- content componentspolicies/- course policiesabout/- course about pagesData Volume
Dependencies
Success Criteria
References
packages/ol-orchestrate-lib/ol_orchestrate/assets/openedx_course_archives.pypackages/ol-orchestrate-lib/ol_orchestrate/lib/openedx.pysrc/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_structure.sqlbin/dbt-create-staging-models.pyREADME.md(dbt Staging Model Generation section)Labels
product:data-platform, enhancement, good-first-issue (for some sub-tasks)