Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models

## Summary

Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.

## Background

Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the `extract_edxorg_courserun_metadata` multi-asset in `src/ol_orchestrate/assets/openedx_course_archives.py`. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.

### Current State

The existing implementation:
- Extracts course XML archives from edX.org and Open edX instances
- Processes specific elements using functions in `src/ol_orchestrate/lib/openedx.py`:
  - `process_course_xml()` - extracts course metadata
  - `process_video_xml()` - extracts video elements
  - `process_policy_json()` - extracts policy information
- Outputs processed data to S3 as JSON/JSONL files
- Has some staging models in `src/ol_dbt/models/staging/edxorg/` that reference raw course structure data

### What's Missing

The complete course XML contents, including all course components, blocks, and their relationships, need to be:
1. Loaded into the raw layer (`ol_warehouse_*_raw` schemas)
2. Staged through dbt models for consistent transformation and data quality
3. Made available for downstream marts and analytics

## Requirements

### 1. Raw Layer Data Loading

**Objective:** Load course XML contents into raw layer tables

**Tasks:**
- [ ] Design schema for raw course XML data tables
  - Consider table structure for course blocks/components
  - Include metadata fields (retrieved_at, source_system, course_id, etc.)
  - Determine granularity (one row per block, per file, etc.)
- [ ] Create Dagster assets to extract and load course XML data
  - Extend or create new assets in `src/ol_orchestrate/assets/`
  - Parse XML structure comprehensively (not just metadata)
  - Handle both edX.org and Open edX course formats
  - Implement incremental loading strategy
- [ ] Configure data quality checks
  - Validate XML parsing completeness
  - Check for required fields
  - Monitor data freshness

**Data Sources:**
- edX.org course archives (production and edge)
- Open edX instance course exports
- Archives stored in S3 buckets

**Target Schema Pattern:**
```
raw__edxorg__s3__course_xml_<entity>
raw__openedx__s3__course_xml_<entity>
```

### 2. dbt Staging Models

**Objective:** Create staging models to transform raw course XML data into clean, typed datasets

**Tasks:**
- [ ] Generate dbt source definitions
  - Use `bin/dbt-create-staging-models.py` utility to scaffold sources
  - Define source freshness checks
  - Document all source columns
- [ ] Create staging models
  - Build staging models in `src/ol_dbt/models/staging/edxorg/`
  - Apply consistent naming conventions (stg__edxorg__s3__course_xml_*)
  - Implement standard transformations:
    - Timestamp standardization (ISO8601)
    - JSON parsing and flattening where appropriate
    - Deduplication logic
    - Type casting
  - Add course_id and block_id semantic renaming
- [ ] Create model documentation
  - Document all columns in YAML
  - Add model descriptions
  - Include examples of use cases
- [ ] Add data quality tests
  - Unique/not null tests for key fields
  - Referential integrity checks
  - Value range validations
  - Freshness tests

**Staging Model Pattern:**
```sql
-- stg__edxorg__s3__course_xml_blocks.sql
with source as (
    select * from {{ source('ol_warehouse_raw_data', 'raw__edxorg__s3__course_xml_blocks') }}
)

, cleaned as (
    select
        course_id as courserun_id
        , block_id as coursestructure_block_id
        , block_type as coursestructure_block_type
        , block_title as coursestructure_block_title
        , {{ cast_timestamp_to_iso8601('retrieved_at') }} as coursestructure_retrieved_at
        , ...
    from source
)

select * from cleaned
```

### 3. Integration with Existing Pipeline

**Tasks:**
- [ ] Update existing assets to output to raw layer
  - Modify `extract_edxorg_courserun_metadata` if needed
  - Ensure consistency with existing course structure processing
- [ ] Align with existing course_structure staging model
  - Review `stg__edxorg__s3__course_structure.sql`
  - Ensure new models complement existing structure
  - Update intermediate/mart models if needed
- [ ] Update orchestration schedules
  - Configure partitioning (by course_id and source_system)
  - Set appropriate refresh schedules
  - Handle backfilling for historical data

## Technical Considerations

### XML Structure
Course XML archives contain:
- `course.xml` - root course definition
- `course/{run_tag}.xml` - course metadata
- `chapter/`, `sequential/`, `vertical/` - course structure
- `video/`, `problem/`, `html/` - content components
- `policies/` - course policies
- `about/` - course about pages

### Data Volume
- Multiple course runs per course
- Potentially thousands of blocks per course
- Historical archives spanning multiple years
- Need for incremental processing

### Dependencies
- Requires access to S3 buckets with course archives
- May need coordination with existing archive retrieval jobs
- Should align with existing edxorg tracking log processing

## Success Criteria

- [ ] Course XML contents are loaded into raw layer tables
- [ ] dbt staging models provide clean, typed course structure data
- [ ] Documentation is complete and clear
- [ ] Data quality tests pass consistently
- [ ] Integration with existing pipeline is seamless
- [ ] Historical data can be backfilled
- [ ] New course archives are processed automatically

## References

- Existing code:
  - `packages/ol-orchestrate-lib/ol_orchestrate/assets/openedx_course_archives.py`
  - `packages/ol-orchestrate-lib/ol_orchestrate/lib/openedx.py`
  - `src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_structure.sql`
- Utility: `bin/dbt-create-staging-models.py`
- Documentation: `README.md` (dbt Staging Model Generation section)

## Labels

product:data-platform, enhancement, good-first-issue (for some sub-tasks)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models #1714

Summary

Background

Current State

What's Missing

Requirements

1. Raw Layer Data Loading

2. dbt Staging Models

3. Integration with Existing Pipeline

Technical Considerations

XML Structure

Data Volume

Dependencies

Success Criteria

References

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models #1714

Description

Summary

Background

Current State

What's Missing

Requirements

1. Raw Layer Data Loading

2. dbt Staging Models

3. Integration with Existing Pipeline

Technical Considerations

XML Structure

Data Volume

Dependencies

Success Criteria

References

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions