Skip to content

[FEATURE]: Add machine-readable JSON output for -out=report#2020

Open
x15sr71 wants to merge 4 commits intoCCExtractor:masterfrom
x15sr71:feat/json-report
Open

[FEATURE]: Add machine-readable JSON output for -out=report#2020
x15sr71 wants to merge 4 commits intoCCExtractor:masterfrom
x15sr71:feat/json-report

Conversation

@x15sr71
Copy link
Contributor

@x15sr71 x15sr71 commented Jan 14, 2026

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Summary

This PR implements machine-readable JSON output for the -out=report feature, addressing issue #1399. Users can now generate structured reports that can be parsed with tools like jq, enabling seamless integration with automated workflows.

Background

Currently, CCExtractor’s report output is human-readable text that requires custom parsing for automation. While other media analysis tools such as ffprobe and mediainfo provide JSON output, structured closed-caption reporting is not consistently available across tools or versions. This feature enables CCExtractor to expose its existing report data in a structured JSON format.

Use case: Users running CCExtractor in automated environments (e.g., CI/CD pipelines, media processing workflows) need to programmatically determine if streams contain closed captions without writing custom parsers.

Changes

-out=report Option

ccextractor -out=report input.ts

Existing Text Output (-out=report)

File: ../20251206ch29FullTS.ts
Stream Mode: Transport Stream
Program Count: 5
Program Numbers: 1 2 3 4 5
PID: 49, Program: 1, MPEG-2 video
PID: 52, Program: 1, AC3 audio
PID: 53, Program: 1, AC3 audio
PID: 65, Program: 2, MPEG-2 video
PID: 68, Program: 2, AC3 audio
PID: 81, Program: 3, MPEG-2 video
PID: 84, Program: 3, AC3 audio
PID: 97, Program: 4, MPEG-2 video
PID: 100, Program: 4, AC3 audio
PID: 113, Program: 5, MPEG-2 video
PID: 116, Program: 5, AC3 audio
//////// Program #5: ////////
DVB Subtitles: No
Teletext: No
ATSC Closed Caption: Yes
EIA-608: Yes
XDS: No
CC1: Yes
CC2: No
CC3: No
CC4: No
CEA-708: Yes
Services: 1 2 3 4 5 6 9
Primary Language Present: Yes
Secondary Language Present: Yes
Width: 704
Height: 480
Aspect Ratio: 03 - 16:9
Frame Rate: 04 - 29.97


(More programs omitted for brevity)

JSON Output Structure (v1.0)

The output follows a versioned JSON report structure:

JSON output via --report-format json

ccextractor --report-format json -out=report input.ts
{
  "schema": {
    "name": "ccextractor-report",
    "version": "1.0"
  },
  "input": {
    "source": "file",
    "path": "../20251206ch29FullTS.ts"
  },
  "stream": {
    "mode": "Transport Stream",
    "program_count": 5,
    "program_numbers": [
      1,
      2,
      3,
      4,
      5
    ],
    "pids": [
      {
        "pid": 49,
        "program_number": 1,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 52,
        "program_number": 1,
        "codec": "AC3 audio"
      },
      {
        "pid": 53,
        "program_number": 1,
        "codec": "AC3 audio"
      },
      {
        "pid": 65,
        "program_number": 2,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 68,
        "program_number": 2,
        "codec": "AC3 audio"
      },
      {
        "pid": 81,
        "program_number": 3,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 84,
        "program_number": 3,
        "codec": "AC3 audio"
      },
      {
        "pid": 97,
        "program_number": 4,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 100,
        "program_number": 4,
        "codec": "AC3 audio"
      },
      {
        "pid": 113,
        "program_number": 5,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 116,
        "program_number": 5,
        "codec": "AC3 audio"
      }
    ]
  },
  "programs": [
    {
      "program_number": 1,
      "summary": {
        "has_any_captions": true,
        "has_608": true,
        "has_708": true
      },
      "services": {
        "dvb_subtitles": false,
        "teletext": false,
        "atsc_closed_caption": true
      },
      "captions": {
        "present": true,
        "eia_608": {
          "present": true,
          "xds": false,
          "channels": {
            "cc1": true,
            "cc2": false,
            "cc3": false,
            "cc4": false
          }
        },
        "cea_708": {
          "present": true,
          "services": [
            1,
            2,
            3,
            4,
            5,
            6,
            9
          ]
        }
      },
      "video": {
        "width": 1920,
        "height": 1080,
        "aspect_ratio": "03 - 16:9",
        "frame_rate": "04 - 29.97"
      }
    },

(More programs omitted for brevity)

Schema Notes

  • The JSON schema is intentionally descriptive rather than prescriptive.
  • Field presence and values depend on the input container, stream type, and available metadata.
  • Codec strings reflect CCExtractor's internal stream type descriptions and are container-dependent (e.g., "AC3 audio" vs "AC3").
  • The services object under programs[] indicates which captioning systems are present (DVB, Teletext, ATSC), while captions.cea_708.services[] lists active CEA-708 caption service numbers.

Program Ordering:

  • JSON output: Programs are sorted in ascending order by program number (1, 2, 3, 4, 5) for predictable parsing
  • Text output: Programs are displayed in descending order (5, 4, 3, 2, 1) as they're processed
Text Output Field JSON Field
File: input.path
Stream Mode stream.mode
Program Count stream.program_count
Program Numbers stream.program_numbers[]
PID: X, Program: Y, Codec stream.pids[]
DVB Subtitles programs[].services.dvb_subtitles
Teletext programs[].services.teletext
ATSC Closed Caption programs[].services.atsc_closed_caption
EIA-608 programs[].captions.eia_608.present
XDS programs[].captions.eia_608.xds
CC1..CC4 programs[].captions.eia_608.channels.*
CEA-708 programs[].captions.cea_708.present
Services: programs[].captions.cea_708.services[]
Primary Language Present (not in JSON)
Secondary Language Present (not in JSON)
Width / Height programs[].video.width / height
Aspect Ratio programs[].video.aspect_ratio
Frame Rate programs[].video.frame_rate
MPEG-4 Timed Text container.mp4.timed_text_tracks
(JSON-only) schema.*
(JSON-only) programs[].summary.*
(JSON-only) programs[].captions.present

Key Features:

  • Structured, machine-readable JSON output for -out=report
  • Versioned schema (v1.0) for future extensibility
  • Backward compatible (existing text report remains the default)
  • Caption presence reporting for:
    • ATSC Closed Captions (EIA-608 / CEA-708)
    • DVB subtitles (presence flag)
    • Teletext (presence flag)
    • Note: the has_any_captions summary field includes all caption types (608/708/DVB/Teletext).
  • Program-level summary fields for fast closed-caption automation checks
  • PID and codec metadata per program (preserving CCExtractor’s existing codec string formats)
  • Guarded video metadata (emitted only when valid)
  • Multi-program stream support with deterministic ordering
  • Container-level metadata when available (e.g., MP4 timed text track count)

Technical Approach

  • JSON generation is implemented in C using existing CCExtractor internal data structures.
  • String values are properly escaped to ensure valid JSON output.
  • Format selection uses case-insensitive comparison (strcasecmp / _stricmp).
  • The JSON output uses CCExtractor’s existing internal data structures without modifying caption extraction or decoding logic.
  • Memory allocation and cleanup follow existing project patterns.
  • Programs are sorted by program number to provide stable and predictable output.

Example Testing Commands

# Test JSON output
ccextractor --report-format json -out=report sample.ts | jq .

# Verify caption presence
ccextractor --report-format json -out=report sample.ts | jq '.programs[0].summary.has_any_captions'

# Extract specific caption channels
ccextractor --report-format json -out=report sample.ts | jq '.programs[].captions.eia_608.channels'

# Check which CC channels are active
ccextractor --report-format json -out=report sample.ts | jq '.programs[].captions.eia_608.channels | to_entries | map(select(.value == true)) | .[].key'

# Get video dimensions
ccextractor --report-format json -out=report sample.ts | jq '.programs[].video | select(. != null) | {width, height}'

# Default text format still works
ccextractor -out=report sample.ts

Field Value Formats:

  • String values like aspect_ratio and frame_rate preserve CCExtractor's internal enum formatting (e.g., "03 - 16:9", "04 - 29.97")
  • This design choice maintains transparency and aids debugging
  • Users needing normalized values can post-process with simple string operations:
    jq '.programs[].video.aspect_ratio | split(" - ")[1]'

Benefits

  1. Automation-Friendly: Enables programmatic parsing without regex/custom parsers
  2. Familiar Structure: Uses JSON patterns similar to tools like ffprobe and mediainfo
  3. Extensible: Versioned schema to support future extensions
  4. Backward Compatible: Existing workflows continue to work unchanged
  5. Addresses Real Need: Solves problem raised by multiple community members (issue [PROPOSAL] - Structured data JSON output of ccextractor -out=report #1399 and related discussions)
  6. Quick Caption Detection: Provides has_any_captions summary field for fast EIA-608 / CEA-708 closed-caption checks

Notes

  • Platform compatibility: uses strcasecmp on POSIX systems and maps to _stricmp on Windows via platform-specific preprocessor guards.
  • Video and container metadata are emitted conditionally when applicable
  • Temporary allocations used for program ordering are properly released
  • The implementation follows existing CCExtractor coding conventions

@x15sr71
Copy link
Contributor Author

x15sr71 commented Jan 16, 2026

I'm reverting the last commit (fix(report): guard JSON report cleanup to prevent test failures). I added it while investigating the Sample Platform failures involving --startcreditstext, but further testing showed the conditional cleanup itself isn’t correct, freport needs to be reset unconditionally.
In local runs, --startcreditstext is parsed and logged correctly, but the text can still be dropped later as it appears that it depends on timing constraints and environment differences.

@x15sr71 x15sr71 changed the title feat(report): add machine-readable JSON output for -out=report feat(report): Add machine-readable JSON output for -out=report Jan 16, 2026
@x15sr71 x15sr71 changed the title feat(report): Add machine-readable JSON output for -out=report [FEATURE]: Add machine-readable JSON output for -out=report Jan 16, 2026
@x15sr71
Copy link
Contributor Author

x15sr71 commented Jan 17, 2026

Follow-up: I’m continuing to investigate the Sample Platform failures separately. At this point, they don’t appear to be directly caused by the changes in this PR, but I’m still digging to be sure. I’ll update here once I have a clearer conclusion.

@cfsmp3
Copy link
Contributor

cfsmp3 commented Jan 18, 2026

Thanks for this feature! The JSON output format looks well-designed and works correctly.

However, please rebase this PR on master. The branch is missing the fix from #2025 (merged Jan 17), which causes a segfault when using -out=report on files with AVC/H.264 video streams.

After rebasing:

  • The segfault on AVC streams will be fixed
  • The JSON report will work on all file types

Once rebased, this should be ready to merge.

@x15sr71
Copy link
Contributor Author

x15sr71 commented Jan 18, 2026

Thanks for the review @cfsmp3! I've rebased on master and the AVC segfault fix from #2025 is now included. The JSON report now works correctly across all file types. Ready for final review.

Copy link
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review Results - Issues Found

I tested the JSON output feature against 172 media files from our test suite. While the feature works well in many cases (166 files produced valid JSON), I found several issues that should be addressed.

Issue 1: Program Count Mismatch (25 files affected)

The JSON reports fewer programs than actually exist in multi-program streams. The program_count and program_numbers fields don't match what ffprobe reports.

Examples:

File JSON Reports FFprobe Shows
96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts 4 programs (0,155,192,193) 6 programs (155,156,157,158,192,193)
36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg 1 program (2030) 10 programs (82,2000,2005,2010,2015,2020,2025,2030,2035,2090)
c6407fb294bf0f97a84e6a70aa2787dc4b13688645d9f2f2db50c754b5855bb6.mpg 1 program (819) 8 programs (817,818,819,820,821,830,831,832)
e92a1d4d2aabdca2f1a2cb7854316a6fdc539bc05d26c5a5aae89f21b697c780.mpg 1 program (1346) 7 programs (1344,1345,1346,1347,1348,1351,1352)

To reproduce:

./ccextractor 96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts -out=report --report-format json | jq '.stream.program_count, .stream.program_numbers'
# Returns: 4, [0,155,192,193]

ffprobe -v quiet -print_format json -show_programs 96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts | jq '[.programs[].program_num]'
# Returns: [155,156,157,158,192,193]

Suggestion: Either report ALL programs in the stream, or rename the field to caption_program_count to clarify it only includes programs with detected caption streams.


Issue 2: has_any_captions Excludes DVB/Teletext

The field has_any_captions only considers EIA-608/CEA-708, not DVB subtitles or Teletext:

// src/lib_ccx/params_dump.c:464
bool has_any_captions = has_608 || has_708;

This produces confusing output:

{
  "has_any_captions": false,
  "teletext": true,
  "dvb_subtitles": true
}

Files demonstrating this issue:

  • 006fdc391aab432f9e379f6e55fa9fec3dc9b2fad67d4b284fc7f28f3984238f.mpg - has teletext but has_any_captions: false
  • 1020459a866fab62d0adc5c5518e1ffcc7b9f313d3af6a18ecd33d73d2eb8e05.ts - has DVB subtitles but has_any_captions: false
  • 36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg - has BOTH teletext AND DVB but has_any_captions: false

Suggestion: Either:

  1. Rename to has_608_708 to be explicit, OR
  2. Include DVB/Teletext: bool has_any_captions = has_608 || has_708 || has_teletext || has_dvb;

Issue 3: Video Dimensions Detection Failure (1 file)

One file reports 0x0 for video dimensions when ffprobe shows 1920x1080:

File: af446fc78afeb80bbf1f329f93f205ca44cbbe635d547061932b3d1431806473.ts

./ccextractor af446fc78afeb80bbf1f329f93f205ca44cbbe635d547061932b3d1431806473.ts -out=report --report-format json | jq '.programs[0].video'
# Returns: {"width": 0, "height": 0, ...}

ffprobe -v quiet -print_format json -show_streams af446fc78afeb80bbf1f329f93f205ca44cbbe635d547061932b3d1431806473.ts | jq '.streams[] | select(.codec_type=="video") | {width, height}'
# Returns: {"width": 1920, "height": 1080}

What Works Well

  • JSON syntax is 100% valid across all 166 files
  • EIA-608/CEA-708 caption detection is accurate
  • Teletext and DVB subtitle stream detection works correctly
  • Stream mode detection (TS, PS, MP4, etc.) is accurate
  • Video codec identification is correct

Please address these issues. Happy to re-test once updates are made.

@x15sr71
Copy link
Contributor Author

x15sr71 commented Feb 3, 2026

Thanks for the detailed review @cfsmp3!

I’ve addressed Issues 1 and 2 and verified the fixes using the sample files you referenced:

  • Program count/ordering is now based on PAT, so all programs are reported correctly.
    Verification:

    • 96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts: Now reports 6 programs
    • 36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg: Now reports 10 programs
    • c6407fb294bf0f97a84e6a70aa2787dc4b13688645d9f2f2db50c754b5855bb6.mpg: Now reports 8 programs
  • has_any_captions now includes DVB subtitles and Teletext in addition to 608/708.
    Verification:

    • 006fdc391aab432f9e379f6e55fa9fec3dc9b2fad67d4b284fc7f28f3984238f.mpg (program 1152): has_any_captions: true (has Teletext)
    • 36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg (program 2030): has_any_captions: true (has Teletext)

For Issue 3 (video dimensions), this reflects existing CCExtractor behavior, both text and JSON reports show 0×0 for this file because dimensions aren't populated in the decoder context for certain H.264 packaging. The JSON report is exposing the same state as the text report. Fixing this would require integrating parts of the --analyzevideo logic into the report pipeline, which has performance and design implications. I haven’t included that here, but I’d be happy to explore it in a follow-up if you think it’s worthwhile.

Note on DVB detection: For 1020459a...ts, ffprobe detects DVB subtitles, but CCExtractor doesn't associate that stream with cap_info (text report also shows "DVB Subtitles: No"), so services.dvb_subtitles remains false. This appears to be a pre-existing detection issue.

I’m happy to explore these further in a future schema version or follow-up PR if that would be useful. Please let me know if you’d like me to adjust anything further.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 032cd1c...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 6/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 85/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --bom c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 032cd1c...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 85/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@x15sr71 x15sr71 requested a review from cfsmp3 February 3, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants