[FEATURE]: Add machine-readable JSON output for -out=report by x15sr71 · Pull Request #2020 · CCExtractor/ccextractor

x15sr71 · 2026-01-14T21:58:35Z

In raising this pull request, I confirm the following (please check boxes):

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

Summary

This PR implements machine-readable JSON output for the -out=report feature, addressing issue #1399. Users can now generate structured reports that can be parsed with tools like jq, enabling seamless integration with automated workflows.

Background

Currently, CCExtractor’s report output is human-readable text that requires custom parsing for automation. While other media analysis tools such as ffprobe and mediainfo provide JSON output, structured closed-caption reporting is not consistently available across tools or versions. This feature enables CCExtractor to expose its existing report data in a structured JSON format.

Use case: Users running CCExtractor in automated environments (e.g., CI/CD pipelines, media processing workflows) need to programmatically determine if streams contain closed captions without writing custom parsers.

Changes

`-out=report` Option

ccextractor -out=report input.ts

Existing Text Output (-out=report)

File: ../20251206ch29FullTS.ts
Stream Mode: Transport Stream
Program Count: 5
Program Numbers: 1 2 3 4 5
PID: 49, Program: 1, MPEG-2 video
PID: 52, Program: 1, AC3 audio
PID: 53, Program: 1, AC3 audio
PID: 65, Program: 2, MPEG-2 video
PID: 68, Program: 2, AC3 audio
PID: 81, Program: 3, MPEG-2 video
PID: 84, Program: 3, AC3 audio
PID: 97, Program: 4, MPEG-2 video
PID: 100, Program: 4, AC3 audio
PID: 113, Program: 5, MPEG-2 video
PID: 116, Program: 5, AC3 audio
//////// Program #5: ////////
DVB Subtitles: No
Teletext: No
ATSC Closed Caption: Yes
EIA-608: Yes
XDS: No
CC1: Yes
CC2: No
CC3: No
CC4: No
CEA-708: Yes
Services: 1 2 3 4 5 6 9
Primary Language Present: Yes
Secondary Language Present: Yes
Width: 704
Height: 480
Aspect Ratio: 03 - 16:9
Frame Rate: 04 - 29.97


(More programs omitted for brevity)

JSON Output Structure (v1.0)

The output follows a versioned JSON report structure:

JSON output via `--report-format json`

ccextractor --report-format json -out=report input.ts

{
  "schema": {
    "name": "ccextractor-report",
    "version": "1.0"
  },
  "input": {
    "source": "file",
    "path": "../20251206ch29FullTS.ts"
  },
  "stream": {
    "mode": "Transport Stream",
    "program_count": 5,
    "program_numbers": [
      1,
      2,
      3,
      4,
      5
    ],
    "pids": [
      {
        "pid": 49,
        "program_number": 1,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 52,
        "program_number": 1,
        "codec": "AC3 audio"
      },
      {
        "pid": 53,
        "program_number": 1,
        "codec": "AC3 audio"
      },
      {
        "pid": 65,
        "program_number": 2,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 68,
        "program_number": 2,
        "codec": "AC3 audio"
      },
      {
        "pid": 81,
        "program_number": 3,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 84,
        "program_number": 3,
        "codec": "AC3 audio"
      },
      {
        "pid": 97,
        "program_number": 4,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 100,
        "program_number": 4,
        "codec": "AC3 audio"
      },
      {
        "pid": 113,
        "program_number": 5,
        "codec": "MPEG-2 video"
      },
      {
        "pid": 116,
        "program_number": 5,
        "codec": "AC3 audio"
      }
    ]
  },
  "programs": [
    {
      "program_number": 1,
      "summary": {
        "has_any_captions": true,
        "has_608": true,
        "has_708": true
      },
      "services": {
        "dvb_subtitles": false,
        "teletext": false,
        "atsc_closed_caption": true
      },
      "captions": {
        "present": true,
        "eia_608": {
          "present": true,
          "xds": false,
          "channels": {
            "cc1": true,
            "cc2": false,
            "cc3": false,
            "cc4": false
          }
        },
        "cea_708": {
          "present": true,
          "services": [
            1,
            2,
            3,
            4,
            5,
            6,
            9
          ]
        }
      },
      "video": {
        "width": 1920,
        "height": 1080,
        "aspect_ratio": "03 - 16:9",
        "frame_rate": "04 - 29.97"
      }
    },

(More programs omitted for brevity)

Schema Notes

The JSON schema is intentionally descriptive rather than prescriptive.
Field presence and values depend on the input container, stream type, and available metadata.
Codec strings reflect CCExtractor's internal stream type descriptions and are container-dependent (e.g., "AC3 audio" vs "AC3").
The services object under programs[] indicates which captioning systems are present (DVB, Teletext, ATSC), while captions.cea_708.services[] lists active CEA-708 caption service numbers.

Program Ordering:

JSON output: Programs are sorted in ascending order by program number (1, 2, 3, 4, 5) for predictable parsing
Text output: Programs are displayed in descending order (5, 4, 3, 2, 1) as they're processed

Text Output Field	JSON Field
File:	`input.path`
Stream Mode	`stream.mode`
Program Count	`stream.program_count`
Program Numbers	`stream.program_numbers[]`
PID: X, Program: Y, Codec	`stream.pids[]`
DVB Subtitles	`programs[].services.dvb_subtitles`
Teletext	`programs[].services.teletext`
ATSC Closed Caption	`programs[].services.atsc_closed_caption`
EIA-608	`programs[].captions.eia_608.present`
XDS	`programs[].captions.eia_608.xds`
CC1..CC4	`programs[].captions.eia_608.channels.*`
CEA-708	`programs[].captions.cea_708.present`
Services:	`programs[].captions.cea_708.services[]`
Primary Language Present	(not in JSON)
Secondary Language Present	(not in JSON)
Width / Height	`programs[].video.width / height`
Aspect Ratio	`programs[].video.aspect_ratio`
Frame Rate	`programs[].video.frame_rate`
MPEG-4 Timed Text	`container.mp4.timed_text_tracks`
(JSON-only)	`schema.*`
(JSON-only)	`programs[].summary.*`
(JSON-only)	`programs[].captions.present`

Key Features:

Structured, machine-readable JSON output for -out=report
Versioned schema (v1.0) for future extensibility
Backward compatible (existing text report remains the default)
Caption presence reporting for:
- ATSC Closed Captions (EIA-608 / CEA-708)
- DVB subtitles (presence flag)
- Teletext (presence flag)
- Note: the has_any_captions summary field includes all caption types (608/708/DVB/Teletext).
Program-level summary fields for fast closed-caption automation checks
PID and codec metadata per program (preserving CCExtractor’s existing codec string formats)
Guarded video metadata (emitted only when valid)
Multi-program stream support with deterministic ordering
Container-level metadata when available (e.g., MP4 timed text track count)

Technical Approach

JSON generation is implemented in C using existing CCExtractor internal data structures.
String values are properly escaped to ensure valid JSON output.
Format selection uses case-insensitive comparison (strcasecmp / _stricmp).
The JSON output uses CCExtractor’s existing internal data structures without modifying caption extraction or decoding logic.
Memory allocation and cleanup follow existing project patterns.
Programs are sorted by program number to provide stable and predictable output.

Example Testing Commands

# Test JSON output
ccextractor --report-format json -out=report sample.ts | jq .

# Verify caption presence
ccextractor --report-format json -out=report sample.ts | jq '.programs[0].summary.has_any_captions'

# Extract specific caption channels
ccextractor --report-format json -out=report sample.ts | jq '.programs[].captions.eia_608.channels'

# Check which CC channels are active
ccextractor --report-format json -out=report sample.ts | jq '.programs[].captions.eia_608.channels | to_entries | map(select(.value == true)) | .[].key'

# Get video dimensions
ccextractor --report-format json -out=report sample.ts | jq '.programs[].video | select(. != null) | {width, height}'

# Default text format still works
ccextractor -out=report sample.ts

Field Value Formats:

String values like aspect_ratio and frame_rate preserve CCExtractor's internal enum formatting (e.g., "03 - 16:9", "04 - 29.97")
This design choice maintains transparency and aids debugging
Users needing normalized values can post-process with simple string operations:
jq '.programs[].video.aspect_ratio | split(" - ")[1]'

Benefits

Automation-Friendly: Enables programmatic parsing without regex/custom parsers
Familiar Structure: Uses JSON patterns similar to tools like ffprobe and mediainfo
Extensible: Versioned schema to support future extensions
Backward Compatible: Existing workflows continue to work unchanged
Addresses Real Need: Solves problem raised by multiple community members (issue [PROPOSAL] - Structured data JSON output of ccextractor -out=report #1399 and related discussions)
Quick Caption Detection: Provides has_any_captions summary field for fast EIA-608 / CEA-708 closed-caption checks

Notes

Platform compatibility: uses strcasecmp on POSIX systems and maps to _stricmp on Windows via platform-specific preprocessor guards.
Video and container metadata are emitted conditionally when applicable
Temporary allocations used for program ordering are properly released
The implementation follows existing CCExtractor coding conventions

x15sr71 · 2026-01-16T03:31:01Z

I'm reverting the last commit (fix(report): guard JSON report cleanup to prevent test failures). I added it while investigating the Sample Platform failures involving --startcreditstext, but further testing showed the conditional cleanup itself isn’t correct, freport needs to be reset unconditionally.
In local runs, --startcreditstext is parsed and logged correctly, but the text can still be dropped later as it appears that it depends on timing constraints and environment differences.

x15sr71 · 2026-01-17T19:14:44Z

Follow-up: I’m continuing to investigate the Sample Platform failures separately. At this point, they don’t appear to be directly caused by the changes in this PR, but I’m still digging to be sure. I’ll update here once I have a clearer conclusion.

cfsmp3 · 2026-01-18T02:49:03Z

Thanks for this feature! The JSON output format looks well-designed and works correctly.

However, please rebase this PR on master. The branch is missing the fix from #2025 (merged Jan 17), which causes a segfault when using -out=report on files with AVC/H.264 video streams.

After rebasing:

The segfault on AVC streams will be fixed
The JSON report will work on all file types

Once rebased, this should be ready to merge.

x15sr71 · 2026-01-18T13:39:43Z

Thanks for the review @cfsmp3! I've rebased on master and the AVC segfault fix from #2025 is now included. The JSON report now works correctly across all file types. Ready for final review.

cfsmp3

Deep Review Results - Issues Found

I tested the JSON output feature against 172 media files from our test suite. While the feature works well in many cases (166 files produced valid JSON), I found several issues that should be addressed.

Issue 1: Program Count Mismatch (25 files affected)

The JSON reports fewer programs than actually exist in multi-program streams. The program_count and program_numbers fields don't match what ffprobe reports.

Examples:

File	JSON Reports	FFprobe Shows
`96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts`	4 programs (0,155,192,193)	6 programs (155,156,157,158,192,193)
`36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg`	1 program (2030)	10 programs (82,2000,2005,2010,2015,2020,2025,2030,2035,2090)
`c6407fb294bf0f97a84e6a70aa2787dc4b13688645d9f2f2db50c754b5855bb6.mpg`	1 program (819)	8 programs (817,818,819,820,821,830,831,832)
`e92a1d4d2aabdca2f1a2cb7854316a6fdc539bc05d26c5a5aae89f21b697c780.mpg`	1 program (1346)	7 programs (1344,1345,1346,1347,1348,1351,1352)

To reproduce:

./ccextractor 96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts -out=report --report-format json | jq '.stream.program_count, .stream.program_numbers'
# Returns: 4, [0,155,192,193]

ffprobe -v quiet -print_format json -show_programs 96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts | jq '[.programs[].program_num]'
# Returns: [155,156,157,158,192,193]

Suggestion: Either report ALL programs in the stream, or rename the field to caption_program_count to clarify it only includes programs with detected caption streams.

Issue 2: `has_any_captions` Excludes DVB/Teletext

The field has_any_captions only considers EIA-608/CEA-708, not DVB subtitles or Teletext:

// src/lib_ccx/params_dump.c:464
bool has_any_captions = has_608 || has_708;

This produces confusing output:

{
  "has_any_captions": false,
  "teletext": true,
  "dvb_subtitles": true
}

Files demonstrating this issue:

006fdc391aab432f9e379f6e55fa9fec3dc9b2fad67d4b284fc7f28f3984238f.mpg - has teletext but has_any_captions: false
1020459a866fab62d0adc5c5518e1ffcc7b9f313d3af6a18ecd33d73d2eb8e05.ts - has DVB subtitles but has_any_captions: false
36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg - has BOTH teletext AND DVB but has_any_captions: false

Suggestion: Either:

Rename to has_608_708 to be explicit, OR
Include DVB/Teletext: bool has_any_captions = has_608 || has_708 || has_teletext || has_dvb;

Issue 3: Video Dimensions Detection Failure (1 file)

One file reports 0x0 for video dimensions when ffprobe shows 1920x1080:

File: af446fc78afeb80bbf1f329f93f205ca44cbbe635d547061932b3d1431806473.ts

./ccextractor af446fc78afeb80bbf1f329f93f205ca44cbbe635d547061932b3d1431806473.ts -out=report --report-format json | jq '.programs[0].video'
# Returns: {"width": 0, "height": 0, ...}

ffprobe -v quiet -print_format json -show_streams af446fc78afeb80bbf1f329f93f205ca44cbbe635d547061932b3d1431806473.ts | jq '.streams[] | select(.codec_type=="video") | {width, height}'
# Returns: {"width": 1920, "height": 1080}

What Works Well

JSON syntax is 100% valid across all 166 files
EIA-608/CEA-708 caption detection is accurate
Teletext and DVB subtitle stream detection works correctly
Stream mode detection (TS, PS, MP4, etc.) is accurate
Video codec identification is correct

Please address these issues. Happy to re-test once updates are made.

… output

x15sr71 · 2026-02-03T20:08:15Z

Thanks for the detailed review @cfsmp3!

I’ve addressed Issues 1 and 2 and verified the fixes using the sample files you referenced:

Program count/ordering is now based on PAT, so all programs are reported correctly.
Verification:
- 96efd279cfa1dddcb1d7d38ecc5ebd6d870a661452c6480804c30a9896037336.ts: Now reports 6 programs
- 36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg: Now reports 10 programs
- c6407fb294bf0f97a84e6a70aa2787dc4b13688645d9f2f2db50c754b5855bb6.mpg: Now reports 8 programs
has_any_captions now includes DVB subtitles and Teletext in addition to 608/708.
Verification:
- 006fdc391aab432f9e379f6e55fa9fec3dc9b2fad67d4b284fc7f28f3984238f.mpg (program 1152): has_any_captions: true (has Teletext)
- 36d5eca53c56ac18e727badec449ac0f10096369f8a7eda5f7108f7170c5cc8c.mpg (program 2030): has_any_captions: true (has Teletext)

For Issue 3 (video dimensions), this reflects existing CCExtractor behavior, both text and JSON reports show 0×0 for this file because dimensions aren't populated in the decoder context for certain H.264 packaging. The JSON report is exposing the same state as the text report. Fixing this would require integrating parts of the --analyzevideo logic into the report pipeline, which has performance and design implications. I haven’t included that here, but I’d be happy to explore it in a follow-up if you think it’s worthwhile.

Note on DVB detection: For 1020459a...ts, ffprobe detects DVB subtitles, but CCExtractor doesn't associate that stream with cap_info (text report also shows "DVB Subtitles: No"), so services.dvb_subtitles remains false. This appears to be a pre-existing detection issue.

I’m happy to explore these further in a future schema version or follow-up PR if that would be useful. Please let me know if you’d like me to adjust anything further.

ccextractor-bot · 2026-02-03T20:37:42Z

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 032cd1c...:

Report Name	Tests Passed
Broken	13/13
CEA-708	14/14
DVB	6/7
DVD	3/3
DVR-MS	2/2
General	27/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	85/86
Teletext	21/21
WTV	13/13
XDS	34/34

Your PR breaks these cases:

ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
ccextractor --out=spupng c83f765c66...

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
ccextractor --bom c83f765c66..., Last passed: Never
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

ccextractor-bot · 2026-02-03T21:00:11Z

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 032cd1c...:

Report Name	Tests Passed
Broken	13/13
CEA-708	14/14
DVB	7/7
DVD	3/3
DVR-MS	2/2
General	27/27
Hardsubx	1/1
Hauppage	3/3
MP4	3/3
NoCC	10/10
Options	85/86
Teletext	21/21
WTV	13/13
XDS	34/34

Your PR breaks these cases:

ccextractor --out=spupng c83f765c66...

Congratulations: Merging this PR would fix the following tests:

ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2..., Last passed: Never
ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

x15sr71 force-pushed the feat/json-report branch from ab2cda5 to ca55c86 Compare January 14, 2026 23:46

x15sr71 force-pushed the feat/json-report branch from 1ac3c21 to 64355f0 Compare January 16, 2026 03:33

x15sr71 changed the title ~~feat(report): add machine-readable JSON output for -out=report~~ feat(report): Add machine-readable JSON output for -out=report Jan 16, 2026

x15sr71 changed the title ~~feat(report): Add machine-readable JSON output for -out=report~~ [FEATURE]: Add machine-readable JSON output for -out=report Jan 16, 2026

x15sr71 force-pushed the feat/json-report branch from 64355f0 to b0d6205 Compare January 18, 2026 13:25

cfsmp3 requested changes Jan 18, 2026

View reviewed changes

cfsmp3 mentioned this pull request Jan 31, 2026

Add JSON output format for file report (-out=report=json) #2070

Closed

10 tasks

x15sr71 added 3 commits February 4, 2026 00:26

feat(report): add machine-readable JSON output for -out=report

1618788

docs(changelog): mention JSON output support for -out=report

556392a

fix(report): address program count and caption summary issues in JSON…

f82c231

… output

x15sr71 force-pushed the feat/json-report branch from b0d6205 to f82c231 Compare February 3, 2026 19:53

style(rust): format code with rustfmt

7b891b9

x15sr71 requested a review from cfsmp3 February 3, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Add machine-readable JSON output for -out=report#2020

[FEATURE]: Add machine-readable JSON output for -out=report#2020
x15sr71 wants to merge 4 commits intoCCExtractor:masterfrom
x15sr71:feat/json-report

x15sr71 commented Jan 14, 2026 •

edited

Loading

Uh oh!

x15sr71 commented Jan 16, 2026 •

edited

Loading

Uh oh!

x15sr71 commented Jan 17, 2026

Uh oh!

cfsmp3 commented Jan 18, 2026

Uh oh!

x15sr71 commented Jan 18, 2026

Uh oh!

cfsmp3 left a comment

Uh oh!

x15sr71 commented Feb 3, 2026 •

edited

Loading

Uh oh!

ccextractor-bot commented Feb 3, 2026

Uh oh!

ccextractor-bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

x15sr71 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

-out=report Option

Existing Text Output (-out=report)

JSON Output Structure (v1.0)

JSON output via --report-format json

Schema Notes

Key Features:

Technical Approach

Example Testing Commands

Benefits

Notes

Uh oh!

x15sr71 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x15sr71 commented Jan 17, 2026

Uh oh!

cfsmp3 commented Jan 18, 2026

Uh oh!

x15sr71 commented Jan 18, 2026

Uh oh!

cfsmp3 left a comment

Choose a reason for hiding this comment

Deep Review Results - Issues Found

Issue 1: Program Count Mismatch (25 files affected)

Issue 2: has_any_captions Excludes DVB/Teletext

Issue 3: Video Dimensions Detection Failure (1 file)

What Works Well

Uh oh!

x15sr71 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ccextractor-bot commented Feb 3, 2026

Uh oh!

ccextractor-bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

x15sr71 commented Jan 14, 2026 •

edited

Loading

`-out=report` Option

JSON output via `--report-format json`

x15sr71 commented Jan 16, 2026 •

edited

Loading

Issue 2: `has_any_captions` Excludes DVB/Teletext

x15sr71 commented Feb 3, 2026 •

edited

Loading