Skip to content

Conversation

@PeterStaar-IBM
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM commented Dec 1, 2025

feat:

  • add the list of unique tokens in the IDocTags
  • add serialization of nested lists in the IDocTags (not yet used)

fixes:

  • removes the content from captions if asked in doctag serialization
  • removed the content from tables if asked in doctag serialization
  • removed the content from nested lists if asked in doctag serialization

nested lists

Currently we have for nested lists:

<ordered_list>
    <list_item>Item 1 in A</list_item>
    <list_item>Item 2 in A</list_item>
    <list_item>Item 3 in A</list_item>
    <list_item>
        <ordered_list>
            <list_item>Item 1 in B</list_item>
            <list_item>Item 2 in B</list_item>
            <list_item>
                <ordered_list>
                    <list_item>Item 1 in C</list_item>
                    <list_item>Item 2 in C</list_item>
                </ordered_list>
            </list_item>
            <list_item>Item 3 in B</list_item>
        </ordered_list>
    </list_item>
    <list_item>Item 4 in A</list_item>
</ordered_list>

This is strictly speaking XML/HTML compliant (if we assume the mapping to <ul>/<ol>), but we do have two problems:

  1. if we have location tokens, we should have strictly speaking also have ones for the list-item that "owns" the sublist, which is not well defined
  2. we are using DocItem tags for concepts that are better captured by GroupItem

So, there are two solitions, either make it truly XML/HTML compliant,

<ordered_list>
    <list_item>Item 1 in A</list_item>
    <list_item>Item 2 in A</list_item>
    <list_item>Item 3 in A
        <ordered_list>
            <list_item>Item 1 in B</list_item>
            <list_item>Item 2 in B
                <ordered_list>
                    <list_item>Item 1 in C</list_item>
                    <list_item>Item 2 in C</list_item>
                </ordered_list>
            </list_item>
            <list_item>Item 3 in B</list_item>
        </ordered_list>
    </list_item>
    <list_item>Item 4 in A</list_item>
</ordered_list>

or make a list a possible child of a list

<ordered_list>
    <list_item>Item 1 in A</list_item>
    <list_item>Item 2 in A</list_item>
    <list_item>Item 3 in A</list_item>
    <ordered_list>
        <list_item>Item 1 in B</list_item>
        <list_item>Item 2 in B</list_item>
            <ordered_list>
                <list_item>Item 1 in C</list_item>
                <list_item>Item 2 in C</list_item>
            </ordered_list>
        <list_item>Item 3 in B</list_item>
    </ordered_list>
    <list_item>Item 4 in A</list_item>
</ordered_list>
<ordered_list>
    <list_item>Item 1 in A</list_item>
    <list_item>Item 2 in A</list_item>
    <list_item_group>
        <list_item>Item 3 in A</list_item>
        <ordered_list>
            <list_item>Item 1 in B</list_item>
            <list_item>Item 2 in B</list_item>
                <ordered_list>
                    <list_item>Item 1 in C</list_item>
                    <list_item>Item 2 in C</list_item>
                </ordered_list>
            <list_item>Item 3 in B</list_item>
        </ordered_list>
    </list_item_group>
    <list_item>Item 4 in A</list_item>
</ordered_list>

@dolfim-ibm , @vagenas: happy to get input on this ^^

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Dec 1, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@dosubot
Copy link

dosubot bot commented Dec 1, 2025

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@codecov
Copy link

codecov bot commented Dec 1, 2025

Codecov Report

❌ Patch coverage is 49.60630% with 64 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/experimental/idoctags.py 43.85% 64 Missing ⚠️

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM PeterStaar-IBM changed the title Dev/updating the idoctags serializer feat: updating the idoctags serializer Dec 1, 2025
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
…le list_item, no content in tables and captions

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@PeterStaar-IBM PeterStaar-IBM merged commit 9a42a3c into main Dec 10, 2025
12 of 13 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the dev/updating-the-idoctags-serializer branch December 10, 2025 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants