Skip to content

Conversation

@multimeric
Copy link
Contributor

Adds an XsdImportEngine, with tests.

XSD doesn't map cleanly to LinkML, because:

  • XML has both attributes and child elements. I treat both as slots, but tag each with a keyword that indicates which one it is
  • The root element is a pseudo class, because it may enforce a specific structure. I resolve this by adding a RootElement class where necessary

@sierra-moxon sierra-moxon self-requested a review January 15, 2025 00:46
@sierra-moxon
Copy link
Member

Thank you @multimeric! I think there is a lot of interest in an XML -> LinkML importer.

@cmungall
Copy link
Member

Thank you!!

Re lack of isomorphism. See also https://stackoverflow.com/questions/191536/converting-xml-to-json-using-python which mentions a "standard" (for instance-level). If there is nothing more up to date, I suggest being consistent with this, such that xml --[xmltodict]--> json validates via [xml schema]-->linkml

Or at least an option to do this - for now just marking this in the docs is sufficient

@multimeric
Copy link
Contributor Author

multimeric commented Jan 15, 2025

Hmm that's an interesting suggestion. However the proposed solution is to prepend @ to attribute names such that:

<p id="1">text</p>

Becomes

{
  "p": {
    "@id": 1,
    "$": "text"
  }
}

I think this is a bit ugly, and also perhaps easily confused with JSON-LD, but I will implement it that way if you prefer.

I consider this scenario a bit different from standard XML to JSON conversion because here we have a schema that can describe which slots are attributes directly. So I'm wondering fi there's a good field in the LinkML SlotDefinition that would capture this (current I'm using keywords).

@cmungall
Copy link
Member

And now I reflect this you are right, this would make things quite complicated. Having the @ at the schema level isn't permitted in linkml, so there would need to be a mapping preserved... all quite ugly.

So scratch that, let's keep it in mind for future extensions.

Formally the way to do this in linkml would be to use conforms_to or instantiates

slots:
  id:
    conforms_to: xsd:attribute
  description:
    conforms_to: xsd:attribute

or

slots:
  id:
    instantiates: [xsd:attribute]
  description:
    instantiates: [xsd:attribute]

The former is stronger and could be used for validation if we later implement metaclasses for xml schema (see https://linkml.io/linkml/schemas/annotations.html#validation-of-annotations).

(a to be defined xml schema metamodel):

class:
  XmlAttribute:
    is_a: slot
    class_uri: xsd:attribute
    description: If a slot instantiates this then it should be mapped to an attribute when serializing to XML

However, if you want to keep your method flexible such that you can pass in a profile on the command line (e.g. using keywords) and satisfy your use case first that's valid!

@multimeric
Copy link
Contributor Author

Okay, I've gone with your second suggestion, as I like the idea that instantiates can support multiple different related schemas. Added tests to ensure that it's added correctly.

@multimeric multimeric requested a review from cmungall January 23, 2025 23:09
@cmungall
Copy link
Member

these test failures seem unrelated to the main content of your PR, but my guess is that updating the lock file introduced some upstream library change that modifies behavior

@multimeric
Copy link
Contributor Author

Ah no that's my fault, I used some type syntax that isn't compatible with Python 3.9. It should hopefully be fixed now.

@tfliss tfliss self-requested a review October 29, 2025 20:15
@tfliss tfliss self-assigned this Nov 19, 2025
@tfliss
Copy link

tfliss commented Nov 20, 2025

@multimeric I am starting a review of this PR. Do you have time to merge from linkml/schema-automator main and resolve the poetry.lock conflict?

I was able to do the merge on my local machine, so let me know if you'd prefer me to push that or to assist w/ poetry.lock. So far the tests seem to pass (there are a lot of errors in the logs that appear unrelated to your PR).
Also I was able to run the import on a sample xsd schema.

Thanks for adding this feature.

Copy link

@tfliss tfliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a partial review. Most of the comments are suggestions, but the range capitalization and merge will be needed changes, so I'm submitting this review while putting together an example XSD to try to get a little more coverage.

One other thing that should be included is a brief documentation file that covers what functionality is included / intended and what isn't. This can be at a relatively high level, but will help call out to the user that it is only a partial implementation that is under development.

from schema_automator.importers import XsdImportEngine
import tempfile

def parse_string(xsd: str) -> SchemaView:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making these XSD file strings fixture files.
Alternatively add a method to XsdImportEngine that can read an XSD string directly (or io.StringIO) if they're inline in the test on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to add a convert_string method to the XsdImportEngine. I think I avoided making these XML documents separate files because it's more messy and it ends up being harder to understand the test.

Copy link

@tfliss tfliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@multimeric Thank you for the quick poetry.lock update and also spending time on the f-strings.

I apologize for the extra round-trip, but please see my comment change/request on TYPE_MAP. I'm pretty sure the ranges need to be lower case. If you make just that change I'll approve and we can get this merged. There is a lot of interest in having this functionality and this is a great base to work from!

from schema_automator.importers.import_engine import ImportEngine

# Mapping between https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/datatypes.html#built-in-primitive-datatypes and https://linkml.io/linkml-model/latest/docs/#types
TYPE_MAP = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the values here need to be lower case. The linkml.io has them uppercase for some reason, but when I run the linter, and comparing to the tutorials, etc. the ranges should be lowercase. I changed this bit as follows locally and the lint errors for the ranges were resolved. More testing would be great, but if this change is made I'll accept the PR. I must have forgot to click 'add; on this comment yesterday, sorry about the delay.

uv run --no-project linkml lint some_schema.linkml.yaml

TYPE_MAP = {
    "anyURI": "uri",
    "boolean": "boolean",
    "date": "bate",
    "dateTime": "datetime",
    "decimal": "decimal",
    "double": "double",
    "float": "float",
    "string": "string",
    "time": "time",
    "int": "integer",
    "integer": "integer",
    # NMTOKEN is basically just a string inside an attribute
    "NMTOKEN": "string",
}

@tfliss tfliss linked an issue Dec 5, 2025 that may be closed by this pull request
@tfliss tfliss moved this to In Progress in Developer Days 2025 Dec 5, 2025
@tfliss tfliss requested a review from amc-corey-cox December 5, 2025 17:42
@tfliss tfliss requested review from tfliss December 12, 2025 22:37
Copy link

@tfliss tfliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amc-corey-cox @sierra-moxon After the developer days side chat with Cory Levinson, we'd like to merge this PR as-is. I will open a separate issue for the remaining minor change I requested and Cory will start a document to plan additional unit test fixtures and roadmap. Chris M previously approved this PR but I think I unintentionally reset that.

cc @multimeric many thanks for this new feature.

@sierra-moxon sierra-moxon merged commit b4323ab into linkml:main Dec 12, 2025
11 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Developer Days 2025 Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

Write an XML Schema importer

4 participants