Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions docs/api/extensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Extension Functions

linkml-map ships a curated set of safe built-in functions for use in
[expressions](expressions.md). When you need a function that isn't built in,
you can register your own — without forking linkml-map or wrapping it in a
custom Python harness — by tagging plain Python functions with
`@safe_function` and pointing the CLI at the file.

## Quick example

A user-supplied `my_helpers.py`:

```python
from linkml_map.utils.extensions import safe_function


@safe_function
def normalize_taxon_id(s: str) -> str | None:
"""Strip the 'NCBI:' prefix and pad to 8 digits."""
if not s:
return None
raw = s.removeprefix("NCBI:").strip()
return f"NCBI:{int(raw):08d}"
```

Then in a trans-spec:

```yaml
# required_extensions: my_helpers.py (convention; see below)
class_derivations:
Organism:
populated_from: SourceOrganism
slot_derivations:
tax_id:
expr: "normalize_taxon_id(taxon)"
```

And at the command line:

```bash
linkml-map map-data -s schema.yaml -T transform.yaml \
--functions ./my_helpers.py \
data.tsv -o out.jsonl
```

The flag is repeatable: pass `--functions` (or the short form `-F`) once per
extension file.

## The `@safe_function` contract

Applying `@safe_function` is a **declaration by the author** that the function
is:

- **Pure** — no I/O, no network calls, no global state mutation
- **Bounded-time** — deterministic and fast; runs once per row in a transform
- **Deterministic** — same inputs produce same outputs

linkml-map **does not verify** these properties. The name "safe" reflects what
*you* are declaring about the function, not what linkml-map enforces. This is
the same posture as `typing.final` or `@SafeVarargs` in other ecosystems.

The trust model is identical to `pip install`: anything in a module you import
will run. If you're importing a third-party extension, treat it like any other
dependency.

## When NOT to use extensions

Extensions are not an escape hatch for putting transformation logic in Python.
They exist for **named atomic operations** that read cleaner as a name than as
an expression chain — `slugify(name)` instead of
`replace(replace(lower(strip(name)), ' ', '_'), ',', '')`.

If the function you're tempted to write is more than a few lines of pure
data manipulation, ask first whether it belongs in the trans-spec or in the
source/target schema. The declarative spec is the documentation of what the
transformation does; pulling logic out into Python hides it from review.

## Reserved names

A handful of names are injected per-call by the transformer (currently `slot`,
used inside expressions to reference a previously derived target slot). An
extension cannot define a function with one of these names — it would be
silently shadowed at evaluation time. `load_extensions` raises
`ExtensionError` on the attempt so the conflict shows up at load time
rather than as silent wrong behavior.

## Override semantics

A `@safe_function` may shadow a built-in if you explicitly say so:

```python
@safe_function(override=True)
def lower(s: str) -> str:
return s.casefold() # locale-aware, replaces the built-in str.lower
```

- **Without `override=True`**: collision with a built-in raises `ExtensionError`
at load time — protects against accidental shadowing from a typo.
- **With `override=True` but no matching built-in**: logged as a warning (still
loaded) — useful as a typo catcher for the override case.
- **Collision between two extensions**: always an error. Pick one.

There is no CLI flag to enable overrides. The decision lives on the function
declaration, where the author is responsible for it.

## List-style functions

By default, scalar functions distribute over lists and propagate `None`
(`slugify([a, b, None])` → `[slugify(a), slugify(b), None]`). For functions
that legitimately accept a list as their first argument (aggregators, etc.),
opt out:

```python
@safe_function(distributes=False)
def median(items: list[float]) -> float:
sorted_items = sorted(items)
return sorted_items[len(sorted_items) // 2]
```

## Required-extension convention

A trans-spec that references an extension function won't run without
`--functions`. The runtime error is clear (`Unknown function 'foo'. (If this
is a custom function, pass it via --functions <path>.)`), but it's still
runtime. Until linkml-map gains a declarative `required_extensions:` key, the
convention is to note the dependency in a header comment on the spec:

```yaml
# required_extensions:
# - my_helpers.py
#
id: https://example.org/my-transform
class_derivations:
...
```

## Programmatic use

Python callers can skip the CLI and set extensions directly on the transformer:

```python
from linkml_map.transformer.object_transformer import ObjectTransformer
from linkml_map.utils.extensions import load_extensions

ext = load_extensions(["./my_helpers.py"])
tr = ObjectTransformer(extension_functions=ext)
```

`extension_functions` accepts any `dict[str, Callable]`, so you can also bypass
the loader entirely and hand-build the dict if you prefer (skipping the
decorator-tagging step).

## API reference

::: linkml_map.utils.extensions
7 changes: 7 additions & 0 deletions docs/api/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@ and null propagation.
| `len(items)` | Length of a list |
| `case(pairs...)` | Conditional — first matching `(condition, value)` pair |
| `uuid5(namespace, name)` | Deterministic UUID v5 generation |
| `slugify(s, separator="_")` | ASCII-fold + lowercase + collapse non-alphanumerics; `None` on no extractable content |
| `to_snake(s)` | Convert to `snake_case` |
| `to_camel(s)` | Convert to `camelCase` |
| `to_pascal(s)` | Convert to `PascalCase` |

For functions not in this list, see [Extension Functions](extensions.md) to
register your own.

## Unit Conversion

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ nav:
- Inference: api/inference.md
- Expressions: api/expressions.md
- Functions: api/functions.md
- Extensions: api/extensions.md
# - Subsetter: api/subsetter.md
- FAQ: faq.md
site_url: https://linkml.github.io/linkml-map/
Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,15 @@ dependencies = [
"duckdb>=1.1.0",
"flatten-dict>=0.4.2",
"graphviz>=0.20",
"inflection>=0.5.1",
"jinja2>=3",
"lark>=1",
"linkml>=1.11.0",
"linkml-runtime>=1.11.0",
"more-itertools>=10.0.0",
"pint>=0.20",
"pydantic>=2,<3",
"python-slugify>=8.0.4",
"pyyaml",
"simpleeval>=1.0.3",
"ucumvert>=0.2",
Expand Down
18 changes: 18 additions & 0 deletions src/linkml_map/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from linkml_map.transformer.engine import transform_spec
from linkml_map.transformer.errors import TransformationError
from linkml_map.transformer.object_transformer import ObjectTransformer
from linkml_map.utils.extensions import ExtensionError, load_extensions
from linkml_map.writers import (
EXTENSION_FORMAT_MAP,
MultiStreamWriter,
Expand Down Expand Up @@ -128,6 +129,16 @@ def main(verbose: int, quiet: bool) -> None:
default=None,
help="Write the resolved (merged + filtered) spec to this file path as a side-effect.",
)
@click.option(
"-F",
"--functions",
multiple=True,
type=click.Path(exists=True, dir_okay=False),
help=(
"Python file containing functions tagged with ``@safe_function``. "
"Their names are merged into the expression eval namespace. Repeatable."
),
)
@click.argument("input_data")
def map_data(
input_data: str,
Expand Down Expand Up @@ -174,6 +185,13 @@ def map_data(
"""
logger.info(f"Transforming {input_data} conforming to {schema} using {transformer_specification}")

function_paths = kwargs.pop("functions", ())
if function_paths:
try:
kwargs["extension_functions"] = load_extensions(function_paths)
except ExtensionError as err:
raise click.ClickException(str(err)) from err

input_path = Path(input_data)

# Determine output format
Expand Down
14 changes: 12 additions & 2 deletions src/linkml_map/transformer/object_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import json
import logging
from collections.abc import Iterator, Mapping
from collections.abc import Callable, Iterator, Mapping
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Any
Expand Down Expand Up @@ -271,6 +271,16 @@ class ObjectTransformer(Transformer):
object_index: ObjectIndex = None
lookup_index: Any = None # Optional[LookupIndex] — lazy import to avoid hard duckdb dep

extension_functions: dict[str, Callable] = field(default_factory=dict)
"""Custom safe functions to merge into the expression eval namespace.

Set by the CLI loader (from ``--functions`` files) or directly by Python
callers. Names here are available inside ``expr:`` expressions alongside
the built-ins. Per-call context functions (currently ``slot()``) take
precedence; built-in functions are precedence-checked at load time via
:func:`~linkml_map.utils.extensions.load_extensions`.
"""

_warned_unbound_names: set[str] = field(default_factory=set, repr=False)
"""Names already warned about in non-strict mode.

Expand Down Expand Up @@ -396,7 +406,7 @@ def map_object(
)
tgt_attrs = {}
bindings = Bindings.from_context(self, context)
expr_functions = {"slot": lambda name: tgt_attrs.get(name)}
expr_functions = {**self.extension_functions, "slot": lambda name: tgt_attrs.get(name)}
for slot_deriv in class_deriv.slot_derivations.values():
with self._slot_error_context(slot_deriv, context):
tgt_attrs[str(slot_deriv.name)] = self._derive_slot(
Expand Down
Loading
Loading