Skip to content

add mypy-based validation, drop syntactic check on return type#536

Merged
eb8680 merged 27 commits intomasterfrom
kg-mypy-checks
Feb 6, 2026
Merged

add mypy-based validation, drop syntactic check on return type#536
eb8680 merged 27 commits intomasterfrom
kg-mypy-checks

Conversation

@kiranandcode
Copy link
Contributor

This PR updates the internals of EncodableCallable such that mypy based typechecking is done on the source code generated by the LLM. The code uses the ctx: Mapping[str, Any] to inject appropriate imports and stubs such that the code should type check.

This replaces and removes the syntactic check on the ast on the return type where previously we were just syntactically checking the annotations on the returned function which had no guarantee of correctness.

The core interface to the type checking is through this function in effectful.handlers.llm.type_checking:

def typecheck_source(
    module: ast.AST,
    ctx: typing.Mapping[str, Any],
    expected_params: list[type] | None,
    expected_return: type,
) -> None:
    """Type-check synthesized module code against expected signature and context.
    Builds a full source with prelude (ctx bindings as type stubs), the module body,
    and a postlude that assigns the function to an expected Callable type so mypy
    validates the signature. Raises ValueError with mypy output on type errors.
    """

Closes #535

Copy link
Contributor

@eb8680 eb8680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, but I'm worried about soundness because bugs here will silently cause code generation to fail for no good reason.

It would be ideal if we could define one singledispatch-extensible function quote_type replacing/subsuming _type_to_annotation_str and _collect and an equation it satisfies together with eval/exec (and with the typechecker API), and have it live in effectful.internals.unification with tons of parameterized tests covering corner cases, similar to the other generic type-munging functions there.

@eb8680
Copy link
Contributor

eb8680 commented Feb 4, 2026

Does this also address #437?

@kiranandcode
Copy link
Contributor Author

Taking a pass through the comments now and working on this!

@kiranandcode
Copy link
Contributor Author

addressed the simple issues, I think as suggested it'd make sense to return an ast.functiondef or an ast instead of constructing strings plus testing a bit more systematically, so will do that now.

@kiranandcode kiranandcode requested a review from eb8680 February 5, 2026 22:12
@eb8680
Copy link
Contributor

eb8680 commented Feb 6, 2026

The notebook tests are still failing with what appears to be a bug in the type generation here. This suggests that our testing strategy for this PR is still flawed - why is this failure mode not showing up in any of the dozens of new unit tests? What can we do to be more confident that this machinery is sound and will not generate false negatives for arbitrary LLM generated code?

@kiranandcode
Copy link
Contributor Author

The notebook tests are still failing with what appears to be a bug in the type generation here. This suggests that our testing strategy for this PR is still flawed - why is this failure mode not showing up in any of the dozens of new unit tests? What can we do to be more confident that this machinery is sound and will not generate false negatives for arbitrary LLM generated code?

That's a fair point. The current implementation works by constructing a typing context prelude for the file from a lexical context by:
a) adding imports for all modules in the lexical context import typing, add imports for any values which can be imported from a module from typing import Any
b) adding variable declarations for any value in the lexical context i.e x: int using nested_type to get the type of the value.
b) adding stubs for any types that are local to the current file or runtime that will not be able to be imported

The reason for the bug above is that nested_type may reference a type that is not imported in step a), in the notebook the error is that a variable is assigned type x: MutableSequence[pathlib.Path], and so mypy complains that pathlib is not imported.

The fix I think is to restructure how the imports are generated - not just from the lexical context, but also from the union of all types of the values in the context (and recursively for parameters). As for why the current unit tests aren't catching them, I need to think of a good way to exhibit this behaviour in tests.

@eb8680
Copy link
Contributor

eb8680 commented Feb 6, 2026

The fix I think is to restructure how the imports are generated - not just from the lexical context, but also from the union of all types of the values in the context (and recursively for parameters).

This sounds pretty complicated. Maybe instead of trying to generate imports from usage we can just look at sys.modules for all the currently loaded modules?

@eb8680 eb8680 linked an issue Feb 6, 2026 that may be closed by this pull request
@kiranandcode
Copy link
Contributor Author

I think including sys.modules should fix this issue but the challenge is that it doesn't account for aliases like import numpy as np which might lead to some confusing issues, esp if the signature the llm is being asked to generate is using the alias.

@kiranandcode
Copy link
Contributor Author

ah but I see your point @eb8680 including sys.modules as well should ensure that all types of values would be present. Though sys.modules is really big.

@kiranandcode
Copy link
Contributor Author

Updated to use sys.modules which should robustly handle these issues, however, type checking is very slow, every file begins with a huge import block that mypy spends time thinking about.

@eb8680
Copy link
Contributor

eb8680 commented Feb 6, 2026

Can we prune the sys.modules list?

@kiranandcode
Copy link
Contributor Author

Hmm, I can't think of any way of pruning better than just only keeping the modules used in types of runtime values, which I guess would be equivalent to the union plan I mentioned initially.

@eb8680
Copy link
Contributor

eb8680 commented Feb 6, 2026

What about pruning automatically with ruff or isort?

@kiranandcode
Copy link
Contributor Author

@eb8680 oh yes! good point, we could totally do that!

@kiranandcode
Copy link
Contributor Author

@eb8680

Does this also address #437?

Yes, actually it does. The error we're running into test llm integration is specifically because the fixtures involve the llm generating a function generate_paragraph which overrides the generate_paragraph already in the context.

@eb8680
Copy link
Contributor

eb8680 commented Feb 6, 2026

The notebook is still broken with autoflake8. It seems like any reference to any value from any part of a module stops autoflake8 from deleting any unused imports from the rest of the module.

@kiranandcode
Copy link
Contributor Author

Tests pass!!!!!!!!

@kiranandcode
Copy link
Contributor Author

@eb8680 tests pass! (hopefully test/build too) re-reading your comments, I was also thinking about making type_to_ast type dispatched. Since the last iteration, I have updated the tests to be a lot more exhaustive on how it should behave. We could add some tests regarding how it should behave w.r.t eval and exec as well, and rename it to quote_type and place it into unification.

It would be ideal if we could define one singledispatch-extensible function quote_type replacing/subsuming _type_to_annotation_str and _collect and an equation it satisfies together with eval/exec (and with the typechecker API), and have it live in effectful.internals.unification with tons of parameterized tests covering corner cases, similar to the other generic type-munging functions there.

Copy link
Contributor

@eb8680 eb8680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests covering typing.Annotated, which I suspect is not handled correctly

@kiranandcode
Copy link
Contributor Author

ah, will add!

Copy link
Contributor

@eb8680 eb8680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure we'll find more edge cases as we use this, but I think it's good to merge for now.

@kiranandcode
Copy link
Contributor Author

Awesome, sounds good! Will merge once the last tests pass!

@eb8680 eb8680 merged commit 7b360e6 into master Feb 6, 2026
6 checks passed
@eb8680 eb8680 deleted the kg-mypy-checks branch February 6, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants