add mypy-based validation, drop syntactic check on return type by kiranandcode · Pull Request #536 · BasisResearch/effectful

kiranandcode · 2026-02-04T16:42:50Z

This PR updates the internals of EncodableCallable such that mypy based typechecking is done on the source code generated by the LLM. The code uses the ctx: Mapping[str, Any] to inject appropriate imports and stubs such that the code should type check.

This replaces and removes the syntactic check on the ast on the return type where previously we were just syntactically checking the annotations on the returned function which had no guarantee of correctness.

The core interface to the type checking is through this function in effectful.handlers.llm.type_checking:

def typecheck_source(
    module: ast.AST,
    ctx: typing.Mapping[str, Any],
    expected_params: list[type] | None,
    expected_return: type,
) -> None:
    """Type-check synthesized module code against expected signature and context.
    Builds a full source with prelude (ctx bindings as type stubs), the module body,
    and a postlude that assigns the function to an expected Callable type so mypy
    validates the signature. Raises ValueError with mypy output on type errors.
    """

Closes #535

eb8680

This is nice, but I'm worried about soundness because bugs here will silently cause code generation to fail for no good reason.

It would be ideal if we could define one singledispatch-extensible function quote_type replacing/subsuming _type_to_annotation_str and _collect and an equation it satisfies together with eval/exec (and with the typechecker API), and have it live in effectful.internals.unification with tons of parameterized tests covering corner cases, similar to the other generic type-munging functions there.

effectful/handlers/llm/type_checking.py

eb8680 · 2026-02-04T23:51:39Z

Does this also address #437?

effectful/handlers/llm/type_checking.py

kiranandcode · 2026-02-05T14:31:01Z

Taking a pass through the comments now and working on this!

kiranandcode · 2026-02-05T15:48:14Z

addressed the simple issues, I think as suggested it'd make sense to return an ast.functiondef or an ast instead of constructing strings plus testing a bit more systematically, so will do that now.

pyproject.toml

eb8680 · 2026-02-06T03:36:54Z

The notebook tests are still failing with what appears to be a bug in the type generation here. This suggests that our testing strategy for this PR is still flawed - why is this failure mode not showing up in any of the dozens of new unit tests? What can we do to be more confident that this machinery is sound and will not generate false negatives for arbitrary LLM generated code?

kiranandcode · 2026-02-06T04:26:51Z

The notebook tests are still failing with what appears to be a bug in the type generation here. This suggests that our testing strategy for this PR is still flawed - why is this failure mode not showing up in any of the dozens of new unit tests? What can we do to be more confident that this machinery is sound and will not generate false negatives for arbitrary LLM generated code?

That's a fair point. The current implementation works by constructing a typing context prelude for the file from a lexical context by:
a) adding imports for all modules in the lexical context import typing, add imports for any values which can be imported from a module from typing import Any
b) adding variable declarations for any value in the lexical context i.e x: int using nested_type to get the type of the value.
b) adding stubs for any types that are local to the current file or runtime that will not be able to be imported

The reason for the bug above is that nested_type may reference a type that is not imported in step a), in the notebook the error is that a variable is assigned type x: MutableSequence[pathlib.Path], and so mypy complains that pathlib is not imported.

The fix I think is to restructure how the imports are generated - not just from the lexical context, but also from the union of all types of the values in the context (and recursively for parameters). As for why the current unit tests aren't catching them, I need to think of a good way to exhibit this behaviour in tests.

eb8680 · 2026-02-06T13:32:05Z

The fix I think is to restructure how the imports are generated - not just from the lexical context, but also from the union of all types of the values in the context (and recursively for parameters).

This sounds pretty complicated. Maybe instead of trying to generate imports from usage we can just look at sys.modules for all the currently loaded modules?

kiranandcode · 2026-02-06T15:23:15Z

I think including sys.modules should fix this issue but the challenge is that it doesn't account for aliases like import numpy as np which might lead to some confusing issues, esp if the signature the llm is being asked to generate is using the alias.

kiranandcode · 2026-02-06T15:29:12Z

ah but I see your point @eb8680 including sys.modules as well should ensure that all types of values would be present. Though sys.modules is really big.

kiranandcode · 2026-02-06T15:41:20Z

Updated to use sys.modules which should robustly handle these issues, however, type checking is very slow, every file begins with a huge import block that mypy spends time thinking about.

eb8680 · 2026-02-06T15:42:06Z

Can we prune the sys.modules list?

kiranandcode · 2026-02-06T15:44:22Z

Hmm, I can't think of any way of pruning better than just only keeping the modules used in types of runtime values, which I guess would be equivalent to the union plan I mentioned initially.

eb8680 · 2026-02-06T15:46:07Z

What about pruning automatically with ruff or isort?

kiranandcode · 2026-02-06T16:34:05Z

@eb8680 oh yes! good point, we could totally do that!

kiranandcode · 2026-02-06T16:36:33Z

@eb8680

Does this also address #437?

Yes, actually it does. The error we're running into test llm integration is specifically because the fixtures involve the llm generating a function generate_paragraph which overrides the generate_paragraph already in the context.

…ngs with un-representable types

…g untyped modules

…py lag

effectful/handlers/llm/evaluation.py

eb8680 · 2026-02-06T18:25:36Z

The notebook is still broken with autoflake8. It seems like any reference to any value from any part of a module stops autoflake8 from deleting any unused imports from the rest of the module.

kiranandcode · 2026-02-06T19:34:14Z

Tests pass!!!!!!!!

kiranandcode · 2026-02-06T19:38:37Z

@eb8680 tests pass! (hopefully test/build too) re-reading your comments, I was also thinking about making type_to_ast type dispatched. Since the last iteration, I have updated the tests to be a lot more exhaustive on how it should behave. We could add some tests regarding how it should behave w.r.t eval and exec as well, and rename it to quote_type and place it into unification.

It would be ideal if we could define one singledispatch-extensible function quote_type replacing/subsuming _type_to_annotation_str and _collect and an equation it satisfies together with eval/exec (and with the typechecker API), and have it live in effectful.internals.unification with tons of parameterized tests covering corner cases, similar to the other generic type-munging functions there.

eb8680

I don't see any tests covering typing.Annotated, which I suspect is not handled correctly

kiranandcode · 2026-02-06T20:18:16Z

ah, will add!

eb8680

I'm sure we'll find more edge cases as we use this, but I think it's good to merge for now.

kiranandcode · 2026-02-06T21:43:47Z

Awesome, sounds good! Will merge once the last tests pass!

kiranandcode added the module:llm label Feb 4, 2026

eb8680 reviewed Feb 4, 2026

View reviewed changes

effectful/handlers/llm/type_checking.py Outdated Show resolved Hide resolved

eb8680 reviewed Feb 4, 2026

View reviewed changes

effectful/handlers/llm/type_checking.py Outdated Show resolved Hide resolved

eb8680 reviewed Feb 4, 2026

View reviewed changes

effectful/handlers/llm/type_checking.py Outdated Show resolved Hide resolved

eb8680 reviewed Feb 5, 2026

View reviewed changes

pyproject.toml Show resolved Hide resolved

kiranandcode requested a review from eb8680 February 5, 2026 22:12

eb8680 linked an issue Feb 6, 2026 that may be closed by this pull request

effectful.handlers.llm.synthesis should type-check generated code #361

Closed

kiranandcode and others added 8 commits February 6, 2026 11:39

add mypy-based validation, drop syntactic check on return type

7c2299a

resurrects tests for test_handlers_llm_encoding.py

88b5eb9

switched to type error instead of value error

511b8dc

neatened up type checking tests

031f740

added mypy to llm dependencies

84c3dca

switched to use typing_extensions TypeAliasType

fd71991

moved type checking to an operation

de6d2ff

refined exception guard and switched to any instead of skipping bindi…

9e3b7cc

…ngs with un-representable types

kiranandcode and others added 5 commits February 6, 2026 11:39

minor bug

bb5a13e

ruff formatting

8f56e4b

updated imports to include sys.modules

a8c0e92

restricted sys.modules in imports and suppressed warnings on importin…

a13f453

…g untyped modules

added ignore to ast.FunctionDef

2e86880

kiranandcode force-pushed the kg-mypy-checks branch from df8bd01 to 2e86880 Compare February 6, 2026 16:39

kiranandcode added 2 commits February 6, 2026 11:47

updated to use ruff to clean up generated code and avoid redundant my…

2b04109

…py lag

updated codeadapt fixture

2a847f2

eb8680 reviewed Feb 6, 2026

View reviewed changes

effectful/handlers/llm/evaluation.py Outdated Show resolved Hide resolved

kiranandcode added 6 commits February 6, 2026 12:27

added --unsafe-fixes to ruff invocation

3d99205

updated prompt

6999c36

format notebook

4f14c3f

format notebook

3d1d814

switched from ruff to autoflake

918e899

fix for __init__

7db4e03

kiranandcode added 2 commits February 6, 2026 14:03

fixed test with __init__ returning None

b2dcaf1

final fixes

02eba96

eb8680 reviewed Feb 6, 2026

View reviewed changes

datvo06 mentioned this pull request Feb 6, 2026

Synthesized Callable Name shouldn't matter #542

Closed

added type_checking

1b1501b

eb8680 approved these changes Feb 6, 2026

View reviewed changes

eb8680 merged commit 7b360e6 into master Feb 6, 2026
6 checks passed

eb8680 deleted the kg-mypy-checks branch February 6, 2026 21:47

This was referenced Feb 6, 2026

Adding mypy check and test #422

Closed

Drop attempts to shadow lexical definitions in LLM generated code #437

Closed

Conversation

kiranandcode commented Feb 4, 2026

Uh oh!

eb8680 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eb8680 commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kiranandcode commented Feb 5, 2026

Uh oh!

kiranandcode commented Feb 5, 2026

Uh oh!

Uh oh!

eb8680 commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

eb8680 commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

eb8680 commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

eb8680 commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

Uh oh!

eb8680 commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

eb8680 left a comment

Choose a reason for hiding this comment

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

eb8680 left a comment

Choose a reason for hiding this comment

Uh oh!

kiranandcode commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants