Add Langfuse tracing for LLM call observability#19
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe changes introduce Langfuse tracing integration to the benchmark system. A new tracing module conditionally enables tracing based on environment variables, while the LLM client and executors integrate this facility to observe calls and flush data at key lifecycle points. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI Command
participant Executor as Executor
participant Client as LLM Client
participant Tracing as Tracing Module
participant Langfuse as Langfuse Service
CLI->>Executor: Start benchmark/quest
Executor->>Client: Initialize client
Client->>Tracing: Check is_enabled()
Tracing->>Tracing: Load LANGFUSE_SECRET_KEY from env
loop During execution
CLI->>Client: Request LLM call
Client->>Langfuse: observe() wraps the call
Langfuse->>Langfuse: Buffer trace data
Client-->>CLI: Return result
end
CLI->>Executor: Complete execution
Executor->>Tracing: Call flush()
alt Tracing enabled
Tracing->>Langfuse: Flush buffered trace data
Langfuse-->>Tracing: Acknowledgement
else Tracing disabled
Tracing->>Tracing: No-op return
end
Executor-->>CLI: Done
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request integrates Langfuse tracing into the benchmark suite, allowing for automatic LLM call tracing when the appropriate environment variables are set. The changes include adding configuration templates, a new tracing utility module, and applying the observe decorator to LLM client completion methods. Additionally, the PR ensures that traces are flushed after benchmark runs. One improvement was suggested regarding the implementation of the tracing enablement check to use functools.lru_cache for better thread-safety and more idiomatic code.
| import os | ||
|
|
||
| _enabled: bool | None = None | ||
|
|
||
|
|
||
| def is_enabled() -> bool: | ||
| global _enabled | ||
| if _enabled is None: | ||
| _enabled = bool(os.getenv("LANGFUSE_SECRET_KEY")) | ||
| return _enabled |
There was a problem hiding this comment.
The current implementation of is_enabled uses a global variable for caching. This is not thread-safe and is a bit verbose.
You can achieve the same result in a more concise and thread-safe way using functools.lru_cache. This is the standard Python way to memoize a function's result.
I suggest replacing the global variable and the function with a cached function.
| import os | |
| _enabled: bool | None = None | |
| def is_enabled() -> bool: | |
| global _enabled | |
| if _enabled is None: | |
| _enabled = bool(os.getenv("LANGFUSE_SECRET_KEY")) | |
| return _enabled | |
| import os | |
| from functools import lru_cache | |
| @lru_cache(maxsize=None) | |
| def is_enabled() -> bool: | |
| """Check if Langfuse tracing is enabled. Caches the result.""" | |
| return bool(os.getenv("LANGFUSE_SECRET_KEY")) | |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
llm_quest_benchmark/executors/benchmark.py (1)
304-309:⚠️ Potential issue | 🟡 MinorFlush won't run if artifact write fails — wrap in
try/finallyfor parity withcommands.py.In
commands.py::run,tracing.flush()runs in afinallyblock so traces are delivered even on failure. Here, if_write_benchmark_artifactsraises (disk full, permissions, bad JSON serialization of a result), the flush is skipped and an entire benchmark's worth of generations may be dropped. A smalltry/finallyaround the artifact + flush block keeps the two call sites consistent.🔧 Proposed fix
- artifact_dir = _write_benchmark_artifacts(config, results) - if artifact_dir: - logger.info("Benchmark artifacts saved to %s", artifact_dir) - - tracing.flush() - return results + try: + artifact_dir = _write_benchmark_artifacts(config, results) + if artifact_dir: + logger.info("Benchmark artifacts saved to %s", artifact_dir) + return results + finally: + tracing.flush()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm_quest_benchmark/executors/benchmark.py` around lines 304 - 309, The artifact write and trace flush need to be paired so traces are delivered even if writing artifacts fails; wrap the calls that invoke _write_benchmark_artifacts (and the subsequent logger.info("Benchmark artifacts saved to %s", artifact_dir)) in a try/finally where tracing.flush() runs in the finally block (mirroring commands.py::run) so flush always executes regardless of errors from _write_benchmark_artifacts.llm_quest_benchmark/llm/client.py (1)
401-422:⚠️ Potential issue | 🟠 MajorAnthropic generation trace won't carry
modelor token usage — test plan requirements won't be met.The PR test plan states: "confirm token counts and model names appear in traces". With
@observe(as_type="generation", name="anthropic-completion")alone, Langfuse v2 captures the function's input args and return value only; it does not auto-populate the generation'smodel,inputmessages, orusagefields for Anthropic (unlike OpenAI/LangChain/LlamaIndex integrations which are auto-captured). These fields must be set explicitly vialangfuse_context.update_current_observation(...)inside the function. As written, Anthropic traces will show empty model/usage in the UI.Additionally, this decorator wraps
_with_retries(_call), so on retry all attempts are folded into a single generation observation — the finalusagerecorded via_record_usagewon't make it to Langfuse at all.🔧 Suggested fix — populate generation fields before returning
+from llm_quest_benchmark.llm.tracing import is_enabled as _langfuse_enabled +if _langfuse_enabled(): + from langfuse.decorators import langfuse_context +else: + langfuse_context = None # type: ignore[assignment] ... `@observe`(as_type="generation", name="anthropic-completion") def get_completion(self, prompt: str) -> str: """Get a completion from the model.""" def _call() -> str: response = self._get_client().messages.create(...) usage = _get_attr_or_key(response, "usage") prompt_tokens = _coerce_int(_get_attr_or_key(usage, "input_tokens", 0) or 0) completion_tokens = _coerce_int(_get_attr_or_key(usage, "output_tokens", 0) or 0) self._record_usage(prompt_tokens, completion_tokens) + if langfuse_context is not None: + langfuse_context.update_current_observation( + model=self.model_id, + input=[{"role": "user", "content": prompt}], + usage={ + "input": prompt_tokens, + "output": completion_tokens, + "total": prompt_tokens + completion_tokens, + }, + ) ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm_quest_benchmark/llm/client.py` around lines 401 - 422, The Anthropic generation observation isn't being populated or updated per attempt; inside get_completion's inner _call (the function passed to _with_retries) call langfuse_context.update_current_observation(...) to set the generation fields (model, input/messages, and usage) before returning so traces contain model name and token counts, and update it after calculating prompt_tokens/completion_tokens (use _coerce_int outputs and the same keys your Langfuse schema expects) so each retry attempt updates the current observation; ensure you reference get_completion, _call, _with_retries, _record_usage and langfuse_context.update_current_observation when applying the change.
🧹 Nitpick comments (5)
pyproject.toml (1)
33-33: Consider makinglangfusean optional extra instead of a hard dep.The PR goal is opt-in tracing with "zero overhead when Langfuse env vars are not set", but this makes Langfuse (and its transitive deps incl. pydantic/httpx pins, opentelemetry in v3+) a mandatory install for every user. A
[project.optional-dependencies]extra (e.g.,tracing = ["langfuse>=2.0.0,<3.0.0"]) with lazy imports insidellm/tracing.pyand the conditional block inclient.pywould be more aligned with that intent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 33, The dependency declaration makes langfuse a hard requirement; change it to an optional extra by moving "langfuse>=2.0.0" out of the main dependency list in pyproject.toml and into [project.optional-dependencies] (e.g., tracing = ["langfuse>=2.0.0,<3.0.0"]); then update code that references langfuse (lazy import it inside llm/tracing.py and guard usage in client.py—where the conditional block enabling tracing runs) to only import and initialize langfuse when tracing is enabled or env vars are present, keeping runtime behavior unchanged for users who don’t opt into the extra.llm_quest_benchmark/llm/client.py (2)
26-36: No-opobservesignature doesn't fully match real decorator — works here but brittle.
langfuse.decorators.observecan be used both as@observe(bare) and@observe(...). Your fallback only handles the parenthesized form (observe(**_kwargs)). This is fine for the single current call site (@observe(as_type="generation", ...)) but will silently break if someone later writes@observewithout parens. A two-mode fallback avoids the footgun:🧹 Optional robustness tweak
- def observe(**_kwargs): # type: ignore[misc] - def _noop(fn): - return fn - - return _noop + def observe(*args, **_kwargs): # type: ignore[misc] + # Support both `@observe` and `@observe`(...). + if len(args) == 1 and callable(args[0]) and not _kwargs: + return args[0] + + def _noop(fn): + return fn + + return _noop🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm_quest_benchmark/llm/client.py` around lines 26 - 36, The no-op fallback for langfuse's observe decorator only supports the parenthesized form and will break if someone uses `@observe` without parens; update the fallback returned by the else branch (the observe function defined when _langfuse_enabled() is false) to accept being used both as a bare decorator and as a decorator factory: detect if the first argument is a callable (i.e., a function/class) and return it directly, otherwise return a decorator that wraps the target callable — keep the symbol names observe, _noop (or similar), and the surrounding logic with the OpenAI import intact so the behavior mirrors langfuse.decorators.observe for both `@observe` and `@observe`(...).
21-37: Module-levelload_dotenv+ import-time Langfuse probe couples side effects to any importer.
load_dotenvnow runs unconditionally onimport llm_quest_benchmark.llm.client, and the Langfuse decision is frozen at import time. Two consequences:
- Any test / tool that imports this module will mutate
os.environfrom the on-disk.env(even when callers explicitly didn't want it). The CLI already callsload_dotenvincommands.py; duplicating it here makes import order matter.- If a caller sets
LANGFUSE_SECRET_KEYafter import (e.g., tests), theopenai.OpenAIvslangfuse.openai.OpenAIbranch is baked in and can no longer flip.Consider pushing both
load_dotenvand the import-time branching into a lazy initializer invoked from_get_client()/AnthropicClient._get_client(), so the module stays side-effect-free.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm_quest_benchmark/llm/client.py` around lines 21 - 37, The module currently runs load_dotenv and probes _langfuse_enabled at import time, causing side effects and freezing which OpenAI implementation (openai.OpenAI vs langfuse.openai.OpenAI) is used; move the dotenv load and the conditional imports into the lazy initializer used to create clients (i.e., perform load_dotenv(...) and then evaluate _langfuse_enabled() inside _get_client() and/or AnthropicClient._get_client()), and import or define observe/OpenAI there so the choice is made at call time rather than import time, ensuring the module stays side-effect-free and respects environment changes performed after import..env.template (1)
34-36: Nit: reorder Langfuse keys to satisfy dotenv-linter.
dotenv-linterflagsLANGFUSE_*as unordered. Easy fix:🧹 Proposed reorder
-LANGFUSE_SECRET_KEY= -LANGFUSE_PUBLIC_KEY= -LANGFUSE_HOST=https://cloud.langfuse.com +LANGFUSE_HOST=https://cloud.langfuse.com +LANGFUSE_PUBLIC_KEY= +LANGFUSE_SECRET_KEY=🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.env.template around lines 34 - 36, dotenv-linter reports the LANGFUSE_* entries as unordered; reorder the three variables LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY into alphabetical order (LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) in .env.template so the linter passes.llm_quest_benchmark/llm/tracing.py (1)
5-12:_enabledcaching hides late-binding env changes.Caching on first call means that if a test, fixture, or
load_dotenvcall setsLANGFUSE_SECRET_KEYafteris_enabled()has already been invoked, tracing stays off forever in that process. Given thatclient.pycallsis_enabled()at import time, the order of imports in any given entrypoint determines the outcome. Either evaluate once explicitly at a known init point, or just re-checkos.getenv— the cost is negligible.🧹 Proposed simplification
-_enabled: bool | None = None - - def is_enabled() -> bool: - global _enabled - if _enabled is None: - _enabled = bool(os.getenv("LANGFUSE_SECRET_KEY")) - return _enabled + return bool(os.getenv("LANGFUSE_SECRET_KEY"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm_quest_benchmark/llm/tracing.py` around lines 5 - 12, The current caching via the module-level _enabled and is_enabled() causes late changes to LANGFUSE_SECRET_KEY (e.g. from tests or load_dotenv) to be ignored because client.py calls is_enabled() at import time; remove the stale-cache behavior by changing is_enabled() to directly return bool(os.getenv("LANGFUSE_SECRET_KEY")) (or alternatively add an explicit init function to set _enabled at a deterministic startup point), updating references to _enabled and is_enabled() accordingly so environment changes after import are respected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.env.template:
- Line 36: Replace the deprecated environment variable name LANGFUSE_HOST with
the current standard LANGFUSE_BASE_URL in the .env.template; specifically,
change the line defining LANGFUSE_HOST to LANGFUSE_BASE_URL so the project uses
the v4+ variable name and aligns with Langfuse docs (also search for any other
occurrences of LANGFUSE_HOST to update references to LANGFUSE_BASE_URL).
In `@llm_quest_benchmark/llm/tracing.py`:
- Around line 15-20: The flush() function should guard against SDK errors: wrap
the Langfuse().flush() call (inside flush(), after the is_enabled() check) in a
try/except Exception block so any exceptions from Langfuse().flush() are caught
and logged (include the exception details) instead of being propagated; do not
re-raise the exception so cleanup in finally blocks cannot mask or crash runs.
Use the module logger (or logging.getLogger(__name__)) to emit the error/warning
and keep the function's behaviour otherwise unchanged.
In `@pyproject.toml`:
- Line 33: The dependency pin is too loose: langfuse v3 removed
langfuse.decorators (used by the client via from langfuse.decorators import
observe) which will raise ModuleNotFoundError; update the pyproject dependency
spec for langfuse to pin it below 3 (e.g., "langfuse>=2.0,<3") so installs get a
v2.x that still exports langfuse.decorators, or alternatively migrate code to
the v3 API by replacing imports from langfuse.decorators with the new v3 form
(e.g., from langfuse import observe) and adjust any langfuse.decorators usage
accordingly (refer to the observe import and any usages of langfuse.decorators
and langfuse.openai.OpenAI).
---
Outside diff comments:
In `@llm_quest_benchmark/executors/benchmark.py`:
- Around line 304-309: The artifact write and trace flush need to be paired so
traces are delivered even if writing artifacts fails; wrap the calls that invoke
_write_benchmark_artifacts (and the subsequent logger.info("Benchmark artifacts
saved to %s", artifact_dir)) in a try/finally where tracing.flush() runs in the
finally block (mirroring commands.py::run) so flush always executes regardless
of errors from _write_benchmark_artifacts.
In `@llm_quest_benchmark/llm/client.py`:
- Around line 401-422: The Anthropic generation observation isn't being
populated or updated per attempt; inside get_completion's inner _call (the
function passed to _with_retries) call
langfuse_context.update_current_observation(...) to set the generation fields
(model, input/messages, and usage) before returning so traces contain model name
and token counts, and update it after calculating
prompt_tokens/completion_tokens (use _coerce_int outputs and the same keys your
Langfuse schema expects) so each retry attempt updates the current observation;
ensure you reference get_completion, _call, _with_retries, _record_usage and
langfuse_context.update_current_observation when applying the change.
---
Nitpick comments:
In @.env.template:
- Around line 34-36: dotenv-linter reports the LANGFUSE_* entries as unordered;
reorder the three variables LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and
LANGFUSE_SECRET_KEY into alphabetical order (LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY,
LANGFUSE_SECRET_KEY) in .env.template so the linter passes.
In `@llm_quest_benchmark/llm/client.py`:
- Around line 26-36: The no-op fallback for langfuse's observe decorator only
supports the parenthesized form and will break if someone uses `@observe` without
parens; update the fallback returned by the else branch (the observe function
defined when _langfuse_enabled() is false) to accept being used both as a bare
decorator and as a decorator factory: detect if the first argument is a callable
(i.e., a function/class) and return it directly, otherwise return a decorator
that wraps the target callable — keep the symbol names observe, _noop (or
similar), and the surrounding logic with the OpenAI import intact so the
behavior mirrors langfuse.decorators.observe for both `@observe` and
`@observe`(...).
- Around line 21-37: The module currently runs load_dotenv and probes
_langfuse_enabled at import time, causing side effects and freezing which OpenAI
implementation (openai.OpenAI vs langfuse.openai.OpenAI) is used; move the
dotenv load and the conditional imports into the lazy initializer used to create
clients (i.e., perform load_dotenv(...) and then evaluate _langfuse_enabled()
inside _get_client() and/or AnthropicClient._get_client()), and import or define
observe/OpenAI there so the choice is made at call time rather than import time,
ensuring the module stays side-effect-free and respects environment changes
performed after import.
In `@llm_quest_benchmark/llm/tracing.py`:
- Around line 5-12: The current caching via the module-level _enabled and
is_enabled() causes late changes to LANGFUSE_SECRET_KEY (e.g. from tests or
load_dotenv) to be ignored because client.py calls is_enabled() at import time;
remove the stale-cache behavior by changing is_enabled() to directly return
bool(os.getenv("LANGFUSE_SECRET_KEY")) (or alternatively add an explicit init
function to set _enabled at a deterministic startup point), updating references
to _enabled and is_enabled() accordingly so environment changes after import are
respected.
In `@pyproject.toml`:
- Line 33: The dependency declaration makes langfuse a hard requirement; change
it to an optional extra by moving "langfuse>=2.0.0" out of the main dependency
list in pyproject.toml and into [project.optional-dependencies] (e.g., tracing =
["langfuse>=2.0.0,<3.0.0"]); then update code that references langfuse (lazy
import it inside llm/tracing.py and guard usage in client.py—where the
conditional block enabling tracing runs) to only import and initialize langfuse
when tracing is enabled or env vars are present, keeping runtime behavior
unchanged for users who don’t opt into the extra.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5afe89c6-138b-4edc-86ff-cec32043855e
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
.env.templatellm_quest_benchmark/executors/benchmark.pyllm_quest_benchmark/executors/cli/commands.pyllm_quest_benchmark/llm/client.pyllm_quest_benchmark/llm/tracing.pypyproject.toml
| # Sign up at https://cloud.langfuse.com or self-host. | ||
| LANGFUSE_SECRET_KEY= | ||
| LANGFUSE_PUBLIC_KEY= | ||
| LANGFUSE_HOST=https://cloud.langfuse.com |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
langfuse python SDK LANGFUSE_HOST vs LANGFUSE_BASE_URL environment variable current name
💡 Result:
The current environment variable for configuring the Langfuse Python SDK API endpoint is LANGFUSE_BASE_URL. LANGFUSE_HOST is deprecated and should no longer be used; use LANGFUSE_BASE_URL instead. Both are supported for backward compatibility, with LANGFUSE_BASE_URL taking precedence when both are set. The default value for both is "https://cloud.langfuse.com".
Citations:
- 1: https://github.com/langfuse/langfuse-python/blob/72aa1565/langfuse/_client/environment_variables.py
- 2: https://langfuse.com/docs/observability/sdk/overview
- 3: https://python.reference.langfuse.com/langfuse
- 4: bug(sdk-python): flush() hangs indefinitely in Google Cloud Functions when consumer threads die langfuse/langfuse#11104
🏁 Script executed:
find . -name "pyproject.toml" -type f | head -5Repository: yourconscience/llm_quest_benchmark
Length of output: 95
🏁 Script executed:
cat pyproject.tomlRepository: yourconscience/llm_quest_benchmark
Length of output: 2193
🏁 Script executed:
cat -n .env.template | head -40Repository: yourconscience/llm_quest_benchmark
Length of output: 1592
Update .env.template to use LANGFUSE_BASE_URL instead of LANGFUSE_HOST.
LANGFUSE_HOST is deprecated in favor of LANGFUSE_BASE_URL in newer Langfuse versions (v4+). While both are supported for backward compatibility, the template should use the current standard variable name for forward compatibility and to align with current Langfuse documentation.
🧰 Tools
🪛 dotenv-linter (4.0.0)
[warning] 36-36: [UnorderedKey] The LANGFUSE_HOST key should go before the LANGFUSE_PUBLIC_KEY key
(UnorderedKey)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.env.template at line 36, Replace the deprecated environment variable name
LANGFUSE_HOST with the current standard LANGFUSE_BASE_URL in the .env.template;
specifically, change the line defining LANGFUSE_HOST to LANGFUSE_BASE_URL so the
project uses the v4+ variable name and aligns with Langfuse docs (also search
for any other occurrences of LANGFUSE_HOST to update references to
LANGFUSE_BASE_URL).
| def flush() -> None: | ||
| if not is_enabled(): | ||
| return | ||
| from langfuse import Langfuse | ||
|
|
||
| Langfuse().flush() |
There was a problem hiding this comment.
flush() should be defensive against SDK errors.
Langfuse().flush() can raise if the SDK can't reach the backend or if credentials are misconfigured. Since this runs in finally blocks at the end of quests/benchmarks, an exception here will mask the real outcome or crash an otherwise successful run. Wrap the flush in a try/except and log instead.
🛡️ Proposed hardening
def flush() -> None:
if not is_enabled():
return
- from langfuse import Langfuse
-
- Langfuse().flush()
+ try:
+ from langfuse import Langfuse
+ Langfuse().flush()
+ except Exception:
+ import logging
+ logging.getLogger(__name__).warning("Langfuse flush failed", exc_info=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@llm_quest_benchmark/llm/tracing.py` around lines 15 - 20, The flush()
function should guard against SDK errors: wrap the Langfuse().flush() call
(inside flush(), after the is_enabled() check) in a try/except Exception block
so any exceptions from Langfuse().flush() are caught and logged (include the
exception details) instead of being propagated; do not re-raise the exception so
cleanup in finally blocks cannot mask or crash runs. Use the module logger (or
logging.getLogger(__name__)) to emit the error/warning and keep the function's
behaviour otherwise unchanged.
| "openai>=1.12.0", | ||
| "openrouter>=0.5.0", | ||
| "typer>=0.9.0", | ||
| "langfuse>=2.0.0", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does langfuse Python v3 still expose langfuse.decorators.observe and langfuse.openai.OpenAI wrapper?
💡 Result:
Yes, Langfuse Python v3 (and later v4) still exposes both langfuse.decorators.observe and langfuse.openai.OpenAI wrapper. The @observe decorator is available directly from the top-level langfuse module: from langfuse import observe. The old v2 path langfuse.decorators.observe is deprecated but the functionality persists in v3/v4 under the new import. Source code shows observe.py in langfuse/_client/ and documentation examples use from langfuse import observe. V2→v3 migration shows continued use of @observe with changed import: from langfuse import observe. The langfuse.openai OpenAI wrapper remains available and actively maintained in the current main branch (post-v4), providing drop-in replacement: from langfuse.openai import openai. It integrates with the new observation model using start_observation internally. Official integration docs and examples confirm compatibility with v3/v4 decorators and context managers.
Citations:
- 1: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v2-to-v3
- 2: https://github.com/langfuse/langfuse-python
- 3: https://github.com/langfuse/langfuse-python/blob/main/langfuse/openai.py
- 4: https://langfuse.com/integrations/model-providers/openai-py
- 5: https://langfuse.com/docs/sdk/python/decorators
- 6: https://langfuse.com/docs/observability/sdk/python/sdk-v3
- 7: https://github.com/langfuse/langfuse-python/blob/72aa1565/langfuse/_client/observe.py
- 8: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4
- 9: langfuse/langfuse-python@v3.14.5...v4.0.0
🏁 Script executed:
# Find files using langfuse imports
fd -type f -name "*.py" | xargs rg "from langfuse" 2>/dev/null | head -20Repository: yourconscience/llm_quest_benchmark
Length of output: 515
🏁 Script executed:
# Check if llm/client.py exists and view its langfuse imports
find . -name "client.py" -path "*/llm/*" 2>/dev/null | head -5Repository: yourconscience/llm_quest_benchmark
Length of output: 114
🏁 Script executed:
# Search for any langfuse imports in the entire codebase
rg "langfuse" --type py -A 2 | head -50Repository: yourconscience/llm_quest_benchmark
Length of output: 758
🏁 Script executed:
# Search for information about langfuse versions and what changed
web_search --query "langfuse-python v3 v4 breaking changes decorators openai module removed"Repository: yourconscience/llm_quest_benchmark
Length of output: 129
🏁 Script executed:
# Look at langfuse documentation for v3 to v4 migration
web_search --query "langfuse Python v3 to v4 migration guide breaking changes"Repository: yourconscience/llm_quest_benchmark
Length of output: 129
🏁 Script executed:
# Check if langfuse.decorators and langfuse.openai still exist in current releases
web_search --query "langfuse-python github langfuse/decorators.py langfuse/openai.py v3 v4 source code"Repository: yourconscience/llm_quest_benchmark
Length of output: 129
🌐 Web query:
Does langfuse v3 still have the langfuse.decorators module with observe decorator?
💡 Result:
No, Langfuse v3 does not have the langfuse.decorators module with the observe decorator. In v2, it was imported as from langfuse.decorators import observe. In v3, the import changed to from langfuse import observe directly, as the decorator was refactored and moved to langfuse/_client/observe.py internally. Attempting to import langfuse.decorators in v3 results in ModuleNotFoundError. The observe decorator still exists and functions in v3 and later versions (including v4, current as of 2026), but the module path changed.
Citations:
- 1: bug: ModuleNotFoundError: No module named 'langfuse.decorators' langfuse/langfuse#7170
- 2: https://get.langfuse.com/docs/observability/sdk/upgrade-path/python-v2-to-v3
- 3: https://github.com/langfuse/langfuse-python/blob/72aa1565/langfuse/_client/observe.py
- 4: https://langfuse.com/docs/observability/sdk/python/decorators
- 5: https://github.com/langfuse/langfuse-python/tree/main
🌐 Web query:
Does langfuse v3 still have langfuse.openai module with OpenAI wrapper?
💡 Result:
Yes, Langfuse Python SDK v3 still includes the langfuse.openai module with the OpenAI wrapper. It provides a drop-in replacement for the OpenAI SDK (import openai from langfuse.openai instead of directly from openai), supporting tracing of prompts, completions, latencies, costs, and errors. This works with OpenAI SDK versions >=0.27.8, including async and streaming for >=1.0.0. The module is actively maintained, with recent fixes in 2026 (e.g., openai.py updates in releases like v3.2.8 and later). Official docs and API reference confirm its availability in v3, compatible with the OTEL-based SDK while preserving the wrapper integration. Use @observe decorator for grouping multiple calls into traces.
Citations:
- 1: https://python.reference.langfuse.com/langfuse/openai
- 2: https://js-sdk-v3.docs-snapshot.langfuse.com/guides/cookbook/integration_openai_sdk/
- 3: https://langfuse.com/integrations/model-providers/openai-py
- 4: https://js-sdk-v3.docs-snapshot.langfuse.com/integrations/model-providers/openai-py/
- 5: https://www.langfuse.com
- 6: https://python.reference.langfuse.com/langfuse
- 7: https://langfuse.com/docs/sdk/python/sdk-v3
- 8: https://github.com/langfuse/langfuse-python/blob/main/langfuse/openai.py
- 9: bug:
langfuse.openai.AsyncOpenAItyped asNonebreaks type narrowing for consumers langfuse/langfuse#11197 - 10: https://langfuse.com/changelog/2025-06-05-python-sdk-v3-generally-available
- 11: https://github.com/langfuse/langfuse-python/releases
- 12: feat!: prepare v4 release langfuse/langfuse-python#1486
- 13: langfuse/langfuse-python@v3.14.5...v4.0.0
- 14: langfuse/langfuse-python@v3.2.7...v3.2.8
🌐 Web query:
Langfuse Python v3 v4 breaking changes removed modules migration
💡 Result:
Langfuse Python SDK v4 (released March 2026) introduces breaking changes from v3, primarily adopting an observation-centric data model for better performance. No entire modules were removed, but significant methods, classes, and API endpoints were deprecated/removed, along with internal restructuring. Key breaking changes: - OpenTelemetry: Default span export now filtered (non-LLM spans like HTTP/DB not exported by default). Use should_export_span=lambda span: True to restore v3 behavior. blocked_instrumentation_scopes deprecated. - Trace updates: update_current_trace and .update_trace removed, replaced by propagate_attributes (context manager for user_id/session_id/metadata/tags), set_current_trace_io (deprecated, for legacy LLM-as-judge), set_current_trace_as_public. - Span/Generation APIs: start_span, start_generation, start_as_current_span/generation → start_observation(as_type="span"/"generation"). - Datasets: DatasetItemClient removed; item.run → dataset.run_experiment. - LangChain CallbackHandler: update_trace parameter removed. - API namespaces: api.observations_v_2 → api.observations; legacy api.observations → api.legacy.observations_v1 (self-hosted: stick to legacy until v2 endpoints available). - Types: langfuse.types (TraceMetadata, ObservationParams, etc.) removed. - Pydantic v1 dropped (now v2 required). - Internal: langfuse/api/resources/ directory deleted; reorganized to top-level modules with lazy loading. Migration checklist (from official docs): 1. Audit non-LLM OTEL spans. 2. Restore export-all if needed. 3. Migrate blocked_instrumentation_scopes. 4. Replace update_current_trace/update_trace calls. 5. Replace start_span/start_generation. 6. Update dataset runs. 7. Remove CallbackHandler(update_trace=...). 8. Update API v2 → non-suffixed; legacy v1 → api.legacy.*_v1. 9. For self-hosted, use legacy endpoints. Full guide: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4. Install: pip install langfuse>=4.0.0 (Python 3.10+).
Citations:
- 1: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4
- 2: https://get.langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4
- 3: feat!: prepare v4 release langfuse/langfuse-python#1486
- 4: https://github.com/langfuse/langfuse-python/releases/tag/v4.0.0
- 5: langfuse/langfuse-python@v3.14.5...v4.0.0
Pin langfuse to <3 — v3 removed langfuse.decorators and the client code will fail on import.
The client code imports from langfuse.decorators import observe, which is no longer available in langfuse v3+. Attempting this import raises ModuleNotFoundError. While langfuse.openai.OpenAI remains available in v3, the decorators module removal is a breaking change that will cause the feature to fail on any fresh install with langfuse>=2.0.0 (which resolves to v3+).
To fix, pin the dependency:
Proposed fix
- "langfuse>=2.0.0",
+ "langfuse>=2.0.0,<3.0.0",Alternative: migrate to v3+ API (from langfuse import observe) and pin accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "langfuse>=2.0.0", | |
| "langfuse>=2.0.0,<3.0.0", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 33, The dependency pin is too loose: langfuse v3
removed langfuse.decorators (used by the client via from langfuse.decorators
import observe) which will raise ModuleNotFoundError; update the pyproject
dependency spec for langfuse to pin it below 3 (e.g., "langfuse>=2.0,<3") so
installs get a v2.x that still exports langfuse.decorators, or alternatively
migrate code to the v3 API by replacing imports from langfuse.decorators with
the new v3 form (e.g., from langfuse import observe) and adjust any
langfuse.decorators usage accordingly (refer to the observe import and any
usages of langfuse.decorators and langfuse.openai.OpenAI).
Summary
LANGFUSE_SECRET_KEYis set in envOpenAIwrapper for automatic trace capture@observe(as_type="generation")decoratorlangfuse.flush()called at benchmark end and after single quest runsChanges
llm/tracing.py: thin init/flush module, checksLANGFUSE_SECRET_KEYllm/client.py: conditional import of Langfuse OpenAI wrapper,@observeon Anthropic completionexecutors/benchmark.py: flush at end of benchmarkexecutors/cli/commands.py: flush in finally block after single runspyproject.toml: addlangfuse>=2.0.0.env.template: add Langfuse env varsTest plan
uv run llm-quest run --quest quests/Boat.qm --model random_choicewithout Langfuse vars: should work identically to beforeLANGFUSE_SECRET_KEYandLANGFUSE_PUBLIC_KEY, run a benchmark: traces should appear in Langfuse UISummary by CodeRabbit
New Features
Configuration