Add Langfuse tracing for LLM call observability by yourconscience · Pull Request #19 · yourconscience/llm_quest_benchmark

yourconscience · 2026-04-24T14:50:39Z

Summary

Adds opt-in Langfuse tracing that activates when LANGFUSE_SECRET_KEY is set in env
OpenAI-compatible clients (OpenAI, Google, DeepSeek, OpenRouter) use Langfuse's drop-in OpenAI wrapper for automatic trace capture
Anthropic client uses @observe(as_type="generation") decorator
langfuse.flush() called at benchmark end and after single quest runs
Zero overhead when Langfuse env vars are not set (conditional imports, no-op stubs)

Changes

llm/tracing.py: thin init/flush module, checks LANGFUSE_SECRET_KEY
llm/client.py: conditional import of Langfuse OpenAI wrapper, @observe on Anthropic completion
executors/benchmark.py: flush at end of benchmark
executors/cli/commands.py: flush in finally block after single runs
pyproject.toml: add langfuse>=2.0.0
.env.template: add Langfuse env vars

Test plan

Run uv run llm-quest run --quest quests/Boat.qm --model random_choice without Langfuse vars: should work identically to before
Set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY, run a benchmark: traces should appear in Langfuse UI
Verify token counts and model names appear in Langfuse traces

Summary by CodeRabbit

New Features
- Added optional LLM call tracing functionality to monitor and trace language model interactions during benchmark runs and CLI operations, with automatic data synchronization at key lifecycle points.
Configuration
- Three new optional environment variables available to enable and configure the tracing feature when needed.

coderabbitai · 2026-04-24T14:50:52Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d71c02f1-22af-4501-8d65-cbc2c8dd04f2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The changes introduce Langfuse tracing integration to the benchmark system. A new tracing module conditionally enables tracing based on environment variables, while the LLM client and executors integrate this facility to observe calls and flush data at key lifecycle points.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `.env.template`, `pyproject.toml`	Added Langfuse environment variable declarations (secret key, public key, host) and added `langfuse>=2.0.0` as a project dependency.
Tracing Integration Module `llm_quest_benchmark/llm/tracing.py`	New module providing `is_enabled()` to check for Langfuse configuration and `flush()` to flush tracing data; uses lazy imports and caches enabled state.
LLM Client Integration `llm_quest_benchmark/llm/client.py`	Loads environment variables, conditionally wraps OpenAI with Langfuse when enabled, provides no-op `observe` decorator when disabled, and instruments `AnthropicClient.get_completion` with tracing decorator.
Executor Tracing Hooks `llm_quest_benchmark/executors/benchmark.py`, `llm_quest_benchmark/executors/cli/commands.py`	Integrated `tracing.flush()` calls to ensure pending tracing data is flushed after benchmark execution and quest completion, using a `finally` clause for guaranteed cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Command
    participant Executor as Executor
    participant Client as LLM Client
    participant Tracing as Tracing Module
    participant Langfuse as Langfuse Service

    CLI->>Executor: Start benchmark/quest
    Executor->>Client: Initialize client
    Client->>Tracing: Check is_enabled()
    Tracing->>Tracing: Load LANGFUSE_SECRET_KEY from env
    
    loop During execution
        CLI->>Client: Request LLM call
        Client->>Langfuse: observe() wraps the call
        Langfuse->>Langfuse: Buffer trace data
        Client-->>CLI: Return result
    end
    
    CLI->>Executor: Complete execution
    Executor->>Tracing: Call flush()
    alt Tracing enabled
        Tracing->>Langfuse: Flush buffered trace data
        Langfuse-->>Tracing: Acknowledgement
    else Tracing disabled
        Tracing->>Tracing: No-op return
    end
    Executor-->>CLI: Done

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Traces now flow through our quest so bright,
Langfuse captures every LLM flight,
With env vars set and flushing at the end,
Our benchmarks now with tracing data blend! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Langfuse tracing for LLM call observability' directly and accurately describes the main purpose of the pull request: adding Langfuse tracing functionality for observing LLM calls. It is clear, concise, and reflects the primary change across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/langfuse-tracing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request integrates Langfuse tracing into the benchmark suite, allowing for automatic LLM call tracing when the appropriate environment variables are set. The changes include adding configuration templates, a new tracing utility module, and applying the observe decorator to LLM client completion methods. Additionally, the PR ensures that traces are flushed after benchmark runs. One improvement was suggested regarding the implementation of the tracing enablement check to use functools.lru_cache for better thread-safety and more idiomatic code.

gemini-code-assist · 2026-04-24T14:53:15Z

+import os
+
+_enabled: bool | None = None
+
+
+def is_enabled() -> bool:
+    global _enabled
+    if _enabled is None:
+        _enabled = bool(os.getenv("LANGFUSE_SECRET_KEY"))
+    return _enabled


The current implementation of is_enabled uses a global variable for caching. This is not thread-safe and is a bit verbose.

You can achieve the same result in a more concise and thread-safe way using functools.lru_cache. This is the standard Python way to memoize a function's result.

I suggest replacing the global variable and the function with a cached function.

Suggested change

import os

_enabled: bool | None = None

def is_enabled() -> bool:

global _enabled

if _enabled is None:

_enabled = bool(os.getenv("LANGFUSE_SECRET_KEY"))

return _enabled

import os

from functools import lru_cache

@lru_cache(maxsize=None)

def is_enabled() -> bool:

"""Check if Langfuse tracing is enabled. Caches the result."""

return bool(os.getenv("LANGFUSE_SECRET_KEY"))

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

llm_quest_benchmark/executors/benchmark.py (1)

304-309: ⚠️ Potential issue | 🟡 Minor

Flush won't run if artifact write fails — wrap in try/finally for parity with commands.py.

In commands.py::run, tracing.flush() runs in a finally block so traces are delivered even on failure. Here, if _write_benchmark_artifacts raises (disk full, permissions, bad JSON serialization of a result), the flush is skipped and an entire benchmark's worth of generations may be dropped. A small try/finally around the artifact + flush block keeps the two call sites consistent.
🔧 Proposed fix
-    artifact_dir = _write_benchmark_artifacts(config, results)
-    if artifact_dir:
-        logger.info("Benchmark artifacts saved to %s", artifact_dir)
-
-    tracing.flush()
-    return results
+    try:
+        artifact_dir = _write_benchmark_artifacts(config, results)
+        if artifact_dir:
+            logger.info("Benchmark artifacts saved to %s", artifact_dir)
+        return results
+    finally:
+        tracing.flush()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/executors/benchmark.py` around lines 304 - 309, The
artifact write and trace flush need to be paired so traces are delivered even if
writing artifacts fails; wrap the calls that invoke _write_benchmark_artifacts
(and the subsequent logger.info("Benchmark artifacts saved to %s",
artifact_dir)) in a try/finally where tracing.flush() runs in the finally block
(mirroring commands.py::run) so flush always executes regardless of errors from
_write_benchmark_artifacts.

llm_quest_benchmark/llm/client.py (1)

401-422: ⚠️ Potential issue | 🟠 Major

Anthropic generation trace won't carry model or token usage — test plan requirements won't be met.

The PR test plan states: "confirm token counts and model names appear in traces". With @observe(as_type="generation", name="anthropic-completion") alone, Langfuse v2 captures the function's input args and return value only; it does not auto-populate the generation's model, input messages, or usage fields for Anthropic (unlike OpenAI/LangChain/LlamaIndex integrations which are auto-captured). These fields must be set explicitly via langfuse_context.update_current_observation(...) inside the function. As written, Anthropic traces will show empty model/usage in the UI.

Additionally, this decorator wraps _with_retries(_call), so on retry all attempts are folded into a single generation observation — the final usage recorded via _record_usage won't make it to Langfuse at all.

🔧 Suggested fix — populate generation fields before returning

+from llm_quest_benchmark.llm.tracing import is_enabled as _langfuse_enabled
+if _langfuse_enabled():
+    from langfuse.decorators import langfuse_context
+else:
+    langfuse_context = None  # type: ignore[assignment]
...
     `@observe`(as_type="generation", name="anthropic-completion")
     def get_completion(self, prompt: str) -> str:
         """Get a completion from the model."""

         def _call() -> str:
             response = self._get_client().messages.create(...)
             usage = _get_attr_or_key(response, "usage")
             prompt_tokens = _coerce_int(_get_attr_or_key(usage, "input_tokens", 0) or 0)
             completion_tokens = _coerce_int(_get_attr_or_key(usage, "output_tokens", 0) or 0)
             self._record_usage(prompt_tokens, completion_tokens)
+            if langfuse_context is not None:
+                langfuse_context.update_current_observation(
+                    model=self.model_id,
+                    input=[{"role": "user", "content": prompt}],
+                    usage={
+                        "input": prompt_tokens,
+                        "output": completion_tokens,
+                        "total": prompt_tokens + completion_tokens,
+                    },
+                )
             ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/client.py` around lines 401 - 422, The Anthropic
generation observation isn't being populated or updated per attempt; inside
get_completion's inner _call (the function passed to _with_retries) call
langfuse_context.update_current_observation(...) to set the generation fields
(model, input/messages, and usage) before returning so traces contain model name
and token counts, and update it after calculating
prompt_tokens/completion_tokens (use _coerce_int outputs and the same keys your
Langfuse schema expects) so each retry attempt updates the current observation;
ensure you reference get_completion, _call, _with_retries, _record_usage and
langfuse_context.update_current_observation when applying the change.

🧹 Nitpick comments (5)

pyproject.toml (1)
33-33: Consider making langfuse an optional extra instead of a hard dep.

The PR goal is opt-in tracing with "zero overhead when Langfuse env vars are not set", but this makes Langfuse (and its transitive deps incl. pydantic/httpx pins, opentelemetry in v3+) a mandatory install for every user. A [project.optional-dependencies] extra (e.g., tracing = ["langfuse>=2.0.0,<3.0.0"]) with lazy imports inside llm/tracing.py and the conditional block in client.py would be more aligned with that intent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 33, The dependency declaration makes langfuse a hard
requirement; change it to an optional extra by moving "langfuse>=2.0.0" out of
the main dependency list in pyproject.toml and into
[project.optional-dependencies] (e.g., tracing = ["langfuse>=2.0.0,<3.0.0"]);
then update code that references langfuse (lazy import it inside llm/tracing.py
and guard usage in client.py—where the conditional block enabling tracing runs)
to only import and initialize langfuse when tracing is enabled or env vars are
present, keeping runtime behavior unchanged for users who don’t opt into the
extra.
llm_quest_benchmark/llm/client.py (2)
26-36: No-op observe signature doesn't fully match real decorator — works here but brittle.

langfuse.decorators.observe can be used both as @observe (bare) and @observe(...). Your fallback only handles the parenthesized form (observe(**_kwargs)). This is fine for the single current call site (@observe(as_type="generation", ...)) but will silently break if someone later writes @observe without parens. A two-mode fallback avoids the footgun:
🧹 Optional robustness tweak
-    def observe(**_kwargs):  # type: ignore[misc]
-        def _noop(fn):
-            return fn
-
-        return _noop
+    def observe(*args, **_kwargs):  # type: ignore[misc]
+        # Support both `@observe` and `@observe`(...).
+        if len(args) == 1 and callable(args[0]) and not _kwargs:
+            return args[0]
+
+        def _noop(fn):
+            return fn
+
+        return _noop
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/client.py` around lines 26 - 36, The no-op fallback
for langfuse's observe decorator only supports the parenthesized form and will
break if someone uses `@observe` without parens; update the fallback returned by
the else branch (the observe function defined when _langfuse_enabled() is false)
to accept being used both as a bare decorator and as a decorator factory: detect
if the first argument is a callable (i.e., a function/class) and return it
directly, otherwise return a decorator that wraps the target callable — keep the
symbol names observe, _noop (or similar), and the surrounding logic with the
OpenAI import intact so the behavior mirrors langfuse.decorators.observe for
both `@observe` and `@observe`(...).
21-37: Module-level load_dotenv + import-time Langfuse probe couples side effects to any importer.

load_dotenv now runs unconditionally on import llm_quest_benchmark.llm.client, and the Langfuse decision is frozen at import time. Two consequences:

Any test / tool that imports this module will mutate os.environ from the on-disk .env (even when callers explicitly didn't want it). The CLI already calls load_dotenv in commands.py; duplicating it here makes import order matter.

If a caller sets LANGFUSE_SECRET_KEY after import (e.g., tests), the openai.OpenAI vs langfuse.openai.OpenAI branch is baked in and can no longer flip.

Consider pushing both load_dotenv and the import-time branching into a lazy initializer invoked from _get_client() / AnthropicClient._get_client(), so the module stays side-effect-free.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/client.py` around lines 21 - 37, The module currently
runs load_dotenv and probes _langfuse_enabled at import time, causing side
effects and freezing which OpenAI implementation (openai.OpenAI vs
langfuse.openai.OpenAI) is used; move the dotenv load and the conditional
imports into the lazy initializer used to create clients (i.e., perform
load_dotenv(...) and then evaluate _langfuse_enabled() inside _get_client()
and/or AnthropicClient._get_client()), and import or define observe/OpenAI there
so the choice is made at call time rather than import time, ensuring the module
stays side-effect-free and respects environment changes performed after import.
.env.template (1)
34-36: Nit: reorder Langfuse keys to satisfy dotenv-linter.

dotenv-linter flags LANGFUSE_* as unordered. Easy fix:
🧹 Proposed reorder
-LANGFUSE_SECRET_KEY=
-LANGFUSE_PUBLIC_KEY=
-LANGFUSE_HOST=https://cloud.langfuse.com
+LANGFUSE_HOST=https://cloud.langfuse.com
+LANGFUSE_PUBLIC_KEY=
+LANGFUSE_SECRET_KEY=
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.env.template around lines 34 - 36, dotenv-linter reports the LANGFUSE_*
entries as unordered; reorder the three variables LANGFUSE_HOST,
LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY into alphabetical order
(LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) in .env.template so
the linter passes.
llm_quest_benchmark/llm/tracing.py (1)
5-12: _enabled caching hides late-binding env changes.

Caching on first call means that if a test, fixture, or load_dotenv call sets LANGFUSE_SECRET_KEY after is_enabled() has already been invoked, tracing stays off forever in that process. Given that client.py calls is_enabled() at import time, the order of imports in any given entrypoint determines the outcome. Either evaluate once explicitly at a known init point, or just re-check os.getenv — the cost is negligible.
🧹 Proposed simplification
-_enabled: bool | None = None
-
-
 def is_enabled() -> bool:
-    global _enabled
-    if _enabled is None:
-        _enabled = bool(os.getenv("LANGFUSE_SECRET_KEY"))
-    return _enabled
+    return bool(os.getenv("LANGFUSE_SECRET_KEY"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/tracing.py` around lines 5 - 12, The current caching
via the module-level _enabled and is_enabled() causes late changes to
LANGFUSE_SECRET_KEY (e.g. from tests or load_dotenv) to be ignored because
client.py calls is_enabled() at import time; remove the stale-cache behavior by
changing is_enabled() to directly return bool(os.getenv("LANGFUSE_SECRET_KEY"))
(or alternatively add an explicit init function to set _enabled at a
deterministic startup point), updating references to _enabled and is_enabled()
accordingly so environment changes after import are respected.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.env.template:
- Line 36: Replace the deprecated environment variable name LANGFUSE_HOST with
the current standard LANGFUSE_BASE_URL in the .env.template; specifically,
change the line defining LANGFUSE_HOST to LANGFUSE_BASE_URL so the project uses
the v4+ variable name and aligns with Langfuse docs (also search for any other
occurrences of LANGFUSE_HOST to update references to LANGFUSE_BASE_URL).

In `@llm_quest_benchmark/llm/tracing.py`:
- Around line 15-20: The flush() function should guard against SDK errors: wrap
the Langfuse().flush() call (inside flush(), after the is_enabled() check) in a
try/except Exception block so any exceptions from Langfuse().flush() are caught
and logged (include the exception details) instead of being propagated; do not
re-raise the exception so cleanup in finally blocks cannot mask or crash runs.
Use the module logger (or logging.getLogger(__name__)) to emit the error/warning
and keep the function's behaviour otherwise unchanged.

In `@pyproject.toml`:
- Line 33: The dependency pin is too loose: langfuse v3 removed
langfuse.decorators (used by the client via from langfuse.decorators import
observe) which will raise ModuleNotFoundError; update the pyproject dependency
spec for langfuse to pin it below 3 (e.g., "langfuse>=2.0,<3") so installs get a
v2.x that still exports langfuse.decorators, or alternatively migrate code to
the v3 API by replacing imports from langfuse.decorators with the new v3 form
(e.g., from langfuse import observe) and adjust any langfuse.decorators usage
accordingly (refer to the observe import and any usages of langfuse.decorators
and langfuse.openai.OpenAI).

---

Outside diff comments:
In `@llm_quest_benchmark/executors/benchmark.py`:
- Around line 304-309: The artifact write and trace flush need to be paired so
traces are delivered even if writing artifacts fails; wrap the calls that invoke
_write_benchmark_artifacts (and the subsequent logger.info("Benchmark artifacts
saved to %s", artifact_dir)) in a try/finally where tracing.flush() runs in the
finally block (mirroring commands.py::run) so flush always executes regardless
of errors from _write_benchmark_artifacts.

In `@llm_quest_benchmark/llm/client.py`:
- Around line 401-422: The Anthropic generation observation isn't being
populated or updated per attempt; inside get_completion's inner _call (the
function passed to _with_retries) call
langfuse_context.update_current_observation(...) to set the generation fields
(model, input/messages, and usage) before returning so traces contain model name
and token counts, and update it after calculating
prompt_tokens/completion_tokens (use _coerce_int outputs and the same keys your
Langfuse schema expects) so each retry attempt updates the current observation;
ensure you reference get_completion, _call, _with_retries, _record_usage and
langfuse_context.update_current_observation when applying the change.

---

Nitpick comments:
In @.env.template:
- Around line 34-36: dotenv-linter reports the LANGFUSE_* entries as unordered;
reorder the three variables LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and
LANGFUSE_SECRET_KEY into alphabetical order (LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY,
LANGFUSE_SECRET_KEY) in .env.template so the linter passes.

In `@llm_quest_benchmark/llm/client.py`:
- Around line 26-36: The no-op fallback for langfuse's observe decorator only
supports the parenthesized form and will break if someone uses `@observe` without
parens; update the fallback returned by the else branch (the observe function
defined when _langfuse_enabled() is false) to accept being used both as a bare
decorator and as a decorator factory: detect if the first argument is a callable
(i.e., a function/class) and return it directly, otherwise return a decorator
that wraps the target callable — keep the symbol names observe, _noop (or
similar), and the surrounding logic with the OpenAI import intact so the
behavior mirrors langfuse.decorators.observe for both `@observe` and
`@observe`(...).
- Around line 21-37: The module currently runs load_dotenv and probes
_langfuse_enabled at import time, causing side effects and freezing which OpenAI
implementation (openai.OpenAI vs langfuse.openai.OpenAI) is used; move the
dotenv load and the conditional imports into the lazy initializer used to create
clients (i.e., perform load_dotenv(...) and then evaluate _langfuse_enabled()
inside _get_client() and/or AnthropicClient._get_client()), and import or define
observe/OpenAI there so the choice is made at call time rather than import time,
ensuring the module stays side-effect-free and respects environment changes
performed after import.

In `@llm_quest_benchmark/llm/tracing.py`:
- Around line 5-12: The current caching via the module-level _enabled and
is_enabled() causes late changes to LANGFUSE_SECRET_KEY (e.g. from tests or
load_dotenv) to be ignored because client.py calls is_enabled() at import time;
remove the stale-cache behavior by changing is_enabled() to directly return
bool(os.getenv("LANGFUSE_SECRET_KEY")) (or alternatively add an explicit init
function to set _enabled at a deterministic startup point), updating references
to _enabled and is_enabled() accordingly so environment changes after import are
respected.

In `@pyproject.toml`:
- Line 33: The dependency declaration makes langfuse a hard requirement; change
it to an optional extra by moving "langfuse>=2.0.0" out of the main dependency
list in pyproject.toml and into [project.optional-dependencies] (e.g., tracing =
["langfuse>=2.0.0,<3.0.0"]); then update code that references langfuse (lazy
import it inside llm/tracing.py and guard usage in client.py—where the
conditional block enabling tracing runs) to only import and initialize langfuse
when tracing is enabled or env vars are present, keeping runtime behavior
unchanged for users who don’t opt into the extra.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5afe89c6-138b-4edc-86ff-cec32043855e

📥 Commits

Reviewing files that changed from the base of the PR and between 129a763 and c1b1eb8.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

.env.template
llm_quest_benchmark/executors/benchmark.py
llm_quest_benchmark/executors/cli/commands.py
llm_quest_benchmark/llm/client.py
llm_quest_benchmark/llm/tracing.py
pyproject.toml

coderabbitai · 2026-04-24T14:55:56Z

+# Sign up at https://cloud.langfuse.com or self-host.
+LANGFUSE_SECRET_KEY=
+LANGFUSE_PUBLIC_KEY=
+LANGFUSE_HOST=https://cloud.langfuse.com


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

langfuse python SDK LANGFUSE_HOST vs LANGFUSE_BASE_URL environment variable current name

💡 Result:

The current environment variable for configuring the Langfuse Python SDK API endpoint is LANGFUSE_BASE_URL. LANGFUSE_HOST is deprecated and should no longer be used; use LANGFUSE_BASE_URL instead. Both are supported for backward compatibility, with LANGFUSE_BASE_URL taking precedence when both are set. The default value for both is "https://cloud.langfuse.com".

Citations:

1: https://github.com/langfuse/langfuse-python/blob/72aa1565/langfuse/_client/environment_variables.py

2: https://langfuse.com/docs/observability/sdk/overview

3: https://python.reference.langfuse.com/langfuse

4: bug(sdk-python): flush() hangs indefinitely in Google Cloud Functions when consumer threads die langfuse/langfuse#11104

🏁 Script executed:

find . -name "pyproject.toml" -type f | head -5

Repository: yourconscience/llm_quest_benchmark

Length of output: 95

🏁 Script executed:

cat pyproject.toml

Repository: yourconscience/llm_quest_benchmark

Length of output: 2193

🏁 Script executed:

cat -n .env.template | head -40

Repository: yourconscience/llm_quest_benchmark

Length of output: 1592

Update .env.template to use LANGFUSE_BASE_URL instead of LANGFUSE_HOST.

LANGFUSE_HOST is deprecated in favor of LANGFUSE_BASE_URL in newer Langfuse versions (v4+). While both are supported for backward compatibility, the template should use the current standard variable name for forward compatibility and to align with current Langfuse documentation.

🧰 Tools

🪛 dotenv-linter (4.0.0)

[warning] 36-36: [UnorderedKey] The LANGFUSE_HOST key should go before the LANGFUSE_PUBLIC_KEY key

(UnorderedKey)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.env.template at line 36, Replace the deprecated environment variable name LANGFUSE_HOST with the current standard LANGFUSE_BASE_URL in the .env.template; specifically, change the line defining LANGFUSE_HOST to LANGFUSE_BASE_URL so the project uses the v4+ variable name and aligns with Langfuse docs (also search for any other occurrences of LANGFUSE_HOST to update references to LANGFUSE_BASE_URL).

coderabbitai · 2026-04-24T14:55:56Z

+def flush() -> None:
+    if not is_enabled():
+        return
+    from langfuse import Langfuse
+
+    Langfuse().flush()


⚠️ Potential issue | 🟡 Minor

flush() should be defensive against SDK errors.

Langfuse().flush() can raise if the SDK can't reach the backend or if credentials are misconfigured. Since this runs in finally blocks at the end of quests/benchmarks, an exception here will mask the real outcome or crash an otherwise successful run. Wrap the flush in a try/except and log instead.

🛡️ Proposed hardening

def flush() -> None: if not is_enabled(): return - from langfuse import Langfuse - - Langfuse().flush() + try: + from langfuse import Langfuse + Langfuse().flush() + except Exception: + import logging + logging.getLogger(__name__).warning("Langfuse flush failed", exc_info=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@llm_quest_benchmark/llm/tracing.py` around lines 15 - 20, The flush() function should guard against SDK errors: wrap the Langfuse().flush() call (inside flush(), after the is_enabled() check) in a try/except Exception block so any exceptions from Langfuse().flush() are caught and logged (include the exception details) instead of being propagated; do not re-raise the exception so cleanup in finally blocks cannot mask or crash runs. Use the module logger (or logging.getLogger(__name__)) to emit the error/warning and keep the function's behaviour otherwise unchanged.

coderabbitai · 2026-04-24T14:55:56Z

    "openai>=1.12.0",
    "openrouter>=0.5.0",
    "typer>=0.9.0",
+    "langfuse>=2.0.0",


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does langfuse Python v3 still expose langfuse.decorators.observe and langfuse.openai.OpenAI wrapper?

💡 Result:

Yes, Langfuse Python v3 (and later v4) still exposes both langfuse.decorators.observe and langfuse.openai.OpenAI wrapper. The @observe decorator is available directly from the top-level langfuse module: from langfuse import observe. The old v2 path langfuse.decorators.observe is deprecated but the functionality persists in v3/v4 under the new import. Source code shows observe.py in langfuse/_client/ and documentation examples use from langfuse import observe. V2→v3 migration shows continued use of @observe with changed import: from langfuse import observe. The langfuse.openai OpenAI wrapper remains available and actively maintained in the current main branch (post-v4), providing drop-in replacement: from langfuse.openai import openai. It integrates with the new observation model using start_observation internally. Official integration docs and examples confirm compatibility with v3/v4 decorators and context managers.

Citations:

1: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v2-to-v3

2: https://github.com/langfuse/langfuse-python

3: https://github.com/langfuse/langfuse-python/blob/main/langfuse/openai.py

4: https://langfuse.com/integrations/model-providers/openai-py

5: https://langfuse.com/docs/sdk/python/decorators

6: https://langfuse.com/docs/observability/sdk/python/sdk-v3

7: https://github.com/langfuse/langfuse-python/blob/72aa1565/langfuse/_client/observe.py

8: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4

9: langfuse/langfuse-python@v3.14.5...v4.0.0

🏁 Script executed:

# Find files using langfuse imports fd -type f -name "*.py" | xargs rg "from langfuse" 2>/dev/null | head -20

Repository: yourconscience/llm_quest_benchmark

Length of output: 515

🏁 Script executed:

# Check if llm/client.py exists and view its langfuse imports find . -name "client.py" -path "*/llm/*" 2>/dev/null | head -5

Repository: yourconscience/llm_quest_benchmark

Length of output: 114

🏁 Script executed:

# Search for any langfuse imports in the entire codebase rg "langfuse" --type py -A 2 | head -50

Repository: yourconscience/llm_quest_benchmark

Length of output: 758

🏁 Script executed:

# Search for information about langfuse versions and what changed web_search --query "langfuse-python v3 v4 breaking changes decorators openai module removed"

Repository: yourconscience/llm_quest_benchmark

Length of output: 129

🏁 Script executed:

# Look at langfuse documentation for v3 to v4 migration web_search --query "langfuse Python v3 to v4 migration guide breaking changes"

Repository: yourconscience/llm_quest_benchmark

Length of output: 129

🏁 Script executed:

# Check if langfuse.decorators and langfuse.openai still exist in current releases web_search --query "langfuse-python github langfuse/decorators.py langfuse/openai.py v3 v4 source code"

Repository: yourconscience/llm_quest_benchmark

Length of output: 129

🌐 Web query:

Does langfuse v3 still have the langfuse.decorators module with observe decorator?

💡 Result:

No, Langfuse v3 does not have the langfuse.decorators module with the observe decorator. In v2, it was imported as from langfuse.decorators import observe. In v3, the import changed to from langfuse import observe directly, as the decorator was refactored and moved to langfuse/_client/observe.py internally. Attempting to import langfuse.decorators in v3 results in ModuleNotFoundError. The observe decorator still exists and functions in v3 and later versions (including v4, current as of 2026), but the module path changed.

Citations:

1: bug: ModuleNotFoundError: No module named 'langfuse.decorators' langfuse/langfuse#7170

2: https://get.langfuse.com/docs/observability/sdk/upgrade-path/python-v2-to-v3

3: https://github.com/langfuse/langfuse-python/blob/72aa1565/langfuse/_client/observe.py

4: https://langfuse.com/docs/observability/sdk/python/decorators

5: https://github.com/langfuse/langfuse-python/tree/main

🌐 Web query:

Does langfuse v3 still have langfuse.openai module with OpenAI wrapper?

💡 Result:

Yes, Langfuse Python SDK v3 still includes the langfuse.openai module with the OpenAI wrapper. It provides a drop-in replacement for the OpenAI SDK (import openai from langfuse.openai instead of directly from openai), supporting tracing of prompts, completions, latencies, costs, and errors. This works with OpenAI SDK versions >=0.27.8, including async and streaming for >=1.0.0. The module is actively maintained, with recent fixes in 2026 (e.g., openai.py updates in releases like v3.2.8 and later). Official docs and API reference confirm its availability in v3, compatible with the OTEL-based SDK while preserving the wrapper integration. Use @observe decorator for grouping multiple calls into traces.

Citations:

1: https://python.reference.langfuse.com/langfuse/openai

2: https://js-sdk-v3.docs-snapshot.langfuse.com/guides/cookbook/integration_openai_sdk/

3: https://langfuse.com/integrations/model-providers/openai-py

4: https://js-sdk-v3.docs-snapshot.langfuse.com/integrations/model-providers/openai-py/

5: https://www.langfuse.com

6: https://python.reference.langfuse.com/langfuse

7: https://langfuse.com/docs/sdk/python/sdk-v3

8: https://github.com/langfuse/langfuse-python/blob/main/langfuse/openai.py

9: bug: langfuse.openai.AsyncOpenAI typed as None breaks type narrowing for consumers langfuse/langfuse#11197

10: https://langfuse.com/changelog/2025-06-05-python-sdk-v3-generally-available

11: https://github.com/langfuse/langfuse-python/releases

12: feat!: prepare v4 release langfuse/langfuse-python#1486

13: langfuse/langfuse-python@v3.14.5...v4.0.0

14: langfuse/langfuse-python@v3.2.7...v3.2.8

🌐 Web query:

Langfuse Python v3 v4 breaking changes removed modules migration

💡 Result:

Langfuse Python SDK v4 (released March 2026) introduces breaking changes from v3, primarily adopting an observation-centric data model for better performance. No entire modules were removed, but significant methods, classes, and API endpoints were deprecated/removed, along with internal restructuring. Key breaking changes: - OpenTelemetry: Default span export now filtered (non-LLM spans like HTTP/DB not exported by default). Use should_export_span=lambda span: True to restore v3 behavior. blocked_instrumentation_scopes deprecated. - Trace updates: update_current_trace and .update_trace removed, replaced by propagate_attributes (context manager for user_id/session_id/metadata/tags), set_current_trace_io (deprecated, for legacy LLM-as-judge), set_current_trace_as_public. - Span/Generation APIs: start_span, start_generation, start_as_current_span/generation → start_observation(as_type="span"/"generation"). - Datasets: DatasetItemClient removed; item.run → dataset.run_experiment. - LangChain CallbackHandler: update_trace parameter removed. - API namespaces: api.observations_v_2 → api.observations; legacy api.observations → api.legacy.observations_v1 (self-hosted: stick to legacy until v2 endpoints available). - Types: langfuse.types (TraceMetadata, ObservationParams, etc.) removed. - Pydantic v1 dropped (now v2 required). - Internal: langfuse/api/resources/ directory deleted; reorganized to top-level modules with lazy loading. Migration checklist (from official docs): 1. Audit non-LLM OTEL spans. 2. Restore export-all if needed. 3. Migrate blocked_instrumentation_scopes. 4. Replace update_current_trace/update_trace calls. 5. Replace start_span/start_generation. 6. Update dataset runs. 7. Remove CallbackHandler(update_trace=...). 8. Update API v2 → non-suffixed; legacy v1 → api.legacy.*_v1. 9. For self-hosted, use legacy endpoints. Full guide: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4. Install: pip install langfuse>=4.0.0 (Python 3.10+).

Citations:

1: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4

2: https://get.langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4

3: feat!: prepare v4 release langfuse/langfuse-python#1486

4: https://github.com/langfuse/langfuse-python/releases/tag/v4.0.0

5: langfuse/langfuse-python@v3.14.5...v4.0.0

Pin langfuse to <3 — v3 removed langfuse.decorators and the client code will fail on import.

The client code imports from langfuse.decorators import observe, which is no longer available in langfuse v3+. Attempting this import raises ModuleNotFoundError. While langfuse.openai.OpenAI remains available in v3, the decorators module removal is a breaking change that will cause the feature to fail on any fresh install with langfuse>=2.0.0 (which resolves to v3+).

To fix, pin the dependency:

Proposed fix

- "langfuse>=2.0.0", + "langfuse>=2.0.0,<3.0.0",

Alternative: migrate to v3+ API (from langfuse import observe) and pin accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"langfuse>=2.0.0",

"langfuse>=2.0.0,<3.0.0",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 33, The dependency pin is too loose: langfuse v3 removed langfuse.decorators (used by the client via from langfuse.decorators import observe) which will raise ModuleNotFoundError; update the pyproject dependency spec for langfuse to pin it below 3 (e.g., "langfuse>=2.0,<3") so installs get a v2.x that still exports langfuse.decorators, or alternatively migrate code to the v3 API by replacing imports from langfuse.decorators with the new v3 form (e.g., from langfuse import observe) and adjust any langfuse.decorators usage accordingly (refer to the observe import and any usages of langfuse.decorators and langfuse.openai.OpenAI).

add Langfuse tracing integration

c1b1eb8

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

fix review feedback: v3 imports, defensive flush, LANGFUSE_BASE_URL

9e9c7ce

yourconscience merged commit c2d7b59 into master Apr 24, 2026
2 checks passed

yourconscience deleted the feat/langfuse-tracing branch April 24, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Langfuse tracing for LLM call observability#19

Add Langfuse tracing for LLM call observability#19
yourconscience merged 2 commits into
masterfrom
feat/langfuse-tracing

yourconscience commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yourconscience commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yourconscience commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading