Skip to content

Add Langfuse tracing for LLM call observability#19

Merged
yourconscience merged 2 commits into
masterfrom
feat/langfuse-tracing
Apr 24, 2026
Merged

Add Langfuse tracing for LLM call observability#19
yourconscience merged 2 commits into
masterfrom
feat/langfuse-tracing

Conversation

@yourconscience

@yourconscience yourconscience commented Apr 24, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds opt-in Langfuse tracing that activates when LANGFUSE_SECRET_KEY is set in env
  • OpenAI-compatible clients (OpenAI, Google, DeepSeek, OpenRouter) use Langfuse's drop-in OpenAI wrapper for automatic trace capture
  • Anthropic client uses @observe(as_type="generation") decorator
  • langfuse.flush() called at benchmark end and after single quest runs
  • Zero overhead when Langfuse env vars are not set (conditional imports, no-op stubs)

Changes

  • llm/tracing.py: thin init/flush module, checks LANGFUSE_SECRET_KEY
  • llm/client.py: conditional import of Langfuse OpenAI wrapper, @observe on Anthropic completion
  • executors/benchmark.py: flush at end of benchmark
  • executors/cli/commands.py: flush in finally block after single runs
  • pyproject.toml: add langfuse>=2.0.0
  • .env.template: add Langfuse env vars

Test plan

  • Run uv run llm-quest run --quest quests/Boat.qm --model random_choice without Langfuse vars: should work identically to before
  • Set LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY, run a benchmark: traces should appear in Langfuse UI
  • Verify token counts and model names appear in Langfuse traces

Summary by CodeRabbit

  • New Features

    • Added optional LLM call tracing functionality to monitor and trace language model interactions during benchmark runs and CLI operations, with automatic data synchronization at key lifecycle points.
  • Configuration

    • Three new optional environment variables available to enable and configure the tracing feature when needed.

@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d71c02f1-22af-4501-8d65-cbc2c8dd04f2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes introduce Langfuse tracing integration to the benchmark system. A new tracing module conditionally enables tracing based on environment variables, while the LLM client and executors integrate this facility to observe calls and flush data at key lifecycle points.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
.env.template, pyproject.toml
Added Langfuse environment variable declarations (secret key, public key, host) and added langfuse>=2.0.0 as a project dependency.
Tracing Integration Module
llm_quest_benchmark/llm/tracing.py
New module providing is_enabled() to check for Langfuse configuration and flush() to flush tracing data; uses lazy imports and caches enabled state.
LLM Client Integration
llm_quest_benchmark/llm/client.py
Loads environment variables, conditionally wraps OpenAI with Langfuse when enabled, provides no-op observe decorator when disabled, and instruments AnthropicClient.get_completion with tracing decorator.
Executor Tracing Hooks
llm_quest_benchmark/executors/benchmark.py, llm_quest_benchmark/executors/cli/commands.py
Integrated tracing.flush() calls to ensure pending tracing data is flushed after benchmark execution and quest completion, using a finally clause for guaranteed cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI Command
    participant Executor as Executor
    participant Client as LLM Client
    participant Tracing as Tracing Module
    participant Langfuse as Langfuse Service

    CLI->>Executor: Start benchmark/quest
    Executor->>Client: Initialize client
    Client->>Tracing: Check is_enabled()
    Tracing->>Tracing: Load LANGFUSE_SECRET_KEY from env
    
    loop During execution
        CLI->>Client: Request LLM call
        Client->>Langfuse: observe() wraps the call
        Langfuse->>Langfuse: Buffer trace data
        Client-->>CLI: Return result
    end
    
    CLI->>Executor: Complete execution
    Executor->>Tracing: Call flush()
    alt Tracing enabled
        Tracing->>Langfuse: Flush buffered trace data
        Langfuse-->>Tracing: Acknowledgement
    else Tracing disabled
        Tracing->>Tracing: No-op return
    end
    Executor-->>CLI: Done
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Traces now flow through our quest so bright,
Langfuse captures every LLM flight,
With env vars set and flushing at the end,
Our benchmarks now with tracing data blend! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Langfuse tracing for LLM call observability' directly and accurately describes the main purpose of the pull request: adding Langfuse tracing functionality for observing LLM calls. It is clear, concise, and reflects the primary change across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/langfuse-tracing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Langfuse tracing into the benchmark suite, allowing for automatic LLM call tracing when the appropriate environment variables are set. The changes include adding configuration templates, a new tracing utility module, and applying the observe decorator to LLM client completion methods. Additionally, the PR ensures that traces are flushed after benchmark runs. One improvement was suggested regarding the implementation of the tracing enablement check to use functools.lru_cache for better thread-safety and more idiomatic code.

Comment thread llm_quest_benchmark/llm/tracing.py Outdated
Comment on lines +3 to +12
import os

_enabled: bool | None = None


def is_enabled() -> bool:
global _enabled
if _enabled is None:
_enabled = bool(os.getenv("LANGFUSE_SECRET_KEY"))
return _enabled

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of is_enabled uses a global variable for caching. This is not thread-safe and is a bit verbose.

You can achieve the same result in a more concise and thread-safe way using functools.lru_cache. This is the standard Python way to memoize a function's result.

I suggest replacing the global variable and the function with a cached function.

Suggested change
import os
_enabled: bool | None = None
def is_enabled() -> bool:
global _enabled
if _enabled is None:
_enabled = bool(os.getenv("LANGFUSE_SECRET_KEY"))
return _enabled
import os
from functools import lru_cache
@lru_cache(maxsize=None)
def is_enabled() -> bool:
"""Check if Langfuse tracing is enabled. Caches the result."""
return bool(os.getenv("LANGFUSE_SECRET_KEY"))

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
llm_quest_benchmark/executors/benchmark.py (1)

304-309: ⚠️ Potential issue | 🟡 Minor

Flush won't run if artifact write fails — wrap in try/finally for parity with commands.py.

In commands.py::run, tracing.flush() runs in a finally block so traces are delivered even on failure. Here, if _write_benchmark_artifacts raises (disk full, permissions, bad JSON serialization of a result), the flush is skipped and an entire benchmark's worth of generations may be dropped. A small try/finally around the artifact + flush block keeps the two call sites consistent.

🔧 Proposed fix
-    artifact_dir = _write_benchmark_artifacts(config, results)
-    if artifact_dir:
-        logger.info("Benchmark artifacts saved to %s", artifact_dir)
-
-    tracing.flush()
-    return results
+    try:
+        artifact_dir = _write_benchmark_artifacts(config, results)
+        if artifact_dir:
+            logger.info("Benchmark artifacts saved to %s", artifact_dir)
+        return results
+    finally:
+        tracing.flush()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/executors/benchmark.py` around lines 304 - 309, The
artifact write and trace flush need to be paired so traces are delivered even if
writing artifacts fails; wrap the calls that invoke _write_benchmark_artifacts
(and the subsequent logger.info("Benchmark artifacts saved to %s",
artifact_dir)) in a try/finally where tracing.flush() runs in the finally block
(mirroring commands.py::run) so flush always executes regardless of errors from
_write_benchmark_artifacts.
llm_quest_benchmark/llm/client.py (1)

401-422: ⚠️ Potential issue | 🟠 Major

Anthropic generation trace won't carry model or token usage — test plan requirements won't be met.

The PR test plan states: "confirm token counts and model names appear in traces". With @observe(as_type="generation", name="anthropic-completion") alone, Langfuse v2 captures the function's input args and return value only; it does not auto-populate the generation's model, input messages, or usage fields for Anthropic (unlike OpenAI/LangChain/LlamaIndex integrations which are auto-captured). These fields must be set explicitly via langfuse_context.update_current_observation(...) inside the function. As written, Anthropic traces will show empty model/usage in the UI.

Additionally, this decorator wraps _with_retries(_call), so on retry all attempts are folded into a single generation observation — the final usage recorded via _record_usage won't make it to Langfuse at all.

🔧 Suggested fix — populate generation fields before returning
+from llm_quest_benchmark.llm.tracing import is_enabled as _langfuse_enabled
+if _langfuse_enabled():
+    from langfuse.decorators import langfuse_context
+else:
+    langfuse_context = None  # type: ignore[assignment]
...
     `@observe`(as_type="generation", name="anthropic-completion")
     def get_completion(self, prompt: str) -> str:
         """Get a completion from the model."""

         def _call() -> str:
             response = self._get_client().messages.create(...)
             usage = _get_attr_or_key(response, "usage")
             prompt_tokens = _coerce_int(_get_attr_or_key(usage, "input_tokens", 0) or 0)
             completion_tokens = _coerce_int(_get_attr_or_key(usage, "output_tokens", 0) or 0)
             self._record_usage(prompt_tokens, completion_tokens)
+            if langfuse_context is not None:
+                langfuse_context.update_current_observation(
+                    model=self.model_id,
+                    input=[{"role": "user", "content": prompt}],
+                    usage={
+                        "input": prompt_tokens,
+                        "output": completion_tokens,
+                        "total": prompt_tokens + completion_tokens,
+                    },
+                )
             ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/client.py` around lines 401 - 422, The Anthropic
generation observation isn't being populated or updated per attempt; inside
get_completion's inner _call (the function passed to _with_retries) call
langfuse_context.update_current_observation(...) to set the generation fields
(model, input/messages, and usage) before returning so traces contain model name
and token counts, and update it after calculating
prompt_tokens/completion_tokens (use _coerce_int outputs and the same keys your
Langfuse schema expects) so each retry attempt updates the current observation;
ensure you reference get_completion, _call, _with_retries, _record_usage and
langfuse_context.update_current_observation when applying the change.
🧹 Nitpick comments (5)
pyproject.toml (1)

33-33: Consider making langfuse an optional extra instead of a hard dep.

The PR goal is opt-in tracing with "zero overhead when Langfuse env vars are not set", but this makes Langfuse (and its transitive deps incl. pydantic/httpx pins, opentelemetry in v3+) a mandatory install for every user. A [project.optional-dependencies] extra (e.g., tracing = ["langfuse>=2.0.0,<3.0.0"]) with lazy imports inside llm/tracing.py and the conditional block in client.py would be more aligned with that intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 33, The dependency declaration makes langfuse a hard
requirement; change it to an optional extra by moving "langfuse>=2.0.0" out of
the main dependency list in pyproject.toml and into
[project.optional-dependencies] (e.g., tracing = ["langfuse>=2.0.0,<3.0.0"]);
then update code that references langfuse (lazy import it inside llm/tracing.py
and guard usage in client.py—where the conditional block enabling tracing runs)
to only import and initialize langfuse when tracing is enabled or env vars are
present, keeping runtime behavior unchanged for users who don’t opt into the
extra.
llm_quest_benchmark/llm/client.py (2)

26-36: No-op observe signature doesn't fully match real decorator — works here but brittle.

langfuse.decorators.observe can be used both as @observe (bare) and @observe(...). Your fallback only handles the parenthesized form (observe(**_kwargs)). This is fine for the single current call site (@observe(as_type="generation", ...)) but will silently break if someone later writes @observe without parens. A two-mode fallback avoids the footgun:

🧹 Optional robustness tweak
-    def observe(**_kwargs):  # type: ignore[misc]
-        def _noop(fn):
-            return fn
-
-        return _noop
+    def observe(*args, **_kwargs):  # type: ignore[misc]
+        # Support both `@observe` and `@observe`(...).
+        if len(args) == 1 and callable(args[0]) and not _kwargs:
+            return args[0]
+
+        def _noop(fn):
+            return fn
+
+        return _noop
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/client.py` around lines 26 - 36, The no-op fallback
for langfuse's observe decorator only supports the parenthesized form and will
break if someone uses `@observe` without parens; update the fallback returned by
the else branch (the observe function defined when _langfuse_enabled() is false)
to accept being used both as a bare decorator and as a decorator factory: detect
if the first argument is a callable (i.e., a function/class) and return it
directly, otherwise return a decorator that wraps the target callable — keep the
symbol names observe, _noop (or similar), and the surrounding logic with the
OpenAI import intact so the behavior mirrors langfuse.decorators.observe for
both `@observe` and `@observe`(...).

21-37: Module-level load_dotenv + import-time Langfuse probe couples side effects to any importer.

load_dotenv now runs unconditionally on import llm_quest_benchmark.llm.client, and the Langfuse decision is frozen at import time. Two consequences:

  1. Any test / tool that imports this module will mutate os.environ from the on-disk .env (even when callers explicitly didn't want it). The CLI already calls load_dotenv in commands.py; duplicating it here makes import order matter.
  2. If a caller sets LANGFUSE_SECRET_KEY after import (e.g., tests), the openai.OpenAI vs langfuse.openai.OpenAI branch is baked in and can no longer flip.

Consider pushing both load_dotenv and the import-time branching into a lazy initializer invoked from _get_client() / AnthropicClient._get_client(), so the module stays side-effect-free.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/client.py` around lines 21 - 37, The module currently
runs load_dotenv and probes _langfuse_enabled at import time, causing side
effects and freezing which OpenAI implementation (openai.OpenAI vs
langfuse.openai.OpenAI) is used; move the dotenv load and the conditional
imports into the lazy initializer used to create clients (i.e., perform
load_dotenv(...) and then evaluate _langfuse_enabled() inside _get_client()
and/or AnthropicClient._get_client()), and import or define observe/OpenAI there
so the choice is made at call time rather than import time, ensuring the module
stays side-effect-free and respects environment changes performed after import.
.env.template (1)

34-36: Nit: reorder Langfuse keys to satisfy dotenv-linter.

dotenv-linter flags LANGFUSE_* as unordered. Easy fix:

🧹 Proposed reorder
-LANGFUSE_SECRET_KEY=
-LANGFUSE_PUBLIC_KEY=
-LANGFUSE_HOST=https://cloud.langfuse.com
+LANGFUSE_HOST=https://cloud.langfuse.com
+LANGFUSE_PUBLIC_KEY=
+LANGFUSE_SECRET_KEY=
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.env.template around lines 34 - 36, dotenv-linter reports the LANGFUSE_*
entries as unordered; reorder the three variables LANGFUSE_HOST,
LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY into alphabetical order
(LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY) in .env.template so
the linter passes.
llm_quest_benchmark/llm/tracing.py (1)

5-12: _enabled caching hides late-binding env changes.

Caching on first call means that if a test, fixture, or load_dotenv call sets LANGFUSE_SECRET_KEY after is_enabled() has already been invoked, tracing stays off forever in that process. Given that client.py calls is_enabled() at import time, the order of imports in any given entrypoint determines the outcome. Either evaluate once explicitly at a known init point, or just re-check os.getenv — the cost is negligible.

🧹 Proposed simplification
-_enabled: bool | None = None
-
-
 def is_enabled() -> bool:
-    global _enabled
-    if _enabled is None:
-        _enabled = bool(os.getenv("LANGFUSE_SECRET_KEY"))
-    return _enabled
+    return bool(os.getenv("LANGFUSE_SECRET_KEY"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/tracing.py` around lines 5 - 12, The current caching
via the module-level _enabled and is_enabled() causes late changes to
LANGFUSE_SECRET_KEY (e.g. from tests or load_dotenv) to be ignored because
client.py calls is_enabled() at import time; remove the stale-cache behavior by
changing is_enabled() to directly return bool(os.getenv("LANGFUSE_SECRET_KEY"))
(or alternatively add an explicit init function to set _enabled at a
deterministic startup point), updating references to _enabled and is_enabled()
accordingly so environment changes after import are respected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.env.template:
- Line 36: Replace the deprecated environment variable name LANGFUSE_HOST with
the current standard LANGFUSE_BASE_URL in the .env.template; specifically,
change the line defining LANGFUSE_HOST to LANGFUSE_BASE_URL so the project uses
the v4+ variable name and aligns with Langfuse docs (also search for any other
occurrences of LANGFUSE_HOST to update references to LANGFUSE_BASE_URL).

In `@llm_quest_benchmark/llm/tracing.py`:
- Around line 15-20: The flush() function should guard against SDK errors: wrap
the Langfuse().flush() call (inside flush(), after the is_enabled() check) in a
try/except Exception block so any exceptions from Langfuse().flush() are caught
and logged (include the exception details) instead of being propagated; do not
re-raise the exception so cleanup in finally blocks cannot mask or crash runs.
Use the module logger (or logging.getLogger(__name__)) to emit the error/warning
and keep the function's behaviour otherwise unchanged.

In `@pyproject.toml`:
- Line 33: The dependency pin is too loose: langfuse v3 removed
langfuse.decorators (used by the client via from langfuse.decorators import
observe) which will raise ModuleNotFoundError; update the pyproject dependency
spec for langfuse to pin it below 3 (e.g., "langfuse>=2.0,<3") so installs get a
v2.x that still exports langfuse.decorators, or alternatively migrate code to
the v3 API by replacing imports from langfuse.decorators with the new v3 form
(e.g., from langfuse import observe) and adjust any langfuse.decorators usage
accordingly (refer to the observe import and any usages of langfuse.decorators
and langfuse.openai.OpenAI).

---

Outside diff comments:
In `@llm_quest_benchmark/executors/benchmark.py`:
- Around line 304-309: The artifact write and trace flush need to be paired so
traces are delivered even if writing artifacts fails; wrap the calls that invoke
_write_benchmark_artifacts (and the subsequent logger.info("Benchmark artifacts
saved to %s", artifact_dir)) in a try/finally where tracing.flush() runs in the
finally block (mirroring commands.py::run) so flush always executes regardless
of errors from _write_benchmark_artifacts.

In `@llm_quest_benchmark/llm/client.py`:
- Around line 401-422: The Anthropic generation observation isn't being
populated or updated per attempt; inside get_completion's inner _call (the
function passed to _with_retries) call
langfuse_context.update_current_observation(...) to set the generation fields
(model, input/messages, and usage) before returning so traces contain model name
and token counts, and update it after calculating
prompt_tokens/completion_tokens (use _coerce_int outputs and the same keys your
Langfuse schema expects) so each retry attempt updates the current observation;
ensure you reference get_completion, _call, _with_retries, _record_usage and
langfuse_context.update_current_observation when applying the change.

---

Nitpick comments:
In @.env.template:
- Around line 34-36: dotenv-linter reports the LANGFUSE_* entries as unordered;
reorder the three variables LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and
LANGFUSE_SECRET_KEY into alphabetical order (LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY,
LANGFUSE_SECRET_KEY) in .env.template so the linter passes.

In `@llm_quest_benchmark/llm/client.py`:
- Around line 26-36: The no-op fallback for langfuse's observe decorator only
supports the parenthesized form and will break if someone uses `@observe` without
parens; update the fallback returned by the else branch (the observe function
defined when _langfuse_enabled() is false) to accept being used both as a bare
decorator and as a decorator factory: detect if the first argument is a callable
(i.e., a function/class) and return it directly, otherwise return a decorator
that wraps the target callable — keep the symbol names observe, _noop (or
similar), and the surrounding logic with the OpenAI import intact so the
behavior mirrors langfuse.decorators.observe for both `@observe` and
`@observe`(...).
- Around line 21-37: The module currently runs load_dotenv and probes
_langfuse_enabled at import time, causing side effects and freezing which OpenAI
implementation (openai.OpenAI vs langfuse.openai.OpenAI) is used; move the
dotenv load and the conditional imports into the lazy initializer used to create
clients (i.e., perform load_dotenv(...) and then evaluate _langfuse_enabled()
inside _get_client() and/or AnthropicClient._get_client()), and import or define
observe/OpenAI there so the choice is made at call time rather than import time,
ensuring the module stays side-effect-free and respects environment changes
performed after import.

In `@llm_quest_benchmark/llm/tracing.py`:
- Around line 5-12: The current caching via the module-level _enabled and
is_enabled() causes late changes to LANGFUSE_SECRET_KEY (e.g. from tests or
load_dotenv) to be ignored because client.py calls is_enabled() at import time;
remove the stale-cache behavior by changing is_enabled() to directly return
bool(os.getenv("LANGFUSE_SECRET_KEY")) (or alternatively add an explicit init
function to set _enabled at a deterministic startup point), updating references
to _enabled and is_enabled() accordingly so environment changes after import are
respected.

In `@pyproject.toml`:
- Line 33: The dependency declaration makes langfuse a hard requirement; change
it to an optional extra by moving "langfuse>=2.0.0" out of the main dependency
list in pyproject.toml and into [project.optional-dependencies] (e.g., tracing =
["langfuse>=2.0.0,<3.0.0"]); then update code that references langfuse (lazy
import it inside llm/tracing.py and guard usage in client.py—where the
conditional block enabling tracing runs) to only import and initialize langfuse
when tracing is enabled or env vars are present, keeping runtime behavior
unchanged for users who don’t opt into the extra.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5afe89c6-138b-4edc-86ff-cec32043855e

📥 Commits

Reviewing files that changed from the base of the PR and between 129a763 and c1b1eb8.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • .env.template
  • llm_quest_benchmark/executors/benchmark.py
  • llm_quest_benchmark/executors/cli/commands.py
  • llm_quest_benchmark/llm/client.py
  • llm_quest_benchmark/llm/tracing.py
  • pyproject.toml

Comment thread .env.template Outdated
# Sign up at https://cloud.langfuse.com or self-host.
LANGFUSE_SECRET_KEY=
LANGFUSE_PUBLIC_KEY=
LANGFUSE_HOST=https://cloud.langfuse.com

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

langfuse python SDK LANGFUSE_HOST vs LANGFUSE_BASE_URL environment variable current name

💡 Result:

The current environment variable for configuring the Langfuse Python SDK API endpoint is LANGFUSE_BASE_URL. LANGFUSE_HOST is deprecated and should no longer be used; use LANGFUSE_BASE_URL instead. Both are supported for backward compatibility, with LANGFUSE_BASE_URL taking precedence when both are set. The default value for both is "https://cloud.langfuse.com".

Citations:


🏁 Script executed:

find . -name "pyproject.toml" -type f | head -5

Repository: yourconscience/llm_quest_benchmark

Length of output: 95


🏁 Script executed:

cat pyproject.toml

Repository: yourconscience/llm_quest_benchmark

Length of output: 2193


🏁 Script executed:

cat -n .env.template | head -40

Repository: yourconscience/llm_quest_benchmark

Length of output: 1592


Update .env.template to use LANGFUSE_BASE_URL instead of LANGFUSE_HOST.

LANGFUSE_HOST is deprecated in favor of LANGFUSE_BASE_URL in newer Langfuse versions (v4+). While both are supported for backward compatibility, the template should use the current standard variable name for forward compatibility and to align with current Langfuse documentation.

🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 36-36: [UnorderedKey] The LANGFUSE_HOST key should go before the LANGFUSE_PUBLIC_KEY key

(UnorderedKey)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.env.template at line 36, Replace the deprecated environment variable name
LANGFUSE_HOST with the current standard LANGFUSE_BASE_URL in the .env.template;
specifically, change the line defining LANGFUSE_HOST to LANGFUSE_BASE_URL so the
project uses the v4+ variable name and aligns with Langfuse docs (also search
for any other occurrences of LANGFUSE_HOST to update references to
LANGFUSE_BASE_URL).

Comment thread llm_quest_benchmark/llm/tracing.py Outdated
Comment on lines +15 to +20
def flush() -> None:
if not is_enabled():
return
from langfuse import Langfuse

Langfuse().flush()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

flush() should be defensive against SDK errors.

Langfuse().flush() can raise if the SDK can't reach the backend or if credentials are misconfigured. Since this runs in finally blocks at the end of quests/benchmarks, an exception here will mask the real outcome or crash an otherwise successful run. Wrap the flush in a try/except and log instead.

🛡️ Proposed hardening
 def flush() -> None:
     if not is_enabled():
         return
-    from langfuse import Langfuse
-
-    Langfuse().flush()
+    try:
+        from langfuse import Langfuse
+        Langfuse().flush()
+    except Exception:
+        import logging
+        logging.getLogger(__name__).warning("Langfuse flush failed", exc_info=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm_quest_benchmark/llm/tracing.py` around lines 15 - 20, The flush()
function should guard against SDK errors: wrap the Langfuse().flush() call
(inside flush(), after the is_enabled() check) in a try/except Exception block
so any exceptions from Langfuse().flush() are caught and logged (include the
exception details) instead of being propagated; do not re-raise the exception so
cleanup in finally blocks cannot mask or crash runs. Use the module logger (or
logging.getLogger(__name__)) to emit the error/warning and keep the function's
behaviour otherwise unchanged.

Comment thread pyproject.toml Outdated
"openai>=1.12.0",
"openrouter>=0.5.0",
"typer>=0.9.0",
"langfuse>=2.0.0",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does langfuse Python v3 still expose langfuse.decorators.observe and langfuse.openai.OpenAI wrapper?

💡 Result:

Yes, Langfuse Python v3 (and later v4) still exposes both langfuse.decorators.observe and langfuse.openai.OpenAI wrapper. The @observe decorator is available directly from the top-level langfuse module: from langfuse import observe. The old v2 path langfuse.decorators.observe is deprecated but the functionality persists in v3/v4 under the new import. Source code shows observe.py in langfuse/_client/ and documentation examples use from langfuse import observe. V2→v3 migration shows continued use of @observe with changed import: from langfuse import observe. The langfuse.openai OpenAI wrapper remains available and actively maintained in the current main branch (post-v4), providing drop-in replacement: from langfuse.openai import openai. It integrates with the new observation model using start_observation internally. Official integration docs and examples confirm compatibility with v3/v4 decorators and context managers.

Citations:


🏁 Script executed:

# Find files using langfuse imports
fd -type f -name "*.py" | xargs rg "from langfuse" 2>/dev/null | head -20

Repository: yourconscience/llm_quest_benchmark

Length of output: 515


🏁 Script executed:

# Check if llm/client.py exists and view its langfuse imports
find . -name "client.py" -path "*/llm/*" 2>/dev/null | head -5

Repository: yourconscience/llm_quest_benchmark

Length of output: 114


🏁 Script executed:

# Search for any langfuse imports in the entire codebase
rg "langfuse" --type py -A 2 | head -50

Repository: yourconscience/llm_quest_benchmark

Length of output: 758


🏁 Script executed:

# Search for information about langfuse versions and what changed
web_search --query "langfuse-python v3 v4 breaking changes decorators openai module removed"

Repository: yourconscience/llm_quest_benchmark

Length of output: 129


🏁 Script executed:

# Look at langfuse documentation for v3 to v4 migration
web_search --query "langfuse Python v3 to v4 migration guide breaking changes"

Repository: yourconscience/llm_quest_benchmark

Length of output: 129


🏁 Script executed:

# Check if langfuse.decorators and langfuse.openai still exist in current releases
web_search --query "langfuse-python github langfuse/decorators.py langfuse/openai.py v3 v4 source code"

Repository: yourconscience/llm_quest_benchmark

Length of output: 129


🌐 Web query:

Does langfuse v3 still have the langfuse.decorators module with observe decorator?

💡 Result:

No, Langfuse v3 does not have the langfuse.decorators module with the observe decorator. In v2, it was imported as from langfuse.decorators import observe. In v3, the import changed to from langfuse import observe directly, as the decorator was refactored and moved to langfuse/_client/observe.py internally. Attempting to import langfuse.decorators in v3 results in ModuleNotFoundError. The observe decorator still exists and functions in v3 and later versions (including v4, current as of 2026), but the module path changed.

Citations:


🌐 Web query:

Does langfuse v3 still have langfuse.openai module with OpenAI wrapper?

💡 Result:

Yes, Langfuse Python SDK v3 still includes the langfuse.openai module with the OpenAI wrapper. It provides a drop-in replacement for the OpenAI SDK (import openai from langfuse.openai instead of directly from openai), supporting tracing of prompts, completions, latencies, costs, and errors. This works with OpenAI SDK versions >=0.27.8, including async and streaming for >=1.0.0. The module is actively maintained, with recent fixes in 2026 (e.g., openai.py updates in releases like v3.2.8 and later). Official docs and API reference confirm its availability in v3, compatible with the OTEL-based SDK while preserving the wrapper integration. Use @observe decorator for grouping multiple calls into traces.

Citations:


🌐 Web query:

Langfuse Python v3 v4 breaking changes removed modules migration

💡 Result:

Langfuse Python SDK v4 (released March 2026) introduces breaking changes from v3, primarily adopting an observation-centric data model for better performance. No entire modules were removed, but significant methods, classes, and API endpoints were deprecated/removed, along with internal restructuring. Key breaking changes: - OpenTelemetry: Default span export now filtered (non-LLM spans like HTTP/DB not exported by default). Use should_export_span=lambda span: True to restore v3 behavior. blocked_instrumentation_scopes deprecated. - Trace updates: update_current_trace and .update_trace removed, replaced by propagate_attributes (context manager for user_id/session_id/metadata/tags), set_current_trace_io (deprecated, for legacy LLM-as-judge), set_current_trace_as_public. - Span/Generation APIs: start_span, start_generation, start_as_current_span/generation → start_observation(as_type="span"/"generation"). - Datasets: DatasetItemClient removed; item.run → dataset.run_experiment. - LangChain CallbackHandler: update_trace parameter removed. - API namespaces: api.observations_v_2 → api.observations; legacy api.observations → api.legacy.observations_v1 (self-hosted: stick to legacy until v2 endpoints available). - Types: langfuse.types (TraceMetadata, ObservationParams, etc.) removed. - Pydantic v1 dropped (now v2 required). - Internal: langfuse/api/resources/ directory deleted; reorganized to top-level modules with lazy loading. Migration checklist (from official docs): 1. Audit non-LLM OTEL spans. 2. Restore export-all if needed. 3. Migrate blocked_instrumentation_scopes. 4. Replace update_current_trace/update_trace calls. 5. Replace start_span/start_generation. 6. Update dataset runs. 7. Remove CallbackHandler(update_trace=...). 8. Update API v2 → non-suffixed; legacy v1 → api.legacy.*_v1. 9. For self-hosted, use legacy endpoints. Full guide: https://langfuse.com/docs/observability/sdk/upgrade-path/python-v3-to-v4. Install: pip install langfuse>=4.0.0 (Python 3.10+).

Citations:


Pin langfuse to <3 — v3 removed langfuse.decorators and the client code will fail on import.

The client code imports from langfuse.decorators import observe, which is no longer available in langfuse v3+. Attempting this import raises ModuleNotFoundError. While langfuse.openai.OpenAI remains available in v3, the decorators module removal is a breaking change that will cause the feature to fail on any fresh install with langfuse>=2.0.0 (which resolves to v3+).

To fix, pin the dependency:

Proposed fix
-    "langfuse>=2.0.0",
+    "langfuse>=2.0.0,<3.0.0",

Alternative: migrate to v3+ API (from langfuse import observe) and pin accordingly.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"langfuse>=2.0.0",
"langfuse>=2.0.0,<3.0.0",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 33, The dependency pin is too loose: langfuse v3
removed langfuse.decorators (used by the client via from langfuse.decorators
import observe) which will raise ModuleNotFoundError; update the pyproject
dependency spec for langfuse to pin it below 3 (e.g., "langfuse>=2.0,<3") so
installs get a v2.x that still exports langfuse.decorators, or alternatively
migrate code to the v3 API by replacing imports from langfuse.decorators with
the new v3 form (e.g., from langfuse import observe) and adjust any
langfuse.decorators usage accordingly (refer to the observe import and any
usages of langfuse.decorators and langfuse.openai.OpenAI).

@yourconscience yourconscience merged commit c2d7b59 into master Apr 24, 2026
2 checks passed
@yourconscience yourconscience deleted the feat/langfuse-tracing branch April 24, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant