Skip to content

Fix langchain integration test failures + add DBSQL/streamable MCP tests#369

Open
dhruv0811 wants to merge 33 commits intomainfrom
fix-langchain-test-failures
Open

Fix langchain integration test failures + add DBSQL/streamable MCP tests#369
dhruv0811 wants to merge 33 commits intomainfrom
fix-langchain-test-failures

Conversation

@dhruv0811
Copy link
Contributor

@dhruv0811 dhruv0811 commented Mar 6, 2026

Summary

MCP Integration Tests

  • Add DBSQL and streamable HTTP MCP integration tests (parity with existing nightly MCP test runner in dogfood notebook)

LangChain Integration Test Fixes

Test Issue Fix
test_chat_databricks_invoke[llama] assert 49 <= 30 — prompt_tokens range too tight Widen range to 15–60
invoke_multiple_completions[claude] Anthropic doesn't support n > 1 Skip Claude parametrization
test_chat_databricks_stream x2 KeyError: 'finish_reason' Find chunk with finish_reason instead of assuming chunks[-1]
test_chat_databricks_stream_with_usage x2 usage_metadata is None on last chunk Find usage chunk instead of assuming last (FMAPI already returns usage, just not on the trailing chunk)
structured_output[json_mode-claude] x3 Anthropic doesn't support json_object response format Skip Claude + json_mode combo
test_chat_databricks_langgraph Fully redundant with FMAPI TestLangGraphSync::test_single_turn Removed
langgraph_with_memory[claude] Non-deterministic LLM response Flexible assertion
timeout_and_retries Pydantic rejects Mock(spec=WorkspaceClient) Moved to unit_tests with proper mock
custom_outputs + custom_outputs_stream ENDPOINT_NOT_FOUND (personal dogfood endpoints) Gated behind RUN_DOGFOOD_TESTS
test_chat_databricks_token_count usage_metadata is None on last chunk Find usage chunk instead of assuming last (same root cause as stream_with_usage)
gpt5_stream_with_usage Uses dogfood gpt-5 endpoint Changed to databricks-gpt-5 (available in ai-oss)
responses_api_usage_metadata_keys reasoning_tokens key doesn't exist Fixed to reasoning (matching LangChain's OutputTokenDetails field)

FMAPI Skip List

  • Add databricks-gpt-5-4 (requires /v1/responses for tool calling, not /v1/chat/completions)
  • Add databricks-gemini-3-1-flash-lite (requires thought_signature on function calls)

Test plan

dhruv0811 and others added 30 commits March 2, 2026 12:00
Test that identity is forwarded correctly through both the Model Serving
(ModelServingUserCredentials) and Databricks Apps (direct token) OBO paths
using two different service principals and a whoami() UC function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ials

WorkspaceClient doesn't accept credential_strategy directly.
Use Config object as shown in the existing unit tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t kwarg

The docstring had a typo (credential_strategy vs credentials_strategy).
Fixed both the test and the source docstring to use the correct parameter
name that WorkspaceClient actually accepts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SQL current_user() returns the SP's UUID, not its display_name.
Compare the two whoami() results against each other instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Invoke pre-deployed Model Serving endpoint and Databricks App as two
different SPs, assert each sees their own identity via whoami() tool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- App fixture: committed agent code so CI redeploys with latest on each run
- deploy_serving_agent.py: script to log + deploy ChatModel with OBO to serving endpoint
- Warm-start fixture: polls serving endpoint until scaled up before tests
- Remove -k TestAppsOBO filter — both Apps and Serving tests run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hatch couldn't find the package directory because the project name
didn't match any directory. Explicitly list agent_server and scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- whoami_serving_agent.py: ResponsesAgent using SQL Statement Execution
  with ModelServingUserCredentials for OBO
- deploy_serving_agent.py: logs with AuthPolicy + deploys with scale_to_zero
- Warehouse ID from env var (not hardcoded)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove databricks-openai from test deps (breaks core_test lowest-direct)
- Use pytest.importorskip instead
- Convert print() to logging in deploy script
- Fix ruff/format issues in all OBO files
- Remove hardcoded warehouse ID, use env var

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The serving endpoint returns the SP's UUID via SQL current_user(),
not the display_name. Use the client ID from env var which matches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are model artifacts and deploy scripts that use MLflow/agents
types not available in the core type checking environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The serving env doesn't have OBO_TEST_WAREHOUSE_ID. The deploy script
now replaces the placeholder in the agent file before logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…serving

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
agents.deploy() auto-derives endpoint name from UC model name.
Passing endpoint_name was creating a new endpoint instead of
updating the existing one. Match notebook pattern exactly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix reasoning_tokens -> reasoning key in responses API usage assertion
- Remove hardcoded MLflow experiment ID from token_count test
- Add stream_usage=True for streaming usage metadata test
- Use databricks-gpt-5 FMAPI endpoint instead of gpt-5 with dogfood profile

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove test_chat_databricks_langgraph (redundant with FMAPI tests)
- Gate dogfood-dependent tests behind RUN_DOGFOOD_TESTS env var
- Gate personal endpoint tests behind RUN_DOGFOOD_TESTS
- Skip Claude for n>1 and json_mode (unsupported)
- Fix streaming finish_reason assertion (KeyError)
- Widen prompt_tokens range assertion
- Fix langgraph_with_memory assertion (non-deterministic)
- Fix timeout_and_retries mock (pydantic validation)
- Fix reasoning_tokens key in responses API usage assertion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip stream_with_usage tests: streaming usage_metadata requires
  stream_options support not yet in ChatDatabricks
- Skip token_count test: same streaming usage issue
- Skip timeout_and_retries: unit test with mocks, should be in unit_tests/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- DBSQL: list_tools (validates execute_sql, execute_sql_read_only, poll_sql_result)
  and call_tool (execute_sql_read_only with SHOW CATALOGS)
- Raw streamable_http_client: tests the low-level MCP SDK path
  (httpx.AsyncClient + DatabricksOAuthClientProvider + streamable_http_client
  + ClientSession) for UC functions, Vector Search, and DBSQL

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Completes coverage: all 4 server types (UC, VS, DBSQL, Genie) are now
tested via both DatabricksMCPClient and raw streamable_http_client.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace asyncio.run() wrapper pattern with @pytest.mark.asyncio +
async def, matching the convention used elsewhere in the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChatDatabricks fixes:
- Add stream_options={"include_usage": True} to streaming API calls
  so usage_metadata is returned in stream chunks

Test fixes:
- Widen prompt_tokens assertion range (15-60 instead of 20-30)
- Skip Claude for n>1 and json_mode (unsupported by Anthropic)
- Fix finish_reason: find chunk with finish_reason instead of assuming last
- Fix langgraph_with_memory: flexible assertion for LLM non-determinism
- Move timeout_and_retries mock test to unit_tests/
- Point gpt5_stream test at ai-oss endpoint (remove dogfood dependency)
- Fix reasoning_tokens -> reasoning key in responses API usage assertion
- Remove redundant test_chat_databricks_langgraph (covered by FMAPI tests)
- Fix token_count: find usage chunk instead of assuming last chunk

FMAPI skip list:
- Add databricks-gpt-5-4 (requires /v1/responses for tool calling)
- Add databricks-gemini-3-1-flash-lite (requires thought_signature)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChatDatabricks:
- Add stream_options={"include_usage": True} for streaming usage metadata

Test fixes:
- Widen prompt_tokens range (15-60)
- Skip Claude for n>1 and json_mode
- Fix finish_reason: find chunk with value instead of assuming last
- Fix langgraph_with_memory: flexible assertion
- Move timeout_and_retries to unit_tests/
- Point gpt5_stream at databricks-gpt-5 (remove dogfood dep)
- Fix token_count: find usage chunk, use stream_usage=True
- Fix reasoning_tokens -> reasoning key
- Remove redundant test_chat_databricks_langgraph

FMAPI: Add gpt-5-4 and gemini-3-1-flash-lite to skip list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruv0811 and others added 3 commits March 9, 2026 16:14
ChatDatabricks.client is a @cached_property that calls get_openai_client(),
not WorkspaceClient().serving_endpoints.get_open_ai_client().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert stream_options change (not all models support it)
- Fix last_chunk NameError in stream_with_usage (comment out usage assertions)
- Comment out token_count streaming part (needs stream_options)
- Gate custom_outputs behind RUN_DOGFOOD_TESTS
- Remove hardcoded MLflow experiment ID from token_count

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FMAPI already returns usage in streaming chunks — no stream_options needed.
The issue was that chunks[-1] is often an empty trailing chunk, not the
one with usage_metadata. Find chunks with usage_metadata explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant