Fix langchain integration test failures + add DBSQL/streamable MCP tests by dhruv0811 · Pull Request #369 · databricks/databricks-ai-bridge

dhruv0811 · 2026-03-06T23:31:35Z

Summary

MCP Integration Tests

Add DBSQL and streamable HTTP MCP integration tests (parity with existing nightly MCP test runner in dogfood notebook)

LangChain Integration Test Fixes

Test	Issue	Fix
`test_chat_databricks_invoke[llama]`	`assert 49 <= 30` — prompt_tokens range too tight	Widen range to 15–60
`invoke_multiple_completions[claude]`	Anthropic doesn't support `n > 1`	Skip Claude parametrization
`test_chat_databricks_stream` x2	`KeyError: 'finish_reason'`	Find chunk with `finish_reason` instead of assuming `chunks[-1]`
`test_chat_databricks_stream_with_usage` x2	`usage_metadata is None` on last chunk	Find usage chunk instead of assuming last (FMAPI already returns usage, just not on the trailing chunk)
`structured_output[json_mode-claude]` x3	Anthropic doesn't support `json_object` response format	Skip Claude + json_mode combo
`test_chat_databricks_langgraph`	Fully redundant with FMAPI `TestLangGraphSync::test_single_turn`	Removed
`langgraph_with_memory[claude]`	Non-deterministic LLM response	Flexible assertion
`timeout_and_retries`	Pydantic rejects `Mock(spec=WorkspaceClient)`	Moved to unit_tests with proper mock
`custom_outputs` + `custom_outputs_stream`	`ENDPOINT_NOT_FOUND` (personal dogfood endpoints)	Gated behind `RUN_DOGFOOD_TESTS`
`test_chat_databricks_token_count`	`usage_metadata is None` on last chunk	Find usage chunk instead of assuming last (same root cause as stream_with_usage)
`gpt5_stream_with_usage`	Uses dogfood `gpt-5` endpoint	Changed to `databricks-gpt-5` (available in ai-oss)
`responses_api_usage_metadata_keys`	`reasoning_tokens` key doesn't exist	Fixed to `reasoning` (matching LangChain's `OutputTokenDetails` field)

FMAPI Skip List

Add databricks-gpt-5-4 (requires /v1/responses for tool calling, not /v1/chat/completions)
Add databricks-gemini-3-1-flash-lite (requires thought_signature on function calls)

Test plan

All MCP integration tests pass: https://github.com/databricks-eng/ai-oss-integration-tests-runner/actions/runs/22786464566
LangChain integration tests: 48 passed, 21 skipped, 0 failed: https://github.com/databricks-eng/ai-oss-integration-tests-runner/actions/runs/22880314065
FULL test suite run: https://github.com/databricks-eng/ai-oss-integration-tests-runner/actions/runs/22880567055

Test that identity is forwarded correctly through both the Model Serving (ModelServingUserCredentials) and Databricks Apps (direct token) OBO paths using two different service principals and a whoami() UC function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ials WorkspaceClient doesn't accept credential_strategy directly. Use Config object as shown in the existing unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t kwarg The docstring had a typo (credential_strategy vs credentials_strategy). Fixed both the test and the source docstring to use the correct parameter name that WorkspaceClient actually accepts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The SQL current_user() returns the SP's UUID, not its display_name. Compare the two whoami() results against each other instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Invoke pre-deployed Model Serving endpoint and Databricks App as two different SPs, assert each sees their own identity via whoami() tool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- App fixture: committed agent code so CI redeploys with latest on each run - deploy_serving_agent.py: script to log + deploy ChatModel with OBO to serving endpoint - Warm-start fixture: polls serving endpoint until scaled up before tests - Remove -k TestAppsOBO filter — both Apps and Serving tests run Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hatch couldn't find the package directory because the project name didn't match any directory. Explicitly list agent_server and scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- whoami_serving_agent.py: ResponsesAgent using SQL Statement Execution with ModelServingUserCredentials for OBO - deploy_serving_agent.py: logs with AuthPolicy + deploys with scale_to_zero - Warehouse ID from env var (not hardcoded) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove databricks-openai from test deps (breaks core_test lowest-direct) - Use pytest.importorskip instead - Convert print() to logging in deploy script - Fix ruff/format issues in all OBO files - Remove hardcoded warehouse ID, use env var Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The serving endpoint returns the SP's UUID via SQL current_user(), not the display_name. Use the client ID from env var which matches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These are model artifacts and deploy scripts that use MLflow/agents types not available in the core type checking environment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The serving env doesn't have OBO_TEST_WAREHOUSE_ID. The deploy script now replaces the placeholder in the agent file before logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…serving Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

agents.deploy() auto-derives endpoint name from UC model name. Passing endpoint_name was creating a new endpoint instead of updating the existing one. Match notebook pattern exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix reasoning_tokens -> reasoning key in responses API usage assertion - Remove hardcoded MLflow experiment ID from token_count test - Add stream_usage=True for streaming usage metadata test - Use databricks-gpt-5 FMAPI endpoint instead of gpt-5 with dogfood profile Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove test_chat_databricks_langgraph (redundant with FMAPI tests) - Gate dogfood-dependent tests behind RUN_DOGFOOD_TESTS env var - Gate personal endpoint tests behind RUN_DOGFOOD_TESTS - Skip Claude for n>1 and json_mode (unsupported) - Fix streaming finish_reason assertion (KeyError) - Widen prompt_tokens range assertion - Fix langgraph_with_memory assertion (non-deterministic) - Fix timeout_and_retries mock (pydantic validation) - Fix reasoning_tokens key in responses API usage assertion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Skip stream_with_usage tests: streaming usage_metadata requires stream_options support not yet in ChatDatabricks - Skip token_count test: same streaming usage issue - Skip timeout_and_retries: unit test with mocks, should be in unit_tests/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- DBSQL: list_tools (validates execute_sql, execute_sql_read_only, poll_sql_result) and call_tool (execute_sql_read_only with SHOW CATALOGS) - Raw streamable_http_client: tests the low-level MCP SDK path (httpx.AsyncClient + DatabricksOAuthClientProvider + streamable_http_client + ClientSession) for UC functions, Vector Search, and DBSQL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Completes coverage: all 4 server types (UC, VS, DBSQL, Genie) are now tested via both DatabricksMCPClient and raw streamable_http_client. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace asyncio.run() wrapper pattern with @pytest.mark.asyncio + async def, matching the convention used elsewhere in the repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ChatDatabricks fixes: - Add stream_options={"include_usage": True} to streaming API calls so usage_metadata is returned in stream chunks Test fixes: - Widen prompt_tokens assertion range (15-60 instead of 20-30) - Skip Claude for n>1 and json_mode (unsupported by Anthropic) - Fix finish_reason: find chunk with finish_reason instead of assuming last - Fix langgraph_with_memory: flexible assertion for LLM non-determinism - Move timeout_and_retries mock test to unit_tests/ - Point gpt5_stream test at ai-oss endpoint (remove dogfood dependency) - Fix reasoning_tokens -> reasoning key in responses API usage assertion - Remove redundant test_chat_databricks_langgraph (covered by FMAPI tests) - Fix token_count: find usage chunk instead of assuming last chunk FMAPI skip list: - Add databricks-gpt-5-4 (requires /v1/responses for tool calling) - Add databricks-gemini-3-1-flash-lite (requires thought_signature) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ChatDatabricks: - Add stream_options={"include_usage": True} for streaming usage metadata Test fixes: - Widen prompt_tokens range (15-60) - Skip Claude for n>1 and json_mode - Fix finish_reason: find chunk with value instead of assuming last - Fix langgraph_with_memory: flexible assertion - Move timeout_and_retries to unit_tests/ - Point gpt5_stream at databricks-gpt-5 (remove dogfood dep) - Fix token_count: find usage chunk, use stream_usage=True - Fix reasoning_tokens -> reasoning key - Remove redundant test_chat_databricks_langgraph FMAPI: Add gpt-5-4 and gemini-3-1-flash-lite to skip list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ChatDatabricks.client is a @cached_property that calls get_openai_client(), not WorkspaceClient().serving_endpoints.get_open_ai_client(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Revert stream_options change (not all models support it) - Fix last_chunk NameError in stream_with_usage (comment out usage assertions) - Comment out token_count streaming part (needs stream_options) - Gate custom_outputs behind RUN_DOGFOOD_TESTS - Remove hardcoded MLflow experiment ID from token_count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FMAPI already returns usage in streaming chunks — no stream_options needed. The issue was that chunks[-1] is often an empty trailing chunk, not the one with usage_metadata. Find chunks with usage_metadata explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dhruv0811 and others added 30 commits March 2, 2026 12:00

Fix: use Config(credentials_strategy=...) for ModelServingUserCredent…

f7ab1ca

…ials WorkspaceClient doesn't accept credential_strategy directly. Use Config object as shown in the existing unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix whoami assertions: compare deployer vs end-user SQL results directly

1f189c1

The SQL current_user() returns the SP's UUID, not its display_name. Compare the two whoami() results against each other instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Format test file with ruff

1b5a825

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix type checker errors: add None guards for SDK optional types

d6a6208

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace simulated OBO tests with end-to-end agent invocation tests

8876ea4

Invoke pre-deployed Model Serving endpoint and Databricks App as two different SPs, assert each sees their own identity via whoami() tool. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add databricks-openai to test dependencies for OBO e2e tests

0aafbae

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix app fixture: add hatch wheel packages config

11609a9

Hatch couldn't find the package directory because the project name didn't match any directory. Explicitly list agent_server and scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix SP-B identity check: use OBO_TEST_CLIENT_ID directly

eefadf0

The serving endpoint returns the SP's UUID via SQL current_user(), not the display_name. Use the client ID from env var which matches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move whoami_serving_agent.py into model_serving_fixture/

a4701f7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Exclude OBO fixture/deploy files from type checking

fe4feca

These are model artifacts and deploy scripts that use MLflow/agents types not available in the core type checking environment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Inject warehouse ID at deploy time instead of reading env at import

47bb76b

The serving env doesn't have OBO_TEST_WAREHOUSE_ID. The deploy script now replaces the placeholder in the agent file before logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix app whoami tool: return user_name (UUID for SPs) for parity with …

363611d

…serving Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix ruff: remove unused imports (shutil, os)

5d2787c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into obo-integration-tests

54494aa

Add Genie raw streamable_http_client integration test

de92531

Completes coverage: all 4 server types (UC, VS, DBSQL, Genie) are now tested via both DatabricksMCPClient and raw streamable_http_client. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Refactor raw streamable tests to use pytest.mark.asyncio

4f388de

Replace asyncio.run() wrapper pattern with @pytest.mark.asyncio + async def, matching the convention used elsewhere in the repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove RUN_DOGFOOD_TESTS gates — run all tests for evaluation

dcbaad7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove RUN_ST_ENDPOINT_TESTS gates — run all tests for evaluation

72248ef

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dhruv0811 and others added 3 commits March 9, 2026 16:14

Fix timeout_and_retries unit test: patch get_openai_client directly

3669477

ChatDatabricks.client is a @cached_property that calls get_openai_client(), not WorkspaceClient().serving_endpoints.get_open_ai_client(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix langchain integration test failures + add DBSQL/streamable MCP tests#369

Fix langchain integration test failures + add DBSQL/streamable MCP tests#369
dhruv0811 wants to merge 33 commits intomainfrom
fix-langchain-test-failures

dhruv0811 commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhruv0811 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MCP Integration Tests

LangChain Integration Test Fixes

FMAPI Skip List

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dhruv0811 commented Mar 6, 2026 •

edited

Loading