Having some automated tests for tool metadata (tool name and parameter names/descriptions) quality, while not foolproof, would make it much easier to confidently make changes to existing servers without worrying about regressing existing use cases. These tests could catch things like a new tool having a description that collides with another tool and confuses LLMs. We should be able to write some example tests using e.g. Mosaic AI Agent Evaluation or open source eval frameworks
Having some automated tests for tool metadata (tool name and parameter names/descriptions) quality, while not foolproof, would make it much easier to confidently make changes to existing servers without worrying about regressing existing use cases. These tests could catch things like a new tool having a description that collides with another tool and confuses LLMs. We should be able to write some example tests using e.g. Mosaic AI Agent Evaluation or open source eval frameworks