AgentScope-Java is an open-source project. To involve a broader community, we recommend asking your questions in English.
Is your feature request related to a problem? Please describe.
Currently, AgentScope Java provides built-in support for text-to-speech (TTS) via DashScope and multi-modal content understanding (image/video/audio as input), but there is no built-in model or tool support for image generation and video generation. Many real-world AI agent scenarios require the ability to create visual content — for example, marketing agents generating promotional images, design assistants producing concept art, or social media agents creating short videos. The [Jimeng](https://jimeng.jianying.com/) (即梦) and [Seedance](https://www.doubao.com/seedance) APIs from ByteDance/Douyin offer high-quality text-to-image and text-to-video generation capabilities, but there is currently no native integration in the framework.
Describe the solution you'd like
I'd like to see native support for image and video generation models, similar to how the framework already supports multiple chat models. Specifically:
-
ImageModel interface — a new model interface (similar to Model/TTSModel) with methods like generate(prompt, options) returning generated image data or URLs.
-
VideoModel interface — a similar interface for video generation.
-
Built-in implementations:
JimengImageModel — for ByteDance Jimeng text-to-image API
SeedanceVideoModel — for Seedance text-to-video API
- (optionally)
OpenAIImageModel for DALL-E as a reference
-
Tool auto-registration — when an ImageModel or VideoModel is attached to a ReActAgent, the framework could optionally register built-in tools like generate_image(prompt) and generate_video(prompt) that the agent can call during reasoning.
-
ContentBlock extension — optionally extend the ContentBlock system to natively represent generated images/videos in agent responses, making it easy to return and display generated media.
Describe alternatives you've considered
- Custom Tool approach — I can wrap the Jimeng/Seedance REST APIs as
@Tool methods myself. This works for immediate needs but lacks standardization and doesn't benefit other users.
- Using MCP servers — there may be third-party MCP servers that expose image/video generation, but this adds external dependency and doesn't provide first-class framework integration.
Both alternatives require boilerplate for each project and don't leverage the framework's model abstraction layer (formatter, retry, tracing, etc.).
Additional context
The framework already has a clean model abstraction (Model → ChatModelBase → concrete models) with formatter, retry, tracing, and streaming support. Extending this pattern to image/video generation would make AgentScope a more complete multi-modal agent framework and align with industry trends (OpenAI's GPT-4o image editing, Google's Veo, etc.).
AgentScope-Java is an open-source project. To involve a broader community, we recommend asking your questions in English.
Is your feature request related to a problem? Please describe.
Currently, AgentScope Java provides built-in support for text-to-speech (TTS) via DashScope and multi-modal content understanding (image/video/audio as input), but there is no built-in model or tool support for image generation and video generation. Many real-world AI agent scenarios require the ability to create visual content — for example, marketing agents generating promotional images, design assistants producing concept art, or social media agents creating short videos. The [Jimeng](https://jimeng.jianying.com/) (即梦) and [Seedance](https://www.doubao.com/seedance) APIs from ByteDance/Douyin offer high-quality text-to-image and text-to-video generation capabilities, but there is currently no native integration in the framework.
Describe the solution you'd like
I'd like to see native support for image and video generation models, similar to how the framework already supports multiple chat models. Specifically:
ImageModelinterface — a new model interface (similar toModel/TTSModel) with methods likegenerate(prompt, options)returning generated image data or URLs.VideoModelinterface — a similar interface for video generation.Built-in implementations:
JimengImageModel— for ByteDance Jimeng text-to-image APISeedanceVideoModel— for Seedance text-to-video APIOpenAIImageModelfor DALL-E as a referenceTool auto-registration — when an
ImageModelorVideoModelis attached to aReActAgent, the framework could optionally register built-in tools likegenerate_image(prompt)andgenerate_video(prompt)that the agent can call during reasoning.ContentBlock extension — optionally extend the
ContentBlocksystem to natively represent generated images/videos in agent responses, making it easy to return and display generated media.Describe alternatives you've considered
@Toolmethods myself. This works for immediate needs but lacks standardization and doesn't benefit other users.Both alternatives require boilerplate for each project and don't leverage the framework's model abstraction layer (formatter, retry, tracing, etc.).
Additional context
The framework already has a clean model abstraction (
Model→ChatModelBase→ concrete models) with formatter, retry, tracing, and streaming support. Extending this pattern to image/video generation would make AgentScope a more complete multi-modal agent framework and align with industry trends (OpenAI's GPT-4o image editing, Google's Veo, etc.).