Skip to content

[Feature]: 后续会支持 即梦 (Jimeng) 和 Seedance 的 API 生成图片或者视频吗? #1383

@zhangyiyifan

Description

@zhangyiyifan

AgentScope-Java is an open-source project. To involve a broader community, we recommend asking your questions in English.

Is your feature request related to a problem? Please describe.
Currently, AgentScope Java provides built-in support for text-to-speech (TTS) via DashScope and multi-modal content understanding (image/video/audio as input), but there is no built-in model or tool support for image generation and video generation. Many real-world AI agent scenarios require the ability to create visual content — for example, marketing agents generating promotional images, design assistants producing concept art, or social media agents creating short videos. The [Jimeng](https://jimeng.jianying.com/) (即梦) and [Seedance](https://www.doubao.com/seedance) APIs from ByteDance/Douyin offer high-quality text-to-image and text-to-video generation capabilities, but there is currently no native integration in the framework.

Describe the solution you'd like
I'd like to see native support for image and video generation models, similar to how the framework already supports multiple chat models. Specifically:

  1. ImageModel interface — a new model interface (similar to Model/TTSModel) with methods like generate(prompt, options) returning generated image data or URLs.

  2. VideoModel interface — a similar interface for video generation.

  3. Built-in implementations:

    • JimengImageModel — for ByteDance Jimeng text-to-image API
    • SeedanceVideoModel — for Seedance text-to-video API
    • (optionally) OpenAIImageModel for DALL-E as a reference
  4. Tool auto-registration — when an ImageModel or VideoModel is attached to a ReActAgent, the framework could optionally register built-in tools like generate_image(prompt) and generate_video(prompt) that the agent can call during reasoning.

  5. ContentBlock extension — optionally extend the ContentBlock system to natively represent generated images/videos in agent responses, making it easy to return and display generated media.

Describe alternatives you've considered

  • Custom Tool approach — I can wrap the Jimeng/Seedance REST APIs as @Tool methods myself. This works for immediate needs but lacks standardization and doesn't benefit other users.
  • Using MCP servers — there may be third-party MCP servers that expose image/video generation, but this adds external dependency and doesn't provide first-class framework integration.

Both alternatives require boilerplate for each project and don't leverage the framework's model abstraction layer (formatter, retry, tracing, etc.).

Additional context
The framework already has a clean model abstraction (ModelChatModelBase → concrete models) with formatter, retry, tracing, and streaming support. Extending this pattern to image/video generation would make AgentScope a more complete multi-modal agent framework and align with industry trends (OpenAI's GPT-4o image editing, Google's Veo, etc.).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions