Skip to content

Multimodal support for Gemini #80

@samrat

Description

@samrat

Currently, it's only possible to send text messages using the Gemini adapter:

{system_instructions, [%{role: "user", parts: [%{text: content}]} | history]}

The Gemini API supports image, video and audio inputs(unlike the OpenAI API where you send the file contents base64-encoded, you need to upload the file separately)

Would you be open to a PR that adds support for uploading files, or would you say that is out of scope of this project?

If it's out of scope, I can create a smaller PR that allows media URLs(with the upload happening outside the library):

Instructor.chat_completion(
  mode: :json_schema,
  model: "gemini-1.5-flash",
  response_model: VideoDesc,
  messages: [
    %{
      role: "user", 
      content: [
        %{
          type: "video_url",
          video_url: %{
            url: "https://generativelanguage.googleapis.com/v1beta/files/..."
          }
        },
        %{
          type: "text",
          text: " what's going on in this video?"
        }
      ]
    }
  ]
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions