|
| 1 | +--- |
| 2 | +layout: posts |
| 3 | +classes: wide |
| 4 | +title: "SmolVLM2 Captioner (v0.3)" |
| 5 | +date: 2026-01-28T15:06:18+00:00 |
| 6 | +--- |
| 7 | +## About this version |
| 8 | + |
| 9 | +- Submitter: [kelleyl](https://github.com/kelleyl) |
| 10 | +- Submission Time: 2026-01-28T15:06:18+00:00 |
| 11 | +- Prebuilt Container Image: [ghcr.io/clamsproject/app-smolvlm2-captioner:v0.3](https://github.com/clamsproject/app-smolvlm2-captioner/pkgs/container/app-smolvlm2-captioner/v0.3) |
| 12 | +- Release Notes |
| 13 | + |
| 14 | + (no notes provided by the developer) |
| 15 | + |
| 16 | +## About this app (See raw [metadata.json](metadata.json)) |
| 17 | + |
| 18 | +**Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.** |
| 19 | + |
| 20 | +- App ID: [http://apps.clams.ai/smolvlm2-captioner/v0.3](http://apps.clams.ai/smolvlm2-captioner/v0.3) |
| 21 | +- App License: Apache 2.0 |
| 22 | +- Source Repository: [https://github.com/clamsproject/app-smolvlm2-captioner](https://github.com/clamsproject/app-smolvlm2-captioner) ([source tree of the submitted version](https://github.com/clamsproject/app-smolvlm2-captioner/tree/v0.3)) |
| 23 | + |
| 24 | + |
| 25 | +#### Inputs |
| 26 | +(**Note**: "*" as a property value means that the property is required but can be any value.) |
| 27 | + |
| 28 | +- [http://mmif.clams.ai/vocabulary/VideoDocument/v1](http://mmif.clams.ai/vocabulary/VideoDocument/v1) (required) |
| 29 | +(of any properties) |
| 30 | + |
| 31 | +- [http://mmif.clams.ai/vocabulary/ImageDocument/v1](http://mmif.clams.ai/vocabulary/ImageDocument/v1) (required) |
| 32 | +(of any properties) |
| 33 | + |
| 34 | +- [http://mmif.clams.ai/vocabulary/TimeFrame/v6](http://mmif.clams.ai/vocabulary/TimeFrame/v6) (required) |
| 35 | +(of any properties) |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | +#### Configurable Parameters |
| 40 | +(**Note**: _Multivalued_ means the parameter can have one or more values.) |
| 41 | + |
| 42 | +- `frameInterval`: optional, defaults to `30` |
| 43 | + |
| 44 | + - Type: integer |
| 45 | + - Multivalued: False |
| 46 | + |
| 47 | + |
| 48 | + > The interval at which to extract frames from the video if there are no timeframe annotations. Default is every 30 frames. |
| 49 | +- `defaultPrompt`: optional, defaults to `Describe what is shown in this video frame. Analyze the purpose of this frame in the context of a news video. Transcribe any text present.` |
| 50 | + |
| 51 | + - Type: string |
| 52 | + - Multivalued: False |
| 53 | + |
| 54 | + |
| 55 | + > default prompt to use for timeframes not specified in the promptMap. If set to `-`, timeframes not specified in the promptMap will be skipped. |
| 56 | +- `promptMap`: optional, defaults to `[]` |
| 57 | + |
| 58 | + - Type: map |
| 59 | + - Multivalued: True |
| 60 | + |
| 61 | + |
| 62 | + > mapping of labels of the input timeframe annotations to new prompts. Must be formatted as "IN_LABEL:PROMPT" (with a colon). To pass multiple mappings, use this parameter multiple times. By default, any timeframe labels not mapped to a prompt will be used with the defaultprompt. In order to skip timeframes with a particular label, pass `-` as the prompt value.in order to skip all timeframes not specified in the promptMap, set the defaultPromptparameter to `-` |
| 63 | +- `defaultSystemPrompt`: optional, defaults to `""` |
| 64 | + |
| 65 | + - Type: string |
| 66 | + - Multivalued: False |
| 67 | + |
| 68 | + |
| 69 | + > default system prompt to use for all timeframes. System prompts are passed to the model using the messages format with role="system", providing context or instructions that guide the model's behavior. The processor will format this properly using its chat template. |
| 70 | +- `systemPromptMap`: optional, defaults to `[]` |
| 71 | + |
| 72 | + - Type: map |
| 73 | + - Multivalued: True |
| 74 | + |
| 75 | + |
| 76 | + > mapping of labels of the input timeframe annotations to system prompts. Must be formatted as "IN_LABEL:SYSTEM_PROMPT" (with a colon). To pass multiple mappings, use this parameter multiple times. System prompts are passed to the model using the messages format with role="system", providing context or instructions that guide the model's behavior. |
| 77 | +- `config`: optional, defaults to `config/default.yaml` |
| 78 | + |
| 79 | + - Type: string |
| 80 | + - Multivalued: False |
| 81 | + |
| 82 | + |
| 83 | + > Name of the config file to use. |
| 84 | +- `num_beams`: optional, defaults to `1` |
| 85 | + |
| 86 | + - Type: integer |
| 87 | + - Multivalued: False |
| 88 | + |
| 89 | + |
| 90 | + > Number of beams for beam search during text generation. Default is 1. Higher values may improve quality but increase generation time. |
| 91 | +- `batchSize`: optional, defaults to `12` |
| 92 | + |
| 93 | + - Type: integer |
| 94 | + - Multivalued: False |
| 95 | + |
| 96 | + |
| 97 | + > Number of images to process in each batch. Default is 12. Higher values may improve throughput but require more memory. |
| 98 | +- `allRepresentatives`: optional, defaults to `false` |
| 99 | + |
| 100 | + - Type: boolean |
| 101 | + - Multivalued: False |
| 102 | + - Choices: **_`false`_**, `true` |
| 103 | + |
| 104 | + |
| 105 | + > Default setting for processing all representative TimePoints in each TimeFrame. When true, all representatives are processed instead of just the first one. This can be overridden per-label in the config file using the all_representatives mapping (e.g., all_representatives: {slate: true, chyron: false}). Default is false (only the first representative is processed). |
| 106 | +- `pretty`: optional, defaults to `false` |
| 107 | + |
| 108 | + - Type: boolean |
| 109 | + - Multivalued: False |
| 110 | + - Choices: **_`false`_**, `true` |
| 111 | + |
| 112 | + |
| 113 | + > The JSON body of the HTTP response will be re-formatted with 2-space indentation |
| 114 | +- `runningTime`: optional, defaults to `false` |
| 115 | + |
| 116 | + - Type: boolean |
| 117 | + - Multivalued: False |
| 118 | + - Choices: **_`false`_**, `true` |
| 119 | + |
| 120 | + |
| 121 | + > The running time of the app will be recorded in the view metadata |
| 122 | +- `hwFetch`: optional, defaults to `false` |
| 123 | + |
| 124 | + - Type: boolean |
| 125 | + - Multivalued: False |
| 126 | + - Choices: **_`false`_**, `true` |
| 127 | + |
| 128 | + |
| 129 | + > The hardware information (architecture, GPU and vRAM) will be recorded in the view metadata |
| 130 | +
|
| 131 | + |
| 132 | +#### Outputs |
| 133 | +(**Note**: "*" as a property value means that the property is required but can be any value.) |
| 134 | + |
| 135 | +(**Note**: Not all output annotations are always generated.) |
| 136 | + |
| 137 | +- [http://mmif.clams.ai/vocabulary/Alignment/v1](http://mmif.clams.ai/vocabulary/Alignment/v1) |
| 138 | +(of any properties) |
| 139 | + |
| 140 | +- [http://mmif.clams.ai/vocabulary/TextDocument/v1](http://mmif.clams.ai/vocabulary/TextDocument/v1) |
| 141 | +(of any properties) |
| 142 | + |
0 commit comments