Skip to content

Commit 16e36b0

Browse files
authored
Merge pull request #234 from clamsproject/register/0-smolvlm2-captioner.v0.3
App Submitted - smolvlm2-captioner.v0.3
2 parents 3a43128 + d939676 commit 16e36b0

6 files changed

Lines changed: 271 additions & 2 deletions

File tree

docs/_apps/smolvlm2-captioner/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,6 @@ title: smolvlm2-captioner
55
date: 1970-01-01T00:00:00+00:00
66
---
77
Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.
8+
- [v0.3](v0.3) ([`@kelleyl`](https://github.com/kelleyl))
89
- [v0.2](v0.2) ([`@kelleyl`](https://github.com/kelleyl))
910
- [v0.1](v0.1) ([`@kelleyl`](https://github.com/kelleyl))
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
layout: posts
3+
classes: wide
4+
title: "SmolVLM2 Captioner (v0.3)"
5+
date: 2026-01-28T15:06:18+00:00
6+
---
7+
## About this version
8+
9+
- Submitter: [kelleyl](https://github.com/kelleyl)
10+
- Submission Time: 2026-01-28T15:06:18+00:00
11+
- Prebuilt Container Image: [ghcr.io/clamsproject/app-smolvlm2-captioner:v0.3](https://github.com/clamsproject/app-smolvlm2-captioner/pkgs/container/app-smolvlm2-captioner/v0.3)
12+
- Release Notes
13+
14+
(no notes provided by the developer)
15+
16+
## About this app (See raw [metadata.json](metadata.json))
17+
18+
**Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.**
19+
20+
- App ID: [http://apps.clams.ai/smolvlm2-captioner/v0.3](http://apps.clams.ai/smolvlm2-captioner/v0.3)
21+
- App License: Apache 2.0
22+
- Source Repository: [https://github.com/clamsproject/app-smolvlm2-captioner](https://github.com/clamsproject/app-smolvlm2-captioner) ([source tree of the submitted version](https://github.com/clamsproject/app-smolvlm2-captioner/tree/v0.3))
23+
24+
25+
#### Inputs
26+
(**Note**: "*" as a property value means that the property is required but can be any value.)
27+
28+
- [http://mmif.clams.ai/vocabulary/VideoDocument/v1](http://mmif.clams.ai/vocabulary/VideoDocument/v1) (required)
29+
(of any properties)
30+
31+
- [http://mmif.clams.ai/vocabulary/ImageDocument/v1](http://mmif.clams.ai/vocabulary/ImageDocument/v1) (required)
32+
(of any properties)
33+
34+
- [http://mmif.clams.ai/vocabulary/TimeFrame/v6](http://mmif.clams.ai/vocabulary/TimeFrame/v6) (required)
35+
(of any properties)
36+
37+
38+
39+
#### Configurable Parameters
40+
(**Note**: _Multivalued_ means the parameter can have one or more values.)
41+
42+
- `frameInterval`: optional, defaults to `30`
43+
44+
- Type: integer
45+
- Multivalued: False
46+
47+
48+
> The interval at which to extract frames from the video if there are no timeframe annotations. Default is every 30 frames.
49+
- `defaultPrompt`: optional, defaults to `Describe what is shown in this video frame. Analyze the purpose of this frame in the context of a news video. Transcribe any text present.`
50+
51+
- Type: string
52+
- Multivalued: False
53+
54+
55+
> default prompt to use for timeframes not specified in the promptMap. If set to `-`, timeframes not specified in the promptMap will be skipped.
56+
- `promptMap`: optional, defaults to `[]`
57+
58+
- Type: map
59+
- Multivalued: True
60+
61+
62+
> mapping of labels of the input timeframe annotations to new prompts. Must be formatted as "IN_LABEL:PROMPT" (with a colon). To pass multiple mappings, use this parameter multiple times. By default, any timeframe labels not mapped to a prompt will be used with the defaultprompt. In order to skip timeframes with a particular label, pass `-` as the prompt value.in order to skip all timeframes not specified in the promptMap, set the defaultPromptparameter to `-`
63+
- `defaultSystemPrompt`: optional, defaults to `""`
64+
65+
- Type: string
66+
- Multivalued: False
67+
68+
69+
> default system prompt to use for all timeframes. System prompts are passed to the model using the messages format with role="system", providing context or instructions that guide the model's behavior. The processor will format this properly using its chat template.
70+
- `systemPromptMap`: optional, defaults to `[]`
71+
72+
- Type: map
73+
- Multivalued: True
74+
75+
76+
> mapping of labels of the input timeframe annotations to system prompts. Must be formatted as "IN_LABEL:SYSTEM_PROMPT" (with a colon). To pass multiple mappings, use this parameter multiple times. System prompts are passed to the model using the messages format with role="system", providing context or instructions that guide the model's behavior.
77+
- `config`: optional, defaults to `config/default.yaml`
78+
79+
- Type: string
80+
- Multivalued: False
81+
82+
83+
> Name of the config file to use.
84+
- `num_beams`: optional, defaults to `1`
85+
86+
- Type: integer
87+
- Multivalued: False
88+
89+
90+
> Number of beams for beam search during text generation. Default is 1. Higher values may improve quality but increase generation time.
91+
- `batchSize`: optional, defaults to `12`
92+
93+
- Type: integer
94+
- Multivalued: False
95+
96+
97+
> Number of images to process in each batch. Default is 12. Higher values may improve throughput but require more memory.
98+
- `allRepresentatives`: optional, defaults to `false`
99+
100+
- Type: boolean
101+
- Multivalued: False
102+
- Choices: **_`false`_**, `true`
103+
104+
105+
> Default setting for processing all representative TimePoints in each TimeFrame. When true, all representatives are processed instead of just the first one. This can be overridden per-label in the config file using the all_representatives mapping (e.g., all_representatives: {slate: true, chyron: false}). Default is false (only the first representative is processed).
106+
- `pretty`: optional, defaults to `false`
107+
108+
- Type: boolean
109+
- Multivalued: False
110+
- Choices: **_`false`_**, `true`
111+
112+
113+
> The JSON body of the HTTP response will be re-formatted with 2-space indentation
114+
- `runningTime`: optional, defaults to `false`
115+
116+
- Type: boolean
117+
- Multivalued: False
118+
- Choices: **_`false`_**, `true`
119+
120+
121+
> The running time of the app will be recorded in the view metadata
122+
- `hwFetch`: optional, defaults to `false`
123+
124+
- Type: boolean
125+
- Multivalued: False
126+
- Choices: **_`false`_**, `true`
127+
128+
129+
> The hardware information (architecture, GPU and vRAM) will be recorded in the view metadata
130+
131+
132+
#### Outputs
133+
(**Note**: "*" as a property value means that the property is required but can be any value.)
134+
135+
(**Note**: Not all output annotations are always generated.)
136+
137+
- [http://mmif.clams.ai/vocabulary/Alignment/v1](http://mmif.clams.ai/vocabulary/Alignment/v1)
138+
(of any properties)
139+
140+
- [http://mmif.clams.ai/vocabulary/TextDocument/v1](http://mmif.clams.ai/vocabulary/TextDocument/v1)
141+
(of any properties)
142+
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
{
2+
"name": "SmolVLM2 Captioner",
3+
"description": "Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.",
4+
"app_version": "v0.3",
5+
"mmif_version": "1.1.0",
6+
"app_license": "Apache 2.0",
7+
"identifier": "http://apps.clams.ai/smolvlm2-captioner/v0.3",
8+
"url": "https://github.com/clamsproject/app-smolvlm2-captioner",
9+
"input": [
10+
{
11+
"@type": "http://mmif.clams.ai/vocabulary/VideoDocument/v1",
12+
"required": true
13+
},
14+
{
15+
"@type": "http://mmif.clams.ai/vocabulary/ImageDocument/v1",
16+
"required": true
17+
},
18+
{
19+
"@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v6",
20+
"required": true
21+
}
22+
],
23+
"output": [
24+
{
25+
"@type": "http://mmif.clams.ai/vocabulary/Alignment/v1"
26+
},
27+
{
28+
"@type": "http://mmif.clams.ai/vocabulary/TextDocument/v1"
29+
}
30+
],
31+
"parameters": [
32+
{
33+
"name": "frameInterval",
34+
"description": "The interval at which to extract frames from the video if there are no timeframe annotations. Default is every 30 frames.",
35+
"type": "integer",
36+
"default": 30,
37+
"multivalued": false
38+
},
39+
{
40+
"name": "defaultPrompt",
41+
"description": "default prompt to use for timeframes not specified in the promptMap. If set to `-`, timeframes not specified in the promptMap will be skipped.",
42+
"type": "string",
43+
"default": "Describe what is shown in this video frame. Analyze the purpose of this frame in the context of a news video. Transcribe any text present.",
44+
"multivalued": false
45+
},
46+
{
47+
"name": "promptMap",
48+
"description": "mapping of labels of the input timeframe annotations to new prompts. Must be formatted as \"IN_LABEL:PROMPT\" (with a colon). To pass multiple mappings, use this parameter multiple times. By default, any timeframe labels not mapped to a prompt will be used with the defaultprompt. In order to skip timeframes with a particular label, pass `-` as the prompt value.in order to skip all timeframes not specified in the promptMap, set the defaultPromptparameter to `-`",
49+
"type": "map",
50+
"default": [],
51+
"multivalued": true
52+
},
53+
{
54+
"name": "defaultSystemPrompt",
55+
"description": "default system prompt to use for all timeframes. System prompts are passed to the model using the messages format with role=\"system\", providing context or instructions that guide the model's behavior. The processor will format this properly using its chat template.",
56+
"type": "string",
57+
"default": "",
58+
"multivalued": false
59+
},
60+
{
61+
"name": "systemPromptMap",
62+
"description": "mapping of labels of the input timeframe annotations to system prompts. Must be formatted as \"IN_LABEL:SYSTEM_PROMPT\" (with a colon). To pass multiple mappings, use this parameter multiple times. System prompts are passed to the model using the messages format with role=\"system\", providing context or instructions that guide the model's behavior.",
63+
"type": "map",
64+
"default": [],
65+
"multivalued": true
66+
},
67+
{
68+
"name": "config",
69+
"description": "Name of the config file to use.",
70+
"type": "string",
71+
"default": "config/default.yaml",
72+
"multivalued": false
73+
},
74+
{
75+
"name": "num_beams",
76+
"description": "Number of beams for beam search during text generation. Default is 1. Higher values may improve quality but increase generation time.",
77+
"type": "integer",
78+
"default": 1,
79+
"multivalued": false
80+
},
81+
{
82+
"name": "batchSize",
83+
"description": "Number of images to process in each batch. Default is 12. Higher values may improve throughput but require more memory.",
84+
"type": "integer",
85+
"default": 12,
86+
"multivalued": false
87+
},
88+
{
89+
"name": "allRepresentatives",
90+
"description": "Default setting for processing all representative TimePoints in each TimeFrame. When true, all representatives are processed instead of just the first one. This can be overridden per-label in the config file using the all_representatives mapping (e.g., all_representatives: {slate: true, chyron: false}). Default is false (only the first representative is processed).",
91+
"type": "boolean",
92+
"default": false,
93+
"multivalued": false
94+
},
95+
{
96+
"name": "pretty",
97+
"description": "The JSON body of the HTTP response will be re-formatted with 2-space indentation",
98+
"type": "boolean",
99+
"default": false,
100+
"multivalued": false
101+
},
102+
{
103+
"name": "runningTime",
104+
"description": "The running time of the app will be recorded in the view metadata",
105+
"type": "boolean",
106+
"default": false,
107+
"multivalued": false
108+
},
109+
{
110+
"name": "hwFetch",
111+
"description": "The hardware information (architecture, GPU and vRAM) will be recorded in the view metadata",
112+
"type": "boolean",
113+
"default": false,
114+
"multivalued": false
115+
}
116+
]
117+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"time": "2026-01-28T15:06:18+00:00",
3+
"submitter": "kelleyl",
4+
"image": "ghcr.io/clamsproject/app-smolvlm2-captioner:v0.3"
5+
}

docs/_data/app-index.json

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
{
22
"http://apps.clams.ai/smolvlm2-captioner": {
33
"description": "Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.",
4-
"latest_update": "2026-01-28T03:14:45+00:00",
4+
"latest_update": "2026-01-28T15:06:18+00:00",
55
"versions": [
6+
[
7+
"v0.3",
8+
"kelleyl"
9+
],
610
[
711
"v0.2",
812
"kelleyl"

docs/_data/apps.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)