Merge pull request #234 from clamsproject/register/0-smolvlm2-captioner.v0.3

keighrim · web-flow · commit 16e36b0fdfed · 2026-01-28T10:11:35.000-05:00
App Submitted - smolvlm2-captioner.v0.3
diff --git a/docs/_apps/smolvlm2-captioner/index.md b/docs/_apps/smolvlm2-captioner/index.md
@@ -5,5 +5,6 @@ title: smolvlm2-captioner
 date: 1970-01-01T00:00:00+00:00
 ---
 Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.
+- [v0.3](v0.3) ([`@kelleyl`](https://github.com/kelleyl))
 - [v0.2](v0.2) ([`@kelleyl`](https://github.com/kelleyl))
 - [v0.1](v0.1) ([`@kelleyl`](https://github.com/kelleyl))
diff --git a/docs/_apps/smolvlm2-captioner/v0.3/index.md b/docs/_apps/smolvlm2-captioner/v0.3/index.md
@@ -0,0 +1,142 @@
+---
+layout: posts
+classes: wide
+title: "SmolVLM2 Captioner (v0.3)"
+date: 2026-01-28T15:06:18+00:00
+---
+## About this version
+
+- Submitter: [kelleyl](https://github.com/kelleyl)
+- Submission Time: 2026-01-28T15:06:18+00:00
+- Prebuilt Container Image: [ghcr.io/clamsproject/app-smolvlm2-captioner:v0.3](https://github.com/clamsproject/app-smolvlm2-captioner/pkgs/container/app-smolvlm2-captioner/v0.3)
+- Release Notes
+
+    (no notes provided by the developer)
+
+## About this app (See raw [metadata.json](metadata.json))
+
+**Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.**
+
+- App ID: [http://apps.clams.ai/smolvlm2-captioner/v0.3](http://apps.clams.ai/smolvlm2-captioner/v0.3)
+- App License: Apache 2.0
+- Source Repository: [https://github.com/clamsproject/app-smolvlm2-captioner](https://github.com/clamsproject/app-smolvlm2-captioner) ([source tree of the submitted version](https://github.com/clamsproject/app-smolvlm2-captioner/tree/v0.3))
+
+
+#### Inputs
+(**Note**: "*" as a property value means that the property is required but can be any value.)
+
+- [http://mmif.clams.ai/vocabulary/VideoDocument/v1](http://mmif.clams.ai/vocabulary/VideoDocument/v1) (required)
+(of any properties)
+
+- [http://mmif.clams.ai/vocabulary/ImageDocument/v1](http://mmif.clams.ai/vocabulary/ImageDocument/v1) (required)
+(of any properties)
+
+- [http://mmif.clams.ai/vocabulary/TimeFrame/v6](http://mmif.clams.ai/vocabulary/TimeFrame/v6) (required)
+(of any properties)
+
+
+
+#### Configurable Parameters
+(**Note**: _Multivalued_ means the parameter can have one or more values.)
+
+- `frameInterval`: optional, defaults to `30`
+
+    - Type: integer
+    - Multivalued: False
+
+
+    > The interval at which to extract frames from the video if there are no timeframe annotations. Default is every 30 frames.
+- `defaultPrompt`: optional, defaults to `Describe what is shown in this video frame. Analyze the purpose of this frame in the context of a news video. Transcribe any text present.`
+
+    - Type: string
+    - Multivalued: False
+
+
+    > default prompt to use for timeframes not specified in the promptMap. If set to `-`, timeframes not specified in the promptMap will be skipped.
+- `promptMap`: optional, defaults to `[]`
+
+    - Type: map
+    - Multivalued: True
+
+
+    > mapping of labels of the input timeframe annotations to new prompts. Must be formatted as "IN_LABEL:PROMPT" (with a colon). To pass multiple mappings, use this parameter multiple times. By default, any timeframe labels not mapped to a prompt will be used with the defaultprompt. In order to skip timeframes with a particular label, pass `-` as the prompt value.in order to skip all timeframes not specified in the promptMap, set the defaultPromptparameter to `-`
+- `defaultSystemPrompt`: optional, defaults to `""`
+
+    - Type: string
+    - Multivalued: False
+
+
+    > default system prompt to use for all timeframes. System prompts are passed to the model using the messages format with role="system", providing context or instructions that guide the model's behavior. The processor will format this properly using its chat template.
+- `systemPromptMap`: optional, defaults to `[]`
+
+    - Type: map
+    - Multivalued: True
+
+
+    > mapping of labels of the input timeframe annotations to system prompts. Must be formatted as "IN_LABEL:SYSTEM_PROMPT" (with a colon). To pass multiple mappings, use this parameter multiple times. System prompts are passed to the model using the messages format with role="system", providing context or instructions that guide the model's behavior.
+- `config`: optional, defaults to `config/default.yaml`
+
+    - Type: string
+    - Multivalued: False
+
+
+    > Name of the config file to use.
+- `num_beams`: optional, defaults to `1`
+
+    - Type: integer
+    - Multivalued: False
+
+
+    > Number of beams for beam search during text generation. Default is 1. Higher values may improve quality but increase generation time.
+- `batchSize`: optional, defaults to `12`
+
+    - Type: integer
+    - Multivalued: False
+
+
+    > Number of images to process in each batch. Default is 12. Higher values may improve throughput but require more memory.
+- `allRepresentatives`: optional, defaults to `false`
+
+    - Type: boolean
+    - Multivalued: False
+    - Choices: **_`false`_**, `true`
+
+
+    > Default setting for processing all representative TimePoints in each TimeFrame. When true, all representatives are processed instead of just the first one. This can be overridden per-label in the config file using the all_representatives mapping (e.g., all_representatives: {slate: true, chyron: false}). Default is false (only the first representative is processed).
+- `pretty`: optional, defaults to `false`
+
+    - Type: boolean
+    - Multivalued: False
+    - Choices: **_`false`_**, `true`
+
+
+    > The JSON body of the HTTP response will be re-formatted with 2-space indentation
+- `runningTime`: optional, defaults to `false`
+
+    - Type: boolean
+    - Multivalued: False
+    - Choices: **_`false`_**, `true`
+
+
+    > The running time of the app will be recorded in the view metadata
+- `hwFetch`: optional, defaults to `false`
+
+    - Type: boolean
+    - Multivalued: False
+    - Choices: **_`false`_**, `true`
+
+
+    > The hardware information (architecture, GPU and vRAM) will be recorded in the view metadata
+
+
+#### Outputs
+(**Note**: "*" as a property value means that the property is required but can be any value.)
+
+(**Note**: Not all output annotations are always generated.)
+
+- [http://mmif.clams.ai/vocabulary/Alignment/v1](http://mmif.clams.ai/vocabulary/Alignment/v1)
+(of any properties)
+
+- [http://mmif.clams.ai/vocabulary/TextDocument/v1](http://mmif.clams.ai/vocabulary/TextDocument/v1)
+(of any properties)
+
diff --git a/docs/_apps/smolvlm2-captioner/v0.3/metadata.json b/docs/_apps/smolvlm2-captioner/v0.3/metadata.json
@@ -0,0 +1,117 @@
+{
+  "name": "SmolVLM2 Captioner",
+  "description": "Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.",
+  "app_version": "v0.3",
+  "mmif_version": "1.1.0",
+  "app_license": "Apache 2.0",
+  "identifier": "http://apps.clams.ai/smolvlm2-captioner/v0.3",
+  "url": "https://github.com/clamsproject/app-smolvlm2-captioner",
+  "input": [
+    {
+      "@type": "http://mmif.clams.ai/vocabulary/VideoDocument/v1",
+      "required": true
+    },
+    {
+      "@type": "http://mmif.clams.ai/vocabulary/ImageDocument/v1",
+      "required": true
+    },
+    {
+      "@type": "http://mmif.clams.ai/vocabulary/TimeFrame/v6",
+      "required": true
+    }
+  ],
+  "output": [
+    {
+      "@type": "http://mmif.clams.ai/vocabulary/Alignment/v1"
+    },
+    {
+      "@type": "http://mmif.clams.ai/vocabulary/TextDocument/v1"
+    }
+  ],
+  "parameters": [
+    {
+      "name": "frameInterval",
+      "description": "The interval at which to extract frames from the video if there are no timeframe annotations. Default is every 30 frames.",
+      "type": "integer",
+      "default": 30,
+      "multivalued": false
+    },
+    {
+      "name": "defaultPrompt",
+      "description": "default prompt to use for timeframes not specified in the promptMap. If set to `-`, timeframes not specified in the promptMap will be skipped.",
+      "type": "string",
+      "default": "Describe what is shown in this video frame. Analyze the purpose of this frame in the context of a news video. Transcribe any text present.",
+      "multivalued": false
+    },
+    {
+      "name": "promptMap",
+      "description": "mapping of labels of the input timeframe annotations to new prompts. Must be formatted as \"IN_LABEL:PROMPT\" (with a colon). To pass multiple mappings, use this parameter multiple times. By default, any timeframe labels not mapped to a prompt will be used with the defaultprompt. In order to skip timeframes with a particular label, pass `-` as the prompt value.in order to skip all timeframes not specified in the promptMap, set the defaultPromptparameter to `-`",
+      "type": "map",
+      "default": [],
+      "multivalued": true
+    },
+    {
+      "name": "defaultSystemPrompt",
+      "description": "default system prompt to use for all timeframes. System prompts are passed to the model using the messages format with role=\"system\", providing context or instructions that guide the model's behavior. The processor will format this properly using its chat template.",
+      "type": "string",
+      "default": "",
+      "multivalued": false
+    },
+    {
+      "name": "systemPromptMap",
+      "description": "mapping of labels of the input timeframe annotations to system prompts. Must be formatted as \"IN_LABEL:SYSTEM_PROMPT\" (with a colon). To pass multiple mappings, use this parameter multiple times. System prompts are passed to the model using the messages format with role=\"system\", providing context or instructions that guide the model's behavior.",
+      "type": "map",
+      "default": [],
+      "multivalued": true
+    },
+    {
+      "name": "config",
+      "description": "Name of the config file to use.",
+      "type": "string",
+      "default": "config/default.yaml",
+      "multivalued": false
+    },
+    {
+      "name": "num_beams",
+      "description": "Number of beams for beam search during text generation. Default is 1. Higher values may improve quality but increase generation time.",
+      "type": "integer",
+      "default": 1,
+      "multivalued": false
+    },
+    {
+      "name": "batchSize",
+      "description": "Number of images to process in each batch. Default is 12. Higher values may improve throughput but require more memory.",
+      "type": "integer",
+      "default": 12,
+      "multivalued": false
+    },
+    {
+      "name": "allRepresentatives",
+      "description": "Default setting for processing all representative TimePoints in each TimeFrame. When true, all representatives are processed instead of just the first one. This can be overridden per-label in the config file using the all_representatives mapping (e.g., all_representatives: {slate: true, chyron: false}). Default is false (only the first representative is processed).",
+      "type": "boolean",
+      "default": false,
+      "multivalued": false
+    },
+    {
+      "name": "pretty",
+      "description": "The JSON body of the HTTP response will be re-formatted with 2-space indentation",
+      "type": "boolean",
+      "default": false,
+      "multivalued": false
+    },
+    {
+      "name": "runningTime",
+      "description": "The running time of the app will be recorded in the view metadata",
+      "type": "boolean",
+      "default": false,
+      "multivalued": false
+    },
+    {
+      "name": "hwFetch",
+      "description": "The hardware information (architecture, GPU and vRAM) will be recorded in the view metadata",
+      "type": "boolean",
+      "default": false,
+      "multivalued": false
+    }
+  ]
+}
diff --git a/docs/_apps/smolvlm2-captioner/v0.3/submission.json b/docs/_apps/smolvlm2-captioner/v0.3/submission.json
@@ -0,0 +1,5 @@
+{
+  "time": "2026-01-28T15:06:18+00:00",
+  "submitter": "kelleyl",
+  "image": "ghcr.io/clamsproject/app-smolvlm2-captioner:v0.3"
+}
diff --git a/docs/_data/app-index.json b/docs/_data/app-index.json
@@ -1,8 +1,12 @@
 {
   "http://apps.clams.ai/smolvlm2-captioner": {
     "description": "Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.",
-    "latest_update": "2026-01-28T03:14:45+00:00",
+    "latest_update": "2026-01-28T15:06:18+00:00",
     "versions": [
+      [
+        "v0.3",
+        "kelleyl"
+      ],
       [
         "v0.2",
         "kelleyl"
diff --git a/docs/_data/apps.json b/docs/_data/apps.json

Original file line number	Diff line number	Diff line change
`@@ -1,8 +1,12 @@`
`1`	`1`	`{`
`2`	`2`	`"http://apps.clams.ai/smolvlm2-captioner": {`
`3`	`3`	`"description": "Applies SmolVLM2-2.2B-Instruct multimodal model to video frames for image captioning.",`
`4`		`- "latest_update": "2026-01-28T03:14:45+00:00",`
	`4`	`+ "latest_update": "2026-01-28T15:06:18+00:00",`
`5`	`5`	`"versions": [`
	`6`	`+ [`
	`7`	`+ "v0.3",`
	`8`	`+ "kelleyl"`
	`9`	`+ ],`
`6`	`10`	`[`
`7`	`11`	`"v0.2",`
`8`	`12`	`"kelleyl"`