Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions actions/snowflake-document-ai/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/)
and this project adheres to [Semantic Versioning](https://semver.org/).

## [0.0.2] - 2025-10-11

### Changed
- Limited support only to PDF, PPTX, and DOCX files
- Reduced logging

## [0.0.1] - 2025-10-01

### Changed
Expand Down
12 changes: 6 additions & 6 deletions actions/snowflake-document-ai/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Snowflake Document AI

This action package enables you to upload PDF documents to Snowflake and extract their content using [Snowflake Document AI](https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/overview). It uses one-shot parsing with no training required, making it easy to extract structured data from PDFs and image files.
This action package enables you to upload PDF, DOCX, or PPTX documents to Snowflake and extract their content using [Snowflake Document AI](https://docs.snowflake.com/en/user-guide/snowflake-cortex/document-ai/overview). It uses one-shot parsing with no training required, making it easy to extract structured data from digital documents.

## What it does

This action package provides two main capabilities:

1. **List Stage Files**: List all files currently stored in a Snowflake stage
2. **Parse Document**: Upload PDF or image file files and extract their content using AI-powered parsing
2. **Parse Document**: Upload PDF, DOCX, or PPTX files and extract their content using AI-powered parsing

The parsing uses Snowflake's `AI_PARSE_DOCUMENT` function which can extract text, tables, and other structured content from PDFs in layout-aware mode.

Expand All @@ -16,7 +16,7 @@ The parsing uses Snowflake's `AI_PARSE_DOCUMENT` function which can extract text
Before using this action, you need:

1. **Snowflake Account**: An active Snowflake account with Document AI enabled
2. **Database, Schema, and Stage**: A Snowflake stage configured to store your PDF files
2. **Database, Schema, and Stage**: A Snowflake stage configured to store your document files
3. **Authentication**: One of the following authentication methods:
- Username and password
- Key pair authentication
Expand All @@ -39,7 +39,7 @@ Lists the most recently modified files in a specific Snowflake stage.

### 2. Parse Document

Uploads a PDF file from chat and parses its content using Snowflake Document AI.
Uploads a PDF, DOCX, or PPTX file from chat and parses its content using Snowflake Document AI.

**Parameters:**
- `filename` (string): The name of the file to upload from chat
Expand Down Expand Up @@ -91,9 +91,9 @@ Uploads a PDF file from chat and parses its content using Snowflake Document AI.

Details up to date [here](https://docs.snowflake.com/en/user-guide/snowflake-cortex/parse-document#input-requirements).

- File types: PDF, PPTX, DOCX, JPEG, JPG, PNG, TIFF, TIF, HTML, TXT
- File types: PDF, PPTX, DOCX (images/HTML/TXT are not supported by this action)
- Maximum file size: 100 MB
- Maximum number of pages: 500
- Stage encryption: Server-side encryption
- Requires Snowflake Document AI to be enabled and set upin your account
- Requires Snowflake Document AI to be enabled and set up in your account
- Processing time depends on document complexity and size - timeouts are possible with larger documents
52 changes: 21 additions & 31 deletions actions/snowflake-document-ai/actions.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def parse_document(
stage_path: str = ""
) -> Response[dict]:
"""
Uploads a file (PDF, PPTX, DOCX, JPEG, JPG, PNG, TIFF, TIF, HTML, TXT) from the chat to a specified Snowflake stage and parses it's content using Snowflake Document AI.
Uploads a file (PDF, PPTX, DOCX) from the chat to a specified Snowflake stage and parses it's content using Snowflake Document AI.

Args:
filename: The name of the file to upload from chat
Expand All @@ -83,45 +83,46 @@ def parse_document(
Details of the uploaded file, and the json string of the document ai processing results.
"""
try:
print(f"Starting process_document for file: {filename}")
print("Starting document processing")

# Get the file from chat
print("Getting file from chat...")
chat_file = chat.get_file(filename)
temp_file_path = str(chat_file)
print(f"Got file: {temp_file_path}")

# Extract just the original filename without path
original_filename = os.path.basename(filename)
print(f"Original filename: {original_filename}")

# Create a filename with timestamp to ensure uniqueness
file_base, file_ext = os.path.splitext(original_filename)
file_ext_lower = file_ext.lower()

# Validate supported file types (Snowflake AI_PARSE_DOCUMENT supports .pdf, .docx, .pptx)
supported_extensions = {'.pdf', '.docx', '.pptx'}
if file_ext_lower not in supported_extensions:
unsupported_msg = (
f"Unsupported file type '{file_ext}'. Only PDF, DOCX, and PPTX are supported "
f"by AI_PARSE_DOCUMENT for this action. See docs: https://docs.snowflake.com/en/user-guide/snowflake-cortex/parse-document"
)
return Response(error=unsupported_msg)
Comment on lines +99 to +106
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite bulletproof.
Typically these kind of checks should be done over MIME.
But I guess it's good enough for now.

timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
unique_filename = f"{file_base}_{timestamp}{file_ext}"
print(f"Using unique filename: {unique_filename}")

# Create a temporary directory to hold our renamed file
print("Creating temporary directory...")
with tempfile.TemporaryDirectory() as temp_dir:
# Create a path for our correctly named file
correct_name_path = os.path.join(temp_dir, unique_filename)
print(f"Correct name path: {correct_name_path}")

# Copy the temporary file to our new path with the correct name
print("Copying file...")
shutil.copy2(temp_file_path, correct_name_path)

print("Establishing Snowflake connection...")
print("Connecting to Snowflake...")
with get_snowflake_connection() as conn:
cursor = conn.cursor()
print("Connected to Snowflake")

# Access Secret values
db_name = database_name.value.upper()
schema = schema_name.value.upper()
stage = stage_name.value.upper()
print(f"Using stage: {db_name}.{schema}.{stage}")

# Construct fully qualified stage name
fully_qualified_stage = f'@"{db_name}"."{schema}"."{stage}"'
Expand All @@ -133,27 +134,24 @@ def parse_document(
clean_path = stage_path.strip('/')
if clean_path:
stage_location = f'{fully_qualified_stage}/{clean_path}'
print(f"Stage location: {stage_location}")

# Record the upload time to use for polling
upload_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
print(f"Upload time: {upload_time}")

# Execute PUT command with the correctly named file
put_command = f"PUT 'file://{correct_name_path}' '{stage_location}' OVERWRITE=TRUE AUTO_COMPRESS=FALSE SOURCE_COMPRESSION=NONE"
print(f"[{datetime.datetime.now().strftime('%H:%M:%S.%f')}] Executing PUT command to upload file...")
print("Uploading file to stage...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for removing timestamp?

try:
cursor.execute(put_command)
print(f"[{datetime.datetime.now().strftime('%H:%M:%S.%f')}] PUT command executed successfully")
print("Upload completed")
except Exception as put_error:
print(f"[{datetime.datetime.now().strftime('%H:%M:%S.%f')}] Error executing PUT command: {str(put_error)}")
print("Upload failed")
raise

# Get results of the upload
result_rows = cursor.fetchall()
status = cursor.sfqid
print(f"[{datetime.datetime.now().strftime('%H:%M:%S.%f')}] Upload completed with status ID: {status}")
print(f"[{datetime.datetime.now().strftime('%H:%M:%S.%f')}] File upload status: {result_rows}")
print("Upload acknowledged by Snowflake")

# Create file paths for the response
if stage_path:
Expand All @@ -164,7 +162,6 @@ def parse_document(

# Create the fully qualified path including the file (like @DB.SCHEMA.STAGE/path/file.ext)
fully_qualified_path = f"@{db_name}.{schema}.{stage}/{stage_file_path}"
print(f"[{datetime.datetime.now().strftime('%H:%M:%S.%f')}] File uploaded to: {fully_qualified_path}")

# Prepare the upload result
upload_result = {
Expand All @@ -177,7 +174,6 @@ def parse_document(
"upload_time": upload_time,
"query_id": status
}
print("Upload completed successfully")

# TODO: handle image (OCR) files differently than others - page splitting does not work for them

Expand All @@ -187,37 +183,31 @@ def parse_document(
TO_FILE('{fully_qualified_path}'),
{{'mode': 'LAYOUT' , 'page_split': true}}) AS parsed_document;
"""
print(f"Executing query: {query}")
print("Parsing document via AI_PARSE_DOCUMENT...")
cursor.execute(query)

print("Fetching results...")
rows = cursor.fetchall()
print(f"Got {len(rows) if rows else 0} results")

processing_result = None
if rows and len(rows) > 0:
# Convert row to dict
columns = [desc[0] for desc in cursor.description]
processing_result = dict(zip(columns, rows[0]))
print(f"Available columns: {columns}")
print(f"Found processing result with {len(processing_result)} fields")
else:
print("No processing results found")

print("Query completed")
print("Parsing completed")

# Combine upload result with parsed result
combined_result = {
"upload": upload_result,
"parsed": processing_result
}
print("Returning combined result")
print("Returning result")

return Response(result=combined_result)

except Exception as e:
error_msg = f"Error processing document {filename}: {str(e)}"
print(f"ERROR: {error_msg}")
print("ERROR during document processing")
import traceback
print(f"Traceback: {traceback.format_exc()}")
return Response(error=error_msg)
4 changes: 2 additions & 2 deletions actions/snowflake-document-ai/package.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ spec-version: v2
name: Snowflake Document AI

# Required: A description of what's in the action package.
description: Upload PDF files to Snowflake and read their content with one-shot parse. No training required.
description: Upload PDF, PPTX, DOCX files to Snowflake and read their content with one-shot parse. No training required.

# Package version number, recommend using semver.org
version: 0.0.1
version: 0.0.2

dependencies:
conda-forge:
Expand Down
1 change: 1 addition & 0 deletions bin/publisher/action_packages_whitelist.json
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
"snowflake-cortex-analyst",
"snowflake-cortex-search",
"snowflake-data",
"snowflake-document-ai",
"serper",
"perplexity",
"robocorp"
Expand Down