diff --git a/integrations/amazon-textract.md b/integrations/amazon-textract.md new file mode 100644 index 00000000..8ef5b48b --- /dev/null +++ b/integrations/amazon-textract.md @@ -0,0 +1,127 @@ +--- +layout: integration +name: Amazon Textract +description: Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents +authors: + - name: deepset + socials: + github: deepset-ai + twitter: deepset_ai + linkedin: https://www.linkedin.com/company/deepset-ai/ +pypi: https://pypi.org/project/amazon-textract-haystack +repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract +type: Data Ingestion +report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues +logo: /logos/aws.png +version: Haystack 2.0 +toc: true +--- + +### **Table of Contents** +- [Overview](#overview) +- [Installation](#installation) +- [Usage](#usage) + +## Overview + +[`AmazonTextractConverter`](https://docs.haystack.deepset.ai/docs/amazontextractconverter) provides an integration of [Amazon Textract](https://aws.amazon.com/textract/) with Haystack. + +This component uses Amazon Textract's synchronous API to convert images and single-page PDFs into Haystack `Document` objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents. + +**Supported file formats**: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB). + +**Key features**: +- Plain text extraction with `DetectDocumentText` +- Table, form, signature, and layout detection with `AnalyzeDocument` +- Natural-language queries to extract specific answers from documents +- Access to the raw Textract response for downstream processing + +## Installation + +Install the Amazon Textract integration: + +```bash +pip install amazon-textract-haystack +``` + +## Usage + +The component uses the standard boto3 credential chain. You can set AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`) as environment variables, configure them via `~/.aws/credentials` and `~/.aws/config`, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as [Secret](https://docs.haystack.deepset.ai/docs/secret-management) arguments. + +The Textract API is selected automatically based on how you configure the component: `DetectDocumentText` is used by default for plain text extraction, while `AnalyzeDocument` is used whenever you set `feature_types` or pass `queries` at runtime. + +### Basic text extraction + +Extract plain text from a document with the default configuration, which calls `DetectDocumentText`: + +```python +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter + +converter = AmazonTextractConverter() +results = converter.run(sources=["document.png"]) +documents = results["documents"] + +print(documents[0].content) +``` + +### Table and form analysis + +Use `AnalyzeDocument` to detect tables and forms by setting `feature_types`: + +```python +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter + +converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"]) +results = converter.run(sources=["invoice.png"]) + +documents = results["documents"] +raw_responses = results["raw_textract_response"] +``` + +Valid `feature_types` values: `"TABLES"`, `"FORMS"`, `"SIGNATURES"`, `"LAYOUT"`. + +### Natural-language queries + +Ask questions about a document and get extracted answers. The `QUERIES` feature type is enabled automatically when you pass the `queries` parameter at runtime: + +```python +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter + +converter = AmazonTextractConverter() +results = converter.run( + sources=["medical_form.png"], + queries=["What is the patient name?", "What is the date of birth?"], +) + +documents = results["documents"] +raw_responses = results["raw_textract_response"] +``` + +Queries can be combined with `feature_types` for both structural and question-based extraction: + +```python +converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"]) +results = converter.run( + sources=["invoice.png"], + queries=["What is the total amount due?"], +) +``` + +### Explicit credentials + +```python +from haystack.utils import Secret +from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter + +converter = AmazonTextractConverter( + aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"), + aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"), + aws_region_name=Secret.from_token("us-east-1"), +) +``` + +For more details on Amazon Textract capabilities and setup, refer to the [Amazon Textract documentation](https://docs.aws.amazon.com/textract/latest/dg/what-is.html). + +### License + +`amazon-textract-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.