Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions docs/source/audio/generate_redacted_audio.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Generate redacted audio files
=============================
Textual can also generated a redacted audio file, where PII are replaced with 'beeps'. This can be accomplished via our :meth:`redact_audio_file<tonic_textual.audio_api.TextualAudio.redact_audio_file>` method.
Textual can generate a redacted audio file, where sensitive content is replaced with 'beeps'.

To do this, use the :meth:`redact_audio_file<tonic_textual.audio_api.TextualAudio.redact_audio_file>` method.

.. code-block:: python

Expand All @@ -15,8 +17,10 @@ Textual can also generated a redacted audio file, where PII are replaced with 'b
textual.redact_audio('input.mp3','output.mp3', generator_config=gc, generator_default='Off')


.. rubric:: Additional Remarks
.. rubric:: Additional remarks

Before you call this method, in addition to the ``tonic_textual`` library, you must install pydub.

Calling this method requires that pydub be installed in addition to the tonic_textual library.
When you use Textual Cloud (https://textual.tonic.ai), file uploads are limited to 25MB or less.

When using the Textual Cloud (https://textual.tonic.ai) file uploads are limited to 25MB or less. Supported file types are m4a, mp3, webm, mpga, wav.
Textual supports the following audio file types: m4a, mp3, webm, mpga, wav
22 changes: 16 additions & 6 deletions docs/source/audio/generate_transcript.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
Generate transcript
===================
Textual can also generated a transcript from an audio file. This can be accomplished via our :meth:`get_audio_transcript<tonic_textual.audio_api.TextualAudio.get_audio_transcript>` method:
To generate a transcript.
Textual can generate a transcript from an audio file. To do this, use the :meth:`get_audio_transcript<tonic_textual.audio_api.TextualAudio.get_audio_transcript>` method.

To generate a transcript:

.. code-block:: python

Expand All @@ -11,15 +12,24 @@ To generate a transcript.

transcription = textual.get_audio_transcript('path_to_file.mp3')

This will generate a :class:`transcription_result<tonic_textual.classes.audio.redact_audio_responses.TranscriptionResult>`. It will contain the full text of the transcription, the detected language, and a list of audio segments. Each segment will be some portion of the transcription with start and end times in milliseconds.
This generates a :class:`transcription_result<tonic_textual.classes.audio.redact_audio_responses.TranscriptionResult>`.

It contains:

It'll look something like this:
* The full text of the transcription.
* The detected language.
* A list of audio segments. Each segment is some portion of the transcription with start and end times in milliseconds.

It looks something like this:

.. literalinclude:: transcription_result.json
:language: JSON


.. rubric:: Additional remarks

When you use the Textual Cloud (https://textual.tonic.ai), file uploads are limited to 25MB or less.

.. rubric:: Additional Remarks
Textual supports the following file types: m4a, mp3, webm, mpga, wav.

When using the Textual Cloud (https://textual.tonic.ai) file uploads are limited to 25MB or less. Supported file types are m4a, mp3, webm, mpga, wav. For file types like m4a you'll need to make sure your build of ffmpeg has the necessary libraries.
For file types such as m4a, make sure that your build of ffmpeg has the necessary libraries.
12 changes: 9 additions & 3 deletions docs/source/audio/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,18 @@ Audio
The Textual audio functionality allows you to process audio files in different ways. With this module you can:

- Generate a transcript
- Sanitize the transcript by synthesizing/redacting it
- Synthesize or redact sensitive values in the transcript
- Generate a redacted (beeped-out) audio file from the original recording

Before you can use these functions, read the :doc:`Getting started </index>` guide and create an API key.

Textual audio processing supports m4a, mp3, webm, mpga, wav files. For file types like m4a you'll need to make sure your build of ffmpeg has the necessary libraries. If you are using the Textual cloud or you are self-hosting but using the Azure AI Whisper integration then you'll have to limit your file sizes to 25MB or less. If you are self-hosting Textual's ASR containers then there are no file size limitations.
Textual audio processing supports the following audio file types: m4a, mp3, webm, mpga, wav

For file types such as m4a, make sure that your build of ffmpeg has the necessary libraries.

If you use Textual Cloud, or you self-host using the Azure AI Whisper integration, then file sizes are limited to 25MB or smaller.

If you self-host using Textual's Automatic Speech Recognition (ASR) containers, then there are no limitations on file size.

.. toctree::
:hidden:
Expand All @@ -18,4 +24,4 @@ Textual audio processing supports m4a, mp3, webm, mpga, wav files. For file type
generate_transcript
redact_transcript
generate_redacted_audio
api
api
21 changes: 16 additions & 5 deletions docs/source/audio/redact_transcript.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
Redacting a transcript
----------------------
To redact a transcript you'll first need to generate a transcription result, which you can do via the :meth:`get_audio_transcript<tonic_textual.audio_api.TextualAudio.get_audio_transcript>` method (see :doc:`here for an example <generate_transcript>`).
Before you can redact a transcript, you must first generate a transcription result. To do this, use the :meth:`get_audio_transcript<tonic_textual.audio_api.TextualAudio.get_audio_transcript>` method. For an example, go to see :doc:`here for an example <generate_transcript>`.

Once you have a transcript you can call :meth:`redact_audio_transcript<tonic_textual.audio_api.TextualAudio.redact_audio_transcript>`. Here is an example:
Once you have a transcript, call :meth:`redact_audio_transcript<tonic_textual.audio_api.TextualAudio.redact_audio_transcript>`.

For example:

.. code-block:: python

Expand All @@ -18,8 +20,17 @@ Once you have a transcript you can call :meth:`redact_audio_transcript<tonic_tex

redacted_transcript = textual.redact_audio_transcript(transcript, generator_config=gc, generator_default='Off').

The :py:func:`redact_audio_transcript` will return a :class:`redacted_transcript_result<tonic_textual.classes.audio.redacted_transcription_result.RedactedTranscriptionResult>` which will include the original transcription, the redacted/synthesized text of the transcription, a list of redacted_segments, and the usage.
The :py:func:`redact_audio_transcript` returns a :class:`redacted_transcript_result<tonic_textual.classes.audio.redacted_transcription_result.RedactedTranscriptionResult>`, which includes:

* The original transcription.
* The redacted or synthesized text of the transcription
* A list of redacted_segments.
* The usage.

.. rubric:: Additional remarks

When you use Textual Cloud (https://textual.tonic.ai), file uploads are limited to 25MB or smaller.

.. rubric:: Additional Remarks
Textual supports the following audio file types: m4a, mp3, webm, mpga, wav

When using the Textual Cloud (https://textual.tonic.ai) file uploads are limited to 25MB or less. Supported file types are m4a, mp3, webm, mpga, wav. For file types like m4a you'll need to make sure your build of ffmpeg has the necessary libraries.
For file types such as m4a, make that sure your build of ffmpeg has the necessary libraries.
8 changes: 6 additions & 2 deletions docs/source/datasets/downloading_files.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
Downloading a redacted dataset file
=====================================

To download the redacted or synthesized version of the file, get the specific file from the dataset, then call the **download** function.
To download the redacted or synthesized version of the file:

1. Get the specific file from the dataset.

2. Call the **download** function.

For example:

Expand All @@ -20,4 +24,4 @@ To download a specific file in a dataset that you fetch by name:
file = txt_file = list(filter(lambda x: x.name=='<file to download>', dataset.files))[0]
file_bytes = file.download()
with open('<file name>', 'wb') as f:
f.write(file_bytes)
f.write(file_bytes)
8 changes: 5 additions & 3 deletions docs/source/datasets/index.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
Datasets
=========================

A dataset is a collection of files that are all redacted and synthesized in the same way. Datasets are a helpful organization tool to ensure that you can easily track a collections of files and how sensitive data is removed from those files.
A dataset is a collection of files that are all redacted and synthesized in the same way. Datasets are a helpful organization tool to ensure that you can easily track a collection of files and how sensitive data is removed from those files.

Typically, you configure datasets from the Textual application, but for ease of use, the SDK supports many dataset operations. However, some operations can only be performed from the Textual application.
Typically, you configure datasets from the Textual application, but for ease of use, the SDK supports many dataset operations.

However, some operations can only be performed from the Textual application.



Expand All @@ -19,4 +21,4 @@ Typically, you configure datasets from the Textual application, but for ease of
viewing_files
downloading_files
viewing_config
api
api
9 changes: 7 additions & 2 deletions docs/source/datasets/uploading_files.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
Uploading files to a dataset
=============================

You can upload files to your dataset from the SDK. Provide the complete path to the file, and the complete name of the file as you want it to appear in Textual.
You can upload files to your dataset from the SDK.

When you upload file, you provide:

* The complete path to the file.
* The complete name of the file as it should appear in Textual.

.. code-block:: python

dataset.add_file('<path to file>','<file name>')
dataset.add_file('<path to file>','<file name>')
53 changes: 41 additions & 12 deletions docs/source/datasets/viewing_config.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
Viewing the PII information for a dataset
-----------------------------------------
Viewing detected entities for a dataset
=======================================

You can also retrieve a list of entities found in the files of a dataset. You can retrieve all entities found or just specific entity types. The below will retrieve information on ALL entities.
You can retrieve a list of entities that were detected in the dataset files.

Retrieving all entities for a dataset
-------------------------------------

To retrieve the complete list of entities for a dataset:

.. code-block:: python

Expand All @@ -13,26 +18,43 @@ You can also retrieve a list of entities found in the files of a dataset. You c
for file in files:
entities = file.get_entities()

It will return a response a dictionary whose key is the type of PII and whose value is a list of found entities. The returned entity includes the original text value of the entity as well as the few words preceding and following the entity, e.g.
It returns a response in the form of a dictionary where:

* The key is the entity type.
* The value is the list of detected entities of that type.

For each entity, the response includes:

* The original text value of the entity.
* To provide context, a few words that precede and follow the entity.

For example:

.. literalinclude:: pii_occurence_response.json
:language: JSON

Retrieving specific types of entities for a dataset
---------------------------------------------------

The call to get_entities() can also take an optional list of entities. For example, you could pass in a hard coded list as:
The call to ``get_entities()`` can take an optional list of entity types.

For example, you could pass in a hard-coded list of entity types:

.. code-block:: python

file.get_entities(['NAME_GIVEN','NAME_FAMILY'])

Or do the same using the PiiType enum
Or you could use the ``PiiType`` enum:

.. code-block:: python

from tonic_textual.enums.pii_type import PiiType
file.get_entities([PiiType.NAME_GIVEN, PiiType.NAME_FAMILY])

Or you could even just pass in the current set of entities enabled by the dataset configuration, e.g.
Retrieving the entities for the enabled entity types for a dataset
------------------------------------------------------------------

To pass in the current set of entities that are enabled by the dataset configuration:

.. code-block:: python

Expand All @@ -44,12 +66,10 @@ Or you could even just pass in the current set of entities enabled by the datase

file.get_entities(entities)

Viewing redaction and synthesis mappings for a dataset
Viewing entity mappings for a dataset
------------------------------------------------------

You can retrieve the original, redacted, synthetic, and final output values for
entities in a dataset after the current generator configuration is applied. The
response is grouped by file.
You can retrieve mappings for each detected entity in a dataset.

.. code-block:: python

Expand All @@ -58,4 +78,13 @@ response is grouped by file.

for file in mappings.files:
for entity in file.entities:
print(file.file_name, entity.text, entity.output_text)
print(file.file_name, entity.text, entity.output_text)

The response is grouped by file.

Each entity mapping includes:

* The original entity value.
* The redacted version of the entity value.
* The synthesized version of the entity value.
* The final output value based on the current dataset configuration.
16 changes: 8 additions & 8 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ Before you get started, you must install the Textual Python SDK:

Set up a Textual API key
------------------------
To authenticate with Tonic Textual, you must set up an API key. After |signup_link|, to obtain an API key, go to the **User API Keys** page.
To authenticate with Tonic Textual, you must set up an API key. After |signup_link|, to obtain an API key, go to the **User API Keys** section of the **User Profile** page.

After, you obtain the key, you can optionally set it as an environment variable:
After you obtain the key, you can optionally set it as an environment variable:

.. code-block:: bash

Expand All @@ -40,7 +40,7 @@ You can can also pass the API key as a parameter when you create your Textual cl
Creating a Textual client
--------------------------

To redact text or files, use our TextualNer client. To parse files, which is useful for extracting information from files such as PDF and DOCX, use our TextualParse client.
To redact text or files, use the TextualNer client. To parse files, which is useful for extracting information from files such as PDF and DOCX, use the TextualParse client.

.. code-block:: python

Expand All @@ -50,14 +50,14 @@ To redact text or files, use our TextualNer client. To parse files, which is use
textual = TextualNer()
textual = TextualParse()

Both client support several optional arguments:
Both clients support the following optional arguments:

* **base_url** - The URL of the server that hosts Tonic Textual. Defaults to https://textual.tonic.ai
* ``base_url`` - The URL of the server that hosts Tonic Textual. Default: ``https://textual.tonic.ai``

* **api_key** - Your API key. If not specified, you must set TONIC_TEXTUAL_API_KEY in your environment.
* ``api_key`` - Your API key. If not specified, you must set ``TONIC_TEXTUAL_API_KEY`` in your environment.

* **verify** - Whether to verify SSL certification. Default is true.
* ``verify`` - Whether to verify SSL certification. Default: ``true``

.. |signup_link| raw:: html

<a href="https://textual.tonic.ai/signup" target="_blank">creating your account</a>
<a href="https://textual.tonic.ai/signup" target="_blank">you create your account</a>
14 changes: 8 additions & 6 deletions docs/source/parse/parsing_files.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Parsing files
=================

When Textual parses files, it convert unstructured files, such as PDF and DOCX, into a more structured JSON form. Textual uses the same JSON schema for all of its supported file types.
When Textual parses files, it converts unstructured files, such as PDF and DOCX, into a more structured JSON form. Textual uses the same JSON schema for all of its supported file types.

To parse a single file, call the **parse_file** function. The function is synchronous. It only returns when the file parsing is complete. For very large files, such as PDFS that are several hundred pages long, this process can take a few minutes.

Expand All @@ -22,9 +22,9 @@ To parse a single file from a local file system, start with the following snippe

To read the files, use the 'rb' access mode, which opens the file for read in binary format.

In the **parse_file** command, you can set an optional timeout. The timeout indicates the number of seconds after which to stop waiting for the parsed result.
In the ``parse_file`` command, you can set an optional timeout. The timeout indicates the number of seconds after which to stop waiting for the parsed result.

To set a timeout for for all parse requests from the SDK, set the environment variable TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS.
To set a timeout for for all parse requests from the SDK, set the environment variable ``TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS``.

Parsing a file from Amazon S3
-----------------------------
Expand All @@ -40,10 +40,12 @@ Because this uses the boto3 library to fetch the file from Amazon S3, you must f
Understanding the parsed result
-------------------------------

The parsed result is a :class:`FileParseResult<tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult>`. It is a wrapper around the JSON that is generated during processing.
The parsed result is a :class:`FileParseResult<tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult>`.

To learn more about the structure of the parsed result, go to |parsed_structure_external_link| in the Textual documentation.
It is a wrapper around the JSON that is generated during processing.

To learn more about the structure of the parsed result, go to the |parsed_structure_external_link| in the Textual documentation.

.. |parsed_structure_external_link| raw:: html

<a href="https://docs.tonic.ai/textual/datasets-preview-output/dataset-output-json-structure" target="_blank">Parsed JSON structure</a>
<a href="https://docs.tonic.ai/textual/datasets-preview-output/dataset-output-json-structure" target="_blank">JSON output structure information</a>
10 changes: 5 additions & 5 deletions docs/source/parse/working_with_parsed_output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@ After a file is parsed, either directly or as part of a dataset, you can begin t

Typically, users build pipelines to feed vector databases for RAG applications, or to prepare datasets to fine-tune or build an LLM.

The parsed result is documented in the Textual documentation in |parsed_structure_external_link|. This topic describes the JSON schema that is used to store the parsed result.
In the Textual documentation, the |parsed_structure_external_link| topic describes the JSON schema that is used to store the parsed result.

The SDK provides access to the raw JSON in the form of a Python dictionary. It also provides a helper methods and utilities to perform common actions.

Examples of actions that the SDK supports include:

- Get the content of the file in Markdown or plain text
- Redact or synthesize the file content
- Get the content of the file in Markdown or plain text.
- Redact or synthesize the file content.
- Chunk the file. You can redact or synthesize the chunks and also enrich them with additional entity metadata.
- List all of the identified tables and key-value pairs that were found in a document
- List all of the identified tables and key-value pairs that were found in a document.

The below snippet includes most of these supported actions.

Expand Down Expand Up @@ -42,4 +42,4 @@ For a list of all of the available operations, go to the :class:`FileParseResult

.. |parsed_structure_external_link| raw:: html

<a href="https://docs.tonic.ai/textual/datasets-preview-output/dataset-output-json-structure" target="_blank">Parsed JSON structure</a>
<a href="https://docs.tonic.ai/textual/datasets-preview-output/dataset-output-json-structure" target="_blank">JSON output structure</a>
Loading
Loading