F-Instruct

A structured financial agent system that produces compliant structured financial products and can integrate with open banking systems.

API

Data Converter

`ingest_data(input_path, output_dir, model=None)`

Converts PDF documents with optional LLM into structured Markdown format for downstream processing workflows.

Parameters:

input_path (str) - Path to PDF file or directory containing PDF files
output_dir (str) - Output directory for processed documents
model (str, optional) - Ollama model to use for enhanced extraction (e.g., 'phi4'). If this argument is not provided, LLM enhancement is disabled.

Returns:

bool - True if ingestion successful, False otherwise

Example:

from f_instruct.data import ingest_data

# Process documents from ingress to processed (LLM enhancement disabled)
ingest_data("./data/ingress", "./data/processed")

# Process single document with Ollama LLM enhancement using the 'phi4' model
ingest_data("./data/ingress/report.pdf", "./data/processed", model="phi4")

CLI:

# Basic usage (LLM enhancement disabled)
uv run f_instruct/data/converter.py -i ./data/ingress -o ./data/processed

# With LLM enhancement using the 'phi4' model
uv run f_instruct/data/converter.py -i ./data/ingress -o ./data/processed -m phi4

Data Preprocessor

`preprocess_data(input_dir, output_dir, model)`

Processes Markdown files into structured Parquet datasets, using spaCy for advanced paragraph detection and an Ollama LLM for text enhancement and named entity formatting.

data.parquet: This file contains individual text chunks extracted from the source documents, along with their associated metadata.

Column Name	Type	Description
`chunk_id`	int	Unique integer identifier for the text chunk (primary key).
`document_id`	str	Identifier of the source document this chunk belongs to.
`chunk_index`	int	Sequential index of the chunk within its source document.
`text`	str	The enhanced text content of the chunk with marked-up entities.
`title`	str	Title of the source document.
`source_name`	str	Original filename of the source document.
`classifier_code`	str	Classifier code assigned to the source document.
`paragraph_id`	str	Identifier of the paragraph this chunk belongs to.
`position_start`	int	Start character position of this chunk in the original document.
`position_end`	int	End character position of this chunk in the original document.
`previous_chunk_id`	int	ID of the previous chunk in sequence (-1 if none).
`next_chunk_id`	int	ID of the next chunk in sequence (-1 if none).
`is_first_chunk`	bool	Boolean flag indicating if this is the first chunk in a document.
`is_last_chunk`	bool	Boolean flag indicating if this is the last chunk in a document.
`relative_position`	float	Float value (0.0-1.0) indicating relative position within document.

Note: The preprocessor now uses spaCy for improved paragraph detection and boundary analysis, and leverages an Ollama LLM to enhance readability and mark up named entities with bold formatting.

Parameters:

input_dir (str) - Path to directory containing Markdown files.
output_dir (str) - Output directory for the processed Parquet data file(s).
model (str) - Ollama model to use for text enhancement and named entity recognition (e.g., 'phi4').

Returns:

dict[str, pandas.DataFrame] - A dictionary containing the processed data DataFrame.

Example:

from f_instruct.data import preprocess_data

# Process Markdown documents with enhanced NLP features
preprocess_data("./data/processed", "./data/structured", model="phi4")

CLI:

# Preprocessor requires specifying a model
uv run f_instruct/data/preprocessor.py -i ./data/processed -o ./data/structured -m phi4

Financial Data Processor

`process_financial_data(input_dir, output_dir, model="phi4")`

Generates diverse training examples from financial regulatory data, creating question-answer pairs, instruction examples, financial product definitions, API interactions, and regulatory compliance scenarios.

The processor uses a Language Model to generate high-quality training data in multiple formats and saves it as JSON, JSONL, and Parquet files.

Parameters:

input_dir (str) - Directory containing preprocessed parquet files
output_dir (str) - Output directory for generated training examples
model (str, optional) - Ollama model to use for generation (default: 'phi4')

Returns:

str - Path to the generated output file

Example:

from f_instruct.data import process_financial_data

# Generate training data from preprocessed financial documents
output_path = process_financial_data(
    "./data/structured", 
    "./data/training", 
    model="phi4"
)

CLI:

# Generate training data using default phi4 model
uv run f_instruct/data/processor.py -i ./data/structured -o ./data/training

# Use a different model for generation
uv run f_instruct/data/processor.py -i ./data/structured -o ./data/training -m llama3

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
common-domain-model @ f94d913		common-domain-model @ f94d913
data		data
digital-regulatory-reporting @ c05b253		digital-regulatory-reporting @ c05b253
docs		docs
f_instruct		f_instruct
openbanksandboxswagger		openbanksandboxswagger
yourbench @ e008dde		yourbench @ e008dde
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pylintrc		.pylintrc
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F-Instruct

API

Data Converter

`ingest_data(input_path, output_dir, model=None)`

Data Preprocessor

`preprocess_data(input_dir, output_dir, model)`

Financial Data Processor

`process_financial_data(input_dir, output_dir, model="phi4")`

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

F-Instruct

API

Data Converter

ingest_data(input_path, output_dir, model=None)

Data Preprocessor

preprocess_data(input_dir, output_dir, model)

Financial Data Processor

process_financial_data(input_dir, output_dir, model="phi4")

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`ingest_data(input_path, output_dir, model=None)`

`preprocess_data(input_dir, output_dir, model)`

`process_financial_data(input_dir, output_dir, model="phi4")`

Packages