Releases: NASA-IMPACT/larch
v0.2.1-alpha
What's Changed
Full Changelog: v0.2.0-alpha...v0.2.1-alpha
v0.2.0-alpha
What's Changed
- Improve fuzzymatch based whitelist validator by tracking best candidates by @NISH1001 in #40
- Add better whitelist validator using different matching algorithm by @NISH1001 in #41
- Bump up the version to 0.2.0-alpha by @NISH1001 in #42
Full Changelog: v0.1.2-alpha...v0.2.0-alpha
Usages
Usages can be seen in the the docstring of the components. But here's the tentative ones
from openai import OpenAI
from larch.metadata import InstructorBasedOpenAIMetadataExtractor
from larch.metadata.validators import WhitelistBasedMetadataValidatorWithMatcher
from larch.processors import (
CombinedMatcher,
ExactMatcher,
FuzzyMatcher,
CombinedMatcher,
LLMMatcher
)
from larch.utils import load_whitelist
matcher = ExactMatcher()
match = matcher(
text="paradox",
values=["para", "Paradox"]
) # Output [('Paradox', 100.0)]
matcher = FuzzyMatcher()
match = matcher(
text="parado",
values=["para", "Paradox"]
) # [('Paradox', 92.3076923076923), ('para', 90.0)]
matcher = LLMMatcher(
InstructorBasedOpenAIMetadataExtractor(schema=None, openai_client=OpenAI(), model="gpt-3.5-turbo", debug=False),
debug=True,
)
match = matcher(
text="prdx",
values=["para", "Paradox"]
) # Output: [('Paradox', 100.0)]
whitelist_map = load_whitelist(<path_to_excel_file>)
whitelist_map = {"address": {"Huntsville": ["Huntsville", "hunsville", "huntsvil"]}},
metadata = dict(address="hunsvllle")
metadata_validated = WhitelistBasedMetadataValidatorWithMatcher(
whitelists=whitelist_map,
field_matcher=CombinedMatcher(ExactMatcher(), FuzzyMatcher()),
fallback_matcher=LLMMatcher(
InstructorBasedOpenAIMetadataExtractor(
schema=None,
openai_client=OpenAI(), model="gpt-3.5-turbo",
debug=True
),
debug=True,
),
unmatched_value=None # If set to `None`, unmatched keys will be removed
)(metadata) # Output: {"address": "Huntsville"}v0.1.2-alpha
v0.1.1-alpha
What's Changed
- Deprecate older SinequaDocumentIndexer in favor of new SinequaDocumen… by @NISH1001 in #33
- Bump up the version to 0.1.1 by @NISH1001 in #34
Full Changelog: v0.1.0-alpha...v0.1.1-alpha
v0.1.0-alpha
What's Changed
- Add top_k param to retriever by @NISH1001 in #27
- Larch to s3 by @Caden-Helbling in #28
- Upgrade Langchain by @NISH1001 in #31
- Bump up the version to 0.1.0-alpha by @NISH1001 in #32
- Addition of
larch.retrieversmodule
- Addition of
New Contributors
- @Caden-Helbling made their first contribution in #28
Full Changelog: v0.0.3-alpha...v0.1.0-alpha
v0.0.3-alpha
This release adds a bunch of improvements and new features for SQL template matching
Major
larch.search.template_matcher.SQLTemplateMatchercomponent is added to do template-matching for SQL queries prlarch.search.template_matcher.FuzzySQLTemplateMatcheruses fuzzy-search based match
larchnow supports additional/optional dependencieslarch[paper-qa]for paperqalarch[extras]for additional requirements like pandas, spacy, etc.larch[url-loaders]can be used to haveunstructuredrelated dependencies for loading URLs
Minor
- switch to pyproject.toml configuration for pip installation
- bugfix
larch.indexing.DocumentIndexer.as_langchain_retriever(...)pr - relevance scores are added to all the returns of
DocumentIndexer.query_top_k(...) - several bugfixes related to
MultiRetrieverSearchEngine
v0.0.2-alpha
Changelog
- Restructuring of
larch.indexing(pr)
v0.0.1-alpha
This is the initial alpha release for larch which consists of different components for RAG, metadata extraction, connecting to sinequa, etc. (See README.md for more details).
The tool tentatively has the following components to create any downstream llm-based search engine.
Components
1. DocumentIndexer
larch.indexing.DocumentIndexer allows to query (.query(..., top_k=) and fetch top_k (.query_top_k(...)) documents that are indexed in a the document store.
larch.indexing.SinequaDocumentIndexeris used to directly connect to Sinequa to fetch relevant documents (no indexing method is allowed in this as it's left to sinequa itself)larch.indexing.PaperQADocumentIndexeruses paperqa to index documents (index_documents(<paths>))larch.indexing.LangchainDocumentIndexerallows to switch between any vector store (faiss, pgvector, etc) to allow for indexing. (index_documents(<paths>))larch.indexing.DocumentMetadataIndexerallows to extract/index/dump metadata extracted from files which are then further used in downstream SQL agent. This takes inlarch.metadata.MetadataExtractorand pydantic schema applied to each document.
2. MetadataExtractor
larch.metadata._base.AbstractMetadataExtractor allows for implementing any downstream metadata extractor.
larch.metadata.extractors_openai.InstructorBasedOpenAIMetadataExtractoris the standard extractor that is recommended to use. This uses function-calling. *larch.metadata.extractors.LangchainBasedMetadataExtractoruses vanilla langchain and prompting to extract metadata.larch.metadata.extractors.LegacyMetadataExtractoris a refactored older algorithm from IMPACT.
3. SearchEngine
larch.search.engines.AbstractSearchEngine component is used to abstract query-to-response process to generate answer for given user query. All the downstream search engine has to implement query(query=<str>, top_k=<top_k>) method.
larch.search.engines.SimpleRAGsimply uses anyDocumentIndexer(especially thequery(...)method) to wrap a RAG pipeline. (Alternatively, one can always use.query(...)method from the document indexer).larch.search.engines.InMemoryDocumentQAEnginetakes in N documents on top of which QA can be done. This can be used standalone engine as well as can be used withDocumentStoreRAGlarch.search.engines.DocumentStoreRAGuses theInMemoryDocumentQAEngineandDocumentIndexerfor QA.DocumentIndexer.query_top_k(...)is used to fetch top k relevant documents which are then fed to the QA engine. *larch.search.engines.SQLAgentSearchEngineconnects to a given database (and set of tables in the database), generates SQL query for a given query and generates the response for the query by fetching relevant rows from the database. This is used only if we want more complex tasks like aggregation, analysis and recommended to be used only ifMetadataExtractorperforms accurately. *larch.search.engines.MultiRetrieverSearchEnginetakes in an arbitrary number ofAbstractSearchEngine(retrievers/sources) and generates individual responses for a given query from each retriever and finally consolidates the responses from them. This is the recommended way to ensemble multiple engine retriever in larch. (Note, each retriever is run in parallel).larch.search.engines.EnsembleAugmentedSearchEngineis a very naive engine that takes in multiple engines, runs through each of them sequentially and puts all the responses in single context prompt and uses LLM to do the QA. Not recommended for now.
4. MetadataValidator
larch.metadata.MetadataValidator is used to post-process the extracted metadata.
larch.metadata.validators.SimpleInTextMetadataValidatorchecks if the extracted value of a field in the metadata lies in the text. If it doesn't, that field is removed. (Not recommended to use)larch.metadata.validators.WhitelistBasedMetadataValidatoruses a whitelist to standardize the extracted value in a field. Each field value could have set of alternate values. Fuzzy-matching is used to figure out whether to standardize or not.
5. MetadataEvaluator
larch.metadata.MetadataEvaluator is used to evaluate (numerically) the extraction of metadata.
larch.metadata.evaluators.JaccardEvaluatorcomputes the ratio of tokens found between prediction and reference (doesn't account for word ordering)larch.metadata.evaluators.FlattenedExactMatchercomputes the score by flattening the prediction and reference metadata and comparing the values. (better thanJaccardEvaluator)larch.metadata.evaluators.RecursiveFuzzyMatcheris the recommended evaluator that performs weighted scoring for each node. (See documentation for more)*
6. TextProcessor
larch.processors.TextProcessor allows for processing text. All the text processors takes in a text and egest out processed text.
larch.processors.PIIRemoveruses spacy to identify Personal Identification Information (name, email, phone number) and mask them out.larch.processors.NonAlphaNumericRemoverremoves non-alpha-numeric characters from the textlarch.processors.TextProcessingPipelineis a container to hold all the text processors and run them sequentially.
Usage
We can do:
- metadata extraction
- index documents into vector store
- json dump metadata in bulk
- create RAG pipeline
- etc
Metadata Extraction
Extract from single document text
from larch.metadata import InstructorBasedOpenAIMetadataExtractor
from larch.metadata.validators import WhitelistBasedMetadataValidator
from larch.processors import PIIRemover, TextProcessingPipeline
from larch.utils import load_whitelist
text_processor = TextProcessingPipeline(
lambda x: re.sub(r"\$(?=\w|\n|\()", " ", x).strip(),
lambda x: re.sub(r"\)(?=\w|\n|\()", " ", x).strip(),
lambda x: re.sub(r"\#(?=\w|\n|\()", " ", x).strip(),
lambda x: x.replace("\t", " ").replace("!", " ").strip(),
PIIRemover()
)
schema = <pydantic schema>
whitelists = load_whitelist(<path_to_excel>)
metadata_extractor = InstructorBasedOpenAIMetadataExtractor(
model="gpt-4",
schema=schema,
preprocessor=text_processor,
debug=True,
)
validator = WhitelistBasedMetadataValidator(whitelists=whitelists, fuzzy_threshold=0.95, ...)
text = <document text>
metadata = metadata_extractor(text)
metadata = validator(metadata)Extract in bulk and json dump
from larch.indexing import DocumentMetadataIndexer
metadata_indexer = DocumentMetadataIndexer(
schema,
metadata_extractor = metadata_extractor,
skip_errors=True,
text_preprocessor=text_processor,
debug=True,
)
file_paths = <paths>
# start indexing
metadata_indexer.index_documents(paths=file_paths, save_path=<path_to_json_file>)
# load existing indices
metadata_indexer = metadata_indexer.load_index(<path_to_json_file>)
# access the metadata store dict
metadata_indexer.metadata_storeDocument Indexing
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import TokenTextSplitter
from larch.indexing import PaperQADocumentIndexer, LangchainDocumentIndexer
model = "gpt-3.5-turbo-0613"
embedder = OpenAIEmbeddings()
text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)
vector_store = PGVector(
collection_name="test_collection",
connection_string="postgresql://...",
embedding_function=embedder,
)
# if vector_store is None, FAISS is used by default
document_indexer = LangchainDocumentIndexer(
llm=ChatOpenAI(model=model, temperature=0.0),
text_preprocessor=text_processor,
vector_store=vector_store,
# vector_store=FAISS.load_local("../tmp/vectorstore", embeddings=embedder, index_name="test_index"),
text_splitter=text_splitter,
debug=True,
)
# get number of chunks in the store
print(document_indexer.num_chunks)
# or use paperqa
document_indexer = PaperQADocumentIndexer(
llm=ChatOpenAI(model=model, temperature=0.0),
text_preprocessor=text_processor,
debug=True,
name="test",
)#.load_index(<path_to_pickle>)
# get files that are indexed
print(document_indexer.docs)search engine
from larch.search.engines import InMemoryDocumentQAEngine, SQLAgentSearchEngine, MultiRetrieverSearchEngine
llm = ChatOpenAI(
model="gpt-4",
temperature=0.0,
)
top_k = 5
engines = [
SinequaDocumentIndexer(
base_url=os.environ.get("SINEQUA_BASE_URL"),
auth_token=os.environ.get("SINEQUA_ACCESS_TOKEN")
),
LangchainDocumentIndexer(...),
SQLAgentSearchEngine(
llm=llm,
db_uri=<db_uri>,
tables=None, # or provide a list of table names
debug=True,
prompt_prefix=False,
query_augmentation_prompt=<prompt_suffix>,
sql_fuzzy_threshold=0.75,
railguard_response=True,
)
]
# create multi-retriever engine
search_engine = MultiRetrieverSearchEngine(*engines, llm=llm)
query = <query_text>
response = search_engine(query, top_k=top_k)
# we can also use individual engine which has same interface
search_engine = engines[1]
response = search_engine(query, top_k=top_k)