Skip to content

parthtiwari2407/rfx-accelerator

Repository files navigation

RFX Accelerator

Production-style GCP document processing pipeline for RFX and RFP analysis.

Repo root: C:\Download\rfx-accelerator

Layout

  • src/ingestion: source document acquisition
  • src/processing: PDF splitting, OCR parsing, orchestration
  • src/extraction: requirement extraction
  • src/matching: placeholder for future capability matching
  • src/utils: shared helper and logging modules
  • data/raw: downloaded source documents and local samples
  • data/intermediate: chunked PDFs and transient artifacts
  • data/outputs: generated JSON, Markdown, and logs
  • secrets: local-only service account credentials (ignored by git)
  • capabilities: reusable response snippets
  • scripts/run_pipeline.py: top-level CLI wrapper

Credentials

This project now relies on Application Default Credentials instead of loading a local service account JSON from code.

For local development, authenticate once with ADC:

gcloud auth application-default login

For Cloud Run, use the service's attached IAM identity and grant it the required roles for Vertex AI, Document AI, and Cloud Storage.

Run

python .\scripts\run_pipeline.py `
  --gcs-uri gs://rfx-raw-docs/sample-rfp.pdf `
  --project-id rfx-accelerator-parth `
  --processor-id 1accdf647f93691a `
  --bucket-name rfx-raw-docs `
  --location asia-south1 `
  --chunk-size 15 `
  --max-workers 3 `
  --errors-log data/outputs/errors.log `
  --run-extractor `
  --run-matching `
  --cleanup

AI Matching Pipeline

The Gemini stage now follows this sequence:

  1. Requirement extraction
  2. Requirement classification
  3. Capability embeddings
  4. Hybrid retrieval (embedding + keyword overlap)
  5. Multi-step Gemini response generation

You can also run the retrieval and response stages directly against an existing requirements.json file:

python .\scripts\run_matching_pipeline.py `
  --requirements-input data/outputs/requirements.json `
  --classification-output data/outputs/requirements_classified.json `
  --capability-dir capabilities `
  --capability-index data/outputs/capabilities.jsonl `
  --output data/outputs/requirement_responses.json `
  --markdown-output data/outputs/requirement_responses.md `
  --project-id rfx-accelerator-parth `
  --location asia-south1 `
  --retrieval-backend local `
  --top-k 5 `
  --final-top-k 3 `
  --coverage-threshold 0.75 `
  --semantic-weight 0.7 `
  --keyword-weight 0.3 `
  --rebuild-capability-index

To use a deployed Vertex Vector Search endpoint instead of local retrieval:

python .\scripts\run_matching_pipeline.py `
  --requirements-input data/outputs/requirements.json `
  --capability-index data/outputs/capabilities.jsonl `
  --project-id rfx-accelerator-parth `
  --location asia-south1 `
  --retrieval-backend vertex `
  --top-k 5 `
  --final-top-k 3 `
  --coverage-threshold 0.75 `
  --vertex-endpoint-id 7863856695435329536 `
  --vertex-deployed-index-id rfx-capability-deployment

To rebuild only the capability embedding library:

python .\scripts\embed_capabilities.py `
  --capability-dir capabilities `
  --output data/outputs/capabilities.jsonl `
  --vertex-output data/outputs/capabilities.vertex.json `
  --project-id rfx-accelerator-parth `
  --location asia-south1

Use data/outputs/capabilities.jsonl for the local RAG pipeline and data/outputs/capabilities.vertex.json for Vertex Vector Search import.

About

End-to-end RFX response automation system leveraging Document AI for OCR, Vertex AI Vector Search for hybrid retrieval, and Gemini for classification, reranking, and response generation, deployed via Streamlit and Cloud Run.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors