Production-style GCP document processing pipeline for RFX and RFP analysis.
Repo root: C:\Download\rfx-accelerator
src/ingestion: source document acquisitionsrc/processing: PDF splitting, OCR parsing, orchestrationsrc/extraction: requirement extractionsrc/matching: placeholder for future capability matchingsrc/utils: shared helper and logging modulesdata/raw: downloaded source documents and local samplesdata/intermediate: chunked PDFs and transient artifactsdata/outputs: generated JSON, Markdown, and logssecrets: local-only service account credentials (ignored by git)capabilities: reusable response snippetsscripts/run_pipeline.py: top-level CLI wrapper
This project now relies on Application Default Credentials instead of loading a local service account JSON from code.
For local development, authenticate once with ADC:
gcloud auth application-default loginFor Cloud Run, use the service's attached IAM identity and grant it the required roles for Vertex AI, Document AI, and Cloud Storage.
python .\scripts\run_pipeline.py `
--gcs-uri gs://rfx-raw-docs/sample-rfp.pdf `
--project-id rfx-accelerator-parth `
--processor-id 1accdf647f93691a `
--bucket-name rfx-raw-docs `
--location asia-south1 `
--chunk-size 15 `
--max-workers 3 `
--errors-log data/outputs/errors.log `
--run-extractor `
--run-matching `
--cleanupThe Gemini stage now follows this sequence:
- Requirement extraction
- Requirement classification
- Capability embeddings
- Hybrid retrieval (embedding + keyword overlap)
- Multi-step Gemini response generation
You can also run the retrieval and response stages directly against an existing
requirements.json file:
python .\scripts\run_matching_pipeline.py `
--requirements-input data/outputs/requirements.json `
--classification-output data/outputs/requirements_classified.json `
--capability-dir capabilities `
--capability-index data/outputs/capabilities.jsonl `
--output data/outputs/requirement_responses.json `
--markdown-output data/outputs/requirement_responses.md `
--project-id rfx-accelerator-parth `
--location asia-south1 `
--retrieval-backend local `
--top-k 5 `
--final-top-k 3 `
--coverage-threshold 0.75 `
--semantic-weight 0.7 `
--keyword-weight 0.3 `
--rebuild-capability-indexTo use a deployed Vertex Vector Search endpoint instead of local retrieval:
python .\scripts\run_matching_pipeline.py `
--requirements-input data/outputs/requirements.json `
--capability-index data/outputs/capabilities.jsonl `
--project-id rfx-accelerator-parth `
--location asia-south1 `
--retrieval-backend vertex `
--top-k 5 `
--final-top-k 3 `
--coverage-threshold 0.75 `
--vertex-endpoint-id 7863856695435329536 `
--vertex-deployed-index-id rfx-capability-deploymentTo rebuild only the capability embedding library:
python .\scripts\embed_capabilities.py `
--capability-dir capabilities `
--output data/outputs/capabilities.jsonl `
--vertex-output data/outputs/capabilities.vertex.json `
--project-id rfx-accelerator-parth `
--location asia-south1Use data/outputs/capabilities.jsonl for the local RAG pipeline and
data/outputs/capabilities.vertex.json for Vertex Vector Search import.