This project implements a Named Entity Recognition (NER) API that extracts biological entities from PDFs. It uses Vertex AI for processing the text and identifying biological entities. The API accepts PDF files as input, extracts the text from them, and then sends the text to Vertex AI's Gemini model to perform NER.
- Python 3.8+
- Vertex AI account and credentials
-
Clone the repository:
git clone https://github.com/amyford/pdf-reader.git cd pdf-reader -
Install dependencies:
python3 -m venv venv source venv/bin/activate pip install -r requirements.txt -
Set up your Google Cloud project for Vertex AI and authenticate
gcloud auth application-default login export GOOGLE_CLOUD_PROJECT=your-project-id
For more details, please See the GCloud website. Configure your API to access Vertex AI API.
python3 main.py
Deploy your service to Cloud Run
gcloud run deploy --source .
Then use the url provided.
Example:
curl -X POST -F "file=@path/to/file.pdf" http://127.0.0.1:5000/api/v1/extract
Example Response:
{
"entities": [
{
"entity": "COVID-19",
"context": "... was observed in patients with COVID-19",
"start": 30,
"end": 45
},
{
"entity": "ERK1",
"context": "... elevated levels of ERK1 were seen",
"start": 10,
"end": 15
}
]
}
python -m unittest test_main.py