Investigate and establish the machine learning framework for the classification and extraction models. This includes evaluating the suitability of PyTorch and scikit-learn for different components of the pipeline and setting up a reproducible training environment.
Goals
- Determine which framework (or combination) best supports the two core tasks:
- Document classification (relevance of PDFs to predator diet data).
- Data field extraction (counts, species, location, year, etc.).
- Set up a reproducible development environment
- Document dependencies and installation steps in requirements.txt and README.md.