This Python project contains some utilities to help with Minnesota Ornithologists' Union (MOU) data retrieval. The website is rather dated and has no API, and we can take advantage of its unsophisticated security to pull data using web automation. Because the data is pulled using Puppeteer, it wraps a Node.js application. You don't have to install Node, since it is bundled in the project as a dependency.
To run:
- Execute
source ./setup.shto set up the virtual environment and install dependencies.- The script must be sourced as shown!
- This will install dependencies and run the build for the Node.js portion of the project.
- The script will copy the
sample.envfile to.env. - It also sets up githooks to update dependencies and rebuild the Node.js portion when code is pulled from the repo.
- Edit the
.envfile with your MOU login credentials. - Then you can run a script using
python mou_data_python/<script_name> [args].- For example, to run the RQD data retrieval script:
python mou_data_python/rqd_data.py -hto see the available options.
- For example, to run the RQD data retrieval script:
See the mou_data_node directory for more details on the Node.js portion of the project.
You can run the Node.js scripts directly if you prefer - the README in that directory has more information.
For those who are more interested in the technical approach, this project illustrates:
- Using Puppeteer to log into a website and retrieve data.
- Wrapping a Node.js application to be used in a Python project.
- Leveraging Node.js transform streams and async generators to handle a potentially large amount of data while avoiding performance and memory issues.