A robust Python crawler to extract, normalize, and export the list of U.S. embassies and consulates worldwide from the U.S. Department of State website. The project provides detailed embassy/consulate information in CSV, JSON, and YAML formats, with features for caching, progress tracking, and continent detection.
- Crawls the official U.S. State Department embassy/consulate list
- Extracts detailed information: country, city, code, continent, full name, address, telephone, fax, email, website, cancel/reschedule info, Google Maps link
- Robust HTML parsing with caching for efficiency
- Auto-detects continent (supports English country/city names)
- Exports data to CSV, JSON, and YAML
- Progress bar and logging for user feedback
- Deduplication and navigation link filtering
- Modular, maintainable codebase using Python best practices
-
Clone the repository:
git clone https://github.com/BaseMax/us-embassies-consulates.git cd us-embassies-consulates -
Install dependencies:
pip install .(Or, for development:
pip install -e .)This project uses PEP 621 and pyproject.toml for dependency management. No
requirements.txtis needed. -
Run the crawler:
python app.py
-
Output files:
us_embassies_consulates.csvus_embassies_consulates.jsonus_embassies_consulates.yml
app.py— Main crawler and exporter script.cache/— Cached HTML pages for efficiencyus_embassies_consulates.csv— Exported embassy/consulate data (CSV)us_embassies_consulates.json— Exported data (JSON)us_embassies_consulates.yml— Exported data (YAML)
- Continent Mapping:
- The script auto-detects continent from country/city (supports English names)
- Caching:
- HTML pages are cached in
.cache/to minimize repeated requests
- HTML pages are cached in
- Logging & Progress:
- Uses Python
loggingandtqdmfor progress bars
- Uses Python
MIT License
© 2025 Seyyed Ali Mohammadiyeh (MAX BASE)
See LICENSE for details.