LogoParser

A Python script that automatically downloads organization logos from Wikipedia infoboxes and Wikimedia Commons. The script intelligently searches for logos, filters them by file type and content, and saves them to a local directory.

Features

Dual Source Strategy:
- Primary: Extracts logos directly from Wikipedia article infoboxes
- Fallback: Searches Wikimedia Commons for matching logo files
Smart Filtering: Only downloads .svg or .png files that contain the organization name and the word "logo"
Automatic Organization: Saves logos with descriptive filenames in an organized output directory
Date Information: Displays and uses file timestamps to sort results (newest first)
Batch Processing: Processes multiple organizations from a simple text file

Requirements

Python 3.7 or higher
Internet connection
Required Python packages (see Installation)

Installation

Clone or download this repository
Install required dependencies:
```
pip install -r requirements.txt
```
This will install:
- requests - For HTTP requests to Wikipedia/Wikimedia APIs
- beautifulsoup4 - For parsing HTML from Wikipedia pages

Usage

Basic Usage

Prepare your organization list: Edit orglist.txt and add one organization name per line:
```
Google
Samsung
NASA
Adidas
```
Run the script:
```
python3 app.py
```
Find your logos: Downloaded logos will be saved in the output/ directory

Organization List Format

The orglist.txt file should contain organization names, one per line:

Empty lines are ignored
Lines starting with # are treated as comments
Organization names are case-insensitive for matching

How It Works

Primary Method: Wikipedia Infobox

Searches for the organization's Wikipedia article
Parses the HTML to find the infobox table
Extracts the first image from the infobox <tbody>
Validates that the image filename contains "logo"
If valid, downloads the logo

Fallback Method: Wikimedia Commons Search

If the infobox method doesn't return a logo (or the result doesn't contain "logo" in the filename), the script:

Searches Wikimedia Commons using multiple search terms:
- "[Org] logo"
- "Logo of [Org]"
- "[Org] emblem"
- "[Org] wordmark"
Filters results to only include:
- Files with .svg or .png extensions
- Filenames containing the organization name
- Filenames containing the word "logo"
Limits results to the 5 most recent files (sorted by upload date)
Downloads all matching files

Result Combination

Results from both methods are combined
Duplicate files are automatically removed
All results are sorted by date (newest first)

Output

Directory Structure

WikimediaLogos/
├── app.py
├── orglist.txt
├── requirements.txt
└── output/
    ├── Google_Google_2015_logo.svg
    ├── Samsung_Samsung_Knox_logo.svg
    ├── NASA_NASA_Worm_logo.svg
    └── Adidas_Adidas_2022_logo.svg

Filename Format

Downloaded files are named using the pattern:

{Organization}_{Original_Filename}.{ext}

If multiple files are found, they are numbered:

{Organization}_{index}_{Original_Filename}.{ext}

Console Output

The script provides detailed output showing:

Source of each logo (Wikipedia infobox or Wikimedia Commons)
Number of files found
File titles and dates
Download status for each file

Example output:

Google: Found 1 matching file(s) from Wikipedia infobox
  [1] File:Google 2015 logo.svg (2016-02-13) → Google_Google_2015_logo.svg

Samsung: Found 5 matching file(s) from Wikipedia infobox + Wikimedia Commons
  [1] File:Samsung Knox logo.svg (2022-12-05) → Samsung_Samsung_Knox_logo.svg
  [2] File:Samsung old logo before year 2015.svg (2022-11-28) → Samsung_1_Samsung_old_logo_before_year_2015.svg
  ...

Configuration

Custom Paths

You can modify the paths in app.py:

ORG_LIST_PATH = Path(__file__).with_name("orglist.txt")  # Organization list file
OUTPUT_DIR = Path(__file__).with_name("output")          # Output directory

User-Agent

The script includes a User-Agent header as required by Wikimedia APIs. You can customize it in app.py:

headers = {
    "User-Agent": "WikimediaLogos/1.0 (https://example.com/contact)"
}

Error Handling

The script includes robust error handling for:

Network connection issues
Missing Wikipedia articles
Invalid API responses
File download failures
Missing or invalid organization names

Errors are logged to the console but don't stop the batch processing of other organizations.

Limitations

Requires internet connection to access Wikipedia and Wikimedia Commons
Some organizations may not have Wikipedia articles or infobox logos
Logo availability depends on what's uploaded to Wikimedia Commons
File filtering is based on filename patterns, which may miss some valid logos

License

This project is provided as-is for educational and personal use.

Contributing

Feel free to submit issues or pull requests for improvements!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
app.py		app.py
orglist.txt		orglist.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogoParser

Features

Requirements

Installation

Usage

Basic Usage

Organization List Format

How It Works

Primary Method: Wikipedia Infobox

Fallback Method: Wikimedia Commons Search

Result Combination

Output

Directory Structure

Filename Format

Console Output

Configuration

Custom Paths

User-Agent

Error Handling

Limitations

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LogoParser

Features

Requirements

Installation

Usage

Basic Usage

Organization List Format

How It Works

Primary Method: Wikipedia Infobox

Fallback Method: Wikimedia Commons Search

Result Combination

Output

Directory Structure

Filename Format

Console Output

Configuration

Custom Paths

User-Agent

Error Handling

Limitations

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages