Skip to content

abhijit-me/LogoParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LogoParser

A Python script that automatically downloads organization logos from Wikipedia infoboxes and Wikimedia Commons. The script intelligently searches for logos, filters them by file type and content, and saves them to a local directory.

Features

  • Dual Source Strategy:
    • Primary: Extracts logos directly from Wikipedia article infoboxes
    • Fallback: Searches Wikimedia Commons for matching logo files
  • Smart Filtering: Only downloads .svg or .png files that contain the organization name and the word "logo"
  • Automatic Organization: Saves logos with descriptive filenames in an organized output directory
  • Date Information: Displays and uses file timestamps to sort results (newest first)
  • Batch Processing: Processes multiple organizations from a simple text file

Requirements

  • Python 3.7 or higher
  • Internet connection
  • Required Python packages (see Installation)

Installation

  1. Clone or download this repository

  2. Install required dependencies:

    pip install -r requirements.txt

    This will install:

    • requests - For HTTP requests to Wikipedia/Wikimedia APIs
    • beautifulsoup4 - For parsing HTML from Wikipedia pages

Usage

Basic Usage

  1. Prepare your organization list: Edit orglist.txt and add one organization name per line:

    Google
    Samsung
    NASA
    Adidas
    
  2. Run the script:

    python3 app.py
  3. Find your logos: Downloaded logos will be saved in the output/ directory

Organization List Format

The orglist.txt file should contain organization names, one per line:

  • Empty lines are ignored
  • Lines starting with # are treated as comments
  • Organization names are case-insensitive for matching

How It Works

Primary Method: Wikipedia Infobox

  1. Searches for the organization's Wikipedia article
  2. Parses the HTML to find the infobox table
  3. Extracts the first image from the infobox <tbody>
  4. Validates that the image filename contains "logo"
  5. If valid, downloads the logo

Fallback Method: Wikimedia Commons Search

If the infobox method doesn't return a logo (or the result doesn't contain "logo" in the filename), the script:

  1. Searches Wikimedia Commons using multiple search terms:
    • "[Org] logo"
    • "Logo of [Org]"
    • "[Org] emblem"
    • "[Org] wordmark"
  2. Filters results to only include:
    • Files with .svg or .png extensions
    • Filenames containing the organization name
    • Filenames containing the word "logo"
  3. Limits results to the 5 most recent files (sorted by upload date)
  4. Downloads all matching files

Result Combination

  • Results from both methods are combined
  • Duplicate files are automatically removed
  • All results are sorted by date (newest first)

Output

Directory Structure

WikimediaLogos/
├── app.py
├── orglist.txt
├── requirements.txt
└── output/
    ├── Google_Google_2015_logo.svg
    ├── Samsung_Samsung_Knox_logo.svg
    ├── NASA_NASA_Worm_logo.svg
    └── Adidas_Adidas_2022_logo.svg

Filename Format

Downloaded files are named using the pattern:

  • {Organization}_{Original_Filename}.{ext}

If multiple files are found, they are numbered:

  • {Organization}_{index}_{Original_Filename}.{ext}

Console Output

The script provides detailed output showing:

  • Source of each logo (Wikipedia infobox or Wikimedia Commons)
  • Number of files found
  • File titles and dates
  • Download status for each file

Example output:

Google: Found 1 matching file(s) from Wikipedia infobox
  [1] File:Google 2015 logo.svg (2016-02-13) → Google_Google_2015_logo.svg

Samsung: Found 5 matching file(s) from Wikipedia infobox + Wikimedia Commons
  [1] File:Samsung Knox logo.svg (2022-12-05) → Samsung_Samsung_Knox_logo.svg
  [2] File:Samsung old logo before year 2015.svg (2022-11-28) → Samsung_1_Samsung_old_logo_before_year_2015.svg
  ...

Configuration

Custom Paths

You can modify the paths in app.py:

ORG_LIST_PATH = Path(__file__).with_name("orglist.txt")  # Organization list file
OUTPUT_DIR = Path(__file__).with_name("output")          # Output directory

User-Agent

The script includes a User-Agent header as required by Wikimedia APIs. You can customize it in app.py:

headers = {
    "User-Agent": "WikimediaLogos/1.0 (https://example.com/contact)"
}

Error Handling

The script includes robust error handling for:

  • Network connection issues
  • Missing Wikipedia articles
  • Invalid API responses
  • File download failures
  • Missing or invalid organization names

Errors are logged to the console but don't stop the batch processing of other organizations.

Limitations

  • Requires internet connection to access Wikipedia and Wikimedia Commons
  • Some organizations may not have Wikipedia articles or infobox logos
  • Logo availability depends on what's uploaded to Wikimedia Commons
  • File filtering is based on filename patterns, which may miss some valid logos

License

This project is provided as-is for educational and personal use.

Contributing

Feel free to submit issues or pull requests for improvements!

About

Wikimedia Logo Parser

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages