Shehrbano Ali - Wikimedia Task 2 Submission

Contributor: Shehrbano Ali
Email: shehrbanoali2230@gmail.com
Github: Shehrbaano-Ali
Program: Outreachy 2026 Cohort
Project: Addressing the lusophone technological wishlist proposals
Date: April 6, 2026
Task: Task 2 - Create a Python script to audit URL status codes

🔗 LINKS

🌐 Click here to view the LIVE Production Prototype

(Note: This Link is running in 'PythonAnywhere')

📸 Visual Overview

1. Operational Logic & Documentation(Updated Version)

This now includes interactive briefings that explain the technical methodology, specifically highlighting how the system handles Header Skipping and Diagnostic Error Reporting.

2. Live Audit Terminal(Updated Version)

Here I prioritized Actionable Verbosity. A generic FAILED message doesn't help a developer fix a link. I implemented a robust try-except block that extracts the specific Exception Name and HTTP Reason Phrase.

Refined Error Catching: Instead of a silent failure, the tool identifies the type of crash (e.g., ReadTimeout, ProxyError).
Human-Readable Status: Every status code is paired with its official reason phrase (e.g., 200 OK, 403 Forbidden, 404 NOT FOUND).

3. Wikidata Quality Analysis

This section visualizes the Wikiscorer prototype, utilizing dynamic visual analytics to provide a real-time quality gauge for Wikidata entities.

High-Quality Entities: Items with a score $> 5$ are highlighted in Signal Green (★ HIGH QUALITY).
Needs Improvement: Items with lower scores receive a Safety Yellow warning, automatically flagging them for metadata enhancement.

The core requirement involved developing a Python script to parse a .csv dataset of Wikimedia-related URLs and retrieve their live HTTP status codes. To demonstrate the practical application of this logic, I extended the script into a web-based dashboard that automates the audit of these links, serving as a functional prototype for improving metadata reliability.

Objectives

Parse CSV Data: Implement Python logic to extract URLs from Task 2 - Intern.csv.
Automated Auditing: Utilize the requests library to fetch real-time HTTP status codes.
Format Output: Print results in the mentor-specified format: (STATUS CODE) URL.
Cloud Integration: Deploy the operational script to PythonAnywhere for community access.
Document Findings: Provide a technical breakdown of the audit process.

Implementation Details

This project provides two ways to audit the Wikimedia dataset:

CLI Script (script.py): A lightweight Python script that reads directly from Task 2 - Intern.csv and prints status codes to the terminal as requested.
Web Dashboard (app.py): A Flask-based prototype that visualizes the audit process and calculates "Wikiscores" for Wikidata entities.

Verification Logic & Error Handling

During development, I focused on ensuring the script remains stable even when encountering broken or restricted links.

The Challenge: Many URLs in the dataset may point to dead servers or have restricted access, which can cause a standard script to crash.
The Solution: I implemented a robust try-except block within the audit loop. This ensures that even if a URL fails to connect (due to DNS or SSL issues), the script catches the error and reports it as FAILED instead of stopping the entire execution.

# Enhanced implementation of diagnostic auditing
try:
    res = requests.get(url, headers=HEADERS, timeout=5)
    # Returns the code + the reason (e.g., "404 Not Found")
    return jsonify({'status': f"{res.status_code} {res.reason}"})
except requests.exceptions.RequestException as e:
    error_name = type(e).__name__
    return jsonify({'status': f'FAILED ({error_name})'})

Beyond the Task: Project Prototype

While Task 2 focuses on a backend status audit, the Wikiscorer feature integrated into this application serves as a functional prototype for Wishlist Proposal #8:

Proposal #8: Ferramenta de pontuação para edições no Wikidata (Scoring tool for Wikidata edits).
The Implementation: By tapping into the Wikidata API (wbgetentities), the prototype calculates a real-time quality score for each entity. It assigns points based on the presence of Labels, Descriptions, and the depth of Claims (Statements).
The Impact: This tool makes it easy to find Wikidata pages that are missing important information. By giving each page a score, we can quickly see which items need more details, helping us reach the community's goal of making all data complete.

Iterative Improvements(Post-Feedback)

After initial development, I refined the system based on mentor feedback:

Header Robustness: Implemented next(reader) in the Python script and a .filter() method in the JavaScript UI to prevent the urls header from being treated as a link.
Enhanced Diagnostic Reporting: Replaced generic error flags with detailed exception logging. The terminal now identifies specific failure types (e.g., ProxyError, ReadTimeout) and includes official HTTP Reason Phrases (e.g., 404 Not Found, 200 OK, 403 Forbidden).
Data Integrity: The tool was engineered to handle dirty data (such as the unquoted comma in the Yahoo URL) through string manipulation (join and strip) rather than modifying the original dataset.

Key Findings

Resilience: Handling network exceptions is as important as the core logic when dealing with large, diverse datasets of external URLs.
Server-Side Versatility: Python provides a robust environment for network tasks that can eventually be integrated into larger Wikimedia tools like Pywikibot.
Environmental Awareness: Deploying on PythonAnywhere highlighted the importance of Whitelisting. The tool successfully identifies ProxyError as a hosting restriction rather than a code failure, proving the value of detailed exception logging.

Repository Structure

Outreachy-Wikimedia-Task-2-Python-Script/
│
├── templates/
│   └── index.html        # UI for the Web Terminal prototype
├── 01-briefing-model.png # Screenshot: Prototype UI
├── 02-audit-results.png  # Screenshot: Audit results
├── 03-wikiscore-analysis.png # Screenshot: Wikidata Quality Scoring logic
├── LICENSE               # MIT License
├── README.md             # Analytical documentation
├── Task 2 - Intern.csv   # The source dataset
├── app.py                # Flask-based Web Terminal logic
├── requirements.txt      # Dependencies (Flask, Requests)
└── script.py             # Official CSV parsing script (Task Requirement)

AI Usage

I utilized ChatGPT for:

Documentation Structure: Organizing this README to reflect systematic open-source analysis.
Infrastructure Troubleshooting: Assisting in the migration from Render to PythonAnywhere during deployment.

All code and logic were manually verified and implemented by me. The AI served as a guide to accelerate understanding of cloud deployment configurations.

Here is link to my Blog
This work is submitted for the Outreachy 2026 internship application for the Wikimedia Project.
~Shehrbano Ali

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shehrbano Ali - Wikimedia Task 2 Submission

🔗 LINKS

🌐 Click here to view the LIVE Production Prototype

📸 Visual Overview

1. Operational Logic & Documentation(Updated Version)

2. Live Audit Terminal(Updated Version)

3. Wikidata Quality Analysis

Table of Contents

Introduction

Objectives

Implementation Details

Verification Logic & Error Handling

Beyond the Task: Project Prototype

Iterative Improvements(Post-Feedback)

Key Findings

Repository Structure

AI Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
templates		templates
01-briefing-model.png		01-briefing-model.png
02-audit-results.png		02-audit-results.png
03-wikiscore-analysis.png		03-wikiscore-analysis.png
LICENSE		LICENSE
README.md		README.md
Task 2 - Intern.csv		Task 2 - Intern.csv
app.py		app.py
requirements.txt		requirements.txt
script.py		script.py

Folders and files

Latest commit

History

Repository files navigation

Shehrbano Ali - Wikimedia Task 2 Submission

🔗 LINKS

🌐 Click here to view the LIVE Production Prototype

📸 Visual Overview

1. Operational Logic & Documentation(Updated Version)

2. Live Audit Terminal(Updated Version)

3. Wikidata Quality Analysis

Table of Contents

Introduction

Objectives

Implementation Details

Verification Logic & Error Handling

Beyond the Task: Project Prototype

Iterative Improvements(Post-Feedback)

Key Findings

Repository Structure

AI Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages