Skip to content

Latest commit

 

History

History
103 lines (82 loc) · 8.63 KB

File metadata and controls

103 lines (82 loc) · 8.63 KB

Implementation Plan: Reddit Crawler

1. Project Setup & Core Dependencies

  • Task: Initialize project structure (main.py, pyproject.toml, .gitignore, .python-version).
  • Task: Install core dependencies: typer, praw, python-dotenv, joblib, tqdm, pydantic, sqlmodel.
  • Task: Setup basic virtual environment.

2. Configuration Management (Credentials & Settings)

  • Task: Define environment variables needed for PRAW authentication in a .env.example file (e.g., REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USER_AGENT, REDDIT_USERNAME, REDDIT_PASSWORD).
  • Task: Implement loading of these variables using python-dotenv at the application's entry point.
  • Task: Create a Pydantic model for application settings if more complex configurations are anticipated beyond credentials.

3. PRAW Integration & API Call Caching

  • Task: Create a module/class to encapsulate PRAW initialization and interaction.
  • Task: Authenticate PRAW using the "Password Flow" as described in the PRAW documentation, using credentials loaded from .env. (Reference: PRAW Authentication)
  • Task: Initialize joblib.Memory with a specified cache directory (e.g., ./cache/praw_cache). (Reference: Joblib Memory Basic Usage)
  • Task: Identify PRAW API calls that fetch data (e.g., subreddit.hot(), subreddit.new(), submission.comments) and wrap them with the joblib.Memory instance to cache their results.
    • Ensure that cache keys are appropriately generated to reflect the parameters of the API calls (e.g., subreddit name, limit, sort order).
  • Task: Create helper functions that use the cached PRAW calls (e.g., get_subreddit_posts(subreddit_name, limit=10)).

4. CLI Interface (Typer)

  • Task: Define the main Typer application in main.py.
  • Task: Create CLI commands. A primary command might be crawl-subreddit.
  • Task: Define CLI options/arguments for crawl-subreddit:
    • subreddit_name: The name of the subreddit to crawl (required).
    • limit: Number of posts to fetch (optional, with a default).
    • sort_by: How to sort posts (e.g., 'hot', 'new', 'top') (optional, with a default). (Reference: PRAW Subreddit Model for available sorting methods like hot(), new(), top())
    • time_filter: Time filter for 'top' sort (e.g., 'all', 'day', 'week') (optional, relevant if sort_by is 'top').
    • output_db: Path to the SQLite database file (optional, with a default like reddit_data.db).
  • Task: Implement basic input validation using Typer's capabilities or Pydantic models for arguments.

5. Data Modeling (Pydantic & SQLModel)

  • Task: Define Pydantic models for:
    • SubredditData: Fields like id (string, from PRAW), name (string, PRAW's display_name), title (string, PRAW's title), description (string, PRAW's public_description), subscribers (int), created_utc (datetime).
    • PostData: Fields like id (string, from PRAW), title (string), author_name (string), score (int), upvote_ratio (float), num_comments (int), created_utc (datetime), url (string), selftext (string), permalink (string), subreddit_id (string, foreign key to SubredditData.id).
    • CrawlInfo: Fields like id (auto-increment int), subreddit_name_crawled (string), timestamp (datetime), post_count (int), parameters_used (e.g., a JSON string or Pydantic model serialized to JSON, capturing limit, sort_by, time_filter).
  • Task: Define SQLModel table models corresponding to the Pydantic models:
    • Subreddit(SQLModel, table=True): Primary key id (string).
    • Post(SQLModel, table=True): Primary key id (string). Add subreddit_id: str = Field(default=None, foreign_key="subreddit.id").
    • Crawl(SQLModel, table=True): Primary key id (integer, auto-increment).
  • Task: Implement relationships in SQLModel if direct ORM-style access is needed (e.g., posts: List["Post"] = Relationship(back_populates="subreddit") in Subreddit model, and subreddit: Optional["Subreddit"] = Relationship(back_populates="posts") in Post model).

6. Database Setup & Interaction (SQLModel)

  • Task: Create a function to initialize the SQLite database and create tables if they don't exist, using SQLModel.metadata.create_all(engine). The engine will be created from the output_db path provided via CLI.
  • Task: Implement functions using SQLModel sessions to:
    • Add or update a Subreddit record (upsert logic: if subreddit id exists, update; else, insert).
    • Add Post records. Check if post id already exists before inserting to avoid duplicates from overlapping crawls (or decide if updates are needed). Link them to the correct Subreddit using subreddit_id.
    • Add a Crawl record after a crawling session.
  • Task: Ensure database sessions are properly managed (e.g., using with Session(engine) as session:).

7. Crawling Logic

  • Task: In the crawl-subreddit Typer command function:
    • Initialize PRAW (using the cached PRAW interaction module).
    • Initialize the database engine and create tables.
    • Fetch subreddit information using reddit.subreddit(subreddit_name). Access attributes like display_name, title, public_description, subscribers, created_utc.
    • Transform this into SubredditData and store/update it in the database.
    • Select the appropriate PRAW method based on sort_by (e.g., subreddit.hot(limit=limit), subreddit.new(limit=limit), subreddit.top(time_filter=time_filter, limit=limit)).
    • Iterate through the fetched PRAW Submission objects using tqdm for progress indication.
      • For each Submission:
        • Extract relevant attributes (e.g., id, title, author.name if author exists, score, upvote_ratio, num_comments, created_utc, url, selftext, permalink).
        • Transform into the PostData Pydantic model.
        • Store the PostData in the database, ensuring subreddit_id is set to the ID of the crawled subreddit.
    • After processing all posts, create and store a CrawlInfo record.
  • Task: Handle potential PRAW exceptions (e.g., prawcore.exceptions.Redirect for non-existent subreddits, prawcore.exceptions.NotFound, authentication errors).

8. Progress Indication & Logging

  • Task: Wrap the loop that processes PRAW Submission objects with tqdm to show a progress bar for post fetching and processing.
  • Task: Implement basic logging using the logging module:
    • Log the start and end of a crawl session.
    • Log the number of posts fetched/saved.
    • Log cache hits/misses if possible or relevant information from joblib.
    • Log any errors encountered during the process.

9. Refinements & Error Handling

  • Task: Add comprehensive error handling for API issues (PRAW exceptions), database transaction errors, and invalid user inputs (Typer helps here).
  • Task: Implement graceful shutdown on keyboard interrupt (Ctrl+C), perhaps saving any partially completed work or logging the interruption.
  • Task: Review Reddit's API rate limits and PRAW's handling. While joblib caching helps significantly for repeated identical calls, initial crawls or crawls with varying parameters will still hit the API. PRAW handles standard rate limiting well, but be mindful of not being overly aggressive.

10. Documentation & Testing (Best Practices)

  • Task: Create/Update README.md with:
    • Project description.
    • Setup instructions (virtual environment, pip install -r requirements.txt).
    • Configuration details (how to create and populate .env from .env.example).
    • CLI usage examples for crawl-subreddit with different options.
    • Brief overview of the database schema.
  • Task (Optional but Recommended): Write unit tests for:
    • PRAW interaction module (mocking PRAW client and its methods).
    • Data transformation logic (PRAW object to Pydantic model).
    • Database interaction functions (using an in-memory SQLite database for tests).
    • CLI command argument parsing and basic command flow (mocking the actual crawl logic).
  • Task (Optional): Consider structuring the code into logical modules (e.g., cli.py, reddit_client.py, database.py, models.py).

This plan outlines the major components and tasks for building the Reddit crawler as specified.