- Task: Initialize project structure (
main.py,pyproject.toml,.gitignore,.python-version). - Task: Install core dependencies:
typer,praw,python-dotenv,joblib,tqdm,pydantic,sqlmodel. - Task: Setup basic virtual environment.
- Task: Define environment variables needed for PRAW authentication in a
.env.examplefile (e.g.,REDDIT_CLIENT_ID,REDDIT_CLIENT_SECRET,REDDIT_USER_AGENT,REDDIT_USERNAME,REDDIT_PASSWORD). - Task: Implement loading of these variables using
python-dotenvat the application's entry point. - Task: Create a Pydantic model for application settings if more complex configurations are anticipated beyond credentials.
- Task: Create a module/class to encapsulate PRAW initialization and interaction.
- Task: Authenticate PRAW using the "Password Flow" as described in the PRAW documentation, using credentials loaded from
.env. (Reference: PRAW Authentication) - Task: Initialize
joblib.Memorywith a specified cache directory (e.g.,./cache/praw_cache). (Reference: Joblib Memory Basic Usage) - Task: Identify PRAW API calls that fetch data (e.g.,
subreddit.hot(),subreddit.new(),submission.comments) and wrap them with thejoblib.Memoryinstance to cache their results.- Ensure that cache keys are appropriately generated to reflect the parameters of the API calls (e.g., subreddit name, limit, sort order).
- Task: Create helper functions that use the cached PRAW calls (e.g.,
get_subreddit_posts(subreddit_name, limit=10)).
- Task: Define the main Typer application in
main.py. - Task: Create CLI commands. A primary command might be
crawl-subreddit. - Task: Define CLI options/arguments for
crawl-subreddit:subreddit_name: The name of the subreddit to crawl (required).limit: Number of posts to fetch (optional, with a default).sort_by: How to sort posts (e.g., 'hot', 'new', 'top') (optional, with a default). (Reference: PRAW Subreddit Model for available sorting methods likehot(),new(),top())time_filter: Time filter for 'top' sort (e.g., 'all', 'day', 'week') (optional, relevant ifsort_byis 'top').output_db: Path to the SQLite database file (optional, with a default likereddit_data.db).
- Task: Implement basic input validation using Typer's capabilities or Pydantic models for arguments.
- Task: Define Pydantic models for:
SubredditData: Fields likeid(string, from PRAW),name(string, PRAW'sdisplay_name),title(string, PRAW'stitle),description(string, PRAW'spublic_description),subscribers(int),created_utc(datetime).PostData: Fields likeid(string, from PRAW),title(string),author_name(string),score(int),upvote_ratio(float),num_comments(int),created_utc(datetime),url(string),selftext(string),permalink(string),subreddit_id(string, foreign key toSubredditData.id).CrawlInfo: Fields likeid(auto-increment int),subreddit_name_crawled(string),timestamp(datetime),post_count(int),parameters_used(e.g., a JSON string or Pydantic model serialized to JSON, capturinglimit,sort_by,time_filter).
- Task: Define SQLModel table models corresponding to the Pydantic models:
Subreddit(SQLModel, table=True): Primary keyid(string).Post(SQLModel, table=True): Primary keyid(string). Addsubreddit_id: str = Field(default=None, foreign_key="subreddit.id").Crawl(SQLModel, table=True): Primary keyid(integer, auto-increment).
- Task: Implement relationships in SQLModel if direct ORM-style access is needed (e.g.,
posts: List["Post"] = Relationship(back_populates="subreddit")inSubredditmodel, andsubreddit: Optional["Subreddit"] = Relationship(back_populates="posts")inPostmodel).
- Task: Create a function to initialize the SQLite database and create tables if they don't exist, using
SQLModel.metadata.create_all(engine). The engine will be created from theoutput_dbpath provided via CLI. - Task: Implement functions using SQLModel sessions to:
- Add or update a
Subredditrecord (upsert logic: if subredditidexists, update; else, insert). - Add
Postrecords. Check if postidalready exists before inserting to avoid duplicates from overlapping crawls (or decide if updates are needed). Link them to the correctSubredditusingsubreddit_id. - Add a
Crawlrecord after a crawling session.
- Add or update a
- Task: Ensure database sessions are properly managed (e.g., using
with Session(engine) as session:).
- Task: In the
crawl-subredditTyper command function:- Initialize PRAW (using the cached PRAW interaction module).
- Initialize the database engine and create tables.
- Fetch subreddit information using
reddit.subreddit(subreddit_name). Access attributes likedisplay_name,title,public_description,subscribers,created_utc. - Transform this into
SubredditDataand store/update it in the database. - Select the appropriate PRAW method based on
sort_by(e.g.,subreddit.hot(limit=limit),subreddit.new(limit=limit),subreddit.top(time_filter=time_filter, limit=limit)). - Iterate through the fetched PRAW
Submissionobjects usingtqdmfor progress indication.- For each
Submission:- Extract relevant attributes (e.g.,
id,title,author.nameif author exists,score,upvote_ratio,num_comments,created_utc,url,selftext,permalink). - Transform into the
PostDataPydantic model. - Store the
PostDatain the database, ensuringsubreddit_idis set to the ID of the crawled subreddit.
- Extract relevant attributes (e.g.,
- For each
- After processing all posts, create and store a
CrawlInforecord.
- Task: Handle potential PRAW exceptions (e.g.,
prawcore.exceptions.Redirectfor non-existent subreddits,prawcore.exceptions.NotFound, authentication errors).
- Task: Wrap the loop that processes PRAW
Submissionobjects withtqdmto show a progress bar for post fetching and processing. - Task: Implement basic logging using the
loggingmodule:- Log the start and end of a crawl session.
- Log the number of posts fetched/saved.
- Log cache hits/misses if possible or relevant information from
joblib. - Log any errors encountered during the process.
- Task: Add comprehensive error handling for API issues (PRAW exceptions), database transaction errors, and invalid user inputs (Typer helps here).
- Task: Implement graceful shutdown on keyboard interrupt (
Ctrl+C), perhaps saving any partially completed work or logging the interruption. - Task: Review Reddit's API rate limits and PRAW's handling. While
joblibcaching helps significantly for repeated identical calls, initial crawls or crawls with varying parameters will still hit the API. PRAW handles standard rate limiting well, but be mindful of not being overly aggressive.
- Task: Create/Update
README.mdwith:- Project description.
- Setup instructions (virtual environment,
pip install -r requirements.txt). - Configuration details (how to create and populate
.envfrom.env.example). - CLI usage examples for
crawl-subredditwith different options. - Brief overview of the database schema.
- Task (Optional but Recommended): Write unit tests for:
- PRAW interaction module (mocking PRAW client and its methods).
- Data transformation logic (PRAW object to Pydantic model).
- Database interaction functions (using an in-memory SQLite database for tests).
- CLI command argument parsing and basic command flow (mocking the actual crawl logic).
- Task (Optional): Consider structuring the code into logical modules (e.g.,
cli.py,reddit_client.py,database.py,models.py).
This plan outlines the major components and tasks for building the Reddit crawler as specified.