Skip to content

Add Sentiment Github Dataset Documentation Notebooks#3

Open
splimon wants to merge 8 commits into
mainfrom
1-sentiment-github-dataset-notebook-documentation
Open

Add Sentiment Github Dataset Documentation Notebooks#3
splimon wants to merge 8 commits into
mainfrom
1-sentiment-github-dataset-notebook-documentation

Conversation

@splimon
Copy link
Copy Markdown
Collaborator

@splimon splimon commented Apr 4, 2026

Adds 3 notebooks that build a pipeline to contextualize the GitHub Gold Standard sentiment dataset (7,122 comments) with GHTorrent project context and re-download comment data via Kaiaulu:

  • Notebook 1: Load the GitHub Gold Standard sentiment CSV into a GHTorrent MySQL database
  • Notebook 2: Explore GHTorrent tables to map sentiment comments to main project repos
  • Notebook 3: Auto-generate Kaiaulu .yml config files for 82 main project repos

splimon added 5 commits April 3, 2026 14:59
…t and download via Kaiaulu

Adds five notebooks that build a pipeline to contextualize the GitHub Gold Standard
sentiment dataset (7,122 comments) with GHTorrent project context and
re-download comment data via Kaiaulu:

- Notebook 1: Load the GitHub Gold Standard sentiment CSV into a GHTorrent MySQL database
- Notebook 2: Explore GHTorrent tables to map sentiment comments to main project repos
- Notebook 3: Auto-generate Kaiaulu .yml config files for 82 main project repos
- Notebook 4: Download and parse commit comments via Kaiaulu
- Notebook 5: Download and parse PR inline comments via Kaiaulu
Revise 3 notebooks for sentiment dataseet documentation:
- Notebook 1: Load the GitHub Gold Standard sentiment CSV into a GHTorrent MySQL database
- Notebook 2: Explore GHTorrent tables to map sentiment comments to canonical project repos
- Notebook 3: Auto-generate Kaiaulu .yml config files for 82 canonical project repos
- Rewrites Notebook 4 to query sentiment labels directly from MySQL and INNER JOIN them against Kaiaulu-downloaded comment data. Writes the output back into Kaiaulu's `rawdata` directory
- Updates Notebooks 2 and 3 to align with the revised pipeline.
Copy link
Copy Markdown
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Can you let me know how big are these files on average and in total? If this is several GBs we will likely piss someone off at GitHub HQ :^)

Copy link
Copy Markdown
Member

@carlosparadis carlosparadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@splimon In addition to the two comments below, I suggest you add a

  • data/combined_project_labels/commit_comments.csv
  • data/combined_project_labels/commit_pr_inline_comments.csv
  • data/combined_project_labels/commit_and_pr_inline_comments.csv

You can then address sailuh/kaiaulu#347 (comment) with the url to data/combined_project_labels/commit_and_pr_inline_comments.csv

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why every file says "commit_comments_joined.cs" but akka does not?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be removed from PR

splimon added 2 commits May 10, 2026 15:44
Add notebooks/5_combine_all_projects.ipynb to merge per-project CSVs into three combined files: commit comments only, PR inline comments only, and all comments combined.

Replace akka_commit_comments.csv (raw Kaiaulu download, no polarity/text) with akka_sentiment_commit_comments_joined.csv

Fix cakephp PR inline CSV column names to polarity/text for consistency with all other project CSVs
Drop _kaiaulu or _gold suffixed columns from combined CSVs in Notebook 5. These extra columns were created when Notebook 4 joined two tables that had columns with the same name. Notebook 5 was including these extras in the combined output. Fixed by dropping any columns ending in _kaiaulu or _gold before saving.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants