Add Sentiment Github Dataset Documentation Notebooks#3
Open
splimon wants to merge 8 commits into
Open
Conversation
…t and download via Kaiaulu Adds five notebooks that build a pipeline to contextualize the GitHub Gold Standard sentiment dataset (7,122 comments) with GHTorrent project context and re-download comment data via Kaiaulu: - Notebook 1: Load the GitHub Gold Standard sentiment CSV into a GHTorrent MySQL database - Notebook 2: Explore GHTorrent tables to map sentiment comments to main project repos - Notebook 3: Auto-generate Kaiaulu .yml config files for 82 main project repos - Notebook 4: Download and parse commit comments via Kaiaulu - Notebook 5: Download and parse PR inline comments via Kaiaulu
Revise 3 notebooks for sentiment dataseet documentation: - Notebook 1: Load the GitHub Gold Standard sentiment CSV into a GHTorrent MySQL database - Notebook 2: Explore GHTorrent tables to map sentiment comments to canonical project repos - Notebook 3: Auto-generate Kaiaulu .yml config files for 82 canonical project repos
- Rewrites Notebook 4 to query sentiment labels directly from MySQL and INNER JOIN them against Kaiaulu-downloaded comment data. Writes the output back into Kaiaulu's `rawdata` directory - Updates Notebooks 2 and 3 to align with the revised pipeline.
7 tasks
Member
carlosparadis
left a comment
There was a problem hiding this comment.
Hi,
Can you let me know how big are these files on average and in total? If this is several GBs we will likely piss someone off at GitHub HQ :^)
carlosparadis
requested changes
May 10, 2026
Member
carlosparadis
left a comment
There was a problem hiding this comment.
@splimon In addition to the two comments below, I suggest you add a
- data/combined_project_labels/commit_comments.csv
- data/combined_project_labels/commit_pr_inline_comments.csv
- data/combined_project_labels/commit_and_pr_inline_comments.csv
You can then address sailuh/kaiaulu#347 (comment) with the url to data/combined_project_labels/commit_and_pr_inline_comments.csv
Member
There was a problem hiding this comment.
why every file says "commit_comments_joined.cs" but akka does not?
Member
There was a problem hiding this comment.
this should be removed from PR
Add notebooks/5_combine_all_projects.ipynb to merge per-project CSVs into three combined files: commit comments only, PR inline comments only, and all comments combined. Replace akka_commit_comments.csv (raw Kaiaulu download, no polarity/text) with akka_sentiment_commit_comments_joined.csv Fix cakephp PR inline CSV column names to polarity/text for consistency with all other project CSVs
Drop _kaiaulu or _gold suffixed columns from combined CSVs in Notebook 5. These extra columns were created when Notebook 4 joined two tables that had columns with the same name. Notebook 5 was including these extras in the combined output. Fixed by dropping any columns ending in _kaiaulu or _gold before saving.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds 3 notebooks that build a pipeline to contextualize the GitHub Gold Standard sentiment dataset (7,122 comments) with GHTorrent project context and re-download comment data via Kaiaulu: