Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
158 changes: 158 additions & 0 deletions claude/skills/notebook-cleanup/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
name: notebook-cleanup
description: Clears all JupyterLab notebook cell outputs, lints code and structure, and generates HTML snapshots for review. HTML snapshots are committed to the repo in a directory mirroring the notebook's location.

---

# Notebook Cleanup Skill

Clean up and lint JupyterLab notebooks for commit by clearing all cell outputs, validating code quality and notebook structure, and generating HTML snapshots for reviewer convenience.

## Behavior

When invoked, this skill will:

1. **Discover notebooks** — Find all `.ipynb` files that have been modified (staged or unstaged) in the current Git repository, comparing the changes on the existing branch to the main branch. You should ask the user to confirm the notebooks before proceeding with any changes. If no modified notebooks are found, scan the entire repo and prompt the user to select which notebooks to clean.

2. **Detect notebook language** — Read the notebook's kernel metadata to determine the language (Python, R, etc.) and apply language-appropriate linting rules.

3. **Lint code cells** — Analyze all code cells for quality issues appropriate to the detected language:
- Unused imports or library loads
- Undefined or shadowed variables
- Style violations (naming conventions, line length, whitespace)
- Invalid syntax
- Hardcoded credentials
- No empty cells
Report findings with cell numbers and suggested fixes

4. **Validate standards compliance** — Check adherence to PR standards:
- No cell outputs present in the committed `.ipynb` file
- No data files staged for commit unless publicly available
- Jira ticket referenced in the notebook or PR description

5. **Generate HTML snapshots** — For each notebook, export an HTML snapshot before clearing outputs. Store the HTML file in a `notebook-snapshots` directory that mirrors the notebook's relative path in the repo. For example:
- `src/analysis/exploration.ipynb` → `.notebook-snapshots/src/analysis/exploration.html`

6. **Ensure provenance and copyright boilerplate** — Check that the notebook ends with the required boilerplate cells. If missing, append them. For Python notebooks,
the provenance section includes:

**Markdown cell:**
Provenance

Generate information about this notebook environment and the packages installed.

**Code cell:** `!date`

**Markdown cell:** `Conda and pip installed packages:`

**Code cell:** `!conda env export`

**Markdown cell:** `JupyterLab extensions:`

**Code cell:** `!jupyter labextension list`

**Markdown cell:** `Number of cores:`

**Code cell:** `!grep ^processor /proc/cpuinfo | wc -l`

**Markdown cell:** `Memory:`

**Code cell:** `!grep "^MemTotal:" /proc/meminfo`

**Markdown cell (copyright):**
--- Copyright Verily Life Sciences LLC

Use of this source code is governed by a BSD-style
license that can be found in the LICENSE file or at
https://developers.google.com/open-source/licenses/bsd

For non-Python notebooks, only the copyright cell is appended (provenance cells are skipped).

6. **Ensure provenance and copyright boilerplate** — Check that the notebook ends with the required boilerplate cells. If missing, append them. For Python notebooks, the provenance section includes:

**Markdown cell:**
Provenance

Generate information about this notebook environment and the packages installed.

**Code cell:** `!date`

**Markdown cell:** `Conda and pip installed packages:`

**Code cell:** `!conda env export`

**Markdown cell:** `JupyterLab extensions:`

**Code cell:** `!jupyter labextension list`

**Markdown cell:** `Number of cores:`

**Code cell:** `!grep ^processor /proc/cpuinfo | wc -l`

**Markdown cell:** `Memory:`

**Code cell:** `!grep "^MemTotal:" /proc/meminfo`

**Markdown cell (copyright):**
--- Copyright Verily Life Sciences LLC

Use of this source code is governed by a BSD-style
license that can be found in the LICENSE file or at
https://developers.google.com/open-source/licenses/bsd

For non-Python notebooks, only the copyright cell is appended (provenance cells are skipped).

7. **Generate HTML snapshots** — For each notebook, export an HTML snapshot **before** clearing outputs. Store the HTML file in a `notebook-snapshots/` directory that mirrors the notebook's relative path in the repo. For example:
`src/analysis/exploration.ipynb` → `notebook-snapshots/src/analysis/exploration.html`

8. **Clear cell outputs** — Strip all cell outputs and execution counts from each `.ipynb` file in place, preserving the notebook structure and source code.

9. **Stage snapshot files** — Add the generated HTML snapshots to the Git staging area alongside the cleaned notebooks so they are included in the commit.

10. **Report results** — Summarize linting findings, what was cleaned, and where HTML snapshots were saved.

## Usage

Invoke this skill when:
- Preparing notebooks for a pull request
- The user asks to clean, lint, or prepare notebooks for commit

## Requirements

- The working directory must be within a Git repository

## Example

User calls: /notebook-cleanup

**Output:**
Linting 3 notebooks...

src/analysis/exploration.ipynb:
⚠ Cell 3: unused import 'pandas as pd' (never referenced)
✓ Structure: title present, markdown sections found

src/analysis/validation.ipynb:
✓ Code: no issues found
✓ Structure: title present, markdown sections found

src/models/training_run.ipynb:
✓ Code: no issues found
⚠ Structure: no markdown heading in first cell
⚠ Structure: empty cell at position 5

Standards compliance:
✓ No data files staged
⚠ No Jira ticket reference found — ensure it is included in the PR description

Cleaned 3 notebooks (outputs cleared):
- src/analysis/exploration.ipynb
- src/analysis/validation.ipynb
- src/models/training_run.ipynb

HTML snapshots saved to notebook-snapshots/:
- notebook-snapshots/src/analysis/exploration.html
- notebook-snapshots/src/analysis/validation.html
- notebook-snapshots/src/models/training_run.html

All files staged for commit. Confirm with user if they would like the linting fixes to be made.
74 changes: 31 additions & 43 deletions first_hour_on_vwb/creating_a_data_collection.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,7 @@
" self.input_name = wu.TextInputWidget(\"<WORKSPACE_NAME>\",\"Workspace Name:\").get()\n",
" self.input_description = wu.TextInputWidget(\"<DESCRIPTION>\",\"Description:\").get()\n",
" self.input_workspace_id = wu.TextInputWidget(\"<WORKSPACE_ID>\",\"Workspace ID:\").get()\n",
" self.input_pod = wu.TextInputWidget(\"<POD_ID>\",\"Pod ID:\").get()\n",
" self.output_workspace_id = widgets.Text()\n",
" self.output_workspace_id.value = self.input_workspace_id.value\n",
" self.button = wu.StyledButton('Create workspace','Click to create a new workspace','plus').get()\n",
Expand All @@ -190,7 +191,7 @@
" self.vb = widgets.VBox(\n",
" children = [self.label, self.warning,\n",
" self.input_name, self.input_description,\n",
" self.input_workspace_id,\n",
" self.input_workspace_id, self.input_pod,\n",
" self.button, self.output],\n",
" layout = wu.vbox_layout)\n",
" \n",
Expand All @@ -204,6 +205,7 @@
" f\"--id={self.input_workspace_id.value.strip()}\",\n",
" f\"--description={self.input_description.value.strip()}\",\n",
" f\"--name={self.input_name.value.strip()}\",\n",
" f\"--pod={self.input_pod.value.strip()}\",\n",
" ]\n",
" print('Running command to create workspace...')\n",
" print('\\n'.join(createWorkspaceCommandList))\n",
Expand All @@ -222,19 +224,20 @@
"tags": []
},
"source": [
"### Convert new workspace to data collection\n",
"<a id=\"convert-to-dc\"></a>\n",
"### Create a new data collection\n",
"<a id=\"create-dc\"></a>\n",
"\n",
"Now you'll convert your newly created workspace, to which you have added resources, into a data collection which can be shared with others and added to other workspaces. \n",
"Run the cell below to create a widget, then populate the widget's input fields and click the button to convert the workspace to a data collection. Please note that until you <a href=\"#publish-version\">publish a version</a> in the next section, your data collection will not appear in the data catalog.\n",
"Now you'll create a new data collection which can be shared with others and added to other workspaces. \n",
"Run the cell below to create a widget, then populate the widget's input fields and click the button to create a data collection. Please note that until you <a href=\"#publish-version\">publish a version</a> in the next section, your data collection will not appear in the data catalog.\n",
"\n",
"Widget input parameters include:\n",
"- `Workspace ID`: Automatically populated with the workspace ID of the workspace created in the previous step.\n",
"- `Workspace Name`: Must be a string. This value is displayed in the Data Collection modal once the workspace is converted to a data collection, so the value should communicate the intended purpose (e.g. `<STUDY_NAME> Data Collection`).<br> While the Workbench UI and this widget require a workspace name to be provided, the CLI does not; if no workspace name is provided to the CLI, a UUID is generated instead.\n",
"- `Workspace ID`: Must be unique and consist only of lowercase letters, numbers and underscores. Provide a workspace ID that suggests something about the contents of the data collection you'd like to create and include the date of its creation, such as `<STUDY_NAME>_<YYMMDD>_dc_ws`. *You cannot change the workspace ID after workspace creation.* \n",
"- `Short Description`: Must be a string. This description will be visible in the Add a Data Collection modal and should summarize the purpose and/or contents of your data collection.\n",
"\n",
"The output should resemble:\n",
"```\n",
"Workspace properties successfully updated.\n",
"Workspace successfully created.\n",
"ID: <STUDY_NAME>_<YYMMDD>_dc_ws\n",
"Name: <STUDY_NAME>-Data-<YYMMDD>\n",
"Description: <DESCRIPTION>\n",
Expand All @@ -247,7 +250,6 @@
"Created: YYYY-MM-DD\n",
"Last updated: YYYY-MM-DD\n",
"# Resources: <NUMBER_OF_RESOURCES>\n",
"Workspace properties successfully updated.\n",
"```"
]
},
Expand All @@ -259,53 +261,46 @@
},
"outputs": [],
"source": [
"class ConvertToDataCollectionWidget(object):\n",
"class CreateDataCollectionWidget(object):\n",
" def __init__(self,prev_widget):\n",
" self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')\n",
" self.workspace_ids = self.get_workspace_ids()\n",
" self.new_ws_id = prev_widget.get_workspace_id();\n",
" self.input_workspace_id = wu.DropdownInputWidget([self.new_ws_id],self.new_ws_id,\"Workspace ID:\").get()\n",
" self.input_name = wu.TextInputWidget(\"<DATA_COLLECTION_NAME>\",\"Data Collection Name:\").get()\n",
" self.input_data_collection_id = wu.TextInputWidget(\"<DATA_COLLECTION_ID>\",\"Data Collection ID:\").get()\n",
" self.input_pod = wu.TextInputWidget(\"<POD_ID>\",\"Pod ID:\").get()\n",
" self.input_short_description = wu.TextInputWidget(\"<SHORT_DESCRIPTION>\",\"Short Description:\").get()\n",
" self.button = wu.StyledButton('Convert to data collection','Click to convert to data collection','check').get()\n",
" self.button.on_click(self.convert_to_data_collection)\n",
" self.button = wu.StyledButton('Create data collection','Click to create data collection','check').get()\n",
" self.button.on_click(self.create_data_collection)\n",
" self.output = widgets.Output()\n",
" self.vb = widgets.VBox([\n",
" self.label,\n",
" self.input_workspace_id,\n",
" self.input_data_collection_id,\n",
" self.input_short_description,\n",
" self.input_pod,\n",
" self.input_name,\n",
" self.button,\n",
" self.output\n",
" ], layout=wu.vbox_layout)\n",
" \n",
"\n",
"\n",
" def get_workspace_id(self):\n",
" return self.input_workspace_id.value.strip()\n",
"\n",
" def get_workspace_ids(self):\n",
" result = subprocess.run([\"wb\",\"workspace\",\"list\",\"--format=JSON\"],capture_output=True,text=True)\n",
" ids_list = wu.list_workspace_ids(result.stdout)\n",
" # Insert empty string to display as value of dropdown until changed by user.\n",
" ids_list.insert(0, \" \")\n",
" return ids_list\n",
" \n",
" def convert_to_data_collection(self,b):\n",
" workspace_id = self.input_workspace_id.value\n",
" \n",
" def create_data_collection(self,b):\n",
" short_desc = self.input_short_description.value\n",
" with self.output:\n",
" prettyConvertToDataCollectionCommand = f\"\"\"wb workspace set-property \\\\\n",
" --workspace={workspace_id} \\\\\n",
" --properties=\\\"terra-type=data-collection,terra-workspace-short-description={short_desc}\\\"\n",
" \"\"\"\n",
" print(\"Running command to convert workspace to data collection...\")\n",
" print(prettyConvertToDataCollectionCommand)\n",
" print(\"Running command to create data collection...\")\n",
" print(\"Your data collection will be ready in less than one minute...\")\n",
" result = subprocess.run([\"wb\",\"workspace\",\"set-property\",\n",
" f\"--workspace={workspace_id}\",\n",
" result = subprocess.run([\"wb\",\"workspace\",\"create\",\n",
" f\"--id={self.input_data_collection_id.value.strip()}\",\n",
" f\"--name={self.input_name.value.strip()}\",\n",
" f\"--pod={self.input_pod.value.strip()}\",\n",
" f\"--properties=terra-type=data-collection,terra-workspace-short-description={short_desc}\"],\n",
" capture_output=True,text=True)\n",
" print(result.stderr) if not result.stdout else print(result.stdout)\n",
"\n",
"convert_to_dc_widget = ConvertToDataCollectionWidget(create_ws_widget)\n",
"display(convert_to_dc_widget.vb)"
"create_dc_widget = CreateDataCollectionWidget(create_ws_widget)\n",
"display(create_dc_widget.vb)"
]
},
{
Expand Down Expand Up @@ -429,7 +424,7 @@
" publishVersionResult = subprocess.run(publishVersionCommand, shell = True, capture_output = True, text = True, check = True)\n",
" print(publishVersionResult.stderr) if not publishVersionResult.stdout else print(publishVersionResult.stdout)\n",
"\n",
"publish_version_widget = PublishVersionWidget(convert_to_dc_widget)\n",
"publish_version_widget = PublishVersionWidget(create_dc_widget)\n",
"display(publish_version_widget.vb)"
]
},
Expand Down Expand Up @@ -683,13 +678,6 @@
"license that can be found in the LICENSE file or at \n",
"https://developers.google.com/open-source/licenses/bsd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
56 changes: 17 additions & 39 deletions first_hour_on_vwb/working_with_bq_resources.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -73,43 +73,7 @@
"tags": []
},
"outputs": [],
"source": [
"from IPython.display import display, HTML\n",
"import ipywidgets as widgets\n",
"import json\n",
"import pandas as pd\n",
"import pandas_gbq\n",
"import os\n",
"import subprocess\n",
"import widget_utils as wu\n",
"\n",
"'''\n",
"Resolves bucket URL from bucket reference in workspace.\n",
"'''\n",
"def get_bucket_url_from_reference(resource_id):\n",
" BUCKET_CMD_OUTPUT = !wb resolve --name={bucket_reference}\n",
" BUCKET = BUCKET_CMD_OUTPUT[0]\n",
" return BUCKET\n",
"\n",
"'''\n",
"Resolves BigQuery dataset from a reference in workspace.\n",
"'''\n",
"def get_bq_dataset_from_reference(resource_id):\n",
" BQ_CMD_OUTPUT = !wb resolve --id={resource_id}\n",
" BQ_DATASET = BQ_CMD_OUTPUT[0]\n",
" return BQ_DATASET\n",
"\n",
"'''\n",
"Resolves current workspace ID from workspace description.\n",
"'''\n",
"def get_current_workspace_id():\n",
" WORKSPACE_CMD_OUTPUT = !wb workspace describe --format=json | jq --raw-output \".id\"\n",
" WORKSPACE_ID = WORKSPACE_CMD_OUTPUT[0]\n",
" return WORKSPACE_ID\n",
"\n",
"CURRENT_WORKSPACE_ID = get_current_workspace_id()\n",
"print(f'Workspace ID: {CURRENT_WORKSPACE_ID}')"
]
"source": "import pandas as pd\nimport pandas_gbq\n\n'''\nResolves bucket URL from bucket reference in workspace.\n'''\ndef get_bucket_url_from_reference(bucket_reference):\n BUCKET_CMD_OUTPUT = !wb resolve --name={bucket_reference}\n BUCKET = BUCKET_CMD_OUTPUT[0]\n return BUCKET\n\n'''\nResolves current workspace ID from workspace description.\n'''\ndef get_current_workspace_id():\n WORKSPACE_CMD_OUTPUT = !wb workspace describe --format=json | jq --raw-output \".id\"\n WORKSPACE_ID = WORKSPACE_CMD_OUTPUT[0]\n return WORKSPACE_ID\n\nCURRENT_WORKSPACE_ID = get_current_workspace_id()\nprint(f'Workspace ID: {CURRENT_WORKSPACE_ID}')"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -219,7 +183,12 @@
},
"outputs": [],
"source": [
"%bigquery_stats bigquery-public-data.human_genome_variants.1000_genomes_pedigree"
"from google.cloud import bigquery\n",
"client = bigquery.Client()\n",
"table = client.get_table(\"bigquery-public-data.human_genome_variants.1000_genomes_pedigree\")\n",
"print(f\"Rows: {table.num_rows}, Size: {table.num_bytes} bytes\")\n",
"for field in table.schema:\n",
" print(f\" {field.name}: {field.field_type}\")"
]
},
{
Expand All @@ -233,6 +202,15 @@
"Run the cell below to total the number of distinct families represented in the 1000 Genomes dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext google.cloud.bigquery"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -474,4 +452,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Loading