verily-src · kvo3 · May 11, 2026 · May 11, 2026 · May 11, 2026 · May 11, 2026
diff --git a/.DS_Store b/.DS_Store
diff --git a/claude/skills/notebook-cleanup/SKILL.md b/claude/skills/notebook-cleanup/SKILL.md
@@ -0,0 +1,158 @@
+---
+  name: notebook-cleanup
+  description: Clears all JupyterLab notebook cell outputs, lints code and structure, and generates HTML snapshots for review. HTML snapshots are committed to the repo in a directory mirroring the notebook's location.
+
+  ---
+
+  # Notebook Cleanup Skill
+
+  Clean up and lint JupyterLab notebooks for commit by clearing all cell outputs, validating code quality and notebook structure, and generating HTML snapshots for reviewer convenience.
+
+  ## Behavior
+
+  When invoked, this skill will:
+
+  1. **Discover notebooks** — Find all `.ipynb` files that have been modified (staged or unstaged) in the current Git repository, comparing the changes on the existing branch to the main branch. You should ask the user to confirm the notebooks before proceeding with any changes. If no modified notebooks are found, scan the entire repo and prompt the user to select which notebooks to clean.
+
+  2. **Detect notebook language** — Read the notebook's kernel metadata to determine the language (Python, R, etc.) and apply language-appropriate linting rules.
+
+  3. **Lint code cells** — Analyze all code cells for quality issues appropriate to the detected language:
+     - Unused imports or library loads
+     - Undefined or shadowed variables
+     - Style violations (naming conventions, line length, whitespace)
+     - Invalid syntax
+     - Hardcoded credentials
+     - No empty cells
+     Report findings with cell numbers and suggested fixes
+
+  4. **Validate standards compliance** — Check adherence to PR standards:
+     - No cell outputs present in the committed `.ipynb` file
+     - No data files staged for commit unless publicly available
+     - Jira ticket referenced in the notebook or PR description
+
+  5. **Generate HTML snapshots** — For each notebook, export an HTML snapshot before clearing outputs. Store the HTML file in a `notebook-snapshots` directory that mirrors the notebook's relative path in the repo. For example:
+     - `src/analysis/exploration.ipynb` → `.notebook-snapshots/src/analysis/exploration.html`
+
+ 6. **Ensure provenance and copyright boilerplate** — Check that the notebook ends with the required boilerplate cells. If missing, append them. For Python notebooks,
+   the provenance section includes:
+
+     **Markdown cell:**
+  Provenance
+
+     Generate information about this notebook environment and the packages installed.
+
+  **Code cell:** `!date`
+
+  **Markdown cell:** `Conda and pip installed packages:`
+
+  **Code cell:** `!conda env export`
+
+  **Markdown cell:** `JupyterLab extensions:`
+
+  **Code cell:** `!jupyter labextension list`
+
+  **Markdown cell:** `Number of cores:`
+
+  **Code cell:** `!grep ^processor /proc/cpuinfo | wc -l`
+
+  **Markdown cell:** `Memory:`
+
+  **Code cell:** `!grep "^MemTotal:" /proc/meminfo`
+
+  **Markdown cell (copyright):**
+  ---   Copyright  Verily Life Sciences LLC
+
+     Use of this source code is governed by a BSD-style
+     license that can be found in the LICENSE file or at
+     https://developers.google.com/open-source/licenses/bsd
+
+  For non-Python notebooks, only the copyright cell is appended (provenance cells are skipped).
+
+ 6. **Ensure provenance and copyright boilerplate** — Check that the notebook ends with the required boilerplate cells. If missing, append them. For Python notebooks, the provenance section includes:
+
+    **Markdown cell:**
+      Provenance
+
+      Generate information about this notebook environment and the packages installed.
+
+    **Code cell:** `!date`
+
+    **Markdown cell:** `Conda and pip installed packages:`
+
+    **Code cell:** `!conda env export`
+
+    **Markdown cell:** `JupyterLab extensions:`
+
+    **Code cell:** `!jupyter labextension list`
+
+    **Markdown cell:** `Number of cores:`
+
+    **Code cell:** `!grep ^processor /proc/cpuinfo | wc -l`
+
+    **Markdown cell:** `Memory:`
+
+    **Code cell:** `!grep "^MemTotal:" /proc/meminfo`
+
+    **Markdown cell (copyright):**
+    ---   Copyright  Verily Life Sciences LLC
+
+      Use of this source code is governed by a BSD-style
+      license that can be found in the LICENSE file or at
+      https://developers.google.com/open-source/licenses/bsd
+
+    For non-Python notebooks, only the copyright cell is appended (provenance cells are skipped).
+
+  7. **Generate HTML snapshots** — For each notebook, export an HTML snapshot **before** clearing outputs. Store the HTML file in a `notebook-snapshots/` directory that mirrors the notebook's relative path in the repo. For example:
+      `src/analysis/exploration.ipynb` → `notebook-snapshots/src/analysis/exploration.html`
+
+  8. **Clear cell outputs** — Strip all cell outputs and execution counts from each `.ipynb` file in place, preserving the notebook structure and source code.
+
+  9. **Stage snapshot files** — Add the generated HTML snapshots to the Git staging area alongside the cleaned notebooks so they are included in the commit.
+
+  10. **Report results** — Summarize linting findings, what was cleaned, and where HTML snapshots were saved.
+
+  ## Usage
+
+  Invoke this skill when:
+  - Preparing notebooks for a pull request
+  - The user asks to clean, lint, or prepare notebooks for commit
+
+  ## Requirements
+
+  - The working directory must be within a Git repository
+
+  ## Example
+
+  User calls: /notebook-cleanup
+
+  **Output:**
+  Linting 3 notebooks...
+
+  src/analysis/exploration.ipynb:
+    ⚠ Cell 3: unused import 'pandas as pd' (never referenced)
+    ✓ Structure: title present, markdown sections found
+
+  src/analysis/validation.ipynb:
+    ✓ Code: no issues found
+    ✓ Structure: title present, markdown sections found
+
+  src/models/training_run.ipynb:
+    ✓ Code: no issues found
+    ⚠ Structure: no markdown heading in first cell
+    ⚠ Structure: empty cell at position 5
+
+  Standards compliance:
+    ✓ No data files staged
+    ⚠ No Jira ticket reference found — ensure it is included in the PR description
+
+  Cleaned 3 notebooks (outputs cleared):
+  - src/analysis/exploration.ipynb
+  - src/analysis/validation.ipynb
+  - src/models/training_run.ipynb
+
+  HTML snapshots saved to notebook-snapshots/:
+  - notebook-snapshots/src/analysis/exploration.html
+  - notebook-snapshots/src/analysis/validation.html
+  - notebook-snapshots/src/models/training_run.html
+
+  All files staged for commit. Confirm with user if they would like the linting fixes to be made.
diff --git a/first_hour_on_vwb/creating_a_data_collection.ipynb b/first_hour_on_vwb/creating_a_data_collection.ipynb
@@ -182,6 +182,7 @@
     "        self.input_name = wu.TextInputWidget(\"<WORKSPACE_NAME>\",\"Workspace Name:\").get()\n",
     "        self.input_description = wu.TextInputWidget(\"<DESCRIPTION>\",\"Description:\").get()\n",
     "        self.input_workspace_id = wu.TextInputWidget(\"<WORKSPACE_ID>\",\"Workspace ID:\").get()\n",
+    "        self.input_pod = wu.TextInputWidget(\"<POD_ID>\",\"Pod ID:\").get()\n",
     "        self.output_workspace_id = widgets.Text()\n",
     "        self.output_workspace_id.value = self.input_workspace_id.value\n",
     "        self.button = wu.StyledButton('Create workspace','Click to create a new workspace','plus').get()\n",
@@ -190,7 +191,7 @@
     "        self.vb = widgets.VBox(\n",
     "            children = [self.label, self.warning,\n",
     "                        self.input_name, self.input_description,\n",
-    "                        self.input_workspace_id,\n",
+    "                        self.input_workspace_id, self.input_pod,\n",
     "                        self.button, self.output],\n",
     "            layout = wu.vbox_layout)\n",
     "        \n",
@@ -204,6 +205,7 @@
     "                f\"--id={self.input_workspace_id.value.strip()}\",\n",
     "                f\"--description={self.input_description.value.strip()}\",\n",
     "                f\"--name={self.input_name.value.strip()}\",\n",
+    "                f\"--pod={self.input_pod.value.strip()}\",\n",
     "            ]\n",
     "            print('Running command to create workspace...')\n",
     "            print('\\n'.join(createWorkspaceCommandList))\n",
@@ -222,19 +224,20 @@
     "tags": []
    },
    "source": [
-    "### Convert new workspace to data collection\n",
-    "<a id=\"convert-to-dc\"></a>\n",
+    "### Create a new data collection\n",
+    "<a id=\"create-dc\"></a>\n",
     "\n",
-    "Now you'll convert your newly created workspace, to which you have added resources, into a data collection which can be shared with others and added to other workspaces. \n",
-    "Run the cell below to create a widget, then populate the widget's input fields and click the button to convert the workspace to a data collection. Please note that until you <a href=\"#publish-version\">publish a version</a> in the next section, your data collection will not appear in the data catalog.\n",
+    "Now you'll create a new data collection which can be shared with others and added to other workspaces. \n",
+    "Run the cell below to create a widget, then populate the widget's input fields and click the button to create a data collection. Please note that until you <a href=\"#publish-version\">publish a version</a> in the next section, your data collection will not appear in the data catalog.\n",
     "\n",
     "Widget input parameters include:\n",
-    "- `Workspace ID`: Automatically populated with the workspace ID of the workspace created in the previous step.\n",
+    "- `Workspace Name`: Must be a string. This value is displayed in the Data Collection modal once the workspace is converted to a data collection, so the value should communicate the intended purpose (e.g. `<STUDY_NAME> Data Collection`).<br> While the Workbench UI and this widget require a workspace name to be provided, the CLI does not; if no workspace name is provided to the CLI, a UUID is generated instead.\n",
+    "- `Workspace ID`: Must be unique and consist only of lowercase letters, numbers and underscores. Provide a workspace ID that suggests something about the contents of the data collection you'd like to create and include the date of its creation, such as `<STUDY_NAME>_<YYMMDD>_dc_ws`. *You cannot change the workspace ID after workspace creation.* \n",
     "- `Short Description`: Must be a string. This description will be visible in the Add a Data Collection modal and should summarize the purpose and/or contents of your data collection.\n",
     "\n",
     "The output should resemble:\n",
     "```\n",
-    "Workspace properties successfully updated.\n",
+    "Workspace successfully created.\n",
     "ID:                <STUDY_NAME>_<YYMMDD>_dc_ws\n",
     "Name:               <STUDY_NAME>-Data-<YYMMDD>\n",
     "Description:       <DESCRIPTION>\n",
@@ -247,7 +250,6 @@
     "Created:           YYYY-MM-DD\n",
     "Last updated:      YYYY-MM-DD\n",
     "# Resources:       <NUMBER_OF_RESOURCES>\n",
-    "Workspace properties successfully updated.\n",
     "```"
    ]
   },
@@ -259,53 +261,46 @@
    },
    "outputs": [],
    "source": [
-    "class ConvertToDataCollectionWidget(object):\n",
+    "class CreateDataCollectionWidget(object):\n",
     "    def __init__(self,prev_widget):\n",
     "        self.label = widgets.Label(value = 'Please provide appropriate values in the input boxes.')\n",
-    "        self.workspace_ids = self.get_workspace_ids()\n",
-    "        self.new_ws_id = prev_widget.get_workspace_id();\n",
-    "        self.input_workspace_id = wu.DropdownInputWidget([self.new_ws_id],self.new_ws_id,\"Workspace ID:\").get()\n",
+    "        self.input_name = wu.TextInputWidget(\"<DATA_COLLECTION_NAME>\",\"Data Collection Name:\").get()\n",
+    "        self.input_data_collection_id = wu.TextInputWidget(\"<DATA_COLLECTION_ID>\",\"Data Collection ID:\").get()\n",
+    "        self.input_pod = wu.TextInputWidget(\"<POD_ID>\",\"Pod ID:\").get()\n",
     "        self.input_short_description = wu.TextInputWidget(\"<SHORT_DESCRIPTION>\",\"Short Description:\").get()\n",
-    "        self.button = wu.StyledButton('Convert to data collection','Click to convert to data collection','check').get()\n",
-    "        self.button.on_click(self.convert_to_data_collection)\n",
+    "        self.button = wu.StyledButton('Create data collection','Click to create data collection','check').get()\n",
+    "        self.button.on_click(self.create_data_collection)\n",
     "        self.output = widgets.Output()\n",
     "        self.vb = widgets.VBox([\n",
     "            self.label,\n",
-    "            self.input_workspace_id,\n",
+    "            self.input_data_collection_id,\n",
     "            self.input_short_description,\n",
+    "            self.input_pod,\n",
+    "            self.input_name,\n",
     "            self.button,\n",
     "            self.output\n",
     "        ], layout=wu.vbox_layout)\n",
-    "    \n",
+    "\n",
+    "\n",
     "    def get_workspace_id(self):\n",
     "        return self.input_workspace_id.value.strip()\n",
-    "\n",
-    "    def get_workspace_ids(self):\n",
-    "        result = subprocess.run([\"wb\",\"workspace\",\"list\",\"--format=JSON\"],capture_output=True,text=True)\n",
-    "        ids_list = wu.list_workspace_ids(result.stdout)\n",
-    "        # Insert empty string to display as value of dropdown until changed by user.\n",
-    "        ids_list.insert(0, \" \")\n",
-    "        return ids_list\n",
     "    \n",
-    "    def convert_to_data_collection(self,b):\n",
-    "        workspace_id = self.input_workspace_id.value\n",
+    "    \n",
+    "    def create_data_collection(self,b):\n",
     "        short_desc = self.input_short_description.value\n",
     "        with self.output:\n",
-    "            prettyConvertToDataCollectionCommand = f\"\"\"wb workspace set-property \\\\\n",
-    "            --workspace={workspace_id} \\\\\n",
-    "            --properties=\\\"terra-type=data-collection,terra-workspace-short-description={short_desc}\\\"\n",
-    "            \"\"\"\n",
-    "            print(\"Running command to convert workspace to data collection...\")\n",
-    "            print(prettyConvertToDataCollectionCommand)\n",
+    "            print(\"Running command to create data collection...\")\n",
     "            print(\"Your data collection will be ready in less than one minute...\")\n",
-    "            result = subprocess.run([\"wb\",\"workspace\",\"set-property\",\n",
-    "                                     f\"--workspace={workspace_id}\",\n",
+    "            result = subprocess.run([\"wb\",\"workspace\",\"create\",\n",
+    "                                     f\"--id={self.input_data_collection_id.value.strip()}\",\n",
+    "                                     f\"--name={self.input_name.value.strip()}\",\n",
+    "                                     f\"--pod={self.input_pod.value.strip()}\",\n",
     "                                     f\"--properties=terra-type=data-collection,terra-workspace-short-description={short_desc}\"],\n",
     "                                    capture_output=True,text=True)\n",
     "            print(result.stderr) if not result.stdout else print(result.stdout)\n",
     "\n",
-    "convert_to_dc_widget = ConvertToDataCollectionWidget(create_ws_widget)\n",
-    "display(convert_to_dc_widget.vb)"
+    "create_dc_widget = CreateDataCollectionWidget(create_ws_widget)\n",
+    "display(create_dc_widget.vb)"
    ]
   },
   {
@@ -429,7 +424,7 @@
     "            publishVersionResult = subprocess.run(publishVersionCommand, shell = True, capture_output = True, text = True, check = True)\n",
     "            print(publishVersionResult.stderr) if not publishVersionResult.stdout else print(publishVersionResult.stdout)\n",
     "\n",
-    "publish_version_widget = PublishVersionWidget(convert_to_dc_widget)\n",
+    "publish_version_widget = PublishVersionWidget(create_dc_widget)\n",
     "display(publish_version_widget.vb)"
    ]
   },
@@ -683,13 +678,6 @@
     "license that can be found in the LICENSE file or at   \n",
     "https://developers.google.com/open-source/licenses/bsd"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/first_hour_on_vwb/working_with_bq_resources.ipynb b/first_hour_on_vwb/working_with_bq_resources.ipynb
@@ -73,43 +73,7 @@
     "tags": []
    },
    "outputs": [],
-   "source": [
-    "from IPython.display import display, HTML\n",
-    "import ipywidgets as widgets\n",
-    "import json\n",
-    "import pandas as pd\n",
-    "import pandas_gbq\n",
-    "import os\n",
-    "import subprocess\n",
-    "import widget_utils as wu\n",
-    "\n",
-    "'''\n",
-    "Resolves bucket URL from bucket reference in workspace.\n",
-    "'''\n",
-    "def get_bucket_url_from_reference(resource_id):\n",
-    "    BUCKET_CMD_OUTPUT = !wb resolve --name={bucket_reference}\n",
-    "    BUCKET = BUCKET_CMD_OUTPUT[0]\n",
-    "    return BUCKET\n",
-    "\n",
-    "'''\n",
-    "Resolves BigQuery dataset from a reference in workspace.\n",
-    "'''\n",
-    "def get_bq_dataset_from_reference(resource_id):\n",
-    "    BQ_CMD_OUTPUT = !wb resolve --id={resource_id}\n",
-    "    BQ_DATASET = BQ_CMD_OUTPUT[0]\n",
-    "    return BQ_DATASET\n",
-    "\n",
-    "'''\n",
-    "Resolves current workspace ID from workspace description.\n",
-    "'''\n",
-    "def get_current_workspace_id():\n",
-    "    WORKSPACE_CMD_OUTPUT = !wb workspace describe --format=json | jq --raw-output \".id\"\n",
-    "    WORKSPACE_ID = WORKSPACE_CMD_OUTPUT[0]\n",
-    "    return WORKSPACE_ID\n",
-    "\n",
-    "CURRENT_WORKSPACE_ID = get_current_workspace_id()\n",
-    "print(f'Workspace ID: {CURRENT_WORKSPACE_ID}')"
-   ]
+   "source": "import pandas as pd\nimport pandas_gbq\n\n'''\nResolves bucket URL from bucket reference in workspace.\n'''\ndef get_bucket_url_from_reference(bucket_reference):\n    BUCKET_CMD_OUTPUT = !wb resolve --name={bucket_reference}\n    BUCKET = BUCKET_CMD_OUTPUT[0]\n    return BUCKET\n\n'''\nResolves current workspace ID from workspace description.\n'''\ndef get_current_workspace_id():\n    WORKSPACE_CMD_OUTPUT = !wb workspace describe --format=json | jq --raw-output \".id\"\n    WORKSPACE_ID = WORKSPACE_CMD_OUTPUT[0]\n    return WORKSPACE_ID\n\nCURRENT_WORKSPACE_ID = get_current_workspace_id()\nprint(f'Workspace ID: {CURRENT_WORKSPACE_ID}')"
   },
   {
    "cell_type": "markdown",
@@ -219,7 +183,12 @@
    },
    "outputs": [],
    "source": [
-    "%bigquery_stats bigquery-public-data.human_genome_variants.1000_genomes_pedigree"
+    "from google.cloud import bigquery\n",
+    "client = bigquery.Client()\n",
+    "table = client.get_table(\"bigquery-public-data.human_genome_variants.1000_genomes_pedigree\")\n",
+    "print(f\"Rows: {table.num_rows}, Size: {table.num_bytes} bytes\")\n",
+    "for field in table.schema:\n",
+    "  print(f\"  {field.name}: {field.field_type}\")"
    ]
   },
   {
@@ -233,6 +202,15 @@
     "Run the cell below to total the number of distinct families represented in the 1000 Genomes dataset."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext google.cloud.bigquery"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -474,4 +452,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}