diff --git a/docs/as1/api.md b/docs/as1/api.md new file mode 100644 index 0000000..7c510ad --- /dev/null +++ b/docs/as1/api.md @@ -0,0 +1,3 @@ +# Afs API + +For documentation on functions in the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package for accessing *Anopheles stephensi* data, please visit the [As1 API docs page](https://malariagen.github.io/malariagen-data-python/latest/As1.html). diff --git a/docs/as1/as1.ipynb b/docs/as1/as1.ipynb new file mode 100644 index 0000000..deaa976 --- /dev/null +++ b/docs/as1/as1.ipynb @@ -0,0 +1,990 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "LBNBl2exUYWu" + }, + "source": [ + "# As1\n", + "\n", + "The **[As1](as1): _Anopheles stephensi_ data resource** contains single nucleotide polymorphism (SNP) calls from whole-genome sequencing of 645 mosquitoes.\n", + "\n", + "More information about this release can be found in the [data resource website](https://www.malariagen.net/data_package/as1-anopheles-stephensi-data-resource/). \n", + "\n", + "This page provides an introduction to open data resources released as part of `As1`. \n", + "\n", + "If you have any questions about this guide or how to use the data, please [start a new discussion](https://github.com/malariagen/vector-public-data/discussions/new) on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please [raise an issue](https://github.com/malariagen/vector-public-data/issues/new/choose)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kJqs4cXppk8j" + }, + "source": [ + "## Terms of use\n", + "\n", + "Data from this project will be made publicly available before journal publication, subject to the following publication embargo: unless otherwise stated, analyses of project data are ongoing and publications are in preparation by project partners, and it is not permitted to use project data for publication (including any type of communication with the general public) without prior permission from the originating partner studies. The publication embargo will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.\n", + "\n", + "Although malaria is generally an endemic rather than an epidemic disease, and the focus of this project is on surveillance of disease vectors rather than pathogens, our data terms of use build on MalariaGEN's approach to data sharing, and adopt norms which have been established for rapid sharing of pathogen genomic data during disease outbreaks. The primary rationale for this approach is that malaria remains a public health emergency, where ethically appropriate and rapid sharing of genomic surveillance data can help to detect and respond to biological threats such as new forms of insecticide resistance, and to adapt malaria vector control strategies to different settings and changing circumstances.\n", + "\n", + "The publication embargo for all data on this release will expire on the **5th of April 2028**. \n", + "\n", + "If you have any questions about the terms of use, please email [support@malariagen.net](mailto:support@malariagen.net)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iNSicUCtpk8j" + }, + "source": [ + "\n", + "## Partner studies\n", + "\n", + "All of the samples were contributed and sequenced as part of the [Controlling Emergent Anopheles stephensi in Sudan and Ethiopia (CEASE) project](https://wellcome.org/research-funding/funding-portfolio/funded-grants/controlling-emergent-anopheles-stephensi-ethiopia).\n", + "\n", + "The samples were contributed by partner institutions from various countries. The surname and primary institution of the lead principle investigator/s contributing samples to the study, and the sample country of origin, are detailed below. \n", + "\n", + "Enquiries about the samples and studies may be directed in the first instance to David Weetman (david.weetman@lstmed.ac.uk) or Martin Donnelly (martin.donnelly@lstmed.ac.uk).\n", + "\n", + "### 1363-VO-ET-GADISA-VMF00316 (Ethiopia)\n", + "\n", + "* Endalamaw Gadisa, Armaeur Hansen Research Institute, Ethiopia.\n", + "\n", + "### 1364-VO-SD-KAFY-VMF00317 (Sudan)\n", + "\n", + "* Hmooda Toto Kafy, University of Khartoum, Sudan.\n", + "* Elfatih Malik, University of Khartoum, Sudan.\n", + "\n", + "### 1365-VO-DJ-ADBI-VMF00318 (Djibouti)\n", + "\n", + "* Bouh Abdi Khaireh, Association Mutualis, Djibouti.\n", + "\n", + "### 1366-VO-YE-ALLAN-VMF00319 (Yemen)\n", + "\n", + "* Richard Allan, MENTOR Initiative, United Kingdom.\n", + "\n", + "### 1367-VO-AF-DONNELLY-VMF00320 (Afghanistan)\n", + "\n", + "* Martin Donnelly, Liverpool School of Tropical Medicine, United Kingdom.\n", + "\n", + "### 1368-VO-PK-DONNELLY-VMF00321 (Pakistan)\n", + "\n", + "* Martin Donnelly, Liverpool School of Tropical Medicine, United Kingdom.\n", + "\n", + "### 1369-VO-SA-AL-NAZAWI-VMF00322 (Saudi Arabia)\n", + "\n", + "* Ashwaq Al-Nazawi, Jazan University, Saudi Arabia. \n", + "\n", + "### 1370-VO-IR-ENAYATI-VMF00323 (Iran)\n", + "\n", + "* Ahmadali Enayati, Mazandaran University of Medical Sciences, Iran.\n", + "\n", + "### 1385-VO-DJ-WEETMAN-VMF00338 (United Kingdom).\n", + "\n", + "* David Weetman, Liverpool School of Tropical Medicine, United Kingdom.\n", + "* N.B. These are colony mosquitoes derived from wild-collected samples in Djibouti.\n", + "\n", + "### 1386-VO-KE-OCHOMO-VMF00339 (Kenya)\n", + "\n", + "* Eric Ochomo, Kenya Medical Research Institute (KEMRI), Kenya\n", + "\n", + "### 1458-VO-ET-YEWHALAW-VMF00340 (Ethiopia)\n", + "\n", + "* Delenasaw Yewhalaw, Jimma University, Ethiopia.\n", + "\n", + "### 1459-VO-SD-AHMED-VMF00342\n", + "\n", + "* Ayman Ahmed, University of Khartoum, Sudan.\n", + " \n", + "### thakare-2022\n", + "\n", + "* Previously published data from [Thakare _et al_, 2022](https://www.nature.com/articles/s41598-022-07462-3).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5RHbe7N6pk8k" + }, + "source": [ + "## Whole-genome sequencing and variant calling\n", + "\n", + "All samples in `As1` have been sequenced individually to high coverage using Illumina technology by Novogene Ltd. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs). After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9Hfchko2pk8l" + }, + "source": [ + "## Data hosting\n", + "\n", + "Data from `As1` are hosted by several different services. \n", + "\n", + "The SNP data have also been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as [Google Colab](https://colab.research.google.com/). Further information about analysing these data in the cloud is provided in the [cloud data access guide](cloud)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lTJ_EnvOpk8l" + }, + "source": [ + "## Sample sets\n", + "\n", + "The samples included in `As1` have been organised into 3 sample sets. \n", + "\n", + "Each sample set corresponds to a set of mosquito specimens from a contributing study. Study details can be found in the partner studies webpages listed above." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:05:09.844381Z", + "iopub.status.busy": "2026-04-05T04:05:09.844101Z", + "iopub.status.idle": "2026-04-05T04:05:11.705969Z", + "shell.execute_reply": "2026-04-05T04:05:11.704899Z", + "shell.execute_reply.started": "2026-04-05T04:05:09.844351Z" + }, + "id": "hGA4d7Yrpk8m", + "outputId": "c29827c1-0361-4926-c227-8f6e76c2a497", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -qq malariagen_data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-05T04:05:11.706973Z", + "iopub.status.busy": "2026-04-05T04:05:11.706697Z", + "iopub.status.idle": "2026-04-05T04:05:17.371545Z", + "shell.execute_reply": "2026-04-05T04:05:17.370432Z", + "shell.execute_reply.started": "2026-04-05T04:05:11.706939Z" + }, + "id": "AnmzLmEgpk8n", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "'use strict';\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " const force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "const JS_MIME_TYPE = 'application/javascript';\n", + " const HTML_MIME_TYPE = 'text/html';\n", + " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " const CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " const script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " function drop(id) {\n", + " const view = Bokeh.index.get_by_id(id)\n", + " if (view != null) {\n", + " view.model.document.clear()\n", + " Bokeh.index.delete(view)\n", + " }\n", + " }\n", + "\n", + " const cell = handle.cell;\n", + "\n", + " const id = cell.output_area._bokeh_element_id;\n", + " const server_id = cell.output_area._bokeh_server_id;\n", + "\n", + " // Clean up Bokeh references\n", + " if (id != null) {\n", + " drop(id)\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd_clean, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " const id = msg.content.text.trim()\n", + " drop(id)\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd_destroy);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " const output_area = handle.output_area;\n", + " const output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " const bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " const script_attrs = bk_div.children[0].attributes;\n", + " for (let i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " const toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " const events = require('base/js/events');\n", + " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " const NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded(error = null) {\n", + " const el = document.getElementById(null);\n", + " if (el != null) {\n", + " const html = (() => {\n", + " if (typeof root.Bokeh === \"undefined\") {\n", + " if (error == null) {\n", + " return \"BokehJS is loading ...\";\n", + " } else {\n", + " return \"BokehJS failed to load.\";\n", + " }\n", + " } else {\n", + " const prefix = `BokehJS ${root.Bokeh.version}`;\n", + " if (error == null) {\n", + " return `${prefix} successfully loaded.`;\n", + " } else {\n", + " return `${prefix} encountered errors while loading and may not function as expected.`;\n", + " }\n", + " }\n", + " })();\n", + " el.innerHTML = html;\n", + "\n", + " if (error != null) {\n", + " const wrapper = document.createElement(\"div\");\n", + " wrapper.style.overflow = \"auto\";\n", + " wrapper.style.height = \"5em\";\n", + " wrapper.style.resize = \"vertical\";\n", + " const content = document.createElement(\"div\");\n", + " content.style.fontFamily = \"monospace\";\n", + " content.style.whiteSpace = \"pre-wrap\";\n", + " content.style.backgroundColor = \"rgb(255, 221, 221)\";\n", + " content.textContent = error.stack ?? error.toString();\n", + " wrapper.append(content);\n", + " el.append(wrapper);\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(() => display_loaded(error), 100);\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n", + " const css_urls = [];\n", + "\n", + " const inline_js = [ function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + "function(Bokeh) {\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " if (root.Bokeh !== undefined || force === true) {\n", + " try {\n", + " for (let i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + "\n", + " } catch (error) {throw error;\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import malariagen_data\n", + "as1 = malariagen_data.As1()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 927 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:05:29.540570Z", + "iopub.status.busy": "2026-04-05T04:05:29.540132Z", + "iopub.status.idle": "2026-04-05T04:05:29.640314Z", + "shell.execute_reply": "2026-04-05T04:05:29.639173Z", + "shell.execute_reply.started": "2026-04-05T04:05:29.540540Z" + }, + "id": "qsElasBepk8n", + "outputId": "4bf80a06-c2e8-4d2d-b4a6-99c8c66da7db", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_setsample_count
study_id
1363-VO-ET-GADISA1363-VO-ET-GADISA-VMF00316111
1364-VO-SD-KAFY1364-VO-SD-KAFY-VMF00317226
1365-VO-DJ-ADBI1365-VO-DJ-ADBI-VMF0031821
1366-VO-YE-ALLAN1366-VO-YE-ALLAN-VMF0031922
1367-VO-AF-DONNELLY1367-VO-AF-DONNELLY-VMF0032024
1368-VO-PK-DONNELLY1368-VO-PK-DONNELLY-VMF0032115
1369-VO-SA-AL-NAZAWI1369-VO-SA-AL-NAZAWI-VMF0032242
1370-VO-IR-ENAYATI1370-VO-IR-ENAYATI-VMF0032372
1385-VO-DJ-WEETMAN1385-VO-DJ-WEETMAN-VMF0033814
1386-VO-KE-OCHOMO1386-VO-KE-OCHOMO-VMF0033929
1458-VO-ET-YEWHALAW1458-VO-ET-YEWHALAW-VMF0034023
1459-VO-SD-AHMED1459-VO-SD-AHMED-VMF0034225
thakare-2022thakare-202215
\n", + "
" + ], + "text/plain": [ + " sample_set sample_count\n", + "study_id \n", + "1363-VO-ET-GADISA 1363-VO-ET-GADISA-VMF00316 111\n", + "1364-VO-SD-KAFY 1364-VO-SD-KAFY-VMF00317 226\n", + "1365-VO-DJ-ADBI 1365-VO-DJ-ADBI-VMF00318 21\n", + "1366-VO-YE-ALLAN 1366-VO-YE-ALLAN-VMF00319 22\n", + "1367-VO-AF-DONNELLY 1367-VO-AF-DONNELLY-VMF00320 24\n", + "1368-VO-PK-DONNELLY 1368-VO-PK-DONNELLY-VMF00321 15\n", + "1369-VO-SA-AL-NAZAWI 1369-VO-SA-AL-NAZAWI-VMF00322 42\n", + "1370-VO-IR-ENAYATI 1370-VO-IR-ENAYATI-VMF00323 72\n", + "1385-VO-DJ-WEETMAN 1385-VO-DJ-WEETMAN-VMF00338 14\n", + "1386-VO-KE-OCHOMO 1386-VO-KE-OCHOMO-VMF00339 29\n", + "1458-VO-ET-YEWHALAW 1458-VO-ET-YEWHALAW-VMF00340 23\n", + "1459-VO-SD-AHMED 1459-VO-SD-AHMED-VMF00342 25\n", + "thakare-2022 thakare-2022 15" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sample_sets = as1.sample_sets(release=\"1.0\")\n", + "df_sample_sets[['study_id','sample_set', 'sample_count']].set_index('study_id')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJ16OQ0Hpk8o" + }, + "source": [ + "Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species. The warning is a result of the surveillance flags not being set. This will be implemented in future versions." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:05:34.859189Z", + "iopub.status.busy": "2026-04-05T04:05:34.858770Z", + "iopub.status.idle": "2026-04-05T04:05:35.892325Z", + "shell.execute_reply": "2026-04-05T04:05:35.890422Z", + "shell.execute_reply.started": "2026-04-05T04:05:34.859156Z" + }, + "id": "a1OMvuTxUWpJ", + "outputId": "9f872334-fd50-4649-990a-df60ea71c12c", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Load sample metadata: ⠏ (0:00:00.76) " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1363-VO-ET-GADISA-VMF00316\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1364-VO-SD-KAFY-VMF00317\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1365-VO-DJ-ADBI-VMF00318\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1366-VO-YE-ALLAN-VMF00319\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1367-VO-AF-DONNELLY-VMF00320\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1368-VO-PK-DONNELLY-VMF00321\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1370-VO-IR-ENAYATI-VMF00323\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1385-VO-DJ-WEETMAN-VMF00338\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1386-VO-KE-OCHOMO-VMF00339\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1458-VO-ET-YEWHALAW-VMF00340\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1459-VO-SD-AHMED-VMF00342\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set thakare-2022\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
taxonstephensi
study_idsample_setcountryyear
1363-VO-ET-GADISA1363-VO-ET-GADISA-VMF00316Ethiopia202210
202374
202427
1364-VO-SD-KAFY1364-VO-SD-KAFY-VMF00317Sudan2022189
202337
1365-VO-DJ-ADBI1365-VO-DJ-ADBI-VMF00318Djibouti202321
1366-VO-YE-ALLAN1366-VO-YE-ALLAN-VMF00319Yemen20216
202316
1367-VO-AF-DONNELLY1367-VO-AF-DONNELLY-VMF00320Afghanistan201724
1368-VO-PK-DONNELLY1368-VO-PK-DONNELLY-VMF00321Pakistan200515
1369-VO-SA-AL-NAZAWI1369-VO-SA-AL-NAZAWI-VMF00322Saudi Arabia202342
1370-VO-IR-ENAYATI1370-VO-IR-ENAYATI-VMF00323Iran202372
1385-VO-DJ-WEETMAN1385-VO-DJ-WEETMAN-VMF00338Colony202514
1386-VO-KE-OCHOMO1386-VO-KE-OCHOMO-VMF00339Kenya20221
202428
1458-VO-ET-YEWHALAW1458-VO-ET-YEWHALAW-VMF00340Ethiopia202323
1459-VO-SD-AHMED1459-VO-SD-AHMED-VMF00342Sudan201825
thakare-2022thakare-2022India202115
\n", + "
" + ], + "text/plain": [ + "taxon stephensi\n", + "study_id sample_set country year \n", + "1363-VO-ET-GADISA 1363-VO-ET-GADISA-VMF00316 Ethiopia 2022 10\n", + " 2023 74\n", + " 2024 27\n", + "1364-VO-SD-KAFY 1364-VO-SD-KAFY-VMF00317 Sudan 2022 189\n", + " 2023 37\n", + "1365-VO-DJ-ADBI 1365-VO-DJ-ADBI-VMF00318 Djibouti 2023 21\n", + "1366-VO-YE-ALLAN 1366-VO-YE-ALLAN-VMF00319 Yemen 2021 6\n", + " 2023 16\n", + "1367-VO-AF-DONNELLY 1367-VO-AF-DONNELLY-VMF00320 Afghanistan 2017 24\n", + "1368-VO-PK-DONNELLY 1368-VO-PK-DONNELLY-VMF00321 Pakistan 2005 15\n", + "1369-VO-SA-AL-NAZAWI 1369-VO-SA-AL-NAZAWI-VMF00322 Saudi Arabia 2023 42\n", + "1370-VO-IR-ENAYATI 1370-VO-IR-ENAYATI-VMF00323 Iran 2023 72\n", + "1385-VO-DJ-WEETMAN 1385-VO-DJ-WEETMAN-VMF00338 Colony 2025 14\n", + "1386-VO-KE-OCHOMO 1386-VO-KE-OCHOMO-VMF00339 Kenya 2022 1\n", + " 2024 28\n", + "1458-VO-ET-YEWHALAW 1458-VO-ET-YEWHALAW-VMF00340 Ethiopia 2023 23\n", + "1459-VO-SD-AHMED 1459-VO-SD-AHMED-VMF00342 Sudan 2018 25\n", + "thakare-2022 thakare-2022 India 2021 15" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = as1.sample_metadata(sample_sets=\"1.0\")\n", + "df_summary = df_samples.pivot_table(\n", + " index=[\"study_id\",\"sample_set\", \"country\", \"year\"], \n", + " columns=[\"taxon\"],\n", + " values=\"sample_id\", \n", + " aggfunc=len,\n", + " fill_value=0)\n", + "df_summary" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dLiU0ulIpk8p" + }, + "source": [ + "Note that there can be multiple sampling sites represented within the same sample set." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OToX5vhfpk8p" + }, + "source": [ + "## Further reading\n", + "\n", + "We hope this page has provided a useful introduction to the `As1` data resource. If you would like to start working with these data, please visit the [cloud data access guide](cloud) or the [data download guide](download) or continue browsing the other documentation on this site.\n", + "\n", + "If you have any questions about the data and how to use them, please do get in touch by [starting a new discussion](https://github.com/malariagen/vector-data/discussions/new) on the malariagen/vector-data repository on GitHub." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "name": "Ag3.0-intro.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "malariagen-dev-as1", + "name": "workbench-notebooks.m138", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m138" + }, + "kernelspec": { + "display_name": "malariagen-dev-as1 (Local)", + "language": "python", + "name": "malariagen-dev-as1" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/as1/cloud.ipynb b/docs/as1/cloud.ipynb new file mode 100644 index 0000000..a846d02 --- /dev/null +++ b/docs/as1/cloud.ipynb @@ -0,0 +1,5385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "DZw8vyUJ0y5k" + }, + "source": [ + "# As1 cloud data access\n", + "\n", + "This notebook provides information about how to download genomic data from the [Controlling Emergent Anopheles stephensi in Sudan and Ethiopia (CEASE) project](https://wellcome.org/research-funding/funding-portfolio/funded-grants/controlling-emergent-anopheles-stephensi-ethiopia), hosted via Google Cloud in collaboration with the MalariaGEN Vector Observatory. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. \n", + "\n", + "This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via [Google Colab](https://colab.research.google.com/) which are free interactive computing service running in the cloud.\n", + "\n", + "To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.\n", + "\n", + "## Data hosting\n", + "\n", + "All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the `vo_aste_release_master_us_central1` bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_-HkLIQH_0" + }, + "source": [ + "## Setup\n", + "\n", + "Running this notebook requires some Python packages to be installed:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:21.460464Z", + "iopub.status.busy": "2026-04-05T04:01:21.460209Z", + "iopub.status.idle": "2026-04-05T04:01:24.015357Z", + "shell.execute_reply": "2026-04-05T04:01:24.014335Z", + "shell.execute_reply.started": "2026-04-05T04:01:21.460437Z" + }, + "id": "wqHBq442QH_1", + "outputId": "1c1306a2-d6f1-46a2-ee4d-30b13dad9148", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q malariagen_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To make accessing these data more convenient, we've created the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the [As1 API docs](https://malariagen.github.io/malariagen-data-python/latest/As1.html) for documentation of all functions available from this package. \n", + "\n", + "Import other packages we'll need to use here." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-05T04:01:24.022130Z", + "iopub.status.busy": "2026-04-05T04:01:24.021867Z", + "iopub.status.idle": "2026-04-05T04:01:29.659489Z", + "shell.execute_reply": "2026-04-05T04:01:29.658324Z", + "shell.execute_reply.started": "2026-04-05T04:01:24.022095Z" + }, + "id": "970klnG1eu8N", + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import dask\n", + "import dask.array as da\n", + "from dask.diagnostics.progress import ProgressBar\n", + "# silence some warnings\n", + "dask.config.set(**{'array.slicing.split_large_chunks': False})\n", + "import allel\n", + "import malariagen_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jPqZ-LFPQH_2" + }, + "source": [ + "`As1` data access from Google Cloud is set up with the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:29.663766Z", + "iopub.status.busy": "2026-04-05T04:01:29.663173Z", + "iopub.status.idle": "2026-04-05T04:01:30.198388Z", + "shell.execute_reply": "2026-04-05T04:01:30.197301Z", + "shell.execute_reply.started": "2026-04-05T04:01:29.663731Z" + }, + "id": "mIsSaTuOQH_2", + "outputId": "4facd5a9-6e43-460a-811c-30293568918e", + "tags": [] + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "'use strict';\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " const force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "const JS_MIME_TYPE = 'application/javascript';\n", + " const HTML_MIME_TYPE = 'text/html';\n", + " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " const CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " const script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " function drop(id) {\n", + " const view = Bokeh.index.get_by_id(id)\n", + " if (view != null) {\n", + " view.model.document.clear()\n", + " Bokeh.index.delete(view)\n", + " }\n", + " }\n", + "\n", + " const cell = handle.cell;\n", + "\n", + " const id = cell.output_area._bokeh_element_id;\n", + " const server_id = cell.output_area._bokeh_server_id;\n", + "\n", + " // Clean up Bokeh references\n", + " if (id != null) {\n", + " drop(id)\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd_clean, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " const id = msg.content.text.trim()\n", + " drop(id)\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd_destroy);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " const output_area = handle.output_area;\n", + " const output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " const bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " const script_attrs = bk_div.children[0].attributes;\n", + " for (let i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " const toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " const events = require('base/js/events');\n", + " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " const NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded(error = null) {\n", + " const el = document.getElementById(null);\n", + " if (el != null) {\n", + " const html = (() => {\n", + " if (typeof root.Bokeh === \"undefined\") {\n", + " if (error == null) {\n", + " return \"BokehJS is loading ...\";\n", + " } else {\n", + " return \"BokehJS failed to load.\";\n", + " }\n", + " } else {\n", + " const prefix = `BokehJS ${root.Bokeh.version}`;\n", + " if (error == null) {\n", + " return `${prefix} successfully loaded.`;\n", + " } else {\n", + " return `${prefix} encountered errors while loading and may not function as expected.`;\n", + " }\n", + " }\n", + " })();\n", + " el.innerHTML = html;\n", + "\n", + " if (error != null) {\n", + " const wrapper = document.createElement(\"div\");\n", + " wrapper.style.overflow = \"auto\";\n", + " wrapper.style.height = \"5em\";\n", + " wrapper.style.resize = \"vertical\";\n", + " const content = document.createElement(\"div\");\n", + " content.style.fontFamily = \"monospace\";\n", + " content.style.whiteSpace = \"pre-wrap\";\n", + " content.style.backgroundColor = \"rgb(255, 221, 221)\";\n", + " content.textContent = error.stack ?? error.toString();\n", + " wrapper.append(content);\n", + " el.append(wrapper);\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(() => display_loaded(error), 100);\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n", + " const css_urls = [];\n", + "\n", + " const inline_js = [ function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + "function(Bokeh) {\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " if (root.Bokeh !== undefined || force === true) {\n", + " try {\n", + " for (let i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + "\n", + " } catch (error) {throw error;\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MalariaGEN As1 API client
\n", + " Please note that data are subject to terms of use,\n", + " for more information see \n", + " the MalariaGEN website or contact support@malariagen.net.\n", + " See also the As1 API docs.\n", + "
\n", + " Storage URL\n", + " gs://vo_aste_release_master_us_central1
\n", + " Data releases available\n", + " 1.0
\n", + " Results cache\n", + " None
\n", + " Cohorts analysis\n", + " 20260402
\n", + " Site filters analysis\n", + " sc_20260401
\n", + " Software version\n", + " malariagen_data 0.0.0
\n", + " Client location\n", + " Iowa, United States (Google Cloud us-central1)
\n", + " Data filtered for unrestricted use only\n", + " False
\n", + " Data filtered for surveillance use only\n", + " False
\n", + " Relevant data releases\n", + " 1.0
\n", + " " + ], + "text/plain": [ + "\n", + "Storage URL : gs://vo_aste_release_master_us_central1\n", + "Data releases available : 1.0\n", + "Results cache : None\n", + "Cohorts analysis : 20260402\n", + "Site filters analysis : sc_20260401\n", + "Software version : malariagen_data 0.0.0\n", + "Client location : Iowa, United States (Google Cloud us-central1)\n", + "Data filtered to unrestricted use only: False\n", + "Data filtered to surveillance use only: False\n", + "Relevant data releases : 1.0\n", + "---\n", + "Please note that data are subject to terms of use,\n", + "for more information see https://www.malariagen.net/data\n", + "or contact support@malariagen.net. For API documentation see \n", + "https://malariagen.github.io/malariagen-data-python/v0.0.0/As1.html" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "as1 = malariagen_data.As1()\n", + "as1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITy4zIVoQH_2" + }, + "source": [ + "## Sample sets\n", + "\n", + "Data are organised into different releases. As an example, data in As1 are organised into 13 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.\n", + "\n", + "To see which sample sets are available, load the sample set manifest into a pandas dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 927 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:30.202912Z", + "iopub.status.busy": "2026-04-05T04:01:30.202397Z", + "iopub.status.idle": "2026-04-05T04:01:30.309584Z", + "shell.execute_reply": "2026-04-05T04:01:30.307209Z", + "shell.execute_reply.started": "2026-04-05T04:01:30.202885Z" + }, + "id": "b4ADQTOfQH_2", + "outputId": "f7c6d68b-053f-4698-8b6f-29720287c423" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_setsample_countstudy_idstudy_urlterms_of_use_expiry_dateterms_of_use_urlreleaseunrestricted_use
01363-VO-ET-GADISA-VMF003161111363-VO-ET-GADISAhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
11364-VO-SD-KAFY-VMF003172261364-VO-SD-KAFYhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
21365-VO-DJ-ADBI-VMF00318211365-VO-DJ-ADBIhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
31366-VO-YE-ALLAN-VMF00319221366-VO-YE-ALLANhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
41367-VO-AF-DONNELLY-VMF00320241367-VO-AF-DONNELLYhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
51368-VO-PK-DONNELLY-VMF00321151368-VO-PK-DONNELLYhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
61369-VO-SA-AL-NAZAWI-VMF00322421369-VO-SA-AL-NAZAWIhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
71370-VO-IR-ENAYATI-VMF00323721370-VO-IR-ENAYATIhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
81385-VO-DJ-WEETMAN-VMF00338141385-VO-DJ-WEETMANhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
91386-VO-KE-OCHOMO-VMF00339291386-VO-KE-OCHOMOhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
101458-VO-ET-YEWHALAW-VMF00340231458-VO-ET-YEWHALAWhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
111459-VO-SD-AHMED-VMF00342251459-VO-SD-AHMEDhttps://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
12thakare-202215thakare-2022https://www.malariagen.net/network/where-we-wo...2099-12-31NaN1.0False
\n", + "
" + ], + "text/plain": [ + " sample_set sample_count study_id \\\n", + "0 1363-VO-ET-GADISA-VMF00316 111 1363-VO-ET-GADISA \n", + "1 1364-VO-SD-KAFY-VMF00317 226 1364-VO-SD-KAFY \n", + "2 1365-VO-DJ-ADBI-VMF00318 21 1365-VO-DJ-ADBI \n", + "3 1366-VO-YE-ALLAN-VMF00319 22 1366-VO-YE-ALLAN \n", + "4 1367-VO-AF-DONNELLY-VMF00320 24 1367-VO-AF-DONNELLY \n", + "5 1368-VO-PK-DONNELLY-VMF00321 15 1368-VO-PK-DONNELLY \n", + "6 1369-VO-SA-AL-NAZAWI-VMF00322 42 1369-VO-SA-AL-NAZAWI \n", + "7 1370-VO-IR-ENAYATI-VMF00323 72 1370-VO-IR-ENAYATI \n", + "8 1385-VO-DJ-WEETMAN-VMF00338 14 1385-VO-DJ-WEETMAN \n", + "9 1386-VO-KE-OCHOMO-VMF00339 29 1386-VO-KE-OCHOMO \n", + "10 1458-VO-ET-YEWHALAW-VMF00340 23 1458-VO-ET-YEWHALAW \n", + "11 1459-VO-SD-AHMED-VMF00342 25 1459-VO-SD-AHMED \n", + "12 thakare-2022 15 thakare-2022 \n", + "\n", + " study_url \\\n", + "0 https://www.malariagen.net/network/where-we-wo... \n", + "1 https://www.malariagen.net/network/where-we-wo... \n", + "2 https://www.malariagen.net/network/where-we-wo... \n", + "3 https://www.malariagen.net/network/where-we-wo... \n", + "4 https://www.malariagen.net/network/where-we-wo... \n", + "5 https://www.malariagen.net/network/where-we-wo... \n", + "6 https://www.malariagen.net/network/where-we-wo... \n", + "7 https://www.malariagen.net/network/where-we-wo... \n", + "8 https://www.malariagen.net/network/where-we-wo... \n", + "9 https://www.malariagen.net/network/where-we-wo... \n", + "10 https://www.malariagen.net/network/where-we-wo... \n", + "11 https://www.malariagen.net/network/where-we-wo... \n", + "12 https://www.malariagen.net/network/where-we-wo... \n", + "\n", + " terms_of_use_expiry_date terms_of_use_url release unrestricted_use \n", + "0 2099-12-31 NaN 1.0 False \n", + "1 2099-12-31 NaN 1.0 False \n", + "2 2099-12-31 NaN 1.0 False \n", + "3 2099-12-31 NaN 1.0 False \n", + "4 2099-12-31 NaN 1.0 False \n", + "5 2099-12-31 NaN 1.0 False \n", + "6 2099-12-31 NaN 1.0 False \n", + "7 2099-12-31 NaN 1.0 False \n", + "8 2099-12-31 NaN 1.0 False \n", + "9 2099-12-31 NaN 1.0 False \n", + "10 2099-12-31 NaN 1.0 False \n", + "11 2099-12-31 NaN 1.0 False \n", + "12 2099-12-31 NaN 1.0 False " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sample_sets = as1.sample_sets(release=\"1.0\")\n", + "df_sample_sets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J0SHf6vaQH_3" + }, + "source": [ + "For more information about these sample sets, you can read about each sample set from the URLs under the field `study_url`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "78L85pli9HdO" + }, + "source": [ + "## Sample metadata\n", + "\n", + "Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.\n", + "\n", + "E.g., load sample metadata for all samples in the Af1.0 release into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe):" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 661 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:30.313497Z", + "iopub.status.busy": "2026-04-05T04:01:30.310326Z", + "iopub.status.idle": "2026-04-05T04:01:31.529234Z", + "shell.execute_reply": "2026-04-05T04:01:31.528366Z", + "shell.execute_reply.started": "2026-04-05T04:01:30.313468Z" + }, + "id": "-V8nLGSaQH_4", + "outputId": "98a12919-fd6a-4fd5-8155-d90f05d877d7", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Load sample metadata: ⠋ (0:00:00.85) " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1363-VO-ET-GADISA-VMF00316\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1364-VO-SD-KAFY-VMF00317\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1365-VO-DJ-ADBI-VMF00318\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1366-VO-YE-ALLAN-VMF00319\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1367-VO-AF-DONNELLY-VMF00320\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1368-VO-PK-DONNELLY-VMF00321\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1370-VO-IR-ENAYATI-VMF00323\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1385-VO-DJ-WEETMAN-VMF00338\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1386-VO-KE-OCHOMO-VMF00339\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1458-VO-ET-YEWHALAW-VMF00340\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1459-VO-SD-AHMED-VMF00342\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set thakare-2022\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_idpartner_sample_idcontributorcountrylocationyearmonthlatitudelongitudesex_call...admin1_nameadmin1_isoadmin2_nametaxoncohort_admin1_yearcohort_admin1_monthcohort_admin1_quartercohort_admin2_yearcohort_admin2_monthcohort_admin2_quarter
0VMF00316-0001A01Endalamaw GadisaEthiopiaAwash2024118.99540.159F...AfarET-AFZone 3stephensiET-AF_step_2024ET-AF_step_2024_11ET-AF_step_2024_Q4ET-AF_Zone-3_step_2024ET-AF_Zone-3_step_2024_11ET-AF_Zone-3_step_2024_Q4
1VMF00316-0002A02Endalamaw GadisaEthiopiaAwash2024118.99540.159F...AfarET-AFZone 3stephensiET-AF_step_2024ET-AF_step_2024_11ET-AF_step_2024_Q4ET-AF_Zone-3_step_2024ET-AF_Zone-3_step_2024_11ET-AF_Zone-3_step_2024_Q4
2VMF00316-0003A03Endalamaw GadisaEthiopiaAwash2024118.99540.159F...AfarET-AFZone 3stephensiET-AF_step_2024ET-AF_step_2024_11ET-AF_step_2024_Q4ET-AF_Zone-3_step_2024ET-AF_Zone-3_step_2024_11ET-AF_Zone-3_step_2024_Q4
3VMF00316-0004A04Endalamaw GadisaEthiopiaAwash2024118.99540.159F...AfarET-AFZone 3stephensiET-AF_step_2024ET-AF_step_2024_11ET-AF_step_2024_Q4ET-AF_Zone-3_step_2024ET-AF_Zone-3_step_2024_11ET-AF_Zone-3_step_2024_Q4
4VMF00316-0005A05Endalamaw GadisaEthiopiaAwash2024118.99540.159F...AfarET-AFZone 3stephensiET-AF_step_2024ET-AF_step_2024_11ET-AF_step_2024_Q4ET-AF_Zone-3_step_2024ET-AF_Zone-3_step_2024_11ET-AF_Zone-3_step_2024_Q4
..................................................................
634SRR15293888SRR15293888Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...IndiaMangaluru2021-112.87974.847M...KarnātakaIN-KADakshina KannadastephensiIN-KA_step_2021IN-KA_step_2021IN-KA_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021
635SRR15293889SRR15293889Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...IndiaMangaluru2021-112.87974.847M...KarnātakaIN-KADakshina KannadastephensiIN-KA_step_2021IN-KA_step_2021IN-KA_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021
636SRR15293892SRR15293892Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...IndiaMangaluru2021-112.87974.847F...KarnātakaIN-KADakshina KannadastephensiIN-KA_step_2021IN-KA_step_2021IN-KA_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021
637SRR15293893SRR15293893Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...IndiaMangaluru2021-112.87974.847M...KarnātakaIN-KADakshina KannadastephensiIN-KA_step_2021IN-KA_step_2021IN-KA_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021
638SRR15293894SRR15293894Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...IndiaMangaluru2021-112.87974.847F...KarnātakaIN-KADakshina KannadastephensiIN-KA_step_2021IN-KA_step_2021IN-KA_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021IN-KA_Dakshina-Kannada_step_2021
\n", + "

639 rows × 44 columns

\n", + "
" + ], + "text/plain": [ + " sample_id partner_sample_id \\\n", + "0 VMF00316-0001 A01 \n", + "1 VMF00316-0002 A02 \n", + "2 VMF00316-0003 A03 \n", + "3 VMF00316-0004 A04 \n", + "4 VMF00316-0005 A05 \n", + ".. ... ... \n", + "634 SRR15293888 SRR15293888 \n", + "635 SRR15293889 SRR15293889 \n", + "636 SRR15293892 SRR15293892 \n", + "637 SRR15293893 SRR15293893 \n", + "638 SRR15293894 SRR15293894 \n", + "\n", + " contributor country location \\\n", + "0 Endalamaw Gadisa Ethiopia Awash \n", + "1 Endalamaw Gadisa Ethiopia Awash \n", + "2 Endalamaw Gadisa Ethiopia Awash \n", + "3 Endalamaw Gadisa Ethiopia Awash \n", + "4 Endalamaw Gadisa Ethiopia Awash \n", + ".. ... ... ... \n", + "634 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "635 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "636 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "637 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "638 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "\n", + " year month latitude longitude sex_call ... admin1_name admin1_iso \\\n", + "0 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "1 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "2 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "3 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "4 2024 11 8.995 40.159 F ... Afar ET-AF \n", + ".. ... ... ... ... ... ... ... ... \n", + "634 2021 -1 12.879 74.847 M ... Karnātaka IN-KA \n", + "635 2021 -1 12.879 74.847 M ... Karnātaka IN-KA \n", + "636 2021 -1 12.879 74.847 F ... Karnātaka IN-KA \n", + "637 2021 -1 12.879 74.847 M ... Karnātaka IN-KA \n", + "638 2021 -1 12.879 74.847 F ... Karnātaka IN-KA \n", + "\n", + " admin2_name taxon cohort_admin1_year cohort_admin1_month \\\n", + "0 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "1 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "2 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "3 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "4 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + ".. ... ... ... ... \n", + "634 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "635 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "636 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "637 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "638 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "\n", + " cohort_admin1_quarter cohort_admin2_year \\\n", + "0 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "1 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "2 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "3 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "4 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + ".. ... ... \n", + "634 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "635 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "636 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "637 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "638 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "\n", + " cohort_admin2_month cohort_admin2_quarter \n", + "0 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "1 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "2 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "3 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "4 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + ".. ... ... \n", + "634 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "635 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "636 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "637 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "638 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "\n", + "[639 rows x 44 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = as1.sample_metadata(sample_sets=\"1.0\")\n", + "df_samples" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ssCdOykfQH_4" + }, + "source": [ + "The `sample_id` column gives the sample identifier used throughout all As1 analyses.\n", + "\n", + "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", + "\n", + "The `year` and `month` columns give the approximate date when the specimen was collected.\n", + "\n", + "The `sex_call` column gives the gender as determined from the sequence data.\n", + "\n", + "Note the warnings set as a result of missing surveillance flags. The surveillance flags will be implemented in future data releases." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9APw05D5gAQ9" + }, + "source": [ + "[Pandas](https://pandas.pydata.org/) can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:18.865363Z", + "iopub.status.busy": "2026-04-05T04:02:18.865006Z", + "iopub.status.idle": "2026-04-05T04:02:18.876141Z", + "shell.execute_reply": "2026-04-05T04:02:18.872642Z", + "shell.execute_reply.started": "2026-04-05T04:02:18.865334Z" + }, + "id": "PpsTgviZQH_4", + "outputId": "ddbc9515-25dc-454f-9f02-9427f1261b06", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "taxon\n", + "stephensi 639\n", + "dtype: int64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples.groupby(\"taxon\").size()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C4EPodCJjg0a" + }, + "source": [ + "## SNP calls\n", + "\n", + "Data on SNP calls, including the SNP positions, alleles, site filters, and genotypes, can be accessed as an [xarray Dataset](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset).\n", + "\n", + "E.g., access SNP calls for chromosome 2RL for all samples in `Af1.0`." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 430 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:21.993178Z", + "iopub.status.busy": "2026-04-05T04:02:21.992783Z", + "iopub.status.idle": "2026-04-05T04:02:24.320013Z", + "shell.execute_reply": "2026-04-05T04:02:24.317119Z", + "shell.execute_reply.started": "2026-04-05T04:02:21.993144Z" + }, + "id": "433PD7k8jlNj", + "outputId": "bc5e1b8d-f1f4-4008-df56-f577a9080561", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 1TB\n",
+       "Dimensions:                        (variants: 93702023, alleles: 4,\n",
+       "                                    samples: 639, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position               (variants) int32 375MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                 (variants) uint8 94MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                      (samples) <U36 92kB dask.array<chunksize=(111,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele                 (variants, alleles) |S1 375MB dask.array<chunksize=(524288, 4), meta=np.ndarray>\n",
+       "    variant_filter_pass_stephensi  (variants) bool 94MB dask.array<chunksize=(300000,), meta=np.ndarray>\n",
+       "    call_genotype                  (variants, samples, ploidy) int8 120GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "    call_GQ                        (variants, samples) int8 60GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_MQ                        (variants, samples) float32 240GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_AD                        (variants, samples, alleles) int16 479GB dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask             (variants, samples, ploidy) bool 120GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2RL', '3RL', 'X')
" + ], + "text/plain": [ + " Size: 1TB\n", + "Dimensions: (variants: 93702023, alleles: 4,\n", + " samples: 639, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 375MB dask.array\n", + " variant_contig (variants) uint8 94MB dask.array\n", + " sample_id (samples) \n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 375MB dask.array\n", + " variant_filter_pass_stephensi (variants) bool 94MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 120GB dask.array\n", + " call_GQ (variants, samples) int8 60GB dask.array\n", + " call_MQ (variants, samples) float32 240GB dask.array\n", + " call_AD (variants, samples, alleles) int16 479GB dask.array\n", + " call_genotype_mask (variants, samples, ploidy) bool 120GB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_snps = as1.snp_calls(region=\"2RL\", sample_sets=\"1.0\")\n", + "ds_snps" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fx9ufvbCnPGn" + }, + "source": [ + "The arrays within this dataset are backed by [Dask arrays](https://docs.dask.org/en/latest/array.html), and can be accessed as shown below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Lvv-lFHJ-Um2" + }, + "source": [ + "### SNP sites and alleles\n", + "\n", + "We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not \"N\". Data on this set of genomic positions and alleles for a given chromosome (e.g., 2RL) can be accessed as [Dask arrays](https://docs.dask.org/en/latest/array.html) as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:25.466031Z", + "iopub.status.busy": "2026-04-05T04:02:25.465727Z", + "iopub.status.idle": "2026-04-05T04:02:25.471287Z", + "shell.execute_reply": "2026-04-05T04:02:25.470136Z", + "shell.execute_reply.started": "2026-04-05T04:02:25.466009Z" + }, + "id": "GO5Os0epQH_5", + "outputId": "7c970e20-4811-46a1-8944-4bd7f6e8359f", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 357.44 MiB 2.00 MiB
Shape (93702023,) (524288,)
Dask graph 179 chunks in 1 graph layer
Data type int32 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 93702023\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pos = ds_snps[\"variant_position\"].data\n", + "pos" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:27.901724Z", + "iopub.status.busy": "2026-04-05T04:02:27.900987Z", + "iopub.status.idle": "2026-04-05T04:02:27.907415Z", + "shell.execute_reply": "2026-04-05T04:02:27.906357Z", + "shell.execute_reply.started": "2026-04-05T04:02:27.901679Z" + }, + "id": "eD5Gtb-xQH_5", + "outputId": "60a9f964-0335-4084-b359-7902d138bec3", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 357.44 MiB 2.00 MiB
Shape (93702023, 4) (524288, 4)
Dask graph 179 chunks in 5 graph layers
Data type |S1 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 4\n", + " 93702023\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "alleles = ds_snps[\"variant_allele\"].data\n", + "alleles" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k6i3W7y1QH_5" + }, + "source": [ + "Data can be loaded into memory as a [NumPy array](https://numpy.org/doc/stable/user/absolute_beginners.html) as shown in the following examples." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:29.276555Z", + "iopub.status.busy": "2026-04-05T04:02:29.276219Z", + "iopub.status.idle": "2026-04-05T04:02:29.392264Z", + "shell.execute_reply": "2026-04-05T04:02:29.391556Z", + "shell.execute_reply.started": "2026-04-05T04:02:29.276528Z" + }, + "id": "3_1qTYtiQH_5", + "outputId": "c260b22a-cc89-4a3c-9371-21fde9ec189e", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=int32)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read first 10 SNP positions into a numpy array\n", + "p = pos[:10].compute()\n", + "p" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:30.443325Z", + "iopub.status.busy": "2026-04-05T04:02:30.442967Z", + "iopub.status.idle": "2026-04-05T04:02:30.714207Z", + "shell.execute_reply": "2026-04-05T04:02:30.713184Z", + "shell.execute_reply.started": "2026-04-05T04:02:30.443295Z" + }, + "id": "UjeBeyOXQH_6", + "outputId": "4ef2a2e1-789a-4ec0-fff6-53e83f4951d1", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[b'A', b'G', b'C', b'T'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'T', b'G', b'C', b'A'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'A', b'G', b'C', b'T'],\n", + " [b'T', b'G', b'C', b'A']], dtype='|S1')" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read first 10 SNP alleles into a numpy array\n", + "a = alleles[:10].compute()\n", + "a" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XoHkXz0Cbk_p" + }, + "source": [ + "Here the first column contains the reference alleles, and the remaining columns contain the alternate alleles.\n", + "\n", + "Note that a byte string data type is used here for efficiency. E.g., the Python code `b'T'` represents a byte string containing the letter \"T\", which here stands for the nucleotide thymine.\n", + "\n", + "Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the `Af1` samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BGVj0OiyAQuX" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. \n", + "\n", + "Each set of site filters provides a \"filter_pass\" Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.\n", + "\n", + "The site filters data can be accessed as dask arrays as shown in the examples below. " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:59.018233Z", + "iopub.status.busy": "2026-04-05T04:02:59.017852Z", + "iopub.status.idle": "2026-04-05T04:02:59.027225Z", + "shell.execute_reply": "2026-04-05T04:02:59.024455Z", + "shell.execute_reply.started": "2026-04-05T04:02:59.018199Z" + }, + "id": "wh1AaMJ_QH_6", + "outputId": "e9b544fc-2db0-4f83-e23b-30258598d552", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 89.36 MiB 292.97 kiB
Shape (93702023,) (300000,)
Dask graph 313 chunks in 1 graph layer
Data type bool numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 93702023\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access site filters for chromosome 2RL as a dask array\n", + "filter_pass = ds_snps['variant_filter_pass_stephensi'].data\n", + "filter_pass" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:00.649281Z", + "iopub.status.busy": "2026-04-05T04:03:00.648983Z", + "iopub.status.idle": "2026-04-05T04:03:00.795700Z", + "shell.execute_reply": "2026-04-05T04:03:00.794683Z", + "shell.execute_reply.started": "2026-04-05T04:03:00.649260Z" + }, + "id": "klokhPxwQH_6", + "outputId": "28c6cbfd-b6cc-46f0-9554-c027c4c57cae", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([False, False, False, False, False, False, False, False, False,\n", + " False])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read filter values for first 10 SNPs (True means the site passes filters)\n", + "f = filter_pass[:10].compute()\n", + "f" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sMnfrmCNBzW8" + }, + "source": [ + "### SNP genotypes\n", + "\n", + "SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.\n", + "\n", + "SNP genotypes can be accessed as dask arrays as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 173 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:09.889015Z", + "iopub.status.busy": "2026-04-05T04:03:09.888672Z", + "iopub.status.idle": "2026-04-05T04:03:09.897834Z", + "shell.execute_reply": "2026-04-05T04:03:09.896615Z", + "shell.execute_reply.started": "2026-04-05T04:03:09.888987Z" + }, + "id": "QPViDmX_QH_7", + "outputId": "125ba0b7-4e6d-4c61-f325-39e9eb9522e7", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 111.53 GiB 28.61 MiB
Shape (93702023, 639, 2) (300000, 50, 2)
Dask graph 6260 chunks in 14 graph layers
Data type int8 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 2\n", + " 639\n", + " 93702023\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gt = ds_snps[\"call_genotype\"].data\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lcG-QFZRRTwx" + }, + "source": [ + "Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:12.707558Z", + "iopub.status.busy": "2026-04-05T04:03:12.707223Z", + "iopub.status.idle": "2026-04-05T04:03:12.718523Z", + "shell.execute_reply": "2026-04-05T04:03:12.717575Z", + "shell.execute_reply.started": "2026-04-05T04:03:12.707527Z" + }, + "id": "H0pR2bOCRcLI", + "outputId": "b3283a90-3202-45e9-9482-a926594945df", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = as1.sample_metadata(sample_sets=\"1.0\")\n", + "gt = ds_snps[\"call_genotype\"].data\n", + "len(df_samples) == gt.shape[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xr_FJ-xARgyS" + }, + "source": [ + "You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:35.190330Z", + "iopub.status.busy": "2026-04-05T04:03:35.190000Z", + "iopub.status.idle": "2026-04-05T04:03:35.222056Z", + "shell.execute_reply": "2026-04-05T04:03:35.221086Z", + "shell.execute_reply.started": "2026-04-05T04:03:35.190300Z" + }, + "id": "WqyNsEwLRo0q", + "outputId": "77a966bd-5ab3-416f-fb16-8cc38f46bac2", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "found 639 stephensi samples\n" + ] + } + ], + "source": [ + "loc_stephensi = df_samples.eval(\"taxon == 'stephensi'\").values\n", + "print(f\"found {np.count_nonzero(loc_stephensi)} stephensi samples\")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 430 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:49.094928Z", + "iopub.status.busy": "2026-04-05T04:03:49.094569Z", + "iopub.status.idle": "2026-04-05T04:03:49.183694Z", + "shell.execute_reply": "2026-04-05T04:03:49.182647Z", + "shell.execute_reply.started": "2026-04-05T04:03:49.094897Z" + }, + "id": "auvV_O0Dx1GT", + "outputId": "e3991a1a-1289-4e3d-f3f3-1539d7d336d0", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 1TB\n",
+       "Dimensions:                        (variants: 93702023, alleles: 4,\n",
+       "                                    samples: 639, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position               (variants) int32 375MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                 (variants) uint8 94MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                      (samples) <U36 92kB dask.array<chunksize=(111,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele                 (variants, alleles) |S1 375MB dask.array<chunksize=(524288, 4), meta=np.ndarray>\n",
+       "    variant_filter_pass_stephensi  (variants) bool 94MB dask.array<chunksize=(300000,), meta=np.ndarray>\n",
+       "    call_genotype                  (variants, samples, ploidy) int8 120GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "    call_GQ                        (variants, samples) int8 60GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_MQ                        (variants, samples) float32 240GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_AD                        (variants, samples, alleles) int16 479GB dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask             (variants, samples, ploidy) bool 120GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2RL', '3RL', 'X')
" + ], + "text/plain": [ + " Size: 1TB\n", + "Dimensions: (variants: 93702023, alleles: 4,\n", + " samples: 639, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 375MB dask.array\n", + " variant_contig (variants) uint8 94MB dask.array\n", + " sample_id (samples) \n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 375MB dask.array\n", + " variant_filter_pass_stephensi (variants) bool 94MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 120GB dask.array\n", + " call_GQ (variants, samples) int8 60GB dask.array\n", + " call_MQ (variants, samples) float32 240GB dask.array\n", + " call_AD (variants, samples, alleles) int16 479GB dask.array\n", + " call_genotype_mask (variants, samples, ploidy) bool 120GB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_snps_stephensi = ds_snps.isel(samples=loc_stephensi)\n", + "ds_snps_stephensi" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xAreXD3ySw_e" + }, + "source": [ + "Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:51.648684Z", + "iopub.status.busy": "2026-04-05T04:03:51.648364Z", + "iopub.status.idle": "2026-04-05T04:03:51.937778Z", + "shell.execute_reply": "2026-04-05T04:03:51.936767Z", + "shell.execute_reply.started": "2026-04-05T04:03:51.648658Z" + }, + "id": "AEH-iHpYQH_7", + "outputId": "04e075b3-5f18-4e6f-882e-898335312d71", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[ 0, 0],\n", + " [ 0, 0],\n", + " [-1, -1]],\n", + "\n", + " [[ 0, 0],\n", + " [ 0, 0],\n", + " [-1, -1]],\n", + "\n", + " [[ 0, 0],\n", + " [ 0, 0],\n", + " [-1, -1]],\n", + "\n", + " [[ 0, 0],\n", + " [ 0, 0],\n", + " [-1, -1]],\n", + "\n", + " [[ 0, 0],\n", + " [ 0, 0],\n", + " [-1, -1]]], dtype=int8)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = gt[:5, :3, :].compute()\n", + "g" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vcMEGuGsCSig" + }, + "source": [ + "If you want to work with the genotype calls, you may find it convenient to use [scikit-allel](http://scikit-allel.readthedocs.org/).\n", + "E.g., the code below sets up a genotype array." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 207 + }, + "execution": { + "iopub.execute_input": "2026-04-05T04:03:54.469043Z", + "iopub.status.busy": "2026-04-05T04:03:54.468689Z", + "iopub.status.idle": "2026-04-05T04:03:57.802912Z", + "shell.execute_reply": "2026-04-05T04:03:57.801799Z", + "shell.execute_reply.started": "2026-04-05T04:03:54.469013Z" + }, + "id": "TBuf01BdbJ6z", + "outputId": "bec96465-4d21-4647-ced0-c687674dad40", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
<GenotypeDaskArray shape=(93702023, 639, 2) dtype=int8>
01234...634635636637638
00/00/0./../../....0/0./../.0/0./.
10/00/0./../../....0/0./../.0/0./.
20/00/0./../../....0/0./../.0/0./.
......
93702020./../../../../...../../../../../.
93702021./../../../../...../../../../../.
93702022./../../../../...../../../../../.
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# use the scikit-allel wrapper class for genotype calls\n", + "gt = allel.GenotypeDaskArray(ds_snps[\"call_genotype\"].data)\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OS4U1IwZgARB" + }, + "source": [ + "## Running larger computations\n", + "\n", + "Please note that free cloud computing services such as Google Colab and MyBinder provide only limited computing resources. Thus although these services are able to efficiently read `As1` data stored on Google Cloud, you may find that you run out of memory, or computations take a long time running on a single core. If you would like any suggestions regarding how to set up more powerful compute resources in the cloud, please feel free to get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4n73mSO-heAF" + }, + "source": [ + "## Feedback and suggestions\n", + "\n", + "If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "collapsed_sections": [], + "name": "Ag3.0 cloud data access 2022-03-14.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "malariagen-dev-as1", + "name": "workbench-notebooks.m138", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m138" + }, + "kernelspec": { + "display_name": "malariagen-dev-as1 (Local)", + "language": "python", + "name": "malariagen-dev-as1" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/as1/download.ipynb b/docs/as1/download.ipynb new file mode 100644 index 0000000..1ad79a0 --- /dev/null +++ b/docs/as1/download.ipynb @@ -0,0 +1,544 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "p0VbAgTdnvpP" + }, + "source": [ + "# As1 data downloads\n", + "\n", + "This notebook provides information about how to download data from the [Controlling Emergent Anopheles stephensi in Sudan and Ethiopia (CEASE) project](https://wellcome.org/research-funding/funding-portfolio/funded-grants/controlling-emergent-anopheles-stephensi-ethiopia), released in collaboration with the MalariaGEN Vector Observatory.\n", + "\n", + "This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. Data from other releases can be accessed by changing the release in the examples from `v1` to the specific Af release, e.g. `v1.0`.\n", + "\n", + "Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.\n", + "\n", + "Examples in this notebook assume you are downloading data to a local folder within your home directory at the path `~/vo_aste_release_master_us_central1/`. Change this if you want to download to a different folder on the local file system.\n", + "\n", + "## Data hosting\n", + "\n", + "`As1` data are hosted by several different services.\n", + "\n", + "Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the `wget` command line tool, but please note that there are several other options for downloading data, see the [ENA documentation on how to download data files](https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html) for more information. \n", + "\n", + "SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage at the Sanger Institute. This guide provides examples of downloading thes data using `wget`.\n", + "\n", + "Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the `vo_aste_release_master_us_central1` bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the [Vector Observatory Data Access page](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).\n", + "\n", + "The guide below provides examples of downloading data from GCS to a local computer using the `wget` and `gsutil` command line tools. For more information about `gsutil`, see the [gsutil tool documentation](https://cloud.google.com/storage/docs/gsutil)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t1wyCDH5nvpS" + }, + "source": [ + "## Sample sets\n", + "\n", + "Data in these releases are organised into sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release, this can be downloaded via `gsutil` to a directory on the local file system, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T02:07:57.529966Z", + "iopub.status.busy": "2026-04-05T02:07:57.529676Z", + "iopub.status.idle": "2026-04-05T02:07:59.511000Z", + "shell.execute_reply": "2026-04-05T02:07:59.509915Z", + "shell.execute_reply.started": "2026-04-05T02:07:57.529937Z" + }, + "id": "rsX4TP6UnvpS", + "outputId": "a9afc995-80b7-4f62-ad0b-b4d95822cf38", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Copying gs://vo_aste_release_master_us_central1/v1.0/manifest.tsv...\n", + "/ [1 files][ 1.7 KiB/ 1.7 KiB] \n", + "Operation completed over 1 objects/1.7 KiB. \n" + ] + } + ], + "source": [ + "!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/\n", + "!gsutil cp gs://vo_aste_release_master_us_central1/v1.0/manifest.tsv ~/vo_aste_release/v1.0/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hWOAFxIDnvpT" + }, + "source": [ + "Here are the file contents:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T02:08:04.087706Z", + "iopub.status.busy": "2026-04-05T02:08:04.087268Z", + "iopub.status.idle": "2026-04-05T02:08:04.216086Z", + "shell.execute_reply": "2026-04-05T02:08:04.215168Z", + "shell.execute_reply.started": "2026-04-05T02:08:04.087660Z" + }, + "id": "vC4ACrTEnvpT", + "outputId": "c7cfe64a-9a78-42ea-dbd9-9cc82410372d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample_set\tsample_count\tstudy_id\tstudy_url\tterms_of_use_expiry_date\tterms_of_use_url\n", + "1363-VO-ET-GADISA-VMF00316\t111\t1363-VO-ET-GADISA\thttps://www.malariagen.net/network/where-we-work/1363-VO-ET-GADISA\t2099-12-31\t\n", + "1364-VO-SD-KAFY-VMF00317\t226\t1364-VO-SD-KAFY\thttps://www.malariagen.net/network/where-we-work/1364-VO-SD-KAFY\t2099-12-31\t\n", + "1365-VO-DJ-ADBI-VMF00318\t21\t1365-VO-DJ-ADBI\thttps://www.malariagen.net/network/where-we-work/1365-VO-DJ-ADBI\t2099-12-31\t\n", + "1366-VO-YE-ALLAN-VMF00319\t22\t1366-VO-YE-ALLAN\thttps://www.malariagen.net/network/where-we-work/1366-VO-YE-ALLAN\t2099-12-31\t\n", + "1367-VO-AF-DONNELLY-VMF00320\t24\t1367-VO-AF-DONNELLY\thttps://www.malariagen.net/network/where-we-work/1367-VO-AF-DONNELLY\t2099-12-31\t\n", + "1368-VO-PK-DONNELLY-VMF00321\t15\t1368-VO-PK-DONNELLY\thttps://www.malariagen.net/network/where-we-work/1368-VO-PK-DONNELLY\t2099-12-31\t\n", + "1369-VO-SA-AL-NAZAWI-VMF00322\t42\t1369-VO-SA-AL-NAZAWI\thttps://www.malariagen.net/network/where-we-work/1369-VO-SA-AL-NAZAWI\t2099-12-31\t\n", + "1370-VO-IR-ENAYATI-VMF00323\t72\t1370-VO-IR-ENAYATI\thttps://www.malariagen.net/network/where-we-work/1370-VO-IR-ENAYATI\t2099-12-31\t\n", + "1385-VO-DJ-WEETMAN-VMF00338\t14\t1385-VO-DJ-WEETMAN\thttps://www.malariagen.net/network/where-we-work/1385-VO-DJ-WEETMAN\t2099-12-31\t\n", + "1386-VO-KE-OCHOMO-VMF00339\t29\t1386-VO-KE-OCHOMO\thttps://www.malariagen.net/network/where-we-work/1386-VO-KE-OCHOMO\t2099-12-31\t\n", + "1458-VO-ET-YEWHALAW-VMF00340\t23\t1458-VO-ET-YEWHALAW\thttps://www.malariagen.net/network/where-we-work/1458-VO-ET-YEWHALAW\t2099-12-31\t\n", + "1459-VO-SD-AHMED-VMF00342\t25\t1459-VO-SD-AHMED\thttps://www.malariagen.net/network/where-we-work/1459-VO-SD-AHMED\t2099-12-31\t\n", + "thakare-2022\t15\tthakare-2022\thttps://www.malariagen.net/network/where-we-work/thakare-2022\t2099-12-31\t\n" + ] + } + ], + "source": [ + "!cat ~/vo_aste_release_master_us_central1/v1.0/manifest.tsv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5hXT_c0pnvpU" + }, + "source": [ + "For more information about these sample sets, you can explore the [Af1.0 data user guide](https://malariagen.github.io/vector-data/af1/af1.0.html)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D0m-HL43nvpU" + }, + "source": [ + "## Sample metadata\n", + "\n", + "Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.\n", + "\n", + "### Specimen collection metadata\n", + "\n", + "Specimen collection metadata can be downloaded from GCS. E.g., sample metadata for all sample sets can be downloaded using `gsutil`. If you only want the sample metadata for a single sample set, these can be accessed by including the sample set name on the link below, e.g. to access the metadata for `1229-VO-GH-DADZIE-VMF00095`, you would use: `gs://vo_afun_release_master_us_central1/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv`:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T02:08:15.869083Z", + "iopub.status.busy": "2026-04-05T02:08:15.868269Z", + "iopub.status.idle": "2026-04-05T02:08:19.196440Z", + "shell.execute_reply": "2026-04-05T02:08:19.195587Z", + "shell.execute_reply.started": "2026-04-05T02:08:15.869036Z" + }, + "id": "CsQVgCl7nvpV", + "outputId": "e0409bcb-5eca-4b1b-e703-e968508f3aec", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Building synchronization state...\n", + "Starting synchronization...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1363-VO-ET-GADISA-VMF00316/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1365-VO-DJ-ADBI-VMF00318/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1367-VO-AF-DONNELLY-VMF00320/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1364-VO-SD-KAFY-VMF00317/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1364-VO-SD-KAFY-VMF00317/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1368-VO-PK-DONNELLY-VMF00321/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1366-VO-YE-ALLAN-VMF00319/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1367-VO-AF-DONNELLY-VMF00320/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1363-VO-ET-GADISA-VMF00316/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1366-VO-YE-ALLAN-VMF00319/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1365-VO-DJ-ADBI-VMF00318/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1368-VO-PK-DONNELLY-VMF00321/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1369-VO-SA-AL-NAZAWI-VMF00322/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1370-VO-IR-ENAYATI-VMF00323/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1369-VO-SA-AL-NAZAWI-VMF00322/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1370-VO-IR-ENAYATI-VMF00323/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1385-VO-DJ-WEETMAN-VMF00338/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1386-VO-KE-OCHOMO-VMF00339/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1458-VO-ET-YEWHALAW-VMF00340/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1385-VO-DJ-WEETMAN-VMF00338/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1386-VO-KE-OCHOMO-VMF00339/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1458-VO-ET-YEWHALAW-VMF00340/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1459-VO-SD-AHMED-VMF00342/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/1459-VO-SD-AHMED-VMF00342/wgs_snp_data.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/README.md...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/thakare-2022/samples.meta.csv...\n", + "Copying gs://vo_aste_release_master_us_central1/v1.0/metadata/general/thakare-2022/wgs_snp_data.csv...\n", + "\\ [27/27 files][282.1 KiB/282.1 KiB] 100% Done \n", + "Operation completed over 27 objects/282.1 KiB. \n" + ] + } + ], + "source": [ + "!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/metadata/\n", + "!gsutil -m rsync -r gs://vo_aste_release_master_us_central1/v1.0/metadata/general/ ~/vo_aste_release_master_us_central1/v1.0/metadata/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R7GeyShRnvpV" + }, + "source": [ + "Here are the first few rows of the sample metadata for sample set `1363-VO-ET-GADISA-VMF00316`:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "execution": { + "iopub.execute_input": "2026-04-05T02:08:35.940732Z", + "iopub.status.busy": "2026-04-05T02:08:35.940280Z", + "iopub.status.idle": "2026-04-05T02:08:36.060825Z", + "shell.execute_reply": "2026-04-05T02:08:36.059836Z", + "shell.execute_reply.started": "2026-04-05T02:08:35.940698Z" + }, + "id": "dhKjnl6knvpW", + "outputId": "6345e845-5288-41a1-e877-5417559b8c6c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call\n", + "VMF00316-0001,A01,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0002,A02,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0003,A03,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0004,A04,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0005,A05,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0006,A06,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0007,A07,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0008,A08,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n", + "VMF00316-0009,A10,Endalamaw Gadisa,Ethiopia,Awash,2024,11,8.995,40.159,F\n" + ] + } + ], + "source": [ + "!head ~/vo_aste_release_master_us_central1/v1.0/metadata/1363-VO-ET-GADISA-VMF00316/samples.meta.csv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VKki7qHunvpW" + }, + "source": [ + "The `sample_id` column gives the sample identifier used throughout all analyses.\n", + "\n", + "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", + "\n", + "The `year` and `month` columns give the approximate date when the specimen was collected.\n", + "\n", + "The `sex_call` column gives the gender as determined from the sequence data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EpMi0q3dnvpZ" + }, + "source": [ + "## SNP calls (VCF format)\n", + "\n", + "### SNP genotypes\n", + "\n", + "SNP genotypes for individual mosquitoes in VCF format are available for download from Sanger S3-compatible object storage. A VCF file is available for each individual sample. To download a VCF file for a given sample, you will need the sample identifier and the sample set in which the sample belongs. Then inspect the data catalog in the metadata. E.g., for sample set `1229-VO-GH-DADZIE-VMF00095`:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-05T02:08:48.928165Z", + "iopub.status.busy": "2026-04-05T02:08:48.927675Z", + "iopub.status.idle": "2026-04-05T02:08:49.051213Z", + "shell.execute_reply": "2026-04-05T02:08:49.049962Z", + "shell.execute_reply.started": "2026-04-05T02:08:48.928116Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample_id,snp_genotypes_vcf\n", + "VMF00316-0001,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0001.vcf.gz\n", + "VMF00316-0002,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0002.vcf.gz\n", + "VMF00316-0003,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0003.vcf.gz\n", + "VMF00316-0004,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0004.vcf.gz\n", + "VMF00316-0005,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0005.vcf.gz\n", + "VMF00316-0006,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0006.vcf.gz\n", + "VMF00316-0007,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0007.vcf.gz\n", + "VMF00316-0008,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0008.vcf.gz\n", + "VMF00316-0009,https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0009.vcf.gz\n" + ] + } + ], + "source": [ + "!head ~/vo_aste_release_master_us_central1/v1.0/metadata/1363-VO-ET-GADISA-VMF00316/wgs_snp_data.csv | cut -f1,4 -d," + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A VCF file and associated tabix index can be downloaded via wget, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget --no-clobber https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0001.vcf.gz\n", + "!wget --no-clobber https://1363-vo-et-gadisa-vmf00316-aste1.cog.sanger.ac.uk/VMF00316-0001.vcf.gz.tbi" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rd1icA5Snvpa" + }, + "source": [ + "Note that each of these VCF files is around 3 Gb, so downloading may take some time, and sufficient local storage will be needed.\n", + "\n", + "Each of these VCF files is an \"all sites\" VCF file, meaning that genotypes have been called at all genomic positions where the reference nucleotide is not \"N\", regardless of whether variation is observed in the given sample. This means that VCFs from multiple samples can be merged easily to create a multi-sample VCF, which may be required for certain analyses. For example, the code below merges VCFs for two samples for chromosome arm 3R up to 1 Mbp: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RcWJS9XJnvpa", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!bcftools merge --output-type z --regions 3RL:1-1000000 --output merged.vcf.gz VMF00316-0002.vcf.gz VMF00316-0003.vcf.gz" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "un-0qdeEnvpa" + }, + "source": [ + "If you are just interested in analysing variants within a given set of samples, you might like to filter the merged VCF to remove non-variant sites and alleles, e.g., using [bcftools view](http://samtools.github.io/bcftools/bcftools.html#view):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tQ7ZQEQznvpa" + }, + "outputs": [], + "source": [ + "!bcftools view --output-type z --output-file merged_variant.vcf.gz --min-ac 1:nonmajor --trim-alt-alleles merged.vcf.gz" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZgpIO8Oknvpa" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created site filters using static cutoffs on site summary statistics. These data are available as Zarr datastores, one per chromosome.\n", + "\n", + "These can be downloaded using `gsutil`, e.g.:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XQjL7R3bnvpa", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/ \\\n", + " ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/site_filters/sc_20220908/vcf/funestus/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/site_filters/sc_20220908/vcf/funestus/ \\\n", + " ~/vo_afun_release/v1.0/site_filters/sc_20220908/vcf/funestus/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OBXGXzj9nvpb" + }, + "source": [ + "## SNP calls (Zarr format)\n", + "\n", + "SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the [Af1 cloud data access guide](https://malariagen.github.io/vector-data/af1/cloud.html) for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.\n", + "\n", + "The data are organised into several Zarr hierarchies. \n", + "\n", + "### SNP sites and alleles\n", + "\n", + "Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hM4noAz3nvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_aste_release_master_us_central1/v1.0/snp_genotypes/all/sites/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_aste_release_master_us_central1/v1.0/snp_genotypes/all/sites/ \\\n", + " ~/vo_aste_release_master_us_central1/v1.0/snp_genotypes/all/sites/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vKfArxCFnvpb" + }, + "source": [ + "### SNP genotypes\n", + "\n", + "SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for a sample set, excluding some data you probably won't need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tWu4ajAbnvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "# N.B., large data download\n", + "!mkdir -pv ~/vo_agam_release_master_us_central1/v{release}/snp_genotypes/all/{sample_set}/\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/calldata/(AD|GQ|MQ)/.*' \\\n", + " gs://vo_aste_release_master_us_central1/v{release}/snp_genotypes/all/{sample_set}/ \\\n", + " ~/vo_aste_release_master_us_central1/v{release}/snp_genotypes/all/{sample_set}/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8ABQPPgAnvph" + }, + "source": [ + "## Feedback and suggestions\n", + "\n", + "If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "collapsed_sections": [ + "8ABQPPgAnvph" + ], + "name": "Ag3.0-data-downloads.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "mgenv-e82ac9c", + "name": "workbench-notebooks.m138", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m138" + }, + "kernelspec": { + "display_name": "Python (mgenv-e82ac9c) (Local)", + "language": "python", + "name": "mgenv-e82ac9c" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}