From 14fbf86948178da2e75a66ab1482e56b8ee960ee Mon Sep 17 00:00:00 2001 From: tristanpwdennis Date: Tue, 31 Mar 2026 23:12:12 +0000 Subject: [PATCH 1/3] WIP, API md and as1.ipynb templates --- docs/as1/api.md | 3 + docs/as1/as1.ipynb | 727 +++ docs/as1/cloud.ipynb | 9952 +++++++++++++++++++++++++++++++++++++++ docs/as1/download.ipynb | 822 ++++ 4 files changed, 11504 insertions(+) create mode 100644 docs/as1/api.md create mode 100644 docs/as1/as1.ipynb create mode 100644 docs/as1/cloud.ipynb create mode 100644 docs/as1/download.ipynb diff --git a/docs/as1/api.md b/docs/as1/api.md new file mode 100644 index 0000000..7c510ad --- /dev/null +++ b/docs/as1/api.md @@ -0,0 +1,3 @@ +# Afs API + +For documentation on functions in the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package for accessing *Anopheles stephensi* data, please visit the [As1 API docs page](https://malariagen.github.io/malariagen-data-python/latest/As1.html). diff --git a/docs/as1/as1.ipynb b/docs/as1/as1.ipynb new file mode 100644 index 0000000..ac9fe32 --- /dev/null +++ b/docs/as1/as1.ipynb @@ -0,0 +1,727 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "LBNBl2exUYWu" + }, + "source": [ + "# As1\n", + "\n", + "The **[As1](as1): _Anopheles stephensi_ data resource** contains single nucleotide polymorphism (SNP) calls from whole-genome sequencing of 645 mosquitoes.\n", + "\n", + "More information about this release can be found in the [data resource website](https://www.malariagen.net/data_package/af14-anopheles-stephensi-data-resource/). \n", + "\n", + "This page provides an introduction to open data resources released as part of `As1`. \n", + "\n", + "If you have any questions about this guide or how to use the data, please [start a new discussion](https://github.com/malariagen/vector-public-data/discussions/new) on the malariagen/vector-open-data repo on GitHub. If you find any bugs, please [raise an issue](https://github.com/malariagen/vector-public-data/issues/new/choose)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kJqs4cXppk8j" + }, + "source": [ + "## Terms of use\n", + "\n", + "Data from this project will be made publicly available before journal publication, subject to the following publication embargo: unless otherwise stated, analyses of project data are ongoing and publications are in preparation by project partners, and it is not permitted to use project data for publication (including any type of communication with the general public) without prior permission from the originating partner studies. The publication embargo will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.\n", + "\n", + "Although malaria is generally an endemic rather than an epidemic disease, and the focus of this project is on surveillance of disease vectors rather than pathogens, our data terms of use build on MalariaGEN's approach to data sharing, and adopt norms which have been established for rapid sharing of pathogen genomic data during disease outbreaks. The primary rationale for this approach is that malaria remains a public health emergency, where ethically appropriate and rapid sharing of genomic surveillance data can help to detect and respond to biological threats such as new forms of insecticide resistance, and to adapt malaria vector control strategies to different settings and changing circumstances.\n", + "\n", + "The publication embargo for all data on this release will expire on the **16th of August 2026**. \n", + "\n", + "If you have any questions about the terms of use, please email [support@malariagen.net](mailto:support@malariagen.net)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iNSicUCtpk8j" + }, + "source": [ + "## Partner studies\n", + "\n", + "- [1363-VO-ET-GADISA](https://www.malariagen.net/network/where-we-work/1363-VO-ET-GADISA) - _Anopheles stephensi_ vector surveillance in Ethiopia.\n", + "\n", + "- [1364-VO-SD-KAFY](https://www.malariagen.net/network/where-we-work/1364-VO-SD-KAFY) - _Anopheles stephensi_ vector surveillance in Sudan.\n", + "\n", + "- [1365-VO-DJ-ADBI](https://www.malariagen.net/network/where-we-work/1365-VO-DJ-ADBI) - _Anopheles stephensi_ vector surveillance in Djibouti.\n", + "\n", + "- [1366-VO-YE-ALLAN](https://www.malariagen.net/network/where-we-work/1366-VO-YE-ALLAN) - _Anopheles stephensi_ vector surveillance in Yemen.\n", + "\n", + "- [1367-VO-AF-DONNELLY](https://www.malariagen.net/network/where-we-work/1367-VO-AF-DONNELLY) - _Anopheles stephensi_ vector surveillance in Afghanistan.\n", + "\n", + "- [1368-VO-PK-DONNELLY](https://www.malariagen.net/network/where-we-work/1368-VO-PK-DONNELLY) - _Anopheles stephensi_ vector surveillance in Pakistan.\n", + "\n", + "- [1369-VO-SA-AL-NAZAWI](https://www.malariagen.net/network/where-we-work/1369-VO-SA-AL-NAZAWI) - _Anopheles stephensi_ vector surveillance in Saudi Arabia.\n", + "\n", + "- [1370-VO-IR-ENAYATI](https://www.malariagen.net/network/where-we-work/1370-VO-IR-ENAYATI) - _Anopheles stephensi_ vector surveillance in Iran.\n", + "\n", + "- [1385-VO-DJ-WEETMAN](https://www.malariagen.net/network/where-we-work/1385-VO-DJ-WEETMAN) - _Anopheles stephensi_ colony samples derived from wild-caught mosquitoes in Djibouti.\n", + "\n", + "- [1386-VO-KE-OCHOMO](https://www.malariagen.net/network/where-we-work/1386-VO-KE-OCHOMO) - _Anopheles stephensi_ vector surveillance in Kenya.\n", + "\n", + "- [1458-VO-ET-YEWHALAW](https://www.malariagen.net/network/where-we-work/1458-VO-ET-YEWHALAW) - _Anopheles stephensi_ vector surveillance in Ethiopia.\n", + "\n", + "- [1459-VO-SD-AHMED](https://www.malariagen.net/network/where-we-work/1459-VO-SD-AHMED) - _Anopheles stephensi_ vector surveillance in Sudan.\n", + "\n", + "- [thakare-2022](https://www.malariagen.net/network/where-we-work/thakare-2022) - Previously published Indian _Anopheles stephensi_ mosquitoes from [Thakare _et al_, 2022](https://www.nature.com/articles/s41598-022-07462-3).\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5RHbe7N6pk8k" + }, + "source": [ + "## Whole-genome sequencing and variant calling\n", + "\n", + "All samples in `As1` have been sequenced individually to high coverage using Illumina technology by Novogene Ltd. These sequence data have then been analysed to identify genetic variants such as single nucleotide polymorphisms (SNPs). After variant calling, both the samples and the variants have been through a range of quality control analyses, to ensure the data are of high quality. Both the raw sequence data and the curated variant calls are openly available for download and analysis. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9Hfchko2pk8l" + }, + "source": [ + "## Data hosting\n", + "\n", + "Data from `As1` are hosted by several different services. \n", + "\n", + "The SNP data have also been uploaded to Google Cloud, and can be analysed directly within the cloud without having to download or copy any data, including via free interactive computing services such as [Google Colab](https://colab.research.google.com/). Further information about analysing these data in the cloud is provided in the [cloud data access guide](cloud)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lTJ_EnvOpk8l" + }, + "source": [ + "## Sample sets\n", + "\n", + "The samples included in `As1` have been organised into 3 sample sets. \n", + "\n", + "Each sample set corresponds to a set of mosquito specimens from a contributing study. Study details can be found in the partner studies webpages listed above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "hGA4d7Yrpk8m", + "outputId": "c29827c1-0361-4926-c227-8f6e76c2a497", + "tags": [ + "remove-input" + ] + }, + "outputs": [], + "source": [ + "%pip install -qq malariagen_data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "AnmzLmEgpk8n", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "'use strict';\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " const force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "const JS_MIME_TYPE = 'application/javascript';\n", + " const HTML_MIME_TYPE = 'text/html';\n", + " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " const CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " const script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " function drop(id) {\n", + " const view = Bokeh.index.get_by_id(id)\n", + " if (view != null) {\n", + " view.model.document.clear()\n", + " Bokeh.index.delete(view)\n", + " }\n", + " }\n", + "\n", + " const cell = handle.cell;\n", + "\n", + " const id = cell.output_area._bokeh_element_id;\n", + " const server_id = cell.output_area._bokeh_server_id;\n", + "\n", + " // Clean up Bokeh references\n", + " if (id != null) {\n", + " drop(id)\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd_clean, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " const id = msg.content.text.trim()\n", + " drop(id)\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd_destroy);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " const output_area = handle.output_area;\n", + " const output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " const bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " const script_attrs = bk_div.children[0].attributes;\n", + " for (let i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " const toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " const events = require('base/js/events');\n", + " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " const NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded(error = null) {\n", + " const el = document.getElementById(null);\n", + " if (el != null) {\n", + " const html = (() => {\n", + " if (typeof root.Bokeh === \"undefined\") {\n", + " if (error == null) {\n", + " return \"BokehJS is loading ...\";\n", + " } else {\n", + " return \"BokehJS failed to load.\";\n", + " }\n", + " } else {\n", + " const prefix = `BokehJS ${root.Bokeh.version}`;\n", + " if (error == null) {\n", + " return `${prefix} successfully loaded.`;\n", + " } else {\n", + " return `${prefix} encountered errors while loading and may not function as expected.`;\n", + " }\n", + " }\n", + " })();\n", + " el.innerHTML = html;\n", + "\n", + " if (error != null) {\n", + " const wrapper = document.createElement(\"div\");\n", + " wrapper.style.overflow = \"auto\";\n", + " wrapper.style.height = \"5em\";\n", + " wrapper.style.resize = \"vertical\";\n", + " const content = document.createElement(\"div\");\n", + " content.style.fontFamily = \"monospace\";\n", + " content.style.whiteSpace = \"pre-wrap\";\n", + " content.style.backgroundColor = \"rgb(255, 221, 221)\";\n", + " content.textContent = error.stack ?? error.toString();\n", + " wrapper.append(content);\n", + " el.append(wrapper);\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(() => display_loaded(error), 100);\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.5.2.min.js\"];\n", + " const css_urls = [];\n", + "\n", + " const inline_js = [ function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + "function(Bokeh) {\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " if (root.Bokeh !== undefined || force === true) {\n", + " try {\n", + " for (let i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + "\n", + " } catch (error) {throw error;\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.5.2.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import malariagen_data\n", + "af1 = malariagen_data.As1()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 927 + }, + "id": "qsElasBepk8n", + "outputId": "4bf80a06-c2e8-4d2d-b4a6-99c8c66da7db", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_setsample_count
study_id
1188-VO-SN-NIANG1188-VO-NIANG-NIEL-SN-2304-VMF0025971
1330-VO-GN-LAMA1330-VO-GN-LAMA-VMF00250196
1354-VO-KE-DONNELLY1354-VO-KE-DONNELLY-VMF00281466
\n", + "
" + ], + "text/plain": [ + " sample_set sample_count\n", + "study_id \n", + "1188-VO-SN-NIANG 1188-VO-NIANG-NIEL-SN-2304-VMF00259 71\n", + "1330-VO-GN-LAMA 1330-VO-GN-LAMA-VMF00250 196\n", + "1354-VO-KE-DONNELLY 1354-VO-KE-DONNELLY-VMF00281 466" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sample_sets = as1.sample_sets(release=\"1\")\n", + "df_sample_sets[['study_id','sample_set', 'sample_count']].set_index('study_id')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yJ16OQ0Hpk8o" + }, + "source": [ + "Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "a1OMvuTxUWpJ", + "outputId": "9f872334-fd50-4649-990a-df60ea71c12c", + "tags": [ + "remove-input" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
taxonfunestus
study_idsample_setcountryyear
1188-VO-SN-NIANG1188-VO-NIANG-NIEL-SN-2304-VMF00259Senegal202011
202116
202244
1330-VO-GN-LAMA1330-VO-GN-LAMA-VMF00250Guinea2022196
1354-VO-KE-DONNELLY1354-VO-KE-DONNELLY-VMF00281Kenya2023466
\n", + "
" + ], + "text/plain": [ + "taxon funestus\n", + "study_id sample_set country year \n", + "1188-VO-SN-NIANG 1188-VO-NIANG-NIEL-SN-2304-VMF00259 Senegal 2020 11\n", + " 2021 16\n", + " 2022 44\n", + "1330-VO-GN-LAMA 1330-VO-GN-LAMA-VMF00250 Guinea 2022 196\n", + "1354-VO-KE-DONNELLY 1354-VO-KE-DONNELLY-VMF00281 Kenya 2023 466" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = as1.sample_metadata(sample_sets=\"1.4\")\n", + "df_summary = df_samples.pivot_table(\n", + " index=[\"study_id\",\"sample_set\", \"country\", \"year\"], \n", + " columns=[\"taxon\"],\n", + " values=\"sample_id\", \n", + " aggfunc=len,\n", + " fill_value=0)\n", + "df_summary" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dLiU0ulIpk8p" + }, + "source": [ + "Note that there can be multiple sampling sites represented within the same sample set." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OToX5vhfpk8p" + }, + "source": [ + "## Further reading\n", + "\n", + "We hope this page has provided a useful introduction to the `As1` data resource. If you would like to start working with these data, please visit the [cloud data access guide](cloud) or the [data download guide](download) or continue browsing the other documentation on this site.\n", + "\n", + "If you have any questions about the data and how to use them, please do get in touch by [starting a new discussion](https://github.com/malariagen/vector-data/discussions/new) on the malariagen/vector-data repository on GitHub." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "name": "Ag3.0-intro.ipynb", + "provenance": [] + }, + "environment": { + "kernel": "mgenv-e82ac9c", + "name": "workbench-notebooks.m138", + "type": "gcloud", + "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m138" + }, + "kernelspec": { + "display_name": "Python (mgenv-e82ac9c) (Local)", + "language": "python", + "name": "mgenv-e82ac9c" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/as1/cloud.ipynb b/docs/as1/cloud.ipynb new file mode 100644 index 0000000..2ee86d3 --- /dev/null +++ b/docs/as1/cloud.ipynb @@ -0,0 +1,9952 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "DZw8vyUJ0y5k" + }, + "source": [ + "# Af1 cloud data access\n", + "\n", + "This notebook provides information about how to download data from the [MalariaGEN Vector Observatory Anopheles funestus Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-funestus-genomic-surveillance-project) via Google Cloud. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. \n", + "\n", + "This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via [Google Colab](https://colab.research.google.com/) which are free interactive computing service running in the cloud.\n", + "\n", + "To launch this notebook in the cloud and run it for yourself, click the launch icon () at the top of the page and select one of the cloud computing services available.\n", + "\n", + "## Data hosting\n", + "\n", + "All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the `vo_afun_release_master_us_central1` bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zn_-HkLIQH_0" + }, + "source": [ + "## Setup\n", + "\n", + "Running this notebook requires some Python packages to be installed:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wqHBq442QH_1", + "outputId": "1c1306a2-d6f1-46a2-ee4d-30b13dad9148", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q malariagen_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To make accessing these data more convenient, we've created the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the [Af1 API docs](https://malariagen.github.io/malariagen-data-python/latest/Af1.html) for documentation of all functions available from this package. \n", + "\n", + "Import other packages we'll need to use here." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "970klnG1eu8N", + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import dask\n", + "import dask.array as da\n", + "from dask.diagnostics.progress import ProgressBar\n", + "# silence some warnings\n", + "dask.config.set(**{'array.slicing.split_large_chunks': False})\n", + "import allel\n", + "import malariagen_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jPqZ-LFPQH_2" + }, + "source": [ + "`Af1` data access from Google Cloud is set up with the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "id": "mIsSaTuOQH_2", + "outputId": "4facd5a9-6e43-460a-811c-30293568918e", + "tags": [] + }, + "outputs": [ + { + "data": { + "application/javascript": [ + "'use strict';\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " const force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + "const JS_MIME_TYPE = 'application/javascript';\n", + " const HTML_MIME_TYPE = 'text/html';\n", + " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " const CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " const script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " function drop(id) {\n", + " const view = Bokeh.index.get_by_id(id)\n", + " if (view != null) {\n", + " view.model.document.clear()\n", + " Bokeh.index.delete(view)\n", + " }\n", + " }\n", + "\n", + " const cell = handle.cell;\n", + "\n", + " const id = cell.output_area._bokeh_element_id;\n", + " const server_id = cell.output_area._bokeh_server_id;\n", + "\n", + " // Clean up Bokeh references\n", + " if (id != null) {\n", + " drop(id)\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd_clean, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " const id = msg.content.text.trim()\n", + " drop(id)\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd_destroy);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " const output_area = handle.output_area;\n", + " const output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " const bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " const script_attrs = bk_div.children[0].attributes;\n", + " for (let i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " const toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " const events = require('base/js/events');\n", + " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " const NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded(error = null) {\n", + " const el = document.getElementById(null);\n", + " if (el != null) {\n", + " const html = (() => {\n", + " if (typeof root.Bokeh === \"undefined\") {\n", + " if (error == null) {\n", + " return \"BokehJS is loading ...\";\n", + " } else {\n", + " return \"BokehJS failed to load.\";\n", + " }\n", + " } else {\n", + " const prefix = `BokehJS ${root.Bokeh.version}`;\n", + " if (error == null) {\n", + " return `${prefix} successfully loaded.`;\n", + " } else {\n", + " return `${prefix} encountered errors while loading and may not function as expected.`;\n", + " }\n", + " }\n", + " })();\n", + " el.innerHTML = html;\n", + "\n", + " if (error != null) {\n", + " const wrapper = document.createElement(\"div\");\n", + " wrapper.style.overflow = \"auto\";\n", + " wrapper.style.height = \"5em\";\n", + " wrapper.style.resize = \"vertical\";\n", + " const content = document.createElement(\"div\");\n", + " content.style.fontFamily = \"monospace\";\n", + " content.style.whiteSpace = \"pre-wrap\";\n", + " content.style.backgroundColor = \"rgb(255, 221, 221)\";\n", + " content.textContent = error.stack ?? error.toString();\n", + " wrapper.append(content);\n", + " el.append(wrapper);\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(() => display_loaded(error), 100);\n", + " }\n", + " }\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.4.1.min.js\"];\n", + " const css_urls = [];\n", + "\n", + " const inline_js = [ function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + "function(Bokeh) {\n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " if (root.Bokeh !== undefined || force === true) {\n", + " try {\n", + " for (let i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + "\n", + " } catch (error) {throw error;\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "" + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MalariaGEN Af1 API client
\n", + " Please note that data are subject to terms of use,\n", + " for more information see \n", + " the MalariaGEN website or contact support@malariagen.net.\n", + " See also the Af1 API docs.\n", + "
\n", + " Storage URL\n", + " gs://vo_afun_release_master_us_central1
\n", + " Data releases available\n", + " 1.0
\n", + " Results cache\n", + " None
\n", + " Cohorts analysis\n", + " 20231215
\n", + " Site filters analysis\n", + " dt_20200416
\n", + " Software version\n", + " malariagen_data 10.0.0
\n", + " Client location\n", + " Iowa, United States (Google Cloud us-central1)
\n", + " " + ], + "text/plain": [ + "\n", + "Storage URL : gs://vo_afun_release_master_us_central1\n", + "Data releases available : 1.0\n", + "Results cache : None\n", + "Cohorts analysis : 20231215\n", + "Site filters analysis : dt_20200416\n", + "Software version : malariagen_data 10.0.0\n", + "Client location : Iowa, United States (Google Cloud us-central1)\n", + "---\n", + "Please note that data are subject to terms of use,\n", + "for more information see https://www.malariagen.net/data\n", + "or contact support@malariagen.net. For API documentation see \n", + "https://malariagen.github.io/malariagen-data-python/v10.0.0/Af1.html" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "af1 = malariagen_data.Af1()\n", + "af1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** To access the `Af1.1`, `Af1.2` & `Af1.3` releases, you need to use the `pre=True` flag in code above. \n", + "\n", + "This flag is used when more data will be added to this release. In the case of `Af1.1`, `Af1.2` & `Af1.3`; CNV data for the sample sets on these releases will be included at a future date." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITy4zIVoQH_2" + }, + "source": [ + "## Sample sets\n", + "\n", + "Data are organised into different releases. As an example, data in Af1.0 are organised into 8 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.\n", + "\n", + "To see which sample sets are available, load the sample set manifest into a pandas dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 927 + }, + "id": "b4ADQTOfQH_2", + "outputId": "f7c6d68b-053f-4698-8b6f-29720287c423" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_setsample_countstudy_idstudy_urlrelease
01229-VO-GH-DADZIE-VMF00095361229-VO-GH-DADZIEhttps://www.malariagen.net/network/where-we-wo...1.0
11230-VO-GA-CF-AYALA-VMF00045501230-VO-MULTI-AYALAhttps://www.malariagen.net/network/where-we-wo...1.0
21231-VO-MULTI-WONDJI-VMF000433201231-VO-MULTI-WONDJIhttps://www.malariagen.net/network/where-we-wo...1.0
31232-VO-KE-OCHOMO-VMF00044811232-VO-KE-OCHOMOhttps://www.malariagen.net/network/where-we-wo...1.0
41235-VO-MZ-PAAIJMANS-VMF00094761235-VO-MZ-PAAIJMANShttps://www.malariagen.net/network/where-we-wo...1.0
51236-VO-TZ-OKUMU-VMF00090101236-VO-TZ-OKUMUhttps://www.malariagen.net/network/where-we-wo...1.0
61240-VO-CD-KOEKEMOER-VMF00099431240-VO-MULTI-KOEKEMOERhttps://www.malariagen.net/network/where-we-wo...1.0
71240-VO-MZ-KOEKEMOER-VMF00101401240-VO-MULTI-KOEKEMOERhttps://www.malariagen.net/network/where-we-wo...1.0
\n", + "
" + ], + "text/plain": [ + " sample_set sample_count study_id \\\n", + "0 1229-VO-GH-DADZIE-VMF00095 36 1229-VO-GH-DADZIE \n", + "1 1230-VO-GA-CF-AYALA-VMF00045 50 1230-VO-MULTI-AYALA \n", + "2 1231-VO-MULTI-WONDJI-VMF00043 320 1231-VO-MULTI-WONDJI \n", + "3 1232-VO-KE-OCHOMO-VMF00044 81 1232-VO-KE-OCHOMO \n", + "4 1235-VO-MZ-PAAIJMANS-VMF00094 76 1235-VO-MZ-PAAIJMANS \n", + "5 1236-VO-TZ-OKUMU-VMF00090 10 1236-VO-TZ-OKUMU \n", + "6 1240-VO-CD-KOEKEMOER-VMF00099 43 1240-VO-MULTI-KOEKEMOER \n", + "7 1240-VO-MZ-KOEKEMOER-VMF00101 40 1240-VO-MULTI-KOEKEMOER \n", + "\n", + " study_url release \n", + "0 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "1 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "2 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "3 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "4 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "5 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "6 https://www.malariagen.net/network/where-we-wo... 1.0 \n", + "7 https://www.malariagen.net/network/where-we-wo... 1.0 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_sample_sets = af1.sample_sets(release=\"1.0\")\n", + "df_sample_sets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J0SHf6vaQH_3" + }, + "source": [ + "For more information about these sample sets, you can read about each sample set from the URLs under the field `study_url`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "78L85pli9HdO" + }, + "source": [ + "## Sample metadata\n", + "\n", + "Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen. These are organised by sample set.\n", + "\n", + "E.g., load sample metadata for all samples in the Af1.0 release into a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe):" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 661 + }, + "id": "-V8nLGSaQH_4", + "outputId": "98a12919-fd6a-4fd5-8155-d90f05d877d7", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_idpartner_sample_idcontributorcountrylocationyearmonthlatitudelongitudesex_call...admin1_nameadmin1_isoadmin2_nametaxoncohort_admin1_yearcohort_admin1_monthcohort_admin1_quartercohort_admin2_yearcohort_admin2_monthcohort_admin2_quarter
0VBS241951229-GH-A-GH01Samuel DadzieGhanaDimabi201789.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_08GH-NP_Tolon_fune_2017_Q3
1VBS241961229-GH-A-GH02Samuel DadzieGhanaGbullung201779.488-1.009F...Northern RegionGH-NPKumbungufunestusGH-NP_fune_2017GH-NP_fune_2017_07GH-NP_fune_2017_Q3GH-NP_Kumbungu_fune_2017GH-NP_Kumbungu_fune_2017_07GH-NP_Kumbungu_fune_2017_Q3
2VBS241971229-GH-A-GH03Samuel DadzieGhanaDimabi201779.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_07GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_07GH-NP_Tolon_fune_2017_Q3
3VBS241981229-GH-A-GH04Samuel DadzieGhanaDimabi201789.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_08GH-NP_Tolon_fune_2017_Q3
4VBS241991229-GH-A-GH05Samuel DadzieGhanaGupanarigu201789.497-0.952F...Northern RegionGH-NPKumbungufunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Kumbungu_fune_2017GH-NP_Kumbungu_fune_2017_08GH-NP_Kumbungu_fune_2017_Q3
..................................................................
651VBS245341240-MZ-A-MozF_1314Lizette KoekemoerMozambiqueMotinho20158-10.85140.594F...Cabo DelgadoMZ-PPalmafunestusMZ-P_fune_2015MZ-P_fune_2015_08MZ-P_fune_2015_Q3MZ-P_Palma_fune_2015MZ-P_Palma_fune_2015_08MZ-P_Palma_fune_2015_Q3
652VBS245351240-MZ-A-MozF_1315Lizette KoekemoerMozambiqueMotinho20158-10.85140.594F...Cabo DelgadoMZ-PPalmafunestusMZ-P_fune_2015MZ-P_fune_2015_08MZ-P_fune_2015_Q3MZ-P_Palma_fune_2015MZ-P_Palma_fune_2015_08MZ-P_Palma_fune_2015_Q3
653VBS245361240-MZ-A-MozF_1317Lizette KoekemoerMozambiqueMotinho20158-10.85140.594F...Cabo DelgadoMZ-PPalmafunestusMZ-P_fune_2015MZ-P_fune_2015_08MZ-P_fune_2015_Q3MZ-P_Palma_fune_2015MZ-P_Palma_fune_2015_08MZ-P_Palma_fune_2015_Q3
654VBS245371240-MZ-A-MozF_1319Lizette KoekemoerMozambiqueMotinho20158-10.85140.594F...Cabo DelgadoMZ-PPalmafunestusMZ-P_fune_2015MZ-P_fune_2015_08MZ-P_fune_2015_Q3MZ-P_Palma_fune_2015MZ-P_Palma_fune_2015_08MZ-P_Palma_fune_2015_Q3
655VBS245391240-MZ-A-MozF_1323Lizette KoekemoerMozambiqueMotinho20158-10.85140.594F...Cabo DelgadoMZ-PPalmafunestusMZ-P_fune_2015MZ-P_fune_2015_08MZ-P_fune_2015_Q3MZ-P_Palma_fune_2015MZ-P_Palma_fune_2015_08MZ-P_Palma_fune_2015_Q3
\n", + "

656 rows × 26 columns

\n", + "
" + ], + "text/plain": [ + " sample_id partner_sample_id contributor country location \\\n", + "0 VBS24195 1229-GH-A-GH01 Samuel Dadzie Ghana Dimabi \n", + "1 VBS24196 1229-GH-A-GH02 Samuel Dadzie Ghana Gbullung \n", + "2 VBS24197 1229-GH-A-GH03 Samuel Dadzie Ghana Dimabi \n", + "3 VBS24198 1229-GH-A-GH04 Samuel Dadzie Ghana Dimabi \n", + "4 VBS24199 1229-GH-A-GH05 Samuel Dadzie Ghana Gupanarigu \n", + ".. ... ... ... ... ... \n", + "651 VBS24534 1240-MZ-A-MozF_1314 Lizette Koekemoer Mozambique Motinho \n", + "652 VBS24535 1240-MZ-A-MozF_1315 Lizette Koekemoer Mozambique Motinho \n", + "653 VBS24536 1240-MZ-A-MozF_1317 Lizette Koekemoer Mozambique Motinho \n", + "654 VBS24537 1240-MZ-A-MozF_1319 Lizette Koekemoer Mozambique Motinho \n", + "655 VBS24539 1240-MZ-A-MozF_1323 Lizette Koekemoer Mozambique Motinho \n", + "\n", + " year month latitude longitude sex_call ... admin1_name \\\n", + "0 2017 8 9.420 -1.083 F ... Northern Region \n", + "1 2017 7 9.488 -1.009 F ... Northern Region \n", + "2 2017 7 9.420 -1.083 F ... Northern Region \n", + "3 2017 8 9.420 -1.083 F ... Northern Region \n", + "4 2017 8 9.497 -0.952 F ... Northern Region \n", + ".. ... ... ... ... ... ... ... \n", + "651 2015 8 -10.851 40.594 F ... Cabo Delgado \n", + "652 2015 8 -10.851 40.594 F ... Cabo Delgado \n", + "653 2015 8 -10.851 40.594 F ... Cabo Delgado \n", + "654 2015 8 -10.851 40.594 F ... Cabo Delgado \n", + "655 2015 8 -10.851 40.594 F ... Cabo Delgado \n", + "\n", + " admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month \\\n", + "0 GH-NP Tolon funestus GH-NP_fune_2017 GH-NP_fune_2017_08 \n", + "1 GH-NP Kumbungu funestus GH-NP_fune_2017 GH-NP_fune_2017_07 \n", + "2 GH-NP Tolon funestus GH-NP_fune_2017 GH-NP_fune_2017_07 \n", + "3 GH-NP Tolon funestus GH-NP_fune_2017 GH-NP_fune_2017_08 \n", + "4 GH-NP Kumbungu funestus GH-NP_fune_2017 GH-NP_fune_2017_08 \n", + ".. ... ... ... ... ... \n", + "651 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", + "652 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", + "653 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", + "654 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", + "655 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", + "\n", + " cohort_admin1_quarter cohort_admin2_year \\\n", + "0 GH-NP_fune_2017_Q3 GH-NP_Tolon_fune_2017 \n", + "1 GH-NP_fune_2017_Q3 GH-NP_Kumbungu_fune_2017 \n", + "2 GH-NP_fune_2017_Q3 GH-NP_Tolon_fune_2017 \n", + "3 GH-NP_fune_2017_Q3 GH-NP_Tolon_fune_2017 \n", + "4 GH-NP_fune_2017_Q3 GH-NP_Kumbungu_fune_2017 \n", + ".. ... ... \n", + "651 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", + "652 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", + "653 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", + "654 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", + "655 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", + "\n", + " cohort_admin2_month cohort_admin2_quarter \n", + "0 GH-NP_Tolon_fune_2017_08 GH-NP_Tolon_fune_2017_Q3 \n", + "1 GH-NP_Kumbungu_fune_2017_07 GH-NP_Kumbungu_fune_2017_Q3 \n", + "2 GH-NP_Tolon_fune_2017_07 GH-NP_Tolon_fune_2017_Q3 \n", + "3 GH-NP_Tolon_fune_2017_08 GH-NP_Tolon_fune_2017_Q3 \n", + "4 GH-NP_Kumbungu_fune_2017_08 GH-NP_Kumbungu_fune_2017_Q3 \n", + ".. ... ... \n", + "651 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", + "652 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", + "653 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", + "654 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", + "655 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", + "\n", + "[656 rows x 26 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = af1.sample_metadata(sample_sets=\"1.0\")\n", + "df_samples" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ssCdOykfQH_4" + }, + "source": [ + "The `sample_id` column gives the sample identifier used throughout all Af1 analyses.\n", + "\n", + "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", + "\n", + "The `year` and `month` columns give the approximate date when the specimen was collected.\n", + "\n", + "The `sex_call` column gives the gender as determined from the sequence data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9APw05D5gAQ9" + }, + "source": [ + "[Pandas](https://pandas.pydata.org/) can be used to explore and query the sample metadata in various ways. E.g., here is a summary of the numbers of samples by species:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PpsTgviZQH_4", + "outputId": "ddbc9515-25dc-454f-9f02-9427f1261b06", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "taxon\n", + "funestus 656\n", + "dtype: int64" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples.groupby(\"taxon\").size()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C4EPodCJjg0a" + }, + "source": [ + "## SNP calls\n", + "\n", + "Data on SNP calls, including the SNP positions, alleles, site filters, and genotypes, can be accessed as an [xarray Dataset](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset).\n", + "\n", + "E.g., access SNP calls for chromosome 2RL for all samples in `Af1.0`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 430 + }, + "id": "433PD7k8jlNj", + "outputId": "bc5e1b8d-f1f4-4008-df56-f577a9080561", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 1TB\n",
+       "Dimensions:                       (variants: 102882611, alleles: 4,\n",
+       "                                   samples: 656, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position              (variants) int32 412MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                (variants) uint8 103MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                     (samples) <U36 94kB dask.array<chunksize=(36,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele                (variants, alleles) |S1 412MB dask.array<chunksize=(524288, 1), meta=np.ndarray>\n",
+       "    variant_filter_pass_funestus  (variants) bool 103MB dask.array<chunksize=(300000,), meta=np.ndarray>\n",
+       "    call_genotype                 (variants, samples, ploidy) int8 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>\n",
+       "    call_GQ                       (variants, samples) int8 67GB dask.array<chunksize=(300000, 36), meta=np.ndarray>\n",
+       "    call_MQ                       (variants, samples) float32 270GB dask.array<chunksize=(300000, 36), meta=np.ndarray>\n",
+       "    call_AD                       (variants, samples, alleles) int16 540GB dask.array<chunksize=(300000, 36, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask            (variants, samples, ploidy) bool 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2RL', '3RL', 'X')
" + ], + "text/plain": [ + " Size: 1TB\n", + "Dimensions: (variants: 102882611, alleles: 4,\n", + " samples: 656, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 412MB dask.array\n", + " variant_contig (variants) uint8 103MB dask.array\n", + " sample_id (samples) \n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 412MB dask.array\n", + " variant_filter_pass_funestus (variants) bool 103MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 135GB dask.array\n", + " call_GQ (variants, samples) int8 67GB dask.array\n", + " call_MQ (variants, samples) float32 270GB dask.array\n", + " call_AD (variants, samples, alleles) int16 540GB dask.array\n", + " call_genotype_mask (variants, samples, ploidy) bool 135GB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_snps = af1.snp_calls(region=\"2RL\", sample_sets=\"1.0\")\n", + "ds_snps" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fx9ufvbCnPGn" + }, + "source": [ + "The arrays within this dataset are backed by [Dask arrays](https://docs.dask.org/en/latest/array.html), and can be accessed as shown below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Lvv-lFHJ-Um2" + }, + "source": [ + "### SNP sites and alleles\n", + "\n", + "We have called SNP genotypes in all samples at all positions in the genome where the reference allele is not \"N\". Data on this set of genomic positions and alleles for a given chromosome (e.g., 2RL) can be accessed as [Dask arrays](https://docs.dask.org/en/latest/array.html) as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "id": "GO5Os0epQH_5", + "outputId": "7c970e20-4811-46a1-8944-4bd7f6e8359f", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 392.47 MiB 2.00 MiB
Shape (102882611,) (524288,)
Dask graph 197 chunks in 1 graph layer
Data type int32 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 102882611\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pos = ds_snps[\"variant_position\"].data\n", + "pos" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "id": "eD5Gtb-xQH_5", + "outputId": "60a9f964-0335-4084-b359-7902d138bec3", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 392.47 MiB 1.50 MiB
Shape (102882611, 4) (524288, 3)
Dask graph 394 chunks in 4 graph layers
Data type |S1 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 4\n", + " 102882611\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "alleles = ds_snps[\"variant_allele\"].data\n", + "alleles" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k6i3W7y1QH_5" + }, + "source": [ + "Data can be loaded into memory as a [NumPy array](https://numpy.org/doc/stable/user/absolute_beginners.html) as shown in the following examples." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3_1qTYtiQH_5", + "outputId": "c260b22a-cc89-4a3c-9371-21fde9ec189e", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=int32)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read first 10 SNP positions into a numpy array\n", + "p = pos[:10].compute()\n", + "p" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UjeBeyOXQH_6", + "outputId": "4ef2a2e1-789a-4ec0-fff6-53e83f4951d1", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[b'T', b'A', b'C', b'G'],\n", + " [b'G', b'A', b'C', b'T'],\n", + " [b'G', b'A', b'C', b'T'],\n", + " [b'C', b'A', b'T', b'G'],\n", + " [b'G', b'A', b'C', b'T'],\n", + " [b'T', b'A', b'C', b'G'],\n", + " [b'C', b'A', b'T', b'G'],\n", + " [b'A', b'C', b'T', b'G'],\n", + " [b'C', b'A', b'T', b'G'],\n", + " [b'T', b'A', b'C', b'G']], dtype='|S1')" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read first 10 SNP alleles into a numpy array\n", + "a = alleles[:10].compute()\n", + "a" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XoHkXz0Cbk_p" + }, + "source": [ + "Here the first column contains the reference alleles, and the remaining columns contain the alternate alleles.\n", + "\n", + "Note that a byte string data type is used here for efficiency. E.g., the Python code `b'T'` represents a byte string containing the letter \"T\", which here stands for the nucleotide thymine.\n", + "\n", + "Note that we have chosen to genotype all samples at all sites in the genome, assuming all possible SNP alleles. Not all of these alternate alleles will actually have been observed in the `Af1` samples. To determine which sites and alleles are segregating, an allele count can be performed over the samples you are interested in. See the example below. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BGVj0OiyAQuX" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. \n", + "\n", + "Each set of site filters provides a \"filter_pass\" Boolean mask for each chromosome arm, where True indicates that the site passed the filter and is accessible to high quality SNP calling.\n", + "\n", + "The site filters data can be accessed as dask arrays as shown in the examples below. " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "id": "wh1AaMJ_QH_6", + "outputId": "e9b544fc-2db0-4f83-e23b-30258598d552", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 98.12 MiB 292.97 kiB
Shape (102882611,) (300000,)
Dask graph 343 chunks in 1 graph layer
Data type bool numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 102882611\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access gamb_colu_arab site filters for chromosome 2RL as a dask array\n", + "filter_pass = ds_snps['variant_filter_pass_funestus'].data\n", + "filter_pass" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "klokhPxwQH_6", + "outputId": "28c6cbfd-b6cc-46f0-9554-c027c4c57cae", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([False, False, False, False, False, False, False, False, False,\n", + " False])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read filter values for first 10 SNPs (True means the site passes filters)\n", + "f = filter_pass[:10].compute()\n", + "f" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note these filters are the result of different filter models. For the filters in this example, a decision-tree was used. These filters are the default ones used across functions in the API.\n", + "\n", + "We have also produced a second set of site filters, which are the result of static cutoffs on the site summary statistics. To access these hard-filters, instantiate an API client with the `site_filters_analysis` parameter as shown below.\n", + "\n", + "\n", + "```\n", + "af1_sc = malariagen_data.Af1(\n", + " site_filters_analysis=\"sc_20220908\",\n", + ")\n", + "\n", + "```\n", + "\n", + "Now, any function call via this API client will use the hard filters. For example, to access the site filters themselves and compute how many sites pass for Chromosome 3:\n", + "\n", + "```\n", + "x = af1_sc.site_filters(region=\"3RL\", mask=\"funestus\")\n", + "x\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sMnfrmCNBzW8" + }, + "source": [ + "### SNP genotypes\n", + "\n", + "SNP genotypes for individual samples are available. Genotypes are stored as a three-dimensional array, where the first dimension corresponds to genomic positions, the second dimension is samples, and the third dimension is ploidy (2). Values coded as integers, where -1 represents a missing value, 0 represents the reference allele, and 1, 2, and 3 represent alternate alleles.\n", + "\n", + "SNP genotypes can be accessed as dask arrays as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 173 + }, + "id": "QPViDmX_QH_7", + "outputId": "125ba0b7-4e6d-4c61-f325-39e9eb9522e7", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 125.71 GiB 28.61 MiB
Shape (102882611, 656, 2) (300000, 50, 2)
Dask graph 5488 chunks in 9 graph layers
Data type int8 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 2\n", + " 656\n", + " 102882611\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gt = ds_snps[\"call_genotype\"].data\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lcG-QFZRRTwx" + }, + "source": [ + "Note that the columns of this array (second dimension) match the rows in the sample metadata, if the same sample sets were loaded. I.e.:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "H0pR2bOCRcLI", + "outputId": "b3283a90-3202-45e9-9482-a926594945df", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_samples = af1.sample_metadata(sample_sets=\"1.0\")\n", + "gt = ds_snps[\"call_genotype\"].data\n", + "len(df_samples) == gt.shape[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xr_FJ-xARgyS" + }, + "source": [ + "You can use this correspondance to apply further subsetting operations to the genotypes by querying the sample metadata. E.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "WqyNsEwLRo0q", + "outputId": "77a966bd-5ab3-416f-fb16-8cc38f46bac2", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "found 656 funestus samples\n" + ] + } + ], + "source": [ + "loc_funestus = df_samples.eval(\"taxon == 'funestus'\").values\n", + "print(f\"found {np.count_nonzero(loc_funestus)} funestus samples\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 430 + }, + "id": "auvV_O0Dx1GT", + "outputId": "e3991a1a-1289-4e3d-f3f3-1539d7d336d0", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 1TB\n",
+       "Dimensions:                       (variants: 102882611, alleles: 4,\n",
+       "                                   samples: 656, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position              (variants) int32 412MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                (variants) uint8 103MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                     (samples) <U36 94kB dask.array<chunksize=(36,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele                (variants, alleles) |S1 412MB dask.array<chunksize=(524288, 1), meta=np.ndarray>\n",
+       "    variant_filter_pass_funestus  (variants) bool 103MB dask.array<chunksize=(300000,), meta=np.ndarray>\n",
+       "    call_genotype                 (variants, samples, ploidy) int8 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>\n",
+       "    call_GQ                       (variants, samples) int8 67GB dask.array<chunksize=(300000, 36), meta=np.ndarray>\n",
+       "    call_MQ                       (variants, samples) float32 270GB dask.array<chunksize=(300000, 36), meta=np.ndarray>\n",
+       "    call_AD                       (variants, samples, alleles) int16 540GB dask.array<chunksize=(300000, 36, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask            (variants, samples, ploidy) bool 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2RL', '3RL', 'X')
" + ], + "text/plain": [ + " Size: 1TB\n", + "Dimensions: (variants: 102882611, alleles: 4,\n", + " samples: 656, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 412MB dask.array\n", + " variant_contig (variants) uint8 103MB dask.array\n", + " sample_id (samples) \n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 412MB dask.array\n", + " variant_filter_pass_funestus (variants) bool 103MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 135GB dask.array\n", + " call_GQ (variants, samples) int8 67GB dask.array\n", + " call_MQ (variants, samples) float32 270GB dask.array\n", + " call_AD (variants, samples, alleles) int16 540GB dask.array\n", + " call_genotype_mask (variants, samples, ploidy) bool 135GB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_snps_funestus = ds_snps.isel(samples=loc_funestus)\n", + "ds_snps_funestus" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xAreXD3ySw_e" + }, + "source": [ + "Data can be read into memory as numpy arrays, e.g., read genotypes for the first 5 SNPs and the first 3 samples:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "AEH-iHpYQH_7", + "outputId": "04e075b3-5f18-4e6f-882e-898335312d71", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[-1, -1],\n", + " [-1, -1],\n", + " [-1, -1]],\n", + "\n", + " [[-1, -1],\n", + " [-1, -1],\n", + " [-1, -1]],\n", + "\n", + " [[-1, -1],\n", + " [-1, -1],\n", + " [-1, -1]],\n", + "\n", + " [[-1, -1],\n", + " [-1, -1],\n", + " [-1, -1]],\n", + "\n", + " [[-1, -1],\n", + " [-1, -1],\n", + " [-1, -1]]], dtype=int8)" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g = gt[:5, :3, :].compute()\n", + "g" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vcMEGuGsCSig" + }, + "source": [ + "If you want to work with the genotype calls, you may find it convenient to use [scikit-allel](http://scikit-allel.readthedocs.org/).\n", + "E.g., the code below sets up a genotype array." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 207 + }, + "id": "TBuf01BdbJ6z", + "outputId": "bec96465-4d21-4647-ced0-c687674dad40", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
<GenotypeDaskArray shape=(102882611, 656, 2) dtype=int8>
01234...651652653654655
0./../../../../...../../../../../.
1./../../../../...../../../../../.
2./../../../../...../../../../../.
......
102882608./../../../../...../../../../../.
102882609./../../../../...../../../../../.
102882610./../../../../...../../../../../.
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# use the scikit-allel wrapper class for genotype calls\n", + "gt = allel.GenotypeDaskArray(ds_snps[\"call_genotype\"].data)\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D40qZqB5gmJg" + }, + "source": [ + "## Copy number variation (CNV) data\n", + "\n", + "Data on copy number variation within the `Af1` cohort are available as three separate data types:\n", + "\n", + "* **HMM** -- Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.\n", + "* **Coverage calls** -- Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.\n", + "\n", + "For more information on the methods used to generate these data, see the [variant-calling methods](methods) page.\n", + "\n", + "The `malariagen_data` Python package provides some convenience functions for accessing these data, illustrated below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HwZnhTkSgmJh" + }, + "source": [ + "### CNV HMM\n", + "\n", + "Access HMM data via the `cnv_hmm()` method, which returns an [xarray](http://xarray.pydata.org/en/stable/) dataset. E.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 367 + }, + "id": "OLQTv13egmJh", + "outputId": "cd354704-36fd-496e-8882-081e4cbe319b", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 2GB\n",
+       "Dimensions:                   (variants: 342946, samples: 511)\n",
+       "Coordinates:\n",
+       "    variant_position          (variants) int32 1MB dask.array<chunksize=(65536,), meta=np.ndarray>\n",
+       "    variant_end               (variants) int32 1MB dask.array<chunksize=(65536,), meta=np.ndarray>\n",
+       "    variant_contig            (variants) uint8 343kB dask.array<chunksize=(342946,), meta=np.ndarray>\n",
+       "    sample_id                 (samples) object 4kB dask.array<chunksize=(8,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, samples\n",
+       "Data variables:\n",
+       "    call_CN                   (variants, samples) int8 175MB dask.array<chunksize=(65536, 8), meta=np.ndarray>\n",
+       "    call_RawCov               (variants, samples) int32 701MB dask.array<chunksize=(65536, 8), meta=np.ndarray>\n",
+       "    call_NormCov              (variants, samples) float32 701MB dask.array<chunksize=(65536, 8), meta=np.ndarray>\n",
+       "    sample_coverage_variance  (samples) float32 2kB dask.array<chunksize=(8,), meta=np.ndarray>\n",
+       "    sample_is_high_variance   (samples) bool 511B dask.array<chunksize=(8,), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2RL', '3RL', 'X')
" + ], + "text/plain": [ + " Size: 2GB\n", + "Dimensions: (variants: 342946, samples: 511)\n", + "Coordinates:\n", + " variant_position (variants) int32 1MB dask.array\n", + " variant_end (variants) int32 1MB dask.array\n", + " variant_contig (variants) uint8 343kB dask.array\n", + " sample_id (samples) object 4kB dask.array\n", + "Dimensions without coordinates: variants, samples\n", + "Data variables:\n", + " call_CN (variants, samples) int8 175MB dask.array\n", + " call_RawCov (variants, samples) int32 701MB dask.array\n", + " call_NormCov (variants, samples) float32 701MB dask.array\n", + " sample_coverage_variance (samples) float32 2kB dask.array\n", + " sample_is_high_variance (samples) bool 511B dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_hmm = af1.cnv_hmm(region=\"2RL\", sample_sets=\"1.0\")\n", + "ds_hmm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EL8QDNS6gmJh" + }, + "source": [ + "Here \"variants\" are the 300 bp windows in which coverage was calculated and the HMM fitted. Window start positions are given by the `variant_position` array and ends are given by `variant_end`." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "sfb4Vk0LgmJh", + "outputId": "df4139fd-2e6a-4606-a6a8-857818ce5abd", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 1, 301, 601, ..., 102882901, 102883201,\n", + " 102883501], dtype=int32)" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pos = ds_hmm['variant_position'].values\n", + "pos" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "lGL4D8qYgmJh", + "outputId": "a6bc6f9c-5669-410e-a4a7-f8daf249a1d1", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 300, 600, 900, ..., 102883200, 102883500,\n", + " 102883511], dtype=int32)" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "end = ds_hmm['variant_end'].values\n", + "end" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "15zeXjzFgmJh" + }, + "source": [ + "Copy number state is given by the `call_CN` array, where rows are windows and columns are individual samples.\n", + "\n", + "On the autosomes (2RL, 3RL) normal diploid copy number is 2. Values greater than 2 mean amplification, less then 2 mean deletion.\n", + "\n", + "On the X chromosome, normal copy number is 2 in females and 1 in males.\n", + "\n", + "For all chromosomes, -1 means missing, i.e., the window was not included.\n", + "\n", + "Rows are variants (windows), columns are individual samples." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JVvGFxPzgmJi", + "outputId": "26cf53d6-8526-4079-f478-e21e312ccc63", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[-1, -1, -1, ..., -1, -1, -1],\n", + " [-1, -1, -1, ..., -1, -1, -1],\n", + " [-1, -1, -1, ..., -1, -1, -1],\n", + " ...,\n", + " [-1, -1, -1, ..., -1, -1, -1],\n", + " [-1, -1, -1, ..., -1, -1, -1],\n", + " [-1, -1, -1, ..., -1, -1, -1]], dtype=int8)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cn = ds_hmm['call_CN'].values\n", + "cn" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gj3gGqIIgmJi" + }, + "source": [ + "### CNV coverage calls\n", + "\n", + "Coverage calls can be accessed via the `cnv_coverage_calls()` method, which returns an xarray dataset.\n", + "\n", + "N.B., coverage calls can only be accessed on sample set at a time, because the CNV alleles do not necessarily match between sample sets. E.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 367 + }, + "id": "OyQ3r8jGgmJi", + "outputId": "6e9d64cd-6e43-4fbd-cc2d-7c2c52d963f7", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 1MB\n",
+       "Dimensions:              (variants: 30614, samples: 8)\n",
+       "Coordinates:\n",
+       "    variant_position     (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    variant_end          (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    variant_contig       (variants) uint8 31kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    variant_id           (variants) object 245kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    sample_id            (samples) object 64B dask.array<chunksize=(8,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, samples\n",
+       "Data variables:\n",
+       "    variant_CIPOS        (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    variant_CIEND        (variants) int32 122kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    variant_filter_pass  (variants) bool 31kB dask.array<chunksize=(30614,), meta=np.ndarray>\n",
+       "    call_genotype        (variants, samples) int8 245kB dask.array<chunksize=(30614, 8), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:  ('2RL', '3RL', 'X')
" + ], + "text/plain": [ + " Size: 1MB\n", + "Dimensions: (variants: 30614, samples: 8)\n", + "Coordinates:\n", + " variant_position (variants) int32 122kB dask.array\n", + " variant_end (variants) int32 122kB dask.array\n", + " variant_contig (variants) uint8 31kB dask.array\n", + " variant_id (variants) object 245kB dask.array\n", + " sample_id (samples) object 64B dask.array\n", + "Dimensions without coordinates: variants, samples\n", + "Data variables:\n", + " variant_CIPOS (variants) int32 122kB dask.array\n", + " variant_CIEND (variants) int32 122kB dask.array\n", + " variant_filter_pass (variants) bool 31kB dask.array\n", + " call_genotype (variants, samples) int8 245kB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_cnv = af1.cnv_coverage_calls(\n", + " region='2RL', \n", + " analysis='funestus', \n", + " sample_set='1229-VO-GH-DADZIE-VMF00095'\n", + ")\n", + "ds_cnv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qvUM-luggmJj" + }, + "source": [ + "Here \"variants\" are copy number variants, inferred from the HMM results. CNV start positions are given by the `variant_position` array and ends are given by `variant_end`." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6x0P-EyAgmJj", + "outputId": "46b1c201-50bc-43d1-9333-b455df43614d", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 80101, 80101, 83401, ..., 102870901, 102870901,\n", + " 102872101], dtype=int32)" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pos = ds_cnv['variant_position'].values\n", + "pos" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Hd4K_BVVgmJj", + "outputId": "80238f45-8660-4d3e-9be3-8fef36ba01d0", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 82200, 83400, 86400, ..., 102872400, 102873600,\n", + " 102873600], dtype=int32)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "end = ds_cnv['variant_end'].values\n", + "end" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XcvUG880gmJj" + }, + "source": [ + "CNV genotypes are given by the `call_genotype` array, coded as 0 for absence and 1 for presence of the CNV allele. Rows are CNV alleles and columns are individual samples." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "VfXRPYPsgmJj", + "outputId": "5f6722f7-0c71-493f-8499-10d646f12f80", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " ...,\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0],\n", + " [0, 0, 0, ..., 0, 0, 0]], dtype=int8)" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gt = ds_cnv['call_genotype'].values\n", + "gt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TMFEbBH8gmJj" + }, + "source": [ + "Note that not all samples will have been included in the coverage calls, as some are excluded due to high coverage variance. Use the `sample_id` array to access the samples included in a given dataset, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cNz9L5l9gmJk", + "outputId": "70d8d3c3-3891-4110-bed1-3090198c030c", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['VBS24196', 'VBS24197', 'VBS24201', 'VBS24213', 'VBS24216'],\n", + " dtype=object)" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "samples = ds_cnv['sample_id'].values\n", + "samples[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IilAkh1JwRX6" + }, + "source": [ + "## Haplotypes\n", + "\n", + "The `Af1` data resource also includes haplotype reference panels, which were obtained by [phasing](https://en.wikipedia.org/wiki/Haplotype_estimation) the SNP calls. Phasing involves resolving the genotypes within each individual into a pair of haplotypes providing information about the two different DNA sequences inherited, one from each parent. Haplotypes provide greater power for a range of population genetic analyses, such as genome-wide scans for signals of recent selection, or analysis of adaptive gene flow between populations.\n", + "\n", + "Data can be accessed in the cloud via the ``haplotypes()`` method. E.g., access haplotypes from the \"funestus\" analysis for all available samples, for chromosome 2RL. This method returns an [xarray Dataset](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset)." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 283 + }, + "id": "SaboHiwByToq", + "outputId": "a0fe48cf-dd93-479d-a0c9-1ba5d10ef59b", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 27GB\n",
+       "Dimensions:           (variants: 20633859, alleles: 2, samples: 656, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position  (variants) int32 83MB dask.array<chunksize=(262144,), meta=np.ndarray>\n",
+       "    variant_contig    (variants) uint8 21MB dask.array<chunksize=(20633859,), meta=np.ndarray>\n",
+       "    sample_id         (samples) object 5kB dask.array<chunksize=(36,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele    (variants, alleles) |S1 41MB dask.array<chunksize=(262144, 1), meta=np.ndarray>\n",
+       "    call_genotype     (variants, samples, ploidy) int8 27GB dask.array<chunksize=(262144, 36, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:   ('2RL', '3RL', 'X')\n",
+       "    analysis:  funestus
" + ], + "text/plain": [ + " Size: 27GB\n", + "Dimensions: (variants: 20633859, alleles: 2, samples: 656, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 83MB dask.array\n", + " variant_contig (variants) uint8 21MB dask.array\n", + " sample_id (samples) object 5kB dask.array\n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 41MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 27GB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')\n", + " analysis: funestus" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds_haps = af1.haplotypes(region=\"2RL\", analysis=\"funestus\", sample_sets=\"1.0\")\n", + "ds_haps" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UuapIdh8yVzF" + }, + "source": [ + "Here we have haplotype data for 656 samples at 20,633,859 SNPs.\n", + "\n", + "The SNP positions and alleles can be accessed as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 132 + }, + "id": "6xtyomWQydHl", + "outputId": "54c07b6c-f36f-4992-9a07-2297602e2e72", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 78.71 MiB 1.00 MiB
Shape (20633859,) (262144,)
Dask graph 79 chunks in 1 graph layer
Data type int32 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 20633859\n", + " 1\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access haplotype SNP positions\n", + "pos = ds_haps[\"variant_position\"].data # dask array\n", + "pos" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "T_5d5JWayfmW", + "outputId": "704b7f0f-0635-4ac9-f54e-f77000192968", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([79113, 79116, 79117, 79120, 79128, 79129, 79130, 79133, 79134,\n", + " 79136], dtype=int32)" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# load positions of first 10 haplotype SNPs\n", + "pos[:10].compute() # numpy array" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "id": "1c5mKn-Gyhfi", + "outputId": "b7873b6d-4ebc-415c-c81c-9309fbfd3d56", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 39.36 MiB 256.00 kiB
Shape (20633859, 2) (262144, 1)
Dask graph 158 chunks in 5 graph layers
Data type |S1 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 2\n", + " 20633859\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access haplotype SNP alleles\n", + "alleles = ds_haps[\"variant_allele\"].data # dask array\n", + "alleles" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "DebkxYr7ylDW", + "outputId": "f5329191-c2cc-4087-c8e5-baeb10b03f36", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[b'T', b'A'],\n", + " [b'A', b'G'],\n", + " [b'T', b'A'],\n", + " [b'T', b'C'],\n", + " [b'A', b'G'],\n", + " [b'A', b'T'],\n", + " [b'A', b'G'],\n", + " [b'G', b'A'],\n", + " [b'A', b'G'],\n", + " [b'A', b'T']], dtype='|S1')" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# load alleles of first 10 haplotype SNPs - note all are biallelic\n", + "alleles[:10].compute() # numpy array" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5GzqrxDHXAud" + }, + "source": [ + "The phased genotypes can be accessed as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 173 + }, + "id": "_W3IP43Zynsy", + "outputId": "491fa5c8-558c-4f83-e588-8fc3db0071c3", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 25.21 GiB 32.00 MiB
Shape (20633859, 656, 2) (262144, 64, 2)
Dask graph 1106 chunks in 9 graph layers
Data type int8 numpy.ndarray
\n", + "
\n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + "\n", + " \n", + " 2\n", + " 656\n", + " 20633859\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access genotypes\n", + "gt = ds_haps[\"call_genotype\"].data # dask array\n", + "gt" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 207 + }, + "id": "qUuzaZ8B3z8T", + "outputId": "b9ace447-d343-4ed0-9166-9c17e97bf86a", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
<GenotypeDaskArray shape=(20633859, 656, 2) dtype=int8>
01234...651652653654655
00/00/00/00/00/0...0/00/00/00/00/0
10/00/00/00/00/0...0/00/00/00/00/0
20/00/00/00/00/0...0/00/00/00/00/0
......
206338560/00/00/00/00/0...0/00/00/00/00/0
206338570/00/00/00/00/0...0/00/00/00/00/0
206338580/11/01/00/10/1...0/11/00/10/00/0
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# wrap as scikit-allel genotype array\n", + "gt = allel.GenotypeDaskArray(ds_haps[\"call_genotype\"].data)\n", + "gt" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 207 + }, + "id": "AHY9z5Bp31wJ", + "outputId": "4dcec211-eb6d-45c9-ff2a-6e27218aea02", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
<HaplotypeDaskArray shape=(20633859, 1312) dtype=int8>
01234...13071308130913101311
000000...00000
100000...00000
200000...00000
......
2063385600000...00000
2063385700000...00000
2063385801101...10000
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# convert to scikit-allel haplotype array - useful for some analyses\n", + "ht = gt.to_haplotypes()\n", + "ht" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lSL0FWMv39Xf" + }, + "source": [ + "Note that in the haplotype array above, each row is a SNP and each column is a haplotype. There were $656$ samples in this analysis, and so we have $1312$ ($2\\times656$) haplotypes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ulR0JYZT4J0l" + }, + "source": [ + "### Using sample metadata to subset haplotypes\n", + "\n", + "For some analyses, you'll want to subset the haplotypes, e.g., by location and species. In order to perform subsetting, you need to obtain sample metadata, and align it with the haplotype data. This ensures that every row in the sample metadata dataframe corresponds to every column in the phased genotypes array. E.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + }, + "id": "v7u8OA7t4ZMl", + "outputId": "d0b057f5-6805-4dd1-eeb2-4ce2744fc780", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_idpartner_sample_idcontributorcountrylocationyearmonthlatitudelongitudesex_call...admin1_nameadmin1_isoadmin2_nametaxoncohort_admin1_yearcohort_admin1_monthcohort_admin1_quartercohort_admin2_yearcohort_admin2_monthcohort_admin2_quarter
0VBS241951229-GH-A-GH01Samuel DadzieGhanaDimabi201789.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_08GH-NP_Tolon_fune_2017_Q3
1VBS241961229-GH-A-GH02Samuel DadzieGhanaGbullung201779.488-1.009F...Northern RegionGH-NPKumbungufunestusGH-NP_fune_2017GH-NP_fune_2017_07GH-NP_fune_2017_Q3GH-NP_Kumbungu_fune_2017GH-NP_Kumbungu_fune_2017_07GH-NP_Kumbungu_fune_2017_Q3
2VBS241971229-GH-A-GH03Samuel DadzieGhanaDimabi201779.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_07GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_07GH-NP_Tolon_fune_2017_Q3
3VBS241981229-GH-A-GH04Samuel DadzieGhanaDimabi201789.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_08GH-NP_Tolon_fune_2017_Q3
4VBS241991229-GH-A-GH05Samuel DadzieGhanaGupanarigu201789.497-0.952F...Northern RegionGH-NPKumbungufunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Kumbungu_fune_2017GH-NP_Kumbungu_fune_2017_08GH-NP_Kumbungu_fune_2017_Q3
\n", + "

5 rows × 26 columns

\n", + "
" + ], + "text/plain": [ + " sample_id partner_sample_id contributor country location year month \\\n", + "0 VBS24195 1229-GH-A-GH01 Samuel Dadzie Ghana Dimabi 2017 8 \n", + "1 VBS24196 1229-GH-A-GH02 Samuel Dadzie Ghana Gbullung 2017 7 \n", + "2 VBS24197 1229-GH-A-GH03 Samuel Dadzie Ghana Dimabi 2017 7 \n", + "3 VBS24198 1229-GH-A-GH04 Samuel Dadzie Ghana Dimabi 2017 8 \n", + "4 VBS24199 1229-GH-A-GH05 Samuel Dadzie Ghana Gupanarigu 2017 8 \n", + "\n", + " latitude longitude sex_call ... admin1_name admin1_iso admin2_name \\\n", + "0 9.420 -1.083 F ... Northern Region GH-NP Tolon \n", + "1 9.488 -1.009 F ... Northern Region GH-NP Kumbungu \n", + "2 9.420 -1.083 F ... Northern Region GH-NP Tolon \n", + "3 9.420 -1.083 F ... Northern Region GH-NP Tolon \n", + "4 9.497 -0.952 F ... Northern Region GH-NP Kumbungu \n", + "\n", + " taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter \\\n", + "0 funestus GH-NP_fune_2017 GH-NP_fune_2017_08 GH-NP_fune_2017_Q3 \n", + "1 funestus GH-NP_fune_2017 GH-NP_fune_2017_07 GH-NP_fune_2017_Q3 \n", + "2 funestus GH-NP_fune_2017 GH-NP_fune_2017_07 GH-NP_fune_2017_Q3 \n", + "3 funestus GH-NP_fune_2017 GH-NP_fune_2017_08 GH-NP_fune_2017_Q3 \n", + "4 funestus GH-NP_fune_2017 GH-NP_fune_2017_08 GH-NP_fune_2017_Q3 \n", + "\n", + " cohort_admin2_year cohort_admin2_month \\\n", + "0 GH-NP_Tolon_fune_2017 GH-NP_Tolon_fune_2017_08 \n", + "1 GH-NP_Kumbungu_fune_2017 GH-NP_Kumbungu_fune_2017_07 \n", + "2 GH-NP_Tolon_fune_2017 GH-NP_Tolon_fune_2017_07 \n", + "3 GH-NP_Tolon_fune_2017 GH-NP_Tolon_fune_2017_08 \n", + "4 GH-NP_Kumbungu_fune_2017 GH-NP_Kumbungu_fune_2017_08 \n", + "\n", + " cohort_admin2_quarter \n", + "0 GH-NP_Tolon_fune_2017_Q3 \n", + "1 GH-NP_Kumbungu_fune_2017_Q3 \n", + "2 GH-NP_Tolon_fune_2017_Q3 \n", + "3 GH-NP_Tolon_fune_2017_Q3 \n", + "4 GH-NP_Kumbungu_fune_2017_Q3 \n", + "\n", + "[5 rows x 26 columns]" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# load sample metadata\n", + "df_samples = af1.sample_metadata(sample_sets=\"1.0\")\n", + "df_samples.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "D2tkmZ0b4cQd", + "outputId": "4d758e09-0ef0-4d9f-b6d8-af5246e6b1d3", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['VBS24195', 'VBS24196', 'VBS24197', 'VBS24198', 'VBS24199'],\n", + " dtype=object)" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# load IDs of phased samples\n", + "samples_phased = ds_haps['sample_id'].values\n", + "samples_phased[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 386 + }, + "id": "to11muCk4hJp", + "outputId": "532d54ed-3020-4c65-dab3-86897a8903b5", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sample_idpartner_sample_idcontributorcountrylocationyearmonthlatitudelongitudesex_call...admin1_nameadmin1_isoadmin2_nametaxoncohort_admin1_yearcohort_admin1_monthcohort_admin1_quartercohort_admin2_yearcohort_admin2_monthcohort_admin2_quarter
0VBS241951229-GH-A-GH01Samuel DadzieGhanaDimabi201789.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_08GH-NP_Tolon_fune_2017_Q3
1VBS241961229-GH-A-GH02Samuel DadzieGhanaGbullung201779.488-1.009F...Northern RegionGH-NPKumbungufunestusGH-NP_fune_2017GH-NP_fune_2017_07GH-NP_fune_2017_Q3GH-NP_Kumbungu_fune_2017GH-NP_Kumbungu_fune_2017_07GH-NP_Kumbungu_fune_2017_Q3
2VBS241971229-GH-A-GH03Samuel DadzieGhanaDimabi201779.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_07GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_07GH-NP_Tolon_fune_2017_Q3
3VBS241981229-GH-A-GH04Samuel DadzieGhanaDimabi201789.420-1.083F...Northern RegionGH-NPTolonfunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Tolon_fune_2017GH-NP_Tolon_fune_2017_08GH-NP_Tolon_fune_2017_Q3
4VBS241991229-GH-A-GH05Samuel DadzieGhanaGupanarigu201789.497-0.952F...Northern RegionGH-NPKumbungufunestusGH-NP_fune_2017GH-NP_fune_2017_08GH-NP_fune_2017_Q3GH-NP_Kumbungu_fune_2017GH-NP_Kumbungu_fune_2017_08GH-NP_Kumbungu_fune_2017_Q3
\n", + "

5 rows × 26 columns

\n", + "
" + ], + "text/plain": [ + " sample_id partner_sample_id contributor country location year month \\\n", + "0 VBS24195 1229-GH-A-GH01 Samuel Dadzie Ghana Dimabi 2017 8 \n", + "1 VBS24196 1229-GH-A-GH02 Samuel Dadzie Ghana Gbullung 2017 7 \n", + "2 VBS24197 1229-GH-A-GH03 Samuel Dadzie Ghana Dimabi 2017 7 \n", + "3 VBS24198 1229-GH-A-GH04 Samuel Dadzie Ghana Dimabi 2017 8 \n", + "4 VBS24199 1229-GH-A-GH05 Samuel Dadzie Ghana Gupanarigu 2017 8 \n", + "\n", + " latitude longitude sex_call ... admin1_name admin1_iso admin2_name \\\n", + "0 9.420 -1.083 F ... Northern Region GH-NP Tolon \n", + "1 9.488 -1.009 F ... Northern Region GH-NP Kumbungu \n", + "2 9.420 -1.083 F ... Northern Region GH-NP Tolon \n", + "3 9.420 -1.083 F ... Northern Region GH-NP Tolon \n", + "4 9.497 -0.952 F ... Northern Region GH-NP Kumbungu \n", + "\n", + " taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter \\\n", + "0 funestus GH-NP_fune_2017 GH-NP_fune_2017_08 GH-NP_fune_2017_Q3 \n", + "1 funestus GH-NP_fune_2017 GH-NP_fune_2017_07 GH-NP_fune_2017_Q3 \n", + "2 funestus GH-NP_fune_2017 GH-NP_fune_2017_07 GH-NP_fune_2017_Q3 \n", + "3 funestus GH-NP_fune_2017 GH-NP_fune_2017_08 GH-NP_fune_2017_Q3 \n", + "4 funestus GH-NP_fune_2017 GH-NP_fune_2017_08 GH-NP_fune_2017_Q3 \n", + "\n", + " cohort_admin2_year cohort_admin2_month \\\n", + "0 GH-NP_Tolon_fune_2017 GH-NP_Tolon_fune_2017_08 \n", + "1 GH-NP_Kumbungu_fune_2017 GH-NP_Kumbungu_fune_2017_07 \n", + "2 GH-NP_Tolon_fune_2017 GH-NP_Tolon_fune_2017_07 \n", + "3 GH-NP_Tolon_fune_2017 GH-NP_Tolon_fune_2017_08 \n", + "4 GH-NP_Kumbungu_fune_2017 GH-NP_Kumbungu_fune_2017_08 \n", + "\n", + " cohort_admin2_quarter \n", + "0 GH-NP_Tolon_fune_2017_Q3 \n", + "1 GH-NP_Kumbungu_fune_2017_Q3 \n", + "2 GH-NP_Tolon_fune_2017_Q3 \n", + "3 GH-NP_Tolon_fune_2017_Q3 \n", + "4 GH-NP_Kumbungu_fune_2017_Q3 \n", + "\n", + "[5 rows x 26 columns]" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# align sample metadata to haplotypes\n", + "df_samples_phased = df_samples.set_index(\"sample_id\").loc[samples_phased].reset_index()\n", + "df_samples_phased.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9yPZeWiM4mc5", + "outputId": "6fb945e7-21ca-4f66-92e9-bd10279713ab", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", + " 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,\n", + " 34, 35])" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# now define some cohort of interest and locate samples\n", + "cohort_query = \"country == 'Ghana' and taxon == 'funestus' and year == 2017\"\n", + "loc_cohort_samples = df_samples_phased.query(cohort_query).index.values\n", + "loc_cohort_samples" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 283 + }, + "id": "_0Qql83z5KDF", + "outputId": "6a1b0681-4951-491b-8cb8-2af632b153d4", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 2GB\n",
+       "Dimensions:           (variants: 20633859, alleles: 2, samples: 36, ploidy: 2)\n",
+       "Coordinates:\n",
+       "    variant_position  (variants) int32 83MB dask.array<chunksize=(262144,), meta=np.ndarray>\n",
+       "    variant_contig    (variants) uint8 21MB dask.array<chunksize=(20633859,), meta=np.ndarray>\n",
+       "    sample_id         (samples) object 288B dask.array<chunksize=(36,), meta=np.ndarray>\n",
+       "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
+       "Data variables:\n",
+       "    variant_allele    (variants, alleles) |S1 41MB dask.array<chunksize=(262144, 1), meta=np.ndarray>\n",
+       "    call_genotype     (variants, samples, ploidy) int8 1GB dask.array<chunksize=(262144, 36, 2), meta=np.ndarray>\n",
+       "Attributes:\n",
+       "    contigs:   ('2RL', '3RL', 'X')\n",
+       "    analysis:  funestus
" + ], + "text/plain": [ + " Size: 2GB\n", + "Dimensions: (variants: 20633859, alleles: 2, samples: 36, ploidy: 2)\n", + "Coordinates:\n", + " variant_position (variants) int32 83MB dask.array\n", + " variant_contig (variants) uint8 21MB dask.array\n", + " sample_id (samples) object 288B dask.array\n", + "Dimensions without coordinates: variants, alleles, samples, ploidy\n", + "Data variables:\n", + " variant_allele (variants, alleles) |S1 41MB dask.array\n", + " call_genotype (variants, samples, ploidy) int8 1GB dask.array\n", + "Attributes:\n", + " contigs: ('2RL', '3RL', 'X')\n", + " analysis: funestus" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# subset haplotypes to cohort\n", + "ds_haps_cohort = ds_haps.isel(samples=loc_cohort_samples)\n", + "ds_haps_cohort" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 207 + }, + "id": "3lyENPmm5S-t", + "outputId": "393e70fe-1a01-4256-8ea5-e2b0fad0ece3", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
<HaplotypeDaskArray shape=(20633859, 72) dtype=int8>
01234...6768697071
000000...00000
100000...00000
200000...00000
......
2063385600000...00000
2063385700000...00000
2063385801101...00101
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# now access subsetted haplotypes\n", + "gt_cohort = allel.GenotypeDaskArray(ds_haps_cohort['call_genotype'].data)\n", + "ht_cohort = gt_cohort.to_haplotypes()\n", + "ht_cohort" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q2a0a0KY5DMh" + }, + "source": [ + "Note there are $36$ samples in the cohort and thus $72$ ($2\\times36$) haplotypes. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "arZZ_OcPoSQV" + }, + "source": [ + "## Example computation\n", + "\n", + "Here's an example computation to count the number of segregating SNPs on chromosome arm 3R that also pass gamb_colu_arab site filters. This may take a minute or two, because it is scanning genotype calls at millions of SNPs in hundreds of samples." + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mPUEp61aQH_8", + "outputId": "c8eecf02-09d0-4797-f25d-cf56ae1c8bb5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[########################################] | 100% Completed | 30.1s\n" + ] + }, + { + "data": { + "text/plain": [ + "4779799" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# choose chromosome arm\n", + "region = \"2RL\"\n", + "\n", + "# choose site filter mask\n", + "mask = \"funestus\"\n", + "\n", + "# choose sample sets\n", + "sample_sets = [\"1229-VO-GH-DADZIE-VMF00095\"]\n", + "\n", + "# access SNP calls\n", + "ds_snps = af1.snp_calls(region=region, sample_sets=sample_sets)\n", + "\n", + "# locate pass sites\n", + "loc_pass = ds_snps[f\"variant_filter_pass_{mask}\"].values\n", + "\n", + "# perform an allele count over genotypes\n", + "gt = allel.GenotypeDaskArray(ds_snps[\"call_genotype\"].data)\n", + "with ProgressBar():\n", + " ac = gt.count_alleles(max_allele=3).compute()\n", + "\n", + "# locate segregating sites\n", + "loc_seg = ac.is_segregating()\n", + "\n", + "# count segregating and pass sites\n", + "n_pass_seg = np.count_nonzero(loc_pass & loc_seg)\n", + "n_pass_seg" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OS4U1IwZgARB" + }, + "source": [ + "## Running larger computations\n", + "\n", + "Please note that free cloud computing services such as Google Colab and MyBinder provide only limited computing resources. Thus although these services are able to efficiently read `Af1` data stored on Google Cloud, you may find that you run out of memory, or computations take a long time running on a single core. If you would like any suggestions regarding how to set up more powerful compute resources in the cloud, please feel free to get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4n73mSO-heAF" + }, + "source": [ + "## Feedback and suggestions\n", + "\n", + "If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "collapsed_sections": [], + "name": "Ag3.0 cloud data access 2022-03-14.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "global-global-mgenv-6.0.6", + "language": "python", + "name": "conda-env-global-global-mgenv-6.0.6-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/as1/download.ipynb b/docs/as1/download.ipynb new file mode 100644 index 0000000..fa0d82e --- /dev/null +++ b/docs/as1/download.ipynb @@ -0,0 +1,822 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "p0VbAgTdnvpP" + }, + "source": [ + "# Af1 data downloads\n", + "\n", + "This notebook provides information about how to download data from the [MalariaGEN Vector Observatory Anopheles funestus Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-funestus-genomic-surveillance-project). This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. Data from other releases can be accessed by changing the release in the examples from `v1` to the specific Af release, e.g. `v1.0`.\n", + "\n", + "Code examples that are intended to be run via a Linux command line are prefixed with an exclamation mark (!). If you are running these commands directly from a terminal, remove the exclamation mark.\n", + "\n", + "Examples in this notebook assume you are downloading data to a local folder within your home directory at the path `~/vo_afun_release/`. Change this if you want to download to a different folder on the local file system.\n", + "\n", + "## Data hosting\n", + "\n", + "`Af1` data are hosted by several different services.\n", + "\n", + "Raw sequence reads in FASTQ format and sequence read alignments in BAM format are hosted by the European Nucleotide Archive (ENA). This guide provides examples of downloading data from ENA via FTP using the `wget` command line tool, but please note that there are several other options for downloading data, see the [ENA documentation on how to download data files](https://ena-docs.readthedocs.io/en/latest/retrieval/file-download.html) for more information. \n", + "\n", + "SNP calls in VCF and Zarr formats are hosted on S3-compatible object storage at the Sanger Institute. This guide provides examples of downloading thes data using `wget`.\n", + "\n", + "Sample metadata in CSV format are hosted on Google Cloud Storage (GCS) in the `vo_afun_release_master_us_central1` bucket, which is a multi-region bucket located in the United States. All data hosted on GCS are publicly accessible but do require an authentication step, please see details on the [Vector Observatory Data Access page](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).\n", + "\n", + "The guide below provides examples of downloading data from GCS to a local computer using the `wget` and `gsutil` command line tools. For more information about `gsutil`, see the [gsutil tool documentation](https://cloud.google.com/storage/docs/gsutil)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t1wyCDH5nvpS" + }, + "source": [ + "## Sample sets\n", + "\n", + "Data in these releases are organised into sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to download data from only specific sample sets, or all sample sets. For convenience there is a tab-delimited manifest file listing all sample sets in the release, this can be downloaded via `gsutil` to a directory on the local file system, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "rsX4TP6UnvpS", + "outputId": "a9afc995-80b7-4f62-ad0b-b4d95822cf38", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/Users/ah32/vo_afun_release\n", + "/Users/ah32/vo_afun_release/v1.0\n", + "Copying gs://vo_afun_release/v1.0/manifest.tsv...\n", + "/ [1 files][ 1015 B/ 1015 B] \n", + "Operation completed over 1 objects/1015.0 B. \n" + ] + } + ], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/\n", + "!gsutil cp gs://vo_afun_release_master_us_central1/v1.0/manifest.tsv ~/vo_afun_release/v1.0/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hWOAFxIDnvpT" + }, + "source": [ + "Here are the file contents:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vC4ACrTEnvpT", + "outputId": "c7cfe64a-9a78-42ea-dbd9-9cc82410372d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample_set\tsample_count\tstudy_id\tstudy_url\n", + "1229-VO-GH-DADZIE-VMF00095\t36\t1229-VO-GH-DADZIE\thttps://www.malariagen.net/network/where-we-work/1229-VO-GH-DADZIE\n", + "1230-VO-GA-CF-AYALA-VMF00045\t50\t1230-VO-MULTI-AYALA\thttps://www.malariagen.net/network/where-we-work/1230-VO-MULTI-AYALA\n", + "1231-VO-MULTI-WONDJI-VMF00043\t320\t1231-VO-MULTI-WONDJI\thttps://www.malariagen.net/network/where-we-work/1231-VO-MULTI-WONDJI\n", + "1232-VO-KE-OCHOMO-VMF00044\t81\t1232-VO-KE-OCHOMO\thttps://www.malariagen.net/network/where-we-work/1232-VO-KE-OCHOMO\n", + "1235-VO-MZ-PAAIJMANS-VMF00094\t76\t1235-VO-MZ-PAAIJMANS\thttps://www.malariagen.net/network/where-we-work/1235-VO-MZ-PAAIJMANS\n", + "1236-VO-TZ-OKUMU-VMF00090\t10\t1236-VO-TZ-OKUMU\thttps://www.malariagen.net/network/where-we-work/1236-VO-TZ-OKUMU\n", + "1240-VO-CD-KOEKEMOER-VMF00099\t43\t1240-VO-MULTI-KOEKEMOER\thttps://www.malariagen.net/network/where-we-work/1240-VO-MULTI-KOEKEMOER\n", + "1240-VO-MZ-KOEKEMOER-VMF00101\t40\t1240-VO-MULTI-KOEKEMOER\thttps://www.malariagen.net/network/where-we-work/1240-VO-MULTI-KOEKEMOER\n" + ] + } + ], + "source": [ + "!cat ~/vo_afun_release/v1.0/manifest.tsv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5hXT_c0pnvpU" + }, + "source": [ + "For more information about these sample sets, you can explore the [Af1.0 data user guide](https://malariagen.github.io/vector-data/af1/af1.0.html)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D0m-HL43nvpU" + }, + "source": [ + "## Sample metadata\n", + "\n", + "Data about the samples that were sequenced to generate this data resource are available, including the time and place of collection, the gender of the specimen, and our call regarding the species of the specimen.\n", + "\n", + "### Specimen collection metadata\n", + "\n", + "Specimen collection metadata can be downloaded from GCS. E.g., sample metadata for all sample sets can be downloaded using `gsutil`. If you only want the sample metadata for a single sample set, these can be accessed by including the sample set name on the link below, e.g. to access the metadata for `1229-VO-GH-DADZIE-VMF00095`, you would use: `gs://vo_afun_release_master_us_central1/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv`:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CsQVgCl7nvpV", + "outputId": "e0409bcb-5eca-4b1b-e703-e968508f3aec", + "tags": [ + "hide-output" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/Users/ah32/vo_afun_release/v1.0/metadata\n", + "Building synchronization state...\n", + "If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o \"GSUtil:parallel_process_count=1\"`. Note that multithreading is still available even if you disable multiprocessing.\n", + "\n", + "Starting synchronization...\n", + "If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o \"GSUtil:parallel_process_count=1\"`. Note that multithreading is still available even if you disable multiprocessing.\n", + "\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1230-VO-GA-CF-AYALA-VMF00045/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1229-VO-GH-DADZIE-VMF00095/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1235-VO-MZ-PAAIJMANS-VMF00094/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1231-VO-MULTI-WONDJI-VMF00043/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1235-VO-MZ-PAAIJMANS-VMF00094/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1232-VO-KE-OCHOMO-VMF00044/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-CD-KOEKEMOER-VMF00099/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1231-VO-MULTI-WONDJI-VMF00043/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1232-VO-KE-OCHOMO-VMF00044/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1236-VO-TZ-OKUMU-VMF00090/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-CD-KOEKEMOER-VMF00099/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-MZ-KOEKEMOER-VMF00101/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1230-VO-GA-CF-AYALA-VMF00045/wgs_snp_data.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1240-VO-MZ-KOEKEMOER-VMF00101/samples.meta.csv...\n", + "Copying gs://vo_afun_release/v1.0/metadata/general/README.md... \n", + "Copying gs://vo_afun_release/v1.0/metadata/general/1236-VO-TZ-OKUMU-VMF00090/samples.meta.csv...\n", + "- [17/17 files][305.0 KiB/305.0 KiB] 100% Done \n", + "Operation completed over 17 objects/305.0 KiB. \n" + ] + } + ], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/metadata/\n", + "!gsutil -m rsync -r gs://vo_afun_release_master_us_central1/v1.0/metadata/general/ ~/vo_afun_release/v1.0/metadata/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R7GeyShRnvpV" + }, + "source": [ + "Here are the first few rows of the sample metadata for sample set `1229-VO-GH-DADZIE-VMF00095`:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dhKjnl6knvpW", + "outputId": "6345e845-5288-41a1-e877-5417559b8c6c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call\n", + "VBS24195,1229-GH-A-GH01,Samuel Dadzie,Ghana,Dimabi,2017,8,9.420,-1.083,F\n", + "VBS24196,1229-GH-A-GH02,Samuel Dadzie,Ghana,Gbullung,2017,7,9.488,-1.009,F\n", + "VBS24197,1229-GH-A-GH03,Samuel Dadzie,Ghana,Dimabi,2017,7,9.420,-1.083,F\n", + "VBS24198,1229-GH-A-GH04,Samuel Dadzie,Ghana,Dimabi,2017,8,9.420,-1.083,F\n", + "VBS24199,1229-GH-A-GH05,Samuel Dadzie,Ghana,Gupanarigu,2017,8,9.497,-0.952,F\n", + "VBS24200,1229-GH-A-GH06,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F\n", + "VBS24201,1229-GH-A-GH07,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F\n", + "VBS24202,1229-GH-A-GH08,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F\n", + "VBS24203,1229-GH-A-GH09,Samuel Dadzie,Ghana,Gupanarigu,2017,7,9.497,-0.952,F\n" + ] + } + ], + "source": [ + "!head ~/vo_afun_release/v1.0/metadata/1229-VO-GH-DADZIE-VMF00095/samples.meta.csv" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VKki7qHunvpW" + }, + "source": [ + "The `sample_id` column gives the sample identifier used throughout all analyses.\n", + "\n", + "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", + "\n", + "The `year` and `month` columns give the approximate date when the specimen was collected.\n", + "\n", + "The `sex_call` column gives the gender as determined from the sequence data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EpMi0q3dnvpZ" + }, + "source": [ + "## SNP calls (VCF format)\n", + "\n", + "### SNP genotypes\n", + "\n", + "SNP genotypes for individual mosquitoes in VCF format are available for download from Sanger S3-compatible object storage. A VCF file is available for each individual sample. To download a VCF file for a given sample, you will need the sample identifier and the sample set in which the sample belongs. Then inspect the data catalog in the metadata. E.g., for sample set `1229-VO-GH-DADZIE-VMF00095`:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "sample_id,snp_genotypes_vcf\n", + "VBS24195,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24195.vcf.gz\n", + "VBS24196,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24196.vcf.gz\n", + "VBS24197,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24197.vcf.gz\n", + "VBS24198,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24198.vcf.gz\n", + "VBS24199,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24199.vcf.gz\n", + "VBS24200,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24200.vcf.gz\n", + "VBS24201,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24201.vcf.gz\n", + "VBS24202,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24202.vcf.gz\n", + "VBS24203,https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24203.vcf.gz\n" + ] + } + ], + "source": [ + "!head ~/vo_afun_release/v1.0/metadata/1229-VO-GH-DADZIE-VMF00095/wgs_snp_data.csv | cut -f1,4 -d," + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A VCF file and associated tabix index can be downloaded via wget, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget --no-clobber https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24195.vcf.gz\n", + "!wget --no-clobber https://1229-vo-gh-dadzie-vmf00095.cog.sanger.ac.uk/VBS24195.vcf.gz.tbi" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rd1icA5Snvpa" + }, + "source": [ + "Note that each of these VCF files is around 3 Gb, so downloading may take some time, and sufficient local storage will be needed.\n", + "\n", + "Each of these VCF files is an \"all sites\" VCF file, meaning that genotypes have been called at all genomic positions where the reference nucleotide is not \"N\", regardless of whether variation is observed in the given sample. This means that VCFs from multiple samples can be merged easily to create a multi-sample VCF, which may be required for certain analyses. For example, the code below merges VCFs for two samples for chromosome arm 3R up to 1 Mbp: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RcWJS9XJnvpa", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!bcftools merge --output-type z --regions 3RL:1-1000000 --output merged.vcf.gz VBS24195.vcf.gz VBS24196.vcf.gz " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "un-0qdeEnvpa" + }, + "source": [ + "If you are just interested in analysing variants within a given set of samples, you might like to filter the merged VCF to remove non-variant sites and alleles, e.g., using [bcftools view](http://samtools.github.io/bcftools/bcftools.html#view):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tQ7ZQEQznvpa" + }, + "outputs": [], + "source": [ + "!bcftools view --output-type z --output-file merged_variant.vcf.gz --min-ac 1:nonmajor --trim-alt-alleles merged.vcf.gz" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZgpIO8Oknvpa" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. We have created some sites-only VCF files with site filter information in the `FILTER` column. These VCF files are hosted on GCS. \n", + "\n", + "Each filter is available as a set of VCF files, one per chromosome arm. E.g., you can access the site filters on chromosome arms 2RL from:\n", + "\n", + "`gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/2RL_sitefilters.vcf.gz`\n", + "\n", + "Alternatively, all site filters VCFs can be downloaded using `gsutil`, e.g.:\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XQjL7R3bnvpa", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/ \\\n", + " ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/vcf/funestus/ \\\n", + " ~/vo_afun_release/v1.0/site_filters/dt_20200416/vcf/funestus/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note these filters are the result of different filter models, in this case, a decision-tree is used. These filters are the default ones used across the function.\n", + "\n", + "We have also produced a second set of site filters, which are the result of static cutoffs on the site summary statistics. \n", + "These hard-filters can also be downloaded via `gsutil`, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/site_filters/sc_20220908/vcf/funestus/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/site_filters/sc_20220908/vcf/funestus/ \\\n", + " ~/vo_afun_release/v1.0/site_filters/sc_20220908/vcf/funestus/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OBXGXzj9nvpb" + }, + "source": [ + "## SNP calls (Zarr format)\n", + "\n", + "SNP data are also available in Zarr format, which can be convenient and efficient to use for certain types of analysis. These data can be analysed directly in the cloud without downloading to the local system, see the [Af1 cloud data access guide](https://malariagen.github.io/vector-data/af1/cloud.html) for more information. The data can also be downloaded to your own system for local analysis if that is more convenient. Below are examples of how to download the Zarr data to your local system.\n", + "\n", + "The data are organised into several Zarr hierarchies. \n", + "\n", + "### SNP sites and alleles\n", + "\n", + "Data on the genomic positions (sites) and reference and alternate alleles that were genotyped can be downloaded as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hM4noAz3nvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/snp_genotypes/all/sites/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/snp_genotypes/all/sites/ \\\n", + " ~/vo_afun_release/v1.0/snp_genotypes/all/sites/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GRqTjrIhnvpb" + }, + "source": [ + "### Site filters\n", + "\n", + "SNP calling is not always reliable, and we have created some site filters to allow excluding low quality SNPs. To download site filters data in Zarr format:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tWu4ajAbnvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/site_filters/dt_20200416/funestus/\n", + "!gsutil -m rsync -r \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/site_filters/dt_20200416/funestus/ \\\n", + " ~/vo_afun_release/v1.0/site_filters/dt_20200416/funestus/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vKfArxCFnvpb" + }, + "source": [ + "### SNP genotypes\n", + "\n", + "SNP genotypes are available for each sample set separately. E.g., to download SNP genotypes in Zarr format for sample set `1229-VO-GH-DADZIE-VMF00095`, excluding some data you probably won't need:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "umeGFe1jnvpb", + "tags": [ + "hide-output" + ] + }, + "outputs": [], + "source": [ + "!mkdir -pv ~/vo_afun_release/v1.0/snp_genotypes/all/1229-VO-GH-DADZIE-VMF00095/\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/calldata/(AD|GQ|MQ)/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/snp_genotypes/all/1229-VO-GH-DADZIE-VMF00095/ \\\n", + " ~/vo_afun_release/v1.0/snp_genotypes/all/1229-VO-GH-DADZIE-VMF00095/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o3ALEZyZnvpb" + }, + "source": [ + "## Copy number variation (CNV) data\n", + "\n", + "Data on copy number variation within the `Af1` cohort are available as three separate data types:\n", + "\n", + "* **HMM** -- Genome-wide inferences of copy number state within each individual mosquito in 300 bp non-overlapping windows.\n", + "* **Coverage calls** -- Genome-wide copy number variant calls, derived from the HMM outputs by analysing contiguous regions of elevated copy number state then clustering of variants across individuals based on breakpoint proximity.\n", + "\n", + "For more information on the methods used to generate these data, see the [variant-calling methods](methods) page.\n", + "\n", + "For each of these data types, data can be downloaded from Google Cloud Storage, and are available in either VCF or Zarr format." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z4vzTYvEnvpc" + }, + "source": [ + "### CNV HMM\n", + "\n", + "The HMM inferences of copy number state are available in VCF, Zarr and text formats, and are organised by sample set. \n", + "\n", + "For example, the VCF file for sample set `1229-VO-GH-DADZIE-VMF00095` can be downloaded from:\n", + "\n", + "* gs://vo_afun_release_master_us_central1/v1/cnv/1229-VO-GH-DADZIE-VMF00095/hmm/vcf/VBS24195_cnv_hmm.vcf.gz\n", + "\n", + "VCF files for all samples sets can be downloaded via gsutil as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bA-PIJaWnvpc" + }, + "outputs": [], + "source": [ + "# create a local directory to hold downloaded CNV data\n", + "!mkdir -pv ~/vo_afun_release/v1.0/cnv/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2EFQKAXHnvpc" + }, + "outputs": [], + "source": [ + "# download the HMM data in VCF format for all sample sets\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/coverage_calls/.*|.*/hmm/zarr/.*|.*/hmm/per_sample/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p4IWcfRJnvpc" + }, + "source": [ + "Zarr files for all sample sets can be downloaded as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jGfEE3y5nvpc" + }, + "outputs": [], + "source": [ + "# download HMM data in Zarr format for all sample sets\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/coverage_calls/.*|.*/hmm/vcf/.*|.*/hmm/per_sample/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WPgfWx0Wnvpc" + }, + "source": [ + "### CNV coverage calls\n", + "\n", + "Coverage-based CNV calls are available in VCF and Zarr formats, and are organised by sample set. \n", + "Note that some samples were excluded from coverage calling because of high coverage variance.\n", + "\n", + "For example, the VCF file for sample set `1229-VO-GH-DADZIE-VMF00095` can be downloaded from:\n", + "\n", + "* gs://vo_afun_release_master_us_central1/v1.0/cnv/1229-VO-GH-DADZIE-VMF00095/coverage_calls/funestus/vcf/1229-VO-GH-DADZIE-VMF00095_funestus_cnv_coverage_calls.vcf.gz\n", + "\n", + "VCF files for all sample sets can be downloaded with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uJEdxeTjnvpc" + }, + "outputs": [], + "source": [ + "# download coverage calls in VCF format for all sample sets\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/hmm/.*|.*/coverage_calls/.*/zarr/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F9rHpx_Invpc" + }, + "source": [ + "Zarr files for all sample sets can be downloaded with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fzdzu6CFnvpc" + }, + "outputs": [], + "source": [ + "# download coverage calls in Zarr format for all sample sets\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/hmm/.*|.*/coverage_calls/.*/vcf/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/cnv/ ~/vo_afun_release/v1.0/cnv/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9hFhrALmnvpd" + }, + "source": [ + "## Haplotypes\n", + "\n", + "The `Af1` data resource also includes haplotype reference panels, which were obtained by [phasing](https://en.wikipedia.org/wiki/Haplotype_estimation) the SNP calls. \n", + "\n", + "Haplotype data can be downloaded in either VCF or Zarr format. See the subsections below for further details" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kpa2QiLsnvpd" + }, + "source": [ + "### Haplotype reference panels (VCF format)\n", + "\n", + "These are the VCFs created by the phasing pipeline, containing all samples included each of the phasing runs. There is one VCF per phasing analysis per chromosome arm. The URL for each file has the following structure:\n", + "\n", + "* `gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/panel/funestus/af1.0_funestus_{contig}_phased.vcf.gz`\n", + "\n", + "...where `{contig}` is one of \"2RL\", \"3RL\", \"X\". \n", + "\n", + "E.g., the panel VCF for the chromosome arm 3RL can be downloaded from:\n", + "\n", + "* gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/panel/funestus/af1.0_funestus_3RL_phased.vcf.gz\n", + "\n", + "Note that these files can be large, up to ~5 GB.\n", + "\n", + "If you'd like to download all of the panel files, you could also use `gsutil`, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dN6QGcHtnvpd" + }, + "outputs": [], + "source": [ + "# create a local directory to store the data\n", + "!mkdir -pv ~/vo_afun_release/v1.0/snp_haplotypes/panel/funestus/\n", + "\n", + "# copy files from cloud to local file system\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/.*zarr.zip' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/panel/funestus/ \\\n", + " ~/vo_afun_release/v1.0/snp_haplotypes/panel/funestus/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jh60c4evnvpd" + }, + "source": [ + "### Sample set haplotypes (VCF format)\n", + "\n", + "These VCFs are subsets of the panel VCFs, containing only samples in a given sample set. There is one VCF per sample set, per phasing analysis, per chromosome arm. The URL for each file has the following structure:\n", + "\n", + "* `gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/{sample_set}/funestus/vcf/{sample_set}_funestus_{contig}_phased.vcf.gz`\n", + "\n", + "...where `{contig}` is one of \"2RL\",\"3RL\", \"X\"; and `{sample_set}` is one of the [Af sample sets](https://malariagen.github.io/vector-data/af1/af1.0.html#sample-sets).\n", + "\n", + "E.g., the VCF for sample set 1229-VO-GH-DADZIE-VMF00095, for chromosome arm 2RL, can be downloaded here:\n", + "\n", + "* gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/vcf/1229-VO-GH-DADZIE-VMF00095_funestus_2RL_phased.vcf.gz \n", + "\n", + "If you'd like to download all of the VCF files for a given sample set, you could also use gsutil, e.g.:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "v4nXM1lpnvpd" + }, + "outputs": [], + "source": [ + "# create a local directory to store the data\n", + "!mkdir -pv ~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/\n", + "\n", + "# copy files from cloud to local file system\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/zarr/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/ \\\n", + " ~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "31vzApoKnvpd" + }, + "source": [ + "### Sample set haplotypes (Zarr format)\n", + "\n", + "These contain the haplotype data in Zarr format, with one Zarr hierarchy per sample set. The root zarr path for a given hierarchy has the following structure:\n", + "\n", + "* `gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/{sample_set}/funestus/zarr`\n", + "\n", + "Data can be downloaded with gsutil. E.g., download the Zarr data for sample 1229-VO-GH-DADZIE-VMF00095. Note that the sites are stored in a separate hierarchy:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "En9ebyYPnvpd" + }, + "outputs": [], + "source": [ + "# create local directories to store the data\n", + "!mkdir -pv ~/vo_afun_release//v1.0/snp_haplotypes/sites/funestus/\n", + "!mkdir -pv ~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/\n", + "\n", + "# copy haplotype data from cloud to local file system\n", + "!gsutil -m rsync -r \\\n", + " -x '.*/vcf/.*' \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/ \\\n", + " ~/vo_afun_release/v1.0/snp_haplotypes/1229-VO-GH-DADZIE-VMF00095/funestus/\n", + "\n", + "# copy phased sites data from cloud to local file system\n", + "!gsutil -m rsync -rn \\\n", + " gs://vo_afun_release_master_us_central1/v1.0/snp_haplotypes/sites/funestus/ \\\n", + " ~/vo_afun_release//v1.0/snp_haplotypes/sites/funestus/ " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8ABQPPgAnvph" + }, + "source": [ + "## Feedback and suggestions\n", + "\n", + "If there are particular analyses you would like to run, or if you have other suggestions for useful documentation we could add to this site, we would love to know, please get in touch via the [malariagen/vector-data GitHub discussion board](https://github.com/malariagen/vector-data/discussions)." + ] + } + ], + "metadata": { + "celltoolbar": "Tags", + "colab": { + "collapsed_sections": [ + "8ABQPPgAnvph" + ], + "name": "Ag3.0-data-downloads.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "global-global-mgenv-6.0.6", + "language": "python", + "name": "conda-env-global-global-mgenv-6.0.6-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From afbdc905c0aab599eb1c70c87f9c88b87797cb21 Mon Sep 17 00:00:00 2001 From: tristanpwdennis Date: Sun, 5 Apr 2026 04:06:44 +0000 Subject: [PATCH 2/3] add doc nbs for as1 --- docs/as1/as1.ipynb | 401 ++- docs/as1/cloud.ipynb | 7509 ++++++++------------------------------- docs/as1/download.ipynb | 554 +-- 3 files changed, 1940 insertions(+), 6524 deletions(-) diff --git a/docs/as1/as1.ipynb b/docs/as1/as1.ipynb index ac9fe32..aa2b17a 100644 --- a/docs/as1/as1.ipynb +++ b/docs/as1/as1.ipynb @@ -40,35 +40,68 @@ "id": "iNSicUCtpk8j" }, "source": [ + "\n", "## Partner studies\n", "\n", - "- [1363-VO-ET-GADISA](https://www.malariagen.net/network/where-we-work/1363-VO-ET-GADISA) - _Anopheles stephensi_ vector surveillance in Ethiopia.\n", + "All of the samples were contributed and sequenced as part of the [Controlling Emergent Anopheles stephensi in Sudan and Ethiopia (CEASE) project](https://wellcome.org/research-funding/funding-portfolio/funded-grants/controlling-emergent-anopheles-stephensi-ethiopia).\n", "\n", - "- [1364-VO-SD-KAFY](https://www.malariagen.net/network/where-we-work/1364-VO-SD-KAFY) - _Anopheles stephensi_ vector surveillance in Sudan.\n", + "The samples were contributed by partner institutions from various countries. The surname and primary institution of the lead principle investigator/s contributing samples to the study, and the sample country of origin, are detailed below. \n", "\n", - "- [1365-VO-DJ-ADBI](https://www.malariagen.net/network/where-we-work/1365-VO-DJ-ADBI) - _Anopheles stephensi_ vector surveillance in Djibouti.\n", + "Enquiries about the samples and studies may be directed in the first instance to David Weetman (david.weetman@lstmed.ac.uk) or Martin Donnelly (martin.donnelly@lstmed.ac.uk).\n", "\n", - "- [1366-VO-YE-ALLAN](https://www.malariagen.net/network/where-we-work/1366-VO-YE-ALLAN) - _Anopheles stephensi_ vector surveillance in Yemen.\n", + "### 1363-VO-ET-GADISA-VMF00316 (Ethiopia)\n", "\n", - "- [1367-VO-AF-DONNELLY](https://www.malariagen.net/network/where-we-work/1367-VO-AF-DONNELLY) - _Anopheles stephensi_ vector surveillance in Afghanistan.\n", + "* Endalamaw Gadisa, Armaeur Hansen Research Institute, Ethiopia.\n", "\n", - "- [1368-VO-PK-DONNELLY](https://www.malariagen.net/network/where-we-work/1368-VO-PK-DONNELLY) - _Anopheles stephensi_ vector surveillance in Pakistan.\n", + "### 1364-VO-SD-KAFY-VMF00317 (Sudan)\n", "\n", - "- [1369-VO-SA-AL-NAZAWI](https://www.malariagen.net/network/where-we-work/1369-VO-SA-AL-NAZAWI) - _Anopheles stephensi_ vector surveillance in Saudi Arabia.\n", + "* Hmooda Toto Kafy, University of Khartoum, Sudan.\n", + "* Elfatih Malik, University of Khartoum, Sudan.\n", "\n", - "- [1370-VO-IR-ENAYATI](https://www.malariagen.net/network/where-we-work/1370-VO-IR-ENAYATI) - _Anopheles stephensi_ vector surveillance in Iran.\n", + "### 1365-VO-DJ-ADBI-VMF00318 (Djibouti)\n", "\n", - "- [1385-VO-DJ-WEETMAN](https://www.malariagen.net/network/where-we-work/1385-VO-DJ-WEETMAN) - _Anopheles stephensi_ colony samples derived from wild-caught mosquitoes in Djibouti.\n", + "* Bouh Abdi Khaireh, Association Mutualis, Djibouti.\n", "\n", - "- [1386-VO-KE-OCHOMO](https://www.malariagen.net/network/where-we-work/1386-VO-KE-OCHOMO) - _Anopheles stephensi_ vector surveillance in Kenya.\n", + "### 1366-VO-YE-ALLAN-VMF00319 (Yemen)\n", "\n", - "- [1458-VO-ET-YEWHALAW](https://www.malariagen.net/network/where-we-work/1458-VO-ET-YEWHALAW) - _Anopheles stephensi_ vector surveillance in Ethiopia.\n", + "* Richard Allan, MENTOR Initiative, United Kingdom.\n", "\n", - "- [1459-VO-SD-AHMED](https://www.malariagen.net/network/where-we-work/1459-VO-SD-AHMED) - _Anopheles stephensi_ vector surveillance in Sudan.\n", + "### 1367-VO-AF-DONNELLY-VMF00320 (Afghanistan)\n", "\n", - "- [thakare-2022](https://www.malariagen.net/network/where-we-work/thakare-2022) - Previously published Indian _Anopheles stephensi_ mosquitoes from [Thakare _et al_, 2022](https://www.nature.com/articles/s41598-022-07462-3).\n", + "* Martin Donnelly, Liverpool School of Tropical Medicine, United Kingdom.\n", "\n", - "\n" + "### 1368-VO-PK-DONNELLY-VMF00321 (Pakistan)\n", + "\n", + "* Martin Donnelly, Liverpool School of Tropical Medicine, United Kingdom.\n", + "\n", + "### 1369-VO-SA-AL-NAZAWI-VMF00322 (Saudi Arabia)\n", + "\n", + "* Ashwaq Al-Nazawi, Jazan University, Saudi Arabia. \n", + "\n", + "### 1370-VO-IR-ENAYATI-VMF00323 (Iran)\n", + "\n", + "* Ahmadali Enayati, Mazandaran University of Medical Sciences, Iran.\n", + "\n", + "### 1385-VO-DJ-WEETMAN-VMF00338 (United Kingdom).\n", + "\n", + "* David Weetman, Liverpool School of Tropical Medicine, United Kingdom.\n", + "* N.B. These are colony mosquitoes derived from wild-collected samples in Djibouti.\n", + "\n", + "### 1386-VO-KE-OCHOMO-VMF00339 (Kenya)\n", + "\n", + "* Eric Ochomo, Kenya Medical Research Institute (KEMRI), Kenya\n", + "\n", + "### 1458-VO-ET-YEWHALAW-VMF00340 (Ethiopia)\n", + "\n", + "* Delenasaw Yewhalaw, Jimma University, Ethiopia.\n", + "\n", + "### 1459-VO-SD-AHMED-VMF00342\n", + "\n", + "* Ayman Ahmed, University of Khartoum, Sudan.\n", + " \n", + "### thakare-2022\n", + "\n", + "* Previously published data from [Thakare _et al_, 2022](https://www.nature.com/articles/s41598-022-07462-3).\n" ] }, { @@ -110,26 +143,48 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, + "execution": { + "iopub.execute_input": "2026-04-05T04:05:09.844381Z", + "iopub.status.busy": "2026-04-05T04:05:09.844101Z", + "iopub.status.idle": "2026-04-05T04:05:11.705969Z", + "shell.execute_reply": "2026-04-05T04:05:11.704899Z", + "shell.execute_reply.started": "2026-04-05T04:05:09.844351Z" + }, "id": "hGA4d7Yrpk8m", "outputId": "c29827c1-0361-4926-c227-8f6e76c2a497", "tags": [ "remove-input" ] }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "%pip install -qq malariagen_data" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "metadata": { + "execution": { + "iopub.execute_input": "2026-04-05T04:05:11.706973Z", + "iopub.status.busy": "2026-04-05T04:05:11.706697Z", + "iopub.status.idle": "2026-04-05T04:05:17.371545Z", + "shell.execute_reply": "2026-04-05T04:05:17.370432Z", + "shell.execute_reply.started": "2026-04-05T04:05:11.706939Z" + }, "id": "AnmzLmEgpk8n", "tags": [ "remove-input" @@ -410,7 +465,7 @@ " document.body.appendChild(element);\n", " }\n", "\n", - " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.5.2.min.js\"];\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n", " const css_urls = [];\n", "\n", " const inline_js = [ function(Bokeh) {\n", @@ -450,7 +505,7 @@ " }\n", "}(window));" ], - "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"
    \\n\"+\n \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n \"
  • use INLINE resources instead, as so:
  • \\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.5.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.5.2.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"
    \\n\"+\n \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n \"
  • use INLINE resources instead, as so:
  • \\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" }, "metadata": {}, "output_type": "display_data" @@ -458,17 +513,24 @@ ], "source": [ "import malariagen_data\n", - "af1 = malariagen_data.As1()" + "as1 = malariagen_data.As1()" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 927 }, + "execution": { + "iopub.execute_input": "2026-04-05T04:05:29.540570Z", + "iopub.status.busy": "2026-04-05T04:05:29.540132Z", + "iopub.status.idle": "2026-04-05T04:05:29.640314Z", + "shell.execute_reply": "2026-04-05T04:05:29.639173Z", + "shell.execute_reply.started": "2026-04-05T04:05:29.540540Z" + }, "id": "qsElasBepk8n", "outputId": "4bf80a06-c2e8-4d2d-b4a6-99c8c66da7db", "tags": [ @@ -508,39 +570,99 @@ " \n", " \n", " \n", - " 1188-VO-SN-NIANG\n", - " 1188-VO-NIANG-NIEL-SN-2304-VMF00259\n", - " 71\n", + " 1363-VO-ET-GADISA\n", + " 1363-VO-ET-GADISA-VMF00316\n", + " 111\n", + " \n", + " \n", + " 1364-VO-SD-KAFY\n", + " 1364-VO-SD-KAFY-VMF00317\n", + " 226\n", + " \n", + " \n", + " 1365-VO-DJ-ADBI\n", + " 1365-VO-DJ-ADBI-VMF00318\n", + " 21\n", + " \n", + " \n", + " 1366-VO-YE-ALLAN\n", + " 1366-VO-YE-ALLAN-VMF00319\n", + " 22\n", + " \n", + " \n", + " 1367-VO-AF-DONNELLY\n", + " 1367-VO-AF-DONNELLY-VMF00320\n", + " 24\n", + " \n", + " \n", + " 1368-VO-PK-DONNELLY\n", + " 1368-VO-PK-DONNELLY-VMF00321\n", + " 15\n", + " \n", + " \n", + " 1369-VO-SA-AL-NAZAWI\n", + " 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " 42\n", + " \n", + " \n", + " 1370-VO-IR-ENAYATI\n", + " 1370-VO-IR-ENAYATI-VMF00323\n", + " 72\n", + " \n", + " \n", + " 1385-VO-DJ-WEETMAN\n", + " 1385-VO-DJ-WEETMAN-VMF00338\n", + " 14\n", + " \n", + " \n", + " 1386-VO-KE-OCHOMO\n", + " 1386-VO-KE-OCHOMO-VMF00339\n", + " 29\n", + " \n", + " \n", + " 1458-VO-ET-YEWHALAW\n", + " 1458-VO-ET-YEWHALAW-VMF00340\n", + " 23\n", " \n", " \n", - " 1330-VO-GN-LAMA\n", - " 1330-VO-GN-LAMA-VMF00250\n", - " 196\n", + " 1459-VO-SD-AHMED\n", + " 1459-VO-SD-AHMED-VMF00342\n", + " 25\n", " \n", " \n", - " 1354-VO-KE-DONNELLY\n", - " 1354-VO-KE-DONNELLY-VMF00281\n", - " 466\n", + " thakare-2022\n", + " thakare-2022\n", + " 15\n", " \n", " \n", "\n", "" ], "text/plain": [ - " sample_set sample_count\n", - "study_id \n", - "1188-VO-SN-NIANG 1188-VO-NIANG-NIEL-SN-2304-VMF00259 71\n", - "1330-VO-GN-LAMA 1330-VO-GN-LAMA-VMF00250 196\n", - "1354-VO-KE-DONNELLY 1354-VO-KE-DONNELLY-VMF00281 466" + " sample_set sample_count\n", + "study_id \n", + "1363-VO-ET-GADISA 1363-VO-ET-GADISA-VMF00316 111\n", + "1364-VO-SD-KAFY 1364-VO-SD-KAFY-VMF00317 226\n", + "1365-VO-DJ-ADBI 1365-VO-DJ-ADBI-VMF00318 21\n", + "1366-VO-YE-ALLAN 1366-VO-YE-ALLAN-VMF00319 22\n", + "1367-VO-AF-DONNELLY 1367-VO-AF-DONNELLY-VMF00320 24\n", + "1368-VO-PK-DONNELLY 1368-VO-PK-DONNELLY-VMF00321 15\n", + "1369-VO-SA-AL-NAZAWI 1369-VO-SA-AL-NAZAWI-VMF00322 42\n", + "1370-VO-IR-ENAYATI 1370-VO-IR-ENAYATI-VMF00323 72\n", + "1385-VO-DJ-WEETMAN 1385-VO-DJ-WEETMAN-VMF00338 14\n", + "1386-VO-KE-OCHOMO 1386-VO-KE-OCHOMO-VMF00339 29\n", + "1458-VO-ET-YEWHALAW 1458-VO-ET-YEWHALAW-VMF00340 23\n", + "1459-VO-SD-AHMED 1459-VO-SD-AHMED-VMF00342 25\n", + "thakare-2022 thakare-2022 15" ] }, - "execution_count": 2, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df_sample_sets = as1.sample_sets(release=\"1\")\n", + "df_sample_sets = as1.sample_sets(release=\"1.0\")\n", "df_sample_sets[['study_id','sample_set', 'sample_count']].set_index('study_id')" ] }, @@ -550,17 +672,24 @@ "id": "yJ16OQ0Hpk8o" }, "source": [ - "Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species:" + "Here is a more detailed breakdown of the samples contained within this sample set, summarised by country, year of collection, and species. The warning is a result of the surveillance flags not being set. This will be implemented in future versions." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, + "execution": { + "iopub.execute_input": "2026-04-05T04:05:34.859189Z", + "iopub.status.busy": "2026-04-05T04:05:34.858770Z", + "iopub.status.idle": "2026-04-05T04:05:35.892325Z", + "shell.execute_reply": "2026-04-05T04:05:35.890422Z", + "shell.execute_reply.started": "2026-04-05T04:05:34.859156Z" + }, "id": "a1OMvuTxUWpJ", "outputId": "9f872334-fd50-4649-990a-df60ea71c12c", "tags": [ @@ -572,7 +701,46 @@ "name": "stdout", "output_type": "stream", "text": [ - " \r" + "Load sample metadata: ⠏ (0:00:00.76) " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1363-VO-ET-GADISA-VMF00316\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1364-VO-SD-KAFY-VMF00317\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1365-VO-DJ-ADBI-VMF00318\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1366-VO-YE-ALLAN-VMF00319\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1367-VO-AF-DONNELLY-VMF00320\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1368-VO-PK-DONNELLY-VMF00321\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1370-VO-IR-ENAYATI-VMF00323\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1385-VO-DJ-WEETMAN-VMF00338\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1386-VO-KE-OCHOMO-VMF00339\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1458-VO-ET-YEWHALAW-VMF00340\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1459-VO-SD-AHMED-VMF00342\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set thakare-2022\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " " ] }, { @@ -599,7 +767,7 @@ " \n", " \n", " taxon\n", - " funestus\n", + " stephensi\n", " \n", " \n", " study_id\n", @@ -611,55 +779,150 @@ " \n", " \n", " \n", - " 1188-VO-SN-NIANG\n", - " 1188-VO-NIANG-NIEL-SN-2304-VMF00259\n", - " Senegal\n", - " 2020\n", - " 11\n", + " 1363-VO-ET-GADISA\n", + " 1363-VO-ET-GADISA-VMF00316\n", + " Ethiopia\n", + " 2022\n", + " 10\n", + " \n", + " \n", + " 2023\n", + " 74\n", + " \n", + " \n", + " 2024\n", + " 27\n", + " \n", + " \n", + " 1364-VO-SD-KAFY\n", + " 1364-VO-SD-KAFY-VMF00317\n", + " Sudan\n", + " 2022\n", + " 189\n", + " \n", + " \n", + " 2023\n", + " 37\n", + " \n", + " \n", + " 1365-VO-DJ-ADBI\n", + " 1365-VO-DJ-ADBI-VMF00318\n", + " Djibouti\n", + " 2023\n", + " 21\n", " \n", " \n", + " 1366-VO-YE-ALLAN\n", + " 1366-VO-YE-ALLAN-VMF00319\n", + " Yemen\n", " 2021\n", + " 6\n", + " \n", + " \n", + " 2023\n", " 16\n", " \n", " \n", - " 2022\n", - " 44\n", + " 1367-VO-AF-DONNELLY\n", + " 1367-VO-AF-DONNELLY-VMF00320\n", + " Afghanistan\n", + " 2017\n", + " 24\n", + " \n", + " \n", + " 1368-VO-PK-DONNELLY\n", + " 1368-VO-PK-DONNELLY-VMF00321\n", + " Pakistan\n", + " 2005\n", + " 15\n", + " \n", + " \n", + " 1369-VO-SA-AL-NAZAWI\n", + " 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " Saudi Arabia\n", + " 2023\n", + " 42\n", " \n", " \n", - " 1330-VO-GN-LAMA\n", - " 1330-VO-GN-LAMA-VMF00250\n", - " Guinea\n", + " 1370-VO-IR-ENAYATI\n", + " 1370-VO-IR-ENAYATI-VMF00323\n", + " Iran\n", + " 2023\n", + " 72\n", + " \n", + " \n", + " 1385-VO-DJ-WEETMAN\n", + " 1385-VO-DJ-WEETMAN-VMF00338\n", + " Colony\n", + " 2025\n", + " 14\n", + " \n", + " \n", + " 1386-VO-KE-OCHOMO\n", + " 1386-VO-KE-OCHOMO-VMF00339\n", + " Kenya\n", " 2022\n", - " 196\n", + " 1\n", + " \n", + " \n", + " 2024\n", + " 28\n", " \n", " \n", - " 1354-VO-KE-DONNELLY\n", - " 1354-VO-KE-DONNELLY-VMF00281\n", - " Kenya\n", + " 1458-VO-ET-YEWHALAW\n", + " 1458-VO-ET-YEWHALAW-VMF00340\n", + " Ethiopia\n", " 2023\n", - " 466\n", + " 23\n", + " \n", + " \n", + " 1459-VO-SD-AHMED\n", + " 1459-VO-SD-AHMED-VMF00342\n", + " Sudan\n", + " 2018\n", + " 25\n", + " \n", + " \n", + " thakare-2022\n", + " thakare-2022\n", + " India\n", + " 2021\n", + " 15\n", " \n", " \n", "\n", "" ], "text/plain": [ - "taxon funestus\n", - "study_id sample_set country year \n", - "1188-VO-SN-NIANG 1188-VO-NIANG-NIEL-SN-2304-VMF00259 Senegal 2020 11\n", - " 2021 16\n", - " 2022 44\n", - "1330-VO-GN-LAMA 1330-VO-GN-LAMA-VMF00250 Guinea 2022 196\n", - "1354-VO-KE-DONNELLY 1354-VO-KE-DONNELLY-VMF00281 Kenya 2023 466" + "taxon stephensi\n", + "study_id sample_set country year \n", + "1363-VO-ET-GADISA 1363-VO-ET-GADISA-VMF00316 Ethiopia 2022 10\n", + " 2023 74\n", + " 2024 27\n", + "1364-VO-SD-KAFY 1364-VO-SD-KAFY-VMF00317 Sudan 2022 189\n", + " 2023 37\n", + "1365-VO-DJ-ADBI 1365-VO-DJ-ADBI-VMF00318 Djibouti 2023 21\n", + "1366-VO-YE-ALLAN 1366-VO-YE-ALLAN-VMF00319 Yemen 2021 6\n", + " 2023 16\n", + "1367-VO-AF-DONNELLY 1367-VO-AF-DONNELLY-VMF00320 Afghanistan 2017 24\n", + "1368-VO-PK-DONNELLY 1368-VO-PK-DONNELLY-VMF00321 Pakistan 2005 15\n", + "1369-VO-SA-AL-NAZAWI 1369-VO-SA-AL-NAZAWI-VMF00322 Saudi Arabia 2023 42\n", + "1370-VO-IR-ENAYATI 1370-VO-IR-ENAYATI-VMF00323 Iran 2023 72\n", + "1385-VO-DJ-WEETMAN 1385-VO-DJ-WEETMAN-VMF00338 Colony 2025 14\n", + "1386-VO-KE-OCHOMO 1386-VO-KE-OCHOMO-VMF00339 Kenya 2022 1\n", + " 2024 28\n", + "1458-VO-ET-YEWHALAW 1458-VO-ET-YEWHALAW-VMF00340 Ethiopia 2023 23\n", + "1459-VO-SD-AHMED 1459-VO-SD-AHMED-VMF00342 Sudan 2018 25\n", + "thakare-2022 thakare-2022 India 2021 15" ] }, - "execution_count": 3, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df_samples = as1.sample_metadata(sample_sets=\"1.4\")\n", + "df_samples = as1.sample_metadata(sample_sets=\"1.0\")\n", "df_summary = df_samples.pivot_table(\n", " index=[\"study_id\",\"sample_set\", \"country\", \"year\"], \n", " columns=[\"taxon\"],\n", @@ -699,15 +962,15 @@ "provenance": [] }, "environment": { - "kernel": "mgenv-e82ac9c", + "kernel": "malariagen-dev-as1", "name": "workbench-notebooks.m138", "type": "gcloud", "uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m138" }, "kernelspec": { - "display_name": "Python (mgenv-e82ac9c) (Local)", + "display_name": "malariagen-dev-as1 (Local)", "language": "python", - "name": "mgenv-e82ac9c" + "name": "malariagen-dev-as1" }, "language_info": { "codemirror_mode": { @@ -719,7 +982,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.11" + "version": "3.12.13" } }, "nbformat": 4, diff --git a/docs/as1/cloud.ipynb b/docs/as1/cloud.ipynb index 2ee86d3..396319e 100644 --- a/docs/as1/cloud.ipynb +++ b/docs/as1/cloud.ipynb @@ -6,9 +6,9 @@ "id": "DZw8vyUJ0y5k" }, "source": [ - "# Af1 cloud data access\n", + "# As1 cloud data access\n", "\n", - "This notebook provides information about how to download data from the [MalariaGEN Vector Observatory Anopheles funestus Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-funestus-genomic-surveillance-project) via Google Cloud. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. \n", + "This notebook provides information about how to download data from the [MalariaGEN Vector Observatory Anopheles stephensi Genomic Surveillance Project](https://www.malariagen.net/project/anopheles-stephensi-genomic-surveillance-project) via Google Cloud. This includes sample metadata, raw sequence reads, sequence read alignments, and single nucleotide polymorphism (SNP) calls. \n", "\n", "This notebook illustrates how to read data directly from the cloud, without having to first download any data locally. This notebook can be run from any computer, but will work best when run from a compute node within Google Cloud, because it will be physically closer to the data and so data transfer is faster. For example, this notebook can be run via [Google Colab](https://colab.research.google.com/) which are free interactive computing service running in the cloud.\n", "\n", @@ -16,7 +16,7 @@ "\n", "## Data hosting\n", "\n", - "All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the `vo_afun_release_master_us_central1` bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access. " + "All data required for this notebook is hosted on Google Cloud Storage (GCS). Data are hosted in the `vo_aste_release_master_us_central1` bucket, which is a single-region bucket located in the United States. All data hosted in GCS are publicly accessible and do not require any authentication to access. " ] }, { @@ -37,6 +37,13 @@ "colab": { "base_uri": "https://localhost:8080/" }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:21.460464Z", + "iopub.status.busy": "2026-04-05T04:01:21.460209Z", + "iopub.status.idle": "2026-04-05T04:01:24.015357Z", + "shell.execute_reply": "2026-04-05T04:01:24.014335Z", + "shell.execute_reply.started": "2026-04-05T04:01:21.460437Z" + }, "id": "wqHBq442QH_1", "outputId": "1c1306a2-d6f1-46a2-ee4d-30b13dad9148", "tags": [ @@ -60,7 +67,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "To make accessing these data more convenient, we've created the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the [Af1 API docs](https://malariagen.github.io/malariagen-data-python/latest/Af1.html) for documentation of all functions available from this package. \n", + "To make accessing these data more convenient, we've created the [malariagen_data](https://github.com/malariagen/malariagen-data-python) Python package. This is experimental so please let us know if you find any bugs or have any suggestions. See the [As1 API docs](https://malariagen.github.io/malariagen-data-python/latest/As1.html) for documentation of all functions available from this package. \n", "\n", "Import other packages we'll need to use here." ] @@ -69,6 +76,13 @@ "cell_type": "code", "execution_count": 2, "metadata": { + "execution": { + "iopub.execute_input": "2026-04-05T04:01:24.022130Z", + "iopub.status.busy": "2026-04-05T04:01:24.021867Z", + "iopub.status.idle": "2026-04-05T04:01:29.659489Z", + "shell.execute_reply": "2026-04-05T04:01:29.658324Z", + "shell.execute_reply.started": "2026-04-05T04:01:24.022095Z" + }, "id": "970klnG1eu8N", "tags": [] }, @@ -90,7 +104,7 @@ "id": "jPqZ-LFPQH_2" }, "source": [ - "`Af1` data access from Google Cloud is set up with the following code:" + "`As1` data access from Google Cloud is set up with the following code:" ] }, { @@ -101,6 +115,13 @@ "base_uri": "https://localhost:8080/", "height": 190 }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:29.663766Z", + "iopub.status.busy": "2026-04-05T04:01:29.663173Z", + "iopub.status.idle": "2026-04-05T04:01:30.198388Z", + "shell.execute_reply": "2026-04-05T04:01:30.197301Z", + "shell.execute_reply.started": "2026-04-05T04:01:29.663731Z" + }, "id": "mIsSaTuOQH_2", "outputId": "4facd5a9-6e43-460a-811c-30293568918e", "tags": [] @@ -380,7 +401,7 @@ " document.body.appendChild(element);\n", " }\n", "\n", - " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.4.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.4.1.min.js\"];\n", + " const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n", " const css_urls = [];\n", "\n", " const inline_js = [ function(Bokeh) {\n", @@ -420,7 +441,7 @@ " }\n", "}(window));" ], - "application/vnd.bokehjs_load.v0+json": "" + "application/vnd.bokehjs_load.v0+json": "'use strict';\n(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"
    \\n\"+\n \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n \"
  • use INLINE resources instead, as so:
  • \\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded(error = null) {\n const el = document.getElementById(null);\n if (el != null) {\n const html = (() => {\n if (typeof root.Bokeh === \"undefined\") {\n if (error == null) {\n return \"BokehJS is loading ...\";\n } else {\n return \"BokehJS failed to load.\";\n }\n } else {\n const prefix = `BokehJS ${root.Bokeh.version}`;\n if (error == null) {\n return `${prefix} successfully loaded.`;\n } else {\n return `${prefix} encountered errors while loading and may not function as expected.`;\n }\n }\n })();\n el.innerHTML = html;\n\n if (error != null) {\n const wrapper = document.createElement(\"div\");\n wrapper.style.overflow = \"auto\";\n wrapper.style.height = \"5em\";\n wrapper.style.resize = \"vertical\";\n const content = document.createElement(\"div\");\n content.style.fontFamily = \"monospace\";\n content.style.whiteSpace = \"pre-wrap\";\n content.style.backgroundColor = \"rgb(255, 221, 221)\";\n content.textContent = error.stack ?? error.toString();\n wrapper.append(content);\n el.append(wrapper);\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(() => display_loaded(error), 100);\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.8.2.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.8.2.min.js\"];\n const css_urls = [];\n\n const inline_js = [ function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\nfunction(Bokeh) {\n }\n ];\n\n function run_inline_js() {\n if (root.Bokeh !== undefined || force === true) {\n try {\n for (let i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n\n } catch (error) {throw error;\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n const cell = $(document.getElementById(null)).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" }, "metadata": {}, "output_type": "display_data" @@ -429,16 +450,16 @@ "data": { "text/html": [ "\n", - " \n", + "
\n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", @@ -446,7 +467,7 @@ " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", - " \n", + " \n", " \n", " \n", " \n", " \n", " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", " \n", "
MalariaGEN Af1 API clientMalariaGEN As1 API client
\n", " Please note that data are subject to terms of use,\n", " for more information see \n", " the MalariaGEN website or contact support@malariagen.net.\n", - " See also the Af1 API docs.\n", + " See also the As1 API docs.\n", "
\n", " Storage URL\n", " gs://vo_afun_release_master_us_central1gs://vo_aste_release_master_us_central1
\n", @@ -464,19 +485,19 @@ " \n", " Cohorts analysis\n", " 2023121520260402
\n", " Site filters analysis\n", " dt_20200416sc_20260401
\n", " Software version\n", " malariagen_data 10.0.0malariagen_data 0.0.0
\n", @@ -484,24 +505,45 @@ " Iowa, United States (Google Cloud us-central1)
\n", + " Data filtered for unrestricted use only\n", + " False
\n", + " Data filtered for surveillance use only\n", + " False
\n", + " Relevant data releases\n", + " 1.0
\n", " " ], "text/plain": [ - "\n", - "Storage URL : gs://vo_afun_release_master_us_central1\n", - "Data releases available : 1.0\n", - "Results cache : None\n", - "Cohorts analysis : 20231215\n", - "Site filters analysis : dt_20200416\n", - "Software version : malariagen_data 10.0.0\n", - "Client location : Iowa, United States (Google Cloud us-central1)\n", + "\n", + "Storage URL : gs://vo_aste_release_master_us_central1\n", + "Data releases available : 1.0\n", + "Results cache : None\n", + "Cohorts analysis : 20260402\n", + "Site filters analysis : sc_20260401\n", + "Software version : malariagen_data 0.0.0\n", + "Client location : Iowa, United States (Google Cloud us-central1)\n", + "Data filtered to unrestricted use only: False\n", + "Data filtered to surveillance use only: False\n", + "Relevant data releases : 1.0\n", "---\n", "Please note that data are subject to terms of use,\n", "for more information see https://www.malariagen.net/data\n", "or contact support@malariagen.net. For API documentation see \n", - "https://malariagen.github.io/malariagen-data-python/v10.0.0/Af1.html" + "https://malariagen.github.io/malariagen-data-python/v0.0.0/As1.html" ] }, "execution_count": 3, @@ -510,17 +552,8 @@ } ], "source": [ - "af1 = malariagen_data.Af1()\n", - "af1" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note:** To access the `Af1.1`, `Af1.2` & `Af1.3` releases, you need to use the `pre=True` flag in code above. \n", - "\n", - "This flag is used when more data will be added to this release. In the case of `Af1.1`, `Af1.2` & `Af1.3`; CNV data for the sample sets on these releases will be included at a future date." + "as1 = malariagen_data.As1()\n", + "as1" ] }, { @@ -531,19 +564,26 @@ "source": [ "## Sample sets\n", "\n", - "Data are organised into different releases. As an example, data in Af1.0 are organised into 8 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.\n", + "Data are organised into different releases. As an example, data in As1 are organised into 13 sample sets. Each of these sample sets corresponds to a set of mosquito specimens contributed by a collaborating study. Depending on your objectives, you may want to access data from only specific sample sets, or all sample sets.\n", "\n", "To see which sample sets are available, load the sample set manifest into a pandas dataframe:" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 927 }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:30.202912Z", + "iopub.status.busy": "2026-04-05T04:01:30.202397Z", + "iopub.status.idle": "2026-04-05T04:01:30.309584Z", + "shell.execute_reply": "2026-04-05T04:01:30.307209Z", + "shell.execute_reply.started": "2026-04-05T04:01:30.202885Z" + }, "id": "b4ADQTOfQH_2", "outputId": "f7c6d68b-053f-4698-8b6f-29720287c423" }, @@ -573,107 +613,214 @@ " sample_count\n", " study_id\n", " study_url\n", + " terms_of_use_expiry_date\n", + " terms_of_use_url\n", " release\n", + " unrestricted_use\n", " \n", " \n", " \n", " \n", " 0\n", - " 1229-VO-GH-DADZIE-VMF00095\n", - " 36\n", - " 1229-VO-GH-DADZIE\n", + " 1363-VO-ET-GADISA-VMF00316\n", + " 111\n", + " 1363-VO-ET-GADISA\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 1\n", - " 1230-VO-GA-CF-AYALA-VMF00045\n", - " 50\n", - " 1230-VO-MULTI-AYALA\n", + " 1364-VO-SD-KAFY-VMF00317\n", + " 226\n", + " 1364-VO-SD-KAFY\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 2\n", - " 1231-VO-MULTI-WONDJI-VMF00043\n", - " 320\n", - " 1231-VO-MULTI-WONDJI\n", + " 1365-VO-DJ-ADBI-VMF00318\n", + " 21\n", + " 1365-VO-DJ-ADBI\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 3\n", - " 1232-VO-KE-OCHOMO-VMF00044\n", - " 81\n", - " 1232-VO-KE-OCHOMO\n", + " 1366-VO-YE-ALLAN-VMF00319\n", + " 22\n", + " 1366-VO-YE-ALLAN\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 4\n", - " 1235-VO-MZ-PAAIJMANS-VMF00094\n", - " 76\n", - " 1235-VO-MZ-PAAIJMANS\n", + " 1367-VO-AF-DONNELLY-VMF00320\n", + " 24\n", + " 1367-VO-AF-DONNELLY\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 5\n", - " 1236-VO-TZ-OKUMU-VMF00090\n", - " 10\n", - " 1236-VO-TZ-OKUMU\n", + " 1368-VO-PK-DONNELLY-VMF00321\n", + " 15\n", + " 1368-VO-PK-DONNELLY\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 6\n", - " 1240-VO-CD-KOEKEMOER-VMF00099\n", - " 43\n", - " 1240-VO-MULTI-KOEKEMOER\n", + " 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " 42\n", + " 1369-VO-SA-AL-NAZAWI\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", " 7\n", - " 1240-VO-MZ-KOEKEMOER-VMF00101\n", - " 40\n", - " 1240-VO-MULTI-KOEKEMOER\n", + " 1370-VO-IR-ENAYATI-VMF00323\n", + " 72\n", + " 1370-VO-IR-ENAYATI\n", + " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", + " 1.0\n", + " False\n", + " \n", + " \n", + " 8\n", + " 1385-VO-DJ-WEETMAN-VMF00338\n", + " 14\n", + " 1385-VO-DJ-WEETMAN\n", + " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", + " 1.0\n", + " False\n", + " \n", + " \n", + " 9\n", + " 1386-VO-KE-OCHOMO-VMF00339\n", + " 29\n", + " 1386-VO-KE-OCHOMO\n", + " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", + " 1.0\n", + " False\n", + " \n", + " \n", + " 10\n", + " 1458-VO-ET-YEWHALAW-VMF00340\n", + " 23\n", + " 1458-VO-ET-YEWHALAW\n", + " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", + " 1.0\n", + " False\n", + " \n", + " \n", + " 11\n", + " 1459-VO-SD-AHMED-VMF00342\n", + " 25\n", + " 1459-VO-SD-AHMED\n", + " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", + " 1.0\n", + " False\n", + " \n", + " \n", + " 12\n", + " thakare-2022\n", + " 15\n", + " thakare-2022\n", " https://www.malariagen.net/network/where-we-wo...\n", + " 2099-12-31\n", + " NaN\n", " 1.0\n", + " False\n", " \n", " \n", "\n", "" ], "text/plain": [ - " sample_set sample_count study_id \\\n", - "0 1229-VO-GH-DADZIE-VMF00095 36 1229-VO-GH-DADZIE \n", - "1 1230-VO-GA-CF-AYALA-VMF00045 50 1230-VO-MULTI-AYALA \n", - "2 1231-VO-MULTI-WONDJI-VMF00043 320 1231-VO-MULTI-WONDJI \n", - "3 1232-VO-KE-OCHOMO-VMF00044 81 1232-VO-KE-OCHOMO \n", - "4 1235-VO-MZ-PAAIJMANS-VMF00094 76 1235-VO-MZ-PAAIJMANS \n", - "5 1236-VO-TZ-OKUMU-VMF00090 10 1236-VO-TZ-OKUMU \n", - "6 1240-VO-CD-KOEKEMOER-VMF00099 43 1240-VO-MULTI-KOEKEMOER \n", - "7 1240-VO-MZ-KOEKEMOER-VMF00101 40 1240-VO-MULTI-KOEKEMOER \n", - "\n", - " study_url release \n", - "0 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "1 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "2 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "3 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "4 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "5 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "6 https://www.malariagen.net/network/where-we-wo... 1.0 \n", - "7 https://www.malariagen.net/network/where-we-wo... 1.0 " + " sample_set sample_count study_id \\\n", + "0 1363-VO-ET-GADISA-VMF00316 111 1363-VO-ET-GADISA \n", + "1 1364-VO-SD-KAFY-VMF00317 226 1364-VO-SD-KAFY \n", + "2 1365-VO-DJ-ADBI-VMF00318 21 1365-VO-DJ-ADBI \n", + "3 1366-VO-YE-ALLAN-VMF00319 22 1366-VO-YE-ALLAN \n", + "4 1367-VO-AF-DONNELLY-VMF00320 24 1367-VO-AF-DONNELLY \n", + "5 1368-VO-PK-DONNELLY-VMF00321 15 1368-VO-PK-DONNELLY \n", + "6 1369-VO-SA-AL-NAZAWI-VMF00322 42 1369-VO-SA-AL-NAZAWI \n", + "7 1370-VO-IR-ENAYATI-VMF00323 72 1370-VO-IR-ENAYATI \n", + "8 1385-VO-DJ-WEETMAN-VMF00338 14 1385-VO-DJ-WEETMAN \n", + "9 1386-VO-KE-OCHOMO-VMF00339 29 1386-VO-KE-OCHOMO \n", + "10 1458-VO-ET-YEWHALAW-VMF00340 23 1458-VO-ET-YEWHALAW \n", + "11 1459-VO-SD-AHMED-VMF00342 25 1459-VO-SD-AHMED \n", + "12 thakare-2022 15 thakare-2022 \n", + "\n", + " study_url \\\n", + "0 https://www.malariagen.net/network/where-we-wo... \n", + "1 https://www.malariagen.net/network/where-we-wo... \n", + "2 https://www.malariagen.net/network/where-we-wo... \n", + "3 https://www.malariagen.net/network/where-we-wo... \n", + "4 https://www.malariagen.net/network/where-we-wo... \n", + "5 https://www.malariagen.net/network/where-we-wo... \n", + "6 https://www.malariagen.net/network/where-we-wo... \n", + "7 https://www.malariagen.net/network/where-we-wo... \n", + "8 https://www.malariagen.net/network/where-we-wo... \n", + "9 https://www.malariagen.net/network/where-we-wo... \n", + "10 https://www.malariagen.net/network/where-we-wo... \n", + "11 https://www.malariagen.net/network/where-we-wo... \n", + "12 https://www.malariagen.net/network/where-we-wo... \n", + "\n", + " terms_of_use_expiry_date terms_of_use_url release unrestricted_use \n", + "0 2099-12-31 NaN 1.0 False \n", + "1 2099-12-31 NaN 1.0 False \n", + "2 2099-12-31 NaN 1.0 False \n", + "3 2099-12-31 NaN 1.0 False \n", + "4 2099-12-31 NaN 1.0 False \n", + "5 2099-12-31 NaN 1.0 False \n", + "6 2099-12-31 NaN 1.0 False \n", + "7 2099-12-31 NaN 1.0 False \n", + "8 2099-12-31 NaN 1.0 False \n", + "9 2099-12-31 NaN 1.0 False \n", + "10 2099-12-31 NaN 1.0 False \n", + "11 2099-12-31 NaN 1.0 False \n", + "12 2099-12-31 NaN 1.0 False " ] }, - "execution_count": 3, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df_sample_sets = af1.sample_sets(release=\"1.0\")\n", + "df_sample_sets = as1.sample_sets(release=\"1.0\")\n", "df_sample_sets" ] }, @@ -701,12 +848,19 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 661 }, + "execution": { + "iopub.execute_input": "2026-04-05T04:01:30.313497Z", + "iopub.status.busy": "2026-04-05T04:01:30.310326Z", + "iopub.status.idle": "2026-04-05T04:01:31.529234Z", + "shell.execute_reply": "2026-04-05T04:01:31.528366Z", + "shell.execute_reply.started": "2026-04-05T04:01:30.313468Z" + }, "id": "-V8nLGSaQH_4", "outputId": "98a12919-fd6a-4fd5-8155-d90f05d877d7", "tags": [] @@ -716,7 +870,46 @@ "name": "stdout", "output_type": "stream", "text": [ - " \r" + "Load sample metadata: ⠋ (0:00:00.85) " + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1363-VO-ET-GADISA-VMF00316\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1364-VO-SD-KAFY-VMF00317\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1365-VO-DJ-ADBI-VMF00318\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1366-VO-YE-ALLAN-VMF00319\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1367-VO-AF-DONNELLY-VMF00320\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1368-VO-PK-DONNELLY-VMF00321\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1369-VO-SA-AL-NAZAWI-VMF00322\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1370-VO-IR-ENAYATI-VMF00323\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1385-VO-DJ-WEETMAN-VMF00338\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1386-VO-KE-OCHOMO-VMF00339\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1458-VO-ET-YEWHALAW-VMF00340\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set 1459-VO-SD-AHMED-VMF00342\n", + " warnings.warn(\n", + "/home/jupyter/malariagen-data-python/malariagen_data/anoph/sample_metadata.py:417: UserWarning: WARNING: The surveillance flags data is missing for sample set thakare-2022\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " " ] }, { @@ -766,123 +959,123 @@ " \n", " \n", " 0\n", - " VBS24195\n", - " 1229-GH-A-GH01\n", - " Samuel Dadzie\n", - " Ghana\n", - " Dimabi\n", - " 2017\n", - " 8\n", - " 9.420\n", - " -1.083\n", + " VMF00316-0001\n", + " A01\n", + " Endalamaw Gadisa\n", + " Ethiopia\n", + " Awash\n", + " 2024\n", + " 11\n", + " 8.995\n", + " 40.159\n", " F\n", " ...\n", - " Northern Region\n", - " GH-NP\n", - " Tolon\n", - " funestus\n", - " GH-NP_fune_2017\n", - " GH-NP_fune_2017_08\n", - " GH-NP_fune_2017_Q3\n", - " GH-NP_Tolon_fune_2017\n", - " GH-NP_Tolon_fune_2017_08\n", - " GH-NP_Tolon_fune_2017_Q3\n", + " Afar\n", + " ET-AF\n", + " Zone 3\n", + " stephensi\n", + " ET-AF_step_2024\n", + " ET-AF_step_2024_11\n", + " ET-AF_step_2024_Q4\n", + " ET-AF_Zone-3_step_2024\n", + " ET-AF_Zone-3_step_2024_11\n", + " ET-AF_Zone-3_step_2024_Q4\n", " \n", " \n", " 1\n", - " VBS24196\n", - " 1229-GH-A-GH02\n", - " Samuel Dadzie\n", - " Ghana\n", - " Gbullung\n", - " 2017\n", - " 7\n", - " 9.488\n", - " -1.009\n", + " VMF00316-0002\n", + " A02\n", + " Endalamaw Gadisa\n", + " Ethiopia\n", + " Awash\n", + " 2024\n", + " 11\n", + " 8.995\n", + " 40.159\n", " F\n", " ...\n", - " Northern Region\n", - " GH-NP\n", - " Kumbungu\n", - " funestus\n", - " GH-NP_fune_2017\n", - " GH-NP_fune_2017_07\n", - " GH-NP_fune_2017_Q3\n", - " GH-NP_Kumbungu_fune_2017\n", - " GH-NP_Kumbungu_fune_2017_07\n", - " GH-NP_Kumbungu_fune_2017_Q3\n", + " Afar\n", + " ET-AF\n", + " Zone 3\n", + " stephensi\n", + " ET-AF_step_2024\n", + " ET-AF_step_2024_11\n", + " ET-AF_step_2024_Q4\n", + " ET-AF_Zone-3_step_2024\n", + " ET-AF_Zone-3_step_2024_11\n", + " ET-AF_Zone-3_step_2024_Q4\n", " \n", " \n", " 2\n", - " VBS24197\n", - " 1229-GH-A-GH03\n", - " Samuel Dadzie\n", - " Ghana\n", - " Dimabi\n", - " 2017\n", - " 7\n", - " 9.420\n", - " -1.083\n", + " VMF00316-0003\n", + " A03\n", + " Endalamaw Gadisa\n", + " Ethiopia\n", + " Awash\n", + " 2024\n", + " 11\n", + " 8.995\n", + " 40.159\n", " F\n", " ...\n", - " Northern Region\n", - " GH-NP\n", - " Tolon\n", - " funestus\n", - " GH-NP_fune_2017\n", - " GH-NP_fune_2017_07\n", - " GH-NP_fune_2017_Q3\n", - " GH-NP_Tolon_fune_2017\n", - " GH-NP_Tolon_fune_2017_07\n", - " GH-NP_Tolon_fune_2017_Q3\n", + " Afar\n", + " ET-AF\n", + " Zone 3\n", + " stephensi\n", + " ET-AF_step_2024\n", + " ET-AF_step_2024_11\n", + " ET-AF_step_2024_Q4\n", + " ET-AF_Zone-3_step_2024\n", + " ET-AF_Zone-3_step_2024_11\n", + " ET-AF_Zone-3_step_2024_Q4\n", " \n", " \n", " 3\n", - " VBS24198\n", - " 1229-GH-A-GH04\n", - " Samuel Dadzie\n", - " Ghana\n", - " Dimabi\n", - " 2017\n", - " 8\n", - " 9.420\n", - " -1.083\n", + " VMF00316-0004\n", + " A04\n", + " Endalamaw Gadisa\n", + " Ethiopia\n", + " Awash\n", + " 2024\n", + " 11\n", + " 8.995\n", + " 40.159\n", " F\n", " ...\n", - " Northern Region\n", - " GH-NP\n", - " Tolon\n", - " funestus\n", - " GH-NP_fune_2017\n", - " GH-NP_fune_2017_08\n", - " GH-NP_fune_2017_Q3\n", - " GH-NP_Tolon_fune_2017\n", - " GH-NP_Tolon_fune_2017_08\n", - " GH-NP_Tolon_fune_2017_Q3\n", + " Afar\n", + " ET-AF\n", + " Zone 3\n", + " stephensi\n", + " ET-AF_step_2024\n", + " ET-AF_step_2024_11\n", + " ET-AF_step_2024_Q4\n", + " ET-AF_Zone-3_step_2024\n", + " ET-AF_Zone-3_step_2024_11\n", + " ET-AF_Zone-3_step_2024_Q4\n", " \n", " \n", " 4\n", - " VBS24199\n", - " 1229-GH-A-GH05\n", - " Samuel Dadzie\n", - " Ghana\n", - " Gupanarigu\n", - " 2017\n", - " 8\n", - " 9.497\n", - " -0.952\n", + " VMF00316-0005\n", + " A05\n", + " Endalamaw Gadisa\n", + " Ethiopia\n", + " Awash\n", + " 2024\n", + " 11\n", + " 8.995\n", + " 40.159\n", " F\n", " ...\n", - " Northern Region\n", - " GH-NP\n", - " Kumbungu\n", - " funestus\n", - " GH-NP_fune_2017\n", - " GH-NP_fune_2017_08\n", - " GH-NP_fune_2017_Q3\n", - " GH-NP_Kumbungu_fune_2017\n", - " GH-NP_Kumbungu_fune_2017_08\n", - " GH-NP_Kumbungu_fune_2017_Q3\n", + " Afar\n", + " ET-AF\n", + " Zone 3\n", + " stephensi\n", + " ET-AF_step_2024\n", + " ET-AF_step_2024_11\n", + " ET-AF_step_2024_Q4\n", + " ET-AF_Zone-3_step_2024\n", + " ET-AF_Zone-3_step_2024_11\n", + " ET-AF_Zone-3_step_2024_Q4\n", " \n", " \n", " ...\n", @@ -909,206 +1102,219 @@ " ...\n", " \n", " \n", - " 651\n", - " VBS24534\n", - " 1240-MZ-A-MozF_1314\n", - " Lizette Koekemoer\n", - " Mozambique\n", - " Motinho\n", - " 2015\n", - " 8\n", - " -10.851\n", - " 40.594\n", - " F\n", + " 634\n", + " SRR15293888\n", + " SRR15293888\n", + " Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...\n", + " India\n", + " Mangaluru\n", + " 2021\n", + " -1\n", + " 12.879\n", + " 74.847\n", + " M\n", " ...\n", - " Cabo Delgado\n", - " MZ-P\n", - " Palma\n", - " funestus\n", - " MZ-P_fune_2015\n", - " MZ-P_fune_2015_08\n", - " MZ-P_fune_2015_Q3\n", - " MZ-P_Palma_fune_2015\n", - " MZ-P_Palma_fune_2015_08\n", - " MZ-P_Palma_fune_2015_Q3\n", + " Karnātaka\n", + " IN-KA\n", + " Dakshina Kannada\n", + " stephensi\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", " \n", " \n", - " 652\n", - " VBS24535\n", - " 1240-MZ-A-MozF_1315\n", - " Lizette Koekemoer\n", - " Mozambique\n", - " Motinho\n", - " 2015\n", - " 8\n", - " -10.851\n", - " 40.594\n", - " F\n", + " 635\n", + " SRR15293889\n", + " SRR15293889\n", + " Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...\n", + " India\n", + " Mangaluru\n", + " 2021\n", + " -1\n", + " 12.879\n", + " 74.847\n", + " M\n", " ...\n", - " Cabo Delgado\n", - " MZ-P\n", - " Palma\n", - " funestus\n", - " MZ-P_fune_2015\n", - " MZ-P_fune_2015_08\n", - " MZ-P_fune_2015_Q3\n", - " MZ-P_Palma_fune_2015\n", - " MZ-P_Palma_fune_2015_08\n", - " MZ-P_Palma_fune_2015_Q3\n", + " Karnātaka\n", + " IN-KA\n", + " Dakshina Kannada\n", + " stephensi\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", " \n", " \n", - " 653\n", - " VBS24536\n", - " 1240-MZ-A-MozF_1317\n", - " Lizette Koekemoer\n", - " Mozambique\n", - " Motinho\n", - " 2015\n", - " 8\n", - " -10.851\n", - " 40.594\n", + " 636\n", + " SRR15293892\n", + " SRR15293892\n", + " Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...\n", + " India\n", + " Mangaluru\n", + " 2021\n", + " -1\n", + " 12.879\n", + " 74.847\n", " F\n", " ...\n", - " Cabo Delgado\n", - " MZ-P\n", - " Palma\n", - " funestus\n", - " MZ-P_fune_2015\n", - " MZ-P_fune_2015_08\n", - " MZ-P_fune_2015_Q3\n", - " MZ-P_Palma_fune_2015\n", - " MZ-P_Palma_fune_2015_08\n", - " MZ-P_Palma_fune_2015_Q3\n", + " Karnātaka\n", + " IN-KA\n", + " Dakshina Kannada\n", + " stephensi\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", " \n", " \n", - " 654\n", - " VBS24537\n", - " 1240-MZ-A-MozF_1319\n", - " Lizette Koekemoer\n", - " Mozambique\n", - " Motinho\n", - " 2015\n", - " 8\n", - " -10.851\n", - " 40.594\n", - " F\n", + " 637\n", + " SRR15293893\n", + " SRR15293893\n", + " Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...\n", + " India\n", + " Mangaluru\n", + " 2021\n", + " -1\n", + " 12.879\n", + " 74.847\n", + " M\n", " ...\n", - " Cabo Delgado\n", - " MZ-P\n", - " Palma\n", - " funestus\n", - " MZ-P_fune_2015\n", - " MZ-P_fune_2015_08\n", - " MZ-P_fune_2015_Q3\n", - " MZ-P_Palma_fune_2015\n", - " MZ-P_Palma_fune_2015_08\n", - " MZ-P_Palma_fune_2015_Q3\n", + " Karnātaka\n", + " IN-KA\n", + " Dakshina Kannada\n", + " stephensi\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", " \n", " \n", - " 655\n", - " VBS24539\n", - " 1240-MZ-A-MozF_1323\n", - " Lizette Koekemoer\n", - " Mozambique\n", - " Motinho\n", - " 2015\n", - " 8\n", - " -10.851\n", - " 40.594\n", + " 638\n", + " SRR15293894\n", + " SRR15293894\n", + " Aditi Thakare, Chaitali Ghosh, Tejashwini Alal...\n", + " India\n", + " Mangaluru\n", + " 2021\n", + " -1\n", + " 12.879\n", + " 74.847\n", " F\n", " ...\n", - " Cabo Delgado\n", - " MZ-P\n", - " Palma\n", - " funestus\n", - " MZ-P_fune_2015\n", - " MZ-P_fune_2015_08\n", - " MZ-P_fune_2015_Q3\n", - " MZ-P_Palma_fune_2015\n", - " MZ-P_Palma_fune_2015_08\n", - " MZ-P_Palma_fune_2015_Q3\n", + " Karnātaka\n", + " IN-KA\n", + " Dakshina Kannada\n", + " stephensi\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", + " IN-KA_Dakshina-Kannada_step_2021\n", " \n", " \n", "\n", - "

656 rows × 26 columns

\n", + "

639 rows × 44 columns

\n", "" ], "text/plain": [ - " sample_id partner_sample_id contributor country location \\\n", - "0 VBS24195 1229-GH-A-GH01 Samuel Dadzie Ghana Dimabi \n", - "1 VBS24196 1229-GH-A-GH02 Samuel Dadzie Ghana Gbullung \n", - "2 VBS24197 1229-GH-A-GH03 Samuel Dadzie Ghana Dimabi \n", - "3 VBS24198 1229-GH-A-GH04 Samuel Dadzie Ghana Dimabi \n", - "4 VBS24199 1229-GH-A-GH05 Samuel Dadzie Ghana Gupanarigu \n", - ".. ... ... ... ... ... \n", - "651 VBS24534 1240-MZ-A-MozF_1314 Lizette Koekemoer Mozambique Motinho \n", - "652 VBS24535 1240-MZ-A-MozF_1315 Lizette Koekemoer Mozambique Motinho \n", - "653 VBS24536 1240-MZ-A-MozF_1317 Lizette Koekemoer Mozambique Motinho \n", - "654 VBS24537 1240-MZ-A-MozF_1319 Lizette Koekemoer Mozambique Motinho \n", - "655 VBS24539 1240-MZ-A-MozF_1323 Lizette Koekemoer Mozambique Motinho \n", - "\n", - " year month latitude longitude sex_call ... admin1_name \\\n", - "0 2017 8 9.420 -1.083 F ... Northern Region \n", - "1 2017 7 9.488 -1.009 F ... Northern Region \n", - "2 2017 7 9.420 -1.083 F ... Northern Region \n", - "3 2017 8 9.420 -1.083 F ... Northern Region \n", - "4 2017 8 9.497 -0.952 F ... Northern Region \n", - ".. ... ... ... ... ... ... ... \n", - "651 2015 8 -10.851 40.594 F ... Cabo Delgado \n", - "652 2015 8 -10.851 40.594 F ... Cabo Delgado \n", - "653 2015 8 -10.851 40.594 F ... Cabo Delgado \n", - "654 2015 8 -10.851 40.594 F ... Cabo Delgado \n", - "655 2015 8 -10.851 40.594 F ... Cabo Delgado \n", - "\n", - " admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month \\\n", - "0 GH-NP Tolon funestus GH-NP_fune_2017 GH-NP_fune_2017_08 \n", - "1 GH-NP Kumbungu funestus GH-NP_fune_2017 GH-NP_fune_2017_07 \n", - "2 GH-NP Tolon funestus GH-NP_fune_2017 GH-NP_fune_2017_07 \n", - "3 GH-NP Tolon funestus GH-NP_fune_2017 GH-NP_fune_2017_08 \n", - "4 GH-NP Kumbungu funestus GH-NP_fune_2017 GH-NP_fune_2017_08 \n", - ".. ... ... ... ... ... \n", - "651 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", - "652 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", - "653 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", - "654 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", - "655 MZ-P Palma funestus MZ-P_fune_2015 MZ-P_fune_2015_08 \n", - "\n", - " cohort_admin1_quarter cohort_admin2_year \\\n", - "0 GH-NP_fune_2017_Q3 GH-NP_Tolon_fune_2017 \n", - "1 GH-NP_fune_2017_Q3 GH-NP_Kumbungu_fune_2017 \n", - "2 GH-NP_fune_2017_Q3 GH-NP_Tolon_fune_2017 \n", - "3 GH-NP_fune_2017_Q3 GH-NP_Tolon_fune_2017 \n", - "4 GH-NP_fune_2017_Q3 GH-NP_Kumbungu_fune_2017 \n", - ".. ... ... \n", - "651 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", - "652 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", - "653 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", - "654 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", - "655 MZ-P_fune_2015_Q3 MZ-P_Palma_fune_2015 \n", - "\n", - " cohort_admin2_month cohort_admin2_quarter \n", - "0 GH-NP_Tolon_fune_2017_08 GH-NP_Tolon_fune_2017_Q3 \n", - "1 GH-NP_Kumbungu_fune_2017_07 GH-NP_Kumbungu_fune_2017_Q3 \n", - "2 GH-NP_Tolon_fune_2017_07 GH-NP_Tolon_fune_2017_Q3 \n", - "3 GH-NP_Tolon_fune_2017_08 GH-NP_Tolon_fune_2017_Q3 \n", - "4 GH-NP_Kumbungu_fune_2017_08 GH-NP_Kumbungu_fune_2017_Q3 \n", - ".. ... ... \n", - "651 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", - "652 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", - "653 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", - "654 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", - "655 MZ-P_Palma_fune_2015_08 MZ-P_Palma_fune_2015_Q3 \n", - "\n", - "[656 rows x 26 columns]" + " sample_id partner_sample_id \\\n", + "0 VMF00316-0001 A01 \n", + "1 VMF00316-0002 A02 \n", + "2 VMF00316-0003 A03 \n", + "3 VMF00316-0004 A04 \n", + "4 VMF00316-0005 A05 \n", + ".. ... ... \n", + "634 SRR15293888 SRR15293888 \n", + "635 SRR15293889 SRR15293889 \n", + "636 SRR15293892 SRR15293892 \n", + "637 SRR15293893 SRR15293893 \n", + "638 SRR15293894 SRR15293894 \n", + "\n", + " contributor country location \\\n", + "0 Endalamaw Gadisa Ethiopia Awash \n", + "1 Endalamaw Gadisa Ethiopia Awash \n", + "2 Endalamaw Gadisa Ethiopia Awash \n", + "3 Endalamaw Gadisa Ethiopia Awash \n", + "4 Endalamaw Gadisa Ethiopia Awash \n", + ".. ... ... ... \n", + "634 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "635 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "636 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "637 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "638 Aditi Thakare, Chaitali Ghosh, Tejashwini Alal... India Mangaluru \n", + "\n", + " year month latitude longitude sex_call ... admin1_name admin1_iso \\\n", + "0 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "1 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "2 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "3 2024 11 8.995 40.159 F ... Afar ET-AF \n", + "4 2024 11 8.995 40.159 F ... Afar ET-AF \n", + ".. ... ... ... ... ... ... ... ... \n", + "634 2021 -1 12.879 74.847 M ... Karnātaka IN-KA \n", + "635 2021 -1 12.879 74.847 M ... Karnātaka IN-KA \n", + "636 2021 -1 12.879 74.847 F ... Karnātaka IN-KA \n", + "637 2021 -1 12.879 74.847 M ... Karnātaka IN-KA \n", + "638 2021 -1 12.879 74.847 F ... Karnātaka IN-KA \n", + "\n", + " admin2_name taxon cohort_admin1_year cohort_admin1_month \\\n", + "0 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "1 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "2 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "3 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + "4 Zone 3 stephensi ET-AF_step_2024 ET-AF_step_2024_11 \n", + ".. ... ... ... ... \n", + "634 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "635 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "636 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "637 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "638 Dakshina Kannada stephensi IN-KA_step_2021 IN-KA_step_2021 \n", + "\n", + " cohort_admin1_quarter cohort_admin2_year \\\n", + "0 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "1 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "2 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "3 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + "4 ET-AF_step_2024_Q4 ET-AF_Zone-3_step_2024 \n", + ".. ... ... \n", + "634 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "635 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "636 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "637 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "638 IN-KA_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "\n", + " cohort_admin2_month cohort_admin2_quarter \n", + "0 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "1 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "2 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "3 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + "4 ET-AF_Zone-3_step_2024_11 ET-AF_Zone-3_step_2024_Q4 \n", + ".. ... ... \n", + "634 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "635 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "636 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "637 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "638 IN-KA_Dakshina-Kannada_step_2021 IN-KA_Dakshina-Kannada_step_2021 \n", + "\n", + "[639 rows x 44 columns]" ] }, - "execution_count": 4, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "df_samples = af1.sample_metadata(sample_sets=\"1.0\")\n", + "df_samples = as1.sample_metadata(sample_sets=\"1.0\")\n", "df_samples" ] }, @@ -1118,13 +1324,15 @@ "id": "ssCdOykfQH_4" }, "source": [ - "The `sample_id` column gives the sample identifier used throughout all Af1 analyses.\n", + "The `sample_id` column gives the sample identifier used throughout all As1 analyses.\n", "\n", "The `country`, `location`, `latitude` and `longitude` columns give the location where the specimen was collected.\n", "\n", "The `year` and `month` columns give the approximate date when the specimen was collected.\n", "\n", - "The `sex_call` column gives the gender as determined from the sequence data." + "The `sex_call` column gives the gender as determined from the sequence data.\n", + "\n", + "Note the warnings set as a result of missing surveillance flags. The surveillance flags will be implemented in future data releases." ] }, { @@ -1138,11 +1346,18 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:18.865363Z", + "iopub.status.busy": "2026-04-05T04:02:18.865006Z", + "iopub.status.idle": "2026-04-05T04:02:18.876141Z", + "shell.execute_reply": "2026-04-05T04:02:18.872642Z", + "shell.execute_reply.started": "2026-04-05T04:02:18.865334Z" + }, "id": "PpsTgviZQH_4", "outputId": "ddbc9515-25dc-454f-9f02-9427f1261b06", "tags": [] @@ -1152,11 +1367,11 @@ "data": { "text/plain": [ "taxon\n", - "funestus 656\n", + "stephensi 639\n", "dtype: int64" ] }, - "execution_count": 5, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -1180,12 +1395,19 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 430 }, + "execution": { + "iopub.execute_input": "2026-04-05T04:02:21.993178Z", + "iopub.status.busy": "2026-04-05T04:02:21.992783Z", + "iopub.status.idle": "2026-04-05T04:02:24.320013Z", + "shell.execute_reply": "2026-04-05T04:02:24.317119Z", + "shell.execute_reply.started": "2026-04-05T04:02:21.993144Z" + }, "id": "433PD7k8jlNj", "outputId": "bc5e1b8d-f1f4-4008-df56-f577a9080561", "tags": [] @@ -1195,7 +1417,7 @@ "name": "stdout", "output_type": "stream", "text": [ - " \r" + " " ] }, { @@ -1221,27 +1443,76 @@ " */\n", "\n", ":root {\n", - " --xr-font-color0: var(--jp-content-font-color0, rgba(0, 0, 0, 1));\n", - " --xr-font-color2: var(--jp-content-font-color2, rgba(0, 0, 0, 0.54));\n", - " --xr-font-color3: var(--jp-content-font-color3, rgba(0, 0, 0, 0.38));\n", - " --xr-border-color: var(--jp-border-color2, #e0e0e0);\n", - " --xr-disabled-color: var(--jp-layout-color3, #bdbdbd);\n", - " --xr-background-color: var(--jp-layout-color0, white);\n", - " --xr-background-color-row-even: var(--jp-layout-color1, white);\n", - " --xr-background-color-row-odd: var(--jp-layout-color2, #eeeeee);\n", - "}\n", - "\n", - "html[theme=dark],\n", - "body[data-theme=dark],\n", + " --xr-font-color0: var(\n", + " --jp-content-font-color0,\n", + " var(--pst-color-text-base rgba(0, 0, 0, 1))\n", + " );\n", + " --xr-font-color2: var(\n", + " --jp-content-font-color2,\n", + " var(--pst-color-text-base, rgba(0, 0, 0, 0.54))\n", + " );\n", + " --xr-font-color3: var(\n", + " --jp-content-font-color3,\n", + " var(--pst-color-text-base, rgba(0, 0, 0, 0.38))\n", + " );\n", + " --xr-border-color: var(\n", + " --jp-border-color2,\n", + " hsl(from var(--pst-color-on-background, white) h s calc(l - 10))\n", + " );\n", + " --xr-disabled-color: var(\n", + " --jp-layout-color3,\n", + " hsl(from var(--pst-color-on-background, white) h s calc(l - 40))\n", + " );\n", + " --xr-background-color: var(\n", + " --jp-layout-color0,\n", + " var(--pst-color-on-background, white)\n", + " );\n", + " --xr-background-color-row-even: var(\n", + " --jp-layout-color1,\n", + " hsl(from var(--pst-color-on-background, white) h s calc(l - 5))\n", + " );\n", + " --xr-background-color-row-odd: var(\n", + " --jp-layout-color2,\n", + " hsl(from var(--pst-color-on-background, white) h s calc(l - 15))\n", + " );\n", + "}\n", + "\n", + "html[theme=\"dark\"],\n", + "html[data-theme=\"dark\"],\n", + "body[data-theme=\"dark\"],\n", "body.vscode-dark {\n", - " --xr-font-color0: rgba(255, 255, 255, 1);\n", - " --xr-font-color2: rgba(255, 255, 255, 0.54);\n", - " --xr-font-color3: rgba(255, 255, 255, 0.38);\n", - " --xr-border-color: #1F1F1F;\n", - " --xr-disabled-color: #515151;\n", - " --xr-background-color: #111111;\n", - " --xr-background-color-row-even: #111111;\n", - " --xr-background-color-row-odd: #313131;\n", + " --xr-font-color0: var(\n", + " --jp-content-font-color0,\n", + " var(--pst-color-text-base, rgba(255, 255, 255, 1))\n", + " );\n", + " --xr-font-color2: var(\n", + " --jp-content-font-color2,\n", + " var(--pst-color-text-base, rgba(255, 255, 255, 0.54))\n", + " );\n", + " --xr-font-color3: var(\n", + " --jp-content-font-color3,\n", + " var(--pst-color-text-base, rgba(255, 255, 255, 0.38))\n", + " );\n", + " --xr-border-color: var(\n", + " --jp-border-color2,\n", + " hsl(from var(--pst-color-on-background, #111111) h s calc(l + 10))\n", + " );\n", + " --xr-disabled-color: var(\n", + " --jp-layout-color3,\n", + " hsl(from var(--pst-color-on-background, #111111) h s calc(l + 40))\n", + " );\n", + " --xr-background-color: var(\n", + " --jp-layout-color0,\n", + " var(--pst-color-on-background, #111111)\n", + " );\n", + " --xr-background-color-row-even: var(\n", + " --jp-layout-color1,\n", + " hsl(from var(--pst-color-on-background, #111111) h s calc(l + 5))\n", + " );\n", + " --xr-background-color-row-odd: var(\n", + " --jp-layout-color2,\n", + " hsl(from var(--pst-color-on-background, #111111) h s calc(l + 15))\n", + " );\n", "}\n", "\n", ".xr-wrap {\n", @@ -1282,7 +1553,7 @@ ".xr-sections {\n", " padding-left: 0 !important;\n", " display: grid;\n", - " grid-template-columns: 150px auto auto 1fr 20px 20px;\n", + " grid-template-columns: 150px auto auto 1fr 0 20px 0 20px;\n", "}\n", "\n", ".xr-section-item {\n", @@ -1290,11 +1561,14 @@ "}\n", "\n", ".xr-section-item input {\n", - " display: none;\n", + " display: inline-block;\n", + " opacity: 0;\n", + " height: 0;\n", "}\n", "\n", ".xr-section-item input + label {\n", " color: var(--xr-disabled-color);\n", + " border: 2px solid transparent !important;\n", "}\n", "\n", ".xr-section-item input:enabled + label {\n", @@ -1302,6 +1576,10 @@ " color: var(--xr-font-color2);\n", "}\n", "\n", + ".xr-section-item input:focus + label {\n", + " border: 2px solid var(--xr-font-color0) !important;\n", + "}\n", + "\n", ".xr-section-item input:enabled + label:hover {\n", " color: var(--xr-font-color0);\n", "}\n", @@ -1323,7 +1601,7 @@ "\n", ".xr-section-summary-in + label:before {\n", " display: inline-block;\n", - " content: '►';\n", + " content: \"►\";\n", " font-size: 11px;\n", " width: 15px;\n", " text-align: center;\n", @@ -1334,7 +1612,7 @@ "}\n", "\n", ".xr-section-summary-in:checked + label:before {\n", - " content: '▼';\n", + " content: \"▼\";\n", "}\n", "\n", ".xr-section-summary-in:checked + label > span {\n", @@ -1406,15 +1684,15 @@ "}\n", "\n", ".xr-dim-list:before {\n", - " content: '(';\n", + " content: \"(\";\n", "}\n", "\n", ".xr-dim-list:after {\n", - " content: ')';\n", + " content: \")\";\n", "}\n", "\n", ".xr-dim-list li:not(:last-child):after {\n", - " content: ',';\n", + " content: \",\";\n", " padding-right: 5px;\n", "}\n", "\n", @@ -1431,7 +1709,9 @@ ".xr-var-item label,\n", ".xr-var-item > .xr-var-name span {\n", " background-color: var(--xr-background-color-row-even);\n", + " border-color: var(--xr-background-color-row-odd);\n", " margin-bottom: 0;\n", + " padding-top: 2px;\n", "}\n", "\n", ".xr-var-item > .xr-var-name:hover span {\n", @@ -1442,6 +1722,7 @@ ".xr-var-list > li:nth-child(odd) > label,\n", ".xr-var-list > li:nth-child(odd) > .xr-var-name span {\n", " background-color: var(--xr-background-color-row-odd);\n", + " border-color: var(--xr-background-color-row-even);\n", "}\n", "\n", ".xr-var-name {\n", @@ -1491,8 +1772,15 @@ ".xr-var-data,\n", ".xr-index-data {\n", " display: none;\n", - " background-color: var(--xr-background-color) !important;\n", - " padding-bottom: 5px !important;\n", + " border-top: 2px dotted var(--xr-background-color);\n", + " padding-bottom: 20px !important;\n", + " padding-top: 10px !important;\n", + "}\n", + "\n", + ".xr-var-attrs-in + label,\n", + ".xr-var-data-in + label,\n", + ".xr-index-data-in + label {\n", + " padding: 0 1px;\n", "}\n", "\n", ".xr-var-attrs-in:checked ~ .xr-var-attrs,\n", @@ -1505,6 +1793,12 @@ " float: right;\n", "}\n", "\n", + ".xr-var-data > pre,\n", + ".xr-index-data > pre,\n", + ".xr-var-data > table > tbody > tr {\n", + " background-color: transparent !important;\n", + "}\n", + "\n", ".xr-var-name span,\n", ".xr-var-data,\n", ".xr-index-name div,\n", @@ -1564,24 +1858,32 @@ " stroke: currentColor;\n", " fill: currentColor;\n", "}\n", + "\n", + ".xr-var-attrs-in:checked + label > .xr-icon-file-text2,\n", + ".xr-var-data-in:checked + label > .xr-icon-database,\n", + ".xr-index-data-in:checked + label > .xr-icon-database {\n", + " color: var(--xr-font-color0);\n", + " filter: drop-shadow(1px 1px 5px var(--xr-font-color2));\n", + " stroke-width: 0.8px;\n", + "}\n", "
<xarray.Dataset> Size: 1TB\n",
-       "Dimensions:                       (variants: 102882611, alleles: 4,\n",
-       "                                   samples: 656, ploidy: 2)\n",
+       "Dimensions:                        (variants: 93702023, alleles: 4,\n",
+       "                                    samples: 639, ploidy: 2)\n",
        "Coordinates:\n",
-       "    variant_position              (variants) int32 412MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
-       "    variant_contig                (variants) uint8 103MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
-       "    sample_id                     (samples) <U36 94kB dask.array<chunksize=(36,), meta=np.ndarray>\n",
+       "    variant_position               (variants) int32 375MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    variant_contig                 (variants) uint8 94MB dask.array<chunksize=(524288,), meta=np.ndarray>\n",
+       "    sample_id                      (samples) <U36 92kB dask.array<chunksize=(111,), meta=np.ndarray>\n",
        "Dimensions without coordinates: variants, alleles, samples, ploidy\n",
        "Data variables:\n",
-       "    variant_allele                (variants, alleles) |S1 412MB dask.array<chunksize=(524288, 1), meta=np.ndarray>\n",
-       "    variant_filter_pass_funestus  (variants) bool 103MB dask.array<chunksize=(300000,), meta=np.ndarray>\n",
-       "    call_genotype                 (variants, samples, ploidy) int8 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>\n",
-       "    call_GQ                       (variants, samples) int8 67GB dask.array<chunksize=(300000, 36), meta=np.ndarray>\n",
-       "    call_MQ                       (variants, samples) float32 270GB dask.array<chunksize=(300000, 36), meta=np.ndarray>\n",
-       "    call_AD                       (variants, samples, alleles) int16 540GB dask.array<chunksize=(300000, 36, 4), meta=np.ndarray>\n",
-       "    call_genotype_mask            (variants, samples, ploidy) bool 135GB dask.array<chunksize=(300000, 36, 2), meta=np.ndarray>\n",
+       "    variant_allele                 (variants, alleles) |S1 375MB dask.array<chunksize=(524288, 4), meta=np.ndarray>\n",
+       "    variant_filter_pass_stephensi  (variants) bool 94MB dask.array<chunksize=(300000,), meta=np.ndarray>\n",
+       "    call_genotype                  (variants, samples, ploidy) int8 120GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
+       "    call_GQ                        (variants, samples) int8 60GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_MQ                        (variants, samples) float32 240GB dask.array<chunksize=(300000, 50), meta=np.ndarray>\n",
+       "    call_AD                        (variants, samples, alleles) int16 479GB dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>\n",
+       "    call_genotype_mask             (variants, samples, ploidy) bool 120GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>\n",
        "Attributes:\n",
-       "    contigs:  ('2RL', '3RL', 'X')