From 9df9752df528f909cb88081419fabefc1f7f1001 Mon Sep 17 00:00:00 2001 From: "Thomas Neep (Advanced Research Computing)" Date: Thu, 14 Aug 2025 16:21:23 +0100 Subject: [PATCH 1/6] Update analysis page. Remove any specific data which may not be public. Make a tabbed interface for the CLI and Python API. --- docs/analyse.md | 401 ++++++++++++++++++++++++++---------------------- 1 file changed, 214 insertions(+), 187 deletions(-) diff --git a/docs/analyse.md b/docs/analyse.md index b3238cb..c11e425 100644 --- a/docs/analyse.md +++ b/docs/analyse.md @@ -14,7 +14,7 @@ so that once installed, the Onyx client will automatically be configured. ## Onyx client basics First, let's install the Onyx client, which is available through the -[conda-forge package](https://anaconda.org/conda-forge/climb-onyx-client) +[conda-forge package](https://anaconda.org/conda-forge/climb-onyx-client) `climb-onyx-client` and can thus be installed with `conda`. As advised in the [CLIMB docs on installing software](https://docs.climb.ac.uk/notebook-servers/installing-software-with-conda/), @@ -28,205 +28,232 @@ Let's activate this environment. jovyan:~$ conda activate onyx ``` On Bryn's Notebook Servers, the client will automatically be configured. -Try running the command-line client with -``` -(onyx) jovyan:~$ onyx -``` -This should show you some options and commands that are available. -Have a look at your own profile with -``` -(onyx) jovyan:~$ onyx profile -``` -and which projects you have access to with -``` -(onyx) jovyan:~$ onyx projects -``` -You should see `mscape` listed. +We will now have access to both the Python API and a command-line client. +Let's walk through some of the commands available to us. +In each case you can choose between the Python API or the command-line interface (CLI). + +### Initial setup + +=== "CLI" + No additional setup is required if you are running the CLI in a CLIMB + notebook. You can try running the command-line client with + + ```console + (onyx) jovyan:~$ onyx + ``` + to see some of the options and commands available to you. + +=== "Python" + If you are using onyx in Python, then you need to import the required modules and configure a client. + ```python + import os + from onyx import OnyxConfig, OnyxEnv, OnyxClient + + config = OnyxConfig( + domain=os.environ[OnyxEnv.DOMAIN], + token=os.environ[OnyxEnv.TOKEN], + ) + + client = OnyxClient(config=config) + ``` + + !!! note + + In all the Python API examples, arguments will be + explicitly passed as keyword arguments e.g. `arg=value`, + however, in all cases shown on this page, the argument names + can be omitted. + +### Profile + +You can view information about your profile (username, site, and email) with + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx profile + ``` + +=== "Python" + + ```python + client.profile() + ``` + +### Projects + +You can view the projects you have access to with + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx projects + ``` + +=== "Python" + + ```python + client.projects() + ``` ## Querying data As an example task, we'll see if we can find any sequencing data performed -for ZymoBIOMICS sources. These are designed with +for ZymoBIOMICS sources. These are designed with [a particular specification](https://files.zymoresearch.com/protocols/_d6300_zymobiomics_microbial_community_standard.pdf) -of DNA from eight bacteria and two yeasts. We can use these to see if our protocol -correctly recovers the DNA fractions. I.e. if our protocol is biased. +of DNA from eight bacteria and two yeasts. +We will search the `mscape` project, but bear in mind you may not +have access to that particular project. -From the command line, the main route to querying Onyx is via the `filter` command. -On its own, this queries the database with *no* filters. The command -``` -(onyx) jovyan:~$ onyx filter mscape -``` -will produce tens of thousands of lines of JSON, so let's not -do that just yet. To first see which fields are available in the database, -we can use -``` -(onyx) jovyan:~$ onyx fields mscape -... -├────────────────────────────────┼──────────┼───────────────────┼──────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤ -│ extraction_enrichment_protocol │ optional │ text │ Details of nucleic acid extraction and optional enrichment steps. │ │ -├────────────────────────────────┼──────────┼───────────────────┼──────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤ -... -``` -Let's search the database for entries with `zymo` (case-insensitive) in this field. -``` -(onyx) jovyan:~$ onyx filter mscape --field extraction_enrichment_protocol.icontains=zymo -... -``` -That should return JSON data for a few entries. You may wish to format the -data as CSV or TSV with `--format csv` or `--format tsv`, respectively. +To see every entry in the entire database for a particular project we can do -## Inspecting some pipeline output on the command line +=== "CLI" -When data is ingested into Onyx, a taxonomic classification is automatically run. -The last part of the JSON data is usually some of this, in JSON format. -The complete reports can be found in the S3 buckets given in the -`'taxon_report'` field. You can find this in the output you've already produced -or modify the `filter` command to only request them using the `--include` flag. e.g. -``` -(onyx) jovyan:~$ onyx filter mscape --field extraction_enrichment_protocol.icontains=zymo --include=taxon_reports -[ - { - "taxon_reports": "s3://mscape-published-taxon-reports/C-FDE50853AD/" - }, - { - "taxon_reports": "s3://mscape-published-taxon-reports/C-04F4495068/" - } -] -``` -Multiple fields can be requested with the `--include` flag e.g. -``` -(onyx) jovyan:~$ onyx filter mscape --field extraction_enrichment_protocol.icontains=zymo --include climb_id,taxon_reports -[ - { - "climb_id": "C-FDE50853AD", - "taxon_reports": "s3://mscape-published-taxon-reports/C-FDE50853AD/" - }, - { - "climb_id": "C-04F4495068", - "taxon_reports": "s3://mscape-published-taxon-reports/C-04F4495068/" - } -] -``` -You can conversely exclude individual fields using the `--exclude` -flag in the same way. + ```console + (onyx) jovyan:~$ onyx filter mscape + ``` -Either way, you now have the location of the taxonomy reports. Let's have a look -with `s3cmd`. -``` -(onyx) jovyan:~$ s3cmd ls s3://mscape-published-taxon-reports/C-FDE50853AD/ -2023-11-10 12:56 146K s3://mscape-published-taxon-reports/C-FDE50853AD/PlusPF.kraken.json -2023-11-10 12:56 2G s3://mscape-published-taxon-reports/C-FDE50853AD/PlusPF.kraken_assignments.tsv -2023-11-10 12:56 193K s3://mscape-published-taxon-reports/C-FDE50853AD/PlusPF.kraken_report.txt -``` -The plain text report is what we're after, so let's download that with `s3cmd`: -``` -(onyx) jovyan:~$ s3cmd get s3://mscape-published-taxon-reports/C-FDE50853AD/PlusPF.kraken_report.txt -download: 's3://mscape-published-taxon-reports/C-FDE50853AD/PlusPF.kraken_report.txt' -> './PlusPF.kraken_report.txt' [1 of 1] - 197750 of 197750 100% in 0s 3.79 MB/s done -``` +=== "Python" -If you've never seen one of these reports before, it's worth having a -quick look with a tool like `less` or by opening it using the -JupyterLab file browser. For reference, it's worth showing the header -``` -(onyx) jovyan:~$ head -n 1 PlusPF.kraken_report.txt -% of Seqs Clades Taxonomies Rank Taxonomy ID Scientific Name -``` -The Zymo sample is prepared with 12% *Bacillus subtilis*. Let's see how much -was actually reported in the results: -``` -(onyx) jovyan:~$ grep "Bacillus subtilis" PlusPF.kraken_report.txt - 20.30 435278 1452 G1 653685 Bacillus subtilis group - 0.12 2624 1952 S 1423 Bacillus subtilis - 0.03 565 242 S1 135461 Bacillus subtilis subsp. subtilis - 0.01 108 108 S2 1404258 Bacillus subtilis subsp. subtilis str. OH 131.1 - ... -``` -Looks like 20.3%, though classified under *Bacillus subtilis* "subgroup", -rather than *Bacillus subtilis*, which reportedly only comprises 0.12% of the sample. -Most of that 20.3% is under *Bacillus spizizenii*. - -An important detail here is that the fraction reported in this output -is not calculated in the same way as what's used in the reference values (12% for bacteria; 2% for yeasts). -Let's make a fairer comparison using the JSON taxonomic data. - -## Working with database output in Python - -To fairly compare the taxonomic data with the reference values in the -Zymo community, we need to know the proportions of gDNA, so we need to -compute the number of base pairs that were assigned to each taxon. -Let's make this comparison in Python using the Onyx client's Python -API. - -Let's first run the same query for the Zymo data. We'll follow the -examples in the Onyx documentation and run the query in a context -manager. -```py -import os -from onyx import OnyxConfig, OnyxEnv, OnyxClient - -config = OnyxConfig( - domain=os.environ[OnyxEnv.DOMAIN], - token=os.environ[OnyxEnv.TOKEN], -) - -with OnyxClient(config) as client: - records = list(client.filter( - "mscape", - fields={ - "extraction_enrichment_protocol__icontains": "zymo", - }, - )) -``` -We've wrapped the `filter` call in a `list` because otherwise -we get a generator. - -If you want to inspect the data, it's a bit easier to read if formatted with -indentation, which can be done using the standard `json.dumps` function: -```py -import json -print(json.dumps(records[0], indent=2)) # show first record -``` -In each record, the `'taxa_files'` key gives us a list of dictionaries -that each has a number of reads and a mean length, the product of -which is the total number of base pairs that were read for that -taxon. A simple first step is to convert the taxonomic data (for the first record) -into a Pandas DataFrame with -```py -import pandas as pd - -df = pd.DataFrame(records[0]['taxa_files']) -``` -We also need to drop a few lower-level taxa that are already -accounted for in higher ones. e.g. the reads for *Bacillus spizizenii TU-B-10* are -among the reads counted for *Bacillus spizizenii*. A quick way of doing this -is by selecting the rows that have only two words in their names. -```py -df = df.loc[df['human_readable'].apply(lambda name: len(name.split()) == 2)] + ```python + # client.filter returns a generator that we can iterate over + entires = client.filter(project="mscape") + ``` + +On its own, this command queries the database with *no* filters, and +could return thousands of entries. + +### Fields + +We can see what fields exist in a particular database with + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx fields mscape + ``` + +=== "Python" + + ```python + client.fields(project="mscape") + ``` + +### Filtering + +We can filter the returned records to just select the entries in the +database that we are interested in. For this example we'll see if we +can find any sequencing data performed for ZymoBIOMICS sources. These +are designed with [a particular +specification](https://files.zymoresearch.com/protocols/_d6300_zymobiomics_microbial_community_standard.pdf) +of DNA from eight bacteria and two yeasts. + +To select these samples, we can ask that the `control_type_details` +equals `zymo-mc_D6300`. + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 + ``` + +=== "Python" + + ```python + # client.filter returns a generator that we can iterate over + entries = client.filter(project="mscape", fields={"control_type_details": "zymo-mc_D6300"}) + ``` + +This returns a small number of entries that we can more easily work +with. Note that this returns every field for each record that is +found, which can be much more information than we need. We can select +specific fields to include using e.g. + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 --include climb_id,biosample_id,taxon_reports + ``` + +=== "Python" + + ```python + query = {"control_type_details": "zymo-mc_D6300"} + fields_to_include = ["climb_id", "biosample_id" , "taxon_reports"] + # client.filter returns a generator that we can iterate over + entries = client.filter("mscape", fields=query, include=fields_to_include) + ``` + + +### Taxonomic information + +By default, the filter command will not return taxonomic +information. To access that information for an individual record use the `get` command. + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx get mscape + ``` + +=== "Python" + + ```python + record = client.get(project="mscape", climb_id=) + ``` +where `` is replaced with the CLIMB ID of the record you +want to retrieve. +This will you give you all the information about a particular record +including binned reads and all classifier calls. + +## Tips + +### `jq` + +If you are using the CLI, you may find [`jq`](https://jqlang.org) +useful. `jq` can be installed into your conda environment + +```console +(onyx) jovyan:~$ conda install jq ``` -Now, let's add columns for the total number of base pairs associated with -each taxon and what proportion that is of the total. -```py -df['gDNA'] = df['n_reads']*df['mean_len'] -df['proportion'] = df['gDNA']/df['gDNA'].sum() +You can then pipe the output of your onyx queries +e.g. `onyx filter ...` into `jq` using the pipe operator `|`. +This will colourise the output and may make reading the data easier. +```console +(onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 | jq ``` -Finally, let's make a rough plot with a black dashed line at 12%. -```py -import matplotlib.pyplot as plt +`jq` has many powerful features, including filtering, selecting, and formatting data. -plt.plot(df['human_readable'], df['proportion']*100, 'o') -plt.axhline(12, c='k', ls='--'); -plt.xticks(rotation=22.5, ha='right'); -``` -![Measured gDNA proportions of a Zymo community](./zymo-comparison.png) +### Python context manager + +If you are using the Python client, and performing more than one query to +the onyx database in a single code block e.g. in a `for` loop. Then we +recommend you use the `OnyxClient` as a context manager. -There are some clear discrepancies—*Pseudomonas aeruginosa* is -underreported and *Bacillus spizizenii* is overreported—but this -matches results by e.g. [Nicholls et -al. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6520541/). +```python +# ... +# Setup omitted +# ... +client = OnyxClient(config=config) + +# Perform several onyx operations in this block +with client: + # Get the first entry in the database for the mscape project + first_entry = next(client.filter(project="mscape")) + + # Get the CLIMB ID of the entry + climb_id = first_entry["climb_id"] + + # Get the full record for this CLIMB ID using the `get` method + full_record = client.get(project="mscape", climb_id=climb_id) + + # Count the number of taxa_files + n_taxa_files = len(full_record["taxa_files"]) + print(f"CLIMB_ID: {climb_id} has {n_taxa_files} taxa files") +``` -This short example is intended as a basic demonstration of what's -possible in CLIMB-TRE. We're always interested to hear more examples -of research questions that CLIMB-TRE can answer, so let us know if you -have an example that could be included as a guide for others. +This is more efficient that not using the context manager as the +client will re-use the same session for all requests, rather than +creating a new session for each request. For more information, see: + From 7b870267f8d9626d1a1fabf374bd06bcc10f6b50 Mon Sep 17 00:00:00 2001 From: "Thomas Neep (Advanced Research Computing)" Date: Wed, 20 Aug 2025 14:55:58 +0100 Subject: [PATCH 2/6] Load required extensions --- mkdocs.yml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mkdocs.yml b/mkdocs.yml index 3a3f2f2..a3c9e4e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -30,6 +30,7 @@ theme: # - navigation.top - toc.integrate - content.code.copy + - content.tabs.link plugins: - search @@ -53,6 +54,7 @@ plugins: verbose: true markdown_extensions: + - admonition - attr_list - pymdownx.highlight: anchor_linenums: true @@ -61,6 +63,8 @@ markdown_extensions: - pymdownx.inlinehilite - pymdownx.snippets - pymdownx.superfences + - pymdownx.tabbed: + alternate_style: true - pymdownx.magiclink - toc: permalink: true From c9be60460c97c535a74723c70d27ec4c05a9edd0 Mon Sep 17 00:00:00 2001 From: "Thomas Neep (Advanced Research Computing)" Date: Thu, 21 Aug 2025 11:44:22 +0100 Subject: [PATCH 3/6] Add next steps --- docs/analyse.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/analyse.md b/docs/analyse.md index c11e425..3a58813 100644 --- a/docs/analyse.md +++ b/docs/analyse.md @@ -184,7 +184,6 @@ specific fields to include using e.g. entries = client.filter("mscape", fields=query, include=fields_to_include) ``` - ### Taxonomic information By default, the filter command will not return taxonomic @@ -257,3 +256,8 @@ This is more efficient that not using the context manager as the client will re-use the same session for all requests, rather than creating a new session for each request. For more information, see: + +## Next steps + +Complete documentation of Onyx for both the CLI and Python API can be +found [here](https://CLIMB-TRE.github.io/onyx-client/). From 252e884132542b63a69013d00ab16f533d17406f Mon Sep 17 00:00:00 2001 From: "Thomas Neep (Advanced Research Computing)" Date: Wed, 10 Sep 2025 16:37:09 +0100 Subject: [PATCH 4/6] WIP comments from Tom B --- docs/analyse.md | 239 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 164 insertions(+), 75 deletions(-) diff --git a/docs/analyse.md b/docs/analyse.md index 3a58813..20669a3 100644 --- a/docs/analyse.md +++ b/docs/analyse.md @@ -4,8 +4,8 @@ Once data and metadata have been ingested into the Onyx database, you can query it using the Onyx client, which provides a command line interface (CLI) -and Python API. This short example -demonstrates a few principal functions. More are described in the +and Python API. This tutorial is intended as a basic demonstration of what is +possible. All capabilities of the Onyx client can be found in the [`onyx-client` documentation](https://climb-tre.github.io/onyx-client/). This guide also assumes that you're using a Notebook Server on CLIMB, @@ -35,34 +35,34 @@ In each case you can choose between the Python API or the command-line interface ### Initial setup === "CLI" - No additional setup is required if you are running the CLI in a CLIMB - notebook. You can try running the command-line client with + No additional setup is required if you are running the CLI in a CLIMB + notebook. You can try running the command-line client with - ```console - (onyx) jovyan:~$ onyx - ``` - to see some of the options and commands available to you. + ```console + (onyx) jovyan:~$ onyx + ``` + to see some of the options and commands available to you. === "Python" - If you are using onyx in Python, then you need to import the required modules and configure a client. - ```python - import os - from onyx import OnyxConfig, OnyxEnv, OnyxClient + If you are using onyx in Python, then you need to import the required modules and configure a client. + ```python + import os + from onyx import OnyxConfig, OnyxEnv, OnyxClient - config = OnyxConfig( - domain=os.environ[OnyxEnv.DOMAIN], - token=os.environ[OnyxEnv.TOKEN], - ) + config = OnyxConfig( + domain=os.environ[OnyxEnv.DOMAIN], + token=os.environ[OnyxEnv.TOKEN], + ) - client = OnyxClient(config=config) - ``` + client = OnyxClient(config=config) + ``` - !!! note + !!! note - In all the Python API examples, arguments will be - explicitly passed as keyword arguments e.g. `arg=value`, - however, in all cases shown on this page, the argument names - can be omitted. + In all the Python API examples, arguments will be + explicitly passed as keyword arguments e.g. `arg=value`, + however, in all cases shown on this page, the argument names + can be omitted. ### Profile @@ -70,15 +70,15 @@ You can view information about your profile (username, site, and email) with === "CLI" - ```console - (onyx) jovyan:~$ onyx profile - ``` + ```console + (onyx) jovyan:~$ onyx profile + ``` === "Python" - ```python - client.profile() - ``` + ```python + client.profile() + ``` ### Projects @@ -86,15 +86,15 @@ You can view the projects you have access to with === "CLI" - ```console - (onyx) jovyan:~$ onyx projects - ``` + ```console + (onyx) jovyan:~$ onyx projects + ``` === "Python" - ```python - client.projects() - ``` + ```python + client.projects() + ``` ## Querying data @@ -109,16 +109,16 @@ To see every entry in the entire database for a particular project we can do === "CLI" - ```console - (onyx) jovyan:~$ onyx filter mscape - ``` + ```console + (onyx) jovyan:~$ onyx filter mscape + ``` === "Python" - ```python - # client.filter returns a generator that we can iterate over - entires = client.filter(project="mscape") - ``` + ```python + # client.filter returns a generator that we can iterate over + entries = client.filter(project="mscape") + ``` On its own, this command queries the database with *no* filters, and could return thousands of entries. @@ -129,15 +129,15 @@ We can see what fields exist in a particular database with === "CLI" - ```console - (onyx) jovyan:~$ onyx fields mscape - ``` + ```console + (onyx) jovyan:~$ onyx fields mscape + ``` === "Python" - ```python - client.fields(project="mscape") - ``` + ```python + client.fields(project="mscape") + ``` ### Filtering @@ -153,16 +153,16 @@ equals `zymo-mc_D6300`. === "CLI" - ```console - (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 - ``` + ```console + (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 + ``` === "Python" - ```python - # client.filter returns a generator that we can iterate over + ```python + # client.filter returns a generator that we can iterate over entries = client.filter(project="mscape", fields={"control_type_details": "zymo-mc_D6300"}) - ``` + ``` This returns a small number of entries that we can more easily work with. Note that this returns every field for each record that is @@ -171,18 +171,35 @@ specific fields to include using e.g. === "CLI" - ```console - (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 --include climb_id,biosample_id,taxon_reports - ``` + ```console + (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 --include climb_id,biosample_id,taxon_reports + ``` === "Python" - ```python - query = {"control_type_details": "zymo-mc_D6300"} - fields_to_include = ["climb_id", "biosample_id" , "taxon_reports"] - # client.filter returns a generator that we can iterate over + ```python + query = {"control_type_details": "zymo-mc_D6300"} + fields_to_include = ["climb_id", "biosample_id" , "taxon_reports"] + # client.filter returns a generator that we can iterate over entries = client.filter("mscape", fields=query, include=fields_to_include) - ``` + ``` + +Likewise, should we want to *exclude* certain fields, that is also possible + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 --exclude batch_id,study_id + ``` + +=== "Python" + + ```python + query = {"control_type_details": "zymo-mc_D6300"} + fields_to_exclude = ["batch_id", "study_id"] + # client.filter returns a generator that we can iterate over + entries = client.filter("mscape", fields=query, exclude=fields_to_exclude) + ``` ### Taxonomic information @@ -191,20 +208,92 @@ information. To access that information for an individual record use the `get` c === "CLI" - ```console - (onyx) jovyan:~$ onyx get mscape - ``` + ```console + (onyx) jovyan:~$ onyx get mscape + ``` === "Python" - ```python - record = client.get(project="mscape", climb_id=) - ``` + ```python + record = client.get(project="mscape", climb_id=) + ``` where `` is replaced with the CLIMB ID of the record you want to retrieve. This will you give you all the information about a particular record including binned reads and all classifier calls. +### Accessing data from s3 buckets + +You can also use the Onyx client to find the `s3` path where the taxon +reports are stored. These can then be directly downloaded for further analysis. + +=== "CLI" + + ```console + (onyx) jovyan:~$ onyx filter mscape --field control_type_details=zymo-mc_D6300 --include "taxon_reports" + [ + { + "taxon_reports": "s3://mscape-published-taxon-reports/CLIMB_ID_1/" + }, + { + "taxon_reports": "s3://mscape-published-taxon-reports/CLIMB_ID_2/" + }, + { + "taxon_reports": "s3://mscape-published-taxon-reports/CLIMB_ID_3/" + } + ] + ``` + where `CLIMB_ID_i` will be CLIMB ID of the sample. + These can be inspect and downloaded using either of the `s3cmd` or `aws s3` commands. + For example + ```console + (onyx) jovyan:~$ s3cmd ls s3://mscape-published-taxon-reports/CLIMB_ID_1/ + 2024-04-26 14:04 163K s3://mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken.json + 2024-04-26 14:04 28M s3://mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_assignments.tsv + 2024-04-26 14:04 457K s3://mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.json + 2024-04-26 14:04 133K s3://mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.txt + (onyx) jovyan:~$ s3cmd get s3://mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.txt + download: 's3://mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.txt' -> './CLIMB_ID_1_PlusPF.kraken_report.txt' [1 of 1] + 136562 of 136562 100% in 0s 988.65 KB/s done + ``` + +=== "Python" + + ```python + for i in client.filter("mscape", fields={"control_type_details": "zymo-mc_D6300"}, include=["taxon_reports"]): + print(i) + ``` + will give something like + ``` + {'taxon_reports': 's3://mscape-published-taxon-reports/CLIMB_ID_1/'} + {'taxon_reports': 's3://mscape-published-taxon-reports/CLIMB_ID_2/'} + {'taxon_reports': 's3://mscape-published-taxon-reports/CLIMB_ID_3/'} + ``` + Which can either be downloaded using the `s3cmd` or `aws s3` commands shown + in the CLI tab of this block, or using a python library capable of reading + from s3, such as [`s3fs`](https://s3fs.readthedocs.io). + ```python + import s3fs # Install into conda environment first! + s3 = s3fs.S3FileSystem() + s3.ls("s3://mscape-published-taxon-reports/CLIMB_ID_1/") + ``` + which will show the files in that s3 path + ``` + ['mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken.json', + 'mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_assignments.tsv', + 'mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.json', + 'mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.txt'] + ``` + which you can then download using + ```python + s3.get_file("mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.txt", ".") + ``` + or read directly as if it were any other file on your system + ```python + with s3.open("mscape-published-taxon-reports/CLIMB_ID_1/CLIMB_ID_1_PlusPF.kraken_report.txt", "r") as f: + # do something with file + ``` + ## Tips ### `jq` @@ -238,16 +327,16 @@ client = OnyxClient(config=config) # Perform several onyx operations in this block with client: - # Get the first entry in the database for the mscape project + # Get the first entry in the database for the mscape project first_entry = next(client.filter(project="mscape")) - - # Get the CLIMB ID of the entry + + # Get the CLIMB ID of the entry climb_id = first_entry["climb_id"] - - # Get the full record for this CLIMB ID using the `get` method + + # Get the full record for this CLIMB ID using the `get` method full_record = client.get(project="mscape", climb_id=climb_id) - - # Count the number of taxa_files + + # Count the number of taxa_files n_taxa_files = len(full_record["taxa_files"]) print(f"CLIMB_ID: {climb_id} has {n_taxa_files} taxa files") ``` From bb2c2ee81890989cb94be07b0c4494ca4e554e2f Mon Sep 17 00:00:00 2001 From: "Thomas Neep (Advanced Research Computing)" Date: Thu, 11 Sep 2025 10:48:11 +0100 Subject: [PATCH 5/6] Add information about output formats --- docs/analyse.md | 49 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 39 insertions(+), 10 deletions(-) diff --git a/docs/analyse.md b/docs/analyse.md index 20669a3..82a5133 100644 --- a/docs/analyse.md +++ b/docs/analyse.md @@ -4,7 +4,7 @@ Once data and metadata have been ingested into the Onyx database, you can query it using the Onyx client, which provides a command line interface (CLI) -and Python API. This tutorial is intended as a basic demonstration of what is +and Python API. This tutorial is intended as a basic demonstration of what is possible. All capabilities of the Onyx client can be found in the [`onyx-client` documentation](https://climb-tre.github.io/onyx-client/). @@ -61,7 +61,7 @@ In each case you can choose between the Python API or the command-line interface In all the Python API examples, arguments will be explicitly passed as keyword arguments e.g. `arg=value`, - however, in all cases shown on this page, the argument names + however, in all cases shown on this page, the argument names can be omitted. ### Profile @@ -123,6 +123,35 @@ To see every entry in the entire database for a particular project we can do On its own, this command queries the database with *no* filters, and could return thousands of entries. +### Output formats + +The default behaviour of Onyx is to return data as JSON. If you prefer +your data to be in a different format then that is possible. + +=== "CLI" + + To get data in `csv` or `tsv` format, simply add the `--format ` + option to your filter command. For example, to get the data in csv format + rather than JSON, you can do + + ```console + (onyx) jovyan:~$ onyx filter mscape --format csv + ``` + +=== "Python" + + The Python client has [a method to write your data to a csv file](https://climb-tre.github.io/onyx-client/api/documentation/client/#onyx.OnyxClient.to_csv). + It can often be convenient to use a library like + [`pandas`](https://pandas.pydata.org) to perform analysis. + You can easily create a [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) like so + ```python + import pandas as pd # Install into conda environment first! + df = pd.DataFrame(client.filter(project="mscape")) + ``` + You cnan then write your data to + [any of the output formats](https://pandas.pydata.org/docs/user_guide/io.html) + supported by `pandas`. + ### Fields We can see what fields exist in a particular database with @@ -198,7 +227,7 @@ Likewise, should we want to *exclude* certain fields, that is also possible query = {"control_type_details": "zymo-mc_D6300"} fields_to_exclude = ["batch_id", "study_id"] # client.filter returns a generator that we can iterate over - entries = client.filter("mscape", fields=query, exclude=fields_to_exclude) + entries = client.filter(project="mscape", fields=query, exclude=fields_to_exclude) ``` ### Taxonomic information @@ -243,7 +272,7 @@ reports are stored. These can then be directly downloaded for further analysis. } ] ``` - where `CLIMB_ID_i` will be CLIMB ID of the sample. + where `CLIMB_ID_i` will be CLIMB ID of the sample. These can be inspect and downloaded using either of the `s3cmd` or `aws s3` commands. For example ```console @@ -260,7 +289,7 @@ reports are stored. These can then be directly downloaded for further analysis. === "Python" ```python - for i in client.filter("mscape", fields={"control_type_details": "zymo-mc_D6300"}, include=["taxon_reports"]): + for i in client.filter(project="mscape", fields={"control_type_details": "zymo-mc_D6300"}, include=["taxon_reports"]): print(i) ``` will give something like @@ -269,8 +298,8 @@ reports are stored. These can then be directly downloaded for further analysis. {'taxon_reports': 's3://mscape-published-taxon-reports/CLIMB_ID_2/'} {'taxon_reports': 's3://mscape-published-taxon-reports/CLIMB_ID_3/'} ``` - Which can either be downloaded using the `s3cmd` or `aws s3` commands shown - in the CLI tab of this block, or using a python library capable of reading + Which can either be downloaded using the `s3cmd` or `aws s3` commands shown + in the CLI tab of this block, or using a python library capable of reading from s3, such as [`s3fs`](https://s3fs.readthedocs.io). ```python import s3fs # Install into conda environment first! @@ -329,13 +358,13 @@ client = OnyxClient(config=config) with client: # Get the first entry in the database for the mscape project first_entry = next(client.filter(project="mscape")) - + # Get the CLIMB ID of the entry climb_id = first_entry["climb_id"] - + # Get the full record for this CLIMB ID using the `get` method full_record = client.get(project="mscape", climb_id=climb_id) - + # Count the number of taxa_files n_taxa_files = len(full_record["taxa_files"]) print(f"CLIMB_ID: {climb_id} has {n_taxa_files} taxa files") From 27de0d53e7eba1c58b8a62b726759ba9d99de1da Mon Sep 17 00:00:00 2001 From: "Thomas Neep (Advanced Research Computing)" Date: Thu, 11 Sep 2025 12:04:21 +0100 Subject: [PATCH 6/6] Use Tom B's suggested example --- docs/analyse.md | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/docs/analyse.md b/docs/analyse.md index 82a5133..79b156d 100644 --- a/docs/analyse.md +++ b/docs/analyse.md @@ -349,6 +349,7 @@ the onyx database in a single code block e.g. in a `for` loop. Then we recommend you use the `OnyxClient` as a context manager. ```python +from onyx.exceptions import OnyxHTTPError # ... # Setup omitted # ... @@ -356,18 +357,26 @@ client = OnyxClient(config=config) # Perform several onyx operations in this block with client: - # Get the first entry in the database for the mscape project - first_entry = next(client.filter(project="mscape")) - - # Get the CLIMB ID of the entry - climb_id = first_entry["climb_id"] - - # Get the full record for this CLIMB ID using the `get` method - full_record = client.get(project="mscape", climb_id=climb_id) - - # Count the number of taxa_files - n_taxa_files = len(full_record["taxa_files"]) - print(f"CLIMB_ID: {climb_id} has {n_taxa_files} taxa files") + try: + records = client.filter( + project="mscape", + fields={ + "control_type_details": "zymo-mc_D6300", + "published_date__range": ["2025-01-01", "2025-05-01"], + }, + include=["climb_id", "published_date", "taxon_reports"], + ) + + for record in records: + climb_id = record["climb_id"] + + full_record = client.get(project="mscape", climb_id=climb_id) + + n_taxa_files = len(full_record["taxa_files"]) + print(f"CLIMB_ID: {climb_id} has {n_taxa_files} taxa files entries") + + except OnyxHTTPError as e: + print(e.response.json()) ``` This is more efficient that not using the context manager as the