Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
209 changes: 205 additions & 4 deletions docs/source/user-guide/web-interface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ Web Interface

This section guides you through the process of using Workflomics, from accessing the platform to generating and benchmarking workflows. Workflomics offers an intuitive web interface that enables users to efficiently create, compare, and optimize computational workflows for bioinformatics research.

.. note::

A detailed video tutorial is available on YouTube:
`Workflomics Web Interface Walkthrough <https://www.youtube.com/watch?v=9BdQCJl_6gc>`_

Accessing Workflomics
*********************

Expand Down Expand Up @@ -35,7 +40,22 @@ The "Explore" option allows you to specify the domain of your research and the d
Choose the Domain
=================

The first step in generating workflows is to select the domain of your research. The Workflomics instance on `workflomics.org <https://workflomics.org>`_ currently only implements the proteomics domain as an example. By specifying the domain, you focus the workflow generation process on the most relevant tools and methods for your research area.
A :term:`domain` in Workflomics represents a specific application area within bioinformatics, such as proteomics, imaging, or genomics, and determines which tools, data formats, and operations are available for workflow generation.

When you arrive at the domain selection screen, you’ll see a list of available domains along with key details for each one:

.. image:: screenshots/domain-table.png
:alt: Domain selection interface with executable and non-executable domains
:align: center
:scale: 45%

**How to choose a domain**

- Select **Proteomics** if your research involves protein sequences, mass spectrometry data (e.g., `mzML`, `FASTA`), and GO term enrichment.
- Use **Proteomics (non-executable)** if you're experimenting or not getting any results. This domain includes more tools and operations but cannot be executed or benchmarked.
- Select **MSI** if your input data involves mass spectrometry imaging.

The first step in generating workflows is to select the domain of your research. By specifying the domain, you focus the workflow generation process on the most relevant tools and methods for your research area.

.. figure:: ./screenshots/domain.png
:align: center
Expand All @@ -47,15 +67,15 @@ The first step in generating workflows is to select the domain of your research.
Choose Workflow Inputs and Outputs
===================================

Before generating workflows, you must specify the desired inputs and outputs. This initial step is crucial as it defines the scope and objectives of the computational task. Each *input* and *output* is defined as a pair of data **type** and **format**, using EDAM Ontology terms (see `EDAM Ontology <https://edamontology.github.io/edam-browser/#data_0006>`_). The data types and formats provided in workflomics are tailored to the proteomics domain and the available tools, therefore, the user can only select types and formats that are supported by the tools in the proteomics domain.
Before generating workflows, you must specify the desired inputs and outputs. This initial step is crucial as it defines the scope and objectives of the computational task. Each *input* and *output* is defined as a pair of data **type** and **format**, using :term:`EDAM Ontology` terms (see `EDAM Ontology <https://edamontology.github.io/edam-browser/#data_0006>`_). The data types and formats provided in workflomics are tailored to the proteomics domain and the available tools, therefore, the user can only select types and formats that are supported by the tools in the proteomics domain.

.. figure:: ./screenshots/inputs.png
:align: center
:alt: Workflow Inputs

Web interface for specifying the available workflow inputs and desired workflow outputs.

To load the example inputs and outputs, click on the "Load Example" button. The goal of this example is to create workflows designed to detect overrepresented biochemical pathways, molecular functions, and cellular components from mass spectrometry data on proteins or their digested peptides, as defined in the gene ontology (GO). To do this, you need two types of files: a *mass spectrometry* data file in the HUPO-PSI standard *mzML* format and a *FASTA* format file that includes protein sequences. The *FASTA* file not only provides the amino acid sequences for matching to mass spectra but also includes gene and protein names necessary for fetching GO annotations. The goal is to produce *over-representation* data in a structured text format, such as *JSON*.
To load the example inputs and outputs, click on the "Load Example" button. The goal of this example is to create workflows designed to detect overrepresented biochemical pathways, molecular functions, and cellular components from mass spectrometry data on proteins or their digested peptides, as defined in the gene ontology (GO). To do this, you need two types of files: a *mass spectrometry* data file in the HUPO-PSI standard `mzML` format and a `FASTA` format file that includes protein sequences. The *FASTA* file not only provides the amino acid sequences for matching to mass spectra but also includes gene and protein names necessary for fetching GO annotations. The goal is to produce *over-representation* data in a structured text format, such as *JSON*.

Based on the problem description, the inputs specified are **Mass spectrum** in **mzML** format, and **Protein sequence** in **FASTA** format, while the output is **Over-representation data** in a **JSON** format.

Expand Down Expand Up @@ -113,7 +133,7 @@ In addition to visualizing the workflows, a design-time analysis of each workflo

Web interface for visualising design-time benchmarks of the candidate workflows.

Each workflow contains a design-time benchmarks that provide information about the quality of the tools used in the workflow. The design-time benchmarks are obtained from the bio.tools and OpenEBench APIs, and include the following:
Each workflow contains a :term:`design-time filtering` benchmarks that provide information about the quality of the tools used in the workflow. The design-time benchmarks are obtained from the bio.tools and OpenEBench APIs, and include the following:

- **OS Compatibility**: Understanding tool compatibility with different operating systems (Linux, macOS, MS Windows) is crucial for users who require their pipelines to run directly on designated machines with accessible tools. While containerized environments can mitigate compatibility issues, direct compatibility remains essential for certain scenarios due to performance or specific use-case requirements. The OS compatibility, obtained from bio.tools, is provided on the tool level and aggregated by the count of tools that support each operating system.
- **License**: The openness of the software license is a crucial factor in selecting tools for workflows. Open-source tools are generally preferred due to their transparency, allowing users to inspect and verify code for security and integrity, customizability, and community support. Licenses can be OSI-approved, open, closed, or unknown. License information, provided on the tool level and aggregated by the count of open licenses, is sourced from OpenEBench.
Expand Down Expand Up @@ -190,3 +210,184 @@ Scientific benchmarks
---------------------

The scientific benchmarks are domain- and operation-specific. For instance, in workflows involving *protein identification*, we provide benchmarks such as the number of proteins identified (see column Proteins). Similarly, for workflows that perform *enrichment analysis*, we measure the number of GO terms identified (see column GO-Terms). Unlike run-time benchmarks, scientific benchmarks are tailored to specific tools and their unique functions within the workflow. The provided figures do not include scientific benchmarks, however, they are available in the live demo.

Troubleshooting: No Workflows Found
===================================

Sometimes, no candidate workflows are returned after specifying your inputs, outputs, and constraints. This can happen for a variety of reasons, and it does not necessarily mean something is wrong with your setup.

.. image:: screenshots/no-workflows.png
:alt: Screenshot showing empty workflow generation result
:align: center
:scale: 40%

Below are some suggestions to help resolve the issue:

**Common Causes**

- Too many constraints have been applied (e.g., requiring multiple specific operations)
- An input/output combination is not supported by the tools in the selected domain
- The workflow length or timeout is too restrictive
- The selected tools are incompatible based on data formats or EDAM terms

**What You Can Try**

- **Start simple**: Try with fewer constraints and only the essential input/output specifications.
- **Use “Load Example”**: This will preload a well-tested input/output combination and give you a working baseline.
- **Switch to the non-executable domain**: This domain includes more tools and may generate workflows even if some steps cannot be executed.
- **Adjust generation parameters**: Increase the maximum workflow length or timeout to allow the generator more room to find solutions.

.. note::

If you're consistently getting no workflows, start by testing the example setup provided. Once you see a valid workflow, gradually add your desired constraints one at a time.


.. _try-it-yourself:

Try It Yourself: Building & Evaluating a Protein Identification Workflow
========================================================================

In this hands-on section, you'll walk through a real example of building, refining, and evaluating automated workflows using **Workflomics**. You'll begin with a minimal setup to identify proteins from spectral data and iteratively build up to more complex, biologically meaningful workflows — such as those that validate proteins and perform functional enrichment.

The goal is to familiarize yourself with how **inputs, outputs, constraints, and tool selection** affect the workflow design, and how scientific and runtime metrics can guide tool choice.

**Aim:**
Start with the basic task of identifying peptides from mass spectrometry data and build toward a workflow that outputs validated proteins with biological interpretation (e.g., GO term enrichment).

.. note::
This example uses the **Proteomics (non-executable)** domain, which offers a broader set of tools for exploration.

Step 1: Start with a Minimal Workflow
-------------------------------------

1. **Go to the Workflomics demo site**: https://workflomics.soft.cs.uni-potsdam.de/
2. Click **Explore**.
3. Choose the **Proteomics (non-executable)** domain.
4. Under **Inputs and Outputs**, specify:
- Input 1: ``Mass spectrum`` (``mzML``)
- Input 2: ``Protein sequence`` (``FASTA``)
- Output: ``Peptide identification`` (``protXML``)
5. Leave constraints empty and click **Generate Workflows**.

This will produce basic peptide identification workflows using a combination of tools such as **Comet**, **SearchGUI**, **Tandem2XML**, or **FireProt ASR**, often followed by validation or protein-level inference using tools like **ProteinProphet**.

Step 2: Add Constraints to Increase Complexity
----------------------------------------------

Begin layering meaningful constraints:

- **Use operation in the solution** → ``Validation``
Ensures that peptide and protein results are quality-checked using tools like ``PeptideProphet``, ``ProteinProphet``, or ``MFPaq``.

- **Do not use operation in the solution** → ``Protein quantification``
Keeps the workflow focused on identification and interpretation, avoiding abundance estimation tools like ``StPeter``.

- **Use operation in the solution** → ``Protein identification``
Forces the pipeline to go beyond peptide-level analysis and include tools such as ``ProteinProphet`` or ``XTandemPipeline`` for protein inference.

.. note::
These constraints do not enforce order — they simply ensure the required operations are **included somewhere** in the workflow. If you want to enforce specific ordering (e.g., validation before enrichment), you can use *"Use operations sequentially"* instead.

Update the configuration to allow room for richer workflows:
- Min # of steps: 3
- Max # of steps: 15
- Timeout: 120 seconds
- Number of workflows (max): 4

Then click **Generate Workflows** again.

Step 3: Explore Design-Time Metrics
-----------------------------------

Click the **Benchmarks** tab to inspect:
- Number of tools per OS (Linux, macOS, Windows)
- License types (open vs. closed)
- Median number of citations per workflow

Use these metrics to compare and select workflows for benchmarking.

Step 4: Execute and Benchmark
-----------------------------

1. Download selected workflows in **CWL format**.
2. Follow the `Benchmarker Guide <https://workflomics.readthedocs.io/en/latest/workflomics-benchmarker/benchmarker-overview.html>`_:
- Run them locally using ``cwltool``
- Collect runtime metrics (execution time, memory, logs)
- Extract scientific metrics (e.g., number of proteins or GO terms)
3. Upload your `benchmarks.json` file to the **Upload Benchmark Results** page.

.. note::
You can toggle between runtime and scientific metrics in the interface to evaluate workflow quality and correctness.

Step 5: Go Even Deeper (Bonus Scenarios)
----------------------------------------

Try expanding your workflow further by exploring:

- **Output transformation**:
- Output peptide identification in other formats (e.g., ``mzIdentML``)

- **Biological interpretation**:
- Add ``Enrichment analysis`` to interpret proteins via GO terms or pathways
- Adjust the output to ``Over-representation data (JSON)`` to support enrichment tools like ``gProfiler`` or ``GOEnrichment``

- **Tool preference**:
- Require a specific search engine (e.g., ``Comet`` or ``MSFragger``)

- **Constraint order**:
- Use ``Validation → Enrichment analysis`` **sequentially**, so enrichment occurs only after validated protein ID

- **Compare domains**:
- Switch to the **executable** domain and evaluate whether tool availability or execution readiness differs

Step 6: Reflection
------------------

After running and evaluating your workflows, reflect on the following:

- How did changing constraints affect tool composition and workflow length?
- Did validation and enrichment tools make the workflow longer, slower, or more informative?
- Were your final decisions driven more by design-time metrics (like citations and licensing) or runtime/scientific results (like # of GO terms)?


############
Glossary
############

.. glossary::

Workflow
A sequence of computational steps or tools used to process data for a specific research task.

Workflomics
A framework for generating, benchmarking, and selecting optimal data analysis workflows.

Benchmarking
Evaluation of workflows based on quality, performance, and scientific output.

EDAM Ontology
A bioinformatics ontology that classifies operations, data types, and formats used in tools.

APE (Automated Pipeline Explorer)
A tool that generates workflows based on formal problem definitions and domain annotations.

CWL (Common Workflow Language)
A standard format for describing computational workflows, ensuring interoperability.

Design-time Filtering
Pre-execution filtering of workflows using static metadata like OS compatibility, license, and citations.

Runtime Evaluation
Assessment of workflows during execution, tracking metrics such as time, memory use, and errors.

Scientific Benchmarking
Domain-specific validation of workflow outputs using benchmarks (e.g., proteins identified, GO terms).

bio.tools
A curated registry of bioinformatics tools annotated with EDAM terms.

Proteomics
The large-scale study of proteins, especially their structures and functions.

GO Terms (Gene Ontology Terms)
Standardized descriptors of gene product attributes, including biological processes and molecular functions.