ProLance

Part of the OpenProteo stack for proteomics raw-file access. ProLance is the columnar storage layer; vendor readers OpenWRaw, OpenTFRaw, and OpenTimsTDF feed it through openproteo-io.

A fast, columnar, memory-mapped mass spectrometry data store powered by Lance.

ProLance is a storage format, not an analysis framework. Its purpose is to serve as a drop-in replacement for mzML in existing proteomics workflows - one that is faster to read, cheaper to seek, and can hold many runs in a single directory. Researchers should be able to swap run.mzML for a ProLance store, retrieve spectra using familiar Python or Rust objects, and feed them directly into existing libraries without rewriting any analysis code.

Vision

                       ingest
   .raw (Thermo)  ─────────────┐
   .d/  (Bruker)  ─────────────┤
   .raw (Waters)  ─────────────┼──►  ProLance store  ──►  analysis libraries
   .mzML          ─────────────┘         │               (spectrum_utils,
                                         │                matchms, pyOpenMS,
                                    export │                ms2ml, ...)
                                         ▼
                                    .mzML (roundtrip)

There are three usage modes:

Ingest from vendor format. Use the open reader crates (OpenTFRaw, OpenTimsTDF, OpenWRaw) to parse proprietary binary files and write a ProLance store. No vendor SDK, no Wine, no COM interop.
Ingest from mzML. Parse existing mzML files and write a ProLance store. This is the zero-friction on-ramp for labs that already have mzML pipelines.
Export to mzML. Reconstruct a spec-valid mzML file from a ProLance store. The roundtrip mzML -> ProLance -> mzML must yield an effectively identical file: same spectra, same CV terms, same metadata, same ordering. This invariant is what makes ProLance safe to insert into existing pipelines without data loss.

Why Not mzML?

mzML is the HUPO-PSI community standard and it is well-supported, but it has structural properties that make it expensive at scale:

It is XML, which means sequential parsing. Seeking to scan 50,000 out of 200,000 requires either re-parsing from the start or maintaining an index file (mzML.index / mzMLb).
Every peak array is Base64-encoded inline in the XML, which inflates size and requires a decode step before any numerical work.
Cross-run queries (e.g., "all MS2 spectra with precursor 445.22 +/- 10 ppm across 40 LC-MS runs") require loading and parsing each file independently.
Profile-mode Orbitrap runs routinely exceed 1 GB. TimsTOF PASEF runs can reach 10-20 GB per acquisition.

Lance stores data as Apache Arrow columns in memory-mapped files. Filtering by retention time, MS level, or precursor m/z touches only the relevant columns, not the peak arrays. A region scan that would require parsing an entire mzML file can be answered by reading a fraction of the bytes.

Architecture

The workspace will follow the same three-crate pattern as BioLance:

prolance-core/     Arrow schema definitions, LanceDB table management,
                   shared types (Spectrum, Chromatogram, Run).

prolance-ms/       Ingest adapters.
                     - mzML reader (noodles or quick-xml + mzml CV table)
                     - OpenTFRaw adapter (Thermo .raw)
                     - OpenTimsTDF adapter (Bruker timsTOF .d)
                     - OpenWRaw adapter (Waters MassLynx .raw)
                   Each adapter normalises its source into the common
                   prolance_core::Spectrum type.

prolance-cli/      `prolance` binary.
                     ingest   - write a ProLance store from any source
                     query    - filter spectra by RT / MS level / precursor
                     xic      - extracted ion chromatogram across runs
                     compare  - cross-run TIC / XIC overlays
                     export   - reconstruct mzML from a ProLance store

A Python package (prolance-py, built with PyO3 + maturin) will expose the store via a thin API that returns objects compatible with common analysis libraries.

Schema

`runs` table

One row per ingested file.

column	Arrow type	notes
run_id	Utf8	SHA-256 of the source file path + mtime
source_path	Utf8
source_format	Utf8	"mzml", "thermo", "bruker", "waters"
instrument	Utf8	instrument model name from CV / metadata
start_time	Timestamp
spectrum_count	UInt32
ms1_count	UInt32
ms2_count	UInt32
run_metadata	Utf8 (JSON)	full mzML run-level CV params + attrs

`spectra` table

One row per spectrum. Peak data lives in list-typed columns so each spectrum is self-contained and can be retrieved atomically.

column	Arrow type	notes
run_id	Utf8
scan_num	UInt32	1-based, matches source
native_id	Utf8	mzML nativeID string
ms_level	UInt8	1, 2, ...
rt	Float64	seconds
tic	Float64	total ion current
base_peak_mz	Float64
base_peak_intensity	Float64
polarity	Int8	+1 / -1
centroided	Boolean	centroid vs. profile mode
precursor_mz	Float64	null for MS1
precursor_charge	Int8	null for MS1
precursor_intensity	Float64	null for MS1
isolation_window_lower	Float64	null for MS1
isolation_window_upper	Float64	null for MS1
activation	Utf8	HCD / CID / ECD / ETD / ...
collision_energy	Float32
inv_mobility	Float64	Bruker 1/K0; null for other vendors
mz_precision	UInt8	32 or 64 - original storage width
intensity_precision	UInt8	32 or 64 - original storage width
mz	LargeList	peak m/z values
intensity	LargeList	peak intensity values
cv_params	Utf8 (JSON)	all spectrum-level CV params
scan_window_lower	Float64
scan_window_upper	Float64

`chromatograms` table

mzML documents routinely include chromatogram traces (TIC, BPC, SRM/MRM transitions). These are structurally separate from spectra in the mzML schema and must be preserved for a lossless roundtrip.

column	Arrow type	notes
run_id	Utf8
chrom_id	Utf8	mzML id attribute
chrom_type	Utf8	TIC / BPC / SRM / ...
precursor_mz	Float64	null for TIC/BPC
product_mz	Float64	null for TIC/BPC
time	LargeList	time axis (seconds)
intensity	LargeList	intensity axis
cv_params	Utf8 (JSON)	all chromatogram-level CV params

Challenges and How to Address Them

1. CV term completeness

mzML is governed by the PSI-MS OBO ontology, which defines over 1,000 controlled vocabulary terms covering instrument components, scan types, data transformations, activation methods, and more. Any schema that hard-codes only the common terms will silently drop the rest on ingest, making a lossless roundtrip impossible.

Approach: Use a hybrid strategy. Map the ~30 terms that researchers routinely query (ms level, RT, precursor m/z, charge, activation, polarity, centroid flag) to typed Arrow columns so they are fast to filter. Store every other CV param for the spectrum as a JSON string in the cv_params column. On mzML export, the typed columns are re-emitted as CV params with their correct accession numbers, and the cv_params blob is written out verbatim. Nothing is lost even for obscure instrument-specific terms.

The same pattern applies at the run level: the full <fileDescription>, <sampleList>, <softwareList>, <scanSettingsList>, <instrumentConfigurationList>, and <dataProcessingList> blocks are stored as JSON in runs.run_metadata and written back character-for-character on export.

2. Binary precision preservation

mzML binary arrays can be stored as 32-bit or 64-bit IEEE 754 floats, and some files use MS-Numpress lossless or lossy integer encodings. Internally, ProLance stores all m/z values as Float64 (sufficient for any precision), but the mz_precision and intensity_precision columns record the original bit width. On export, values are downcast to the original width before Base64-encoding, preserving the roundtrip value to within the original precision without storing redundant 32-bit copies.

MS-Numpress re-encoding is deterministic for the lossless variants (Linear and Pic), so re-applying the same algorithm on export will yield identical encoded bytes. If the source file used Numpress, the cv_params blob will record that fact and the exporter will apply the matching encoder.

3. Ion mobility (4D data from Bruker TIMS)

Bruker timsTOF PASEF acquisitions add a full ion mobility dimension. Each "frame" in the TDF format is a collection of scans across the TIMS ramp, each scan being a TOF spectrum at a specific inverse-mobility (1/K0) value. This does not fit the flat mzML spectrum model cleanly.

mzML 1.1.3+ defines a convention for ion mobility (CV term MS:1002476 for mean 1/K0, MS:1003008 for scan-level IM arrays), but support across readers is inconsistent. There are two representation options:

Collapsed (recommended for most workflows): Sum or select peaks across the mobility dimension at ingest, producing conventional MS1/MS2 spectra with an inv_mobility centroid value. Matches what most downstream tools expect.
Expanded (full-fidelity): Store one row per TIMS scan, with an inv_mobility value per row. The scan_num + run_id + inv_mobility tuple uniquely identifies each record. This preserves the full 4D dataset at the cost of row count.

ProLance should support both modes as an ingest flag (--tims-mode collapsed|expanded). The mzML exporter will use the expanded-mode representation when exporting TIMS data.

4. Chromatograms

mzML <chromatogramList> is routinely populated by instruments with TIC, BPC, and SRM/MRM transition traces. It is separate from the spectrum list and cannot be mapped into the spectra table without information loss.

The chromatograms table above addresses this directly. On mzML ingest, chromatograms are parsed and written to that table. On export, they are reconstructed and written as a <chromatogramList> block. Any tool that expects chromatograms in its mzML output will still find them.

5. Data processing lineage

mzML documents record what software transformations have been applied to the data (centroiding, deisotoping, denoising, calibration) in a <dataProcessingList>. This block affects how consumers interpret the data (e.g., whether to apply their own centroiding) and must be preserved.

The run_metadata JSON column stores the full <dataProcessingList> XML fragment. The exporter writes it back verbatim. If ProLance itself transforms data during ingest (e.g., vendor-to-mzML conversion adds a processing step), a new <processingMethod> entry is appended recording the conversion.

6. Vendor metadata beyond mzML

The three vendor readers expose metadata that does not have a direct mzML CV term mapping: Thermo filter strings (FTMS + p ESI Full ms [300.00-1500.00]), Waters function types, Bruker accumulation times. This information is useful for debugging and for re-creating accurate mzML output.

Each vendor adapter will serialize its source-specific scan metadata into the cv_params JSON column using a vendor: namespace prefix for fields with no PSI-MS accession. On mzML export, fields with a real accession are emitted as <cvParam> elements; vendor-namespaced fields are emitted as <userParam> elements. mzML allows <userParam> anywhere CV params appear, so no information is lost and the file remains schema-valid.

7. Library interoperability (Python)

The primary consumer surface is Python. The most widely-used spectrum libraries - spectrum_utils, matchms, ms2ml, pyOpenMS, pyteomics - each define their own spectrum object. None of them read Lance natively.

The prolance-py Python package will expose:

import prolance

store = prolance.open("runs/")

# Iterator of spectrum_utils.spectrum.MsmsSpectrum objects
for spec in store.ms2_spectra(run="run01", rt_min=10.0, rt_max=20.0):
    ...

# matchms.Spectrum objects
for spec in store.as_matchms(run="run01"):
    ...

# Direct numpy arrays for custom code
mz, intensity = store.peaks(run="run01", scan=5432)

The adapters are thin wrappers that copy the numpy arrays from the Arrow columns directly into the target library's data structures, with zero additional parsing.

8. Large file streaming

A single Orbitrap run can be 1-2 GB of mzML; a full timsTOF PASEF experiment can exceed 20 GB. Loading everything into RAM before writing is not feasible.

Ingest will stream spectra in batches (default 10,000 spectra per batch) and write each batch as a Lance fragment. Lance's columnar chunking means each fragment is independently readable; no seek across the whole store is needed for a range query. The CLI will report progress via an indicatif progress bar, matching the BioLance pattern.

9. Roundtrip correctness as a first-class invariant

The mzML -> ProLance -> mzML roundtrip is not an afterthought - it is the primary correctness guarantee that makes ProLance safe to deploy as an infrastructure swap. It should be tested with a suite of real mzML files spanning:

Thermo Orbitrap (DDA and DIA)
Bruker timsTOF PASEF
Waters HDMS (ion mobility)
SRM/MRM files with chromatogram lists
Files with both 32-bit and 64-bit arrays
Files with Numpress-encoded arrays
Files with unusual or instrument-vendor-specific CV terms

The test suite will parse the exported mzML with an independent parser (e.g., mzdata or psims) and assert equality of spectrum count, CV params, retention times, precursor values, and peak arrays to within floating-point precision of the original storage width.

Dependency Map

prolance-core
  lancedb, arrow-array, arrow-schema, serde_json, anyhow, thiserror

prolance-ms
  prolance-core
  opentfraw   (Thermo .raw)
  OpenTimsTDF     (Bruker timsTOF .d)
  openwraw    (Waters .raw)
  quick-xml   (mzML ingest)
  base64      (binary array decoding)

prolance-cli
  prolance-core, prolance-ms
  clap, tokio, rayon, indicatif

prolance-py  (separate maturin crate)
  prolance-core, prolance-ms
  pyo3, numpy

No vendor SDK, no Wine, no COM server, no proprietary runtime dependency of any kind.

Relation to Existing Projects

project	role
OpenTFRaw	Thermo .raw parser; ingest adapter source
OpenTimsTDF	Bruker timsTOF parser; ingest adapter source
OpenWRaw	Waters MassLynx parser; ingest adapter source
BioLance	Conceptual template (VCF -> Lance); sibling project
ProLance	Storage layer; delegates parsing to the above

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
crates		crates
docs		docs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProLance

Vision

Why Not mzML?

Architecture

Schema

`runs` table

`spectra` table

`chromatograms` table

Challenges and How to Address Them

1. CV term completeness

2. Binary precision preservation

3. Ion mobility (4D data from Bruker TIMS)

4. Chromatograms

5. Data processing lineage

6. Vendor metadata beyond mzML

7. Library interoperability (Python)

8. Large file streaming

9. Roundtrip correctness as a first-class invariant

Dependency Map

Relation to Existing Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProLance

Vision

Why Not mzML?

Architecture

Schema

runs table

spectra table

chromatograms table

Challenges and How to Address Them

1. CV term completeness

2. Binary precision preservation

3. Ion mobility (4D data from Bruker TIMS)

4. Chromatograms

5. Data processing lineage

6. Vendor metadata beyond mzML

7. Library interoperability (Python)

8. Large file streaming

9. Roundtrip correctness as a first-class invariant

Dependency Map

Relation to Existing Projects

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`runs` table

`spectra` table

`chromatograms` table

Packages