feature: add proforma.modify_with by mobiusklein · Pull Request #196 · levitsky/pyteomics

mobiusklein · 2026-03-12T01:59:55Z

part of #194

levitsky · 2026-03-19T13:21:00Z

Thank you! I noticed a couple places that seemed to need fixing (I hope this makes sense, otherwise feel free to flag/roll back) and then I wanted to add some tests.
Before we get to implementing the full combinatorial proteoform generation, I thought we already had the ability to specify the number of occurrences of an unlocalized modification via ^. While testing it, I noticed something strange: it seems to produce repeated peptidoforms.

>>> for pf in proforma.ProForma.parse('[Oxidation|Position:M]^2?PEMPTIMDE').peptidoforms():
    ...:     print(pf)
    ...: 
PEM[Oxidation|Position:M]PTIM[Oxidation|Position:M]DE
PEM[Oxidation|Position:M]PTIM[Oxidation|Position:M]DE

This shows why a test I added is currently failing. I think the reason is simply that the Proforma parser supports the unclocalized modification count by adding multiple copies of the modification tag to the properties, and then the combinator treats them as distinct.

Do you see a way to fix this? This is not a problem with the new function, as you can see, but rather with the interplay of ProteformCombinator and the parser.

However, what I was initially trying to do was to see if we can support the count in the peptidoroms (modify_with) function in some way.

mobiusklein · 2026-03-28T03:43:34Z

Well, that's a reminder to me to refresh my fork more often. Keeping proper track of repeated configurations was simple enough. The memory burden of tracking seen configurations should be small outside of some high complexity cases like trying to place dozens of different modifications on a full length protein sequence, but that's a bad case either way.

I think the tests are failing due to a network timeout unrelated to ProForma.

levitsky · 2026-03-28T14:10:31Z

Thanks! This is indeed a simple solution (I was thinking along the lines of building a Counter-like object of variable mods and then handling the counts). For both solutions to work, though, it appears that we need to make PositionModifierTag hashable, which it is not yet.

So with this change the semantics of include_unmodified seems to expand to actually mean "include any subset of modifications", is that right? This means that combinatorial application is already all but implemented, we just need to supply the maximum count equal to the number of eligible sites.

In [6]: seq = "[Phospho|Position:S|Position:T]^3?EMEVTSESPEK"

In [7]: pf = ProForma.parse(seq)

In [8]: for ppf in pf.proteoforms(include_unmodified=True):
   ...:     print(ppf)
   ...: 
EMEVTSESPEK
EMEVT[Phospho|Position:S|Position:T]SESPEK
EMEVTS[Phospho|Position:S|Position:T]ESPEK
EMEVTSES[Phospho|Position:S|Position:T]PEK
EMEVT[Phospho|Position:S|Position:T]S[Phospho|Position:S|Position:T]ESPEK
EMEVT[Phospho|Position:S|Position:T]SES[Phospho|Position:S|Position:T]PEK
EMEVTS[Phospho|Position:S|Position:T]ES[Phospho|Position:S|Position:T]PEK
EMEVT[Phospho|Position:S|Position:T]S[Phospho|Position:S|Position:T]ES[Phospho|Position:S|Position:T]PEK

This is great and perhaps include_unmodified should be renamed accordingly.

One thing that puzzles me is that this behavior does not seem to extend to CoMUP modifications:

In [16]: seq = "[Phospho|Position:S|Position:T|comup|Limit:2]^2?EMEVTESPEK"
    ...: pf = ProForma.parse(seq)

In [17]: for ppf in pf.proteoforms(include_unmodified=False):
    ...:     print(ppf)
    ...: 
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2][Phospho|Position:S|Position:T|CoMUP|Limit:2]ESPEK
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2]ES[Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK
EMEVTES[Phospho|Position:S|Position:T|CoMUP|Limit:2][Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK

In [18]: for ppf in pf.proteoforms(include_unmodified=True):
    ...:     print(ppf)
    ...: 
EMEVTESPEK
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2]ESPEK
EMEVTES[Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2]ES[Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK

In this case including the unmodified peptidoform also excludes the two colocalized peptidoforms.

mobiusklein · 2026-03-28T17:53:46Z

Thank you for adding those extra test cases. I'll think about how I can improve the API for your first point about flexible upper bound on the number of times a rule is applied.

To your point about a Counter, yes, that absolutely is needed to make co-localization work. In the seven (or more?) other combinatorial modification expanders I've written, I explicitly refused to include colocalization because it makes the search space intractable. It has its uses though.

…down

mobiusklein · 2026-03-29T04:02:24Z

The refactor to use a Counter to track previous states makes CoMUP behavior work where it was lost.

I added a new expand_rules mode that can efficiently repeat modification rules on all positions, but it isn't compatible with co-localizating new variable modifications. There should be a way to get to this working with CoMUP, but it's going to be complicated. If that is desirable, I can do it though.

…y false

levitsky · 2026-03-29T11:50:53Z

Thank you! I am totally fine leaving the co-localization case for later as long as this limitation is documented.

I am still getting results contrary to my intuition playing around with the free proteoforms function converting rules from dict, for example in the test_from_simple_dict test case. With include_unmodified=False, shouldn't we get 3 forms (phospho on each of 3 sites), and with include_unmodified=True, an extra one for the unmodified form? Instead of 3 and 4, it produces just 2 and then 6.

So two questions about this are:

It looks like conversion from {"Phospho": ['S', 'T']} to ["Phospho|Position:S", "Phospho|Position:T"] is not really equivalent, because we could get phospho on two S sites and we're not getting it.
If we are getting the non-modified, singly phosphorylated and some forms with 2 phospho with include_unmodified, then why not three as well? I'm having trouble distinguishing include_unmodified and expand_rules in this scenario.

Perhaps with the dict input we should just assume expand_rules? Or can we tweak the rules built from the dict input to make it produce some less surprising output?

P.S. The current assertions in the test don't have to stay this way, that was just me trying to reason about this.

mobiusklein · 2026-03-29T15:04:36Z

I see your point. {"Phospho": ["S", "T"]} doesn't behave the same way that ["Phospho|Position:S|Position:T"] does, even if I force it to, that's not the you are looking for. Additionally, if I move the logic for making it do so further into the guts of ProteoformCombinator, it won't be able to easily do its original job, which is to generate all the proteoforms defined by the ProForma sequence's own rules.

I'll implement what you suggested originally and see how it goes.

…lein/pyteomics into feature/more_combinators

levitsky · 2026-03-30T11:15:45Z

Moving expand_rules to the function only makes a lot of sense. I really like how the function works with string rule specifiers. I want to make sure in tests that it works correctly with multiple modifications and then I think we are good. What do you think?

Strange that the PROXI test failure is still not processed, I can try to look into that but it should not deter us.

mobiusklein · 2026-03-31T02:48:47Z

I agree. When I tried to push that logic into the ProteoformCombinator, I was trying to make the system do something it wasn't meant to.

I'm still not completely happy with how expand_rules interacts with CoMUP and Limit, since it only copies the rule enough times to assign it to each position at most once, thus if you have two sites, CoMUP|Limit:2 is applied, it can stack two modifications on one site and leave a second site unoccupied, not 0|0, 1|0, 0|1, 1|1, 0|2, 2|0, 2|1, 1|2, 2|2. But that is an excellent way to set off a complexity explosion. I think it'd be trivial to make it work (multiply copies by rule.limit), but I'm out of time before I need to get back to that Zstd PR and other work.

mobiusklein and others added 3 commits March 11, 2026 21:59

feature: add proforma.modify_with

9fc4b96

cleanup docs

229fb63

Fixes and tests

473dff8

fix previous tests

c7af7c8

Add a test for proteoforms function with a dict

03a3cff

mobiusklein added 2 commits March 28, 2026 23:33

Patch the failing unrelated test, Massive's PROXI server seems to be …

f82b8a4

…down

Add expand_rules argument

2e8adb4

mobiusklein and others added 3 commits March 29, 2026 00:02

chore: rework USI test patch

c9f5f01

Only warn if expand_rules is true and include_unmodified is explicitl…

770b2c2

…y false

Update test_from_simple_dict

a6f7f7f

mobiusklein and others added 3 commits March 29, 2026 22:31

refactor expand_rules

a40f893

Merge branch 'feature/more_combinators' of https://github.com/mobiusk…

37b0fd9

…lein/pyteomics into feature/more_combinators

Expand tests

4736e32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: add proforma.modify_with#196

feature: add proforma.modify_with#196
mobiusklein wants to merge 13 commits intolevitsky:masterfrom
mobiusklein:feature/more_combinators

mobiusklein commented Mar 12, 2026

Uh oh!

levitsky commented Mar 19, 2026 •

edited

Loading

Uh oh!

mobiusklein commented Mar 28, 2026 •

edited

Loading

Uh oh!

levitsky commented Mar 28, 2026 •

edited

Loading

Uh oh!

mobiusklein commented Mar 28, 2026

Uh oh!

mobiusklein commented Mar 29, 2026

Uh oh!

levitsky commented Mar 29, 2026 •

edited

Loading

Uh oh!

mobiusklein commented Mar 29, 2026

Uh oh!

levitsky commented Mar 30, 2026 •

edited

Loading

Uh oh!

mobiusklein commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mobiusklein commented Mar 12, 2026

Uh oh!

levitsky commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobiusklein commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

levitsky commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobiusklein commented Mar 28, 2026

Uh oh!

mobiusklein commented Mar 29, 2026

Uh oh!

levitsky commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobiusklein commented Mar 29, 2026

Uh oh!

levitsky commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mobiusklein commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

levitsky commented Mar 19, 2026 •

edited

Loading

mobiusklein commented Mar 28, 2026 •

edited

Loading

levitsky commented Mar 28, 2026 •

edited

Loading

levitsky commented Mar 29, 2026 •

edited

Loading

levitsky commented Mar 30, 2026 •

edited

Loading