Skip to content

feature: add proforma.modify_with#196

Open
mobiusklein wants to merge 13 commits intolevitsky:masterfrom
mobiusklein:feature/more_combinators
Open

feature: add proforma.modify_with#196
mobiusklein wants to merge 13 commits intolevitsky:masterfrom
mobiusklein:feature/more_combinators

Conversation

@mobiusklein
Copy link
Copy Markdown
Contributor

part of #194

@levitsky
Copy link
Copy Markdown
Owner

levitsky commented Mar 19, 2026

Thank you! I noticed a couple places that seemed to need fixing (I hope this makes sense, otherwise feel free to flag/roll back) and then I wanted to add some tests.
Before we get to implementing the full combinatorial proteoform generation, I thought we already had the ability to specify the number of occurrences of an unlocalized modification via ^. While testing it, I noticed something strange: it seems to produce repeated peptidoforms.

>>> for pf in proforma.ProForma.parse('[Oxidation|Position:M]^2?PEMPTIMDE').peptidoforms():
    ...:     print(pf)
    ...: 
PEM[Oxidation|Position:M]PTIM[Oxidation|Position:M]DE
PEM[Oxidation|Position:M]PTIM[Oxidation|Position:M]DE

This shows why a test I added is currently failing. I think the reason is simply that the Proforma parser supports the unclocalized modification count by adding multiple copies of the modification tag to the properties, and then the combinator treats them as distinct.

Do you see a way to fix this? This is not a problem with the new function, as you can see, but rather with the interplay of ProteformCombinator and the parser.

However, what I was initially trying to do was to see if we can support the count in the peptidoroms (modify_with) function in some way.

@mobiusklein
Copy link
Copy Markdown
Contributor Author

mobiusklein commented Mar 28, 2026

Well, that's a reminder to me to refresh my fork more often. Keeping proper track of repeated configurations was simple enough. The memory burden of tracking seen configurations should be small outside of some high complexity cases like trying to place dozens of different modifications on a full length protein sequence, but that's a bad case either way.

I think the tests are failing due to a network timeout unrelated to ProForma.

@levitsky
Copy link
Copy Markdown
Owner

levitsky commented Mar 28, 2026

Thanks! This is indeed a simple solution (I was thinking along the lines of building a Counter-like object of variable mods and then handling the counts). For both solutions to work, though, it appears that we need to make PositionModifierTag hashable, which it is not yet.

So with this change the semantics of include_unmodified seems to expand to actually mean "include any subset of modifications", is that right? This means that combinatorial application is already all but implemented, we just need to supply the maximum count equal to the number of eligible sites.

In [6]: seq = "[Phospho|Position:S|Position:T]^3?EMEVTSESPEK"

In [7]: pf = ProForma.parse(seq)

In [8]: for ppf in pf.proteoforms(include_unmodified=True):
   ...:     print(ppf)
   ...: 
EMEVTSESPEK
EMEVT[Phospho|Position:S|Position:T]SESPEK
EMEVTS[Phospho|Position:S|Position:T]ESPEK
EMEVTSES[Phospho|Position:S|Position:T]PEK
EMEVT[Phospho|Position:S|Position:T]S[Phospho|Position:S|Position:T]ESPEK
EMEVT[Phospho|Position:S|Position:T]SES[Phospho|Position:S|Position:T]PEK
EMEVTS[Phospho|Position:S|Position:T]ES[Phospho|Position:S|Position:T]PEK
EMEVT[Phospho|Position:S|Position:T]S[Phospho|Position:S|Position:T]ES[Phospho|Position:S|Position:T]PEK

This is great and perhaps include_unmodified should be renamed accordingly.

One thing that puzzles me is that this behavior does not seem to extend to CoMUP modifications:

In [16]: seq = "[Phospho|Position:S|Position:T|comup|Limit:2]^2?EMEVTESPEK"
    ...: pf = ProForma.parse(seq)

In [17]: for ppf in pf.proteoforms(include_unmodified=False):
    ...:     print(ppf)
    ...: 
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2][Phospho|Position:S|Position:T|CoMUP|Limit:2]ESPEK
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2]ES[Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK
EMEVTES[Phospho|Position:S|Position:T|CoMUP|Limit:2][Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK

In [18]: for ppf in pf.proteoforms(include_unmodified=True):
    ...:     print(ppf)
    ...: 
EMEVTESPEK
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2]ESPEK
EMEVTES[Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK
EMEVT[Phospho|Position:S|Position:T|CoMUP|Limit:2]ES[Phospho|Position:S|Position:T|CoMUP|Limit:2]PEK

In this case including the unmodified peptidoform also excludes the two colocalized peptidoforms.

@mobiusklein
Copy link
Copy Markdown
Contributor Author

Thank you for adding those extra test cases. I'll think about how I can improve the API for your first point about flexible upper bound on the number of times a rule is applied.

To your point about a Counter, yes, that absolutely is needed to make co-localization work. In the seven (or more?) other combinatorial modification expanders I've written, I explicitly refused to include colocalization because it makes the search space intractable. It has its uses though.

@mobiusklein
Copy link
Copy Markdown
Contributor Author

The refactor to use a Counter to track previous states makes CoMUP behavior work where it was lost.

I added a new expand_rules mode that can efficiently repeat modification rules on all positions, but it isn't compatible with co-localizating new variable modifications. There should be a way to get to this working with CoMUP, but it's going to be complicated. If that is desirable, I can do it though.

@levitsky
Copy link
Copy Markdown
Owner

levitsky commented Mar 29, 2026

Thank you! I am totally fine leaving the co-localization case for later as long as this limitation is documented.

I am still getting results contrary to my intuition playing around with the free proteoforms function converting rules from dict, for example in the test_from_simple_dict test case. With include_unmodified=False, shouldn't we get 3 forms (phospho on each of 3 sites), and with include_unmodified=True, an extra one for the unmodified form? Instead of 3 and 4, it produces just 2 and then 6.

So two questions about this are:

  1. It looks like conversion from {"Phospho": ['S', 'T']} to ["Phospho|Position:S", "Phospho|Position:T"] is not really equivalent, because we could get phospho on two S sites and we're not getting it.

  2. If we are getting the non-modified, singly phosphorylated and some forms with 2 phospho with include_unmodified, then why not three as well? I'm having trouble distinguishing include_unmodified and expand_rules in this scenario.

Perhaps with the dict input we should just assume expand_rules? Or can we tweak the rules built from the dict input to make it produce some less surprising output?

P.S. The current assertions in the test don't have to stay this way, that was just me trying to reason about this.

@mobiusklein
Copy link
Copy Markdown
Contributor Author

I see your point. {"Phospho": ["S", "T"]} doesn't behave the same way that ["Phospho|Position:S|Position:T"] does, even if I force it to, that's not the you are looking for. Additionally, if I move the logic for making it do so further into the guts of ProteoformCombinator, it won't be able to easily do its original job, which is to generate all the proteoforms defined by the ProForma sequence's own rules.

I'll implement what you suggested originally and see how it goes.

@levitsky
Copy link
Copy Markdown
Owner

levitsky commented Mar 30, 2026

Moving expand_rules to the function only makes a lot of sense. I really like how the function works with string rule specifiers. I want to make sure in tests that it works correctly with multiple modifications and then I think we are good. What do you think?

Strange that the PROXI test failure is still not processed, I can try to look into that but it should not deter us.

@mobiusklein
Copy link
Copy Markdown
Contributor Author

I agree. When I tried to push that logic into the ProteoformCombinator, I was trying to make the system do something it wasn't meant to.

I'm still not completely happy with how expand_rules interacts with CoMUP and Limit, since it only copies the rule enough times to assign it to each position at most once, thus if you have two sites, CoMUP|Limit:2 is applied, it can stack two modifications on one site and leave a second site unoccupied, not 0|0, 1|0, 0|1, 1|1, 0|2, 2|0, 2|1, 1|2, 2|2. But that is an excellent way to set off a complexity explosion. I think it'd be trivial to make it work (multiply copies by rule.limit), but I'm out of time before I need to get back to that Zstd PR and other work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants