When the HMM set passed to pynteny search contains paralogous models (e.g. nifD/nifK, nifE/nifN, bchL/nifH, V-/Mo-/Fe-only nitrogenase families), the same peptide commonly scores above threshold under more than one HMM. Pynteny keeps every hit and emits the warning
WARNING: At least two different HMMs produced identical sequence hits
…but the rolling-window pattern matcher in SyntenyHMMfilter does not deduplicate them. The duplicate rows are interleaved (by (contig, gene_pos) sort) between the rows of true syntenic neighbours, breaking the position-diff check inside the window. The result is a silent false negative: a genuine syntenic operon is dropped from synteny_matched.tsv with no error and no exit code.
This is the most common gotcha when users curate paralog-aware HMM panels — which is exactly the use case pynteny is best suited for.
Reproducer
Public dataset: 20 TARA Oceans MAGs from PRJEB97970 (or any marine MAG panel) + the PGAP-derived TIGRFAM HMM set restricted to {TIGR01287.1 (nifH), TIGR01282.1 (nifD), TIGR01286.1 (nifK)}.
pynteny build --data mags/ --outfile peptides.faa --prepend-filename
pynteny search --synteny_struc '>TIGR01287.1 0 >TIGR01282.1 0 >TIGR01286.1' \
--data peptides.faa --hmm_dir hmm_profiles/ --outdir out/
The MAG TARA_ION_45_MAG_00019 carries a textbook syntenic nifH→nifD→nifK at positions 33→34→35 of a single contig, all positive strand, bitscores 452.7 / 856.5 / 861.6. It does not appear in out/synteny_matched.tsv. The warning fires.
Mechanism
After get_all_HMM_hits(), the combined hits dataframe for that contig contains (rows abbreviated):
| row |
pos |
HMM (code) |
source |
| 1 |
33 |
nifH (0) |
TIGR01287.1 hit |
| 2 |
34 |
nifD (1) |
TIGR01282.1 hit |
| 3 |
34 |
nifK (2) |
TIGR01286.1 cross-hit on the nifD peptide |
| 4 |
35 |
nifD (1) |
TIGR01282.1 cross-hit on the nifK peptide |
| 5 |
35 |
nifK (2) |
TIGR01286.1 hit |
| 6 |
36 |
nifD (1) |
TIGR01282.1 cross-hit on the nifE peptide |
| 7 |
36 |
nifK (2) |
TIGR01286.1 cross-hit on the nifE peptide |
| 8 |
37 |
nifK (2) |
TIGR01286.1 cross-hit on the nifN peptide |
The window-of-3 scan in filter.py:316-327 looks for code-tuple (0, 1, 2) AND gene-position diffs of exactly (1, 1).
- Window (rows 1,2,3) → codes
(0,1,2) ✓ but position diffs (33→34=1, 34→34=0) ✗
- Window (rows 1,2,4) doesn't exist; the scan is contiguous over the sorted dataframe.
- Every subsequent window starts inside the duplicate region and has the wrong code tuple.
So (nifH@33, nifD@34, nifK@35) is never visited as a window even though the bitscore-best assignment at each position is exactly the canonical operon. The warning at filter.py:294-296 detects the duplicate condition but is informational only.
Proposed change
1. Opt-in per-peptide deduplication (--best-hmm-wins)
Add a single deduplication pass inside SyntenyHMMfilter.get_all_HMM_hits(), immediately after the pd.concat of per-HMM frames and before the rolling-window logic:
if self._best_hmm_wins:
all_hit_labels = (
all_hit_labels
.sort_values("bitscore", ascending=False)
.drop_duplicates(subset=["full_label"], keep="first")
.sort_values(["contig", "gene_pos"])
.reset_index(drop=True)
)
This requires propagating bitscore from HMMER.parse_HMM_search_output() (already collected, just not currently surfaced into get_all_HMM_hits) and threading a --best-hmm-wins flag from CLI → synteny_search → filter_FASTA_by_synteny_structure → SyntenyHMMfilter.__init__.
Default for 1.x: off (preserves existing behaviour for users with curated panels that intentionally rely on multi-HMM membership via |-groups). Default for 2.0: flip to on with a sibling --keep-all-hmm-hits escape hatch, since this is what most users actually want.
2. Upgrade the warning
Replace the current line in filter.py:
logger.warning("At least two different HMMs produced identical sequence hits")
with something actionable:
n_dup = all_hit_labels.full_label.duplicated().sum()
affected_pairs = (
all_hit_labels[all_hit_labels.full_label.duplicated(keep=False)]
.groupby("full_label").hmm.apply(lambda s: tuple(sorted(set(s))))
.value_counts().head(5).to_dict()
)
logger.warning(
f"{n_dup} peptides hit by >=2 HMMs (paralog cross-hits). "
f"Top HMM pairs: {affected_pairs}. "
"Syntenic patterns adjacent to these peptides may be silently dropped. "
"Re-run with --best-hmm-wins to deduplicate by bitscore."
)
3. Docs
Add a short note to the user guide and to pynteny search --help:
Pynteny's synteny matcher does not disambiguate paralogous HMMs by default. If your panel contains close paralogs (e.g. nitrogenase α/β, MoFe-co biosynthesis nifE/N, light-independent protochlorophyllide reductase bchLNB), prefer either (a) a disjoint orthogonal HMM set, (b) |-grouped HMMs in the synteny structure, or (c) --best-hmm-wins to assign each peptide to its highest-scoring HMM.
What I would not change
- The rolling-window matcher itself is correct on deduplicated input — no logic change needed there.
- The "all peptides duplicated" hard
sys.exit(1) at filter.py:297-304 is the right failsafe for the degenerate case and should stay.
- The default behaviour, until a 2.0 release. Silent dedup would change the output for users who curate paralog panels intentionally.
Acceptance
- A user running
pynteny search --best-hmm-wins ... on the reproducer above gets TARA_ION_45_MAG_00019 in synteny_matched.tsv.
- The current default behaviour is unchanged.
- The updated warning prints the count and at least one affected HMM pair.
- The user guide carries the paralog note.
- A regression test exercises the duplicate-row case on a minimal synthetic dataset.
Related
When the HMM set passed to
pynteny searchcontains paralogous models (e.g.nifD/nifK,nifE/nifN,bchL/nifH, V-/Mo-/Fe-only nitrogenase families), the same peptide commonly scores above threshold under more than one HMM. Pynteny keeps every hit and emits the warning…but the rolling-window pattern matcher in
SyntenyHMMfilterdoes not deduplicate them. The duplicate rows are interleaved (by(contig, gene_pos)sort) between the rows of true syntenic neighbours, breaking the position-diff check inside the window. The result is a silent false negative: a genuine syntenic operon is dropped fromsynteny_matched.tsvwith no error and no exit code.This is the most common gotcha when users curate paralog-aware HMM panels — which is exactly the use case pynteny is best suited for.
Reproducer
Public dataset: 20 TARA Oceans MAGs from PRJEB97970 (or any marine MAG panel) + the PGAP-derived TIGRFAM HMM set restricted to
{TIGR01287.1 (nifH), TIGR01282.1 (nifD), TIGR01286.1 (nifK)}.pynteny build --data mags/ --outfile peptides.faa --prepend-filename pynteny search --synteny_struc '>TIGR01287.1 0 >TIGR01282.1 0 >TIGR01286.1' \ --data peptides.faa --hmm_dir hmm_profiles/ --outdir out/The MAG
TARA_ION_45_MAG_00019carries a textbook syntenic nifH→nifD→nifK at positions 33→34→35 of a single contig, all positive strand, bitscores 452.7 / 856.5 / 861.6. It does not appear inout/synteny_matched.tsv. The warning fires.Mechanism
After
get_all_HMM_hits(), the combined hits dataframe for that contig contains (rows abbreviated):The window-of-3 scan in
filter.py:316-327looks for code-tuple(0, 1, 2)AND gene-position diffs of exactly(1, 1).(0,1,2)✓ but position diffs(33→34=1, 34→34=0)✗So
(nifH@33, nifD@34, nifK@35)is never visited as a window even though the bitscore-best assignment at each position is exactly the canonical operon. The warning atfilter.py:294-296detects the duplicate condition but is informational only.Proposed change
1. Opt-in per-peptide deduplication (
--best-hmm-wins)Add a single deduplication pass inside
SyntenyHMMfilter.get_all_HMM_hits(), immediately after thepd.concatof per-HMM frames and before the rolling-window logic:This requires propagating
bitscorefromHMMER.parse_HMM_search_output()(already collected, just not currently surfaced intoget_all_HMM_hits) and threading a--best-hmm-winsflag from CLI →synteny_search→filter_FASTA_by_synteny_structure→SyntenyHMMfilter.__init__.Default for 1.x: off (preserves existing behaviour for users with curated panels that intentionally rely on multi-HMM membership via
|-groups). Default for 2.0: flip to on with a sibling--keep-all-hmm-hitsescape hatch, since this is what most users actually want.2. Upgrade the warning
Replace the current line in
filter.py:with something actionable:
3. Docs
Add a short note to the user guide and to
pynteny search --help:What I would not change
sys.exit(1)atfilter.py:297-304is the right failsafe for the degenerate case and should stay.Acceptance
pynteny search --best-hmm-wins ...on the reproducer above getsTARA_ION_45_MAG_00019insynteny_matched.tsv.Related
src/pynteny/filter.py:293-304src/pynteny/filter.py:316-327|-group merge:SyntenyHMMfilter._merge_hits_by_HMM_group— covers the case where the user declares the HMM group in the synteny structure, but not the case where cross-hits are incidental.