Skip to content

Add Sort Priority System for Gene Quality #91

@tmushayahama

Description

@tmushayahama

Related to #89

Add a sort_priority field to rank genes by annotation quality, prioritizing well-annotated genes with known GO terms over those with unknown terms or unresolved symbols.

Problem

Currently, genes are not sorted by annotation quality, making it difficult to:

  • Quickly access well-annotated genes
  • So many unknowns on the first pages

Solution

Implement a priority system based on:

  1. Number of unknown GO terms (UNKNOWN:CC, UNKNOWN:BP, UNKNOWN:MF)
  2. Whether the gene has a resolved symbol

Priority Levels

Priority Condition Description
1 Default Genes with known GO terms
10 1 unknown term Contains 1 unknown GO term
20 2 unknown terms Contains 2 unknown GO terms
30 3 unknown terms Contains 3 unknown GO terms
50 Unnamed gene named_gene: false

Sorting Order

Genes sorted by (in order):

  1. sort_priority (ascending) - Best quality first
  2. coordinates_chr_num (ascending) - Chromosome number
  3. gene_symbol (ascending) - Alphabetical for deterministic ordering

Elasticsearch Query:

sort=[
    {"sort_priority": {"order": "asc"}},
    {"coordinates_chr_num.keyword": {"order": "asc"}},
    {"gene_symbol.keyword": {"order": "asc"}}
]

Benefits

  • Highlights high-quality, well-annotated genes
  • Consistent quality ranking across entire dataset
  • Deterministic ordering via chromosome and gene symbol

Discussion

Should genes with lower priority (10, 20, 30, 50) be sorted:

  1. At the bottom of the entire list (current implementation) - All priority 1 genes first (sorted by chromosome then gene_symbol), then all priority 10+ genes (sorted by chromosome then gene_symbol)
  2. At the bottom of each chromosome - Within each chromosome, priority 1 genes first (sorted by gene_symbol), then priority 10+ genes for that chromosome (sorted by gene_symbol)
  3. At the bottom of each gene_symbol letter - Within each letter group (A*, B*, C*, etc.), priority 1 genes first, then priority 10+ genes for that letter

Testing

  • Verify priority calculation for each unknown term count
  • Confirm unnamed genes get priority 50
  • Check sorting order: priority → chromosome → gene_symbol
  • Validate output JSON includes sort_priority field
  • Test Elasticsearch query returns genes in correct order

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions