# Intro - Problem statement: * NCBI records vary in quality * not available for download as a single data set * annotation not consistent or difficult to piece together - Previous 16S data sets * RDP * GreenGenes * NCBI bioproject? * Silva * 16sitgdb - https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.905489/full * GSR-DB - https://journals.asm.org/doi/10.1128/msystems.00950-23 - Summarize ya16sdb features * annotation * outlier detection (includes plotly website) * sequence subsets by confidence # Methods - ... # Results/Discussion - Record counts in each category (16S genes, whole genomes, taxcheck pass vs fail, refseq, reference sequences) - Outlier detection and taxcheck outcomes for each subset - Discrepancies between taxcheck and outlier detection - Maybe: are there any predictors of outliers (eg, by year, source, etc) # TODOs - [x] start a group zotero (YM) - [ ] gather literature (group) - [ ] Chris: begin methods in README or elsewhere in repo - [ ] Create OneDrive doc for MS (NH) - [ ] Start authoring problem statement (NH)
Intro
Methods
Results/Discussion
TODOs