Skip to content

Estimating marker prevalence in pooled data

Caitlin Cherryh edited this page Oct 23, 2024 · 17 revisions

In this tutorial, we will use PoolTools to estimate the prevalence of a marker within a region, and how these change over time. You will be given a mock dataset of positive/negative results on pooled samples and use the different models available to estimate prevalence.

🕵️ Scenario

You analysing molecular xenomonitoring survey data for a lymphatic filariasis (LF) surveillance programme. You have been tasked with estimating the prevalence of filarial DNA within the main mosquito vector over time. You have been provided with survey results based on tests performed on pooled samples of mosquitos, where each pool is tested for the presence or absence of parasite DNA.

You are required to estimate the prevalence of LF in the population, and provide the following:

  • The prevalence of LF for each Region
  • The prevalence of LF per year

📄 The dataset

  • Download the SimpleExampleData.csv dataset here.

Click on "Download raw file" to save the data to your computer.

  • Inspect the spreadsheet (in i.e. Excel). What columns and information does it contain?

Each row in the spreadsheet contains results for each pool that was tested.

  • Result: indicates whether parasite DNA was detected in the pool (1 = present; 0 = absent)
  • NumInPool: indicates how many units, i.e. mosquitoes, were included in the pool

The Site, Village, Year and Region are metadata columns that uniquely describe the pool. For more information about the values within these columns, see the explanation on Hierarchical sampling structure.

Important

Each different location (i.e., Site here) must be represented by a unique variable. See How to prepare your data for analysis for more details.

📊 Analysis

To get started:

  • Navigate to the Analyse tab
  • Upload the SimpleExampleData.csv file

Next, we need to specify two columns in the uploaded data:

  • Test results: Result
  • Number of specimens per pool: NumInPool

📅 Estimating disease prevalence by place and time

We want estimate the prevalence of LF separately for each region and year.

Under Stratify data by: select the checkboxes for:

  • Region
  • Year

Hit Estimate prevalence!

Show results

We find that the prevalence differs across Regions. Interestingly, the marker prevalence is lower across all regions in 2019, compared to 2018!

You can further explore the data using the on-screen buttons, such as:

  • Showing more results per page
  • Sorting columns
  • Viewing the next page

Using one of the approaches above, find the row for 2020 and Region D. What is the estimated prevalence reported?

Show answer
0.0786

🪜 Obtaining more accurate estimates of uncertainty by accounting for hierarchical (cluster) sampling

So far, we have ignored the hierarchical (cluster) sampling structure in the survey data (i.e. villages within regions, and sites within villages). If we don't account for clustering our estimates of prevalence will be over-confident (our confidence intervals will be too narrow). This can make a practical difference if we are making a decision to start/stop an intervention based on finding that the upper confidence interval is above/below and threshold.

We will repeat the previous analysis, but apply a hierarchical model.

  • Select the checkbox for 'Cluster/hierarchical sampling'
  • Drag Village and Site into the 'Hierarchical variables' bucket

Important

Hierarchical variables must be ordered from the largest sampling area, to the lowest (top-down in the bucket). In this example, Village should be placed first, above Site.

Hit Estimate prevalence! This analysis will take 1-2 minutes to complete.

Show results

Compare some of the prevalence estimates between the two analyses. A few observations to note are:

  • The credible intervals for the hierarchical analysis are wider than the confidence intervals from the previous analysis
  • The point estimates of prevalence are slightly higher in the hierarchical model than in the non-hierarchical model

Note

The results will slightly differ each time a Bayesian analysis is run in PoolTools due to random numbers being used for part of the calculation. The hierarchical model uses a Bayesian analysis and therefore the numbers you get may vary slightly from what's shown above

The underlying differences between these models go beyond the scope of this tutorial. A detailed explanation can be found in Hierarchical sampling structure.

🚀 Where to go from here?

Clone this wiki locally