+
+
+
+
+
+
+
+
++ A modular tool to aggregate results from bioinformatics analyses across many samples into a single report. +
+ + + + + + + + + + + +Report
+
+ generated on 2024-05-24, 21:09 EDT
+
+
+ based on data in:
+
+ /Users/uran/Desktop/Biomedical Data Analysis/HW2/ogdens_data/fastqc
+ + + + + + + + +
General Statistics
+ +| Sample Name | % Dups | % GC | M Seqs |
|---|---|---|---|
| brother_short | 71.1% | 51% | 1.0M |
| grandmother_short | 72.5% | 51% | 1.2M |
| mother_short | 72.0% | 52% | 1.2M |
| proband_29 | 69.8% | 51% | 0.8M |
| proband_short | 9.7% | 47% | 1.0M |
| uncle_short | 71.9% | 51% | 1.1M |
FastQC
+ +0.11.9
+
+
+ FastQC is a quality control tool for high throughput sequence data, written by Simon Andrews at the Babraham Institute in Cambridge.
+ + + + ++ Sequence Counts + + + +
+ +Sequence counts for each sample. Duplicate read counts are an estimate only.
This plot show the total number of reads, broken down into unique and duplicate +if possible (only more recent versions of FastQC give duplicate info).
+You can read more about duplicate calculation in the +FastQC documentation. +A small part has been copied here for convenience:
+Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.
+The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.
+
+ Sequence Quality Histograms + + + +
+ +The mean quality value across each base position in the read.
To enable multiple samples to be plotted on the same graph, only the mean quality +scores are plotted (unlike the box plots seen in FastQC reports).
+Taken from the FastQC help:
+The y-axis on the graph shows the quality scores. The higher the score, the better +the base call. The background of the graph divides the y axis into very good quality +calls (green), calls of reasonable quality (orange), and calls of poor quality (red). +The quality of calls on most platforms will degrade as the run progresses, so it is +common to see base calls falling into the orange area towards the end of a read.
+
+ Per Sequence Quality Scores + + + +
+ +The number of reads with average quality scores. Shows if a subset of reads has poor quality.
From the FastQC help:
+The per sequence quality score report allows you to see if a subset of your +sequences have universally low quality values. It is often the case that a +subset of sequences will have universally poor quality, however these should +represent only a small percentage of the total sequences.
+
+ Per Base Sequence Content + + + +
+ +The proportion of each base position for which each of the four normal DNA bases has been called.
To enable multiple samples to be shown in a single plot, the base composition data +is shown as a heatmap. The colours represent the balance between the four bases: +an even distribution should give an even muddy brown colour. Hover over the plot +to see the percentage of the four bases under the cursor.
+To see the data as a line plot, as in the original FastQC graph, click on a sample track.
+From the FastQC help:
+Per Base Sequence Content plots out the proportion of each base position in a +file for which each of the four normal DNA bases has been called.
+In a random library you would expect that there would be little to no difference +between the different bases of a sequence run, so the lines in this plot should +run parallel with each other. The relative amount of each base should reflect +the overall amount of these bases in your genome, but in any case they should +not be hugely imbalanced from each other.
+It's worth noting that some types of library will always produce biased sequence +composition, normally at the start of the read. Libraries produced by priming +using random hexamers (including nearly all RNA-Seq libraries) and those which +were fragmented using transposases inherit an intrinsic bias in the positions +at which reads start. This bias does not concern an absolute sequence, but instead +provides enrichement of a number of different K-mers at the 5' end of the reads. +Whilst this is a true technical bias, it isn't something which can be corrected +by trimming and in most cases doesn't seem to adversely affect the downstream +analysis.
Rollover for sample name
++
+ Per Sequence GC Content + + + +
+ +The average GC content of reads. Normal random library typically have a + roughly normal distribution of GC content.
From the FastQC help:
+This module measures the GC content across the whole length of each sequence +in a file and compares it to a modelled normal distribution of GC content.
+In a normal random library you would expect to see a roughly normal distribution +of GC content where the central peak corresponds to the overall GC content of +the underlying genome. Since we don't know the the GC content of the genome the +modal GC content is calculated from the observed data and used to build a +reference distribution.
+An unusually shaped distribution could indicate a contaminated library or +some other kinds of biased subset. A normal distribution which is shifted +indicates some systematic bias which is independent of base position. If there +is a systematic bias which creates a shifted normal distribution then this won't +be flagged as an error by the module since it doesn't know what your genome's +GC content should be.
+
+ Per Base N Content + + + +
+ +The percentage of base calls at each position for which an N was called.
From the FastQC help:
+If a sequencer is unable to make a base call with sufficient confidence then it will
+normally substitute an N rather than a conventional base call. This graph shows the
+percentage of base calls at each position for which an N was called.
It's not unusual to see a very low proportion of Ns appearing in a sequence, especially +nearer the end of a sequence. However, if this proportion rises above a few percent +it suggests that the analysis pipeline was unable to interpret the data well enough to +make valid base calls.
+
+ Sequence Length Distribution + +
+ ++
+ Sequence Duplication Levels + + + +
+ +The relative level of duplication found for every sequence.
From the FastQC Help:
+In a diverse library most sequences will occur only once in the final set. +A low level of duplication may indicate a very high level of coverage of the +target sequence, but a high level of duplication is more likely to indicate +some kind of enrichment bias (eg PCR over amplification). This graph shows +the degree of duplication for every sequence in a library: the relative +number of sequences with different degrees of duplication.
+Only sequences which first appear in the first 100,000 sequences +in each file are analysed. This should be enough to get a good impression +for the duplication levels in the whole file. Each sequence is tracked to +the end of the file to give a representative count of the overall duplication level.
+The duplication detection requires an exact sequence match over the whole length of +the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.
+In a properly diverse library most sequences should fall into the far left of the +plot in both the red and blue lines. A general level of enrichment, indicating broad +oversequencing in the library will tend to flatten the lines, lowering the low end +and generally raising other categories. More specific enrichments of subsets, or +the presence of low complexity contaminants will tend to produce spikes towards the +right of the plot.
+
+ Overrepresented sequences by sample + + + +
+ +The total amount of overrepresented sequences found in each library.
FastQC calculates and lists overrepresented sequences in FastQ files. It would not be +possible to show this for all samples in a MultiQC report, so instead this plot shows +the number of sequences categorized as overrepresented.
+Sometimes, a single sequence may account for a large number of reads in a dataset. +To show this, the bars are split into two: the first shows the overrepresented reads +that come from the single most common sequence. The second shows the total count +from all remaining overrepresented sequences.
+From the FastQC Help:
+A normal high-throughput library will contain a diverse set of sequences, with no +individual sequence making up a tiny fraction of the whole. Finding that a single +sequence is very overrepresented in the set either means that it is highly biologically +significant, or indicates that the library is contaminated, or not as diverse as you expected.
+FastQC lists all the sequences which make up more than 0.1% of the total. +To conserve memory only sequences which appear in the first 100,000 sequences are tracked +to the end of the file. It is therefore possible that a sequence which is overrepresented +but doesn't appear at the start of the file for some reason could be missed by this module.
+
+ Top overrepresented sequences + +
+ +Top overrepresented sequences across all samples. The table shows 20 +most overrepresented sequences across all samples, ranked by the number of samples they occur in.
| Overrepresented sequence | Samples | Occurrences | % of all reads |
|---|---|---|---|
| GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT | 1 | 1166 | 0.0186% |
+
+ Adapter Content + + + +
+ +The cumulative percentage count of the proportion of your + library which has seen each of the adapter sequences at each position.
Note that only samples with ≥ 0.1% adapter contamination are shown.
+There may be several lines per sample, as one is shown for each adapter +detected in the file.
+From the FastQC Help:
+The plot shows a cumulative percentage count of the proportion +of your library which has seen each of the adapter sequences at each position. +Once a sequence has been seen in a read it is counted as being present +right through to the end of the read so the percentages you see will only +increase as the read length goes on.
+
+ Status Checks + + + +
+ +Status for each FastQC section showing whether results seem entirely normal (green), +slightly abnormal (orange) or very unusual (red).
FastQC assigns a status for each section of the report. +These give a quick evaluation of whether the results of the analysis seem +entirely normal (green), slightly abnormal (orange) or very unusual (red).
+It is important to stress that although the analysis results appear to give a pass/fail result, +these evaluations must be taken in the context of what you expect from your library. +A 'normal' sample as far as FastQC is concerned is random and diverse. +Some experiments may be expected to produce libraries which are biased in particular ways. +You should treat the summary evaluations therefore as pointers to where you should concentrate +your attention and understand why your library may not look random and diverse.
+Specific guidance on how to interpret the output of each module can be found in the relevant +report section, or in the FastQC help.
+In this heatmap, we summarise all of these into a single heatmap for a quick overview. +Note that not all FastQC sections have plots in MultiQC reports, but all status checks +are shown in this heatmap.
+ + + + + +
Software Versions
+ +Software Versions lists versions of software tools extracted from file contents.
+ + + + +| Software | Version |
|---|---|
| FastQC | 0.11.9 |