Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
232 commits
Select commit Hold shift + click to select a range
70a441d
Added notes on Git repository structure
Jan 25, 2013
27d90f8
Initial commit of src directory
Jan 25, 2013
6986df8
Changed permissions to make executable
Jan 25, 2013
ae317bc
Initial commit
Jan 25, 2013
d23ae68
Added Makefile
Jan 28, 2013
14d0cc4
Extensive refactoring, additional help text, added config file
Jan 29, 2013
1aaea67
Renamed calibration.py
Jan 29, 2013
f739be3
Renamed to calibrationFromEGT.py
Jan 29, 2013
8e7b68b
Validate arguments from argparse module and run on multiple zscores
Jan 29, 2013
ffc73d4
Added 'config' command-line option
Jan 29, 2013
8cf9340
Initial commit
Jan 29, 2013
2a60177
Install etc directory and .ini files
Jan 29, 2013
67a7a64
Exit if no PREFIX given for installation
Jan 29, 2013
9b2659a
Minor edits to git section
Jan 29, 2013
0300002
force recompiling of any .pyc files on install
Jan 29, 2013
93e4531
Initial commit of new classes to evaluate concordance
Jan 29, 2013
3cc8878
Renamed and updated
Jan 30, 2013
428fb59
Change verbose output, and skip existing threshold files
Jan 30, 2013
ad50190
Extensive refactoring for concordance evaluation
Jan 30, 2013
f6f7e7e
Initial commit
Jan 31, 2013
11d7916
Added headers to output file, --force option, reflect argument change…
Feb 4, 2013
23d9e81
bug fix in concordance counts; return inclusion rates only once
Feb 4, 2013
9c815a5
Add --force option to ensure overwrite
Feb 4, 2013
e28d7ee
Added gain statistic
Feb 5, 2013
64de34f
Edited comments
Feb 5, 2013
065e2f5
Added 'best z' calculation; minor refactoring
Feb 5, 2013
f1d19d6
Added pydoc strings
Feb 5, 2013
494acdd
Removed findConcordance, merged functionality into findMultipleConcor…
Feb 13, 2013
4891c70
fixed email address
Feb 13, 2013
c3e179d
Moved to new 'legacy' directory
Feb 13, 2013
ecefe5a
Moved to new 'legacy' directory
Feb 13, 2013
91849c6
Collect calibration/evaluation classes into a combined module
Feb 13, 2013
8347de0
Moved classes into calibration.py, leaving script front-end
Feb 13, 2013
967edc1
renamed as writeThresholdFiles.py
Feb 13, 2013
8c8eeaf
Edited comments
Feb 15, 2013
96850a2
Change to .json input of GTC paths for calibration
Feb 15, 2013
045e478
Added getTotalSNPs method
Feb 18, 2013
e0ff77a
Renamed
Feb 19, 2013
6190d9b
Renamed to evaluate.py
Feb 19, 2013
f994c27
Initial commit
Feb 19, 2013
23710ec
Modified to use ThresholdContainer
Feb 19, 2013
66c14ed
Extensive rewrite to support pipeline workflows
Feb 20, 2013
85b9443
Corrected initial shebang line
Feb 20, 2013
4549285
Changed to use ThresholdContainer class
Feb 20, 2013
990368d
Modified to use SampleEvaluator class
Feb 20, 2013
e0c99c7
Initial commit
Feb 20, 2013
99764a3
Renamed/deleted files
Feb 20, 2013
232af8f
Renamed
Feb 22, 2013
1c8ba19
Renamed and added ThresholdContainer class
Feb 22, 2013
d5e2c64
Deleted (contents moved into utilities.py)
Feb 22, 2013
011628c
Renamed
Feb 22, 2013
0e88d42
Update description and import statements
Feb 26, 2013
d75f37f
Edited comments
Feb 26, 2013
d62afc4
Edited argument description & updated imports
Feb 26, 2013
3822e61
MetricEvaluator now has 1 output file instead of 2; edited comments
Feb 26, 2013
f177e69
Initial commit
Feb 26, 2013
22b2dda
Removed unnecessary callsToStrings method
Feb 26, 2013
de5a1a6
Renamed
Feb 26, 2013
4f19ed2
Added getInputPath method
Feb 26, 2013
463b3e6
Moved constants/functions into SharedBase class and modified SampleEv…
Feb 26, 2013
0b54be7
Changed argument description
Feb 26, 2013
68f06e8
Write threshold .json index; modified import statements
Feb 26, 2013
4444bdb
Added main() method and bugfix in callsToBinary
Feb 26, 2013
3e30ae0
Added SharedBase class
Feb 26, 2013
05e9245
Renamed scripts directory to zcall
Feb 27, 2013
31a9ed1
Removed .py suffix to prevent module import by pydoc
Feb 27, 2013
f5a6a1a
Renamed findThresholds and findMeanSD; remove unnecessary time import
Feb 27, 2013
5bcb705
Initial commit
Feb 27, 2013
4060eb8
Added html to ignore list
Feb 27, 2013
f12e719
Initial commit
Feb 27, 2013
69db6a9
Initial commit of test data
Feb 28, 2013
b1a4270
Changed permissions
Feb 28, 2013
c56dbf2
Initial commit
Feb 28, 2013
fe5131a
Added getMD5 function and modified test data locations
Feb 28, 2013
c178663
Fixed bug in merged results comparison
Mar 1, 2013
0e5a93f
Revised ignore list
Mar 1, 2013
69801d9
Removed excessively large test data files
Mar 1, 2013
694abee
Initial commit
Mar 1, 2013
020d2a6
Added bigdata directory path
Mar 1, 2013
811bcef
Refactored and made more robust; added bigdata directory
Mar 1, 2013
735af2c
Added option to round off threshold outputs in findThresholds
Mar 1, 2013
1b7cb09
Added comment
Mar 1, 2013
39a670e
Updated paths
Mar 1, 2013
a50f4c4
Ignore test output directories
Mar 1, 2013
e16b004
Edited comments and added README
Mar 1, 2013
57e8f7a
Find modules dynamically instead of from hard-coded list
Mar 4, 2013
5f0289b
Generate html pydoc and install
Mar 4, 2013
69f642e
Add findMeanSD and findThresholds to install list
Mar 4, 2013
140d7db
Renamed
Mar 4, 2013
ca440fd
Enable start/end indices for list of GTC files
Mar 6, 2013
fad8807
Print duratio
Mar 6, 2013
e6186de
Test start/end indices for evaluation, rename calling script
Mar 6, 2013
da8e218
Use JSON instead of .txt for input to evaluation merge
Mar 12, 2013
34a610e
Updated input to merge test
Mar 12, 2013
af78713
Modify sys.path to import BPM module
Mar 14, 2013
d939e91
Added getChromosomes, getPositions methods
Mar 14, 2013
4f27c62
Corrected bug in bytestring reversal; added method to sort calls in g…
Mar 14, 2013
d844d28
Initial commit
Mar 18, 2013
d49fa20
Moved Plink functionality into new PlinkHandler class
Mar 18, 2013
9c81c45
Added Plink reading functions
Mar 19, 2013
e5924c6
Initial commit
Mar 19, 2013
6910779
Minor rewording
Mar 20, 2013
d0fbf29
Changed .bed file checksum
Mar 20, 2013
68598f7
Edited comment
Mar 20, 2013
57c150f
Changed outputs
Mar 20, 2013
b1402a6
Added destination directory argument
Mar 20, 2013
be38266
Initial commit
Mar 20, 2013
485fc89
Fixed minor output bug
Mar 20, 2013
58c1411
Output plink .bim and .fam
Mar 20, 2013
19cf95b
Test additional Plink outputs
Mar 20, 2013
894a3c7
Added basic logfile of zcall activity
Mar 20, 2013
f06b320
initial commit of rough draft
Mar 20, 2013
1a6b5c9
Specify output directory and plink prefix as separate command line op…
Mar 20, 2013
ec4090b
Moved loop to find multiple thresholds into ThresholdFinder class.
Mar 21, 2013
74fee6d
Improved parsing of start/end arguments
Mar 21, 2013
dbb601c
Reading of .json metric list moved from MetricEvaluator into mergeEva…
Mar 21, 2013
46b9d8f
Fixed handling of gtc start/end indices; added functionality to suppo…
Mar 21, 2013
21ce318
Added functionality; first working version
Mar 21, 2013
d1cbe09
Changed permissions
Mar 21, 2013
f7c641d
Added test for zCallComplete.py, and validatePlink method
Mar 21, 2013
dec59db
Fixed bug in findBestZ (string vs. integer sort) and added verbose mode
Mar 21, 2013
9f8b74e
Modified checksums and z range for zCallComplete test
Mar 21, 2013
cfc55a9
Print duration in verbose mode; add verbosity support to finding best Z
Mar 21, 2013
76ba59f
Moved to main zcall module
Mar 22, 2013
7698f23
Modifed documentation command
Mar 22, 2013
7c53de6
Prevent duplication of zcall module docs
Mar 22, 2013
19dd366
Initial commit of convenience script to generate .ped files
Mar 22, 2013
74a7c02
Moved to legacy
Mar 25, 2013
a33cbae
New and simplified README
Mar 25, 2013
b51e636
Fixed typos
Mar 25, 2013
6ee780b
Tidied up README
Mar 27, 2013
caf54f0
Deleted metrics.txt
Mar 27, 2013
6678759
Use main() method
Mar 27, 2013
f3c6827
Moved into zcall module
Mar 27, 2013
961eaad
Remove obsolete code in src/legacy directory
Mar 27, 2013
bdad2c3
Remove obsolete GTC json paths
Mar 27, 2013
5aa14b7
Additional tidying
Mar 27, 2013
627899f
Removed unnecessary test gtc .json files
Mar 27, 2013
cfa37a1
Modify help text
Mar 27, 2013
d1b0d55
Modified comments
Mar 27, 2013
6224aaa
Changed to a short overview, most content moved to README_prototypes
Mar 27, 2013
0b9adec
Initial commit, material was previously in README
Mar 27, 2013
8252f1c
Corrected section numbering
Mar 27, 2013
a3a34b2
Moved to README_extended
Mar 28, 2013
300c4d2
Added choice between binary and non-binary Plink output
Apr 2, 2013
fc8b408
Edited and added line breaks
Apr 2, 2013
be9b981
Added line breaks
Apr 2, 2013
c7c6f1c
Initial commit
Apr 3, 2013
7f33f8a
Initial commit
Apr 3, 2013
623b607
Ignore intermediate LaTeX output
Apr 3, 2013
e70f1b3
Note on excluding QC failures
Apr 3, 2013
3665202
Initial entries for exended zcall
Apr 3, 2013
74123eb
Changed log path default
Apr 30, 2013
6ae732d
Merge pull request #1 from iainrb/devel
Apr 30, 2013
77e859e
Merge pull request #2 from wtsi-npg/devel
Apr 30, 2013
c88800f
Gracefully handle ZeroDivisionError in findMAF
May 2, 2013
461bca2
Merge pull request #3 from iainrb/devel
May 7, 2013
1f251cd
Merge branch 'devel'
May 7, 2013
26eb40a
If gender code not in .json, default to -9 instead of crashing
May 14, 2013
b3b665f
Initial commit
May 14, 2013
6e8cea0
Note that gender codes are optional
May 14, 2013
44bfc36
Added option to write mean metric text
May 14, 2013
31608c1
Change help text; bugfix in uri uniqueness check
May 15, 2013
fe37378
Output sample count for evaluation in verbose mode
May 15, 2013
69f2479
Support text output of mean concordance/gain
May 15, 2013
027cd32
Added copyright text
May 17, 2013
428bcac
Revise mean/sd sanity checks and heuristic for choosing z score
May 21, 2013
ca356a5
Initial commit of GPLv3 license
May 22, 2013
dc46afc
Merge pull request #4 from iainrb/devel
May 22, 2013
452333a
Merge branch 'devel'
May 22, 2013
4916992
Added --profile option to run Python profiler
May 29, 2013
b5e4156
Add --profile option
May 30, 2013
3a386df
Convert findMeanSD script to a method in calibration.py
Jun 3, 2013
41c5c97
Replaced findThresholds script with method in calibration.py
Jun 3, 2013
2a10743
Split findMeanSD into smaller and more manageable methods
Jun 3, 2013
23f6143
Edited comments
Jun 3, 2013
1e8af88
Bugfix in arguments to runMultiple
Jun 4, 2013
40e7636
Write profile output to file
Jun 4, 2013
374c29d
Removed obsolete scripts
Jun 4, 2013
8ff31b4
Write pstats to named temporary file (pasted from cython branch)
Jun 5, 2013
3b68172
Added option to activate profiling by default in config.ini
Jun 7, 2013
aa44b42
Removed cython imports, previously added in error
Jun 12, 2013
d4fcbb7
Replace 'for range' with 'while' and list append with assignment, for…
Jun 17, 2013
27335f5
Replace 'for range' with 'while', list append with assignment, and fi…
Jun 17, 2013
96f8012
Bugfix for output array length in callsToBinary
Jun 18, 2013
f1d16ff
Fix minor bug in handling --text argument with profiler
Jun 19, 2013
0f2d22d
Changed copyright notice for multiple authors
Jun 19, 2013
e36da3a
Use class in utilities.py for generic argument validation
Jun 20, 2013
3a92f90
Merge pull request #5 from iainrb/devel
Jun 20, 2013
bd243aa
Merge branch 'devel'
Jun 20, 2013
524b179
Bugfix for profile args
Jun 21, 2013
f4c6df4
Print additional information on JSON parse error
Jun 21, 2013
6ed2446
Merge pull request #6 from iainrb/devel
Jun 21, 2013
8897600
Merge pull request #7 from wtsi-npg/devel
Jun 21, 2013
d4fd8d7
Moved some plink functionality to the plinktools module
Jun 25, 2013
0c00987
Merge pull request #8 from iainrb/devel
Jul 2, 2013
4299fdf
Merge branch 'devel'
Jul 2, 2013
862561a
Initial commit
Sep 23, 2013
0f34a1e
Updated Rscript path to 0.3.0; works on both farm2 and farm3
Sep 23, 2013
0b698cd
Updatd prerequisites
Sep 23, 2013
8f0c33e
Check for NaN/INF values and assign no-call if found
Sep 23, 2013
3b4c536
Change ImportError message
Sep 23, 2013
7f37407
Catch ZeroDivisionError in normalization
Oct 2, 2013
80a5e3d
Changed error message
Oct 2, 2013
d40b479
Merge pull request #9 from iainrb/devel
Oct 2, 2013
9076921
Initial commit
Nov 6, 2013
feac2aa
Implement normalization to Illumina TOP strand
Nov 8, 2013
6ad27ca
Added markdown for headings
Jan 6, 2014
a971efc
Renamed
Jan 6, 2014
73f844e
Added markdown
Jan 6, 2014
fa29b1b
Renamed
Jan 6, 2014
892686f
Add tearDown() to delete test output; update checksums for non-normal…
Jan 6, 2014
a491d25
Remove manifest normalization
Jan 6, 2014
5f42e34
Deleted
Jan 6, 2014
ee232d1
Add subsection on strand normalization
Jan 6, 2014
8b4c3bb
Edited comments
Jan 7, 2014
45f61cf
Minor edits to README docs
Jan 7, 2014
4b9bee5
fixed typos
Jan 7, 2014
49e3dfb
Fixed typos and list of scripts
Jan 7, 2014
3dba83e
Merge pull request #10 from iainrb/devel
Jan 7, 2014
3bf9556
Announce deletion of output directory to stdout
Jan 8, 2014
b1ed935
Bugfix; correct number of no-calls for padding to integer number of b…
Jan 30, 2014
4c3bd19
Updated locations of test data
Jan 30, 2014
a6dfcf9
Added test of zCallComplete.py with alternate manifest
Jan 30, 2014
8c6d7c3
Initial commit
Jan 30, 2014
9876197
Merge pull request #11 from iainrb/devel
Jan 30, 2014
24ea91b
Merge pull request #13 from wtsi-npg/release-0.4.4
Jan 31, 2014
65eda2e
Updated test data and changed data location
Mar 25, 2014
41b33ae
Change placeholder value from -9 to 0 in Plink .fam output
Mar 27, 2014
f4166eb
Allow user to configure representation of missing data in Plink .fam …
Mar 28, 2014
19fa128
Merge pull request #14 from iainrb/devel
Mar 28, 2014
3bba1bb
Merge pull request #16 from wtsi-npg/release-0.4.5
Mar 28, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.png -crlf -diff
*.pdf -crlf -diff
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# patterns for git version control to ignore
*~
*.pyc
*.html
*.aux
*.dvi
*.log
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# makefile for zCall by Iain Bancarz, ib5@sanger.ac.uk

PREFIX = PREFIX_DIRECTORY # dummy value as default
DEST = $(PREFIX)/zCall
SCRIPTS = src/zcall
ETC = src/etc

usage:
@echo -e "Usage: make install PREFIX=<destination directory>\nWill install to the zCall subdirectory of PREFIX.\nPREFIX must exist, zCall subdirectory will be created if necessary."

install: $(PREFIX)
@echo -e "Installing scripts..."
install -d $(DEST) $(DEST)/zcall $(DEST)/etc $(DEST)/doc
@rm -f $(DEST)/zcall/*.pyc # force recompiling of any .pyc files
install $(SCRIPTS)/*.py $(SCRIPTS)/*.r $(DEST)/zcall
install $(ETC)/*.ini $(DEST)/etc
@echo -e "Writing documentation..."
$(SCRIPTS)/createDocs.py --out $(DEST)/doc
@echo -e "Installation complete. See $(DEST)/doc/zcall.html for class documentation."
59 changes: 59 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@

zCall: A Rare Variant Caller for Array-Based Genotyping
=======================================================

1. Overview
-----------

zCall is a variant caller specifically designed for calling rare single
nucleotide polymorphisms (SNPs) from array-based technology. This caller is
implemented as a post-processing step after a default calling algorithm has
been applied such as Illumina's GenCall algorithm. zCall uses the intensity
profile of the common allele homozygote cluster to define the location of the
other two genotype clusters.

The zCall code includes three prototype versions, and an extended version;
these are documented respectively in README_prototypes and README_extended.md.
The prototypes are run in a series of steps with some manual intervention;
extended zCall can be run with a single command and has other added
capabilities.

2. Publication and downloads
-----------------------------

The paper describing zCall is:
Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M,
O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia
Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, Purcell S,
McCarroll SA, Sullivan PF, Daly MJ, Neale BM. zCall: a rare variant caller
for array-based genotyping: Genetics and population analysis. Bioinformatics.
2012 Oct 1;28(19):2543-2545. Epub 2012 Jul 27. PubMed PMID: 22843986.

zCall is hosted on Github (https://github.com/jigold/zCall). Prototype versions
are available as .zip files, while the extended version is in the src
directory. The entire zCall repository can be cloned using Git or downloaded
from Github as a .zip file (approximately 0.8 MB). Extended zCall can be
installed from a download of the full repository, using the included Makefile.

3. Disclaimer
---------------

The prototype and extended editions of zCall include code provided by Illumina.
The Illumina provided Code was provided as-is and with no warranty as to
performance and no warranty against it infringing any other party's
intellectual property rights.

4. Contacts
------------

The original zCall method and prototype implementations were developed by
Jackie Goldstein et al. For questions about prototype zCall or reporting
problems with the code, please send an email to Jackie Goldstein
(jigold@broadinstitute.org). For all other inquiries, please send an email to
both Ben Neale (bneale@broadinstitute.org) and Jackie Goldstein
(jigold@broadinstitute.org).

Any queries concerning the extended version of zCall in the src directory
should be directed to Iain Bancarz (ib5@sanger.ac.uk).

This document was written by Iain Bancarz.
132 changes: 132 additions & 0 deletions README_extended.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@

zCall: Extended Version
=======================

1. Overview
-----------

This document covers the extended version of zCall, located in the `src`
directory on github. For other implementations, see Section 9 (History).

2. Installation
---------------

Use GNU Make, with the target `install`. For example, `make install
/home/jsmith/foo` will create a directory `/home/jsmith/foo/zCall` and
install the relevant scripts and documentation.

The etc/config.ini file should be edited to reflect the user's local
computing environment.

Software prerequisites are:
* Python 2.7.x (or PyPy >= 2.1)
* R 2.x
* Plink should be in the PATH environment variable
See: http://pngu.mgh.harvard.edu/~purcell/plink/
* Plinktools should be in the PYTHONPATH environment variable
See: https://github.com/wtsi-npg/plinktools

3. The zCall method
-------------------

Applying zCall consists of four steps:
1. Generate candidate zscore threshold files
2. Evaluate metrics on the input data for each candidate threshold
3. Merge evaluation results and choose an optimal threshold for calling
4. Apply zCall using the chosen threshold to re-call any 'no-calls' in
the input data.

Steps 2 and 3 may be omitted in favour of simply using a default threshold;
JI Goldstein et al. suggest a default of z=7.

4. Usage
--------

The user's PATH variable must include the path to the 'zcall' subdirectory of
the installation (or alternatively to src/zcall in the source code).

Two approaches are supported:
1. A self-contained script to run zCall from start to finish as a single
process: zCallComplete.py
2. Scripts to run each step independently, enabling parallelization. For
steps 1-4 respectively: prepareThresholds.py, evaluateThresholds.py,
mergeEvaluation.py, runZCall.py

Any of these scripts can be run with -h or --help for detailed help and
usage information. Note that if calling with a single default threshold is
desired, this can be achieved by simply using zCallComplete.py with a
threshold "range" of only one value.

The original authors of zCall recommend that samples which fail quality
control should be excluded from zCall calibration and re-calling.

5. Other scripts and modules
----------------------------

* Additional command-line scripts, run with -h or --help for usage:
appendPEDline.py, createDocs.py, textToSampleJson.py
* Other Python files are modules which contain relevant classes but are not intended to be run as scripts.
* R script used to find thresholds: findBetas.r

6. Data formats
---------------

Input data consists of binary .gtc files, one for each sample. In addition,
zCall requires an appropriate .egt cluster file and .bpm.csv manifest file.
The .egt and .bpm.csv files are proprietary Illumina formats and can be
downloaded from: http://support.illumina.com/downloads.ilmn

Final genotype output is in Plink format (see
http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml). Intermediate
metadata files are written in JSON, a simple text-based format for storing
data structures (see http://www.json.org/).

7. Further reading
------------------

The document src/doc/zCallExtended.tex describes how the extended version
expands on the prototypes, including automated evaluation of candidate zscore
thresholds. HTML documentation for Python modules can be generated by running
src/zcall/createDocs.py. The zCall method is discussed in the paper by JI
Goldstein et al. (see Section 8 below).

8. Strand normalization
-----------------------

Normalization of the beadpool manifest to the Illumina TOP strand is not
implemented by zCall, in either the prototype or extended software.

* If normalization is required, the simtools package developed at the
Wellcome Trust Sanger Institute can be used: https://github.com/wtsi-npg/simtools

* For details of the normalization procedure, see: http://res.illumina.com/documents/products/technotes/technote_topbot.pdf‎

9. History
----------

The zCall method was originally developed by Jackie Goldstein
(jigold@broadinstitute.org) et al. and published in the following paper:
Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M,
O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia
Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, Purcell S,
McCarroll SA, Sullivan PF, Daly MJ, Neale BM. zCall: a rare variant caller
for array-based genotyping: Genetics and population analysis. Bioinformatics.
2012 Oct 1;28(19):2543-2545. Epub 2012 Jul 27. PubMed PMID: 22843986.

zCall has been substantially extended by Iain Bancarz (ib5@sanger.ac.uk), to
allow incorporation into the WTSI Genotyping Pipeline (WTSI-GP, see
https://github.com/wtsi-npg/genotyping). The extension includes metrics for
evaluation of the 'zscore' threshold parameter, automated implementation of
the complete zcall workflow, and support for Plink binary output. WTSI-GP will
support zCall as part of a fully automated pipeline workflow, with parallel
processing on LSF.

The prototype implementation of zCall consisted of several self-contained
versions for different input formats. These versions were committed to Github
as .zip files. Extended zCall is based on 'zCall_Version1.3_AutoCall.zip'.
Individual files in extended zCall are under version control on Github, and
are located in the src directory.

Illumina provided code was provided as-is and with no warranty as to
performance and no warranty against it infringing any other party's
intellectual property rights.
20 changes: 13 additions & 7 deletions README → README_prototypes
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
--------------------------------------------------------------
zCall: A Rare Variant Caller for Array-based Genotyping

For questions about implementing zCall or reporting problems with the code, please send an email to Jackie Goldstein (jigold@broadinstitute.org). For all other inquiries, please send an email to both Ben Neale (bneale@broadinstitute.org) and Jackie Goldstein (jigold@broadinstitute.org).
For questions about implementing zCall or reporting problems with the code, please send an email to Jackie Goldstein (jigold@broadinstitute.org). For all other inquiries, please send an email to both Ben Neale (bneale@broadinstitute.org) and Jackie Goldstein (jigold@broadinstitute.org).

*** The Illumina provided Code was provided as-is and with no warranty as to performance and no warranty against it infringing any other party's intellectual property rights.

Expand All @@ -11,15 +11,21 @@ Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, O'Dushlaine C,


I. Overview
zCall is a variant caller specifically designed for calling rare single nucleotide polymorphisms (SNPs) from array-based technology. This caller is implemented as a post-processing step after a default calling algorithm has been applied such as Illumina�s GenCall algorithm. zCall uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. The outputs of zCall are either a PLINK PED and MAP file or a TPED and TFAM file depending on what the input files to zCall are. See http://pngu.mgh.harvard.edu/~purcell/plink/ for more details about PLINK.

zCall is a variant caller specifically designed for calling rare single nucleotide polymorphisms (SNPs) from array-based technology. This caller is implemented as a post-processing step after a default calling algorithm has been applied such as IlluminaÕs GenCall algorithm. zCall uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. The outputs of zCall are either a PLINK PED and MAP file or a TPED and TFAM file depending on what the input files to zCall are. See http://pngu.mgh.harvard.edu/~purcell/plink/ for more details about PLINK.

Three different versions of zCall have been provided depending on whether AutoCall or GenomeStudio has been run. Each version includes a set of Python scripts and an R script as well as line-by line instructions on how to run zCall located in the README file. Scripts are available at https://github.com/jigold/zCall. Examples of input file format are available at https://github.com/jigold/zCall; however, these files are subsets of actual inputs and are not usable for testing out the code.


II. Implementation
zCall is implemented as a set of Python scripts and one R script that are run on the command line. The scripts have been written to work with Illumina�s software output files such as an EGT file, GTC files, a .bpm.csv file, and a GenomeStudio Report. Examples of input files are located in the examples directory. We have written three versions of zCall that work with different input files based on whether AutoCall or GenomeStudio has been used for the initial calling:Version 1 (AutoCall):
Input files are Illumina�s binary GTC files, a .bpm.csv file, and a binary EGT file and the output files are a PLINK .ped and .map file. This version should be used after AutoCall has been run. Because the inputs are GTC files, calling of samples can be parallelized and the need to upload IDAT files into GenomeStudio can be avoided. In addition, this version avoids making large intermediate files with genotypes and intensity data.Version 2 (GenomeStudio - Thresholds derived from EGT file):
Input files are Illumina�s binary EGT file, an Illumina probe manifest file, and an Illumina GenomeStudio final report with the following fields: genotype, normalized X intensity, and normalized Y intensity where the normalized intensities are between 0 and 2. The output files are PLINK .tped and .tfam files where the alleles are A/B. PLINK can later be used to transform the A/B alleles into A,T,G,C using the --update-alleles flag.

zCall is implemented as a set of Python scripts and one R script that are run on the command line. The scripts have been written to work with IlluminaÕs software output files such as an EGT file, GTC files, a .bpm.csv file, and a GenomeStudio Report. Examples of input files are located in the examples directory. We have written three versions of zCall that work with different input files based on whether AutoCall or GenomeStudio has been used for the initial calling:

Version 1 (AutoCall):
Input files are IlluminaÕs binary GTC files, a .bpm.csv file, and a binary EGT file and the output files are a PLINK .ped and .map file. This version should be used after AutoCall has been run. Because the inputs are GTC files, calling of samples can be parallelized and the need to upload IDAT files into GenomeStudio can be avoided. In addition, this version avoids making large intermediate files with genotypes and intensity data.

Version 2 (GenomeStudio - Thresholds derived from EGT file):
Input files are IlluminaÕs binary EGT file, an Illumina probe manifest file, and an Illumina GenomeStudio final report with the following fields: genotype, normalized X intensity, and normalized Y intensity where the normalized intensities are between 0 and 2. The output files are PLINK .tped and .tfam files where the alleles are A/B. PLINK can later be used to transform the A/B alleles into A,T,G,C using the --update-alleles flag.

This version requires using GenomeStudio to generate a final report, which can be on the order of 40 GB for 5000 samples. However, it makes use of the fact that an EGT file already has the calculated means and standard deviations of each cluster when the initial clustering was done. Therefore, if a cluster file has already been generated with a large number of samples, then using the existing cluster file to generate the thresholds is much faster and more accurate than calculating the mean and standard deviation of each cluster from the GenomeStudio report.

Expand Down Expand Up @@ -48,11 +54,11 @@ zCall works by using the intensity profile of the common allele homozygote clust

5. Genotypes are assigned based on where points are located with respect to the two thresholds.

6. Analysts should only consider sites and samples that meet QC criteria from the original calls.
6. Analysts should only consider sites and samples that meet QC criteria from the original calls.


IV. Miscellaneous

The run time and memory requirements are dependent on the number of probes and samples. For the Illumina HumanExome BeadChip, the run time to call one sample from a GTC file is 15 seconds and requires about 3 MB of memory (Max Memory Swap = 39 MB). For 971 samples calling from GTC files and generating one large PED file is 145 MB of memory (Max Memory Swap = 438 MB). For 384 samples from an Illumina Genome Studio report, it takes about 13 minutes to generate a new TPED and TFAM file with predetermined thresholds and 129 MB of memory (Max Memory Swap = 289 MB). It is assumed that the number, order, and names of sites match between the EGT, GTC, GenomeStudio report, and .bpm.csv files.

For best results, we recommend using at least 1000 samples to derive the thresholds. This can either be an EGT file clustered with 1000 samples or from a GenomeStudio report with 1000 samples. We found that we got the best results using a z-score threshold equal to 7; however we recommend using the calibration step to verify that 7 will work well for your data. When using GenCall to generate the original genotype calls, we got the best results when using a smaller cluster file (90 samples) rather than a larger cluster file (9,479 samples). Please see Goldstein,J.I. et al. (2012) zCall: A Rare Variant Caller for Array-based Genotyping. Bioinformatics: 28(19):2543-2545 for more information.
For best results, we recommend using at least 1000 samples to derive the thresholds. This can either be an EGT file clustered with 1000 samples or from a GenomeStudio report with 1000 samples. We found that we got the best results using a z-score threshold equal to 7; however we recommend using the calibration step to verify that 7 will work well for your data. When using GenCall to generate the original genotype calls, we got the best results when using a smaller cluster file (90 samples) rather than a larger cluster file (9,479 samples). Please see Goldstein,J.I. et al. (2012) zCall: A Rare Variant Caller for Array-based Genotyping. Bioinformatics: 28(19):2543-2545 for more information.
2 changes: 2 additions & 0 deletions src/data/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# do not diff the .csv manifest (treat as binary data)
*.bpm.csv -crlf -diff
6 changes: 6 additions & 0 deletions src/data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# ignore output from tests
thresholds_*.txt
*.egt
*.bpm.csv
thresholds.json
tmp*
26 changes: 26 additions & 0 deletions src/data/gtc.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R01C01.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R01C02.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R02C01.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R02C02.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R03C01.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R03C02.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R04C01.gtc"
},
{
"result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R04C02.gtc"
}
]
Loading