diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..842e1cf --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ +*.png -crlf -diff +*.pdf -crlf -diff \ No newline at end of file diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..290bc82 --- /dev/null +++ b/.gitignore @@ -0,0 +1,7 @@ +# patterns for git version control to ignore +*~ +*.pyc +*.html +*.aux +*.dvi +*.log diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..94a9ed0 --- /dev/null +++ b/LICENSE @@ -0,0 +1,674 @@ + GNU GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + Copyright (C) + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +. diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..0db12eb --- /dev/null +++ b/Makefile @@ -0,0 +1,19 @@ +# makefile for zCall by Iain Bancarz, ib5@sanger.ac.uk + +PREFIX = PREFIX_DIRECTORY # dummy value as default +DEST = $(PREFIX)/zCall +SCRIPTS = src/zcall +ETC = src/etc + +usage: + @echo -e "Usage: make install PREFIX=\nWill install to the zCall subdirectory of PREFIX.\nPREFIX must exist, zCall subdirectory will be created if necessary." + +install: $(PREFIX) + @echo -e "Installing scripts..." + install -d $(DEST) $(DEST)/zcall $(DEST)/etc $(DEST)/doc + @rm -f $(DEST)/zcall/*.pyc # force recompiling of any .pyc files + install $(SCRIPTS)/*.py $(SCRIPTS)/*.r $(DEST)/zcall + install $(ETC)/*.ini $(DEST)/etc + @echo -e "Writing documentation..." + $(SCRIPTS)/createDocs.py --out $(DEST)/doc + @echo -e "Installation complete. See $(DEST)/doc/zcall.html for class documentation." diff --git a/README.md b/README.md new file mode 100644 index 0000000..bbf4e14 --- /dev/null +++ b/README.md @@ -0,0 +1,59 @@ + +zCall: A Rare Variant Caller for Array-Based Genotyping +======================================================= + +1. Overview +----------- + +zCall is a variant caller specifically designed for calling rare single +nucleotide polymorphisms (SNPs) from array-based technology. This caller is +implemented as a post-processing step after a default calling algorithm has +been applied such as Illumina's GenCall algorithm. zCall uses the intensity +profile of the common allele homozygote cluster to define the location of the +other two genotype clusters. + +The zCall code includes three prototype versions, and an extended version; +these are documented respectively in README_prototypes and README_extended.md. +The prototypes are run in a series of steps with some manual intervention; +extended zCall can be run with a single command and has other added +capabilities. + +2. Publication and downloads +----------------------------- + +The paper describing zCall is: +Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, Purcell S, +McCarroll SA, Sullivan PF, Daly MJ, Neale BM. zCall: a rare variant caller +for array-based genotyping: Genetics and population analysis. Bioinformatics. +2012 Oct 1;28(19):2543-2545. Epub 2012 Jul 27. PubMed PMID: 22843986. + +zCall is hosted on Github (https://github.com/jigold/zCall). Prototype versions +are available as .zip files, while the extended version is in the src +directory. The entire zCall repository can be cloned using Git or downloaded +from Github as a .zip file (approximately 0.8 MB). Extended zCall can be +installed from a download of the full repository, using the included Makefile. + +3. Disclaimer +--------------- + +The prototype and extended editions of zCall include code provided by Illumina. +The Illumina provided Code was provided as-is and with no warranty as to +performance and no warranty against it infringing any other party's +intellectual property rights. + +4. Contacts +------------ + +The original zCall method and prototype implementations were developed by +Jackie Goldstein et al. For questions about prototype zCall or reporting +problems with the code, please send an email to Jackie Goldstein +(jigold@broadinstitute.org). For all other inquiries, please send an email to +both Ben Neale (bneale@broadinstitute.org) and Jackie Goldstein +(jigold@broadinstitute.org). + +Any queries concerning the extended version of zCall in the src directory +should be directed to Iain Bancarz (ib5@sanger.ac.uk). + +This document was written by Iain Bancarz. \ No newline at end of file diff --git a/README_extended.md b/README_extended.md new file mode 100644 index 0000000..6294980 --- /dev/null +++ b/README_extended.md @@ -0,0 +1,132 @@ + +zCall: Extended Version +======================= + +1. Overview +----------- + +This document covers the extended version of zCall, located in the `src` +directory on github. For other implementations, see Section 9 (History). + +2. Installation +--------------- + +Use GNU Make, with the target `install`. For example, `make install +/home/jsmith/foo` will create a directory `/home/jsmith/foo/zCall` and +install the relevant scripts and documentation. + +The etc/config.ini file should be edited to reflect the user's local +computing environment. + +Software prerequisites are: + * Python 2.7.x (or PyPy >= 2.1) + * R 2.x + * Plink should be in the PATH environment variable + See: http://pngu.mgh.harvard.edu/~purcell/plink/ + * Plinktools should be in the PYTHONPATH environment variable + See: https://github.com/wtsi-npg/plinktools + +3. The zCall method +------------------- + +Applying zCall consists of four steps: + 1. Generate candidate zscore threshold files + 2. Evaluate metrics on the input data for each candidate threshold + 3. Merge evaluation results and choose an optimal threshold for calling + 4. Apply zCall using the chosen threshold to re-call any 'no-calls' in +the input data. + +Steps 2 and 3 may be omitted in favour of simply using a default threshold; +JI Goldstein et al. suggest a default of z=7. + +4. Usage +-------- + +The user's PATH variable must include the path to the 'zcall' subdirectory of +the installation (or alternatively to src/zcall in the source code). + +Two approaches are supported: + 1. A self-contained script to run zCall from start to finish as a single +process: zCallComplete.py + 2. Scripts to run each step independently, enabling parallelization. For +steps 1-4 respectively: prepareThresholds.py, evaluateThresholds.py, +mergeEvaluation.py, runZCall.py + +Any of these scripts can be run with -h or --help for detailed help and +usage information. Note that if calling with a single default threshold is +desired, this can be achieved by simply using zCallComplete.py with a +threshold "range" of only one value. + +The original authors of zCall recommend that samples which fail quality +control should be excluded from zCall calibration and re-calling. + +5. Other scripts and modules +---------------------------- + + * Additional command-line scripts, run with -h or --help for usage: +appendPEDline.py, createDocs.py, textToSampleJson.py + * Other Python files are modules which contain relevant classes but are not intended to be run as scripts. + * R script used to find thresholds: findBetas.r + +6. Data formats +--------------- + +Input data consists of binary .gtc files, one for each sample. In addition, +zCall requires an appropriate .egt cluster file and .bpm.csv manifest file. +The .egt and .bpm.csv files are proprietary Illumina formats and can be +downloaded from: http://support.illumina.com/downloads.ilmn + +Final genotype output is in Plink format (see +http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml). Intermediate +metadata files are written in JSON, a simple text-based format for storing +data structures (see http://www.json.org/). + +7. Further reading +------------------ + +The document src/doc/zCallExtended.tex describes how the extended version +expands on the prototypes, including automated evaluation of candidate zscore +thresholds. HTML documentation for Python modules can be generated by running +src/zcall/createDocs.py. The zCall method is discussed in the paper by JI +Goldstein et al. (see Section 8 below). + +8. Strand normalization +----------------------- + +Normalization of the beadpool manifest to the Illumina TOP strand is not +implemented by zCall, in either the prototype or extended software. + +* If normalization is required, the simtools package developed at the +Wellcome Trust Sanger Institute can be used: https://github.com/wtsi-npg/simtools + +* For details of the normalization procedure, see: http://res.illumina.com/documents/products/technotes/technote_topbot.pdf‎ + +9. History +---------- + +The zCall method was originally developed by Jackie Goldstein +(jigold@broadinstitute.org) et al. and published in the following paper: +Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, Purcell S, +McCarroll SA, Sullivan PF, Daly MJ, Neale BM. zCall: a rare variant caller +for array-based genotyping: Genetics and population analysis. Bioinformatics. +2012 Oct 1;28(19):2543-2545. Epub 2012 Jul 27. PubMed PMID: 22843986. + +zCall has been substantially extended by Iain Bancarz (ib5@sanger.ac.uk), to +allow incorporation into the WTSI Genotyping Pipeline (WTSI-GP, see +https://github.com/wtsi-npg/genotyping). The extension includes metrics for +evaluation of the 'zscore' threshold parameter, automated implementation of +the complete zcall workflow, and support for Plink binary output. WTSI-GP will +support zCall as part of a fully automated pipeline workflow, with parallel +processing on LSF. + +The prototype implementation of zCall consisted of several self-contained +versions for different input formats. These versions were committed to Github +as .zip files. Extended zCall is based on 'zCall_Version1.3_AutoCall.zip'. +Individual files in extended zCall are under version control on Github, and +are located in the src directory. + +Illumina provided code was provided as-is and with no warranty as to +performance and no warranty against it infringing any other party's +intellectual property rights. diff --git a/README b/README_prototypes similarity index 83% rename from README rename to README_prototypes index 4cf5e60..fd6918d 100644 --- a/README +++ b/README_prototypes @@ -1,7 +1,7 @@ -------------------------------------------------------------- zCall: A Rare Variant Caller for Array-based Genotyping -For questions about implementing zCall or reporting problems with the code, please send an email to Jackie Goldstein (jigold@broadinstitute.org). For all other inquiries, please send an email to both Ben Neale (bneale@broadinstitute.org) and Jackie Goldstein (jigold@broadinstitute.org). +For questions about implementing zCall or reporting problems with the code, please send an email to Jackie Goldstein (jigold@broadinstitute.org). For all other inquiries, please send an email to both Ben Neale (bneale@broadinstitute.org) and Jackie Goldstein (jigold@broadinstitute.org). *** The Illumina provided Code was provided as-is and with no warranty as to performance and no warranty against it infringing any other party's intellectual property rights. @@ -11,15 +11,21 @@ Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, O'Dushlaine C, I. Overview - zCall is a variant caller specifically designed for calling rare single nucleotide polymorphisms (SNPs) from array-based technology. This caller is implemented as a post-processing step after a default calling algorithm has been applied such as IlluminaÕs GenCall algorithm. zCall uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. The outputs of zCall are either a PLINK PED and MAP file or a TPED and TFAM file depending on what the input files to zCall are. See http://pngu.mgh.harvard.edu/~purcell/plink/ for more details about PLINK. + +zCall is a variant caller specifically designed for calling rare single nucleotide polymorphisms (SNPs) from array-based technology. This caller is implemented as a post-processing step after a default calling algorithm has been applied such as IlluminaÕs GenCall algorithm. zCall uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. The outputs of zCall are either a PLINK PED and MAP file or a TPED and TFAM file depending on what the input files to zCall are. See http://pngu.mgh.harvard.edu/~purcell/plink/ for more details about PLINK. Three different versions of zCall have been provided depending on whether AutoCall or GenomeStudio has been run. Each version includes a set of Python scripts and an R script as well as line-by line instructions on how to run zCall located in the README file. Scripts are available at https://github.com/jigold/zCall. Examples of input file format are available at https://github.com/jigold/zCall; however, these files are subsets of actual inputs and are not usable for testing out the code. II. Implementation - zCall is implemented as a set of Python scripts and one R script that are run on the command line. The scripts have been written to work with IlluminaÕs software output files such as an EGT file, GTC files, a .bpm.csv file, and a GenomeStudio Report. Examples of input files are located in the examples directory. We have written three versions of zCall that work with different input files based on whether AutoCall or GenomeStudio has been used for the initial calling: Version 1 (AutoCall): -Input files are IlluminaÕs binary GTC files, a .bpm.csv file, and a binary EGT file and the output files are a PLINK .ped and .map file. This version should be used after AutoCall has been run. Because the inputs are GTC files, calling of samples can be parallelized and the need to upload IDAT files into GenomeStudio can be avoided. In addition, this version avoids making large intermediate files with genotypes and intensity data. Version 2 (GenomeStudio - Thresholds derived from EGT file): -Input files are IlluminaÕs binary EGT file, an Illumina probe manifest file, and an Illumina GenomeStudio final report with the following fields: genotype, normalized X intensity, and normalized Y intensity where the normalized intensities are between 0 and 2. The output files are PLINK .tped and .tfam files where the alleles are A/B. PLINK can later be used to transform the A/B alleles into A,T,G,C using the --update-alleles flag. + +zCall is implemented as a set of Python scripts and one R script that are run on the command line. The scripts have been written to work with IlluminaÕs software output files such as an EGT file, GTC files, a .bpm.csv file, and a GenomeStudio Report. Examples of input files are located in the examples directory. We have written three versions of zCall that work with different input files based on whether AutoCall or GenomeStudio has been used for the initial calling: + +Version 1 (AutoCall): +Input files are IlluminaÕs binary GTC files, a .bpm.csv file, and a binary EGT file and the output files are a PLINK .ped and .map file. This version should be used after AutoCall has been run. Because the inputs are GTC files, calling of samples can be parallelized and the need to upload IDAT files into GenomeStudio can be avoided. In addition, this version avoids making large intermediate files with genotypes and intensity data. + +Version 2 (GenomeStudio - Thresholds derived from EGT file): +Input files are IlluminaÕs binary EGT file, an Illumina probe manifest file, and an Illumina GenomeStudio final report with the following fields: genotype, normalized X intensity, and normalized Y intensity where the normalized intensities are between 0 and 2. The output files are PLINK .tped and .tfam files where the alleles are A/B. PLINK can later be used to transform the A/B alleles into A,T,G,C using the --update-alleles flag. This version requires using GenomeStudio to generate a final report, which can be on the order of 40 GB for 5000 samples. However, it makes use of the fact that an EGT file already has the calculated means and standard deviations of each cluster when the initial clustering was done. Therefore, if a cluster file has already been generated with a large number of samples, then using the existing cluster file to generate the thresholds is much faster and more accurate than calculating the mean and standard deviation of each cluster from the GenomeStudio report. @@ -48,11 +54,11 @@ zCall works by using the intensity profile of the common allele homozygote clust 5. Genotypes are assigned based on where points are located with respect to the two thresholds. -6. Analysts should only consider sites and samples that meet QC criteria from the original calls. +6. Analysts should only consider sites and samples that meet QC criteria from the original calls. IV. Miscellaneous The run time and memory requirements are dependent on the number of probes and samples. For the Illumina HumanExome BeadChip, the run time to call one sample from a GTC file is 15 seconds and requires about 3 MB of memory (Max Memory Swap = 39 MB). For 971 samples calling from GTC files and generating one large PED file is 145 MB of memory (Max Memory Swap = 438 MB). For 384 samples from an Illumina Genome Studio report, it takes about 13 minutes to generate a new TPED and TFAM file with predetermined thresholds and 129 MB of memory (Max Memory Swap = 289 MB). It is assumed that the number, order, and names of sites match between the EGT, GTC, GenomeStudio report, and .bpm.csv files. -For best results, we recommend using at least 1000 samples to derive the thresholds. This can either be an EGT file clustered with 1000 samples or from a GenomeStudio report with 1000 samples. We found that we got the best results using a z-score threshold equal to 7; however we recommend using the calibration step to verify that 7 will work well for your data. When using GenCall to generate the original genotype calls, we got the best results when using a smaller cluster file (90 samples) rather than a larger cluster file (9,479 samples). Please see Goldstein,J.I. et al. (2012) zCall: A Rare Variant Caller for Array-based Genotyping. Bioinformatics: 28(19):2543-2545 for more information. \ No newline at end of file +For best results, we recommend using at least 1000 samples to derive the thresholds. This can either be an EGT file clustered with 1000 samples or from a GenomeStudio report with 1000 samples. We found that we got the best results using a z-score threshold equal to 7; however we recommend using the calibration step to verify that 7 will work well for your data. When using GenCall to generate the original genotype calls, we got the best results when using a smaller cluster file (90 samples) rather than a larger cluster file (9,479 samples). Please see Goldstein,J.I. et al. (2012) zCall: A Rare Variant Caller for Array-based Genotyping. Bioinformatics: 28(19):2543-2545 for more information. diff --git a/src/data/.gitattributes b/src/data/.gitattributes new file mode 100644 index 0000000..e75bbb0 --- /dev/null +++ b/src/data/.gitattributes @@ -0,0 +1,2 @@ +# do not diff the .csv manifest (treat as binary data) +*.bpm.csv -crlf -diff \ No newline at end of file diff --git a/src/data/.gitignore b/src/data/.gitignore new file mode 100644 index 0000000..d305b6c --- /dev/null +++ b/src/data/.gitignore @@ -0,0 +1,6 @@ +# ignore output from tests +thresholds_*.txt +*.egt +*.bpm.csv +thresholds.json +tmp* \ No newline at end of file diff --git a/src/data/gtc.json b/src/data/gtc.json new file mode 100644 index 0000000..af192fd --- /dev/null +++ b/src/data/gtc.json @@ -0,0 +1,26 @@ +[ + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R01C01.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R01C02.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R02C01.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R02C02.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R03C01.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R03C02.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R04C01.gtc" + }, + { + "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R04C02.gtc" + } +] diff --git a/src/data/metrics00.json b/src/data/metrics00.json new file mode 100644 index 0000000..3d3e3d0 --- /dev/null +++ b/src/data/metrics00.json @@ -0,0 +1 @@ +[["7197037167_R01C01.gtc", "6", 0.9857857577032777, 0.9850746268656716, {"1:0": 0, "1:1": 9624, "1:2": 140, "1:3": 0, "0:1": 17, "0:0": 1, "0:3": 2, "0:2": 47, "2:3": 1, "2:2": 8372, "2:1": 35, "2:0": 5, "3:2": 122, "3:3": 3087, "3:0": 1, "3:1": 0}], ["7197037167_R01C01.gtc", "7", 0.9895731051573385, 0.9850746268656716, {"1:0": 0, "1:1": 9702, "1:2": 62, "1:3": 0, "0:1": 23, "0:0": 1, "0:3": 3, "0:2": 40, "2:3": 3, "2:2": 8349, "2:1": 50, "2:0": 11, "3:2": 96, "3:3": 3113, "3:0": 1, "3:1": 0}], ["7197037167_R01C01.gtc", "8", 0.9910225838125964, 0.9850746268656716, {"1:0": 1, "1:1": 9743, "1:2": 20, "1:3": 0, "0:1": 27, "0:0": 1, "0:3": 3, "0:2": 36, "2:3": 12, "2:2": 8328, "2:1": 57, "2:0": 16, "3:2": 84, "3:3": 3124, "3:0": 2, "3:1": 0}], ["7197037167_R01C02.gtc", "6", 0.9887377914855834, 0.9818181818181818, {"1:0": 1, "1:1": 9775, "1:2": 73, "1:3": 0, "0:1": 17, "0:0": 1, "0:3": 2, "0:2": 35, "2:3": 1, "2:2": 8186, "2:1": 31, "2:0": 4, "3:2": 129, "3:3": 3197, "3:0": 2, "3:1": 0}], ["7197037167_R01C02.gtc", "7", 0.9915883919809337, 0.9636363636363636, {"1:0": 1, "1:1": 9822, "1:2": 26, "1:3": 0, "0:1": 19, "0:0": 2, "0:3": 2, "0:2": 32, "2:3": 4, "2:2": 8169, "2:1": 44, "2:0": 5, "3:2": 97, "3:3": 3228, "3:0": 3, "3:1": 0}], ["7197037167_R01C02.gtc", "8", 0.9921958970045329, 0.9636363636363636, {"1:0": 1, "1:1": 9838, "1:2": 10, "1:3": 0, "0:1": 23, "0:0": 2, "0:3": 2, "0:2": 28, "2:3": 7, "2:2": 8150, "2:1": 56, "2:0": 9, "3:2": 81, "3:3": 3244, "3:0": 3, "3:1": 0}], ["7197037167_R02C01.gtc", "6", 0.9892543449822463, 0.98, {"1:0": 0, "1:1": 9965, "1:2": 31, "1:3": 0, "0:1": 11, "0:0": 1, "0:3": 0, "0:2": 38, "2:3": 2, "2:2": 8087, "2:1": 29, "2:0": 5, "3:2": 161, "3:3": 3122, "3:0": 2, "3:1": 0}], ["7197037167_R02C01.gtc", "7", 0.9907961128760979, 0.98, {"1:0": 1, "1:1": 9981, "1:2": 14, "1:3": 0, "0:1": 15, "0:0": 1, "0:3": 1, "0:2": 33, "2:3": 4, "2:2": 8071, "2:1": 42, "2:0": 6, "3:2": 128, "3:3": 3155, "3:0": 2, "3:1": 0}], ["7197037167_R02C01.gtc", "8", 0.9907493926368903, 0.98, {"1:0": 1, "1:1": 9988, "1:2": 7, "1:3": 0, "0:1": 16, "0:0": 1, "0:3": 1, "0:2": 32, "2:3": 10, "2:2": 8053, "2:1": 50, "2:0": 10, "3:2": 116, "3:3": 3165, "3:0": 4, "3:1": 0}], ["7197037167_R02C02.gtc", "6", 0.9354292781357808, 0.9998358233459202, {"1:0": 0, "1:1": 1730, "1:2": 911, "1:3": 0, "0:1": 113, "0:0": 1, "0:3": 19, "0:2": 5958, "2:3": 5, "2:2": 12610, "2:1": 38, "2:0": 2, "3:2": 36, "3:3": 31, "3:0": 0, "3:1": 0}], ["7197037167_R02C02.gtc", "7", 0.943826075636269, 0.9998358233459202, {"1:0": 1, "1:1": 1880, "1:2": 760, "1:3": 0, "0:1": 144, "0:0": 1, "0:3": 28, "0:2": 5918, "2:3": 9, "2:2": 12582, "2:1": 61, "2:0": 3, "3:2": 29, "3:3": 38, "3:0": 0, "3:1": 0}], ["7197037167_R02C02.gtc", "8", 0.9526785133112022, 0.9998358233459202, {"1:0": 1, "1:1": 2046, "1:2": 594, "1:3": 0, "0:1": 200, "0:0": 1, "0:3": 43, "0:2": 5847, "2:3": 17, "2:2": 12544, "2:1": 87, "2:0": 7, "3:2": 21, "3:3": 46, "3:0": 0, "3:1": 0}]] \ No newline at end of file diff --git a/src/data/metrics01.json b/src/data/metrics01.json new file mode 100644 index 0000000..8ce1c49 --- /dev/null +++ b/src/data/metrics01.json @@ -0,0 +1 @@ +[["7197037167_R03C01.gtc", "6", 0.9780348445110528, 0.9705882352941176, {"1:0": 0, "1:1": 9706, "1:2": 218, "1:3": 0, "0:1": 27, "0:0": 3, "0:3": 9, "0:2": 63, "2:3": 1, "2:2": 7895, "2:1": 31, "2:0": 1, "3:2": 214, "3:3": 3282, "3:0": 4, "3:1": 0}], ["7197037167_R03C01.gtc", "7", 0.9855282877482203, 0.9705882352941176, {"1:0": 0, "1:1": 9824, "1:2": 100, "1:3": 0, "0:1": 37, "0:0": 3, "0:3": 9, "0:2": 53, "2:3": 3, "2:2": 7882, "2:1": 41, "2:0": 2, "3:2": 158, "3:3": 3337, "3:0": 5, "3:1": 0}], ["7197037167_R03C01.gtc", "8", 0.9886193330835519, 0.9607843137254902, {"1:0": 0, "1:1": 9875, "1:2": 49, "1:3": 0, "0:1": 42, "0:0": 4, "0:3": 13, "0:2": 43, "2:3": 5, "2:2": 7866, "2:1": 52, "2:0": 5, "3:2": 126, "3:3": 3368, "3:0": 6, "3:1": 0}], ["7197037167_R03C02.gtc", "6", 0.9849420127197905, 0.9714285714285714, {"1:0": 0, "1:1": 9843, "1:2": 161, "1:3": 0, "0:1": 15, "0:0": 2, "0:3": 4, "0:2": 49, "2:3": 2, "2:2": 8111, "2:1": 25, "2:0": 2, "3:2": 131, "3:3": 3108, "3:0": 1, "3:1": 0}], ["7197037167_R03C02.gtc", "7", 0.9884493078937523, 0.9714285714285714, {"1:0": 1, "1:1": 9920, "1:2": 83, "1:3": 0, "0:1": 18, "0:0": 2, "0:3": 5, "0:2": 45, "2:3": 6, "2:2": 8088, "2:1": 42, "2:0": 4, "3:2": 110, "3:3": 3129, "3:0": 1, "3:1": 0}], ["7197037167_R03C02.gtc", "8", 0.9907875046763935, 0.9714285714285714, {"1:0": 2, "1:1": 9967, "1:2": 35, "1:3": 0, "0:1": 23, "0:0": 2, "0:3": 7, "0:2": 38, "2:3": 6, "2:2": 8073, "2:1": 54, "2:0": 7, "3:2": 90, "3:3": 3147, "3:0": 3, "3:1": 0}], ["7197037167_R04C01.gtc", "6", 0.9806225134565879, 0.9887640449438202, {"1:0": 0, "1:1": 9632, "1:2": 235, "1:3": 0, "0:1": 13, "0:0": 1, "0:3": 11, "0:2": 64, "2:3": 2, "2:2": 8151, "2:1": 29, "2:0": 3, "3:2": 141, "3:3": 3168, "3:0": 4, "3:1": 0}], ["7197037167_R04C01.gtc", "7", 0.9870816756377252, 0.9887640449438202, {"1:0": 1, "1:1": 9756, "1:2": 110, "1:3": 0, "0:1": 18, "0:0": 1, "0:3": 12, "0:2": 58, "2:3": 4, "2:2": 8135, "2:1": 43, "2:0": 3, "3:2": 108, "3:3": 3198, "3:0": 7, "3:1": 0}], ["7197037167_R04C01.gtc", "8", 0.9894687573133629, 0.9887640449438202, {"1:0": 1, "1:1": 9815, "1:2": 51, "1:3": 0, "0:1": 28, "0:0": 1, "0:3": 14, "0:2": 46, "2:3": 9, "2:2": 8112, "2:1": 57, "2:0": 7, "3:2": 91, "3:3": 3213, "3:0": 9, "3:1": 0}], ["7197037167_R04C02.gtc", "6", 0.984046783625731, 0.9873417721518988, {"1:0": 0, "1:1": 9992, "1:2": 164, "1:3": 0, "0:1": 21, "0:0": 1, "0:3": 5, "0:2": 52, "2:3": 2, "2:2": 7951, "2:1": 36, "2:0": 5, "3:2": 133, "3:3": 3091, "3:0": 1, "3:1": 0}], ["7197037167_R04C02.gtc", "7", 0.9883976608187135, 0.9873417721518988, {"1:0": 0, "1:1": 10069, "1:2": 87, "1:3": 0, "0:1": 28, "0:0": 1, "0:3": 6, "0:2": 44, "2:3": 5, "2:2": 7932, "2:1": 46, "2:0": 11, "3:2": 96, "3:3": 3126, "3:0": 3, "3:1": 0}], ["7197037167_R04C02.gtc", "8", 0.9905029239766082, 0.9873417721518988, {"1:0": 0, "1:1": 10115, "1:2": 41, "1:3": 0, "0:1": 33, "0:0": 1, "0:3": 6, "0:2": 39, "2:3": 8, "2:2": 7916, "2:1": 57, "2:0": 13, "3:2": 80, "3:3": 3141, "3:0": 4, "3:1": 0}]] \ No newline at end of file diff --git a/src/data/test1_sample.json b/src/data/test1_sample.json new file mode 100644 index 0000000..6e82fd9 --- /dev/null +++ b/src/data/test1_sample.json @@ -0,0 +1 @@ +[{"gender_code": -1, "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R01C01.gtc", "uri": "urn:wtsi:test_sample000"}, {"gender_code": -1, "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R01C02.gtc", "uri": "urn:wtsi:test_sample001"}, {"gender_code": -1, "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R02C01.gtc", "uri": "urn:wtsi:test_sample002"}, {"gender_code": -1, "result": "/nfs/gapi/data/genotype/zcall_test/7197037167_R02C02.gtc", "uri": "urn:wtsi:test_sample003"}] \ No newline at end of file diff --git a/src/data/test2_sample.json b/src/data/test2_sample.json new file mode 100644 index 0000000..9844f65 --- /dev/null +++ b/src/data/test2_sample.json @@ -0,0 +1,66 @@ +[ + { + "gender": "Female", + "gender_code": 2, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R01C01.gtc", + "sanger_sample_id": "hhtmri5617332", + "uri": "urn:wtsi:310735_A01_hhtmri5617332" + }, + { + "gender": "Male", + "gender_code": 1, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R02C01.gtc", + "sanger_sample_id": "hhtmri5617334", + "uri": "urn:wtsi:310735_B01_hhtmri5617334" + }, + { + "gender": "Female", + "gender_code": 2, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R03C01.gtc", + "sanger_sample_id": "hhtmri5617335", + "uri": "urn:wtsi:310735_C01_hhtmri5617335" + }, + { + "gender": "Female", + "gender_code": 2, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R04C01.gtc", + "sanger_sample_id": "hhtmri5617336", + "uri": "urn:wtsi:310735_D01_hhtmri5617336" + }, + { + "gender": "Female", + "gender_code": 2, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R05C01.gtc", + "sanger_sample_id": "hhtmri5617337", + "uri": "urn:wtsi:310735_E01_hhtmri5617337" + }, + { + "gender": "Male", + "gender_code": 1, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R06C01.gtc", + "sanger_sample_id": "hhtmri5617338", + "uri": "urn:wtsi:310735_F01_hhtmri5617338" + }, + { + "gender": "Female", + "gender_code": 2, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R01C02.gtc", + "sanger_sample_id": "hhtmri5617339", + "uri": "urn:wtsi:310735_G01_hhtmri5617339" + }, + { + "gender": "Female", + "gender_code": 2, + "gender_method": "Supplied", + "result": "/nfs/gapi/data/genotype/zcall_test/9422798202_R02C02.gtc", + "sanger_sample_id": "hhtmri5617340", + "uri": "urn:wtsi:310735_H01_hhtmri5617340" + } +] \ No newline at end of file diff --git a/src/data/zEvaluation.json b/src/data/zEvaluation.json new file mode 100644 index 0000000..3f9ed85 --- /dev/null +++ b/src/data/zEvaluation.json @@ -0,0 +1 @@ +{"BEST_THRESHOLDS": "/nfs/gapi/data/genotype/zcall_test/thresholds_HumanExome-12v1_z08.txt", "SAMPLE_METRICS": [["7197037167_R01C01.gtc", "6", 0.9857857577032777, 0.9850746268656716, {"1:0": 0, "1:1": 9624, "1:2": 140, "1:3": 0, "0:1": 17, "0:0": 1, "0:3": 2, "0:2": 47, "2:3": 1, "2:2": 8372, "2:1": 35, "2:0": 5, "3:2": 122, "3:3": 3087, "3:0": 1, "3:1": 0}], ["7197037167_R01C01.gtc", "7", 0.9895731051573385, 0.9850746268656716, {"1:0": 0, "1:1": 9702, "1:2": 62, "1:3": 0, "0:1": 23, "0:0": 1, "0:3": 3, "0:2": 40, "2:3": 3, "2:2": 8349, "2:1": 50, "2:0": 11, "3:2": 96, "3:3": 3113, "3:0": 1, "3:1": 0}], ["7197037167_R01C01.gtc", "8", 0.9910225838125964, 0.9850746268656716, {"1:0": 1, "1:1": 9743, "1:2": 20, "1:3": 0, "0:1": 27, "0:0": 1, "0:3": 3, "0:2": 36, "2:3": 12, "2:2": 8328, "2:1": 57, "2:0": 16, "3:2": 84, "3:3": 3124, "3:0": 2, "3:1": 0}], ["7197037167_R01C02.gtc", "6", 0.9887377914855834, 0.9818181818181818, {"1:0": 1, "1:1": 9775, "1:2": 73, "1:3": 0, "0:1": 17, "0:0": 1, "0:3": 2, "0:2": 35, "2:3": 1, "2:2": 8186, "2:1": 31, "2:0": 4, "3:2": 129, "3:3": 3197, "3:0": 2, "3:1": 0}], ["7197037167_R01C02.gtc", "7", 0.9915883919809337, 0.9636363636363636, {"1:0": 1, "1:1": 9822, "1:2": 26, "1:3": 0, "0:1": 19, "0:0": 2, "0:3": 2, "0:2": 32, "2:3": 4, "2:2": 8169, "2:1": 44, "2:0": 5, "3:2": 97, "3:3": 3228, "3:0": 3, "3:1": 0}], ["7197037167_R01C02.gtc", "8", 0.9921958970045329, 0.9636363636363636, {"1:0": 1, "1:1": 9838, "1:2": 10, "1:3": 0, "0:1": 23, "0:0": 2, "0:3": 2, "0:2": 28, "2:3": 7, "2:2": 8150, "2:1": 56, "2:0": 9, "3:2": 81, "3:3": 3244, "3:0": 3, "3:1": 0}], ["7197037167_R02C01.gtc", "6", 0.9892543449822463, 0.98, {"1:0": 0, "1:1": 9965, "1:2": 31, "1:3": 0, "0:1": 11, "0:0": 1, "0:3": 0, "0:2": 38, "2:3": 2, "2:2": 8087, "2:1": 29, "2:0": 5, "3:2": 161, "3:3": 3122, "3:0": 2, "3:1": 0}], ["7197037167_R02C01.gtc", "7", 0.9907961128760979, 0.98, {"1:0": 1, "1:1": 9981, "1:2": 14, "1:3": 0, "0:1": 15, "0:0": 1, "0:3": 1, "0:2": 33, "2:3": 4, "2:2": 8071, "2:1": 42, "2:0": 6, "3:2": 128, "3:3": 3155, "3:0": 2, "3:1": 0}], ["7197037167_R02C01.gtc", "8", 0.9907493926368903, 0.98, {"1:0": 1, "1:1": 9988, "1:2": 7, "1:3": 0, "0:1": 16, "0:0": 1, "0:3": 1, "0:2": 32, "2:3": 10, "2:2": 8053, "2:1": 50, "2:0": 10, "3:2": 116, "3:3": 3165, "3:0": 4, "3:1": 0}], ["7197037167_R02C02.gtc", "6", 0.9354292781357808, 0.9998358233459202, {"1:0": 0, "1:1": 1730, "1:2": 911, "1:3": 0, "0:1": 113, "0:0": 1, "0:3": 19, "0:2": 5958, "2:3": 5, "2:2": 12610, "2:1": 38, "2:0": 2, "3:2": 36, "3:3": 31, "3:0": 0, "3:1": 0}], ["7197037167_R02C02.gtc", "7", 0.943826075636269, 0.9998358233459202, {"1:0": 1, "1:1": 1880, "1:2": 760, "1:3": 0, "0:1": 144, "0:0": 1, "0:3": 28, "0:2": 5918, "2:3": 9, "2:2": 12582, "2:1": 61, "2:0": 3, "3:2": 29, "3:3": 38, "3:0": 0, "3:1": 0}], ["7197037167_R02C02.gtc", "8", 0.9526785133112022, 0.9998358233459202, {"1:0": 1, "1:1": 2046, "1:2": 594, "1:3": 0, "0:1": 200, "0:0": 1, "0:3": 43, "0:2": 5847, "2:3": 17, "2:2": 12544, "2:1": 87, "2:0": 7, "3:2": 21, "3:3": 46, "3:0": 0, "3:1": 0}], ["7197037167_R03C01.gtc", "6", 0.9780348445110528, 0.9705882352941176, {"1:0": 0, "1:1": 9706, "1:2": 218, "1:3": 0, "0:1": 27, "0:0": 3, "0:3": 9, "0:2": 63, "2:3": 1, "2:2": 7895, "2:1": 31, "2:0": 1, "3:2": 214, "3:3": 3282, "3:0": 4, "3:1": 0}], ["7197037167_R03C01.gtc", "7", 0.9855282877482203, 0.9705882352941176, {"1:0": 0, "1:1": 9824, "1:2": 100, "1:3": 0, "0:1": 37, "0:0": 3, "0:3": 9, "0:2": 53, "2:3": 3, "2:2": 7882, "2:1": 41, "2:0": 2, "3:2": 158, "3:3": 3337, "3:0": 5, "3:1": 0}], ["7197037167_R03C01.gtc", "8", 0.9886193330835519, 0.9607843137254902, {"1:0": 0, "1:1": 9875, "1:2": 49, "1:3": 0, "0:1": 42, "0:0": 4, "0:3": 13, "0:2": 43, "2:3": 5, "2:2": 7866, "2:1": 52, "2:0": 5, "3:2": 126, "3:3": 3368, "3:0": 6, "3:1": 0}], ["7197037167_R03C02.gtc", "6", 0.9849420127197905, 0.9714285714285714, {"1:0": 0, "1:1": 9843, "1:2": 161, "1:3": 0, "0:1": 15, "0:0": 2, "0:3": 4, "0:2": 49, "2:3": 2, "2:2": 8111, "2:1": 25, "2:0": 2, "3:2": 131, "3:3": 3108, "3:0": 1, "3:1": 0}], ["7197037167_R03C02.gtc", "7", 0.9884493078937523, 0.9714285714285714, {"1:0": 1, "1:1": 9920, "1:2": 83, "1:3": 0, "0:1": 18, "0:0": 2, "0:3": 5, "0:2": 45, "2:3": 6, "2:2": 8088, "2:1": 42, "2:0": 4, "3:2": 110, "3:3": 3129, "3:0": 1, "3:1": 0}], ["7197037167_R03C02.gtc", "8", 0.9907875046763935, 0.9714285714285714, {"1:0": 2, "1:1": 9967, "1:2": 35, "1:3": 0, "0:1": 23, "0:0": 2, "0:3": 7, "0:2": 38, "2:3": 6, "2:2": 8073, "2:1": 54, "2:0": 7, "3:2": 90, "3:3": 3147, "3:0": 3, "3:1": 0}], ["7197037167_R04C01.gtc", "6", 0.9806225134565879, 0.9887640449438202, {"1:0": 0, "1:1": 9632, "1:2": 235, "1:3": 0, "0:1": 13, "0:0": 1, "0:3": 11, "0:2": 64, "2:3": 2, "2:2": 8151, "2:1": 29, "2:0": 3, "3:2": 141, "3:3": 3168, "3:0": 4, "3:1": 0}], ["7197037167_R04C01.gtc", "7", 0.9870816756377252, 0.9887640449438202, {"1:0": 1, "1:1": 9756, "1:2": 110, "1:3": 0, "0:1": 18, "0:0": 1, "0:3": 12, "0:2": 58, "2:3": 4, "2:2": 8135, "2:1": 43, "2:0": 3, "3:2": 108, "3:3": 3198, "3:0": 7, "3:1": 0}], ["7197037167_R04C01.gtc", "8", 0.9894687573133629, 0.9887640449438202, {"1:0": 1, "1:1": 9815, "1:2": 51, "1:3": 0, "0:1": 28, "0:0": 1, "0:3": 14, "0:2": 46, "2:3": 9, "2:2": 8112, "2:1": 57, "2:0": 7, "3:2": 91, "3:3": 3213, "3:0": 9, "3:1": 0}], ["7197037167_R04C02.gtc", "6", 0.984046783625731, 0.9873417721518988, {"1:0": 0, "1:1": 9992, "1:2": 164, "1:3": 0, "0:1": 21, "0:0": 1, "0:3": 5, "0:2": 52, "2:3": 2, "2:2": 7951, "2:1": 36, "2:0": 5, "3:2": 133, "3:3": 3091, "3:0": 1, "3:1": 0}], ["7197037167_R04C02.gtc", "7", 0.9883976608187135, 0.9873417721518988, {"1:0": 0, "1:1": 10069, "1:2": 87, "1:3": 0, "0:1": 28, "0:0": 1, "0:3": 6, "0:2": 44, "2:3": 5, "2:2": 7932, "2:1": 46, "2:0": 11, "3:2": 96, "3:3": 3126, "3:0": 3, "3:1": 0}], ["7197037167_R04C02.gtc", "8", 0.9905029239766082, 0.9873417721518988, {"1:0": 0, "1:1": 10115, "1:2": 41, "1:3": 0, "0:1": 33, "0:0": 1, "0:3": 6, "0:2": 39, "2:3": 8, "2:2": 7916, "2:1": 57, "2:0": 13, "3:2": 80, "3:3": 3141, "3:0": 4, "3:1": 0}]], "BEST_Z": "8"} \ No newline at end of file diff --git a/src/doc/concordance_gain_means.png b/src/doc/concordance_gain_means.png new file mode 100644 index 0000000..9aafd1d Binary files /dev/null and b/src/doc/concordance_gain_means.png differ diff --git a/src/doc/zCallExtended.pdf b/src/doc/zCallExtended.pdf new file mode 100644 index 0000000..140d601 Binary files /dev/null and b/src/doc/zCallExtended.pdf differ diff --git a/src/doc/zCallExtended.tex b/src/doc/zCallExtended.tex new file mode 100644 index 0000000..ef884ed --- /dev/null +++ b/src/doc/zCallExtended.tex @@ -0,0 +1,113 @@ +\documentclass{article} + +\title{Extended zCall} +\author{Iain Bancarz\\ + Wellcome Trust Sanger Institute\\ + \texttt{ib5@sanger.ac.uk}} +\date{3 April 2013} + +\usepackage{graphicx} +\usepackage{placeins} +\usepackage{url} + +\begin{document} + +\maketitle + +\section{Introduction} + +zCall is a rare variant caller for array-based genotyping, developed by JI Goldstein et al.\ \cite{gold}. This document describes a major extension of zCall, intended to support inclusion in the Wellcome Trust Sanger Institute Genotyping Pipeline \cite{wtsi-gp, wtsi-zcall}. + +\section{Summary of new features} + +\begin{itemize} +\item \textbf{Calibration:} Definition of metrics and a heuristic for evaluating candidate zscore values; see Section \ref{sec:calib}. +\item \textbf{Automation:} Automated implementation of zCall, invoked by either a single command-line script, or four separate scripts to enable parallelization. +\item \textbf{Logging:} Intermediate metadata, such as results of threshold evaluation, is saved in .json format files. Results of applying zCall to previously uncalled genotypes are summarized in a file \texttt{zcall\_log.json}. +\item \textbf{Plink output:} Output in the Plink genotyping data format, in either binary or non-binary form \cite{plink, plink-soft}. +\item \textbf{Pydoc:} The \texttt{createDocs.py} script uses Python's Pydoc module to generate HTML documentation for zCall code and its dependencies. +\item \textbf{Unit tests:} Using the standard Python unit test suite, in \texttt{src/test/test.py}. +\end{itemize} + +\section{Calibration} +\label{sec:calib} + +The zCall method requires the choice of a threshold known as the `zscore'. This threshold is the number of standard deviations away from the mean of genotyping intensity clusters, to be used as a boundary between genotype calls. Goldstein et al.\ recommended a zscore of 7 as a suitable default value \cite{gold}. A threshold of 7 standard deviations by definition will affect very few calls, but this is appropriate, as zCall is intended as a post-processing step applied to a small subset of `badly behaved' data points. + +Goldstein et al.\ note that the choice of zscore can be informed by comparing the outputs of zCall and another caller. We now present a systematic procedure for evaluating candidate zscores on a given test data set. + +\subsection{Metrics} + +In production, zCall is applied to data points which failed to be called by a `default caller'. The default caller is typically Illumina's GenCall software, but may also be a third-party package such as Illuminus or GenoSNP \cite{genosnp, illuminus}. + +For an input data point $x$, let $d(x)$ be the call under a default caller and $Z(x,k)$ be the call made by zCall with a zscore of $k$. Let $\phi$ represent a no-call. We define two metrics which may be used to assess a given zscore on a set of test data: +\begin{itemize} +\item \textbf{Concordance:} The proportion of points $x$ such that $d(x) \neq \phi$ and $d(x) = Z(x,k)$. These points are called by both callers, and are given the same call by each. +\item \textbf{Gain:} The proportion of points $x$ such that $d(x) = \phi$ and $Z(x,k) \neq \phi$. These points are called by zCall, but not by the default caller. +\end{itemize} + +\subsection{Heuristic for choosing a zscore} + +Intuitively, we would like to maximize both concordance and gain. Assuming that our default caller is reasonably good at finding the true genotype, we would like zCall to agree with it. Conversely, we would like as many no-calls as possible to be `rescued' by zCall. + +The gain is maximized where $z=0$ and will decrease as $z$ increases; the higher the value of $z$, the greater the volume of intensity space which zCall will designate for no-calls. Concordance is likely to be low for very low or very high values of $z$, and maximized for some intermediate value. + +Suppose that we have evaluated the concordance and gain for a range of candidate zscores. We define the following heuristic for choosing a candidate zscore $z$: + +\begin{quote}Choose the smallest value of $z$ such that concordance is greater than gain; or if no such $z$ is in the set of candidates, the value of $z$ with the greatest concordance. +\end{quote} + +The above heuristic is simple to automate, and has been implemented in the zCall extended software. It can be thought of as re-calling all data points with multiple zscores as a first pass, after which zcall is applied with the chosen zscore to re-call any no-calls. + +Evaluating candidate thresholds with this heuristic requires much more computation than the actual application of zcall. However, it is straightforward to parallelize by evaluating different zscores and subsets of samples on different processes; this will be implemented in the WTSI Genotyping Pipeline \cite{wtsi-gp}. + +In addition, zCall does not require highly intensive computation, and a single processor is easily sufficient to calibrate and call for medium-sized test datasets such as the one in Section \ref{sec:evaluation}. For larger datasets, one may use parallel processing, or simply evaluate candidate zscores on a representative subset of the available samples. + +\subsection{Evaluation on test data} +\label{sec:evaluation} + +The heuristic has been found to achieve good results on test data. In assessment on a set of 94 samples genotyped with the Illumina HumanExome-12v1 beadchip, the criterion was met at $z=7$, with concordance and gain of 96.9\% and 96.3\% respectively. Concordance and gain for a range of zscores on this dataset are shown in Figure \ref{fig:zscores}. + +\begin{figure}[t] +\includegraphics[width=1.0\textwidth]{concordance_gain_means.png} +\caption{Concordance and gain metrics on a test dataset.} +\label{fig:zscores} +\end{figure} + +In this case, the chosen zscore happened to equal the default suggested by Goldstein et al. In other choices of test dataset, the zscore chosen by the heuristic was typically between 6 and 8. The range of zscores to evaluate for production will depend on the preferences and computational resources of the user, but a range of 4 to 10 may be appropriate. + +\section{Conclusion and further reading} + +This document has summarized changes introduced in extended zcall, and presented metrics and a heuristic for choosing a zscore threshold. + +See the \texttt{README}, \texttt{README\_prototype}, and \texttt{README\_extended} files, in the top-level directory of the zCall repository, for additional details. \texttt{README\_extended} has instructions for installation and usage of extended zCall. The \texttt{updates.txt} file contains a history of changes. Tests are documented in \texttt{src/test/README}. + +\FloatBarrier + +\begin{thebibliography}{9} + +\bibitem{genosnp} A Variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Giannoulatou E, Yau C, Colella S, Ragoussis J, Holmes CC. Bioinformatics. 2008 Oct 1;24(19):2209-14 + +\bibitem{gold} +Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +zCall: a rare variant caller for array-based genotyping. +Genetics and population analysis. Bioinformatics 2012 Oct 1;28(19):2543-2545. Epub 2012 Jul 27. PubMed PMID: 22843986. + +\bibitem{plink} +Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ \& Sham PC (2007). +PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81. + +\bibitem{plink-soft} +Purcell S. PLINK software v1.07. \url{http://pngu.mgh.harvard.edu/purcell/plink/} + +\bibitem{illuminus} A genotype calling algorithm for the Illumina BeadArray platform. Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP and Clark TG. Bioinformatics 2007;23;20;2741-6 + +\bibitem{wtsi-gp} +Wellcome Trust Sanger Institute Genotyping Pipeline. \url{https://github.com/wtsi-npg/genotyping} + +\bibitem{wtsi-zcall} +Wellcome Trust Sanger Institute fork of the zCall repository. \url{https://github.com/wtsi-npg/zcall} + +\end{thebibliography} + +\end{document} diff --git a/src/etc/config.ini b/src/etc/config.ini new file mode 100644 index 0000000..e6cd863 --- /dev/null +++ b/src/etc/config.ini @@ -0,0 +1,12 @@ +; configuration file for zcall +; paths are local to Wellcome Trust Sanger Institute and may require editing + +[zcall] +rscript=/software/R-3.0.0/bin/Rscript +digits=6 +min_concord=0.98 +;profile= ; comment out this line to disable default profiling + +[test] +bigdata=/nfs/gapi/data/genotype/zcall_test + diff --git a/src/test/README b/src/test/README new file mode 100644 index 0000000..42a1586 --- /dev/null +++ b/src/test/README @@ -0,0 +1,15 @@ +Unit tests for zCall/src + +Author: Iain Bancarz, ib5@sanger.ac.uk + +Tests should be run from the src subdirectory of the zCall source code, as cloned from Github. They are run by executing test/run_tests.sh. This runs the Python unit tests and cleans up test output (in a way which avoids errors on NFS). + +The tests require a correctly populated 'bigdata' directory, specified in etc/config.ini. Bigdata contains test files which are too large to upload to github and/or contain confidental data: +- GTC files with intensities and genotype calls. These contain confidential human data. (Eventually these will be replaced by synthetic data without confidentality concerns.) +- EGT files: HumanOmniExpress-12v1-1_b.egt and HumanExome-12v1.egt, available from http://support.illumina.com/downloads.ilmn. +- BPM file: A .bpm.csv file corresponding to the EGT file and manifest for the GTC files. +- gtc.json file: A .json file containing paths to the GTC files. + Threshold inputs: The file thresholds.json containing paths to 3 threshold.txt files, which were generated from the EGT file for z scores (6,7,8). +- Merge inputs: The file evaluation_metrics.json contains paths to additional .json files metrics00.json, metrics01.json, which serve as input to the 'merge evaluation' test. + +The Python tests examine the MD5 checksums of output files generated from the above input. diff --git a/src/test/run_tests.sh b/src/test/run_tests.sh new file mode 100755 index 0000000..2adc6f9 --- /dev/null +++ b/src/test/run_tests.sh @@ -0,0 +1,18 @@ +#! /usr/bin/env bash + +# workaround for annoying NFS error in deleting directories within Python tests + +PYTHON_TEST='test/test.py' +DATA_DIR='data' + +if [ ! -d $DATA_DIR ]; then + echo "Output path '$DATA_DIR' does not exist or is not a directory" + exit 1 +elif [ ! -f $PYTHON_TEST ]; then + echo "$PYTHON_TEST does not exist; should run from src directory of zCall" + exit 1 +else + python $PYTHON_TEST + echo "Finished running $PYTHON_TEST"; + rm -Rf $DATA_DIR/output_test_* +fi \ No newline at end of file diff --git a/src/test/test.py b/src/test/test.py new file mode 100644 index 0000000..4a840b2 --- /dev/null +++ b/src/test/test.py @@ -0,0 +1,296 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + +"""Unit tests for zcall module + +Test cases: +- prepareThresholds.py creates correct thresholds.txt files +- evaluateThresholds.py creates correct evaluation in .json format +- mergeEvaluation.py correctly merges results and identifies best Z score +- call.py produces Plink binary data + +Required input data: +- Example BPM, EGT files +- Example GTC files -- may have privacy issues +- (TODO Test on public GTC data, with corresponding BPM/EGT) + +Author: Iain Bancarz, ib5@sanger.ac.uk +""" + +import json, os, sys, unittest +from ConfigParser import ConfigParser +from hashlib import md5 +from tempfile import mkdtemp + +class TestScripts(unittest.TestCase): + + """Test command-line python scripts used in WTSI genotyping pipeline""" + + def call(self, binary): + """Re-call GTC files using zCall and choice of output""" + outStem = os.path.join(self.outDir, self.prefix) + tPath = os.path.join(self.bigData, 'thresholds_HumanExome-12v1_z07.txt') + logPath = os.path.join(self.outDir, 'zcall_log.json') + args = ['zcall/runZCall.py', + '--thresholds', tPath, + '--bpm', self.bpmPath, + '--egt', self.egtPath, + '--samples', self.sampleJson, + '--out', self.outDir, + '--plink', self.prefix, + '--log', logPath, + ] + if binary: args.append('--binary') + self.assertEqual(os.system(' '.join(args)), 0) # run script + if binary: + suffixes = ['.bed', '.bim', '.fam'] + expected = ['55fa3cfd960d43366cb506ab004ac300', + '19dd8929cd63e3ee906e43b9bb59cd02', + 'b836bd45459de6a9bc4b8d92e8c9e298'] + else: + suffixes = ['.ped', '.map', '.fam'] + expected = ['731e0199b1433228abf80ec8126089d1', + 'df64da4ef76b2df1040df375a8933f45', + 'b836bd45459de6a9bc4b8d92e8c9e298'] + for i in range(len(suffixes)): + self.assertTrue(os.path.exists(outStem+suffixes[i])) + checksum = self.getMD5hex(outStem+suffixes[i]) + self.assertEqual(checksum, expected[i]) + self.assertTrue(os.path.exists(logPath)) + self.validatePlink(self.prefix, binary) + + + def getMD5hex(self, inPath): + """Get MD5 checksum for contents of given file, in hex format""" + m = md5() + m.update(open(inPath).read()) + checksum = m.hexdigest() + return checksum + + def readConfig(self, configPath=None): + """Read local params from config file + + - bigdata: Directory for test files too big to upload to github""" + if configPath==None: + configPath = os.path.abspath('etc/config.ini') + if not os.access(configPath, os.R_OK): + raise IOError("Cannot read config path '"+configPath+"'") + config = ConfigParser() + config.readfp(open(configPath)) + bigData = config.get('test', 'bigdata') + return bigData + + def validateThresholds(self, jsonPath): + """Check that thresholds.txt files are correct + + Use for test_prepareThresholds, and to validate input for other tests + """ + index = json.loads(open(jsonPath).read()) + for z in index.keys(): + self.assertEqual(self.getMD5hex(index[z]), self.expectedT[z]) + + def validatePlink(self, prefix, binary): + """Check that Plink can parse dataset without crashing""" + startDir = os.getcwd() + os.chdir(self.outDir) + if binary: opt = '--bfile' + else: opt = '--file' + cmd = 'plink '+opt+' '+prefix+' > /dev/null' + try: + self.assertEqual(0, os.system(cmd)) + except AssertionError: + os.chdir(startDir) + raise + os.chdir(startDir) + + def setUp(self): + """Check for valid input/output files and directories""" + self.prefix = 'test' + self.dataDir = 'data' + self.bigData = self.readConfig() + for d in (self.dataDir, self.bigData): + if not os.path.exists(d) or not os.path.isdir(d): + msg = "Invalid test directory: \""+d+"\"\n" + sys.stderr.write(msg) + sys.exit(1) + self.expectedT = {"6":"1a53e8cbba6750d43d5ff607cf616beb", + "7":"a8d8b62be728b62fc986230da13f4ef7", + "8":"1f14419d0053841cfa8ab3fb994de1c1"} + self.outDir = mkdtemp(prefix="output_test_", dir=self.dataDir) + print "Created output directory", self.outDir + self.bpmPath = os.path.join(self.bigData, 'HumanExome-12v1_A.bpm.csv') + self.egtPath = os.path.join(self.bigData, 'HumanExome-12v1.egt') + # TODO automatically generate threshold.txt files, if not present + self.gtcPath = os.path.join(self.dataDir, 'gtc.json') + self.metricIndex = os.path.join(self.bigData, + 'evaluation_metric_index.json') + self.thresholdJsonName = 'thresholds.json' + self.thresholdJson = os.path.join(self.bigData, self.thresholdJsonName) + if os.path.exists(self.thresholdJson): + self.validateThresholds(self.thresholdJson) + else: + sys.stderr.write("WARNING: Missing thresholds, see test/README.") + self.sampleJson = os.path.join(self.dataDir, 'test1_sample.json') + + def tearDown(self): + # cannot delete temporary directories from within Python unit test + # not compatible with NFS behaviour; must delete tempdirs manually + pass + + def test_prepareThresholds(self): + """Prepare thresholds.txt files + + Run as part of normal test suite + Can also be used to generate input thresholds for other tests + Checksums should ensure that generated thresholds are correct + """ + zstart = 6 + ztotal = 3 + outPaths = [] + for i in range(zstart, zstart+ztotal): + name = 'thresholds_HumanExome-12v1_z0'+str(i)+'.txt' + outPaths.append(os.path.join(self.outDir, name)) + args = ['zcall/prepareThresholds.py', + '--egt', self.egtPath, + '--out', self.outDir, + '--config etc/config.ini', + '--zstart', str(zstart), + '--ztotal', str(ztotal), + '--index_name', self.thresholdJsonName] + self.assertEqual(os.system(' '.join(args)), 0) # run script + jsonOut = os.path.join(self.outDir, self.thresholdJsonName) + self.assertTrue(os.access(jsonOut, os.R_OK)) + self.validateThresholds(jsonOut) + + def test_evaluateThresholds(self): + """Evaluate thresholds for collections of sample GTC files""" + argsBase = ('zcall/evaluateThresholds.py', + '--egt', self.egtPath, + '--bpm', self.bpmPath, + '--thresholds', self.thresholdJson) + ranges = ((0,4), (4,8)) + for i in range(len(ranges)): + (start, end) = ranges[i] + args = list(argsBase) + name = 'metrics0'+str(i)+'.json' + outPath = os.path.join(self.outDir, name) + args.extend(['--gtc', self.gtcPath, '--start', str(start), + '--end', str(end), '--out', outPath ]) + self.assertEqual(os.system(' '.join(args)), 0) # run script + metricsNew = json.loads(open(outPath).read()) + oldPath = os.path.join(self.dataDir, name) + metricsOld = json.loads(open(oldPath).read()) + self.assertEqual(metricsOld, metricsNew) + + def test_mergeEvaluation(self): + """Merge evaluation results and find best Z score""" + outPath = os.path.join(self.outDir, 'merged_evaluation.json') + args = ['zcall/mergeEvaluation.py', + '--metrics', self.metricIndex, + '--thresholds', self.thresholdJson, + '--out', outPath, + '--text', os.path.join(self.outDir, 'metric_summary.txt'), ] + self.assertEqual(os.system(' '.join(args)), 0) # run script + resultsNew = json.loads(open(outPath).read()) + oldPath = os.path.join(self.dataDir, 'zEvaluation.json') + resultsOld = json.loads(open(oldPath).read()) + for key in ('BEST_Z', 'SAMPLE_METRICS'): + self.assertEqual(resultsOld[key], resultsNew[key]) + newT = resultsNew['BEST_THRESHOLDS'] # threshold.txt path + self.assertEqual(self.expectedT["8"], self.getMD5hex(newT)) + + def test_call_binary(self): + """Re-call GTC files using zCall with binary output""" + self.call(True) + + def test_call_text(self): + """Re-call GTC files using zCall with text output""" + self.call(False) + + def test_complete(self): + """Test self-contained zcall script""" + zstart = 6 + ztotal = 5 + args = ['zcall/zCallComplete.py', + '--bpm', self.bpmPath, + '--egt', self.egtPath, + '--out', self.outDir, + '--zstart', str(zstart), + '--ztotal', str(ztotal), + '--samples', self.sampleJson, + '--text', os.path.join(self.outDir, 'metric_summary.txt'), + '--plink', self.prefix, + '--binary' + ] + self.assertEqual(os.system(' '.join(args)), 0) + outStem = os.path.join(self.outDir, self.prefix) + suffixes = ['.bed', '.bim', '.fam'] + expected = ['8e222b46b0760cba5de1d2bded337c76', + '19dd8929cd63e3ee906e43b9bb59cd02', + 'b836bd45459de6a9bc4b8d92e8c9e298'] + for i in range(len(suffixes)): + self.assertTrue(os.path.exists(outStem+suffixes[i])) + checksum = self.getMD5hex(outStem+suffixes[i]) + self.assertEqual(checksum, expected[i]) + self.validatePlink(self.prefix, True) + + def test_complete_alternate(self): + """As for test_complete, but with alternate dataset + + Manifest has 719665 SNPs, so this checks the null padding operation""" + zstart = 7 + ztotal = 1 + self.bpmPath = os.path.join(self.bigData, 'HumanOmniExpress-12v1-1_A.bpm.csv') + self.egtPath = os.path.join(self.bigData, 'HumanOmniExpress-12v1-1_b.egt') + self.sampleJson = os.path.join(self.dataDir, 'test2_sample.json') + args = ['zcall/zCallComplete.py', + '--bpm', self.bpmPath, + '--egt', self.egtPath, + '--out', self.outDir, + '--zstart', str(zstart), + '--ztotal', str(ztotal), + '--samples', self.sampleJson, + '--text', os.path.join(self.outDir, 'metric_summary.txt'), + '--plink', self.prefix, + '--binary' + ] + self.assertEqual(os.system(' '.join(args)), 0) + outStem = os.path.join(self.outDir, self.prefix) + suffixes = ['.bed', '.bim', '.fam'] + expected = ['863eb70d672d5ce71cead57b767a4688', + 'cccb69c7bb690ee15cb6397b7d20db41', + '14e7a7ffaf112132ffa12e1cbf8c8892'] + for i in range(len(suffixes)): + self.assertTrue(os.path.exists(outStem+suffixes[i])) + checksum = self.getMD5hex(outStem+suffixes[i]) + self.assertEqual(checksum, expected[i]) + self.validatePlink(self.prefix, True) + + +if __name__ == "__main__": + unittest.main(verbosity=2) diff --git a/src/zcall/BPM.py b/src/zcall/BPM.py new file mode 100644 index 0000000..39148e7 --- /dev/null +++ b/src/zcall/BPM.py @@ -0,0 +1,78 @@ +#! /usr/bin/python + +# Original code supplied by Illumina, Inc. subsequently modified by Broad +# Institute and Genome Research Ltd. The Illumina provided code was provided +# as-is and with no warranty as to performance and no warranty against it +# infringing any other party's intellectual property rights. All contributions +# are copyright their respective authors. See https://github.com/wtsi-npg/zCall +# for revision history. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +class BPM: + ''' Python class to parse a .bpm.csv file ''' + def __init__(self, bpmFile): + self.names = [] + self.chr = [] + self.pos = [] + self.normID = [] + self.A = [] + self.B = [] + self.ilmnStrand = [] # illumina strand designation + self.numSNPs = 0 + self.readFile(bpmFile) + + def readFile(self, bpmFile): + """ read input from .bpm.csv path and update instance variables """ + for line in open(bpmFile, 'r'): + line = line.replace("\n", "") + line = line.replace("\r", "") + + fields = line.split(",") + + if line.find("Chromosome") != -1: + continue + + else: + self.names.append(fields[1]) + self.chr.append(fields[2]) + self.pos.append(fields[3]) + alleles = fields[5].replace("[", "") + alleles = alleles.replace("]", "") + alleles = alleles.split("/") + self.A.append(alleles[0]) # allele A + self.B.append(alleles[1]) # allele B + self.ilmnStrand.append(fields[6][0]) # one of T,B,M,P + self.normID.append(int(fields[8])) # normalization ID for that snp + self.numSNPs = len(self.names) + + + def getChromosomes(self): + return self.chr + + def getPositions(self): + return self.pos + + def getTotalSNPs(self): + return self.numSNPs diff --git a/src/zcall/EGT.py b/src/zcall/EGT.py new file mode 100644 index 0000000..0d8596d --- /dev/null +++ b/src/zcall/EGT.py @@ -0,0 +1,216 @@ +#! /usr/bin/python + +# Original code supplied by Illumina, Inc. subsequently modified by Broad +# Institute and Genome Research Ltd. The Illumina provided code was provided +# as-is and with no warranty as to performance and no warranty against it +# infringing any other party's intellectual property rights. All contributions +# are copyright their respective authors. See https://github.com/wtsi-npg/zCall +# for revision history. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +import struct +import math + +class EGT: + ''' Class to parse an Illumina .egt file''' + def __init__(self, file): + self.f = open(file, 'rb') + self.readHeaderData() + self.readFileData() + self.numPoints = max(self.nTotal) # estimate the number of samples used for clustering by looking at the maximum number of points used for clustering was + + def readFileData(self): + self.version = self.getInt() + self.opa = self.getString() + self.numCodes = self.getInt() # number of SNPs in file + self.parseClusterPositions() # get mean and standard deviations of each cluster + self.readSNPnames() # get names of each snp + + def readHeaderData(self): + self.FileVersion = self.getInt() + self.GcVersion = self.getString() + self.ClusterVersion = self.getString() + self.CallVersion = self.getString() + self.NormalizationVersion = self.getString() + self.DateCreated = self.getString() + self.Mode = self.getByte() + self.Manifest = self.getString() + + def readSNPnames(self): + self.names = [] + + buffer = self.f.read(13 * self.numCodes) # skip snp quality scores + + for i in range(self.numCodes): # skip genotype scores + gt = self.getString() + + for i in range(self.numCodes): # parse snp names + snp = self.getString() + self.names.append(snp) + + def getTotalSNPs(self): + return self.numCodes + + def parseClusterPositions(self): + ''' + Parse cluster positions for each site and transform from r,theta to x,y + ''' + self.nAA = [] + self.nAB = [] + self.nBB = [] + self.nTotal = [] + + self.meanXAA = [] + self.meanXAB = [] + self.meanXBB = [] + + self.meanYAA = [] + self.meanYAB = [] + self.meanYBB = [] + + self.devXAA = [] + self.devXAB = [] + self.devXBB = [] + + self.devYAA = [] + self.devYAB = [] + self.devYBB = [] + + buffer = self.f.read((30*4)*self.numCodes) + + for i in range(self.numCodes): + nAA = self.parseInt(buffer,(i*30*4) + 0) # number of points in AA cluster + nAB = self.parseInt(buffer,(i*30*4) + 4) # number of points in AB cluster + nBB = self.parseInt(buffer,(i*30*4) + 8) # number of points in BB cluster + + devRAA = self.parseFloat(buffer,(i*30*4) + 12) + devRAB = self.parseFloat(buffer,(i*30*4) + 16) + devRBB = self.parseFloat(buffer,(i*30*4) + 20) + + meanRAA = self.parseFloat(buffer,(i*30*4) + 24) + meanRAB = self.parseFloat(buffer,(i*30*4) + 28) + meanRBB = self.parseFloat(buffer,(i*30*4) + 32) + + devThetaAA = self.parseFloat(buffer,(i*30*4) + 36) + devThetaAB = self.parseFloat(buffer,(i*30*4) + 40) + devThetaBB = self.parseFloat(buffer,(i*30*4) + 44) + + meanThetaAA = self.parseFloat(buffer,(i*30*4) + 48) + meanThetaAB = self.parseFloat(buffer,(i*30*4) + 52) + meanThetaBB = self.parseFloat(buffer,(i*30*4) + 56) + + + meanXAA, meanYAA, devXAA, devYAA = self.polarToEuclidean(meanRAA, devRAA, meanThetaAA, devThetaAA) + meanXAB, meanYAB, devXAB, devYAB = self.polarToEuclidean(meanRAB, devRAB, meanThetaAB, devThetaAB) + meanXBB, meanYBB, devXBB, devYBB = self.polarToEuclidean(meanRBB, devRBB, meanThetaBB, devThetaBB) + + self.nAA.append(nAA) + self.nAB.append(nAB) + self.nBB.append(nBB) + self.nTotal.append(nAA + nAB + nBB) + + self.meanXAA.append(meanXAA) + self.meanXAB.append(meanXAB) + self.meanXBB.append(meanXBB) + + self.meanYAA.append(meanYAA) + self.meanYAB.append(meanYAB) + self.meanYBB.append(meanYBB) + + self.devXAA.append(devXAA) + self.devXAB.append(devXAB) + self.devXBB.append(devXBB) + + self.devYAA.append(devYAA) + self.devYAB.append(devYAB) + self.devYBB.append(devYBB) + + + def polarToEuclidean(self, meanR, devR, meanTheta, devTheta): + ''' + Manhattan distance conversion from r,theta to x,y + ''' + varTheta = devTheta**2 + varR = devR ** 2 + + piFactor = math.pi / float(2) + + A = -1 * (piFactor * meanR) * (1 + math.tan(piFactor * meanTheta))**-2 * (1 / float(math.cos(piFactor * meanTheta)**2)) + B = 1 / float( 1 + math.tan(piFactor * meanTheta)) + varX = (A * A * varTheta) + (B * B * varR) + C = -1 * A + D = float(1 - B) + varY = (C*C * varTheta) + (D*D * varR) + + meanX = (meanR / float( 1 + (math.tan(meanTheta * piFactor)))) + meanY = (meanR - (meanR / float(1 + math.tan(meanTheta * piFactor)))) + devX = varX**(0.5) + devY = varY**(0.5) + + return (meanX, meanY, devX, devY) + + def getByte(self): + ''' + read byte from current place in file + ''' + line = self.f.read(1) + x = struct.unpack_from("b",line)[0] + return x + + def getInt(self): + ''' + read int from current place in file + ''' + line = self.f.read(4) + x = struct.unpack_from("i",line)[0] + return x + + def getString(self): + ''' + get string from current place in file + ''' + line = self.f.read(1) + count = struct.unpack_from("b", line)[0] + line = self.f.read(count) + t = count * "s" + x = "".join(list(struct.unpack(t, line))) + return x + + def parseInt(self,buffer,position): + ''' + parse int from buffer at given position + ''' + y = buffer[position:position + 4] + x = struct.unpack_from("i",y)[0] + return x + + def parseFloat(self,buffer,position): + ''' + parse float from input buffer at given position + ''' + y = buffer[position:position + 4] + x = struct.unpack_from("f",y)[0] + return x + diff --git a/src/zcall/GTC.py b/src/zcall/GTC.py new file mode 100644 index 0000000..f979f2a --- /dev/null +++ b/src/zcall/GTC.py @@ -0,0 +1,263 @@ +#! /usr/bin/python + + +# Original code supplied by Illumina, Inc. subsequently modified by Broad +# Institute and Genome Research Ltd. The Illumina provided code was provided +# as-is and with no warranty as to performance and no warranty against it +# infringing any other party's intellectual property rights. All contributions +# are copyright their respective authors. See https://github.com/wtsi-npg/zCall +# for revision history. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +import struct +import math +from cStringIO import StringIO + +class GTC: + ''' Class for parsing GTC file''' + + def __init__(self, inputPath, bpmNormIDs): + ''' Init function for class. Input is a .gtc file ''' + self.inputPath = inputPath + self.f = StringIO(open(self.inputPath, 'rb').read()) + #self.f = open(self.inputPath, 'rb') # open file handler for binary file + self.BPMnormIDs = bpmNormIDs # list with norm ID for each snp + self.TOC = self.parseTOC() # parse table of contents to get location of other information + self.numSNPs = self.readNumSNPs() + self.sampleName = self.parseString(10) # parse sample name + self.samplePlate = self.parseString(11) # parse sample plate + self.sampleWell = self.parseString(12) # parse sample well + self.clusterFile = self.parseString(100) # parse what cluster file was used + self.snpManifest = self.parseString(101) # parse what snp manifest was used + self.imagingDate = self.parseString(200) # parse imaging date + self.autoCallDate = self.parseString(201) # parse autocall date + self.autoCallVersion = self.parseString(300) # parse autocall version + self.rawXintensities = self.extractIntensities(1000) # parse raw x intensities into python list object + self.rawYintensities = self.extractIntensities(1001) # parse raw y intensities into python list object + self.normalizationTransformations = self.extractNormalizationTransformations(400) # parse normalization transformation arrays into python dictionary where key is the order they appeared in the gtc file and the value is a dictionary with keys offset_x, offset_y,scale_x, scale_y, shear, theta and values are floats + self.genotypes = self.extractGenotypes(1002) # parse genotypes (0,1,2,3) into python list object + self.baseCalls = self.extractBaseCalls(1003) # parse basecalls (AT,TT,AT,--) into python list object + + self.normXintensities, self.normYintensities = self.normalizeIntensities() + + + def getInputPath(self): + """Return original input path""" + return self.inputPath + + def parseTOC(self): + '''Parse Table of Contents of GTC file + No input + Output is a dictionary where the ID for that entry is the key and the value is the offset for that variable in the GTC file + ''' + self.f.seek(4,0) + line = self.f.read(4) + + count = struct.unpack("i",line)[0] + TOC = {} + + i = 0 + while i < count: + line = self.f.read(2) + id = struct.unpack("h",line) + line = self.f.read(4) + offset = struct.unpack("I",line) + TOC[id[0]] = offset[0] + i+=1 + return TOC + + + def parseString(self,id): + ''' + Extract a string variable from GTC file such as SampleName. + Input is ID for that variable in TOC + Output is a string + ''' + offset = self.TOC[id] + self.f.seek(offset,0) + + line = self.f.read(1) + nbytes = struct.unpack("b",line)[0] + line = self.f.read(nbytes) + type = nbytes * "s" + x = "".join(list(struct.unpack(type, line))) + + return x + + def extractIntensities(self, id): + ''' + Extract intensity values (x or y depending on input ID). + Input is ID for variable of interest in TOC + Output is a list with integer intensity values in the order they were parsed + ''' + intensities = [0]*self.numSNPs + offset = self.TOC[id] + + self.f.seek(offset,0) + line = self.f.read(4) + count = struct.unpack("i",line)[0] + if count != self.numSNPs: + msg = "Inconsistent SNP count in GTC file intensity group: "+\ + "Expected %s, found %s" % (self.numSNPs, count) + raise ValueError(msg) + #self.numSNPs = count + i = 0 + while i < count: + line = self.f.read(2) + y = struct.unpack("H",line) + intensities[i] = y[0] + i += 1 + return intensities + + def extractNormalizationTransformations(self, id): + ''' + Extract normalization transformation arrays + Input is ID for Normalization Transformations in TOC. + Output is dictionary where keys are the order xForm array appears in gtc file (ex: 1,2,3...). + The values of the dictionary are another dictionary + where the keys are shear, offset_x, offset_y, theta, scale_x, scale_y and the values are floats + ''' + normTransforms = {} + offset = self.TOC[id] + + self.normIDlist = list(set(self.BPMnormIDs)) # ordered list of unique normIDs + self.normIDlist.sort() + + self.f.seek(offset,0) + line = self.f.read(4) + count = struct.unpack("i",line)[0] + + i = 0 + while i < count: + line = self.f.read(4) + line = self.f.read(48) + x = struct.unpack("<12f", line) + normTransforms[self.normIDlist[i]] = {"offset_x":x[0],"offset_y":x[1],"scale_x":x[2],"scale_y":x[3],"shear":x[4],"theta":x[5]} + i += 1 + return normTransforms + + def extractBaseCalls(self, id): + ''' + Extract base calls. + Input is id for BaseCalls in TOC + Output is a list with one basecall for each SNP (ex: AT, GT,AA...) + ''' + baseCalls = [None]*self.numSNPs + offset = self.TOC[id] + self.f.seek(offset,0) + line = self.f.read(4) + count = struct.unpack("i",line)[0] + if count != self.numSNPs: + msg = "Inconsistent SNP count in GTC file intensity group: "+\ + "Expected %s, found %s" % (self.numSNPs, count) + i = 0 + while i < count: + line = self.f.read(2) + calls = struct.unpack("ss",line) + baseCalls[i] = calls[0] + calls[1] + i += 1 + return baseCalls + + def extractGenotypes(self, id): + ''' + Extract genotypes. + Input is ID for Genotypes in TOC + Output is a list with one genotype per SNP (0,1,2,3) + ''' + genotypes = [None]*self.numSNPs + offset = self.TOC[id] + self.f.seek(offset,0) + line = self.f.read(4) + count = struct.unpack("i",line)[0] + if count != self.numSNPs: + msg = "Inconsistent SNP count in GTC file intensity group: "+\ + "Expected %s, found %s" % (self.numSNPs, count) + i = 0 + while i < count: + line = self.f.read(1) + gt = struct.unpack("b",line) + genotypes[i] = gt[0] + i += 1 + return genotypes + + def getTotalSNPs(self): + return self.numSNPs + + def normalizeIntensities(self): + ''' + Use Normalization transformations to convert raw intensities to normalized intensities + No Input + Outputs are normalized x and y intensities in python lists + ''' + normXIntensities = [0]*self.numSNPs + normYIntensities = [0]*self.numSNPs + + i = 0 + while i < self.numSNPs: + xraw = self.rawXintensities[i] + yraw = self.rawYintensities[i] + normID = self.BPMnormIDs[i] + + offset_x = self.normalizationTransformations[normID]["offset_x"] + offset_y = self.normalizationTransformations[normID]["offset_y"] + scale_x = self.normalizationTransformations[normID]["scale_x"] + scale_y = self.normalizationTransformations[normID]["scale_y"] + theta = self.normalizationTransformations[normID]["theta"] + shear = self.normalizationTransformations[normID]["shear"] + + tempx = xraw - offset_x + tempy = yraw - offset_y + + tempx2 = math.cos(theta) * tempx + math.sin(theta) * tempy + tempy2 = -1 * math.sin(theta) * tempx + math.cos(theta) * tempy + + tempx3 = tempx2 - (shear * tempy2) + tempy3 = tempy2 + + try: xn = tempx3 / float(scale_x) + except ZeroDivisionError: xn = 0.0 + try: yn = tempy3 / float(scale_y) + except ZeroDivisionError: yn = 0.0 + + if xn < 0: + xn = 0.0 + if yn < 0: + yn = 0.0 + + normXIntensities[i] = xn + normYIntensities[i] = yn + i += 1 + + return (normXIntensities, normYIntensities) + + def readNumSNPs(self): + """Read total number of SNPs from first block of intensity data + + Adapted from extractIntensities""" + offset = self.TOC[1000] + self.f.seek(offset,0) + line = self.f.read(4) + count = struct.unpack("i",line)[0] + return count diff --git a/src/zcall/__init__.py b/src/zcall/__init__.py new file mode 100644 index 0000000..57feb68 --- /dev/null +++ b/src/zcall/__init__.py @@ -0,0 +1,30 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + +"""Package to implement the zcall rare variant caller for genotyping""" + diff --git a/src/zcall/appendPEDline.py b/src/zcall/appendPEDline.py new file mode 100644 index 0000000..9841d5a --- /dev/null +++ b/src/zcall/appendPEDline.py @@ -0,0 +1,81 @@ +#! /usr/bin/python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + + +"""Write Plink .ped file from calls in GTC files; use to generate test data + +Produces a single line, representing one sample, and appends to .ped file + +Adapted from legacy zCall.py +Usage: writePEDfile.py [bpm path] [gtc path] [ped path]""" + +import sys +sys.path.append(sys.path[0]+'/../zcall') +from BPM import * +from GTC import * + +def main(): + + bpm = BPM(sys.argv[1]) + gtc = GTC(sys.argv[2], bpm.normID) + outPath = sys.argv[3] + useManifest = False + + ### Parse sample name from input gtc file name + sampleName = sys.argv[2].split("/") + sampleName = sampleName[len(sampleName) - 1] + sampleName = sampleName.split(".")[0] + + out = [sampleName, sampleName, "0", "0", "-9", "-9"] #output holder in python list; have no sample information so use "0" and "-9" for mid, pid, gender, case/control status + for i in range(gtc.getTotalSNPs()): + if useManifest: + alleleA = bpm.A[i] + alleleB = bpm.B[i] + else: + alleleA = "A" + alleleB = "B" + origCall = gtc.genotypes[i] + if origCall == 1: ## AA is original call + out.append(alleleA) + out.append(alleleA) + elif origCall == 2: ## AB is original call + out.append(alleleA) + out.append(alleleB) + elif origCall == 3: ## BB is original call + out.append(alleleB) + out.append(alleleB) + else: ## NC is original call + out.append("0") + out.append("0") + ## Output to std out the new calls in PED format + outFile = open(outPath, 'a') # append to existing file + outFile.write(" ".join(out)+"\n") + outFile.close() + +if __name__ == "__main__": + main() diff --git a/src/zcall/calibration.py b/src/zcall/calibration.py new file mode 100644 index 0000000..a9a2fcf --- /dev/null +++ b/src/zcall/calibration.py @@ -0,0 +1,599 @@ +#! /usr/bin/env python + +# Some code in this file originates from the findMeanSD.py and +# findThresholds.py scripts developed by the Broad Institute. Other code by +# Genome Research Ltd. All contributions are copyright their respective +# authors. See https://github.com/wtsi-npg/zCall for revision history. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +"""Classes to find thresholds and evaluate z scores, as parameters for zCall. + +Threshold finding combines: findMeanSD.py, findBetas.r, findThresholds.py +Evaluation looks for optimal concordance and gain metrics on given dataset. +Import classes to front-end scripts for calibration and calling. + +Contents: +- ThresholdFinder +- MetricEvaluator +- MetricFinder +- SampleEvaluator + +Author: Iain Bancarz, ib5@sanger.ac.uk, January 2013 +""" + +import json, os, re, sys, tempfile +from ConfigParser import ConfigParser +from GTC import * +from BPM import * +from EGT import * +from utilities import CallingBase, ThresholdContainer, SharedBase + +class ThresholdFinder: + """Class to write threshold.txt files for given EGT input and z score(s). + + Threshold finding procedure: + 1. Run findMeanSD on given EGT file. Outputs mean_sd.txt + 2. Run findBetas.r on output from (1). Outputs betas.txt + 3. Run findThresholds on EGT file and betas.txt, with given Z score(s). + Outputs from (1) and (2) are written to a temporary directory, deleted on successful exit. + + Recommended default Z score = 7. Suggested range of alternatives = 3 to 15. +""" + +# TODO Modify findMeanSD and findThresholds so they can be imported, instead of being run in a subshell + + INDEX_NAME = "thresholds.json" + + def __init__(self, egtPath, configPath=None): + self.egtPath = egtPath + self.egt = EGT(egtPath) + if configPath==None: + configPath = os.path.join(sys.path[0], '../etc/config.ini') + configPath = os.path.abspath(configPath) + self.snpTotal = self.egt.getTotalSNPs() + config = ConfigParser() + config.readfp(open(configPath)) + self.rScript = config.get('zcall', 'rscript') + if (not os.path.exists(self.rScript)): + msg = "ERROR: R script executable \""+self.rScript+\ + "\" does not exist. Check config file: \""+configPath+"\"\n" + sys.stderr.write(msg) + sys.exit(1) + self.digits = int(config.get('zcall', 'digits')) + + def findMeanSD(self, outPath): + """Find cluster mean/sd from EGT object + + Based on findMeanSD script in original zcall implementation + By default, apply 'sanity check' filter to SNPs""" + exclude = set() + output = [None]*self.snpTotal + maxExclusion = 0.95 + for i in range(self.snpTotal): + # Get SNP Name + snp = self.egt.names[i] + # Get number of points in each genotype cluster + nAA = self.egt.nAA[i] + nAB = self.egt.nAB[i] + nBB = self.egt.nBB[i] + nTotal = nAA + nAB + nBB + (meanX, meanY, devX, devY) = self.meanDevXY(i) + if nAA >= nBB: + out = [snp, meanX, meanY, devX, devY, nBB, nAA] + elif nBB > nAA: + out = [snp, meanX, meanY, devX, devY, nAA, nBB] + out = [str(o) for o in out] + output[i] = "\t".join(out) + # apply sanity checks and flag for exclusion if appropriate + if not self.snpSanityCheckOK(nAA, nAB, nBB, nTotal): + exclude.add(i) + excludeRate = len(exclude)/float(self.snpTotal) + head = ["SNP", "meanX", "meanY", "sdX", "sdY", + "nMinorHom", "nCommonHom"] + outFile = open(outPath, 'w') + outFile.write("\t".join(head)+"\n") + if excludeRate <= maxExclusion: + for i in range(self.snpTotal): + if i not in exclude: outFile.write(output[i]+"\n") + else: + msg = "Warning: findMeanSD exclusion rate of "+str(excludeRate)+\ + " exceeds maximum of "+str(maxExclusion)+\ + "; omitting SNP filter.\n" + sys.stderr.write(msg) + for i in range(self.snpTotal): + outFile.write(output[i]+"\n") + outFile.close() + + def findThresholds(self, betasPath, zScore, outPath): + """Find thresholds from findBetas.r output and write thresholds.txt + + Replaces findThresholds.py script in original zCall implementation""" + (beta0, beta1) = self.readBetas(betasPath) + output = [None]*(self.snpTotal+1) + output[0] = ["SNP", "Tx", "Ty"] + for i in range(self.snpTotal): + # Get SNP Name + snp = self.egt.names[i] + (Tx, Ty) = self.findThresholdPair(i, zScore, beta0, beta1) + out = [snp, Tx, Ty] + out = [str(o) for o in out] + output[i+1] = out + outFile = open(outPath, 'w') + for line in output: outFile.write("\t".join(line)+"\n") + outFile.close() + + def findThresholdPair(self, i, z, beta0, beta1, minIntensity=0.2): + """Find x,y thresholds for the ith snp""" + nAA = self.egt.nAA[i] + nBB = self.egt.nBB[i] + + # Extract mean and standard deviations from EGT class + meanXAA = self.egt.meanXAA[i] + meanXBB = self.egt.meanXBB[i] + devXAA = self.egt.devXAA[i] + devXBB = self.egt.devXBB[i] + + meanYAA = self.egt.meanYAA[i] + meanYBB = self.egt.meanYBB[i] + devYAA = self.egt.devYAA[i] + devYBB = self.egt.devYBB[i] + + # Calculate Thresholds + if nAA <= 2 and nBB <= 2: + # Not enough points in common allele homozygote cluster + Tx = "NA" + Ty = "NA" + else: + if nAA >= nBB: # AA cluster tags the common allele + if meanXAA < minIntensity: + Tx = "NA" # skips new genotype calls in zCall + Ty = "NA" + else: + Ty = meanYAA + z * devYAA + # Solve for mean, sd of the minor allele hom. cluster + # based on betas and mean of common allele hom. cluster + meanXBB = beta1[1]*meanYAA + beta0[1] + devXBB = beta1[3]*devYAA + beta0[3] + Tx = meanXBB + z * devXBB + else: # BB cluster tags the common allele + if meanYBB < minIntensity: + Tx = "NA" + Ty = "NA" + else: + Tx = meanXBB + z * devXBB + # Solve for mean, sd as above + meanYAA = beta1[0] * meanXBB + beta0[0] + devYAA = beta1[2] * devXBB + beta0[2] + Ty = meanYAA + z * devYAA + if Tx!="NA": Tx = round(Tx, self.digits) + if Ty!="NA": Ty = round(Ty, self.digits) + return (Tx, Ty) + + def meanDevXY(self, i): + """Find mean/sd for x,y for given snp""" + # Extract the mean and sd for each common allele homozygote clusters in the noise dimension + meanXAA = self.egt.meanXAA[i] + meanXBB = self.egt.meanXBB[i] + devXAA = self.egt.devXAA[i] + devXBB = self.egt.devXBB[i] + meanYAA = self.egt.meanYAA[i] + meanYBB = self.egt.meanYBB[i] + devYAA = self.egt.devYAA[i] + devYBB = self.egt.devYBB[i] + if meanXAA >= meanYAA: ## AA is in the lower right quadrant + meanY = meanYAA + devY = devYAA + meanX = meanXBB + devX = devXBB + else: ## AA is in the upper left quadrant; however, this should never be the case by definition X -> A, Y -> B + meanY = meanYBB + devY = devYBB + meanX = meanXAA + devX = devXAA + return (meanX, meanY, devX, devY) + + def snpSanityCheckOK(self, nAA, nAB, nBB, nTotal): + """Apply sanity checks to SNPs for inclusion in mean/SD calculation""" + cr = float(nTotal) / float(self.egt.numPoints) + if cr < 0.99: + return False + # Make sure there are at least 10 points in each homozygote cluster + if nAA < 10 or nBB < 10: + return False + # Calculate and check MAF + if nAA > nBB: + maf = (nAB + 2 * nBB) / float(2 * nTotal) + else: + maf = (nAB + 2 * nAA) / float(2 * nTotal) + if maf < 0.05: + return False + # Hardy-Weinberg Equilibrium (don't use site if p_hwe < 0.00001) + chiCritical = 19.5 # p = 0.00001 for 1 DOF + if nAA > nBB: + p = 1.0 - maf + q = maf + expAA = p**2 * nTotal + expAB = 2 * p * q * nTotal + expBB = q**2 * nTotal + else: + p = 1.0 - maf + q = maf + expAA = q**2 * nTotal + expAB = 2 * p * q * nTotal + expBB = p**2 * nTotal + chiSquare = ((nAA - expAA)**2 / float(expAA)) + \ + ((nAB - expAB)**2 / float(expAB)) + \ + ((nBB - expBB)**2 / float(expBB)) + if chiSquare > chiCritical: + return False + return True + + def thresholdFileName(self, zScore): + egtName = re.split('/', self.egtPath).pop() + items = re.split('\.', egtName) + items.pop() # remove .egt suffix + name = '.'.join(items) + return 'thresholds_'+name+'_z'+str(zScore).zfill(2)+'.txt' + + def readBetas(self, inPath): + """Read output of findBetas.r""" + beta0 = [] # list container for beta intercept + beta1 = [] # list container for beta of slope + lines = open(inPath).readlines() + for i in range(1, len(lines)): + fields = re.split("\t", lines[i].strip()) + beta0.append(float(fields[1])) + beta1.append(float(fields[2])) + return (beta0, beta1) + + def run(self, zScore=7, outDir='/tmp', verbose=True, force=False): + outPath = os.path.join(outDir, self.thresholdFileName(zScore)) + if os.path.exists(outPath) and force==False: + if verbose: print outPath+" already exists; omitting calibration." + return outPath + scriptDir = os.path.abspath(sys.path[0]) + tempDir = tempfile.mkdtemp(prefix='zcall_') + if verbose: + msg = "Calibrating zCall: zscore = "+str(zScore)+"\n"+\ + "Writing temporary files to "+tempDir + print msg + meanSd = tempDir+'/mean_sd.txt' + betas = tempDir+'/betas.txt' + self.findMeanSD(meanSd) + # findBetas.r command uses bash to redirect stderr + cmd = 'bash -c "'+self.rScript+' '+scriptDir+'/findBetas.r '+\ + meanSd+' '+betas+' 1 " &> '+tempDir+'/rscript.log' + status = os.system(cmd) + self.findThresholds(betas, zScore, outPath) + if status==0: + if verbose: print "Cleaning up temporary directory." + os.system('rm -Rf '+tempDir) + elif verbose: + print "Possible error, retaining temporary directory." + if verbose: print "Finished calibration." + return outPath + + def runMultiple(self, zstart, ztotal, outDir, + verbose=True, force=False): + z = zstart + thresholdPaths = {} + for i in range(ztotal): + thresholdPath = self.run(z, outDir, verbose, force) + thresholdPaths[str(z)] = thresholdPath # .json needs string as key + z += 1 + indexPath = os.path.join(outDir, self.INDEX_NAME) + index = open(indexPath, 'w') + index.write(json.dumps(thresholdPaths)) + index.close() + return indexPath + +class MetricEvaluator(SharedBase): + """Class to assess concordance/gain metrics and choose best z score""" + + def __init__(self, configPath=None): + if configPath==None: + configPath = os.path.join(sys.path[0], '../etc/config.ini') + configPath = os.path.abspath(configPath) + config = ConfigParser() + config.readfp(open(configPath)) + self.minConcord = float(config.get('zcall', 'min_concord')) + + def getBestThresholdKey(self): + return self.T_KEY + + def findBestZ(self, concords, gains): + """Find best z score from normalized mean concordance/gain values + + Normalize concordance/gain by dividing each value by the maximum. Optionally, set a minimum normalized concordance. The 'best' is defined as the smallest z s.t. (concordance > gain) AND (concordance > minimum); or if none exists, return z with maximum concordance. Note that z keys in hash are strings (for use in .json output), but need to sort them in integer order.""" + if concords.keys() != gains.keys(): + raise ValueError("Concordance and gain inputs do not match") + cMax = max(concords.values()) + gMax = max(gains.values()) + zScores = [] + for z in concords.keys(): + concords[z] = concords[z]/cMax + gains[z] = gains[z]/gMax + zScores.append(int(z)) + zScores.sort() + best = None + zMax = None + for i in range(len(zScores)): + z = str(zScores[i]) + if concords[z] > gains[z] and concords[z] > self.minConcord: + best = z + break + elif concords[z]==1: + zMax = z + if best == None: best = zMax + return str(best) + + def findMeans(self, inPaths, verbose=False, outPath=None): + """Read JSON result paths, find mean concordance/gain by z score + + Return concatenation of results and concordance/gain metrics + """ + rows = [] + for inPath in inPaths: + rows.extend(json.loads(open(str(inPath)).read())) + zCounts = {} + concords = {} + gains = {} + zIntegers = set() + for row in rows: + [gtc, z, concord, gain, counts] = row + try: + zCounts[z] += 1 + concords[z] += concord + gains[z] += gain + except KeyError: + zCounts[z] = 1 + concords[z] = concord + gains[z] = gain + zIntegers.add(int(z)) + zIntegers = list(zIntegers) + zIntegers.sort() + if verbose: print "z\tconcord\tgain" + for zInt in zIntegers: + z = str(zInt) + concords[z] = concords[z] / float(zCounts[z]) + gains[z] = gains[z] / float(zCounts[z]) + if verbose: print z+"\t"+str(round(concords[z], 4))+"\t"+\ + str(round(gains[z], 4)) + if outPath!=None: + out = open(outPath, 'w') + out.write(json.dumps(rows)) + out.close() + return (rows, concords, gains) + + def writeBest(self, inPaths, thresholdPath, outPath, textPath=None, + verbose=False): + """Find best z score & thresholds.txt, write to file for later use + + Arguments: + - List of SampleEvaluator output paths + - JSON file with hash of thresholds.txt paths by z score + - Output directory + """ + (metrics, concords, gains) = self.findMeans(inPaths, verbose) + best = self.findBestZ(concords, gains) + if verbose: print self.Z_KEY+":"+best + thresholdPaths = json.loads(open(thresholdPath).read()) + results = { self.Z_KEY:best, self.T_KEY:thresholdPaths[best], + self.M_KEY:metrics} + out = open(outPath, 'w') + out.write(json.dumps(results)) + out.close() + if textPath!=None: + self.writeMeanText(concords, gains, textPath) + return results + + def writeMeanText(self, concords, gains, outPath, digits=6): + """Write plain text file with mean concordance/gain by z score""" + zRange = [] + for z in concords.keys(): zRange.append(int(z)) + zRange.sort() + output = [] + for zScore in zRange: + z = str(zScore) + concord = round(concords[z], digits) + gain = round(gains[z], digits) + output.append("%s\t%s\t%s\n" % (z, concord, gain) ) + out = open(outPath, 'w') + out.write(''.join(output)) + out.close() + +class MetricFinder(CallingBase): + """Class to evaluate GTC objects by concordance and gain metrics + + Initialize with egt path, bpm path + Inherits common "calling" functions from CallingBase +""" + + def concordanceRate(self, counts): + """Find concordance rate between original and new calls + + Ignores SNPs where original was a 'no call'""" + [match, total] = [0,0] + for i in range(1,4): + for j in range(0,4): + count = counts[(i,j)] + total += count + if i==j: match += count + concord = float(match)/float(total) + return concord + + def countCallTypes(self, thresholds, gtc): + """Count call types (0, AA, AB, BB) for given GTC and thresholds. + + call codes: 0 - "No Call", 1 - AA, 2 - AB, 3 - BB + based on method in sampleConcordance.py + """ + self.setThresholds(thresholds) + counts = {} + for i in range(4): + for j in range(4): counts[(i,j)] = 0 + includedSNPs = self.findIncludedSNPs(thresholds) + for i in includedSNPs: + nAA = self.egt.nAA[i] + nBB = self.egt.nBB[i] + nAB = self.egt.nAB[i] + origCall = self.normalizeCall(gtc.genotypes[i], nAA, nBB) + newCall = self.normalizeCall(self.call(gtc, i), nAA, nBB) + counts[(origCall, newCall)] += 1 + return counts + + def getMetrics(self, thresholds, gtc): + """Find call types, concordance and gain for given threshold and GTC + + Arguments are ThresholdContainer and GTC objects""" + counts = self.countCallTypes(thresholds, gtc) + concord = self.concordanceRate(counts) + gain = self.gainRate(counts) + return (counts, concord, gain) + + def gainRate(self, counts): + """Find rate of call gain + + Defined as no calls in original which are called by zcall""" + [gain, total] = [0,0] + for i in range(4): + count = counts[(0,i)] # no call in original GTC + total += count + if i!=0: gain += count + gainRate = float(gain)/float(total) + return gainRate + + def includeSNP(self, i, nAA, nBB, nAB, thresholds): + """Should ith SNP be included in concordance calculation? + + Require autosomal SNP with MAF>=5% + Want at least 10 points in each homozygote cluster + Also exclude SNPs without defined zcall thresholds""" + include = True + chrom = self.bpm.chr[i] + maf = self.findMAF(nAA, nBB, nAB) + if maf < 0.05 or chrom == "X" or chrom == "Y" or nAA < 10 or nBB < 10 \ + or thresholds.getX(i)=="NA" or thresholds.getY(i)=="NA": + include = False + return include + + def findIncludedSNPs(self, thresholds): + """ Find set of included SNP indices """ + included = [] + for i in range(self.snpTotal): + nAA = self.egt.nAA[i] + nBB = self.egt.nBB[i] + nAB = self.egt.nAB[i] + if self.includeSNP(i, nAA, nBB, nAB, thresholds): + included.append(i) + return included + +class SampleEvaluator(SharedBase): + """Evaluate z scores and thresholds for one or more GTC files.""" + + def __init__(self, bpmPath, egtPath): + self.bpmPath = bpmPath + self.egtPath = egtPath + self.bpm = BPM(bpmPath) + self.metricFinder = MetricFinder(bpmPath, egtPath) + self.evaluated = 0 # running total of evaluated samples + self.sampleTotal = 0 # total number of samples to evaluate + + def convertCountKeys(self, counts): + """Convert keys in counts dictionary to string; required for JSON output + + Counts are indexed by (original call, final call) + """ + output = {} + for key in counts.keys(): + (i,j) = key + output[str(i)+':'+str(j)] = counts[key] + return output + + def evaluate(self, thresholds, gtc, verbose=False): + """Evaluate z thresholds for given thresholds & sample GTC + + Inputs: + - dictionary of threshold paths indexed by z score + - GTC object + """ + gtcName = os.path.split(gtc.getInputPath())[1] + self.evaluated += 1 + if verbose: + print "Evaluating z scores for sample "+str(self.evaluated)+\ + " of "+str(self.sampleTotal)+": "+gtcName + zList = thresholds.keys() + zList.sort() + results = {} + for z in zList: + if verbose: print "Finding metrics for z score", z + results[z] = self.metricFinder.getMetrics(thresholds[z], gtc) + output = [] + for z in zList: + (counts, concord, gain) = results[z] + converted = self.convertCountKeys(counts) + output.append([gtcName, z, concord, gain, converted]) + return output + + def run(self, thresholdPath, sampleJson, start, end, outPath,verbose=False): + """Evaluate thresholds for given list of sample GTC paths + + Inputs: + - Path to .json file with hash of paths to threshold.txt files + - Path to .json file with paths of sample GTC files + - Start index in GTC .json file (use to split GTC for batch processing) + - End index in GTC .json file + - Output path + + Output: + - JSON file with GTC filename, z score, metrics, and call type counts + + GTC input is in sample JSON format used by genotyping pipeline """ + metrics = [] + if verbose: print "Reading thresholds." + thresholdPaths = json.loads(open(thresholdPath).read()) + thresholds = {} + for z in thresholdPaths.keys(): + thresholds[z] = ThresholdContainer(thresholdPaths[z]) + if verbose: print "Evaluating samples." + output = [] + gtcPaths = self.readSampleJson(sampleJson) + self.evaluated = 0 + if end==-1: end = len(gtcPaths) + if start>=end: + raise ValueError("Must have GTC start index < end index") + elif start < 0: + raise ValueError("Must have GTC start index > 0") + elif end > len(gtcPaths): + raise ValueError("Must have GTC end index <= total GTC paths") + self.sampleTotal = end - start + gtcPaths = gtcPaths[start:end] + for gtcPath in gtcPaths: + gtc = GTC(gtcPath, self.bpm.normID) + output.extend(self.evaluate(thresholds, gtc, verbose)) + # write results + out = open(outPath, 'w') + out.write(json.dumps(output)) + out.close() diff --git a/src/zcall/createDocs.py b/src/zcall/createDocs.py new file mode 100755 index 0000000..fe021c3 --- /dev/null +++ b/src/zcall/createDocs.py @@ -0,0 +1,97 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + +"""Convenience script to generate HTML documentation + +Use instead of command-line pydoc. Allows user to specify Python version 2.7, even if installation of pydoc is compiled with an earlier Python version. + +Run with -h or --help for command-line help. +""" + +import os, pydoc, re, sys +import shared +from importlib import import_module +from modulefinder import ModuleFinder +try: + import argparse +except ImportError: + sys.stderr.write(shared.importErrorMessage) + sys.exit(1) + +def main(): + + description="Convenience script to generate HTML documentation using pydoc" + parser = argparse.ArgumentParser(description=description) + parser.add_argument('--out', required=True, metavar="PATH", + help="Directory in which to write HTML output files.") + parser.add_argument('--recursive', action='store_true', default=False, + help="Recursively import documentation for dependencies. If not recursive, zcall documents will contain broken links to standard modules. Recursive mode generates approximately 180 HTML files comprising 6 MB of data.") + parser.add_argument('--verbose', action='store_true', default=False, + help="Write pydoc information to stdout.") + args = vars(parser.parse_args()) + recursive = args['recursive'] + verbose = args['verbose'] + if not verbose: # suppress stdout chatter from pydoc.writedoc + sys.stdout = open('/dev/null', 'w') + + localDir = os.path.dirname(os.path.realpath(__file__)) + sys.path.append(os.path.abspath(localDir+"/..")) # import from zcall dir + zcallDir = os.path.abspath(localDir+"/../zcall") + outDir = os.path.abspath(args['out']) + if not (os.access(outDir, os.W_OK) and os.path.isdir(outDir)): + msg = "ERROR: Output path "+outDir+" is not a writable directory.\n" + sys.stderr.write(msg) + sys.exit(1) + os.chdir(outDir) + import zcall + pydoc.writedoc(zcall) + modules = set() + zcall = set() + scripts = [] + mf = ModuleFinder() + for script in os.listdir(zcallDir): + if re.search("\.py$", script) and script!="__init__.py": + words = re.split("\.", script) + words.pop() + scriptName = (".".join(words)) # name without .py suffix + modules.add("zcall."+scriptName) + zcall.add(scriptName) + scripts.append(script) + if recursive: + for script in scripts: + mf.run_script(os.path.join(zcallDir, script)) + for name, mod in mf.modules.iteritems(): + if name not in zcall: modules.add(name) + for module in modules: + pydoc.writedoc(import_module(module)) + +# NB findThresholds and findMeanSD files can only be run as scripts, not imported. Omitting the .py extension from these files prevents pydoc from creating broken links in the main zcall page. + + +if __name__ == "__main__": + main() diff --git a/src/zcall/evaluateThresholds.py b/src/zcall/evaluateThresholds.py new file mode 100755 index 0000000..9e48efa --- /dev/null +++ b/src/zcall/evaluateThresholds.py @@ -0,0 +1,94 @@ +#! /usr/bin/env python + + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +"""Evaluate concordance and gain for given GTC files and Z scores. + +Author: Iain Bancarz, ib5@sanger.ac.uk, January 2013 +""" + +import cProfile, os, sys +import shared +from calibration import SampleEvaluator +try: + import argparse # optparse is deprecated, using argparse instead + from tempfile import NamedTemporaryFile + from utilities import ArgParserExtra +except ImportError: + sys.stderr.write(shared.importErrorMessage) + sys.exit(1) + +def main(): + """Method to run as script from command line. Run with --help for usage.""" + description = "Evaluate concordance/gain for given z scores and GTC files." + parser = argparse.ArgumentParser(description=description) + parser.add_argument('--thresholds', required=True, metavar="PATH", + help="Path to .json file containing threshold .txt paths indexed by z score") + parser.add_argument('--bpm', required=True, metavar="PATH", + help="BPM .csv manifest file") + parser.add_argument('--egt', required=True, metavar="PATH", + help="EGT input file") + parser.add_argument('--gtc', required=True, metavar="PATH", + help="Path to .json file containing .gtc input paths") + parser.add_argument('--out', required=True, metavar="PATH", + help="Path for .json output") + parser.add_argument('--start', metavar="INT", default = 0, type=int, + help="Starting index in GTC .json file") + parser.add_argument('--end', metavar="INT", default = -1, type=int, + help="Ending index in GTC .json file") + parser.add_argument('--verbose', action='store_true', default=False, + help="Print status information to standard output") + parser.add_argument('--profile', action='store_true', default=False, + help="Use cProfile to profile runtime operation. Overrides default in config.ini.") + parser.add_argument('--no-profile', action='store_true', default=False, + help="Do not use cProfile to profile runtime operation. Overrides default in config.ini.") + configDefault = os.path.join(sys.path[0], '../etc/config.ini') + configDefault = os.path.abspath(configDefault) + parser.add_argument('--config', metavar="PATH", default=configDefault, + help="Path to config file. Default="+configDefault) + args = vars(parser.parse_args()) + parserExtra = ArgParserExtra(args) + args = parserExtra.validateInputs(['thresholds','bpm','egt','gtc','config']) + args = parserExtra.validateOutputFile() + profile = parserExtra.enableProfile() + if profile: + pstats = NamedTemporaryFile(prefix="evaluateThresholds_", + suffix=".pstats", + dir=os.path.dirname(args['out']), + delete=False).name + cmd0 = "SampleEvaluator('%s', '%s')" % (args['bpm'], args['egt']) + args1 = (args['thresholds'], args['gtc'], args['start'], + args['end'], args['out'], args['verbose']) + cmd1 = ".run('%s', '%s', %s, %s, '%s', %s)" % args1 + cProfile.run(cmd0+cmd1, pstats) + else: + eva = SampleEvaluator(args['bpm'], args['egt']) + eva.run(args['thresholds'], args['gtc'], args['start'], + args['end'], args['out'], args['verbose']) + +if __name__ == "__main__": + main() diff --git a/src/zcall/findBetas.r b/src/zcall/findBetas.r new file mode 100644 index 0000000..98327df --- /dev/null +++ b/src/zcall/findBetas.r @@ -0,0 +1,122 @@ +#!/usr/bin/env Rscript + + +# Copyright (c) 2012 Broad Institute. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +# zCall: A Rare Variant Caller for Array-based Genotyping +# Jackie Goldstein +# jigold@broadinstitute.org +# April 5th, 2012 + + +## Input is means and standard deviations of the homozygote clusters from common sites created from findMeanSD.py + + +args <- commandArgs(TRUE) + +X <- read.table(args[1],header=TRUE,sep="\t") #read in file as a dataframe +outFile <- args[2] # Specify output file +weighted <- args[3] # If 1, use weighted linear regression + +w <- ((1/ X$nMinorHom) + (1 / X$nCommonHom))^-0.5 +print(summary(w)) +#------------------------------------- +# LM of meanY as a function of meanX +if ( weighted == 1 ){ +m <- lm(X$meanY ~ X$meanX, weights = w)} +if ( weighted == 0 ){ +m <- lm(X$meanY ~ X$meanX)} + +a <- as.numeric(summary(m)$coefficients[1]) # beta0 +b <- as.numeric(summary(m)$coefficients[2]) # beta1 +pa <- as.numeric(summary(m)$coefficients[7]) # pvalue for beta0 +pb <- as.numeric(summary(m)$coefficients[8]) # pvalue for beta1 + +### Correlation -- uncomment +#print(summary(m, correlation = TRUE)) + +#------------------------------------- +# LM of meanX as a function of meanY +if ( weighted == 1 ){ +x <- lm(X$meanX ~ X$meanY, weights = w)} +if ( weighted == 0 ){ +x <- lm(X$meanX ~ X$meanY)} + +c <- as.numeric(summary(x)$coefficients[1]) # beta0 +d <- as.numeric(summary(x)$coefficients[2]) # beta1 +pc <- as.numeric(summary(x)$coefficients[7]) # pvalue for beta0 +pd <- as.numeric(summary(x)$coefficients[8]) # pvalue for beta1 + +### Correlation -- uncomment +#print(summary(x, correlation = TRUE)) + +#------------------------------------- +# LM of sdY as a function of sdX +if ( weighted == 1 ){ +l <- lm(X$sdY ~ X$sdX, weights = X$nMinorHom)} +if ( weighted == 0 ){ +l <- lm(X$sdY ~ X$sdX)} + +e <- as.numeric(summary(l)$coefficients[1]) # beta0 +f <- as.numeric(summary(l)$coefficients[2]) # beta1 +pe <- as.numeric(summary(l)$coefficients[7]) # pvalue for beta0 +pf <- as.numeric(summary(l)$coefficients[8]) # pvalue for beta1 + +### Correlation -- uncomment +#print(summary(l, correlation = TRUE)) + +#------------------------------------- +# LM of sdX as a function of sdY +if ( weighted == 1 ){ +y <- lm(X$sdX ~ X$sdY, weights = w)} +if ( weighted == 0 ){ +y <- lm(X$sdX ~ X$sdY)} + +g <- as.numeric(summary(y)$coefficients[1]) # beta0 +h <- as.numeric(summary(y)$coefficients[2]) # beta1 +pg <- as.numeric(summary(y)$coefficients[7]) # pvalue for beta0 +ph <- as.numeric(summary(y)$coefficients[8]) # pvalue for beta1 + +### Correlation -- uncomment +#print(summary(y, correlation = TRUE)) + +#------------------------------------- +## Output to betas.txt + +labels <- c("meanY~meanX","meanX~meanY","sdY~sdX","sdX~sdY") + +beta0 <- c(a,c,e,g) +beta1 <- c(b,d,f,h) +p0 <- c(pa,pc,pe,pg) +p1 <- c(pb,pd,pf,ph) + +head <- c("Model","Beta0","Beta1","pBeta0","pBeta1") + +t <- cbind(labels, beta0, beta1, p0, p1) +r <- rbind(head, t) + +write.table(as.matrix(r), file = paste(outFile,sep=""),sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE) + diff --git a/src/zcall/mergeEvaluation.py b/src/zcall/mergeEvaluation.py new file mode 100755 index 0000000..112854d --- /dev/null +++ b/src/zcall/mergeEvaluation.py @@ -0,0 +1,90 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +"""Combine concordance and gain metric results for multiple samples and zscores. + +Use to find 'best' z score for calling. + +Author: Iain Bancarz, ib5@sanger.ac.uk, January 2013 +""" + +import cProfile, os, sys +try: + import argparse, json + from tempfile import NamedTemporaryFile + from utilities import ArgParserExtra + from calibration import MetricEvaluator +except ImportError: + sys.stderr.write("ERROR: Requires Python 2.7 to run; exiting.\n") + sys.exit(1) + + +def main(): + """Method to run as script from command line. Run with --help for usage.""" + description = "Evaluate concordance/gain results for multiple thresholds and samples; find the 'best' z score for subsequent use of zCall." + configDefault = os.path.join(sys.path[0], '../etc/config.ini') + configDefault = os.path.abspath(configDefault) + parser = argparse.ArgumentParser(description=description) + parser.add_argument('--metrics', required=True, metavar="PATH", + help="Path to .json file containing paths to .json metrics files") + parser.add_argument('--thresholds', required=True, metavar="PATH", + help="Path to .json file containing threshold .txt paths indexed by z score") + parser.add_argument('--out', required=True, metavar="PATH", + help="Path for .json output") + parser.add_argument('--text', required=False, metavar="PATH", default=None, + help="Path for text summary of metrics. Optional.") + parser.add_argument('--config', required=False, metavar="PATH", + help="Path to .ini config file. Optional, defaults to "+configDefault, default=configDefault) + parser.add_argument('--profile', action='store_true', default=False, + help="Use cProfile to profile runtime operation. Overrides default in config.ini.") + parser.add_argument('--no-profile', action='store_true', default=False, + help="Do not use cProfile to profile runtime operation. Overrides default in config.ini.") + args = vars(parser.parse_args()) + parserExtra = ArgParserExtra(args) + args = parserExtra.validateInputs(['metrics', 'thresholds', 'config']) + args = parserExtra.validateOutputFile() + profile = parserExtra.enableProfile() + metricPaths = json.loads(open(args['metrics']).read()) + if profile: + pstats = NamedTemporaryFile(prefix="mergeEvaluation_", + suffix=".pstats", + dir=os.path.dirname(args['out']), + delete=False).name + a = (args['config'], str(metricPaths), args['thresholds'], + args['out'], args['text']) + if args['text']==None: + cmd = "MetricEvaluator('%s').writeBest(%s, '%s', '%s', %s)" % a + else: + cmd = "MetricEvaluator('%s').writeBest(%s, '%s', '%s', '%s')" % a + cProfile.run(cmd, pstats) + else: + MetricEvaluator(args['config']).writeBest(metricPaths, + args['thresholds'], + args['out'], args['text']) + +if __name__ == "__main__": + main() diff --git a/src/zcall/plinkWriter.py b/src/zcall/plinkWriter.py new file mode 100644 index 0000000..f93de33 --- /dev/null +++ b/src/zcall/plinkWriter.py @@ -0,0 +1,242 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + + +"""Process Plink format genotyping data + +See http://pngu.mgh.harvard.edu/~purcell/plink/ + +Requires the plinktools package; see https://github.com/iainrb/plinktools +""" + +import json, struct +from math import ceil +from sys import stderr +from BPM import * +from plink import PlinkHandler # requires plinktools in PYTHONPATH + + +class PlinkWriter(PlinkHandler): + """Class to handle Plink format data""" + + def __init__(self, bpm, famDummy=-9): + """Initialize with a BPM object""" + self.bpm = bpm + self.snpTotal = self.bpm.getTotalSNPs() + self.sortMap = self.snpSortMap() + # famNull = value in .fam for missing paternal/maternal ID or phenotype + self.famDummy = famDummy + + def callsToBinary(self, calls, reorder=True): + """Translate genotype calls for one sample to Plink binary + + 4 genotype calls are packed into one byte of output + Returns a list of struct.pack strings corresponding to output bytes + + Overrides method in parent class, adds reorder functionality""" + if len(calls) != self.snpTotal: + msg = "Number of calls %s is not equal to SNP total %s" % \ + (len(calls), self.snpTotal) + raise ValueError(msg) + if reorder: + sortedCalls = [None]*self.snpTotal + i = 0 + while i < self.snpTotal: + sortedCalls[self.sortMap[i]] = calls[i] + i += 1 + calls = sortedCalls + if self.snpTotal % 4 != 0: + # if not an integer number of bytes, pad with no calls + calls.extend([0]*(4 - self.snpTotal % 4)) + output = [0]*(int(ceil(self.snpTotal/4.0))) + i = 0 + while i < self.snpTotal: + byte = struct.pack('B', self.callsToByte(calls[i:i+4])) + j = i / 4 + try: + output[j] = byte + except IndexError: + stderr.write("Failed to record byte at SNP index "+str(i)+"\n") + raise + i += 4 + return output + + def getSampleFields(self, sample): + """Get 6 sample metadata fields for .fam or .ped file + + Fields are: + - Family ID + - Individual ID + - Paternal ID + - Maternal ID + - Sex (1=male; 2=female; other=unknown) + - Phenotype + Conventionally, set family/individual IDs to sample URI + Set gender code if known, otherwise default to the dummy value + Other values set to the dummy as placeholder""" + fields = [str(self.famDummy)]*6 + fields[0] = sample['uri'] + fields[1] = sample['uri'] + try: fields[4] = str(sample['gender_code']) + except KeyError: pass + return fields + + def numericChromosomes(self, chroms): + """Convert to numeric chromosome IDs used by Plink""" + for i in range(len(chroms)): + if chroms[i]=='X': chroms[i] = 23 + elif chroms[i]=='Y': chroms[i] = 24 + elif chroms[i]=='XY': chroms[i] = 25 + elif chroms[i]=='MT': chroms[i] = 26 + else: chroms[i] = int(chroms[i]) + return chroms + + def snpSortMap(self): + """Sort snps into (chromosome, position) order + + Ensures compatibility with sorted .bim files generated by Plink + Return a map from original position to sorted position""" + chroms = self.numericChromosomes(self.bpm.getChromosomes()) + pos = self.bpm.getPositions() + coords = [None]*self.snpTotal + for i in range(self.snpTotal): + coords[i] = (chroms[i], int(pos[i]), i) + coords.sort() + sortMap = {} + for i in range(self.snpTotal): + [chrom, pos, orig] = coords[i] + sortMap[orig] = i + return sortMap + + def writeBed(self, binaryCalls, outPath, verbose=False): + """Write output for one or more samples in Plink .bed format + + Input: List of call bytes, and output path + Output file: First 2 bytes are Plink magic number + 3rd byte is flag for an individual-major file + Subsequent bytes represent genotype calls + """ + header = [0b01101100, 0b00011011, 0b00000000] + output = [] + for byte in header: output.append(struct.pack('B', byte)) + output.extend(binaryCalls) + out = open(outPath, 'w') + for byte in output: out.write(byte) + out.close() + if verbose: print len(output), "bytes written." + + def writeBim(self, outPath): + """Write a Plink .bim file to accompany .bed output + + Similar to Plink .map format, except: + - 2 additional columns for allele names (use A and B as dummy values) + - Entries are *sorted* into (chromosome, position) order + - Chromosomes are given numeric codes (including for X, Y, etc.) + + Use "while" instead of "for range" for greater efficiency + """ + unsorted = [None]*self.snpTotal + i = 0 + while i < self.snpTotal: + snp = self.bpm.names[i] + chr = self.bpm.chr[i] + pos = self.bpm.pos[i] + alleleA = self.bpm.A[i] + alleleB = self.bpm.B[i] + out = [chr, snp, "0", pos, alleleA, alleleB] + unsorted[i] = out + i += 1 + # sort manifest entries + out = [None]*self.snpTotal + i = 0 + while i < self.snpTotal: + out[self.sortMap[i]] = unsorted[i] + i += 1 + # write to file + outFile = open(outPath, 'w') + i = 0 + while i < self.snpTotal: + words = [] + for item in out[i]: words.append(str(item)) + outFile.write("\t".join(words)+"\n") + i += 1 + outFile.close() + + def writeFam(self, sampleJson, outPath): + """Write a Plink .fam file to accompany .bed output""" + samples = json.loads(open(sampleJson).read()) + outLines = [] + for sample in samples: + outLines.append(' '.join(self.getSampleFields(sample))+"\n") + outFile = open(outPath, 'w') + outFile.write("".join(outLines)) + outFile.close() + + def writeMap(self, outPath): + """Write Plink .map format file""" + outLines = [] + for i in range(self.snpTotal): + snp = self.bpm.names[i] + chr = self.bpm.chr[i] + pos = self.bpm.pos[i] + out = [str(chr), str(snp), "0", str(pos)] + outLines.append("\t".join(out)+"\n") + outFile = open(outPath, 'w') + outFile.write("".join(outLines)) + outFile.close() + + def writePed(self, calls, sampleJson, outPath): + """Write Plink .ped format file. + + Each line represents one sample. + First 6 fields are same as for .fam file, see writeFam() method. + Subsequent fields are allele pairs, eg. 'G C' or 'A B'. + If manifestNames==True then get allele symbols from manifest, + otherwise use A and B respectively for major and minor alleles.""" + samples = json.loads(open(sampleJson).read()) + outLines = [] + if len(calls) % self.snpTotal !=0: + msg = "Number of calls %s is not a multiple of SNP total %s" % \ + (len(calls), self.snpTotal) + raise ValueError(msg) + for i in range(len(samples)): + fields = self.getSampleFields(samples[i]) + start = i * self.snpTotal + for j in range(self.snpTotal): + alleleA = self.bpm.A[j] + alleleB = self.bpm.B[j] + call = calls[start+j] + if call == 1: symbol = alleleA+' '+alleleA + elif call == 2: symbol = alleleA+' '+alleleB + elif call == 3: symbol = alleleB+' '+alleleB + else: symbol = "0 0" + fields.append(symbol) + outLines.append("\t".join(fields)+"\n") + outFile = open(outPath, 'w') + outFile.write("".join(outLines)) + outFile.close() diff --git a/src/zcall/prepareThresholds.py b/src/zcall/prepareThresholds.py new file mode 100755 index 0000000..b7db368 --- /dev/null +++ b/src/zcall/prepareThresholds.py @@ -0,0 +1,117 @@ +#! /usr/bin/env python + + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + + +""" Find thresholds for given .egt file and Z score(s) +Combines findMeanSD.py, findBetas.r, findThresholds.py from original zCall + +Iain Bancarz, ib5@sanger.ac.uk +January 2013 +""" + +import cProfile, os, sys, time +import shared +try: + import argparse, json + from tempfile import NamedTemporaryFile + from calibration import ThresholdFinder + from utilities import ArgParserExtra +except ImportError: + sys.stderr.write(shared.importErrorMessage) + sys.exit(1) + +""" +Calibration procedure: +1. Run findMeanSD.py on given EGT file. Outputs mean_sd.txt +2. Run findBetas.r on output from (1). Outputs betas.txt +3. Run findThresholds.py on EGT file and betas.txt, with given Z score(s). +Outputs from (1) and (2) are written to a temporary directory, deleted on exit. + +Recommended default Z score = 7. Suggested range of alternatives = 3 to 15. +""" + +def main(): + # 'main' method to run script from command line + start = time.time() + args = get_args() + verbose = args['verbose'] + if verbose: print "Starting prepareThresholds.py" + # validate arguments + if args['ztotal']<1 or args['zstart']<1: + raise ValueError("Invalid zstart or ztotal option.") + parserExtra = ArgParserExtra(args) + args = parserExtra.validateInputs(['egt', 'config']) + args = parserExtra.validateOutputDir() + profile = parserExtra.enableProfile() + if profile: + pstats = NamedTemporaryFile(prefix="prepareThresholds_", + suffix=".pstats", + dir=args['out'], delete=False).name + cmd = "ThresholdFinder('"+args['egt']+"', '"+args['config']+\ + "').runMultiple(%s, %s, '%s', %s, %s)" % \ + (args['zstart'], args['ztotal'], args['out'], args['verbose'], + args['force']) + cProfile.run(cmd, pstats) + else: + tf = ThresholdFinder(args['egt'], args['config']) + tf.runMultiple(args['zstart'], args['ztotal'], args['out'], + args['verbose'], args['force']) + duration = time.time() - start + if verbose: print "Finished. Duration:", round(duration, 1), "s" + +def get_args(): + # parse command-line arguments and return dictionary of params + description = "Generates threshold files for use with the zCall genotype caller. Inputs are an .egt file and one or more Z score values. The .egt file is a proprietary Illumina binary file format, containing typical means and standard deviations for intensity clusters. An .egt file is supplied by Illumina for its own genotyping chips, or it may be generated using the GenomeStudio software for custom probe sets." + parser = argparse.ArgumentParser(description=description) + parser.add_argument('--egt', required=True, metavar="PATH", + help="Path to .egt input file.") + parser.add_argument('--out', metavar="DIR", default=".", + help="Directory for output; defaults to current working directory. Filename(s) will be of the form _z_thresholds.txt, for an input file of the form .egt") + configDefault = os.path.join(sys.path[0], '../etc/config.ini') + configDefault = os.path.abspath(configDefault) + parser.add_argument('--config', metavar="PATH", default=configDefault, + help="Path to .ini config file. Default = etc/config.ini") + parser.add_argument('--zstart', metavar="INT", default=7, type=int, + help='Starting z score. Default = %(default)s') + parser.add_argument('--ztotal', metavar="INT", default=1, type=int, + help='Total number of integer z scores to generate. Default = %(default)s') + parser.add_argument('--index_name', metavar="STRING", + default="threshold_index.json", + help='Name for .json index file with paths to thresholds.txt output, written to output directory') + parser.add_argument('--verbose', action='store_true', default=False, + help="Print status information to standard output") + parser.add_argument('--profile', action='store_true', default=False, + help="Use cProfile to profile runtime operation, if not activated by default in config.ini") + parser.add_argument('--no-profile', action='store_true', default=False, + help="Do not use cProfile to profile runtime operation. Overrides default in config.ini.") + parser.add_argument('--force', action='store_true', default=False, + help="Force overwrite of existing threshold files (if any)") + args = vars(parser.parse_args()) + return args + +if __name__ == "__main__": + main() diff --git a/src/zcall/runZCall.py b/src/zcall/runZCall.py new file mode 100755 index 0000000..03882ba --- /dev/null +++ b/src/zcall/runZCall.py @@ -0,0 +1,181 @@ +#! /usr/bin/env python + + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + + +import cProfile, os, struct, sys +import shared +from GTC import * +try: + import argparse, json + from calibration import ThresholdFinder + from utilities import CallingBase, ThresholdContainer, ArgParserExtra + from plinkWriter import PlinkWriter + from tempfile import NamedTemporaryFile +except ImportError: + sys.stderr.write(shared.importErrorMessage) + sys.exit(1) + + +class SampleCaller(CallingBase): + + """Class to run zCall on given GTC files + + initialize with: bpmPath, egtPath, threshPath + Write output in .bed format; use merge_bed in genotyping pipeline workflows + Construct Plink .bim manifest file from BPM object + Optionally, use sample JSON file to construct Plink .fam file + """ + + def makeCalls(self, gtc, null=False): + """Apply zCall to 'no calls' from GTC input + + Return genotypes in numeric format + 0 - "No Call", 1 - AA, 2 - AB, 3 - BB + + If null, output the unchanged original calls (use for testing) + """ + calls = [] + zcalls = 0 + gains = 0 + for i in range(self.snpTotal): + origCall = gtc.genotypes[i] + if null==False \ + and origCall == 0 \ + and self.thresholds.getX(i) != "NA" \ + and self.thresholds.getY(i) != "NA": + zcall = self.call(gtc, i) + calls.append(zcall) + zcalls += 1 + if zcall!=0: gains += 1 + else: + calls.append(origCall) + return (calls, zcalls, gains) + + def run(self, samplesPath, outDir, prefix, logPath=None, binary=False, + verbose=False, dummy=-9, skip=False, reorder=True): + """Apply zCall to GTC files and write Plink .bed output""" + gtcPaths = self.readSampleJson(samplesPath) + calls = [] + pw = PlinkWriter(self.bpm, dummy) + zcallTotal = 0 + gainsTotal = 0 + for gtcPath in gtcPaths: + if verbose: print "Calling GTC file", gtcPath + gtc = GTC(gtcPath, self.bpm.normID) + (sampleCalls, zcalls, gains) = self.makeCalls(gtc, skip) + if binary: calls.extend(pw.callsToBinary(sampleCalls, reorder)) + else: calls.extend(sampleCalls) + zcallTotal += zcalls + gainsTotal += gains + outStem = os.path.join(outDir, prefix) + if binary: + pw.writeBed(calls, outStem+'.bed', verbose) + pw.writeBim(outStem+'.bim') + pw.writeFam(samplesPath, outStem+'.fam') + else: + pw.writePed(calls, samplesPath, outStem+'.ped') + pw.writeMap(outStem+'.map') + pw.writeFam(samplesPath, outStem+'.fam') + logData = { 'total_samples': len(gtcPaths), + 'sample_json': samplesPath, + 'plink_output': outStem, + 'total_snps': self.snpTotal, + 'total_calls': self.snpTotal*len(gtcPaths), + 'zcall_attempts': zcallTotal, + 'zcall_gains': gainsTotal, } + if logPath!=None: + log = open(logPath, 'w') + log.write(json.dumps(logData, sort_keys=True, + indent=4, separators=(',', ': '))+"\n") + log.close() + +def main(): + """Method to run as script from command line""" + description = "Apply zCall to no-calls with given threshold and samples." + parser = argparse.ArgumentParser(description=description) + parser.add_argument('--thresholds', required=True, metavar="PATH", + help="Path to zCall thresholds.txt file") + parser.add_argument('--bpm', required=True, metavar="PATH", + help="BPM .csv manifest file") + parser.add_argument('--egt', required=True, metavar="PATH", + help="EGT input file") + parser.add_argument('--samples', required=True, metavar="PATH", + help="Path to .json file containing sample URI's, genders (optional), and .gtc input paths") + parser.add_argument('--out', required=True, metavar="DIR", + help="Directory for output data") + parser.add_argument('--plink', default='zcall', metavar="STRING", + help="Prefix for Plink output files") + parser.add_argument('--binary', action='store_true', default=False, + help="Write Plink binary output. If this option is not given, output is in Plink text format.") + parser.add_argument('--log', metavar="PATH", default=None, + help="Path for .json log output. Defaults to [plink prefix]_log.json in same directory as Plink output.") + parser.add_argument('--verbose', action='store_true', default=False, + help="Print status information to standard output") + parser.add_argument('--profile', action='store_true', default=False, + help="Use cProfile to profile runtime operation. Overrides default in config.ini.") + parser.add_argument('--no-profile', action='store_true', default=False, + help="Do not use cProfile to profile runtime operation. Overrides default in config.ini.") + configDefault = os.path.join(sys.path[0], '../etc/config.ini') + configDefault = os.path.abspath(configDefault) + parser.add_argument('--config', metavar="PATH", default=configDefault, + help="Path to .ini config file. Default = etc/config.ini") + parser.add_argument('--fam-dummy-value', metavar="INT", type=int, choices=(0,-9), default=-9, help="Value in Plink .fam output for missing paternal ID, maternal ID, or phenotype. Must be equal to 0 or -9; defaults to -9.") + parser.add_argument('--null', action='store_true', default=False, + help="Do not apply zcall. Instead output GTC calls unchanged to an individual-major Plink binary file. Used for testing.") + + args = vars(parser.parse_args()) + parserExtra = ArgParserExtra(args) + inputKeys = ['thresholds', 'bpm', 'egt', 'samples', 'config'] + args = parserExtra.validateInputs(inputKeys) + args = parserExtra.validateOutputDir() + if args['log'] == None: + args['log'] = os.path.join(args['out'], args['plink']+'_log.json') + else: + args['log'] = os.path.abspath(args['log']) + if args['null']: + print "WARNING: Null option in effect, input calls will not be changed" + profile = ArgParserExtra(args).enableProfile() + if profile: + pstats = NamedTemporaryFile(prefix="runZCall_", suffix=".pstats", + dir=args['out'], delete=False).name + arg0 = (args['bpm'], args['egt'], args['thresholds']) + arg1 = (args['samples'], args['out'], args['plink'], args['log'], + args['binary'], args['verbose'], args['missing'], + args['null']) + call = ("SampleCaller('%s', '%s', '%s')." % arg0)+\ + ("run('%s', '%s', '%s', '%s', %s, %s, %s, %s)" % arg1) + cProfile.run(call, pstats) + else: + caller = SampleCaller(args['bpm'], args['egt'], args['thresholds']) + caller.run(args['samples'], args['out'], args['plink'], args['log'], + args['binary'], args['verbose'], args['fam_dummy_value'], + args['null']) + +if __name__ == "__main__": + main() diff --git a/src/zcall/shared.py b/src/zcall/shared.py new file mode 100644 index 0000000..edb7b08 --- /dev/null +++ b/src/zcall/shared.py @@ -0,0 +1,38 @@ +#! /usr/bin/env python + + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +"""Shared constants for zcall code + +Author: Iain Bancarz, ib5@sanger.ac.uk, September 2013 +""" + +importErrorMessage = """ImportError: Cannot run zCall, one or more prerequisites missing. +Prerequisites include: +- Python >= 2.7 or Pypy >= 2.1 +- Plinktools should be in the PYTHONPATH environment variable. + See: https://github.com/wtsi-npg/plinktools +""" diff --git a/src/zcall/textToSampleJson.py b/src/zcall/textToSampleJson.py new file mode 100644 index 0000000..3021821 --- /dev/null +++ b/src/zcall/textToSampleJson.py @@ -0,0 +1,107 @@ +#! /software/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + +"""Standalone script to convert plain text to input for zCall + +Input: Text file + One sample path per line (no spaces in pathnames!) + Optional second column is sample name; default to file name + Optional uri prefix + +Output: JSON file with sample names/uri's in correct format +""" + +import os, re, sys +import shared +try: + import argparse, json +except ImportError: + sys.stderr.write(shared.importErrorMessage) + sys.exit(1) + +class sampleTextParser: + + def __init__(self): + pass + + def readText(self, inPath): + """Read GTC paths and (optional) names from plain text file""" + lines = open(inPath).readlines() + gtcPaths = {} + for line in lines: + if re.match('#', line): continue + words = re.split('\s+', line.strip()) + gtc = words[0] + if len(words) > 1: + name = words[1] + else: + items = re.split('\.', os.path.basename(gtc)) + items.pop() # remove .gtc suffix + name = '.'.join(items) + gtcPaths[name] = gtc + return gtcPaths + + def writeJson(self, inPath, outPath, uriPrefix): + """Read plain text and output sample .json""" + gtcPaths = self.readText(inPath) + output = [] + names = gtcPaths.keys() + names.sort() + uriset = set() + for name in names: + sample = {} + if uriPrefix==None: uri = name + else: uri = uriPrefix+name + if uri in uriset: + raise ValueError("Sample URI \""+uri+"\" is not unique") + else: + uriset.add(uri) + sample['uri'] = uri + sample['result'] = gtcPaths[name] + sample['gender'] = -9 + output.append(sample) + out = open(outPath, 'w') + out.write(json.dumps(output)+"\n") + out.close() + + +def main(): + """Method to run as script from command line. Run with --help for usage.""" + description = "Process a plain text file listing GTC paths and (optional) sample names. Outputs a JSON file for input to zCall scripts. JSON file is in a similar format to the samples file in the WTSI Genotyping Pipeline." + parser = argparse.ArgumentParser(description=description) + parser.add_argument('--input', required=True, metavar="PATH", + help="Path to .txt file with sample GTC paths, one per line. Optional second column contains sample names. Paths must not contain whitespace. Input lines starting with # are ignored.") + parser.add_argument('--output', required=False, metavar="PATH", + help="Path for .json output", default="samples.json") + parser.add_argument('--uri', required=False, metavar="STRING", + help="URI prefix to be prepended to all sample names, eg. uri:my_institution:'", default=None) + args = vars(parser.parse_args()) + sampleTextParser().writeJson(args['input'], args['output'], args['uri']) + +if __name__ == "__main__": + main() diff --git a/src/zcall/utilities.py b/src/zcall/utilities.py new file mode 100644 index 0000000..67b1f40 --- /dev/null +++ b/src/zcall/utilities.py @@ -0,0 +1,239 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + +import os, json, sys +from math import isinf, isnan +from ConfigParser import ConfigParser +from GTC import * +from BPM import * +from EGT import * + +class SharedBase: + """Constants and simple functions shared across multiple classes""" + + Z_KEY = 'BEST_Z' + T_KEY = 'BEST_THRESHOLDS' + M_KEY = 'SAMPLE_METRICS' + + def readSampleJson(self, inPath): + """Read sample GTC paths from .json file used by genotyping pipeline""" + contents = open(inPath).read() + try: + samples = json.loads(contents) + except ValueError: + msg = "JSON parser error! Input to failed parse attempt:\n" + sys.stderr.write(msg) + sys.stderr.write(contents) + raise + gtc = [] + for sample in samples: gtc.append(sample['result']) + return gtc + + +class CallingBase(SharedBase): + """ 'Base' class containing useful methods for zcall subclasses """ + + def __init__(self, bpmPath, egtPath, threshPath=None): + self.bpm = BPM(bpmPath) + self.egt = EGT(egtPath) + self.snpTotal = self.egt.getTotalSNPs() + if self.bpm.getTotalSNPs() != self.snpTotal: + raise ValueError("ERROR: SNP totals in .egt and .bpm inputs differ") + if threshPath!=None: + self.thresholds = ThresholdContainer(threshPath) + else: + self.thresholds = None + + def call(self, gtc, i): + """ re-call ith SNP in GTC file, using zcall thresholds + + call codes: 0 - "No Call", 1 - AA, 2 - AB, 3 - BB + """ + if self.thresholds==None: + raise ValueError("Must specify thresholds before calling!") + normX = gtc.normXintensities[i] + normY = gtc.normYintensities[i] + Tx = self.thresholds.getX(i) + Ty = self.thresholds.getY(i) + call = None + inputSum = normX+normY+Tx+Ty # use sum to check for NaN or infinity + if (isnan(inputSum) or isinf(inputSum)): + ## NaN or infinity present in inputs -- no call + call = 0 + elif normX < Tx and normY < Ty: ## Lower left quadrant + call = 0 + elif normX >= Tx and normY <= Ty: ## Lower right quadrant + call = 1 + elif normX < Tx and normY >= Ty: ## Upper left quadrant + call = 3 + else: ## Upper right quadrant + call = 2 + return call + + def findMAF(self, nAA, nBB, nAB): + """ Find minor allele frequency """ + maf = None + try: + if nAA > nBB: + maf = (nAB + 2 * nBB) / float(2*(nAA + nAB + nBB)) + else: + maf = (nAB + 2 * nAA) / float(2*(nAA + nAB + nBB)) + except ZeroDivisionError: + maf = 0 + return maf + + def normalizeCall(self, call, nAA, nBB): + """Normalization: Flip genotype call so 1 is always the common allele homozygote and 3 is the minor allele homozygote + + Enforces convention that major allele is on X intensity axis + Allele counts are taken from EGT object + """ + if nBB > nAA: + if call == 1: + call = 3 + elif call == 3: + call = 1 + return call + + def setThresholds(self, thresholds): + """Set thresholds to given ThresholdContainer object""" + self.thresholds = thresholds + +class ConfigReader: + """Convenience wrapper for the ConfigParser class to read .ini files""" + + def __init__(self, configPath): + configFile = open(configPath, 'r') + self.cp = ConfigParser() + self.cp.readfp(configFile) + configFile.close() + + def getParser(self): + return self.cp + +class ArgParserExtra: + """Class for additional parsing/validation of command-line arguments""" + def __init__(self, args): + self.args = args + + def enableProfile(self): + """Check if script should be run with Python profiler enabled + + Note that parser converts - to _ in --no-profile option key""" + cp = ConfigReader(self.args['config']).getParser() + if cp.has_option('zcall', 'profile'): default = True + else: default = False + if self.args['profile'] and self.args['no_profile']: + raise ValueError("Cannot specify both --profile and --no-profile!") + elif self.args['profile']: + enable = True + elif self.args['no_profile']: + enable = False + else: + enable = default + return enable + + def validateInputs(self, keys): + """Validate input paths with given keys + + Convert paths to absolute paths, and return revised args""" + for key in keys: + if not os.path.exists(self.args[key]): + raise OSError("Input path '"+self.args[key]+"' does not exist!") + elif not os.access(self.args[key], os.R_OK): + raise OSError("Cannot read input path '"+self.args[key]+"'!") + else: + self.args[key] = os.path.abspath(self.args[key]) + return self.args + + def validateOutputDirPath(self, dirPath): + """Validate output directory path, and convert to absolute path""" + if not os.path.exists(dirPath): + raise OSError("Output '"+dirPath+"' does not exist!") + elif not os.path.isdir(dirPath): + raise OSError("Output '"+dirPath+"' is not a directory!") + elif not os.access(dirPath, os.W_OK): + raise OSError("Cannot write to output '"+dirPath+"'!") + else: + dirPath = os.path.abspath(dirPath) + return dirPath + + def validateOutputDir(self, key="out"): + """Validate output directory argument + + Convert path to absolute path, and return revised args""" + self.args[key] = self.validateOutputDirPath(self.args[key]) + return self.args + + def validateOutputFile(self, key="out"): + """Validate an output file path + + Directory must exist and be writable, file may not be""" + if os.path.isdir(self.args[key]): + raise OSError("Output file '"+self.args[key]+"' is a directory!") + (dirName, fileName) = os.path.split(self.args[key]) + dirName = self.validateOutputDirPath(dirName) + self.args[key] = os.path.join(dirName, fileName) + return self.args + + +class ThresholdContainer: + """Class to contain thresholds, read from thresholds.txt file""" + + def __init__(self, inPath): + (self.x, self.y) = self.readThresholds(inPath) + + def getX(self, i): + """Get x threshold for the ith SNP""" + return self.x[i] + + def getY(self, i): + """Get y threshold for the ith SNP""" + return self.y[i] + + def readThresholds(self, inPath): + """ Read a thresholds.txt file; return lists of x and y thresholds """ + thresholdsX = [] + thresholdsY = [] + for line in open(inPath, 'r'): + line = line.replace("\n", "") + if line.find("Tx") != -1: + continue + else: + fields = line.split("\t") + if fields[1] != "NA": + tx = float(fields[1]) + else: + tx = fields[1] + if fields[2] != "NA": + ty = float(fields[2]) + else: + ty = fields[2] + thresholdsX.append(tx) + thresholdsY.append(ty) + return (thresholdsX, thresholdsY) diff --git a/src/zcall/zCallComplete.py b/src/zcall/zCallComplete.py new file mode 100755 index 0000000..23a4f2f --- /dev/null +++ b/src/zcall/zCallComplete.py @@ -0,0 +1,184 @@ +#! /usr/bin/env python + +# Copyright (c) 2013 Genome Research Ltd. All rights reserved. +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . +# +# Publication of the zCall algorithm: +# Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, +# O'Dushlaine C, Moran JL, Chambert K, Stevens C; Swedish Schizophrenia +# Consortium; ARRA Autism Sequencing Consortium, Sklar P, Hultman CM, +# Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM. +# zCall: a rare variant caller for array-based genotyping: Genetics and +# population analysis. Bioinformatics. 2012 Oct 1;28(19):2543-2545. +# Epub 2012 Jul 27. PubMed PMID: 22843986. + +# Author: Iain Bancarz, ib5@sanger.ac.uk + + +"""Standalone, self-contained script to run the complete zcall process. + +zcall steps: +1. Generate thresholds +2. Evaluate thresholds +3. Merge evaluations to find best threshold +4. Apply zcall with given threshold to no-calls in input data + +This standalone script does not allow parallelization of evaluation or calling, +or reuse of thresholds. It is intended as a convenience script for small +datasets where parallelization is not required, and an illustration of the +zcall method. +""" +import cProfile, os, sys, time +import shared +try: + import argparse, json + from tempfile import NamedTemporaryFile + from utilities import ArgParserExtra +except ImportError: + sys.stderr.write(shared.importErrorMessage) + sys.exit(1) +from calibration import ThresholdFinder, SampleEvaluator, MetricEvaluator +from runZCall import SampleCaller + +class ZCallComplete: + + EVALUATION = 'threshold_evaluation.json' + MERGED = 'merged_evaluation.json' + LOG = 'zcall_log.json' + + def __init__(self, args): + self.args = args + + def call(self, bpm, egt, thresholdPath, sJson, outDir, prefix, binary, + verbose, missing): + if verbose: print "Running zcall with thresholds", thresholdPath + logPath = os.path.join(outDir, self.LOG) + caller = SampleCaller(bpm, egt, thresholdPath) + caller.run(sJson, outDir, prefix, logPath, binary, verbose, missing) + + def evaluate(self, bpm, egt, thresholds, gtc, start, end, outDir, verbose): + """Evaluate threshold.txt files by concordance and gain metrics""" + outPath = os.path.join(outDir, self.EVALUATION) + eva = SampleEvaluator(bpm, egt) + eva.run(thresholds, gtc, start, end, outPath, verbose) + return outPath + + def merge(self, metricPath, thresholdJson, outDir, textPath, config, + verbose): + outPath = os.path.join(outDir, self.MERGED) + eva = MetricEvaluator(config) + results = eva.writeBest([metricPath,], thresholdJson, outPath, + textPath, verbose) + thresholdPath = results[eva.getBestThresholdKey()] + return thresholdPath + + def prepare(self, zstart, ztotal, egtPath, outDir, config, verbose=False): + """ Prepare threshold.txt files for given range of z scores""" + tf = ThresholdFinder(egtPath, config) + return tf.runMultiple(zstart, ztotal, outDir, verbose) + + + def run(self): + tJson = self.prepare(self.args['zstart'], + self.args['ztotal'], + self.args['egt'], + self.args['out'], + self.args['config'], + self.args['verbose']) + mJson = self.evaluate(self.args['bpm'], + self.args['egt'], + tJson, + self.args['samples'], + self.args['gtc_start'], + self.args['gtc_end'], + self.args['out'], + self.args['verbose']) + thresholdPath = self.merge(mJson, + tJson, + self.args['out'], + self.args['text'], + self.args['config'], + self.args['verbose']) + self.call(self.args['bpm'], + self.args['egt'], + thresholdPath, + self.args['samples'], + self.args['out'], + self.args['plink'], + self.args['binary'], + self.args['verbose'], + self.args['fam_dummy_value']) + +def main(): + start = time.time() + args = getArgs() + parserExtra = ArgParserExtra(args) + args = parserExtra.validateInputs(['bpm', 'egt', 'config', 'samples']) + args = parserExtra.validateOutputDir() + if args['text']!=None: args = parserExtra.validateOutputFile('text') + profile = parserExtra.enableProfile() + if profile: + pstats = NamedTemporaryFile(prefix="zCallComplete_", + suffix=".pstats", + dir=args['out'], delete=False).name + cProfile.run('ZCallComplete('+str(args)+').run()', pstats) + else: + ZCallComplete(args).run() + if args['verbose']==True: + duration = time.time() - start + print "zCall finished. Duration:", round(duration, 2), "seconds." + +def getArgs(): + description = "Standalone script to run the complete zcall process: Generate and evaluate thresholds, and apply zcall to no-calls in the input data." + parser = argparse.ArgumentParser(description=description) + configDefault = os.path.join(sys.path[0], '../etc/config.ini') + configDefault = os.path.abspath(configDefault) + parser.add_argument('--bpm', required=True, metavar="PATH", + help="BPM .csv manifest file") + parser.add_argument('--egt', required=True, metavar="PATH", + help="Path to .egt input file.") + parser.add_argument('--config', metavar="PATH", default=configDefault, + help="Path to .ini config file. Default = etc/config.ini") + parser.add_argument('--out', metavar="DIR", default=".", + help="Directory for output; defaults to current working directory.") + parser.add_argument('--plink', default='zcall', metavar="STRING", + help="Prefix for Plink output files") + parser.add_argument('--binary', action='store_true', default=False, + help="Write Plink binary output. If this option is not given, output is in Plink text format.") + parser.add_argument('--fam-dummy-value', metavar="INT", type=int, choices=(0,-9), default=-9, help="Value in Plink .fam output for missing paternal ID, maternal ID, or phenotype. Must be equal to 0 or -9; defaults to -9.") + parser.add_argument('--zstart', metavar="INT", default=7, type=int, + help='Starting z score. Default = %(default)s') + parser.add_argument('--ztotal', metavar="INT", default=1, type=int, + help='Total number of integer z scores to generate. Default = %(default)s') + parser.add_argument('--gtc_start', metavar="INT", default = 0, + help="Starting index in GTC .json file for threshold evaluation") + parser.add_argument('--gtc_end', metavar="INT", default = -1, + help="Ending index in GTC .json file for threshold evaluation") + parser.add_argument('--samples', required=True, metavar="PATH", + help="Path to .json file containing sample URIs (unique identifiers), gender codes (optional), and .gtc data paths") + parser.add_argument('--text', required=False, metavar="PATH", + help="Path for text summary of calibration metrics. Optional.", + default=None) + parser.add_argument('--verbose', action='store_true', default=False, + help="Print status information to standard output") + parser.add_argument('--profile', action='store_true', default=False, + help="Use cProfile to profile runtime operation. Overrides default in config.ini.") + parser.add_argument('--no-profile', action='store_true', default=False, + help="Do not use cProfile to profile runtime operation. Overrides default in config.ini.") + args = vars(parser.parse_args()) + return args + +if __name__ == "__main__": + main() diff --git a/updates.txt b/updates.txt index 58489d1..198f9de 100644 --- a/updates.txt +++ b/updates.txt @@ -2,7 +2,8 @@ zCall: A Rare Variant Caller for Array-Based Genotyping Author: Jackie Goldstein - jigold@broadinstitute.org -Contact: Benjamin Neale - bneale@broadinstitute.org +Contact (prototype): Benjamin Neale - bneale@broadinstitute.org +Contact (extended): Iain Bancarz - ib5@sanger.ac.uk May 8th, 2012 ***The Illumina provided Code was provided as-is and with no warranty as to performance and no warranty against it infringing any other party's intellectual property rights. @@ -41,3 +42,10 @@ October 3rd, 2012: 2. Added a new zip file to each folder with minor updates to README files. Also fixed a small bug in the calibrateZ.py script for Version 3 and the sampleConcordance.py script for Version 1. (The bug caused the script to get killed when there was an "NA" in the thresholds file) You can either download the latest zip file bundle for each version (VersionX.3) or just download the scripts with the bug fixes from the additionalScripts/ folder. +January 25, 2013: + +Start of development on extended zCall by Iain Bancarz: ib5@sanger.ac.uk, https://github.com/iainrb/zCall/ . + +April 3, 2013: + +Fork of https://github.com/iainrb/zCall/ repository to https://github.com/wtsi-npg. Initial merge of extended 'devel' branch into master, with tags for original and extended versions (0.1 and 0.2 respectively).