Quantcast
Channel: vcf — GATK-Forum
Viewing all 624 articles
Browse latest View live

Variant Quality Score Recalibration (VQSR)

$
0
0

This document describes what Variant Quality Score Recalibration (VQSR) is designed to do, and outlines how it works under the hood. For command-line examples and recommendations on what specific resource datasets and arguments to use for VQSR, please see this FAQ article.

As a complement to this document, we encourage you to watch the workshop videos available on our Events webpage.

Detailed information about command line options for VariantRecalibrator can be found here.

Detailed information about command line options for ApplyRecalibration can be found here.


Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.

The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided as input (typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array, for humans). This adaptive error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.

The variant recalibrator contrastively evaluates variants in a two step process, each performed by a distinct tool:

  • VariantRecalibrator
    Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. This step produces a recalibration file.

  • ApplyRecalibration
    Apply the model parameters to each variant in input VCF files producing a recalibrated VCF file in which each variant is annotated with its VQSLOD value. In addition, this step will filter the calls based on this new lod score by adding lines to the FILTER column for variants that don't meet the specified lod threshold.

Please see the VQSR tutorial for step-by-step instructions on running these tools.

How VariantRecalibrator works in a nutshell

The tool takes the overlap of the training/truth resource sets and of your callset. It models the distribution of these variants relative to the annotations you specified, and attempts to group them into clusters. Then it uses the clustering to assign VQSLOD scores to all variants. Variants that are closer to the heart of a cluster will get a higher score than variants that are outliers.

How ApplyRecalibration works in a nutshell

During the first part of the recalibration process, variants in your callset were given a score called VQSLOD. At the same time, variants in your training sets were also ranked by VQSLOD. When you specify a tranche sensitivity threshold with ApplyRecalibration, expressed as a percentage (e.g. 99.9%), what happens is that the program looks at what is the VQSLOD value above which 99.9% of the variants in the training callset are included. It then takes that value of VQSLOD and uses it as a threshold to filter your variants. Variants that are above the threshold pass the filter, so the FILTER field will contain PASS. Variants that are below the threshold will be filtered out; they will be written to the output file, but in the FILTER field they will have the name of the tranche they belonged to. So VQSRTrancheSNP99.90to100.00 means that the variant was in the range of VQSLODs corresponding to the remaining 0.1% of the training set, which are basically considered false positives.


Interpretation of the Gaussian mixture model plots

The variant recalibration step fits a Gaussian mixture model to the contextual annotations given to each variant. By fitting this probability model to the training variants (variants considered to be true-positives), a probability can be assigned to the putative novel variants (some of which will be true-positives, some of which will be false-positives). It is useful for users to see how the probability model was fit to their data. Therefore a modeling report is automatically generated each time VariantRecalibrator is run (in the above command line the report will appear as path/to/output.plots.R.pdf). For every pair-wise combination of annotations used in modeling, a 2D projection of the Gaussian mixture model is shown.

image

The figure shows one page of an example Gaussian mixture model report that is automatically generated by the VQSR from the example HiSeq call set. This page shows the 2D projection of mapping quality rank sum test versus Haplotype score by marginalizing over the other annotation dimensions in the model.

In each page there are four panels which show different ways of looking at the 2D projection of the model. The upper left panel shows the probability density function that was fit to the data. The 2D projection was created by marginalizing over the other annotation dimensions in the model via random sampling. Green areas show locations in the space that are indicative of being high quality while red areas show the lowest probability areas. In general putative SNPs that fall in the red regions will be filtered out of the recalibrated call set.

The remaining three panels give scatter plots in which each SNP is plotted in the two annotation dimensions as points in a point cloud. The scale for each dimension is in normalized units. The data for the three panels is the same but the points are colored in different ways to highlight different aspects of the data. In the upper right panel SNPs are colored black and red to show which SNPs are retained and filtered, respectively, by applying the VQSR procedure. The red SNPs didn't meet the given truth sensitivity threshold and so are filtered out of the call set. The lower left panel colors SNPs green, grey, and purple to give a sense of the distribution of the variants used to train the model. The green SNPs are those which were found in the training sets passed into the VariantRecalibrator step, while the purple SNPs are those which were found to be furthest away from the learned Gaussians and thus given the lowest probability of being true. Finally, the lower right panel colors each SNP by their known/novel status with blue being the known SNPs and red being the novel SNPs. Here the idea is to see if the annotation dimensions provide a clear separation between the known SNPs (most of which are true) and the novel SNPs (most of which are false).

An example of good clustering for SNP calls from the tutorial dataset is shown to the right. The plot shows that the training data forms a distinct cluster at low values for each of the two statistics shown (haplotype score and mapping quality bias). As the SNPs fall off the distribution in either one or both of the dimensions they are assigned a lower probability (that is, move into the red region of the model's PDF) and are filtered out. This makes sense as not only do higher values of HaplotypeScore indicate a lower chance of the data being explained by only two haplotypes but also higher values for mapping quality bias indicate more evidence of bias between the reference bases and the alternative bases. The model has captured our intuition that this area of the distribution is highly enriched for machine artifacts and putative variants here should be filtered out!


Tranches and the tranche plot

The recalibrated variant quality score provides a continuous estimate of the probability that each variant is true, allowing one to partition the call sets into quality tranches. The main purpose of the tranches is to establish thresholds within your data that correspond to certain levels of sensitivity relative to the truth sets. The idea is that with well calibrated variant quality scores, you can generate call sets in which each variant doesn't have to have a hard answer as to whether it is in or out of the set. If a very high accuracy call set is desired then one can use the highest tranche, but if a larger, more complete call set is a higher priority than one can dip down into lower and lower tranches. These tranches are applied to the output VCF file using the FILTER field. In this way you can choose to use some of the filtered records or only use the PASSing records.

The first tranche (90) which has the lowest value of truth sensitivity but the highest value of novel Ti/Tv, is exceedingly specific but less sensitive. Each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. Downstream applications can select in a principled way more specific or more sensitive call sets or incorporate directly the recalibrated quality scores to avoid entirely the need to analyze only a fixed subset of calls but rather weight individual variant calls by their probability of being real. An example tranche plot, automatically generated by the VariantRecalibrator walker, is shown below.

image

This is an example of a tranches plot generated for a HiSeq call set. The x-axis gives the number of novel variants called while the y-axis shows two quality metrics -- novel transition to transversion ratio and the overall truth sensitivity.

Note that the tranches plot is not applicable for indels and will not be generated when the tool is run in INDEL mode.


Ti/Tv-free recalibration

We use a Ti/Tv-free approach to variant quality score recalibration. This approach requires an additional truth data set, and cuts the VQSLOD at given sensitivities to the truth set. It has several advantages over the Ti/Tv-targeted approach:

  • The truth sensitivity (TS) approach gives you back the novel Ti/Tv as a QC metric
  • The truth sensitivity (TS) approach is conceptual cleaner than deciding on a novel Ti/Tv target for your dataset
  • The TS approach is easier to explain and defend, as saying "I took called variants until I found 99% of my known variable sites" is easier than "I took variants until I dropped my novel Ti/Tv ratio to 2.07"

We have used hapmap 3.3 sites as the truth set (genotypes_r27_nr.b37_fwd.vcf), but other sets of high-quality (~99% truly variable in the population) sets of sites should work just as well. In our experience, with HapMap, 99% is a good threshold, as the remaining 1% of sites often exhibit unusual features like being close to indels or are actually MNPs, and so receive a low VQSLOD score.
Note that the expected Ti/Tv is still an available argument but it is only used for display purposes.


Finally, a couple of Frequently Asked Questions

- Can I use the variant quality score recalibrator with my small sequencing experiment?

This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties.

One piece of advice is to turn down the number of Gaussians used during training. This can be accomplished by adding --maxGaussians 4 to your command line.

maxGaussians is the maximum number of different "clusters" (=Gaussians) of variants the program is "allowed" to try to identify. Lowering this number forces the program to group variants into a smaller number of clusters, which means there will be more variants in each cluster -- hopefully enough to satisfy the statistical requirements. Of course, this decreases the level of discrimination that you can achieve between variant profiles/error modes. It's all about trade-offs; and unfortunately if you don't have a lot of variants you can't afford to be very demanding in terms of resolution.

- Why don't all the plots get generated for me?

The most common problem related to this is not having Rscript accessible in your environment path. Rscript is the command line version of R that gets installed right alongside. We also make use of the ggplot2 library so please be sure to install that package as well. See the Common Problems section of the Guide for more details.


Error on running ReadBasedPhasing tool

$
0
0

I got a Contig X does not have a length field error when I run the phasing tool. By digging in more, it seems a user exception from htsjdk code, here is the trace of the coding line, has anyone experienced this issue before? Is it a problem in contig reference sequence or bad VCF file, or bad bam file?

Thanks for any help,

--------------------------------error exception raised in following code line-------------------------------------------------------------------

public SAMSequenceRecord getSAMSequenceRecord() {
final String lengthString = this.getGenericFieldValue("length");
if (lengthString == null) throw new TribbleException("Contig " + this.getID() + " does not have a length field.");

which is called by RMDTrackBuilder.java validateVariantAgainstSequenceDictionary(final String name, final String descriptorName, final AbstractFeatureReader reader, final SAMSequenceDictionary dict )

in vcfContigRecords.add(contig.getSAMSequenceRecord()); -- crashed here

Combining variants from different files into one

$
0
0

Solutions for combining variant callsets depending on purpose

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

  1. The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or gVCF files) per-chromosome. For that case, you can use a tool called CatVariants to concatenate the files. There are a few important requirements (e.g. the files should contain all the same samples, and distinct intervals) which you can read about on the tool's documentation page.

  2. The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many gVCF files. We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs (for performance reasons), which you can do using CombineGVCFs, which is specific for handling gVCF files.

  3. The third case is when you want to combine variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. This produces separate callsets for the same samples, which are then easier to compare if you combine them into a single file. For that purpose, you can use CombineVariants, which is capable of merging VCF records intelligently, treating the same samples as separate or not as desired, combining annotations as appropriate. This is the case that requires the most preparation and forethought because there are many options that may be used to adapt the behavior of the tool.

There is also one reason you might want to combine variants from different files into one, that we do not recommend following. That is, if you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.


Merging records across VCFs with CombineVariants

Here we provide some more information and a worked out example to illustrate the third case because it is less straightforward than the other two.

A key point to understand is that CombineVariants will include a record at every site in all of your input VCF files, and annotate in which input callsets the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. Any part of the Venn of the N merged VCFs can then be extracted specifically using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the tool documentation page linked above.

Understanding the set attribute

The set property of the INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

  • set=Intersection : occurred in both call sets, not filtered out

  • set=NAME : occurred in the call set NAME only

  • set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

  • set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

You specify the NAME of a callset is by using the following syntax in your command line: -V:omni 1000G_omni2.5.b37.sites.vcf.

Emitting minimal VCF output

You can add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. In that case, the only fields emitted will be GT:GQ for genotypes and the keySet for INFO.

An even more extreme output format is -sites_only (a general engine capability listed in the CommandLineGATK documentation) where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

Requiring sites to be present in a minimum number of callsets

Sometimes you may want to combine several data sets but you only keep sites that are present in at least 2 of them. To do so, simply add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. In our example, you would add -minN 2 to the command line.

Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

# combine the data
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf

# select the intersection
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

# contents of union.vcf
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

# contents of intersect.vcf
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection

What is a VCF and how should I interpret it?

$
0
0

This document describes "regular" VCF files. For information on the special kind of VCF called gVCF, produced by HaplotypeCaller in -ERC GVCF mode, please see this companion document.


Contents

  1. What is VCF?
  2. Basic structure of a VCF file
  3. Interpreting the VCF file header information
  4. Structure of variant call records
  5. How the genotype and other sample-level information is represented
  6. How to extract information from a VCF in a sane, straightforward way

1. What is VCF?

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and expansion has been taken over by the Global Alliance for Genomics and Health Data Working group file format team. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specs like SAM/BAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Basic structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

image

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
1   873762  .       T   G   5231.78 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   3931.66 PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   71.77   PASS    [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:1,3:4:26:103,0,26
1   974165  rs9442391   T   C   29.84   LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL  0/1:14,4:14:61:61,0,255

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs (also called SNVs), but other variation could be described, such as indels or CNVs. See the VCF specification for details on how the various types of variations are represented. Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the VCF file header information

The following is a valid VCF header produced by HaplotypeCaller on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself!

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-3-gd1ac142,Date="Mon May 18 17:36:4
.
.
.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr1,length=249250621,assembly=b37>
##reference=file:human_genome_b37.fasta

We're not showing all the lines here, but that's still a lot... so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

  • VCF spec version

The first line:

##fileformat=VCFv4.1

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

  • FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

  • FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation.

  • GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, not just the ones specified explicitly by the user in the command line.

  • Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input bam file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for most organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

[todo: FAQ on genome builds]


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

  • CHROM and POS : The contig and genomic coordinates on which the variant occurs.
    Note that for deletions the position given is actually the base preceding the event.

  • ID: An optional identifier for the variant.
    Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP.

  • REF and ALT: The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated).
    Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

  • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data.
    Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
    Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

  • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters.
    If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

This next field does not have to be present in the VCF.

  • INFO: Various site-level annotations.
    The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94.
    They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. How the genotype and other sample-level information is represented

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

1   873762  .       T   G   [CLIPPED] GT:AD:DP:GQ:PL    0/1:173,141:282:99:255,0,255
1   877664  rs3828047   A   G   [CLIPPED] GT:AD:DP:GQ:PL    1/1:0,105:94:99:255,255,0
1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

Looking at that last column, here is what the tags mean:

  • GT : The genotype of this sample at this site.
    For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

    • 0/0 - the sample is homozygous reference
    • 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
    • 1/1 - the sample is homozygous alternate
      In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, G/G, and C/T respectively.
      For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT; for polyploids there will be more, e.g. 4 values for a tetraploid organism.
  • AD and DP : Allele depth and depth of coverage.
    These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
    AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
    DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
    See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage)for more details.

  • PL : Normalized Phred-scaled likelihoods of the possible genotypes.
    For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are normalized so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale (meaning its P = 1.0 in regular scale). The other values are scaled relative to this most likely genotype.
    Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "likelihood of the genotype", we mean it is "the probability that the genotype is not correct". That's why the smaller the value, the better it is.
    [todo: PL details doc]

  • GQ : Quality of the assigned genotype.
    The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
    Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
    Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

With that out of the way, let's interpret the genotype information for NA12878 at 1:899282.

1   899282  rs28548431  C   T   [CLIPPED] GT:AD:DP:GQ:PL    0/1:1,3:4:26:103,0,26

At this site, the called genotype is GT = 0/1, which corresponds to the alleles C/T. The confidence indicated by GQ = 26 isn't very good, largely because there were only a total of 4 reads at this site (DP =4), 1 of which was REF (=had the reference base) and 3 of which were ALT (=had the alternate base) (indicated by AD=1,3). The lack of certainty is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele, but the next PL is PL(1/1) = 26 (which corresponds to 10^(-2.6), or 0.0025). So although we're pretty sure there's a variant at this site, there's a chance that the genotype assignment is incorrect, and that the subject may in fact not be het (heterozygous) but be may instead be hom-var (homozygous with the variant allele). But either way, it's clear that the subject is definitely not hom-ref (homozygous with the reference allele) since PL(0/0) = 103, which corresponds to 10^(-10.3), a very small number.


6. How to extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal by the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

variant QC graphs: bimodal QD

$
0
0

Hi,

I'm trying to finalise good hard-filtering parameters. Does anyone know why the quality-by-depth distribution having has two peaks (See attached graphs, last column"QD").

This happens even after very strict filtering by basic metrics. There seems to be a lot of variants contributing to the two peaks so I'm guessing it's not due to a particular genomic region. (The graph lines are Drosophila chromosomes. Chr4 in blue is clearly poor. The Inbreeding coefficient and allele frequency are expected to be weird-looking due to our breeding design.)

Filter parameters are
(MQ > 61, MQ < 68, FS < 5, AN > 420, InbreedingCoeff > -1, DP < 10000, DP > 1000, ReadPosRankSum > -1, ReadPosRankSum < 1, ClippingRankSum > -.5, ClippingRankSum < .5, BaseQRankSum > -1, BaseQRankSum < 1, MQRankSum > -.5, MQRankSum < .5, EVENTLENGTH < 1, EVENTLENGTH > -1).

Cheers,

Will

Empty VCF file from GATK-Lite UnifiedGenotyper

$
0
0

(There was another question about a similar symptom, but the answer does not appear to apply to what I'm seeing.)

I get an empty VCF file that just contains the header lines. The input VCF file is 1.8Gb, and as far as I can tell the content is OK - it has MAPQ scores, the flags seem reasonable, etc. I've attached a copy of the console output, and the beginning of the input file in SAM format. Let me know if you have any suggestions. Thanks,

Ravi

Change format of AD field to Number=R?

$
0
0

Hi

In GATK version 3.5 I see the following in VCF headers:

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

However, the number of values in the AD field should always be the number of alleles (including the reference), right? The 4.2 VCF spec has a value R to represent this. Therefore, could the header line be changed to the following in future GATK releases?

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

Why is this important? Well, one reason is that bcftools norm uses this when splitting multiallelic variants into multiple biallelics. With the '.' this isn't done correctly, but with the 'R' it is. See the comment from freeseek at https://github.com/samtools/bcftools/issues/40 for further details.

GT, PL and GQ inconsistent

$
0
0

From the GATK documentation I found that for PL field, the most likely genotype (assigned in the GT field) is 0. However, from the vcf file called by GATK, I found that this is not always the case. Below are a few examples:

GT:AD:DP:GQ:PL
0/1:0,232:444:99:0,382,4800
1/1:1,0:1:2:1,0,0
1/1:1,0:1:99:0,3,35
1/1:1,0:1:87:0,292,247

Also, GQ is equal to the second smallest PL, unless that PL is greater than 99. However, I also found cases like the following, where the GQ is 1.76:

GT:AD:DP:GQ:PL
0/1:9,14,23:1.76:30,3,0

I am wondering that whether they have specific meanings or it is due to calling error, or due to the low quality of calling. Thank you very much.


Tallying different genotype calls per individual

$
0
0

Hi,

Is there a tool in GATK - or in any other program - which counts the number of different genotype calls per individual in a vcf?

Specifically this would be the total number of hom.ref, het, hom.var and no.call genotype calls - for each individual.

This would be fairly straightforward to do on a small-ish and simple genotypes file, with perl, R etc but I'd rather do it on a vcf.

Sincerely,

William Gilks

after sort VCF as command the website showed, but still have question in VariantRecalibrator

$
0
0

I sort vcf as command"java -jar picard.jar SortVcf \
I=original.vcf \
O=sorted.vcf \
SEQUENCE_DICTIONARY=reference.dict "

I call snp as "https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php" show.
But I counter problems like:
INFO 15:40:14,228 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:40:15,731 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:40:29,350 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 15:41:30,833 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO 15:41:30,834 HttpMethodDirector - Retrying request
INFO 15:41:31,098 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO 15:41:31,099 HttpMethodDirector - Retrying request
INFO 15:41:31,362 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO 15:41:31,363 HttpMethodDirector - Retrying request
INFO 15:41:31,621 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO 15:41:31,622 HttpMethodDirector - Retrying request
INFO 15:41:31,897 HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO 15:41:31,897 HttpMethodDirector - Retrying request

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in input.
ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
ERROR input contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
ERROR ------------------------------------------------------------------------------------------

I see the VCFfile,the order is the same as human ref contig"1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M ",what's wrong?

VCF output format problem when -ERC BP_RESOLUTION is used

$
0
0

In a discussion about using ERC, you provide some example VCF output lines like:

20  10000442   .   T  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,168,2095
20  10000443   .   A  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,169,2089
20  10000444   .   A  <NON_REF>  .   .   .   GT:AD:CD:DP:GQ:PL  0/0:56,0:56:56:99:0,168,2093

and say:

  1. "For each reference base not part of a variant call we emit a VCF record with the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site,"
  2. and
  3. "Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it."

While the formal Variant Call Format 4.2 Specification says:

  1. "ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends,"
  2. and
  3. "QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified."

The issue is subtle, but introduces problems with downstream processing of HaplotypeCaller generated VCF files containing reference calls. The use of the symbol "<NON_REF>" instead of "." for reference calls is a little confusing, but I also see the logic of that. More seriously: According to the VCFv4.2 specs, QUAL is NOT always a measure of "the probability that the site is polymorphic". Perhaps when a variant is called, but not when a site is called as non-variant. All those QUAL field values should NOT be 0 there. It is about the quality and correctness of the call itself. It is defined as the PHRED score of the probability that the assertion made in ALT is wrong. So if ALT asserts "<NON_REF>" or "." CONFIDENTLY (meaning a low probability that the assertion is wrong), then the QUAL PHRED score should be HIGH, not ZERO. A confident call should have a high QUAL score, whether variant or monomorphic. That is the intention of the specification. If otherwise: filtering out low quality records from a non-compliant VCF output file by removing those with low QUAL scores, for instance, would also filter out all the high quality reference calls.

I have previously been in touch with the writers/keepers of the VCF Specs (now over at samtools) and they confirm this interpretation of QUAL scores as the correct one for VCFv4.2 (and other versions) compliant output. It's unfortunately couched in mathematical double negatives, but look at it carefully and you will see the correctness of this reading. As one who uses tools from many sources, I hope you will address this discrepancy. It complicates interoperability.

-- Fred P.

How to read a bgzipped VCF file with HTSJDK?

$
0
0
 Hello,

How can you read a bgzipped VCF file using Htsjdk? Thanks!

(I hope I'm in the right place: the Htsjdk docs pointed to GATK)

 Best, Oliver

AF calculation in Mutect2

$
0
0

Hi,

I've been calling somatic mutations with Mutect2, and have noted that the AF calculation does not always correlate to DP alt/(DP ref+DP alt). For example, consider this indel:

T057.mutect2.vcf:7 17379247 . CAACAGCAACAGTCCTTGGCTCTGAACTCAAGCTGTATGGTACAGGAACACCTACATCTAG C . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=57.68;TLOD=106.87 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:66,30:0.452:18:12:.:1983,919:32:34 0/0:214,0:0.112:0:0:.:6285,0:98:116

The tumour variant AF should be (30/(30+66)=0.31, but is reported as 0.42, while the normal variant AF should be 0/214=0, but is reported as 0.112

Am I missing something/misinterpreting these fields?

Any comments and help greatly appreciated!

Thanks,

Stephen

Code for hard-filtering and QC statistics

$
0
0

I thought I'd post some code for hard-filtering a vcf based-on currrent GATK recommendations (as far as I know).

I'd be interested to hear any suggestions for improvement.

The recommended hard-filters are from http://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set

SNPs:
QualByDepth (QD) 2.0
FisherStrand (FS) 60.0
RMSMappingQuality (MQ) 40.0
MappingQualityRankSumTest (MQRankSum) 12.5
ReadPosRankSumTest (ReadPosRankSum) 8.0

Indels:
QualByDepth (QD) 2.0
FisherStrand (FS) 200.0
ReadPosRankSumTest (ReadPosRankSum) 20.0

#!/bin/sh
#$ -N z_vcf_filter1
#$ -pe openmp 1
#$ -S /bin/sh
#$ -cwd
#$ -j y
#$ -q bioinf.q

## load modules
. /etc/profile.d/modules.sh
module load jre/1.7.0_25
module load gatk/3.4-0

## set variables
vcf_dir=../vcf_prefilter/
in_vcf=unfilt.lhm_rg_HC_raw.vcf
touch filter1.lhm_rg_HC.vcf
out_vcf=filter1.lhm_rg_HC.vcf
refseq=local_reference/dm6.fa



## ____ make vcf for SNPs only ____
GenomeAnalysisTK -R ${refseq} \
-T SelectVariants \
-V ${vcf_dir}${in_vcf} \
-selectType SNP \
    -o snp.${in_vcf}

## ____ Mark bad qual SNPs ____
GenomeAnalysisTK -R ${refseq} \
-T VariantFiltration \
-V snp.${in_vcf} \
--filterExpression "QD<2.0||FS>60.000||MQ<40.00||MQRankSum<-12.5||ReadPosRankSum<-8.0" \
--filterName "GATKsnp" \
   -o marked.snp.${in_vcf}

## ____ Remove bad quality SNPs ____
GenomeAnalysisTK -R ${refseq} \
-T SelectVariants \
-V marked.snp.${in_vcf} \
--excludeFiltered \
    -o snp.${out_vcf}
rm snp.${in_vcf}*
rm marked.snp.${in_vcf}*



## ____ make vcf indels only ____
GenomeAnalysisTK -R ${refseq} \
-T SelectVariants \
-V ${vcf_dir}${in_vcf} \
-selectType INDEL \
    -o indel.${in_vcf}

## ____ Mark bad qual indels ____
GenomeAnalysisTK -R ${refseq} \
-T VariantFiltration \
-V indel.${in_vcf} \
--filterExpression "QD<2.0||FS>200.000||MQ<40.00||ReadPosRankSum<-20.0" \
--filterName "GATKindel" \
    -o marked.indel.${in_vcf}

## ____ Remove bad quality indels ____
GenomeAnalysisTK -R ${refseq} \
-T SelectVariants \
-V marked.indel.${in_vcf} \
--excludeFiltered \
   -o indel.${out_vcf}
rm indel.${in_vcf}
rm marked.indel.${in_vcf}



## ____ Combine filtered SNPs and indels ____
GenomeAnalysisTK -R ${refseq} \
-T CombineVariants \
-genotypeMergeOptions UNIQUIFY \
-V snp.${out_vcf} \
-V indel.${out_vcf} \
    -o ${out_vcf}
rm snp.${out_vcf}*
rm indel.${out_vcf}*





## ____ GENERATE SUMMARY TABLES ____
mkdir gatk_sums

##  ____ Evaluate variants by sample ____
GenomeAnalysisTK -R ${refseq} \
-T VariantEval \
-eval ${out_vcf} \
--evalModule CountVariants \
--stratificationModule Sample \
-noEV \
    -o gatk_sums/${out_vcf}.vareval.txt



##  ____ make locus metrics table ____
GenomeAnalysisTK -R ${refseq} \
-T VariantsToTable \
-V ${out_vcf} \
-F CHROM -F POS -F ID -F VRT -F QD -F MQRankSum -F MQRankSum -F ReadPosRankSum \
-F QUAL -F DP -F MQ -F FS -F AC -F AF -F AN \
-F HOM-REF -F HET -F HOM-VAR -F NO-CALL \
--allowMissingData \
    -o gatk_sums/${out_vcf}.qc.tbl
gzip gatk_sums/${out_vcf}.qc.tbl


## ____ make genotype DEPTH table ____
GenomeAnalysisTK -R ${refseq} \
-T VariantsToTable \
-V form.unfilt.${vcf} \
-F CHROM -F POS -F ID -GF DP \
--allowMissingData \
    -o gatk_sums/${out_vcf}.depth.tbl
gzip gatk_sums/${out_vcf}.depth.tbl`



##  ____ ANALYSIS WITH VCF-TOOLS ____
module load vcftools/0.1.12b
mkdir vcftools_sums

vcftools --vcf ${out_vcf} --SNPdensity 100000 --remove-indels --out              
vcftools_sums/${out_vcf}.snpDens_100kb.txt
vcftools --vcf ${out_vcf} --SNPdensity 100000 --keep-only-indels --out vcftools_sums/${out_vcf}.indelDens_100kb.txt
vcftools --vcf ${out_vcf} --window-pi 100000 --window-pi-step 100000 --out vcftools_sums/${out_vcf}.pi_100kb.txt
vcftools --vcf ${out_vcf} --singletons --out vcftools_sums/${out_vcf}.sing.txt
vcftools --vcf ${out_vcf} --indv-freq-burden --out vcftools_sums/${out_vcf}.ifb.txt
vcftools --vcf ${out_vcf} --het --out vcftools_sums/${out_vcf}.heterozygosity.txt
vcftools --vcf ${out_vcf} --relatedness --out vcftools_sums/${out_vcf}.relate.txt
module unload vcftools/0.1.12b
module unload jre/1.7.0_25
exit

How to convert GVF to VCF

$
0
0

I want to convert a GVF file to a VCF file. However, I have problem when I did the conversion of the file formats.
I have downloaded the structural variation data from NCBI dbVAR by assemblies.
(http://www.ncbi.nlm.nih.gov/dbvar/content/ftp_manifest/)
I found that there was only GVF format available by assemblies of dbVAR. However, the IndelRealigner does not support the GVF file as an input file.

I have tried several methods to convert a GVF file to a VCF file.
I found that VariantsToVCF in GATK and VCFtools are not support GVF as input file.
So I tried the script 'gvf2vcf', which is a tool in Ensembl perl API (https://github.com/hxin/DisEnt/blob/master/disnet/common/lib/ensembl-api/ensembl-variation/scripts/misc/gvf2vcf.pl). But the program always connects to the Ensembl database when I do the file conversion, and I want to do the analysis locally.

Is there anyone can help me? A lot of appreciate!


DiagnoseTargets: missing genotype field in output vcf

$
0
0

Hi,

I am running the following command:
${java} -jar ${GATK} \
-T DiagnoseTargets \
-R human_g1k_v37.fasta \
-L truseq_interval.bed \
-ip 100 \
-I cleaned.bam \
-o interval_stats.vcf \
-writeFullFormat \

It works great except that the output .vcf does not contain the GT field, which gives trouble after for annotating the vcc file.

Any workaround maybe?

Thanks !

why there are * in my vcf file? what does it mean?

$
0
0

CHROM POS ID REF ALT

chr5 156479568 . C CGTT,*

what does * mean?

Errors about input files having missing or incompatible contigs

$
0
0

These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.

The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference.


BAM file contigs not matching the reference

A very common case we see looks like this:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.
##### ERROR   reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:

Input files reads and reference have incompatible contigs

It further tells us that the contig length doesn't match for the chrM contig:

Found contigs with the same name but different lengths:
##### ERROR   contig reads = chrM / 16569
##### ERROR   contig reference = chrM / 16571.

This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.

We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.

Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L) with something like this:

#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.

Solution

If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.

If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).

Special case of b37 vs. hg19

The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?

Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.

So only try that if you know what you're doing. YMMV.


VCF file contigs not matching the reference

ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chrM / 16569
ERROR contig reference = chrM / 16571.

Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:

Input files known and reference have incompatible contigs

We know (trust me) that this is the output of a RealignerTargetCreator command, so the known file must be the VCF file provided through the known argument. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.

Solution

If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle (see FAQs for access details).

location: ftp.broadinstitute.org
username: gsapubftp-anonymous

If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.

GATK used to include some liftover utilities (documented below for the record) but we no longer support them.

Liftover procedure with older versions of GATK

This procedure involves three steps:

  1. Run GATK LiftoverVariants on your VCF file
  2. Run a script to sort the lifted-over file
  3. Filter out records whose REF field does not match the new reference

We provide a script that performs those three steps for you, called liftOverVCF.pl, which is available in our public source repository under the 'perl' directory. Instructions for pulling down our source are available here.

The example below shows how you would run the script:

./liftOverVCF.pl \
    -vcf calls.b36.vcf \                    # input vcf
    -chain b36ToHg19.broad.over.chain \ # chain file
    -out calls.hg19.vcf \                   # output vcf
    -gatk gatk_source \                     # path to source code
    -newRef Homo_sapiens_assembly19 \    # path to new reference base name (without extension)
    -oldRef human_b36_both \            # path to old reference prefix (without extension)
    -tmp /broad/shptmp [defaults to /tmp]   # temp file location (defaults to /tmp)

We provide several chain files to liftover between the major human reference builds, also in our resource bundle (mentioned above) in the Liftover_Chain_Files directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.

Note that if you're at the Broad, you can access chain files to liftover from b36/hg18 to hg19 on the humgen server.

/humgen/gsa-hpprojects/GATK/data/Liftover_Chain_Files/

Using dbSNPs for mapping

$
0
0

Hi. So I want to take a dbSNP file and highlight SNPs that are found in some BAM files. Instead of filtering these I want to search for these SNPs only. I will then use these SNP positions as markers to trace recombination events that have occured between some strains I am studying. So far I havn't found a good strategy to accomplish this. I was hoping someone over here knows how to parse the dbSNP, and could guide me through it. Alternatively you could use this idea to generate a new tool! Please let me know if you have thoughts, or questions. Thanks! -Keller

Problem running CatVariants

$
0
0

Hi,

I tried to concatenate the individual chromosome vcf files for chicken: vcf_chr_1.vcf etc. I made a reference file from the individual chromosome files: gga_ref_Gallus_gallus-4.0_chr1.mfa etc.

My command is:
java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants \
-R path1/gga_ref_Gallus_gallus-4.0.fa \
-V path2/vcf_chr_1.vcf \
-V path2/vcf_chr_2.vcf \
-out path2/Gallus_gallus-145.vcf \
-assumeSorted

I get the following error output:

ERROR stack trace

java.lang.NullPointerException
at htsjdk.variant.variantcontext.VariantContextComparator.getSequenceNameList(VariantContextComparator.java:25)
at htsjdk.variant.variantcontext.VariantContextComparator.(VariantContextComparator.java:76)
at org.broadinstitute.gatk.tools.CatVariants$PositionComparator.(CatVariants.java:327)
at org.broadinstitute.gatk.tools.CatVariants.execute(CatVariants.java:233)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.tools.CatVariants.main(CatVariants.java:312)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):

Is there an obvious reason for this?

I checked the .fa file and there are no empty lines present between the chromosome sequences,

Rick

Viewing all 624 articles
Browse latest View live