Hi all,
I'm working on a resequencing dataset , which contains 60 individuals from 5 different populations. I used HaplotypeCaller to conduct variants calling, and got all individuals in one big dataset after applying hard filter, then I would like to know if in GATK, can I apply minor allele frequency filter to one specific population considering I have 5 different populations in one file?
Minor Allele Frequency filter in GATK
How to now if a Combined vcf file is calibrated with GATK or not
Dear GATK using community
Pardon me for my ignorance, but I am new with using NGS pipelines. I have recieved a combined VCF file or 200 samples. The only information I was able to understand from VCF after opening it in command line and usegalaxy online server the variants are called by using Haplotypecaller of GATK tool. How can I identify that if the variants are only called using haplotypecaller or they were also validated or calibrated afterwards.
SelectVariants error
I`m calling SNP by haplotype caller,and i got vcf document for each sample. After that , i use selectvariant to select variants of SNP . But i got this error:
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Line 33910: there aren't enough columns for line (we expected 9 tokens, and saw 1 ), for input source: /home/gaotiangang/niuguohao/1204recall/50/BQSRc/seq164.raw.c.vcf
The 33910 line is:
DF384195.1 374 . G A 243.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.100;ClippingRankSum=0.000;DP=95;ExcessHet=3.0103;FS=6.236;MLEAC=1;MLEAF=0.500;MQ=40.68;MQRankSum=-2.037;QD=7.62;ReadPosRankSum=1.274;SOR=2.712 GT:AD:DP:GQ:PL 0/1:18,14:32:99:272,0,411
I don't think there is any thing wrong with this line for the lines before it look the same.
Only part of the samples' vcf documents have this error . But i didn't find the difference between them.
Anyone coule help me ?
Many thanks!
GenotypeGVCFs: Long runtime exclusively with a single sample
I have been having some trouble with long runtime with several of GATK utilities.
However it was manageable.
I could arrive at a g.vcf file( I used HaplotypeCaller instead of UnifiedGenotyper upon a suggestion made on a seperate thread).
Now I two different g.vcf file for two different samples and for one of them I could get a vcf file using GenotypeGVCFs within 45 minutes or so.
However with another sample I am getting ** a 40 week long runtime.**
The samples are that of Aedes aegypti and Aedes albopictus (this is the one giving trouble).
The walker starts walking instantly with Aedes aegypti sample and gives me the vcf without any errors.However In the Aedes albopictus the walker itself is initiated after an hour or so.
The command used is:
java -jar GenomeAnalysisTK-3.7-0-gcfedb6 -T GenotypeGVCFs -nt 12 -R ref-ab/GCA_001444175.2_A.albopictus_v1.1_genomic.fasta --variant output-AB.raw.snps.indels.g.vcf -o genotyped-ab.vcf
It should be noted that this exact command has worked for the other sample(except that the necessary files were changed).
The log is as follows:
INFO 19:56:34,300 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 19:56:34,301 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 22:49:04,685 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.4 w 37.4 w
INFO 22:50:04,687 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.7 w 37.6 w
INFO 22:51:04,689 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.9 w 37.9 w
INFO 22:52:04,690 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 38.1 w 38.1 w
INFO 22:53:04,694 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 38.3 w 38.3 w
(the run time is increasing instead of decreasing)
IMPORTANT NOTES:
1)The genome sizes are:
1.9 G for A.albopictus and 1.4 G for A.aegypti
2)Cannot blame it on space
I have around 48 usable threads at the moment and enough RAM space
I have tried using different number of threads as well. Its not making any difference.
3) have tried re-running the a.aegypti sample parallely (to get rid of any doubts that the computation maybe have been faster due to uncertain variables at that point in time),and its reproducing its behaviour i.e gets done in 45 minutes or so.But the a.albopictus sample is still showing the same problem.
VCF - Variant Call Format
This document describes "regular" VCF files produced for GERMLINE short variant (SNP and indel) calls (e.g. by HaplotypeCaller in "normal" mode and by GenotypeGVCFs). For information on the special kind of VCF called GVCF produced by HaplotypeCaller in -ERC GVCF
mode, please see the GVCF entry. For information specific to SOMATIC calls, see the Mutect2 documentation.
Contents
- Overview
- Structure of a VCF file
- Interpreting the header information
- Structure of variant call records
- Interpreting genotype and other sample-level information
- Basic operations: validating, subsetting and exporting from a VCF
- Merging VCF files
1. Overview
VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and further development has been taken over by the Genomic Data Toolkit team of the Global Alliance for Genomics and Health. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specifications like SAM/BAM/CRAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.
VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.
That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.
Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:
Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned
Don't write home-brewed VCF parsing scripts. It never ends well.
2. Structure of a VCF file
A valid VCF file is composed of two main parts: the header, and the variant call records.
The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.
The actual data lines will look something like this:
[HEADER LINES]
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
20 10001019 . T G 364.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.699;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=3.064;MLEAC=1;MLEAF=0.500;MQ=42.48;MQRankSum=-3.219e+00;QD=11.05;ReadPosRankSum=-6.450e-01;SOR=0.537 GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480
20 10001298 . T A 884.77 . AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.49;SOR=1.765 GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0
20 10001436 . A AAGGCT 1222.73 . AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.836 GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0
20 10001474 . C T 843.77 . AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=1.302 GT:AD:DP:GQ:PL 1/1:0,27:27:81:872,81,0
20 10001617 . C A 493.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.63;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=1.323;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=12.99;ReadPosRankSum=0.170;SOR=1.179 GT:AD:DP:GQ:PL 0/1:19,19:38:99:522,0,480
After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs and indels, but other variation types could be described (see the VCF specification for details). Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.
You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.
3. Interpreting the header information
The following is a valid VCF header produced by GenotypeGVCFs on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself.
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Fri Jan 20 11:14:15 EST 2017",Epoch=1484928855435,CommandLineOptions="[command-line goes here]">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="[command-line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 5:45:56 PM EST">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=20,length=63025520>
##reference=file:///data/ref/ref.fasta
##source=GenotypeGVCFs
That's a lot of lines, so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.
VCF spec version
The first line:
##fileformat=VCFv4.2
tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.
FILTER lines
The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:
##FILTER=<ID=LowQual,Description="Low quality">
Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual
) in its FILTER
field (see how records are structured further below).
FORMAT and INFO lines
These lines define the annotations contained in the FORMAT
and INFO
columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation (at least if you're using a civilized program that writes definition lines to the header).
GATKCommandLine
The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller
refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, along with the values that were applied (if you don't pass one, a default is applied); so it's not just the arguments specified explicitly by the user in the command line.
Contig lines and Reference
These contain the contig names, lengths, and which reference assembly was used with the input BAM file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for many organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!
For more information on genome references, see the corresponding Dictionary entry.
4. Structure of variant call records
For each site record, the information is structured into columns (also called fields) as follows:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...]
The first 8 columns of the VCF records (up to and including INFO
) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.
Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT
column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!
Site-level properties and annotations
These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie .
to serve as a placeholder).
CHROM and POS
The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.
ID
An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP. A typical identifier is the dbSNP ID, which in human data would look like rs28548431, for example.
REF and ALT
The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). The REF and ALT alleles are the only required elements of a VCF record that tell us whether the variant is a SNP or an indel (or in complex cases, a mixed-type variant). If we look at the following two sites, we see the first is a SNP, the second is an insertion and the third is a deletion:
20 10001298 . T A 884.77 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0
20 10001436 . A AAGGCT 1222.73 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0
20 10004769 . TAAAACTATGC T 622.73 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,17:35:99:660,0,704
Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.
QUAL
The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the Dictionary entry). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.
Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.
FILTER
This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS
if the variant passed all filters. If the FILTER value is .
, then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.
INFO
Various site-level annotations. This field is not required to be present in the VCF.
The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =
, and pairs are separated by colons, ie ;
as in this example: MQ=99.00;MQ0=0;QD=17.94
. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.
Sample-level annotations
At this point you've met all the fields up to INFO in this lineup:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 [other samples...]
All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT
field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM
tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.
5. Interpreting genotype and other sample-level information
The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.
Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:
20 10001019 . T G 364.77 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480
20 10001298 . T A 884.77 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,30:30:89:913,89,0
20 10001436 . A AAGGCT 1222.73 . [CLIPPED] GT:AD:DP:GQ:PL 1/1:0,28:28:84:1260,84,0
Looking at that last column, here is what the tags mean:
GT
The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:
- 0/0 : the sample is homozygous reference
- 0/1 : the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
- 1/1 : the sample is homozygous alternate
In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, A/A and AAGGCT/AAGGCT respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT (e.g. 1
); for polyploids there will be more, e.g. 4 values for a tetraploid organism (e.g. 0/0/1/1
).
AD and DP
Allele depth (AD) and depth of coverage (DP). These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.
AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.
DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.
See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.
PL
"Normalized" Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.
Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.
GQ
The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.
Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.
Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.
A few examples
With all the definitions out of the way, let's interpret the genotype information for a few records from our NA12878 callset, starting with at position 10001019 on chromosome 20:
20 10001019 . T G 364.77 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480
At this site, the called genotype is GT = 0/1
, which corresponds to a heterozygous genotype with alleles T/G. The confidence indicated by GQ = 99
is very good; there were a total of 33 informative reads at this site (DP=33
), 18 of which supported the REF allele (=had the reference base) and 15 of which supported the ALT allele (=had the alternate base) (indicated by AD=18,15
). The degree of certainty in our genotype is evident in the PL field, where PL(0/1) = 0
(the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele; the next PL is PL(0/0) = 393
, corresponding to 10^(-39.3), or 5.0118723e-40 which is a very small number indeed; and the next one will be even smaller. The GQ ends up being 99 because of the capping as explained above.
Now let's look at a site where our confidence is quite a bit lower:
20 10024300 . C CTT 43.52 . [CLIPPED] GT:AD:DP:GQ:PL 0/1:1,4:6:20:73,0,20
Here we have an indel -- specifically an insertion of TT
after the reference C
base at position 10024300. The called genotype is GT = 0/1
again, but this time the GQ = 20
indicates that even though this is probably a real variant (the QUAL is not too bad), we're not sure we have the right genotype. Looking at the coverage annotations, we see we only had 6 reads there, of which 1 supported REF and 4 supported ALT (and one read must have been considered uninformative, possibly due to quality issues). With so little coverage, we can't be sure that the genotype shouldn't in fact be homozygous variant.
Finally, let's look at a more complicated example:
20 10009875 . A G,AGGGAGG 1128.77 . [CLIPPED] GT:AD:DP:GQ:PL 1/2:0,11,5:16:99:1157,230,161,487,0,434
This site is a doozy; two credible ALT alleles were observed, but the REF allele was not -- so technically this is a biallelic site in our sample, but will be considered multiallelic because there are more than two alleles notated in the record. It's also a mixed-type record, since one of the ALTs by itself would make it an A
->G
SNP, and the other would make it an insertion of GGGAGG
after the reference A
. The called genotype is GT = 1/2
, which means it's a heterozygous genotype composed of two different ALT alleles. The coverage wasn't great, and wasn't all that balanced between the two ALTs (since one was supported by 11 reads and the other by 5) but it was sufficient for the program to have high confidence in its call.
6. Basic operations: validating, subsetting and exporting from a VCF
These are a few common things you may want to do with your VCFs that don't deserve their own tutorial. Let us know if there are other operations you think we should cover here.
Validate your VCF
By that I mean check that the format of the file is correct, follows the specification, and will therefore not break any well-behave tool you choose to run on it. You can do this very simply with ValidateVariants. Note that ValidateVariants can also be used on GVCFs if you use the --gvcf
argument.
Subset records from your VCF
Sometimes you want to subset just one or a few samples from a big cohort. Sometimes you want to subset to just a genomic region. Sometimes you want to do both at the same time! Well, the same tool can do both, and more; it's called SelectVariants and has a lot of options for doing this like that (including operating over intervals in the usual way). There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions. Other VCF files can also be used to modify the selection based on concordance or discordance between different callsets (see --discordance / --concordance arguments in the Tool Doc.
Important notes about subsetting operations
In the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset.
By default, SelectVariants will keep all ALT alleles, even if they are no longer supported by any samples after subsetting. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. In some cases this will produce monomorphic records, i.e. where no ALT alleles are supported. The tool accepts flags that exclude unsupported alleles and/or monomorphic records from the output.
Extract information from a VCF in a sane, (mostly) straightforward way
Use VariantsToTable.
No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.
Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal according to the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.
(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)
7. Merging VCF files
There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.
The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or GVCF files) per-chromosome. For that case, you can use the Picard tool MergeVcfs to merge the files. See the relevant Tool Doc page for usage details.
The second case is when you have been using HaplotypeCaller in
-ERC GVCF
or-ERC BP_RESOLUTION
to call variants on a large cohort, producing many GVCF files. You then need to consolidate them before joint-calling variants with GenotypeGVCFs (for performance reasons). This can be done with either CombineGVCFs or ImportGenomicsDB tools, both of which are specifically designed to handle GVCFs in this way. See the relevant Tool Doc pages for usage details and the Best Practices workflow documentation to learn more about the logic of this workflow.The third case is when you want to compare variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. For this case, we recommend taking a different approach; rather than merging the VCF files (which can have all sorts of complicated consequences), you can us the VariantAnnotator tool to annotate one of the VCFs with the other treated as a resource. See the relevant Tool Doc page for usage details.
There is actually one more reason why you might want to combine variants from different files into one, but we do not recommend doing it: you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.
Errors about input files having missing or incompatible contigs
These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.
The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference. For more information about that see theDictionary entry on reference genomes.
BAM file contigs not matching the reference
A very common case we see looks like this:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR contig reads = chrM / 16569
##### ERROR contig reference = chrM / 16571.
##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:
Input files reads and reference have incompatible contigs
It further tells us that the contig length doesn't match for the chrM contig:
Found contigs with the same name but different lengths:
##### ERROR contig reads = chrM / 16569
##### ERROR contig reference = chrM / 16571.
This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.
We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.
Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L
) with something like this:
#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.
Solution
If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.
If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).
Special case of b37 vs. hg19
The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?
Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.
So only try that if you know what you're doing. YMMV.
VCF file contigs not matching the reference
ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chrM / 16569
ERROR contig reference = chrM / 16571.
Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:
Input files known and reference have incompatible contigs
We know (trust me) that this is the output of a RealignerTargetCreator command, so the known file must be the VCF file provided through the known
argument. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.
Solution
If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we make sets of suitable resources available for the supported reference builds. If you're working on your own installation of GATK, you can get these from the Resource Bundle. If you're using GATK on FireCloud, our cloud-based analysis platform, the featured GATK workspaces are preloaded with the appropriate resources.
If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.
GATK used to include some liftover utilities but we no longer support them.
What input files does the GATK accept / require?
Analyses done with the GATK typically involve several (though not necessarily all) of the following inputs:
- Reference genome sequence in FASTA format
- Unmapped sequencing data in uBAM format (alternative to FASTQ)
- Mapped sequencing data in SAM, BAM or CRAM format
- List of intervals
- Variant calls in VCF format or GVCF format (can be gzipped)
- Supplementary resources (e.g. known variants) as documented by the relevant tools
Errors about input files having missing or incompatible contigs
These errors occur when the names or sizes of contigs don't match between input files. This is a classic problem that typically happens when you get some files from collaborators, you try to use them with your own data, and GATK fails with a big fat error saying that the contigs don't match.
The first thing you need to do is find out which files are mismatched, because that will affect how you can fix the problem. This information is included in the error message, as shown in the examples below. You'll notice that GATK always evaluates everything relative to the reference if one is provided.
Contents
- BAM file contigs not matching the reference
- VCF file contigs not matching the reference
1. BAM file contigs not matching the reference
A very common case we see looks like this:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR contig reads = chrM / 16569
##### ERROR contig reference = chrM / 16571.
##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
First, the error tells us that the mismatch is between the file containing reads, i.e. our BAM file, and the reference:
Input files reads and reference have incompatible contigs
It further tells us that the contig length doesn't match for the chrM contig:
Found contigs with the same name but different lengths:
##### ERROR contig reads = chrM / 16569
##### ERROR contig reference = chrM / 16571.
This can be caused either by using the wrong genome build version entirely, or using a reference that was hacked from a build that's very close but not identical, like b37 vs hg19, as detailed a bit more below.
We sometimes also see cases where people are using a very different reference; this is especially the case for non-model organisms where there is not yet a widely-accepted standard genome reference build.
Note that the error message also lists the content of the sequence dictionaries that it found for each file, and we see that some contigs in our reference dictionary are not listed in the BAM dictionary, but that's not a problem. If it was the opposite, with extra contigs in the BAM (or VCF), then GATK wouldn't know what to do with the reads from these extra contigs and would error out (even if we try restricting analysis using -L
) with something like this:
#### ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with.
Solution
If you can, simply switch to the correct reference. Note that file names may be misleading, as people will sometimes rename files willy-nilly. Sometimes you'll need to do some detective work to identify the correct reference if you inherited someone else's sequence data.
If that's not an option because you either can't find the correct reference or you absolutely MUST use a particular reference build, then you will need to redo the alignment altogether. Sadly there is no liftover procedure for reads. If you don't have access to the original unaligned sequence files, you can use Picard tools to revert your BAM file back to an unaligned state (either unaligned BAM or FASTQ depending on the workflow you wish to follow).
Special case of b37 vs. hg19
The b37 and hg19 human genome builds are very similar, and the canonical chromosomes (1 through 22, X and Y) only differ by their names (no prefix vs. chr prefix, respectively). If you only care about those, and don't give a flying fig about the decoys or the mitochondrial genome, you could just rename the contigs throughout your mismatching file and call it done, right?
Well... This can work if you do it carefully and cleanly -- but many things can go wrong during the editing process that can screw up your files even more, and it only applies to the canonical chromosomes. The mitochondrial contig is a slightly different length (see error above) in addition to having a different naming convention, and all the other contigs (decoys, herpes virus etc) don't have direct equivalents.
So only try that if you know what you're doing. YMMV.
2. VCF file contigs not matching the reference
ERROR MESSAGE: Input files known and reference have incompatible contigs: Found contigs with the same name but different lengths:
ERROR contig known = chrM / 16569
ERROR contig reference = chrM / 16571.
Yep, it's just like the error we had with the BAM file above. Looks like we're using the wrong genome build again and a contig length doesn't match. But this time the error tells us that the mismatch is between the file identified as known and the reference:
Input files known and reference have incompatible contigs
In this case the error was output by a tool that takes a VCF file of known variants provided through the known
argument, so this makes sense and tells us which file is at fault. Depending on the tool, the way the file is identified may vary, but the logic should be fairly obvious.
Solution
If you can, you find a version of the VCF file that is derived from the right reference. If you're working with human data and the VCF in question is just a common resource like dbsnp, you're in luck -- we provide versions of dbsnp and similar resources derived from the major human reference builds in our resource bundle.
If that's not an option, then you'll have to "liftover" -- specifically, liftover the mismatching VCF to the reference you need to work with. The best tool for liftover is Picard's LiftoverVCF.
We provide several chain files to liftover between the major human reference builds, also in our resource bundle in the Liftover_Chain_Files
directory. If you are working with non-human organisms, we can't help you -- but others may have chain files, so ask around in your field.
Errors about contigs in BAM or VCF files not being properly ordered or sorted
This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files, so it will fail with an error in this case.
So what do you do?
For BAM files
You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:
java -jar picard.jar ReorderSam \
I=original.bam \
O=reordered.bam \
R=reference.fasta \
CREATE_INDEX=TRUE
Where reference.fasta
is your genome reference, which must be accompanied by a valid *.dict
dictionary file. The CREATE_INDEX
argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).
Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!
For VCF files
You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:
java -jar picard.jar SortVcf \
I=original.vcf \
O=sorted.vcf \
SEQUENCE_DICTIONARY=reference.dict
Where reference.dict
is the sequence dictionary of your genome reference.
Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.
Can HaplotypeCaller be used on drug treated samples?
Hello, I am working on a RNASeq data which consists of liver samples from donors. It is a case-control study where 12 samples are divided as Normal (control) and Rifampin Treated (case). I want to create a sample specific VCF file. I was going through the documentation and I got a bit confused between HaplotypeCaller and Mutect2. Which one should I use to get my VCF file.
In addition, is there a decent way to add gene name, symbol and other annotations to the INFO field of the VCF file?
Any help is much appreciated.
Regards,
Anurag
Convert CombineVariant output to individual vcf files
HI,
I was curious if it possible to convert the output of CombineVariants back into individual sample vcf files. Based on answered questions in other threads, I am under the impression that you should not try to use VariantFiltration on a combined output if variants were not called on an previously merged file. Is that correct? If so, could I basically reverse CombineVariants after using SelectVariants to analyze samples individually?
Also, I would like to point out that that all the links to the "Tools documentation index" are broken and displaying 404. I am not if you are aware.
Thank you
what reads from original BAM are listed in mutect2 VCF-output?
Dear GATK team,
I am sorry if my question is naive. After running mutect2 (version 4.2) what positions (AND corresponding reads) from the original bam-file are reported in the vcf file? all that passed filters or only those that have the potential to be SNP/mutation?
I am analysing several cancer data sets in parallel and want to choose the most interesting for me SNPs. For this I want to know how many patients had WT at a particular positions, and how many had a possible mutation at this position. Can I find this information in the vcf file or do I need to refer to the original BAM?
For example, in the vcf file I have a postition chr1:111222 which is encountered in 8 patients with some AD values for normal and tumor samples. Does it mean that the original BAM file contain the same 8 patients at this position OR there may be more patients, but some of them are clearly wild-type and are not reported in vcf-file?
BTW, I am using vcf files (processed with mutect2) downloaded from the GDC data repository.
I hope for your help:)
CalculateMixingFractions Returns all NaN?
Hi there, I'm attempting to use the CalculateMixingFractions tool to estimate the mixture of genomes from a VCF in a pooled BAM file and getting all NaN as output.
The output looks like this:
SAMPLE MIXING_FRACTION GENO1 NaN GENO2 NaN GENO3 NaN etc...
And the headers of my VCF file is the following:
##fileformat=VCFv4.1 ##filedate=2017.8.8 ##source=Minimac3 ##FORMAT= ##FORMAT= ##FORMAT= ##INFO= ##INFO= ##INFO= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GENO1 GENO2 GENO3 chr10 67193 rs1111111:67193:C:T C T . . . GT:AS 0|0:0,0 0|0:0,0 0|0:4,0
And the header for the SAM file is:
@HD VN:1.4 SO:coordinate SQ SN:chr1 LN:249250621 SQ SN:chr2 LN:243199373 ...etc SL-HEL:C66HBACXX150512:C66HBACXX:6:1302:18557:85138 147 chr1 10000 0 25M = 10003 -22 ATAACCCTAACCCTAACCCTAACCC ##BA39A<@CBBCCBBBAAAA<==> MD:Z:25 PG:Z:MarkDuplicates NM:i:0 AS:i:25 XS:i:25 etc...
VariantRecalibration step issue
Hi all, i need some advice about VariantRecalibration as in if I am doing the right thing?
Downloaded 30 bams from 1000 genomes and realigned them to HG38. Then using HaplotypeCaller generated a gVCF for each BAM. Then used CombineGVCFs and made a single VCF named cohort.g.vcf
Took a sample BAM and aligned to HG38. Then used HaplotypeCaller like above and generated a gVCF.
Now I am trying to genotype both the gVCFs using
java -Xmx16g -jar algorithms/gatk3/gatk3.8.jar -T GenotypeGVCFs -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --variant data/HG100/HG100.output.raw.snps.indels.g.vcf --variant references/hg38gatkbundle/exomerefs/vcf/cohort.g.vcf -o data/HG100/output.raw.combined.vcf
This is not working. When I ran ValidateVCF from GATK, for my sample VCF i got message about --reference_window_size adjustment as in make it larger than 108, 170 etc. Then reran haplotype caller with --set_window_size 1000 but there is no change, same validation message.
ValidateVariants on the cohort.g.vcf reports a broken line.
I am not just not sure why these issues arise.
Is there a simple straightforward pipeline/ tutorial published somewhere that helps get over this issue?
Any help will be highly appreciated.
[INSTALLTEST] Discovery.sh --> VCF is missing
Hi everyone
I tried to launch the install test script (discovery.sh). All seems to work : no error in the logs, no error in the stderr. But at the end, I have this message :
INFO 10:29:26,082 QCommandLine - Script completed successfully with 2 total jobs
grep: test1.discovery.vcf: No such file or directory
Files - and benchmark/test1.discovery.vcf differ
Error: test results do not match benchmark data
Let me know which information can I show you to resolve my problem.
Thank you.
Elodie
Combine Variants throwing a bug error
@Geraldine_VdAuwera @Sheila @shlee
I am having this issue while merging several single-sample VCF's into a single multisample VCF. CombineVariants
is throwing a bug error message
$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -V MA605_phased.vcf -V MA611_phased.vcf -V MA622_phased.vcf -V MA625_phased.vcf -V MA629_phased.vcf -V Ncm8_phased.vcf -V Sp3_phased.vcf -V Sp21_phased.vcf -V Sp76_phased.vcf -V Sp154_phased.vcf -V Sp164_phased.vcf -V SpNor33_phased.vcf -o RBphased_variants.AllSamples.Final.vcf -genotypeMergeOptions UNSORTED
INFO 18:53:40,486 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:53:40,489 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO 18:53:40,489 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 18:53:40,490 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 18:53:40,490 HelpFormatter - [Tue Mar 06 18:53:40 EST 2018] Executing on Linux 4.13.0-36-generic amd64
INFO 18:53:40,490 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 18:53:40,494 HelpFormatter - Program Args: -T CombineVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -V MA605_phased.vcf -V MA611_phased.vcf -V MA622_phased.vcf -V MA625_phased.vcf -V MA629_phased.vcf -V Ncm8_phased.vcf -V Sp3_phased.vcf -V Sp21_phased.vcf -V Sp76_phased.vcf -V Sp154_phased.vcf -V Sp164_phased.vcf -V SpNor33_phased.vcf -o RBphased_variants.AllSamples.Final.vcf -genotypeMergeOptions UNSORTED
INFO 18:53:40,497 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-36-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 18:53:40,498 HelpFormatter - Date/Time: 2018/03/06 18:53:40
INFO 18:53:40,498 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:53:40,498 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 18:53:40,829 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 18:53:40,831 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 18:53:40,832 GenomeAnalysisEngine - Strictness is SILENT
INFO 18:53:41,263 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 18:53:42,443 GenomeAnalysisEngine - Preparing for traversal
INFO 18:53:42,451 GenomeAnalysisEngine - Done preparing for traversal
INFO 18:53:42,452 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:53:42,452 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:53:42,452 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
ERROR --
ERROR stack trace
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.valueOf(Integer.java:766)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:724)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:132)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.mergeGenotypes(GATKVariantContextUtils.java:1573)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1223)
at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:361)
at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:143)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:323)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: ""
ERROR ------------------------------------------------------------------------------------------
But, I can combine several other samples together if some of the samples are removed.
$java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o RBphased.F1_Samples.merged.vcf
INFO 18:55:20,023 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:55:20,026 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO 18:55:20,027 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 18:55:20,028 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 18:55:20,028 HelpFormatter - [Tue Mar 06 18:55:19 EST 2018] Executing on Linux 4.13.0-36-generic amd64
INFO 18:55:20,028 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 18:55:20,033 HelpFormatter - Program Args: -T CombineVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o RBphased.F1_Samples.merged.vcf
INFO 18:55:20,037 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-36-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 18:55:20,037 HelpFormatter - Date/Time: 2018/03/06 18:55:20
INFO 18:55:20,037 HelpFormatter - ----------------------------------------------------------------------------------
INFO 18:55:20,037 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 18:55:20,258 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 18:55:20,258 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 18:55:20,259 GenomeAnalysisEngine - Strictness is SILENT
INFO 18:55:20,653 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 18:55:21,264 GenomeAnalysisEngine - Preparing for traversal
INFO 18:55:21,267 GenomeAnalysisEngine - Done preparing for traversal
INFO 18:55:21,268 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 18:55:21,268 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 18:55:21,268 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 18:55:23,074 ProgressMeter - done 1934.0 1.0 s 15.6 m 88.9% 1.0 s 0.0 s
INFO 18:55:23,077 ProgressMeter - Total runtime 1.81 secs, 0.03 min, 0.00 hours
Done. There were no warn messages.
But, the thing is - all these files came out of sample pipeline (from "phaser" tool https://github.com/secastel/phaser ) and have the same file structure.
Can you suggest what the bug
is referring to and if there is any solution to it.
I am attaching the associated files if need be.
Update: To circumvent the merging issue, I did the following, but I think it may not be totally good (or may produce some problematic records at some places of the merged VCF's).
#merging all the samples that worked together
$java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -V MA605_phased.vcf -V MA611_phased.vcf -V MA622_phased.vcf -V MA625_phased.vcf -V MA629_phased.vcf -V Sp21_phased.vcf -V Sp76_phased.vcf -V Sp154_phased.vcf -V Sp164_phased.vcf -o RBphased.ms01e_02g_03g_04h.MA605_611_622_625_629.Sp21_76_154_164.merged.vcf
# merge using bcftools - for the samples that didn't work
$bcftools merge Ncm8_phased.vcf.gz Sp3_phased.vcf.gz SpNor33_phased.vcf.gz -O v -o RBphased.Ncm8.Sp3_Nor33.vcf
# Now, merge two files
$java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T CombineVariants -R ${ref_genome} -V RBphased.ms01e_02g_03g_04h.MA605_611_622_625_629.Sp21_76_154_164.merged.vcf -V RBphased.Ncm8.Sp3_Nor33.vcf -o RBphased_variants.AllSamples.Final03.vcf
# But, there is validataion error
$$ java -jar -Xmx6g /home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T ValidateVariants -R ${ref_genome} -V RBphased_variants.AllSamples.Final03.vcf
INFO 19:45:21,504 HelpFormatter - ----------------------------------------------------------------------------------
INFO 19:45:21,507 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO 19:45:21,508 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 19:45:21,508 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 19:45:21,508 HelpFormatter - [Tue Mar 06 19:45:21 EST 2018] Executing on Linux 4.13.0-36-generic amd64
INFO 19:45:21,509 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 19:45:21,514 HelpFormatter - Program Args: -T ValidateVariants -R /media/everestial007/SeagateBackup4.0TB2/New_Alignment_Set/RefNindex_lyrata/lyrata_genome.fa -V RBphased_variants.AllSamples.Final03.vcf
INFO 19:45:21,517 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.13.0-36-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 19:45:21,518 HelpFormatter - Date/Time: 2018/03/06 19:45:21
INFO 19:45:21,518 HelpFormatter - ----------------------------------------------------------------------------------
INFO 19:45:21,518 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/home/everestial007/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO 19:45:21,700 GenomeAnalysisEngine - Deflater: IntelDeflater
INFO 19:45:21,700 GenomeAnalysisEngine - Inflater: IntelInflater
INFO 19:45:21,701 GenomeAnalysisEngine - Strictness is SILENT
INFO 19:45:22,156 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 19:45:22,559 GenomeAnalysisEngine - Preparing for traversal
INFO 19:45:22,564 GenomeAnalysisEngine - Done preparing for traversal
INFO 19:45:22,569 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 19:45:22,569 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 19:45:22,577 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: File /media/everestial007/SeagateBackup4.0TB2/RNAseq_Data_Analyses/phaser_to_ASE_on_DiploidGenome/02_outputs_RBphased_VCF/new_merge/RBphased_variants.AllSamples.Final03.vcf fails strict validation: one or more of the ALT allele(s) for the record at position 7:21087512 are not observed at all in the sample genotypes
ERROR ------------------------------------------------------------------------------------------
Thanks,
From one sample to multiple samples
I wrote a script in WDL that does 4 tasks to analyze the samples starting from Fastq files until getting a raw VCF file.
The script can handle one sample only per time and I wonder if there is any way to make it read a folder of multiple Fastq files (2 files per sample, R1 and R2) and do the same workflow for each of them.
Thanks for your help
This is the script:
workflow FromFastqToVCF {
String SAMPLENAME
File FASTQ1
File FASTQ2
File REFFASTA
File REFINDEX
File REFDICT
File HG19AMB
File HG19ANN
File HG19BWT
File HG19PAC
File HG19SA
File DBSNP
File DBSNPINDEX
File BEDFILE
call step1 {
input:
FastqR1=FASTQ1,
FastqR2=FASTQ2,
SampleName=SAMPLENAME
}
call step2 {
input:
SampleName=SAMPLENAME,
uBAM=step1.uBAM
}
call step3 {
input:
SampleName=SAMPLENAME,
mBAM=step2.mBAM,
refFasta=REFFASTA,
refIndex=REFINDEX,
refDict=REFDICT,
hg19amb=HG19AMB,
hg19ann=HG19ANN,
hg19bwt=HG19BWT,
hg19pac=HG19PAC,
hg19sa=HG19SA,
uBAM=step1.uBAM
}
call step4 {
input:
SampleName=SAMPLENAME,
refFasta=REFFASTA,
refIndex=REFINDEX,
refDict=REFDICT,
dbSNP=DBSNP,
dbSNPIndex=DBSNPINDEX,
aBAM=step3.aBAM,
aBAMIndex=step3.aBAMIndex,
bedFile=BEDFILE
}
}
task step1 {
File FastqR1
File FastqR2
String SampleName
command {
gatk FastqToSam \
--FASTQ "${FastqR1}" \
--FASTQ2 "${FastqR2}" \
--OUTPUT "/home/projects/cu_10111/data/Test/${SampleName}_fastqtosam.bam" \
--SAMPLE_NAME "${SampleName}"
}
output {
File uBAM = "/home/projects/cu_10111/data/Test/${SampleName}_fastqtosam.bam"
}
}
task step2 {
File uBAM
String SampleName
command {
gatk MarkIlluminaAdapters \
--INPUT "${uBAM}" \
--METRICS "/home/projects/cu_10111/data/Test/${SampleName}_markilluminaadapters_metrics.txt" \
--OUTPUT "/home/projects/cu_10111/data/Test/${SampleName}_markilluminaadapters.bam"
}
output {
File mBAM = "/home/projects/cu_10111/data/Test/${SampleName}_markilluminaadapters.bam"
}
}
task step3 {
File mBAM
String SampleName
File refFasta
File refIndex
File refDict
File hg19amb
File hg19ann
File hg19bwt
File hg19pac
File hg19sa
File uBAM
command {
gatk SamToFastq \
--INPUT "${mBAM}" \
--FASTQ "/dev/stdout" \
--CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 --INTERLEAVE true --INCLUDE_NON_PF_READS true \
--TMP_DIR "/home/projects/cu_10111/data/Test/temp" \
| \
bwa mem -M -t 31 -p "${refFasta}" "/dev/stdin" \
| \
gatk MergeBamAlignment \
--REFERENCE_SEQUENCE "${refFasta}" \
--UNMAPPED_BAM "${uBAM}" \
--ALIGNED_BAM "/dev/stdin" \
--CREATE_INDEX true --ADD_MATE_CIGAR true --CLIP_ADAPTERS false --CLIP_OVERLAPPING_READS true \
--INCLUDE_SECONDARY_ALIGNMENTS true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY MostDistant \
--ATTRIBUTES_TO_RETAIN XS \
--OUTPUT "/home/projects/cu_10111/data/Test/${SampleName}_MergeBamAlignment_piped_plumbed.bam" \
--TMP_DIR "/home/projects/cu_10111/data/Test/temp"
}
output {
File aBAM = "/home/projects/cu_10111/data/Test/${SampleName}_MergeBamAlignment_piped_plumbed.bam"
File aBAMIndex = "/home/projects/cu_10111/data/Test/${SampleName}_MergeBamAlignment_piped_plumbed.bai"
}
}
task step4 {
File aBAM
File aBAMIndex
String SampleName
File refFasta
File refIndex
File refDict
File dbSNP
File dbSNPIndex
File bedFile
command {
gatk HaplotypeCaller \
--reference "${refFasta}" \
--input "${aBAM}" \
--output "/home/projects/cu_10111/data/Test/${SampleName}-raw.indels.snps.vcf" \
--dbsnp "${dbSNP}" \
--intervals "${bedFile}"
}
output {
File rawVCF = "/home/projects/cu_10111/data/Test/${SampleName}-raw.indels.snps.vcf"
}
}
Standard practice for VCF filtering for the purpose of fingerprinting via proportion IBS?
Hi all,
I was wondering whether I could get some insight into what the standard procedure for `fingerprinting' by prop. IBS for sequencing data is.
To be more precise: the proportion IBS between the called variants from sequencing data between two samples depends on the relationship between the individuals from which the samples were taken. For example, the prop. IBS between two samples taken from the same person should be higher than that between siblings, which in turn should be higher than that between two unrelated people.
It should therefore be possible to classify two individuals whose relationship is unknown into at least one of the three aforementioned bins (self-self, self-sibling, self-unrelated). My attempt to do this with a ~400 sample size of targeted amplicon sequence data yields fairly good separation by simply filtering by MAF of 0.05, but there is still a lot of overlap (see attached image):
My question is whether there is a battery of filters to be applied to the VCFs that is generally accepted to be good practice for such a use.
FilterMutectCalls change between 4.0.2.1 and 4.0.3.0
I tried running FilterMutectCalls
on a VCF generated with Mutect2
for matched tumor-normal samples. With 4.0.2.1, I get some variants that are labeled PASS
as would be expected, but with 4.0.3.0, I get 0. I could not find any documentation that explains such a big change in behavior. Is this a bug? Or did I have variants erroneously labeled as PASS before?
VCF to fasta format question
I'm trying to take a multi-sample vcf and convert it to an aligned fasta file for phylogenetic analysis, including all indels and invariant sites. Previous questions similar to this suggested isolating individual samples using SelectVariants, and then FastaAlternateReferenceMaker to make the fasta. I'm having trouble with this because for GATK 3.8, FastaAlternateReferenceMaker doesn't appear to accept '*' alternate alleles, and for GATK 4.0, it doesn't seem like FastaAlternateReferenceMaker exists. Also, the choice of always output the alternate allele for heterozygous sites doesn't work for my downstream analyses.
To get around these issues, what I've done so far is use GATK 3.8 to generate a vcf using GenotypeGVCF including invariant sites. Then use SelectVariants to pull out a single sample,
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -R /home/gowens/bin/ref/HanXRQr1.0-20151230.fa -T SelectVariants --variant HanXRQChr01_000000001.tmp.vcf.gz -o HanXRQChr01_000000001.subset -sn GO1_GB057 -trimAlternates
I was planning on scripting a parser to convert it to fasta (against the gatk recommendations), but I'm having trouble interpreting some sites.
Example 1:
HanXRQChr01 12430 . T . 16.80 . AN=2;DP=1 GT:AD:DP:RGQ 0/0:1,0:1:3
HanXRQChr01 12431 . GA . 138.70 . AN=2;BaseQRankSum=1.09;ClippingRankSum=0.00;DP=1;ExcessHet=3.0103;FS=0.000;MQ=60.00;MQRankSum=0.00;QD=15.41;ReadPosRankSum=0.020;SOR=1.179 GT:AD:DP:GQ:PL 0/0:1:1:3:0
HanXRQChr01 12432 . A . 138.70 . AN=2;DP=1;ExcessHet=3.0103;FS=0.000;QD=15.41;SOR=1.179 GT:AD:DP:GQ:PL 0/0:1:1:3:0
My interpretation would be that it is the true state is TGA, but the maybe its TGAA?
Example 2:
HanXRQChr01 17651 . TCGGCAA . 298.45 . AN=2;DP=3;ExcessHet=0.4576;FS=0.000;MQ=56.12;QD=30.57;SOR=0.941 GT:AD:DP:GQ:PL 0/0:3:3:3:0
HanXRQChr01 17652 . C . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=30.65;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
HanXRQChr01 17653 . G . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=35.84;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
HanXRQChr01 17654 . G . 303.51 . AN=0;DP=2;ExcessHet=3.0103;FS=0.000;QD=30.94;SOR=0.941 GT:AD:DP:PL ./.:2:2:0
In this case, there seems to be a difference in confidence depending on which row I interpret. Looking at the first row, it calls the reference allele for a 7 basepair stretch, but subsequent rows don't have any variant call. Should I interpret this as TCGG or TNNN or NNNN?
Example 3:
HanXRQChr01 56181 . ATTT A,AT 921.63 . AC=1,1;AF=0.500,0.500;AN=2;BaseQRankSum=0.681;ClippingRankSum=0.00;DP=11;ExcessHet=3.0103;FS=0.000;MQ=58.70;MQRankSum=0.00;QD=25.74;ReadPosRankSum=0.475;SOR=0.321 GT:AD:DP:GQ:PL 1/2:0,9,2:11:43:462,73,43,378,0,
HanXRQChr01 56182 . T * 769.10 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=31.15;SOR=0.321 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56183 . T * 769.10 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=29.40;SOR=0.321 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56184 . T * 724.25 . AC=2;AF=1.00;AN=2;DP=11;ExcessHet=3.0103;FS=0.000;QD=29.97;SOR=0.287 GT:AD:DP:GQ:PL 1/1:0,9:11:30:419,30,0
HanXRQChr01 56185 . T . 12.99 . AN=0;DP=17 GT:AD:DP ./.:17,0:17
HanXRQChr01 56186 . T . 23.07 . AN=2;DP=17 GT:AD:DP:RGQ 0/0:17,0:17:4
HanXRQChr01 56187 . T . 20.94 . AN=2;DP=16 GT:AD:DP:RGQ 0/0:16,0:16:1
HanXRQChr01 56188 . T . 26.81 . AN=2;DP=17 GT:AD:DP:RGQ 0/0:17,0:17:14
Would the two haplotypes be: A-NTTT/ATNTTT?