Quantcast
Channel: vcf — GATK-Forum
Viewing all 624 articles
Browse latest View live

selecting/intersecting variants

$
0
0

Hello GATK,
I have a WGS experiment of 3 genotypes from my favorite non-model organism for which I called variants based on GATK best practices (a pain for non-models!). I am interested in finding variants that are found in both A and B genotypes as heterozygotes and are not in common with genotype C. I am using SelectVariants but when I check the output files in IGV I don't see the desired variants. I would appreciate any input on modifying my commands:

Get all varinats from C
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o C.vcf -sn C --excludeNonVariants --excludeFiltered;

Get heterozyguos varinats from A
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o A_het.vcf -select 'vc.getGenotype("A").isHet()' -env -ef;

Get heterozyguos varinats from B
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta \
-V filtered.vcf -o B_het.vcf -select 'vc.getGenotype("B").isHet()' -env -ef;

Find het varinats common to A and B and are differeant from C
GenomeAnalysisTK.jar -T SelectVariants -R genomic.fasta
-V A_het.vcf \
--concordance B_het.vcf \
--discordance C.vcf \
-o ABh_no_C.vcf

Thanks a lot for any suggestion/correction


What is the best practice for calling/combining variants across multiple RNA-Seq datasets

$
0
0

Hi,
I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

Allele frequency and depth VCF produced by MuTect2

$
0
0

Hi all,

From my understanding of the VCF output, the AF[format] field (Allele fraction of the event in the tumor) equals to :
AD[format] / DP[format].
With AD being the depth of coverage of each allele per sample (we use the alt allele when calculating AF),
and DP being the "filtered" depth of coverage for each sample (we use the one computed from the tumor sample when calculating AF).

And with some further reading, I think I figured that :
AD[format] <=> all sample-reads minus uninformative reads.
AD is computed with GATK DepthPerAlleleBySample.
DP[format] <=> all sample-reads minus filtered reads (which is != from uninformative reads).
DP[info] <=> all site-levels-reads (T+N samples), minus nothing.
DP is computed with GATK Coverage

From the GATK doc (http://gatkforums.broadinstitute.org/gatk/discussion/4721/using-depth-of-coverage-metrics-for-variant-evaluation), one can read the following :

The key difference is that the AD metric is based on unfiltered read counts while the sample-level DP is based on filtered read counts (see tool documentation for a list of read filters that are applied by default for each tool). As a result, they should be interpreted differently.

If AF is indeed AD[format]/DP[format], isn't it strange to computed AF by dividing an unfiltered-read depth by a filtered-read depth ?

Ps : I tried to "verify" the DP[info] depth (computed inside the MuTect2 run), by using GATK DepthOfCoverage with the same input (non-marked_recalibrated T/N BAMs). For a given position, I find a higher depth with GATK DepthOfCoverage.(501 vs 434). Is the DP[info] really based on unfiltered-reads ? Or do GATK Coverage & GATK DepthOfCoverage have some minor differences ?

MuTect2 is calling different variants when changing -minPruning value

$
0
0

Context :
Working with targeted sequencing data (amplicon gene panel), depth is high, thus trying to adjust -minPruning value seems relevant.

-minPruning argument : Paths with fewer supporting kmers than the specified threshold will be pruned from the graph.
Default value : 2.

Atm, values 2 & 10 have been tested in our pipeline.
At first glance i was expecting faster MuTect2 runs with -minPruning=10 (it was the case), and fewer variants compared to the same run with -minPruning=2 (was the case too). I was also expecting that -minPruning=10 variants would be "included" in -minPruning=2 variants.

Thing is, VCFs contains 3 "kinds" of variants :

  • common to both -minPruning=2 & -minPruning=10.
  • unique to -minPruning=2.
  • unique to -minPruning=10.

So, the only explanation i can come up with is that during the HaplotypeCaller re-assembly for active region, depending on the -minPruning value, the graph can be "discarded" or "skipped" if it's too messy or not passing the needed kmer coverage.

Trying to apply this "explanation" to the 3 "kinds" of variants :

  • common to both -minPruning=2 & -minPruning=10 : for this active region, HC graphs for both minPruning=2 & minPruning=10 are similar and allow to call the same variant in both case.
  • unique to -minPruning=2 : for this particular region, the graph is pruned for having paths with fewer than 10 supporting kmers.
  • unique to -minPruning=10 : for this particular region, the graph was too messy with -minPruning=2 and VC was skipped.

Ps : This was investigated for a small targeted region (chr5:68470404-68471404).

Command lines are given below :

prun=2;
java -jar $GATK -T MuTect2 \
                 -nct 1 \
                 -R ../Hg19_files/ucsc.hg19.fasta \
                 -L chr5:68470404-68471404 \
                 -I:tumor ../Outputs/V2_noDups/3_BQSR/p10_noDups.2.bam \
                 -I:normal ../Outputs/V2_noDups/3_BQSR/p10_nor_noDups.2.bam \
                 --dbsnp ../Hg19_files/dbsnp_138.hg19.vcf \
                 --cosmic ../Hg19_files/cosmic_coding_and_noncoding_chr_M_sorted.vcf \
                 -PON ../Outputs/V2_noDups/5_PoN/pon_noDups.vcf \
                 -contamination $(awk 'NR==2 {print $4/100}' ../Outputs/V2_noDups/4_ContEst/p10_ContEst.txt) \
                 --annotation DepthPerAlleleBySample \
                 --annotation BaseQualitySumPerAlleleBySample \
                 --annotation TandemRepeatAnnotator \
                 --annotation OxoGReadCounts \
                 --annotation StrandAlleleCountsBySample \
                 -minPruning $prun \
                 --bamOutput p10_noDups_minPrun"$prun"_chr5_68470404_68471404_bamout.bam \
                 -o p10_noDups_minPrun"$prun"_chr5_68470404_68471404.vcf;
prun=10;
java -jar $GATK -T MuTect2 \
                 -nct 1 \
                 -R ../Hg19_files/ucsc.hg19.fasta \
                 -L chr5:68470404-68471404 \
                 -I:tumor ../Outputs/V2_noDups/3_BQSR/p10_noDups.2.bam \
                 -I:normal ../Outputs/V2_noDups/3_BQSR/p10_nor_noDups.2.bam \
                 --dbsnp ../Hg19_files/dbsnp_138.hg19.vcf \
                 --cosmic ../Hg19_files/cosmic_coding_and_noncoding_chr_M_sorted.vcf \
                 -PON ../Outputs/V2_noDups/5_PoN/pon_noDups.vcf \
                 -contamination $(awk 'NR==2 {print $4/100}' ../Outputs/V2_noDups/4_ContEst/p10_ContEst.txt) \
                 --annotation DepthPerAlleleBySample \
                 --annotation BaseQualitySumPerAlleleBySample \
                 --annotation TandemRepeatAnnotator \
                 --annotation OxoGReadCounts \
                 --annotation StrandAlleleCountsBySample \
                 -minPruning $prun \
                 --bamOutput p10_noDups_minPrun"$prun"_chr5_68470404_68471404_bamout.bam \
                 -o p10_noDups_minPrun"$prun"_chr5_68470404_68471404.vcf;

Resulting VCFs contains :
For minPruning=2 :

chr5 68470680    .   A   AT  .   alt_allele_in_normal;panel_of_normals   ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=76.08;TLOD=43.37   GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC  0/1:796,65:0.069:0:0:.:22889,1241:0:0:431,365,0,65  0/0:465,34:0.063:0:0:.:12741,646:0:0:241,224,0,34
chr5    68470913    .   C   T   .   PASS    ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=146.71;TLOD=9.80   GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC  0/1:943,19:0.014:0:0:.:27476,480:0:0:470,473,8,11   0/0:488,0:0.00:0:0:.:13599,0:0:0:216,272,0,0
chr5    68471144    .   TA  T   .   alt_allele_in_normal;clustered_events;t_lod_fstar   ECNT=2;HCNT=18;MAX_ED=12;MIN_ED=12;NLOD=144.47;RPA=3,2;RU=A;STR;TLOD=6.25   GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC  0/1:977,29:0.022:0:0:.:24369,584:0:0:475,502,2,27   0/0:593,21:0.028:0:0:.:14864,416:0:0:301,292,3,18
chr5    68471156    .   TG  T   .   alt_allele_in_normal;clustered_events;panel_of_normals  ECNT=2;HCNT=4;MAX_ED=12;MIN_ED=12;NLOD=151.73;RPA=3,2;RU=G;STR;TLOD=40.65   GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC  0/1:956,59:0.052:0:0:.:27030,1379:0:0:422,534,55,4  0/0:596,18:0.023:0:0:.:16046,407:0:0:287,309,17,1
For minPruning=10
chr5 68471156    .   TG  T   .   alt_allele_in_normal;panel_of_normals   ECNT=1;HCNT=8;MAX_ED=.;MIN_ED=.;NLOD=136.55;RPA=3,2;RU=G;STR;TLOD=40.71 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC  0/1:899,54:0.051:0:0:.:25517,1244:0:0:393,506,50,4  0/0:559,17:0.023:0:0:.:15186,378:0:0:271,288,17,0
chr5    68471324    .   CT  C   .   alt_allele_in_normal;panel_of_normals;t_lod_fstar   ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=249.61;RPA=5,4;RU=T;STR;TLOD=5.03  GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC  0/1:1281,51:0.032:0:0:.:36659,1387:0:0:551,730,16,35    0/0:996,35:0.027:0:0:.:26231,889:0:0:511,485,6,29
  • 3 variants are unique to minPruning2 (chr5:68470680:A->AT, chr5:68470913:C->T & chr5:68471144:TA->T)
  • 1 variant is common (chr5:68471156:TG->T)
  • 1 variant is unique to minPruning10 (chr5:68471324:CT->C)

IGV views of corresponding bamouts are attached (track_1 : reference, track_2 : minPruning=2, track_3 : minPruning=10).
IGV shows that when variants are "common", both tracks contains reads.
And when variants are unique, there is no reads supporting this particular region for the run with the other value of minPruning.

So, the question is : Is this behaviour expected ? If so, is it correct to combine VCF for minPruning=2 with VCF for minPruning=10, to get the most out of MuTect2 ?

which DP value shuold I use ?

$
0
0

I used HaplotypeCaller module to detect mutations,I noticed that the DP values was different in the INFO column and in the FORMAT column ,like the picture:
Why this happened ?which DP value shuold I use ?

Multisample vs Single sample(Paired sample MUT &WT separate BAM files)

$
0
0

Hello,
I'm currently working with zebrafish mutants, and compare phenotypically wild-type and mutant siblings for mutations. I have 3 different mutants, like 3 pairs of different Mutant and Wild type siblings. e.g Mut A, WT A and Mut B, WT B and so on.

I called variants in Single sample mode, i.e. WT separate and Mutant Separate (GVCF haplotypecaller) for all the pairs (separate GVCF for each BAM file).

Then I combine all the samples to call Genotypes, to use all the coverage from other samples to get correct Genotype. So in the end I have a 6 sample VCF file.

Then just to check, I called Variants directly from Haplotypecaller with using just pairs, i.e Mut (A) and WT (A) together to call variants, only for one pair.

But, After comparing at some positions in same sample, say A, I'm seeing difference in genotypes, as compared in 6 sample VCF and my 2 sample VCF (of same mutant+sibling vcf). Even the DP is different in some cases.

I want to ask, which one is the better method ? I though multisample variant calling would be better.

thanks

Error in VariantAnnotator

$
0
0

hi
Because my reference is in scaffold level and GATK not designed to analyse such referenes, thus my variant caller project done using samtools. Now i want to add QD(qualityofdepth) to vcf files using VariantAnnotator but i get error as following:

ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Annotation QD was not found; please check that you have specified the annotation name correctly

my used command:

java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R /home/mylinux/refdic/DROMREF.fasta -I /home/mylinux/Drom1/realigned_D1.bam -V /home/mylinux/Drom1/variants/SNPSwhitGQ.vcf -L /home/mylinux/Drom1/variants/SNPSwhitGQ.vcf -A QD -o /home/mylinux/Drom1/variants/GATKANN.vcf

how can i add QD to vcf file?

Filteration creteria

$
0
0

i want to filter a vcf file resulted GATK. my criterias are QUAL < 30, FS > 60, MQ < 30, DP < 10 and GQ < 30.
filterExpression in my command line in GATK must be as following? GQ is in right place in command? another question is QD < 2 is good criteria for quality filteration?

--filterExpression "QUAL < 30 || FS > 60.0 || MQ < 30.0 || DP < 10 || GQ < 30"

thanks in advance


ASECounter vs AD and DP in vcf file

$
0
0

Dear GATK developer:

Can you please explain the difference between the allele counts obtained by running ASECounter and the allele read counts provided in the AD/DP fields in vcf file. It seems that the allele counts obtained by ASECounter is similar to DP the filtered depth in the vcf? while the AD calculation includes unfiltered reads?

Thank you.

Bamout file shows a consistent deletion that is not reported in VCF

$
0
0

Dear GATK team

I ran HaplotypeCaller on a bam file, which is the alignment of a single bacteria sample to its reference genome. To understand the calling process I wanted to compare the resulting VCF and bamout files, so I selected a small region.

java -jar -Xmx8g $BASE/GenomeAnalysisTK.jar -T HaplotypeCaller -R AE014075.fasta -I dedup_3_S3.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o S3_part.vcf -bamout newbamout389088.bam -L AE014075.1:389000-389190

Then I opened three tracks in IGV: the VCF file (S3_part.vcf) on top, then the original bam file (dedup_3_S3.bam), and lastly the bamout file (newbamout389088.bam). As you can see, the original bam file thinks there is a deletion at position 389087 and a SNP at 389091, supported by 861 out of ~1000 reads (MAPQ > 30). The bamfile, after re-alignment and re-assembly, thinks there is an insertion at 389088 and a 3-nt deletion at 389090, supported by almost 100% of the reads.

It is important to note that in the bamout track, all artificial haplotypes (pink reads) show the same insertion and deletions. Actually if you scroll down you will see that all reads from the original bam file (blue) contain exactly the same variants too, although not all of them are informative.

Now if you go to the top VCF track, the insertion at 389088 is called while the deletion at 389090 is not. This is confusing because, didn't the bamout file just say all assembled haploptypes contain a deletion at 389090? And almost all the re-aligned reads from the original bam file also contain this variant?

Also maybe a silly question: since the genome's ploidy is 1 and there's only 1 sample here, shouldn't there be only 1 haplotype by definition? Why bamout has ~10 haplotypes (when by default bamWriterType is CALLED_HAPLOTYPES)?

Thanks a lot!

Mutect2 output with samples with different read groups

$
0
0

Hi all,
I am running Mutect2 with a tumour/normal match in RNA-Seq. Here is my command:

gatk -T MuTect2 -R ./genome.fa --read_filter Mappin
gQuality --min_mapping_quality_score 20 --max_alternate_alleles 4 -kmerSize 15 -kmerSize 30 -I:tumor ./9_Recal.reorderedDupl.rg.split.bam -I:normal ./8_Recal.reorderedDupl.rg.split.bam --dbsnp ./dbsnp_144.hg38.vcf.gz -L ./genes_for_variant_calling.bed -o ./9_somatic_snvs_indels.vcf -log ./9_somatic_snvs_indels_log

In the output (I am showing here output opened in R with the VariantAnnotation package) I get the following data for the genotypes:

                                        8_L003       8_L004     8_L007     8_L008        9_L003     9_L004       9_L007    9_L008

chr1:36466762_A/G "0/0" "." "." "." "0/1" "." "." "."
chr1:36468081_A/G "0/0" "." "." "." "0/1" "." "." "."
chr1:36475819_A/G "0/0" "." "." "." "0/1" "." "." "."
chr1:36476164_A/G "0/0" "." "." "." "0/1" "." "." "."
chr1:36476671_C/CT "0/0" "." "." "." "0/1" "." "." "."
chr1:36476745_T/C "0/0" "." "." "." "0/1" "." "." "."

I get the same strange output for many other samples. It is like the variants are found in a subset of reads related to one one read group. All the variants are like this. Shall I get really worried about my data or Mutect2 does some comparisons that I am not aware of?

Thanks for your help,

Anna

Some questions about VQSR.

$
0
0

Can I ask three questions about VQSR:
1. I have two VCF files from two different sources, and each of them contains several samples. The sequencing processes are basically the same (library prep, sequence kit, etc.) for the two data sets. But the VSQ Lod cutoff for various tranches are different for the two sets (e.g. for file1 Tranche99.00to99.90 is VSQ Lod -100 <= x < -0.8, and -7 <= x < -1.5 for file2). I’m guessing it’s because they used different training data, but I’m not sure if my guessing was correct. i.e. If the VQSR steps were processed in the exactly same way (same training set and parameters), will two data sets get the same VSQ Lod cutoff for various tranches?
2. Does the testing data (i.e. the variant calls from certain experiment) have an impact on the VQSR results? For example, for one data set, if I do VQSR again on the filtered (only keeping the “PASS” variant loci) variants using the same training variants, will I get a different VQSR evaluation compared to the first time for the same variant loci?
3. I read that InDels and SNPs should be somehow seperately evaluated via VQSR, and I see that in some VCF files, they have TrancheINDEL99.00to99.90 and TrancheSNP99.00to99.90, etc.. However, there are also VCF files that only have Tranche99.00to99.90, etc. and no indication of InDels or SNP. Does this mean that InDel and SNPs were evaluated together? Is it a good thing?
Sorry for so many questions. I just learned about VQSR and have some confusions. Thanks in advance!

fastest way of getting total number of variants in VCF via picard?

$
0
0

What is the fastest way of getting the total number of variants in a VCF file? (using picard-tools-1.119, via SplitVcfs.jar.)

So far the fastest way I could have done it was this:

private static int getNumVariants(VCFFileReader reader) {
int totalVariants = 0;
final CloseableIterator iterator = reader.iterator();
while (iterator.hasNext()) {iterator.next(); totalVariants++; }
iterator.close();

  return totalVariants;

}

  • but this appears to iterate through the entire VCF file which for large files seems very inefficient...

I am thinking that there must be a faster way. After all, the number of variants is simply:
total number of lines in file - number of lines in header?

Any way to get this?

Thanks
Martin

Empty vcf when generating PON with MuTect2

$
0
0

Hi,

First, sorry if it's a duplicate question. I have looked for solution, but didn't find.

I have 8 exomes data from 4 patient: normal and tumor.
I ran MuTect2 (in GTAK 3.7) in normal/tumor mode, and it works well.

But now, I would like to generate a Panel of Normal.
Here the log:

INFO  16:48:48,998 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:48:49,000 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  16:48:49,001 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  16:48:49,001 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  16:48:49,001 HelpFormatter - [Mon Apr 03 16:48:48 CEST 2017] Executing on Linux 4.4.0-62-generic amd64 
INFO  16:48:49,001 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13 
INFO  16:48:49,004 HelpFormatter - Program Args: -T MuTect2 -nct 2 -R /media/raid/genomic_DB//hg19/hg19.fasta -I:normal bam/WES_8N.sorted.dedup.recalib.bam -L /media/raid/genomic_DB//hg19/bed/hg19_Gene_table.bed --dbsnp /media/raid/genomic_DB//hg19/GATK_bundle/dbsnp_138.hg19.vcf --cosmic /media/raid/genomic_DB//hg19/GATK_bundle/Cosmic.hg19.vcf --artifact_detection_mode -o var/WES_8_N.vcf 
INFO  16:48:49,007 HelpFormatter - Executing as aatkinson@umr910-pfg-ws2 on Linux 4.4.0-62-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13. 
INFO  16:48:49,007 HelpFormatter - Date/Time: 2017/04/03 16:48:48 
INFO  16:48:49,008 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:48:49,008 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  16:48:49,023 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:48:49,133 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  16:48:49,140 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:48:49,160 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO  16:48:50,273 IntervalUtils - Processing 1325431593 bp from intervals 
INFO  16:48:50,285 MicroScheduler - Running the GATK in parallel mode with 2 total threads, 2 CPU thread(s) for each of 1 data thread(s), of 56 processors available on this machine 
INFO  16:48:50,340 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  16:48:50,465 GenomeAnalysisEngine - Done preparing for traversal 
INFO  16:48:50,466 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  16:48:50,466 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining 
INFO  16:48:50,466 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime 
INFO  16:48:50,532 MuTect2 - Using global mismapping rate of 45 => -4.5 in log10 likelihood units 
INFO  16:48:50,533 PairHMM - Performance profiling for PairHMM is disabled because the program is being run with multiple threads (-nct>1) option
Profiling is enabled only when running in single thread mode

INFO  16:49:20,470 ProgressMeter -   chr1:14958958     5.97786178E8    30.0 s            0.0 s        0.6%    82.1 m      81.6 m 
INFO  16:49:50,473 ProgressMeter -   chr1:26627596    2.516091919E9    60.0 s            0.0 s        1.1%    88.4 m      87.4 m 
INFO  16:50:20,474 ProgressMeter -   chr1:42766671    6.139828701E9    90.0 s            0.0 s        1.7%    85.9 m      84.4 m 
INFO  16:50:50,476 ProgressMeter -   chr1:62963628   1.0743175026E10   120.0 s            0.0 s        2.7%    73.3 m      71.3 m 
INFO  16:51:20,478 ProgressMeter -   chr1:91817153   1.5892962774E10     2.5 m            0.0 s        3.8%    66.3 m      63.8 m 
INFO  16:51:50,480 ProgressMeter -  chr1:117093107   2.5856520721E10     3.0 m            0.0 s        4.6%    65.4 m      62.4 m 
INFO  16:52:20,482 ProgressMeter -  chr1:156181686   4.2774755347E10     3.5 m            0.0 s        5.3%    65.5 m      62.0 m 
INFO  16:52:50,484 ProgressMeter -  chr1:178832743   5.9965019186E10     4.0 m            0.0 s        6.2%    64.3 m      60.3 m 
INFO  16:53:20,486 ProgressMeter -  chr1:205910763   7.6568008961E10     4.5 m            0.0 s        7.1%    63.5 m      59.0 m 
INFO  16:53:50,487 ProgressMeter -  chr1:231612271   9.5039378738E10     5.0 m            0.0 s        8.1%    62.0 m      57.0 m 
INFO  16:54:20,489 ProgressMeter -   chr2:10978103   1.08452450175E11     5.5 m            0.0 s        9.1%    60.2 m      54.7 m 
...
...
*MuTect2 running, without any warnings, so I delete lines to keep message shorter*
...
...
INFO  18:55:50,952 ProgressMeter -   chrX:76856153   6.28621417292E11     2.1 h            0.0 s       97.8%     2.2 h       2.8 m 
INFO  18:56:20,953 ProgressMeter -   chrX:76877471   6.28621417292E11     2.1 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:56:50,954 ProgressMeter -   chrX:76893542   6.28621417292E11     2.1 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:57:20,956 ProgressMeter -   chrX:76906228   6.28621417292E11     2.1 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:57:50,958 ProgressMeter -   chrX:76917364   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:58:20,959 ProgressMeter -   chrX:76927855   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:58:50,961 ProgressMeter -   chrX:76937585   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:59:20,962 ProgressMeter -   chrX:76946451   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  18:59:50,964 ProgressMeter -   chrX:76954953   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  19:00:20,965 ProgressMeter -   chrX:76963009   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  19:00:50,966 ProgressMeter -   chrX:76970682   6.28621417292E11     2.2 h            0.0 s       97.8%     2.2 h       2.9 m 
INFO  19:01:20,967 ProgressMeter -  chrX:118733701   6.34550108037E11     2.2 h            0.0 s       98.9%     2.2 h      92.0 s 
Using AVX accelerated implementation of PairHMM
INFO  19:01:44,249 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file 
INFO  19:01:44,249 VectorLoglessPairHMM - Using vectorized implementation of PairHMM 
INFO  19:01:44,250 MuTect2 - Ran local assembly on 0 active regions 
INFO  19:01:44,252 ProgressMeter -            done   6.48115744256E11     2.2 h            0.0 s      100.0%     2.2 h       0.0 s 
INFO  19:01:44,253 ProgressMeter - Total runtime 7973.79 secs, 132.90 min, 2.21 hours 
INFO  19:01:44,253 MicroScheduler - 386106 reads were filtered out during the traversal out of approximately 105750521 total reads (0.37%) 
INFO  19:01:44,253 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  19:01:44,253 MicroScheduler -   -> 371854 reads (0.35% of total) failing DuplicateReadFilter 
INFO  19:01:44,253 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO  19:01:44,254 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  19:01:44,254 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO  19:01:44,254 MicroScheduler -   -> 14252 reads (0.01% of total) failing NotPrimaryAlignmentFilter 
INFO  19:01:44,254 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

At the end, I have a vcf without any variant, only header appear.
I have looked for solutions, and it seems I have the same problem like here:
1. http://gatkforums.broadinstitute.org/gatk/discussion/6085/generating-pool-of-normal
2. https://gatkforums.broadinstitute.org/wdl/discussion/6690/mutect2-tumor-only-mode-empty-vcfs

As in previous post, I read the same strange info in my log:

INFO  19:01:44,250 MuTect2 - Ran local assembly on 0 active regions

So I have tried a nightly build, as recommended in previous post:

INFO  16:09:07,202 HelpFormatter - --------------------------------------------------------------------------------------------- 
INFO  16:09:07,204 HelpFormatter - The Genome Analysis Toolkit (GATK) vnightly-2017-04-04-g34bd8a3, Compiled 2017/04/04 00:01:14 
INFO  16:09:07,204 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  16:09:07,204 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  16:09:07,204 HelpFormatter - [Tue Apr 04 16:09:07 CEST 2017] Executing on Linux 4.4.0-62-generic amd64 
INFO  16:09:07,204 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13 
INFO  16:09:07,208 HelpFormatter - Program Args: -T MuTect2 -nct 2 -R /media/raid/genomic_DB//hg19/hg19.fasta -I:normal bam/WES_26N.sorted.dedup.recalib.bam -L /media/raid/genomic_DB//hg19/bed/hg19_Gene_table.bed --dbsnp /media/raid/genomic_DB//hg19/GATK_bundle/dbsnp_138.hg19.vcf --cosmic /media/raid/genomic_DB//hg19/GATK_bundle/Cosmic.hg19.vcf --artifact_detection_mode -o var/WES_26_N.vcf 
INFO  16:09:07,210 HelpFormatter - Executing as aatkinson@umr910-pfg-ws2 on Linux 4.4.0-62-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13. 
INFO  16:09:07,210 HelpFormatter - Date/Time: 2017/04/04 16:09:07 
INFO  16:09:07,210 HelpFormatter - --------------------------------------------------------------------------------------------- 
INFO  16:09:07,210 HelpFormatter - --------------------------------------------------------------------------------------------- 
INFO  16:09:07,225 GenomeAnalysisEngine - Strictness is SILENT 
INFO  16:09:07,340 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  16:09:07,346 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  16:09:07,366 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
INFO  16:09:08,334 IntervalUtils - Processing 1325431593 bp from intervals 
INFO  16:09:08,346 MicroScheduler - Running the GATK in parallel mode with 2 total threads, 2 CPU thread(s) for each of 1 data thread(s), of 56 processors available on this machine 
INFO  16:09:08,399 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
INFO  16:09:08,539 GenomeAnalysisEngine - Done preparing for traversal 
INFO  16:09:08,540 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  16:09:08,540 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining 
INFO  16:09:08,542 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime 
INFO  16:09:08,610 MuTect2 - Using global mismapping rate of 45 => -4.5 in log10 likelihood units 
INFO  16:09:08,611 PairHMM - Performance profiling for PairHMM is disabled because the program is being run with multiple threads (-nct>1) option
Profiling is enabled only when running in single thread mode

INFO  16:09:38,546 ProgressMeter -   chr1:16563659     7.97486694E8    30.0 s            0.0 s        0.7%    71.0 m      70.5 m 
INFO  16:10:08,547 ProgressMeter -   chr1:36384738    4.675477702E9    60.0 s            0.0 s        1.5%    66.7 m      65.7 m 
INFO  16:10:38,549 ProgressMeter -   chr1:65120499   1.1110499512E10    90.0 s            0.0 s        2.8%    53.0 m      51.5 m 
INFO  16:11:08,551 ProgressMeter -  chr1:108509208   1.9618437838E10   120.0 s            0.0 s        4.3%    46.9 m      44.9 m 
...
...
again, I delete some lines without warnings to keep shorter
...
...
INFO  18:07:38,969 ProgressMeter -   chrX:76944672   6.28621417292E11   118.5 m            0.0 s       97.8%     2.0 h       2.7 m 
INFO  18:08:08,970 ProgressMeter -   chrX:76953314   6.28621417292E11   119.0 m            0.0 s       97.8%     2.0 h       2.7 m 
INFO  18:08:38,971 ProgressMeter -   chrX:76961411   6.28621417292E11   119.5 m            0.0 s       97.8%     2.0 h       2.7 m 
INFO  18:09:08,973 ProgressMeter -   chrX:76968843   6.28621417292E11   120.0 m            0.0 s       97.8%     2.0 h       2.7 m 
INFO  18:09:38,974 ProgressMeter -  chrX:113099932   6.33515778237E11     2.0 h            0.0 s       98.7%     2.0 h      93.0 s 
Using AVX accelerated implementation of PairHMM
INFO  18:10:00,807 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file 
INFO  18:10:00,808 VectorLoglessPairHMM - Using vectorized implementation of PairHMM 
INFO  18:10:00,809 MuTect2 - Ran local assembly on 0 active regions 
INFO  18:10:00,812 ProgressMeter -            done   6.48115744256E11     2.0 h            0.0 s      100.0%     2.0 h       0.0 s 
INFO  18:10:00,812 ProgressMeter - Total runtime 7252.27 secs, 120.87 min, 2.01 hours 
INFO  18:10:00,813 MicroScheduler - 231969 reads were filtered out during the traversal out of approximately 82616070 total reads (0.28%) 
INFO  18:10:00,813 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  18:10:00,813 MicroScheduler -   -> 215232 reads (0.26% of total) failing DuplicateReadFilter 
INFO  18:10:00,813 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter 
INFO  18:10:00,813 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  18:10:00,813 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter 
INFO  18:10:00,814 MicroScheduler -   -> 16737 reads (0.02% of total) failing NotPrimaryAlignmentFilter 
INFO  18:10:00,814 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter 
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

Nightly build doesn't work better.
Not shown here, but I have tried without -L myBedFile.bed, and also without -nct, and I get a empty vcf again.
I don't know what to do/try...

Should I provide the exome target list (-L argu) even while calling gVCF file using Haplotypecaller?

$
0
0

Hi,

Recently we performed exome sequencing using Nextera Illumina platform for three samples (Father, Mother and Son). I downloaded the exome interval list from Illumina's website.

1) Trimmed the raw reads
2) Aligned the trimmed reads against the human reference hg19 as recommended for exome-sequencing
3) Then sorted, deduped, recalibrated the bam file.
4) Then performed variant calling in two steps process for all three samples individually
4.1) Used the GATK Haplotype Caller tool in GVCF mode
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T Haplotypecaller -R /GATK_bundle/hg19.fa -I sample1.sorted.dedup.recal.bam --emitRefConfidence GVCF --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.raw.g.vcf
4.2) Used GenotypeGVCFs (Joint SNP calling) for all three samples together
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T GenotypeGVCFs -R /GATK_bundle/hg19.fa --variant sample1.raw.g.vcf --variant sample2.raw.g.vcf --variant sample3.raw.g.vcf --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.2.3.trio.raw.vcf

In the above command, I didn't use the Illumina's exome interval list used for targeting the exomes in sequencing process.

As per this link "https://software.broadinstitute.org/gatk/documentation/article.php?id=4669", under the example section of GATK command lines, for exome sequencing the article suggests us to provide the exome targets using -L argument.

I have following queries,as per the aforementioned article
1) Should I provide the exome target list (-L argument) only while calling regular VCF file using Haplotype caller?
or
2) Should I provide the exome target list (-L argument) even while calling gVCF file using Haplotype caller?


REGARDING COMBINING VARIANTS IN CLOUDMAP IN GALAXY

Problems interpreting Mutect2 output

$
0
0

Hello,

I am calling variants with Mutect2 (default parameters) from bulk WGS Tumor/Normal pairs following Somatic SNV Best Practices, and in the VCF outputs I am finding a lot of variants like this (the last one is the tumor):

chr1 1037759 . CTT C . PASS ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=4.60;RPA=12,10;RU=T;STR;TLOD=26.17 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:31,0:NaN:0:0:0,0:0:0 0/1:13,21:1.00:8:13:0,611:0:0

The genotype suggested for the tumor is heterozygous. However, the AF is 1.00. I also see that the QSS for reference allele is 0, but with IGV I checked that the base and mapping qualities at this position are normal for both reference and alternative-supporting reads and that they are primary alignments and have their mates mapped. I got up to 14% AF=1.00, which seems very weird to me for this type of analysis.

It doesn't happen for all the deletions, though:

chr1 1128849 . CTT C . PASS ECNT=1;HCNT=2;MAX_ED=.;MIN_ED=.;NLOD=3.77;RPA=11,9;RU=T;STR;TLOD=20.27 GT:AD:AF:ALT_F1R2:ALT_F2R1:QSS:REF_F1R2:REF_F2R1 0/0:24,0:0.00:0:0:60,0:0:2 0/1:20,13:0.867:6:7:60,403:2:0

In this case QSS is not 0, and I feel that it could have some relationship ( (13403)/((2060)+(13*403))=0.814, near to the AF).

Most of the cases are indels, but not all of them:

chr2 28505882 . A G . PASS ECNT=1;HCNT=8;MAX_ED=.;MIN_ED=.;NLOD=4.00;TLOD=23.46 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/0:16,0:0.00:0:0:.:149,0:4:1 0/1:2,15:1.00:1:0:0.00:0,30:0:0

So in summary, I don't understand the way in which AF is calculated. Am I misunderstanding the AF concept or the way it works for this type of samples? Or may I be skipping the reason of QSS=0? Should I use AD for my calculations instead?

Thank you very much!

GATK version: 3.7-0-gcfedb67
Java version: 1.8.0_31
WGS paired end samples
Bulk Tumor/Normal pairs
Sequenced with HiSeqX using TruSeq Nano DNA (350) library kit

Selectively eliminating a set of chromosomes per sample in a multisample VCF

$
0
0

Hello,
In my multisample VCF file, some samples have aneuploid chromosome numbers. For example, Sample1 might be 2N for chr1, chr2, and chr3. Sample2, however, might be 2N for chr1 and chr3, but 3N for chr2.

Is there a way to selectively exclude all of the sample genotypes for Sample2/chr2, while leaving all other sample genotypes for chr2 (and all other chromosomes for Sample2) intact?

I already tried removing the individual chromosomes in the sample .g.vcf files using vcftools and the "--not-chr" options, then re-running GenotypeGVCFs. However, this still included calls for the excluded chromosomes for all of the specified samples, and I can't figure out where the calls themselves were originating from.

Perhaps there is a way to set sample genotypes to "NoCall" for individual chromosomes?

Thanks,
Mike

VCF - Combining two columns into one column

$
0
0

I have a VCF file with multiple samples (multiple columns). There are replicate samples.

Essentially, I want to combine two columns into one. The GT:AD:DP:GQ:PL data would obviously have to be recalculated. (The two columns representing replicate samples are replaced by a single column that represents them both together). I have tried CombineVariants, but the two columns are not combined into one.

Is there a GATK tool to do this for a VCF? Or do I have to go all the way back and concatenate the replicate sample FASTQ files together before rerunning the GATK pipeline? Is there any other way to combine replicate samples into one at some point in time other than when concatenating their FASTQs?

Problemm with hard filtering

$
0
0

Hi All,
I am trying to use GATK for variant filtration (hard filtering) for non human species, using the below command after choosing SNPs only:

java -d64 -Xmx48g -jar /home/mbxao2/R-drive/tools/GATK/GenomeAnalysisTK.jar \
-R Ref.fa \
-T VariantFiltration \
-V input.vcf \
-o output.vcf \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 2 " \
--filterName "LowQD" \
--filterExpression "MQRankSum < -12.5" \
--filterName "default_SNP_filter" \
--filterExpression "ReadPosRankSum < -8.0" \
--filterName "default_SNP_filter" \
--filterExpression "FS > 200 " \
--filterName "StrandBias"

but the below error appears for me:

INFO 15:24:26,590 GenomeAnalysisEngine - Preparing for traversal
INFO 15:24:26,611 GenomeAnalysisEngine - Done preparing for traversal
INFO 15:24:26,611 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 15:24:26,611 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 15:24:26,612 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime

ERROR --
ERROR stack trace

java.lang.NumberFormatException: For input string: "4.21"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.commons.jexl2.JexlArithmetic.toLong(JexlArithmetic.java:906)
at org.apache.commons.jexl2.JexlArithmetic.compare(JexlArithmetic.java:718)
at org.apache.commons.jexl2.JexlArithmetic.lessThan(JexlArithmetic.java:774)
at org.apache.commons.jexl2.Interpreter.visit(Interpreter.java:967)
at org.apache.commons.jexl2.parser.ASTLTNode.jjtAccept(ASTLTNode.java:18)
at org.apache.commons.jexl2.Interpreter.interpret(Interpreter.java:232)
at org.apache.commons.jexl2.ExpressionImpl.evaluate(ExpressionImpl.java:65)
at htsjdk.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:178)
at htsjdk.variant.variantcontext.JEXLMap.get(JEXLMap.java:94)
at htsjdk.variant.variantcontext.JEXLMap.get(JEXLMap.java:15)
at htsjdk.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:341)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.matchesFilter(VariantFiltration.java:483)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.buildVCfilters(VariantFiltration.java:474)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.filter(VariantFiltration.java:379)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:318)
at org.broadinstitute.gatk.tools.walkers.filters.VariantFiltration.map(VariantFiltration.java:99)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: "4.21"
ERROR ------------------------------------------------------------------------------------------

I couldn't find any information about the error, please is there anything to do for this issue.

Thanks in advance,

Ahmed

Viewing all 624 articles
Browse latest View live