Context :
Working with targeted sequencing data (amplicon gene panel), depth is high, thus trying to adjust -minPruning value seems relevant.
-minPruning argument : Paths with fewer supporting kmers than the specified threshold will be pruned from the graph.
Default value : 2.
Atm, values 2 & 10 have been tested in our pipeline.
At first glance i was expecting faster MuTect2 runs with -minPruning=10 (it was the case), and fewer variants compared to the same run with -minPruning=2 (was the case too). I was also expecting that -minPruning=10 variants would be "included" in -minPruning=2 variants.
Thing is, VCFs contains 3 "kinds" of variants :
- common to both -minPruning=2 & -minPruning=10.
- unique to -minPruning=2.
- unique to -minPruning=10.
So, the only explanation i can come up with is that during the HaplotypeCaller re-assembly for active region, depending on the -minPruning value, the graph can be "discarded" or "skipped" if it's too messy or not passing the needed kmer coverage.
Trying to apply this "explanation" to the 3 "kinds" of variants :
- common to both -minPruning=2 & -minPruning=10 : for this active region, HC graphs for both minPruning=2 & minPruning=10 are similar and allow to call the same variant in both case.
- unique to -minPruning=2 : for this particular region, the graph is pruned for having paths with fewer than 10 supporting kmers.
- unique to -minPruning=10 : for this particular region, the graph was too messy with -minPruning=2 and VC was skipped.
Ps : This was investigated for a small targeted region (chr5:68470404-68471404).
Command lines are given below :
prun=2;
java -jar $GATK -T MuTect2 \
-nct 1 \
-R ../Hg19_files/ucsc.hg19.fasta \
-L chr5:68470404-68471404 \
-I:tumor ../Outputs/V2_noDups/3_BQSR/p10_noDups.2.bam \
-I:normal ../Outputs/V2_noDups/3_BQSR/p10_nor_noDups.2.bam \
--dbsnp ../Hg19_files/dbsnp_138.hg19.vcf \
--cosmic ../Hg19_files/cosmic_coding_and_noncoding_chr_M_sorted.vcf \
-PON ../Outputs/V2_noDups/5_PoN/pon_noDups.vcf \
-contamination $(awk 'NR==2 {print $4/100}' ../Outputs/V2_noDups/4_ContEst/p10_ContEst.txt) \
--annotation DepthPerAlleleBySample \
--annotation BaseQualitySumPerAlleleBySample \
--annotation TandemRepeatAnnotator \
--annotation OxoGReadCounts \
--annotation StrandAlleleCountsBySample \
-minPruning $prun \
--bamOutput p10_noDups_minPrun"$prun"_chr5_68470404_68471404_bamout.bam \
-o p10_noDups_minPrun"$prun"_chr5_68470404_68471404.vcf;
prun=10;
java -jar $GATK -T MuTect2 \
-nct 1 \
-R ../Hg19_files/ucsc.hg19.fasta \
-L chr5:68470404-68471404 \
-I:tumor ../Outputs/V2_noDups/3_BQSR/p10_noDups.2.bam \
-I:normal ../Outputs/V2_noDups/3_BQSR/p10_nor_noDups.2.bam \
--dbsnp ../Hg19_files/dbsnp_138.hg19.vcf \
--cosmic ../Hg19_files/cosmic_coding_and_noncoding_chr_M_sorted.vcf \
-PON ../Outputs/V2_noDups/5_PoN/pon_noDups.vcf \
-contamination $(awk 'NR==2 {print $4/100}' ../Outputs/V2_noDups/4_ContEst/p10_ContEst.txt) \
--annotation DepthPerAlleleBySample \
--annotation BaseQualitySumPerAlleleBySample \
--annotation TandemRepeatAnnotator \
--annotation OxoGReadCounts \
--annotation StrandAlleleCountsBySample \
-minPruning $prun \
--bamOutput p10_noDups_minPrun"$prun"_chr5_68470404_68471404_bamout.bam \
-o p10_noDups_minPrun"$prun"_chr5_68470404_68471404.vcf;
Resulting VCFs contains :
For minPruning=2 :
chr5 68470680 . A AT . alt_allele_in_normal;panel_of_normals ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=76.08;TLOD=43.37 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:796,65:0.069:0:0:.:22889,1241:0:0:431,365,0,65 0/0:465,34:0.063:0:0:.:12741,646:0:0:241,224,0,34
chr5 68470913 . C T . PASS ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=146.71;TLOD=9.80 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:943,19:0.014:0:0:.:27476,480:0:0:470,473,8,11 0/0:488,0:0.00:0:0:.:13599,0:0:0:216,272,0,0
chr5 68471144 . TA T . alt_allele_in_normal;clustered_events;t_lod_fstar ECNT=2;HCNT=18;MAX_ED=12;MIN_ED=12;NLOD=144.47;RPA=3,2;RU=A;STR;TLOD=6.25 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:977,29:0.022:0:0:.:24369,584:0:0:475,502,2,27 0/0:593,21:0.028:0:0:.:14864,416:0:0:301,292,3,18
chr5 68471156 . TG T . alt_allele_in_normal;clustered_events;panel_of_normals ECNT=2;HCNT=4;MAX_ED=12;MIN_ED=12;NLOD=151.73;RPA=3,2;RU=G;STR;TLOD=40.65 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:956,59:0.052:0:0:.:27030,1379:0:0:422,534,55,4 0/0:596,18:0.023:0:0:.:16046,407:0:0:287,309,17,1
For minPruning=10chr5 68471156 . TG T . alt_allele_in_normal;panel_of_normals ECNT=1;HCNT=8;MAX_ED=.;MIN_ED=.;NLOD=136.55;RPA=3,2;RU=G;STR;TLOD=40.71 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:899,54:0.051:0:0:.:25517,1244:0:0:393,506,50,4 0/0:559,17:0.023:0:0:.:15186,378:0:0:271,288,17,0
chr5 68471324 . CT C . alt_allele_in_normal;panel_of_normals;t_lod_fstar ECNT=1;HCNT=1;MAX_ED=.;MIN_ED=.;NLOD=249.61;RPA=5,4;RU=T;STR;TLOD=5.03 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1:SAC 0/1:1281,51:0.032:0:0:.:36659,1387:0:0:551,730,16,35 0/0:996,35:0.027:0:0:.:26231,889:0:0:511,485,6,29
- 3 variants are unique to minPruning2 (chr5:68470680:A->AT, chr5:68470913:C->T & chr5:68471144:TA->T)
- 1 variant is common (chr5:68471156:TG->T)
- 1 variant is unique to minPruning10 (chr5:68471324:CT->C)
IGV views of corresponding bamouts are attached (track_1 : reference, track_2 : minPruning=2, track_3 : minPruning=10).
IGV shows that when variants are "common", both tracks contains reads.
And when variants are unique, there is no reads supporting this particular region for the run with the other value of minPruning.
So, the question is : Is this behaviour expected ? If so, is it correct to combine VCF for minPruning=2 with VCF for minPruning=10, to get the most out of MuTect2 ?