Collection of training resources for Galaxy courses
Proudly part of
In this practical you will use several additional features not covered in the previous sessions. This will help you to:
Using the Copy datasets function, copy the following datasets to a new history:
This check can be performed before running the experiment as it only requires a BED files containing the regions covered by the Exome Capture Kit. These files are freely available from the vendor sites.
Question:
father
, mother
or proband
)
for each input file. For more info see the QPLOT website.Questions:
VCF files stored in Galaxy can be rapidly analyzed with vcf.iobio.io, a variant data inspector tool that quickly samples vcf files and visualizes a series of metrics. In the Galaxy history, click the display at vcf.iobio vcf.iobio.io link.
The aim of this step is to reduce the false positive calls by identifying low quality variants. The best solution is to apply the GATK Variant Quality Score Recalibration. An alternative for small sample sets is to flag low quality variant using the following criteria, according to GATK Best Practices (aka GATK hard filtering):
Filters for SNPS: * QualByDepth < 2.0 * RMSMappingQuality < 40.0 * FS > 60.0 * HaplotypeScore > 13.0 * MQRankSum < -12.5 * ReadPosRankSum < -8.0
Filters for INDELs: * QualByDepth < 2.0 * FS > 200.0 * ReadPosRankSum < -20.0
Note that you need to apply different filters to SNPs and INDELs. Browse the Published workflows section and run GATK Hard Filters. Edit the workflow to inspect the different sections and execute. The output now includes different variants whose value in column FILTER is different from PASS: these variants are considered as low-quality variants and are assigned a low priority.
Question: * :question: How many variants are flagged as PASS?
:point_right: Hint: You can aggregate and count the variants by the flag in column FILTER with
Join, Subtract and Group -> Group as follows:
* Select data: VCF output of the workflow
* Group by column: select the value (c1, 1st column; c2, 2nd column; …) corresponding to the column FILTER
* Ignore lines beginning with these characters: # (skip header lines)
* Operations: count
on column [value corresponding to the column FILTER]
, do not round results
.
SnpEFF Variant effect and annotation is a popular tool for the annotation of VCF files. This will populate the INFO column of your file with the new annotations, and the header of the VCF with a short description.
Question: * :question: Can you find the new annotations in your output VCF files?
To annotate your VCF with info extracted from internal resources, i.e. allele frequency from a reference population, you can run GATK Variant annotator. Briefly, it takes a VCF as input and adds the annotations extracted from the INFO column of multiple VCF files. Let’s assume you want to annotate which variants in your set are present in NCBI ClinVar, a database of variants of clinical relevance. To do that, execute GATK Variant annotator as follows:
Then run GATK Variant annotator with the following parameters:
* Variant file to annotate: your VCF to annotate
* Using reference genome: hg19
* Provide a dbSNP Reference-Ordered Data (ROD) file: don’t set dbSNP (reduces the computation time)
* Binding for reference-ordered resource data:
* ROD file: your VCF with annotations - clinvar_YYYYMMDD_hg19.vcf
* ROD name: a shortname for this file to be used in the next step (no spaces allowed) - clinvar
* Expressions: to annotate with the CLNSIG (Variant Clinical Significance, from 0 to 7) and CLNDBN (Variant disease name) parameters from ClinVar, enter the two following expressions:
* clinvar.CLNSIG
* clinvar.CLNDBN
* Choose the bed file with target regions: add NexteraRapidCaptureExpandedExome_Target.hg19.chr8.padding200.bed
in
Advanced GATK options -> Operate on genomic interval
If you want to export the final VCF in a Excel-compatible file, run the VCFtoTab-delimited tool.
Identification of Runs of Homozygosity (RoH) is a strategy to limit the search for candidate genes to specific chromosomal regions in consanguineous families. You can identify RoH in your family with the following tools:
10
, corresponding to 10Kb) for Filter by Runs of Homozygosity (ROH).
The software will return only the variants located in a RoH with length greater than this value. In the tabular output the last two columns contain the number of SNPs and length of the RoH.