Galaxy Training

Collection of training resources for Galaxy courses

Proudly part of

Advanced Exome Analysis using Galaxy

Aims

In this practical you will use several additional features not covered in the previous sessions. This will help you to:

Before starting

Using the Copy datasets function, copy the following datasets to a new history:

Check the coverage of your favourite genes

This check can be performed before running the experiment as it only requires a BED files containing the regions covered by the Exome Capture Kit. These files are freely available from the vendor sites.

Question:

Quality control of aligned reads

Questions:

Quickly examine your VCF file

VCF files stored in Galaxy can be rapidly analyzed with vcf.iobio.io, a variant data inspector tool that quickly samples vcf files and visualizes a series of metrics. In the Galaxy history, click the display at vcf.iobio vcf.iobio.io link.

Flag low quality variants

The aim of this step is to reduce the false positive calls by identifying low quality variants. The best solution is to apply the GATK Variant Quality Score Recalibration. An alternative for small sample sets is to flag low quality variant using the following criteria, according to GATK Best Practices (aka GATK hard filtering):

Filters for SNPS: * QualByDepth < 2.0 * RMSMappingQuality < 40.0 * FS > 60.0 * HaplotypeScore > 13.0 * MQRankSum < -12.5 * ReadPosRankSum < -8.0

Filters for INDELs: * QualByDepth < 2.0 * FS > 200.0 * ReadPosRankSum < -20.0

Note that you need to apply different filters to SNPs and INDELs. Browse the Published workflows section and run GATK Hard Filters. Edit the workflow to inspect the different sections and execute. The output now includes different variants whose value in column FILTER is different from PASS: these variants are considered as low-quality variants and are assigned a low priority.

Question: * :question: How many variants are flagged as PASS?

:point_right: Hint: You can aggregate and count the variants by the flag in column FILTER with Join, Subtract and Group -> Group as follows: * Select data: VCF output of the workflow * Group by column: select the value (c1, 1st column; c2, 2nd column; …) corresponding to the column FILTER * Ignore lines beginning with these characters: # (skip header lines) * Operations: count on column [value corresponding to the column FILTER], do not round results.

Annotations with SnpEFF

SnpEFF Variant effect and annotation is a popular tool for the annotation of VCF files. This will populate the INFO column of your file with the new annotations, and the header of the VCF with a short description.

Question: * :question: Can you find the new annotations in your output VCF files?

Annotate with your internal resources

To annotate your VCF with info extracted from internal resources, i.e. allele frequency from a reference population, you can run GATK Variant annotator. Briefly, it takes a VCF as input and adds the annotations extracted from the INFO column of multiple VCF files. Let’s assume you want to annotate which variants in your set are present in NCBI ClinVar, a database of variants of clinical relevance. To do that, execute GATK Variant annotator as follows:

Then run GATK Variant annotator with the following parameters: * Variant file to annotate: your VCF to annotate * Using reference genome: hg19 * Provide a dbSNP Reference-Ordered Data (ROD) file: don’t set dbSNP (reduces the computation time) * Binding for reference-ordered resource data: * ROD file: your VCF with annotations - clinvar_YYYYMMDD_hg19.vcf * ROD name: a shortname for this file to be used in the next step (no spaces allowed) - clinvar * Expressions: to annotate with the CLNSIG (Variant Clinical Significance, from 0 to 7) and CLNDBN (Variant disease name) parameters from ClinVar, enter the two following expressions: * clinvar.CLNSIG * clinvar.CLNDBN * Choose the bed file with target regions: add NexteraRapidCaptureExpandedExome_Target.hg19.chr8.padding200.bed in Advanced GATK options -> Operate on genomic interval

If you want to export the final VCF in a Excel-compatible file, run the VCFtoTab-delimited tool.

Runs of Homozygosity

Identification of Runs of Homozygosity (RoH) is a strategy to limit the search for candidate genes to specific chromosomal regions in consanguineous families. You can identify RoH in your family with the following tools: