What Does Quality Filtering Genomic Data Do

Get-go steps in genomic data analysis

Compiled by Julia M.I. Barth, 21 January 2020

Introduction

The Variant Call Format (VCF)
Difficult quality filtering of variants
- Inspection of quality measurements
- Awarding of hard filtering thresholds
Further quality filtering
- Filter on minimum read depth (DP) and genotype quality (GQ)
- Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels
- Remove individuals with a high amount of missing data
- Remove variants with a loftier corporeality of missing genotypes and filter on minor allele frequency
Performing a principal component assay (PCA) to inspect the data structure

Introduction

Background and Aim
This activity will guide yous though some of the the initial steps in population genomic data analyses – that is, the first steps after the sequencing reads were mapped against a reference genome assembly and variants were called (if you lot're interested in mapping and variant calling, cheque out the "Workshop on Genomics" exercises).
The aim is to get some hands-on experience working with VCF files, exercise some of the considerations in producing a subset of high quality SNPs, and to explore the data structure using a master component analysis (PCA). The aim is likewise to apply and expand on the UNIX and R skills that you learned in the previous activities: UNIX and R.

Learning goals

After this activeness, you should exist able to perform beginning quality checks and filtering steps, preparing a population genomic dataset file to explore e.g., population structure.

Apply the UNIX and R skills yous learned in the previous exercises.
Learn about the VCF format and how to handle and manipulate VCF files.
Plot different parameters to check the overall quality of the information fix.
Perform hard filtering, removing low quality variants.
Perform farther filtering to ready the data file for analyses.
Explore the data structure using a PCA.

We will piece of work with a data prepare containing genomic information of 204 individuals of Atlantic cod (Gadus morhua) sampled at seven sites in the Atlantic, North Sea, and Baltic Sea (Barth JMI et al. (2019) Disentangling structural genomic and behavioural barriers in a sea of connectivity. Mol Ecol 28, 1394–1411).

Sample map and picture of Atlantico cod

The individuals in this data set were sequenced using Illumina HiSeq paired-end sequencing to ~10x coverage, reads were mapped using BWA-mem to the Atlantic cod reference assembly, and variants were called jointly for a total number of about 860 individuals using the GATK Haplotypecaller pipeline. All filtering steps were performed per chromosome (i.eastward., per linkage group) for parallelized processing, followed past merging of high quality variants of all linkage groups for follow-up analyses.

In this activity, we volition perform similar filtering steps, starting with the unfiltered raw VCF, but nosotros will work with a reduced data file that contains only the 204 individuals in the study mentioned above, and only 10% randomly subsampled variants of one linkage group (LG five) to reduce time and computer resources (the original compressed LG 5 VCF file contained 3,281,615 variants and had a file size for xi GB). By randomly subsampling the variants, physically very shut variant sites that may share the same data (they are in linkage disequilibrium) have already been excluded, for which reason this tutorial does non contain a stride on filtering for linked sites. If you are interested in those steps, accept a look at the tutorials from 2016 and 2018.

How to do this activity

Connect to your Amazon instance via the terminal using SSH.
Questions or activities are indicated with Q.
Switch to Rstudio in the browser (http://ec2-Xx-Xxx-Thirty-30.compute-1.amazonaws.com:8787) if y'all come across this: R job .
All text in red on gray background indicates commands or file names that you can blazon or copy to the terminal. Text in black on grayness background refers to terminal outputs.
Endeavor to work independently using the provided manuals and information, discuss questions with your neighbour. If you get stuck, bank check the answer-box:

Show me the answer!

Oh no, simply if y'all get actually stuck!! Starting time try to find the respond yourself!

There are often alternative options and other solutions to the tasks. E.grand., whether y'all prefer true cat over less doesn't matter in near cases – go alee and apply your solution! Don't hesitate to ask the workshop team if you have questions!

ane) The Variant Phone call Format (VCF)

VCF (Variant Call Format) is a standardized text file format that is used to store genetic variation calls such every bit SNPs or insertions/deletions. The full format specifications and valuable information about the dissimilar tags can exist found here.
In the following first function of the exercise, nosotros will explore how the information in a VCF is stored, and how we tin can inspect information technology.

Q1 Open the terminal and navigate to the assay folder including the data file for this activity: /dwelling house/popgen/workshop_materials/21_first_steps_genomics.
What is the file size of cod204.lg05.i.vcf.gz? How many bytes and how many MB does it have?

Prove me the respond!

cd /home/popgen/workshop_materials/21_first_steps_genomics
ls -fifty "long list format" to see a list of the folder content including the file size: the VCF file is 320855142 bytes.
ls -lh to see the content list with the file sizes printed in "human being-readable" format: it's 306 M (MiB, to see the the size in actuall megabyte (MB) use: ls -fifty --block-size=MB.

Tip: for more functions of ls type ls --help or check the transmission with man ls.

The file cod204.lg05.i.vcf.gz is a very small VCF file with reduced file size for the purpose of this activity. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.

Q2 What is written in the first ten lines of the VCF file cod204.lg05.one.vcf.gz?

Show me a hint!

If you used head cod204.lg05.i.vcf.gz you will simply run into gibberish. Why?
Information technology's a compressed file (indicated by the catastrophe ".gz").
Compressing a VCF file with gzip or bgzip and indexing it with tabix is the common style in which huge VCF files are stored.

To see the first ten lines, you could decompress the entire file. All the same, this is not advisable every bit it would take several minutes (with larger files even hours) and information technology would inflate the file size to several gigabyte (GB).

Instead, you can make utilize of a program that allows you to view a role of the decompressed the VCF file without decompressing the whole file. There are several different options and programs to do this – can yous think of one?

Show me the answer!

Y'all could, for example, utilise either:
zcat cod204.lg05.1.vcf.gz | caput (zcat is decompressing the file, but head tells the commands to terminate decompressing any farther later on x lines are printed)
Or you could use:
zless cod204.lg05.1.vcf.gz (similarly, zless simply decompresses the part you are actually displaying; leave zless just like less by pressing q)
Or y'all could use:
bcftools view cod204.lg05.i.vcf.gz | caput (BCFtools ia a software for variant calling and manipulation of VCF files. We will extensively piece of work with BCFtools in the remaining exercise.)

Tip: Check out what these programs are doing with --aid or man.

The start line is already telling you that this is indeed a VCF file:
##fileformat=VCFv4.2

These first lines, starting with ##, are office of the "VCF header", which stores meta-information about the data. If y'all use zless you tin scroll down to see more of the header and also the genotype data, which make up the bulk of the file.

There can exist a lot of data in the header. Unlike meta-data lines, eastward.yard. "INFO", "FILTER", "FORMAT", "ALT" describe types of data that are given for every variant in the VCF file. A few examples:

##INFO<ID=AC,Number=A,Blazon=Integer,Description="Allele count in genotypes, for each ALT allele, in the aforementioned guild as listed">: The number of of often the culling allele is occuring.
##FORMAT<ID=GT,Number=1,Type=String,Description="Genotype">: The genotypes coded every bit allele values separated by / (or | if the data is phased).

The allele values are 0 for the reference allele and 1 for the alternative allele. If there are more alternative alleles the second alternative allele will have the value 2, the third 3, etc. Missing information is denoted past dots ..

##contig<ID=LG01,length=28303952> : Part of a listing of chromosomes, contigs, or scaffolds of the reference assembly.

For more than information about the different tags, accept a look at the VCF file format description.

The very terminal line of the header specifies the column names for the genotype data. The first eight columns are mandatory, follow by "FORMAT" and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AVE1409002 AVE1409003 AVE1409004 [...] TVE130512 TVE130513 TVE130514

You may desire to edit these header lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to come across what they did.

Q3 Find a command in BCFtools that allows you to divide the header from the big body of genotype information. Salvage the header in a separate text file named cod204.lg05.1.header.txt.
To notice the control, have a look at the BCFtools manual, or just type bcftools on the command line to see the available options.

Give me a hint!

Have a await at the bcftools view option. Either type bcftools view on the command line, or have a look the manual.

Show me the reply!

bcftools view -h cod204.lg05.1.vcf.gz > cod204.lg05.one.header.txt
With view -h merely the VCF header is selected (the header is small, therefore the small "h"). By using the capital -H the header is suppressed and only the "large" trunk of genotype information is considered.

Q4 What is the file size of cod204.lg05.1.header.txt, compared to the whole VCF file?

Show me the answer!

ls -50
The header is 7476 bytes (7 KB) small. This is only most 0.002% of the entire VCF, and a size nosotros tin safely open with a text editor.

Q5 Open the header on the command line in a text editor. Add a new line beneath the line ##fileformat=VCFv4.2 stating the date of today: ##fileDate=20200121 .
Save the new header as: cod204.lg05.1.header.ed.txt. Use BCFtools to replace (or "rehead") the VCF file cod204.lg05.1.vcf.gz with the new edited header and save the re-headed VCF as cod204.lg05.i.ed.vcf.gz.
Make sure your changes took effect and your file is nevertheless valid by visually inspecting the file and the file size of cod204.lg05.1.ed.vcf.gz.

Tip: Type bcftools to encounter general options for the program; and so apply bcftools reheader to see the more specific options for the reheader function.

Show me the respond!

Make apply of a text editor: nano cod204.lg05.one.header.txt or vi cod204.lg05.1.header.txt
Add ##fileDate=20200121, save the file (nano: "Ctrl + o", half-dozen: ":w"; don't forget to save under the new name), and close the text editor (nano: Ctrl + 10, vi: ":q")
bcftools reheader -h cod204.lg05.1.header.ed.txt cod204.lg05.one.vcf.gz -o cod204.lg05.1.ed.vcf.gz (with -h the header file is specified, with -o the new output file name)
With zless cod204.lg05.i.ed.vcf.gz you can check the header lines, and by using ls -lh you can come across if the file has still roughly the aforementioned size.

In the genotype part of the VCF file, one line corresponds to one variant site. Each line starts with the chromosome ID "CHROM", the position "POS", and the variant "ID", and is followed past the bases of the two alleles ("REF" for the reference allele, and "ALT" for the alternative allele(s)). The sixth column ("QUAL") is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, "PASS" tells u.s. that the variant passed all filters.

In the "INFO" column, information related to the header info tags for each variant site is given (e.g., Ac is the allele count in genotypes for the "ALT" allele(southward), AF the allele frequency of the "ALT" allele, and AN is the total number of alleles, etc. We volition use some of the "INFO" field specifications afterwards for difficult quality filtering).

The "FORMAT" cavalcade specifies the fields and order of the information that is given with each individual for each genotype and the remaining columns listing each individual with the values corresponding to the "FORMAT" field. The first "FORMAT" field is e'er the genotype (GT), the remaining fields are variable, simply in our data file, GT is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood):

This is how the start variant of the genotype role of the VCF file looks similar:
LG05 338 . A G 93.01 . Air-conditioning=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=one;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:x,0:ten:30:0,30,381 [...] 0/0:seven,0:7:21:0,21,248

Q6 How many variant sites are stored in the VCF?

Show me the answer!

"Give-and-take count" (wc) tin exist used to count words, characters, or lines. Use wc --help to see the options.
We know that the genotype part of the VCF file contains one line per variant site, meaning that we tin use BCFtools to suppress the header and forward (pipe; |) the output to wc -l. Using the flag -50 means "count the lines":

bcftools view -H cod204.lg05.1.vcf.gz | wc -fifty (takes ~1 min)
328253

Q7 Examine the first two variant sites in the VCF file and respond the following questions:

A) What kind of variants do yous see (east.g., SNP, insertion/deletion)?

Show me the answer!

You can inapect the variants using zless cod204.lg05.1.vcf.gz

The first variant site ("338") is a SNP. Columns four and five for "REF" (the reference allele) and "ALT" (the alternative allele) show A and G.
The second variant site ("357") is a deletion. Column four contains multiple bases, while column five shows a Yard.

LG05 338 . A Chiliad
LG05 357 . GTGGGCGCTACTCCAGGCAAAGGAGGACTCTGTACCGGAGGCTTGGATCAGCT G

Sometimes you may likewise meet an asterisk (*), which represents a spanning deletion.

B) What is the genotype (GT) of the beginning individual (AVE1409002) at the first variant site (LG05 338)?

Evidence me the answer!

The first individual has the genotype 0/0, which ways twice the reference allele A (A/A). The allele values in the GT field are 0 for the reference allele (the "REF" column) and 1 for the outset alternative allele ("ALT").
GT:Advert:DP:GQ:PL 0/0:20,0:twenty:57:0,57,855

C) What is the genotype read depth (DP, specified as "FMT/DP" to distinguish information technology from the mean DP "INFO/DP" per variant site) for the first iii individuals (AVE1409002, AVE1409003, AVE1409004) at the outset variant site (LG05 338)?

Show me the answer!

The read depths (DP) are 20, 9, and 10.
GT:Advertising:DP:GQ:PL
0/0:20,0:20:57:0,57,855
0/0:9,0:9:27:0,27,345
0/0:x,0:x:30:0,30,381

ii) Difficult quality filtering of variants

To remove low-quality variants (e.grand., variants with a low read depth or variants only supported by poorly aligned reads) from our data set, we will utilize hard quality filtering. In the terminology of the authors of GATK, "hard filtering" refers to using specific filtering cutoffs for specific annotations. This stands opposed to using the so-called Variant Quality Score Recalibration (VQSR) method, which does not utilise any specific "hard cutoffs" merely combines data from multiple annotations from a known, highly validated variant training fix in a machine learning algorithm. Variants that are non passing the chosen thresholds or the VQSR filtering can either be completely removed or they can exist kept in the VCF, but a flag is then added to them noting that they did non pass the filter. Since frequently such a grooming prepare is not bachelor, we volition apply "difficult quality filtering" hither.
The following part of the activity is adjusted from this "GATK tutorial".

2.ane) Inspection of quality measurements

Earlier we choose the measurements and thresholds for hard quality filtering, allow's start by having a look at some of these measurements and the distributions of their values.

We will consider the following vii measurements, which are given for each variant site in the "INFO" cavalcade of the VCF (more than information about each of these measurements is given farther below):
Fisher Strand (FS), Strand Odds Ratio (SOR), RMS Mapping Quality (MQ), Mapping Quality Rank Sum Exam (MQRankSum), Quality by depth (QD), Read Position Rank Sum Examination (ReadPosRankSum), and the combined depth across samples (INFO/DP).

Below, the measurements of the kickoff variant site are highlighted:
LG05 338 . A G 93.01 . Ac=ane;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=iii.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=i;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=ane.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:xx,0:20:57:0,57,855 0/0:nine,0:ix:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248

Q8 Employ the BCFtools "query" pick to extract the Fisher Strand (FS) values of all variants and salve the output in a separate text file named cod204.lg05.1_FS.txt.

Tip: Expect at the examples given in the manual. The FS values should be saved every bit one column, with one line for the value of each variant.

Show me the reply!

bcftools query cod204.lg05.1.vcf.gz -f'%FS\n' > cod204.lg05.1_FS.txt (takes ~one min)
With the -f selection, you lot tin can define the output format. The %FS indicates that for each variant, the FS value should be printed, and \due north tells the plan that after each FS value there should be a new line.

Q9 Audit the file. How many columns are in the file? How many lines? Does the number stand for to the number of variant sites in the VCF? Practise the outset values represent to the FS values of the starting time variants in the VCF file? Utilise your UNIX skills!

Show me the answer!

Using head cod204.lg05.1_FS.txt will show you the first lines. There is only 1 column.
Typing wc -l cod204.lg05.1_FS.txt will show you the number of lines, which is the same as the number of variant sites nosotros have ( 328253 )
With bcftools view -H cod204.lg05.1.vcf.gz | less y'all can visually check if the first variants in the file and compare them to the extracted FS values.

R task1 Now lets switch to R studio to plot the FS values. Open Rstudio in your internet browser:
http://ec2-XXX-Xxx-Thirty-Xxx.compute-ane.amazonaws.com:8787

Open the provided script "gda_plotting.R" in R and follow the points two) and ii.1) in that R script.

Show me the plot!

Fisher Strand (FS), as well as Strand Odds Ratio (SOR), are measurements for strand bias – that is a sequencing bias, in which the reads of one DNA strand (e.g., the forrad strand) support one genotype (e.thousand., a heterozygous phone call), while the reads for the contrary strand support e.one thousand., a homozygous genotype. In cases of farthermost strand bias, SNPs are non correctly called.

Fisher Strand (FS) is the Phred-scaled probability that in that location is strand bias at that site. It tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. A value shut to 0 indicates picayune or no strand bias. GATK recommends hard filtering of variants with FS greater than lx, only this is a very conservative value that will only filter out cases of extreme bias and get out many false positives in the data set.

Q10 Based on the to a higher place data and the plotted distribution of FS, choose a FS filtering threshold to remove low quality variants.

Prove me the answer!

Nosotros could stick with the GATK recommendation and remove variants with FS > 60. However, since this is a conservative value and our distribution tells united states, that we would not loose many more variants if nosotros lower the threshold to, let's say, FS > 40 (0.64% of variants have FS > 60; 1.11% have FS > xl), we could equally well set the threshold to FS > forty.
Setting such thresholds depends on the quality of the data set up, as well equally the planned downstream analyses. Here, we plan to inspect population structure, so we could hazard removing some false negatives for the sake of overall quality improvement.

Q11 Use the BCFtools "query" option to jointly extract the Fisher Strand (FS), Strand Odds Ratio (SOR), Mapping Quality Rank Sum Test (MQRankSum), Read Position Rank Sum Test (ReadPosRankSum), Quality by depth (QD), RMS Mapping Quality (MQ),and the combined depth beyond samples (INFO/DP) and save the values separated by the tab character in a split up text file named cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt.

Tip: Each measurement should be saved in one cavalcade separated past tabs, with each variant on a single line.

Show me the answer!

bcftools query cod204.lg05.ane.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\northward' > cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt

Again, -f defines the output format. The %FS\t%SOR\t… indicates that for each variant first the FS value should be printed, then a tab \t should exist printed, followed past the SOR value, followed by a tab etc… . At the end, the \n tells the program that after all six measurements, there should exist a new line.

R task2 At present lets switch over again to R studio to plot the distributions of all measurements. Employ the provided script "gda_plotting.R" and follow the point 2.1.one).

Testify me the plot!

Q12 As you lot did for the FS values above, use the distribution plots and the information about each quality measurement given below to come with a meaningful filtering threshold.

Strand Odds Ratio (SOR) is also a measurement for strand bias, but it takes into account the ratios of reads that cover both alleles and is therefore a improve measurement for sites that accept more reads in ane direction than the other. SOR values range from 0 to greater than nine, a value close to 0 indicates piffling or no strand bias. GATK recommends difficult filtering of variants with SOR greater than 3.
RMS Mapping Quality (MQ) and Mapping Quality Rank Sum Test (MQRankSum) are both measures for mapping quality. The Root Mean Square (RMS) mapping quality (MQ) of reads across all samples provides an estimation of the overall mapping quality to back up a variant call. It includes the standard deviation (SD) of the mapping qualities, where a low SD indicates values shut to the mean. Skillful mapping qualities are effectually MQ 60. GATK recommends difficult filtering of variants with MQ less than 40.
The Mapping Quality Rank Sum Test (MQRankSum) compares the mapping qualities of reads supporting the reference and the alternate allele. If all reads map correctly, the expected MQRankSum is 0. Negative values bespeak a lower MQ of the reads supporting the alternate allele, compared to reads supporting the reference allele. MQRankSum values usually range from about -10.5 to 6.5. GATK recommends hard filtering of extreme outlier variants with MQRankSum less than -12.five.
Quality by depth (QD) is the "confidence in a call" (i.e., the QUAL score), normalized past the read depth at that site. GATK recommends hard filtering of variants with QD below 2.
The Read Position Rank Sum Test (ReadPosRankSum) measures the relative position of the reference versus the alternative allele within reads. Eastward.m., SNPs towards the ends of reads may be more likely a sequencing error. A value effectually 0 means there is footling to no deviation in where the alleles are found relative to the ends of reads. GATK recommends hard filtering of variants with ReadPosRankSum less than -8.0.
The combined depth across samples (INFO/DP) is only the sum of the read depth (i.e., the genotype depth (FORMAT/DP) fields) over all samples. Variants that take a high mean depth across all samples are indicative of either repetitive regions or paralogs. A good rule of thumb is to remove variants that show twice the hateful read depth. You can calculate the hateful read depth in R: hateful(cod204.lg05.1_hf.values$DP). Now, don't be shocked about the loftier read depth. This field has not been updated later on reducing the vcf from 860 individuals to 204 individuals, and then the information is given for all 860 individauls. We can use this field withal for excluding variants with read depth larger than twice the mean.

Show me the answer!

Again, choosing filtering thresholds is highly depended on the quality of the data and the planned downstream analyses. Here are some possible suggestions for variants to exclude:

FS (GATK recommendation FS > threescore): FS > 40 (we are more strict hither)
SOR (GATK recommendation SOR > 3): SOR > 3 (this seems just fine)
MQ (GATK recommendation MQ < xl): MQ < 40 (we would loose many variants with a lower threshold, so we stick to the recommended)
MQRankSum (GATK recommendation MQRankSum > -12.5): MQRankSum > -5.0 and MQRankSum < 5.0 (a value around 0 is preferred and we don't loose many variants if nosotros set strict thresholds)
QD (GATK recommendation QD > two): QD > two (we would loose many variants with a lower threshold, so nosotros stick with the recommendation)
ReadPosRankSum (GATK recommendation ReadPosRankSum > -8.0): ReadPosRankSum > -4.0 (not many variants are below -viii, and then we can set information technology more strict)
INFO/DP (no GATK recommendation, dependend on sample number and sequencing depth): INFO/DP > xvi,000 (the hateful DP is 8400.935, we exclude sites that bear witness about twice the mean DP)

2.2) Application of hard filtering thresholds

We will now utilize the filtering thresholds and remove variants that don't lucifer the chosen thresholds.

Q13 Using the BCFtools "filter" option, come up upward with a command that will produce a high-quality data file. Y'all may use the thresholds you chose, or adapt them according to the suggestions in a higher place.
Don't forget to specify the input file cod204.lg05.1.vcf.gz, the output format (we'll utilise -O z since nosotros would like to relieve the data file again as compressed VCF), and the output filename (-o cod204.lg05.1.hf.vcf.gz). Type bcftools filter in the command line to see the options.

Tip: It is possible to either apply -east and the logical operator "or" || to exclude low-quality variants from the file, or apply -i and the logical operator "and" && to include loftier-quality variants.

Running the filtering volition take about five minutes, so time for a break and some coffee!

Show me the code!

A possible solution choosing the "exclude" option:
bcftools filter -e 'FS>40.0 || SOR>3 || MQ<forty || MQRankSum<-5.0 || MQRankSum>v.0 || QD<2.0 || ReadPosRankSum<-4.0 || INFO/DP>16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.ane.vcf.gz

A possible solution choosing the "include" option:
bcftools filter -i 'FS<xl.0 && SOR<3 && MQ>40.0 && MQRankSum>-5.0 && MQRankSum<5 && QD>ii.0 && ReadPosRankSum>-4.0 && INFO/DP<16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.1.vcf.gz

! Be enlightened that the commands will not produce the exact same results. The reason for this is that some measurements will not be available for all sites, because there are, for example, too many missing genotypes to calculate this measurement. In such cases, these sites would non be excluded (because there is no measurement to exclude), so they would remain in the filtered file, while with the include choice, they would not exist included (since there is no measurement based on which they could be included).

Q14 Bank check the filtered file:

What is the file size?
How many variant sites have been removed? Tip: see Q6
Has the FS filter been applied correctly? What is the largest FS value?
Tip: apply bcftools query as you did above (Q8) but instead of saving the output, pipe it directly to sort. Use the correct flags with "sort" (come across the options with sort --help and merely check the first values by pipe the output again to head.

Show me the answer!

! Annotation, the values below may be different depending on the filter thresholds y'all accept called.

ls -lh
213M – the file is about thirty% smaller than the unfiltered one.
bcftools view -H cod204.lg05.ane.hf.vcf.gz | wc -l
231014 – the file also has about 30% fewer variant sites.
bcftools query cod204.lg05.i.hf.vcf.gz -f'%FS\n' | sort -gr | head – we need the sort flag -g since nosotros have floating point numbers, and since we are interested in the largest value, we demand to reverse the sorting using the flag -r. Using sort -g | tail would work but the aforementioned fashion.
39.994 39.989 39.913 39.91
bcftools query cod204.lg05.1.hf.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\north' > cod204.lg05.1.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt

R task3 Extract again all quality measurements (FS, SOR, MQ, MQRS, QD, RPRS, DP) every bit you did above (see Q11), save the output in a new text file named cod204.lg05.one.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt, and plot the distributions using Rstudio and the script "gda_plotting.R" (section 2.1.2).

Prove me the plot!

3) Further quality filtering

After hard quality filtering, we have a data prepare containing only variant sites that we trust with high confidence. However, and so far we have just removed variants that had skewed values across all samples – we haven't nonetheless looked at the individual genotypes at each variant site. For example, we may have kept a variant that has a loftier quality across all 204 individuals, simply there are three individuals where the read depth (FMT/DP) or the genotype quality (GQ) for the individual genotype is very low, so we don't trust the calls for these individuals.

LG05 338 . A Yard 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=three.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=one.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:twenty,0:xx:57:0,57,855 0/0:9,0:nine:27:0,27,345 0/0:ten,0:10:30:0,30,381 0/0:5,0:v:15:0,15,209 [...] 0/0:seven,0:vii:21:0,21,248

Furthermore, nosotros volition exclude SNPs that are in the close proximity of indels. That is, because the proper alignment of sequences around indels can often be problematic and produce false positives.

For example, dependent on the "gap opening penalty", a region can exist aligned to have four indels (‐) and a SNP.
CTTTGAACCAAATTT----ACTGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACT--AGMT-TGTTGCC

Or it tin can be aligned to accept three indels (‐) and four SNPs.
CTTTGAACCAAATTT--TCAGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACTAG1000T-TGTTGCC

In addition, we will also remove monomorphic SNPs (i.e., SNPs that were called because they differ from the reference, but that are not polymorphic in our sample).

Then, since we are only interested in biallelic SNPs for the downstream analyses in this action, we will remove indels and multiallelic SNPs (for data sets were indels are likewise going to be analysed, information technology is recommended to subset indels course SNPs as the very first footstep and conform the above hard filtering thresholds separately).

And terminal but not to the lowest degree, we will remove individuals with high amounts of missing data, too as variants with lots of missing genotypes, and we volition filter on the minor allele frequency (MAF).

3.i) Filter on minimum read depth (DP) and genotype quality (GQ)

We will again use bcftools filter and filter reads that accept a read depth below 3 or a genotype quality beneath 20. The difference is, that instead of the whole variant site, we are at present considering unmarried genotypes (the information in the "FORMAT" fields") using -e 'FMT/DP<3 | FMT/GQ<20' and nosotros will non remove the whole variant site, but simply fix the respective genotypes to missing (./.) past using the bcftools filter -S . command.
Use the already "hard-filtered" data file as input cod204.lg05.1.hf.vcf.gz, specify a compressed VCF every bit the output format -O z, and name the new output file (-o cod204.lg05.1.hf.DP3.GQ20.vcf.gz).

Q15 Puzzle the code together with the help of the to a higher place information, bcftools filter, and the manual. Information technology takes ~3 min to run the command.

Show me the command!

bcftools filter -Southward . -e 'FMT/DP<3 | FMT/GQ<20' -O z -o cod204.lg05.i.hf.DP3.GQ20.vcf.gz cod204.lg05.1.hf.vcf.gz

Hither nosotros utilise the logical operator "|" instead of "||", since using "||" would mean that every genotype at a variant site is fix to missing, even if simply one genotype doesn't fulfill the threshold. More than information about the utilise of these operators in BCFtools can be found hither.

You may wonder whether a minimum read depth of 3 is sufficient to confidently call the variant. In our instance it is, since the GATK Haplotypecaller pipeline calls genotypes across all 860 samples simultaneously and includes information from all reads from all individuals to adjust the genotype quality. A telephone call with DP=3 may thus have a high GQ and can be confidently kept.

Q16 Permit's have a brief visual cheque whether our filter has been applied correctly:

With the following control we extract genotypes of samples that accept a read depth lower than three -i 'FMT/DP<iii' and in the output we would like to see the genotype (GT), followed past the read depth (DP) and the genotype quality (GQ) for that sample -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n'.

Earlier you lot execute the post-obit control, accept a brief thought nearly the expected output regarding the genotypes.

bcftools query -i 'FMT/DP<3' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\north' cod204.lg05.one.hf.DP3.GQ20.vcf.gz | less -S

Show me the answer!

With our filtering control above, we specified to gear up genotypes to missing ./.. Thus, when just extracting genotypes of samples that have a read depth lower than three, nosotros would expect them to be all ready to missing.

GT=./. DP=i GQ=iii GT=./. DP=ii GQ=vi GT=./. DP=1 GQ=iii GT=./. DP=1 GQ=3 GT=./. DP=ane GQ=3 GT=./. DP=i GQ=3 GT=./. DP=1 GQ=iii GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=three GT=./. DP=2 GQ=vi GT=./. DP=one GQ=3 GT=./. DP=one GQ=3 GT=./. DP=2 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=one GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6 GT=./. DP=2 GQ=half-dozen GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=i GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=2 GQ=half-dozen GT=./. DP=0 GQ=. GT=./. DP=1 GQ=iii GT=./. DP=two GQ=6 GT=./. DP=one GQ=iii GT=./. DP=ii GQ=vi [...]

You may too check the genotypes with a read depth of exactly iii. bcftools query -i 'FMT/DP=three' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\due north' cod204.lg05.1.hf.DP3.GQ20.vcf.gz | less -South.
You will see that there are only few genotypes with a high genotype quality (GQ) that are non set to missing.

3.2) Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels

To remove monomorphic SNPs, we will use bcftools filter as before to exclude -eastward all sites at which no alternative alleles are called for any of the samples Ac==0 and all sites at which only alternative alleles are chosen Air conditioning==AN. In add-on, to remove SNPs that are within 10 bp of indels, nosotros add together the flag --SnpGap x.
To reduce the information file to contain but biallelic SNPs, we utilise the bcftools view command using the flags -m2 -M2 -v snps, which set both the minimum -m and maximum -G number of alleles to 2, and keeps only snps via -v snps.

Y'all can execute both commands in one control line by piping | the output of the offset command (bcftools filter …) to the second command (bcftools view …). This style you merely have to specify the input file cod204.lg05.1.hf.DP3.GQ20.vcf.gz with the first control, and the output format -O z and output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz with the second control.

Q17 Apply the command by using the information higher up, bcftools filter and bcftools view, and the transmission. It will have ~2 min to run.

Show me the command!

bcftools filter -due east 'Air-conditioning==0 || AC==AN' --SnpGap 10 cod204.lg05.ane.hf.DP3.GQ20.vcf.gz | bcftools view -m2 -M2 -v snps -O z -o cod204.lg05.i.hf.DP3.GQ20.allele.vcf.gz

Q18 Verify that the filters take been applied by visually inspecting the file (e.g., using zless -S).

Prove me how!

You may use:
zless -S cod204.lg05.1.hf.DP3.GQ20.vcf.gz to see the original file,
and compare it with the output file:
zless -S cod204.lg05.one.hf.DP3.GQ20.allele.vcf.gz

Or: bcftools view -H cod204.lg05.one.hf.DP3.GQ20.allele.vcf.gz | less -S

By using less -Southward we "chop" the long lines and but display the beginning of each line, making information technology easier to compare the reference and culling alleles and the allele numbers. You tin scroll horizontally across the lines using the arrow keys.

Can you identify what has changed?

Q19 How many variants are left in our data file after all the in a higher place filtering steps have been applied?

Show me the answer!

bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | wc -l
82594
Recollect, we had 328,253 variants at the outset and 231,014 variants after the difficult quality filtering steps, now subsequently the last filtering steps we're down to 82,588.

3.3) Remove individuals with a high corporeality of missing data

Sometimes, sequencing doesn't work as well for some individuals equally it does for others. Such individuals will exist marked past depression quality genotypes and low read depth, resulting in loftier amounts of missing information afterward the above filtering steps.

We can bank check the amount of missing data by using the bcftools stats command. Accept a look at the options by typing bcftools stats in the concluding or cheque the manual for what it can do. With -due south - nosotros tin can request stats for all samples.

Run bcftools stats -southward - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz and quickly scroll through the large output.

Find the section PSC, Per-sample counts, which displays some summary statistics per sample. The last cavalcade named [14]nMissing displays the number of missing genotypes per sample.
There is also other valuable information that allows u.s. for example to appraise the effectiveness of the filtering steps higher up (e.g., in clolumn [9]nIndels, also nether PSC the number should exist 0 since we removed all indels).

As well have a quick await at the number of [eleven]nSingletons, listed in column 11. We will go back to that in part iii.four) of the activity.

Q20 Utilise your UNIX (and regex) skills to excerpt only the 3rd and 14th cavalcade of the PSC output to go a listing of sample names and numbers of missing genotypes. Salve the file as: cod204.lg05.1.hf.DP3.GQ20.allele.imiss.
Visually check that the file is as expected.

Bear witness me the code!

bcftools stats -due south - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | grep -Eastward ^PSC | cut -f3,xiv > cod204.lg05.i.hf.DP3.GQ20.allele.imiss
Do the visual cheque e.thousand., using less cod204.lg05.1.hf.DP3.GQ20.allele.imiss

R task4 Lets switch to R studio to plot the missingness per sample. Utilise the provided script "gda_plotting.R" and follow the point 3) of the R script.

Prove me the plot!

Q21 Utilise the bcftools view -south command to remove the three individuals with more lxx% missing information from the data file cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz .
Save the output as compressed VCF by using selection -O z and specify the output file proper name -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz with pick -o. Information technology takes ~1 min to run this command.

Show me the code

bcftools view -s ^ORE1203009,ORE1203008,ORE1203013 -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz

Q22 Count the number of samples in cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz. Accept the three individuals been successfully removed?

Show me the lawmaking

bcftools query -50 cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz | wc -fifty
201
Yes, our data set had 204 individuals earlier.

3.4) Remove variants with a loftier amount of missing genotypes and filter on minor allele frequency

In the missingness plot higher up, we saw that the data set up in general has a lot of missing data. This is considering nosotros gear up genotypes with low quality and low read depth to missing. As the penultimate pace, we will now remove all variants that have more than xx% missing data.

As the terminal footstep, we volition remove the singletons that you saw in a higher place in the bcftools stats command. Singletons – alleles that are observed simply in one case – and other alleles that have a low minor allele frequency (MAF) frequently present false calls, which can be obstructive in downstream analyses past adding noise to the data (note even so, that you may besides need them due east.g., for demographic analyses).
If the data will be used for a genome-wide association study (GWAS), usually a rather high MAF threshold is applied, as it requires very strong statistical power to make meaningful statements almost very rare alleles. When interested in population structure with many samples, alleles present in just 1 individual do not add information and tin can therefore be removed. Thus, the optimal MAF filtering threshold depends on the blazon of information you have, the analysis you are planning to practise, the data quality, and the sample size.

Q23 Knowing that our data gear up contains 201 individuals with a minimum of twenty individuals per sampling site, and that we volition use it to dissect population structure, what could exist an appropriate MAF filtering threshold?

Show me the answer!

The default MAF threshold in some programs is 0.01. Filtering for MAF 0.01 ways that remaining sites in the VCF take a minimum modest allele frequency of one%. Our data set contains 201 individuals (thus maximally 402 alleles per site). If we filter for MAF 0.01, a variant is excluded if its minor allele is found in less than 4.02 genotypes (1% of 402). This would remove alleles where the pocket-sized allele occurs 4 times or less, including the singletons.

A slightly higher MAF threshold of 0.05 would exclude all sites that are called less than 20.i times (5% of 402). Given that we know that nosotros take samples with xx individuals (40 alleles), such number of minor alleles could very well be truthful variants and should rather not be removed.

Note: for the above calculations, I assumed that we have 402 allels per site. However, in BCFtools, MAF is calculated every bit the minor allele count (MAC) divided past the total number of alleles (AN), meaning that missing data is taken into business relationship. Since we filter simultaneously for missing data, AN volition be between 322 and 402.
If missing data should non be taken into account, filtering tin also be done based on MAC.

Taken together, probably something between 0.01 and 0.03 could be a good MAF threshold for our data set.

Q24 Use bcftools filter -e to exclude all variants that accept more than 20% missing genotypes or have a minor allele frequency smaller or equal to 0.02 from the data file cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz. Save the output every bit compressed VCF -O z and use the output file proper noun -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz.

Prove me the code!

bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.02' -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz cod204.lg05.i.hf.DP3.GQ20.allele.missi.vcf.gz

Q25 Oft the alternative allele is too the pocket-size allel, thus nosotros tin can inspect the alternative allele count (Air conditioning) to see whether our MAF filter took outcome.
What would be the lowest culling allele count to wait?

Use once more bcftools query and excerpt the INFO field "Air-conditioning", pipage the output into sort and print only the start lines.

Show me the lawmaking!

We filtered for MAF 0.02, so we would expect a minimum AC of 7 (because 2% of 322 = 6.44).

bcftools query cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz -f'%Air conditioning\n' | sort -g | head
7
7
7
[…]

OPTIONAL R job: Switch to Rstudio, adjust the code of point 3) in the "gda_plotting.R" script and plot the missingness per sample afterwards the above filtering steps. Has it inverse?

Q26 How many variants (in pct) have been removed by all our filtering steps?

Show me the answer!

bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz | wc -l
From 328,253 variants, xi,864 SNPs are left. This ways we removed 96% per centum of the data.

four) Performing a principal component analysis (PCA) to inspect the data structure

The "smartpca" program implemented in the software EIGENSOFT uses a combination of master component assay (PCA; a technique used to identify and illustrate the strongest components of variation in a dataset) and new statistics designed specifically for genomic information. The methods are described in Patterson et al.(2006) PLOS Genetics 2(12): e190.

The EIGENSOFT smartpca needs iv input files: The ".ped" (containing the genotype information) and ".map" (containing the chromosome/position information) files that we will produce using the software PLINK1.nine, a ".pedind" file containing private information, and a '.par' file specifying the analysis settings.

Q27 Catechumen the filterd VCF file cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz to the PLINK ".ped/.map" format by running the following command:
plink --vcf cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz --aec --recode --out cod204.lg05.one.hf.DP3.GQ20.allele.missi.miss20.maf0.02

If you would like to know more nigh this format have a look here.

Q28 The EIGENSOFT smartpca is often not happy if the chromosome IDs in the ".map" file contain anything else than numbers.
Utilize sed to supervene upon the "LG" in the cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map file with "" (nada) and salve the output in a new file called tmp.
Verify that the new file is correct. If so, rename tmp again to cod204.lg05.ane.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map.

Prove me the command!

cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map | sed south'/LG//'g > tmp
The "stream editor" sed will substitute south/ the found expression "LG" with "" (nothing) within the "global" /thousand environment, meaning the whole line/file.

less tmp
mv tmp cod204.lg05.ane.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map

Q29 To generate the ".pedind" file, apply again your UNIX skills and cut the first 5 columns of the ".ped" file and save them in a new file chosen "tmp".

Show me the control!

cat cod204.lg05.i.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cutting -d ' ' -f1-v > tmp
The default delimiter whitespace in cut are tabs, just the .ped file has spaces. We tin ascertain this with -d ' '. With -f we specify columns i to v.

Have a expect at the "tmp" file. It contains the family IDs and the individual IDs in the kickoff two columns (in our instance this is the same), followed past zeros that betoken "no information" for the remaining columns (fatherID, motherID, sex). In addition, we need a column to assign individuals to populations.

Q30 Add a tab-separated sixth cavalcade specifying that AVE1409002-AVE1409024 vest to the population "AVE", BOR1104001-BOR1205019 belong to the population "BOR", etc.. .
To practice this, "cutting" the first three characters (see cutting --help) from cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped and "paste" them to the tmp file. Save the output as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind.
Remove the "tmp" file.

Show me the command!

cat cod204.lg05.one.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cut -c 1-3 | paste tmp - > cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind
"cut -c 1-three" selects the first 3 characters, which are combined via the standard output (stdout, represented by "-) with the saved "tmp" file.
rm -i tmp

Q31 A template parameter file (template.par) and a smartpca manual page (eigensoft_smartpca_readme.txt) can be establish in the analysis folder. Using the text editor of your option, modify the template.par to include the respective input file names, salvage the par file as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par, and run the smartpca with the syntax smartpca -p cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par > pca01.out.

R task5 Lets switch a last time to R studio to plot the PCA. Use the provided script "gda_plotting.R" and follow point iv) in the R script.

Show me the plot!

The starting time principal component centrality explains almost ii% of the total genomic variation and separates the eastern Baltic Sea (night blue) from the remaining samples. The second centrality explains 1% of the total genomic variation and separates the N Sea populations from the fjord-type cod. Although nosotros used just a tiny fraction of the information, the population structure is comparable with Fig. 1b in Barth JMI et al. (2019) Mol Ecol 28, 1394–1411.

Well washed!! Now it is time for socializing. My suggestion for todays pub is the "Zapa bar".