What Does Quality Filtering Genomic Data Do
Get-go steps in genomic data analysis
Compiled by Julia M.I. Barth, 21 January 2020
Table of contents
- Introduction
- The Variant Call Format (VCF)
- Difficult quality filtering of variants
- Inspection of quality measurements
- Awarding of hard filtering thresholds
- Further quality filtering
- Filter on minimum read depth (DP) and genotype quality (GQ)
- Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels
- Remove individuals with a high amount of missing data
- Remove variants with a loftier corporeality of missing genotypes and filter on minor allele frequency
- Performing a principal component assay (PCA) to inspect the data structure
Introduction
Background and Aim
This activity will guide yous though some of the the initial steps in population genomic data analyses – that is, the first steps after the sequencing reads were mapped against a reference genome assembly and variants were called (if you lot're interested in mapping and variant calling, cheque out the "Workshop on Genomics" exercises).
The aim is to get some hands-on experience working with VCF files, exercise some of the considerations in producing a subset of high quality SNPs, and to explore the data structure using a master component analysis (PCA). The aim is likewise to apply and expand on the UNIX and R skills that you learned in the previous activities: UNIX and R.
Learning goals
After this activeness, you should exist able to perform beginning quality checks and filtering steps, preparing a population genomic dataset file to explore e.g., population structure.
- Apply the UNIX and R skills yous learned in the previous exercises.
- Learn about the VCF format and how to handle and manipulate VCF files.
- Plot different parameters to check the overall quality of the information fix.
- Perform hard filtering, removing low quality variants.
- Perform farther filtering to ready the data file for analyses.
- Explore the data structure using a PCA.
We will piece of work with a data prepare containing genomic information of 204 individuals of Atlantic cod (Gadus morhua) sampled at seven sites in the Atlantic, North Sea, and Baltic Sea (Barth JMI et al. (2019) Disentangling structural genomic and behavioural barriers in a sea of connectivity. Mol Ecol 28, 1394–1411).
The individuals in this data set were sequenced using Illumina HiSeq paired-end sequencing to ~10x coverage, reads were mapped using BWA-mem to the Atlantic cod reference assembly, and variants were called jointly for a total number of about 860 individuals using the GATK Haplotypecaller pipeline. All filtering steps were performed per chromosome (i.eastward., per linkage group) for parallelized processing, followed past merging of high quality variants of all linkage groups for follow-up analyses.
In this activity, we volition perform similar filtering steps, starting with the unfiltered raw VCF, but nosotros will work with a reduced data file that contains only the 204 individuals in the study mentioned above, and only 10% randomly subsampled variants of one linkage group (LG five) to reduce time and computer resources (the original compressed LG 5 VCF file contained 3,281,615 variants and had a file size for xi GB). By randomly subsampling the variants, physically very shut variant sites that may share the same data (they are in linkage disequilibrium) have already been excluded, for which reason this tutorial does non contain a stride on filtering for linked sites. If you are interested in those steps, accept a look at the tutorials from 2016 and 2018.
How to do this activity
- Connect to your Amazon instance via the terminal using SSH.
- Questions or activities are indicated with
Q . - Switch to Rstudio in the browser (http://ec2-Xx-Xxx-Thirty-30.compute-1.amazonaws.com:8787) if y'all come across this:
R job . - All text in
red on gray background
indicates commands or file names that you can blazon or copy to the terminal. Text inblack on grayness background
refers to terminal outputs. - Endeavor to work independently using the provided manuals and information, discuss questions with your neighbour. If you get stuck, bank check the answer-box:
Show me the answer!
Oh no, simply if y'all get actually stuck!! Starting time try to find the respond yourself!
There are often alternative options and other solutions to the tasks. E.grand., whether y'all prefer true cat
over less
doesn't matter in near cases – go alee and apply your solution! Don't hesitate to ask the workshop team if you have questions!
ane) The Variant Phone call Format (VCF)
VCF (Variant Call Format) is a standardized text file format that is used to store genetic variation calls such every bit SNPs or insertions/deletions. The full format specifications and valuable information about the dissimilar tags can exist found here.
In the following first function of the exercise, nosotros will explore how the information in a VCF is stored, and how we tin can inspect information technology.
/dwelling house/popgen/workshop_materials/21_first_steps_genomics
.
What is the file size of cod204.lg05.i.vcf.gz
? How many bytes and how many MB does it have?
Prove me the respond!
cd /home/popgen/workshop_materials/21_first_steps_genomics
ls -fifty
"long list format" to see a list of the folder content including the file size: the VCF file is 320855142
bytes.
ls -lh
to see the content list with the file sizes printed in "human being-readable" format: it's 306 M
(MiB, to see the the size in actuall megabyte (MB) use: ls -fifty --block-size=MB
.
Tip: for more functions of ls
type ls --help
or check the transmission with man ls
.
The file cod204.lg05.i.vcf.gz
is a very small VCF file with reduced file size for the purpose of this activity. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.
cod204.lg05.one.vcf.gz
?
Show me a hint!
If you used head cod204.lg05.i.vcf.gz
you will simply run into gibberish. Why?
Information technology's a compressed file (indicated by the catastrophe ".gz").
Compressing a VCF file with gzip or bgzip and indexing it with tabix is the common style in which huge VCF files are stored.
To see the first ten lines, you could decompress the entire file. All the same, this is not advisable every bit it would take several minutes (with larger files even hours) and information technology would inflate the file size to several gigabyte (GB).
Instead, you can make utilize of a program that allows you to view a role of the decompressed the VCF file without decompressing the whole file. There are several different options and programs to do this – can yous think of one?
Show me the answer!
Y'all could, for example, utilise either:
zcat cod204.lg05.1.vcf.gz | caput
(zcat
is decompressing the file, but head
tells the commands to terminate decompressing any farther later on x lines are printed)
Or you could use:
zless cod204.lg05.1.vcf.gz
(similarly, zless
simply decompresses the part you are actually displaying; leave zless just like less by pressing q
)
Or y'all could use:
bcftools view cod204.lg05.i.vcf.gz | caput
(BCFtools ia a software for variant calling and manipulation of VCF files. We will extensively piece of work with BCFtools in the remaining exercise.)
Tip: Check out what these programs are doing with --aid
or man
.
The start line is already telling you that this is indeed a VCF file:
##fileformat=VCFv4.2
These first lines, starting with ##
, are office of the "VCF header", which stores meta-information about the data. If y'all use zless
you tin scroll down to see more of the header and also the genotype data, which make up the bulk of the file.
There can exist a lot of data in the header. Unlike meta-data lines, eastward.yard. "INFO", "FILTER", "FORMAT", "ALT" describe types of data that are given for every variant in the VCF file. A few examples:
-
##INFO<ID=AC,Number=A,Blazon=Integer,Description="Allele count in genotypes, for each ALT allele, in the aforementioned guild as listed">
: The number of of often the culling allele is occuring. -
##FORMAT<ID=GT,Number=1,Type=String,Description="Genotype">
: The genotypes coded every bit allele values separated by/
(or|
if the data is phased). -
##contig<ID=LG01,length=28303952>
: Part of a listing of chromosomes, contigs, or scaffolds of the reference assembly.
The allele values are 0
for the reference allele and 1
for the alternative allele. If there are more alternative alleles the second alternative allele will have the value 2
, the third 3
, etc. Missing information is denoted past dots .
.
For more than information about the different tags, accept a look at the VCF file format description.
The very terminal line of the header specifies the column names for the genotype data. The first eight columns are mandatory, follow by "FORMAT" and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AVE1409002 AVE1409003 AVE1409004 [...] TVE130512 TVE130513 TVE130514
You may desire to edit these header lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to come across what they did.
cod204.lg05.1.header.txt
.
To notice the control, have a look at the BCFtools manual, or just type bcftools
on the command line to see the available options.
Give me a hint!
Have a await at the bcftools view
option. Either type bcftools view
on the command line, or have a look the manual.
Show me the reply!
bcftools view -h cod204.lg05.1.vcf.gz > cod204.lg05.one.header.txt
With view -h
merely the VCF header is selected (the header is small, therefore the small "h"). By using the capital -H
the header is suppressed and only the "large" trunk of genotype information is considered.
cod204.lg05.1.header.txt
, compared to the whole VCF file?
Show me the answer!
ls -50
The header is 7476 bytes (7 KB) small. This is only most 0.002% of the entire VCF, and a size nosotros tin safely open with a text editor.
##fileformat=VCFv4.2
stating the date of today: ##fileDate=20200121
.
Save the new header as: cod204.lg05.1.header.ed.txt
. Use BCFtools to replace (or "rehead") the VCF file cod204.lg05.1.vcf.gz
with the new edited header and save the re-headed VCF as cod204.lg05.i.ed.vcf.gz
.
Make sure your changes took effect and your file is nevertheless valid by visually inspecting the file and the file size of cod204.lg05.1.ed.vcf.gz
.
Tip: Type bcftools
to encounter general options for the program; and so apply bcftools reheader
to see the more specific options for the reheader function.
Show me the respond!
Make apply of a text editor: nano cod204.lg05.one.header.txt
or vi cod204.lg05.1.header.txt
Add ##fileDate=20200121
, save the file (nano: "Ctrl + o", half-dozen: ":w"; don't forget to save under the new name), and close the text editor (nano: Ctrl + 10, vi: ":q")
bcftools reheader -h cod204.lg05.1.header.ed.txt cod204.lg05.one.vcf.gz -o cod204.lg05.1.ed.vcf.gz
(with -h
the header file is specified, with -o
the new output file name)
With zless cod204.lg05.i.ed.vcf.gz
you can check the header lines, and by using ls -lh
you can come across if the file has still roughly the aforementioned size.
In the genotype part of the VCF file, one line corresponds to one variant site. Each line starts with the chromosome ID "CHROM", the position "POS", and the variant "ID", and is followed past the bases of the two alleles ("REF" for the reference allele, and "ALT" for the alternative allele(s)). The sixth column ("QUAL") is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, "PASS" tells u.s. that the variant passed all filters.
In the "INFO" column, information related to the header info tags for each variant site is given (e.g., Ac is the allele count in genotypes for the "ALT" allele(southward), AF the allele frequency of the "ALT" allele, and AN is the total number of alleles, etc. We volition use some of the "INFO" field specifications afterwards for difficult quality filtering).
The "FORMAT" cavalcade specifies the fields and order of the information that is given with each individual for each genotype and the remaining columns listing each individual with the values corresponding to the "FORMAT" field. The first "FORMAT" field is e'er the genotype (GT), the remaining fields are variable, simply in our data file, GT is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood):
This is how the start variant of the genotype role of the VCF file looks similar:
LG05 338 . A G 93.01 . Air-conditioning=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=one;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:x,0:ten:30:0,30,381 [...] 0/0:seven,0:7:21:0,21,248
Show me the answer!
"Give-and-take count" (wc) tin exist used to count words, characters, or lines. Use wc --help
to see the options.
We know that the genotype part of the VCF file contains one line per variant site, meaning that we tin use BCFtools to suppress the header and forward (pipe; |
) the output to wc -l
. Using the flag -50
means "count the lines":
bcftools view -H cod204.lg05.1.vcf.gz | wc -fifty
(takes ~1 min)
328253
A) What kind of variants do yous see (east.g., SNP, insertion/deletion)?
Show me the answer!
You can inapect the variants using zless cod204.lg05.1.vcf.gz
The first variant site ("338") is a SNP. Columns four and five for "REF" (the reference allele) and "ALT" (the alternative allele) show A and G.
The second variant site ("357") is a deletion. Column four contains multiple bases, while column five shows a Yard.
LG05 338 . A Chiliad
LG05 357 . GTGGGCGCTACTCCAGGCAAAGGAGGACTCTGTACCGGAGGCTTGGATCAGCT G
Sometimes you may likewise meet an asterisk (*), which represents a spanning deletion.
B) What is the genotype (GT) of the beginning individual (AVE1409002) at the first variant site (LG05 338)?
Evidence me the answer!
The first individual has the genotype 0/0
, which ways twice the reference allele A (A/A). The allele values in the GT field are 0 for the reference allele (the "REF" column) and 1 for the outset alternative allele ("ALT").
GT:Advert:DP:GQ:PL 0/0:20,0:twenty:57:0,57,855
C) What is the genotype read depth (DP, specified as "FMT/DP" to distinguish information technology from the mean DP "INFO/DP" per variant site) for the first iii individuals (AVE1409002, AVE1409003, AVE1409004) at the outset variant site (LG05 338)?
Show me the answer!
The read depths (DP) are 20, 9, and 10.
GT:Advertising:DP:GQ:PL
0/0:20,0:20:57:0,57,855
0/0:9,0:9:27:0,27,345
0/0:x,0:x:30:0,30,381
ii) Difficult quality filtering of variants
To remove low-quality variants (e.grand., variants with a low read depth or variants only supported by poorly aligned reads) from our data set, we will utilize hard quality filtering. In the terminology of the authors of GATK, "hard filtering" refers to using specific filtering cutoffs for specific annotations. This stands opposed to using the so-called Variant Quality Score Recalibration (VQSR) method, which does not utilise any specific "hard cutoffs" merely combines data from multiple annotations from a known, highly validated variant training fix in a machine learning algorithm. Variants that are non passing the chosen thresholds or the VQSR filtering can either be completely removed or they can exist kept in the VCF, but a flag is then added to them noting that they did non pass the filter. Since frequently such a grooming prepare is not bachelor, we volition apply "difficult quality filtering" hither.
The following part of the activity is adjusted from this "GATK tutorial".
2.ane) Inspection of quality measurements
Earlier we choose the measurements and thresholds for hard quality filtering, allow's start by having a look at some of these measurements and the distributions of their values.
We will consider the following vii measurements, which are given for each variant site in the "INFO" cavalcade of the VCF (more than information about each of these measurements is given farther below):
Fisher Strand (FS), Strand Odds Ratio (SOR), RMS Mapping Quality (MQ), Mapping Quality Rank Sum Exam (MQRankSum), Quality by depth (QD), Read Position Rank Sum Examination (ReadPosRankSum), and the combined depth across samples (INFO/DP).
Below, the measurements of the kickoff variant site are highlighted:
LG05 338 . A G 93.01 . Ac=ane;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=iii.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=i;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=ane.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:xx,0:20:57:0,57,855 0/0:nine,0:ix:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248
cod204.lg05.1_FS.txt
.
Tip: Expect at the examples given in the manual. The FS values should be saved every bit one column, with one line for the value of each variant.
Show me the reply!
bcftools query cod204.lg05.1.vcf.gz -f'%FS\n' > cod204.lg05.1_FS.txt
(takes ~one min)
With the -f
selection, you lot tin can define the output format. The %FS
indicates that for each variant, the FS value should be printed, and \due north
tells the plan that after each FS value there should be a new line.
Show me the answer!
Using head cod204.lg05.1_FS.txt
will show you the first lines. There is only 1 column.
Typing wc -l cod204.lg05.1_FS.txt
will show you the number of lines, which is the same as the number of variant sites nosotros have ( 328253
)
With bcftools view -H cod204.lg05.1.vcf.gz | less
y'all can visually check if the first variants in the file and compare them to the extracted FS values.
http://ec2-XXX-Xxx-Thirty-Xxx.compute-ane.amazonaws.com:8787
Open the provided script "gda_plotting.R" in R and follow the points two) and ii.1) in that R script.
Show me the plot!
Fisher Strand (FS), as well as Strand Odds Ratio (SOR), are measurements for strand bias – that is a sequencing bias, in which the reads of one DNA strand (e.g., the forrad strand) support one genotype (e.thousand., a heterozygous phone call), while the reads for the contrary strand support e.one thousand., a homozygous genotype. In cases of farthermost strand bias, SNPs are non correctly called.
- Fisher Strand (FS) is the Phred-scaled probability that in that location is strand bias at that site. It tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. A value shut to 0 indicates picayune or no strand bias. GATK recommends hard filtering of variants with FS greater than lx, only this is a very conservative value that will only filter out cases of extreme bias and get out many false positives in the data set.
Prove me the answer!
Nosotros could stick with the GATK recommendation and remove variants with FS > 60. However, since this is a conservative value and our distribution tells united states, that we would not loose many more variants if nosotros lower the threshold to, let's say, FS > 40 (0.64% of variants have FS > 60; 1.11% have FS > xl), we could equally well set the threshold to FS > forty.
Setting such thresholds depends on the quality of the data set up, as well equally the planned downstream analyses. Here, we plan to inspect population structure, so we could hazard removing some false negatives for the sake of overall quality improvement.
cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
.
Tip: Each measurement should be saved in one cavalcade separated past tabs, with each variant on a single line.
Show me the answer!
bcftools query cod204.lg05.ane.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\northward' > cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
Again, -f
defines the output format. The %FS\t%SOR\t
… indicates that for each variant first the FS value should be printed, then a tab \t
should exist printed, followed past the SOR value, followed by a tab etc… . At the end, the \n
tells the program that after all six measurements, there should exist a new line.
Testify me the plot!
- Strand Odds Ratio (SOR) is also a measurement for strand bias, but it takes into account the ratios of reads that cover both alleles and is therefore a improve measurement for sites that accept more reads in ane direction than the other. SOR values range from 0 to greater than nine, a value close to 0 indicates piffling or no strand bias. GATK recommends difficult filtering of variants with SOR greater than 3.
- RMS Mapping Quality (MQ) and Mapping Quality Rank Sum Test (MQRankSum) are both measures for mapping quality. The Root Mean Square (RMS) mapping quality (MQ) of reads across all samples provides an estimation of the overall mapping quality to back up a variant call. It includes the standard deviation (SD) of the mapping qualities, where a low SD indicates values shut to the mean. Skillful mapping qualities are effectually MQ 60. GATK recommends difficult filtering of variants with MQ less than 40.
- The Mapping Quality Rank Sum Test (MQRankSum) compares the mapping qualities of reads supporting the reference and the alternate allele. If all reads map correctly, the expected MQRankSum is 0. Negative values bespeak a lower MQ of the reads supporting the alternate allele, compared to reads supporting the reference allele. MQRankSum values usually range from about -10.5 to 6.5. GATK recommends hard filtering of extreme outlier variants with MQRankSum less than -12.five.
- Quality by depth (QD) is the "confidence in a call" (i.e., the QUAL score), normalized past the read depth at that site. GATK recommends hard filtering of variants with QD below 2.
- The Read Position Rank Sum Test (ReadPosRankSum) measures the relative position of the reference versus the alternative allele within reads. Eastward.m., SNPs towards the ends of reads may be more likely a sequencing error. A value effectually 0 means there is footling to no deviation in where the alleles are found relative to the ends of reads. GATK recommends hard filtering of variants with ReadPosRankSum less than -8.0.
- The combined depth across samples (INFO/DP) is only the sum of the read depth (i.e., the genotype depth (FORMAT/DP) fields) over all samples. Variants that take a high mean depth across all samples are indicative of either repetitive regions or paralogs. A good rule of thumb is to remove variants that show twice the hateful read depth. You can calculate the hateful read depth in R:
hateful(cod204.lg05.1_hf.values$DP)
. Now, don't be shocked about the loftier read depth. This field has not been updated later on reducing the vcf from 860 individuals to 204 individuals, and then the information is given for all 860 individauls. We can use this field withal for excluding variants with read depth larger than twice the mean.
Show me the answer!
Again, choosing filtering thresholds is highly depended on the quality of the data and the planned downstream analyses. Here are some possible suggestions for variants to exclude:
- FS (GATK recommendation FS > threescore): FS > 40 (we are more strict hither)
- SOR (GATK recommendation SOR > 3): SOR > 3 (this seems just fine)
- MQ (GATK recommendation MQ < xl): MQ < 40 (we would loose many variants with a lower threshold, so we stick to the recommended)
- MQRankSum (GATK recommendation MQRankSum > -12.5): MQRankSum > -5.0 and MQRankSum < 5.0 (a value around 0 is preferred and we don't loose many variants if nosotros set strict thresholds)
- QD (GATK recommendation QD > two): QD > two (we would loose many variants with a lower threshold, so nosotros stick with the recommendation)
- ReadPosRankSum (GATK recommendation ReadPosRankSum > -8.0): ReadPosRankSum > -4.0 (not many variants are below -viii, and then we can set information technology more strict)
- INFO/DP (no GATK recommendation, dependend on sample number and sequencing depth): INFO/DP > xvi,000 (the hateful DP is 8400.935, we exclude sites that bear witness about twice the mean DP)
2.2) Application of hard filtering thresholds
We will now utilize the filtering thresholds and remove variants that don't lucifer the chosen thresholds.
Don't forget to specify the input file cod204.lg05.1.vcf.gz
, the output format (we'll utilise -O z
since nosotros would like to relieve the data file again as compressed VCF), and the output filename (-o cod204.lg05.1.hf.vcf.gz
). Type bcftools filter
in the command line to see the options.
Tip: It is possible to either apply -east
and the logical operator "or" ||
to exclude low-quality variants from the file, or apply -i
and the logical operator "and" &&
to include loftier-quality variants.
Running the filtering volition take about five minutes, so time for a break and some coffee!
Show me the code!
A possible solution choosing the "exclude" option:
bcftools filter -e 'FS>40.0 || SOR>3 || MQ<forty || MQRankSum<-5.0 || MQRankSum>v.0 || QD<2.0 || ReadPosRankSum<-4.0 || INFO/DP>16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.ane.vcf.gz
A possible solution choosing the "include" option:
bcftools filter -i 'FS<xl.0 && SOR<3 && MQ>40.0 && MQRankSum>-5.0 && MQRankSum<5 && QD>ii.0 && ReadPosRankSum>-4.0 && INFO/DP<16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.1.vcf.gz
! Be enlightened that the commands will not produce the exact same results. The reason for this is that some measurements will not be available for all sites, because there are, for example, too many missing genotypes to calculate this measurement. In such cases, these sites would non be excluded (because there is no measurement to exclude), so they would remain in the filtered file, while with the include choice, they would not exist included (since there is no measurement based on which they could be included).
- What is the file size?
- How many variant sites have been removed? Tip: see Q6
- Has the FS filter been applied correctly? What is the largest FS value?
Tip: applybcftools query
as you did above (Q8) but instead of saving the output, pipe it directly tosort
. Use the correct flags with "sort" (come across the options withsort --help
and merely check the first values by pipe the output again tohead
.
Show me the answer!
! Annotation, the values below may be different depending on the filter thresholds y'all accept called.
-
ls -lh
213M
– the file is about thirty% smaller than the unfiltered one. -
bcftools view -H cod204.lg05.ane.hf.vcf.gz | wc -l
231014
– the file also has about 30% fewer variant sites. -
bcftools query cod204.lg05.i.hf.vcf.gz -f'%FS\n' | sort -gr | head
– we need the sort flag-g
since nosotros have floating point numbers, and since we are interested in the largest value, we demand to reverse the sorting using the flag-r
. Usingsort -g | tail
would work but the aforementioned fashion.
39.994
39.989
39.913
39.91 -
bcftools query cod204.lg05.1.hf.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\north' > cod204.lg05.1.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
cod204.lg05.one.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
, and plot the distributions using Rstudio and the script "gda_plotting.R" (section 2.1.2).
Prove me the plot!
3) Further quality filtering
After hard quality filtering, we have a data prepare containing only variant sites that we trust with high confidence. However, and so far we have just removed variants that had skewed values across all samples – we haven't nonetheless looked at the individual genotypes at each variant site. For example, we may have kept a variant that has a loftier quality across all 204 individuals, simply there are three individuals where the read depth (FMT/DP) or the genotype quality (GQ) for the individual genotype is very low, so we don't trust the calls for these individuals.
LG05 338 . A Yard 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=three.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=one.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:twenty,0:xx:57:0,57,855 0/0:9,0:nine:27:0,27,345 0/0:ten,0:10:30:0,30,381 0/0:5,0:v:15:0,15,209 [...] 0/0:seven,0:vii:21:0,21,248
Furthermore, nosotros volition exclude SNPs that are in the close proximity of indels. That is, because the proper alignment of sequences around indels can often be problematic and produce false positives.
For example, dependent on the "gap opening penalty", a region can exist aligned to have four indels (‐) and a SNP.
CTTTGAACCAAATTT----ACTGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACT--AGMT-TGTTGCC
Or it tin can be aligned to accept three indels (‐) and four SNPs.
CTTTGAACCAAATTT--TCAGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACTAG1000T-TGTTGCC
In addition, we will also remove monomorphic SNPs (i.e., SNPs that were called because they differ from the reference, but that are not polymorphic in our sample).
Then, since we are only interested in biallelic SNPs for the downstream analyses in this action, we will remove indels and multiallelic SNPs (for data sets were indels are likewise going to be analysed, information technology is recommended to subset indels course SNPs as the very first footstep and conform the above hard filtering thresholds separately).
And terminal but not to the lowest degree, we will remove individuals with high amounts of missing data, too as variants with lots of missing genotypes, and we volition filter on the minor allele frequency (MAF).
3.i) Filter on minimum read depth (DP) and genotype quality (GQ)
We will again use bcftools filter
and filter reads that accept a read depth below 3 or a genotype quality beneath 20. The difference is, that instead of the whole variant site, we are at present considering unmarried genotypes (the information in the "FORMAT" fields") using -e 'FMT/DP<3 | FMT/GQ<20'
and nosotros will non remove the whole variant site, but simply fix the respective genotypes to missing (./.
) past using the bcftools filter -S .
command.
Use the already "hard-filtered" data file as input cod204.lg05.1.hf.vcf.gz
, specify a compressed VCF every bit the output format -O z
, and name the new output file (-o cod204.lg05.1.hf.DP3.GQ20.vcf.gz
).
bcftools filter
, and the manual. Information technology takes ~3 min to run the command.
Show me the command!
bcftools filter -Southward . -e 'FMT/DP<3 | FMT/GQ<20' -O z -o cod204.lg05.i.hf.DP3.GQ20.vcf.gz cod204.lg05.1.hf.vcf.gz
Hither nosotros utilise the logical operator "|" instead of "||", since using "||" would mean that every genotype at a variant site is fix to missing, even if simply one genotype doesn't fulfill the threshold. More than information about the utilise of these operators in BCFtools can be found hither.
You may wonder whether a minimum read depth of 3 is sufficient to confidently call the variant. In our instance it is, since the GATK Haplotypecaller pipeline calls genotypes across all 860 samples simultaneously and includes information from all reads from all individuals to adjust the genotype quality. A telephone call with DP=3 may thus have a high GQ and can be confidently kept.
With the following control we extract genotypes of samples that accept a read depth lower than three -i 'FMT/DP<iii'
and in the output we would like to see the genotype (GT), followed past the read depth (DP) and the genotype quality (GQ) for that sample -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n'
.
Earlier you lot execute the post-obit control, accept a brief thought nearly the expected output regarding the genotypes.
bcftools query -i 'FMT/DP<3' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\north' cod204.lg05.one.hf.DP3.GQ20.vcf.gz | less -S
Show me the answer!
With our filtering control above, we specified to gear up genotypes to missing ./.
. Thus, when just extracting genotypes of samples that have a read depth lower than three, nosotros would expect them to be all ready to missing.
GT=./. DP=i GQ=iii GT=./. DP=ii GQ=vi GT=./. DP=1 GQ=iii GT=./. DP=1 GQ=3
GT=./. DP=ane GQ=3 GT=./. DP=i GQ=3 GT=./. DP=1 GQ=iii GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6
GT=./. DP=1 GQ=three GT=./. DP=2 GQ=vi GT=./. DP=one GQ=3 GT=./. DP=one GQ=3
GT=./. DP=2 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=one GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6
GT=./. DP=2 GQ=half-dozen GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=i GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=2 GQ=half-dozen GT=./. DP=0 GQ=. GT=./. DP=1 GQ=iii
GT=./. DP=two GQ=6 GT=./. DP=one GQ=iii GT=./. DP=ii GQ=vi [...]
You may too check the genotypes with a read depth of exactly iii. bcftools query -i 'FMT/DP=three' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\due north' cod204.lg05.1.hf.DP3.GQ20.vcf.gz | less -South
.
You will see that there are only few genotypes with a high genotype quality (GQ) that are non set to missing.
3.2) Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels
- To remove monomorphic SNPs, we will use
bcftools filter
as before to exclude-eastward
all sites at which no alternative alleles are called for any of the samplesAc==0
and all sites at which only alternative alleles are chosenAir conditioning==AN
. In add-on, to remove SNPs that are within 10 bp of indels, nosotros add together the flag--SnpGap x
. - To reduce the information file to contain but biallelic SNPs, we utilise the
bcftools view
command using the flags-m2 -M2 -v snps
, which set both the minimum-m
and maximum-G
number of alleles to 2, and keeps only snps via-v snps
.
Y'all can execute both commands in one control line by piping |
the output of the offset command (bcftools filter …) to the second command (bcftools view …). This style you merely have to specify the input file cod204.lg05.1.hf.DP3.GQ20.vcf.gz
with the first control, and the output format -O z
and output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
with the second control.
bcftools filter
and bcftools view
, and the transmission. It will have ~2 min to run.
Show me the command!
bcftools filter -due east 'Air-conditioning==0 || AC==AN' --SnpGap 10 cod204.lg05.ane.hf.DP3.GQ20.vcf.gz | bcftools view -m2 -M2 -v snps -O z -o cod204.lg05.i.hf.DP3.GQ20.allele.vcf.gz
zless -S
).
Prove me how!
You may use:
zless -S cod204.lg05.1.hf.DP3.GQ20.vcf.gz
to see the original file,
and compare it with the output file:
zless -S cod204.lg05.one.hf.DP3.GQ20.allele.vcf.gz
Or: bcftools view -H cod204.lg05.one.hf.DP3.GQ20.allele.vcf.gz | less -S
By using less -Southward
we "chop" the long lines and but display the beginning of each line, making information technology easier to compare the reference and culling alleles and the allele numbers. You tin scroll horizontally across the lines using the arrow keys.
Can you identify what has changed?
Show me the answer!
bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | wc -l
82594
Recollect, we had 328,253 variants at the outset and 231,014 variants after the difficult quality filtering steps, now subsequently the last filtering steps we're down to 82,588.
3.3) Remove individuals with a high corporeality of missing data
Sometimes, sequencing doesn't work as well for some individuals equally it does for others. Such individuals will exist marked past depression quality genotypes and low read depth, resulting in loftier amounts of missing information afterward the above filtering steps.
We can bank check the amount of missing data by using the bcftools stats
command. Accept a look at the options by typing bcftools stats
in the concluding or cheque the manual for what it can do. With -due south -
nosotros tin can request stats for all samples.
Run bcftools stats -southward - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
and quickly scroll through the large output.
Find the section PSC, Per-sample counts
, which displays some summary statistics per sample. The last cavalcade named [14]nMissing
displays the number of missing genotypes per sample.
There is also other valuable information that allows u.s. for example to appraise the effectiveness of the filtering steps higher up (e.g., in clolumn [9]nIndels
, also nether PSC
the number should exist 0 since we removed all indels).
As well have a quick await at the number of [eleven]nSingletons
, listed in column 11. We will go back to that in part iii.four) of the activity.
cod204.lg05.1.hf.DP3.GQ20.allele.imiss
.
Visually check that the file is as expected.
Bear witness me the code!
bcftools stats -due south - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | grep -Eastward ^PSC | cut -f3,xiv > cod204.lg05.i.hf.DP3.GQ20.allele.imiss
Do the visual cheque e.thousand., using less cod204.lg05.1.hf.DP3.GQ20.allele.imiss
Prove me the plot!
bcftools view -south
command to remove the three individuals with more lxx% missing information from the data file cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
.
Save the output as compressed VCF by using selection -O z
and specify the output file proper name -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
with pick -o
. Information technology takes ~1 min to run this command.
Show me the code
bcftools view -s ^ORE1203009,ORE1203008,ORE1203013 -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
. Accept the three individuals been successfully removed?
Show me the lawmaking
bcftools query -50 cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz | wc -fifty
201
Yes, our data set had 204 individuals earlier.
3.4) Remove variants with a loftier amount of missing genotypes and filter on minor allele frequency
In the missingness plot higher up, we saw that the data set up in general has a lot of missing data. This is considering nosotros gear up genotypes with low quality and low read depth to missing. As the penultimate pace, we will now remove all variants that have more than xx% missing data.
As the terminal footstep, we volition remove the singletons that you saw in a higher place in the bcftools stats
command. Singletons – alleles that are observed simply in one case – and other alleles that have a low minor allele frequency (MAF) frequently present false calls, which can be obstructive in downstream analyses past adding noise to the data (note even so, that you may besides need them due east.g., for demographic analyses).
If the data will be used for a genome-wide association study (GWAS), usually a rather high MAF threshold is applied, as it requires very strong statistical power to make meaningful statements almost very rare alleles. When interested in population structure with many samples, alleles present in just 1 individual do not add information and tin can therefore be removed. Thus, the optimal MAF filtering threshold depends on the blazon of information you have, the analysis you are planning to practise, the data quality, and the sample size.
Show me the answer!
The default MAF threshold in some programs is 0.01. Filtering for MAF 0.01 ways that remaining sites in the VCF take a minimum modest allele frequency of one%. Our data set contains 201 individuals (thus maximally 402 alleles per site). If we filter for MAF 0.01, a variant is excluded if its minor allele is found in less than 4.02 genotypes (1% of 402). This would remove alleles where the pocket-sized allele occurs 4 times or less, including the singletons.
A slightly higher MAF threshold of 0.05 would exclude all sites that are called less than 20.i times (5% of 402). Given that we know that nosotros take samples with xx individuals (40 alleles), such number of minor alleles could very well be truthful variants and should rather not be removed.
Note: for the above calculations, I assumed that we have 402 allels per site. However, in BCFtools, MAF is calculated every bit the minor allele count (MAC) divided past the total number of alleles (AN), meaning that missing data is taken into business relationship. Since we filter simultaneously for missing data, AN volition be between 322 and 402.
If missing data should non be taken into account, filtering tin also be done based on MAC.
Taken together, probably something between 0.01 and 0.03 could be a good MAF threshold for our data set.
bcftools filter -e
to exclude all variants that accept more than 20% missing genotypes or have a minor allele frequency smaller or equal to 0.02 from the data file cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
. Save the output every bit compressed VCF -O z
and use the output file proper noun -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz
.
Prove me the code!
bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.02' -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz cod204.lg05.i.hf.DP3.GQ20.allele.missi.vcf.gz
What would be the lowest culling allele count to wait?
Use once more bcftools query
and excerpt the INFO field "Air-conditioning", pipage the output into sort and print only the start lines.
Show me the lawmaking!
We filtered for MAF 0.02, so we would expect a minimum AC of 7 (because 2% of 322 = 6.44).
bcftools query cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz -f'%Air conditioning\n' | sort -g | head
7
7
7
[…]
OPTIONAL R job: Switch to Rstudio, adjust the code of point 3) in the "gda_plotting.R" script and plot the missingness per sample afterwards the above filtering steps. Has it inverse?
Show me the answer!
bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz | wc -l
From 328,253 variants, xi,864 SNPs are left. This ways we removed 96% per centum of the data.
four) Performing a principal component analysis (PCA) to inspect the data structure
The "smartpca" program implemented in the software EIGENSOFT uses a combination of master component assay (PCA; a technique used to identify and illustrate the strongest components of variation in a dataset) and new statistics designed specifically for genomic information. The methods are described in Patterson et al.(2006) PLOS Genetics 2(12): e190.
The EIGENSOFT smartpca needs iv input files: The ".ped" (containing the genotype information) and ".map" (containing the chromosome/position information) files that we will produce using the software PLINK1.nine, a ".pedind" file containing private information, and a '.par' file specifying the analysis settings.
cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz
to the PLINK ".ped/.map" format by running the following command:
plink --vcf cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz --aec --recode --out cod204.lg05.one.hf.DP3.GQ20.allele.missi.miss20.maf0.02
If you would like to know more nigh this format have a look here.
Utilize sed
to supervene upon the "LG" in the cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map
file with "" (nada) and salve the output in a new file called tmp
.
Verify that the new file is correct. If so, rename tmp
again to cod204.lg05.ane.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map
.
Prove me the command!
cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map | sed south'/LG//'g > tmp
The "stream editor" sed
will substitute south/
the found expression "LG" with "" (nothing) within the "global" /thousand
environment, meaning the whole line/file.
less tmp
mv tmp cod204.lg05.ane.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map
cut
the first 5 columns of the ".ped" file and save them in a new file chosen "tmp".
Show me the control!
cat cod204.lg05.i.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cutting -d ' ' -f1-v > tmp
The default delimiter whitespace in cut
are tabs, just the .ped file has spaces. We tin ascertain this with -d ' '
. With -f
we specify columns i to v.
Have a expect at the "tmp" file. It contains the family IDs and the individual IDs in the kickoff two columns (in our instance this is the same), followed past zeros that betoken "no information" for the remaining columns (fatherID, motherID, sex). In addition, we need a column to assign individuals to populations.
To practice this, "cutting" the first three characters (see cutting --help
) from cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped
and "paste" them to the tmp
file. Save the output as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind
.
Remove the "tmp" file.
Show me the command!
cat cod204.lg05.one.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cut -c 1-3 | paste tmp - > cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind
"cut -c 1-three" selects the first 3 characters, which are combined via the standard output (stdout, represented by "-
) with the saved "tmp" file.
rm -i tmp
template.par
to include the respective input file names, salvage the par file as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par
, and run the smartpca with the syntax smartpca -p cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par > pca01.out
.
Show me the plot!
The starting time principal component centrality explains almost ii% of the total genomic variation and separates the eastern Baltic Sea (night blue) from the remaining samples. The second centrality explains 1% of the total genomic variation and separates the N Sea populations from the fjord-type cod. Although nosotros used just a tiny fraction of the information, the population structure is comparable with Fig. 1b in Barth JMI et al. (2019) Mol Ecol 28, 1394–1411.
Well washed!! Now it is time for socializing. My suggestion for todays pub is the "Zapa bar".
What Does Quality Filtering Genomic Data Do,
Source: https://evomics.org/learning/population-and-speciation-genomics/2020-population-and-speciation-genomics/first-steps-in-genomic-data-analysis/
Posted by: brownthabould.blogspot.com
0 Response to "What Does Quality Filtering Genomic Data Do"
Post a Comment