Genotype Imputation Pipeline¶
Our genotype imputation pipeline executes the following steps:
Step1. Parse vcf files¶
Firstly, we identify the chromosomes in each file and check if the file meets the requirements. Then Each file is processed in parallel.
Step2. Quality Control¶
We create chunks with a size of 20 Mb. For each 20Mb chunk, we perform the following checkings:
exclude sites that are not A, T, C, G
exclude sites without a called genotype
exclude duplicate sites
Important Note:In this step, the sites that are not existed in the reference panel and monomorphic sites will not be excluded.
Then, we count the number of variants included in the reference panel. The Chunk would be excluded in the case of:
The number of variants in the reference panel < 3
>50% variants are not included in the reference panel
For each valid chunk, phasing is executed using Eagle2 with the following script (take chr2:1-20000000 as an example):
/path/eagle \ --chrom 2 \ --bpStart 1 \ --bpEnd 20000000 \ --vcfRef reference_panel.chr2.phased.vcf.gz \ --vcfTarget chr2.1-20000000.vcf.gz \ --geneticMapFile genetic_map.hg38.txt \ --noImpMissing \ --allowRefAltSwap \ --vcfOutFormat z \ --outputUnphased \ --outPrefix chr2.1-20000000.phased
For each valid chunk, imputation is executed using Minimac4 with the following script (take chr2:1-20000000 as an example):
/path/minimac4 \ --chr 2 \ --start 1 \ --end 20000000 \ --minRatio 0.000001 \ --window 500000 \ --refhaps reference_panel.chr2.m3vcf.gz \ --haps chr2.1-20000000.phased.vcf.gz \ --noPhoneHome \ --allTypedSites \ --format GT,DS,GP \ --prefix chr2.1-20000000.impute
Finally, we merge all the chunks of one chromosome into one single vcf.gz and generate md5 of the vcf.gz file.