Data Preparation

Requirements of input files

Please ensure that your input files meet the following requirements:

  • The input file is vcf format compressed with bgzip.

  • The vcf.gz file is sorted by genomic coordinates.

  • GRCh38/hg38 coordinates are required and the chromosomes should be encoded with prefix ‘chr’ (e.g. chr2).

  • Create separate vcf.gz file for each chromosome.

  • Only one file can be uploaded for one chromosome in the same task.

  • Files uploaded at the same time shouldn't have the same filename. And the filenames shouldn't contain any special characters or Chinese characters.

  • The size of a vcf file shouldn't exceed 100MB.

  • The percentage of sites that are not existed in the ChinaMAP reference panel shouldn't exceed 50%. Too many such sites may cause QC failure. The list of ChinaMAP reference panel sites can be download here.

We recommend you to exclude the variants which meet the following criteria:

  • monomorphic sites

The example of an input vcf file can be download here.

Examples of data preparation command

Bcftools is a helpful tool for data preparation.

  • Compress vcf with bgzip.

bgzip -f my.vcf
  • Removing the redundant annotations from the vcf file can reduce the size of the file.

bcftools annotate -x INFO,^FORMAT/GT -O z -o my.concise.vcf.gz my.vcf.gz
  • Sort vcf by genomic coordinates.

bcftools sort --output-type z --output my.sort.vcf.gz my.concise.vcf.gz
#create index of vcf.gz
bcftools index --force --tbi --output my.sort.vcf.gz.tbi my.sort.vcf.gz
  • Adjust the coordinates of your input loci to GRCh38/hg38. You can use the gatk tool.

gatk LiftoverVcf \
--INPUT hg19.vcf.gz \
--OUTPUT hg38.vcf.gz \
--REJECT hg38.rejected_variants.vcf.gz \
--REFERENCE_SEQUENCE hg38.fasta \
--CHAIN hg19ToHg38.over.chain.gz
  • Rename the chromosomes with prefix ‘chr’.

#The format of chr.list: old_name\tnew_name
bcftools annotate --rename-chrs chr.list --output-type z --output my.sort.rename.vcf.gz my.sort.vcf.gz
#create index of vcf.gz
bcftools index --force --tbi --output my.sort.rename.vcf.gz.tbi my.sort.rename.vcf.gz
  • You can use the ChinaMAP_filterVCF tool to exclude monomorphic sites and the sites that are not existed in the reference panel.

#exclude only sites that are not existed in the reference panel
ChinaMAP_filterVCF.py --input my.vcf.gz \
--reference ChinaMAP.phase1_v1_reference_panel.vcf.gz \
--outputDir ./outdir 

#exclude only monomorphic sites
ChinaMAP_filterVCF.py --input my.vcf.gz \
--exlude_monomorphic \
--outputDir ./outdir

#exclude sites that are not existed in the reference panel and exclude monomorphic sites.
ChinaMAP_filterVCF.py --input my.vcf.gz \
--reference ChinaMAP.phase1_v1_reference_panel.vcf.gz \
--exlude_monomorphic \
--outputDir ./outdir

#create index of vcf.gz 
bcftools index --force --tbi --output my_filtered.vcf.gz.tbi my_filtered.vcf.gz
  • Create separate vcf.gz file for each chromosome.

for chr in chr{1..22}
do
  bcftools view -r $chr --output-type z --output $chr.vcf.gz my_filtered.vcf.gz
done

Tip

Before uploading the input vcf files, we recommend that you use the ChinaMAP_checkVCF to check whether your input files meet the above requirements. Please contact us with your file and error information if there is any unexpected error message. We will be very grateful for your feedback and the help to improve the server.