1 Data Quality Control in Genome-wide Genetic Analyses Wei V. Chen July 2, 2008.

1

Data Quality Control in Genome-wide Genetic Analyses

Wei V. ChenJuly 2, 2008

2

Why QC?

• Data quality control is an important issue in mapping complex disease genes to ensure unbiased results of genetic disease gene signals.

• Complete mission: localization of a gene locus followed by attempts to clone the gene, characterize and sequence it, and determine its functions and/or the nature of the defect it causes (borrowed from SAGE notes).

• Reducing false-positives and false-negatives in data analyses towards localization of a gene locus could greatly save later-on efforts.

• Could be a very complicated topic. Here will concentrate on SNP data and discuss some practical precautions to be taken care of based on recent genetic analyses done here.

– Calling algorithms of different genotyping platforms– Genotyping errors– Bad markers– Call rates– Hardy-Weinberg disequilibrium– Mendelian inconsistencies– Linkage disequilibrium– Wrong relationships– etc.

3

General QC in SNP genotyping Data

• Allele calling algorithm– TOP/BOTTOM for Illumina platform– Confidence Score for Affymetrix platform– forward/reverse strand orientation

• Call rates• Minor Allele Frequency (MAF) check-up • Hardy-Weinberg Equilibrium (HWE) test

4

Illumina Platform

Illumina platform: the data may come as A/B alleles, TOP alleles, etc. 1. A/C or A/G SNP: TOP -> A: A; C/G: B2. T/C or T/G SNP: BOTTOM -> T: A; C/G: B3. A/T or C/G SNP: newer Illumina data does not use these

ambiguous pairs of SNPs; TOP/BOT designation depends on neighboring SNPs:

A/T: TOP -> A: A; T: BA/T: BOT -> T: A; A: BC/G: TOP -> C: A; G: BC/G: BOT -> G: A; C: B

http://www.illumina.com/downloads/TOPBOT_technote27Jun06.pdf

5

Affymetrix Platform

• The confidence score is based upon a Wilcoxon's Signed Rank Test of the mapping algorithm output for assignment of probe sets into a call zone. – eg. Affymetrix may indicate that a confidence score <= 0.25

is used as the threshold for accepting the assignment based upon the best model (this can mean a call of AA, AB, BB or No Call). Probe sets with a confidence score > 0.25 are automatically assigned as No Call.

• They have installed a new genotype calling algorithm (called BRLMM) since January 2007, which optimizes the call rate and confidence score and was shown better and more precise than the old algorithm.

6

Strand Orientation

• Two forward/reverse concepts:Classic: forward: 5’->3’; reverse: 3’->5’

eg. rs4753770 5 ’ . . . A T T A C [ T / C] T T A T T. . . 3 ’

Relative: forward or reverse relative to the reference genome eg. Hapmap genotypes in imputations.

ID Name Chr Position SNP hapmap1 hapmap2 for_rev ILMN Strand

Customer Strand

Top Genomic Sequence

1 rs3094315 1 792429 [T/C] A G 0 Bot Bot AAAC[G/A]TTT

2 rs12562034 1 808311 [T/C] A G 0 Bot Top AGTA[G/A]GTT

3 rs3934834 1 1045729 [A/G] C T 1 Top Bot GCTC[A/G]CCA

4 rs9442372 1 1058627 [T/C] A G 0 Bot Top TCCC[A/G]AGA

5 rs3737728 1 1061338 [A/G] A G 0 Top Bot GAGC[A/G]GAC

7

Sample Illumina 317k data

• [Header]• BSGT Version 3.1.12• Processing Date 2/15/2008 4:05 PM• Content HumanHap300v2_A.bpm• Num SNPs 318237• Total SNPs 318237• Num Samples 64• Total Samples 64• [Data]• SNPName SampleID Allele1-Top Allele2-Top GCScore Allele1-Forward Allele2-Forward Allele1-Design Allele2-Design Allele1-AB

Allele2-AB ChrGTScore BAlleleFreq LogRRatio Theta• rs10000010 6884-001 A G 0.9369 T C T C A B 4 0.9009 0.4810 0.0047 0.556• rs10000023 6884-001 A C 0.8234 T G T G A B 4 0.8051 0.5063 -0.3746 0.503• rs10000030 6884-001 A G 0.8159 A G T C A B 4 0.8002 0.4882 0.1877 0.760• rs1000007 6884-001 A A 0.8748 A A A A A A 2 0.8423 0.0027 0.1582 0.011• rs10000092 6884-001 A G 0.9265 T C A G A B 4 0.8894 0.5679 0.4354 0.565• rs10000121 6884-001 G G 0.8919 G G C C B B 4 0.8564 1.0000 -0.2516 0.994• rs1000014 6884-001 A G 0.7254 A G T C A B 16 0.7462 0.5249 -0.1187 0.487• rs10000141 6884-001 A G 0.8891 A G A G A B 4 0.8541 0.4859 0.0921 0.653• ……• rs9989098 6763-016 A A 0.7683 A A T T A A 13 0.8113 0.0022 -0.2439 0.009• rs9989761 6763-016 A A 0.7768 T T A A A A 2 0.7758 0.0102 0.1003 0.058• rs9990047 6763-016 A G 0.8946 A G T C A B 3 0.8588 0.4937 -0.0845 0.644• rs999147 6763-016 A A 0.8130 A A A A A A 3 0.7983 0.0000 -0.2373 0.019• rs9992259 6763-016 A A 0.8340 A A A A A A 4 0.8123 0.0003 0.2165 0.002• rs9996716 6763-016 A G 0.8716 A G T C A B 4 0.8398 0.5048 -0.4071 0.459• rs670417 6763-016 A A 0.9307 A A A A A A 11 0.8939 0.0073 0.3097 0.033• rs2818058 6763-016 A A 0.8620 T T A A A A 6 0.8324 0.0000 -0.1547 0.010• rs542876 6763-016 A G 0.9010 A G T C A B 8 0.8644 0.4856 -0.1083 0.646• rs2466627 6763-016 A A 0.9242 A A T T A A 11 0.8870 0.0000 -0.1472 0.023

From Dr. Frazier’s data on DID families with multiple cancers:

8

Call Rates

Indicate data quality for an individual or a marker.• Genotype call rates for each typed individual: percentage of

called genotypes among all markers typed.– eg. remove the individual if < 90%

In NIH’s rare immune-deficiency data, individual 26 in pedigree 1 was dropped due to an abnormally low call rate (64%). They cannot see evidence of a bubble or evaporation on the image of the chip (accident in experiment), so the DNA was just of poor quality.

• Marker call rates: percentage of called genotypes among all typed people for each SNP.– eg. remove the marker if < 95%, or even < 98% as proposed by Dr.

Elston

9

Minor Allele Frequencies?

• For SNPs (2 alleles at a locus):Frequency of minor allele (MAF) = (number of minor alleles)/2N N: number of typed persons in a study populationSo MAF <= 0.5

• General practice: remove SNPs of MAF < 0.01, or MAF < 0.05 (stricter threshold) – Markers with low allele frequencies can have spurious high values of

r^2 which could complicate analyses. May cause loss of information and imprecision in analyses due to uncertainty in haplotype frequency estimate, especially in small samples.

– The very rare alleles should contribute very little to analysis anyway.• Rare disease allele?

10

Hardy-Weinberg Disequilibrium?

• If allele frequencies do not change over generations, they are in Hardy-Weinberg Equilibrium (HWE).

• Failing HWE test maybe due to: bad quality of genotyping real association signal related to copy number variation

influencing disease risk! • Better leave HWD SNPs in analyses and check the

genotyping quality if they do give significant signals.

11

QC in Linkage/Pedigree DataGenotyping Errors

• Mendelian inconsistencies identified by Pedcheck• Other errors: Algorithm in MERLIN identifies

unlikely genotypes based on the inferred double recombination events, when erroneous genotypes can imply excessive and unlikely recombination events between tightly linked markers.

12

QC in Linkage/Pedigree Data Mendelian Inconsistencies

eg. NIH’s Carney project (Affymetrix 500k):• Genotypes regenerated using BRLMM calling algorithm and

a confidence score <= 0.25 as the threshold slightly decreased the number of inconsistencies (from 1300+ to 1000+ for chromosome 22).

• %inconsistencies found using Pedcheck (revised version with max allowed markers increased to 500k) in all pedigrees: CAR071 87.7%; CAR104: 5.2%; CAR106: 5.9%; CAR615: 1.2%.

• CAR071: mother 01 is not consistent with all kids 02, 05, 06 and 07, which caused most of the inconsistencies found.

• Solution: remove genotypes of CAR071-01, and fix other sporadic errors to missing.

13

QC in Linkage/Pedigree Data Relationships

eg. Alopecia Areata project (28 pedigrees on 370k combined with 10 new pedigrees on 317k) Relationship checking using modified version of Prest (with increased MAXSTR) identified three pairs of sample switching in two pedigrees, which caused majority of mendelian inconsistencies.

14

EDF-10

EDF-11

7 78 710 10 5262 NA 1.0000 0.0000 0.0000 NA 0.00000 0.00000 7 78 711 6 5167 NA 0.0000 1.0000 0.0000 NA NA 0.00000

7 71 711 10 5175 NA 0.4257 0.5743 0.0000 NA 0.00000 0.00000 7 71 74 1 5251 0.8866 0.6515 0.3484 0.0000 0.00000 0.00000 0.00000

7 71 710 1 5272 0.9721 0.0028 0.9972 0.0000 0.08161 0.07109 0.06289 Famid id1 id2 rel #mar EIBDobs P0 P1 P2 pEIBD pAIBS pIBS

EDF 8 and 11 were supposed to be parents of 1, 2, 4, 5, 6, 7, 8, 9 and 10; but 10 is actually mother of 1 and 11.

15

258257

16

Famid id1 id2 rel #mar EIBDobs P0 P1 P2 pEIBD pAIBS pIBS 19 1997251 1997252 6 5255 NA 0.8667 0.1333 0.0000 NA NA 0.09102 19 1997251 1997253 10 5256 NA 0.0000 1.0000 0.0000 NA 0.00000 0.00000 19 1997251 1997254 10 5247 NA 0.0000 1.0000 0.0000 NA 0.00157 0.00237 19 1997251 1997255 10 5256 NA 0.0000 1.0000 0.0000 NA 0.11563 0.05443 19 1997251 1997256 6 5251 NA 0.8752 0.0998 0.0250 NA NA 0.00003 19 1997251 1997257 6 5248 NA 0.8525 0.1337 0.0138 NA NA 0.00000 19 1997251 1997258 6 5249 NA 0.9835 0.0083 0.0082 NA NA 0.14835 19 1997252 1997253 10 5265 NA 0.7563 0.1818 0.0619 NA 0.00000 0.00000 19 1997252 1997254 10 5257 NA 0.6854 0.2716 0.0430 NA 0.00000 0.00000 19 1997252 1997255 10 5267 NA 0.6698 0.3029 0.0273 NA 0.00000 0.00000 19 1997252 1997256 1 5265 0.9638 0.0026 0.9974 0.0000 0.02389 0.00647 0.01357 19 1997252 1997257 4 5260 0.5361 0.0025 0.9555 0.0420 0.00000 0.00000 0.00000 19 1997252 1997258 4 5263 0.5486 0.0047 0.7093 0.2860 0.00000 0.00000 0.00000 19 1997253 1997254 1 5263 1.1325 0.0000 0.3276 0.6724 0.00000 0.00000 0.00000 19 1997253 1997255 1 5268 1.0559 0.0606 0.5659 0.3734 0.00049 0.00066 0.00059 19 1997253 1997256 4 5262 0.4877 0.6735 0.3265 0.0000 0.00295 0.00012 0.00069 19 1997253 1997257 5 5258 0.2441 0.8895 0.1105 0.0001 0.01599 0.03060 0.14822 19 1997253 1997258 5 5260 0.2846 0.0000 0.9586 0.0414 0.00000 0.00000 0.00000 19 1997254 1997255 1 5262 1.1278 0.0000 0.3432 0.6568 0.00000 0.00000 0.00000 19 1997254 1997256 4 5257 0.4872 0.6855 0.3144 0.0000 0.00196 0.00033 0.00321 19 1997254 1997257 5 5252 0.2502 0.7455 0.2545 0.0000 0.94248 0.74222 0.51948 19 1997254 1997258 5 5254 0.2846 0.0000 0.9344 0.0656 0.00000 0.00000 0.00000 19 1997255 1997256 4 5266 0.4728 0.9039 0.0961 0.0000 0.00000 0.00000 0.00000 19 1997255 1997257 5 5262 0.2573 0.5972 0.3704 0.0324 0.00271 0.00025 0.00001 19 1997255 1997258 5 5263 0.2845 0.0000 0.9452 0.0548 0.00000 0.00000 0.00000 19 1997256 1997257 10 5261 NA 0.0000 0.9101 0.0899 NA 0.00002 0.00000 19 1997256 1997258 10 5260 NA 0.0282 0.9718 0.0000 NA 0.00494 0.08682 19 1997257 1997258 1 5259 0.8987 0.5802 0.4197 0.0000 0.00000 0.00000 0.00000

17

QC in Linkage/Pedigree Data MZ Twins

Famid Sib1 Sib2 %shared 1V 1V-176 1V-177 1 These two are MZ twins for sure as checked (same birthday).

Solution: Remove one of them in the analyses. Otherwise false positive due to excess sharing.

18

QC in Linkage/Pedigree Data Linkage Disequilibrium

• eg. Reumatoid Arthritis project (Illumina 5k)Dramatic decreases in LOD scores were noted on a few chromosomes when markers in LD were dropped, for example on chromosome 21.

19

QC in Association Data

Identification of large numbers of single nucleotide polymorphisms (SNP) across the human genome and recent development of technologies for massively multiplex genotyping has made genome wide association studies (GWAS) feasible.

20

QC in Association Data (borrowed from SAGE notes)

• Causes of association between a marker and a disease chance stratification, population heterogeneity selection tight linkage ? pleiotropy (several traits due to a single locus)

21

QC in Association DataGenotyping Errors

• Genotyping Error Detection in Samples of Unrelated IndividualsNianjun Liu*, University of Alabama at Birmigham, Dabao Zhang, Purdue University, Hongyu Zhao, Yale University

22

QC in Association DataGeneral QC Filters (eg. in plink)

• Remove people of low call rates: --mind 0.05 • Remove SNPs of low call rates, or low MAF, (or in

HWD?): --geno 0.05 --maf 0.01 (--hwe 0.00001?)• Identify related samples: --genome --min 0.15 --max 1

remove related samples: --remove

• Remove contaminated samples if any: excess relatedness across many samples

• Identify outliers: --read-genome --cluster --neighbour 15

Remove outliers: --remove (z-score > 4)

23

QC in Association Data Population Substructure

• May cause false positive association due to stratification caused by population heterogeneity, i.e. from different ancestry. Genetic variations determined different genetic risks of complex diseases.

• PLINK:– determine GC-lambda: --assoc– generate CMH clusters: --read-genome --cluster –ppc 0.05 –cc

exclude individuals that don’t cluster• Eigenstrat: principal component based population stratification. Effective in

improving error rates in association testing of candidate genes and in replication studies of WGA scans.

– Applies general QC filter first before running Eigenstrat!– Exclude regions of known strong associations first before running smartpca. – Select top N PC’s which are significant in Anova statistics for population differences. – Add those regions of known strong associations back in genomic controlled association

analysis.– eg.

https://cge.mdanderson.org/~wchen/User/RASNPs/Feb08/Eigenstrat/NARAC/Set2/

24

Thanks!

1 Data Quality Control in Genome-wide Genetic Analyses Wei V. Chen July 2, 2008.

Documents

Transcript of 1 Data Quality Control in Genome-wide Genetic Analyses Wei V. Chen July 2, 2008.