Statistical Issues in Genetic Association Studies Eleanor Feingold, Ph.D. University of Pittsburgh March, 2011.
Statistical Issues in Genetic Association StudiesEleanor Feingold, Ph.D.University of PittsburghMarch, 2011Underlying Principle of Genetic Mapping People who have similar traits (phenotypes) should have greater than expected sharing of genetic material near the genes that influence those traits.Basic study designs for gene mappingfamiliesunrelatedindividualsBasic study designs for gene mappingfamiliesunrelatedindividualslinkageanalysis(or association)associationanalysisBasic study designs for gene mappingfamiliesunrelatedindividualsassociationanalysissemi-relatedindividuals inan inbred population?linkageanalysis(or association)Basic study designs for gene mappingfamiliesunrelatedindividualsassociationanalysissemi-relatedindividuals inan inbred population?linkageanalysis(or family-based association)Association analysis (circa 2000)AAAaaacases65133202controls1681316GWAS Study circa 2010AAAaaacases65133202controls1681316So whats the BIG DEAL?Well, not much, until you get into 1) the complexities of array data, and2) the real science of genetics.One important genetic subtletyEven in a GWAS study, we cant test every variant on the genome. Soat the design phase, we have to pick markers (SNPs) that we hope will cover as well as possible, andat the testing phase, we do not expect that the marker we are testing is actually the causal variant - we are usually hoping (at best) that it is correlated with the true causal genetic variable.Gene inhere somewhereGene inhere somewhereAfter many generations ...Within a population, genotypes at nearby SNPs are correlated due to population history.This correlation is called linkage disequilibrium.Tag SNPsFind a set of SNPs that captures most information at least cost.How? Find clusters of SNPs that are highly correlated and then choose one representative from each cluster to genotype.Easily-available relatively idiot-proof software (e.g. Tagger).Caveat 1:You need a database that knows lots of SNPs in your gene and has genotyped them in a fair number of people in the population you are studying (Hapmap, Seattle SNPs).Caveat 2:Beware of overly-aggressive tagging.Conventional association vs. candidate gene sequencingGWAS (tag SNP) study1) Cheaper - more genes and more people, so higher power.2) Find only common variation.3) Probably do not find functional variants.Candidate gene sequencing study1) Expensive - fewer genes and fewer people, so lower power overall.2) Find both common and rare variation.3) Find functional variants.GWAS AnalysisGenotype callingData cleaningSingle-SNP analysisOther analysesCNVsBBABAAGenotype calling Generally done before you see the data.But plenty of open questionsabout how to do it.- best clustering methods? salvage data from messy clusters?Data cleaningSomewhat dependent on which chip you are using.Throw out bad SNPs and bad samples. (% of genotypes called for each person and each SNP)Hardy-Weinberg testingRelationship testingFind major chromosomal anomaliesLook for population stratificationLook for signs of systematic problems (e.g. allele frequencies differ by sample processing date).Data cleaning examplesPlate effect on missing call rate per sampleANOVA p-value = 6e-48But no significant association between plate and case status (p=0.20)Gender Checkchromosomal anomaliesTesting Hardy-WeinbergHardy-Weinberg Equilibrium (HWE) means that your three genotype groups occur in the expected p2, 2pq, q2 proportions.Departure from HWE most often indicates genotyping problems.But it can also indicate an actual genetic effect. (Check for case-control differences).Do your HWE tests by ethnicity, but dont expect admixed groups (hispanics, African-Americans) to be in HWE.HWE 10-4 < p < 0.5HWE p < 10-4population stratification via principle componentsAnalysiscasecontrolAaCase-control association test by allele ...And by genotype ...2 x 2 table(Fishers exact test or chi-squared test)2 x 3 table(Fishers exact test or chi-squared test orArmitage trend test)casecontrolAAAaaaSimple association test at every SNPOr use logistic regressionLets you incorporate other predictors (age, sex, diet, whatever).G + E (genotype + environment model)G + E + GxE (interaction model)GWAS resultsManhattan plot and qq plotWhats the best single-SNP association test?Not as solved a problem as youd think.If you knew the true model for the gene effect, youd just fit that model. But you dont.So which tests are robust over lots of models?Chia-Ling Kuos work===== MIN 2P ============= MIN 3P ==================== MIN 4P ==============Scan with CovariatesWhich logistic regression model is best for testing GENETIC EFFECT?G: LR(G, NULL) ~ X2(1)G+E: LR(G+E, E) ~ X2(1)G+E+GE: LR(G+E+GE, E) ~ X2(2)ResultsCombination statistics (best of several statistics) are most robust, even after correction for multiple comparisons, but linear trend test is also a good choice. To test for genetic effect, the G + E is almost never advantageous. Just test G, or fit G + E + GxE if youre pretty sure theres an interaction. BIG CAVEAT: This assumes G and E are independent if you are worried about confounding, you DO need to control for E when testing G.More generally, should you use the same statistics you used for a small-scale study?Maybe not.ProblemNeed to worry about the statistical propertiesof the extreme values ofthe test statistics.What do I mean? Statisticians develop teststhat behave sensibly on average. But in genomic problems, we do 10,000 or 500,000 of the same test and then follow up thetop 100 results. So we need test statistics for whichthe extreme values are well-behaved,not so much the averages. Example from expression arrays:10,000 t-tests analysisCompute t-statistic for each gene.Rank by absolute value of t-statistic.ProblemRanked list is dominatedby small-variance genes.With a small sample size,the SE estimates are very poor.If you estimate an SE poorly 10,000 times, some of the estimates will come out very small.SolutionShrinkage estimator!(Add a fudge factor to the denominator of the t-statistic.)Back to association studies ...Whatever statistic you are using (1,000,000 times), you need to know the statistical behavior of the 1st - 50th highest order statistics, not the statistical behavior on average.This issue has not really been dealt with in the association study literature. A few other open statistical issuesMultiple testingThe problemIf you do 1,000,000 tests, you will produce a lot of false positives.The solutionThere isnt one! Be realistic about hypothesis generating vs. hypothesis testing. False discovery rate - controls percent of genes on list that are false. Permutation testing - controls for lots of correlated tests.Imputaton at untyped SNPsThe ideaUse Hapmap database to impute genotypes for your samples at all the SNPs in-between the ones you genotyped.Do a test at each of those SNPs in addition to the typed ones.Should increase overall study power even if multiple comparisons are correctly controlled for.blue at typed SNP => blue at untyped one as wellImputaton at untyped SNPsThe best thingAllows joint analyses of datasets that were genotyped with different chips!LimitationsOnly helpful if correlation structure in Hapmap is valid for your population.Only helpful for SNPs in the database (contrast to haplotype analysis). Open questions Best imputation methods in theory and practice? What populations should you base the imputation on? Imputed SNPs have different statistical properties (e.g. slightly higher variance) how do we account for that?Meta-analysisTypical GWAS papers now combine results from many studies.What are the best meta-analysis methods for doing this?- What if same SNPs not typed in all studies?- What if phenotype not measured the same way?- What if some SNPs are imputed?Software for genetic association studiesPLINK is the primary tool. Bioinformatics is incorporated.There are some useful R packages as well.Need R for fancier analyses typically integrate it with PLINK.Lots of new stuff constantly under development for large-scale data management and viewing WGAViewer, LocusZoomLots of specialty packages for:HWE haplotype analysisfamily associationother stuff***********************************