# Genome Wide Association Studies

## Transcript Of Genome Wide Association Studies

Genome Wide Association Studies

• What is a Genome Wide Association Study? • What is a Genome Wide Linkage Study? • Linkage vs Association (Risch and Merikangas 1996) • Study Design • Different methods for detecting association

Linkage Mapping

Basic Idea Use the pattern of allele segregation in pedigrees (families) to estimate recombination fraction (θ) between a marker locus and an unobserved trait locus. (Sham (1998))

Mother AB a b

Father a b a b

a b

a b

AB

a b

a b

a b

Ab

a B

Non−Recombinants Recombinants

From Father From Mother

Out of 4 informative meioses 2 are recombinants ⇒ θˆ = 1/2 i.e. we can use pedigrees to estimate the “distance” between 2 markers.

What is a Genome Wide Association Study?

Goal Uncover the genetic basis of a given disease. Basic Idea A rather vague idea of a study design that involves genotyping cases and controls at a large number (104 → 106) of SNP markers spread (in some unspeciﬁed way) throughout the genome. Look for associations between the genotypes at each locus and disease status.

01111101021220100011 20111200010110110100 20122012100110100111 12112111101110022202

11210121111212121211 22120100012212121021 01100210021112112010 01100102211112012112

Control Control Control Control

Case Case Case Case

Why might this be a good idea?

Suppose marker B is the causative disease locus but the genotype information at this marker is missing. Instead we have disease status for the individuals in the pedigree. A is a marker we think is close to B.

Mother A a

Father a a

A

a

a

a

A

a

a

a

Given a model of penetrance (i.e. how genotype dictates disease status) then for a given value of θ we can calculate the likelihood of the pedigree by summing over all the missing data conﬁgurations consistent with the data.

L(θ) = (Likelihood) =

G P (X|G) × P (Gd|Gf ; θ) × P (Gf ) G (Penetrance) × (Transmission) × (Founders)

Thus we can obtain the MLE, θˆ ⇒ Likelihood ratio test LL(1(θ/ˆ)2)

Normally, the “lod score” is reported where lod(θ) = log10

L(θˆ) L(1/2)

When disease status has a high correlation with the genotype at marker A

then this suggests the recombination fraction between markers A and B is

small i.e. A is close to B.

Because recombination events are so rare we can use a sparse set of (Micro Satellite) markers spread throughout the genome to “map” the disease locus (usually between 200-500 markers).

Once areas of the Genome have been implicated Fine Mapping studies in candidate genes can be used to localize the causal locus.

It’s not quite as simple as I’ve made out (Sham (1998))

• Need to specify a penetrance function • Founder haplotype probabilities • Allow variable recombination rates in men and women?

It is also non-trivial to sum over all the missing data and considerable effort has been put into efﬁcient calculation of the likelihoods (Abecasis et al. (2002)) [In much the same way that people have worked very hard to calculate likelihoods in population genetics models] Extensions

• Multiple loci • Other covariates e.g. environmental factors • Continuous traits

559

Figure from Wiltshire et al. (2001) AJHG 69:553-69 Type II Diabetes (Non-parametric Genome Wide Linkage Study)

Successes and Failures of Linkage Mapping

Linkage Mapping has been successful in identifying the genetic basis of many human diseases in which the disease penetrance resembles a simple Mendelian model

• Huntington’s disease, Cystic Fibrosis, some forms of breast cancer Risch and Merikangas (1996)

But “the literature is now replete with linkage screens for an array of common ‘complex’ disorders such as

• schizophrenia, manic depression, autism, asthma, type I and type II diabetes, Multiple Sclerosis, Lupus

Although many of these studies have reported signiﬁcant linkage ﬁndings, none has lead to convincing replication”

Risch(2001)

What is a Complex Disease?

Not simply Mendelian! Possible departures from the simple Mendelian model include

Allelic Heterogeneity Different alleles at the same locus increase disease susceptibility i.e. not just one allele. (linkage mapping is robust to this effect)

Locus Heterogeneity Mutations at different loci increase disease susceptibility (linkage mapping is not robust to this effect)

Gene-Gene Interactions multiple loci “interact” to increase disease susceptibility

Environmental factors e.g. diet, smoker Gene-Environment Interactions

Non-Parametric Linkage

Design Two Affected Sibs and their parents (Micro Satellite Locus)

Basic Idea Count the number of alleles two affected sibs share Identical By Descent (IBD)

A 1 A2 A3 A4

A 1 A2 A3 A4

A 1 A2 A3 A4

A 1 A 3 A1 A3 IBD = 2

A 1 A3 A1 A4 IBD = 1

A 1 A3 A2 A4 IBD = 0

If the marker is linked to the disease locus the affected sibs will tend to

share the disease allele more often than they would at a marker unlinked to

the disease locus.

⇒ There will be a departure from the null IBD distribution of

14 ,

12 ,

1 4

.

Note No distinction is made between which allele is shared IBD.

Linkage Mapping vs Association Studies

In a widely quoted paper Risch and Merikangas (1996) The Future of Genetic Studies of Complex Human Diseases. Science 273:1516-17 the authors pointed out that linkage studies had less power than association studies to detect weak genetic effects exhibited by the loci involved in complex diseases. Not quite that general. More speciﬁcally, the authors compared two speciﬁc methods of linkage and association that were popular at that time

• Non-parametric linkage mapping using an Affected Sib Pairs (ASP) design

• Family based association using the Transmission Disequilibrium/Distortion Test (TDT)

TDT (Transmission Disequilibrium Test)

Design Affected Child and their parents (SNP Locus)

Basic Idea Compare the distribution of the transmitted allele to the distribution of the non-transmitted allele from heterozygous parents.

Aa Aa

Aa Aa

Aa Aa

A A Score = 2

A a Score = 1

a a Score = 0

If the marker is linked to the disease locus one of the alleles will tend to be

transmitted more often than if the marker was unlinked to the disease locus.

⇒ There will be a departure from the null distribution of

12,

1 2

.

Note This test does effectively focus on a speciﬁc allele.

Disease Model

Risch and Merikangas assumed that the disease locus was diallelic with the Genotypic Relative Risk (GRR) increasing in a multiplicative fashion

P (Disease|Aa) = γ P (Disease|aa)

P (Disease|AA) = γ2 P (Disease|aa)

For both tests they assumed the marker locus is completely linked to the disease locus and calculated the IBD distribution and transmission distribution for a given values of γ.

Q. How many families do you need for 80% power?

ASP They assumed they had 500 MS markers spread throughout the genome ⇒ α = 10−4

TDT They assumed they had 1,000,000 SNP markers spread throughout the genome ⇒ α = 5 × 10−8

Common variants or rare alleles?

The common disease/common variant hypothesis (CD/CV) holds that alleles at relatively high frequencies (> 1%) represent a signiﬁcant proportion of susceptibility alleles for common disease.

Their high frequency implies that association studies in large population cohorts will be fruitful for identifying risk alleles.

Based on recent empirical evidence some people have suggested that we may be able to characterize the variation in the human genome using a block like structure of common haplotypes.

Haplotype Block

Figure based on one in Paabo (2003)

An Individual

If this is true then association studies may proceed by typing just those SNPs (Haplotype tagging SNPs or htSNP’s) that code the common haplotypes.

The ‘HapMap’ project will investigate this issue (Nature Genetics (2001) 29:353-4)

Power of ASP vs TDT

γ Allele Frequency ASP (N) TDT (N)

2

0.01

296,710 5,823

2

0.1

5,382 695

1.5

0.01

4,620,807 19,320

1.5

0.1

67,816 2,218

Even with the much more stringent Type I error TDT is seen to have much more power to detect an effect.

Note 1 The power of both tests depends on allele frequency. If the disease allele is rare it reduces the number of heterozygous parents with the mutant allele.

Note 2 Deeper families will have more power than the ASP design but there will also be a dependence on allele frequency.

In opposition is the common disease/rare allele hypothesis (CD/RA) which holds that there is no reason to expect that most common genetic diseases result from common alleles.

Simulations from models based on empirical parameter estimates (Pritchard (2001)) suggest that this may be the case and that should expect extensive allelic heterogeneity.

Many people also expect extensive locus heterogeneity.

If this is the case then Haplotype Maps will be of limited use and that family studies in unusual populations (e.g. Iceland) may be the only way to go.

May be a mixture of both?

Genome Wide Association Studies in Practice

Risch and Merikangas (1996) says that to detect a disease allele with a frequency of 0.1 and GRR = 1.5 we need to genotype 2, 218 families at 1,000,000 SNP loci. This isn’t a solution. There are still lots of questions.

• Is this design practical? Costs? Pooling? • Is TDT the best association design? Families vs Unrelated? MALD? • How do both these approaches cope with population stratiﬁcation? • What if we don’t type the causative marker but one in LD to the

marker? • How do allelic heterogeneity, locus heterogeneity, gene-gene interac-

tions and gene-environment interactions impact association studies?

The data produced by the Haplotype Map project will have a large inﬂuence on association study design.

Multinomial Likelihood

Data Case Control

aa r0 = 100 s0 = 130 n0

aA r1 = 400 s1 = 390 n1

AA r2 = 500 s2 = 480 n2

R

S

N

We want to test whether Case/Control status is associated with disease i.e. is there a difference in the distribution of genotypes between Cases and Controls.

Thus, we can write our null and alternative hypotheses in

Case Control

terms of the conditional probabilities of genotype given aa p0 q0

case/control status, parameterized by p = {p0, p1, p2} and aA p1 q1

q = {q0, q1, q2}

AA p2

q2

In terms of these probabilities, we can write H0 : p = q vs H1 : p = q

The likelihood is multinomial

L(p, q) = pr00pr11pr22q0s0q1s1q2s2

Under H0 the MLE’s of p and q are both v = {n0/N, n1/N, n2/N } Under H1 the MLE’s are qˆ = {s0/S, s1/S, s2/S} and pˆ = {r0/R, r1/R, r2/R}

Linkage

(families)

Risch and Merikangas (1996)

Association

(unrelateds)

Parametric (Pedigree Likelihood Based) Non−Parametric (IBD counting)

TDT (unrelated

families)

MALD

Likelihood Ratio Tests (logistic, log−linear, multinomial) Score Tests Haplotype Tests

Structure

Structured Association Genomic Control Prospective Studies

Matched Case−Controls

Pooling

Fine Mapping

(see Prof Hein’s lecture)

We can test H0 vs H1 using the log-likelihood ratio (LLR) statistic −2 log L(v, v) L(pˆ, qˆ)

The LLR statistic can be written as

2

ri

si

ri log rˆi + si log sˆi

i=0

where rˆi and sˆi are the ﬁtted frequencies under H0.

For

example,

rˆ0

=

N

×

R N

×

n0 N

Under H0 this statistic has a χ22 distribution (asymptotically).

For the example on the previous slide LLR = 4.45899

Logistic Regression

Alternatively, we can ﬁt a logistic regression model to the data.

Each subject in our sample consists of a (yi, xi) pair where yi is case/control status (1/0) and xi ∈ {0, 1, 2} is the genotype at the typed locus.

The relationship between y and x is modelled using the likelihood

n

L(β) = pyi i(1 − pi)(1−yi)

i=1

where

ηi = log pi 1 − pi

= β0 + β1xi

This statistical model is equivalent to the genetics model in which the odds of disease given genotype increase in a multiplicative fashion.

Odds of Genotype aa = P (D|aa) 1 − P (D|aa)

Genotype Odds

(aa) 0

α

πA = P (A)

(Aa) 1

α(1 + θ) p = Disease prevalence in the population

(AA) 2

α(1 + θ)2

α

2

α(1 + θ)

α(1 + θ)2 2

=

(1 − πA) + 2

πA(1 − πA) +

πA

1+α

1 + α(1 + θ)

1 + α(1 + θ)2

The relationship between the models is

β0 = log α β1 = log(1 + θ)

Score Tests

Score tests can be thought of as approximate likelihood ratio tests.

Suppose the likelihood can be written as L(α, β) and H0 : β = 0 These tests take the form U T V −1U where U is the vector of ﬁrst derivatives of the loglikelihood w.r.t to β evaluated at the MLE of α under H0 and β = 0. V is covariance matrix of U under H0. Under H0 this statistic has a χ22 distribution (asymptotically). Applying this method to the multinomial likelihood results in the statistic

2 (ri − rˆi)2 + (si − sˆi)2 i=0 rˆi sˆi

This is the “standard” Pearson’s Chi-squared test statistic for a 2 × 3 contingency table. For the example used before the Score test = 4.4478

Applying this method to the logistic regression model with ηi = β0 + β1xi results in

N {N (r1 + 2r2) − R(n1 + 2n2)}2 R(N − R){N (n1 + 4n2) − (n1 + 2n2)2}

which is equivalent to Armitage’s Trend test statistic i.e. a popular test statistic used in testing for association at a given SNP marker.

For the example used before the Score test = 2.69179

There is no explicit formulae for the MLE of β. The Likelihood is maximized numerically using an algorithm called Iteratively Re-Weighted Least Squares (IRWLS). The algorithm is derived by using a Newton-Raphson algorithm to maximize the Likelihood.

Using this framework we can test the hypothesis that there is an association which takes the form of a multiplicative increase in the odds of disease.

Writing β(0) as the MLE of β under H0 and β(1) as the MLE of β under H1 the LLR

statistic is

L(β(0)) LLR = −2 log L(β(1))

For the example used before LLR = 2.693173

The logistic regression framework allows us more ﬂexibility to include other covariates into the linear predictor ηi e.g. gentotypes at other loci, environmental covariates, interaction terms etc.

The ﬂexibility implies that the LLR test using the multinomial likelihood is a special case of the logistic regression model when we set the linear predictor to have the form

ηi = β0 + β1xi + β2yi

where yi = I[xi > 1] and H0 : β1 = β2 = 0

Multiple Alleles

The tests/models can be extended to the case when we have more than two alleles at a given locus.

Genotype Models

Suppose the locus has K alleles then there will be K(K + 1)/2 possible genotypes at the locus.

Thus we might think that each genotype has a different effect on the odds of disease.

In a logistic regression framework we can model this using a linear predictor of the form

η(i,g) = log

p(i,g) 1 − p(i,g)

= βg

where g codes for genotype. This model effectively models interactions between the alleles. We can test for an effect at the locus using a LLR test with K(K + 1)/2 − 1 degrees of freedom. When K is large the test will have low power.

Allelic Models

We can reduce the number of parameters in the model by assuming a speciﬁc form for the interaction between alleles at each locus i.e. they interact to increase the odds of disease in a multiplicative fashion.

η(i,j,k) = log

p(i,j,k) 1 − p(i,j,k)

= β0 + βj + βk

where j and k codes for the two alleles that make up a speciﬁc genotype.

We can test for an effect at the locus using a LLR test with K − 1 degrees of freedom.

The log link form of this model can be generalized (Clayton (2001)).

Alternatively we could carry out K 1 df tests, each one focussing on a speciﬁc allele and combining all the other alleles.

If we do this we need to correct for the multiple testing involved. This can be achieved through simulation.

Sham and Curtis (1995) search over all possible 1 df comparisons obtainable by collapsing alleles into two groups and construct p-values through simulation.

TDT

The TDT test compares the distribution of transmitted and non-transmitted alleles by parents of affected offspring (Speilman et al. (1993))

Transmitted Allele

If the marker

Non-transmitted allele A1 A2 Total

A1 n11 n12 n1· A2 n21 n22 n2· Total n·1 n·2 2n

is unlinked to the causative

A1A2

A1A2

This trio

contribute

A1A1

n12

locus then we expect n12 = n21.

would 2 to

Thus we can test for a distortion in the transmission distribution using the test statistic

(n12 − n21)2 (n12 + n21)

which is asymptotically χ21.

Various extensions of this method exist i.e. multi allelic markers, multiple siblings, missing parental data

Haplotype Tests

If we don’t observe that causative locus directly we need to pick up the signal of association using a marker or markers in LD with the causative marker. It makes sense to try to combine the information in several SNP markers by considering the haplotype effects at these markers.

In practice, we will probably not observe the haplotypes of each individual at a given set of markers. There has been alot of recent work in the area of haplotype phase reconstruction so this is not a serious problem (Stephens et al. (2001))

Approach 1 If we assume we know the haplotypes at a given set of markers one approach we can take is to treat the set of markers as one multi-allelic marker as discussed before.

Approach 2 We might think that haplotypes that are “similar” will have similar effects on disease susceptibility. This idea has lead to approaches that use ideas from spatial modelling of disease risk to impose this type of “prior” structure on the estimation of genotype and haplotype risks (Clayton and Jones (1999), Seaman et al. (2002))

Population Structure

Cases Controls

Population 1 Population 2

Population 1 Population 2

Genotype 0 1 2

Spurious association is caused by the co-occurnce of 2 factors

• A difference in proportion of individuals from two (or more) subpopulations in cases and controls

• Subpopulations have differing allele frequencies at the locus

There are 2 general strategies for protecting against structure in case/control studies.

Structured Association

Basic Idea Try to infer (discover) the structure and then condition on the structure when testing for association.

Basic Model

• Assume we have data on N individuals at L (unlinked) loci. • Assume that there are K underlying populations. • Each individual has a parameter that indicates population membership, zi. • Each population has a set of parameters that speciﬁes the allele frequencies at all

of the markers, pk = {pk1, . . . , pkL}, k = 1, . . . , K

z

Data

1 010012021 1 1 2 0 0.. 2 2 1 1 0

. 1 020112110

(0.2, 0.8, 0.01, 0.05, 0.7, 0.9, 0.5, 0.6, 0.1) p1

2 101211020 2 212021120

... 2 212111020

(0.7, 0.4, 0.8, 0.5, 0.9, 0.6, 0.3, 0.99, 0.01) p2

The model speciﬁes P (Data|Z, P, K)

Genomic Control

Basic Idea Correct the null distribution of the Chi-squared statistic for the effects of structure. If we use the logistic regression score statistic to test association at a given locus population structure in or data will tend to skew the null distribution.

Density

0.0 0.2 0.4 0.6 0.8 1.0

nwoithstsrutrcutcutruere

0

2

4

6

8

10

In practice we won’t know the number of populations K so we want to calculate P (K|Data) which we can calculate (in principal) in the following way

P (K|Data) ∝ P (Data|K) = P (Data|Z, P, K)π(Z)π(P )dZdP

where π(Z) and π(P ) are prior distributions on the parameters. In practice, the integral is approximated using Markov Chain Monte Carlo (MCMC) techniques. The program structure (Pritchard et al. (2000a)) uses a novel method of calculating this integral that seems to work very well on many data sets. The actual model structure uses allows for admixture between populations. More recently, the model has been extended to include a more realistic model of the correlations that occur between populations (Marchini and Cardon (2002)) Once the number of populations has been inferred we can obtain an estimate of the structure Z. Pritchard et al. (2000b) propose a likelihood ratio test (STRAT) to test the null hypothesis that the subpopulation allele frequencies are independent of the disease phenotype at a given marker.

Devlin and Roeder (1999) show that the null distribution of this statistic in the presence of structure is

λχ21 where the constant λ is a function of the structure present in the data. These authors suggest estimating λ from the empirical distribution of the statistics across all loci tested. Pritchard and Donnelly (2001) show that

K

λ ≈ 1 + N FST (dk − ck)2

k=1

where dk and ck are the fractions of cases and controls from subpopulation k, N is the number of cases and controls and FST is a statistic that measures the level of structure between subpopulations. Note λ increases with sample size.

Accuracy of Genomic Control

The accuracy of GC will depend on 2 factors • how good the λχ21 approximation is to the true null distribution. • bias and variance in the estimation of λ

We have compared the empirical null distribution of GC to it’s theoretical distribution using simulations from models ﬁtted to real datasets. We’ve found that the accuracy of GC can go both ways.

• In small samples the correction can be quite conservative. • In large samples the correction can be quite liberal.

8

6

Theoretical −log10(p−value)

n = 1000 RR = 10

NGoCcorrection

0

2

4

6

8

Empirical −log10(p−value)

4

2

0

Theoretical −log10(p−value)

8

6

4

2

n = 100 RR = 2

NGoCcorrection

0

2

4

6

8

Empirical −log10(p−value)

Power : STRAT vs GC vs TDT

0

Table copied from Pritchard and Donnelly (2001)

MALD

Structure can be used to help map disease genes (Collins-Schramm et al. (2002)) Mapping by Admixture Linkage Disequilibrium (MALD) relies on recent admixture between 2 populations to create LD between markers that differ in allele frequency between the populations. If the causative locus differs in allele frequency between 2 populations we can map the locus using a set of Ethnic Difference Markers (EDMs) spread throughout the genome.

Pop 1

60%

Unlinked

Pop 2 Causative Locus

30% Marker Locus

Linked Admixed Population

Other designs that protect against structure

Matched Controls We can protect against structure by selecting the controls to have similar ethnic and genetic backgrounds. Often ethnicity is self-reported and may not be accurate thus there may still be ‘cryptic’ structure present in the dataset.

Prospective Studies This design samples a large number of individuals and follows them through until disease onset. The long time span of such studies allows the collection of detailed environmental information that can help to elucidate the genetic/envirnomental basis of the disease. The large number of subjects followed potentially allows careful matching of controls to cases.

Probably the biggest such study is the UK BioBank project (http://www.ukbiobank.ac.uk/).

“Up to half a million participants aged between 45 and 69 years will be involved in the study. They will be asked to contribute a blood sample, lifestyle details and their medical histories to create a national database of unprecedented size. This combination of information from participants will create a powerful resource for biomedical researchers.”

Pros and Cons

Pros

Cons

TDT

Protects against structure Need Parents

Need fewer markers than Poor localization

Case-Control

Doesn’t use homozygous parents

MALD

Better power than TDT Need less markers than Case-Control

Relies on difference in disease between 2 populations Poor localisation

Unrelated Good localization Case-Control

Need lots of markers Need to protect against structure

Pooling

Genotyping costs can quickly escalate in case-control designs. 2000 individuals at 1,000,000 SNP’s will cost at least $20,000,000. Pooling is a method that can estimate the sample allele frequency at a locus using a pooled sample of genetic material from a group of individuals This method can be used to estimate allele frequency differences between groups of cases and controls. Pooling has already been used successfully to map some disease genes (Risch and Teng (1998)) A disadvantage of this method is that we cannot investigate interaction effects between loci.

• What is a Genome Wide Association Study? • What is a Genome Wide Linkage Study? • Linkage vs Association (Risch and Merikangas 1996) • Study Design • Different methods for detecting association

Linkage Mapping

Basic Idea Use the pattern of allele segregation in pedigrees (families) to estimate recombination fraction (θ) between a marker locus and an unobserved trait locus. (Sham (1998))

Mother AB a b

Father a b a b

a b

a b

AB

a b

a b

a b

Ab

a B

Non−Recombinants Recombinants

From Father From Mother

Out of 4 informative meioses 2 are recombinants ⇒ θˆ = 1/2 i.e. we can use pedigrees to estimate the “distance” between 2 markers.

What is a Genome Wide Association Study?

Goal Uncover the genetic basis of a given disease. Basic Idea A rather vague idea of a study design that involves genotyping cases and controls at a large number (104 → 106) of SNP markers spread (in some unspeciﬁed way) throughout the genome. Look for associations between the genotypes at each locus and disease status.

01111101021220100011 20111200010110110100 20122012100110100111 12112111101110022202

11210121111212121211 22120100012212121021 01100210021112112010 01100102211112012112

Control Control Control Control

Case Case Case Case

Why might this be a good idea?

Suppose marker B is the causative disease locus but the genotype information at this marker is missing. Instead we have disease status for the individuals in the pedigree. A is a marker we think is close to B.

Mother A a

Father a a

A

a

a

a

A

a

a

a

Given a model of penetrance (i.e. how genotype dictates disease status) then for a given value of θ we can calculate the likelihood of the pedigree by summing over all the missing data conﬁgurations consistent with the data.

L(θ) = (Likelihood) =

G P (X|G) × P (Gd|Gf ; θ) × P (Gf ) G (Penetrance) × (Transmission) × (Founders)

Thus we can obtain the MLE, θˆ ⇒ Likelihood ratio test LL(1(θ/ˆ)2)

Normally, the “lod score” is reported where lod(θ) = log10

L(θˆ) L(1/2)

When disease status has a high correlation with the genotype at marker A

then this suggests the recombination fraction between markers A and B is

small i.e. A is close to B.

Because recombination events are so rare we can use a sparse set of (Micro Satellite) markers spread throughout the genome to “map” the disease locus (usually between 200-500 markers).

Once areas of the Genome have been implicated Fine Mapping studies in candidate genes can be used to localize the causal locus.

It’s not quite as simple as I’ve made out (Sham (1998))

• Need to specify a penetrance function • Founder haplotype probabilities • Allow variable recombination rates in men and women?

It is also non-trivial to sum over all the missing data and considerable effort has been put into efﬁcient calculation of the likelihoods (Abecasis et al. (2002)) [In much the same way that people have worked very hard to calculate likelihoods in population genetics models] Extensions

• Multiple loci • Other covariates e.g. environmental factors • Continuous traits

559

Figure from Wiltshire et al. (2001) AJHG 69:553-69 Type II Diabetes (Non-parametric Genome Wide Linkage Study)

Successes and Failures of Linkage Mapping

Linkage Mapping has been successful in identifying the genetic basis of many human diseases in which the disease penetrance resembles a simple Mendelian model

• Huntington’s disease, Cystic Fibrosis, some forms of breast cancer Risch and Merikangas (1996)

But “the literature is now replete with linkage screens for an array of common ‘complex’ disorders such as

• schizophrenia, manic depression, autism, asthma, type I and type II diabetes, Multiple Sclerosis, Lupus

Although many of these studies have reported signiﬁcant linkage ﬁndings, none has lead to convincing replication”

Risch(2001)

What is a Complex Disease?

Not simply Mendelian! Possible departures from the simple Mendelian model include

Allelic Heterogeneity Different alleles at the same locus increase disease susceptibility i.e. not just one allele. (linkage mapping is robust to this effect)

Locus Heterogeneity Mutations at different loci increase disease susceptibility (linkage mapping is not robust to this effect)

Gene-Gene Interactions multiple loci “interact” to increase disease susceptibility

Environmental factors e.g. diet, smoker Gene-Environment Interactions

Non-Parametric Linkage

Design Two Affected Sibs and their parents (Micro Satellite Locus)

Basic Idea Count the number of alleles two affected sibs share Identical By Descent (IBD)

A 1 A2 A3 A4

A 1 A2 A3 A4

A 1 A2 A3 A4

A 1 A 3 A1 A3 IBD = 2

A 1 A3 A1 A4 IBD = 1

A 1 A3 A2 A4 IBD = 0

If the marker is linked to the disease locus the affected sibs will tend to

share the disease allele more often than they would at a marker unlinked to

the disease locus.

⇒ There will be a departure from the null IBD distribution of

14 ,

12 ,

1 4

.

Note No distinction is made between which allele is shared IBD.

Linkage Mapping vs Association Studies

In a widely quoted paper Risch and Merikangas (1996) The Future of Genetic Studies of Complex Human Diseases. Science 273:1516-17 the authors pointed out that linkage studies had less power than association studies to detect weak genetic effects exhibited by the loci involved in complex diseases. Not quite that general. More speciﬁcally, the authors compared two speciﬁc methods of linkage and association that were popular at that time

• Non-parametric linkage mapping using an Affected Sib Pairs (ASP) design

• Family based association using the Transmission Disequilibrium/Distortion Test (TDT)

TDT (Transmission Disequilibrium Test)

Design Affected Child and their parents (SNP Locus)

Basic Idea Compare the distribution of the transmitted allele to the distribution of the non-transmitted allele from heterozygous parents.

Aa Aa

Aa Aa

Aa Aa

A A Score = 2

A a Score = 1

a a Score = 0

If the marker is linked to the disease locus one of the alleles will tend to be

transmitted more often than if the marker was unlinked to the disease locus.

⇒ There will be a departure from the null distribution of

12,

1 2

.

Note This test does effectively focus on a speciﬁc allele.

Disease Model

Risch and Merikangas assumed that the disease locus was diallelic with the Genotypic Relative Risk (GRR) increasing in a multiplicative fashion

P (Disease|Aa) = γ P (Disease|aa)

P (Disease|AA) = γ2 P (Disease|aa)

For both tests they assumed the marker locus is completely linked to the disease locus and calculated the IBD distribution and transmission distribution for a given values of γ.

Q. How many families do you need for 80% power?

ASP They assumed they had 500 MS markers spread throughout the genome ⇒ α = 10−4

TDT They assumed they had 1,000,000 SNP markers spread throughout the genome ⇒ α = 5 × 10−8

Common variants or rare alleles?

The common disease/common variant hypothesis (CD/CV) holds that alleles at relatively high frequencies (> 1%) represent a signiﬁcant proportion of susceptibility alleles for common disease.

Their high frequency implies that association studies in large population cohorts will be fruitful for identifying risk alleles.

Based on recent empirical evidence some people have suggested that we may be able to characterize the variation in the human genome using a block like structure of common haplotypes.

Haplotype Block

Figure based on one in Paabo (2003)

An Individual

If this is true then association studies may proceed by typing just those SNPs (Haplotype tagging SNPs or htSNP’s) that code the common haplotypes.

The ‘HapMap’ project will investigate this issue (Nature Genetics (2001) 29:353-4)

Power of ASP vs TDT

γ Allele Frequency ASP (N) TDT (N)

2

0.01

296,710 5,823

2

0.1

5,382 695

1.5

0.01

4,620,807 19,320

1.5

0.1

67,816 2,218

Even with the much more stringent Type I error TDT is seen to have much more power to detect an effect.

Note 1 The power of both tests depends on allele frequency. If the disease allele is rare it reduces the number of heterozygous parents with the mutant allele.

Note 2 Deeper families will have more power than the ASP design but there will also be a dependence on allele frequency.

In opposition is the common disease/rare allele hypothesis (CD/RA) which holds that there is no reason to expect that most common genetic diseases result from common alleles.

Simulations from models based on empirical parameter estimates (Pritchard (2001)) suggest that this may be the case and that should expect extensive allelic heterogeneity.

Many people also expect extensive locus heterogeneity.

If this is the case then Haplotype Maps will be of limited use and that family studies in unusual populations (e.g. Iceland) may be the only way to go.

May be a mixture of both?

Genome Wide Association Studies in Practice

Risch and Merikangas (1996) says that to detect a disease allele with a frequency of 0.1 and GRR = 1.5 we need to genotype 2, 218 families at 1,000,000 SNP loci. This isn’t a solution. There are still lots of questions.

• Is this design practical? Costs? Pooling? • Is TDT the best association design? Families vs Unrelated? MALD? • How do both these approaches cope with population stratiﬁcation? • What if we don’t type the causative marker but one in LD to the

marker? • How do allelic heterogeneity, locus heterogeneity, gene-gene interac-

tions and gene-environment interactions impact association studies?

The data produced by the Haplotype Map project will have a large inﬂuence on association study design.

Multinomial Likelihood

Data Case Control

aa r0 = 100 s0 = 130 n0

aA r1 = 400 s1 = 390 n1

AA r2 = 500 s2 = 480 n2

R

S

N

We want to test whether Case/Control status is associated with disease i.e. is there a difference in the distribution of genotypes between Cases and Controls.

Thus, we can write our null and alternative hypotheses in

Case Control

terms of the conditional probabilities of genotype given aa p0 q0

case/control status, parameterized by p = {p0, p1, p2} and aA p1 q1

q = {q0, q1, q2}

AA p2

q2

In terms of these probabilities, we can write H0 : p = q vs H1 : p = q

The likelihood is multinomial

L(p, q) = pr00pr11pr22q0s0q1s1q2s2

Under H0 the MLE’s of p and q are both v = {n0/N, n1/N, n2/N } Under H1 the MLE’s are qˆ = {s0/S, s1/S, s2/S} and pˆ = {r0/R, r1/R, r2/R}

Linkage

(families)

Risch and Merikangas (1996)

Association

(unrelateds)

Parametric (Pedigree Likelihood Based) Non−Parametric (IBD counting)

TDT (unrelated

families)

MALD

Likelihood Ratio Tests (logistic, log−linear, multinomial) Score Tests Haplotype Tests

Structure

Structured Association Genomic Control Prospective Studies

Matched Case−Controls

Pooling

Fine Mapping

(see Prof Hein’s lecture)

We can test H0 vs H1 using the log-likelihood ratio (LLR) statistic −2 log L(v, v) L(pˆ, qˆ)

The LLR statistic can be written as

2

ri

si

ri log rˆi + si log sˆi

i=0

where rˆi and sˆi are the ﬁtted frequencies under H0.

For

example,

rˆ0

=

N

×

R N

×

n0 N

Under H0 this statistic has a χ22 distribution (asymptotically).

For the example on the previous slide LLR = 4.45899

Logistic Regression

Alternatively, we can ﬁt a logistic regression model to the data.

Each subject in our sample consists of a (yi, xi) pair where yi is case/control status (1/0) and xi ∈ {0, 1, 2} is the genotype at the typed locus.

The relationship between y and x is modelled using the likelihood

n

L(β) = pyi i(1 − pi)(1−yi)

i=1

where

ηi = log pi 1 − pi

= β0 + β1xi

This statistical model is equivalent to the genetics model in which the odds of disease given genotype increase in a multiplicative fashion.

Odds of Genotype aa = P (D|aa) 1 − P (D|aa)

Genotype Odds

(aa) 0

α

πA = P (A)

(Aa) 1

α(1 + θ) p = Disease prevalence in the population

(AA) 2

α(1 + θ)2

α

2

α(1 + θ)

α(1 + θ)2 2

=

(1 − πA) + 2

πA(1 − πA) +

πA

1+α

1 + α(1 + θ)

1 + α(1 + θ)2

The relationship between the models is

β0 = log α β1 = log(1 + θ)

Score Tests

Score tests can be thought of as approximate likelihood ratio tests.

Suppose the likelihood can be written as L(α, β) and H0 : β = 0 These tests take the form U T V −1U where U is the vector of ﬁrst derivatives of the loglikelihood w.r.t to β evaluated at the MLE of α under H0 and β = 0. V is covariance matrix of U under H0. Under H0 this statistic has a χ22 distribution (asymptotically). Applying this method to the multinomial likelihood results in the statistic

2 (ri − rˆi)2 + (si − sˆi)2 i=0 rˆi sˆi

This is the “standard” Pearson’s Chi-squared test statistic for a 2 × 3 contingency table. For the example used before the Score test = 4.4478

Applying this method to the logistic regression model with ηi = β0 + β1xi results in

N {N (r1 + 2r2) − R(n1 + 2n2)}2 R(N − R){N (n1 + 4n2) − (n1 + 2n2)2}

which is equivalent to Armitage’s Trend test statistic i.e. a popular test statistic used in testing for association at a given SNP marker.

For the example used before the Score test = 2.69179

There is no explicit formulae for the MLE of β. The Likelihood is maximized numerically using an algorithm called Iteratively Re-Weighted Least Squares (IRWLS). The algorithm is derived by using a Newton-Raphson algorithm to maximize the Likelihood.

Using this framework we can test the hypothesis that there is an association which takes the form of a multiplicative increase in the odds of disease.

Writing β(0) as the MLE of β under H0 and β(1) as the MLE of β under H1 the LLR

statistic is

L(β(0)) LLR = −2 log L(β(1))

For the example used before LLR = 2.693173

The logistic regression framework allows us more ﬂexibility to include other covariates into the linear predictor ηi e.g. gentotypes at other loci, environmental covariates, interaction terms etc.

The ﬂexibility implies that the LLR test using the multinomial likelihood is a special case of the logistic regression model when we set the linear predictor to have the form

ηi = β0 + β1xi + β2yi

where yi = I[xi > 1] and H0 : β1 = β2 = 0

Multiple Alleles

The tests/models can be extended to the case when we have more than two alleles at a given locus.

Genotype Models

Suppose the locus has K alleles then there will be K(K + 1)/2 possible genotypes at the locus.

Thus we might think that each genotype has a different effect on the odds of disease.

In a logistic regression framework we can model this using a linear predictor of the form

η(i,g) = log

p(i,g) 1 − p(i,g)

= βg

where g codes for genotype. This model effectively models interactions between the alleles. We can test for an effect at the locus using a LLR test with K(K + 1)/2 − 1 degrees of freedom. When K is large the test will have low power.

Allelic Models

We can reduce the number of parameters in the model by assuming a speciﬁc form for the interaction between alleles at each locus i.e. they interact to increase the odds of disease in a multiplicative fashion.

η(i,j,k) = log

p(i,j,k) 1 − p(i,j,k)

= β0 + βj + βk

where j and k codes for the two alleles that make up a speciﬁc genotype.

We can test for an effect at the locus using a LLR test with K − 1 degrees of freedom.

The log link form of this model can be generalized (Clayton (2001)).

Alternatively we could carry out K 1 df tests, each one focussing on a speciﬁc allele and combining all the other alleles.

If we do this we need to correct for the multiple testing involved. This can be achieved through simulation.

Sham and Curtis (1995) search over all possible 1 df comparisons obtainable by collapsing alleles into two groups and construct p-values through simulation.

TDT

The TDT test compares the distribution of transmitted and non-transmitted alleles by parents of affected offspring (Speilman et al. (1993))

Transmitted Allele

If the marker

Non-transmitted allele A1 A2 Total

A1 n11 n12 n1· A2 n21 n22 n2· Total n·1 n·2 2n

is unlinked to the causative

A1A2

A1A2

This trio

contribute

A1A1

n12

locus then we expect n12 = n21.

would 2 to

Thus we can test for a distortion in the transmission distribution using the test statistic

(n12 − n21)2 (n12 + n21)

which is asymptotically χ21.

Various extensions of this method exist i.e. multi allelic markers, multiple siblings, missing parental data

Haplotype Tests

If we don’t observe that causative locus directly we need to pick up the signal of association using a marker or markers in LD with the causative marker. It makes sense to try to combine the information in several SNP markers by considering the haplotype effects at these markers.

In practice, we will probably not observe the haplotypes of each individual at a given set of markers. There has been alot of recent work in the area of haplotype phase reconstruction so this is not a serious problem (Stephens et al. (2001))

Approach 1 If we assume we know the haplotypes at a given set of markers one approach we can take is to treat the set of markers as one multi-allelic marker as discussed before.

Approach 2 We might think that haplotypes that are “similar” will have similar effects on disease susceptibility. This idea has lead to approaches that use ideas from spatial modelling of disease risk to impose this type of “prior” structure on the estimation of genotype and haplotype risks (Clayton and Jones (1999), Seaman et al. (2002))

Population Structure

Cases Controls

Population 1 Population 2

Population 1 Population 2

Genotype 0 1 2

Spurious association is caused by the co-occurnce of 2 factors

• A difference in proportion of individuals from two (or more) subpopulations in cases and controls

• Subpopulations have differing allele frequencies at the locus

There are 2 general strategies for protecting against structure in case/control studies.

Structured Association

Basic Idea Try to infer (discover) the structure and then condition on the structure when testing for association.

Basic Model

• Assume we have data on N individuals at L (unlinked) loci. • Assume that there are K underlying populations. • Each individual has a parameter that indicates population membership, zi. • Each population has a set of parameters that speciﬁes the allele frequencies at all

of the markers, pk = {pk1, . . . , pkL}, k = 1, . . . , K

z

Data

1 010012021 1 1 2 0 0.. 2 2 1 1 0

. 1 020112110

(0.2, 0.8, 0.01, 0.05, 0.7, 0.9, 0.5, 0.6, 0.1) p1

2 101211020 2 212021120

... 2 212111020

(0.7, 0.4, 0.8, 0.5, 0.9, 0.6, 0.3, 0.99, 0.01) p2

The model speciﬁes P (Data|Z, P, K)

Genomic Control

Basic Idea Correct the null distribution of the Chi-squared statistic for the effects of structure. If we use the logistic regression score statistic to test association at a given locus population structure in or data will tend to skew the null distribution.

Density

0.0 0.2 0.4 0.6 0.8 1.0

nwoithstsrutrcutcutruere

0

2

4

6

8

10

In practice we won’t know the number of populations K so we want to calculate P (K|Data) which we can calculate (in principal) in the following way

P (K|Data) ∝ P (Data|K) = P (Data|Z, P, K)π(Z)π(P )dZdP

where π(Z) and π(P ) are prior distributions on the parameters. In practice, the integral is approximated using Markov Chain Monte Carlo (MCMC) techniques. The program structure (Pritchard et al. (2000a)) uses a novel method of calculating this integral that seems to work very well on many data sets. The actual model structure uses allows for admixture between populations. More recently, the model has been extended to include a more realistic model of the correlations that occur between populations (Marchini and Cardon (2002)) Once the number of populations has been inferred we can obtain an estimate of the structure Z. Pritchard et al. (2000b) propose a likelihood ratio test (STRAT) to test the null hypothesis that the subpopulation allele frequencies are independent of the disease phenotype at a given marker.

Devlin and Roeder (1999) show that the null distribution of this statistic in the presence of structure is

λχ21 where the constant λ is a function of the structure present in the data. These authors suggest estimating λ from the empirical distribution of the statistics across all loci tested. Pritchard and Donnelly (2001) show that

K

λ ≈ 1 + N FST (dk − ck)2

k=1

where dk and ck are the fractions of cases and controls from subpopulation k, N is the number of cases and controls and FST is a statistic that measures the level of structure between subpopulations. Note λ increases with sample size.

Accuracy of Genomic Control

The accuracy of GC will depend on 2 factors • how good the λχ21 approximation is to the true null distribution. • bias and variance in the estimation of λ

We have compared the empirical null distribution of GC to it’s theoretical distribution using simulations from models ﬁtted to real datasets. We’ve found that the accuracy of GC can go both ways.

• In small samples the correction can be quite conservative. • In large samples the correction can be quite liberal.

8

6

Theoretical −log10(p−value)

n = 1000 RR = 10

NGoCcorrection

0

2

4

6

8

Empirical −log10(p−value)

4

2

0

Theoretical −log10(p−value)

8

6

4

2

n = 100 RR = 2

NGoCcorrection

0

2

4

6

8

Empirical −log10(p−value)

Power : STRAT vs GC vs TDT

0

Table copied from Pritchard and Donnelly (2001)

MALD

Structure can be used to help map disease genes (Collins-Schramm et al. (2002)) Mapping by Admixture Linkage Disequilibrium (MALD) relies on recent admixture between 2 populations to create LD between markers that differ in allele frequency between the populations. If the causative locus differs in allele frequency between 2 populations we can map the locus using a set of Ethnic Difference Markers (EDMs) spread throughout the genome.

Pop 1

60%

Unlinked

Pop 2 Causative Locus

30% Marker Locus

Linked Admixed Population

Other designs that protect against structure

Matched Controls We can protect against structure by selecting the controls to have similar ethnic and genetic backgrounds. Often ethnicity is self-reported and may not be accurate thus there may still be ‘cryptic’ structure present in the dataset.

Prospective Studies This design samples a large number of individuals and follows them through until disease onset. The long time span of such studies allows the collection of detailed environmental information that can help to elucidate the genetic/envirnomental basis of the disease. The large number of subjects followed potentially allows careful matching of controls to cases.

Probably the biggest such study is the UK BioBank project (http://www.ukbiobank.ac.uk/).

“Up to half a million participants aged between 45 and 69 years will be involved in the study. They will be asked to contribute a blood sample, lifestyle details and their medical histories to create a national database of unprecedented size. This combination of information from participants will create a powerful resource for biomedical researchers.”

Pros and Cons

Pros

Cons

TDT

Protects against structure Need Parents

Need fewer markers than Poor localization

Case-Control

Doesn’t use homozygous parents

MALD

Better power than TDT Need less markers than Case-Control

Relies on difference in disease between 2 populations Poor localisation

Unrelated Good localization Case-Control

Need lots of markers Need to protect against structure

Pooling

Genotyping costs can quickly escalate in case-control designs. 2000 individuals at 1,000,000 SNP’s will cost at least $20,000,000. Pooling is a method that can estimate the sample allele frequency at a locus using a pooled sample of genetic material from a group of individuals This method can be used to estimate allele frequency differences between groups of cases and controls. Pooling has already been used successfully to map some disease genes (Risch and Teng (1998)) A disadvantage of this method is that we cannot investigate interaction effects between loci.