Family-Based Tests of Association in the Presence of Linkage

Preparing to load PDF file. please wait...

0 of 0
100%
Family-Based Tests of Association in the Presence of Linkage

Transcript Of Family-Based Tests of Association in the Presence of Linkage

Am. J. Hum. Genet. 67:1515–1525, 2000

Family-Based Tests of Association in the Presence of Linkage
Stephen L. Lake,1 Deborah Blacker,2,3 and Nan M. Laird1
Departments of 1Biostatistics and 2Epidemiology, Harvard School of Public Health, and 3Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Harvard University, Boston

Linkage analysis may not provide the necessary resolution for identification of the genes underlying phenotypic variation. This is especially true for gene-mapping studies that focus on complex diseases that do not exhibit Mendelian inheritance patterns. One positional genomic strategy involves application of association methodology to areas of identified linkage. Detection of association in the presence of linkage localizes the gene(s) of interest to more-refined regions in the genome than is possible through linkage analysis alone. This strategy introduces a statistical complexity when family-based association tests are used: the marker genotypes among siblings are correlated in linked regions. Ignoring this correlation will compromise the size of the statistical hypothesis test, thus clouding the interpretation of test results. We present a method for computing the expectation of a wide range of association test statistics under the null hypothesis that there is linkage but no association. To standardize the test statistic, an empirical variance-covariance estimator that is robust to the sibling marker-genotype correlation is used. This method is widely applicable: any type of phenotypic measure or family configuration can be used. For example, we analyze a deletion in the A2M gene at the 5 splice site of “exon II” of the bait region in Alzheimer disease (AD) discordant sibships. Since the A2M gene lies in a chromosomal region (chromosome 12p) that consistently has been linked to AD, association tests should be conducted under the null hypothesis that there is linkage but no association.

Introduction
Although linkage analysis has been applied successfully to the mapping of genes involved in the pathogenesis of diseases exhibiting Mendelian inheritance, its application in the setting of genetically complex diseases has been less fruitful (Risch and Merikangas 1996). With complex diseases, the resolution from linkage analysis is reduced, and extended segments of the genome containing large numbers of genes may be implicated in disease etiology (Hauser and Boehnke 1997; Roberts et al. 1999). Fine mapping of these linked regions may be accomplished through the use of allelic-association methods that are designed to jointly detect linkage and gametic-phase disequilibrium. Detecting association significantly refines the search for disease susceptibility genes, because linkage disequilibrium between a genetic marker and disease susceptibility polymorphisms is expected to exist only over relatively small genetic distances in most populations. The sequential approach of linkage-based genomic screening followed by dissection
Received July 28, 2000; accepted for publication September 21, 2000; electronically published October 31, 2000.
Address for correspondence and reprints: Mr. Stephen Lake, Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115. E-mail: [email protected] .edu
᭧ 2001 by The American Society of Human Genetics. All rights reserved. 0002-9297/2000/6706-0017$02.00

of linked regions with association methodology recently has been used to identify a susceptibility locus for human hypertension (Bray et al. 2000).
Allelic association can be detected through traditional contingency-table analysis using cases and controls (Woolf 1955). Although straightforward to implement, tests based on this approach are sensitive to spurious association caused by population admixture (Ott 1989). Family-based association tests (FBATs) are a class of tests that utilize within- and between-family markerinheritance patterns to test for association and that are safeguarded, by design, from confounding caused by admixture (Ewens and Spielman 1995). A widely used FBAT is the transmission/disequilibrium test (TDT; Terwilliger and Ott 1992; Spielman et al. 1993), which uses the marker genotypes of an affected child and those of his/her parents to test for association. FBATs have received much attention lately, with numerous extensions and generalizations of the TDT being proposed in the literature. Recently, Rabinowitz and Laird (2000) developed a unified approach to family-based association tests that puts tests of different genetic models, tests of different sampling designs, tests involving different disease phenotypes, tests with missing parents, and tests of different null hypotheses, all in the same framework. Algorithms for calculating the distribution of association test statistics for these many settings are also presented.
A distinction must be made between tests for linkage

1515

1516
that use association methods and tests for association in the presence of linkage. Letting v be the recombination parameter and d be a measure of allelic association, the tests for linkage that use association methods have a composite null hypothesis (type I H0) that can be expressed as H0:d p 0 or v p 1/2. The null hypothesis for testing association in the presence of linkage (type II H0) is H0:d p 0 and v ! 1/2. Both settings have the same alternative hypothesis, Ha:d 1 0 and v ! 1/2. Complications arise in tests addressing the type II H0 setting, because sibling marker genotypes are correlated under H0 (Martin et al. 1997; Lazzeroni and Lange 1998). Ignoring the correlation in the type II H0 setting compromises the a level of the tests. In this article, we show that valid tests for association in the presence of linkage may be performed using the mean of the test statistic computed via the Rabinowitz-Laird (RL) algorithm for the type I H0 setting and an empirical variance-covariance estimator that adjusts for the correlation among sibling marker genotypes. This provides a convenient means for testing allelic association in the presence of linkage that can be used with a wide range of test statistics and any pedigree configuration. For example, the nine strategies for testing the type I H0 advocated by S. Horvath, X. Xu, and N. Laird (unpublished data), which include applications to binary, quantitative and time-to-onset phenotypes, can all be adapted to the type II H0 setting with the method presented here. We note that in the biallelic setting and with a qualitative trait, the pedigree disequilibrium test (PDT; Martin et al. 2000c) is similar to the approach developed here.
As an illustration, we focus on the reported association between alleles of the A2M gene and late-onset Alzheimer disease. Blacker et al. (1998) reported a strong association between a deletion near the 5 splice site of exon 18 of the A2M gene (A2M-18i) and AD in a sample of sibships from the National Institute of Mental Health (NIMH) Genetics Initiative (Blacker et al. 1997). During the course of the A2M association study, linkage to a nearby region on chromosome 12 was reported as part of a genome screen (Pericak-Vance et al. 1997). Subsequent linkage analyses revealed linkage peaks at or near the A2M gene (Rimmler et al. 1997; Rogaeva et al. 1998; Wu et al. 1998; Kehoe et al. 1999; Scott et al. 1999). The reported A2M association has been controversial, with further findings both confirmatory and nonconfirmatory (Dow et al. 1999; Rogaeva et al. 1999; Rudrasingham et al. 1999; Romas et al. 2000). In any case, A2M is useful as an illustration of association tests conducted in the presence of linkage. We use the NIMH data set, in which a strong A2M/AD association has been reported (Blacker et al. 1999), to illustrate our method.

FBATs

Am. J. Hum. Genet. 67:1515–1525, 2000

We assume that there are N nuclear families, with ni

children in each family. Let mij be the marker genotype

for the jth child in the ith family and mi be the vector

of marker genotypes for the ni children in the ith family.

In addition, the vector of parental marker genotypes will

be denoted by Mi. Let X(mij) be an h # 1 vector that

codes for marker genotype. Depending on the coding

scheme, X(mij) may be a scalar or a vector (see Schaid

1996; Laird et al. 2000; S. Horvath, X. Xu, and N. Laird

[unpublished data]). Last, let yij be the phenotype of the

jth child in the ith family and T(yij) be some function of

the phenotype. In what follows we will often abbreviate

X(mij) with X ij and T(yij) with Tij and drop the subscript

indicating family when dealing with data from only one

family.

Association test statistics are constructed to detect

correlation between genotype and phenotype. In this

article, we restrict attention to the class of test statistics

that can be expressed as

͸ ͸ ͸ S p Si p

Tij X ij ,

(1)

i

ij

where the summation is over all children in all families and Si is the contribution from the ith nuclear family, i p 1, … ,N. Test statistics in this general class constitute the majority of family-based association test statistics proposed in the literature, including tests in the multiallelic setting, tests using quantitative phenotypes, and tests that allow missing parental marker information (Laird et al. 2000; Rabinowitz and Laird 2000). For example, with simplex families, letting Tij be an indicator function for child disease status and Xij be the count of a particular marker allele, Si counts the total number of alleles in the affected child and S is the same test statistic used in the TDT. Other types of test statistics are discussed in S. Horvath, X. Xu, and N. Laird (unpublished data).
Under the assumption that the N families are unrelated, the distribution of the test statistic S under H0 depends on the distributions of the independent Si, i p 1, … ,N. For the ith family, the general distribution of Si depends on the joint distribution of the observed children’s marker genotypes, children’s phenotypes, and parental marker genotypes p(mi,Mi,yi). Under the type I H0, p(mi,Mi,yi) depends on allele frequencies and the genetic model; conditioning on the phenotypes and the parental genotypes eliminates these unknown nuisance parameters and makes the distribution of Si dependent only on the conditional distribution of the children’s marker genotypes (Lazzeroni and Lange 1998). When parental genotypes are unknown, the nuisance parameters can be elim-

Lake et al.: FBATs with Linkage
inated by conditioning on the sufficient statistic for the parental genotypes S(M), which is composed of the observed parental genotypes (when available) Mobs and the children’s genotype configuration Cm (Rabinowitz and Laird 2000). The distribution under the type II H0 is discussed in the next section.
Using the conditional distribution of the children’s marker genotypes, we take the approach of standardizing S and using the large sample normal or x2 approximation. In this case, the mean and variance of the Si are required. For the type I H0, letting FI p [S(M),y], S. Horvath, X. Xu, and N. Laird (unpublished data) show that E(SiFFI) can be computed with the univariate conditional distribution of the children’s marker genotype, and Var(SiFFI) can be computed with the univariate and bivariate conditional distributions of the children’s marker genotypes, where Var(7) refers to the variance-covariance matrix. That is, by using just the joint distributions of (mij,mik) (which, under the type I H0, do not depend on j and k), we can compute Var(SiFFI). These distributions can be computed using the RL algorithm for the type I H0.
Tests of Association in the Presence of Linkage
As discussed above, association tests performed in areas of known linkage may significantly refine gene-mapping studies. The challenge is that, among siblings, genetic markers that reside within linked regions are correlated even in the absence of association and after conditioning on FI p [y,S(M)]. The dependence exists because siblings with similar phenotypes are more likely to share the putative disease genes, even in the absence of allelic association. Linkage between a marker and the putative disease gene, therefore, induces positive correlation between the genetic markers of siblings with similar phenotypes. The opposite holds for siblings with disparate phenotypes. The correlation makes p(mFFI) dependent on the recombination parameter and the genetic model for the phenotype.
Conditioning on the minimal sufficient statistic for v and the phenotypes removes the dependence of the marker genotypes on v and y under the type II H0. When the patterns of allele sharing among siblings can be unambiguously determined, they serve as the minimal sufficient statistic for v (Rabinowitz and Laird 2000). With incomplete identification of the allele sharing patterns, the outcome space of the children’s marker genotypes given the minimal sufficient statistic under the type II H0 may be computed using the RL algorithm (type II H0 case). Therefore, under the type II H0, the minimal sufficient statistic FII consists of the minimal sufficient statistic for the recombination parameter S(v), the minimal sufficient sta-

1517
tistic for the parental marker genotypes S(M), and the observed phenotypes y.
Since patterns of allele sharing are defined by the joint realization of sibling marker genotypes, the conditional outcome space consists of the various joint outcomes of sibling marker genotypes satisfying the constraints of the minimal sufficient statistic for the type II H0 (Martin et al. 1997; Rabinowitz and Laird 2000). Therefore, after conditioning on FII, the convenient expression of E(SiFFII) and Var(SiFFII), in terms of the univariate and bivariate conditional distribution of marker genotypes under the type I H0, cannot be paralleled. Rather, under the type II H0, expressions for E(SiFFII) and Var(SiFFII) using the RL algorithm can be found with the multinomial distribution.
For a given family, assume that there are p compatible realizations of the sibling marker genotypes, and let r be a p # 1 random vector, with the kth element being an indicator function that assumes the value 1, when the realization of the sibling marker genotypes corresponds to the kth element of the conditional outcome space, and 0 otherwise. The set of possible outcomes is given in tables 4–7 in Rabinowitz and Laird (2000) for nuclear families. Because, under the type II H0 and conditional on FII, all outcomes are equally likely, with probability 1/p, r follows a multinomial distribution, with mean and variance given by
1 mr p E(rFFII) p 1p
p
and

Sr

p

Var(rFFII)

p

1 p

(

I

p

Ϫ

1 p

1p

1p

)

,

where 1p is a p # 1 vector of 1s and Ip is a p # p dimensional identity matrix.
The moments of Si can be derived using the moments
͸ of r. Let Sir be an h # p matrix with the kth column
equal to j Tij X(mi(jk)) where m(k) p (mi(1k), … ,mi(nk)i) is the vector of sibling marker genotypes corresponding to the
kth element of the conditional outcome space and h is
the length of the marker genotype coding vector X. The
conditional mean and variance of Si are

mSi p E(SiFFII) p Sirmr

and

SSi p Var(SiFFII) p SirSr(Sir) .

͸ Under the type II H0, the approximate distribution of
S Ϫ E(SFFII) is Nh(0, i SSi).

1518

Am. J. Hum. Genet. 67:1515–1525, 2000

The last column of table 1 indicates which combinations of parental marker genotypes and children marker configurations are potentially informative in the biallelic setting with the RL algorithm applied to the type II H0 setting. When parental data are missing (as is often the case for late-onset diseases), sibships with more than two sibs and Cm p {AA,AB} or Cm p {BB,AB} are not informative, because allele sharing cannot be discerned. The removal of these types of sibships may cause a substantial loss in the effective sample size, especially when one of the alleles is rare, because homozygotes of the rare allele will be infrequent. An alternative to conditioning on the allele sharing is to take advantage of the linear form of the test statistic (eq. [1]) and to use the RL algorithm for the type I H0 to calculate the expectation, in conjunction with a robust variancecovariance estimator. The development of this approach follows.
Factorization of p(mFFI) under Type II H0
In view of the potentially severe loss of information caused by conditioning on sibling identical-by-descent
͸ (IBD) patterns, we here develop a method that employs
the type I H0 RL algorithm to compute Nip1 Si Ϫ E(SiFFI) and an empirical variance-covariance estimator
͸ that is robust to the correlation among the sibling
marker genotypes. To show that Nip1 Si Ϫ E(SiFFI) is a valid measure of association in the presence of linkage, we derive the marginal conditional distribution for the

Table 1

Nuclear Family Informativeness for Both Conditioning Approaches

FAMILY INFORMATIVENESS

PARENTAL GENOTYPESa

CHILDREN CONFIGURATIONb

EV-FBAT

RL Algorithm Type II H0

AA,AA AA,AB AA,BB AB,AB AA,Ϫ AA,Ϫ AB,Ϫ AB,Ϫ AB,Ϫ AB,Ϫ AB,Ϫ Ϫ,Ϫ Ϫ,Ϫ Ϫ,Ϫ Ϫ,Ϫ Ϫ,Ϫ

NA NA NA NA {AA} {AA,AB} {AA} {AB} {AA,AB} {AA,BB} {AA,AB,BB} {AA} {AB} {AA,AB} {AA,BB} {AA,AB,BB}

No

No

Yes

Yes

No

No

Yes

Yes

No

No

Yes

Yes

No

No

No

No

Yes

No when n 1 2

Yes

Yes

Yes

Yes

No

No

No

No

Yes

No when n 1 2

Yes

Yes

Yes

Yes

a Ϫ p Not genotyped. b NA p not applicable.

kth sibling marker genotype p(mkFFI) and show that this marginal distribution is the same under both the type I
H0 and the type II H0 and does not depend on the recombination parameter v or on the observed phenotypes
y for k p 1, … ,n (see Appendix). Since the linear form
of the test statistic (eq. [1]) permits its expectation to be found using p(mkFFI), the RL algorithm for the type I H0 can be used to compute E(SiFFI). Therefore, without specification or estimation of v and without parameterization of the phenotype distribution, S Ϫ E(SFFI) can be used to construct an unbiased test for association in the
presence of linkage. Since family-specific contributions comprise S Ϫ E(SFFI), only the variances of these contributions are needed to compute Var[S Ϫ E(SFFI)]; the correlation among children need not be addressed when finding Var[Si Ϫ E(SiFFI)].
The derivation in the Appendix employs an ordered
notation similar to that of Thomson (1995), where m∗k is the marker genotype of the kth child, expressed in terms of the parental derived haplotypes (see Appen-
dix). In particular, it is shown that under both the type
I H0 and the type II H0, the joint conditional probability for a family can be factored into

͸ Pr (mFFI) p

Pr (mϪkFmk,M,y)

Mu෈A

#

[ ] ͸m∗෈B Pr (mk∗,M) k S(M)

,

where mϪk is the vector of sibling marker alleles with the kth sibling information omitted, Mu is the unobserved parental marker genotypes, A is the set of unobserved parental maker genotypes that coincide with S(M) and B corresponds to the set of paternal and maternal derived markers for parents with marker geno-
types M that result in the kth sibling’s observed marker genotype mk. Marginalization of Pr (mFFI) with respect to mϪk results in the marginal conditional probability for the kth sibling marker genotype with Pr (mkFFI) p Pr [mkFS(M)]. In addition, we show that Pr [mkFS(M)] is not a function of v and can be computed using the RL
algorithm for the type I H0. Although the factorization can be used to find the correct conditional expecta-
tion of the test statistic, it cannot be used to derive ex-
pressions for the covariance between sibling marker
genotypes, because it marginalizes over the IBD
relationships. Since Si Ϫ E(SiFFI) are independent mean 0 random
vectors with unspecified variance-covariance matrices,
we can apply the results of White (1980) to construct a robust variance-covariance estimator of S Ϫ E(SFFI). Specifically, White (1980) addresses estimation of the
variance-covariance matrix for estimated regression pa-

Lake et al.: FBATs with Linkage
rameters in linear models with heteroscedastic errors. The test statistic S Ϫ E(SiFFI) can be couched as proportional to a vector of parameter estimates from a linear model and, therefore, the White empirical variance-covariance estimator, given by

{͸ } N
Sˆ W p Vˆar [Si Ϫ E(SiFFI)]
ip1

͸N

p [Si Ϫ E(SiFFI)][Si Ϫ E(SiFFI)] ,

(2)

ip1

provides a consistent estimate of the variance-covariance matrix of S Ϫ E(SFFI). Alternatively, Sˆ W can be derived using the results of Liang and Zeger (1986) on generalized estimating equations. When S is vector-valued, Sˆ may not be full rank. In this case, the test statistic for the type II H0 is [S Ϫ E(SFFI)] Sˆ W Ϫ [S Ϫ E(SFFI)], where Sˆ ϪW is the generalized inverse of Sˆ W. It should be noted that the empirical variance-covariance estimator (2) reduces to a simple sum of squares for the biallelic case.
Extensions to more-complex pedigrees are straightforward. Assume that the ith pedigree can be split into qi nuclear families, for i p 1, … ,F, and let

͸ ͸F qi

S Ϫ E(SFFI) p

[Sij Ϫ E(SijFFI)] ,

ip1 jp1

where Sij is the test-statistic contribution from the jth nuclear family in the ith pedigree and E(SijFFI) is computed using formulas by S. Horvath, X. Xu, and N. Laird (unpublished data). Although the contributions from nuclear families in the same pedigree are not independent, we can again appeal to White (1980) to construct a consistent estimate of the variance-covariance matrix of S Ϫ E(SFFI):

͸[͸ ][͸ ] F qi

qi

Sˆ W p

Sij Ϫ E(SijFFI) Sij Ϫ E(SijFFI) .

ip1 jp1

jp1

The advantage of the empirical variance-covariance approach is that more nuclear-family marker configurations are informative than is the case with the type II conditioning method. Table 1 indicates which nuclear family configurations are informative for the two approaches in the setting of a biallelic marker. In addition, since the conditioning is different for the two approaches, the expected values and variance-covariance terms are also not the same. We will refer to the empirical variance-covariance approach as “EV-FBAT.”

1519

Example: Testing for Association in the A2M Gene

As an example, we tested for association between the A2M-18i deletion and AD in a set of sibships from the National Institute of Mental Health (NIMH) Genetics Initiative AD Sample. The ascertainment and assessment of the AD families collected have been discussed elsewhere (Blacker et al. 1997). The sample we used is composed of 437 individuals in 120 sibships and is identical to the sample analyzed by Blacker et al. (1999); 246 of the siblings met the NINCDS/ADRDA criteria for AD and/or had autopsy confirmation of the diagnosis.
Table 2 contains the results for testing the A2M-18i/ AD association. The test statistic used in the applications of the RL algorithm is the sum of the A2M-1 alleles in AD-affected siblings. This corresponds to the following coding schemes:

Tij

p

{1
0

if sibling j in ith sibship is affected otherwise

and

{2
Xij p 1 0

if mij p A2M-1/A2M-1 if mij p A2M-1/A2M-2 . otherwise

Implementation of the RL algorithm consists of finding the expected value of Xij conditional on the minimal sufficient statistic corresponding to the null hypothesis. Variance estimation is accomplished through the procedures described above.
Application of the RL algorithm to test for linkage and association (type I H0) results in 51 informative sibships and a significant finding. As discussed above, the type I H0 may not be appropriate in view of the reported linkage evidence in the region spanning the A2M gene. Conditioning on the type II H0 minimal sufficient statistic results in a dramatic decrease in the effective sample size. With only 10 informative sibships, the test statistic is only marginally significant, and its large sample x2 approximation may not be reliable (ta-

Table 2

A2M/Alzheimer Disease Association Test Results for Various Methods

Method

No. of

Informative

Sibships

x2

P

Type I RL algorithm

51

8.599 .0034

Type II RL algorithm

10

6.125 .0133

EV-FBAT

44

8.631 .0033

Siegmund et al. (2000)

51

6.916 .0085

PDT

50

8.387 .0038

SDT

46

… .0016

1520
ble 2). With EV-FBAT, 44 sibships were informative resulting in a highly significant result (x2 p 2.94, P p .0033).
The discrepancy in the number of informative families is a consequence of the absence of parental genotype data and the distribution of genotypes among the siblings [p(A2M-1/A2M-1) p .732, and p(A2M-1/A2M-2) p .231, p(A2M-2/A2M-2) p .037]. The 34 families that are informative for EV-FBAT but not informative for the type II H0 conditioning approach have more than two siblings and Cm p {A2M-1/A2M-1, A2M-1/A2M-2} or Cm p {A2M-2/A2M-2, A2M-1/A2M-2} as the sibling marker configuration. As indicated by table 1, these sibships are not informative for the type II H0 RL approach because no definite allele sharing can be discerned. Because it does not condition on the allele sharing, the empirical variance approach is not subject to these constraints. The difference between the number of informative families for the type I H0 RL test and for EVFBAT is a result of the definition of the empirical variance (2). Families with Si p E(SiFFI) do not contribute to the test statistic or the empirical variance-covariance estimate.
To justify the EV-FBAT x2 approximation with 44 informative sibships, we empirically estimated the significance level under the type II H0 for various numbers of informative sibships. We simulated sibships that were similar to the NIMH sibships in that the size distribution of the sibships was maintained, the biallelic marker had population allele frequencies of 0.20 and 0.80, and the baseline prevalence was fixed at 0.30. Because simulated data with the same number of sibships will have different numbers of informative families, we report the mean number of informative families. For each number of sibships we simulated 10,000 data sets. In figure 1, the circles represent the empirical significance levels for the mean number of informative families. The dashed lines are the pointwise 95% Monte Carlo samplingerror levels (0.0457, 0.0543). Figure 1 shows that the empirical significance level is within Monte Carlo sampling error for a large range of informative sibships. Indeed, the x2 approximation appears to hold even for samples with only 20 informative sibships. With !20 informative sibships, the test appears to become conservative.
Robust variance-covariance estimation has been implemented in the context of a TDT extension (TRANSMIT; Clayton 1999), conditional logistic regression (Siegmund et al. 2000), and the PDT (Martin et al. 2000c). All three procedures are limited to qualitative traits, whereas the application of Siegmund et al. (2000) is further restricted to discordant sibships. When applied to the A2M data set, the Wald statistic from conditional logistic regression with robust variance estimation produces a test statistic that is not as

Am. J. Hum. Genet. 67:1515–1525, 2000
Figure 1 Empirical significance levels under the type II H0 for
average number of informative sibships. The dashed lines are the pointwise 95% Monte Carlo sampling error levels (0.0457, 0.0543).
pronounced as that of EV-FBAT but is still significant (table 2). The PDT produces a test statistic that is essentially equivalent to the test statistic of EV-FBAT in these data.
Another alternative is to use the sibship disequilibrium test (SDT; Horvath and Laird 1998). As shown in table 2, the SDT provides the strongest evidence for linkage disequilibrium. The SDT is well suited to the discordant sibships setting of the NIMH data, but it is restricted to qualitative phenotypes and cannot efficiently handle families with genotype-known parents.
Discussion
One strategy for positional genomic analysis is to focus allelic-association testing on regions that have been identified through linkage analysis as putatively containing a gene or genes influencing phenotypic variation. Supplementing linkage results with association methodology is needed because, with complex diseases, linkage peaks may span regions of у10–20 cM that cover a large number of genes and are beyond the reach of positional cloning (Hauser and Boehnke 1997). A significant association finding may greatly refine the search for the underlying trait gene, since linkage disequilibrium will not generally extend over regions 11 cM in outbred populations (Pericak-Vance 1998). Although the utility of association methodology in this setting has been questioned (Terwilliger and Weiss 1998), the use of association methodology in the dissection of a region linked to human hypertension has recently yielded a susceptibility locus (Bray et al. 2000).
Candidates for the association tests within regions

Lake et al.: FBATs with Linkage
identified by linkage may be chosen via database searches using knowledge of biological pathways (Brookes et al. 2000). In addition, as dense maps of singlenucleotide polymorphisms (SNPs) become available and costs of genotyping decline, the dissection of linked regions may be accomplished by saturating the linked regions with SNPs and performing association tests on them. Martin et al. (2000a, 2000b) have used the APOE gene to illustrate the potential for using SNPs in mapping studies of complex traits.
With these strategies in mind, we have presented a method for evaluating the mean and variance-covariance of a wide range of test statistics computed under the null hypothesis that there is linkage but no association (type II H0). The method, EV-FBAT, determines the expected value of an association test statistic by conditioning on the minimal sufficient statistic under the null hypothesis of no linkage and no association (type I H0) and uses an empirical variance-covariance estimator that is consistent even when the sibling marker genotypes are correlated. As discussed above, the expectation of the test statistic is computed via the RL algorithm, and the resulting standardized test statistic is unbiased as a test for association in the presence of linkage. In addition, while retaining the robust properties of family based association tests, EV-FBAT does not suffer from the costly reduction in sample size caused by missing parental data that is inherent with approaches that condition on sibling IBD patterns.
The results of the A2M/AD example strongly suggest that the A2M-18i deletion is in linkage disequilibrium with a polymorphism that contributes to AD development. Whether or not the A2M-18i polymorphism is the polymorphism of interest (in which case the linkage disequilibrium is complete) cannot be deduced by association tests. In light of the evidence for linkage, relying on the type I H0 test alone would leave open the interpretation of the P value. Here, the P values of the type I RL approach and EV-FBAT agree; in general, we expect the type II H0 P values to be larger if H0 is true. Additional work will investigate the power of EV-FBAT and various proposed methods under Ha.
For qualitative traits and biallelic markers, EV-FBAT is similar to the PDT (Martin et al. 2000c). In the PDT, pedigrees are broken into nuclear families and discordant sibships. Let A and B be the two alleles of the marker. The contribution to the test statistic of a particular pedigree consists of weighted sums of the number of A alleles for each affected child minus an “expected” number of A alleles. This expectation is computed from unaffected siblings when the affected child belongs to a discordant sibship and is computed using a pseudocontrol (as defined by Falk and Rubinstein 1987) when the affected child belongs to a nuclear family. If a child belongs to a nuclear family and a discordant sibship,

1521
both differences are computed. Under the type II H0, the sum of the pedigree contributions has expectation 0 and is standardized with an empirical estimator of the variance.
In this setting, the difference between the PDT and EV-FBAT is in the derivation of the expected number of A alleles under the type II H0. In using the RL algorithm for the type I H0, EV-FBAT conditions on the minimal sufficient statistic and, by definition, makes the most efficient use of the observed data in constructing the control genotype (see Cox and Hinkley [1974] or Rabinowitz and Laird [2000]). Further, the PDT can not use concordant sibships with missing parental marker information and is also limited to the dichotomous-phenotype case.
EV-FBAT uses a robust variance-covariance estimation to take into account the correlation among sibling marker genotypes under the type II H0. In addition to the PDT and EV-FBAT, a robust variance-covariance estimation for the qualitative setting has been implemented in the context of a TDT extension (TRANSMIT; Clayton 1999) and conditional logistic regression (Siegmund et al. 2000). The method of Clayton (1999) uses the EM algorithm (Dempster et al. 1977) to impute the likelihood contribution from family trios in which there is missing parental information and/or ambiguous genetic transmissions. Such imputation requires a full specification of the family-trio likelihood that depends on estimates of allele frequencies and population genetic assumptions that are difficult to justify. A score test based on these likelihood contributions is used to test for association with a robust variance-covariance estimator when multiple siblings are allowed.
The merits of association tests based on conditional logistic regression have been discussed (Witte et al. 1998; Kraft and Thomas 2000). Siegmund et al. (2000) recommend generalized estimating equations applied to the conditional logistic likelihood when the type II H0 is used. Unlike EV-FBAT, this method does not make any use of available parental data and is restricted to discordant sibships. As with the PDT, both TRANSMIT and the Siegmund et al. (2000) procedure are limited to qualitative traits.
In summary, EV-FBAT provides a flexible framework for association testing in the presence of linkage because it can be used with any type of phenotype and with any pedigree configuration. Therefore, the researcher is not restricted to particular sampling designs and is free to test for associations with quantitative or time-to-onset traits. Indeed, with EV-FBAT, the approaches to association testing with binary, quantitative, and time-toonset phenotypes for the type I H0 advocated by S. Horvath, X. Xu, and N. Laird (unpublished data) can all be adapted to the type II H0. Application of EV-FBAT is limited to the class of test statistics that can be ex-

1522
pressed in a linear form (eq. [1]), but, as discussed in Laird et al. (2000), a number of family-based association-test statistics are of this form. Furthermore, Clayton and Jones (1999) and Lunetta et al. (2000) have shown that the score statistics from generalized linear models in which the coded marker genotype is the covariate can be expressed in the form of equation (1). The case when the test statistic may depend on unknown nuisance parameters is discussed in Lunetta et al. (2000). The method is also valid as a test of the type I H0 of no linkage or no association, since the empirical variance-covariance estimator is a consistent estimator under both types of null hypotheses.
The empirical variance approach for testing association in the presence of linkage has been implemented in a program called FBAT. It is invoked with the -e (for empirical variance) option for the fbat command. The program and its documentation are available free of charge from our Web site. There are different versions of the program for different operating systems: MAC,

Am. J. Hum. Genet. 67:1515–1525, 2000
Solaris/Sparc, and Windows. If you encounter problems, please e-mail [email protected]
Acknowledgements
We thank Dr. Steve Horvath for valuable conversations and Dr. John Rogus for helpful comments on the manuscript. Support for this research was provided by National Institutes of Health (NIH) grant MH 59532. We are indebted to two anonymous referees for their helpful suggestions. The genotypes of the sibships were generated in the laboratory of Dr. Rudy Tanzi, with support from NIH grant R01 MH60009. Data and biomaterials were collected in three projects that participated in the NIMH Alzheimer Disease Genetics Initiative. From 1991 to 1998, the principal investigators and coinvestigators were: Marilyn S. Albert, Ph.D., and Deborah Blacker, M.D., Sc.D., Massachusetts General Hospital, Boston, grant U01 MH46281; Susan S. Bassett, Ph.D., Gary A. Chase, Ph.D., and Marshal F. Folstein, M.D., Johns Hopkins University, Baltimore, grant U01 MH46290; and Rodney C. P. Go, Ph.D., and Lindy E. Harrell, M.D., University of Alabama, Birmingham, grant U01 MH46373.

Appendix A

Proof

We show that, under the type II H0, the joint conditional distribution of the sibling marker genotypes m given the sufficient statistic for the parental marker genotypes S(M) and the observed phenotypes y can be factored into a form amenable to the approach discussed above. The key point is that the marginal conditional distribution of a child’s marker genotype is not a function of the recombination parameter v or of the observed phenotypes y. Therefore, under the type II H0, the expectation of the test statistic conditional on the minimal sufficient statistic for the type I H0 can be found using the type I H0 RL algorithm, without modeling the correlation between the children’s marker genotypes.
Since S(M) p (Cm,Mobs), where Cm is the configuration of sibling marker genotypes and Mobs is any observed parental marker genotype, the joint conditional distribution can be expressed as

Pr [mFS(M),y] p Pr [S(M),y]Ϫ1 Pr [m,S(M),y]

p Pr [S(M),y]Ϫ1 Pr (m,Cm,Mobs,y)

͸ p Pr [S(M),y]Ϫ1 Pr (m,Mobs,y)

p Pr [S(M),y]Ϫ1

Pr (m,M,y) ,

Mu෈A S(M)

where AS(M) is the set of possible unobserved parental marker genotypes with elements Mu that correspond to

S(M) and where M p (Mobs,Mu).

To derive the marginal conditional distribution of a child’s marker genotype we arbitrarily select the kth sibling

(referred to as the reference sibling) and let mϪk be the vector of sibling marker alleles with the kth sibling information

omitted. For all k p 1, … ,n we have that

͸ ͸ Pr (m,M,y) p

Pr (mϪkFmk,M,y) Pr (mk,M,y) .

Mu෈A S(M)

Mu෈A S(M)

We next show that Pr (mk,M,y) p Pr (mk,M)f(y), where f(y) is the joint distribution of the sibling phenotypes. To do this, we adopt a notation similar to the ordered notation of Thomson (1995), which identifies the paternally

Lake et al.: FBATs with Linkage

1523

and maternally derived haplotypes that comprise the marker genotypes of the children. This is accomplished by

expanding

the

parental

marker

genotypes

into

specific

haplotypes,

M∗i

p

[m(i1p)/m(i2p),

m(i1m)/m(im2 )],

and

letting

m∗ ij

be

the

marker

genotype

of

the

jth

child

expressed

in

terms

of

the

parental-derived

haplotypes.

That

is,

m∗ ij

p

[m(idp)/m(idm)], where dj, dj p 1,2 indicate inheritance from each parent. Furthermore, let Bm∗,M correspond to the set

j

j

k

of paternally and maternally derived markers from parents with marker genotypes M that result in the kth sibling’s

observed marker genotype mk, and let G p [g1(p)/g2(p), g1(m)/g2(m)] be the unobserved disease genotypes for the parents

and g be the vector of unobserved disease genotypes for the children. The joint probability, Pr (mk,M,y), thus can

be expressed as the summation

͸ Pr (mk,M,y) p Pr (mk∗,M,y) m∗k෈B

͸ ͸͸ p

Pr (y,g,m∗k,H) ,

m∗k෈B G g෈G

(A1)

where the additional summations in (A1) are with respect to the set of possible parental disease genotype combinations and the set of siblings’ disease genotypes conditional on parental disease genotypes G and where H p [m(1p)g1(p)/m(2p)g2(p),m1(m)g1(m)/m(2m)g2(p)] describes the parental haplotypes.
Under the assumption that sibling disease genotypes are conditionally independent given parental haplotypes, equation (A1) can be expressed as

͸ ͸ ͸ [͹ ] f(yFg) Pr (giFH) Pr (gkFmk∗,H) Pr (mk∗,H) .

m∗k෈B G g෈G

i(k

(A2)

Under

the

type

II

null

hypothesis

of

no

association,

we

have

that

Pr (giFH) p

1 4

for

i p 1, … ,n;

i ( k,

and

Pr (m∗k,H) p Pr (m∗k,M) Pr (G). Therefore, (A2) can be simplified to

͸ {͸[( ) ͸ ] } Pr(m∗k,M)

m∗k෈B

G

nϪ1

1

f(yFg) Pr (g Fm∗,H) Pr (G) .

4 g෈G

kk

(A3)

Let FG denote the expression within square brackets in equation (A3). There are 4n terms in FG, corresponding to

all the combinations of disease genotypes in the n children. The summation over all combinations of parental

disease genotypes makes the terms in FG with the same parental disease allele sharing patterns equivalent. For

example, in the case of two children with the first child being the reference sibling,

͸ ͸ f{yFg1 p [g1(p),g1(m)],g2 p [g1(p),g2(m)]} Pr (G) p f{yFg1 p [g2(p),g1(m)],g2 p [g2(p),g2(m)]} Pr (G)

G

G

͸p f{yFg1 p [g1(p),g2(m)],g2 p [g1(p),g1(m)]} Pr (G)

͸G

p

f{yFg1

p

[ g2(p),g2(m) ],g2

p

[

g2(p),

g (m) 1

]}

Pr

(G

)

.

G

Furthermore, if we assume m∗1 p (m(1p),m(1m)), then we have that

͸ ͸ f(yFg1,g2) Pr (g1Fm∗1,H) Pr (G) G g෈GIBDp1p

͸p

f{yFg1

p

[ g1(p),g1(m) ],g2

p

[

g1(p),

g (m) 2

]}

Pr

(G)

,

G

where GIBDp1p is the set of disease allele–sharing patterns, between the two siblings, that result in them sharing the paternally but not the maternally derived disease allele. Because of the ordered notation, Pr (g1Fm∗1,H) is a simple function of the recombination parameter v, which cancels in the summation.
͸ The same logic can be applied to any disease allele–sharing patterns for any number of children, making it
straightforward to show that G FG Pr (G) p f(y). Therefore, Pr (mk,M,y) p Pr (mk,M)f(y), where Pr (mk,M) is not a function of v or of y, and we have the following factorization of the joint conditional distribution:

1524

Am. J. Hum. Genet. 67:1515–1525, 2000

Pr [mFS(M),y]

p

͸

M

u෈A

Pr

(mϪkFmk

,

[͸ M,y)

m∗෈B Pr (mk∗, k S(M)

]M)

,

where we have used the fact that, under the type II H0, Pr [S(M),y] p Pr [S(M)] Pr (y). We can marginalize the joint distribution with respect to mϪk to obtain

͸ Pr [mkFS(M),y] p

M ෈A,m∗෈B Pr (m∗k,M)

u

k

.

S(M)

(A4)

The term on the right side of (A4) is the conditional distribution of marker genotypes for the kth sibling, Pr [mkFS(M)], under the null hypothesis of no linkage and no association. It has been tabulated by Rabinowitz and
͸ Laird (2000), for arbitrary missing parental marker information, and can be used to derive E(SiFFI) under the type
II H0. In summary, we have shown that Nip1 Si Ϫ E(SiFFI) is a valid measure of association in the presence of linkage.

Electronic-Database Information
The URL for data in this article is as follows:
FBAT Web page, http://www.biostat.harvard.edu/˜fbat/default.html (for free FBAT program and documentation)
References
Blacker D, Haines JL, Rhodes L, Terwedow H, Go RCP, Harrell LE, Perry RT, Bassett SS, Chase G, Meyers D, Albert MS, Tanzi R (1997) ApoE-4 and age at onset of Alzheimer’s disease: the NIMH genetics initiative. Neurology 48:139– 147
Blacker D, Wilcox MA, Laird NM, Rhodes L, Horvath SM, Go RCP, Perry R, Watson B, Bassett SS, McInnis MG, Albert MS, Hyman BT, Tanzi RE (1998) Alpha-2 macroglobulin is genetically associated with Alzheimer disease. Nat Genet 19: 357–360
Blacker D, Crystal AS, Wilcox MA, Laird NM, Tanzi RE (1999) An alpha-2 macroglobulin insertion-deletion polymorphism in Alzheimer disease-Reply. Nat Genet 22:21–22
Bray MS, Krushkal J, Li L, Ferrell R, Kardia S, Sing CF, Turner ST, Boerwinkle E (2000) Positional genomic analysis identifies the b2-Adrenergic receptor gene as a susceptibility locus in human hypertension. Circulation 101:2877–2882
Brookes AJ, Emahazion T, Howell WM, Jobs M, Sawyer S, Fredman D, Siegfried M, Feuk L, Prince JA (2000) Using intra-genic SNPs to study complex disease: tools, systems and practical experience. Paper presented at DNA2000: International Symposium on the State of the Art in Genetic Analysis. Boston, June 1–3
Clayton D (1999) A generalization of the transmission/disequilibrium test for uncertain haplotype transmission. Am J Hum Genet 65:1170–1177
Clayton D, Jones H (1999) Transmission/disequilibrium test for extended marker haplotypes. Am J Hum Genet 65:1161– 1169
Cox DR, Hinkley DV (1974) Theoretical statistics. Halsted Press, New York
Dempster A, Laird NM, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–22

Dow DJ, Lindsey N, Cairns NJ, Brayne C, Robinson D, Huppert FA, Paykel ES, Xuereb J, Wilcock G, Whittaker JL, Rubinsztein DC (1999) a-2 macroglobulin polymorphism and Alzheimer disease risk in the UK. Nat Genet 22:16–17
Ewens W, Spielman RS (1995) The transmission/disequilibrium test: history, subdivision and admixture. Am J Hum Genet 57:455–464
Falk CT, Rubinstein P (1987) Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet 51:227–233
Hauser ER, Boehnke M (1997) Confirmation of linkage results in affected-sib-pair linkage analysis for complex genetic traits. Am J Hum Genet Suppl 61:A278
Horvath S, Laird NM (1998) A discordant-sibship test for disequilibrium and linkage: no need for parental data. Am J Hum Genet 63:1886–1897
Kehoe P, Wavrant-De Vrieze F, Crook R, Wu WS, Holmans P, Fenton I, Spurlock G, Norton N, Williams H, Williams N, Lovestone S, Perez-Tur J, Hutton M, Chartier-Harlin MC, Shears S, Roehl K, Booth J, Van Voorst W, Ramic D, Williams J, Goate A, Hardy J, Owen MJ (1999) A full genome scan for late onset Alzheimer’s disease. Hum Mol Genet 8: 237–245
Kraft P, Thomas DC (2000) Bias and efficiency in family based gene-characterization studies: conditional, prospective, and joint likelihoods. Am J Hum Genet 66:1119–1131
Laird NM, Horvath S, Xu X (2000) Implementing a unified approach to family based tests of association. Genet Epidemiol Suppl 19:S36–S42
Lazzeroni LC, Lange K (1998) A conditional inference framework for extending the transmission/disequilibrium test. Hum Hered 48:67–81
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized estimating equations. Biometrika 73:13–22
Lunetta KL, Faraone SV, Beidermann J, Laird NM (2000) Family based tests of association that use unaffected sibs, covariates, and interactions. Am J Hum Genet 66:605–614
Martin ER, Gilbert JR, Lai EH, Riley J, Rogala AR, Slotterback BD, Sipe CA, Grubber JM, Warren LL, Conneally PM, Saunders AM, Schmechel DE, Purvis I, Pericak-Vance MA, Roses AD, Vance JM (2000a) Analysis of association at single nucleotide polymorphisms in the APOE region. Genomics 63:7–12
AssociationLinkageSibshipsDistributionTests