Pedigree data likelihood
Let us consider data on a pedigree consisting
of N individuals. The data consist of individual phenotypes x1 being
phenotype of person 1, x2 being the phenotype of person 2, … xN being the
phenotype of person N. Thus, the phenotypic data may be thought as a vector X={x1, x2, x3,…xN}. Each individual
in the pedigree has some genotype, g, with g1 being genotype of person 1, g2
being the genotype of person 2, … gN
being the genotype of person N. Thus, the genotypic data may be thought as a
vector of genotypes G={g1, g2, g3,…gN}.
Following the low of total probability, probability
of X is
P(X) = ∑G P(X|G)
P(G) (1)
here sum
is taken over all possible genotypes of N people.
If we assume that a phenotype of an individual
depends only on its’ genotype (and not on the genotypes of other people in the
pedigree), then expression (1) may be written in more details:
P(X) = ∑G [ ∏i=1,N P(xi|gi)
] P(G) (1)’
Further, we may explore what is the probability
P(G). P(G) is a product of
probabilities of genotypes of all N subjects. For “founder” subjects (i.e.
these for whom parents are not in the pedigree) these probabilities are just populational probabilities of genotypes. Fot these, who are offspring, the probability is defined
conditional on parents’ genotypes. If we say that people 1...K are founders and
people K+1…N are not, then
P(X) = ∑G [ ∏i=1,N P(xi|gi)
] [ ∏j=1,K P(gj) ] [ ∏l=K+1,N
P(gl|glm glf)
] (1)’’
Thus, to calculate probability of a pedigree
data, one needs to specify tree probability distributions:
Example. Consider a Medelian locus with two alleles, A and
a, with frequency of A being 0.99. The trait of interest is some dominant disease.
The people with genotypes Aa
and aa are having this condition for sure, the people
with genotype AA never develop the condition. Let us formalize this model of
inheritance in terms of the three distributions described above.
Designation: genotype g may take the form of
AA, Aa or aa; phenotype x may take the form of D (affected) or U
(unaffected).
P(g): assuming HWE, the population distribution
of genotypes is P(g=AA) = 0.99 0.99 = 0.9801, P(g=Aa) = 0.0198 and P(g=aa) = 0.0001
P(x|g): the penetrance
function is P(x=D|g=AA)=0, P(x=D|g=Aa)=1, P(x=D|g=aa)=1. A probability of being unaffected given the genotype
is P(U|g)=1 – P(D|g)
P(g|gm gf):
transmission probabilities. For the locus considered P(g=AA|gm=AA,
gf =AA)=1, P(g=Aa|gm=AA,
gf =AA)=0, P(g=aa|gm=AA,
gf =AA)=0, P(g=AA|gm=Aa, gf =AA)=½, P(g=Aa|gm=Aa, gf =AA)=½, P(g=a|gm=Aa, gf =AA)=0, … P(g=aa|gm=aa, gf =aa)=1. These are
easily calculated.
Penetrance function extended
If we want
to incorporate some covariate, C, into our model, this is quite easy to do. However,
there may be many different ways to do that. One of the most obvious ways is to
re-write the penetrance function using the form:
P(x|g) = wg eα C / (1 + eα C)
here, wg is the maximal penetrance
(which is reached when C->+Infinity) for the genotype g, C is the value of
the covariate and is the regression
coefficient.
Figure: P(x|g) = wg eα C / (1 + eα C), with wAA=0.1,
wAa=0.8, waa=0.9
and α=0.3.
It is also easy to incorporate more covariates
and to do the regression coefficients dependent on the genotype (thus introducing
genotype x environment interaction)
Genetic risks: general form
During
accessing genetic risk we are usually answering the question what is the
probability that some individual will have a disease given a known model of
inheritance ands data on family history (generally, phenotypes and other
information on relatives). Thus, the probability we are interested in is
P(Affected|X) = ∑g P(Affected|g) P(g|X)
The first term in this expression is penetrance function and thus we know it from the model. The
second term, P(g|X), is the
probability that the individual has genotype g given the pedigree data. This
may be calculated using Bayes theorem:
P(g|X) = P(X, g) / P(X)
P(X) is calculated using expression (1). P(X,
g) is also obtained using expression (1). However, in this case, the genotype
of the individual of interest is fixed.
X-linked recessive disease
Pedigree A & B. What is the
probability that e is a carrier? Person a is affected. Thus, d
is a carrier. The probability that the mutation is transmitted to e is
½. Next question, what is the risk for a boy born from e and an unaffected father, to be
affected? Clear, this probability is the probability e is carrier times the probability of transmission of the mutation
= ½ ½ = ¼. The pedigree B is
different from A only by introduction of an extra
generation. Thus, the risk for g to
be carrier is ( ½ )2 is ¼,
and, consequently the risk for a boy born from g and unaffected father is 1/8. Generally, if N generation passed
between obligate carrier and the person of interest (and information on disease
phenotypes of males in previous generations is not available) then the
probability of being carrier is ½N.
Pedigree C. Now, we ran
into real consulting. The probabilities of e
being carrier (and, consequently, the risk for next child) are calculated in
following manner:
Hypothesis: |
e is carrier |
e is not carrier |
Prior, P(g) |
½ |
1 – ½
= ½ |
Conditional,
P(X|g) |
½ ½ = ¼ |
1 |
Joint, P(X|g) P(g) |
1/8 |
½ |
Posterior |
1/8 / (1/8
+ 1/2) = 1/5 |
½ /
(1/8 + ½) = 4/5 |
Risk for
next boy |
1/5 ½ = 1/10 |
More generally, if the number of healthy
children is not two, but some M, then the posterior probability of being
carrier is 1/(1+2M). If we combine this
result with the result obtained for pedigrees A and B, then posterior
probability is 1/(2M+N-2M+1).
However, this formula is correct only if in the upper part of the pedigree
there are no informative males. Generally, this is not the case.
Pedigree D. Here, the situation is more general. The
simplest way to deal with this pedigree is to calculate the probability of e being carrier. Then, half of this
probability will be the prior probability of being carrier for the woman of
interest, j. The probability of e being carrier is easily calculated
using the above formula:
P(e is carrier | data) = 1/(2M+N-2M+1)
= 1/(23+1-23+1) = 1/9
Then,
Hypothesis: |
j is carrier |
j is not carrier |
Prior, P(g) |
½ 1/9
= 1/18 |
1 – 1/18 =
17/18 |
Conditional,
P(X|g) |
½ |
1 |
Joint, P(X|g) P(g) |
1/36 |
17/18 |
Posterior |
1/36 / (1/36
+ 34/36) = 1/35 |
34/35 |
Risk for
next boy |
1/35 ½ = 1/70 |
Dominant disease
Consider a
dominant disease with the frequency of mutant allele being q=0.01. In this case, we can securely assume that any person
observed in population is a heterozygous carrier (odds heterozygous vs.
homozygous carrier are ~ 0.02/0.0001 = 200). Using this assumption, what are
the risks for next child in pedigrees A-D? It is quite obvious, that the
genotype of father is DN and the genotype of mother is NN. Given these
genotypes, the risk for next child is ½.
If the disease allele is more frequent (say, q=0.1), we cannot assume that diseased
people are heterozygous. In pedigree A,
we need to estimate the probability of father being DD (given such genotype, then
the risk to child is 1) or ND (the risk to child is 0.5). Using Bayesian
approach:
P(father is DD | father is affected) =
P(affected | DD) P(DD) / [P(affected | DD) P(DD)
+ P(affected | DN) P(DN) + P(affected | NN) P(NN)] =
1 q2
/ [1 q2 + 1 2 q (1-q)
+ 0 (1-q)2
=]
0.01 / 0.19 = 0.053
P(father is DN | father is affected) =
P(affected | DN) P(DN) / [P(affected | DD) P(DD)
+ P(affected | DN) P(DN) + P(affected | NN) P(NN)] =
1 2 q
(1-q) / [1 q2 + 1 2 q (1-q) + 0 (1-q)2 ] =
0.18 / 0.19 = 0.947
Thus, the risk for the child to be affected is
0.053 + ½ 0.947 = 0.53
Now,
consider pedigree B. Here, information
on one previous child, which was affected, is available. How this information
changes the posterior probability of father to be DD or ND?
Hypothesis: |
Father is
DD |
Father is
ND |
Prior, P(g) |
0.053 |
0.947 |
Conditional,
P(X|g) |
1 |
½ |
Joint, P(X|g) P(g) |
0.053 |
0.4735 |
Posterior |
0.053/(0.053+0.4735)=0.1 |
0.9 |
Risk for
next boy |
0.1 + 0.9 ½ = 0.55 |
Thus, the
chances that the second offspring will get the disease,
are increased.
Incorporating marker data – simple example
Recessive disease
Consider a recessive
disease with frequency of mutant allele D being q=0.025. In pedigree A,
what are is the probability that the child in question
will get the disease? The information on a,
b and c provides the key: the genotypes of these individuals are
P(d is DN | d is unaffected) =
P(unaffected | DN) P(DN) / [P(unaffected | DD) P(DD)
+ P(unaffected | DN) P(DN) + P(unaffected | NN) P(NN)] =
1 ½ / [0 ¼ + 1 ½ + 1 ¼]
= 2/3
Here, the priors for genotypes are the
transmission probabilities given the genotypes of the parents.
What are
the chances that e is heterozygous
carrier?
P(e is DN | e is unaffected) =
P(unaffected | DN) P(DN) / [P(unaffected | DD) P(DD)
+ P(unaffected | DN) P(DN) + P(unaffected | NN) P(NN)] =
1 2 q
(1-q) / [0 q2 + 1 2 q (1-q) + 0 (1-q)2 ] =
= 0.049 / (0.049 + 0.95) =0.049
Then, the risk for the child is ¼ 2/3
0.049 = 0.008
Consider now pedigree B. For individual h
the posterior probability of being carrier is 2/3, as in previous example. Here
we assume, that the frequency of the D allele in
population is very low. This means that D allele must come to both d and i (if i is carrier) from either of two
ancestors a or
b. The probability that D is not
lost during the transmission in two generations is ½2. Thus,
the risk for the child is ¼ ¼ 2/3 =
1/24.
More complex case
X-linked recessive:
Dominant disease: evaluate pedigrees C and D,
first analytically (with frequency of D being q), then do calculations for q=0.1.
Recessive disease: evaluate pedigree C, assume that frequency of D is very low.