PAWE Version 2.1
Help File, May 2014
Written by Derek Gordon
1.
Overview and Purpose
a.
The name of
this program is PAWE, which stands for Power
of Association With Errors. Over the
years, there have been numerous publications documenting the presence of genotype
misclassification errors and methods to detect and address these errors in
statistical analyses. See Pompanon et al. (2005), Gordon and Finch (2005), and Miller et al. (2008) for reviews.
b.
Among
genotype misclassification errors, there are two types: (i) differential, in
which error rates are different in cases and controls; and (ii)
non-differential, in which error rates are the same. The effect of differential
misclassification errors is an increase in the false positive rate, i.e., when
the null hypothesis is true (similar to population stratification). The effect
of non-differential misclassification is a decrease in statistical power to
detect association. The PAWE webtool considers only non-differential genotype
misclassification.
c.
The purposes
of the PAWE program are two-fold: (i) to compute power and sample size
calculations for genetic case-control association studies in the presence of genotype
misclassification errors; and (ii) to determine quantitatively how much, in
terms of decrease in asymptotic power for a fixed sample size, or increase in
sample size to maintain constant asymptotic power, errors cost the researcher
performing genetic association studies with cases and controls. Thus, results
from the PAWE program will be either asymptotic power or sample size values.
d.
This file is
designed to explain to you the different values entered into the PAWE program,
so that a meaningful answer is obtained. This program is designed to perform
asymptotic power and sample size calculations for genetic case-control studies
with a di-allelic locus (for example, a SNP) in the presence of
misclassification. The test statistics considered are the linear trend test and
the chi-square statistic for genotypic association. In what follows, it will be
assumed that there is a di-allelic trait locus for a discrete trait with two
alleles: a wild-type allele or low risk allele, denoted by +, and a trait or
high-risk allele, denoted by . Also, it will be assumed that
there is a marker locus with two alleles, denoted by 1 and 2.
2.
Parameter Settings
a.
Asymptotic
Power or Sample Size
i.
The
researcher who has a fixed sample size and wants to know the value of
asymptotic power for his/her study in the presence of errors should choose the
option "Power for a fixed sample size".
1.
Number
of cases
In this box, you specify the number of case individuals you have.
These individuals are assumed to be both phenotyped and genotyped. The number
typed in this box must be a positive integer.
2.
Number
of controls
In this box, you specify the number of control individuals you
have. These individuals are assumed to be both phenotyped and genotyped. The number typed in this box must be a
positive integer.
ii.
The
researcher who is planning a study, and who wants to know what sample size is
necessary, in the presence of errors, to achieve a given asymptotic power level
should choose "Sample size for fixed power".
1.
Asymptotic
Power level
In this box, you specify the asymptotic power you would like for
your study. This number must be greater than 0 and less than or equal to 1. It
is usually a number closer to 1.
2.
Ratio
of Controls to Cases
In this box, you specify the ratio of the number of controls to
the number of cases that you expect to have for your study. The number entered
here must be a positive real number. The most common number used is 1 (equal
number of cases and controls).
b.
Significance
level
Here, you specify the significance level of the test. This value
is the probability of falsely rejecting a true null hypothesis. This number
must be positive and less than 1. Typically, it is chosen to be less than or
equal to 0.05.
c.
Genotype
Frequency Generation
i.
Genetic
model free method
You choose this option if you do not know or do not want to
specify the genetic model parameters such as disease prevalence, disease allele
frequency, genotype relative risks, or proportion of linkage disequilibrium for your study. When choosing the genetic
model free method, you specify the frequency distribution of genotypes 11, 12,
and 22 for cases and controls. You may specify or not specify Hardy Weinberg Equilibrium
(HWE) for the case and control populations.
1.
Hardy
Weinberg Equilibrium specified
If you specify HWE, then the genotype frequency distribution is a function
of two parameters, (1 allele frequency in cases)
and
(1 allele frequency in controls). These parameters must be real,
positive numbers less than 1. The genotype frequencies of 11, 12, and 22 in
cases (respectively, controls) are:
,
and
,
, respectively.
2.
No
Specification of Hardy Weinberg Equilibrium
If you do not specify HWE, then the genotype frequencies are a function
of four parameters: (11 genotype frequency in cases);
(22 genotype frequency in cases);
(11 genotype frequency in controls);
(22 genotype frequency in controls). Note that the parameters are
all real, positive numbers subject to the constraints:
;
.
Note also that the heterozygote genotype frequencies in cases and
controls are given by the formulas:
;
.
ii.
Genetic
model based method
You choose this option if you have estimates for the following six
parameters: the disease prevalence ; the genotype relative risks
and
(Schaid
and Sommer, 1993); the disease allele frequency
; the marker 1-allele frequency
; and a measure of linkage disequilibrium
between alleles d and 1
. With the exception of the
genotype relative risks, all parameters must be positive real numbers that are
less than one. The genotype relative risks are defined as:
where .
The values , are commonly referred to as
the penetrance values.
Some common disease modes of inheritance are determined by
constraints on the genotype relative risk. For example:
In PAWE, you enter the value and the mode of inheritance, and the
program computes the
value. Note that, if no constraints are placed on
and
(we use the term “unconstrained”), then they
can be any positive numbers greater than one. Finally, we mention that
= 1 means complete disequilibrium,
the scenario with the highest power, and
= 0 means no disequilibrium, the
null scenario (no power).
d.
Linear Trend
Test Weights
Here, you specify the weights of the trend test statistic,
corresponding to the genotypes 11, 12, and 22. The weights reflect an ordering
of the genotypes. For example, in our work, the risk allele at a SNP locus is
the 1 allele. If we have a dominant mode of inheritance, then we expect that
the 22 genotype does not increase risk, while the 11 and 12 genotypes do. We
may choose the weights (1, 1, 0) for the 11, 12, and 22 genotypes respectively.
Similarly, suppose that the trait acted in a recessive manner. Then we would
choose the weights (1,0,0) for the 11, 12, and 22 genotypes respectively. The most
commonly used weights for unknown modes of inheritance are (2,1, 0). See Slager
and Schaid (2001) for more information.
Very important note: If you choose weights that do
not reflect the true risk genotypes, your sample sizes for a fixed power will
increase dramatically. Similarly, your power for a fixed sample size will drop
dramatically. For example, in the first example above, if you choose the
weights (0,1,1) instead of (1,1,0), you will either lose power or increase
sample size (two undesirable effects). When in doubt, we recommend that you use
the (2,1,0) weights.
e.
Genotype
Error Model
For this option, you have the choice of selecting one of three
error models that you think best explains your data. The choices are:
i.
Gordon
Heath Liu Ott (GHLO) error model (Gordon
et al., 2001)
The parameter settings for this
error model are:
;
.
Both entries must be positive real numbers less than 1.0. Note
that in this error model, loci that are in HWE before errors are still in HWE
after introduction of errors.
ii.
Douglas Skol Boehnke error model (Douglas et al., 2002)
The parameter settings for this error model are:
;
.
Both entries must be positive real numbers less than 1.0. In this
error model, it is not possible for one homozygote (e.g., 11) to be
misclassified as another (e.g., 22). Also, for the parameter, we specify that the 12 genotype
(heterozygote) has an equal probability (0.5) of being incorrectly coded as 11
or 22 (homozygotes). See (Gordon
et al., 2001) for more details.
iii.
Mote
and Anderson error model
(Mote
and Anderson, 1965)
The parameter settings for this model are:
= Pr(12 genotype observed | 11
true);
= Pr(22 genotype observed | 11
true);
= Pr(11 genotype observed | 12 true);
= Pr(22 genotype observed | 12
true);
= Pr(11 genotype observed | 22
true);
= Pr(12 genotype observed | 22
true).
The MA model is the most general error model possible in that it
can describe all other error models. As noted above, the MA model is described
by 6 parameters. The GHLO and MA error
models allow for errors in which one homozygote is incorrectly miscoded as another
homozygote. The following constraints are needed for the MA error model:
3.
Program Output
The PAWE program reports most of
the input parameters that the user enters, as well as the following items:
a.
Non-centrality
parameter
i.
For a given
test of association (linear trend or genotypic), this parameter completely
determines either the asymptotic power or sample size calculations. To see how
the asymptotic power calculations are performed, please see Ahn et al. (2007) and Gordon et al. (2002). The non-centrality parameters
for both errorless data and data with errors are presented.
ii.
Asymptotic
power for fixed sample size: power loss
Based on the value of the non-centrality parameter for either the
linear trend or genotypic test of association, the asymptotic power of the test
is reported for both errorless data and data assuming the particular error
model. Also reported is the percent loss in power due to errors in the data.
iii.
Sample
size increase for fixed asymptotic power
Based on the value of the non-centrality parameter for either the
linear trend or genotypic test of association, the minimum sample of cases and
controls is reported for both errorless data and data given the particular
error model. Also reported is the percent increase in sample size needed to
maintain constant power when errors are present.
b.
Genotype
frequencies for errorless data
Based on the parameters entered for genotype frequency generation (Section
2.c), the genotype frequencies in cases and controls are computed for errorless
data.
c.
Matrix of
Penetrances
The entries of this matrix are the conditional probabilities
Pr(observed genotype i | true
genotype j) where i and j are in the set of the genotypes 11, 12, or 22. These conditional probabilities, also
called penetrances, are used in calculating the genotype frequencies in the
presence of errors (see Section 3.d).
d.
Genotype frequencies
for error data
Using the genotype frequencies in cases and controls for errorless
data (Section 3.b), and the matrix of penetrances (Section 3.c), genotype frequencies
in cases and controls are computed for error data. These values are used to
compute the non-centrality parameters for error data (Section 3.a). For more details
on how this computation is performed, please see Gordon et al (2002) and Ahn et al. (2007).
Please
cite the references below in bold when using results from the PAWE webtool.
References
Ahn, K., Haynes, C., Kim, W., Fleur, R.S., Gordon, D., and Finch, S.J.
(2007). The effects of SNP genotyping errors on the power of the
Cochran-Armitage linear trend test for case/control association studies. Ann
Hum Genet 71, 249-261.
Douglas, J.A., Skol, A.D., and Boehnke, M. (2002). Probability of
detection of genotyping errors and mutations as inheritance inconsistencies in
nuclear-family data. Am J Hum Genet 70,
487-495.
Gordon, D., and Finch, S.J. (2005). Factors affecting statistical power in
the detection of genetic association. J Clin Invest 115, 1408-1418.
Gordon, D.,
Finch, S.J., Nothnagel, M., and Ott, J. (2002). Power and sample size
calculations for case-control genetic association tests when errors are
present: application to single nucleotide polymorphisms. Hum Hered 54, 22-33.
Gordon, D., Heath, S.C., Liu, X., and Ott, J. (2001). A
transmission/disequilibrium test that allows for genotyping errors in the
analysis of single-nucleotide polymorphism data. Am J Hum Genet 69, 371-380.
Miller, M.B., Schwander, K., and Rao, D.C. (2008). Genotyping errors and
their impact on genetic analysis. Adv Genet
60, 141-152.
Mote, V.L., and Anderson, R.L. (1965). An Investigation of the Effect of
Misclassification on the Properties of Chi-2-Tests in the Analysis of
Categorical Data. Biometrika 52,
95-109.
Pompanon, F., Bonin, A., Bellemain, E., and Taberlet, P. (2005).
Genotyping errors: causes, consequences and solutions. Nat Rev Genet 6, 847-859.
Schaid, D.J., and Sommer, S.S. (1993). Genotype relative risks: methods
for design and analysis of candidate-gene association studies. Am J Hum Genet 53, 1114-1126.
Slager, S.L., and Schaid, D.J. (2001). Case-control studies of genetic markers:
power and sample size approximations for Armitage's test for trend. Hum Hered 52, 149-153.