4 Introduction to statistical modelling

Gianluca Baio

University College London, UK

4.1 Introduction

The aim of this chapter is not to provide a comprehensive and detailed introduction to all of statistical theory. However, the idea is to go through some of the basic and subtle concepts underpinning statistical modelling, with a view to highlighting the fundamental distinctions that exist among different approaches, as well as the need for HTA modellers to have sound foundations in statistics, in order to do their job well, in its modern incarnation.

The structure of the chapter guides you through the very basics of the philosophy underlying the ideas of sampling and data collection in Section 4.2. Section 4.3 describes several statistical models that are commonly used in the applications we typically encounter, specifically in HTA applications. These include Bernoulli and Binomial models to describe sampling variability in individual or aggregated binary data, Poisson models for counts, Normal distributions for continuous, symmetric phenomena and more specific distributions, which are the basis for many of the procedures described in the rest of this book. Of course, the presentation is far from exhaustive and there are many more models you may be exposed to, in specific applications.

Section 4.4 and Section 4.5 present the central tools of statistical inference — the methods of point and interval estimation. These are presented while highlighting the fundamental distinctions among different approaches, i.e. Bayesian (Section 4.4.1) Likelihood (Section 4.4.2) and Frequentist (Section 4.4.3), which are often confused or conflated (especially the last two) into an integrated theory, which essentially does not exist.

Again, the mathematical sophistication is kept to a low level — you do not need to read this chapter to learn all about the technical issues. The point is rather to try and help you understand the basic principles and why things work the way they do, over and above how. Throughout the chapter, there are some parts in which it is unavoidable to use mathematics to make the point. In our view, this is also necessary because HTA modellers are increasingly faced with suitably complicated statistical modelling to act as the fundamental building block of the economic evaluation procedure.

4.2 Basic concepts

Broadly speaking, the objective of statistical analysis is to produce “some summary” of the available data. Invariably we rely on information obtained from a sample of units, drawn from the population of interest, for which a measurement is indeed available. We can use these units to make inference about the (theoretical) underlying population.

Figure 4.1 shows a schematic representation of the process moving from a theoretical population (made by $N=10$ units and shown in Figure 4.1 (a)) to some potential samples of size $n=5$ (Figure 4.1 (b)). Typically, we indicate the population parameters (e.g., the “true” mean and standard deviation, that we could compute if we could access the whole population) using Greek letters, e.g., $\mu$ or $\sigma$. Ideally, we would like to learn about these quantities — but as mentioned above, we really cannot observe the whole population (which may be a purely idealistic and theoretical concept and so might not even exist as such!). Thus, we rely on the sample statistics, which we indicate using Roman letters, e.g., $\bar{x}$ for the sample mean and $s_x$ for the sample standard deviation.

(a) The entire population of size $N=10$; this is assumed to be completely known

In addition, in real life (i.e. when we do Statistics), we cannot access all possible samples that could have been drawn from a given population. For example, in the trivial case above, the true population is not really accessible to us (and thus it is greyed out in Figure 4.2 (a)). We only have one such sample (and again all but the one in the left-bottom part of Figure 4.2 (b) are also greyed out).

Incidentally, there are in fact 252 different ways of picking at random 5 units out of the population made by 10 — but again, all but one have not been drawn. We want to use the information in the actually observed sample (e.g. the sample mean and standard deviation) to infer about the underlying population parameters. That is what Statistics is all about.

(a) The entire population of size $N=10$ from which the observed data are selected (greyed out, because it is not observable)

How to obtain a random sample

In general, you can think of this process as extracting numbered balls from an urn — a bit like they do when they call numbers in games such as Tombola or Bingo. If you extract the ball with number “14” on it, then you are selecting into your sample the 14-th unit from a complete list of population members.

In reality, we are likely to use a computer to simulate “pseudo-random” numbers. For example, we can use the freely available software R (see Chapter 2 for more details on R and its basic commands) to sample 4 numbers from the set $(1,2,\ldots,20)$, using the following code.

# Simulate 4 numbers from 1 to 20, without replacement
sample(x=1:20,size=4,replace=FALSE)

[1] 11  9 10 16

The resulting values shown here can be used to in fact select the corresponding units in a list of peoples or items of interest. In this case, the option replace=FALSE instructs R to sample without replacement. This means that if a unit is selected then it is taken out from the list of units that can be selected at a later stage.

4.3 Statistical distributions: working with probability calculus

The objective of this section is to present a brief introduction to the use of some probability distributions to model sampling variability in observed data. For now, we consider a situation very similar to the one discussed in the context of Figure 4.1: we assume that there is a data generating process (DGP), i.e. a way in which data can arise and become available to us. This DGP determines the way in which some units are actually sampled from the theoretical target population.

As mentioned in Section 4.2, there are many different ways in which we can obtain a sample of $n$ individuals out of the $N>>n$ that make up the whole population. In a nutshell (and somewhat making a more trivial argument than it really is), the ideas we explore in this section assume that we can safely assume a DGP characterised by a probability distribution; often, we use the phrase “the variable $y$ has a XXX distribution”. While this terminology is almost ubiquitous in Statistics and Probability Calculus, it is in fact slightly misleading. What we really mean is a rather handy shortcut for the much more verbose (and correct!) phrasing

“We can associate the observed data with a XXX distribution to describe our level of uncertainty, e.g. on the sampling process that has determined the actual observation we have made (or we will make in the future).”

Of course, it would be very impractical to always use this mouthful sentence — and practically nobody does. But: it is important to understand that probability distributions or DGPs are not physical properties of the data — that is why the data cannot have a probability distribution. Rather, they are mathematical idealised concepts that we use to represent complex phenomena in a convenient way.

In general, the DGP will depend on a set of parameters (which in general we indicate with Greek letters, e.g. $\theta$, $\mu$, $\sigma$, $\lambda$, etc). The distribution is characterised by a mathematical function, something like \[p(y\mid\theta_1,\theta_2,\theta_3)=\theta_1 \frac{\theta_2}{\sqrt{\theta_3}}\exp\left(\frac{(y-\theta_2)^2}{\log(\theta_3)}\right) \] (the actual form of this function is irrelevant here and this specific one is only chosen to make a point!). In this case, the parameters are $\boldsymbol\theta=(\theta_1,\theta_2,\theta_3)$ and, given a specific value for them, we determine a certain value of the underlying probability distribution for the observable outcomes, $y$. In other words, we can use a probability distribution to describe the chance that a specific outcome arises, given a set values of the parameters.

For example, for the fake distribution above, if $\boldsymbol\theta=(2,1,0.6)$, then $p(y)=2.582\exp\left( \frac{(y-1)^2}{-0.5108} \right)$. The graph in Figure 4.3 shows a graphical representation of the distribution $p(y)$ for the set values of $\boldsymbol\theta$. In this case, we are implying that the probability is distributed around a central part (more or less in correspondence of the value 1 along the $x-$axis) and that values increasingly further away are associated with very small probability weight (and thus are deemed unlikely by this model).

Figure 4.3: A graphical example of a probability distribution for a given set of value for the parameters $\boldsymbol\theta$

Of course, the assumption that we know the value of the parameters is certainly a big one and, in general, we are not in a position of having such certainty. And that is the point made in Figure 4.2 — instead of going from left to right, in reality we will try and go from right (i.e. using the one and only sample that we have indeed observed) to make assumptions and learn something about the DGP. This is the process of statistical analysis/estimation (which we will consider in Section 4.4).

In the rest of this section we present some of the most important and frequently use probability distributions.

4.3.1 Binomial and Bernoulli

The Binomial distribution can be used to characterise the sample distribution associated with a discrete variable $R$ describing the total number of “successes” in $n$ independent binary trials (e.g. the outcome is either 0 or 1, dead or alive, male or female, etc.), each with probability $\theta$ of success.

You may think of this as a case where an individual randomly selected from the target population is associated with a probability $\theta$ of experiencing an event of interest (e.g. having cardiovascular disease). Then, if you select $n$ individuals randomly and you can assume that the event happens more or less independently on the individuals (e.g. if I have cardiovascular disease, this does not modify the chance that you do), then the sampling process can be described by the following equation \[ p(r \mid \theta, n) = \left( \begin{array}{c}n\\r\end{array} \right) \theta^r \; (1-\theta)^{n -r}; \qquad r = 0,1,\dots,n, \tag{4.1}\] where the term \[ \left( \begin{array}{c}n\\r\end{array} \right) = \frac{n!}{r!(n-r)!} = \frac{n(n-1)(n-2)\cdots 1}{[r(r-1)(r-2)\cdots 1][(n-r)(n-r-1)(n-r-2)\cdots 1]}\] is known as the binomial coefficient.

Independence

The assumption of independence is most useful from the mathematical point of view. In a nutshell, this comes from the fact that if we have $n$ observed data points and we can assume that they are independent, then their joint probability distribution (i.e. a function describing their joint variability) can be factorised into the product of the single probability distributions: \[ p(y_1,y_2,\ldots,y_n\mid \boldsymbol\theta) = \prod_{i=1}^n p(y_i \mid \boldsymbol\theta). \tag{4.2}\] The main advantage of this factorisation is that the elements of the product on the right-hand side of Equation 4.2 are of lower dimension than the full joint distribution on the left-hand side. For example, full independence implies that instead of modelling an $n-$dimensional probability distribution, we can simply model $n$ $1-$dimensional distributions, which is much simpler, both intuitively and computationally.

Of course, there are instances where the assumption of independence clearly does not hold. For example, you may think of $n$ observations taking value 1 if the $i-$th individual has some infectious disease (e.g. sexually transmitted) and 0 otherwise. Then, if I do have the disease, this may well affect your chance of becoming infected (depending on what the transmission mechanism is).

Or in a slightly simpler context, it may be that we cannot claim marginal independence (i.e. that, without reference to any other feature, the two variables $X$ and $Y$ are independent). But we may be able to claim a conditional version — for instance if, given the value of a third variable $Z$, then $X$ and $Y$ may be reasonably assumed to not influence each other.

Essentially, the terms $\theta^r$ and $(1-\theta)^{n -r}$ are used to quantify the fact that out of the $n$ individuals, $r$ experience the event, each with probability $\theta$. Thus, because we are assuming that they are independent on one another, this happens for each with probability $\theta$ or overall by multiplying this by itself for $r$ times, i.e. $\theta^r$. Conversely, $(n-r)$ do not experience the event, which again assuming independence is computed as $(1-\theta)$ multiplied by itself for $(n-r)$ times — or $(1-\theta)^{(n-r)}$.

The binomial coefficient is considered to account for the fact that we do not know which $r$ of the $n$ actually experience the event — only that $r$ do and $(n-r)$ do not. The binomial coefficient quantifies all the possible ways in which we can choose $r$ individuals out of $n$. For example, if we consider of population of size 5 and we want to sample 3 people from it, we could enumerate all the possible combinations. In R we can do this using the following command

combn(5,3)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    1    1    1    1    1    2    2    2     3
[2,]    2    2    2    3    3    4    3    3    4     4
[3,]    3    4    5    4    5    5    4    5    5     5

which returns a matrix where each column represents one of the possible $\left( \begin{array}{c}5\\3\end{array} \right)=\frac{5\times 4\times 3\times 2\times 1}{[3\times 2\times 1][2\times 1]} = 10$ samples. The units in the population are labelled as 1,2,$\ldots$,5 and so the first possible sample (the first column of the output) would be made by the first three units, while the tenth (the last column) would be made by units 3, 4 and 5.

A Binomial with $n=1$ is simply a Bernoulli distribution, denoted $Y \sim \mbox{Bernoulli}(\theta)$. As is obvious, $Y$ can only take on the values 0 (if the indiviual does not experience the event) or 1 (otherwise); in addition, because by definition $\left( \begin{array}{c}1\\1\end{array} \right) = \left( \begin{array}{c}1\\0\end{array} \right) =1$, the probability distribution for a Bernoulli variable is simply \[ p(y \mid \theta) = \theta^y \; (1-\theta)^{1 -y}; \qquad y = 0,1.\]

The Bernoulli model essentially describes the sampling process for a binary outcome applied to a single individual. So if you have $n$ individuals and you record whether each has an event or not, you can either describe the sampling process as $n$ independent Bernoulli variables $y_1,\ldots,y_n \stackrel{iid}{\sim}\mbox{Bernoulli}(\theta)$, where the notation $\stackrel{iid}{\sim}$ indicates independent and identically distributed variables; or by considering the total number of people who have had the event $r=\sum_{i=1}^n y_i=y_1+\ldots+y_n$ and modelling it using a single Binomial variable $r\sim \mbox{Binomial}(\theta,n)$. When the only information available is about either the individual outcomes ($y_i$) or the aggregated summary ($r$), the two models are equivalent. If we have access to the individual level data (ILD; see Chapter 5), we can use the $n$ Bernoulli variables directly. Often, however, we will not be able to access the ILD and will only know the observed value of the summary ($r$) — in this case we cannot use the Bernoulli model and need to work with the Binomial sampling distribution.

Equation 4.1 can be used to compute the probability of observing exactly $r$ individuals experiencing the event in a sample of $n$, given that the underlying probability is $\theta$. For example, if we set $r=12$, $n=33$ and $\theta=0.25$, then \[\begin{eqnarray*} p(r=12\mid \theta=0.25,n=33) & = & \left( \begin{array}{c}33\\12\end{array} \right) 0.25^{12} \; (1-0.25)^{33 -12}\\ & = &354817320 \times 0.00000005960464 \times 0.002378409 \\ & = & 0.0503004. \end{eqnarray*}\]

In R we can simply use the built-in command dbinom to compute Binomial probabilities (see Section 2.6 for more details on the R implementation), for instance

dbinom(x=12,size=33,prob=0.25)

[1] 0.0503004

would return the same output as the manual computation above.

We can also use the built-in command rbinom to simulate values from a given Binomial distribution. For example, the following code can be used to produce the graph in Figure 4.4 that illustrates a histogram from a random sample of 10000 observations from a Binomial(0.25,33) distribution.

tibble(r=rbinom(n=10000,size=33,prob=0.25)) %>% 
  ggplot(aes(r)) + geom_histogram(
    breaks=seq(0,33),color="black",fill="grey"
  ) + theme_bw() + xlab("Number of successes") + ylab("Absolute frequency") + 
  theme(axis.text=element_text(size=8)) +
  scale_x_continuous(breaks = seq(0, 33, 1), lim = c(0, 33))

Figure 4.4: Histogram of a random sample of 10000 observations from a Binomial($\theta=0.25,n=33)$ distribution

The R code here is probably unnecessarily complicated to show some of the features in terms of customisation of the graph, using the tidyverse and ggplot2 approach (see Chapter 2 for more details on using R). We first define a tibble, a special R object containing data, which we fill with a vector r that contains 10000 simulated values from the relevant Binomial distribution. We then apply ggplot to construct and style the histogram.

The graph in Figure 4.5 shows three examples of Binomial distributions with $\theta=0.3$ but upon varying the sample size. As the sample size increases, the histogram becomes more symmetrical around the mean $(\theta=0.3)$.

Because of the mathematical definition of the Binomial distribution, it can be proved that if $R\sim\mbox{Binomial}(\theta,n)$, then

The mean (or “expected value”) is $\mbox{E}[R]=\mu=n\theta$;
The variance is $\mbox{Var}[R]=\sigma^2=n\theta(1-\theta)$.

Consequently, for a single Bernoulli variable, the mean is simply $\theta$ and the variance is simply $\theta(1-\theta)$. Both the Bernoulli and Binomial distributions were derived by the Swiss mathematician Jacob Bernoulli, in the late 17th century. Bernoulli was part of a large family of academics and mathematicians, who have contributed to much of the early development of probability calculus and statistics. Examples of uses for the Bernoulli or Binomial distributions are presented in Section 4.6.3, Section 6.8.2, Section 7.5.2 and Section 10.1.4.

4.3.2 Poisson

The Poisson distribution, named after the French mathematician Siméon Denis Poisson, can be used to model the sampling variability for “counts” of events, each with a low chance of occurring. Poisson used this model in the 19th century in his “Research on the Probability of Judgments in Criminal and Civil Matters”, in which he modelled the number of wrongful convictions. Examples of its use are shown in Section 4.6.3 and Section 7.5.2.

If $y \sim \mbox{Poisson}(\theta)$ then we define \[ p(y\mid\theta)= \frac{ \theta^y \; e^{-\theta}}{y!} \qquad y = 0,1,2,3,\dots \tag{4.3}\]

For a Poisson distribution, the parameter $\theta$ represents both the mean and the variance: $\mbox{E}[Y]=\mu=\mbox{Var}[Y]=\sigma^2=\theta$. This may at times be a limitation because often empirical data tend to violate this assumption — they show larger variance than the mean, a phenomenon often referred to as overdispersion. Suitable models can be used to expand the standard Poisson set up to account for this feature. In addition, in many applications, the Poisson sampling distribution will arise as a total number of events occurring in a period of time $T$, where the events occur at an unknown rate $\lambda$ per unit of time, in which case the expected value for $Y$ is $\theta = \lambda T$.

Generally speaking, the Poisson distribution is used to model sampling variability in observed counts, e.g. the number of cases of an observed disease in a given area. The examples in Figure 4.6 show that if events happen with a constant rate, observing for longer periods of time leads to smaller relative variability and a tendency towards a symmetrical shape. Comparison of Figure 4.6 with Figure 4.5 shows that, when sample size increases, a Binomial might be approximated by a Poisson with the same mean.

The main distinction between the Poisson and the Binomial models is that in the latter case we consider the number of events out of a fixed and known total number of possible occurrences (the sample size, $n$). In the case of the Poisson, we generally consider the overall number of events (counts), without formally restricting the total number of occurrences. For this reason, the Poisson distribution may be used to model the probability of observing a certain number of goals in a football match (which, theoretically is unbounded), while the Binomial distribution can be used to model the number of Gold medals won by Italy at the next Olympic Games (which physically is bounded by the total number of sporting events).

If you consider this, it becomes perhaps more intuitive why the Poisson is the limiting distribution for the Binomial, as $n$ increases to $\infty$. Figure 4.7 shows two histograms summarising 1,000,000 simulations from: a) $y_1\sim\mbox{Binomial}(\theta=0.05, n=200)$; and b) $y_2\sim\mbox{Poisson}(\mu=n\theta=7.5)$. As is possible to see, because the underlying probability $\theta$ is fairly small (i.e. the event of interest is rare), for all intents and purposes, $n=200$ is effectively as large as $n\rightarrow\infty$ and the two distributions overlap almost completely. If $\theta$ is larger, then we need a much bigger sample size $n$ before the Binomial distribution is fully approximated by a $\mbox{Poisson}(\lambda=n\theta)$.

Figure 4.7: Histogram for two samples of 1000000 observations from a Binomial(0.05,200) (in red) and a Poisson(7.5=0.05 $\times$ 200) (in blue). When the two histograms overlap, the resulting colour is shaded to purple

We can use R to compute probabilities or simulate random numbers from a Poisson distribution in a similar fashion as to what shown above. For instance, if we set $\theta=2$ we can compute the probability of observing $y=8$ events using the following command (note that, somewhat confusingly, R calls the parameter we have indicated as $\theta$ with the name lambda).

dpois(x=8,lambda=2)

[1] 0.0008592716

The answer could be determined also by a simulation approach, as follows.

# Set the pseudo-random number generator (for replicability)
set.seed(10230)
# Vector of number of simulations
n=c(100,1000,10000,100000,1000000,10000000)
# Initialise (an empty) vector of (numeric) probabilities 
prob=numeric()
# Simulates n[i] observations from a Poisson(lambda=2) and then 
# counts the proportion of simulations with value 8
for (i in 1:length(n)) {
   y=rpois(n=n[i],lambda=2)
   prob[i]=sum(y==8)/n[i]
}
# Formats and shows the output
nlab=format(n,scientific=FALSE,big.mark=",")
names(prob)=nlab
prob

       100      1,000     10,000    100,000  1,000,000 10,000,000 
 0.0000000  0.0000000  0.0015000  0.0008100  0.0007820  0.0008638

If we inspect the output of this process, we see that as we increase the size of the simulation to 10,000,000, then the numerical answer (0.0008638) becomes very close to the analytic one (0.0008593). When we consider a small number of simulations, our numerical estimate of the “true” analytic value is not very precise at all.

This process of numerical approximation of a quantity through simulation is often termed Monte Carlo analysis.

4.3.3 Normal

The Normal distribution is fundamental to much of statistical modelling (and it is in fact mentioned almost throughout this book). Often it is referred to as Gaussian, from the name of its inventor, the German mathematician Carl Friedrich Gauss, who in the late 18th century used it to describe “normally distributed errors”.

A continuous variable is associated with a Normal distribution if we can assume that the underlying sampling process has the following properties:

Symmetry: we expect to see most of the observations scattered around a central location, with “errors”, or deviation from the central location becoming increasingly smaller as we move further away.
Unboundness: we do not place any physical restriction on the size of the values that the variable can take on. Technically speaking, we say that the range of the variable is the set $(-\infty; \infty)$. Notice however that, in reality, we will never observe a variable to take on the value $\infty$ and all our observations will in fact be finite values.

When we assume that a variable $Y$ is well described by a Normal distribution, we then write $Y\sim \mbox{Normal}(\mu,\sigma)$, with \[ p(y\mid \mu,\sigma) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{1}{2}\frac{(y-\mu)^2}{\sigma^2} \right); \qquad -\infty < y < \infty. \tag{4.4}\] Equation 4.4 shows that the Normal distribution is characterised by two parameters: $\mu$ is the population mean and $\sigma$ is the population standard deviation (see Section 4.2). Mathematically, the function in Equation 4.4 is a probability density. This is due to the fact that the underlying variable is continous and, as such, it can take on any real number, e.g. -1.21131765366772, 2.32279253568801, -6.52670986118001, 25.5187887120438.

As mentioned in Section 4.2, for a continuous variable it is a mathematical impossibility that two different observations have the exact same value (technically, there is 0 probability that this happens). Thus, we can think of a probability density as a histogram where each group width is arbitrarily small. Figure 4.8 shows this for a sample of values from a Normal(0,1) distribution. The histogram in panel (a) uses group width of 0.50 — this means that the base on each rectangle is exactly 0.50 and the height is indicated along the $y-$axis of the graph. The histograms in panels (b)–(d) have increasingly small bar widths, approaching to a value $\epsilon\rightarrow 0$. Each graph has superimposed a Normal(0,1) density — as is possible to see the graph in panel (d) shows essentially no distinction between the histogram approximation and the true density.

An important consequence is that, unlike the case of discrete variables, for which R commands such as dbinom(...) or dpois(...) allows us to calculate the actual probability of observing an exact value, for continuous variables the command dnorm(...) computes the density associated with a small interval around the value.

The Normal distribution (and the respective R code) can be used to compute tail area probabilities. For example, the command

qnorm(p=0.975,mean=0,sd=1)

[1] 1.959964

returns the value $y$ such that for a variable $Y\sim\mbox{Normal}(\mu=0, \sigma=1)$, we obtain $\Pr(Y\leq y)=0.975$ — that is the 97.5% quantile of the Normal distribution. Figure 4.9 shows graphically that the area under the Normal(0,1) density between $-\infty$ and 1.959964 $\approx$ 1.96 does cover most of the probability mass — and in fact exactly 97.5%, which in turns implies that $\Pr(Y>1.96)=0.25$.

Tail area probabilities can be used to compute the probability that a Normal variable lies within a given range. For example, we could use the following code

# Computes y1 so that, given Y~Normal(0,1), Pr(Y<=y1)=0.975
y1=qnorm(p=0.975,mean=0,sd=1)
# Computes y2 so that, given Y~Normal(0,1), Pr(Y<=y1)=0.025
y2=qnorm(p=0.025,mean=0,sd=1)
# Now verifies that Pr(Y<=y1) - Pr(Y<=y2) = 0.975 - 0.025 = 0.95
pnorm(q=y1,mean=0,sd=1)-pnorm(q=y2,mean=0,sd=1)

[1] 0.95

to check that the area comprised between the 97.5%-quantile and the 2.5%-quantile (i.e. the interval -1.96 to 1.96) does contain 95% of the probability distribution. Note the use of the built-in R functions qnorm and pnorm that compute the quantile, given a specified probability or a probability, given a specified quantile, under a Normal distribution.

4.3.4 Gamma

The Gamma distribution defines a flexible and mathematically convenient model for continuous quantities constrained to be positive. Its use in HTA is rather prevalent, e.g. to model costs or utilities, as well as in survival modelling and in Bayesian analysis as a possible/suitable prior. It is discussed further in Section 4.6.2.1, Section 5.4.3, Section 6.8.2, Section 7.2.5, Section 9.6.2 and Section 11.4.

The notation $Y \sim \mbox{Gamma}(a,b)$ indicates a Gamma distribution with properties:
\[ p(y\mid a, b) = \frac{ b^a }{\Gamma(a) } y^{a-1} \; e^{-by}; \qquad -\infty < y < \infty; a,b>0 \tag{4.5}\] with \[\begin{eqnarray} \mbox{E}[Y] = \frac{a}{b} \qquad \mbox{and} \qquad \mbox{V}[Y] = \frac{a}{b^2} \nonumber \end{eqnarray}\]

The parameter $a$ is called the shape, while $b$ is called the rate of the Gamma distribution. Alternative parameterisations exist in terms of the parameters $(a,c)$, where $c=1/b$ is called the scale — but note that, in this case, the form of the density in Equation 4.5 needs to be re-written accordingly!

The general form of the Gamma distribution includes as special cases several other important distributions. Figure 4.10 shows a few examples of different Gamma distributions, upon varying the two parameters $a$ and $b$. As is possible to see, for specific choices of the parameters, we retrieve other distributions, related to the Gamma.

Figure 4.10: Some examples of Gamma distributions upon varying the parameters $a$ and $b$. For different choices of the parameters, the Gamma distribution can give rise to other distributions, including the Exponential

4.3.5 Exponential

The Exponential distribution is obtained as a special case of the Gamma distribution, when we set $a=1$, i.e. \[ Y\sim \mbox{Gamma}(1,b) \equiv \mbox{Exponential}(b). \tag{4.6}\] The Exponential distribution is sometimes used as model for sampling variability in times until an event of interest happens (e.g. time until a patient dies of a given disease). One major limitation of the Exponential model is however that it is only characterised by a single parameter $b$, which also determines the value of the mean and variance as \[\mbox{E}[Y]=\displaystyle\frac{1}{b} \qquad \mbox{and} \qquad \mbox{Var}[Y]=\displaystyle\frac{1}{b^2}.\]

For this reason, the Exponential model is often too rigid and cannot represent well variations across a wide range of values for the variable $y$. Some other mentions of the Exponential distribution can be found in Section 7.2.5, Section 9.6.2 and Section 11.4.

4.3.6 Beta distribution

The Beta distribution is a general model that can be used to describe the sampling variability (and, in general, uncertainty) about quantities defined in the range $[0;1]$. Examples of such quantities include probabilities or proportions (e.g. the probability of a given event occurring), but also measures for quality of life (e.g. QALYs; see Chapter 1), which may be defined in that range. Examples are shown in Section 9.6.2.

If a quantity is modelled as $Y\sim \mbox{Beta}(a,b)$, then \[ p(y\mid a, b) = \frac{\Gamma(a+ b)}{ \Gamma(a)\Gamma(b) } \,\, y^{a-1} (1-y)^{b-1}. \tag{4.7}\]

For a variable $Y\sim\mbox{Beta}(a,b)$, the following properties also hold \[ \mbox{E}[Y] = \mu = \frac{a}{a+b} \qquad \mbox{ and } \qquad \mbox{Var}[Y] = \sigma^2 = \frac{ab}{(a+b)^2(a+b+1)}, \] which implies that we can infer the value of the parameters, given set values for mean and variance, as \[ a = \mu\left(\frac{(1-\mu)\mu}{\sigma^2} -1\right) \qquad \mbox{ and } \qquad b = (1-\mu)\left(\frac{(1-\mu)\mu}{\sigma^2} -1\right). \]

One further very nice feature of the Beta distribution is its versatility: upon varying the parameters of the distribution, we can get very different shapes indicating different levels of knowledge or uncertainty about the underlying value of the variable modelled using the Beta distribution, as shown in Figure 4.11.

4.3.7 Other distributions

Probability calculus as taught in Statistical courses often concentrates on a small number of probability distributions, e.g. those mentioned above. But there are of course many more possibilities: which distribution to use to reflect assumptions about the underlying data generating process and sampling variability (or indeed epistemic uncertainty about unobservable quantities) is a matter of substantive knowledge of the problem. And in reality, it becomes as much an art as it is a science, which requires experience as well as expertise.

Additional important and useful distributions include the Uniform, log-Normal, Weibull, Gompertz, Multinomial and Dirichlet. Good compendia of probability distributions are presented in several textbooks, including Spiegelhalter et al. (2004) and Lunn et al. (2013).

The graph in Figure 4.12 shows a graphical representation of the relationships among a large number of probability distributions. As is possible to see, there are many more choices than the simple few showed above. And interestingly, in many cases, the distributions tend to show close mathematical connections (for instance one distribution may be obtained as a special case of another, just like for the Exponential/Gamma). More details are given in Leemis and McQueston (2008).

Figure 4.12: A graphical representation of the relationships among a (non-exhaustive) set of univariate probability distributions. Source: Leemis and McQueston (2008).

4.4 Parameter point estimation

As mentioned earlier, in a nutshell the problem of statistical inference (and thus the idea of doing Statistics) consists in

Obtaining a sample of observations $\boldsymbol{y}=(y_1,\ldots,y_n)$ from a population of interest. The process through which the data become observed is subject to sampling variability — you get to see only one of the many possible samples that could be obtained by extracting a number $n$ of units that is (typically) much smaller than the total size of the population $N$.
Characterise the sampling variability using a suitable probability distribution $p(\boldsymbol{y}\mid \boldsymbol\theta)$, defined as a function of a set of model parameters, $\boldsymbol\theta$, to describe the data generating process.
Use the observed data $\boldsymbol{y}=(y_1,\ldots,y_n)$ to learn about the unobservable population features described by the model parameters.

Interestingly (and somewhat confusingly) not even statisticians have agreed on a single way in which this process should be carried out. In fact, there are at least three main philosophical approaches to the problem of statistical inference.

4.4.1 The Bayesian approach

The Bayesian approach is historically the first to have been developed. The original ideas and the basic structure date back to the publication of an essay by Reverend Thomas Bayes (Bayes, 1763), an English non-conformist minister, after whom the whole approach is named.

A Bayesian model specifies a full probability distribution to describe uncertainty. This applies to data, which are subject to sampling variability, as well as to parameters (or hypotheses), which are typically unobservable and thus are subject to epistemic uncertainty (e.g. the experimenter’s imperfect knowledge about their value) and even future, yet unobserved realisations of the observable variables (Gelman et al., 2013).

As a consequence, probability is used in the Bayesian framework to assess any form of imperfect information or knowledge. Thus, before even seeing the data, the experimenter needs to identify a suitable probability distribution to describe the overall uncertainty about the data $\boldsymbol y$ and the parameters $\boldsymbol \theta$. We generally indicate this as $p(\boldsymbol y, \boldsymbol\theta)$. Using the basic rules of probability, it is always possible to factorise a joint distribution as the product of a marginal and a conditional distribution (a relevant example in the HTA case is shown in Section 6.7.1). For instance, we could re-write $p(\boldsymbol y,\boldsymbol\theta)$ as the product of the marginal distribution for the parameters $p(\boldsymbol \theta)$ and the conditional distribution for the data, given the parameters $p(\boldsymbol y\mid \boldsymbol\theta)$. But in exactly the same fashion, one could also re-express the joint distribution as the product of the marginal distribution for the data $p(\boldsymbol y)$ and the conditional distribution for the parameters given the data $p(\boldsymbol\theta\mid \boldsymbol y)$.

Consequently, \[ p(\boldsymbol{y},\boldsymbol{\theta})=p(\boldsymbol\theta)p(\boldsymbol{y}\mid\boldsymbol\theta) = p(\boldsymbol y)p(\boldsymbol\theta\mid \boldsymbol y) \] from which Bayes’ Theorem follows in a straightforward way: \[ p(\boldsymbol\theta\mid \boldsymbol y) = \frac{p(\boldsymbol\theta)p(\boldsymbol y\mid\boldsymbol\theta)}{p(\boldsymbol y)}. \tag{4.8}\] While mathematically incontrovertible, Bayes’ Theorem has deeper philosophical implications, which have led to heated debates, within and without the field of Statistics. In fact, the qualitative implications of this construction are that, if we are willing to describe our uncertainty on the parameters before seeing the current data through a probability distribution, then we can update this uncertainty by means of the evidence provided by the data into a posterior probability, the left hand side of Equation 4.8. This allows us to make inference in terms of direct probabilistic statements.

In all but trivial models, Equation 4.8 also presents some computational challenges because it is often very hard or even impossible to compute the ratio on the right hand side analytically. Consequently, until the 1990s the practical implementation of Bayesian models has been hampered by this problem. The widespread availability of cheap computing as well as the development of suitable clever methods for simulations (e.g. Markov Chain Monte Carlo, or MCMC — see for example Section 5.4 and Section 6.7) have helped overcome these problems and make Bayesian analysis very popular in several research areas, including economic evaluation of health care interventions and adaptive clinical trial designs.

Leaving all the technicalities aside, Bayesian inference proceeds by using the following scheme.

Define a “prior” distribution $p(\boldsymbol\theta)$ to describe the current uncertainty on the model parameters. This represents the knowledge before the data $\boldsymbol y$ become available.
Observe data $\boldsymbol y$, whose sampling variability is modelled using $p(\boldsymbol y\mid\boldsymbol\theta)$.
Apply (an approximation to the exact computation intrinsic in) Bayes’ Theorem of Equation 4.8 to compute the “posterior” distribution $p(\boldsymbol\theta\mid\boldsymbol y)$. This distribution represents the “revised” or “updated” level of uncertainty on the parameters $\boldsymbol\theta$, after the data $\boldsymbol y$ have become available.

Suppose we consider a simple case where the data are $R\sim\mbox{Binomial}(\theta,n)$, with $n=13$ and we have observed $r=9$ successes. We want to make inference on the parameter $\theta$. Figure 4.13 shows an example of Bayesian inference in action (we will dispense with all the difficult technical points here and only concentrate on the interpretation).

Figure 4.13: An (extremely simplified!) example of Bayesian inference on a probability $\theta$ for a Binomial sampling distribution model, with $r=9$ observed successes out of $n=13$ individuals

Imagine that you are willing to specify the current level of uncertainty about $\theta$ using the black curve (labelled as “Prior”). This essentially implies that, before seeing any other data, you believe reasonable to assume that the most likely value for $\theta$ is around 0.4 and most likely it will be included in the interval (0.2 – 0.6). Values below 0.2 or above 0.6 are associated with increasingly smaller values of the probability mass as you move away from the mode (0.4) and towards the extremes (0 and 1). Technically, we could use a Beta(9.2,13.8) distribution to encode these assumptions (but, again, this is only a technicality and not important, here).

The red curve is a representation of the contribution brought by the observed data. In fact, this is the likelihood function (that is described in Section 4.4.2). Again, leaving all the details aside, the interpretation of the red curve is that, intuitively, because we have observed $r=9$ successes over $n=13$ individuals, the data seem to suggest that the “true” underlying probability of success may be higher than we originally thought (the red curve has a mode around 0.69231).

Finally, Bayesian inference is obtained by inspecting the blue curve, showing the posterior distribution. This is typically a compromise between the prior knowledge and the evidence provided by the data. As mentioned above, technically this can be complex, or even impossible to determine analytically — and in fact, most often we resort to simulation algorithms to obtain a suitable approximation.

In general, once the whole posterior distribution is available, it is then possible to describe it using suitable summaries For example, the usual point estimate is the mean of the posterior distribution. In the case of Figure 4.13, because the blue curve is reasonably symmetrical, the mean corresponds with the mode (the point where the distribution is highest), which in this case can be computed as 0.5056. So we initially thought that the probability that a random individual would experience the event under study was centered around the prior mean 0.4000 and we have revised this to the posterior summary 0.5056, after observing 9 individuals in a sample of 13 experiencing the event.

4.4.2 The Likelihood approach

This approach to statistical inference is almost single-handledly developed by Ronald Fisher ¹. The main ideas underlying Fisher’s theory can be somewhat roughly summarised as follows.

Unlike in the Bayesian approach, Fisher considers parameters as fixed (although unknown) quantities. As such, we cannot model our uncertainty over the true underlying value of $\theta$ using a probability distribution $p(\theta)$.
The only randomness in a statistical analysis is generated by the sampling variability associated with the observed data. We still want to use the data to learn about the world (as described by our model, indexed by the parameters $\boldsymbol\theta$) and to do so, the only thing we need to consider is the likelihood function.

Known, unknown, fixed…

Notice that in regards to point 1. above, also from the Bayesian point of view parameters may be fixed quantities: it is possible that the “true” proportion of males in the world is an unmutable constant — we just do not (and cannot!) know its true value with absolute certainty. The Bayesian way to deal with this epistemic uncertainty is to consider $\theta$ as a random variable and describe current uncertainty with the prior.

As shown in Section 4.3, we can model sampling variability using probability distributions. For example, in the case of the Binomial distribution, this is defined as \[\begin{eqnarray*} p(r \mid \theta, n) = \left( \begin{array}{c}n\\r\end{array} \right) \theta^r \; (1- \theta)^{n -r} \end{eqnarray*}\] — recall Equation 4.1. The expression above is a function of the observed data, given the values of the parameters (as should be clear from Section 4.3).

However, what we really want to do when making inference is not so much fixing the value for $\theta$ and computing probabilities for possible values of $r$ — rather we want to use the observed value of $r$ to learn what the most plausible value of $\theta$ is in the “true”, underlying DGP. Thus, Fisher’s main idea was to create a different kind of function, which varied with the parameters, but depended on the observed (fixed) $r$. He called this the likelihood function and defined it as the same equation used for the sampling variability associated with the model — except that the fixed and varying arguments are flipped around. For example, in the Binomial case, the likelihood function is \[ \mathcal{L}(\theta\mid r) = \theta^r(1-\theta)^{(n-r)}. \tag{4.9}\] Equation 4.9 only takes the terms in the Binomial sampling distribution that depend directly on the model parameter. For instance, the Binomial coefficient $\displaystyle \left( \begin{array}{c}n\\r\end{array} \right)$ is irrelevant because it does not include $\theta$ and thus it is discarded in forming the likelihood.

(a) Binomial sampling distribution $p(r\mid \theta=0.3)$

Panel (a) in Figure 4.14 shows the Binomial sampling distribution for a fixed value of $\theta=0.3$. If that was the “true”, underlying value of the probability that a random individual drawn from the population of interest experiences the event of interest, then we would expect 4 successes out of $n=13$ individuals sampled to be the most probable outcome. Observing 9 successes would be a somewhat unlikely event: we can use R to compute this probability as dbinom(x=9,size=13,prob=0.3)=0.00338.

Panel (b) presents the likelihood function for three possible observed samples. The black curve corresponds to the analysis when $r=2,n=13$, the red curve is drawn for $r=4,n=13$ and the blue curve is derived in the case where $r=9$ and $n=13$ — notice that in the interest of comparability, we present here the rescaled likelihood functions, i.e. $\mathcal{L}(\theta\mid r)/\max\left[ \mathcal{L}(\theta\mid r) \right]$, so that the the $y-$axis ranges from 0 to 1.

The interpretation of the likelihood function is that it describes “how likely” each possible value of the parameter is, given the observed data. For example, in the case of the black curve, the most likely value of the parameter after observing $r=2$ successes out of $n=13$ individuals is $\hat{\theta}=0.154$, which is indicated by the black dashed line. For $r=4$, then $\hat\theta=0.308$ (the red dashed line) and for $r=9$ then $\hat\theta=0.692$ (the blue dashed line).

Important

Notice that the language we have used here is purposedly pedantic and focuses on the concept of “how likely”, as opposed to “how probable”. This is because the likelihood function is not a probability distribution (technically, the reason for this is that the likelihood function does not integrate to 1, i.e. the area under the curve representing the likelihood function is not equal to 1, which is a necessary condition for a mathematical function to represent a probability distribution). Fisher was very clear on this — that is why he coined the phrasing “likelihood function”. But the concepts are sometimes conflated, so it is important to be careful on the distinction.

As seems sensible, the greater the number of observed successes, the higher the most likely value of the underlying true probability of success — if $\theta$ is indeed very large, then most people will experience the event and so we will tend to observe large values for $r$ in any given sample. So, by and large, the simplified process presented here describes how inference is performed in this setting:

we define a sampling distribution to describe variability in the observed data;
we turn this into a likelihood function for the model parameter(s); and
we determine the value that is associated with the highest likelihood (again: not probability!).

This is used as the best estimate — and it is referred to as the maximum likelihood estimator (MLE). Mathematically, this last step amounts to maximising the likelihood function. Technically this is done by finding the value for which the first derivatives of the function is equal to 0 and then checking that the second derivative of the function is negative to ensure that that point is indeed a maximum.

In order to perform this analytically (i.e. for a general value of the random variable $R$, rather than the specific observed value $r$), it is usually easier to work on the log scale, i.e. trying to maximise the log-likelihood $\ell(\theta)=\log\mathcal{L}(\theta\mid R)$. Figure 4.15 shows that both functions are maximised by the same point on the $x-$axis.

Figure 4.15: Likelihood function (the dark curve, with values on the $y-$axis on the left hand side) and Log-likelihood function (in ligter colour, with values on the $y-$axis on the right hand side) for the Binomial example with $r=2$. Both functions are maximised at the same point along the $x-$axis

Thus, if we consider the log-likelihood for the Binomial example, we have \[\begin{eqnarray*} \ell(\theta) = \log\mathcal{L}(\theta\mid R) = R\log\theta + (n-R)\log(1-\theta) \end{eqnarray*}\] and so, making use of the properties that: i. the derivative of a sum is the sum of the derivatives; ii. for a variable $x$, the derivative of $\log(x)$ is $\frac{1}{x}$; iii. for a variable $x$, the derivative of $\log(1-x)$ is $-\frac{1}{1-x}$; we can compute \[ \mbox{First derivative of $\ell(\theta)$} = \ell'(\theta)= \frac{R}{\theta} - \frac{n-R}{1-\theta}. \tag{4.10}\] Setting this to 0 and solving for $\theta$, i.e. finding the value $\hat\theta$ in correspondence of which the first derivative is 0, gives \[\begin{eqnarray*} \frac{R}{\hat\theta} - \frac{n-R}{1-\hat\theta} = 0 & \Rightarrow & (1-\hat\theta)R -\hat\theta (n-R) = 0 \\ & \Rightarrow & R-\hat\theta R -\hat\theta n + \hat\theta R = 0 \\ & \Rightarrow & \hat\theta = \frac{R}{n} = \sum_{i=1}^n \frac{Y_i}{n} = \bar{Y}, \end{eqnarray*}\] i.e. the sample mean of the $n$ individual Bernoulli variables (recall Section 4.3.1).

If we also compute the second derivative, i.e. the derivative of Equation 4.10 and check that it is negative (which in this case it is) to ensure that $\hat\theta$ is indeed a maximum point, then we have obtained a functional form for the MLE. And we can use this to compute its value for the observed sample, which in this case is $\displaystyle \hat{\theta}=\bar{y}=\frac{r}{n}=\frac{2}{13}=0.154$, which can be used as the point estimate for the parameter of interest.

In many cases the derivatives can be computed analytically, which means we can find the MLE as a function of the general random variable $R$ and then compute the value in correspondence of the observed value $r$, as in the case above. However, there may be cases where the resulting likelihood function is complex and either difficult or even impossible to differentiate (i.e. compute the derivatives). In these cases, most likely, you will be using a computer to maximise the likelihood function numerically. For example, R has several built-in functions to perform numerical optimisation, for instance optim or optimise, which can be used to that effect. The following code implements this calculation using optimise.

# Defines the likelihood function as a R function
Lik=function(theta,r,n) {
   # The function depends on three arguments:
   # 'theta' is a vector of values specifying the range of the parameter
   # 'r' is the observed number of successes 
   # 'n' is the observed sample size
   theta^r*(1-theta)^(n-r)
}
# Use 'optimise' to obtain the MLE for w=2 and n=13 in the interval (0,1), 
# ie the range of theta
optimise(Lik,r=2,n=13,interval=c(0,1),maximum=TRUE)

$maximum
[1] 0.1538463

$objective
[1] 0.003768044

Firstly, we code up the likelihood function of Equation 4.9 in the function Lik, which takes as arguments the parameter theta, as well as the observed data r and n. The we run optimise by passing as arguments the function we want to maximise (Lik), the values for the data (r and n), the interval over which we want to maximise the function (interval=c(0,1), which represents the range of $\theta$) and the option maximum=TRUE, which instructs R to compute the maximum (instead of the minimum of the function). The results is stored in the object maximum, which has a value of 0.1538463, which is essentially identical to the analytic value 0.1538462 (the differences are due to the approximation in the numerical optimisation procedure performed through optimise).

4.4.3 The Frequentist approach

The third major approach to statistical inference is termed frequentist and it was built on major contributions by Jerzy Neyman and Egon Pearson, in the 1930s².

As in the likelihood approach, parameters are considered to be unknown, but fixed quantities. However, the frequentist school does not attempt to make inference for a specific set of data, but rather it considers and evaluates inference procedures (e.g. the way in which an estimator is defined). Inference consists in the probabilistic assessment of the properties of the procedure, according to suitably defined criteria (more on this later).

In a nutshell, the frequentist approach defines a statistic, i.e. a function $f(Y)$ of the observed data, based on its “optimality” according to the long-run performance. In other words, we want to use as estimator for a given parameter a function of the observed data that, if we were able to repeat the experiment over and over under the same conditions would guarantee that, in the long-run, we would be certain of some properties.

Long-run/short-life

The “long-run” argument underpins the frequentist approach and all the theory developed by Neyman and Pearson. This was very popular at the time, although not universally advocated. And criticism of the frequentist philosophy was not confined to the field of Statistics. The British economist John Maynard Keynes is quoted to have dismissed it because “… in the long run, we are all dead” (Keynes, 1923).

For instance, let us consider a Bernoulli sample of $n$ individuals $\boldsymbol{Y} = (Y_1,\ldots,Y_n)\stackrel{iid}{\sim}\mbox{Bernoulli}(\theta)$ and the two possible estimators \[f_1(\boldsymbol Y) = \sum_{i=1}^n\frac{Y_i}{n} = \bar{Y} \qquad \mbox{and}\qquad f_2(\boldsymbol Y) = \frac{\mbox{Med}(\boldsymbol{Y})}{n}.\] Here, $f_1(\boldsymbol Y)$ is the sample mean, where $f_2(\boldsymbol Y)$ is computed as the median of the data $\boldsymbol Y$ divided by the sample size $n$.

Now, imagine that we knew with no uncertainty that the true value of the underlying probability of success is $\theta=0.3$ — of course, in reality we cannot know this and to give our best estimate for its value is in fact the objective of the analysis. But if we did know, then we could imagine a simulation process that aims at mimicking would could happen if we were to repeat a very large number of times the experiment in which we collect data on $n=13$ individuals and record how many experience the event.

In R we could do this by using the following code

# Sets the 'seed' so that we always get the same results
set.seed(12)
# Sets the "true" value of the probability of success (assumed known)
theta=0.3
# Sets the sample size in each repeated experiment
n=13
# Sets the number of simulations
nsim=1000
# Defines a matrix "samples" with nsim rows and n columns, 
# initially with "NA" (empty) values
samples=matrix(NA,nsim,n)
# Then creates a loop to fill each row of the matrix "samples" with n 
# simulated values from a Binomial(theta, 1) (i.e. we simulate all the 
# individual Bernoulli data Y_i) 
for (i in 1:nrow(samples)) {
   samples[i,]=rbinom(n,1,theta)
}
head(samples)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,]    0    1    1    0    0    0    0    0    0     0     0     1     0
[2,]    0    0    0    0    0    0    0    0    1     0     1     0     0
[3,]    0    0    0    0    0    1    1    1    0     1     0     1     0
[4,]    0    0    0    1    0    1    0    0    0     0     1     0     0
[5,]    0    0    0    1    0    0    0    1    1     0     1     0     0
[6,]    0    0    1    0    1    0    1    0    0     0     1     1     1

(the command head(samples) shows the first few rows of the matrix of simulations, where 0 indicates that we have simulated a “failure” and 1 indicates a “success”).

For each simulated dataset (=replicate of the experiment), we can record the observed value of the two statistics $f_1(\boldsymbol y)$ and $f_2(\boldsymbol y)$. Figure 4.16 shows the output for the first 20 replicates of the experiments: in each of the 20 rows in the graph, the black dots represent the $n=13$ simulated values (notice that, because we are simulating from a Bernoulli, then the only possible outcomes are 0=“failure” and 1=“success”); the red diamonds are the observed value of the sample mean $f_1(\boldsymbol y)$; and the blue squares are the observed values of the rescaled median $f_2(\boldsymbol y)$.

Figure 4.16: A graphical representation of a simulation exercise to describe repeated experiments from a Binomial case with $n=13$ and $\theta=0.3$. For each of the 20 simulations presented, the black dots indicate the simulated Bernoulli outcomes $y_1,\ldots,y_n$; the red diamonds indicate the sample means $\bar{y}$ and the blue dots are the rescaled sample medians $\displaystyle \frac{\mbox{Med}(\boldsymbol{y})}{n}$. The vertical dashed line indicates the true value for the parameter $\theta=0.3$

The frequentist approach makes use of the fact that, because they are functions of the observed data (which are subject to sampling variability), the statistics $f_1(\boldsymbol Y)$ and $f_2(\boldsymbol Y)$ also are subject to sampling variability — intuitively, this is expressed by the different values that we could observe for them, if we were able to do the experiment over and over again (as in Figure 4.16). Using the nsim simulations stored in the matrix sample, we could investigate the sampling distributions of $f_1(\boldsymbol Y)$ and $f_2(\boldsymbol Y)$, for example using the following commands

tibble(x=samples %>% apply(1,mean)) %>% 
  ggplot(aes(x)) + geom_histogram(bins=10,fill="grey",col="black") + 
  theme_bw() + xlab("") + ylab("Density") + 
  geom_segment(
    aes(x=0.3,y=-Inf,xend=0.3,yend=Inf),
    linetype="dashed",lwd=0.85
  ) 

tibble(x=samples %>% apply(1,median)/n) %>% 
  ggplot(aes(x)) + geom_histogram(bins=10,fill="grey",col="black") + 
  theme_bw() + xlab("") + ylab("Density")

(the built-in function apply(matrix,1,FUN) takes all the rows of the first argument matrix and applies the function FUN to each of them, returning a vector of summaries, e.g. the mean or the median). Figure 4.17 shows histograms for the sampling distributions of $f_1(\boldsymbol Y)$ and $f_2(\boldsymbol Y)$, while Table 4.1 reports some summaries, including the mean, standard deviation, 2.5%, 50% (median) and 97.5% percentiles of the sampling distributions.

Table 4.1: Summary of the sampling distributions for the two statistics

	Mean	SD	2.5\%	Median	97.5\%
$\displaystyle f_1(\boldsymbol Y)=\bar{Y}$	0.303	0.13159	0.07692	0.3077	0.61538
$\displaystyle f_2(\boldsymbol Y)=\mbox{Med}(\boldsymbol Y)/n$	0.005538	0.01989	0	0	0.07692

As is possible to see, the two estimators have rather different characteristics.

The rescaled sample median $f_2(\boldsymbol Y)$ has a much smaller standard deviation. However, this is only due the fact that its possible values are 0 or 0.0769 only and thus there is much less possible variation in the observed data, which leads to a smaller variance. If we consider the mean over the sampling distribution, the value is 0.00554, which is in fact very far from the true value $\theta=0.3$.
Conversely, the distribution of the sample mean $f_1(\boldsymbol Y)$ is centered around the “true” value for $\theta$. This is a very important property to a frequentist — it is called unbiasedness and a statistic having this property is called an unbiased estimator (for a given parameter). Formally, we define an unbiased estimator $f(\boldsymbol{Y})$ as one for which $\mbox{E}[f(\boldsymbol{Y})]=\theta.$

The reason why unbiasedness is so important to a frequentist is that it encapsulate perfectly the concept of long-run optimality: for any given sample, we have no reassurance that the unbiased statistic would give us a value equal, or even close, to the true underlying parameter. In fact, looking at Figure 4.16, none of the observed values for $f_1(\boldsymbol Y)$ is equal to the true value for $\theta=0.3$. However, because of its unbiasedness, we can say that in the log-run, on average we would “get it right” by using $f_1(\boldsymbol Y)$ to estimate $\theta$.

Unbiasedness (et al)

Unbiasedness is only one of the frequentist properties — arguably, the most compelling from a frequentist perspective and possibly one of the easiest to verify empirically (and, often, analytically).

There are however many others, including:

Bias-variance trade-off: we would consider as optimal an estimator with little (or no) bias; but we would also value ones with small variance (i.e. more precision in the estimate), So when choosing between two estimators, we may prefer one with very little bias and small variance to one that is unbiased but with large variance;
Consistency: we would like an estimator to become more and more precise and less and less biased as we collect more data (technically, when $n\rightarrow \infty$).
Efficiency: as the sample size increases indefinitely ($n\rightarrow\infty$), we expect an estimator to become increasingly precise (i.e. its variance to reduce to 0, in the limit).

4.4.4 You’re my one and only (theory)…?

Interestingly, as it happens, the MLE for a given parameter does generally have all the good frequentist properties. For this reason, we can effectively be frequentist and select the MLE as our optimal estimator, which has contributed to some people presenting and using these two approaches as a combined and unified theory. In fact, Fisher on the one hand and Neyman and Pearson on the other saw them as two irreconcilable schools of thought and have spent many years arguing vehemently with each other — so much so that when Fisher moved to University College London, a new special chair was created for him, to avoid him being in the same department as Pearson.³

What the two approaches do have in common, as mentioned above, is the fact that both take a non-Bayesian stance on how to deal with the model parameters. In both cases, parameters are considered as fixed but unknown quantities, which govern the DGP. In order to learn about the true, underlying value of the parameters, we can use suitable statistics: in the case of Fisher’s theory, the MLE because it is based on the likelihood function (which contains all the information that the sample can provide on the parameters); in the case of the frequentist approach, again most often the MLE, because it upholds all the good frequentist properties. But you should be aware of the fundamental distinction in these two approaches.

4.5 Parameter interval estimation

We have just seen how different approaches to statistical inference produce point estimates for a parameter of interest. This is often a very good starting point — and sometimes it is the only summary reported, e.g. in non-scientific publications (such as in the mainstream media). In reality, the point estimate is only likely to be indicative of what the true value of the underlying parameters could be — because of a combination of the sampling variability (that implies we can only learn so much from the observed data, unless $n\rightarrow \infty$, which it never is…) and the intrinsic epistemic uncertainty that we have on the parameter.

For this reason, it is a good idea to provide measures of interval estimate to complement the point estimate for a given (set of) parameter(s).

4.5.1 Bayesian approach

From a Bayesian point of view, in theory, interval estimate does not pose any additional complication. Once the full posterior probability distribution for the parameter has been computed, then we can simply present any summary we want. As seen in Section 4.4.1, the point estimate can be taken as the mean or the mode of the posterior. But we can also simply compute intervals that express directly probabilistic statements about the values of the parameters.

For example, we can prove that the posterior distribution for the Binomial example shown in Section 4.4.1 is a Beta$(18.2,17.8)$ — again the technical details are not important here. We can use this information to compute analytically quantities such as the value $\theta_U$, the point in the parameter space (defined in this case as the interval $[0;1]$, because $\theta$ represents a probability) in correspondence of which $\Pr(\theta < \theta_U \mid y)=0.95$. With the current specification, this is 0.6408 — that is we can estimate that the posterior probability that $\theta$ is less than 0.6408 is exactly 95%.

Sometimes we will be able to make this or similar calculations analytically (which technically means we can solve an integral) and we can make this computations in R for instance using the command

q_U=qbeta(p=0.95,shape1=18.2,shape2=17.8)

which returns the exact value 0.640801.

A Bayesian interval is essentially computed using a similar process as the interval in correspondence of which a certain amount of probability lies: \[ \mbox{Bayesian 95\% interval} = [\theta_L; \theta_U] \mbox{ such that } \Pr(\theta_L \leq \theta \leq \theta_U \mid y) = 0.95. \]

Generally speaking we can use a simulation approach (which is essentially what MCMC does) and

Simulate a large number of values from the posterior distribution
Use the simulated values to compute tail-area probabilities and obtain the relevant interval.

For example, we can use the following R code to compute the 95% interval for the example above.

# Simulates 10000 values from the posterior distribution Beta(18.2,17.8)
theta=rbeta(n=10000,shape1=18.2,shape2=17.8)
# Computes the 2.5% quantile (e.g. the point leaving area of 2.5% to its left)
q_L=quantile(theta,0.025)
# Computes the 97.5% quantile (e.g. the point leaving area of 97.5% to its left)
q_U=quantile(theta,0.975)
# Displays the resulting interval estimate
c(q_L,q_U)

     2.5%     97.5% 
0.3435259 0.6635252

The result is that we compute that 95% of the posterior distribution lies in the interval $[0.344; 0.664]$, or, in other words, that the probability that the true parameter is contained between 0.344 and 0.664, given the model assumptions and the observed data is exactly 95%.

Figure 4.18: Posterior distribution for the Binomial example. The dark horizontal line below the distribution indicate the 95% interval estimate

Figure 4.18 shows again the posterior probability distribution; the dark horizontal line at the bottom of the histogram indicates the 95% interval.

Bayesian intervals

As mentioned above, theoretically, summarising the posterior distribution through an interval does not pose any additional problem, once the target distribution is available. The main problem is, of course, that we need to know what the posterior is before we can manipulate it — and this is the main point about doing simulations, e.g. using MCMC algorithms. This is not trivial, though and thus requires some specific training.

In addition, the procedure shown above to compute a Bayesian interval delivers what is often called a central interval — that is the one that leaves equal amount of area to its right and to its left. It does work very well if the underlying distribution is reasonably symmetric, but it may not be the best option when the distribution is skewed or “multimodal” (e.g. it has several “humps”, like a camel). In those cases, we can still compute suitable Bayesian intervals, based on slightly different theory and computations (these are often called “High Posterior Density”, or HPD intervals). The details are not important here.

4.5.2 Likelihood and frequentist approach

As mentioned in Section 4.4.2 and Section 4.4.3, the likelihood and frequentist approaches are not, in fact, a unified theory. Nevertheless, because effectively the MLE is often the “best” frequentist estimator, we can present the methodology in a compacted way.

The basic idea is that, as mentioned above, the MLE is a statistic, i.e. a function $f(Y)$ of the observed data and, as such, is associated with a sampling distribution. If we are able to determine what this sampling distribution is, then we can use it to compute tail-area probabilities that can be used to derive interval estimates.

For example, for a generic statistic $f(\boldsymbol Y)$, we can prove that \[ Z=\frac{f(\boldsymbol Y)-\mbox{E}[f(\boldsymbol Y)]}{\sqrt{\mbox{Var}[f(\boldsymbol Y)]}} \sim \mbox{Normal}(0,1), \tag{4.11}\] at least approximately, as the sample size $n\rightarrow \infty$ (which in practical terms, simply means that $n$ is “large enough”. A very general rule of thumb is that $n>30$ suffices for this result to hold).

We have seen before that the MLE for the probability of success $\theta$ in the $n$ independent Bernoulli case is the sample mean $\displaystyle f_1(\boldsymbol Y)=\bar{Y}=\sum_{i=1}^n\frac{Y_i}{n}=\frac{R}{n}$, where $Y_i \stackrel{iid}{\sim} \mbox{Bernoulli}(\theta)$. We can prove that the sampling distribution for $f_1(\boldsymbol Y)$ is approximately a Normal with mean $\theta$ (which makes $f_1(\boldsymbol Y)$ an unbiased estimator) and variance $\sigma^2/n$, where $\sigma^2=\theta(1-\theta)$. From this we can also derive that the standardised version of $f_1(\boldsymbol Y)$, obtained by considering $f_1(\boldsymbol Y)$ minus its expected value and divided by its standard deviation, is \[ \frac{\bar{Y}-\theta}{\sqrt{\theta(1-\theta)/n}} \sim \mbox{Normal}(0,1). \tag{4.12}\] We have seen in Figure 4.9 that for a Normal(0,1), the point 1.96 is the one leaving 97.5% of the distribution to its left. Similarly, we can prove that the point -1.96 leaves 2.5% to its left and thus the interval $[-1.96; 1.96]$ includes 95% of the probability \[\begin{equation*} \Pr\left(-1.96\leq \frac{\bar{Y}-\theta}{\sqrt{\theta(1-\theta)/n}} \leq 1.96 \right) \approx 0.95 \end{equation*}\] (the approximation comes about because: i. the distribution of $f_1(\boldsymbol Y)$ is only approximately Normal; and ii. we are using the rounded value 1.96 instead of the exact quantile of the Normal distribution). Re-arranging the terms inside the probability statement, we get \[ \Pr\left(\theta -1.96\sqrt{\frac{\theta(1-\theta)}{n}} \leq \bar{Y} \leq \theta + 1.96\sqrt{\frac{\theta(1-\theta)}{n}} \right) \approx 0.95. \tag{4.13}\]

At this point, we need a further layer of approximation: we do not know the true value of the parameter $\theta$ — only its estimate $\hat\theta$ (e.g. the MLE) and so we can compute the 95% confidence interval for the original statistic $\bar{Y}$ as \[ \left[\hat\theta -1.96\sqrt{\frac{\hat\theta(1-\hat\theta)}{n}}; \hat\theta + 1.96\sqrt{\frac{\hat\theta(1-\hat\theta)}{n}}\right]. \tag{4.14}\]

Probability of what?…

There is a subtle but crucial point in this argument. Equation 4.13 computes a probability. This probability however is computed with respect to the sampling distribution of the statistic — not the parameter $\theta$. From the frequentist/likelihood point of view, this is perfectly fine — in fact neither Neyman and Pearson, nor Fisher would want to compute a probability distribution directly for the model parameters. To them, the parameters are just fixed quantities and so cannot be associated with distributions. $\theta$ has a “true” value and so $\Pr(\theta=\mbox{true value})=1$ and $\Pr(\theta\neq \mbox{true value})=0$ — we just do not know what the true value is.

Again in line with the long-run philosphy, the interpretation of a confidence interval is that if we were able to replicate the experiment over and over again under the same conditions and each time we computed a confidence interval according to the procedure in Equation 4.14, then, in the long-run, the resulting interval would cover the true value of the unknown but fixed parameter 95% of the times. Figure 4.19 expands on Figure 4.16, by including the 95% intervals computed using Equation 4.14, depicted as the blue horizontal lines.

Figure 4.19: Graphical representation of the concept of confidence interval. The black dots are the simulated values for the observed Bernoulli data, in each of 20 replicates of the experiment. The red diamonds indicate the computed sample mean for each replicate. The dashed vertical line is drawn in correspondence of the “true” value of the parameter $\theta=0.3$. The blue lines indicate the 95% confidence intervals computed using the procedure described above.

For 19 out of the 20 potential replicates of the experiments presented in Figure 4.19, the procedure for computing the confidence interval succeeds in covering the true underlying value of $\theta=0.3$, shown as the dashed vertical line.

However, there is one potential replicate (experiment number 18) in which the procedure fails to cover the true value of the parameter. This is not entirely surprising: the procedure for the confidence interval “gets it right” 19/20 or 95% of the times. And this is the meaning of the analysis based on the confidence interval: nothing to do to the uncertainty about the true value of the parameter — as mentioned above, there is no such thing in the frequentist paradigm. What we can evaluate is the long-run performance of the procedure.

In more general terms, building on Equation 4.11, considering a statistic $\hat\theta=f(\boldsymbol Y)$ used to estimate a parameter $\theta$ and with sampling distribution described (at least approximately) by a $\mbox{Normal}(\theta,\sigma^2/n)$, we can derive a form of the 95% confidence interval as \[ \left[\hat\theta -1.96 \frac{\sigma}{\sqrt{n}}; \hat\theta + 1.96 \frac{\sigma}{\sqrt{n}} \right]. \tag{4.15}\]

Confidence intervals

To be precise, the idea of confidence intervals as presented above is central to the purely frequentist approach — in fact the mathematical derivation is due to the work of Neyman, who wrote a landmark paper on this topic in 1937, drawing on the work he had done with Egon Pearson at UCL.

The mathematical formulation in purely likelihood terms derives the interval estimates in a fairly similar way, by considering the sampling distribution of the statistic $f(Y)$ (possibly approximated by a Normal distribution) and then computing the interval as \[ \left[\hat\theta -1.96 \frac{1}{\sqrt{-\ell{''}(\hat\theta)}}; \hat\theta + 1.96 \frac{1}{\sqrt{-\ell{''}(\hat\theta)}}\right], \] where $\ell{''}(\hat\theta)$ is the second derivative of the log-likelihood, evaluated at the MLE $\hat\theta$. The quantity $-\ell{''}(\theta)$ is often referred to as the observed Fisher’s Information.

For all intents and purposes, often the two procedures tend to return the same numerical value for the confidence interval.

4.6 Regression analysis

Regression analysis is one of the most important tools available to a statistician. Gelman and Hill (2007) provide an excellent introduction to the main techniques associated with regression analysis. Chapter 5, Chapter 6, Chapter 7 and Chapter 13 show several examples of regression models applied to HTA problems.

The main idea of regression is to link a function of some observed outcome (often called the “response” variable), $y$, to a set of predictors (often referred to as “covariates”), $\boldsymbol{X}=(X_1,\ldots,X_K)$. Examples include modelling the relationship between some clinical outcome $y$, e.g. a measurement for blood pressure, and a set of prognostic factors, e.g. sex, age, comorbidities, ethnic background, etc, which are used to predict the outcome.

Helpful language

Sometimes the terminology “dependent” and “independent” variables is used to describe the outcome and the predictors — this is probably not very helpful though, because it somehow conflicts with the concept of statistical independence: if two variables $X$ and $Y$ are statistically independent on one another, then learning something about $X$ does not tell us anything about $Y$, which is in fact the opposite assumption underlying regression analysis! We therefore do not use this terminology in the rest of this chapter.

Assuming we have observed data for $n$ individuals \[(\boldsymbol{y,X}) = (y_1,X_{11},\ldots,X_{1K}),\ldots,(y_n,X_{n1},\ldots,X_{nK}),\] in its simplest form, we can indicate a regression model as \[ f(y_i\mid \boldsymbol{X}_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_K X_{iK}. \] Even more specifically, the function $f(\cdot)$ is almost invariably chosen as $\mbox{E}[Y\mid \boldsymbol{X}]=\mu$ and so the linear regression model can be written as \[ \mbox{E}[Y_i\mid \boldsymbol{X}_i]=\mu_i=\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_K X_{iK} . \tag{4.16}\]

Helpful algebra

The expression in Equation 4.16 can be also defined more compactly and equivalently, using matrix algebra as \[\begin{eqnarray*} \mbox{E}[Y\mid \boldsymbol{X}]=\boldsymbol{X} \boldsymbol\beta, \end{eqnarray*}\] where $\boldsymbol\beta=(\beta_0,\ldots,\beta_K)$ is the vector of model coefficients, each describing the impact (or effect) of the predictors on the outcome — in this case we assume that the matrix of covariates is built as \[\begin{eqnarray*} \left(\begin{array}{ccccc} X_{10} & X_{11} & X_{12} & \cdots & X_{1K} \\ X_{20} & X_{21} & X_{22} & \cdots & X_{2K} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ X_{n0} & X_{n1} & X_{n2} & \cdots & X_{nK} \end{array}\right) \end{eqnarray*}\] and the first column of the matrix $\boldsymbol{X}$ is made by a vector of ones, i.e. \[ \left(\begin{array}{c} X_{10} \\ X_{11} \\ \vdots \\ X_{n0}\end{array}\right) = \left(\begin{array}{c} 1 \\ 1 \\ \vdots \\ 1\end{array}\right). \] This is needed to ensure that when performing the matrix multiplication \[\begin{eqnarray*} \boldsymbol{X}\boldsymbol\beta = \left(\begin{array}{c} X_{10}\beta_0 + X_{11}\beta_1 + \ldots + X_{1K}\beta_K \\ X_{20}\beta_0 + X_{21}\beta_1 + \ldots + X_{2K}\beta_K \\ \vdots \\ X_{n0}\beta_0 + X_{n1}\beta_1 + \ldots + X_{nK}\beta_K \end{array}\right) \end{eqnarray*}\] the first element in each row (i.e. for each of the $n$ individuals) returns $\beta_0$ as in Equation 4.16.

4.6.1 Regression to the mean

The term “regression” was introduced by Francis Galton ⁴. Among many other topics, Galton worked to study hereditary traits. In particular, he collected data on $n=898$ children from 197 families.

The data comprise the height of each child $y_i$, as well as the height of the father $X_{1i}$ and the mother $X_{2i}$, for $i=1,\ldots,n$, all measured in inches. An excerpt of the data is presented in Table 4.2.

Table 4.2: The first few rows of Galton’s dataset on height of parents and children

Family	Father	Mother	Gender	Height	Kids
1	78	67	M	73	4
1	78	67	F	69	4
1	78	67	F	69	4
1	78	67	F	69	4
2	76	66	M	74	4
2	76	66	M	72	4

In its basic format, Galton’s analysis aimed at measuring the level of correlation between the children and (say, for simplicity) the father’s heights.

Warning

There are several nuances in Galton’s data structure; for example, many families are observed to have had several children, which intuitively implies some form of correlation within groups of data — in other words, we may believe that, over and above their parent’s height, siblings may show some level of correlation in their observed characteristics. Or, to put it another way, that children ar “clustered” or “nested” within families. There are suitable models to account for this feature in the data (see for instance Gelman and Hill, 2007).

The data can be visualised by drawing a scatterplot, where the predictor (father’s height) is along the $x-$axis and the outcome (child’s age) is along the $y-$axis. This is presented in Figure 4.20, where each family is labelled by a different number.

Figure 4.20: A graphical depiction of Galton’s data on father’s and children’s height. Numbers represent the families within which the children and fathers are clustered

Galton’s objective was to find out whether there was some consistent relationship between the outcome and the predictor and, if so, to quantify the strength of this relationship. In developing his original work, he used a technique that was common at the time, called “least squares fitting”, a mathematical tool used to summarise a set of numbers $y_1,\ldots,y_n$, by using a single value $a$ chosen to be as close as possible to all the observed data points. Because we are considering a single covariate, the model in Equation 4.16 can be written as $\mu_i=\beta_0 + \beta_1 X_{i1}$ and Figure 4.21 (a) shows the regression line obtained by plugging in the estimated values $\hat\beta_0$ and $\hat\beta_1$, superimposed over the observed data.

Figure 4.21 (a) shows that the coefficient $\hat\beta_0$ identifies the value of the line in correspondence with a covariate set to the value of 0. In other words, if we considered a father whose height is 0 inches, we would be predicting that his child’s height would be, on average, $\hat\beta_0=39.11$ inches — i.e. the point along the $y-$axis in which the regression line crosses it. This is referred to as the intercept.

As for the second regression coefficient $\hat\beta_1$, the graph in Figure 4.21 (a) shows that this can be interpreted as the slope of the regression curve. That is the inclination of the line, as described by the angle shown in the left-hand side of the graph (the arc below the line). The interpretation of this feature is that $\hat\beta_1$ is responsible for how much the line “tiltes” — if the estimated value is high, then the line becomes steeper, while if it is low, it becomes more shallow. When $\hat\beta_1=0$, then the regression line is parallel to the $x-$axis, indicating the, irrespective of the value of the covariate $X$, the expected value for the outcome $y$ remains unchanged. This essentially indicates that if $\hat\beta_1=0$ then $X$ has no effect on $y$.

If we compare two individuals whose $X$ value differs by 1 unit (e.g. two fathers whose heights differ by 1 inch), the slope indicates the increase in the expected outcome (e.g. the expected height of their respective child). This can be easily seen by considering the linear predictor for two different fathers, say $F_1$ and $F_2$, for whom the recorded heights are, say, 70 and 71 inches, respectively. The expected heights for the childred of $F_1$ and $F_2$ are then \[\begin{eqnarray*} \mbox{E}[Y\mid X=71] = \beta_0 + \beta_1\times 71 \qquad \mbox{ and } \qquad \mbox{E}[Y\mid X=72] = \beta_0 + \beta_1\times 72. \end{eqnarray*}\] Thus, the difference between these two expected outcomes (i.e. the effect of a unit change in the father’s height) is \[\begin{eqnarray*} \Delta_X & = & \mbox{E}[Y\mid X=72] - \mbox{E}[Y\mid X=71] \\ & = & \left(\hat\beta_0 + \hat\beta_1\times 72\right) - \left(\hat\beta_0 + \hat\beta_1\times 71\right) \\ & = & \hat\beta_1 (72-71) \\ & = & \hat\beta_1 = 0.399. \end{eqnarray*}\]

Figure 4.21 (b) shows the original data on children’s height on the $y-$axis, along the centred version of their father’s data on the $x-$axis and with the new regression line superimposed. As is indicated in the graph, the slope $\hat\beta_1$ is unchanged, while the intercept is indeed changed. That is because the change in the scale of the $X$ covariate (which as is possible to see now goes from -7.23 to 9.27) is modified. The “effect” of the covariate on the outcome is not affected by this change of scale; however, we are now in a position of providing a better interpretation of the intercept: this is the value of the expected outcome in correspondence of a centred value for the covariate equal to 0. It is easy to see that if $X^*_i=(X_i-\bar{X})=0$, then the original value $X_i=\bar{X}$. So the intercept is the expected outcome for a father with the average height in the observed population — and this is something that may exist and certainly makes sense!

Centering covariates

Centring covariates is particularly important within a Bayesian setting, because this has the added benefit of improving convergence of simulation algorithms (e.g. Markov Chain Monte Carlo), which underpin the actual performance of Bayesian modelling (see Baio, 2012).

The terminology “regression” comes from Galton’s original conclusion from his analysis — when plotting the original data with superimposed both the least square regression line (in blue in Figure 4.22) and the line of “equality” (the black line). This is constructed by using a slope equal to 0 and an intercept equal to 1, essentially implying that on average we expect children and their father to have the same height.

Figure 4.22: Galton’s original data with the regression and equality lines superimposed

What Galton noted is that shorter fathers tended to be associated with slightly taller children. This is noticeable because at the left-hand of the $x-$axis, the blue curve (the estimated least squares regression) is higher than the line of equality. Conversely, if a father was taller, then on average his child(ren) tended to be shorter than him (because at the other extreme the black line is increasingly higher than the blue line). With his eugenist hat on, he found this rather disappointing, because it meant that the species could not be improved (e.g. by selecting only taller parents to breed). For this reason, he gave this phenomenon the rather demeaning name “regression to mediocrity” or “to the mean”.

4.6.2 Regression as a statistical model

Galton’s original analysis is not really framed as a full statistical model, because it is rather based on the idea of mathematical optimisation provided by the least squares. In fact, he did not specify explicitly any distributional assumption behind the data observed. From the statistical perspective, this is a limiting factor, because, for example, it is impossible to go beyond the point estimate of the regression coefficients, unless we are willing to provide some more expanded model, based on probability distributions.

In practice, this extension is fairly easy, as we will show in the following. In its simplest form, we can assume that the sampling variability underlying the observed data can be described by a Normal distribution. This amounts to assuming that \[ y_i \sim \mbox{Normal}(\mu_i,\sigma^2) \tag{4.17}\] \[ \mu_i = \beta_0 + \beta_1 X_{1i} + \ldots + \beta_K X_{Ki}. \tag{4.18}\] for $i=1,\ldots,n$. This notation highlights the probabilistic nature of the assumed regression relationship between $y$ and $\boldsymbol{X}$: we are assuming that the linear predictor of Equation 4.18 is on average the best estimate of the outcome, given a specific “profile”, i.e. a set of values for the observed covariates. But we do not impose determinism on this relationship: there is a variance, which is determined by the sampling variability in the observed data and expressed by the population parameter $\sigma^2$.

Statistics vs Econometrics

An alternative way of writing the regression model (which is perhaps more common in Econometrics than it is in Statistics) is to consider \[ y_i = \beta_0 + \beta_1 X_{1i} + \ldots + \beta_K X_{Ki} + \varepsilon_i, \quad \mbox{with} \quad \varepsilon_i \sim \mbox{Normal}(0,\sigma^2). \tag{4.19}\] Equation 4.19 explicitly describes the assumption mentioned above. The quantity $\varepsilon_i$ represents some kind of “white noise”, or, in other words, a random error, that is centered on 0 and whose variance basically depends on the population variability.

Bayesian approach

The model parameters for the specification in Equation 4.17 and Equation 4.18 are the coefficients $\boldsymbol\beta$ and the variance $\sigma^2$. Thus, in order to perform the Bayesian analysis, we need to specify a suitable prior distribution for $\boldsymbol\theta=(\boldsymbol\beta,\sigma^2)$. Ideally, we would specify a joint, multivariate prior $p(\boldsymbol\theta)$, which would be used to encode any knowledge on the uncertainty about each parameter, as well as the correlation among them.

In practice, we often assume some form of (conditional) independence, where the $(K+2)$ dimensional distribution $p(\boldsymbol\theta)$ is factorised into a product of simpler (lower-dimensional) distributions, exploiting some (alleged!) independence conditions among the parameters. For instance, we may model the regression coefficients independently on one another and on the population variance \[\begin{eqnarray*} p(\boldsymbol\theta)=p\left(\boldsymbol{\beta},{\sigma}^2\right) = p\left({\sigma}^2\right) \prod_{k=0}^K p(\beta_k). \end{eqnarray*}\]

Prior independence vs posterior dependence

This is of course just a convenient specification and care should be taken in encoding the actual prior knowledge into the definition of the prior distribution. Notice however that even if we are assuming some form of independence in the prior distribution, it is possible that in the posterior (i.e. after observing the evidence provided by the data), we have some level of correlation among (some of) the parameters.

There are of course many possible models we can define for the prior, but a convenient (if often relatively unrealistic) choice is to assume vague Normal priors on the coefficients: $(\beta_0,\beta_1,\ldots,\beta_K)\stackrel{iid}{\sim}\mbox{Normal}(0,v)$ with a fixed and large variance $v$; and a Gamma distribution for the precision $\tau=1/\sigma^2 \sim \mbox{Gamma}(a,b)$, for some fixed, small values $(a,b)$ — see for example Gelman et al. (2013).

As mentioned above, it is convenient, particularly within the Bayesian framework to consider a centred version of the covariates, where $X^*_{0i}=X_{0i}$ and $X^*_{ki}=(X_{ki}-\bar{X}_k)$, for $k=1,\ldots,K$ (notice that we should not rescale the column of the predictors matrix corresponding to the intercept — in fact we want to keep it as a vector of ones, to ensure that the matrix multiplication returns the correct values). For example, using again Galton’s data, if we wanted to include in the model also the mothers’ heights (indicated as $X_{2i}$) and use a centred version of the covariates, then the linear predictor would be \[\begin{eqnarray*} \mu_i & = & \beta_0 X^*_{0i} + \beta_1 X^*_{1i} + \beta_K X^*_{2i}\\ & = & \boldsymbol{X}_i^* \boldsymbol\beta \end{eqnarray*}\] where $\boldsymbol\beta=(\beta_0,\beta_1,\beta_2)$ and the predictors matrix $\boldsymbol{X}^*$ is

\[\begin{equation*} \boldsymbol{X}^* = \left(\begin{array}{ccc} X_{01}^* & X_{11}^* & X_{21}^* \\ X_{02}^* & X_{12}^* & X_{22}^* \\ X_{03}^* & X_{13}^* & X_{23}^* \\ X_{04}^* & X_{14}^* & X_{24}^* \\ \vdots & \vdots & \vdots \\ X_{0n}^* & X_{1n}^* & X_{2n}^* \end{array}\right) = \left(\begin{array}{ccc} 1 & 9.27 & 2.92 \\ 1 & 9.27 & 2.92 \\ 1 & 9.27 & 2.92 \\ 1 & 9.27 & 2.92 \\ \vdots & \vdots & \vdots \\ 1 & -0.733 & 0.92 \end{array}\right). \end{equation*}\]

We may specify a prior for the father’s and mother’s effect that is fairly skeptical, by setting a Normal distribution, centred around 0 (indicating that on average we are not expecting an impact of these covariates on the predicted value of their child’s height), with a relatively large variance. For instance we could set $\beta_1,\beta_2\stackrel{iid}{\sim}\mbox{Normal}(0,\mbox{sd}=10)$. Note that in this case, looking at the context, we may argue that a standard deviation of 10 is already “large enough” to avoid including too much information in the prior. The black curve in Figure 4.23 shows a graphical representation of this prior.

Figure 4.23: The assumed prior distribution for the regression coefficients. The distribution in blue indicates the prior for the intercept $\beta_0$. The distribution in black is the same for $\beta_1$ and $\beta_2$. As is possible to see, most of the probability is included in a relatively large range, approximately between -20 and 20, indicating large uncertainty in the prior father’s and the mother’s effect

As for the intercept $\beta_0$, given the centring in our model, this represents the expected height of a child whose mother and father’s heights are at the average in the population. We can use some general knowledge about people’s heights in Victorian times and, assuming no particular association between the outcome and the covariate, we may set a Normal prior with mean equal to 65 inches (approximately 165 cm) and standard deviation equal to 20. Again, we are not imposing a particular value for the intercept, in the prior, but while using a reasonable choice, we still maintain some substantial uncertainty before seeing the data. Notice that essentially this prior (correctly!) assigns 0 probability of negative heights — in fact heights below 20 inches are very unlikely under the model assumed.

Unfortunately, this model is not possible to compute in closed form and so in order to estimate the posterior distributions for the parameters, we need to resort to a simulation approach (e.g. Markov Chain Monte Carlo, MCMC) — the details are not important here, but more details and applications are presented in Section 5.4 and Section 6.7, as well as in Spiegelhalter and Best (2003), Baio (2012) and Baio et al. (2017), for specific reference to applications in HTA. The output of the model is presented in Table 4.3.

Table 4.3: A summary of the posterior distributions for the model parameters

	Mean	SD	2.5\%	97.5\%
$\beta_0$ (intercept)	66.7622	0.11555	66.538	66.9972
$\beta_1$ (slope for father's height)	0.3795	0.04487	0.289	0.4691
$\beta_2$ (slope for mothers's height)	0.284	0.04954	0.1891	0.3804
$\sigma$ (population variance)	3.3899	0.07953	3.2405	3.5462

The computer output shows some summary statistics for the estimated posterior distributions of the model parameters. The posterior mean for the intercept is very close to our prior guess, but the uncertainty in the overall distribution is massively reduced by the observed data — the 95% interval estimate is indeed very narrow, ranging from 0.116 to 66.54.

As for the slopes $\beta_1$ and $\beta_2$, both have a positive mean and the entire 95% interval estimate is also above 0. This indicates that there seems to be a truly positive relationship between the heights of the parents and those of their offspring. Nevertheless, the magnitude of the effect is not very large, which explains the phenomenon so disconcerting for Galton.

Using the full posterior distributions, we can determine the probability that either $\beta_1$ or $\beta_2$ are negative (which would indicate the opposite relationship). These can be obtained numerically, using the output of the MCMC analysis. As is obvious from Figure 4.24, there is essentially no probability of either the two slopes being negative.

(a) $p(\beta_1\mid \boldsymbol{y},\boldsymbol{X})$

Warning

Finally, notice that the coefficients for the father’s height has now changed from the least square analysis given above. This is possibly due to the influence of the prior distribution, in the Bayesian analysis. Nevertheless, because we are now including an additional covariate, it is extremely likely that the effect of the father’s height be indeed modified in comparison to the simpler analysis, simply because of the joint variation brought about by the formal consideration of the mother’s height.

Likelihood approach

As shown in Section 4.4.2, the likelihood approach proceeds by computing the maximum likelihood estimator for the model parameters. In this case, using relatively simple algebra, we can prove that the MLE for the coefficients $\boldsymbol\beta=(\beta_0,\beta_1,\ldots,\beta_K)$ are equivalent to the least square solutions (as seen in Section 4.4.3, the MLE has usually all the good frequentist properties and thus it is often selected as the best estimator in that framework too).

Expanding on the result shown above, in the most general case for a linear regression, where we consider $K$ predictors, the MLEs are \[ \hat\beta_0 = \bar{y}-\left(\hat\beta_1\bar{X}_1 + \ldots + \hat\beta_K \bar{X}_K \right) = \bar{y}-\sum_{k=1}^K \beta_k \bar{X}_k \tag{4.20}\] \[ \hat\beta_1 = \frac{\mbox{Cov}(y,X_1)}{\mbox{Var}[X_1]} \tag{4.21}\] \[ \vdots \] \[ \hat\beta_K = \frac{\mbox{Cov}(y,X_K)}{\mbox{Var}[X_K]}, \tag{4.22}\] where $\mbox{Cov}(y,X_k)$ indicates the covariance between the two variables $y$ and $X_k$.

In addition, the MLE for the population (sometimes referred to as “residual”) variance $\sigma^2$ is given by \[ \hat\sigma^2 = \frac{\mbox{RSS}}{(n-K-1)}, \tag{4.23}\] where: \[\begin{eqnarray*} \mbox{RSS} & = &\sum_{i=1}^n \left(y_i - \hat{y}_i\right)^2 \\ & = & \sum_{i=1}^n \left(y_i - \left[\beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_K X_{Ki}\right]\right)^2 \end{eqnarray*}\] is the residual sum of squares, i.e. a measure of the “error” we make by estimating the outcome using the regression line (i.e. the residuals $\hat{y}_i$), instead of the actual observed points ($y_i$); and the denominator of Equation 4.23 is the degrees of freedom, which in the general case are equal to the number of data points ($n$) minus the number of regression coefficients $(K+1$, in this case).

In practical terms, matrix algebra can be used to programme these equations more efficiently and compactly. For example, including again in the model both the fathers’ and the mothers’ heights on the original scale (i.e. without centring the covariates), then the linear predictor would be \[\begin{eqnarray*} \mu_i & = & \beta_0 X_{0i} + \beta_1 X_{1i} + \beta_K X_{2i}\\ & = & \boldsymbol{X}_i \boldsymbol\beta \end{eqnarray*}\] where $\boldsymbol\beta=(\beta_0,\beta_1,\beta_2)$ and the predictors matrix $\boldsymbol{X}$ is

\[\begin{equation*} \boldsymbol{X} = \left(\begin{array}{ccc} X_{01} & X_{11} & X_{21} \\ X_{02} & X_{12} & X_{22} \\ X_{03} & X_{13} & X_{23} \\ X_{04} & X_{14} & X_{24} \\ \vdots & \vdots & \vdots \\ X_{0n} & X_{1n} & X_{2n} \end{array}\right) = \left(\begin{array}{ccc} 1 & 78.5 & 67.0 \\ 1 & 78.5 & 67.0 \\ 1 & 78.5 & 67.0 \\ 1 & 78.5 & 67.0 \\ \vdots & \vdots & \vdots \\ 1 & 68.5 & 65.0 \end{array}\right). \end{equation*}\]

Equations 4.20 — 4.22 can be written compactly in matrix algebra using the notation \[ \hat{\boldsymbol\beta} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}. \tag{4.24}\]

Note 4.1: Transpose and inverse matrices

The operator $^\top$ indicates the transpose of a matrix. So if you have a matrix \[\begin{eqnarray*} \boldsymbol{X} = \left(\begin{array}{cc} a & b \\ c & d \end{array} \right), \end{eqnarray*}\] then \[\begin{eqnarray*} \boldsymbol{X}^\top = \left(\begin{array}{cc} a & c \\ b & d \end{array}\right), \end{eqnarray*}\] i.e. the transpose matrix is constructed by flipping around the rows and the columns (the first column becomes the first row, the second column becomes the second row, etc.).

Multiplying the transpose of a matrix by the original matrix is equivalent to summing the cross-products of the values in the matrix \[\begin{eqnarray*} \boldsymbol{X}^\top \boldsymbol{X} & = & \left(\begin{array}{cc} a & c \\ b & d \end{array}\right) \times \left(\begin{array}{cc} a & b\\ c & d \end{array}\right) \\ & = & \left(\begin{array}{cc} \left(a\times a + c\times c\right) &\hspace{5pt} \left(a\times b + c \times d\right) \\ \left(b \times a + d \times c\right) & \hspace{5pt} \left(b \times b + d \times d\right) \end{array}\right) \\ & = & \left(\begin{array}{cc} a^2+c^2 & \hspace{5pt} ab+cd \\ ba+dc & \hspace{5pt} b^2+d^2 \end{array}\right) \end{eqnarray*}\] So, if $\boldsymbol{X}$ is the matrix of predictors \[\begin{eqnarray*} \boldsymbol{X}^\top\boldsymbol{X} & = & \left(\begin{array}{cccc} X_{01} & X_{02} & \ldots & X_{0n} \\ X_{11} & X_{12} & \ldots & X_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ X_{K1} & X_{K2} & \ldots & X_{Kn} \end{array}\right) \times \left(\begin{array}{cccc} X_{01} & X_{11} & \ldots & X_{K1} \\ X_{02} & X_{12} & \ldots & X_{K2} \\ \vdots & \vdots & \ddots & \vdots \\ X_{0n} & X_{1n} & \ldots & X_{Kn} \end{array}\right) \\ & = & \left(\begin{array}{cccc} \displaystyle\sum_{i=1}^n X_{0i}^2 & \displaystyle\sum_{i=1}^n X_{0i}X_{1i} & \ldots & \displaystyle\sum_{i=1}^n X_{0i}X_{Ki} \\ \displaystyle\sum_{i=1}^n X_{1i}X_{0i} & \displaystyle\sum_{i=1}^n X_{1i}^2 & \ldots & \displaystyle\sum_{i=1}^n X_{1i}X_{Ki} \\ \vdots & \vdots & \ddots & \vdots \\ \displaystyle\sum_{i=1}^n X_{Ki}X_{0i} & \displaystyle\sum_{i=1}^n X_{Ki}X_{1i} & \ldots & \displaystyle\sum_{i=1}^n X_{Ki}^2\end{array}\right), \end{eqnarray*}\] which is equivalent to the computation made for the covariance. Note that, irrespective of the original dimension of a matrix, multiplying a transpose by the original matrix always produces a square matrix (i.e. one with the same number of rows and columns).

The matrix operator $^{-1}$ is the generalisation of the division operation for numbers. So, much like for a number $x$ the product $x x^{-1}=\frac{x}{x}=1$, for matrices the inverse is such that for a square matrix $\boldsymbol{X}$, pre-multiplying by the inverse matrix produces the identity matrix (i.e. one with ones on the diagonal and zeros everywhere else). \[\begin{eqnarray*} \boldsymbol{X}^{-1} \boldsymbol{X} = \boldsymbol{1} = \left(\begin{array}{cccccccccc} 1 & & & 0 & & & \ldots & & & 0 \\ 0 & & & 1 & & & \ldots & & & 0 \\ \vdots & & & \vdots & & & \ddots & & & \vdots \\ 0 & & & 0 & & & \ldots & & & 1\end{array}\right). \end{eqnarray*}\]

Intuitively, Equation 4.24 is made by multiplying the inverse of the sum of squares for the matrix $\boldsymbol{X}$ (which is proportional to the variance) by the sum of cross-squared between $\boldsymbol{X}$ and $\boldsymbol{y}$ (which is proportional to the covariance). This is basically the same as constructing the ratio between the covariance of $\boldsymbol{X}$ and $\boldsymbol{y}$ and the variance of $\boldsymbol{X}$.

This matrix algebra can be programmed in R using the commands

# Constructs the matrix of predictors, including the first column of ones 
# the second column with the fathers' heights and the third column with
# the mothers' heights, from the original dataset
X=cbind(rep(1,nrow(galton)),galton$Father,galton$Mother)

# Constructs the vector of outcomes with the children's heights
y=galton$Height
# Now computes the MLE for all the regression coefficients
beta=solve(t(X)%*%X) %*% t(X)%*%y

(in R the built-in command solve(X) is used to compute the inverse of a square matrix X, while the function t(X) is used to transpose its matrix argument; see Section 2.4.3.4 for more details on the R implementation of these operations).

The code above returns the following values for the coefficients.

      MLE estimate
beta0   22.3097055
beta1    0.3798970
beta2    0.2832145

We can also prove that these are unbiased for the underlying regression coefficients $\beta_0,\ldots,\beta_K$, i.e. \[\begin{eqnarray*} \mbox{E}[\hat\beta_0] = \beta_0, \mbox{E}[\hat\beta_1] = \beta_1, \ldots, \mbox{E}[\hat\beta_K] = \beta_K \end{eqnarray*}\] and, using the theory shown in Section 4.4, that the sampling distribution for the estimators of each $\hat\beta_k$ (for $k=0,\ldots,K$) is given by Normal distributions where the mean is of course the underlying “true” coefficient $\beta_k$ and the variance is given by \[ \mbox{Var}[\beta_k] = \hat\sigma^2 \left(\boldsymbol{X}^\top \boldsymbol{X}\right)^{-1} \] — the intuition behind this formula is that we rescale the estimate of the variance of the error $\varepsilon_i$ by the variance in the covariates to provide the variance of the estimate of the effects (coefficients).

Using matrix notation to compute $\mbox{RSS} = (\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol\beta})^\top (\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol\beta})$, in R we can compute these variances using the following commands

# Computes the Residual Sums of Squares
RSS=t(y-X%*%beta)%*%(y-X%*%beta)
# Computes the estimate of the standard deviation sigma
# NB: "nrow(X)"=number of rows in the matrix X, while
#     "ncol(X)"=number of columns in the matrix X
sigma2.hat=as.numeric(RSS/(nrow(X)-ncol(X)))

# Now computes the variance of the coefficients using the formula above
sigma2.beta=sigma2.hat*solve(t(X)%*%X)
# Now squares the elements on the diagonal (i.e. the variances of the three 
# coefficients), to obtain the standard deviations
sigma.beta=sqrt(diag(sigma2.beta))

to produce the estimates for the three coefficients in $\boldsymbol\beta$ as below.

      MLE estimate         sd
beta0   22.3097055 4.30689678
beta1    0.3798970 0.04589120
beta2    0.2832145 0.04913817

At this point, using Equation 4.15, we can substitute the MLE $\hat\beta_k$ for $\hat\theta$ and the estimate of the standard deviation of $\hat\beta_k$ for $\sigma/\sqrt{n}$ and compute a 95% interval for $\hat\beta_k$ as \[ \left[\hat\beta_k -1.96 \sqrt{\mbox{Var}[\hat\beta_k]}; \hat\beta_k + 1.96 \sqrt{\mbox{Var}[\hat\beta_k]}, \right] \] which in the current case gives \[\begin{eqnarray*} \mbox{95\% interval for } \beta_0 & = & \left[22.31 -1.96\times4.31; 22.31 + 1.96\times4.31 \right] = \left[13.87; 30.75 \right] \\ \mbox{95\% interval for } \beta_1 & = & \left[0.38 -1.96 0.04590.38 + 1.96\times0.0459 \right] = \left[0.29; 0.47 \right] \\ \mbox{95\% interval for } \beta_2 & = & \left[0.283 -1.96 0.0491 0.283 + 1.96\times0.0491 \right] = \left[0.187; 0.38 \right] . \end{eqnarray*}\]

Of course, in practice, you will never need to make matrix calculations by hand — or even programme them directly. Most likely you will use routines and programmes available in statistical software to do the regression analysis. For example, in R we can compute the regression coefficients using the built-in function lm, as in the following code.

# Runs the function "lm" to run the model including "Height" as the reponse 
# (that is the variable to the left of the twiddle symbol "~"), while "Father" 
# and "Mother" (the variables to the right of the twiddle) are the covariates. 
# These are recorded in the dataset called "galton"
m1=lm(formula=Height~Father+Mother,data=galton)
# Now displays the output
summary(m1)


Call:
lm(formula = Height ~ Father + Mother, data = galton)

Residuals:
   Min     1Q Median     3Q    Max 
-9.136 -2.700 -0.181  2.768 11.689 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 22.30971    4.30690   5.180 2.74e-07 ***
Father       0.37990    0.04589   8.278 4.52e-16 ***
Mother       0.28321    0.04914   5.764 1.13e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.386 on 895 degrees of freedom
Multiple R-squared:  0.1089,    Adjusted R-squared:  0.1069 
F-statistic: 54.69 on 2 and 895 DF,  p-value: < 2.2e-16

The standard output reported by R includes the estimates for the coefficients (the column labelled as Estimate), their standard error based on the suitable sampling distributions (the column labelled Std. Error), the value of the test statistic $T$ in the column labelled as t value and the “p-value” in the column labelled Pr(>|t|). These last two quantities are used for statistical testing, which however is not a useful tool in HTA, a point famously made by Claxton (1999).

Warning

The analysis based on the MLE shows some slight discrepancies with the Bayesian analysis shown above. By and large the results are extremely consistent — notice that because of the different scaling of the covariate, the intercepts $\beta_0$ are not directly comparable. As for the slopes (which, on the contrary are directly comparable), the Bayesian analysis estimates the effect of fathers’ height to be 0.3795, while the frequentist model indicates a value of 0.3799. For the mothers’ height effect, the Bayesian model estimates 0.2840, while the MLE-based analysis indicates that it is 0.2832. More importantly, all the coefficients are estimated with large precision, so that the entire intervals are above 0, in both approaches.

In this case, because we have a relatively large dataset, with evidence that consistently points towards the same direction; and we have used relatively vague priors, the numerical outputs of the two approaches are highly comparable. But this need not be the case in general, particularly when the data size is small and they provide only limited evidence. From the Bayesian point of view, this is perfectly reasonable: if we do not have overwhelming evidence, it is sensible that including prior knowledge does exert some influence on the final decisions.

This feature is also particular important when we need to complement limited evidence from observed data (e.g. in terms of short follow up, or large non-compliance).

4.6.3 Generalised linear regression

One of the main assumptions underlying the linear regression analysis seen above when viewed as a statistical model is that the underlying outcome is suitably modelled using a Normal distribution — this implies that we can assume that $Y$ is reasonably symmetric, continuous and unbounded, i.e. it can, at least theoretically, take on values in the range $(-\infty; \infty)$.

However, as we have seen in Section 4.3, there are many cases in which variables are not suitably described by a Normal distribution — notably Bernoulli/Binomial or Poisson counts, but also other variables describing skewed phenomena (e.g. costs or times to event). In these instances, we can slightly extend the set up of Equation 4.17 and Equation 4.18 to account for this extra complexity. This can be accomplished using the following structure. \[ \left\{ \begin{array}{ll} y_i \stackrel{iid}{\sim} p(y_i \mid \boldsymbol\theta_i, \boldsymbol{X}_i) & \mbox{is the model to describe sampling variability for the outcome} \\ \boldsymbol\theta_i = (\mu_i,\boldsymbol\alpha) & \mbox{is the vector of model parameters} \\ \mu_i = \mbox{E}[Y_i \mid \boldsymbol{X}_i] & \mbox{is the mean of the outcome given the covariates $\boldsymbol{X}_i$} \\ \boldsymbol\alpha & \mbox{is a vector of other potential model parameters, e.g. variances, etc.} \\ & \mbox{(NB this may or may not exist for a specific model)} \\ g(\mu_i) = \boldsymbol{X}_i \boldsymbol\beta & \mbox{is the linear predictor on the scale defined by $g(\cdot)$}. \end{array} \right. \tag{4.25}\]

Link functions

A function such that $g(x)=x$ is referred to as the identity function. We can see that linear regression, presented in Section 4.6.2, is in fact a special case of the wider class of structures described in Equation 4.25, which is often referred to as Generalised Linear Models (GLMs). A GLM in which $p(y_i\mid \boldsymbol\theta,\boldsymbol{X})=\mbox{Normal}(\mu_i,\sigma^2)$ and $g(\mu_i)=\mu_i=\mbox{E}[Y\mid \boldsymbol{X}_i]=\boldsymbol{X}_i\boldsymbol\beta$ is the linear regression model of Equation 4.17 and Equation 4.18.

Upon varying the choice of the distribution $p(\cdot)$ and the transformation function $g(\cdot)$, we can model several outcomes.

Logistic regression

When the outcome is a binary or Binomial variable, we know that the mean $\theta$ is the probability that a random individual in the relevant population experiences the event of interest. Thus, by necessity, $\theta_i=\mbox{E}[Y_i\mid\boldsymbol{X}_i]$ is bounded by 0 from below and 1 from above (i.e. it cannot be below 0 or above 1). For this reason, we should not use a linear regression to model this type of outcome (although sometimes this is done, particularly in Econometrics, in the context of “two-stage least square analysis”, which you may encounter during your studies).

One convenient way to model binary outcomes in a regression context is to use the general structure of Equation 4.25, where the outcome is modelled using $y_i \stackrel{iid}{\sim} \mbox{Bernoulli}(\theta_i)$ and \[ g(\theta_i) = g\left(\mbox{E}[Y_i\mid\boldsymbol{X}_i]\right) = \mbox{logit}(\theta_i)=\log\left(\frac{\theta_i}{1-\theta_i}\right) = \boldsymbol{X}_i \boldsymbol\beta \tag{4.26}\] (notice that in the Bernoulli model there is only one parameter and so we indicate here $\boldsymbol\theta=\theta$).

Equation 4.26 is referred to as logistic regression. Figure 4.25 shows graphically the mapping from the original range of the parameter $\theta$, on the $x-$axis, to the rescaled parameter $g(\theta)=\mbox{logit}(\theta)$, on the $y-$axis. As is possible to see, the original range is mapped onto the whole set of numbers from $-\infty$ (in correspondence of the value $\theta=0$) to $\infty$ (for $\theta=1$).

Figure 4.25: Graphical description of the shape and features of the logit function. For $\theta\rightarrow 0$, then logit$(\theta)\rightarrow -\infty$, while for $\theta\rightarrow 1$, then logit$(\theta)\rightarrow + \infty$. For $\theta=0.5$, then logit$(\theta)=0$

This is obvious by noting that \[\begin{eqnarray*} \theta=0 & \Rightarrow & \log\left(\frac{\theta}{1-\theta}\right) = \log\left(\frac{0}{1}\right) = \log(0) \rightarrow -\infty \end{eqnarray*}\] and \[\begin{eqnarray*} \theta=1 & \Rightarrow & \log\left(\frac{\theta}{1-\theta}\right) = \log\left(\frac{1}{0}\right) = \log(\infty) \rightarrow \infty. \\ \end{eqnarray*}\]

In addition, the graph in Figure 4.25 shows that the resulting function of $\theta$ is reasonably symmetric around the central point ($\theta=0.5$).

Odds vs probabilities

If $\theta$ represents a probability (e.g. that a given event $E$ happens), the quantity \[\mbox{O}= \left(\frac{\theta}{1-\theta}\right) = \frac{\Pr(E)}{1-\Pr(E)}\] is called the odds for the event $E$ and it represents a measure of how more likely $E$ is to happen than not. If $\Pr(E)=\theta=0.5$, then $\displaystyle\mbox{O}=0.5/0.5=1$. If $\Pr(E)$ is small, then O is also very small, while if $E$ is very likely, then O is increasingly bigger. The range in which O is defined is characterised by the extreme values 0 (in correspondence of which $E$ is impossible, i.e. $\Pr(E)=0$ and thus $\displaystyle \mbox{O}=0/1=0$) and $\infty$, when $E$ is certain, i.e. $\Pr(E)=1$ and thus $\displaystyle \mbox{O}=1/0=\infty$.

The quantity $\displaystyle \log\left(\frac{\theta}{1-\theta}\right) = \log\mbox{O}$ is the log-odds for the event $E$. By applying the log transformation to O, we map its range in the interval $[\log(0)=-\infty; \log(\infty)=\infty]$.

Equation 4.26 clarifies that logistic regression implies that we are modelling the log-odds using a linear predictor or, in other words, that we are assuming a linear relationship between the mean outcome and the covariates, on the log odds scale.

For example, consider again Galton’s data, but this time, our outcome is given by a new variable, which takes value 1 if the original child’s height is above a threshold of 71 inches (180 cm) and 0 otherwise. We can create the suitable data in R using the following command.

# Creates a variable "y2" taking value 1 if the original child's height > median
y2=ifelse(galton$Height>71,1,0)
# Summarises the new variable
table(y2)

y2
  0   1 
801  97

As is possible to see, nearly 10.80% (i.e. 97/898) of the sample “experiences the event”, i.e. is taller than the set threshold.

Logistic regression coefficients

The interpretation of the regression coefficients in a logistic regression needs a bit of care, given the change in the scale we define for the linear predictor.

The intercept is still related to the expected mean of the outcome for the profile of covariates all set to 0. So, in Galton’s example, centering the covariates for simplicity of interpretation, this would be the expected value for the height of a child of “average” father and mother (in terms of height). However, this time the expected mean of the outcome is the probability of the underlying event of interest, e.g. that a child is taller than the threshold of 71 inches. So, for an individual with centred covariates set to 0 (e.g. for a child whose parents’ centred height is 0), then the regression line (on the log-odds scale) is simply \[\begin{eqnarray*} \mbox{logit}(\theta_i) = \log\left(\frac{\theta_i}{1-\theta_i}\right) = \beta_0 \end{eqnarray*}\] and thus, recalling that $\exp\left(\log\left(x\right)\right)=x$, we can invert the logit function to give \[ \begin{array}{ll} & \exp\left[\log\left(\frac{\theta_i}{1-\theta_i}\right)\right] = \exp\left(\beta_0\right) \Rightarrow \\ & \theta_i = \exp\left(\beta_0\right) (1-\theta_i) \Rightarrow \\ & \theta_i\left[1+\exp\left(\beta_0\right)\right] = \exp\left(\beta_0\right) \Rightarrow \end{array} \] \[ \theta_i = \frac{\exp\left(\beta_0\right)}{1+\exp\left(\beta_0\right)} \tag{4.27}\] (the right hand side of Equation 4.27 is often referred to as the expit or inverse logit transformation). This particular rescaling of intercept represents the expected mean outcome (= the probability that the event under study occurs, for an individual with covariates set to 0 — or to the overall average, if we are centring them).

The “slope” has also a slightly different interpretation in a logistic regression. We have seen before that a generic slope $\beta_k$ represents the difference in the expected outcome for two individuals whose $k-$th covariate varies by one unit, “all else being equal”, i.e. where all the other covariates are set to the same value.

For instance, if in the current example we compared two fathers, one for whom the centred height was 0 (= 69.23 inches) and one for whom the centred height was 1 (= 70.23 inches), then we would have \[ \mbox{logit}(\theta \mid X^*_1=0) = \beta_0 + \beta_1 \times 0 = \beta_0, \] for the first one; and \[ \mbox{logit}(\theta \mid X^*_1=1) = \beta_0 + \beta_1 \times 1, \] for the second one. The difference between the expected outcomes would then be \[\begin{eqnarray*} \Delta_X & = &\mbox{logit}(\theta\mid X^*_1=1) - \mbox{logit}(\theta\mid X^*_1=0) \\ & = & \log\left(\frac{\theta}{1-\theta}\mid X^*_1=1 \right) - \log\left(\frac{\theta}{1-\theta}\mid X^*_1=0 \right) \\ & = & \log\left(\frac{\theta_1}{1-\theta_1}\right) - \log\left(\frac{\theta_0}{1-\theta_0}\right) \\ & = & (\beta_0 + \beta_1) - (\beta_0) \\ & = & \beta_1, \end{eqnarray*}\] indicating the probability of the event in correspondence of the covariate profile $X_1^*=j$ (for $j=0,1$) with the notation $\theta_j$, for simplicity.

Recalling that for any two positive numbers $(a,b)$ we can show that $\log(a)-\log(b)=\log\left(a/b\right)$, we can then write \[\begin{eqnarray*} \log\left(\frac{\theta_1}{1-\theta_1} \middle/ \frac{\theta_0}{1-\theta_0}\right) = \beta_1 \end{eqnarray*}\] The quantity $\displaystyle \left(\frac{\theta_1}{1-\theta_1} \middle/ \frac{\theta_0}{1-\theta_0}\right)$ is the ratio of two odds — this is called the odds ratio (OR) and describes how much more likely an event $E$ is to happen among those individuals who present the characteristic associated with $X^*_1=1$ (which happens with probability $\theta_1$) than among those who have $X^*_1=0$ (which is associated with probability $\theta_0$). Taking the log of this quantity gives the log-OR, which we can now see is the same as the value of the slope $\beta_1$.

Theoretically, the log-OR ranges between $-\infty$ and $+\infty$, with larger values indicating that the event is more likely to happen when individuals are associated with larger values of the covariate. In this case, the taller the father, the more likely their child to be taller than 71 inches, by an amount $\exp(\beta_1)$.

In general, negative values for a log-OR indicate that the covariate is negatively associated with the outcome, while positive log-OR suggest that the covariate is positively associated with the outcome. If we transform this on to the natural scale by exponentiating the log-OR, which means that OR$\in [0,\infty)$, we can interpret an OR $>1$ to indicate that the covariate has a positive effect on the (probability of the) outcome, while if OR $<1$ then the opposite is true. When log-OR $=0$ (or, equivalently OR $=1$), then there is no association between the covariate and the outcome.

Bayesian approach

In a Bayesian context, theoretically the model of Equation 4.26 does not pose particular problems: we need to specify prior distributions on the coefficients $\boldsymbol\beta$ and, much as in the case of linear regression, we can use Normal distributions — note that Equation 4.26 is defined on the log-odds scale, which as suggested above, does have an unbounded range, which makes it reasonable to assume a Normal distribution for the coefficients.

Equation 4.27 is helpful when we want to set up a prior distribution for $\beta_0$ in a meaningful way for the main parameter $\theta$. For example, imagine we wanted to encode the assumption that, before getting to see any data, we expected only a 5 to 20% chance for a random person in the population to be taller than 71 inches. If we set a Normal(logit(0.105),sd=0.4) for $\beta_0$, then we can check that what we are actually implying is a prior for the underlying “baseline” probability (i.e. for the child of “average” parents) as the one represented in Figure 4.26.

The dark blue horizontal line just above the $x-$axis in panel (b) indicates the 95% prior interval estimate for the parameter $\theta$, which is derived by the prior depicted in panel (a). As is possible to see, the current choice for the mean and standard deviation of the Normal prior do induce a 95% interval covering approximately the required range. The values for the parameters of the Normal prior distribution for $\beta_0$ can be found by trial-and-error, e.g. using the following R code:

# Defines the mean and sd of the prior
b0=log(0.105/(1-0.105))
s0=0.4
# Simulates 10000 values from the prior
beta0=rnorm(10000,b0,s0)
# Rescales the prior to the scale of the main parameter theta
theta=exp(beta0)/(1+exp(beta0))
# Checks the implied 95% interval estimate for theta
cbind(quantile(theta,.025),quantile(theta,.975))

           [,1]      [,2]
2.5% 0.05087575 0.2065268

and changing the imposed value for b0 and s0 until the resulting approximate 95% interval returns values close enough to the required limits (5-20%). This procedure is often referred to as forward sampling. Note that setting some mildly “informative” prior for the intercept may be helpful in stabilising the inference, particularly when the data are made by only few records (i.e. small sample size), or, even more importantly, when the number of observed “successes” is small.

As for the slopes, we may elicit some informative prior and encode genuine prior knowledge in the model — for example, we may have some strong belief in a given treatment effect and thus we may set up a prior for the corresponding coefficient that is concentrated above 0. This would indicate a large prior probability that the resulting OR is greater than 1 and thus a positive association between the treatment and the outcome. However, in general terms, we may be unwilling to specifiy too strong a prior on a given treatment effect because we would like to be a bit more conservative and let the data drive the inference on this particular relationship.

For example we could set up $\beta_1,\beta_2 \stackrel{iid}{\sim}\mbox{Normal}(0,sd=2)$. This assumes that, before seeing any data, we are not expecting a particular effect of either father’s or mother’s height on the probability of a child being taller than 71 inches (because these distributions are centred around 0 — recall that $\beta_1$ and $\beta_2$ represent the log-ORs!). However, we are implying some variance around it to guarantee the possibility that the effect is either positive or negative.

Notice that we are imposing a standard deviation of 2 — you may think that this is rather strict and we are in fact including some strong prior on these distributions. However, recall that the slopes are defined on the log-odds scale and so a value of 2 for a standard deviation is actually pretty large, when rescaled back to the original probability scale. Adapting Equation 4.15, we know that for a Normal distribution with mean 0 and standard deviation of 5, 95% of the probability is approximately included in the interval $[-1.96\times 2; 1.96\times 2] = [-3.92; 3.92]$, which when we map back on to the probability scale applying the inverse logit transformation implies a prior 95% interval for the OR of $[0.0198;50.40]$. That is extremely vague!

Interestingly, there is no closed form for the posterior distributions of the model parameters, in a logistic regression. Thus we need to resort to simulations, e.g. MCMC. The details of the MCMC model used to run the analysis are not important here, but the results are summarised in Table 4.4.

Table 4.4: A summary of the posterior distributions for the model parameters

	Mean	SD	2.5\%	97.5\%
$\beta_0$ (intercept)	-0.06824	0.06576	-0.20033	0.05653
$\beta_1$ (logOR for father's height)	0.16726	0.02883	0.11117	0.22664
$\beta_2$ (logOR for mothers's height)	0.09789	0.03037	0.03718	0.16128

In a logistic regression, typically we do not worry too much about the intercept (given the caveat above and the rescaling necessary to make sense of its value in terms of the underlying probability of “success”). As for the log-ORs, we can see that both the covariates are associated with a positive point estimate and both the entire 95% interval estimates are also positive, thus indicating a probability $\geq 0.95$ that the posterior distributions of $\beta_1$ and $\beta_2$ are positive.

Of course, once we have the simulations for the log-OR, we can simply exponentiate each simulated values to obtain a full posterior distribution for the ORs. So for example if we had stored the output of the MCMC model in two suitable objects called beta1 and beta2, then we could simply rescale them to compute numerically probabilities associated with them, for example as in the following R code.

# Constructs the ORs from the original simulations obtained by the model
OR1=exp(beta1)
OR2=exp(beta2)
# Tail-area probability to estimate Pr(OR1<1). This is the proportion of
# simulations for OR1 that are below 1
sum(OR1<1)/length(OR1)

[1] 0

# Tail-area probability to estimate Pr(OR2<1). This is the proportion of
# simulations for OR2 that are below 1
sum(OR2<1)/length(OR2)

[1] 0.0005

Once the objects OR1 and OR2 are available, we could plot histograms of the posterior distributions, as shown in Figure 4.27. As is possible to see, in both cases, none or very little of the posterior distribution is below 1, indicating a “highly significant” result in terms of the positive association between the covariates and the outcome.

Likelihood approach

If we consider a Likelihod approach, the idea is of course to maximise the likelihood function for the model parameters, to determine the MLE, which, again, would be considered a very good candidate for optimality even under a pure Frequentist approach. Unfortunately, for a logistic regression model, it is not possible to obtain maximum values analytically. Thus, we resort to numerical maximisation — usually based on a clever approximation method referred to as Newton-Rapson algorithm.

In practice, we do not need to programme this algorithm ourselves, but rather we rely on existing routines or programmes. For example R has a built-in command glm, that can be used to obtain the MLEs for GLMs.

The following code shows how to perform a GLM analysis for the model shown in Section 4.6.3.1.1 — all the assumptions are identical, except of course that we do not specify any prior distribution in this case.


Call:
glm(formula = formula, family = "binomial", data = data.frame(y2, 
    X))

Coefficients:
            Estimate Std. Error z value     Pr(>|z|)    
XIntercept -0.008175   0.068537  -0.119      0.90506    
XFather     0.166414   0.029209   5.697 0.0000000122 ***
XMother     0.097989   0.030380   3.225      0.00126 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1244.9  on 898  degrees of freedom
Residual deviance: 1197.1  on 895  degrees of freedom
AIC: 1203.1

Number of Fisher Scoring iterations: 4

The computer output is fairly similar to that presented for the outcome of lm in the case of a linear regression. The main parameters are presented in the core of the table in terms of the point estimate (labelled as Estimate), together with the standard deviation (Std. Error).

In this particular instance, the results are fairly similar to the numerical output of the Bayesian analysis. The intercept shows some slight differences — this is mainly due to the informative prior used above. However, for the two slopes, given that the prior was centred around 0 and in fact fairly vague, the effects of the covariates is estimated to very similar numerical values.

Poisson and other GLMs

Logistic regression is only one of the models embedded in the wider family of GLMs. The general principles are essentially the same in all circumstances — specify a distribution $p(y_i\mid\boldsymbol{\theta}_i, \boldsymbol{X}_i)$ and a suitable map from the natural scale of the mean to an unbounded range, where we can claim at least approximately linearity in the covariates.

Another common example is Poisson regression, where we have observed data on the outcome $y_1,\ldots,y_n\stackrel{iid}{\sim}\mbox{Poisson}(\theta_i)$. Here the parameter $\theta_i$ represents at once the mean and the variance of the underlying Poisson distribution. Because $\theta_i\geq 0$ (as it represents a rate, i.e. the intensity with which the counts are observed), a suitable transformation is simply to take the log and model \[\begin{eqnarray} g(\theta_i) = g\left(\mbox{E}[Y_i\mid\boldsymbol{X}_i]\right)=\log(\theta_i) = \boldsymbol{X}_i\boldsymbol\beta. \end{eqnarray}\] In comparison to logistic regression, the interpretation of the coefficients is slightly simpler: using a reasoning similar to that shown for the intercept and slope of a logistic regression, we can show that, in the case of a Poisson GLM:

The intercept $\beta_0$ is the log rate for an individual whose covariates are set to 0;
The slope $\beta_k$ is the log relative risk corresponding to increasing the covariate by one unit. So if you compare two individuals, the first of whom has the value of the covariate $X_k=1$ and the other for whom $X_k=0$, then $\beta_k=\log\left(\frac{\theta_1}{\theta_0}\right)$, where $\theta_j$ is the rate associated with a covariates profile of $X_k=j$, for $j=0,1$.

The log-linear model embedded in the Poisson structure can be applied to several other distributions to describe sampling variability in the observed outcome. Other relevant examples include the case of time-to-event outcomes (see Chapter 7), for which suitable models are Weibull or Gamma, among others. Because all of these are defined as positive, continuous variables, their mean is also positive and so it is useful to rescale the linear predictor on the log scale.

Much as Karl Pearson (see Section 4.6.1), Fisher was also a very controversial figure. His brilliance as a scientist is undisputed and he has made contributions to many disciplines, including Statistics, Biology and Genetics. On the other hand, his views were also close to eugenics — in fact, he took the post of head of the Department of eugenics at University College London (UCL), in 1933. He was also heavily criticised for his views on the link between smoking and lung cancer, which he strongly denied in favour of some underlying genetic features, despite the evidence that was already at the time becoming substantial.↩︎
Jerzy Neyman (a Polish mathematician and statistician) and Egon Pearson (the son of Karl Pearson) developed most of the theory underlying the main frequentist ideas while the former was visiting the Department of Statistical Science at UCL, where the latter had taken over his father as head.↩︎
During his early career, Fisher had also heated arguments with Karl Pearson, whom he started off admiring very much, but with whom he fell out over a rejection of one of Fisher’s papers in Biometrika, the scientific journal edited by Karl Pearson.↩︎
Galton was another very controversial character. He was a polyscientist, who made contributions to Statistics, Psychology, Sociology, Anthropology and many other sciences. He was the half-cousin of Charles Darwin and was inspired by his work on the origin of species to study hereditary traits in humans, including height, which he used to essentially invent regression. He also established and then financed the “Galton Laboratory” at University College London, which Karl Pearson went on to lead. Alas, Galton was also a major proponent of eugenics (in fact, he is credited with the invention of the term) and has thus left a troubling legacy behind him.↩︎