Introduction to Statistical Thought


             Michael Lavine

             August 3, 2008


﻿


Copyright © 2005 by Michael Lavine


﻿


  CONTENTS


                                    vi

                                    xi


                                    X1

                                    1

                                    6
                   .....14

                   .....14
     ....17
                   .....20
                   .....22
                     .....29
                                   40
                   kJ2..............................51
   ....57
                   .....57
                     .....61
   ....72
                   .....77
....81

                                   93
....93
....94
..... e,95


11


﻿


CONTENTSii


111


L-\ .'K \,,.k


40 40 40 40


w w K'AU{1Ul


. 100
. 113
. 132
. 132
. 139
. 144
. 154
. 154
. 155
. 158
. 164
. 174
. 178
. 192

202
. 202
. 210
. 210
. 221
. 236
. 236
. 245
. 250
. 253

263
. 263
. 264
. 265
. 266
. 266
. 271
. 276

279
. 279
n 9


" " " " " " " " " " " "


﻿


CONTENTS                                                             i


iv


. . . . . . . . . . . . . .
   \ X54 h h L", J
      -\ \__.ti I 4.\.ti -yam y \
    \ y\ h .a4
    l
" " " " " " " " " " " " " "

Istles
rl,,_ , \_."  l

" " " " " " " " " " " " " "


. y w   ll v v; N IS


.0 d \,. l " " " " " " " " " "
cess   4


. 303
. 305
. 313
. 316
. 316
. 320
. 329
. 329
. 335
. 335

343
. 343
. 351
. 370

373
. 373
. 374
. 388
. 388
. 395
. 395
. 396
. 396
. 396
. 396

399
. 399
. 399
. 402
. 404
. 404
. 404
. 404
. 404
an40


NomaIit


0 0 0 0 0 0 0 0
e l"es s
les . . . . . .


﻿


CONTENTSV


                  405
   v..  .  .. .....................405
          ................................405
   ..........................405

      ..................................405

    ....................................410
    .:  , ,414


﻿


            r+ M        \,O O\ r+                L) N        r+ N        N'    r+ zl- 00 N              L) O                    O\ N                      + N

            r+ rl rl rl N N N N M co co                                                           Lo Lo l0 l0 l0 l0 00                           O\ O O


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


  .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .     .      .     .     .     .     .     .          .     .     .


                                                                      .     .     .     .     .    " Z ";  .     .     .     .     .     .          .     .     .


                                                              ; °     .     .     .     .     .    ,,,,    .     .     .     .     .     .          .     .     .


                                                                i                                  trs          '


                                                                                o


  "     "    j       "     "     "     "     "          {{ 'C /.ice ii/, "  "     %     "                  "    i"/, F     {       /     "          "     "     "


                                                                                                                     /iii
                                                                                                                                   s


                                                       , .    f          C,,, /              t rff /..         ci ~.       " f


       '             "     "    ~ice   "     "   ;       ,,, irsi     "    ..         f, i   ....   "f     "               f ../ /-                 "   'i,,,y ,,,i
                                                                                      fs
                                                                                      f fr~                                       /     ,,,f                   rr.
"      f             "     "    ,,,    "     "          .....,        "                " i          .      "   {     /"         / {/>,  f %         "    ,;i   off
rrf {/ ,,.


                                           V/'3


﻿


LIST OF FIGURES


vii


w"


" " " " " " " " " " " "


103
107
110
112
114
118
119
122
123
124
125
128
129


130
134
137
138
141
143
146
149
152
157
160
162
169
171
172
173
176
179
184
185
186
190
191


_: w
y


     .
\
" " " " " "


﻿


LIST OF FIGURES                                                  vi


viii


k.. it tt) "h " " " " " " " " "


203
206
207
209
211
213
215
222
228
230
233
235
238
239
242
247
249
251
256


. . . . . . ................270
. . . . . . ................272


. . . . . . . . . . . . .
    ;...: r)"
      v V


285
289
295
300


" " " " " " " " " " " " " " "  " " " "K


.302


.304
.307
.310
.314
0317


﻿


"   " " " "      " r   " Q/  / "     !.Q Qi " /Q,       / Q Q in      '/rr " %. / rr%/ " "         Q,
                             y     /ri rrrrr    i.       r    /  rrr a/ i"" r
"o"Q " " "  " "\ "% "0 ''R[/   "  "O" /  " /rrr  /


﻿


LIST OF FIGURES                                                   x


        .......,......................385
        ............................387

                         .. . ...\... 4  \  v \.. .. .... .. .... ,, 4... ,..... .... .... .39

                                                                391

               ............................ ............... .......... .-....................40
                                          ........................41
             yob \ w .......................-......................412


﻿


LIST OF TABLES


   .....42
   .....44

   ......150

..   .......215
.........240

.........299


  .... 0345


. . . . . . . . . . . . . . .

,.. , \ , a .h ,.w


xi


﻿


                              PREFACE


This book is intended as an upper level undergraduate or introductory graduate
textbook in statistical thinking with a likelihood emphasis for students with a good
knowledge of calculus and the ability to think abstractly. By "statistical thinking" is
meant a focus on ideas that statisticians care about as opposed to technical details
of how to put those ideas into practice. By "likelihood emphasis" is meant that the
likelihood function and likelihood principle are unifying ideas throughout the text.
Another unusual aspect is the use of statistical software as a pedagogical tool. That
is, instead of viewing the computer merely as a convenient and accurate calculating
device, we use computer calculation and simulation as another way of explaining
and helping readers understand the underlying concepts.
   Our software of choice is R (                        [    ]). R and accom-
panying manuals are available for free download from http://www.r-project.
org. You may wish to download An Introduction to R to keep as a reference. It
is highly recommended that you try all the examples in R. They will help you un-
derstand concepts, give you a little programming experience, and give you facility
with a very flexible statistical software package. And don't just try the examples
as written. Vary them a little; play around with them; experiment. You won't hurt
anything and you'll learn a lot.


xii


﻿


CHAPTER 1


                           PROBABILITY


 1.1 Basic Probability

 Let X be a set and F a collection of subsets of X. A probability measure, or just a
probability, on (X, F) is a function ,u: F - [0, 1]. In other words, to every set in F,
yu assigns a probability between 0 and 1. We call yu a set function because its domain
is a collection of sets. But not just any set function will do. To be a probability pu
must satisfy

   1. ,u(0)= 0 (0 is the empty set.),

   2. ,u(X) = 1, and

   3. if A1 and A2 are disjoint then ,u(A1 U A2) =,u(A1) + ,u(A2).

One can show that property 3 holds for any finite collection of disjoint sets, not just
two; see Exercise 1. It is common practice, which we adopt in this text, to assume
more - that property 3 also holds for any countable collection of disjoint sets.
   When X is a finite or countably infinite set (usually integers) then yu is said to
be a discrete probability. When X is an interval, either finite or infinite, then yu
is said to be a continuous probability. In the discrete case, F usually contains all
possible subsets of X. But in the continuous case, technical complications prohibit
F from containing all possible subsets of X. See C]as.ell ad  ere [2002] or
   Schervish[   ] for details. In this text we deemphasize the role of F and speak
of probability measures on X without mentioning F.
   In practical examples X is the set of outcomes of an "experiment" and yu is deter-
mined by experience, logic or judgement. For example, consider rolling a six-sided
die. The set of outcomes is {1, 2, 3, 4, 5, 6} so we would assign X = {1, 2, 3, 4, 5, 6}.


1


﻿


1.1. BASIC PROBABILITY


2


If we believe the die to be fair then we would also assign p,u({1}) p,u({2}) "- -  
p,u({6}) = 1/6. The laws of probability then imply various other values such as

                                   p({1, 2}) = 1/3
                                 p({2, 4, 6}) = 1/2
                                        etc.

Often we omit the braces and write ,u(2), ,u(5), etc. Setting ,u(i)= 1/6 is not
automatic simply because a die has six faces. We set ,u(i)= 1/6 because we believe
the die to be fair.
   We usually use the word "probability" or the symbol P in place of P. For exam-
ple, we would use the following phrases interchangeably:

   " The probability that the die lands 1

   " P(1)
   " P[the die lands 1]

   * p({1})

We also use the word distribution in place of probability measure.
   The next example illustrates how probabilities of complicated events can be
calculated from probabilities of simple events.

Example 1.1 (The Game of Craps)
Craps is a gambling game played with two dice. Here are the rules, as explained on the
website www  aonline- craps-gamblig. com/raps-rules..c

         For the dice thrower (shooter) the object of the game is to throw a 7 or
      an 11 on the first roll (a win) and avoid throwing a 2, 3 or 12 (a loss). If
      none of these numbers (2, 3, 7, 11 or 12) is thrown on the first throw (the
      Come-out roll) then a Point is established (the point is the number rolled)
      against which the shooter plays. The shooter continues to throw until one
      of two numbers is thrown, the Point number or a Seven. If the shooter rolls
      the Point before rolling a Seven he/she wins, however if the shooter throws
      a Seven before rolling the Point he/she loses.

Ultimately we would like to calculate P(shooter wins). But for now, let's just calculate


P(shooter wins on Come-out roll) = P(7 or 11) = P(7) + P(11).


﻿


1.1. BASIC PROBABILITY


3


Using the language of page , what is X in this case? Let di denote the number showing
on the first die and d2 denote the number showing on the second die. di and d2 are
integers from 1 to 6. So X is the set of ordered pairs (di, d2) or

                      (6, 6) (6, 5) (6, 4) (6, 3) (6, 2) (6, 1)
                      (5, 6) (5, 5) (5, 4) (5, 3) (5, 2) (5,1)
                      (4, 6) (4, 5) (4, 4) (4, 3) (4, 2) (4, 1)
                      (3, 6) (3, 5) (3, 4) (3, 3) (3, 2) (3, 1)
                      (2, 6) (2, 5) (2, 4) (2, 3) (2, 2) (2, 1)
                      (1, 6) (1, 5) (1, 4) (1, 3) (1, 2) (1, 1)
If the dice are fair, then the pairs are all equally likely. Since there are 36 of them, we
assign P(di, d2) = 1/36 for any combination (di, d2). Finally, we can calculate
                    P(7 or 11) = P(6,5) +P(5,6) +P(6, 1) +P(5,2)
                + P(4, 3) + P(3, 4) + P(2, 5) + P(1, 6) = 8/36 = 2/9.
The previous calculation uses desideratum 3 for probability measures. The different pairs
(6, 5), (5, 6), ... , (1, 6) are disjoint, so the probability of their union is the sum of their
probabilities.

   Example       illustrates a common situation. We know the probabilities of some
simple events like the rolls of individual dice, and want to calculate the proba-
bilities of more complicated events like the success of a Come-out roll. Sometimes
those probabilities can be calculated mathematically as in the example. Other times
it is more convenient to calculate them by computer simulation. We frequently use
R to calculate probabilities. To illustrate, Example   uses R to calculate by simu-
lation the same probability we found directly in Example

Example 1.2 (Craps, continued)
To simulate the game of craps, we will have to simulate rolling dice. That's like randomly
sampling an integer from 1 to 6. The sample() command in R can do that. For example,
the following snippet of code generates one roll from a fair, six-sided die and shows R's
response:
     > sample(1:6,1)
     [1] 1


When you start Rf on your computer, you see>, ft's prompt. Then you can type a command
such as sample (1 :6,1) which means "take a sample of size 1 from the numbers 1 through


﻿


1.1. BASIC PROBABILITY


4


6". (It could have been abbreviated sample (6,1).) R responds with [1] 1. The [1]
says how many calculations R has done; you can ignore it. The 1 is R's answer to the
sample command; it selected the number "1". Then it gave another >, showing that it's
ready for another command. Try this several times; you shouldn't get "1" every time.
    Here's a longer snippet that does something more useful.

    > x <- sample ( 6, 10, replace=T ) # take a sample of
                                              # size 10 and call it x
     > x # print the ten values
     [1] 6 4 2 3 4 4 3 6 6 2

     > sum ( x == 3 ) # how many are equal to 3?
     [1] 2


Note

   " # is the comment character. On each line, R ignores all text after #.

   " We have to tell R to take its sample with replacement. Otherwise, when R selects
      "6" the first time, "6" is no longer available to be sampled a second time. In
      replace=T, the T stands for True.

   " <- does assignment. I.e., the result of sample ( 6, 10, replace=T ) is as-
      signed to a variable called x. The assignment symbol is two characters: < followed
      by -.

   " A variable such as x can hold many values simultaneously. When it does, it's called
      a vector. You can refer to individual elements of a vector. For example, x[11 is
      the first element of x. x[11 turned out to be 6; x[2] turned out to be 4; and so
      on.

   " == does comparison. In the snippet above, (x==3) checks, for each element of x,
      whether that element is equal to 3. If you just type x == 3 you will see a string
      of T's and F's (True and False), one for each element of x. Try it.

   " The sum command treats T as 1 and F as 0.

   * R is almost always tolerant of spaces. You can often leave them out or add extras
      where you like.


﻿


1.1. BASIC PROBABILITY


5


On average, we expect 1/6 of the draws to equal 1, another 1/6 to equal 2, and so on.
The following snippet is a quick demonstration. We simulate 6000 rolls of a die and
expect about 1000 l's, 1000 2's, etc. We count how many we actually get. This snippet
also introduces the for loop, which you should try to understand now because it will be
extremely useful in the future.

     > x <- sample(6,6000,replace=T)


     > for ( i in 1:6 ) print ( sum ( x==i ))
     [11 995
     [11 1047
     [11 986
     [11 1033
     [11 975
     [11 964


Each number from 1 through 6 was chosen about 1000 times, plus or minus a little bit
due to chance variation.
    Now let's get back to craps. We want to simulate a large number of games, say 1000.
For each game, we record either 1 or 0, according to whether the shooter wins on the
Come-out roll, or not. We should print out the number of wins at the end. So we start
with a code snippet like this:

     # make a vector of length 1000, filled with 0's
           wins <- rep ( 0, 1000 )
           for ( i in 1:1000 ) {
       simulate a Come-out roll
       if shooter wins on Come-out, wins[i] <- 1
     }

     sum ( wins ) # print the number of wins


Now we have to figure out how to simulate the Come-out roll and decide whether the
shooter wins. Clearly, we begin by simulating the roll of two dice. So our snippet expands
to


﻿


1.2. PROBABILITY DENSITIES


6


     # make a vector of length 1000, filled with 0's
                wins <- rep ( 0, 1000 )
     for ( i in 1:1000 ) {
       d <- sample ( 1:6, 2, replace=T )
               if ( sum(d) == 7 II sum(d) == 11 ) wins[i] <- 1
     }
     sum ( wins ) # print the number of wins


The "I I" stands for "or". So that line of code sets wins [i] <- 1 if the sum of the
rolls is either 7 or 11. When I ran this simulation R printed out 219. The calculation in
Example 1 says we should expect around (2/9) x 1000  222 wins. Our calculation
and simulation agree about as well as can be expected from a simulation. Try it yourself
a few times. You shouldn't always get 219. But you should get around 222 plus or minus
a little bit due to the randomness of the simulation.
   Try out these R commands in the version of R installed on your computer. Make sure
you understand them. If you don't, print out the results. Try variations. Try any tricks
you can think of to help you learn R.


1.2 Probability Densities

So far we have dealt with discrete probabilities, or the probabilities of at most a
countably infinite number of outcomes. For discrete probabilities, X is usually a
set of integers, either finite or infinite. Section L2 deals with the case where X is
an interval, either of finite or infinite length. Some examples are

Medical trials the time until a patient experiences a relapse

Sports the length of a javelin throw

Ecology the lifetime of a tree

Manufacturing the diameter of a ball bearing

Computing the amount of time a Help Line customer spends on hold

Physics the time until a uranium atom decays

Oceanography the temperature of ocean water at a specified latitude, longitude
      and depth


﻿


1.2. PROBABILITY DENSITIES


7


   Probabilities for such outcomes are called continuous. For example, let Y be the
time a Help Line caller spends on hold. The random variable Y is often modelled
with a density similar to that in Figure


                       y


Figure 1.1: pdf for time on hold at Help Line


   The curve in the figure is a probability density function or pdf. The pdf is large
near y = 0 and monotonically decreasing, expressing the idea that smaller values
of y are more likely than larger values. (Reasonable people may disagree about
whether this pdf accurately represents callers' experience.) We typically use the
symbols p, 7 or f for pdf's. We would write p(50), 7(50) or f(50) to denote the
height of the curve at y = 50. For a pdf, probability is the same as area under the
curve. For example, the probability that a caller waits less than 60 minutes is


                            P[Y < 60]   fp(t) dt.

   Every pdf must satisfy two properties.

   1. p(y) > 0 for all y.


2. f   p(y) dy = 1.


﻿


1.2. PROBABILITY DENSITIES                                                   8

The first property holds because, if p(y) < 0 on the interval (a, b) then P[Y E
(a, b)] = f p(y) dy < 0; and we can't have probabilities less than 0. The second
property holds because P[Y E (-oo, Do)] = f_ p(y) dy = 1.
   One peculiar fact about any continuous random variable Y is that P[Y = a] = 0,
for every a c R. That's because

             P[Y = a] = lim P[Y E [a, a + c]] = im]   py (y) dy = 0.

Consequently, for any numbers a < b,

            P[Y E (a, b)] = P[Y E [a, b)] = P[Y E (a, b]] = P[Y E [a, b]].

   The use of "density" in statistics is entirely analagous to its use in physics. In
both fields
                                          mass
                               density =   u(1.1)
                                         volume
In statistics, we interpret density as probability density, mass as probability mass
and volume as length of interval. In both fields, if the density varies from place to
place (In physics it would vary within an object; in statistics it would vary along
the real line.) then the density at a particular location is the limit of Equation
as volume -- 0.
   Probability density functions are derivatives of probabilities. For any fixed num-
ber a
                    d P[X E (a, b]]  db J  fx(x) dx = fx(b).              (1.2)
                    db               db J
Similarly, d/da P[X E (a, b]] = -fx(a).
   Sometimes we can specify pdf's for continuous random variables based on the
logic of the situation, just as we could specify discrete probabilities based on the
logic of dice rolls. For example, let Y be the outcome of a spinner that is marked
from 0 to 1. Then Y will be somewhere in the unit interval, and all parts of the
interval are equally likely. So the pdf py must look like Figure

   Figure     was produced by the following snippet.

     plot ( c(0,1), c(1,1), xlab="y", ylab="p(y)",
            ylim=c(0,1.1), type="1" )


Note:


﻿


1.2. PROBABILITY DENSITIES


9


               Co


                  S  I       I       I      I       I       I
                    0.0    0.2     0.4     0.6     0.8    1.0

                                        y


                  Figure 1.2: py for the outcome of a spinner


   " c (0, 1) collects 0 and 1 and puts them into the vector (0,1). Likewise, c (1, 1)
     creates the vector (1,1).

   " plot(x,y,...) produces a plot. The plot(c(0,1), c(1,1), ...) com-
     mand above plots the points (x [11 , y [1]) = (0,1) and (x [21 , y [21) = (1,1).

   " type="1" says to plot a line instead of individual points.

   " xlab and ylab say how the axes are labelled.

   " ylim=c (0, 1. 1) sets the limits of the y-axis on the plot. If ylim is not specified
     then R sets the limits automatically. Limits on the x-axis can be specified with
     xlim.


   At other times we use probability densities and distributions as models for data,
and estimate the densities and distributions directly from the data. Figure
shows how that works. The upper panel of the figure is a histogram of 112 mea-
surements of ocean temperature at a depth of 1000 meters in the North Atlantic
near 450 North latitude and 200 degrees West longitude. Example  will say
more about the data. Superimposed on the histogram is a pdf f. We think of f as


﻿


1.2. PROBABILITY DENSITIES


10


underlying the data. The idea is that measuring a temperature at that location is
like randomly drawing a value from f. The 112 measurements, which are spread
out over about a century of time, are like 112 independent draws from f. Having
the 112 measurements allows us to make a good estimate of f. If oceanographers
return to that location to make additional measurements, it would be like mak-
ing additional draws from f. Because we can estimate f reasonably well, we can
predict with some degree of assurance what the future draws will be like.
   The bottom panel of Figure     is a histogram of the discoveries data set
that comes with R and which is, as R explains, "The numbers of 'great' inventions
and scientific discoveries in each year from 1860 to 1959." It is overlaid with a
line showing the Poi(3.1) distribution. (Named distributions will be introduced in
Section  .) It seems that the number of great discoveries each year follows the
Poi(3.1) distribution, at least approximately. If we think the future will be like the
past then we should expect future years to follow a similar pattern. Again, we
think of a distribution underlying the data. The number of discoveries in a single
year is like a draw from the underlying distribution. The figure shows 100 years,
which allow us to estimate the underlying distribution reasonably well.

   Figure    was produced by the following snippet.
   par ( mfrow=c(2,1) )

   good <- abs ( med.1000$lon + 20 ) < 1 &
             abs ( med.1000$lat - 45 ) < 1
    hist ( med.1000$temp[good], xlab="temperature", ylab="",
            main="", prob=T, xlim=c(5,11) )
    m <- mean ( med.1000$temp[good] )
    s <- sqrt ( var ( med.1000$temp[good] ) )
    x <- seq ( 5, 11, length=40 )
    lines ( density(med.1000$temp[good]) )

    hist ( discoveries, xlab="discoveries", ylab="", main="",
            prob=T, breaks=seq(-.5,12.5,by=1) )
    lines ( 0:12, dpois(0:12, 3.1), type="b" )


Note:
   * par sets R's graphical parameters. mfrow=c (2, 1) tells R to make an array of
     multiple figures in a 2 by 1 layout.


﻿


1.2. PROBABILITY DENSITIES


11


0


CN
0


0
0


I      I     I     I      I     I
5     6      7     8     9     10


11


temperature


0
CN
0

0
6

0
0
0


0


                      0     2    4     6     8    10    12


                                  discoveries


Figure 1.3: (a): Ocean temperatures at 1000m depth near 45°N latitude, -20
longitude; (b) Numbers of important discoveries each year 1860-1959


﻿


1.2. PROBABILITY DENSITIES


12


" med. 1000 is a data set of North Atlantic ocean temperatures at a depth of
  1000 meters. med. 1000$ln and med. 1000$lat are the longitude and latitude
  of the measurements. med.1000$temp are the actual temperatures.

" abs stands for absolute value.

" good <- ... calls those points good whose longitude is between -19 and -21
  and whose latitude is between 44 and 46.

" hist () makes a histogram. prob=T turns the y-axis into a probability scale
  (area under the histogram is 1) instead of counts.

" mean () calculates the mean. var () calculates the variance. Section de-
  fines the mean and variance of distributions. Section   defines the mean
  and variance of data sets.

" lines() adds lines to an existing plot.

" density() estimates a density from a data set.


It is often necessary to transform one variable into another as, for example, Z


g(X) for some specified function g. We might know px (
which random variable we're talking about.) and want to
consider only monotonic functions g, so there is an inverse

Theorem 1.1. Let X be a random variable with pdf px. L
monotonic, invertible function and define Z = g(X). Then th


                         pz (t) = px (g -1(t))  d t 1 1
                                              Pit

Proof. If q is an increasing function then


The subscript
calculate pz.
X = h(Z).


indicates
Here we


et g be a differentiable,
e pdf of Z is


pz(b) =    P[Z E (a, b]]
        db


   P[X E (g 1(a), g 1(b)]]
db
       d    g-1(b)
       d        px(x)dx
       db J91(a)


dg-1(
  dt


b


X px (g-1(b))


The proof when g is decreasing is left as an exercise.


﻿


1.2. PROBABILITY DENSITIES


13


   To illustrate, suppose that X is a random variable with pdf px (x) = 2x on the
unit interval. Let Z = 1/X. What is pz(z)? The inverse transformation is X = 1/Z.
Its derivative is dz/dz = -z-2. Therefore,


                _ d g-1(z)  2  1
pz(z) =px(g 1(z))  d z z2
                          d z zz


2


And the possible values of Z are from 1 to oc. So pz(z) =
(1, oo). As a partial check, we can verify that the integral is 1.


2/z3 on the interval


2
S z dz


1
21


1.


   Theorem 141 can be explained by Figure  . The figure shows an x, a z, and
the function z = g(x). A little interval is shown around x; call it I. It gets mapped
by g into a little interval around z; call it I2. The density is


            pz~z)P[Z E I] _
                    length (I2)

The approximations in Equation
to 0.


P[X E Im] length(Ix)   Px(,)h'(z)
length(Ix) length(I2)


(1.3)


are exact as the lengths of I and I2 decrease


   If g is not one-to-one, then it is often possible to find subsets of R on which g is
one-to-one, and work separately on each subset.


------------    =9(X)


Figure 1.4: Change of variables


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


14


1.3 Parametric Families of Distributions

Probabilities often depend on one or more unknown numerical constants. Suppose,
for example, that we have a biased coin. Let 0 be the chance that it lands H. Then
P(H) =0; but we might not know 0; it is an unknown numerical constant. In this
case we have a family of probability measures, one for each value of 0, and we don't
know which one is right. When we need to be explicit that probabilities depend
on 0, we use the notation, for example, P(H |0) or P(H |0 = 1/3). The vertical bar
is read "given" or "given that". So P(H |0 = 1/3) is read "the probability of Heads
given that 0 equals 1/3" and P(H B) is read "the probability of Heads given 0."
This notation means

                            P(H 0= 1/3) = 1/3,
                            P(T 0= 1/3) = 2/3,
                            P(T 0B =1/5) = 4/5

and so on. Instead of "given" we also use the word "conditional". So we would say
"the probability of Heads conditional on 0", etc.
   The unknown constant 0 is called a parameter. The set of possible values for 0
is denoted e (upper case 0). For each 0 there is a probability measure po. The set
of all possible probability measures (for the problem at hand),

                                {pe:o 0 8 E},

is called a parametric family of probability measures. The rest of this chapter intro-
duces four of the most useful parametric families of probability measures.

1.3.1 The Binomial Distribution
Statisticians often have to consider observations of the following type.

   " A repeatable event results in either a success or a failure.

   " Many repetitions are observed.

   " Successes and failures are counted.

   " The number of successes helps us learn about the probability of success.


Such observations are called binomial. Some examples are


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


15


Medical trials A new treatment is given to many patients. Each is either cured or
     not.

Toxicity tests Many laboratory animals are exposed to a potential carcinogen.
     Each either develops cancer or not.

Ecology Many seeds are planted. Each either germinates or not.

Quality control Many supposedly identical items are subjected to a test. Each
     either passes or not.
Because binomial experiments are so prevalent there is specialized language to
describe them. Each repetition is called a trial; the number of trials is usually
denoted N; the unknown probability of success is usually denoted either p or 0;
the number of successes is usually denoted X. We write X  Bin(N, p). The
symbol " is read is distributed as; we would say "X is distributed as Binomial N
p" or "X has the Binomial N, p distribution". Some important assumptions about
binomial experiments are that N is fixed in advance, 0 is the same for every trial,
and the outcome of any trial does not influence the outcome of any other trial.
When N = 1 we say X has a Bernoulli(6) distribution and write X   Bern(0); the
individual trials in a binomial experiment are called Bernoulli trials.
   When a binomial experiment is performed, X will turn out to be one of the
integers from 0 to N. We need to know the associated probabilities; i.e. P[X = k |0]
for each value of k from 0 to N. These probabilities are given by EquationL4
whose derivation is given in Section51.


                                              (NNN
                        P(X = k|0] =       6 N)-

The term (N) is called a binomial coefficient and is read "N choose k". (N) k!(Nk
and is equal to the number of subsets of size k that can be formed from a group of
N distinct items. In case k = 0 or k = N, 0! is defined to be 1. Figure L5 shows
binomial probabilities for N E {3, 30, 300} and p E {.1, .5, .9}.

Example 1.3 (Craps, continued)
This example continues the game of craps. See Examples 11and,1.
   What is the probability that at least one of the next four players wins on his Come-out
roll?
   This is a Binomial experiment because
   1. We are looking at repeated trials. Each Come-out roll is a trial. It results in either
     success, or not.


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


16


    N=3       N=3       N=3
    p=O.1     p=O.5     p=O.9


    o     ao1

    0 1 23   01 23     0 12 3

    x          x         x


    N=30      N=30      N=30
    p=O.1     p=O.5     p=O.9


  O O00
  o       T         -0 O0Q

  Q OQ OQ
  0 10 30    0 10 30   0 10 30

     x         x         x


     N =300   N =300    N =300
     p=O.1    p=O.5     p=O.9


CL0       0L 00CL

   0  200    0 200     0 200

     x         x         x


Figure 1.5: Binomial probabilities


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


17


   2. The outcome of one trial does not affect the other trials.

   3. We are counting the number of successes.

Let X be the number of successes. There are four trials, so N = 4. We calculated the
probability of success in Example 1,1; it's p = 2/9. So X  Bin(4, 2/9). The probability
of success in at least one Come-out roll is

  P[success in at least one Come-out roll] = P[X > 1]
                             4              4
                          =     P[X    i]      ()(2/9)2(7/9)4-    0.634 (1.5)
                             i=1           i=1

   A convenient way to re-express Equation  is

                           P[X;> 1] = 1 -P[X =0],

which can be quickly calculated in R. The dbinom () command computes Binomial
probabilities. To compute Equation L5 we would write

  1 - dbinom(0,4,2/9)

The 0 says what value of X we want. The 4 and the 2/9 are the number of trials
and the probability of success. Try it. Learn it.

1.3.2 The Poisson Distribution
Another common type of observation occurs in the following situation.

   " There is a domain of study, usually a block of space or time.

   " Events arise seemingly at random in the domain.

   " There is an underlying rate at which events arise.

Such observations are called Poisson after the 19th century French mathematician
Simeon-Denis Poisson. The number of events in the domain of study helps us learn
about the rate. Some examples are

Ecology Tree seedlings emerge from the forest floor.

Computer programming Bugs occur in computer code.


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


18


Quality control Defects occur along a strand of yarn.

Genetics Mutations occur in a genome.

Traffic flow Cars arrive at an intersection.

Customer service Customers arrive at a service counter.

Neurobiology Neurons fire.

The rate at which events occur is often called A; the number of events that occur in
the domain of study is often called X; we write X  Poi(A). Important assumptions
about Poisson observations are that two events cannot occur at exactly the same
location in space or time, that the occurence of an event at location fi does not
influence whether an event occurs at any other location £2, and the rate at which
events arise does not vary over the domain of study.
   When a Poisson experiment is observed, X will turn out to be a nonnegative
integer. The associated probabilities are given by Equation L6.

                                             Ake e-
                              P[X=k |A] =       !.(1.6)

   One of the main themes of statistics is the quantitative way in which data help
us learn about the phenomenon we are studying. Example;L4 shows how this
works when we want to learn about the rate A of a Poisson distribution.

Example 1.4 (Seedlings in a Forest)
Tree populations move by dispersing their seeds. Seeds become seedlings, seedlings be-
come saplings, and saplings become adults which eventually produce more seeds. Over
time, whole populations may migrate in response to climate change. One instance oc-
curred at the end of the Ice Age when species that had been sequestered in the south
were free to move north. Another instance may be occurring today in response to global
warming. One critical feature of the migration is its speed. Some of the factors deter-
mining the speed are the typical distances of long range seed dispersal, the proportion
of seeds that germinate and emerge from the forest floor to become seedlings, and the
proportion of seedlings that survive each year.
   To learn about emergence and survival, ecologists return annually to forest quadrats
(square meter sites) to count seedlings that have emerged since the previous year. One
such study was reported in Lavine:e l[2002]. A fundamental quantity of interest is
the rate A at which seedlings emerge. Suppose that, in one quadrat, three new seedlings
are observed. What does that say about A?


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


19


    Different values of A yield different values of P[X
of A we see how well each one explains the data X
for different values of A. For example,


3| A]. To compare different values
= 3; i.e., we compare P[X = 3| A]


                                              13e-1
                          P[X =3| A= 1] =      3!     0.06
                                               3!
                                             23e-2
                          P[X=3 A=2] =O0.18
                                               3!
                                             33c-3
                          P[X = 3|A=3] =            ~0.22
                                               3!
                                             43c-4
                          P[X =3|A =4] =            =0.14
                                               3!
In other words, the value A = 3 explains the data almost four times as well as the
value A = 1 and just a little bit better than the values A = 2 and A = 4. Figure16
shows P[X = 3| A] plotted as a function of A. The figure suggests that P[X = 3| A] is
maximized by A = 3. The suggestion can be verified by differentiating Equation 1 with
respect to lambda, equating to 0, and solving. The figure also shows that any value of
A from about 0.5 to about 9 explains the data not too much worse than A = 3.


11
n


O

0


0
0


  0        2       4        6        8       10

                     lambda


Figure 1.6: P[X = 3| A] as a function of A


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


   Figure  was produced by the following snippet.


20


lam <- seq ( 0,


10, length=50 )


y <- dpois


( 3, lam )


    plot ( lam, y, xlab="lambda", ylab="P[x=3]", type="l" )


Note:

   " seq stands for "sequence". seq(0,10, length=50) produces a sequence of 50
     numbers evenly spaced from 0 to 10.

   " dpois calculates probabilities for Poisson distributions the way dbinom does
     for Binomial distributions.

   " plot produces a plot. In the plot (.. .) command above, lam goes on the
     x-axis, y goes on the y-axis, xlab and ylab say how the axes are labelled, and
     type="1" says to plot a line instead of indvidual points.


Making and interpreting plots is a big part of statistics. Figure


is a good


example. Just by looking at the figure we were able to tell which values of A are
plausible and which are not. Most of the figures in this book were produced in R.


1.3.3 The Exponential Distribution

It is often necessary to model a continuous random variable X whose density de-
creases away from 0. Some examples are

Customer service time on hold at a help line

Neurobiology time until the next neuron fires

Seismology time until the next earthquake

Medicine remaining years of life for a cancer patient

Ecology dispersal distance of a seed

   In these examples it is expected that most calls, times or distances will be short
and a few will be long. So the density should be large near x = 0 and decreasing


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


21


as x increases. A useful pdf for such situations is the Exponential density


p(x) =  e


for x > 0.


(1.7)


We say X has an exponential distribution with parameter A and write X ~ Exp(A).
Figure    shows exponential densities for several different values of A.


o-


00 -


11-


.
                             lambda = 2
                        - -- lambda = 1
                        .  ambda = 0.2
                        l - ambda=0.1


     1


N\


0.0


0.5


1.0


1.5


2.0


                 x
Figure 1.7: Exponential densities


Figure    was produced by the following snippet.


x <- seq ( 0,
lam <- c ( 2,


2, length=40 ) # 40 values from 0 to 2


1,


.2, .1 )


# 4 different values of lambda


y <- matrix ( NA,


40, 4 )     # y values for plotting


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


22


1.3.4 The Normal Distribution
It is often necessary to model a continuous random variable Y whose density is
mound-shaped. Some examples are


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS                               23

Biological Anthropology Heights of people

Oceanography Ocean temperatures at a particular location

Quality Control Diameters of ball bearings

Education SAT scores
In each case the random variable is expected to have a central value around which
most of the observations cluster. Fewer and fewer observations are farther and
farther away from the center. So the pdf should be unimodal - large in the center
and decreasing in both directions away from the center. A useful pdf for such
situations is the Normal density
                                      1  1  _ya
                           p Y) =-     e-      .                      (1.8)

We say Y has a Normal distribution with mean y and standard deviation o and write
Y    N(p, or). Figure   shows Normal densities for several different values of
(p, or). As illustrated by the figure, y controls the center of the density; each pdf is
centered over its own value of p. On the other hand, o controls the spread. pdf's
with larger values of o are more spread out; pdf's with smaller o are tighter.

   Figure    was produced by the following snippet.

   x <- seq ( -6, 6, len=100 )
   y <- cbind ( dnorm ( x, -2, 1 ),
                  dnorm ( x, 0, 2 ),
                  dnorm ( x, 0, .5 ),
                  dnorm ( x, 2, .3 ),
                  dnorm ( x, -.5, 3 )
                )
    matplot ( x, y, type="1", col=1 )
    legend ( -6, 1.3, paste( "mu =", c(-2,0,0,2,-.5),
                               "; sigma  =   c(1,2,.5,.3,3)  ),
              lty=1:5, col=1, cex=.75 )


   * dnorm(.. .) computes the Normal pdf. The first argument is the set of x
     values; the second argument is the mean; the third argument is the standard
     deviation.


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS                             2


24


CM


0


Co
0


0


0


    mu = -2;sigma
  ---mu = 0; sigma
  .mu =0; sigma
    mu =2;sigma
-- mu=-0.5;sigr


a=1
- 2
0.5
=0.3


    na 3


-6


1      1

-4    -2


0


2     4     6


              X


Figure 1.8: Normal densities


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


25


   As a further illustration, Figure  shows a histogram of 105 ocean tempera-
tures (0C) recorded in the Atlantic Ocean from about 1938 to 1997 at a depth of
1000 meters, near 45 degrees North latitude and 30 degrees West longitude. The
N(5.87, .72) density is superimposed on the histogram. The Normal density repre-
sents the data moderately well. We will study ocean temperatures in much more
detail in a series of examples beginning with Example


.U)
a)
0


Col
0


0


N\
O


Q
O


I    I     I    I    I    I     I    I


                   4.0  4.5   5.0  5.5  6.0  6.5   7.0  7.5

                                 temperature


Figure 1.9: Ocean temperatures at 45°N, 30°W, 1000m depth. The N(5.87, .72)
density.


   Figure    was produced by

   hist ( y, prob=T, xlab="temperature", ylab="density",
            ylim=c(0,.6), main="" )


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


26


    t <- seq ( 4, 7.5, length=40 )
    lines ( t, dnorm ( t, mean(y), sd(y) ) )


    " The 105 temperatures are in a vector y.

    " hist produces a histogram. The argument prob=T causes the vertical scale to
    be probability density instead of counts.

    " The line t <- ... sets 40 values of t in the interval [4, 7.5] at which to
    evaluate the Normal density for plotting purposes.

    " lines displays the Normal density.

As usual, you should try to understand the R commands.

   The function rnorm(n,mu, sig) generates a random sample from a Normal dis-
tribution. n is the sample size; mu is the mean; and sig is the standard deviation.
To demonstrate, we'll generate a sample of size 100 from the N(5.87, .72) density,
the density in Figure   , and compare the sample histogram to the theoretical
density. Figure  (a) shows the comparison. It shows about how good a fit can
be expected between a histogram and the Normal density, for a sample of size
around 100 in the most ideal case when the sample was actually generated from
the Normal distribution. It is interesting to consider whether the fit in Figure  is
much worse.

   Figure     (a) was produced by

     samp <- rnorm ( 100, 5.87, .72 )
     y.vals <- seq ( 4, 7.5, length=40 )
     hist ( samp, prob=T, main="(a)",
            xlim=c(4,7.5), xlab="degrees C",
            ylim=c(0,.6), ylab="density" )
    lines ( y.vals, dnorm(y.vals,5.87,.72) )


    When working with Normal distributions it is extremely useful to think in terms
of units of standard deviation, or simply standard units. One standard unit equals
one standard deviation. In Figure  (a) the number 6.6 is about 1 standard


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS                          27


                                   (a)


                        4.0 4. 5. 55  60     .5  7.0   7.5
            0


          I
                d)(b)


              3940         41    4253         44    45
                                 degrees F


                                   (c)


              0I             I        I       I       I
                     -2      -1      0       1        2
                                standard units


Figure 1.10: (a): A sample of size 100 from N(5.87, .72) and the N(5.87, .72) density.
(b): A sample of size 100 from N(42.566, 1.296) and the N(42.566, 1.296) density.
(c): A sample of size 100 from N(0, 1) and the N(0, 1) density.


﻿


1.3. PARAMETRIC FAMILIES OF DISTRIBUTIONS


28


unit above the mean, while the number 4.5 is about 2 standard units below the
mean. To see why that's a useful way to think, Figure   (b) takes the sample
from Figure     (a), multiplies by 9/5 and adds 32, to simulate temperatures
measured in OF instead of 0C. The histograms in panels (a) and (b) are slightly
different because R has chosen the bin boundaries differently; but the two Normal
curves have identical shapes. Now consider some temperatures, say 6.50C = 43.7°F
and 4.50C = 40.1°F. Corresponding temperatures occupy corresponding points on
the plots. A vertical line at 6.5 in panel (a) divides the density into two sections
exactly congruent to the two sections created by a vertical line at 43.7 in panel (b).
A similar statement holds for 4.5 and 40.1. The point is that the two density
curves have exactly the same shape. They are identical except for the scale on the
horizontal axis, and that scale is determined by the standard deviation. Standard
units are a scale-free way of thinking about the picture.
   To continue, we converted the temperatures in panels (a) and (b) to standard
units, and plotted them in panel (c). Once again, R made a slightly different choice
for the bin boundaries, but the Normal curves all have the same shape.


   Panels (b) and (c) of Figure   were produced by

   y2samp <- samp * 9/5 + 32
   y2.vals <- y.vals * 9/5 + 32
   hist ( y2samp, prob=T, main="(b)",
            xlim=c(39.2,45.5), xlab="degrees F",
            ylim=c(0,1/3), ylab="density" )
    lines ( y2.vals, dnorm(y2.vals,42.566,1.296) )

    zsamp <- (samp-5.87) / .72
    z.vals <- (y.vals-5.87) / .72
    hist ( zsamp, prob=T, main="(c)",
            xlim=c(-2.6,2.26), xlab="standard units",
            ylim=c (0, .833), ylab="density" )
    lines ( z.vals, dnorm(z.vals,0,1) )


    Let Y ~ N(p, o-) and define a new random variable Z = (Y - p)/o-. Z is in
standard units. It tells how many standard units Y is above or below its mean p.
What is the distribution of Z? The easiest way to find out is to calculate pz, the


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


29


density of Z, and see whether we recognize it. From Theorem
                                               1    12
                       Pz(z) = py(o-z + p)- =     e 2

which we recognize as the N(0, 1) density. I.e., Z ~ N(0, 1). The N(0, 1) distribution
is called the standard Normal distribution.


1.4 Centers, Spreads, Means, and Moments

Recall Figure    (pg.   ). In each panel there is a histogram of a data set along
with an estimate of the underlying pdf or pmf p. In each case we have found a
distribution that matches the data reasonably well, but the distributions we have
drawn are not the only ones that match well. We could make modest changes to
either distribution and still have a reasonably good match. But whatever pdf we
propose for the top panel should be roughly mound shaped with a center around
80 and a spread that ranges from about 60 to about 100. And in the bottom panel
we would want a distribution with a peak around 2 or 3 and a longish right hand
tail.
   In either case, the details of the distribution matter less than these central fea-
tures. So statisticians often need to refer to the center, or location, of a sample
or a distribution and also to its spread. Section gives some of the theoretical
underpinnings for talking about centers and spreads of distributions.

Example 1.5
Physical oceanographers study physical properties such as temperature, salinity, pressure,
oxygen concentration, and potential vorticity of the world's oceans. Data about the
oceans' surface can be collected by satellites' bouncing signals off the surface. But
satellites cannot collect data about deep ocean water. Until as recently as the 1970s, the
main source of data about deep water came from ships that lower instruments to various
depths to record properties of ocean water such as temperature, pressure, salinity, etc.
(Since about the 1970s oceanographers have begun to employ neutrally buoyant floats. A
brief description and history of the floats can be found on the web at www. soc. soton.
ac.uk/JRD/HYDRO/shb/float.history.html.) Figure          shows locations, called
hydrographic stations, off the coast of Europe and Africa where ship-based measurements
were taken between about 1910 and 1990. The outline of the continents is apparent on
the right-hand side of the figure due to the lack of measurements over land.
   Deep ocean currents cannot be seen but can be inferred from physical properties.
Figure     shows temperatures recorded over time at a depth of 1000 meters at nine


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


30


different locations. The upper right panel in Figure 1.12 is the same as the top panel of
Figure 13. Each histogram in Figure 1.1has a black circle indicating the "center" or
"location" of the points that make up the histogram. These centers are good estimates
of the centers of the underlying pdf's. The centers range from a low of about 50 at
latitude 45 and longitude -40 to a high of about 90 at latitude 35 and longitude -20. (By
convention, longitudes to the west of Greenwich, England are negative; longitudes to the
east of Greenwich are positive.) It's apparent from the centers that for each latitude,
temperatures tend to get colder as we move from east to west. For each longitude,
temperatures are warmest at the middle latitude and colder to the north and south.
Data like these allow oceanographers to deduce the presence of a large outpouring of
relatively warm water called the Mediterranean tongue from the Mediterranean Sea into
the Atlantic ocean. The Mediterranean tongue is centered at about 1000 meters depth
and 35°N latitude, flows from east to west, and is warmer than the surrounding Atlantic
waters into which it flows.


   There are many ways of describing the center of a data sample. But by far the
most common is the mean. The mean of a sample, or of any list of numbers, is just
the average.

Definition 1.1 (Mean of a sample). The mean of a sample, or any list of numbers,
x1,..., x is
                          mean of x1, ... , x   -   xi.                     (1.9)
                                               n

   The black circles in Figure'1.12 are means. The mean of x1, ... , xn is often
denoted z. Means are often a good first step in describing data that are unimodal
and roughly symmetric.
   Similarly, means are often useful in describing distributions. For example, the
mean of the pdf in the upper panel of Figure 13 is about 8.1, the same as the mean
of the data in same panel. Similarly, in the bottom panel, the mean of the Poi(3.1)
distribution is 3.1, the same as the mean of the discoveries data. Of course we
chose the distributions to have means that matched the means of the data.
   For some other examples, consider the Bin(n, p) distributions shown in Fig-
ure 1.5. The center of the Bin(30, .5) distribution appears to be around 15, the
center of the Bin(300, .9) distribution appears to be around 270, and so on. The
mean of a distribution, or of a random variable, is also called the expected value or
expectation and is written E(X).


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


31


0


0D


0
IC)-


(1)


-xppJ


0 _
CY)


0_
CV


-40


  I      I      I       I

-30    -20     -10     0


longitude


Figure 1.11: hydrographic stations off the coast of Europe and Africa


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


32


latitude = 45
longitude = -40


latitude = 45
longitude = -30


latitude = 45
longitude = -20


0
(0

0
QZI
C\
N


0


10
T


0


10
T


     4 6 8

     temperature
     n = 213

   latitude = 35
   longitude = -40

0


    temperature
       n =37

   latitude = 25
   longitude = -40


   4 6 8

   temperature
     n = 105

 latitude = 35
longitude = -30

     -I n


co
cc

C\J-
N


     4 6 8

     temperature
     n=112

  latitude = 35
  longitude = -20


0 -J-

     4 6 8

     temperature
     n = 44

  latitude = 25
  longitude = -20


   4 6 8

   temperature
     n = 24

 latitude = 25
longitude = -30


0


0
T


4 6 8

temperature
  n=47


10


10


4 6 8

temperature
  n=35


10


10


4 6 8

temperature
  n=27


Figure 1.12: Water temperatures (°C) at 1000m depth, latitude 25, 35, 45 degrees
North longitude 20, 30, 40 degrees West


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


33


Definition 1.2 (Mean of a random variable). Let X be a random variable with cdf
Fx and pdf px. Then the mean of X (equivalently, the mean of Fx) is

                  IB(X)   ( >jKi P[X = i] if X is discrete         (1.10)
                          lf x px(x) dz if X is continuous
   The logic of the definition is that E(X) is a weighted average of the possible
values of X. Each value is weighted by its importance, or probability. In addition
to E(X), another common notation for the mean of a random variable X is pX.
   Let's look at some of the families of probability distributions that we have al-
ready studied and calculate their expectations.
Binomial If X  Bin(n, p) then

                     E(X)    ZiP[x=i]
                             i=0

                                    (1Z-)"


                                   jp( (1 -p)"
                             1    iJ                                  (1.11)

                          = np      . 1!      . ~lp - (1 - p)"-i


                                  (- n      A !- 1) in i-

                          = np
     The first five equalities are just algebra. The sixth is worth remembering. The
     sum Ej"_"-- is the sum of the probabilities of the Bin(n - 1, p) distribution.
     Therefore the sum is equal to 1. You may wish to compare IE(X) to Figure 1.5.

Poisson If X   Poi(A) then
                                    E(X) =_A.
     The derivation is left as Exercise

Exponential If X  Exp(A) then

                   EB(X)   f   x p(xc) d= A-1 f ze-/A dc - A.


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


34


     Use integration by parts.

Normal If X    N(p, o) then E(X) = p. The derivation is left as Exercise 18.

   Statisticians also need to measure and describe the spread of distributions, ran-
dom variables and samples. In Figure 112, the spread would measure how much
variation there is in ocean temperatures at a single location, which in turn would
tell us something about how heat moves from place to place in the ocean. Spread
could also describe the variation in the annual numbers of "great" discoveries, the
range of typical outcomes for a gambler playing a game repeatedly at a casino or
an investor in the stock market, or the uncertain effect of a change in the Federal
Reserve Bank's monetary policy, or even why different patches of the same forest
have different plants on them.
   By far the most common measures of spread are the variance and its square
root, the standard deviation.

Definition 1.3 (Variance). The variance of a sample Y1, ... , yn is

                       Var(yi,... , y) = n-1  (y - y2

The variance of a random variable Y is

                            Var(Y) = E((Y - py)2)

Definition 1.4 (Standard deviation). The standard deviation of a sample y1, ..., yn
is
                      SD (yi, . . . ,y) =  n-1 (yi - y))2.
The standard deviation of a random variable Y is

                           SD(Y) =/E((Y -py)2).

   The variance (standard deviation) of Y is often denoted o4 (c-y). The variances
of common distributions will be derived later in the book.
   Caution: for reasons which we don't go into here, many books define the vari-
ance of a sample as Var(yi, ..., y) =(n - 1)1  (y - Y)2. For large n there is no
practical difference between the two definitions. And the definition of variance of
a random variable remains unchanged.
   While the definition of the variance of a random variable highlights its inter-
pretation as deviations away from the mean, there is an equivalent formula that is
sometimes easier to compute.


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS

Theorem 1.2. If Y is a random variable, then Var(Y) =E(Y2)


Proof.


35


(gy)2.


Var(Y) = E((Y -
       = IE(Y2= -
       =IE(Y2)
       =IE(Y2)


EY)2)
2YEY + (EY)2)
-2(EY)2 + (EY)2
- (IY)2


D-


   To develop a feel for what the standard deviation measures, Figure repeats
Figure;1.12 and adds arrows showing + 1 standard deviation away from the mean.
Standard deviations have the same units as the original random variable; variances
have squared units. E.g., if Y is measured in degrees, then SD(Y) is in degrees but
Var(Y) is in degrees2. Because of this, SD is easier to interpret graphically. That's
why we were able to depict SD's in Figure114.
   Most "mound-shaped" samples, that is, samples that are unimodal and roughly
symmetric, follow this rule of thumb:


   " about 2/3 of the sample falls within about 1 standard deviation of the mean;


   " about 95% of the sample falls within about 2 standard deviations of the mean.


The rule of thumb has implications for predictive accuracy. If x1, .. .., xn are a sam-
ple from a mound-shaped distribution, then one would predict that future obser-
vations will be around X- with, again, about 2/3 of them within about one SD and
about 95% of them within about two SD's.
   To illustrate further, we'll calculate the SD of a few mound-shaped random
variables and compare the SD's to the pdf's.


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS                        36

Binomial Let Y  Bin(30, .5).

        Var(Y) =E(Y2) - (EY)2

                 =y2 (30l .530 - 152
                 y=0
                 30  (3
                 =        30!     .530 - 152
                    (y - 1)!(30 - y)!
                    29
              = 15Z(v + 1) '(29   ),.52  152                  (1.12)
                  v=0
                        v!(9 -v)

                 -'59 (29V !V               9!     29
                 15(   V +       .-52+.5-)152
                          !(9- v)!.     o v!(29 - v)!

                  1 29+11

                15
                2

     and therefore SD(Y) = 15/2 2.7. (See Exercises 19 and 20.)

Normal Let Y   N(0, 1).

                        Var(Y) =E(Y2)

                                 /       eZdy
                                   00 \/   -
                                   00
                               = 2        e 2 dy              (1.13)

                               ~2fo,   1ezydy


     and therefore SD(Y) = 1. (See Exercises 19 and 20.)

Figure 1i1 shows the comparison. The top panel shows the pdf of the Bin(30, .5)
distribution; the bottom panel shows the N(0, 1) distribution.


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS               37


               6                 000

               O                0  0
               6
               -~0                  0

               O              0      0

                             0 ~0

                 c 00000000 0~000

                   0    5   10   15   20   25   30


                                 y


               0
               co)
               0


               0

                  -3   -2   -1    0   1    2    3


                                 y


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


38


Figure


was produced by the following R code.


par ( mfrow=c(2,1) )


y <- 0:30
sd <- sqrt


( 15 / 2 )


plot ( y, dbinom(y,30,.5), ylab="p(y)" )
arrows ( 15-2*sd, 0, 15+2*sd, 0, angle=60,
         code=3, lwd=2 )


length=. 1,


length=. 1,


text ( 15,
arrows ( 1


.008, "+/- 2 SD's" )


5-sd, .03, 15+sd,


.03, angle=60,


code=3, lwd=2 )


text ( 15,


.04, "+/- 1 SD" )


y <- seq(-3,3,length=60)
plot ( y, dnorm(y,0,1), ylab="p(y)", type="1"


)
code=3, lwd=2 )


arrows ( -2, .02, 2, .02, angle=60,


length=. 1,


text ( 0,


.04, "+/- 2 SD's" )


arrows ( -1, .15,


1, .15, angle=60,


length=.1, code=3,


lwd=2 )


text ( 0,


.17, "+/- 1 SD" )


* arrows(x0, y0, x1, y1, length,


angle, code, ...)


adds arrows to a


     plot. See the documentation for the meaning of the arguments.

   . text adds text to a plot. See the documentation for the meaning of the
     arguments.


Definition 1.5 (Moment). The r'th moment of a sample y1,... , y,, or random vari-
able Y is defined as


         n-- E((y( - )r
IE((Y - py)'))


        (for samples)
(for random variables)


   Variances are second moments. Moments above the second have little applica-
bility.
   R has built-in functions to compute means and variances and can compute other
moments easily. Note that R uses the divisor n - 1 in its definition of variance.


﻿


1.4. CENTERS, SPREADS, MEANS, AND MOMENTS


39


  # Use R to calculate moments of the Bin(100,.5) distribution
  x <- rbinom ( 5000, 100, .5 )
  m <- mean ( x )  #the mean
  v <- var   ( x ) # the variance
  s <- sqrt (v)      # the SD
  mean ( (x-m)3 ) # the third moment
  our.v <- mean ( (x-m)2 ) # our variance
  our.s <- sqrt ( our.v )      # our standard deviation
  print ( c ( v, our.v ) ) # not quite equal
  print ( c ( s, our.s ) ) # not quite equal

  * rbinom(...) generates random draws from the binomial distribution. The
     5000 says how many draws to generate. The 100 and .5 say that the draws
     are to be from the Bin(100, .5) distribution.

   Let h be a function. Then E[h(Y)] = f h(y)p(y) dy (L h(y)p(y) in the discrete
case) is the expected value of h(Y) and is called a generalized moment. There are
sometimes two ways to evaluate E[h(Y)]. One is to evaluate the integral. The other
is to let X = h(Y), find px, and then evaluate IE[X] = f xpx(x) dx. For example,
let Y have pdf fy(y) = 1 for y E (0, 1), and let X = h(Y) = exp(Y).
Method 1
                  E[h(Y)] f exp(y) dy = exp(y)| = e - 1.

Method 2

                       px(x) = py(log(x)) dy/dx= 1/x

                    E[X]=     xpx(x)dz       1dz=e- 1
                            1              1
   If h is a linear function then E[h(Y)] has a particularly appealing form.
Theorem 1.3. If X = a + bY then E[X] = a + bE[Y].
Proof. We prove the continuous case; the discrete case is left as an exercise.

                      EX = f(a + by)fy(y) dy

                          = afy(y) dy + bfyfy(y) dy

                          =a+bIEY


D-


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


40


   There is a corresponding theorem for variance.

Theorem 1.4. If X = a + bY then Var(X) = b2 Var(Y).

Proof. We prove the continuous case; the discrete case is left as an exercise. Let
y   IE[Y].

                       Var(X) = E[(a + bY - (a+ by))2]
                               = E~b2(Y -_2
                               = b2 Var(Y)


1.5 Joint, Marginal and Conditional Probability

Statisticians often have to deal simultaneously with the probabilities of several
events, quantities, or random variables. For example, we may classify voters in a
city according to political party affiliation and support for a school bond referen-
dum. Let A and S be a voter's affiliation and support, respectively.

           A    rD   if Democrat       and     S=Y       if in favor
                 R   if Republican.                  N   if opposed

Suppose a polling organization finds that 80% of Democrats and 35% of Republi-
cans favor the bond referendum. The 80% and 35% are called conditional probabil-
ities because they are conditional on party affiliation. The notation for conditional
probabilities is pS IA. As usual, the subscript indicates which random variables
we're talking about. Specifically,

                  ps A(Y D) = 0.80;   ps A(N D) = 0.20;
                  ps| A(Y R) = 0.35;  psIA(N |R) = 0.65.

We say "the conditional probability that S = N given A = D is 0.20", etc.
   Suppose further that 60% of voters in the city are Democrats. Then 80% of
60% = 48% of the voters are Democrats who favor the referendum. The 48% is
called a joint probability because it is the probability of (A = D, S = Y) jointly.
The notation is paS(D, Y) =.48. Likewise, PAS(D, N)  .12; PAS(R, Y) =-.14;
and pa,s(R, N) =0.26. Table L1summarizes the calculations. The quantities


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


41


latitude = 45
longitude = -40


latitude = 45
longitude = -30


latitude = 45
longitude = -20


0
(0

0

0

0


0
N\


0


0


0
N\


0


0


4 6 8


4 6 8


4 6 8


   temperature
   n = 213

 latitude = 35
longitude = -40


   temperature
   n = 105

 latitude = 35
longitude = -30

-1      n


0


0


0


4 6 8

temperature
  n=37


co
cc


0


    temperature
      n=112

   latitude = 35
 longitude = -20

 _T-

 o_
0-J


     4 6 8

     temperature
     n = 44

   latitude = 25
 longitude = -20


4 6 8


temperature
  n=24


latitude = 25
longitude = -40


0
N\

0


0


4 6 8

temperature
  n=47


  latitude = 25
  longitude = -30


LO -
0


     4 6 8

     temperature
     n=35


10


0


4 6 8

temperature
  n=27


Figure 1.14: Water temperatures (°C) at 1000m depth, latitude 25, 35, 45 degrees
North, longitude 20, 30, 40 degrees West, with standard deviations


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


42


                                     For Against
                       Democrat     48%     12%     60%
                       Republican   14%     26%     40%
                                    62%     38%

              Table 1.1: Party Affiliation and Referendum Support


.60, .40, .62, and .38 are called marginal probabilities. The name derives from
historical reasons, because they were written in the margins of the table. Marginal
probabilities are probabilities for one variable alone, the ordinary probabilities that
we've been talking about all along.
    The event A = D can be partitioned into the two smaller events (A = D, S = Y)
and (A=D,S=N). So

               pA(D) = .60 = .48 + .12= PA,s(DY) + PA,S(D, N).

The event A = R can be partitioned similarly. Too, the event S = Y can be
partitioned into (A = D, S = Y) and (A = R, S = Y). So

               ps(Y) = .62 = .48 + .14= PA,s(D, Y) + PA,s(R, Y).

    These calculations illustrate a general principle: To get a marginal probability
for one variable, add the joint probabilities for all values of the other variable. The
general formulae for working simultaneously with two discrete random variables
X and Y are

               fx,y(x, y) =-fx(x) -'fy x(y Iz) =fy(y) -fx y((z y)  (1.14)
                 fx(x) =    fx,y(x,y)     fy(y)   Zfx,y(x,y)
                          Y                        x
Sometimes we know joint probabilities and need to find marginals and condition-
als; sometimes it's the other way around. And sometimes we know fx and fy x
and need to find fy or fxIy. The following story is an example of the latter. It is a
common problem in drug testing, disease screening, polygraph testing, and many
other fields.
    The participants in an athletic competition are to be randomly tested for steroid
use. The test is 90% accurate in the following sense: for athletes who use steroids,
the test has a 90% chance of returning a positive result; for non-users, the test has
a 10% chance of returning a positive result. Suppose that only 30% of athletes use


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


43


steroids. An athlete is randomly selected. Her test returns a positive result. What
is the probability that she is a steroid user?
   This is a problem of two random variables, U, the steroid use of the athlete and
T, the test result of the athlete. Let U = 1 if the athlete uses steroids; U = 0 if not.
Let T = 1 if the test result is positive; T = 0 if not. We want fu iT(1|1). We can
calculate fu T if we know fu,; and we can calculate fuT because we know fu and
fT u . Pictorially,
                           fu, fT U -   fu,T -   fu T
The calculations are

              fu,T(O, 0) = (.7)(.9) = .63 fu,T(O, 1) = (.7)(.1) = .07
              fu,T(i, 0)= (.3)(.1) = .03  fu,T(i, 1) = (.3)(.9) = .27

so

                fT(0) = .63+.03= .66      fT(i) =.07+.27= .34

and finally

                  fu T(1|1) = fu,T(1, 1)/fT(1) = .27/.34   .80.

In other words, even though the test is 90% accurate, the athlete has only an 80%
chance of using steroids. If that doesn't seem intuitively reasonable, think of a
large number of athletes, say 100. About 30 will be steroid users of whom about
27 will test positive. About 70 will be non-users of whom about 7 will test positive.
So there will be about 34 athletes who test positive, of whom about 27, or 80%
will be users.
   Table   ,2 is another representation of the same problem. It is important to
become familiar with the concepts and notation in terms of marginal, conditional
and joint distributions, and not to rely too heavily on the tabular representation
because in more complicated problems there is no convenient tabular representa-
tion.
   Example L6 is a further illustration of joint, conditional, and marginal distri-
butions.

Example 1.6 (Seedlings)
Example 1_4 introduced an observational experiment to learn about the rate of seedling
production and survival at the Coweeta Long Term Ecological Research station in western
North Carolina. For a particular quadrat in a particular year, let N be the number of


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


44


        T=O     T=1
U = 0     .63     .07    .70
U = 1     .03     .27    .30
          .66     .34


                      Table 1.2: Steroid Use and Test Results


new seedlings that emerge. Suppose that N     Poi(A) for some A > 0. Each seedling
either dies over the winter or survives to become an old seedling the next year. Let 0
be the probability of survival and X be the number of seedlings that survive. Suppose
that the survival of any one seedling is not affected by the survival of any other seedling.
Then X      Bin(N, 0). Figure 1,15 shows the possible values of the pair (N, X). The
probabilities associated with each of the points in Figure 1.15 are denoted fN,x where, as
usual, the subscript indicates which variables we're talking about. For example, fN,x(3, 2)
is the probability that N = 3 and X = 2.


CD -


0 -


                              *
                        * *
                  *     *     *
            *     *     *     *
      *     *     *     *     *        Eu.
*     *     *     *     *     *


0


1


2


3


4


5


6


7
7


N


Figure 1.15: Permissible values of N
number that survive.


and X, the number of new seedlings and the


   The next step is to figure out what the joint probabilities are. Consider, for example,
the event N = 3. That event can be partitioned into the four smaller events (N


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


45


3, X = 0), (N = 3, X = 1), (N = 3, X = 2), and (N = 3, X = 3). So

              fN(3) =_fN,x(3, 0) + fN,X(3, 1) + fN,X(3, 2) + fN,x(3, 3)

The Poisson model for N says fN(3) = P[N = 3] = e-AA3/6. But how is the total
e-AA3/6 divided into the four parts? That's where the Binomial model for X comes in.
The division is made according to the Binomial probabilities


)3


Cll e(1


0)2


(3)2(i -


The e-AA3/6 is a marginal probability like the 60% in the affliation/support problem.
The binomial probabilities above are conditional probabilities like the 80% and 20%; they
are conditional on N = 3. The notation is fx|N(2 3) or P[X = 2 N = 3]. The joint
probabilities are


fN,X(3,)                  3
                e-as
   fN,X(3,2)2


             e-ak3
fN,x(3, 1)


0)2


0)


             e-ad3
fN,x (33)


In general,


fN,x(mv    Xf|(N)fxN  In)=e A n(I)>OX(1


.Q)n-~X


   An ecologist might be interested in fx, the pdf for the number of seedlings that will
be recruited into the population in a particular year. For a particular number x, fX(x)
is like looking in Figure   along the horizontal line corresponding to X = x. To get
fX(x) = P[X = z], we must add up all the probabilities on that line.


fx(x)  ZfN,x(,x)


ri~x


0)-X


    0eA-0)(A(1 - 0))"- e-A(A0)x


e-A(AO)X 00 e-A-0)(A(1 - Q))z
          z=0
e-a (AO)x
    z!


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


46


The last equality follows since E,-z-.-.= 1 because it is the sum of probabilities from
the Poi(A(1 - 0)) distribution. The final result is recognized as a probability from the
Poi(A*) distribution where A* =AO. So X   Poi(A*).
    In the derivation we used the substitution z = n-c. The trick is worth remembering.

    For continuous random variables, conditional and joint densities are written
px Y(ccIy) and px,y(x, y) respectively and, analgously to Equation 14 we have

                 px,y (x,y) = px (x)pyi x(Y |cc) = pY (Y)pxY (x | y)  (1.15)

             px() =       px,y(x, y) dy  py(y) =    px,y(x, y) dx

The logic is the same as for discrete random variables. In order for (X = x, Y = y)
to occur we need either of the following.

   1. First X = x occurs then Y = y occurs. The "probability" of that happening
      is just px(x) x pr x(y cc), the "probability" that X = x occurs times the
      "probability" that Y = y occurs under the condition that X = x has already
      occured. "Probability" is in quotes because, for continuous random variables,
      the probability is 0. But "probability" is a useful way to think intuitively.

   2. First Y = y occurs then X = x occurs. The reasoning is similar to that in
      item 1.

   Just as for single random variables, probabilities are integrals of the density.
If A is a region in the (x, y) plane, P[(X, Y) E A] = fAp(x, y) dx dy, where fA...
indicates a double integral over the region A.
   Just as for discrete random variables, the unconditional density of a random
variable is called its marginal density; px and py are marginal densities. Let B c R
be a set. Since a density is "the function that must be integrated to calculate a
probability", on one hand, P [X E B] = fB px(x) dc. On the other hand,

            P[X E B] = P[(X, Y) E B x R] =   (   px,y (x, y) dy  d

which implies px(x) =f Lpx,y (c, y) dy.
   An example will help illustrate. A customer calls the computer Help Line. Let
X be the amount of time he spends on hold and Y be the total duration of the call.
The amount of time a consultant spends with him after his call is answered is W
Y - X. Suppose the joint density is px,y(cc, y) =e-" in the region 0 <cc < y < o'o.


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


47


1. What is the marginal density of X?


p(x) = fp(x, y) dy


  00
I e-dy
x


00

x


e-x


2. What is the marginal density of Y?


p(y) =   p(x, y) d


  e
0


-y dx = ye-'


3. What is the conditional density of X given Y?


p(c IY)=p(cxy)
          p(y)


1


y


4. What is the conditional density of Y given X?


p(y   ) - p()-
          p(xc)


eXY


5. What is the marginal density of W?


        d               d    o   x/
p(w)    P[W < w] =  f00fX+w
           dwdw JJ                    - dd

     =           -e-        d
       dw /           x
               e-x(1 - e-") dz = e-w


   Figure      illustrates the Help Line calculations. For questions 1 and 2, the
answer comes from using Equations  .The only part deserving comment is the
limits of integration. In question 1, for example, for any particular value X = x,
Y ranges from x to oc, as can be seen from panel (a) of the figure. That's where
the limits of integration come from. In question 2, for any particular y, X E (0, y),
which are the limits of integration. Panel (d) shows the conditional density of X
given Y for three different values of Y. We see that the density of X is uniform on
the interval (0, y). See Section  for discussion of this density. Panel (d) shows
the conditional density of Y given X for three different values of X. It shows,
first, that Y > X and second, that the density of Y decays exponentially. See


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


48


(a)


(b)


   CO)

>+ N\


   "T


a1-


co
O


QZI

Q

O


0   1   2   3   4


0  2  4  6   8 10


x


(d)


(c)


Q

  O


11-
>


O
N\


O


O
O


0  2  4  6   8 10


        y


        (e)


0.0  1.0  2.0   3.0


        x


        (f)


11-


co
O


QZI

Q

O


(0


C\J-

N


0  2  4  6   8 10


        y


0   1   2   3   4


        x


Figure 1.16: (a): the region of Rll2 where (X, Y) live; (b): the marginal density of

X; (c): the marginal density of Y; (d): the conditional density of X given Y for

three values of Y; (e): the conditional density of Y given X for three values of X;

(f): the region W ; w


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


49


Section 1.3.3 for discussion of this density. Panel (f) shows the region of integration
for question 5. Take the time to understand the method being used to answer
question 5.
   When dealing with a random variable X, sometimes its pdf is given to us and
we can calculate its expectation:

                             E(X)    fxp(x) dz.

(The integral is replaced by a sum if X is discrete.) Other times X arises more
naturally as part of a pair (X, Y) and its expectation is

                           E(X) =ff xp(x, y) dzdy.

The two formulae are, of course, equivalent. But when X does arise as part of a
pair, there is still another way to view p(x) and E(X):

               px(x)      (pxy (xy))py(y)dy = E (px y (x |y))      (1.16)

            E(X) =f (fpxy(c Y)dc)pY(y) dy = IE (IE(X | Y)).       (1.17)

The notation deserves some explanation. For any number x, pxY (ccl y) is a func-
tion of y, say g(y). The middle term in Equation116 is f g(y)p(y)dy, which equals
IE(g(y)), which is the right hand term. Similarly, E(X|Y) is a function of Y, say
h(Y). The middle term in Equation 1.7 is f h(y)p(y)dy, which equals IE(h(y)),
which is the right hand term.

Example 1.7 (Seedlings, continued)
Examples1.4 and 1,6 discussed (N, X), the number of new seedlings in a forest quadrat
and the number of those that survived over the winter. Example 1, suggested the
statistical model
                   N    Poi(A)    and    X |N ~Bin(N, 0).
Equation 117 shows that E(X) can be computed as
                 E(X) = E (E(X |IN)) = E(NO) O=BE(N) = BA.
Example 1.8 (Craps, continued)
Examples11,2 and 1: introduced the game of craps. Example    calculates the
chance of winning.

                          Let X =0     if shooter loses
                                   1i if shooter wins


﻿


1.5. JOINT, MARGINAL AND CONDITIONAL PROBABILITY


50


X has a Bernoulli distribution. We are trying to find

                         P[shooter wins] = px(1) = E(X).

(Make sure you see why px(1) = E(X).) Let Y be the outcome of the Come-out roll.
Equation 117 says

          E(X)= E (E(X|Y))
                =    IE(X|Y=2)P[Y=2]+IE(X|Y=3)P[Y=3]
                   +E(X    Y=4)P[Y=4]+IE(X         Y=5)P[Y=5]
                   +E(XY=6)P[Y=6]+E(XY=7)P[Y=7]
                   +IE(X|Y=8)P[Y=8]+IE(X|Y=9)P[Y=9]
                   +IE(X|Y =10)P[Y =10]+IE(X|Y =11)P[Y =11]
                   + E(X| Y =12) P[Y =12]3
                   -   x1     0    2                 3-B(
                =    0 x  -6+ 0 x 3  + E(X|IY= 4)
                         36       36                36
                   + E(X |Y = 5)     +E(X |Y=6)3
                                  36                36
                   +1x 36+IE(X         =8)    +E(X|Y=9)

                   +IE(X|Y = 10)      +1 x     +0 x     .
                                        36 3636
So it only remains to find E(X | Y = y) for y = 4, 5, 6, 8, 9, 10. The calculations are all
similar. We will do one of them to illustrate. Let w = E(X|Y = 5) and let z denote
the next roll of the dice. Once 5 has been established as the point, then a roll of the
dice has three possible outcomes: win (if z = 5), lose (if z = 7), or roll again (if z is
anything else). Therefore

                      w= 1 x 4/36 + 0 x 6/36 + w x 26/36
                                 (10/36)w = 4/36
                                    w = 4/10.

After similar calculations for the other possible points we find
              3 3     4 4     5 5     6    5 5     4 4     3 3    2
     E(X)=936        1036    1136    36    1136    1036    936    36 ~.493.


Craps is a very fair game; the house has only a slight edge.


﻿


1.6. ASSOCIATION, DEPENDENCE, INDEPENDENCE


51


1.6 Association, Dependence, Independence

It is often useful to describe or measure the degree of association between two ran-
dom variables X and Y. The R dataset iris provides a good example. It contains
the lengths and widths of setals and petals of 150 iris plants. The first several lines
of iris are
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  1           5.1          3.5          1.4          0.2 setosa
  2           4.9          3.0          1.4          0.2 setosa
  3           4.7          3.2          1.3          0.2 setosa
Figure     shows each variable plotted against every other variable. It is evident
from the plot that petal length and petal width are very closely associated with
each other, while the relationship between sepal length and sepal width is much
weaker. Statisticians need a way to quantify the strength of such relationships.

   Figure     was produced by the following line of R code.

   pairs ( iris[,1:41 )


   . pairs() produces a pairs plot, a matrix of scatterplots of each pair of vari-
     ables. The names of the variables are shown along the main diagonal of the
     matrix. The (i, j)th plot in the matrix is a plot of variable i versus variable
     j. For example, the upper right plot has sepal length on the vertical axis and
     petal width on the horizontal axis.


   By far the most common measures of association are covariance and correlation.
Definition 1.6. The covariance of X and Y is
                     Cov(X, Y) - E((X - px)(Y - py)
   In R, cov measures the covariance in a sample. Thus, cov(iris [,1:41) pro-
duces the matrix
                Sepal.Length Sepal.Width Petal.Length Petal.Width
  Sepal.Length    0.68569351 -0.04243400      1.2743154   0.5162707
  Sepal.Width    -0.04243400 0.18997942      -0.3296564  -0.1216394
  Petal.Length    1.27431544 -0.32965638   3.1162779   1.2956094
  Petal.Width     0.51627069 -0.12163937      1.2956094   0.5810063


﻿


1.6. ASSOCIATION, DEPENDENCE, INDEPENDENCE


52


Sepal .Length


2.0 3.0 4.0


     0ea.it


10


10

LO

10)

10

10


0.5 1.5 2.5


CO


CO

CJ

LO
N\


0


        OD
        0


Col


CY)


N\


Petal .Length


L0
N\

O


0

'^


Petal.Width


4.5 6.0 7.5


1 35 7


Figure 1.17: Lengths and widths of sepals and petals of 150 iris plants


﻿


1.6. ASSOCIATION, DEPENDENCE, INDEPENDENCE


53


in which the diagonal entries are variances and the off-diagonal entries are covari-
ances.
   The measurements in iris are in centimeters. To change to millimeters we
would multiply each measurement by 10. Here's how that affects the covariances.


> cov ( 10*iris[,1:4] )
              Sepal.Length
Sepal.Length     68.569351
Sepal.Width      -4.243400
Petal.Length    127.431544
Petal.Width      51.627069


Sepal. Width
  -4.243400
  18.997942
  -32.965638
  -12.163937


Petal.Length
   127.43154
   -32.96564
   311.62779
   129.56094


Petal. Width
   51.62707
   -12.16394
   129.56094
   58.10063


Each covariance has been multiplied by 100 because each variable has been multi-
plied by 10. In fact, this rescaling is a special case of the following theorem.

Theorem 1.5. Let X and Y be random variables. Then Cov(aX + b, cY + d)
ac Cov(X, Y).

Proof.


Cov(aX + b,cY + d)


IE((aX +b-


(apux + b))(cY + d
= E(ac(X - px)(Y


(cpy + d)))
py)) = ac Cov(X,Y)


D-


   Theorem L shows that Cov(X, Y) depends on the scales in which X and Y are
measured. A scale-free measure of association would also be useful. Correlation is
the most common such measure.

Definition 1.7. The correlation between X and Y is

                                     _Cov(X, Y)
                         Cor(X, Y) =    o(,Y
                               C X Y SD(X) SD(Y)

   cor measures correlation. The correlations in iris are


> cor(iris[,1:4])
              Sepal.Length
Sepal.Length     1.0000000
Sepal.Width     -0.1175698
Petal.Length     0.8717538
Petal.Width      0.8179411


Sepal. Width
-0.1175698
  1.0000000
  -0.4284401
  -0.3661259


Petal.Length
   0.8717538
   -0.4284401
   1.0000000
   0.9628654


Petal. Width
  0.8179411
  -0.3661259
  0.9628654
  1.0000000


﻿


1.6. ASSOCIATION, DEPENDENCE, INDEPENDENCE


54


which confirms the visually impression that sepal length, petal length, and petal
width are highly associated with each other, but are only loosely associated with
sepal width.
   TheoremL6 tells us that correlation is unaffected by linear changes in mea-
surement scale.

Theorem 1.6. Let X and Y be random variables. Then Cor(aX + b, cY + d)
Cor(X, Y).

Proof. See Exercise 40.

   Correlation doesn't measure all types of association; it only measures clustering
around a straight line. The first two columns of Figure  show data sets that
cluster around a line, but with some scatter above and below the line. These data
sets are all well described by their correlations, which measure the extent of the
clustering; the higher the correlation, the tighter the points cluster around the line
and the less they scatter. Negative values of the correlation correspond to lines
with negative slopes. The last column of the figure shows some other situations.
The first panel of the last column is best described as having two isolated clusters
of points. Despite the correlation of .96, the panel does not look at all like the
last panel of the second column. The second and third panels of the last column
show data sets that follow some nonlinear pattern of association. Again, their
correlations are misleading. Finally, the last panel of the last column shows a data
set in which most of the points are tightly clustered around a line but in which
there are two outliers. The last column demonstrates that correlations are not
good descriptors of nonlinear data sets or data sets with outliers.
   Correlation measures linear association between random variables. But some-
times we want to say whether two random variables have any association at all,
not just linear.

Definition 1.8. Two two random variables, X and Y, are said to be independent if
p(X |Y) = p(X), for all values of Y. If X and Y are not independent then they are
said to be dependent.

   If X and Y are independent then it is also true that p(Y lIX) = p(Y). The
interpretation is that knowing one of the random variables does not change the
probability distribution of the other. If X and Y are independent (dependent) we
write X I Y (X A Y). If X and Y are independent then Cov(X, Y) = Cor(X, Y)
0. The converse is not true. Also, if X  I Y then pv(x, y) =p(x)p(y). This last
equality is usually taken to be the definition of independence.


﻿


1.6. ASSOCIATION, DEPENDENCE, INDEPENDENCE                         5


55


cor = -0.9


cor = 0.3


cor= 0.96


cor = -0.5


cor = 0.5


cor= 0.67


cor = -0.2


cor = 0.8


cor = -0.37


cor =0


cor= 0.95


cor = 0.62


Figure 1.18: correlations


﻿


1.6. ASSOCIATION, DEPENDENCE, INDEPENDENCE


56


   Do not confuse independent with mutually exclusive. Let X denote the outcome
of a die roll and let A = 1 if X E {1, 2, 3} and A = 0 if X E {4, 5, 6}. A is called an
indicator variable because it indicates the occurence of a particular event. There is
a special notation for indicator variables:

                                A= 1{1,2,3}()

1{1,2,3} is an indicator function. 1{1,2,3} (X) is either 1 or 0 according to whether
X is in the subscript. Let B = 1{4,5,6}(X), C = 1{1,3,5}(X), D = 1{2,4,6}(X) and
E    1{1,2,3,4}(X). A and B are dependent because P[A] = .5 but P[A | B] = 0. D
and E are independent because P[D] = P[D I E] = .5. You can also check that
P[E] = P[E I D] = 2/3. Do not confuse dependence with causality. A and B are
dependent, but neither causes the other.
   For an example, recall the Help Line story on page 46. X and Y were the
amount of time on hold and the total length of the call, respectively. The difference
was W = Y - X. We found p(xI y) = y-1. Because p(xI y) depends on y, X A Y.
Similarly, p(y |x) depends on x. Does that make sense? Would knowing something
about X tell us anything about Y?
   What about X and W? Are they independent?

                  d                      d
        p(w X=dP[W <w|X =z]=d =             P[Y<w+zX = x]=e

which does not depend on x. Therefore X I W. Does that make sense? Would
knowing something about X tell us anything about W?

Example 1.9 (Seedlings, continued)
Examples 1., 1-6, and 1, were about new seedlings in forest quadrats. Suppose that
ecologists observe the number of new seedlings in a quadrat for k successive years; call
the observations N1, ... , Nk. If the seedling arrival rate is the same every year, then we
could adopt the model Ni   Poi(A). I.e., A is the same for every year. If A is known,
or if we condition on A, then the number of new seedlings in one year tells us nothing
about the number of new seedlings in another year, we would model the Ni's as being
independent, and their joint density, conditional on A, would be
                               k            k CAmn       -kAn
                pn,.,kA)= Hp(mj A= e11eAA~h


But if A is unknown then we might treat it like a random variable. It would be a random
variable if, for instance, different locations in the forest had different A's we chose a


﻿


1.7. SIMULATION


57


location at random or if we thought of Nature as randomly choosing A for our particular
location. In that case the data from early years, N1, ... , Nm, say, yield information about
A and therefore about likely values of Nm+1,... , Nk, so the Ni's are dependent. In fact,


         p(ni, ... , nk) = p(ni, ... , nk A)p(A) dA =_     p(a) dA

So, whether the Ni's are independent is not a question with a single right answer. Instead,
it depends on our perspective. But in either case, we would say the Ni's are conditionally
independent given A.


1.7 Simulation

We have already seen, in Example 1_2, an example of computer simulation to es-
timate a probability. More broadly, simulation can be helpful in several types of
problems: calculating probabilities, assessing statistical procedures, and evaluat-
ing integrals. These are explained and exemplified in the next several subsections.

1.7.1 Calculating Probabilities
Probabilities can often be estimated by computer simulation. Simulations are es-
pecially useful for events so complicated that their probabilities cannot be easily
calculated by hand, but composed of smaller events that are easily mimicked on
computer. For instance, in Example 142 we wanted to know the probability that
the shooter in a craps game rolls either 7 or 11 on the Come Out roll. Although
it's easy enough to calculate this probability exactly, we did it by simulation in the
Example.
   Expected values can also be estimated by simulations. Let Y be a random
variable and suppose we want to estimate the expected value of some function
pg = E(g(Y)). We can write a computer program to simulate Y many times. To
keep track of the simulations we use the notation y(3) for the j'th simulated value
of Y. Let n be the number of simulations. Then

                               pg = n- 1   g(yi)

is a sensible estimate of pg. In fact, the Law of Large Numbers tells us that


lim n 1     g(yi)) =pg.
nh-oo


﻿


1.7. SIMULATION


58


So as we do a larger and larger simulation we get a more and more accurate
estimate of pg.
   Probabilities can be computed as special cases of expectations. Suppose we
want to calculate P[Y E S] for some set S. Define X =1S (Y). Then P[Y E S]
E(X) and is sensibly estimated by

                   number of occurences of S-_         2
                                             n EX.
                        number of trials
Example1.10 illustrates with the game of Craps.

Example 1.10 (Craps, continued)
Example 1.8 calculated the chance of winning the game of Craps. Here is the R code to
calculate the same probability by simulation.

    makepoint <- function ( point ) {
    determined <- F
    while ( !determined ) { # roll until outcome is determined
       roll <- sum ( sample ( 6, 2, replace=T ) )
       if ( roll == point ) {
         made <- T
         determined <- T
       } else if ( roll == 7 ) {
         made <- F
         determined <- T
       }
    } # end while
    return ( made )
  } # end makepoint

  sim.craps <- function () {
    roll <- sum ( sample ( 6, 2, replace=T ) )
    if ( roll==7 II roll==11 )
       win <- T
       else if ( roll==2 II roll==3 II roll==12 )
       win <- F
       else
       win <- makepoint ( roll )


return ( win )


﻿


1.7. SIMULATION


59


  }

  n.sim <- 1000
  wins <- 0
  for ( i in 1:n.sim )
     wins <- wins + sim.craps()
  print ( wins/n.sim )


  " "!" is R's symbol for not. If determined is T then ! determined is F.

  " while(! determined) begins a loop. The loop will repeat as many times as
     necessary as long as ! determined is T.

Try the example code a few times. See whether you get about 49% as Example L
suggests.
   Along with the estimate itself, it is useful to estimate the accuracy of  9 as an
estimate of pg. If the simulations are independent then Var(Ag) = n-1 Var(g(Y));
if there are many of them then Var(g(Y)) can be well estimated by n-1 E((g(y) -
g)2) and SD(g(Y)) can be well estimated by
n-1/2 E((g(y) -    g)2). Because SD's decrease in proportion to n-1/2 (See the
Central Limit Theorem.), it takes a 100 fold increase in n to get, for example, a 10
fold increase in accuracy.
   Similar reasoning applies to probabilities, but when we are simulating the oc-
curence or nonoccurence of an event, then the simulations are Bernoulli trials, so
we have a more explicit formula for the variance and SD.

Example 1.11 (Craps, continued)
How accurate is the simulation in Example 1A10?
   The simulation keeps track of X, the number of successes in n. sim trials. Let 0
be the true probability of success. (Example found 0   .49, but in most practical
applications we won't know 0.)

                               X    Bin(n. sim,O8)
                           Var(X) = n. sim(0)(1 -08)
                      SD(X/n. sim) =_  (0)(1 - 0)/n.sim

and, by the Central Limit Theorem if n. sim is large,


0 = X/n. sim  N(O, (0(1 - )/n.sim)1/2)


﻿


1.7. SIMULATION


60


What does this mean in practical terms? How accurate is the simulation when n. sim
50, or 200, or 1000, say? To illustrate we did 1000 simulations with n. sim = 50, then
another 1000 with n. sim = 200, and then another 1000 with n. sim = 1000.
   The results are shown as a boxplot in Figure   . In Figure  there are three
boxes, each with whiskers extending vertically. The box for n. sim = 50 shows that the
median of the 1000 0's was just about .50 (the horizontal line through the box), that
50% of the 0's fell between about .45 and .55 (the upper and lower ends of the box), and
that almost all of the 0's fell between about .30 and .68 (the extent of the whiskers). In
comparison, the 1000 8's for n. sim = 200 are spread out about half as much, and the
1000 8's for n. sim = 1000 are spread out about half as much again. The factor of about
a half comes from the n. sim- in the formula for SD(0). When n. sim increases by a
factor of about 4, the SD decreases by a factor of about 2. See the notes for Figure
for a further description of boxplots.


C)
0


I             I          -0-

           -6-

p


50


200


1000


                              n.sim


Figure 1.19: 1000 simulations of 0 for n. sim = 50, 200, 1000


Here is the R code for the simulations and Figure

N <- 1000


n.sim <- c(50, 200, 1000)


﻿


1.7. SIMULATION


61


    theta.hat <- matrix ( NA, N, length(n.sim) )

    for ( i in seq(along=n.sim) ) {
      for ( j in 1:N ) {
         wins <- 0
         for ( k in 1:n.sim[i] )
         wins <- wins + sim.craps()
         theta.hat[j,i] <- wins / n.sim[i]
      }
    }

    boxplot ( theta.hat ~ col(theta.hat), names=n.sim,
               xlab="n.sim" )


   " matrix forms a matrix. The form is matrix(x,nrows,ncols) where x are
     the entries in the matrix and nrows and ncols are the numbers of rows and
     columns.

   " seq(along=n.sim) is the same as 1:length(n.sim) except that it behaves
     more sensibly in case length (n. sim) is 0.

   " A boxplot is one way to display a data set. It produces a rectangle, or box,
     with a line through the middle. The rectangle contains the central 50% of the
     data. The line indicates the median of the data. Extending vertically above
     and below the box are dashed lines called whiskers. The whiskers contain
     most of the outer 50% of the data. A few extreme data points are plotted
     singly. See Example   for another use of boxplots and a fuller explanation.


1.7.2 Evaluating Statistical Procedures

Simulation can sometimes be useful in deciding whether a particular experiment
is worthwhile or in choosing among several possible experiments or statistical pro-
cedures. For a fictitious example, consider ABC College, where, of the 10,000
students, 30% of the students are members of a sorority or fraternity (greeks) and
70% are independents. There is an upcoming election for Head of the student


﻿


1.7. SIMULATION


62


governance organization. Two candidates are running, D and E. Let

                 6G = proportion of greeks supporting D
                 01 = proportion of independents supporting D
                 0 = proportion of all students supporting D

A poll is commisioned to estimate 0 and it is agreed that the pollster will sample
100 students. Three different procedures are proposed.

   1. Randomly sample 100 students. Estimate

                   01 = proportion of polled students who favor D

   2. Randomly sample 100 students. Estimate

                SG = proportion of polled greeks supporting D
                01 = proportion of polled independents supporting D
                02 = .30G + .701

   3. Randomly sample 30 greeks and 70 independents. Estimate

                SG = proportion of polled greeks supporting D
                01 = proportion of polled independents supporting D
                0 = .35G + .701

Which procedure is best? One way to answer the question is by exact calculation,
but another way is by simulation. In the simulation we try each procedure many
times to see how accurate it is, on average. We must choose some "true" values
of 0G, 01 and 0 under which to do the simulation. Here is some R code for the
simulation.

  # choose "true" theta.g and theta.i
  theta.g <- .8
  theta.i <- .4
  prop.g <- .3
  prop.i <- 1 - prop.g
  theta    <- prop.g * theta.g + prop.i * theta.i


﻿


1.7. SIMULATION


63


sampsize <- 100

n.times <- 1000 # should be enough
theta.hat <- matrix ( NA, n.times, 3 )
for ( i in 1:n.times ) {
   theta.hat[i,1] <- sim.1()
   theta.hat[i,2] <- sim.2()
   theta.hat[i,3] <- sim.3()
 }

 print ( apply(theta.hat,2,mean) )
 boxplot ( theta.hat    col(theta.hat) )

 sim.1 <- function() {
   x <- rbinom(1,sampsize,theta)
   return ( x / sampsize )
 }

 sim.2 <- function() {
   n.g <- rbinom ( 1, sampsize, prop.g )
   n.i <- sampsize - n.g
   x.g <- rbinom ( 1, n.g, theta.g )
   x.i <- rbinom ( 1, n.i, theta.i )
   t.hat.g <- x.g / n.g
   t.hat.i <- x.i / n.i
   return ( prop.g * t.hat.g + (1-prop.g) * t.hat.i )
 }

 sim.3 <- function() {
   n.g <- sampsize * prop.g
   n.i <- sampsize * prop.i
   x.g <- rbinom ( 1, n.g, theta.g )
   x.i <- rbinom ( 1, n.i, theta.i )
   t.hat.g <- x.g / n.g
   t.hat.i <- x.i / n.i
   return ( prop.g * t.hat.g + (1-prop.g) * t.hat.i )
}


﻿


1.7. SIMULATION                                                              64

   " apply applies a function to a matrix. In the code above, apply(....) applies
     the function mean to dimension 2 of the matrix theta. hat. That is, it returns
     the mean of of each column of theta. hat

   The boxplot, shown in Figure 1,20 shows little practical difference between the
three procedures.


                0


                   oI                     II
                0


                             1            2            3


Figure 1.20: 1000 simulations of 0 under three possible procedures for conducting
a poll


   The next example shows how simulation was used to evaluate whether an ex-
periment was worth carrying out.

Example 1.12 (FACE)
The amount of carbon dioxide, or C02, in the Earth's atmosphere has been steadily
increasing over the last century or so. You can see the increase yourself in the co2
data set that comes with R. Typing ts.plot (co2) makes a time series plot, reproduced
here as Figure 1,21. Typing help(co2) gives a brief explanation. The data are the
concentrations of CO2 in the atmosphere measured at Mauna Loa each month from
1959 to 1997. The plot shows a steadily increasing trend imposed on a regular annual
cycle. The primary reason for the increase is burning of fossil fuels. CO2 is a greenhouse
gas that traps heat in the atmosphere instead of letting it radiate out, so an increase in
atmospheric CO2 will eventually result in an increase in the Earth's temperature. But


﻿


1.7. SIMULATION


65


what is harder to predict is the effect on the Earth's plants. Carbon is a nutrient needed
by plants. It is possible that an increase in CO2 will cause an increase in plant growth
which in turn will partly absorb some of the extra carbon.
    To learn about plant growth under elevated CO2, ecologists began by conducting
experiments in greenhouses. In a greenhouse, two sets of plants could be grown under
conditions that are identical except for the amount of CO2 in the atmosphere. But the
controlled environment of a greenhouse is quite unlike the uncontrolled natural environ-
ment, so, to gain verisimilitude, experiments soon moved to open-top chambers. An
open-top chamber is a space, typically a few meters in diameter, enclosed by a solid,
usually plastic, wall and open at the top. CO2 can be added to the air inside the cham-
ber. Because the chamber is mostly enclosed, not much CO2 will escape, and more can
be added as needed. Some plants can be grown in chambers with excess CO2 others in
chambers without and their growth compared. But as with greenhouses, open-top cham-
bers are not completely satisfactory. Ecologists wanted to conduct experiments under
even more natural conditions.
    To that end, in the late 1980's the Office of Biological and Environmental Research
in the U.S. Department of Energy (DOE) began supporting research using a technol-
ogy called FACE, or Free Air CO2 Enrichment developed at the Brookhaven National
Laboratory. As the lab's webpage www.fa                           explains

         "FACE provides a technology by which the microclimate around growing
      plants may be modified to simulate climate change conditions. Typically
      C02-enriched air is released from a circle of vertical pipes into plots up to
      30m in diameter, and as tall as 20 m.
         "Fast feedback control and pre-dilution of CO2 provide stable, elevated
      [CO2] simulating climate change conditions.
         "No containment is required with FACE equipment and there is no signifi-
      cant change in natural air-flow. Large FACE plots reduce effects of plot edge
      and capture fully-functioning, integrated ecosystem-scale processes. FACE
      Field data represent plant and ecosystems responses to concentrations of
      atmospheric CO2 expected in the mid-twenty-first century."

See the website for pictures and more information. In a FACE experiment, CO2 is released
into some treatment plots. The level of CO2 inside the plot is continually monitored.
More CO2 is released as needed to keep the amount of CO2 in the atmosphere at some
prespecified level, typically the level that is expected in the mid-21st century. Other plots
are reserved as control plots. Plant growth in the treatment plots is compared to that in
the control plots.


﻿


1.7. SIMULATION


66


(a)


(b)


    0
    (0
    C')


CVJ 0
O   ri
o   C')


    0
    CVJ
    C')


(I)

a)


N~J


0 -


N~


a)

0


C')


0


C')


1960   1980


      Time


      (C)


      o0


   0


      o

      00

    0/0


  2 4 6 8     12


      I ndex


1960   1980


      Time


    0
    (0
    C')


CVJ 0
O   ri
o   C')


(d)


0
N~
C')


1960   1980


      Time


Figure 1.21: Monthly concentrations of CO2 at Mauna Loa


﻿


1.7. SIMULATION


67


    Because a FACE site is not enclosed, CO2 continually drifts out of the site and needs
to be replenished. Keeping enough CO2 in the air is very costly and is, in fact, the major
expense in conducting FACE experiments.
   The first several FACE sites were in Arizona (sorghum, wheat, cotton), Switzerland
(rye grass, clover) and California (native chaparral). All of these contained low-growing
plants. By the early 1990's, ecologists wanted to conduct a FACE experiment in a forest,
and such an experiment was proposed by investigators at Duke University, to take place in
Duke Forest. But before the experiment could be funded the investigators had to convince
the Department of Energy (DOE) that it would be worthwhile. In particular, they wanted
to demonstrate that the experiment would have a good chance of uncovering whatever
growth differences would exist between treatment and control. The demonstration was
carried out by computer simulation. The code for that demonstration, slightly edited for
clarity, is given at the end of this Example and explained below.

   1. The experiment would consist of 6 sites, divided into 3 pairs. One site in each pair
      would receive the CO2 treatment; the other would be a control. The experiment
      was planned to run for 10 years. Investigators had identified 16 potential sites
      in Duke Forest. The above ground biomass of those sites, measured before the
      experiment began, is given in the line b.mass <- c ( . . .

   2. The code simulates 1000 repetitions of the experiment. That's the meaning of
      nreps <- 1000.

   3. The above ground biomass of each site is stored in M. actual. control and
      M. actual. treatment. There must be room to store the biomass of each site
      for every combination of (pair,year,repetition). The array(...) command cre-
      ates a multidimensional matrix, or array, filled with NA's. The dimensions are given
      by c (npairs ,nyears+1,nreps).

   4. A site's actual biomass is not known exactly but is measured with error. The
      simulated measurements are stored in M. observed. control and
      M. observed.treatment.

   5. Each repetition begins by choosing 6 sites from among the 16 available. Their ob-
      served biomass goes into temp. The first three values are assigned to M. observed. control
      and the last three to M.observed.treatment. All this happens in a loop for (i in
      1 :nreps).

   6. Investigators expected that control plots would grow at an average rate of 2% per
      year and treatment plots at an average of something else. Those values are called


﻿


1.7. SIMULATION


68


      betaC and betaT. The simulation was run with betaT = 1.04, 1.06, 1.08 (shown
      below) and 1.10. Each site would have its own growth rate which would be slightly
      difFerent from betaC or betaT. For control sites, those rates are drawn from the
      N(betaC,0.1*(betaC-1)) distribution and stored in beta. control, and similarly
      for the treatment sites.

   7. Measurement errors of biomass were expected to have an SD around 5%. That's
      sigmaE. But at each site in each year the measurement error would be slightly
      difFerent. The measurement errors are drawn from the N(1, sigmaE) distribution
      and stored in errors. control and errors.treatment.

   8. Next we simulate the actual biomass of the sites. For the first year where we
      already have measurements that's

             M.actual.control [ , 1, 1 <- ...
             M.actual.treatment [ , 1, 1 <- ...


      For subsequent years the biomass in year i is the biomass in year i-1 multiplied
      by the growth factor beta. control or beta.treatment. Biomass is simulated
      in the loop for(i in 2: (nyears+1)).

   9. Measured biomass is the actual biomass multiplied by measurement error. It is
      simulated by

             M. observed.control      <-   ...
             M.observed.treatment <-       ...


  10. The simulations for each year were analyzed each year by a two-sample t-test which
      looks at the ratio
                                    biomass in year i
                                    biomass in year 1
      to see whether it is significantly larger for treatment sites than for control sites. See
      Section xyz for details about t-tests. For our purposes here, we have replaced the
      t-test with a plot, Figure 122, which shows a clear separation between treatment
      and control sites after about 5 years.

The DOE did decide to fund the proposal for a FACE experiment in Duke Forest, at least
partly because of the demonstration that such an experiment would have a reasonably
large chance of success.


﻿


1.7. SIMULATION


69


0
NM


                 /

              /
              /
              /


 i/
i/


ci)


0


LO)


0


I,11ZZZZZ11


2


4


6


8


10


                                     year


Figure 1.22: 1000 simulations of a FACE experiment. The x-axis is years. The
y-axis shows the mean growth rate (biomass in year i / biomass in year 1) of
control plants (lower solid line) and treatment plants (upper solid line). Standard
deviations are shown as dashed lines.


﻿


1.7. SIMULATION


70


# A power analysis of the FACE experiment


# Initial measured biomass of potential FACE sites in g/m2:
b.mass <- c ( 17299.1, 17793.1, 23211.7, 23351.8, 24278,
              25335.9, 27001.5, 27113.6, 30184.3, 30625.5,
              33496.2, 33733.76, 35974.3, 38490.8, 40319.6,
              44903 )

npairs <- 3
nyears <- 10
nreps <- 1000

M.observed.control   <- array ( NA, c(npairs,nyears+1,nreps) )
M.actual.control     <- array ( NA, c(npairs,nyears+1,nreps) )
M.observed.treatment <- array ( NA, c(npairs,nyears+1,nreps) )
M.actual.treatment   <- array ( NA, c(npairs,nyears+1,nreps) )

# Specify the initial levels of biomass
for ( i in 1:nreps ) {
  temp <- sample ( b.mass, size=2*npairs )
  M.observed.control    [,1,i] <- temp [1:npairs]
  M.observed.treatment [,1,i] <- temp [(npairs+1):(2*npairs)]
}

# Specify the betas

betaC <- 1.02
betaT <- 1.08

beta.control   <- matrix ( rnorm ( npairs*nreps, betaC,
                                    0.1*(betaC-1) ),
                           npairs, nreps )
beta.treatment <- matrix ( rnorm ( npairs*nreps, betaT,
                                    0.1*(betaT-1) ),
                           npairs, nreps )


﻿


1.7. SIMULATION


71


# measurement errors in biomass
sigmaE <- 0.05

errors.control    <- array ( rnorm ( npairs*(nyears+1)*nreps,
                                     1, sigmaE ),
                             c(npairs,nyears+1,nreps) )
errors.treatment <- array ( rnorm ( npairs*(nyears+1)*nreps,
                                     1, sigmaE ),
                             c(npairs,nyears+1,nreps) )
##############################################################
##############################################################
# Generate 10 years of data. The model for generation is:
# M.actual[i,j,] : above ground biomass in ring i, year j
# M.actual[i,j+1,] = beta[i] * M.actual[i,j,]

# We actually observe M.observed[i,j] = M.actual[i,j] * error
# Start with M.observed[i,1] and generate M.actual[i,1]

M.actual.control [ , 1, ] <-
  M.observed.control [ , 1, 1 / errors.control[ , 1, ]

M.actual.treatment [ , 1, ] <-
  M.observed.treatment [ , 1, 1 / errors.treatment[ , 1, 1

# Now, for the rest of the years, generate M.actual
for ( i in 2:(nyears+1) ) {
  M.actual.control [ , i, ] <-
    M.actual.control [ , i-1, ] * beta.control

  M.actual.treatment [ , i, ] <-
    M.actual.treatment [ , i-1, ] * beta.treatment
}

# The initial observed data, corresponding to the [,1,],
# doesn't need to be recomputed. But the following two
# statements are cleaner without subscripts.
M.observed.control <- M.actual.control * errors.control


﻿


1.8. R


72


M.observed.treatment <- M.actual.treatment * errors.treatment
##############################################################

##############################################################
# two-sample t-test on (M.observed[j]/M.observed[1]) removed
# plot added

ratio.control    <- matrix ( NA, nyears, npairs*nreps )
ratio.treatment <- matrix ( NA, nyears, npairs*nreps )

for ( i in 2:(nyears+1) ) {


  ratio.control
    as.vector (

  ratio. treatmer
    as.vector (

}

mean.control    <
mean.treatment <
sd.control      <
sd.treatment    <


[ i-1, ]    <-
   M.observed.control[,i,]
 / M.observed.control[,1,] )
It [ i-1, ] <-
   M.observed.treatment[,i,]
 / M.observed.treatment[,1,] )


oo
I.. -
oo,
N.
oo,
N.
oo,
N.


apply ( ratio.control,    1, mean )
apply ( ratio.treatment, 1, mean )
sqrt ( apply ( ratio.control,     1, var ) )
sqrt ( apply ( ratio.treatment, 1, var ) )


  plot ( 2:11, mean.control, type="l", ylim=c(.9,2.4),
         xlab="year", ylab="growth rate" )
  lines ( 2:11, mean.treatment )
  lines ( 2:11, mean.control - sd.control, lty=2 )
  lines ( 2:11, mean.control + sd.control, lty=2 )
  lines ( 2:11, mean.treatment - sd.treatment, lty=2 )
  lines ( 2:11, mean.treatment + sd.treatment, lty=2 )


1.8 R


This section introduces a few more of the R commands we will need to work flu-
ently with the software. They are introduced in the context of studying a dataset
on the percent bodyfat of 252 men. You should download the data onto your


﻿


1.8. R


73


own computer and try out the analysis in R to develop your familiarity with what
will prove to be a very useful tool. The data can be found at StatLib, an on-
line repository of statistical data and software. The data were originally con-
tributed by Roger Johnson of the Department of Mathematics and Computer Sci-
ence at the South Dakota School of Mines and Technology. The StatLib website is
lib. st at . cmu. edu. If you go to StatLib and follow the links to dat asets and then
bodyf at you will find a file containing both the data and an explanation. Copy just
the data to a text file named bodyf at . dat on your own computer. The file should
contain just the data; the first few lines should look like this:

1.0708     12.3       23 ...
1.0853      6.1       22 ...
1.0414     25.3       22 ...

The following snippet shows how to read the data into R and save it into bodyf at.

bodyf at <- read.table ( "bodyfat.dat",
  col.names = c ( "density", "percent.fat", "age", "weight",
     "height", "neck. circum", "chest. circum", "abdomen. circum",
     "hip. circum", "thigh. circum", "knee. circum", "ankle. circum",
     "bicep.circum" , "forearm.circum" , "wrist.circum" ) )
dim    ( bodyf at )    # how many rows and columns in the dataset?
names ( bodyf at )     # names of the columns

   " read. table (...) reads data from a text file into a dataframe. A dataframe
     is a flexible way to represent data because it can be treated as either a ma-
     trix or a list. Type help (read. table) to learn more. The first argument,
     "bodyf at .dat ", tells R what file to read. The second argument, col.names =
     c ( "density", ... ), tells R the names of the columns.

   " dim gives the dimension of a matrix, a dataframe, or anything else that has a
     dimension. For a matrix or dataframe, dim tells how many rows and columns.

   " names gives the names of things. names (bodyf at) should tell us the names
     density, percent .fat,.... It's used here to check that the data were read
     the way we intended.

   Individual elements of matrices can be accessed by two-dimensional subscripts
such as bodyf at [1, 11 or bodyf at [3,71 in which the subscripts refer to the row
and column of the matrix. (Try this out to make sure you know how two dimen-
sional subscripts work.) If the columns of the matrix have names, then the second


﻿


1.8. R


74


subscript can be a name, as in bodyf at [1, "density"] or bodyf at [3, "chest . circum"].
Often we need to refer to an entire column at once, which can be done by omit-
ting the first subscript. For example, bodyf at [,2] refers to the entire set of 252
measurements of percent body fat.
   A dataframe is a list of columns. Because bodyf at has 15 columns its length,
length(bodyf at), is 15. Members of a list can be accessed by subscripts with
double brackets, as in bodyf at [[1]]. Each member of bodyf at is a vector of
length 252. Individual measurements can be accessed as in bodyf at [[111 [1] or
bodyf at [[3]] [7]. If the list members have names, then they can be accessed as
in bodyf at$percent . fat. Note the quotation marks used when treating bodyf at
as a matrix and the lack of quotation marks when treating bodyf at as a list. The
name following the "$" can be abbreviated, as long as the abbreviation is unam-
biguous. Thus bodyf at$ab works, but bodyf at$a fails to distinguish between age
and abdomen. cir cum.
   Begin by displaying the data.

par ( mfrow=c(5,3) ) # establish a 5 by 3 array of plots

for ( i in 1:15 ) {
  hist ( bodyfat [ [i]], xlab=" " , main=names (bodyfat) [i] )
}

Although it's not our immediate purpose, it's interesting to see what the relation-
ships are among the variables. Try pairs (bodyf at).
   To illustrate some of R's capabilities and to explore the concepts of marginal,
joint and conditional densities, we'll look more closely at percent fat and its relation
to abdomen circumference. Begin with a histogram of percent fat.

fat <- bodyfat$per # give these two variables short names
abd <- bodyfat$abd # so we can refer to them easily
par ( mfrow=c(1,1) ) # just need one plot now, not 15
hist ( fat )

We'd like to rescale the vertical axis to make the area under the histogram equal
to 1, as for a density. R will do that by drawing the histogram on a "density" scale
instead of a "frequency" scale. While we're at it, we'll also make the labels prettier.
We also want to draw a Normal curve approximation to the histogram, so we'll
need the mean and standard deviation.


hist ( fat, xlab="", main="percent fat", freq=F )


﻿


1.8. R


75


mu <- mean ( fat )
sigma <- sqrt ( var ( fat )) # standard deviation
lo <- mu - 3*sigma
hi <- mu + 3*sigma
x <- seq ( lo, hi, length=50 )
lines ( x, dnorm ( x, mu, sigma ) )

That looks better, but we can do better still by slightly enlarging the axes. Redraw
the picture, but use

hist ( fat, xlab="", main="percent fat", freq=F,
        xlim=c(-10, 60), ylim=c(0,.06) )

The Normal curve fits the data reasonably well. A good summary of the data is
that it is distributed approximately N(19.15, 8.37).
   Now examine the relationship between abdomen circumference and percent
body fat. Try the following command.

plot ( abd, fat, xlab="abdomen circumference",
        ylab="percent body fat" )

The scatter diagram shows a clear relationship between abdomen circumference
and body fat in this group of men. One man doesn't fit the general pattern; he
has a circumference around 148 but a body fat only around 35%, relatively low
for such a large circumference. To quantify the relationship between the variables,
let's divide the men into groups according to circumference and estimate the con-
ditional distribution of fat given circumference. If we divide the men into twelfths
we'll have 21 men per group.

cut.pts <- quantile ( abd, (0:12)/12 )
groups <- cut ( abd, cut.pts, include.lowest=T, labels=1:12 )
boxplot ( fat     groups,
           xlab="quantiles of abdomen circumference",
           ylab="percent body fat" )

Note:

   " A quantile is a generalization of median. For example, the 1/12-th quantile of
     abd is the number q such that 1/12 of all the abd measurements are less than
     q and 11/12 are greater than q. (A more careful definition would say what
     to do in case of ties.) The median is the .5 quantile. We have cut our data
     according to the 1/12, 2/12, ..., 12/12 quantiles of abd.


﻿


1.8. R


76


   " If you don't see what the cut (abd, ...) command does, print out cut .pts
     and groups, then look at them until you figure it out.

   " Boxplots are a convenient way to compare different groups of data. In this
     case there are 12 groups. Each group is represented on the plot by a box with
     whiskers. The box spans the first and third quartiles (.25 and .75 quantiles)
     of fat for that group. The line through the middle of the box is the median
     fat for that group. The whiskers extend to cover most of the rest of the data.
     A few outlying fat values fall outside the whiskers; they are indicated as
     individual points.

   " "fat ~ groups" is R's notation for a formula. It means to treat fat as a func-
     tion of groups. Formulas are extremely useful and will arise repeatedly.

   The medians increase in not quite a regular pattern. The irregularities are
probably due to the vagaries of sampling. We can find the mean, median and
variance of fat for each group with

mu.fat <- tapply ( fat, groups, mean )
me.f at <- tapply ( fat, groups, median )
sd.fat <- sqrt ( tapply ( fat, groups, var ) )
cbind ( mu.fat, me.fat, sd.fat )

   " tapply means "apply to every element of a table." In this case, the table is
     fat, grouped according to groups.

   " cbind means "bind together in columns". There is an analgous command
     rbind.

   Finally, let's make a figure similar to FigureL2.

x <- seq ( 0, 50, by=1 )
par ( mfrow=c(4,3) )
for ( i in 1:12 )
  good <- groups == i
  hist ( fat[good], xlim=c(0,50), ylim=c(0,.1),
          breaks=seq(0,50,by=5), freq=F,
          xlab="percent fat", main="" )
  y <- dnorm ( x, mu.fat[i], sd.fat[i] )
  lines ( x, y )


﻿


1.9. SOME RESULTS FOR LARGE SAMPLES


77


The Normal curves seem to fit well. We saw earlier that the marginal (Marginal
means unconditional.) distribution of percent body fat is well approximated by
N(19.15, 8.37). Here we see that the conditional distribution of percent body fat,
given that abdomen circumference is in between the (i - 1)/12 and i/12 quantiles
is N(mu. fat [i], sd. fat [i]). If we know a man's abdomen circumference even
approximately then (1) we can estimate his percent body fat more accurately and
(2) the typical estimation error is smaller. [add something about estimation
error in the sd section]


1.9 Some Results for Large Samples

It is intuitively obvious that large samples are better than small, that more data
is better than less, and, less obviously, that with enough data one is eventually
led to the right answer. These intuitive observations have precise mathematical
statements in the form of Theorems   1,113 and 1 . We state those theorems
here so we can use them throughout the rest of the book. They are examined in
more detail in Section..

Definition 1.9 (Random Sample). A collection y,. .. , ym of random variables is
said to be a random sample of size n from pdf or pmf f if and only if
   1. y~f for each i = 1, 2, ..., n and

   2. the yi's are mutually independent, i.e. f(yi, ..., yn)  H> "f(ys).
   The collection (yi,..., yn) is called a data set. We write

                              y1, ..., yn ~1.i.d. f
where i.i.d. stands for independent and identically distributed.
   Many introductory statistics texts describe how to collect random samples, many
pitfalls that await, and many pratfalls taken in the attempt. We omit that discussion
here and refer the interested reader to our favorite introductory text on the subject,
          et al[     ] which has an excellent description of random sampling in
general as well as detailed discussion of the US census and the Current Population
Survey.
   Suppose y1, Y2, ...,~ i.i.d. f. Let y = f yf(y) dy and u2= f(y - p)2f(y)dy
be the mean and variance of f. Typically yu and o are unknown and we take the
sample in order to learn about them. We will often use 4= (yi+ ... + y )/n, the
mean of the first n observations, to estimate pu. Some questions to consider are


﻿


1.9. SOME RESULTS FOR LARGE SAMPLES


78


   " For a sample of size n, how accurate is y, as an estimate of u?
   " Does yK get closer to y as n increases?
   " How large must n be in order to achieve a desired level of accuracy?
Theorems 1.12 and     provide answers to these questions. Before stating them
we need some preliminary results about the mean and variance of yn.
Theorem 1.7. Let x1,..., x, be random variables with means p1,... ,up. Then E[x1+
-.-.+ xn] = p1+-.-.+ pn.
Proof. It suffices to prove the case n = 2.

          E[zi + x2] f f(i + x2)f (i, x2) dzidz2

                   = ff xif (i, x2) dzidz2 + iffx2f (zi, x2) dzidz2

                   =1p1 + /p2


Corollary 1.8. Let y1, ... , y, be a random sample from f with mean u. Then E[yn]
1.
Proof. The corollary follows from Theorems L3 and L7.
Theorem 1.9. Let x1,... , x, be independent random variables with means
pi,..., y and SDs a1,... ,n. Then Var[zi +-...-+x] =12+--+...+as.
Proof. It suffices to prove the case n = 2. Using Theorem L,
      Var(Xi + X2) =E((X1 + X2)2) - (1 + /p2)2
                   = E(X )+ 2E(X1X2) + E(X2) -p1 -21p2 -12
                     (E(xf) - p1) + (E(X ) - p1) +2 (E(x1x2) -11112)
                   =0 + ul +2 (E(X1X2) - /1/2).
But if X1 I X2 then

                   E(X1X2) =fzfxc2f(zix2)dzidz2

                             Sc1 x   2f (x2) dz2f (ci) dci

                           =p12f z1f (x1) dc1 =pip2.


So Var(Xl + X2) = 61 + 62.


F-I


﻿


1.9. SOME RESULTS FOR LARGE SAMPLES


79


   Note that Theorem 1. requires independence while Theorem 1.7 does not.

Corollary 1.10. Let y1,..., y be a random sample from f with variance a2. Then
Var(p) = a2/

Proof. The corollary follows from Theorems 1.4 and 1.

Theorem 1.11 (Chebychev's Inequality). Let X be a random variable with mean y
and SD a. Then for any Ec> 0,

                           P[lX -u pl ;>E]<2/E2.


Proof.


           (xfh  - pu)2f (x) dx~ 1+ (cc - p)2f (c) dcc+fcz- (c c
        2        f)2 (c) dc+   f(c) d   )f

        >      (E _iff(cc~dc+E Jf(cc   1)2dccd


        = E2 P[lX -1 p;> E].


   Theorems112 and 1.      are the two main limit theorems of statistics. They
provide answers, at least probabilistically, to the questions on page 77.

Theorem 1.12 (Weak Law of Large Numbers). Let Yi,..., Ym be a random sample
from a distribution with mean yu and variance a2. Then for any E > 0,

                           lim P[ly.n- pl <E] =1.                   (1.18)

Proof. Apply Chebychev's Inequality (Theorem   ) to yn.

       lim P[ly - Pl < 6] = lim 1 - P [ly -  pl;> E] ;> lim 1 - a2  2


D-


Another version of Theorem


is called the Strong Law of Large Numbers.


﻿


1.9. SOME RESULTS FOR LARGE SAMPLES


80


Theorem 1.13 (Strong Law of Large Numbers). Let yi, ..., y, be a random sample
from a distribution with mean ya and variance a2. Then for any E > 0,

                           P[ lim ln - pl1 < E] = 1;
                             h-oo

i.e.,

                              P [lim n = p] = 1.
                                h-oo

   It is beyond the scope of this section to explain the difference between the
WLLN and the SLLN. See Section 8,9.

Theorem 1.14 (Central Limit Theorem). Let y1,..., y, be a random sample from f
with mean ya and variance a2. Let z = (Yn - p)/(av/ ). Then, for any numbers
a < b,
                     lim P[zn E [a, b]] f   1e_2/2dw.
                     n-oo              fa v/2
Le. The limiting distribution of zn is N(0, 1).

   The Law of Large Numbers is what makes simulations work and why large
samples are better than small. It says that as the number of simulation grows or
as the sample size grows, (n - oc), the average of the simulations or the average
of the sample gets closer and closer to the true value (X - p). For instance, in
Example    11, where we used simulation to estimate P[Shooter wins] in Craps, the
estimate became more and more accurate as the number of simulations increased
from 50, to 200, and then to 1000. The Central Limit Theorem helps us look at
those simulations more closely.
    Colloquially, the Central Limit Theorem says that

                               lim Pz,  N(O, 1).
                               nh-oo

Its practical application is that, for large n,

                                Pzs aN(0, 1),

which in turn means
                                pa aj N(p, a/w/ -).
   In Example L11we simulated the game of Craps for n.sim = 50, 200, and
 1000. Those simulations are depicted in Figure L2. The upper panel shows


﻿


1.10. EXERCISES


81


a histogram of 1000 simulations, all using n. sim = 50. For a single simulation
with n. sim = 50 let X1, ..., X50 be the outcomes of those simulations. Each Xi
Bern(.493), so y = .493 and a = v/.493 * .507  .5. Therefore, according to the
Central Limit Theorem, when n. sim = 50

                       XNN(, a/       )   N(.493, .071)

This is the Normal density plotted in the upper panel of Figure 1,23. We see that
the N(.493, .071) is a good approximation to the histogram. And that's because
0   Xo50   N(.493, .071), approximately. The Central Limit Theorem says that the
approximation will be good for "large" n. In this case n = 50 is large enough.
(Section.. will discuss the question of when n is "large enough".)
   Similarly,

                   Xn~N(.493, .035)      when n. sim = 200
                   X, ~N(.493, .016)    when n. sim = 1000.

These densities are plotted in the middle and lower panels of Figure 1,23.
   The Central Limit Theorem makes three statements about the distribution of yn
(zn) in large samples:

   1. E[yn] = p (E[zn] = 0),
   2. SD(pn) = / /   (SD(z) = 1), and

   3. Yn (zn) has, approximately, a Normal distribution.

The first two of these are already known from Theorems'1.7 and 1,9. It's the third
point that is key to the Central Limit Theorem. Another surprising implication from
the Central Limit Theorem is that the distributions of yn and zn in large samples
are determined solely by y and a; no other features of f matter.


1.10 Exercises

   1. Show: if p is a probability measure then for any integer n;> 2, and disjoint
     sets A1, . . . ,A

                               p(UAi) =      p(A2).
                                 i=1      i=1


2. Simulating Dice Rolls


﻿


1.10. EXERCISES


82


C,,
a)
0


(0

LO

C)


0.3         0.4         0.5         0.6         0.7


theta hat
n.sim = 50


7-1


C,,
a)
0


0o
TD

M\


F-
0.3


7


I


0.4


0.5


0.6


0.7


theta hat
n.sim = 200


C,,
a)
0


0-


L


0.3


0.4


0.6


0.7


0.5


                                         theta hat
                                       n.sim = 1000


Figure 1.23: Histograms of craps simulations. Solid curves are Normal approxima-
tions according to the Central Limit Theorem.


﻿


1.10. EXERCISES


83


    (a) simulate 6000 dice rolls. Count the number of 1's, 2's, ..., 6's.
    (b) You expect about 1000 of each number. How close was your result to
       what you expected?
    (c) About how often would you expect to get more than 1030 l's? Run an
       R simulation to estimate the answer.

3. The Game of Risk In the board game Risk players place their armies in dif-
   ferent countries and try eventually to control the whole world by capturing
   countries one at a time from other players. To capture a country, a player
   must attack it from an adjacent country. If player A has A > 2 armies in
   country A, she may attack adjacent country D. Attacks are made with from 1
   to 3 armies. Since at least 1 army must be left behind in the attacking country,
   A may choose to attack with a minimum of 1 and a maximum of min(3, A -1)
   armies. If player D has D > 1 armies in country D, he may defend himself
   against attack using a minimum of 1 and a maximum of min(2, D) armies.
   It is almost always best to attack and defend with the maximum permissible
   number of armies.
   When player A attacks with a armies she rolls a dice. When player D defends
   with d armies he rolls d dice. As highest die is compared to D's highest. If
   both players use at least two dice, then As second highest is also compared
   to D's second highest. For each comparison, if As die is higher than D's then
   A wins and D removes one army from the board; otherwise D wins and A
   removes one army from the board. When there are two comparisons, a total
   of two armies are removed from the board.

     " If A attacks with one army (she has two armies in country A, so may
       only attack with one) and D defends with one army (he has only one
       army in country D) what is the probability that A will win?
     " Suppose that Player 1 has two armies each in countries C1, C2, C3 and
       C4, that Player 2 has one army each in countries B1, B2, B3 and B4, and
       that country CZ attacks country B2. What is the chance that Player 1 will
       be successful in at least one of the four attacks?

4. (a) Justify the last step of Equation 1..
    (b) Justify the last step of the proof of Theorem


(c) Prove Theorem


when g is a decreasing function.


﻿


1.10. EXERCISES


84


5. Y is a random variable. Y E (-1, 1). The pdf is p(y) = ky2 for some constant,
   k.

   (a) Find k.
   (b) Use R to plot the pdf.
   (c) Let Z = -Y. Find the pdf of Z. Plot it.

6. U is a random variable on the interval [0, 1]; p(u) = 1.

    (a) V = U2. On what interval does V live? Plot V as a function of U. Find
        the pdf of V. Plot pv(v) as a function of v.
    (b) W = 2U. On what interval does W live? Plot W as a function of U. Find
        the pdf of W. Plot pw(w) as a function of w.
    (c) X = - log(U). On what interval does X live? Plot X as a function of U.
        Find the pdf of X. Plot px(x) as a function of x.

7. Let X   Exp(A) and let Y = cX for some constant c.

    (a) Write down the density of X.
    (b) Find the density of Y.
    (c) Name the distribution of Y.

8. A teacher randomly selects a student from a Sta 103 class. Let X be the num-
   ber of math courses the student has completed. Let Y = 1 if the student is
   female and Y = 0 if the student is male. Fifty percent of the class is female.
   Among the women, thirty percent have completed one math class, forty per-
   cent have completed two math classes and thirty percent have completed
   three. Among the men, thirty percent have completed one math class, fifty
   percent have completed two math classes and twenty percent have completed
   three.

   (a) True or False: X and Y are independent.
   (b) Find E[X|IY = 1].

9. Sue is studying the Bin(25,.4) distribution. In R she types

       y <- rbinom(50, 25, .4)
       m1 <- mean(y)
       m2 <- sum(y) / 25


﻿


1.10. EXERCISES


85


        m3 <- sum ( (y-m1)^2 ) / 50


     (a) Is y a number, a vector or a matrix?
     (b) What is the approximate value of ml?
     (c) What is the approximate value of m2?
     (d) What was Sue trying to estimate with m3?

10. The random variables X and Y have joint pdf fx,y (x, y) = 1 in the triangle of
    the XY-plane determined by the points (-1,0), (1,0), and (0,1). Hint: Draw
    a picture.

    (a) Find fx(.5).
    (b) Find fy(y).
    (c) Find fyx (y | X = .5).
    (d) Find E[Y | X = .5].
    (e) Find fy(.5).
    (f) Find fx (x).
    (g) Find fx y(x | Y =.5).
    (h) Find E[X | Y = .5].

11. X and Y are uniformly distributed in the unit disk. I.e., the joint density
    p(x, y) is constant on the region of R2 such that x2 + y2 < 1.

    (a) Find p(x, y).
    (b) Are X and Y independent?
    (c) Find the marginal densities p(x) and p(y).
    (d) Find the conditional densities p(z| y) and p(y Icc).
    (e) FindIE[X],IE[X | Y = .5], andlE[X Y = -.5].

12. Verify the claim in Example 14 that argmaxa P[cc= 3| A] = 3. Hint: differen-
    tiate Equation L6.

13. (a) p is the pdf of a continuous random variable w. Find fp(s) ds.
     (b) Find f~ p(s) ds for the pdf in Equation 17


﻿


1.10. EXERCISES


86


14. Page 7 says "Every pdf must satisfy two properties ... " and that one of them
    is "p(y) > 0 for all y." Explain why that's not quite right.
15.
                                        1   _1 2
                                p(y) =     e 29
                                        2wr
    is the pdf of a continuous random variable y. Find f_, p(s) ds.

16. When spun, an unbiased spinner points to some number y E (0, 1]. What is
    p(y)?
17. Some exercises on the densities of tranformed variables. One of them should
    illustrate the need for the absolute value of the Jacobian.

18. (a) Prove: if X   Poi(A) then E(X) = A. Hint: use the same trick we used
         to derive the mean of the Binomial distribution.
     (b) Prove: if X   N(p, o) then E(X) = p. Hint: change variables in the
         integral.

19. (a) Prove: if X   Bin(n, p) then Var(X) = np(1-p). Hint: use Theorem 1.9.
     (b) Prove: if X  Poi(A) then Var(X) = A. Hint: use the same trick we used
         to derive the mean of the Binomial distribution and Theorem 1_2.
     (c) If X  Exp(A), find Var(X). Hint: use Theorem 1:.
     (d) If X  N(p, o), find Var(X). Hint: use Theorem 12.

20. (a) Justify each step of Equation
     (b) Justify each step of Equation 13. (Hint: integrate by parts.)

21. Let X1    Bin(50, .1), X2 Bin(50, .9) and X1 I X2. Define Y= X1 + X2.
    Does Y have the Bin(100, .5) distribution? Why or why not?

22. Let X1    Bin(50, .5), X2 Bin(50, .5) and X1 I X2. Define Y1 = X1 + X2
    and Y2 = 2X1. Which has the bigger mean: Y1 or Y2? Which has the bigger
    variance: Y1 or Y2? Justify your answer.

23. Consider customers arriving at a service counter. Interarrival times often
    have a distribution that is approximately exponential with a parameter A that
    depends on conditions specific to the particular counter. I.e., p(t) = AeL.
    Assume that successive interarrival times are independent of each other. Let
    T1 be the arrival time of the next customer and 2 be the additional time until
    the arrival of the second customer.


﻿


1.10. EXERCISES


87


     (a) What is the joint density of (T1, T2)?
     (b) Let S = T1 + T2, the time until the next two customers arrive. What is
         P[S < 5]; i.e. the probability that at least 2 customers arrive within the
         next 5 minutes?
     (c) What is E(S)?

24. A gambler plays at a roulette table for two hours betting on Red at each spin
    of the wheel. There are 60 spins during the two hour period. What is the
    distribution of

    (a) z, the number of times the gambler wins,
    (b) y, the number of times the gambler loses,
    (c) x, the number of times the gambler is jostled by the person standing
        behind,
     (d) w, the gambler's net gain?

25. If human DNA contains xxx bases, and if each base mutates with probability
    p over the course of a lifetime, what is the average number of mutations per
    person? What is the variance of the number of mutations per person?

26. Isaac is in 5th grade. Each sentence he writes for homework has a 90%
    chance of being grammatically correct. The correctness of one sentence does
    not affect the correctness of any other sentence. He recently wrote a 10
    sentence paragraph for a writing assignment. Write a formula for the chance
    that no more than two sentences are grammatically incorrect.

27. Teams A and B play each other in the World Series of baseball. Team A has
    a 60% chance of winning each game. What is the chance that B wins the
    series? (The winner of the series is the first team to win 4 games.)

28. A basketball player shoots ten free throws in a game. She has a 70% chance
    of making each shot. If she misses the shot, her team has a 30% chance of
    getting the rebound.

    (a) Let m be the number of shots she makes. What is the distribution of m?
        What are its expected value and variance? What is the chance that she
        makes somewhere between 5 and 9 shots, inclusive?


﻿


1.10. EXERCISES


88


     (b) Let r be the number of rebounds her team gets from her free throws.
         What is the distribution of r? What are its expected value and variance?
         What is the chance that r > 1?

29. Let (x, y) have joint density function fx,y. There are two ways to find IEy.
    One way is to evaluate ff yf,(x, y) dxdy. The other is to start with the joint
    density f,, find the marginal density fy, then evaluate f yfy(y) dy. Show
    that these two methods give the same answer.

30. Prove Theorem1. (pg. 39) in the discrete case.

31. Prove Theorem17 (pg. 78) in the continuous case.

32. A researcher randomly selects mother-daughter pairs. Let x and ye be the
    heights of the i'th mother and daughter, respectively. True or False:

    (a) x and xz are independent
    (b) x and y3 are independent
    (c) ye and y3 are independent
    (d) x and ye are independent

33. As part of his math homework Isaac had to roll two dice and record the
    results. Let X1 be the result of the first die and X2 be the result of the second.
    What is the probability that X1=1 given that X1 + X2 = 5?

34. A doctor suspects a patient has the rare medical condition DS, or disstaticu-
    laria, the inability to learn statistics. DS occurs in .01% of the population, or
    one person in 10,000. The doctor orders a diagnostic test. The test is quite
    accurate. Among people who have DS the test yields a positive result 99% of
    the time. Among people who do not have DS the test yields a positive result
    only 5% of the time.
    For the patient in question, the test result is positive. Calculate the probability
    that the patient has DS.

35. For various reasons, researchers often want to know the number of peo-
    ple who have participated in embarassing activities such as illegal drug use,
    cheating on tests, robbing banks, etc. An opinion poll which asks these ques-
    tions directly is likely to elicit many untruthful answers. To get around the
    problem, researchers have devised the method of randomized response. The
    following scenario illustrates the method.


﻿


1.10. EXERCISES


89


    A pollster identifies a respondent and gives the following instructions. "Toss
    a coin, but don't show it to me. If it lands Heads, answer question (a). If
    it lands tails, answer question (b). Just answer 'yes' or 'no'. Do not tell me
    which question you are answering.
    Question (a): Does your telephone number end in an even digit?
    Question (b): Have you ever used cocaine?"
    Because the respondent can answer truthfully without revealing his or her
    cocaine use, the incentive to lie is removed. Researchers hope respondents
    will tell the truth.
    You may assume that respondents are truthful and that telephone numbers
    are equally likely to be odd or even. Let p be the probability that a randomly
    selected person has used cocaine.

    (a) What is the probability that a randomly selected person answers "yes"?
    (b) Suppose we survey 100 people. Let X be the number who answer "yes".
         What is the distribution of X?

36. In a 1991 article (See Utts [191] and discussants.) Jessica Utts reviews some
    of the history of probability and statistics in ESP research. This question
    concerns a particular series of autoganzfeld experiments in which a sender
    looking at a picture tries to convey that picture telepathically to a receiver.
    Utts explains:

         "... 'autoganzfeld' experiments require four participants. The first
         is the Receiver (R), who attempts to identify the target material be-
         ing observed by the Sender (5). The Experimenter (E) prepares R
         for the task, elicits the response from R and supervises R's judging
         of the response against the four potential targets. (Judging is dou-
         ble blind; E does not know which is the correct target.) The fourth
         participant is the lab assistant (LA) whose only task is to instruct
         the computer to randomly select the target. No one involved in the
         experiment knows the identity of the target.
            "Both R and S are sequestered in sound-isolated, electrically
         sheilded rooms. R is prepared as in earlier ganzfeld studies, with
         white noise and a field of red light. In a nonadjacent room, S
         watches the target material on a television and can hear R's tar-
         get description ('mentation') as it is being given. The mentation is
         also tape recorded.


﻿


1.10. EXERCISES


90


            "The judging process takes place immediately after the 30-minute
         sending period. On a TV monitor in the isolated room, R views the
         four choices from the target pack that contains the actual target. R
         is asked to rate each one according to how closely it matches the
         ganzfeld mentation. The ratings are converted to ranks and, if the
         correct target is ranked first, a direct hit is scored. The entire pro-
         cess is automatically recorded by the computer. The computer then
         displays the correct choice to R as feedback."

    In the series of autoganzfeld experiments analyzed by Utts, there were a total
    of 355 trials. Let X be the number of direct hits.

    (a) What are the possible values of X?
    (b) Assuming there is no ESP, and no cheating, what is the distribution of
         X?
     (c) Plot the pmf of the distribution in part (b).
     (d) Find E[X] and SD(X).
     (e) Add a Normal approximation to the plot in part (c).
     (f) Judging from the plot in part (c), approximately what values of X are
         consistent with the "no ESP, no cheating" hypothesis?
     (g) In fact, the total number of hits was x = 122. What do you conclude?

37. This exercise is based on a computer lab that another professor uses to teach
    the Central Limit Theorem. It was originally written in MATLAB but here it's
    translated into R.
    Enter the following R commands:

      u <- matrix ( runif(250000), 1000, 250 )
      y <- apply ( u, 2, mean )


    These create a 1000x250 (a thousand rows and two hundred fifty columns)
    matrix of random draws, called u and a 250-dimensional vector y which con-
    tains the means of each column of U.
    Now enter the command hist (u[,11). This command takes the first column
    of u (a column vector with 1000 entries) and makes a histogram. Print out


﻿


1.10. EXERCISES


91


    this histogram and describe what it looks like. What distribution is the runif
    command drawing from?
    Now enter the command hist (y). This command makes a histogram from
    the vector y. Print out this histogram. Describe what it looks like and how
    it differs from the one above. Based on the histogram, what distribution do
    you think y follows?
    You generated y and u with the same random draws, so how can they have
    different distributions? What's going on here?

38. Suppose that extensive testing has revealed that people in Group A have IQ's
    that are well described by a N(100, 10) distribution while the IQ's of people
    in Group B have a N(105, 10) distribution. What is the probability that a ran-
    domly chosen individual from Group A has a higher IQ than a randomly chosen
    individual from Group B?

    (a) Write a formula to answer the question. You don't need to evaluate the
         formula.
     (b) Write some R code to answer the question.

39. The so-called Monty Hall or Let's Make a Deal problem has caused much con-
    sternation over the years. It is named for an old television program. A con-
    testant is presented with three doors. Behind one door is a fabulous prize;
    behind the other two doors are virtually worthless prizes. The contestant
    chooses a door. The host of the show, Monty Hall, then opens one of the
    remaining two doors, revealing one of the worthless prizes. Because Monty
    is the host, he knows which doors conceal the worthless prizes and always
    chooses one of them to reveal, but never the door chosen by the contestant.
    Then the contestant is offered the choice of keeping what is behind her orig-
    inal door or trading for what is behind the remaining unopened door. What
    should she do?
    There are two popular answers.

       " There are two unopened doors, they are equally likely to conceal the
         fabulous prize, so it doesn't matter which one she chooses.
       " She had a 1/3 probability of choosing the right door initially, a 2/3
         chance of getting the prize if she trades, so she should trade.


(a) Create a simulation in R to discover which answer is correct.


﻿


1.10. EXERCISES                                                         92

      (b) Show using formal arguments of conditional probability which answer
         is correct.

     Make sure your answers to (a) and (b) agree!

 40. Prove Theorem     (pg.  ).


﻿


CHAPTER 2


                 MODES OF INFERENCE


2.1 Data

This chapter takes up the heart of statistics: making inferences, quantitatively, from
data. The data, Y1, ..y. , y are assumed to be a random sample from a population.
   In Chapter;1 we reasoned from f to Y. That is, we made statements like "If
the experiment is like ... ,then f will be ... ,and (yi, . . ., y) will look like ... " or
"E(Y) must be ... ", etc. In Chapter 2 we reason from Y to f. That is, we make
statements such as "Since (Yi,... , yn) turned out to be ... it seems that f is likely to
be ... ", or "f yf(y) dy is likely to be around ... ", etc. This is a basis for knowledge:
learning about the world by observing it. Its importance cannot be overstated. The
field of statistics illuminates the type of thinking that allows us to learn from data
and contains the tools for learning quantitatively.
   Reasoning from Y to f works because samples are usually like the populations
from which they come. For example, if f has a mean around 6 then most reason-
ably large samples from f also have a mean around 6, and if our sample has a
mean around 6 then we infer that f likely has a mean around 6. If our sample has
an SD around 10 then we infer that f likely has an SD around 10, and so on. So
much is obvious. But can we be more precise? If our sample has a mean around
6, then can we infer that f likely has a mean somewhere between, say, 5.5 and
6.5, or can we only infer that f likely has a mean between 4 and 8, or even worse,
between about -100 and 100? When can we say anything quantitative at all about
the mean of f? The answer is not obvious, and that's where statistics comes in.
Statistics provides the quantitative tools for answering such questions.
   This chapter presents several generic modes of statistical analysis.
Data Description Data description can be visual, through graphs, charts, etc., or
     numerical, through calculating sample means, SD's, etc. Displaying a few


93


﻿


2.2. DATA DESCRIPTION


94


     simple features of the data yi,..., y, can allow us to visualize those same
     features of f. Data description requires few a priori assumptions about f.

Likelihood In likelihood inference we assume that f is a member of a parametric
     family of distributions {fo : 8 E 8}. Then inference about f is the same
     as inference about the parameter 0, and different values of 0 are compared
     according to how well fo explains the data.

Estimation The goal of estimation is to estimate various aspects of f, such as
     its mean, median, SD, etc. Along with the estimate, statisticians try to give
     quantitative measures of how accurate the estimates are.

Bayesian Inference Bayesian inference is a way to account not just for the data
     Y, ..., yn, but also for other information we may have about f.

Prediction Sometimes the goal of statistical analysis is not to learn about f per se,
     but to make predictions about y's that we will see in the future. In addition
     to the usual problem of not knowing f, we have the additional problem that
     even if we knew f, we still wouldn't be able to predict future y's exactly.

Hypothesis Testing Sometimes we want to test hypotheses like Head Start is good
     for kids or lower taxes are good for the economy or the new treatment is better
     than the old.

Decision Making Often, decisions have to be made on the basis of what we have
     learned about f. In addition, making good decisions requires accounting for
     the potential gains and losses of each decision.


2.2 Data Description

There are many ways, both graphical and numerical, to describe data sets. Some-
times we're interested in means, sometimes variations, sometimes trends through
time, and there are good ways to describe and display all these aspects and many
more. Simple data description is often enough to shed light on an underlying sci-
entific problem. The subsections of Section 2.2 show some basic ways to describe
various types of data.


﻿


2.2. DATA DESCRIPTION


95


2.2.1 Summary Statistics
One of the simplest ways to describe a data set is by a low dimensional summary.
For instance, in Example   on ocean temperatures there were multiple measure-
ments of temperatures from each of 9 locations. The measurements from each
location were summarized by the sample mean y = n--E y2; comparisons of the
9 sample means helped oceanographers deduce the presence of the Mediterranean
tongue. Similarly, the essential features of many data sets can be captured in a
one-dimensional or low-dimensional summary. Such a summary is called a statis-
tic. The examples below refer to a data set Y1, ..., yn of size n.

Definition 2.1 (Statistic). A statistic is any function, possibly vector valued, of the
data.

   The most important statistics are measures of location and dispersion. Impor-
tant examples of location statistics include

mean The mean of the data is y =n--1 E y. fR can compute means:

          y <- 1:10
          mean (y)


median A median of the data is any number m such that at least half of the yr's are
     less than or equal to m and at least half of the yr's are greater than or equal
     to m. We say "a" median instead of "the" median because a data set with an
     even number of observations has an interval of medians. For example, if y
     <- 1: 10, then every m E [5,6] is a median. When R computes a median it
     computes a single number by taking the midpoint of the interval of medians.
     So median(y) yields 5.5.

quantiles For any p E [0, 1], the p-th quantile of the data should be, roughly speak-
     ing, the number q such that pn of the data points are less than q and (1 - p)n
     of the data points are greater than q.
     Figure 2. illustrates the idea. Panel a shows a sample of 100 points plotted
     as a stripchart (page 108). The black circles on the abcissa are the .05, .5,
     and .9 quantiles; so 5 points (open circles) are to the left of the first vertical
     line, 50 points are on either side of the middle vertical line, and 10 points are
     to the right of the third vertical line. Panel b shows the empirical cdf of the
     sample. The values .05, .5, and .9 are shown as squares on the vertical axis;


﻿


2.2. DATA DESCRIPTION


96


     the quantiles are found by following the horizontal lines from the vertical
     axis to the cdf, then the vertical lines from the cdf to the horizontal axis.
     Panels c and d are similar, but show the distribution from which the sample
     was drawn instead of showing the sample itself. In panel c, 5% of the mass
     is to the left of the first black circle; 50% is on either side of the middle black
     circle; and 10% is to the right of the third black dot. In panel d, the open
     squares are at .05, .5, and .9 on the vertical axis; the quantiles are the circles
     on the horizontal axis.
     Denote the p-th quantile as qp(yi,..., y.), or simply as qp if the data set is
     clear from the context. With only a finite sized sample qp(yi, ... , y.) cannot
     be found exactly. So the algorithm for finding quantiles works as follows.

       1. Sort the yz's in ascending order. Label them y(1), ..., Y(n) so that

                                     Y(1) < ...  _ Y(n)

       2. Set qo - y(1) and q1 - y .
       3. Y(2) through Y(n_1) determine n - 1 subintervals in [y(1), y(n)]. So, for
          i = 1, . .. , n - 2, set q_ - 1- y(2+1).

       4. For p c (n-, i) let qp be any number in the interval (q_ , q   ).

     If p is a "nice" number then qp is often given a special name. For example, q.5
     is the median; (q.25, q.5, q.75), the first, second and third quartiles, is a vector-
     valued statistic of dimension 3; q.1, q.2,... are the deciles; q.78 is the 78'th
     percentile.
     R can compute quantiles. When faced with p c (-- , n) R does linear inter-
     polation. E.g. quantile (y, c (. 25, .75)) yields (3.25, 7.75).
     The vector (y(1), .. . , y()) defined in step  of the algorithm for quantiles is an
     n-dimensional statistic called the order statistic. y(i) by itself is called the i'th
     order statistic.

Figure    was created with the following R code.

            par ( mfrow=c(2,2) )
            quant <- c ( .05, .5, .9 )
            nquant <- length(quant)


﻿


2.2. DATA DESCRIPTION


97


a


b


  00
  0

LL
  0
  0


0 2 4 6 8

       y


       c


0 2 4 6 8


y


d


LO)

0
0
0


00


0
0
6


                       0 2 4 6 8                     0 2 4 6 8

                               y                            y


Figure 2.1: Quantiles. The black circles are the .05, .5, and .9 quantiles. The open
squares are the numbers .05, .5, and .9 on the vertical axis. Panels a and b are for
a sample; panels c and d are for a distribution.


﻿


2.2. DATA DESCRIPTION


98


Dispersion statistics measure how spread out the data are. Since there are many
ways to measure dispersion there are many dispersion statistics. Important disper-
sion statistics include
standard deviation The sample standard deviation or SD of a data set is

                                  S-
                                             12


﻿


2.2. DATA DESCRIPTION


99


      Note: some statisticians prefer


                                     _ E (yi - Y) 2
                                     n3--
                                           n2-1

      for reasons which do not concern us here. If n is large there is little difference
      between the two versions of s.

variance The sample variance is

                                    2   E ______)_
                                            n
      Note: some statisticians prefer

                                    2 _ ___ -_02
                                          n-1

      for reasons which do not concern us here. If n is large there is little difference
      between the two versions of s2.

interquartile range The interquartile range is q.75 - q.25

   Presenting a low dimensional statistic is useful if we believe that the statistic is
representative of the whole population. For instance, in Example   , oceanogra-
phers believe the data they have collected is representative of the long term state
of the ocean. Therefore the sample means at the nine locations in Figure  are
representative of the long term state of the ocean at those locations. More formally,
for each location we can imagine a population of temperatures, one temperature
for each moment in time. That population has an unknown pdf f. Even though are
data are not really a random sample from f (The sampling times were not chosen
randomly, among other problems.) we can think of them that way without making
too serious an error. The histograms in Figure 1:1 are estimates of the f's for the
nine locations. The mean of each f is what oceanographers call a climatological
mean, or an average which, because it is taken over a long period of time, repre-
sents the climate. The nine sample means are estimates of the nine climatological
mean temperatures at those nine locations. Simply presenting the sample means
reveals some interesting structure in the data, and hence an interesting facet of
physical oceanography.
   Often, more than a simple data description or display is necessary; the statisti-
cian has to do a bit of exploring the data set. This activity is called exploratory data


﻿


2.2. DATA DESCRIPTION


100


analysis or simply eda. It is hard to give general rules for eda, although displaying
the data in many different ways is often a good idea. The statistician must decide
what displays and eda are appropriate for each data set and each question that
might be answered by the data set. That is one thing that makes statistics inter-
esting. It cannot be reduced to a set of rules and procedures. A good statistician
must be attuned to the potentially unique aspects of each analysis. We now present
several examples to show just a few of the possible ways to explore data sets by
displaying them graphically. The examples reveal some of the power of graphical
display in illuminating data and teasing out what it has to say.


2.2.2 Displaying Distributions

Instead of reducing a data set to just a few summary statistics, it is often helpful to
display the full data set. But reading a long list of numbers is usually not helpful;
humans are not good at assimilating data in that form. We can learn a lot more
from a graphical representation of the data.

Histograms The next examples use histograms to display the full distribution of
some data sets. Visual comparison of the histograms reveals structure in the data.

Example 2.1 (Tooth Growth)
The R statistical language comes with many data sets. Type data() to see what they
are. This example uses the data set ToothGrowth on the effect of vitamin C on tooth
growth in guinea pigs. You can get a description by typing help(ToothGrowth). You
can load the data set into your R session by typing data(ToothGrowth). ToothGrowth
is a dataframe of three columns. The first few rows look like this:

    len supp dose
1   4.2    VC  0.5
2 11.5     VC  0.5
3   7.3    VC  0.5


Column 1, or len, records the amount of tooth growth. Column 2, supp, records whether
the guinea pig was given vitamin C in ascorbic acid or orange juice. Column 3, dose,
records the dose, either 0.5, 1.0 or 2.0 mg. Thus there are six groups of guinea pigs in
a two by three layout. Each group has ten guinea pigs, for a total of sixty observations.
Figure 22 shows histograms of growth for each of the six groups. From Figure 2 2 it is
clear that dose affects tooth growth.


﻿


2.2. DATA DESCRIPTION


101


VC9 0.5


OJ, 0.5


CO9


10j


   0 5 10   20  30


         OJ, 1


10
N


  05


CO -

N\


o


0 5 10   20  30


VC9 2


OJ, 2


L0


10]


   0 5 10   20  30


  OJO


CO]
N


   0 5 10   20  30


Figure 2.2: Histograms of tooth growth by delivery method (VC or OJ) and dose

(0.5, 1.0 or 2.0).


﻿


oP.
jJ1

I-1 CD


O W~


O rt


   0


   0


   CD

   CD


   CD


   C


N
NQ


bm


y


0       1      2      3       4      5
I       I      I       I      I       I


0

0

No
Cal


]


k


I


K


0
C-


0

0

No
(31


0.0   0.5    1.0   1.5   2.0   2.5   3.0
I      I      I     I     I     I      I


0         1         2        3        4
I          I        I        I         I


C

0


0.0      0.5
1         1


1.0      1.5       2.0
1         1         1


0

O         _ _     _ _


    0.0  0.5  1.0  1.5


0


0

0

No
01l


C
p


2.0   2.5   3.0


I     I     I      I     I     I     I


02

O4

i
01]


0


    0.0       0.5      1.0       1.5      2.0
          II            I         I        I
o


N


I-1

N


﻿


2.2. DATA DESCRIPTION


103


VC, 0.5


VC, 1


VC, 2


CO
IC)


   0 10  25


     OJ, 0.5


   i 10 25


    OJ1 1
COH


   0 10  25


     0J

O


N


O


   0 10  25


     OJ, 2


O


   0 10 25


Figure 2.4: Histograms of tooth growth by delivery method (VC or OJ) and dose
(0.5, 1.0 or 2.0).


﻿


2.2. DATA DESCRIPTION


104


   Figure    was produced by the following R code.

       supp <- unique ( ToothGrowth$supp )
       dose <- unique ( ToothGrowth$dose )
       par ( mfcol=c(3,2) )
       for ( i in 1:2 )
       for ( j in 1:3 ) {
       good <- (      ToothGrowth$supp == supp[i]
                    & ToothGrowth$dose == dose[j] )
         hist ( ToothGrowth$len[good], breaks=seq(0,34,by=2),
                 xlab="", ylab="",
                 main=paste(supp[i] ,       ", dose [j] , sep="") )
       }


   " unique (x) returns the unique values in x. For example, if x <- c (1,1,2) then
     unique (x) would be 1 2.


   Figure  is similar to Figure   but laid out in the other direction. (Notice that it's
easier to compare histograms when they are arranged vertically rather than horizontally.)
The figures suggest that delivery method does have an effect, but not as strong as
the dose effect. Notice also that Figure   is more difficult to read than Figure
because the histograms are too tall and narrow. Figure repeats Figure   but using
less vertical distance; it is therefore easier to read. Part of good statistical practice is
displaying figures in a way that makes them easiest to read and interpret.
   The figures alone have suggested that dose is the most important effect, and delivery
method less so. A further analysis could try to be more quantitative: what is the typical
size of each effect, how sure can we be of the typical size, and how much does the effect
vary from animal to animal. The figures already suggest answers, but a more formal
analysis is deferred to Section

   Figures      ,   , and     are histograms. The abscissa has the same scale as
the data. The data are divided into bins. The ordinate shows the number of data
points in each bin. (hist( ... ,prob=T) plots the ordinate as probability rather
than counts.) Histograms are a powerful way to display data because they give a
strong visual impression of the main features of a data set. However, details of the
histogram can depend on both the number of bins and on the cut points between
bins. For that reason it is sometimes better to use a display that does not depend


﻿


2.2. DATA DESCRIPTION


105


on those features, or at least not so strongly. Example   illustrates.

Density Estimation

Example 2.2 (Hot Dogs)
In June of 1986, Consumer Reports published a study of hot dogs. The data are available
at DASL, the Data and Story Library, a collection of data sets for free use by statistics
students. DASL says the data are

      "Results of a laboratory analysis of calories and sodium content of major hot
      dog brands. Researchers for Consumer Reports analyzed three types of hot
      dog: beef, poultry, and meat (mostly pork and beef, but up to 15% poultry
      meat)."

You can download the data from http : //lib. stat. cmu. edu/DASL/Dataf iles/Hotdogs .
html. The first few lines look like this:

     Type          Calories           Sodium
     Beef          186                495
     Beef          181                477


This example looks at the calorie content of beef hot dogs. (Later examples will compare
the calorie contents of different types of hot dogs.)
    Figure   (a) is a histogram of the calorie contents of beef hot dogs in the study.
From the histogram one might form the impression that there are two major varieties of
beef hot dogs, one with about 130-160 calories or so, another with about 180 calories or
so, and a rare outlier with fewer calories. Figure  (b) is another histogram of the same
data but with a different bin width. It gives a different impression, that calorie content
is evenly distributed, approximately, from about 130 to about 190 with a small number
of lower calorie hot dogs. Figure  (c) gives much the same impression as  (b). It
was made with the same bin width as  (a), but with cut points starting at 105 instead
of 110. These histograms illustrate that one's impression can be influenced by both bin
width and cut points.
    Density estimation is a method of reducing dependence on cut points. Let x1, ... ,
x20 be the calorie contents of beef hot dogs in the study. We think of x1,... , x20 as a
random sample from a density f representing the population of all beef hot dogs. Our
goal is to estimate f. For any fixed number x, how shall we estimate f(x)? The idea is
to use information local to x to estimate f(x). We first describe a basic version, then
add two refinements to get kernel density estimation and the density() function in R.


﻿


2.2. DATA DESCRIPTION


106


    Let n be the sample size (20 for the hot dog data). Begin by choosing a number
h > 0. For any number x the estimate fbasic(x) is defined to be
                  1   nfraction of sample points within h of x
       basic(x   2h=(x-h,x+h) (i)                           2
                      i=1
fbasic has at least two apparently undesirable features.
   1. fb2sic(x) gives equal weight to all data points in the interval (x - h, x + h) and
      has abrupt cutoffs at the ends of the interval. It would be better to give the most
      weight to data points closest to x and have the weights decrease gradually for
      points increasingly further away from x.

   2. fb2sic(x) depends critically on the choice of h.
We deal with these problems by introducing a weight function that depends on distance
from x. Let go be a probability density function. Usually go is chosen to be symmetric
and unimodal, centered at 0. Define

                                       n
Choosing go to be a probability density ensures that f is also a probability density because

                          ff(x) dx i=/          go(x - x) dx= 1                  (2.1)

When go is chosen to be a continuous function it deals nicely with problem   above. In
fact, fbasic comes from taking go to be the uniform density on (-h, h).
    To deal with problem 2 we rescale go. Choose a number h > 0 and define a new
density g(x) = h-1go(x/h). A little thought shows that g differs from go by a rescaling
of the horizontal axis; the factor h-1 compensates to make f g = 1. Now define the
density estimate to be

                  fh(x)          g(x-xi) =           go(     z)/h)

h is called the bandwidth. Of course fh still depends on h. It turns out that dependence
on bandwidth is not really a problem. It is useful to view density estimates for several
difFerent bandwidths. Each reveals features of f at difFerent scales. Figures 2,5(d), (e),
and (f) are examples. Panel (d) was produced by the default bandwidth; panels (e)
and (f) were produced with 1/4 and 1/2 the default bandwidth. Larger bandwidth
makes a smoother estimate of f; smaller bandwidth makes it rougher. None is exactly
right. It is useful to look at several.


﻿


2.2. DATA DESCRIPTION


107


(a)


(b)


LO-


Co -]


       I    I    I     I
       120 140  160  180

            calories


              (c)


        I   I    I    I
        120 140 160 180

            calories


CO


o-1

       I    I    I     I
       120 140  160  180

            calories


              (d)


6       I      I      I
       100    150    200


calories


(e)


(f)


C,)
a)


0
C
0
0

10
0
0

0
0
0
0


C,)
a)


0


0
0


0
0
O


100      140     180


100    140     180


calories


calories


Figure 2.5: (a), (b), (c): histograms of calorie contents of beef hot dogs; (d), (e),

(f): density estimates of calorie contents of beef hot dogs.


﻿


2.2. DATA DESCRIPTION


108


   Figure   was produced with

       hotdogs <- read.table ( "data/hotdogs/data", header=T )
       cal.beef <- hotdogs$Calories [ hotdogs$Type == "Beef" ]
       par ( mfrow=c(3,2) )
       hist ( cal.beef, main="(a)", xlab="calories", ylab="" )
       hist ( cal.beef, breaks=seq(110,190,by=20), main="(b)",
              xlab="calories", ylab="" )
       hist ( cal.beef, breaks=seq(105,195,by=10), main="(c)",
              xlab="calories", ylab="" )

       plot ( density ( cal.beef ), main="(d)", xlab="calories",
              ylab="density" )
       plot ( density ( cal.beef, adjust=1/4 ), main="(e)",
              xlab="calories", ylab="density" )
       plot ( density ( cal.beef, adjust=1/2 ), main="(f)",
              xlab="calories", ylab="density" )


   " In panel (a) R used its default method for choosing histogram bins.

   " In panels (b) and (c) the histogram bins were set by
     hist ( ..., breaks=seq(...)     ).

   " density() produces a kernel density estimate.

   " R uses a Gaussian kernel by default which means that go above is the N(0, 1)
     density.

   " In panel (d) R used its default method for choosing bandwidth.

   " In panels (e) and (f) the bandwidth was set to 1/4 and 1/2 the default by
     density(..., adjust=...).


Stripcharts and Dotplots Figure    uses the ToothGrowth data to illustrate
stripcharts, also called dotplots, an alternative to histograms. Panel (a) has three
rows of points corresponding to the three doses of ascorbic acid. Each point is for
one animal. The abscissa shows the amount of tooth growth; the ordinate shows


﻿


2.2. DATA DESCRIPTION


109


the dose. The panel is slightly misleading because points with identical coordinates
are plotted directly on top of each other. In such situations statisticians often add
a small amount of jitter to the data, to avoid overplotting. The middle panel is a
repeat of the top, but with jitter added. The bottom panel shows tooth growth by
delivery method. Compare Figure       to Figures    and     . Which is a better
display for this particular data set?

   Figure     was produced with the following R code.

   par ( mfrow=c(3,1) )
     stripchart ( ToothGrowth$len ~ ToothGrowth$dose, pch=1,
                   main="(a)", xlab="growth", ylab="dose" )
     stripchart ( ToothGrowth$len ~ ToothGrowth$dose,
                   method="jitter", main="(b)", xlab="growth",
                   ylab="dose", pch=1 )
     stripchart ( ToothGrowth$len ~ ToothGrowth$supp,
                   method="jitter", main="(c)", xlab="growth",
                   ylab="method", pch=1 )


Boxplots An alternative, useful for comparing many distributions simultaneously,
is the boxplot. Example     uses boxplots to compare scores on 24 quizzes in a
statistics course.

Example 2.3 (Quiz Scores)
In the spring semester of 2003, 58 students completed Statistics 103 at Duke University.
Figure    displays their grades.
   There were 24 quizzes during the semester. Each was worth 10 points. The upper
panel of the figure shows the distribution of scores on each quiz. The abscissa is labelled
1 through 24, indicating the quiz number. For each quiz, the figure shows a boxplot. For
each quiz there is a box. The horizontal line through the center of the box is the median
grade for that quiz. We can see that the median score on Quiz 2 is around 7, while the
median score on Quiz 3 is around 4. The upper end of the box is the 75th percentile (3rd
quartile) of scores; the lower end of the box is the 25th percentile (1st quartile). We can
see that about half the students scored between about 5 and 8 on Quiz 2, while about
half the students scored between about 2 and 6 on Quiz 3. Quiz 3 was tough.
   Each box may have whiskers, or dashed lines which extend above and below the box.
The exact definition of the whiskers is not important, but they are meant to include most


﻿


2.2. DATA DESCRIPTION


110


(a)


N\


(1)
0


10O


O\


(1)
0


                     0   0 0000 00 000 0 0 0 0 0


             0 000 00 0 00 0 0 00 000 0


0 0000000 0 00 0 0 00  0

5       10     15     20     25      30     35

                    growth


                    (b)


                    0      00 0 00 8 0 * °00


              08%0o90o0 0 0 o 000


000 o°c   lp 0 00 0      0


5       10     15     20     25      30     35

                    growth


                    (C)


 0 00 0       o o    0 0 Q 0000     0 °00


    0 00     °  O  o0     0         0    0
    o 0                0 0 cob c 0 g ° 0 0


 5      10     15     20     25      30     35


0_


10


O


E


0o


                                     growth


Figure 2.6: (a) Tooth growth by dose, no jittering; (b) Tooth growth by dose with

jittering; (c) Tooth growth by delivery method with jittering


﻿


2.2. DATA DESCRIPTION


111


of the data points that don't fall inside the box. (In R, by default, the whiskers extend to
the most extreme data point which is no more that 1.5 times the interquartile range away
from the median.) Finally, there may be some individual points plotted above or below
each boxplot. These indicate outliers, or scores that are extremely high or low relative to
other scores on that quiz. Many quizzes had low outliers; only Quiz 5 had a high outlier.
   Box plots are extremely useful for comparing many sets of data. We can easily see,
for example, that Quiz 5 was the most difficult (75% of the class scored 3 or less.) while
Quiz 1 was the easiest (over 75% of the class scored 10.)
   There were no exams or graded homeworks. Students' grades were determined by
their best 20 quizzes. To compute grades, each student's scores were sorted, the first 4
were dropped, then the others were averaged. Those averages are displayed in a stripchart
in the bottom panel of the figure. It's easy to see that most of the class had quiz averages
between about 5 and 9 but that 4 averages were much lower.

   Figure    was produced by the following R code.

       ... # read in the data
       colnames(scores) <- paste("Q",1:24,sep="")
       # define column names
       boxplot ( data.frame(scores), main="Individual quizzes" )

       scores[is.na(scores)] <- 0 # replace missing scores
                                     # with 0's
       temp <- apply ( scores, 1, sort ) # sort
       temp <- temp[5:24,]                   # drop the 4 lowest
       scores.ave <- apply ( temp, 2, mean ) # find the average

       stripchart ( scores.ave, "jitter", pch=1, xlab="score",
                     xlim=c (0, 10), main="Student averages" )


QQ plots Sometimes we want to assess whether a data set is well modelled by a
Normal distribution and, if not, how it differs from Normal. One obvious way to
assess Normality is by looking at histograms or density estimates. But the answer
is often not obvious from the figure. A better way to assess Normality is with
QQ plots. Figure     illustrates for the nine histograms of ocean temperatures in
Figure


﻿


2.2. DATA DESCRIPTION                               112


                      Individual quizzes

           °  -T-    TT     T 11111 T

                        --  0     oQ  o


               Qi Q4 Q7   Qi1  Q15 Q19 Q23


                      Student averages
                      1100


                   00           0000
               °O 0o     o    8     °Z O


               I     I    I     I     II
               0     2    4     6    8    10

                           score


Figure 2.7: Quiz scores from Statistics 103


﻿


2.2. DATA DESCRIPTION


113


   Each panel in Figure   was created with the ocean temperatures near a par-
ticular (latitude, longitude) combination. Consider, for example, the upper left
panel which was constructed from the n = 213 points x1, ... x213 taken near (45,
-40). Those points are sorted, from smallest to largest, to create the order statis-
tic (x(1), ..., x(213)). Then they are plotted against E[(z(1), ..., z(213))], the expected
order statistic from a Normal distribution. If the xis are approximately Normal
then the QQ plot will look approximately linear. The slope of the line indicates the
standard deviation.
   In Figure   most of the panels do look approximately linear, indicating that a
Normal model is reasonable. But some of the panels show departures from Normal-
ity. In the upper left and lower left panels, for example, the plots looks roughly lin-
ear except for the upper right corners which show some data points much warmer
than expected if they followed a Normal distribution. In contrast, the coolest tem-
peratures in the lower middle panel are not quite as cool as expected from a Normal
distribution.

   Figure    was produced with

   lats <- c ( 45, 35, 25 )
   lons <- c ( -40, -30, -20 )
   par ( mfrow=c(3,3) )
   for ( i in 1:3 )
   for ( j in 1:3 ) {
      good <- abs ( med.1000$lon - lons[j ) < 1 &
      abs ( med.1000$lat - lats[i] ) < 1
      qqnorm ( med.1000$temp [good], xlab="", ylab="",
        sub=paste("n = ", sum(good), sep=""),
        main = paste ( "latitude =", lats[i], "\n longitude =",
                         lons[ji] ))
    }


2.2.3   Exploring Relationships
Sometimes it is the relationships between several random variables that are of
interest. For example, in discrimination cases the focus is on the relationship be-
tween race or gender on one hand and employment or salary on the other hand.
Subsection      shows several graphical ways to display relationships.


﻿


2.2. DATA DESCRIPTION


114


co


ICI


latitude = 45
longitude = -40


          0

          0
          0


 -3 -1  1  3


    n = 213

 latitude = 35
 longitude = -40


 ~1/


 -2    0   2


     n=37


latitude = 45
longitude = -30


10

10
10
10)


  latitude = 45
  longitude = -20

_


coI


-2  0  2


-2  0  2


LO


LO
(0


LO
IC)


      n = 105

  latitude = 35
  longitude = -30
  100


0
   coo

c6


   -2   0 1 2


      n=112

  latitude = 35
  longitude = -20
0
           0


   -2   0   2


      n = 44

  latitude = 25
  longitude = -20


N        6


o   2   0


   -2   0   2


n=24


  latitude = 25
  longitude = -40


10
c

10

    -2  0Y


    -2  0   2


  latitude = 25
  longitude = -30

0
c   -      0
          6)

0


   -2   0   2


n=47


n=35


n=27


Figure 2.8: QQ plots of water temperatures (°C) at 1000m depth


﻿


2.2. DATA DESCRIPTION


115


   We begin with Example 2.4, an analysis of potential discrimination in admission
to UC Berkeley graduate school.

Example 2.4
In 1973 UC Berkeley investigated its graduate admissions rates for potential sex bias. Ap-
parently women were more likely to be rejected than men. The data set UCBAdmissions
gives the acceptance and rejection data from the six largest graduate departments on
which the study was based. Typing help(UCBAdmissions) tells more about the data.
It tells us, among other things:


Format:

      A 3-dimensional array resulting from cross-tabulating 4526
      observations on 3 variables. The variables and their levels
      are as follows:

        No  Name     Levels
        1 Admit      Admitted, Rejected
        2   Gender   Male, Female
        3   Dept     A, B, C, D, E, F


The major question at issue is whether there is sex bias in admissions. To investigate we
ask whether men and women are admitted at roughly equal rates.
   Typing UCBAdmissions gives the following numerical summary of the data.

   Dept = A

           Gender
Admit       Male Female
  Admitted    512      89
  Rejected    313      19

    Dept = B

           Gender
Admit       Male Female
  Admitted    353      17
  Rejected    207       8


﻿


2.2. DATA DESCRIPTION


116


    Dept = C

            Gender
Admit        Male Female
  Admitted    120     202
  Rejected    205     391

    Dept = D

            Gender
Admit        Male Female
  Admitted    138     131
  Rejected    279     244

    Dept = E

            Gender
Admit        Male Female
  Admitted     53      94
  Rejected    138     299

    Dept = F

            Gender
Admit        Male Female
  Admitted     22      24
  Rejected    351     317

  For each department, the twoway table of admission status versus sex is displayed.
Such a display, called a crosstabulation, simply tabulates the number of entries in each
cell of a multiway table. It's hard to tell from the crosstabulation whether there is a
sex bias and, if so, whether it is systemic or confined to just a few departments. Let's
continue by finding the marginal (aggregated by department as opposed to conditional
given deparment) admissions rates for men and women.

> apply(UCBAdmissions, c(1, 2), sum)
            Gender
Admit        Male Female


﻿


2.2. DATA DESCRIPTION


117


  Admitted 1198        557
  Rejected 1493      1278

The admission rate for men is 1198/(1198 + 1493) = 44.5% while the admission rate for
women is 557/(557 + 1493) = 30.4%, much lower. A mosaic plot, created with

mosaicplot(apply(UCBAdmissions, c(1, 2), sum),
             main = "Student admissions at UC Berkeley")

is a graphical way to display the discrepancy. (A beautiful example of a mosaic plot is on
the cover of CHANCE magazine. ref here.) The left column is for admitted students; the
heights of the rectangles show how many admitted students were male and how many
were female. The right column is for rejected students; the heights of the rectangles
show how many were male and female. If sex and admission status were independent,
i.e., if there were no sex bias, then the proportion of men among admitted students
would equal the proportion of men among rejected students and the heights of the left
rectangles would equal the heights of the right rectangles. The apparent difference in
heights is a visual representation of the discrepancy in sex ratios among admitted and
rejected students. The same data can be viewed as discrepant admission rates for men
and women by transposing the matrix:

mosaicplot(t(apply(UCBAdmissions, c(1, 2), sum)),
             main = "Student admissions at UC Berkeley")

   The existence of discrepant sex ratios for admitted and rejected students is equivalent
to the existence of discrepant admission rates for males and females and to dependence
of sex and admission rates. The lack of discrepant ratios is equivalent to independence
of sex and admission rates.
    Evidently UC Berkeley admitted men and women at different rates. But graduate
admission decisions are not made by a central admissions offce; they are made by the
individual departments to which students apply. So our next step is to look at admission
rates for each department separately. We can look at the crosstabulation on page 115 or
make mosaic plots for each department separately (not shown here) with

  ## Mosaic plots for individual departments
for(i in 1:6)
  mosaicplot(UCBAdmissions[, ,i],
     xlab = "Admit", ylab = "Sex",
     main = paste ("Department", LETTERS [i]))


﻿


2.2. DATA DESCRIPTION


118


Student admissions at UC Berkeley


     Admitted            Rejected


(1)


ai)

ai)


i           i


Admit


a)
E
a)
U-


Figure 2.9: Mosaic plot of UCBAdmiss ions


﻿


2.2. DATA DESCRIPTION


119


Student admissions at UC Berkeley


         Male              Female


a)
E


E
-o


a)
a)


Gender


Figure 2.10: Mosaic plot of UCBAdmissions


﻿


2.2. DATA DESCRIPTION


120


   The plots show that in each department men and women are admitted at roughly
equal rates. The following snippet calculates and prints the rates. It confirms the rough
equality except for department A which admitted women at a higher rate than men.

for ( i in 1:6 ) {                               # for each department
  temp <- UCBAdmissions[,,i]                     # that department's data
  m <- temp[1,1] / (temp[1,1]+temp[2,1]) # Men's admission rate
  w <- temp[1,2] / (temp[1,2]+temp[2,2]) # Women's admission rate
  print ( c ( m,w) )                             # print them
}

Note that departments A and B which had high admission rates also had large numbers
of male applicants while departments C, D, E and F which had low admission rates had
large numbers of female applicants. The generally accepted explanation for the discrepant
marginal admission rates is that men tended to apply to departments that were easy to
get into while women tended to apply to departments that were harder to get into. A
more sinister explanation is that the university gave more resources to departments with
many male applicants, allowing them to admit a greater proportion of their applicants.
The data we've analyzed are consistent with both explanations; the choice between them
must be made on other grounds.
    One lesson here for statisticians is the power of simple data displays and summaries.
Another is the need to consider the unique aspects of each data set. The explanation
of different admissions rates for men and women could only be discovered by someone
familiar with how universities and graduate schools work, not by following some general
rules about how to do statistical analyses.

   The next example is about the duration of eruptions and interval to the next
eruption of the Old Faithful geyser. It explores two kinds of relationships - the re-
lationship between duration and eruption and also the relationship of each variable
with time.

Example 2.5 (Old Faithful)
Old Faithful is a geyser in Yellowstone National Park and a great tourist attraction. As
Denbynd.Prgibn[1987] explain, "From August 1 to August 8, 1978, rangers and
naturalists at Yellowstone National Park recorded the duration of eruption and interval
to the next eruption (both in minutes) for eruptions of Old Faithful between 6 a.m.
and midnight. The intent of the study was to predict the time of the next eruption,
to be posted at the Visitor's Center so that visitors to Yellowstone can usefully budget
their time." The R dataset faithful contains the data. In addition to the references
listed there, the data and analyses can also be found in Weiber [198] and Debyan


﻿


2.2. DATA DESCRIPTION


121


         [    ]. The latter analysis emphasizes graphics, and we shall follow some of
their suggestions here.
    We begin exploring the data with stripcharts and density estimates of durations and
intervals. These are shown in Figure   The figure suggests bimodal distributions. For
duration there seems to be one bunch of data around two minutes and another around
four or five minutes. For interval, the modes are around 50 minutes and 80 minutes. A
plot of interval versus duration, Figure , suggests that the bimodality is present in the
joint distribution of the two variables. Because the data were collected over time, it might
be useful to plot the data in the order of collection. That's Figure  . The horizontal
scale in Figure  is so compressed that it's hard to see what's going on. Figure
repeats Figure     but divides the time interval into two subintervals to make the plots
easier to read. The subintervals overlap slightly. The persistent up-and-down character
of Figure     shows that, for the most part, long and short durations are interwoven,
as are long and short intervals. (Figure   is potentially misleading. The data were
collected over an eight day period. There are eight separate sequences of eruptions with
gaps in between. The faithful data set does not tell us where the gaps are.
         [    ] tell us where the gaps are and use the eight separate days to find errors
in data transcription.) Just this simple analysis, a collection of four figures, has given us
insight into the data that will be very useful in predicting the time of the next eruption.

    Figures    ,           and     were produced with the following R code.

    data(faithful)
    attach (faithful)

    par ( mfcol=c(2,2) )
    stripchart ( eruptions, method="jitter", pch=1, xlim=c(1,6),
                    xlab="duration (min)", main=" (a)" )
     plot ( density ( eruptions ), type="l", xlim=c(1,6),
             xlab="duration (min)", main=" (b)" )
     stripchart ( waiting, method="jitter", pch=1, xlim=c(40,100),
                    xlab="waiting (min)", main=" (c)" )
     plot ( density ( waiting ), type="l", xlim=c(40,100),
             xlab="waiting (min)", main=" (d)" )

     par ( mfrow=c(1,1) )
     plot ( eruptions, waiting, xlab="duration of eruption",
             ylab="time to next eruption" )


﻿


2.2. DATA DESCRIPTION


122


(a)


(c)


I  I   I  I  I  I
1  2  3  4  5  6

  duration (min)


       (b)


D


I   I I  I I  I I
40   60   80   100

   waiting (min)


       (d)


(I)
C)


C
O


0
6


(I)
a)


C')
0

0


0
6

0
0
6


1 2 3 4 5 6


40   60   80   100


duration (min)


waiting (min)


Figure 2.11: Old Faithful data: duration of eruptions and waiting time between
eruptions. Stripcharts: (a) and (c). Density estimates: (b) and (d).


﻿


2.2. DATA DESCRIPTION


123


0


0
O)


0O
00


              0
       O    a
          0
        0
        00 GOD
        o a00OGD
  0       0
  0   0   0 CD
     GD 0  C00
0 00    00 COD00
     0 0 GD0000
   0D0 GamD0(a) 0
     0 00 OODO
   00 0 009D
 0 0 0OXD 0(0 OD
      ® ®0000 0 0
      GDG00DO0000 0
  0 0 0 ()
  0  00OG
         0
  00 00
      000
      0


0
0~


0
0)
E


0  0C


0


0


0


    0 0 0
 0     0
 0     0
 0 0 0 0
 0 0D(®
 00 0 GD
 0(0)
 0 0
   00 00
   't~0 0
   ODGOD 0 0
00000
  GOOD 0
  0   000
  @,OD
  00 0
  00   0
  ®~)0
  ©ID0
    0


    0
    00
    0
0


0


0
0


0 _


. I  I    I   I    I    I   I    I


1.5 2.0 2.5 3.0


3.5  4.0 4.5  5.0


duration of eruption


Figure 2.12: Waiting time versus duration in the Old Faithful dataset


﻿


2.2. DATA DESCRIPTION


124


a


   O L

.2   C)
0~c
  - NIC


I


.. 11 11 11  1


                     I      I      I      I      I      I
                     0     50    100    150    200    250

                                  data number


                                       b


               0
               0)

          E


               0


                    0      50    100    150    200    250

                                  data number


Figure 2.13: (a): duration and (b): waiting time plotted against data number in
the Old Faithful dataset


﻿


2.2. DATA DESCRIPTION


125


al


-i                          a                       I  I


0
cV


LO
CY)
Lf)


V 'YYYVYV.V W WV NVYVY Y VYV' YY  N V I


         0               50              100              150

                              data number


                                 a2


j0 :

         0     20     40     60     80     100    120    140

                              data number


                                 b1


,.z)


5   -


0               50               100             150

                     data number


                         b2


I


                    0     20     40     60     80     100    120    140

                                         data number


Figure 2.14: (al), (a2): duration and (bl), (b2): waiting time plotted against data
number in the Old Faithful dataset


﻿


2.2. DATA DESCRIPTION


126


    par ( mfrow=c(2,1) )
    plot.ts ( eruptions, xlab="data number", ylab="duration",
               main="a" )
    plot.ts ( waiting, xlab="data number", ylab="waiting time",
               main="b" )

    par ( mfrow=c(4,1) )
    plot.ts ( eruptions[1:1501, xlab="data number",
               ylab="duration", main="a1" )
    plot.ts ( eruptions[130:2721, xlab="data number",
               ylab="duration", main="a2" )
    plot.ts ( waiting[1:150], xlab="data number",
               ylab="waiting time", main="b1")
    plot.ts ( waiting[130:272], xlab="data number",
               ylab="waiting time", main="b2")


   Figures     and      introduce coplots, a tool for visualizing the relationship
among three variables. They represent the ocean temperature data from Exam-
ple   . In Figure  there are six panels in which temperature is plotted against
latitude. Each panel is made from the points in a restricted range of longitude. The
upper panel, the one spanning the top of the Figure, shows the six different ranges
of longitude. For example, the first longitude range runs from about -10 to about
-17. Points whose longitude is in the interval (-17, -10) go into the upper right
panel of scatterplots. These are the points very close to the mouth of the Mediter-
ranean Sea. Looking at that panel we see that temperature increases very steeply
from South to North, until about 350, at which point they start to decrease steeply
as we go further North. That's because we're crossing the Mediterranean tongue at
a point very close to its source.
   The other longitude ranges are about (-20, -13), (-25, -16), (-30, -20), (-34, -25)
and (-40, -28). They are used to create the scatterplot panels in the upper cen-
ter, upper left, lower right, lower center, and lower left, respectively. The general
impression is

   " temperatures decrease slightly as we move East to West,

   " the angle in the scatterplot becomes slightly shallower as we move East to
     West, and


﻿


2.2. DATA DESCRIPTION


127


   . there are some points that don't fit the general pattern.

Notice that the longitude ranges are overlapping and not of equal width. The
ranges are chosen by R to have a little bit of overlap and to put roughly equal
numbers of points into each range.
   Figure      reverses the roles of latitude and longitude. The impression is that
temperature increases gradually from West to East. These two figures give a fairly
clear picture of the Mediterranean tongue.

   Figures      and       were produced by

     coplot ( temp ~ lat I lon )
     coplot ( temp ~ lon I lat )


   Example      shows one way to display the relationship between two sequences
of events.

Example 2.6 (Neurobiology)
To learn how the brain works, neurobiologists implant electrodes into animal brains. These
electrodes are fine enough to record the firing times of individual neurons. A sequence
of firing times of a neuron is called a spike train. Figure  shows the spike train from
one neuron in the gustatory cortex of a rat while the rat was in an experiment on taste.
This particular rat was in the experiment for a little over 80 minutes. Those minutes are
marked on the y-axis. The x-axis is marked in seconds. Each dot on the plot shows a
time at which the neuron fired. We can see, for example, that this neuron fired about
nine times in the first five seconds, then was silent for about the next ten seconds. We
can also see, for example, that this neuron undergoes some episodes of very rapid firing
lasting up to about 10 seconds.
   Since this neuron is in the gustatory cortex - the part of the brain responsible for
taste - it is of interest to see how the neuron responds to various tastes. During the
experiment the rat was licking a tube that sometimes delivered a drop of water and
sometimes delivered a drop of water in which a chemical, or tastant, was dissolved. The
55 short vertical lines on the plot show the times at which the rat received a drop of 300
millimolar (.3 M) solution of NaCl. We can examine the plot for relationships between
deliveries of NaCl and activity of the neuron.


﻿


2.2. DATA DESCRIPTION


128


        Given : Ion

-40 -35 -30 -25 -20 -15 -10
  III      I  I  I   I


            SII
         SII
       III
  SII


I   I  I  I   I  II


20 30 40 50   20 30 40 50

                    0


          00
    0     0


0m~
     0    O    O
   O      O


      000
      20 3 0


      DO* *


  J04 6 20 30 40 50


10

C


- 1


a
E
W)


10-


lat


Figure 2.15: Temperature versus latitude for different values of longitude


﻿


2.2. DATA DESCRIPTION


129


Given : lat


20    25    30    35    40    45    50
I      I     I     I     I     I     I


                      SII


 I     I     I     I     I     I     I


-40 -30 -20 -10         -40 -30 -20 -10
       I I I I I I       I   I I I I


O


0

    LO
    T


    0
    T


i


- LO


l0


E
W)


10


10


                          OO


     0 O
 I                II  I


0           0                 o


          -40 -30 -20 -10


Ion


Figure 2.16: Temperature versus longitude for different values of latitude


﻿


2.2. DATA DESCRIPTION


130


6


T
00

D
1

C)
1

LO


LO
LO


LO


M
M

M 4
M


N


N

00

MT
M
T

00

,'4


                       a spike train


        - -*        I I.i.. -- -.-       ...


          Iii I.I ... .._..I-             __ - . --l-I


       ----                       -- -- -- --


0        10      20       30       40       50      60


                                            seconds


Figure 2.17: Spike train from a neuron during a taste experiment. The dots show


the times at which the neuron fired. The solid lines show times at which the rat


received a drop of a .3 M solution of NaCl.


﻿


2.2. DATA DESCRIPTION


131


Figure


was produced by


datadir <- "~/research/neuro/data/stapleton/"


spikes <- list (
  sig002a = scan ( paste ( datadir,
  sig002b = scan ( paste ( datadir,
  sig002c = scan ( paste ( datadir,
  sig003a = scan ( paste ( datadir,
  sig003b = scan ( paste ( datadir,
  sig004a = scan ( paste ( datadir,
  sig008a = scan ( paste ( datadir,
  sig014a = scan ( paste ( datadir,
  sig014b = scan ( paste ( datadir,
  sig017a = scan ( paste ( datadir,
                )
tastants <- list (
  MSG100 = scan ( paste ( datadir,
  MSG300 = scan ( paste ( datadir,
  NaC1100 = scan ( paste ( datadir,
  NaCl300 = scan ( paste ( datadir,
  water   = scan ( paste ( datadir,
                  )


"sig002a.txt", sep="" )
"sig002b.txt", sep="" )
"sig002c.txt", sep="" )
"sig003a.txt", sep="" )
"sig003b.txt", sep="" )
"sig004a.txt", sep="" )
"sig008a.txt", sep="" )
"sig014a.txt", sep="" )
"sig014b.txt", sep="" )
"sig017a.txt", sep="" )


"MSG100.txt", sep="" )
"MSG300.txt", sep="" )
"NaC1100.txt", sep="" )
"NaCl300.txt", sep="" )
"water.txt",   sep="" )


),


),


stripchart ( spikes [[8]1 %% 60 ~ spikes [[8]1 %/% 60, pch=". ",
             main="a spike train", xlab="seconds", ylab="minutes" )
points ( tastants$NaCl300 %% 60, tastants$NaCl300 %/% 60 + 1,
         pch=" " )

 " The line datadir <- ... stores the name of the directory in which I keep
   the neuro data. When used in paste it identifies individual files.

 " The command list () creates a list. The elements of a list can be anything. In
   this case the list named spikes has ten elements whose names are sig002a,
   sig002b, ..., and sig017a. The list named tastants has five elements whose
   names are MSG100, MSG300, NaC1100, NaCl300, and water. Lists are useful for
   keeping related objects together, esepecially when those objects aren't all of
   the same type.

 " Each element of the list is the result of a scano. scan() reads a file and
   stores the result in a vector. So spikes is a list of ten vectors. Each vector


﻿


2.3. LIKELIHOOD


132


     contains the firing times, or a spike train, of one neuron. tastants is a list of
     five vectors. Each vector contains the times at which a particular tastant was
     delivered.

   " There are two ways to refer an element of a list. For example, spikes [[811
     refers to the eighth element of spikes while tastants$NaC1300 refers to the
     element named NaC1300.

   " Lists are useful for keeping related objects together, especially when those
     objects are not the same type. In this example spikes$sig002a is a vector
     whose length is the number of times neuron 002a fired, while the length
     of spikes$sig002b is the number of times neuron 002b fired. Since those
     lengths are not the same, the data don't fit neatly into a matrix, so we use a
     list instead.


2.3 Likelihood

2.3.1 The Likelihood Function
It often happens that we observe data from a distribution that is not known pre-
cisely but whose general form is known. For example, we may know that the data
come from a Poisson distribution, X ~ Poi(A), but we don't know the value of A.
We may know that X ~ Bin(n, 0) but not know 0. Or we may know that the values
of X are densely clustered around some central value and sparser on both sides, so
we decide to model X ~ N(p, o-), but we don't know the values of p and o. In these
cases there is a whole family of probability distributions indexed by either A, 0, or
(p, r). We call A, 0, or (p, o) the unknown parameter; the family of distributions
is called a parametric family. Often, the goal of the statistical analysis is to learn
about the value of the unknown parameter. Of course, learning which value of the
parameter is the true one, or which values of the parameter are plausible in light
of the data, is the same as learning which member of the family is the true one, or
which members of the family are plausible in light of the data.
   The different values of the parameter, or the different members of the family,
represent different theories or hypotheses about nature. A sensible way to discrim-
inate among the theories is according to how well they explain the data. Recall
the Seedlings data (Examples   ,   ,    and    ) in which X was the number
of new seedlings in a forest quadrat, X ~ Poi(A), and different values of A repre-


﻿


2.3. LIKELIHOOD


133


sent different theories or hypotheses about the arrival rate of new seedlings. When
X turned out to be 3, how well a value of A explains the data is measured by
Pr[X = 3 A]. This probability, as a function of A, is called the likelihood function
and denoted {(A). It says how well each value of A explains the datum X = 3.
Figure    (pg.   ) is a plot of the likelihood function.
   In a typical problem we know the data come from a parametric family indexed
by a parameter 0, i.e. X1, ..., X, ~ i.i.d.f(x 0), but we don't know 0. The joint
density of all the data is

                        f (X1, ... , X 0 0) H f (Xi 0).                 (2.2)

Equation , as a function of 0, is the likelihood function. We sometimes write
f(Data 0) instead of indicating each individual datum. To emphasize that we are
thinking of a function of 0 we may also write the likelihood function as {(0) or
(0 Data).
   The interpretation of the likelihood function is always in terms of ratios. If, for
example, {(01)/{(02) > 1, then 01 explains the data better than 02. If  (01)/ (02) = k,
then 01 explains the data k times better than 02. To illustrate, suppose students in a
statistics class conduct a study to estimate the fraction of cars on Campus Drive that
are red. Student A decides to observe the first 10 cars and record X, the number
that are red. Student A observes

                 NR, R, NR, NR, NR, R, NR, NR, NR, R

and records X = 3. She did a Binomial experiment; her statistical model is
X  ~ Bin(10, 0); her likelihood function is  A(O) (3 )3(1 - 0)7. It is plotted in
Figure     . Because only ratios matter, the likelihood function can be rescaled by
any arbitrary positive constant. In Figure it has been rescaled so the maximum
is 1. The interpretation of Figure   is that values of 0 around 0 ~ 0.3 explain
the data best, but that any value of 0 in the interval from about 0.1 to about 0.6
explains the data not too much worse than the best. I.e., 0 ~ 0.3 explains the data
only about 10 times better than 0 ~ 0.1 or 0 ~ 0.6, and a factor of 10 is not really
very much. On the other hand, values of 0 less than about 0.05 or greater than
about 0.7 explain the data much worse than 0 ~ 0.3.

   Figure     was produced by the following snippet.

   theta <- seq ( 0, 1, by=.01 ) # some values of theta
   y <- dbinom ( 3, 10, theta )     # calculate l(theta)


﻿


2.3. LIKELIHOOD


134


O


0

-o
0


0
O

.)


Co
0
O
cD
0


0
o

O
CO
0


0.0     0.2     0.4     0.6     0.8


1.0


                                         0


Figure 2.18: Likelihood function f(O) for the proportion 0 of red cars on Campus
Drive


﻿


2.3. LIKELIHOOD


135


  y <- y / max(y)                   # rescale
  plot ( theta, y, type="l", xlab=expression(theta),
          ylab="likelihood function" )


   * expression is R's way of getting mathematical symbols and formulae into
     plot labels. For more information, type help (plotmath).


   To continue the example, Student B decides to observe cars until the third red
one drives by and record Y, the total number of cars that drive by until the third
red one. Students A and B went to Campus Drive at the same time and observed
the same cars. B records Y = 10. For B the likelihood function is
             B (O) =PL[Y =10 0]
                    P[2 reds among first 9 cars] x P[10'th car is red]

                    (2)02(1 - 0)7 x 0


EB differs from EA by the multiplicative constant (2)/ (3). But since multiplicative
constants don't matter, A and B really have the same likelihood function and hence
exactly the same information about 0. Student B would also use Figure as the
plot of her likelihood function.
   Student C decides to observe every car for a period of 10 minutes and record
Z1, ..., Zk where k is the number of cars that drive by in 10 minutes and each Z
is either 1 or 0 according to whether the i'th car is red. When C went to Campus
Drive with A and B, only 10 cars drove by in the first 10 minutes. Therefore C
recorded exactly the same data as A and B. Her likelihood function is

     Ec(0) = (1 - 0)0(1 - 0)(1 - 0)(1 - 0)0(1 - 0)(1 - 0)(1 - 0)0 = 03(1 - 0)7

 c is proportional to EA and EB and hence contains exactly the same information
 and looks exactly like Figure . So even though the students planned different
 experiments they ended up with the same data, and hence the same information
 about 0.
   The next example follows the Seedling story and shows what happens to the
likelihood function as data accumulates.


﻿


2.3. LIKELIHOOD


136


Example 2.7 (Seedlings, cont.)
Examples             , and    reported data from a single quadrat on the number of
new seedlings to emerge in a given year. In fact, ecologists collected data from multiple
quadrats over multiple years. In the first year there were 60 quadrats and a total of 40
seedlings so the likelihood function was

                               (A) - p(Data A)
                                  p(y1,.. .Y60 A)
                                     60
                                     fJp(y A)
                                     1
                                     60   _

                                     1    Y
                                  oc e-60AA40

Note that H1 y2! is a multiplicative factor that does not depend on A and so is irrelevant
to (A). Note also that  (A) depends only on 1 y2, not on the individual yi's. I.e., we
only need to know 1 y2 = 40; we don't need to know the individual yj's.  (A) is plotted
in Figure       Compare to Figure     (pg.   ). Figure      is much more peaked.
That's because it reflects much more information, 60 quadrats instead of 1. The extra
information pins down the value of A much more accurately.

    Figure     was created with

    lam <- seq ( 0, 2, length=50 )
    lik <- dpois ( 40, 60*lam )
    lik <- lik / max(lik)
    plot ( lam, lik, xlab=expression(lambda),
             ylab="likelihood", type="l" )


   The next example is about a possible cancer cluster in California.

Example 2.8 (Slater School)
This example was reported in        [    ]. See       [    ] for further analysis.
   The Slater school is an elementary school in Fresno, California where teachers and staff
were "concerned about the presence of two high-voltage transmission lines that ran past
the school ...." Their concern centered on the "high incidence of cancer at Slater...."


﻿


2.3. LIKELIHOOD


137


                 0o
           0
           0


                 0

                      0.0        0.5        1.0        1.5       2.0


                 Figure 2.19: .(O) after  y = 40 in 60 quadrats.


To address their concern, Dr. Raymond Neutra of the California Department of Health
Services' Special Epidemiological Studies Program conducted a statistical analysis on the

         "eight cases of invasive cancer, ... , the total years of employment of the
      hundred and forty-five teachers, teachers' aides, and staff members, ... , [and]
      the number of person-years in terms of National Cancer Institute statistics
      showing the annual rate of invasive cancer in American women between the
      ages of forty and forty-four - the age group encompassing the average age
      of the teachers and staff at Slater - [which] enabled him to calculate that
      4.2 cases of cancer could have been expected to occur among the Slater
      teachers and staff members . ..."

    For our purposes we can assume that X, the number of invasive cancer cases at the
Slater School has the Binomial distribution X  Bin(145, 0). We observe x= 8. The
likelihood function
                                 f(O) oc08(1 - 0)137                            (2.3)

is pictured in Figure 2,20. From the Figure it appears that values of 0 around .05 or .06,
explain the data better than values less than .05 or greater than .06, but that values of
0 anywhere from about .02 or .025 up to about .11 explain the data reasonably well.


﻿


2.3. LIKELIHOOD


138


          0


                     I         I         I         I        I
                   0.00      0.05      0.10      0.15     0.20

                                        0


                   Figure 2.20: Likelihood for Slater School


   Figure     was produced by the following R code.

   theta <- seq ( 0, .2, length=100 )
     lik <- dbinom ( 8, 145, theta )
     lik <- lik / max(lik)
     plot ( theta, lik, xlab=expression(theta),
            ylab="likelihood", type="l", yaxt="n" )


The first line of code creates a sequence of 100 values of 0 at which to compute  (0),
the second line does the computation, the third line rescales so the maximum likelihood
is 1, and the fourth line makes the plot.

   Examples      and     show how likelihood functions are used. They reveal
which values of a parameter the data support (equivalently, which values of a pa-
rameter explain the data well) and values they don't support (which values explain
the data poorly). There is no hard line between support and non-support. Rather,
the plot of the likelihood functions shows the smoothly varying levels of support
for different values of the parameter.
   Because likelihood ratios measure the strength of evidence for or against one


﻿


2.3. LIKELIHOOD


139


hypothesis as opposed to another, it is important to ask how large a likelihood
ratio needs to be before it can be considered strong evidence. Or, to put it another
way, how strong is the evidence in a likelihood ratio of 10, or 100, or 1000, or
more? One way to answer the question is to construct a reference experiment,
one in which we have an intuitive understanding of the strength of evidence and
can calculate the likelihood; then we can compare the calculated likelihood to the
known strength of evidence.
   For our reference experiment imagine we have two coins. One is a fair coin, the
other is two-headed. We randomly choose a coin. Then we conduct a sequence of
coin tosses to learn which coin was selected. Suppose the tosses yield n consecutive
Heads. P[n HeadsIfair] = 2-n; P[n HeadsItwo-headed] = 1. So the likelihood
ratio is 24. That's our reference experiment. A likelihood ratio around 8 is like
tossing three consecutive Heads; a likelihood ratio around 1000 is like tossing ten
consecutive Heads.
   In Example 2. argmax f(0)   .055 and £(.025)/t(.055)  .13  1/8, so the
evidence against 0 = .025 as opposed to 0 = .055 is about as strong as the evidence
against the fair coin when three consecutive Heads are tossed. The same can be
said for the evidence against 0 = .1. Similarly, £(.011)/f(.055)  £ (.15)/f(.055)
.001, so the evidence against 0 = .011 or 0 = .15 is about as strong as 10 consecutive
Heads. A fair statement of the evidence is that 0's in the interval from about 0
.025 to about 0 = .1 explain the data not much worse than the maximum of 0
.055. But 0's below about .01 or larger than about .15 explain the data not nearly
as well as 0's around .055.


2.3.2 Likelihoods from the Central Limit Theorem

Sometimes it is not possible to compute the likelihood function exactly, either be-
cause it is too difficult or because we don't know what it is. But we can often
compute an approximate likelihood function using the Central Limit Theorem. The
following example is the simplest case, but typifies the more exotic cases we will
see later on.
   Suppose we sample X1, X2, ... , X, from a probability density f. We don't know
what f is; we don't even know what parametric family it belongs to. Assume that
f has a mean yu and an SD a (I.e, assume that the mean and variance are finite.)
and that we would like to learn about p. If (,u, a) are the only unknown parameters
then the likelihood function is (pu, a) = f(Data ly, )  H ]f(Xi l ya). But we
don't know f and can't calculate 0(p, c).
   However, we can reason as follows.


﻿


2.3. LIKELIHOOD


140


   1. Most of the information in the data for learning about yu is containined in X.
      That is, X tells us a lot about y and the deviations og = Xi - X, i = 1, ..., n
      tell us very little.

   2. If n is large then the Central Limit Theorem tells us

                          X     N(p, o/ /n), approximately

   3. We can estimate o2 from the data by

                                  2     = s2Z(2/n

   4. And therefore the function

                            M (/1) Xc exp (-- ((2.4)
                                            2   &/Vw/6

      is a good approximation to the likelihood function.

In the preceding reasoning we separated the data into two parts - X and {o};
used {S} to estimate o; and used X to find a likelihood function for p. We cannot,
in general, justify such a separation mathematically. We justified it if and when our
main interest is in yu and we believe {5} tell us little about P.
   Function2. is called a marginal likelihood function. I sou anRyl[1995]
show that marginal likelihoods are good approximations to true likelihoods and
can be used to make accurate inferences, at least in cases where the Central Limit
Theorem applies. We shall use marginal likelihoods throughout this book.

Example 2.9 (Slater School, continued)
We redo the Slater School example (Example 2,8) to illustrate the marginal likelihood
and see how it compares to the exact likelihood. In that example the Xi's were l's
and O's indicating which teachers got cancer. There were 8 l's out of 145 teachers, so
X = 8/145     .055. Also, 2 = (8(137/145)2 + 137(8/145)2))/145  .052, so 8  .23.
We get

                                     (1   (p -.055)2)(2.5)
                                        2 (.23/9/45)
Figure.21 shows the marginal and exact likelihood functions. The marginal likelihood
is a reasonably good approximation to the exact likelihood.


﻿


2.3. LIKELIHOOD


141


0
0
0


co
0

Qz
0
0


  I
0.00


  I         I         I
0.05      0.10      0.15


  0
0.20


Figure 2.21: Marginal and exact likelihoods for Slater School


Figure


was produced by the following snippet.


theta <- seq ( 0, .2, length=100 )
lik <- dbeta ( theta, 9, 138 )
lik.mar <- dnorm ( theta, 8/145,
  sqrt((8*(137/145)^2 + 137*(8/145)^2)/145)/sqrt(145) )
lik <- lik/max(lik)
lik.mar <- lik.mar/max(lik.mar)
matplot ( theta, cbind(lik,lik.mar), xlab=expression(mu),
           ylab="likelihood", type="l", lty=c(2,1), col=1 )
legend ( .1, 1, c("marginal", "exact"), lty=c(1,2) )


Example 2.10 (CEO salary)
How much are corporate CEO's paid? Forbes magazine collected data in 1993 that can
begin to answer this question. The data are available on-line at DASL, the Data and
Story Library, a collection of data sets for free use by statistics students. DASL says
     "Forbes magazine published data on the best small firms in 1993. These were
     firms with annual sales of more than five and less than $350 million. Firms


﻿


2.3. LIKELIHOOD


142


were ranked by five-year average return on investment. The data extracted
are the age and annual salary of the chief executive officer for the first 60
ranked firms. In question are the distribution patterns for the ages and the
salaries."


You can download the data from
http: //lib. stat .cmu. edu/DASL/Datafiles/ceodat .html.
like this:


The first few lines look


AGE SAL


53
43
33


145
621
262


In this example we treat the Forbes data as a random sample of size n = 60 of CEO
salaries for small firms. We're interested in the average salary p. Our approach is to
calculate the marginal likelihood function fm( ).
   Figure     (a) shows a stripchart of the data. Evidently, most salaries are in the range
of $200 to $400 thousand dollars, but with a long right-hand tail. Because the right-
hand tail is so much larger than the left, the data are not even approximately Normally
distributed. But the Central Limit Theorem tells us that X is approximately Normally
distributed, so the method of marginal likelihood applies. Figure  (b) displays the
marginal likelihood function E(r(p).


Figure


was produced by the following snippet.


ceo <- read.table ("data/ceo_salaries/data",header=T)

par ( mfrow=c(2,1) )
stripchart ( ceo$SAL, "jitter", pch=1, main="(a)",
               xlab="Salary (thousands of dollars)" )


m <-
s <-
x <-
y <-
y <-
plot


mean ( ceo$SAL, na.rm=T )
sqrt ( var(ceo$SAL,na.rm=T) / (length(ceo$SAL)-1) )
seq ( 340, 470, length=40 )
dnorm ( x, m, s )
y / max(y)
( x, y, type="l", xlab="mean salary",


﻿


2.3. LIKELIHOOD                                              143


                                 (a)


                 o60                     ~0 @ 0


                 I     I     I     I     II
                 o    200   400   600   800  1000

                       Salary (thousands of dollars)


                                 (b)

             0

           o _D


           0


                  I    I    I    I    I    II
                340  360  380  400  420  440  460

                             mean salary


Figure 2.22: Marginal likelihood for mean CEO salary


﻿


2.3. LIKELIHOOD


144


         ylab="likelihood", main="(b)" )


" In s <- sqrt ... the (length(ceo$SAL) -1) is there to account for one missing
  data point.

" y <- y / max(y) doesn't accomplish much and could be omitted.


* The data strongly support the conclusion
  $350 and $450 thousand dollars. That's
  on display in Figure   (a). Why?


that the mean salary is between about
much smaller than the range of salaries


* Is inference about the
  better?


mean salary useful in this data set? If not, what would be


2.3.3 Likelihoods for several parameters


What if there are two unknown parameters? Then the likelihood is a function of
two variables. For example, if the X's are a sample from N(p, o) then the likelihood
is a function of (p, cr). The next example illustrates the point.

Example 2.11 (FACE, continued)
This example continues Example     about a FACE experiment in Duke Forest. There
were six rings; three were treated with excess C02. The dominant canopy tree in the
FACE experiment is pinus taeda, or loblolly pine. Figure  a is a histogram of the final
basal area of each loblolly pine in 1998 divided by its initial basal area in 1996. It shows
that the trees in Ring 1 grew an average of about 30% but with variability that ranged
from close to 0% on the low end to around 50% or 60% on the high end. Because the
data are clustered around a central value and fall off roughly equally on both sides they
can be well approximated by a Normal distribution. But with what mean and SD? What
values of (p, or) might reasonably produce the histogram in Figure  a?


﻿


2.3. LIKELIHOOD


145


   The likelihood function is
                                     n
                           (p, (7) =   f (Xi | p, Or)
                                     1
                                          1     __ X - )
                                     1 2wu(


Figure     b is a contour plot of the likelihood function. The dot in the center, where
(p, o)   (1.27, .098), is where the likelihood function is highest. That is the value of
(p, o) that best explains the data. The next contour line is drawn where the likelihood
is about 1/4 of its maximum; then the next is at 1/16 the maximum, the next at 1/64,
and the last at 1/256 of the maximum. They show values of (p, o) that explain the data
less and less well.
    Ecologists are primarily interested in y because they want to compare the p's from
different rings to see whether the excess CO2 has affected the average growth rate.
(They're also interested in the o's, but that's a secondary concern.) But   is a function
of both p and o, so it's not immediately obvious that the data tell us anything about p
by itself. To investigate further, Figure c shows slices through the likelihood function
at o = .09, .10, and.11, the locations of the dashed lines in Figure  b. The three
curves are almost identical. Therefore, the relative support for different values of p does
not depend very much on the value of o, and therefore we are justified in interpreting
any of the curves in Figure   c as a "likelihood function" for y alone, showing how well
different values of p explain the data. In this case, it looks as though values of p in the
interval (1.25, 1.28) explain the data much better than values outside that interval.

    Figure     was produced with

    par ( mfrow=c(2,2) ) # a 2 by 2 array of plots
    x <- ba98$BA.final / ba96$BA.init
    x <- x[!is.na(x)]
    hist ( x, prob=T, xlab="basal area ratio",
             ylab="", main="(a)" )
     mu <- seq ( 1.2, 1.35, length=50 )
     sd <- seq ( .08, .12, length=50 )
     lik <- matrix ( NA, 50, 50 )
     for ( i in 1:50 )
     for ( j in 1:50 )


﻿


2.3. LIKELIHOOD


146


(a)


(b)


L10-


N~
6

6


N H-


r, -, - I- I- I


  0


tD
  6

  0
  0

  0
  6


I     I    II
1.20      1.30


0-


1.0  1.2 1.4  1.6

  basal area ratio


       (c)


0
O


(0


0


I    I    I    I


1.20


1.30


Figure 2.23:   FACE Experiment, Ring     1.    (a):  (1998 final basal area) ±
(1996 initial basal area); (b): contours of the likelihood function. (c): slices of
the likelihood function.


﻿


2.3. LIKELIHOOD                                               147


﻿


2.3. LIKELIHOOD


148


   * lik.09, lik.10, and lik.11 pick out three columns from the lik matrix. They
      are the three columns for the values of o closest to o = .09, .10, .11. Each column
      is rescaled so its maximum is 1.


Example 2.12 (Quiz Scores, continued)
This example continues Example     about scores in Statistics 103. Figure  shows
that most students scored between about 5 and 10, while 4 students were well below
the rest of the class. In fact, those students did not show up for every quiz so their
averages were quite low. But the remaining students' scores were clustered together in
a way that can be adequately described by a Normal distribution. What do the data say
about (p, o)?
   Figure     shows the likelihood function. The data support values of p from about
7.0 to about 7.6 and values of o from about 0.8 to about 1.2. A good description of the
data is that most of it follows a Normal distribution with (p, o) in the indicated intervals,
except for 4 students who had low scores not fitting the general pattern. Do you think
the instructor should use this analysis to assign letter grades and, if so, how?

   Figure     was produced by

     x <- sort(scores.ave)[5:58]
     mu <- seq ( 6.8, 7.7, length=60 )
     sig <- seq ( .7, 1.3, length=60 )
     lik <- matrix ( NA, 60, 60 )
     for ( i in 1:60 )
     for ( j in 1:60 )
       lik[i,j] <- prod ( dnorm ( x, mu[i], sig[j] ) )
     lik <- lik/max(lik)
     contour ( mu, sig, lik, xlab=expression(mu),
                ylab=expression(sigma) )


   Examples       and      have likelihood contours that are roughly circular, indi-
cating that the likelihood function for one parameter does not depend very strongly
on the value of the other parameter, and so we can get a fairly clear picture of what
the data say about one parameter in isolation. But in other data sets two parame-
ters may be inextricably entwined. Example    illustrates the problem.


﻿


2.3. LIKELIHOOD                                                        149


                                         0.1
                                         0.2

                                         0.4
                                         0.6


                   6.8      7.0     7.2     7.4      7.6


               Figure 2.24: Likelihood function for Quiz Scores


﻿


2.3. LIKELIHOOD


150


Example 2.13 (Seedlings, continued)
Examples 1.4, 1A, and 1-7 introduced an observational study by ecologists to learn
about tree seedling emergence and survival. Some species, Red Maple or acer rubrum for
example, get a mark called a bud scale scar when they lose their leaves over winter. By
looking for bud scale scars ecologists can usually tell whether an acer rubrum seedling is
New (in its first summer), or Old (already survived through at least one winter). When
they make their annual observations they record the numbers of New and Old acer rubrum
seedlings in each quadrat. Every Old seedling in year t must have been either a New or
an Old seedling in year t - 1.
   Table 2.1 shows the 1992-1993 data for quadrat 6. Clearly the data are inconsistent;
where did the Old seedling come from in 1993? When confronted with this paradox the
ecologists explained that some New seedlings emerge from the ground after the date of
the Fall census but before the winter. Thus they are not counted in the census their first
year, but develop a bud scale scar and are counted as Old seedlings in their second year.
One such seedling must have emerged in 1992, accounting for the Old seedling in 1993.


Year No. of New seedlings No. of Old seedlings
1992              0                      0
1993              0                      1


  Table 2.1: Numbers of New and Old seedlings in quadrat 6 in 1992 and 1993.


    How shall we model the data? Let Nf' be the true number of New seedlings in year
i, i.e., including those that emerge after the census; and let N°r be the observed number
of seedlings in year i, i.e., those that are counted in the census. As in Example 1A4we
model Ni~ Poi(A). Furthermore, each seedling has some chance Oy of being found in
the census. (Nominally Oy is the proportion of seedlings that emerge before the census,
but in fact it may also include a component accounting for the failure of ecologists to
find seedlings that have already emerged.) Treating the seedlings as independent and all
having the same Oy leads to the model N°~ Bin(NT,, Os). The data are the NP's; the
NT's are not observed. What do the data tell us about the two parameters (A, Of)?
    Ignore the Old seedlings for now and just look at 1992 data N°992 = 0. Dropping the


﻿


2.3. LIKELIHOOD


151


subscript 1992, the likelihood function is

                  (A, Bf) = P(N0 = 0 A, Of]
                            00
                               P(N0 = 0, NT = n A, Of]
                           n~0

                           n0

                           oo    _n                                             (2.6)
                                 ii! (1 - Of)"
                            n 0
                            00 e-A(l-f) (A(1 - Of))n

                            _eOf n!
                            n=0


    Figure     a plots logio (A, O ). (We plotted logil o instead of E for variety.) The
contour lines are not circular. To see what that means, focus on the curve logio  (A, O )
-1 which runs from about (A, Of) = (2.5, 1) to about (A, Of) = (6, .4). Points (A, Of)
along that curve explain the datum No = 0 about 1/10 as well as the m.l.e.(The m.l.e.
is any pair where either A = 0 or O = 0.) Points below and to the left of that curve
explain the datum better than 1/10 of the maximum.
    The main parameter of ecological interest is A, the rate at which New seedlings tend
to arrive. The figure shows that values of A as large as 6 can have reasonably large
likelihoods and hence explain the data reasonably well, at least if we believe that Of
might be as small as .4. To investigate further, Figure  b is similar to  a but
includes values of A as large as 1000. It shows that even values of A as large as 1000 can
have reasonably large likelihoods if they're accompanied by sufficiently small values of Of.
In fact, arbitrarily large values of A coupled with sufficiently small values of Oy can have
arbitrarily large likelihoods. So from the data alone, there is no way to rule out extremely
large values of A. Of course extremely large values of A don't make ecological sense, both
in their own right and because extremely small values of Of are also not sensible. Scientific
background information of this type is incorporated into statistical analysis often through
Bayesian inference (Section   ). But the point here is that A and Oy are linked, and the
data alone does not tell us much about either parameter individually.

    Figure     (a) was produced with the following snippet.


lam <- seq ( 0, 6, by=.1 )


﻿


2.3. LIKELIHOOD


152


00
0


0ql

0

0


                   (a)


                                      -1

                                  -0.6 -

                                  -0.2
                                      0
I      I      I     I      I     I

0      1     2      3     4      5     6


(b)


00
0


0ql

0

0


0      200     400    600     800     1000


Figure 2.25: Log of the likelihood function for (A, Of) in Example


﻿


2.3. LIKELIHOOD


153


    th <- seq ( 0, 1, by=.02 )
    lik <- matrix ( NA, length(lam), length(th) )
    for ( i in seq(along=lam) )
    for ( j in seq(along=th) )
      lik[i,j] <- exp ( -lam[i]*th[j] )
    contour ( lam, th, log10(lik),
               levels=c(0,-.2,-.6,-1,-1.5,-2),
               xlab=expression(lambda),
               ylab=expression(theta[f]), main="(a)" )


   e log10 computes the base 10 logarithm.

Figure    (b) was produced with the following snippet.

    lam2 <- seq ( 0, 1000, by=1 )
    lik2 <- matrix ( NA, length(lam2), length(th) )
    for ( i in seq(along=lam2) )
    for ( j in seq(along=th) )
      lik2[i,j] <- exp ( -lam2[i]*th[j] )
    contour ( lam2, th, log10(lik2), levels=c(0,-1,-2,-3),
               xlab=expression(lambda),
               ylab=expression(theta[f]), main="(b)" )


   We have now seen two examples (     and     ) in which likelihood contours
are roughly circular and one ( ) in which they're not. By far the most common
and important case is similar to Example  because it applies when the Central
Limit Theorem applies. That is, there are many instances in which we are trying
to make an inference about a parameter 0 and can invoke the Central Limit Theo-
rem saying that for some statistic t, t ~ N(8, o-t) approximately and where we can
estimate o-t. In these cases we can, if necessary, ignore any other parameters in the
problem and make an inference about 0 based on %(0).


﻿


2.4. ESTIMATION


154


2.4 Estimation

Sometimes the purpose of a statistical analysis is to compute a single best guess
at a parameter 0. An informed guess at the value of 0 is called an estimate and
denoted 0. One way to estimate 0 is to find 0 - argmax£(0), the value of 0 for
which £(0) is largest and hence the value of 0 that best explains the data. That's
the subject of Section 2.4,1.

2.4.1 The Maximum Likelihood Estimate
In many statistics problems there is a unique value of 0 that maximizes £(0). This
value is called the maximum likelihood estimate, or m.l.e. of 0 and denoted 0.

                      0   argmaxe p(y |0) = argmaxe f(0).

   For instance, in Example28 and Figure 2420 was the rate of cancer occurence
and we calculated £(0) based on y = 8 cancers in 145 people. Figure22 suggests
that the m.l.e. is about 0 .05.
   When £(0) is differentiable, the m.l.e. can be found by differentiating and equat-
ing to zero. In Example2. the likelihood was £(0) c 08(1 - 8)137. The derivative
is
                     df(O)
                     dfOc 87(1 - 0)137 - 13708(1 - 0)136
                     de                                                 (2.7)
                            07(1 - 0)136 [8(1 - 0) - 1370]

Equating to 0 yields

                             0 =8(1 - 0) - 1370
                                   145= 8
                               0 = 8/145   .055

So 0    .055 is the m.l.e. Of course if the mode is flat, there are multiple modes,
the maximum occurs at an endpoint, or £ is not differentiable, then more care is
needed.
   Equation 2,7 shows more generally the m.l.e. for Binomial data. Simply replace
137 with n - y and 8 with y to get 0 = y/n. In the Exercises you will be asked to
find the m.l.e. for data from other types of distributions.
   There is a trick that is often useful for finding m.l.e.'s. Because log is a monotone
function, argmax f(0) =argmax log(if(0)), so the m.l.e. can be found by maximizing


﻿


2.4. ESTIMATION


155


log L. For i.i.d. data, £(0) =_ 0 p(y, |), log f(0) = E log p(xi |0), and it is often easier
to differentiate the sum than the product. For the Slater example the math would
look like this:

                       log t(8) = 8 logO8 + 137 log(1 - 0)
                            dlogYf(0)   8    137
                               dO       0   1-0
                                   137    8
                                   1-0    0
                                1370 = 8 - 80
                                     0_8
                                       145

   Equation 2.7 shows that if y1, .. . , y  Bern(0) then the m.l.e. of 0 is

                         0 = n-13 y =sample mean

The Exercises ask you to show the following.

  1. If y1, .. . , yn N(,o-a) then the m.l.e. of pu is

                           fy=n1      y = sample mean

  2. If y1, ... , y~ Poi(A) then the m.l.e. of A is

                           A= n-1 3 y= sample mean

  3. If y1, ... , y~ Exp(A) then the m.l.e. of A is

                           A= n-1 3 y2= sample mean


2.4.2 Accuracy of Estimation

Finding the m.l.e. is not enough. Statisticians want to quantify the accuracy of 0
as an estimate of 0. In other words, we want to know what other values of 0, in
addition to 0, have reasonably high likelihood (provide a reasonably good expla-
nation of the data). And what does "reasonable" mean? Section 2:4.2addresses
this question.


﻿


2.4. ESTIMATION


156


   As we saw from the reference experiment in section 2.3, the evidence is not
very strong against any value of 0 such that £(0) > £(8)/10. So when considering
estimation accuracy it is useful to think about sets such as

                            LS.1     0      >.1
                                        £(e)

LS stands for likelihood set. More generally, for any a E (0, 1) we define the likeli-
hood set of level a to be
                                       .f(0)
                             LSa   {0 :     >

   LS, is the set of 0's that explain the data reasonbly well, and therefore the set of
0's best supported by the data, where the quantification of "reasonable" and "best"
are determined by a. The notion is only approximate and meant as a heuristic
reference; in reality there is no strict cutoff between reasonable and unreasonable
values of 0. Also, there is no uniquely best value of a. We frequently use a  .1 for
convenience and custom.
   In many problems the likelihood function £(0) is continuous and unimodal, i.e.
strictly decreasing away from 0, and goes to 0 as 0 -  too, as in Figures 2,19
and 2,20. In these cases, 0 0 -0(0)  £(0). So values of 0 close to 0 explain the
data almost as well as and are about as plausible as 0 and LS, is an interval

                                 LS= [Oi,Oa]

where 01 and 0, are the lower and upper endpoints, respectively, of the interval.
   In Example 2,9 (Slater School) 0 = 8/145, so we can find £(8) on a calculator,
or by using R's built-in function
  dbinom ( 8, 145, 8/145 )
which yields about .144. Then 01 and B, can be found by trial and error. Since
dbinom(8,145, .023)     .013 and dbinom(8,145, .105)  .015, we conclude that
LS.1   [.023, .105] is a rough likelihood interval for 0. Review Figure 2,20 to see
whether this interval makes sense.
   The data in Example2. could pin down 0 to an interval of width about .08.
In general, an experiment will pin down 0 to an extent determined by the amount
of information in the data. As data accumulates so does information and the abil-
ity to determine 0. Typically the likelihood function becomes increasingly more
peaked as n - 00, leading to increasingly accurate inference for 0. We saw that in
Figures L6and 21. Example 2 4illustrates the point further.


﻿


2.4. ESTIMATION


157


Example 2.14 (Craps, continued)
Example 1.0 introduced a computer simulation to learn the probability 0 of winning the
game of craps. In this example we use that simulation to illustrate the effect of gathering
ever increasing amounts of data. We'll start by running the simulation just a few times,
and examining the likelihood function £(O). Then we'll add more and more simulations
and see what happens to £(O).
   The result is in Figure 2,26. The flattest curve is for 3 simulations, and the curves
become increasingly peaked for 9, 27, and 81 simulations. After only 3 simulations
LS.1    [.15, .95] is quite wide, reflecting the small amount of information. But after
9 simulations £(8) has sharpened so that LS.1  [.05, .55] is much smaller. After 27
simulations LS.1 has shrunk further to about [.25, .7], and after 81 it has shrunk even
further to about [.38, .61].


-0
0
0


O

Co

CD
0

0
NM
0
0
0


I         I        I       I        I        I'
0.0      0.2     0.4      0.6      0.8     1.0


                                             0


Figure 2.26: Likelihood function for the probability 6 of winning a game of craps.
The four curves are for 3, 9, 27, and 81 simulations.


﻿


2.4. ESTIMATION


158


   Figure    was produced with the following snippet.

   n.sim <- c ( 3, 9, 27, 81 )
   th <- seq ( 0, 1, length=200 )
   lik <- matrix ( NA, 200, length(n.sim) )

   for ( i in seq(along=n.sim) ) {
       wins <- 0
       for ( j in 1:n.sim[i] )
         wins <- wins + sim.craps()
       lik[,i] <- dbinom ( wins, n.sim[i], th )
       lik [, i] <- lik [, i] / max(lik [,i] )
    }

    matplot ( th, lik, type="1", col=1, lty=1:4,
               xlab=expression(theta), ylab="likelihood" )


   In Figure     the likelihood function looks increasingly like a Normal density
as the number of simulations increases. That is no accident; it is the typical behav-
ior in many statistics problems. Section  explains the reason.


2.4.3   The sampling distribution of an estimator

The estimator 0 is a function of the data y1, ... , yn. If we repeat the experiment and
get new data we also get a new 0. So 0 is a random variable and has a distribution
called the sampling distribution of 0 and denoted Fe. We studied Fe in Example
where we used simulation to estimate the probability 0 of winning a game of craps.
For each sample size of n = 50, 200, 1000 we did 1000 simulations. Each simulation
yielded a different 0. Those 1000 0's are a random sample of size 1000 from Fe.
Figure     showed boxplots of the simulations.
   Now we examine the sampling distribution of 0 in more detail. There are at
least two reasons for doing so. First, Fe is another way, in addition to likelihood
sets, of assessing the accuracy of 0 as an estimator of 0. If Fe is tightly concentrated
around 0 then 0 is highly accurate. Conversely, if Fe is highly dispersed, or not
centered around 0, then 0 is an inaccurate estimator. Second, we may want to
compare two possible estimators. I.e., if there are two potential estimators 01 and


﻿


2.4. ESTIMATION


159


02, we can compare Fe, and Fee and use the estimator whose sampling distribution
is most tightly concentrated around 0.
   To illustrate, let's suppose we sample y1, ... , y., from distribution Fy, and want
to estimate 0 -- E[Y]. We consider two potential estimators, the sample mean
01 = (1/n) 1 y2 and the sample median 02. To see which estimator is better we
do a simulation, as shown in the following snippet. The simulation is done at four
different sample sizes, n = 4, 16, 64, 256, to see whether sample size matters. Here
we'll let Fy be N(0, 1). But the choice between 01 and 02 might depend on what Fy
is, so a more thorough investigation would consider other choices of Fr.
   We do 1000 simulations at each sample size. Figure   shows the result. The
figure suggests that the sampling distributions of both 01 and 02 are centered at the
true value of 0. The distribution of 01 is slightly less variable than that of 02, but
not enough to make much practical difference.

   Figure     was produced by the following snippet.

   sampsize <- c ( 4, 16, 64, 256 )
   n.sim <- 1000

   par ( mfrow=c(2,2) )
   for ( i in seq(along=sampsize) ) {
   y <- matrix ( rnorm ( n. sim*sampsize [il , 0, 1 ),
                   nrow=sampsize[i], ncol=n.sim )
    that.1 <- apply ( y, 2, mean )
    that.2 <- apply ( y, 2, median )
    boxplot ( that.1, that.2, names=c("mean", "median"),
               main=paste (" (",letters [i] , ") ", sep="") )
    abline ( h=0, lty=2 )
  }


  For us, comparing 01 to 02 is only a secondary point of the simulation. The main
point is four-fold.

   1. An estimator 0 is a random variable and has a distribution.

   2. Fe is a guide to estimation accuracy.

   3. Statisticians study conditions under which one estimator is better than an-
     other.


﻿


2.4. ESTIMATION


160


(a)


(b)


0-


N~


  8


  I I
mean median


    (c)


         0


         0
  I      Ia
mean median


0


0


  I      I


mean median


    (d)
  I      I

  0    m
         0

mean median


0 _
0


0-


0  

0
O~
0-


Figure 2.27: Sampling distribution of 01, the sample mean and B2, the sample
median. Four different sample sizes. (a): n=4; (b): n=16; (c): n=64; (d):
n=256


﻿


2.4. ESTIMATION


161


  4. Simulation is useful.

  When the m.l.e. is the sample mean, as it is when Fy is a Bernoulli, Normal,
Poisson or Exponential distribution, the Central Limit Theorem tells us that in large
samples, 0 is approximately Normally distributed. Therefore, in these cases, its
distribution can be well described by its mean and SD. Approximately,

                               0 ~ N(M, or).

where
                                   s = Pr
                                       cy                             (2.8)


both of which can be easily estimated from the sample. So we can use the sample
to compute a good approximation to the sampling distribution of the m.l.e.
   To see that more clearly, let's make 1000 simulations of the m.l.e. in n
5, 10, 25, 100 Bernoulli trials with p = .1. We'll make histograms of those simu-
lations and overlay them with kernel density estimates and Normal densities. The
parameters of the Normal densities will be estimated from the simulations. Results
are shown in Figure

   Figure     was produced by the following snippet.

   sampsize <- c ( 5, 10, 25, 100 )
   n.sim <- 1000
   p.true <- .1
   par(mfrow=c(2,2))
   for ( i in seq(along=sampsize) ) {
   # n.sim Bernoulli samples of sampsize[il
   y <- matrix ( rbinom ( n.sim*sampsize[il, 1, p.true ),
                   nrow=n.sim, ncol=sampsize[il )

    # for each sample, compute the mean
    t.hat <- apply ( y, 1, mean )

    # histogram of theta hat
    hist ( t.hat, prob=T,
            xlim=c(0,.6), xlab=expression(hat(theta)),
            ylim=c(0,14), ylab="density",


﻿


2.4. ESTIMATION


162


(a)


(b)


(I)
C)


(0-


0-


1   '


1 /
   A


(I)
a)


(0 -


0 -


1F


11111


0.0  0.2 0.4


0.6


0.0  0.2 0.4


0.6


0


(c)


0


(d)


N T~


(I)
a)


00

(0


0


   11
   Ill(


   1 1

 1. I "


0.0  0.2 0.4  0.6


(I)
a)


0.0  0.2 0.4  0.6


A
0


A
0


Figure 2.28: Histograms of 0, the sample mean, for samples from Bin(m, .1). Dashed

line: kernel density estimate. Dotted line: Normal approximation. (a): n=4; (b):

n=16; (c): n=64; (d): n=256


﻿


2.4. ESTIMATION


163


            main=paste ( "(", letters [i], ")", sep="" )
    )

    # kernel density estimate of theta hat
    lines ( density ( t.hat ), lty=2 )

    # Normal approximation to density of theta hat,
    # calculated from the first sample
    m <- mean(y[1,])
    sd <- sd(y[1,])/sqrt(sampsize[i])
    t <- seq ( min(t.hat), max(t.hat), length=40 )
    lines ( t, dnorm ( t, m, sd ), lty=3 )
  }


  Notice that the Normal approximation is not very good for small n. That's be-
cause the underlying distribution Fy is highly skewed, nothing at all like a Normal
distribution. In fact, R was unable to compute the Normal approximation for n = 5.
But for large n, the Normal approximation is quite good. That's the Central Limit
Theorem kicking in. For any n, we can use the sample to estimate the parameters
in Equation . For small n, those parameters don't help us much. But for n = 256,
they tell us a lot about the accuracy of 0, and the Normal approximation computed
from the first sample is a good match to the sampling distribution of 0.
   The SD of an estimator is given a special name. It's called the standard error or
SE of the estimator because it measures the typical size of estimation errors 0 - 0l.
When 0 ~ N(p6, og), approximately, then o is the SE. For any Normal distribution,
about 95% of the mass is within ±2 standard deviations of the mean. Therefore,

                           Pr[ - 0 < 2(g] ~ .95

In other words, estimates are accurate to within about two standard errors about
95% of the time, at least when Normal theory applies.
   We have now seen two ways of assessing estimation accuracy - through {(0)
and through Fe. Often these two apparently different approaches almost coincide.
That happens under the following conditions.

   1. When 0 ~ N(0, og), and o ~ o//n, an approximation often justified by the
     Central Limit Theorem, then we can estimate 0 to within about ±2oa, around


﻿


2.5. BAYESIAN INFERENCE


164


     95% of the time. So the interval (0 - 2a, 0 + 2a) is a reasonable estimation
     interval.
  2. When most of the information in the data come from the sample mean, and
     in other cases when a marginal likelihood argument applies, then £(O)
     exp -              (Equation 24) and LS.1    (0- 2u6,y0+2u6). So the two
     intervals are about the same.


2.5 Bayesian Inference

The essence of Bayesian inference is using probability distributions to describe our
state of knowledge of some parameter of interest, 0. We construct p(O), either a
pmf or pdf, to reflect our knowledge by making p(O) large for those values of 0
that seem most likely, and p(O) small for those values of 0 that seem least likely,
according to our state of knowledge. Although p(O) is a probability distribution, it
doesn't necessarily mean that 0 is a random variable. Rather, p(O) encodes our state
of knowledge. And different people can have different states of knowledge, hence
different probability distributions. For example, suppose you toss a fair coin, look
at it, but don't show it to me. The outcome is not random; it has already occured
and you know what it is. But for me, each outcome is equally likely. I would encode
my state of knowledge by assigning P(H) = P(T) = 1/2. You would encode your
state of knowledge by assigning either P(H) = 1 or P(T) = 1 according to whether
the coin was Heads or Tails. After I see the coin I would update my probabilities to
be the same as yours.
   For another common example, consider horse racing. When a bettor places a
bet at 10 to 1, she is paying $1 for a ticket that will pay $10 if the horse wins. Her
expected payoff for that bet is -$1 + P[horse wins] x $10. For that to be a good
deal she must think that P[horse wins] > .1. Of course other bettors may disagree.
   Here are some other examples in which probability distributions must be as-
sessed.
   " In deciding whether to fund Head Start, legislators must assess whether the
     program is likely to be beneficial and, if so, the degree of benefit.
   " When investing in the stock market, investors must assess the future proba-
     bility distributions of stocks they may buy.
   * When making business decisions, firms must assess the future probability
     distributions of outcomes.


﻿


2.5. BAYESIAN INFERENCE


165


   " Weather forecasters assess the probability of rain.


   " Public policy makers must assess whether the observed increase in average
      global temperature is anthropogenic and, if so, to what extent.


   " Doctors and patients must assess and compare the distribution of outcomes
      under several alternative treatments.


   " At the Slater School, Example .8, teachers and administrators must as-
      sess their probability distribution for 0, the chance that a randomly selected
      teacher develops invasive cancer.


   Information of many types goes into assessing probability distributions. But it is
often useful to divide the information into two types: general background knowl-
edge and information specific to the situation at hand. How do those two types
of information combine to form an overall distribution for 0? Often we begin by
summarizing just the background information as p(O), the marginal distribution of
0. The specific information at hand is data which we can model as p(yi,... ,  0y|0),
the conditional distribution of y,..., yn given 0. Next, the marginal and condi-
tional densities are combined to give the joint distribution p(yi, ..., ya, 0). Finally,
the joint distribution yields p(O|yi,..., yn) the conditional distribution of 0 given
Y1, ..., yn. And p(0|y1,..., yn) represents our state of knowledge accounting for
both the background information and the data specific to the problem at hand.
p(O) is called the prior distribution and p(| y,... , yn) is the posterior distribution.
   A common application is in medical screening exams. Consider a patient being
screened for a rare disease, one that affects 1 in 1000 people, say. The disease
rate in the population is background information; the patient's response on the
screening exam is data specific to this particular patient. Define an indicator vari-
able D by D = 1 if the patient has the disease and D = 0 if not. Define a second
random variable T by T = 1 if the test result is positive and T = 0 if the test
result is negative. And suppose the test that is 95% accurate in the sense that
P[T = 1 D = 1] = P[T = 0|D = 0]       .95. Finally, what is the chance that
the patient has the disease given that the test is positve? In other words, what is
P[D= 1| T =1]?
   We have the marginal distribution of D and the conditional distribution of T
given D. The procedure is to find the joint distribution of (D, T), then the condi-


﻿


2.5. BAYESIAN INFERENCE


166


tional distribution of D given T. The math is

   P[D=1|T=1]=P[D= 1 and T           1]
                          P[T = 1]
                               P[D    1 and T =1]
                     P[T=1andD=1]+P[T=1andD=0]
                                  P[D=1]P[T=1|D =1]                      (2.9)
                     P[D = 1] P[T   1| D = 1] +P[D =0] P[T = 1| D =0]
                           (.001) (.95)
                     (.001)(.95) + (.999)(.05)
                         .00095
                     .00095 + .04995 ~ 019.
That is, a patient who tests positive has only about a 2% chance of having the
disease, even though the test is 95% accurate.
   Many people find this a surprising result and suspect a mathematical trick. But a
quick heuristic check says that out of 1000 people we expect 1 to have the disease,
and that person to test positive; we expect 999 people not to have the disease
and 5% of those, or about 50, to test positive; so among the 51 people who test
postive, only 1, or a little less than 2%, has the disease. The math is correct. This
is an example where most people's intuition is at fault and careful attention to
mathematics is required in order not to be led astray.
   What is the likelihood function in this example? There are two possible values
of the parameter, hence only two points in the domain of the likelihood function,
D = 0 and D = 1. So the likelihood function is
                           f(0) = .05; f(1) = .95
Here's another way to look at the medical screening problem, one that highlights
the multiplicative nature of likelihood.

               P[D=0 T=1]P[D=0andT=1]
                                 P[D =1] P[T = 1|D =1]
                                 P[D=0]P[T=1| D=0]
                                 (P[D=1]) (P[T=1D=1])
                                 P[D =0]fP[T = l l D1=(0]f

                                 1 () (:5)
                                 .019


﻿


2.5. BAYESIAN INFERENCE


167


The LHS of this equation is the posterior odds of having the disease. The penul-
timate line shows that the posterior odds is the product of the prior odds and the
likelihood ratio. Specifically, to calculate the posterior, we need only the likelihood
ratio, not the absolute value of the likelihood function. And likelihood ratios are
the means by which prior odds get transformed into posterior odds.
   Let's look more carefully at the mathematics in the case where the distributions
have densities. Let y denote the data, even though in practice it might be y1,.... ,yn.


                                      p(Y)
                                      P(O,Y)(
                                      fp(,y)d
                                      p(O)p(y 0)
                                      fp(O)p(y )dO

Equation 2.10 is the same as Equation 2, only in more general terms. Since we
are treating the data as given and p(O|y) as a function of 0, we are justified in
writing
                                        p(O)e(O)
                               fp(O 1?(O d
                                      f p(0)/(0) de
or
                                    (0)-p(O)e(O)
                                            C
where c = f p()(O) dO is a constant that does not depend on 0. (An integral
with respect to 0 does not depend on 0; after integration it does not contain 6.)
The effect of the constant c is to rescale the function in the numerator so that it
integrates to 1. I.e., f p(O|y) dO = 1. And since c plays this role, the likelihood
function can absorb an arbitrary constant which will ultimately be compensated
for by c. One often sees the expression

                               p(O | y) cp(O)C(O)                        (2.11)

where the unmentioned constant of proportionality is c.
   We can find c either through Equation   or by using Equation 2.11, then
setting c =[fp()(O) d]-1. Example:2. illustrates the second approach.

Example 2.15 (Seedlings, continued)
Recall the Seedlings examples (1                  , and    3) which modelled the
number of New seedling arrivals as Poi(A). Prior to the experiment ecologists knew


﻿


2.5. BAYESIAN INFERENCE


168


quite a bit about regeneration rates of acer rubrum in the vicinity of the experimental
quadrats. They estimated that New seedlings would arise at a rate most likely around .5
to 2 seedlings per quadrat per year and less likely either more or less than that. Their
knowledge could be encoded in the prior density displayed in Figure 2,29 which is p(A)
4A2e-2A. (This is the Gam(3, 1/2) density; see Section  .) Figure 2,2 also displays
the likelihood function p(y | A) c A3e-A found in Example 1.4 and Figure 1,6. Therefore,
according to Equation  .11, the posterior density is p(A l y) X A5e-3A. In Section 5,5
we will see that this is the Gam(6, 1/3) density, up to a constant of proportionality.
Therefore c in this example must be the constant that appears in the Gamma density:
c = 1/[5! x (1/3)6].

   In Figure 2.29 the posterior density is more similar to the prior density than
to the likelihood function. But the analysis deals with only a single data point.
Let's see what happens as data accumulates. If we have observations y, ...., y, the
likelihood function becomes
                       £(A = lp~ IA) =1 - AAYi, eA Y

To see what this means in practical terms, Figure:2:3 shows (a): the same prior
we used in Example 2., (b): £(A) for n = 1, 4, 16, and (c): the posterior for
n = 1, 4,16, always with y = 3.
   1. As n increases the likelihood function becomes increasingly peaked. That's
      because as n increases, the amount of information about A increases, and
      we know A with increasing accurracy. The likelihood function becomes in-
      creasingly peaked around the true value of A and interval estimates become
      increasingly narrow.

   2. As n increases the posterior density becomes increasingly peaked and be-
      comes increasingly like £(A). That's because as n increases, the amount of
      information in the data increases and the likelihood function becomes in-
      creasingly peaked. Meanwhile, the prior density remains as it was. Eventu-
      ally the data contains much more information than the prior, so the likelihood
      function becomes much more peaked than the prior and the likelihood domi-
      nates. So the posterior, the product of prior and likelihood, looks increasingly
      like the likelihood.
      Another way to look at it is through the loglikelihood log £(A) = c+log p(A) +
      E  log p(yi| A). As n - oc there is an increasing number of terms in the sum,
      so the sum eventually becomes much larger and much more important than
      log p(A).


﻿


2.5. BAYESIAN INFERENCE


169


C)
0


NM
0


6


0
0


                      0       1       2       3       4       5

Figure 2.29: Prior, likelihood and posterior  ensities for A in the seedlings example
after the single observation y 3


﻿


2.5. BAYESIAN INFERENCE


170


    In practice, of course, y usually doesn't remain constant as n increases. We saw
in Example L6 that there were 40 new seedlings in 60 quadrats. With this data
the posterior density is
                             p(O |Iy1, ... Y6o) C A42-62(2.12)
which is the Gam(43, 1/62) density. It is pictured in Figure 2.31. Compare to
Figure 2.29.
    Example 2.16 shows Bayesian statistics at work for the Slater School. See
,avine [1999] for further analysis.

Example 2.16 (Slater School, cont.)
At the time of the analysis reported in Brodeur [1992] there were two other lines of
evidence regarding the effect of power lines on cancer. First, there were some epidemi-
ological studies showing that people who live near power lines or who work as power
line repairmen develop cancer at higher rates than the population at large, though only
slightly higher. And second, chemists and physicists who calculate the size of magnetic
fields induced by power lines (the supposed mechanism for inducing cancer) said that the
small amount of energy in the magnetic fields is insufficient to have any appreciable affect
on the large biological molecules that are involved in cancer genesis. These two lines of
evidence are contradictory. How shall we assess a distribution for 0, the probability that
a teacher hired at Slater School develops cancer?
    Recall from page 137 that Neutra, the state epidemiologist, calculated "4.2 cases of
cancer could have been expected to occur" if the cancer rate at Slater were equal to
the national average. Therefore, the national average cancer rate for women of the age
typical of Slater teachers is 4.2/145  .03. Considering the view of the physicists, our
prior distribution should have a fair bit of mass on values of 0  .03. And considering the
epidemiological studies and the likelihood that effects would have been detected before
1992 if they were strong, our prior distribution should put most of its mass below 0  .06.
For the sake of argument let's adopt the prior depicted in Figure 2,32. Its formula is
                                  F (20)F (400) 1         9
                            (0) =   (19( -              )39                     (2.13)
                                     F(420)
which we will see in Section 5,6 is the Be(20, 400) density. The likelihood function
is f(O) c 08(1 - 0)137 (Equation 23, Figure 220). Therefore the posterior density
p(O y) oc 027(1 - 0)536 which we will see in Section 56 is the Be(28, 537) density.
Therefore we can easily write down the constant and get the posterior density
                           p(0 )  F(28)F(537)027(1 - 0)536
                                      F(565)
which is also pictured in Figure23.


﻿


2.5. BAYESIAN INFERENCE


                     a


171


b


0

0~


CJ


0


(0


0
0


0


      I   ~


    /

    I I I I  I

012345


      x


0


I I I I  I I


0 12 34 5


C


0_


0D


/"\


0


0
0~


(0
0


0


0


' I I I I I I


              0 12 34 5


                     k


Figure 2.30: a:Prior, b:likelihood and c:posterior densities for A with nm 1, 4,16


﻿


2.5. BAYESIAN INFERENCE                                              172


                                        ,' ', prior
                                        - --  likelihood
                               I  I
              co --                           posterior
                               I  I


                               I  I

                       -I
          (/)I                        II

          .05                       1 .


                  T               I       I


                  0.0       0.5      1.0      1.5      2.0


Figure 2.31: Prior, likelihood and posterior densities for A with n = 60, E y2 = 40.


﻿


2.5. BAYESIAN INFERENCE


173


0


0
CY)


0
CV


0


0


0.00      0.05     0.10      0.15


0.20


0


Figure 2.32: Prior, likelihood and posterior density for Slater School


﻿


2.6. PREDICTION


174


   Examples 2.15 and 2.16 have the convenient feature that the prior density had
the same form - Aae-bA in one case and Q"(1 - 8)b in the other - as the likelihood
function, which made the posterior density and the constant c particularly easy
to calculate. This was not a coincidence. The investigators knew the form of the
likelihood function and looked for a convenient prior of the same form that ap-
proximately represented their prior beliefs. This convenience, and whether choos-
ing a prior density for this property is legitimate, are topics which deserve serious
thought but which we shall not take up at this point.


2.6 Prediction

Sometimes the goal of statistical analysis is to make predictions for future observa-
tions. Let y1,..., y,, yf be a sample from p(.0). We observe yi, ..., y, but not yf,
and want a prediction for yf. There are three common forms that predictions take.

point predictions A point prediction is a single guess for yf. It might be a pre-
     dictive mean, predictive median, predictive mode, or any other type of point
     prediction that seems sensible.

interval predictions An interval prediction or predictive interval, is an interval of
     plausible values for yf. A predictive interval is accompanied by a probabil-
     ity. For example, we might say that "The interval (0, 5) is a 90% predictive
     interval for Yf" which would mean Pr[yf E (0, 5)]= .90. In a given problem
     there are, for two reasons, many predictive intervals. First, there are 90% in-
     tervals, 95% intervals, 50% intervals, and so on. And second, there are many
     predictive intervals with the same probability. For instance, if (0, 5) is a 90%
     predictive interval, then it's possible that (-1, 4.5) is also a 90% predictive
     interval.

predictive distributions A predictive distribution is a probability distribution for
      yf. From a predictive distribution, different people could compute point pre-
      dictions or interval predictions, each according to their needs.

In the real world, we don't know 0. After all, that's why we collected data y, ... ya.
But for now, to clarify the types of predictions listed above, let's pretend that we
do know 0. Specifically, let's pretend that we know y1,...., Yn, y f~i.i.d. N(-2, 1).
   The main thing to note, since we know 0 (in this case, the mean and SD of the
Normal distribution), is that y1,. .., yin don't help us at all. That is, they contain no


﻿


2.6. PREDICTION


175


information about yf that is not already contained in the knowledge of 0. In other
words, Yi, ..., y, and yf are conditionally independent given 0. In symbols:

                          p(y | 6, y1, .. . , y ) =  p(yy|10).
Therefore, our prediction should be based on the knowledge of 0 alone, not on any
aspect of y1, ..., ,y.
   A sensible point prediction for yf is Yf = -2, because -2 is the mean, median,
and mode of the N(-2, 1) distribution. Some sensible 90% prediction intervals
are (-oc, -0.72), (-3.65, -0.36) and (-3.28, oc). We would choose one or the
other depending on whether we wanted to describe the lowest values that yf might
take, a middle set of values, or the highest values. And, of course, the predictive
distribution of yf is N(-2, 1). It completely describes the extent of our knowledge
and ability to predict yf.
   In real problems, though, we don't know 0. The simplest way to make a predic-
tion consists of two steps. First use y, ..., yn to estimate 0, then make predictions
based on p(y 01). Predictions made by this method are called plug-in predictions.
In the example of the previous paragraph, if y1,.... y, yielded yf= -2 and = 1,
then predictions would be exactly as described above.
   For an example with discrete data, refer to Examples L4.and:L6 in which A is
the arrival rate of new seedlings. We found A = 2/3. The entire plug-in predictive
distribution is displayed in Figure 2.33. Q^f= 0 is a sensible point prediction.
The set {0, 1, 2} is a 97% plug-in prediction interval or prediction set (because
ppois (2, 2/3)   .97); the set {0, 1, 2, 3} is a 99.5% interval.
   There are two sources of uncertainty in making predictions. First, because yf is
random, we couldn't predict it perfectly even if we knew 0. And second, we don't
know 0. In any given problem, either one of the two might be the more important
source of uncertainty. The first type of uncertainty can't be eliminated. But in
theory, the second type can be reduced by collecting an increasingly large sample
Y, ..., yn so that we know 0 with ever more accuracy. Eventually, when we know
O accurately enough, the second type of uncertainty becomes negligible compared
to the first. In that situation, plug-in predictions do capture almost the full extent
of predictive uncertainty.
   But in many practical problems the second type of uncertainty is too large to be
ignored. Plug-in predictive intervals and predictive distributions are too optimistic
because they don't account for the uncertainty involved in estimating 0. A Bayesian
approach to prediction can account for this uncertainty. The prior distribution of
0 and the conditional distribution of y1,..., y, yf given 0 provide the full joint
distribution of y, ... , y, yf, 0, which in turn provides the conditional distribution


﻿


2.6. PREDICTION


176


c0
0
0~


0
CN
0
0
0


    0

        0

 I       I
o       2       4       6        8      10


Figure 2.33: Plug-in predictive distribution y f  Poi(A
example


2/3) for the seedlings


of yf given y1,..., y,. Specifically,


p(yf | y1, ... , Yr)


f p(yf,O| y1,...,yn) d


I

I


p(O  y1,... , y)p(yf |,y1, y . . ) d6

p(| y1 , ... , y)p(yy O6) dO


(2.14)


Equation12. is just the yf marginal density derived from the joint density of
(0, yf), all densities being conditional on the data observed so far. To say it another
way, the predictive density p(yf) is f p(6, yf) dO = f p(0)p(yf 01) dO, but where p(O)
is really the posterior p(O|yi, ..., y,). The role of y1,... , y is to give us the poste-
rior density of 0 instead of the prior.
   The predictive distribution in Equation 2.14 will be somewhat more dispersed
than the plug-in predictive distribution. If we don't know much about 0 then the
posterior will be widely dispersed and Equation     will be much more dispersed
than the plug-in predictive distribution. On the other hand, if we know a lot about
0 then the posterior distribution will be tight and Equation 2:14 will be only slightly


﻿


2.6. PREDICTION


177


more dispersed than the plug-in predictive distribution.

Example 2.17 (Seedlings, cont.)
Refer to Examples"14 and 2.15 about Y, the number of new seedlings emerging each year
in a forest quadrat. Our model is Y Poi(A). The prior (page 167) was p(A) = 4A2e-2A.
Before collecting any data our predictive distribution would be based on that prior. For
any number y we could calculate


py,(Y)   P[Yf = y] =
       -f Aye-a2
              y! F(3-
            23
                   Ay
         y!F(3)
         23(y + 3)
         y ! F ( 3 ) 3 y +3y (
         (y±+2)(


f PYf i(yA)p(A) dA

-A2e-2 2A
3)
+2e-3a dA

f   3Y+3
   I ]7(   Ay+A±2e73A dA
) (y + 3)
  3 1y


(2.15)


(We will see in Chapter.5 that this is a Negative Binomial distribution.) Thus for example,
according to our prior,
                                           (2)3     8
                             Pr[Yf =0] =23
                                             (3    2
                           Pr[Y f = 1] = 3 3 ) i _8
                                                 3   27
                                        etc.


Figure 2,34 displays these probabilities.
   In the first quadrat we found yi=
pg. 16)


3 and the posterior distribution (Example


p(A | y1 - 3)  36 - /3


So, by calculations similar to Equati
y1= 3 is
                pyf iy ly1= 3) =


on


the predictive distribution after observing


J pYf A(y A)PAly1(A Y 1=3)dA

  y+5      3)6(1)y


(2.16)


﻿


2.7. HYPOTHESIS TESTING


178


So, for example,


                             Pr[Y = 0|1y1 = 3] =

                                                      3  1
                           Pr[Yf=   1y1=3] =16Q
                                                   (4)4
                                         etc.

Figure 23displays these probabilities.
    Finally, when we collected data from 60 quadrats, we found

                                              6243
                          p(A | yi, ... , Y60) =42 A42e-62a                    (2.17)

Therefore , by calculations similar to Equation   , the predictive distribution is

                                             y +42     62\6( 1
                 Pr[Yf=y |IY1 ,... , Y601  (   J=/ 63)(63)(2.18)

Figure 2,34 displays these probabilities.
   A priori, and after only n = 1 observation, A is not know very precisely; both types of
uncertainty are important; and the Bayesian predictive distribution is noticably different
from the plug-in predictive distribution. But after n = 60 observations A is known fairly
well; the second type of uncertainty is negligible; and the Bayesian predictive distribution
is very similar to the plug-in predictive distribution.


2.7 Hypothesis Testing

Scientific inquiry often takes the form of hypothesis testing. In each instance there
are two hypotheses - the null hypothesis Ho and the alternative hypothesis Ha.

medicine

         * Ho: the new drug and the old drug are equally effective.
         " Ha: the new drug is better than the old.

public health


* Ho exposure to high voltage electric lines is benign.


﻿


2.7. HYPOTHESIS TESTING


179


C)
0


  -e n=0
-A- n=1
+     n=60
- -x- plug-in


10


A


NC_
0


A


  A

t0'


0
0


I
O0


I
2


4       6       8      10


                                         y


Figure 2.34: Predictive distributions of yf in the seedlings example after samples
of size n = 0, 1, 60, and the plug-in predictive


﻿


2.7. HYPOTHESIS TESTING


180


        * Ha exposure to high voltage electric lines promotes cancer.

public policy

        * Ho: Head Start has no effect.
        " Ha: Head Start is beneficial.

astronomy

        * Ho: The sun revolves around the Earth.
        " Ha: The Earth revolves around the sun.

physics

        * Ho: Newtonian mechanics holds.
        " Ha: Relativity holds.

public trust

        * Ho: Winning lottery numbers are random.
        " Ha: Winning lottery numbers have patterns.

ESP

        * Ho: There is no ESP.
        " Ha: There is ESP.

ecology

        * Ho: Forest fires are irrelevant to forest diversity.
        " Ha: Forest fires enhance forest diversity.

   By tradition Ho is the hypothesis that says nothing interesting is going on or the
current theory is correct, while Ha says that something unexpected is happening or
our current theories need updating. Often the investigator is hoping to disprove
the null hypothesis and to suggest the alternative hypothesis in its place.
   It is worth noting that while the two hypotheses are logically exclusive, they
are not logically exhaustive. For instance, it's logically possible that forest fires
decrease diversity even though that possibility is not included in either hypothesis.
So one could write Ha: Forest fires decrease forest diversity, or even Ha: Forest fires


﻿


2.7. HYPOTHESIS TESTING


181


change forest diversity. Which alternative hypothesis is chosen makes little differ-
ence for the theory of hypothesis testing, though it might make a large difference
to ecologists.
   Statisticians have developed several methods called hypothesis tests. We focus
on just one for the moment, useful when Ho is specific. The fundamental idea is
to see whether the data are "compatible" with the specific Ho. If so, then there
is no reason to doubt Ho; if not, then there is reason to doubt Ho and possibly to
consider Ha in its stead. The meaning of "compatible" can change from problem to
problem but typically there is a four step process.

   1. Formulate a scientific null hypothesis and translate it into statistical terms.

   2. Choose a low dimensional statistic, say w = w(y1, ..., y) such that the distri-
     bution of w is specified under Ho and likely to be different under Ha.

   3. Calculate, or at least approximate, the distribution of w under Ho.

   4. Check whether the observed value of w, calculated from y,... ya, is com-
     patible with its distribution under Ho.

   How would this work in the examples listed at the beginning of the chapter?
What follows is a very brief description of how hypothesis tests might be carried
out in some of those examples. To focus on the key elements of hypothesis testing,
the descriptions have been kept overly simplistic. In practice, we would have to
worry about confounding factors, the difficulties of random sampling, and many
other issues.

public health Sample a large number of people with high exposure to power lines.
     For each person, record X2, a Bernoulli random variable indicating whether
     that person has cancer. Model X1, ... , X ~ i.i.d. Bern(01). Repeat for a sam-
     ple of people with low exposure; getting Y1,. .. , Y,  i.i.d. Bern(02). Estimate
     01 and 02. Let w = 1 - 02. Ho says E[w] = 0. Either the Binomial distribution
     or the Central Limit Theorem tells us the SD's of 01 and 02, and hence the SD
     of w. Ask How many SD's is w away from its expected value of 0. If it's off by
     many SD's, more than about 2 or 3, that's evidence against Ho.

public policy Test a sample children who have been through Head Start. Model
     their test scores as X1, ..., X ~ i.i.d. N(,ui, c1). Do the same for children
     who have not been through Head Start, getting Y1, ... , Y ~i.i.d. N(p2, c2).
     H0 says , =p2. Let w =   - ft2. The parameters piu1,p2,uT,2 can all be
     estimated from the data; therefore w can be calculated and its SD estimated.


﻿


2.7. HYPOTHESIS TESTING


182


     Ask How many SD's is w away from its expected value of 0. If it's off by many
     SD's, more than about 2 or 3, that's evidence against Ho.

ecology We could either do an observational study, beginning with one sample of
     plots that had had frequent forest fires in the past and another sample that
     had had few fires. Or we could do an experimental study, beginning with a
     large collection of plots and subjecting half to a regime of regular burning and
     the other half to a regime of no burning. In either case we would measure
     and compare species diversity in both sets of plots. If diversity is similar
     in both groups, there is no reason to doubt Ho. But if diversity is sufficiently
     different (Sufficient means large compared to what is expected by chance under
     Ho.) that would be evidence against Ho.

   To illustrate in more detail, let's consider testing a new blood pressure medi-
cation. The scientific null hypothesis is that the new medication is not any more
effective than the old. We'll consider two ways a study might be conducted and see
how to test the hypothesis both ways.
   METHOD 1 A large number of patients are enrolled in a study and their blood
pressures are measured. Half are randomly chosen to receive the new medication
(treatment); half receive the old (control). After a prespecified amount of time,
their blood pressure is remeasured. Let Yc,2 be the change in blood pressure from
the beginning to the end of the experiment for the i'th control patient and YT,i be
the change in blood pressure from the beginning to the end of the experiment for
the i'th treatment patient. The model is

          Yo,1, ..-.-Yo,n~ i.i.d. fc; E[YoCi] = pc;   Var(Yo,2) =a2
          Yr,1, ..-.-Yr,n~ i.i.d. fr; E[Y,i] = p'T;  Var(Y,2) =Ta

for some unknown means pC and pT and variances ac and UT. The translation of
the hypotheses into statistical terms is

                                 Ho : p= c
                                 Ha :T    PC

Because we're testing a difference in means, let w = T - Yc. If the sample size n is
reasonably large, then the Central Limit Theorem says approximately w N(0, oaf)
under H0 with au,= (UT + ac)/n. The mean of 0 comes from Ho. The variance
,T comes from adding variances of independent random variables. a, and ,r and
therefore ua, can be estimated from the data. So we can calculate w from the data


﻿


2.7. HYPOTHESIS TESTING


183


and see whether it is within about 2 or 3 SD's of where Ho says it should be. If it
isn't, that's evidence against Ho.
   METHOD 2 A large number of patients are enrolled in a study and their blood
pressure is measured. They are matched together in pairs according to relevant
medical characteristics. The two patients in a pair are chosen to be as similar to
each other as possible. In each pair, one patient is randomly chosen to receive the
new medication (treatment); the other receives the old (control). After a prespeci-
fied amount of time their blood pressures are measured again. Let YT,i and Yc,2 be
the change in blood pressure for the i'th treatment and i'th control patients. The
researcher records
                                   1 if YT,2 > Yo,2
                                   0 otherwise

The model is
                          X1, . .. , Xn i.i.d. Bern(p)
for some unknown probability p. The translation of the hypotheses into statistical
terms is

                                  Ho : p = .5
                                  Ha: p   .5

Let w =>Xi. Under Ho, w   Bin(n, .5). To test H0 we plot the Bin(n, .5) distribu-
tion and see where w falls on the plot . Figure,23 shows the plot for n = 100. If
w turned out to be between about 40 and 60, then there would be little reason to
doubt Ho. But on the other hand, if w turned out to be less than 40 or greater than
60, then we would begin to doubt. The larger |w - 50|, the greater the cause for
doubt.
   This blood pressure example exhibits a feature common to many hypothesis
tests. First, we're testing a difference in means. I.e., Ho and Ha disagree about a
mean, in this case the mean change in blood pressure from the beginning to the
end of the experiment. So we take w to be the difference in sample means. Second,
since the experiment is run on a large number of people, the Central Limit Theorem
says that w will be approximately Normally distributed. Third, we can calculate or
estimate the mean po and SD o-o under Ho. So fourth, we can compare the value of
w from the data to what Ho says its distribution should be.
   In Method 1 above, that's just what we did. In Method 2 above, we didn't
use the Normal approximation; we used the Binomial distribution. But we could
have used the approximation. From facts about the Binomial distribution we know


﻿


2.7. HYPOTHESIS TESTING


184


00
0
0

0ql
0
0
0
0


        0c b
        o    o
      o       o
      o        o
    o           a
    o            a
  o               o
a                  ao


50


60


7
70


30


40


                         w


Figure 2.35: pdf of the Bin(100, .5) distribution


po = n/2 and go =   //2 under Ho. For n = 100, Figure 2.36 compares the exact
Binomial distribution to the Normal approximation.
   In general, when the Normal approximation is valid, we compare w to the
N(,uo, go) density, where yo is calculated according to Ho and go is either calculated
according to H0 or estimated from the data. If t  w - polu/uo is bigger than about
2 or 3, that's evidence against Ho.
   The following example shows hypothesis testing at work.

Example 2.18 (Tooth Growth, continued)
This continues Example 2, (pg. 100). Let's concentrate on a particular dosage, say
dose = 0.5, and test the null hypothesis that, on average, the delivery method (supp)
makes no difference to tooth growth, as opposed to the alternative that it does make
a difference. Those are the scientific hypotheses. The data for testing the hypothesis
are x1,..., xio, the 10 recordings of growth when supp = VC and y1, ..., y1, the 10
recordings of growth when supp O= J. The xi's are 10 independent draws from one
distribution; the yj's are 10 independent draws from another:

                              xi, . . . X0z i .i.d. fyc
                              2y1, .. . , yio~i.i.d. fo i


﻿


2.7. HYPOTHESIS TESTING


185


                0o
                0
                0

                0
                0
                0
                0
                0
                      30         40        50         60        70

                                           w


    Figure 2.36: pdfs of the Bin(100, .5) (dots) and N(50, 5) (line) distributions


Define the two means to be py =E[xi] and pou   =IE[y2]. The scientific hypothesis and
its alternative, translated into statistical terms become


                                   Ho : pyc = paar
                                   Ha : PyC -/0pJ


Those are the hypotheses in statistical terms.
    Because we're testing a difference in means, we choose our one dimensional summary
statistic to be w = |x - yl. Small values of w support Ho; large values support Ha. But
how small is small; how large is large? The Central Limit Theorem says


                                 z~N     y    vc)

                                 y jN (poi J / cmj)


﻿


2.7. HYPOTHESIS TESTING

approximately, so that under Ho,


186


w ~ N 0,


2C+ Oj)


approximately. The statistic w can be calculated, its SD estimated, and its approximate
density plotted as in Figure  . We can see from the Figure, or from the fact that
t/ot ~ 3.2 that the observed value of t is moderately far from its expected value under
Ho. The data provide moderately strong evidence against Ho.


                LO


                       -6    -4    -2    0     2     4     6


Figure 2.37: Approximate density of summary statistic t. The black dot is the value
of t observed in the data.


Figure


was produced with the following R code.


x <- ToothGrowth$len[   ToothGrowth$supp=="VC"
                         & ToothGrowth$dose==0.5
y <- ToothGrowth$len[   ToothGrowth$supp=="0J"
                         & ToothGrowth$dose==0.5
t <- abs ( mean(x) - mean(y) )
sd <- sqrt ( ( var(x) + var(y) )/length(x) )


]

]


﻿


2.7. HYPOTHESIS TESTING


187


     tvals <- seq ( -4*sd, 4*sd, len=80 )
     plot ( tvals, dnorm(tvals,0,sd), type="1",
             xlab="", ylab="", main="" )
     points ( t, 0, pch=16, cex=1.5 )


The points(. . ) adds the observed value of t to the plot.

    In the next example it is difficult to estimate the distribution of w under Ho; so
we use simulation to work it out.

Example 2.19 (Baboons)
Because baboons are promiscuous, when a baby is born it is not obvious, at least to
humans, who the father is. But do the baboons themselves know who the father is?
              [    ] report a study of baboon behavior that attempts to answer that
question. For more information see http://www.princeton.edu/~baboon. Baboons
live in social groups comprised of several adult males, several adult females, and juveniles.
Researchers followed several groups of baboons periodically over a period of several years
to learn about baboon behavior. The particular aspect of behavior that concerns us here
is that adult males sometimes come to the aid of juveniles. If adult males know which
juveniles are their own children, then it's at least possible that they tend to aid their own
children more than other juveniles. The data set baboons (available on the web site)
contains data on all the recorded instances of adult males helping juveniles. The first
four lines of the file look like this.

    Recip Father Maleally Dadpresent Group
    ABB     EDW       EDW           Y      OMO
    ABB     EDW       EDW           Y      OMO
    ABB     EDW       EDW           Y      OMO
    ABB     EDW       POW           Y      OMO


    1. Recip identifies the juvenile who received help. In the four lines shown here, it is
      always ABB.

   2. Father identifes the father of the juvenile. Researchers know the father through
      DNA testing of fecal samples. In the four lines shown here, it is always EDW.


1We have slightly modified the data to avoid some irrelevant complications.


﻿


2.7. HYPOTHESIS TESTING


188


   3. Maleally identifies the adult male who helped the juvenile. In the fourth line we
      see that POW aided ABB who is not his own child.

   4. Dadpresent tells whether the father was present in the group when the juvenile
      was aided. In this data set it is always Y.

   5. Group identifies the social group in which the incident occured. In the four lines
      shown here, it is always OMO.
Let w be the number of cases in which a father helps his own child. The snippet
     dim ( baboons )
     sum ( baboons$Father == baboons$Maleally )

reveals that there are n = 147 cases in the data set, and that w = 87 are cases in which
a father helps his own child. The next step is to work out the distribution of w under
H0: adult male baboons do not know which juveniles are their children.
    Let's examine one group more closely, say the OMG group. Typing
    baboons [baboons$Group == "0MG" , ]

displays the relevant records. There are 13 of them. EDW was the father in 9, POW
was the father in 4. EDW provided the help in 9, POW in 4. The father was the ally
in 9 cases; in 4 he was not. Ho implies that EDW and POW would distribute their help
randomly among the 13 cases. If Ho is true, i.e., if EDW distributes his 9 helps and POW
distributes his 4 helps randomly among the 13 cases, what would be the distribution of
W, the number of times a father helps his own child? We can answer that question
by a simulation in R. (We could also answer it by doing some math or by knowing the
hypergeometric distribution, but that's not covered in this text.)
     dads <- baboons$Father [ baboons$Group == "0MG" ]
     ally <- baboons$Maleally [ baboons$Group == "0MG" ]
     N.sim <- 1000
     w <- rep ( NA, N.sim )
     for ( i in 1:N.sim ) {
       perm <- sample ( dads )
       w[i] <- sum ( perm == ally )
     }
     hist (w)
     table(w)


﻿


2.7. HYPOTHESIS TESTING


189


Try out the simulation for yourself. It shows that the observed number in the data,
w = 9, is not so unusual under Ho.
   What about the other social groups? If we find out how many there are, we can do
a similar simulation for each. Let's write an R function to help.

     g.sim <- function (group, N.sim) {
       dads <- baboons$Father [ baboons$Group == group ]
       ally <- baboons$Maleally [ baboons$Group == group ]
       w <- rep ( NA, N.sim )
       for ( i in 1:N.sim ) {
         perm <- sample ( dads )
         w[i] <- sum ( perm == ally )
       }
       return(w)
     }


Figure shows histograms of g. sim for each group, along with a dot showing the
observed value of w in the data set. For some of the groups the observed value of w,
though a bit on the high side, might be considered consistent with Ho. For others, the
observed value of w falls outside the range of what might be reasonably expected by
chance. In a case like this, where some of the evidence is strongly against Ho and some
is only weakly against Ho, an inexperienced statistician might believe the overall case
against H0 is not very strong. But that's not true. In fact, every one of the groups
contributes a little evidence against Ho, and the total evidence against H0 is very strong.
To see this, we can combine the separate simulations into one. The following snippet of
code does this. Each male's help is randomly reassigned to a juvenile within his group.
The number of times when a father helps his own child is summed over the different
groups. Simulated numbers are shown in the histogram in Figure  . The dot in the
figure is at 84, the actual number of instances in the full data set. Figure  suggests
that it is almost impossible that the 84 instances arose by chance, as H0 would suggest.
We should reject Ho and reach the conclusion that (a) adult male baboons do know who
their own children are, and (b) they give help preferentially to their own children.

    Figure     was produced with the following snippet.

    groups <- unique ( baboons$Group )
    n.groups <- length(groups)
    par ( mfrow=c(3,2) )


﻿


2.7. HYPOTHESIS TESTING


                       OMO


190


vIv


0
0


0
0
N\


0


0
0


0
0
N\


0


0
CO)

0
0
N\

0
0

T


5  6  7  8 9  10 11


        w


        NYA


0
0
CO)

0
0
N\


0
0


0


8   10  12  14   16


        w


        WEA


  I       I

  15 20  25  30


0   1  2  3   4


5


W


W


LIN


0

0

0


0

LO


8 10    14   18


                         W


Figure 2.38: Number of times baboon father helps own child in Example

Histograms are simulated according to Ho. Dots are observed data.


﻿


2.7. HYPOTHESIS TESTING


191


for ( i in 1:n.groups ) {
  good <- baboons$Group == groups[i]
  w.obs <- sum (      baboons$Father[good]
                   == baboons$Maleally[good] )
  w.sim <- g.sim ( groups[i], N.sim )
  hist ( w.sim, xlab="w", ylab="", main=groups[i],
         xlim=range(c(w.obs,w.sim)) )
  points ( w.obs, 0, pch=16, cex=1.5 )
  print ( w.obs )
}


0
0
Nd

0
0
r


0


B


50


60


70


80


                                    w.tot


Figure 2.39: Histogram of simulated values of w.tot. The dot is the value observed
in the baboon data set.


Figure


was produced with the following snippet.


w.obs <- rep ( NA, n.groups )
w.sim <- matrix ( NA, n.groups, N.sim )


﻿


2.8. EXERCISES                                                         192


    for ( i in 1:n.groups ) {
      good <- baboons$Group == groups[i]
      w.obs[i] <- sum (       baboons$Father[good]
                           == baboons$Maleally[good] )
      w.sim[i,] <- g.sim ( groups[i], N.sim )
    }
    w.obs.tot <- sum ( w.obs )
    w.sim.tot <- apply ( w.sim, 2, sum )
    hist ( w.sim.tot, xlab="w.tot", ylab="",
            xlim=range(c(w.obs.tot,w.sim.tot)) )
      points ( w.obs.tot, 0, pch=16, cex=1.5 )
      print ( w.obs.tot )


2.8 Exercises

  1. (a) Justify Equation   on page
      (b) Show that the function g(x) defined just after Equation is a proba-
          bility density. I.e., show that it integrates to 1.

  2. This exercise uses the ToothGrowth data from Examples  and

      (a) Estimate the effect of delivery mode for doses 1.0 and 2.0. Does it seem
          that delivery mode has a different effect at different doses?
      (b) Does it seem as though delivery mode changes the effect of dose?
      (c) For each delivery mode, make a set of three boxplots to compare the
          three doses.

  3. This exercise uses data from 272 eruptions of the Old Faithful geyser in Yel-
     lowstone National Park. The data are in the R dataset faithful. One column
     contains the duration of each eruption; the other contains the waiting time
     to the next eruption.

     (a) Plot eruption versus waiting. Is there a pattern? What is going on?
     (b) Try ts. plot (faithful$eruptions [1: 501 ). Try other sets of eruptions,
          say ts.plot (faithful$eruptions [51:100]). (There is nothing magic


﻿


2.8. EXERCISES


193


        about 50, but if you plot all 272 eruptions then the pattern might be
        harder to see. Choose any convenient number that lets you see what's
        going on.) What is going on?

4. This exercise relies on data from the neurobiology experiment described in
   Example 2.6

   (a) Download the data from the book's website.
   (b) Reproduce Figure17.
   (c) Make a plot similar to Figure 2,17 but for a different neuron and differ-
        ent tastant.
    (d) Write an R function that accepts a neuron and tastant as input and pro-
        duces a plot like Figure217
    (e) Use the function from the previous part to look for neurons that respond
       to particular tastants. Describe your results.

5. This exercise relies on Example 2. about the Slater school. There were 8 can-
   cers among 145 teachers. Figure 2,20 shows the likelihood function. Suppose
   the same incidence rate had been found among more teachers. How would
   that affect £(O)? Make a plot similar to Figure .20, but pretending that there
   had been 80 cancers among 1450 teachers. Compare to Figure  . What is
   the result? Does it make sense? Try other numbers if it helps you see what is
   going on.

6. This exercise continues Exercise 35 in Chapter . Let p be the fraction of the
   population that uses illegal drugs.

   (a) Suppose researchers know that p  .1. Jane and John are given the
        randomized response question. Jane answers "yes"; John answers "no".
        Find the posterior probability that Jane uses cocaine; find the posterior
        probability that John uses cocaine.
    (b) Now suppose that p is not known and the researchers give the random-
        ized response question to 100 people. Let X be the number who answer
        "yes". What is the likelihood function?
    (c) What is the mle of p if X=50, if X=60, if X=70, if X=80, if X=90?

7. This exercise deals with the likelihood function for Poisson distributions.

    (a) Let x1, .. . , ,~i.i.d. Poi(A). Find Yf(A) in terms of zi, ... ,xz.


﻿


2.8. EXERCISES                                                          194

      (b) Show that {(A) depends only on 1 x and not on the specific values of
          the individual x's.
      (c) Let y1, ... , y., be a sample from Poi(A). Show that A =  is the m.l.e.
      (d) Find the m.l.e. in Example

  8. The book Data                       [     ] contains lots of data sets that
     have been used for various purposes in statistics. One famous data set records
     the annual number of deaths by horsekicks in the Prussian Army from 1875-
     1894 for each of 14 corps. Download the data from statlib at http: //lib.
     stat. cmu.edu/datasets/Andrews/T04.1. (It is Table 4.1 in the book.) Let
     Y be the number of deaths in year i, corps j, for i = 1875,..., 1894 and
     j = 1, ... , 14. The Ys are in columns 5-18 of the table.
     (a) What are the first four columns of the table?
     (b) What is the last column of the table?
     (c) What is a good model for the data?
     (d) Suppose you model the data as i.i.d. Poi(A). (Yes, that's a good answer
          to the previous question.)
          i. Plot the likelihood function for A.
          ii. Find A.
          iii. What can you say about the rate of death by horsekick in the Prus-
             sian calvary at the end of the 19th century?
      (e) Is there any evidence that different corps had different death rates? How
          would you investigate that possibility?

  9. Use the data from Example . Find the m.l.e. for 0.

  10. X1,... , X, ~ Normal(p, 1). Multiple choice: The m.l.e. f is found from the
     equation

     (a)      f(Xi, ... . ..p) = 0
     (b) +f (Xi, ... , /'t) = 0
     (c) d/f(Xi, ... . o..p) = 0

 11. This exercise deals with the likelihood function for Normal distributions.


(a) Let y1, ... , y   ~ i.i.d. N(p, 1). Find  (p) in terms of y1, ... , y.-


﻿


2.8. EXERCISES


195


     (b) Show that £(pu) depends only on E ye and not on the specific values of
         the individual yr's.
     (c) Let n = 10 and choose a value for p. Use R to generate a sample of size
         10 from N(p, 1). Plot the likelihood function. How accurately can you
         estimate yu from a sample of size 10?
     (d) Let y1, ... , yio  i.i.d. N(pu, a) where a is known but not necessarily equal
         to 1. Find £ (pu) in terms of y1, ... , yio and a.
     (e) Let y1, ... Y o i.i.d. N(p, a) where pu is known but a is unknown. Find
         £(a) in terms of Y1,..., yio and pu.

12. Let y,. .. , ym be a sample from N(p, 1). Show that ft = y is the m.l.e.

13. Let y1, ..., y, be a sample from N(pu, a) where ,u is known. Show that 892=
    n-1 E(y - p)2 is the m.l.e.

14. Recall the discoveries data from page   on the number of great discoveries
    each year. Let Y be the number of great discoveries in year i and suppose
    Y  ~ Poi(A). Plot the likelihood function £(A). Figure:L3 suggested that
    A   3.1 explained the data reasonably well. How sure can we be about the
    3.1?

15. Justify each step of Equation,2.

16. Page 159 discusses a simulation experiment comparing the sample mean and
    sample median as estimators of a population mean. Figure 2,27 shows the
    results of the simulation experiment. Notice that the vertical scale decreases
    from panel (a) to (b), to (c), to (d). Why? Give a precise mathematical
    formula for the amount by which the vertical scale should decrease. Does the
    actual decrease agree with your formula?

17. In the medical screening example on page  , find the probability that the
    patient has the disease given that the test is negative.

18. A drug testing example

19. Country A suspects country B of having hidden chemical weapons. Based on
    secret information from their intelligence agency they calculate
    P[B has weapons] =.8. But then country B agrees to inspections, so A sends
    inspectors. If there are no weapons then of course the inspectors won't find


﻿


2.8. EXERCISES


196


    any. But if there are weapons then they will be well hidden, with only a 20%
    chance of being found. I.e.,

                      P[finding weapons weapons exist] = .2.      (2.19)

    No weapons are found. Find the probability that B has weapons. I.e., find

                    Pr[B has weapons no weapons are found].

20. Let T be the amount of time a customer spends on Hold when calling the
    computer help line. Assume that T   exp(A) where A is unknown. A sample
    of n calls is randomly selected. Let ti, ... , t, be the times spent on Hold.

    (a) Choose a value of A for doing simulations.
    (b) Use R to simulate a sample of size n = 10.
    (c) Plot £(A) and find A.
    (d) About how accurately can you determine A?
    (e) Show that £(A) depends only on E ti and not on the values of the indi-
         vidual ti's.

21. There are two coins. One is fair; the other is two-headed. You randomly
    choose a coin and toss it.

    (a) What is the probability the coin lands Heads?
    (b) What is the probability the coin is two-headed given that it landed
         Heads?
     (c) What is the probability the coin is two-headed given that it landed Tails?
         Give a formal proof, not intuition.
     (d) You are about to toss the coin a second time. What is the probability
         that the second toss lands Heads given that the first toss landed Heads?

22. There are two coins. For coin A, P[H] = 1/4; for coin B, P[H] = 2/3. You
    randomly choose a coin and toss it.

    (a) What is the probability the coin lands Heads?
    (b) What is the probability the coin is A given that it landed Heads? What
         is the probability the coin is A given that it landed Tails?


﻿


2.8. EXERCISES


197


     (c) You are about to toss the coin a second time. What is the probability the
         second toss lands Heads given that the first toss landed Heads?

23. At Dupont College (apologies to Tom Wolfe) Math SAT scores among math
    majors are distributed N(700, 50) while Math SAT scores among non-math
    majors are distributed N(600, 50). 5% of the students are math majors. A
    randomly chosen student has a math SAT score of 720. Find the probability
    that the student is a math major.

24. The Great Randi is a professed psychic and claims to know the outcome of
    coin flips. This problem concerns a sequence of 20 coin flips that Randi will
    try to guess (or not guess, if his claim is correct).

    (a) Take the prior P[Randi is psychic] = .01.
           i. Before any guesses have been observed, find
             P [first guess is correct] and P [first guess is incorrect].
          ii. After observing 10 consecutive correct guesses, find the updated
             P[Randi is psychic].
         iii. After observing 10 consecutive correct guesses, find
             P [next guess is correct] and P [next guess is incorrect].
          iv. After observing 20 consecutive correct guesses, find
             P [next guess is correct] and P [next guess is incorrect].
     (b) Two statistics students, a skeptic and a believer discuss Randi after class.
         Believer: I believe her, I think she's psychic.
         Skeptic: I doubt it. I think she's a hoax.
         Believer: How could you be convinced? What if Randi guessed 10 in a
         row? What would you say then?
         Skeptic: I would put that down to luck. But if she guessed 20 in a row then
         I would say P [Randi can guess coin flips]  .5.
         Find the skeptic's prior probability that Randi can guess coin flips.
     (c) Suppose that Randi doesn't claim to guess coin tosses perfectly, only that
         she can guess them at better than 50%. 100 trials are conducted. Randi
         gets 60 correct. Write down Ho and Ha appropriate for testing Randi's
         claim. Do the data support the claim? What if 70 were correct? Would
         that support the claim?
     (d) The Great Sandi, a statistician, writes the following R code to calculate
         a probability for Randi.


﻿


2.8. EXERCISES


198


               y <- rbinom ( 500, 100, .5)
               sum ( y == 60 ) / 500


         What is Sandi trying to calculate? Write a formula (Don't evaluate it.)
         for the quantity Sandi is trying to calculate.

25. Let w be the fraction of free throws that Shaquille O'Neal (or any other player
    of your choosing) makes during the next NBA season. Find a density that
    approximately represents your prior opinion for w.

26. Let t be the amount of time between the moment when the sun first touches
    the horizon in the afternoon and the moment when it sinks completely below
    the horizon. Without making any observations, assess your distribution for t.

27. Assess your prior distribution for b, the proportion of M&M's that are brown.
    Buy as many M&M's as you like and count the number of browns. Calculate
    your posterior distribution.

28. (a) Let y   N(O, 1) and let the prior distribution for 0 be 0  N(O, 1).
          i. When y has been observed, what is the posterior density of 0?
          ii. Show that the density in part i. is a Normal density.
          iii. Find its mean and SD.
     (b) Let y    N(O, ay) and let the prior distribution for 0 be 0 N(m, a).
         Suppose that ay, m, and a are known constants.
         i. When y has been observed, what is the posterior density of 0?
         ii. Show that the density in part i. is a Normal density.
         iii. Find its mean and SD.
     (c) Let y1,... , y be a sample of size n from N(0, ay) and let the prior dis-
         tribution for 0 be 0  N(m, a). Suppose that ay, m, and a are known
         constants.
         i. When Y1,..., yn have been observed, what is the posterior density
             of 0?
          ii. Show that the density in part i. is a Normal density.
          iii. Find its mean and SD.
     (d) An example with data.


29. Verify Equations


,and


﻿


2.8. EXERCISES


199


30. Refer to the discussion of predictive intervals on page  5. Justify the claim
    that (-oc, -.72), (-3.65, -0.36), and (-3.28, oc) are 90% prediction inter-
    vals. Find the corresponding 80% prediction intervals.

31. (a) Following Example 2,17 (pg.    ), find Pr[y f= k |y1,. .. y ] for k
         1,2,3,4.
     (b) Using the results from part (a), make a plot analagous to Figure 2.33
         (pg. 76).

32. Suppose you want to test whether the random number generator in R gener-
    ates each of the digits 0, 1,..., 9 with probability 0.1. How could you do it?
    You may consider first testing whether R generates 0 with the right frequency,
    then repeating the analysis for each digit.

33. (a) Repeat the analysis of Example2.18 (pg. 184), but for dose = 1 and
         dose = 2.
     (b) Test the hypothesis that increasing the dose from 1 to 2 makes no differ-
         ence in tooth growth.
     (c) Test the hypothesis that the effect of increasing the dose from 1 to 2 is
         the same for supp = VC as it is for supp O=J.
     (d) Do the answers to parts (a), (b) and (c) agree with your subjective as-
         sessment of Figures 22, 2.3, and 2,6?

34. Continue Exercise 3   from Chapter . The autoganzfeld trials resulted in
    X = 122.

    (a) What is the parameter in this problem?
    (b) Plot the likelihood function.
    (c) Test the "no ESP, no cheating" hypothesis.
    (d) Adopt and plot a reasonable and mathematically tractable prior distri-
         bution for the parameter. Compute and plot the posterior distribution.
     (e) Find the probability of a match on the next trial given X = 122.
     (f) What do you conclude?

35. Three biologists named Asiago, Brie, and Cheshire are studying a mutation
    in morning glories, a species of flowering plant. The mutation causes the
    flowers to be white rather than colored. But it is not known whether the


﻿


2.8. EXERCISES


200


    mutation has any effect on the plants' fitness. To study the question, each
    biologist takes a random sample of morning glories having the mutation,
    counts the seeds that each plant produces, and calculates a likelihood set for
    the average number of seeds produced by mutated morning glories.
    Asiago takes a sample of size nA =100 and calculates a LS.1 set. Brie takes a
    sample of size nB = 400 and calculates a LS.1 set. Cheshire takes a sample of
    size Tno= 100 and calculates a LS.2 set.

    (a) Who will get the longer interval, Asiago or Brie? About how much longer
         will it be? Explain.
     (b) Who will get the longer interval, Asiago or Cheshire? About how much
         longer will it be? Explain.

36. In the 1990's, a committee at MIT wrote A Study on the Status of Women
    Faculty in Science at MIT. In 1994 there were 15 women among the 209
    tenured women in the six departments of the School of Science. They found,
    among other things, that the amount of resources (money, lab space, etc.)
    given to women was, on average, less than the amount given to men. The
    report goes on to pose the question: Given the tiny number of women faculty
    in any department one might ask if it is possible to obtain significant data to
    support a claim of gender differences .....
    What does statistics say about it? Focus on a single resource, say laboratory
    space. The distribution of lab space is likely to be skewed. I.e., there will
    be a few people with lots more space than most others. So let's model the
    distribution of lab space with an Exponential distribution. Let x1, ..., x be
    the amounts of space given to tenured women, so x~ Exp(Aw) for some
    unknown parameter Aw. Let M be the average lab space given to tenured
    men. Assume that M is known to be 100, from the large number of tenured
    men. If there is no discrimination, then Aw = 100. (Aw is E (z).)
    Chris Stats writes the following R code.

         y <- rexp(15,.01)
         m <- mean(y)
         s <- sqrt( var(y ) / 15)
         lo <- m - 2*s
         hi <- m + 2*s


﻿


2.8. EXERCISES


201


    What is y supposed to represent? What is (lo,hi) supposed to represent?
    Now Chris puts the code in a loop.

        n <- 0
        for ( i in 1:1000 ) {
           y <- rexp(15,.01)
           m <- mean(y)
           s <- sqrt ( var(y ))
           lo <- m - 2*s
           hi <- m + 2*s
           if ( lo < 100 & hi > 100 ) n <- n+1
        }
        print (n/1000)


    What is n/1000 supposed to represent? If a sample size of 15 is sufficiently
    large for the Central Limit Theorem to apply, then what, approximately, is the
    value of n/1000?

37. Refer to the R code in Example 2 1 (pg. 100). Why was it necessary to have
    a brace ("{") after the line
    for ( j in 1:3 )
    but not after the line
    for ( i in 1:2 )?


﻿


CHAPTER 3


                           REGRESSION


3.1 Introduction

Regression is the study of how the distribution of one variable, Y, changes accord-
ing to the value of another variable, X. R comes with many data sets that offer
regression examples. Four are shown in Figure 3.1.

   1. The data set attenu contains data on several variables from 182 earthquakes,
     including hypocenter-to-station distance and peak acceleration. Figure 3.1_(a)
     shows acceleration plotted against distance. There is a clear relationship be-
     tween X = distance and the distribution of Y = acceleration. When X is
     small, the distribution of Y has a long right-hand tail. But when X is large,
     Y is always small.

  2. The data set airquality contains data about air quality in New York City.
     Ozone levels Y are plotted against temperature X in Figure 3.1 (b). When X
     is small then the distribution of Y is concentrated on values below about 50
     or so. But when X is large, Y can range up to about 150 or so.

  3. Figure3.1 (c) shows data from mtcars. Weight is on the abcissa and the type
     of transmission (manual=1, automatic=0) is on the ordinate. The distribu-
     tion of weight is clearly different for cars with automatic transmissions than
     for cars with manual transmissions.

  4. The data set faithful contains data about eruptions of the Old Faithful
     geyser in Yellowstone National Park. Figure 3.1 (d) shows Y = time to next eruption
     plotted against X = duration of current eruption. Small values of X tend to
     indicate small values of Y.


202


﻿


3.1. INTRODUCTION


203


(a)


(b)


co _
0


0
+r
a)

0
0


(0
6


0
6


o 100     300


0
10 -


N
0


0
0


10-


0 -


60 70 80 90


Distance


  (C)


temperature


   (d)


0
UI)
E


  0-


o % 00


   0800


       0     0


    Weight


0


LO    ..


0 .     .  .   .


eruptions


Figure 3.1: Four regression examples


﻿


3.1. INTRODUCTION


204


   Figure    was produced by the following R snippet.

   par ( mfrow=c(2,2) )

   data ( attenu )
   plot ( attenu$dist, attenu$accel, xlab="Distance",
          ylab="Acceleration", main=" (a)", pch="." )

  data ( airquality )
  plot ( airquality$Temp, airquality$Ozone, xlab="temperature",
          ylab="ozone", main="(b)", pch="." )

  data ( mtcars )
  stripchart ( mtcars$wt ~ mtcars$am, pch=1, xlab="Weight",
                method="jitter", ylab="Manual Transmission",
                main="(c)" )

  data ( faithful )
  plot ( faithful, pch=".", main="(d)" )


  Both continuous and discrete variables can turn up in regression problems. In
the attenu, airquality and faithful datasets, both X and Y are continuous. In
mtcars, it seems natural to think of how the distribution of Y = weight varies with
X = transmission, in which case X is discrete and Y is continuous. But we could
also consider how the fraction of cars Y with automatic transmissions varies as a
function of X = weight, in which case Y is discrete and X is continuous.
   In many regression problems we just want to display the relationship between
X and Y. Often a scatterplot or stripchart will suffice, as in Figure  . Other
times, we will use a statistical model to describe the relationship. The statistical
model may have unknown parameters which we may wish to estimate or otherwise
make inference for. Examples of parametric models will come later. Our study of
regression begins with data display.
   In many instances a simple plot is enough to show the relationship between X
and Y. But sometimes the relationship is obscured by the scatter of points. Then it
helps to draw a smooth curve through the data. Examples  and  illustrate.


Example 3.1 (1970 Draft Lottery)


﻿


3.1. INTRODUCTION


205


The result of the 1970 draft lottery is available at DASL . The website explains:
         "In 1970, Congress instituted a random selection process for the military
      draft. All 366 possible birth dates were placed in plastic capsules in a rotating
      drum and were selected one by one. The first date drawn from the drum
      received draft number one and eligible men born on that date were drafted
      first. In a truly random lottery there should be no relationship between the
      date and the draft number."
    Figure    shows the data, with X = day of year and Y = draft number. There is
no apparent relationship between X and Y.

    Figure    was produced with the following snippet.

    plot ( draft$Day.of.year, draft$Draft.No,
             xlab="Day of year", ylab="Draft number" )


    More formally, a relationship between X and Y usually means that the expected value
of Y is different for different values of X. (We don't consider changes in SD or other
aspects of the distribution here.) Typically, when X is a continuous variable, changes in
Y are smooth, so we would adopt the model
                                 ElY X] = g(X)                               (3.1)

for some unknown smooth function g.
   R has a variety of built-in functions to estimate g. These functions are called scat-
terplot smoothers, for obvious reasons. Figure shows the draft lottery data with two
scatterplot smoother estimates of g. Both estimates show a clear trend, that birthdays
later in the year were more likely to have low draft numbers. check this: Following dis-
covery of this trend, the procedure for drawing draft numbers was changed in subsequent
years.

    Figure    was produced with the following snippet.

    x <- draft$Day.of.year
    y <- draft$Draft.No
    plot ( x, y, xlab="Day of year", ylab="Draft number" )
    lines ( lowess ( x, y ) )
    lines ( supsmu ( x, y ), lty=2 )


﻿


3.1. INTRODUCTION


206


0
0
CY)


-0


0
0
0


0


0


I         I         I        I


0        100       200


300


Day of year


Figure 3.2: 1970 draft lottery. Draft number vs. day of year


﻿


3.1. INTRODUCTION


207


0
0
CY)


0


0
0
0


0-
O -


o -


I          I         I

0        100        200       300


                                  Day of year


Figure 3.3: 1970 draft lottery. Draft number vs. day of year. Solid curve fit by
lowess; dashed curve fit by supsmu.


﻿


3.1. INTRODUCTION


208


   * lowess (locally weighted scatterplot smoother) and supsmu (super smoother) are
     two of R's scatterplot smoothers. In the figure, the lowess curve is less wiggly
     than the supsmu curve. Each smoother has a tuning parameter that can make the
     curve more or less wiggly. Figure   was made with the default values for both
     smoothers.


Example 3.2 (Seedlings, continued)
As mentioned in Example , the seedlings study was carried out at the Coweeta Long
Term Ecological Research station in western North Carolina. There were five plots at
different elevations on a hillside. Within each plot there was a 60m x1m strip running
along the hillside divided into 60 1m x1m quadrats. It is possible that the arrival rate
of New seedlings and the survival rates of both Old and New seedlings are different in
different plots and different quadrats. Figure  shows the total number of New seedlings
in each of the quadrats in one of the five plots. The lowess curve brings out the spatial
trend: low numbers to the left, a peak around quadrat 40, and a slight falling off by
quadrat 60.

   Figure    was produced by

     plot ( total.new, xlab="quadrat index",
             ylab="total new seedlings" )
     lines ( lowess ( total.new ) )


   In a regression problem the data are pairs (xi, y2) for i = 1, ... , n. For each i, y2
is a random variable whose distribution depends on x. We write

                                 Y2 = g(xj) + eC.                          (3.2)
Equation      expresses y2 as a systematic or explainable part g(xj) and an unex-
plained part e2. g is called the regression function. Often the statistician's goal is to
estimate g. As usual, the most important tool is a simple plot, similar to those in
Figures     through
   Once we have an estimate, y, for the regression function g (either by a scatter-
plot smoother or by some other technique) we can calculate r2 - y2 - y(xi). The
r2's are estimates of the c's and are called residuals. The c's themselves are called
errors. Because the r2's are estimates they are sometimes written with the "hat"
notation:
                             si = r2 = estimate of e6


﻿


3.1. INTRODUCTION


209


U,)
0)


ci)
0i


0 _


107


/1-


0


0


0
0
0


  0

  0
  OD


0


0 0  0
0        00


0


I                 I          ---r-


0     10    20


30    40


I     I


50    60


quadrat index


Figure 3.4: Total number of New seedlings 1993 - 1997, by quadrat.


﻿


3.2. NORMAL LINEAR MODELS


210


Residuals are used to evaluate and assess the fit of models for g, a topic which is
beyond the scope of this book.
   In regression we use one variable to explain or predict the other. It is customary
in statistics to plot the predictor variable on the x-axis and the predicted variable
on the y-axis. The predictor is also called the independent variable, the explanatory
variable, the covariate, or simply x. The predicted variable is called the dependent
variable, or simply y. (In Economics x and y are sometimes called the exogenous
and endogenous variables, respectively.) Predicting or explaining y from x is not
perfect; knowing x does not tell us y exactly. But knowing x does tell us something
about y and allows us to make more accurate predictions than if we didn't know x.
   Regression models are agnostic about causality. In fact, instead of using x to
predict y, we could use y to predict x. So for each pair of variables there are two
possible regressions: using x to predict y and using y to predict x. Sometimes
neither variable causes the other. For example, consider a sample of cities and let
x be the number of churches and y be the number of bars. A scatterplot of x and
y will show a strong relationship between them. But the relationship is caused by
the population of the cities. Large cities have large numbers of bars and churches
and appear near the upper right of the scatterplot. Small cities have small numbers
of bars and churches and appear near the lower left.
   Scatterplot smoothers are a relatively unstructured way to estimate g. Their
output follows the data points more or less closely as the tuning parameter allows
g to be more or less wiggly. Sometimes an unstructured approach is appropriate,
but not always. The rest of Chapter 3 presents more structured ways to estimate g.


3.2 Normal Linear Models

3.2.1 Introduction
In Section L,4 we studied the Normal distribution, useful for continuous popula-
tions having a central tendency with roughly equally sized tails. In Section 3.2 we
generalize to the case where there are many Normal distributions with different
means which depend in a systematic way on another variable. We begin our study
with an example in which there are three distinct distributions.

Example 3.3 (Hot Dogs, continued)
Figure 3° displays calorie data for three types of hot dogs. It appears that poultry hot
dogs have, on average, slightly fewer calories than beef or meat hot dogs. How should
we model these data?


﻿


3.2. NORMAL LINEAR MODELS21


211


0
0~


©000©O   000 (D0O


     0     (MM) 0Jam00 ®0


     00   0GOD 00 ® 0 000DO


100 120


140 160  180


calories


Figure 3.5: Calorie content of hot dogs


﻿


3.2. NORMAL LINEAR MODELS


212


   Figure    was produced with

     stripchart ( hotdogs$Calories       hotdogs$Type, pch=1,
                    xlab="calories" )


   There are 20 Beef, 17 Meat and 17 Poultry hot dogs in the sample. We think of
them as samples from much larger populations. Figure  shows density estimates of
calorie content for the three types. For each type of hot dog, the calorie contents cluster
around a central value and fall off to either side without a particularly long left or right
tail. So it is reasonable, at least as a first attempt, to model the three distributions as
Normal. Since the three distributions have about the same amount of spread we model
them as all having the same SD. We adopt the model

                           B1, ... , B20 ~  i.i.d. N(pB, O)
                           M 1, . .. , M 17 ~ i.i.d. N (put , ()           (3.3)
                           P1, ... , P17  ~  i.i.d. N(pp, o),

where the Bi's, Mi's and Pi's are the calorie contents of the Beef, Meat and Poultry hot
dogs respectively. Figure  suggests

                  pB ~l_ 150; pM ~l_ 160; p  ~ 120; or ~ 30.

An equivalent formulation is

                            B1, ... , B20 ~  i.i.d. N(I, o)
                         M1,... , M17 ~ i.i.d. N(p + M, )                  (3.4)
                         P1,... ,P17 ~ i.i.d. N(p + 6p, u)

   Models     and    are mathematically equivalent. Each has three parameters for the
population means and one for the SD. They describe exactly the same set of distributions
and the parameters of either model can be written in terms of the other. The equivalence
is shown in Table . For the purpose of further exposition we adopt Model
   We will see later how to carry out inferences regarding the parameters. For now we
stop with the model.


Figure    was produced with the following snippet.


﻿


3.2. NORMAL LINEAR MODELS                                                         213


                                           Beef


                   50          100          150         200         250

                                          calories


                                          Meat


                   50          100          150         200         250

                                          calories


                                          Poultry


        50          100         150         200          250

                               calories


Figure 3.6: Density estimates of calorie contents of hot dogs


﻿


3.2. NORMAL LINEAR MODELS


214


    par ( mfrow=c(3,1))
    plot ( density ( hotdogs$C [hotdogs$T=="Beef "] , bw=20 ),
            xlim=c(50,250), yaxt="n", ylab="", xlab="calories",
            main="Beef " )
    plot ( density ( hotdogs$C[hotdogs$T=="Meat"], bw=20 ),
            xlim=c(50,250), yaxt="n", ylab="", xlab="calories",
            main="Meat" )
    plot ( density ( hotdogs$C[hotdogs$T=="Poultry"], bw=20 ),
            xlim=c(50,250), yaxt="n", ylab="", xlab="calories",
            main="Poultry" )


   " hotdogs$C and hotdogs$T illustrate a convenient feature of R, that components
     of a structure can be abbreviated. Instead of typing hotdogs$Calories and
     hotdogs$Type we can use the abbreviations. The same thing applies to arguments
     of functions.

   " density ( ..., bw=20 ) specifies the bandwidth of the density estimate. Larger
     bandwidth gives a smoother estimate; smaller bandwidth gives a more wiggly esti-
     mate. Try different bandwidths to see what they do.


   The PlantGrowth data set in R provides another example. As R explains, the
data previously appeared in       [    ] and are
     "Results from an experiment to compare yields (as measured by dried
     weight of plants) obtained under a control and two different treatment
     conditions."
The first several lines are
  weight group
1   4.17   ctrl
2   5.58   ctrl
3   5.18   ctrl
Figure    shows the whole data set.
   It appears that plants grown under different treatments tend to have different
weights. In particular, plants grown under Treatment 1 appear to be smaller on av-
erage than plants grown under either the Control or Treatment 2. What statistical
model should we adopt?


﻿


3.2. NORMAL LINEAR MODELS


215


Model3.    Model3.4 Interpretation         Approximate value
                      mean calorie content
    I'B               of Beef hot dogs
                      mean calorie content
                      of Meat hot dogs
                      mean calorie content
                      of Poultry hot dogs
                      mean calorie differ-
 pM - pIB      M      ence between    Beef         10
                      and Meat hotdogs
                      mean calorie differ-
  P - pB       oP     ence between    Beef        -30
                      and Poultry hotdogs
                      SD of calorie content
    o          o      within a single type of      30
                      hot dog


Table 3.1: Correspondence between Models


and


NM


o o     000    000             00


        0   Cpo      (MC) .      .


0


3.5


I       I5    5      5
4.0    4.5    5.0   5.5    6.0


weight


Figure 3.7: The PlantGrowth data


﻿


3.2. NORMAL LINEAR MODELS


216


   Figure    was produced with the following snippet.

   stripchart ( PlantGrowth$weight ~ PlantGrowth$group, pch=1,
                 xlab="weight" )


   First, we think of the 10 plants grown under each condition as a sample from a
much larger population of plants that could have been grown. Second, a look at the
data suggests that the weights in each group are clustered around a central value,
approximately symmetrically without an especially long tail in either direction. So
we model the weights as having Normal distributions.
   But we should allow for the possibility that the three populations have different
means. (We do not address the possibility of different SD's here.) Let p be the
population mean of plants grown under the Control condition, 61 and 62 be the
extra weight due to Treatment 1 and Treatment 2 respectively, and o be the SD.
We adopt the model

                         Wo,1, ... , Wo,1o ~ i.i.d. N(p, c)
                      Wi,1, . .. , WrT,1o ~ i.i.d. N(p + b1, or)         (3.5)
                      WT2,1, ... , WT2,1o - i.i.d. N(p + 62, o).

   There is a mathematical structure shared by  ,    and many other statistical
models, and some common statistical notation to describe it. We'll use the hot dog
data to illustrate.

Example 3.4 (Hot Dogs, continued)
Example     continues Example     First, there is the main variable of interest, often
called the response variable and denoted Y. For the hot dog data Y is calorie content.
(Another analysis could be made in which Y is sodium content.)
   The distribution of Y is different under different circumstances. In this example, Y
has a Normal distribution whose mean depends on the type of hot dog. In general, the
distribution of Y will depend on some quantity of interest, called a covariate, regressor,
or explanatory variable. Covariates are often called X.
   The data consists of multiple data points, or cases. We write Y and Xi for the i'th
case. It is usual to represent the data as a matrix with one row for each case. One
column is for Y; the other columns are for explanatory variables. For the hot dog data
the matrix is


Type         Calories


Sodium


﻿


3.2. NORMAL LINEAR MODELS


217


     Beef          186                495
     Beef          181                477

     Meat          140                428
     Meat          138                339
     Poultry       129                430
     Poultry       132                375


(For analysis of calories, the third column is irrelevant.)
    Rewriting the data matrix in a slightly different form reveals some mathematical struc-
ture common to many models. There are 54 cases in the hotdog study. Let (Y1, ... , Y54)
be their calorie contents. For each i from 1 to 54, define two new variables X1,2 and X2,2
by

                             S1   if the i'th hot dog is Meat,
                             0  otherwise

and

                             { 1  if the i'th hot dog is Poultry,
                             0  otherwise.

X1,2 and X2,2 are indicator variables. Two indicator variables suffice because, for the i'th
hot dog, if we know X1,2 and X2,2, then we know what type it is. (More generally, if
there are k populations, then k - 1 indicator variables suffice.) With these new variables,
Model 3.4 can be rewritten as

                           Y=+ oMX1,2 + oPX2,2 + E 2(3.6)

for i = 1,... ,54, where
                             61, ... , E54 ~ i.i.d. N(0, o-).
    Equation 3,6 is actually 54 separate equations, one for each case. We can write them
succinctly using vector and matrix notation. Let

                                     y~yy~t
                                Y   =(Y1, .. . ,Y4),
                                B     (PwoMi, P)f,
                                E   = (61, . . . ,e4*


﻿


3.2. NORMAL LINEAR MODELS


218


(The transpose is there because, by convention, vectors are column vectors.) and


X


1
1


1
1


1
1


1


0
0


0
1


1
0


0


0
0


0
0


0
1


1


X is a 54 x 3 matrix. The first 20 lines are for the Beef hot dogs; the next 17
the Meat hot dogs; and the final 17 are for the Poultry hot dogs. Equation 3,6
written


are for
can be


Y=XB+E


(3.7)


   Equations similar to
the PlantGrowth data (page


and     are common to many statistical models. For
2) let


Y = weight of i'th plant,

     X1,  if i'th plant received treatment 1
       0 otherwise

     X2,  if i'th plant received treatment 2
       0 otherwise
    y~yy~t
Y= (Y1,...,Y30)
B = (p, 61, 62)t
E = (1, . . . , 6so)


﻿


3.2. NORMAL LINEAR MODELS


219


and

                        /1 0 0\
                        1 0 0

                        1 0 0
                        1 1 0

                        1 1 0
                        1 0 1


Then analogously to 3.61 and 3.7 we can write

                        Y = p + oiX1,2 + 62X2,2 + eZ                 (3.8)

and

                              Y = XB + E.                            (3.9)

   Notice that Equation 3.6is nearly identical to Equation 3.8 and Equation 3,7
is identical to Equation 3.9. Their structure is common to many statistical models.
Each Y is written as the sum of two parts. The first part, XB, (p1+oX1,i+pX2,i for
the hot dogs; pu+ iX1,1 + 2X2,1 for PlantGrowth) is called systematic, deterministic,
or signal and represents the explainable differences between populations. The
second part, E, or ei, is random, or noise, and represents the differences between
hot dogs or plants within a single population. The es's are called errors. In statistics,
the word "error" does not indicate a mistake; it simply means the noise part of a
model, or the part left unexplained by covariates. Modelling a response variable as

                          response = signal + noise
is a useful way to think and will recur throughout this book.
   In 3.6; the signal yu + ogX 1,2 + SPX 2,i is a linear function of (,o, o b) In ;18
the signal yu + b1iXl,i + 52X2,1 is a linear function of (,, ,62) . Models in which the
signal is a linear function of the parameters are called linear models.
   In our examples so far, X has been an indicator. For each of a finite number
of X's the~re hias bee~n acorrespndring popuilatinof Y's. As the next eavmple
      ilutats ina odl anas     riewenXi1acninosvaibe


﻿


3.2. NORMAL LINEAR MODELS


220


Example 3.5 (Ice Cream Consumption)
This example comes from DASL, which says

         "Ice cream consumption was measured over 30 four-week periods from
      March 18, 1951 to July 11, 1953. The purpose of the study was to determine
      if ice cream consumption depends on the variables price, income, or temper-
      ature. The variables Lag-temp and Year have been added to the original
      data."

You can download the data from

          http://lib.stat.cmu.edu/DASL/Datafiles/IceCream.html.

The first few lines look like this:

  date    IC       price   income   temp   Lag-temp    Year
     1    .386     .270     78       41        56       0
     2    .374     .282     79       56        63       0
     3    .393     .277     81       63        68       0


The variables are

date Time period (1-30) of the study (from 3/18/51 to 7/11/53)

IC Ice cream consumption in pints per capita

Price Price of ice cream per pint in dollars

Income Weekly family income in dollars

Temp Mean temperature in degrees F

Lag-temp Temp variable lagged by one time period

Year Year within the study (0 = 1951, 1 = 1952, 2 = 1953)

Figure 3.8 is a plot of consumption versus temperature. It looks as though an equation
of the form


consumption =/#0 + #1temperature + error


(3.10)


﻿


3.2. NORMAL LINEAR MODELS


221


would describe the data reasonably well. This is a linear model, not because consumption
is a linear function of temperature, but because it is a linear function of (/0,31). To
write it in matrix form, let

                              Y =  (IC1,...,1C3o)
                              B = (#33#)t
                              E= (61,...,630)

and

                                     1 temps
                                     ltemp2
                              X=

                                     1 temp3o

The model is
                                 Y=XB+E.                                 (3.11)
Equation 3.11 is a linear model, identical to Equations.7 and 3,9.

   Equation 3.7 (equivalently, 3.9 or 3'11) is the basic form of all linear models.
Linear models are extremely useful because they can be applied to so many kinds
of data sets. Section 3.2 2 investigates some of their theoretical properties and R's
functions for fitting them to data.

3.2.2 Inference for Linear Models
Section 32A showed some graphical displays of data that were eventually de-
scribed by linear models. Section32     treats more formal inference for linear
models. We begin by deriving the likelihood function.
   Linear models are described by Equation'3.7 (equivalently,3.9 or 11) which
we repeat here for convenience:

                                 Y = XB + E.                             (3.12)

In general there is an arbitrary number of cases, say n, and an arbitrary number
of covariates, say p. Equation 3,12 is shorthand for the collection of univariate
equations


Y = 00 + OlX li + ... + OPXP)i + 6i


(3.13)


﻿


3.2. NORMAL LINEAR MODELS


222


0
0~
E

0,
0


0


0


IC)


0

C)
0


0
0


                         0


                         0


                         0
                     0
                        0
              0
                        0
                     0
        O0           00
        0o0
   0

                   0 0


0      0


0
0 O        O


       'O


30


40


50


60


70


temperature


Figure 3.8: Ice cream consumption (pints per capita) versus mean temperature
(°F)


﻿


3.2. NORMAL LINEAR MODELS


223


or equivalently,

                                 Y ~N(pi, a)

for i = 1, ... , n where si = 30 + E j3;Xj, and the c's are i.i.d. N(0, a). There are
p + 2 parameters: (,0... , p, a). The likelihood function is

                                n

                                i=1
                                     1   e__(y2 2)2

                                i=1 2wu
                                     n(iv y-(3O+E /X3,i)2
                                =    2   eu
                                i1   2r


        n     Y(yi-(/30+ 3jX,))2
(27ra2)


(3.14)


   Likelihood 3.14 is a function of the p + 2 parameters. To find the m.l.e.'s we
could differentiate 3.14 with respect to each parameter in turn, set the derivatives
equal to 0, and solve. But it is easier to take the log of 3.14 first, then differentiate
and solve.


log f(Oo,..., /3p, a) = C - n log a - 2 2  (Yi


(00+ oji,5J)
   (s~o)Z


for some irrelevant constant C, so we get the system of equations


    1
    2Z (-
    2   (yi


  1
  2     (yi

n    1
+     3    (y,


yi - (0o + EXi,j))


(0o + Z/Xi,))Xi,1


(0o + Z jxij))Xip

                   2
 i - o3 + 1: S X2, ) )


0


0


(3.15)


0


0


i


﻿


3.2. NORMAL LINEAR MODELS


224


Note the hat notation to indicate estimates. The m.l.e.'s (00,...13,A,) are the
values of the parameters that make the derivatives equal to 0 and therefore satisfy
Equations 3,15. The first p + 1 of these equations can be multiplied by o2, yielding
p + 1 linear equations in the p + 1 unknown #3's. Because they're linear, they can be
solved by linear algebra. The solution is

                              $ = (XtX)-1XY,

using the notation of Equation3.12.
   For each i E {1,... , n}, let

                          Yi =0i+x121 + -+xip.

The Qi's are called fitted values. The residuals are

                        fri = - i
                          = Yi/-30 + X1231 + -- + X pip

and are estimates of the errors Ei. Finally, referring to the last line of Equation 31,
the m.l.e. 8 is found from

                    0 = - +          (Yi - (30 +   #jXi,5))2


and

                                 & = (3.16)
                           n         2


   In addition to the m.l.e.'s we often want to look at the likelihood function to
judge, for example, how accurately each /3 can be estimated. The likelihood func-
tion for a single /3i comes from the Central Limit Theorem. We will not work out
the math here but, fortunately, Rt will do all the calculations for us. We illustrate
with the hot dog data.


﻿


3.2. NORMAL LINEAR MODELS


225


Example 3.6 (Hot Dogs, continued)
Estimating the parameters of a model is called fitting a model to data. R has built-in
commands for fitting models. The following snippet fits Model 3,7 to the hot dog data.
The syntax is similar for many model fitting commands in R, so it is worth spending some
time to understand it.

     hotdogs.fit <- im ( hotdogs$Calories       hotdogs$Type )


   " lm stands for linear model.

   " ~ stands for "is a function of". It is used in many of R's modelling commands. y
      x is called a formula and means that y is modelled as a function of x. In the case
      at hand, Calories is modelled as a function of Type.

   " lm specifies the type of model.


" R automatically creates the X matrix in Equation
  ters.


and estimates the parame-


   " The result of fitting the model is stored in a new object called hotdogs.fit. Of
      course we could have called it anything we like.

   " lm can have an argument data, which specifies a dataframe. So instead of

             hotdogs.fit <- lm ( hotdogs$Calories       hotdogs$Type )


      we could have written

             hotdogs.fit <- lm ( Calories        Type, data=hotdogs )


      You may want to try this to see how it works.

To see hotdogs.fit, use R's summary function. It's use and the resulting output are
shown in the following snippet.


﻿


3.2. NORMAL LINEAR MODELS


226


  > summary(hotdogs. fit)

  Call:
  lm(formula = hotdogs$Calories      hotdogs$Type)

  Residuals:
       Min       1Q  Median       3Q      Max
  -51.706 -18.492    -5.278   22.500  36.294

  Coefficients:
                        Estimate Std. Error t value Pr(>ItI)
   (Intercept)           156.850       5.246   29.901   < 2e-16 ***
   hotdogs$TypeMeat        1.856       7.739    0.240     0.811
   hotdogs$TypePoultry   -38.085       7.739   -4.921   9.4e-06 ***

   Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

   Residual standard error: 23.46 on 51 degrees of freedom
   Multiple R-Squared: 0.3866,       Adjusted R-squared: 0.3626
   F-statistic: 16.07 on 2 and 51 DF, p-value: 3.862e-06

The most important part of the output is the table labelled Coefficients:. There is
one row of the table for each coefficient. Their names are on the left. In this table
the names are Intercept, hotdogs$TypeMeat, and hotdogs$TypePoultry. The first
column is labelled Estimate. Those are the m.l.e.'s. R has fit the model

                         Y =/30+31X1,1+0#2X2,1+ce
where X1 and X2 are indicator variables for the type of hot dog. The model implies
                    Y = 3o + Ei   for beef hotdogs
                    Y = ,3o + 31 + Ei  for meat hotdogs
                    Y = ,3o + /32 + Ei for poultry hotdogs
Therefore the names mean

   30 = Intercept            = mean calorie content of beef hot dogs
   /31 = hotdogs$TypeMeat    = mean difFerence between beef and meat hot dogs
   /2 = hotdogs$TypePoultry = mean difFerence between
                                  beef and poultry hot dogs


﻿


3.2. NORMAL LINEAR MODELS


227


From the Coefficients table the estimates are

                                 S0 = 156.850
                                 ,1= 1.856
                                 )2= -38.085

The next column of the table is labelled Std. Error. It contains the SD's of the
estimates. In this case, )o has an SD of about 5.2; 1 has an SD of about 7.7, and )2
also has an SD of about 7.7. The Central Limit Theorem says that approximately, in
large samples


                                 1 ~ N (#1,(70, )
                                 2 ~ N (#2, (702)

The SD's in the table are estimates of the SD's in the Central Limit Theorem.
   Figure    plots the likelihood functions. The interpretation is that # 0 is likely some-
where around 157, plus or minus about 10 or so; 1, is somewhere around 2, plus or
minus about 15 or so; and # 2 is somewhere around -38, plus or minus about 15 or so.
(Compare to Table  .) In particular, there is no strong evidence that Meat hot dogs
have, on average, more or fewer calories than Beef hot dogs; but there is quite strong
evidence that Poultry hot dogs have considerably fewer.

   Figure    was produced with the following snippet.

   m <- c ( 156.85, 1.856, -38.085 )
     s <- c ( 5.246, 7.739, 7.739 )

     par ( mfrow=c(2,2) )

     x <- seq ( m[1]-3*s[1], m[1]+3*s[1], length=40 )
     plot ( x, dnorm(x,m[1],s[1]), type="1",
            xlab=expression(mu), ylab="likelihood", yaxt="n" )

     x <- seq ( m[2]-3*s[2], m[2]+3*s[2], length=40 )
     plot ( x, dnorm(x,m[2],s[2]), type="1",
            xlab=expression(delta[M]),
            ylab="likelihood", yaxt="n" )


﻿


3.2. NORMAL LINEAR MODELS


228


0
0


0
0


140 150 160 170


-20    0 10

       8M


0
0


' I   - 1 1  1
-60   -40   -20


Figure 3.9: Likelihood functions for (,u, oM, 8P) in the Hot Dog example.


﻿


3.2. NORMAL LINEAR MODELS


229


     x <- seq ( m[3]-3*s[3], m[3]+3*s[3], length=40 )
     plot ( x, dnorm(x,m[3],s[3]), type="1",
             xlab=expression(delta[P]),
             ylab="likelihood", yaxt="n" )


   The summary also gives an estimate of o-. The estimate is labelled Residual
standard error. In this case, & ~ 23.46.  So our model says that for each type
of hot dog, the calorie contents have approximately a Normal distribution with SD about
23 or so. Compare to Figure   to see whether the 23.46 makes sense.

   Regression is sometimes used in an exploratory setting, when scientists want
to find out which variables are related to which other variables. Often there is a
response variable Y, (imagine, for example, performance in school) and they want
to know which other variables affect Y (imagine, for example, poverty, amount of
television watching, computer in the home, parental involvment, etc.) Example
illustrates the process.

Example 3.7 (mtcars)
This example uses linear regression to explore the R data set mtcars (See Figure ,
panel (c)) more thoroughly with the goal of modelling mpg (miles per gallon) as a
function of the other variables. As usual, type data(mtcars) to load the data into R
and help(mtcars) for an explanation. As R explains:
         "The data was extracted from the 1974 Motor Trend US magazine, and
      comprises fuel consumption and 10 aspects of automobile design and perfor-
      mance for 32 automobiles (1973-74 models)."
In an exploratory exercise such as this, it often helps to begin by looking at the data.
Accordingly, Figure  is a pairs plot of the data, using just the continuous variables.

    Figure    was produced by

    pairs ( mtcars[,c(1,3:7)] )


    1R, like most statistical software, does not report the m.l.e. but reports instead & - (Z r/(n -
p - 1))1/2. Compare to Equation for the m.l.e. in which the denominator is n. The situation
is similar to the sample SD on page . When n p there is little difference between the two
estimates.


﻿


3.2. NORMAL LINEAR MODELS


230


                    100 400      3.0 4.5       16 22


               0 °
                   °  displ I  D C 0
              0             w II ___D          )     L


                     CO    Lii]    0     0 0 °10C\JQ


            C\J                       II          I0


                    r0       0                LZLJ w


              10 25        50 250       2 4


Figure 3.10: pairs plot of the mtcars data. Type help (mtcars) in Rt for an expla-
nation.


﻿


3.2. NORMAL LINEAR MODELS


231


    Clearly, mpg is related to several of the other variables. Weight is an obvious and
intuitive example. The figure suggests that the linear model

                                 mpg = #o +#13wt + e                            (3.17)

is a good start to modelling the data. Figure 3 11(a) is a plot of mpg vs. weight plus
the fitted line. The estimated coefficients turn out to be #0   37.3 and #1  -5.34.
The interpretation is that mpg decreases by about 5.34 for every 1000 pounds of weight.
Note: this does not mean that if you put a 1000 pound weight in your car your mileage
mpg will decrease by 5.34. It means that if car A weighs about 1000 pounds less than
car B, then we expect car A to get an extra 5.34 miles per gallon. But there are likely
many difFerences between A and B besides weight. The 5.34 accounts for all of those
difFerences, on average.
    We could just as easily have begun by fitting mpg as a function of horsepower with
the model
                                 mpg = 7o+')71hp +                              (3.18)
We use 'y's to distinguish the coefficients in Equation 31 from those in Equation 3,17.
The m.l.e.'s turn out to be to  30.1 and 11    -0.069. Figure ,11(b) shows the
corresponding scatterplot and fitted line. Which model do we prefer? Choosing among
different possible models is a major area of statistical practice with a large literature that
can be highly technical. In this book we show just a few considerations.
    One way to judge models is through residual plots, which are plots of residuals versus
either X variables or fitted values. If models are adequate, then residual plots should
show no obvious patterns. Patterns in residual plots are clues to model inadequacy and
how to improve models. Figure 3.1(c) and (d) are residual plots for mpg.fitl (mpg
vs. wt) and mpg.fit2 (mpg vs. hp). There are no obvious patterns in panel (c). In
panel (d) there is a suggestion of curvature. For fitted values between about 15 and 23,
residuals tend to be low but for fitted values less than about 15 or greater than about
23, residuals tend to be high (The same pattern might have been noted in panel (b).)
suggesting that mpg might be better fit as a nonlinear function of hp. We do not pursue
that suggestion further at the moment, merely noting that there may be a minor flaw in
mpg. f it2 and we therefore slightly prefer mpg. f itl.
    Another thing to note from panels (c) and (d) is the overall size of the residuals.
In (c), they run from about -4 to about +6, while in (d) they run from about -6 to about
+6. That is, the residuals from mpg.fit2 tend to be slightly larger in absolute value
than the residuals from mpg.fit1, suggesting that wt predicts mpg slightly better than
does hp. That impression can be confirmed by getting the summary of both fits and
checking 8. From mpg.fit1 8    3.046 while from mpg.fit2 8   3.863. I.e., from wt


﻿


3.2. NORMAL LINEAR MODELS


232


we can predict mpg to within about 6 or so (two SD's) while from hp we can predict
mpg only to within about 7.7 or so. For this reason too, we slightly prefer mpg.fit1 to
mpg. fit2.
   What about the possibility of using both weight and horsepower to predict mpg?
Consider

    mpg.fit3 <- lm ( mpg ~ wt + hp, data=mtcars )


    " The formula y ~ x1 + x2 means fit y as a function of both x1 and x2. In our
    example that means
                            mpg - 6o + oiwt1 + 62h p2 + E             (3.19)


A residual plot from model  is shown in Figure   (e). The m.l.e. 's are oo  37.2,
1    -3.88, 62 ~ -0.03, and - 2.6. Since the residual plot looks curved, Model
has residuals about as small as Model , and Model  is more parsimonious than
Model     we slightly prefer Model


   Figure     (a) was produced with

   plot ( mtcars$wt, mtcars$mpg, xlab="weight", ylab="mpg" )
   mpg.fitl <- im ( mpg ~ wt, data=mtcars )
     abline ( coef(mpg.fitl) )


   Figure    , panels (c) and (d) were produced with

   # panel c
   plot ( fitted(mpg.fitl), resid(mpg.fitl), main="(c)",
       xlab="fitted values from fiti", ylab="resid" )
    # panel d
    plot ( fitted(mpg.fit2), resid(mpg.fit2),
       xlab="fitted values from fit2", ylab="resid",
       main="(d)" )


﻿


3.2. NORMAL LINEAR MODELS


233


(a)


(b)


    0
    CO)


E   o
    Q~


    0
    CO)


E   o
    Q~


2    3    4    5


50     150   250

       horsepower


weight


(c)


(d)


  (0

IZI
N,


a)
    N\


10   15   20   25

fitted values from fit2


fitted values from fit1


(e)


                        0              0
                        -o0         00

                               c0 0
                               00 0

                    10  15   2     5   3

                      fitted values from fit3


Figure 3.11: t c ar s- (a): mpg vs. wt; (b): mpg vs. hp; (c): residual plot from
mpg~ wt; (d): residual plot from mpg~ hp; (e): residual plot from mpg~ wt+hp


﻿


3.2. NORMAL LINEAR MODELS


234


   In Example    we fit three models for mpg, repeated here with their original
equation numbers.

                        mpg=# O+J1wt+E                              (    )
                        mpg = 'yo + '1hp +E                         (    )
                        mpg = o + olwtl + 62hp + E                  (    )


What is the connection between, say, #1, and 61, or between -71 and 62? #1 is the
average mpg difference between two cars whose weights differ by 1000 pounds.
Since heavier cars tend to be different than lighter cars in many ways, not just in
weight, #1, captures the net effect on mpg of all those differences. On the other
hand, 61 is the average mpg difference between two cars of identical horsepower
but whose weights differ by 1000 pounds. Figure   shows the likelihood func-
tions of these four parameters. The evidence suggests that #1 is probably in the
range of about -7 to about -4, while 61 is in the range of about -6 to -2. It's possible
that 1 ~ 61. On the other hand, -y1 is probably in the interval (-.1, -.04) while
62 is probably in the interval (-.05, 0). It's quite likely that -y1 # 62. Scientists
sometimes ask the question "What is the effect of variable X on variable Y?" That
question does not have an unambiguous answer; the answer depends on which
other variables are accounted for and which are not.

   Figure     was produced with

   par ( mfrow=c(2,2) )
   x <- seq ( -8, -1.5, len=60 )
   plot ( x, dnorm(x,-5.3445,.5591), type="1",
          xlab=expression(beta[1]), ylab="", yaxt="n" )
  x <- seq ( -.1, 0, len=60 )
  plot ( x, dnorm(x,-.06823,.01012), type="1",
          xlab=expression(gamma[1]), ylab="", yaxt="n" )
  x <- seq ( -8, -1.5, len=60 )
  plot ( x, dnorm(x,-3.87783, .63273), type="1",
          xlab=expression(delta[1]), ylab="", yaxt="n" )
  x <- seq ( -.1, 0, len=60 )
  plot ( x, dnorm(x,-.03177,.00903), type="1",
          xlab=expression(delta[21), ylab="", yaxt="n" )


﻿


3.2. NORMAL LINEAR MODELS


235


-8  -6   -4  -2


-0.10   -0.04 0.00


Y1


I  I I I I  I I
-8  -6   -4  -2


-0.10   -0.04 0.00


Figure 3.12: likelihood functions for /1, ,71, 61 and b2 in the mtcars example.


﻿


3.3. GENERALIZED LINEAR MODELS


236


3.3 Generalized Linear Models

3.3.1 Logistic Regression

Look again at panel (c) in Figure  on page     . The dependent variable is binary,
as opposed to the continuous dependent variables in panels (a), (b) and (d). In (a),
(b) and (d) we modelled Y   X as having a Normal distribution; regression was a
model for E[Y  X], the mean of that Normal distribution as a function of X. In (c)
Y  X has a Binomial distribution. We still use the term "regression" for a model of
E [Y X]. When Y is binary, regression is a model for the probability of success 0 as
a function of X.
   Figure      shows two more scatterplots where Y is a binary variable. The data
are described in the next two examples.

Example 3.8 (FACE, continued)
Refer to Examples      and      about the FACE experiment to assess the effects of
excess CO2 on the growth of the forest. To describe the size of trees, ecologists sometimes
use Diameter at Breast Height, or DBH. DBH was recorded every year for each loblolly
pine tree in the FACE experiment. One potential effect of elevated CO2 is for the trees to
reach sexual maturity and hence be able to reproduce earlier than otherwise. If they do
mature earlier, ecologists would like to know whether that's due only to their increased
size, or whether trees will reach maturity not just at younger ages, but also at smaller
sizes. Sexually mature trees can produce pine cones but immature trees cannot. So to
investigate sexual maturity, a graduate student counted the number of pine cones on
each tree. For each tree let X be its DBH and Y be either 1 or 0 according to whether
the tree has pine cones.
   Figure     (a) is a plot of Y versus X for all the trees in Ring 1. It does appear that
larger trees are more likely to have pine cones.

Example 3.9 (0-rings)
"On January 28, 1986 America was shocked by the destruction of the space shuttle
Challenger, and the death of its seven crew members." So begins the website http:
//www. f as. org/spp/51L. html of the Federation of American Scientist's Space Policy
Project.
   Up until 1986 the space shuttle orbiter was lifted into space by a pair of booster rock-
ets, one on each side of the shuttle, that were comprised of four sections stacked vertically
on top of each other. The joints between the sections were sealed by 0-rings. On January
28, 1986 the temperature at launch time was so cold that the 0-rings became brittle and
failed to seal the joints, allowing hot exhaust gas to come into contact with unburned


﻿


3.3. GENERALIZED LINEAR MODELS


237


fuel. The result was the Challenger disaster. An investigation ensued. The website
http: //science.ksc.nasa.gov/shuttle/missions/51-1/docs contains links to
   1. a description of the event,

   2. a report (Kerwin) on the initial attempt to determine the cause,

   3. a report (rogers-commission) of the presidential investigative commission that fi-
      nally did determine the cause, and

   4. a transcript of the operational recorder voice tape.
One of the issues was whether NASA could or should have forseen that cold weather
might diminish performance of the 0-rings.
   After launch the booster rockets detach from the orbiter and fall into the ocean where
they are recovered by NASA, taken apart and analyzed. As part of the analysis NASA
records whether any of the 0-rings were damaged by contact with hot exhaust gas. If
the probability of damage is greater in cold weather then, in principle, NASA might have
forseen the possibility of the accident which occurred during a launch much colder than
any previous launch.
   Figure     (b) plots Y = presence of damage against X   = temperature for the
launches prior to the Challenger accident. The figure does suggest that colder launches
are more likely to have damaged 0-rings. What is wanted is a model for probability of
damage as a function of temperature, and a prediction for probability of damage at 37°F,
the temperature of the Challenger launch.

   Fitting straight lines to Figure  doesn't make sense. In panel (a) what we
need is a curve such that
   1. E[Y X] = P[Y = 1 X] is close to 0 when X is smaller than about 10 or
      12 cm., and

   2. E[Y X] = P[Y = 1 X] is close to 1 when X is larger than about 25 or 30 cm.
In panel (b) we need a curve that goes in the opposite direction.
   The most commonly adopted model in such situations is

                      E[Y X] = P[Y = 1 X] =                               (3.20)
                                                1 + c~o+QiX
Figure      shows the same data as Figure      with some curves added according
to Equation     . The values of #0 and #1 are in Table


﻿


3.3. GENERALIZED LINEAR MODELS


238


(a)


U,)
0~
U,)

0
0
4i)


4i)
U,)
4i)
0~
4i)
0J)
E
-o


00
0


0


0
0


00
0


0


0
0


                  0   o Ho) UasD000


OD namoD  IoM ooansam  oo 0


   5       10      15      20      25


                DBH


                (b)


0    00     0        0     0


                00000 00 00 00 0
  o  oo     o        o     o


                    55     60    65    70    75    80


                               temperature


Figure 3.13: (a): pine cone presence/absence vs. dbh. (b): O-ring damage vs.
launch temperature


﻿


3.3. GENERALIZED LINEAR MODELS


239


(a)


0i
U,)
0i
0~
U,)

0
0
4i)
.5-


00
0


0


0
0


5       10


15       20       25


DBH


(b)


,)
ci)

ci)
0J


0o
0


0


0
0


55     60     65    70     75     80


                                  temperature


Figure 3.14: (a): pine cone presence/absence vs. dbh. (b): O-ring damage vs.
launch temperature, with some logistic regression curves


﻿


3.3. GENERALIZED LINEAR MODELS24


240


              /,0  IL
      solid   -8   .45
  (a) dashed  -7.5 .36
      dotted  -5   .45
      solid   20   -.3
  (b) dashed  15   -.23
      dotted  18   -.3

Table 3.2: , 's for Figure


Figure


was produced by the following snippet.


par ( mfrow=c (2,1) )
plot ( cones$dbh [ringll , mature [ringli , xlab="DBH",
       ylab="pine cones present", main="(a)" )
x <- seq ( 4, 25, length=40 )
bO <- c(-8, -7.5, -5)
bi <- c ( .45, .36, .45 )
for ( i in 1:3 )
  lines ( x, exp (bO [il + bl [i] *x) /(1 + exp (bO [il + bl [i] *x))
          lty=i )

plot ( orings$temp, orings$damage>0, xlab="temperature",
       ylab="damage present", main=" (b)" )
x <- seq ( 50, 82, length=40 )
bO <- c(20, 15, 18)
bi <- c ( -.3, -.23, -.3 )
for ( i in 1:3 )
  lines ( x, exp (bO [il + bl [i] *x) /(1 + exp (bO [il + bl [i] *x) )
          lty=i )


Model


is known as logistic regression. Let the i'th observation have covari-


ate x2 and probability of success 02 = E[Y  x2]. Define


2-log (1 -


﻿


3.3. GENERALIZED LINEAR MODELS


241


i is called the logit of Oi. The inverse transformation is

                                     e**
                                       1 + e~i

The logistic regression model is

                                 #b2 = 0 +#1xi.

This is called a generalized linear model or glm because it is a linear model for #, a
transformation of E(Y l x) rather than for E(Y l x) directly. The quantity ,/ + #31x
is called the linear predictor. If ,31 > 0, then as x - +oc, 6 - 1 and as x --oo,
0 - 0. If #1 < 0 the situation is reversed. ,30 is like an intercept; it controls how far
to the left or right the curve is. 31 is like a slope; it controls how quickly the curve
moves between its two asymptotes.
   Logistic regression and, indeed, all generalized linear models differ from linear
regression in two ways: the regression function is nonlinear and the distribution of
Y l x is not Normal. These differences imply that the methods we used to analyze
linear models are not correct for generalized linear models. We need to derive the
likelihood function and find new calculational algorithms.
   The likelihood function is derived from first principles.

        X  1 - -Y n   1 - - - y ny13o 11)   (oI i, 0 1


                                        1JO6 (1 - o2)lYi


                                        i:y=1  i:yi=0
                                            H  e,3o+1xi           1
                                            1 + e o+1xi .'--- 1 + eo+1xi
                                        i:y =1          z:yi=o

This is a rather complicated function of the two variables (,3o,#,31). However, a
Central Limit Theorem applies to give a likelihood function for ,30 and 31 that is
accurate when n is reasonable large. The theory is beyond the scope of this book,
but R will do the calculations for us. We illustrate with the pine cone data from
Example 3.8. Figure      shows the likelihood function.


﻿


3.3. GENERALIZED LINEAR MODELS


242


O


0


C)
0


0


C)


0

NC
CM


0


  I    I    I     I    I     I    I    I

-11  -10   -9    -8   -7    -6   -5   -4


fP0


Figure 3.15: Likelihood function for the pine cone data


﻿


3.3. GENERALIZED LINEAR MODELS


243


   Figure     was produced by the following snippet.

   mature <- cones$X2000[ringl] > 0
   b0 <- seq ( -11, -4, length=60 )
   b1 <- seq ( .15, .5, length=60 )
   lik <- matrix ( NA, 60, 60 )
   for ( i in 1:60 )
   for ( j in 1:60 ) {
   linpred <- b0[i] + b1[j]*cones$dbh[ringl]
   theta     <- exp(linpred) / (1+exp(linpred))
   lik[i,j] <- prod ( theta-mature * (1-theta)^(1-mature) )
   }
   lik <- lik/max(lik)
   contour ( b0, b1, lik, xlab=expression(beta[0]),
             ylab=expression(beta[1]) )


   " mature is an indicator variable for whether a tree has at least one pine cone.

   " The lines bO <- ... and b1 <- ... set some values of (O, 1) at which to
     evaluate the likelihood. They were chosen after looking at the output from
     fitting the logistic regression model.

   " lik <- ... creates a matrix to hold values of the likelihood function.

   " linpred is the linear predictor. Because cones$dbh [ringl] is a vector,
     linpred is also a vector. Therefore theta is also a vector, as is theta-mature
     * (1-theta) (1-mature). It will help your understanding of R to understand
     what these vectors are.


   One notable feature of Figure   is the diagonal slope of the contour ellipses.
The meaning is that we do not have independent information about #0 and # 1. For
example if we thought, for some reason, that #0 r -9, then we could be fairly
confident that #1 is in the neighborhood of about .4 to about .45. But if we thought
0 r -6, then we would believe that #1 is in the neighborhood of about .25 to
about .3. More generally, if we knew po, then we could estimate #1 to within
a range of about .05. But since we don't know po, we can only say that #1 is
likely to be somewhere between about .2 and .6. The dependent information for
(,# 1) means that our marginal information for #1 is much less precise than our


﻿


3.3. GENERALIZED LINEAR MODELS


244


conditional information for 31 given 30. That imprecise marginal information is
reflected in the output from R, shown in the following snippet which fits the model
and summarizes the result.

  cones <- read.table ( "data/pinecones.dat", header=T )
  ringi <- cones$ring == 1
  mature <- cones$X2000[ringll > 0

  fit <- glm ( mature   cones$dbh[ringl], family=binomial )
  summary ( fit )

    Coefficients:
                     Estimate Std. Error z value Pr(>IzI)
  (Intercept)        -7.46684     1.76004   -4.242 2.21e-05 ***
  cones$dbh[ringl]  0.36151    0.09331   3.874 0.000107 ***


  " cones ... reads in the data. There is one line for each tree. The first few
     lines look like this.

          ring     ID xcoor ycoor spec  dbh X1998 X1999 X2000
          1    1 11003   0.71   0.53 pita 19.4       0     0      0
          2    1 11004   1.26   2.36 pita 14.1       0     0      4
          3    1 11011   1.44   6.16 pita 19.4       0     6      0


     ID is a unique identifying number for each tree; xcoor and ycoor are coordi-
     nates in the plane; spec is the species; pita stands for pinus taeda or loblolly
     pine, X1998, X1999 and X2000 are the numbers of pine cones each year.

   " ringi ... is an indicator variable for trees in Ring 1.

   " mature ... indicates whether the tree had any cones at all in 2000. It is not
     a precise indicator of maturity.

   " fit ... fits the logistic regression. glm fits a generalized linear model. The
     argument f amily=binomial tells R what kind of data we have. In this case
     it's binomial because y is either a success or failure.


﻿


3.3. GENERALIZED LINEAR MODELS


245


   " summary (f it) shows that (/°, #31) ~(-7.5, 0.36). The SD's are about 1.8 and
      .1. These values guided the choice of bO and bi in creating Figure 3,15. It's
      the SD of about .1 that says we can estimate 31 to within an interval of about
      .4, or about +2SD's.

3.3.2 Poisson Regression
Section;3.. dealt with the case where the response variable Y was Bernoulli.
Another common situation is where the response Y is a count. In that case it is
natural to adopt, at least provisionally, a model in which Y has a Poisson distribu-
tion: Y    Poi(A). When there are covariates X, then A may depend on X. It is
common to adopt the regression

                                log A = ,0 +#1-x                         (3.21)

Model 3.2 is another example of a generalized linear model. Example 310 illus-
trates its use.

Example 3.10 (Seedlings, continued)
Several earlier examples have discussed data from the Coweeta LTER on the emergence
and survival of red maple (acer rubrum) seedlings. Example 3,2 showed that the arrival
rate of seedlings seemed to vary by quadrat. Refer especially to Figure 3,4. Example 3,10
follows up that observation more quantitatively.
   Roughly speaking, New seedlings arise in a two-step process. First, a seed falls out
of the sky, then it germinates and emerges from the ground. We may reasonably assume
that the emergence of one seedling does not affect the emergence of another (They're
too small to interfere with each other.) and hence that the number of New seedlings
has a Poi(A) distribution. Let Y be the number of New seedlings observed in quadrat i
and year j. Here are two fits in R, one in which A varies by quadrat and one in which it
doesn't.

     new <- data.frame ( count=count,
                           quadrat=as.factor (quadrat),
                           year=as. factor (year)
                         )
     fitO <- gim ( count ~ 1, f amily=poisson, data=new )
     f iti <- gim ( c ount ~quadr at , f amily=pois son, dat a=new )


   * The command data.frame creates a dataframe. Rt describes dataframes as


﻿


3.3. GENERALIZED LINEAR MODELS


246


              "tightly coupled collections of variables which share many of the prop-
           erties of matrices and of lists, used as the fundamental data structure
           by most of R's modeling software."

      We created a dataframe called new, having three columns called count, quadrat,
      and year. Each row of new contains a count (of New seedlings), a quadrat number
      and a year. There are as many rows as there are observations.

    " The command as.factor turns its argument into a factor. That is, instead
      of treating quadrat and year as numerical variables, we treat them as indicator
      variables. That's because we don't want a quadrat variable running from 1 to
      60 implying that the 60th quadrat has 60 times as much of something as the 1st
      quadrat. We want the quadrat numbers to act as labels, not as numbers.

    " gim stands for generalized linear model. The family=poisson argument says
      what kind of data we're modelling. data=new says the data are to be found in a
      dataframe called new.

    " The formula count     1 says to fit a model with only an intercept, no covariates.

    " The formula count     quadrat says to fit a model in which quadrat is a covariate.
      Of course that's really 59 new covariates, indicator variables for 59 of the 60
      quadrats.

   To examine the two fits and see which we prefer, we plotted actual versus fitted
values and residuals versus fitted values in Figure 3:16. Panels (a) and (b) are from
fitO. Because there may be overplotting, we jittered the points and replotted them
in panels (c) and (d). Panels (e) and (f) are jittered values from fit1. Comparison
of panels (c) to (e) and (d) to (f) shows that fiti predicts more accurately and has
smaller residuals than fitO. That's consistent with our reading of Figure 34. So we
prefer fit1.
    Figure 3.17 continues the story. Panel (a) shows residuals from fit1 plotted against
year. There is a clear difference between years. Years 1, 3, and 5 are high while years 2
and 4 are low. So perhaps we should use year as a predictor. That's done by

     fit2 <- gim ( count ~ quadrat+year, family=poisson,
                      data=new )


Panels (b) and (c) show diagnostic plots for fit2. Compare to similar panels in Fig-
ure 3;1 to see whether using year makes an appreciable difference to the fit.


﻿


3.3. GENERALIZED LINEAR MODELS


247


(a)


a)


0a


10


0


C,,


U)


co


Co
ccJ


0.6  0.8   1.0  1.2

    fitted values


        (c)


           (b)


           0


           0
           0

           0

   0.6   0.8  1.0   1.2

        fitted values


           (d)


               0

      0   00    0
  0           0

        000     0    0
     00c0     0  0
     I I I I I II


0.865 0.8750.885 0.895

        fitted values


a)


Ca


10


0


C,)


U)


co

cc


C\J-


0.865 0.875 0.885 0.895

        fitted values


                (e)


    1O  10
f )

  O~             00 0 0=
    0~ 0              0

0z


       0   1   2  3   4  5

             fitted values


(f)


C)


L')


0   1  2   3   4  5

      fitted values


Figure 3.16: Actual vs. fitted and residuals vs. fitted for the New seedling data. (a)

and (b): fitO. (c) and (d): jittered values from fitO. (e) and (f): jittered values

from f it1.


﻿


3.3. GENERALIZED LINEAR MODELS


248


Figure    was created with the following snippet.

par ( mfrow=c(3,2) )
plot ( fitted(fit0), new$count, xlab="fitted values",
        ylab="actual values", main="(a)" )
 abline ( 0, 1 )
 plot ( fitted(fit0), residuals(fit0), xlab="fitted values",
        ylab="residuals", main="(b)" )


plot ( jitter(fitted(fit0)),
        xlab="fitted values",
        main="(c)" )
 abline ( 0, 1 )
 plot ( jitter(fitted(fit0)),
        xlab="fitted values",
 plot ( jitter(fitted(fit1)),
        xlab="fitted values",
        main="(e)" )
 abline ( 0, 1 )
 plot ( jitter(fitted(fit1)),
        xlab="fitted values",


The following snippet shows how Figu


jitter(new$count),
ylab="actual values",


jitter(residuals(fit0)),
ylab="residuals", main="(d)" )
jitter(new$count),
ylab="actual values",


jitter(residuals(fitl)),
ylab="residuals", main="(f)" )


ire    was made in R.


par ( mfrow=c(2,2) )
plot ( new$year, residuals(fitl),
       xlab="year", ylab="residuals", main="(a)" )
plot ( jitter(fitted(fit2)), jitter(new$count),
       xlab="fitted values", ylab="actual values",
       main="(b)" )
abline ( 0, 1 )
plot ( jitter(fitted(fit2)), jitter(residuals(fit2)),
       xlab="fitted values", ylab="residuals", main="(c)" )


﻿


3.3. GENERALIZED LINEAR MODELS


                         (a)


                         0
                         -r
               C'7         I

               CV - 0
                              I         UI)


                 TI           I
                       I      I>


               |       -I -   I


                         year


                         (c)


249


(b)


10


0


0 2 4 6 8


   fitted values


C)


C\1


(I)

(I
a)


0


C')


                   0 2 4 6 8


                      fitted values


Figure 3.17: New seedling data. (a): residuals from fiti vs. year. (b): actual vs.

fitted from fit2. (c): residuals vs. fitted from fit2.


﻿


3.4. PREDICTIONS FROM REGRESSION


250


3.4 Predictions from Regression

From a regression equation, if we have estimates of the ,'s we can

   1. plug in the values of the x's we have to get fitted values, and

   2. plug in the values of x for new or future cases to get predicted values.

We illustrate with the mtcars data.

Example 3.11 (mtcars, continued)
Example     concluded with a comparison of three models for mpg. Here we continue
that comparison by seeing whether the models make substantially different predictions
for any of the cars in the data set. For each car we know its weight and horsepower and
we have estimates of all the parameters in Equations  ,  , and    , so we can
compute its fitted values from all three models. In symbols,

                             fi = )o + I1wt2                        (from     )
                             fi=   o +1jhp                          (from     )
                             fi = 6o + 61wt2 + b2h p                (from     )
We plot the fitted values against each other to see whether there are any noticable
differences. Figure  displays the result. Figure  shows that the mpg. fitl and
mpg. fit3 produce fitted values substantially similar to each other and agreeing fairly well
with actual values, while mpg.fit2 produces fitted values that differ somewhat from the
others and from the actual values, at least for a few cars. This is another reason to prefer
mpg.fit1 and mpg.fit3 to mpg.fit2. In Example    this lack of fit showed up as a
higher & for mpg.fit2 than for mpg.fit1.

   Figure     was made with the following snippet.

     fitted.mpg <- cbind ( fitted(mpg.fitl), fitted(mpg.fit2),
                              fitted(mpg.fit3), mtcars$mpg )
     pairs ( fitted.mpg, labels = c ( "fitted from wt",
             "fitted from hp", "fitted from both", "actual mpg" ) )


   * fitted(xyz) extracts fitted values. xyz can be any model previously fitted by lm,
      glm, or other R functions to fit models.


﻿


3.4. PREDICTIONS FROM REGRESSION


251


0-


"T-


fitted from wt


   O 00


   cJ0


 0
     0o


   0

   0

      0

   0


     00


       0


   0~


   1 20 3


10   20


   0    o

   0 O§On
   0


   9


fitted from hp


  00O
  0


  0Q
       0©


       0
       0
       0
       0


  08


      8


  0?


  0 6b


  0

0E


  00
  0D

    0

  0


     0


  10 20 3


10  20  30


       0
       0 0


  c0o


  0


    00


    o   0
    ~0


  0
  0~


  0

      0

      00

        0
      00

        0


  00,

  0
  0

  0


  actual mpg


0
CO)


10


10

T


10
M)


10

N


0
CO)


0


0

T


Figure 3.18: Actual mpg and fitted values from three models


﻿


3.4. PREDICTIONS FROM REGRESSION


252


   In Example 3.5 we posited model 3,10

                              Y = #0 + /31 x + e(3.22)
where x was mean temperature during the week and y was ice cream consumption
during the week. Now we want to fit the model to the data and use the fit to predict
consumption. In addition, we want to say how accurate the predictions are. Let xf
be the predicted mean temperature for some future week and yf be consumption.
xf is known; yf is not. Our model says

                               Yf ~N(pf, a)
where

                              pf =0 + O1xf
pif is unknown because 30 and 31 are unknown. But (/, #1) can be estimated from
the data, so we can form an estimate

                              Pf= =o + $1xf
How accurate is ftf as an estimate of pf? The answer depends on how accurate
(,/o, ,31) are as estimates of (/#,#,1). Advanced theory about Normal distributions,
beyond the scope of this book, tells us
                                f ~N(pf,  it)
for some ait which may depend on xf; we have omitted the dependency from the
notation.
   pif is the average ice cream consumption in all weeks whose mean temperature
is xf. So ^f is also an estimator of y . But in any particular week the actual
consumption won't exactly equal p f. Our model says

                                Yf = pf + E
where E   N(0, a). So in any given week yf will differ from Pf by an amount up to
about +2a or so.
   Thus the uncertainty cfit in estimating yf has two components: (1) the uncer-
tainty of p f which comes because we don't know (/o, #1) and (2) the variability a
due to E. We can't say in advance which component will dominate. Sometimes it
will be the first, sometimes the second. What we can say is that as we collect more
and more data, we learn (/, #1) more accurately, so the first component becomes
negligible and the second component dominates. When that happens, we won't go
far wrong by simply ignoring the first component.


﻿


3.5. EXERCISES


253


3.5 Exercises

  1. (a) Use the attenu , airquality and faithful datasets to reproduce Fig-
          ures:3. (a), (b) and (d).
      (b) Add lowess and supsmu fits.
      (c) Figure out how to use the tuning parameters and try out several different
         values. (Use the help or help. start functions.)

  2. With the mtcars dataset, use a scatterplot smoother to plot the relationship
     between weight and displacement. Does it matter which we think of as X
     and which as Y? Is one way more natural than the other?

  3. Download the 1970 draft data from DASL and reproduce Figure 3. Use the
     tuning parameters (f for lowess; span for supsmu) to draw smoother and
     wigglier scatterplot smoothers.

  4. How could you test whether the draft numbers in Example  were gener-
     ated uniformly? What would Ho be? What would be a good test statistic w?
     How would estimate the distribution of w under Ho?

  5. Using the information in Example 3. estimate the mean calorie content of
     meat and poultry hot dogs.

  6. Refer to Examples 22, ., and 3 6

      (a) Formulate statistical hypotheses for testing whether the mean calorie
          content of Poultry hot dogs is equal to the mean calorie content of Beef
          hot dogs.
      (b) What statistic will you use?
      (c) What should that statistic be if Ho is true?
      (d) How many SD's is it off?
      (e) What do you conclude?
      (f) What about Meat hot dogs?

  7. Refer to Examples 22, 3.4, and 3.6. Figure   shows plenty of overlap in
     the calorie contents of Beef and Poultry hot dogs. I.e., there are many Poultry
     hot dogs with more calories than many Beef hot dogs. But Figure 3.9 shows
     very little support for values of op near 0. Can that be right? Explain?


﻿


3.5. EXERCISES


254


8. Examples     ,    , and    analyze the calorie content of Beef, Meat, and
    Poultry hot dogs. Create a similar analysis, but for sodium content. Your
    analysis should cover at least the following steps.

    (a) A stripchart similar to Figure  and density estimates similar to Fig-
        ure
     (b) A model similar to Model , including definitions of the parameters.
     (c) Indicator variables analogous to those in Equation
     (d) A model similar to Model , including definitions of all the terms.
     (e) A fit in R, similar to that in Example
     (f) Parameter estimates and SD's.
     (g) Plots of likelihood functions, analagous to Figure
     (h) Interpretation.

 9. Analyze the PlantGrowth data from page  . State your conclusion about
    whether the treatments are effective. Support you conclusion with analysis.

10. Analyze the Ice Cream data from Example     . Write a model similar to
    Model , including definitions of all the terms. Use R to fit the model.
    Estimate the coefficients and say how accurate your estimates are. If temper-
    ature increases by about 5 OF, about how much would you expect ice cream
    consumption to increase? Make a plot similar to Figure , but add on the
    line implied by Equation    and your estimates of #0 and # 1.

11. Verify the claim that for Equation  to r 30, 1 ~r -.07 and &  3.9.

12. Does a football filled with helium travel further than one filled with air?
    DASL has a data set that attempts to answer the question. Go to DASL, http:
    //lib. stat. cmu. edu/DASL, download the data set Helium football and read
    the story. Use what you know about linear models to analyze the data and
    reach a conclusion. You must decide whether to include data from the first
    several kicks and from kicks that appear to be flubbed. Does your decision
    affect your conclusion?

13. Use the PlantGrowth data from R. Refer to page   and Equation

     (a) Estimate pc, pTi, pT2 and o-.
     (b) Test the hypothesis pTi = [c.


﻿


3.5. EXERCISES


255


     (c) Test the hypothesis pUT1 =pT2.

14. Jack and Jill, two Duke sophomores, have to choose their majors. They both
    love poetry so they might choose to be English majors. Then their futures
    would be full of black clothes, black coffee, low paying jobs, and occasional
    volumes of poetry published by independent, non-commercial presses. On
    the other hand, they both see the value of money, so they could choose to be
    Economics majors. Then their futures would be full of power suits, double
    cappucinos, investment banking and, at least for Jack, membership in the
    Augusta National golf club. But which would make them more happy?
    To investigate, they conduct a survey. Not wanting to embarass their friends
    and themselves, Jack and Jill go up Chapel Hill to interview poets and invest-
    ment bankers. In all of Chapel Hill there are 90 poets but only 10 investment
    bankers. J&J interview them all. From the interviews J&J compute the Hap-
    piness Quotient or HQ of each subject. The HQ's are in Figure 3.19. J&J also
    record two indicator variables for each person: Pi = 1 or 0 (for poets and
    bankers); BZ = 1 or 0 (for bankers and poets).
    Jill and Jack each write a statistical model:
            Jill: HQZ =ao + a1iB+ eZ  Jack: HQZ = #1P+ §2BZ +ceZ

     (a) Say in words what are ao, ai, #1 and #2.
     (b) Express #1l and 32 in terms of ao and a1.
     (c) In their data set J&J find HQ = 43 among poets, HQ = 44 among
        bankers and 2 = 1. (Subjects report disappointment with their favorite
        basketball team as the primary reason for low HQ.) Find sensible nu-
        merical estimates of ao, ai, #1 and #2.

15. Is poverty related to academic performance in school? The file

    schools-poverty

    at this text's website contains relevant data from the Durham, NC school
    system in 2001. The first few lines are

            pf1 eog type
        1    66  65     e
        2    32  73     m
        3    65  65     e


﻿


3.5. EXERCISES


256


0 o     o8 cb 68o


  -                    0 p oo

  39   40  41  42   43  44   45

                HQ


Figure 3.19: Happiness Quotient of bankers and poets


﻿


3.5. EXERCISES


257


    Each school in the Durham public school system is represented by one line
    in the file. The variable pf1 stands for percent free lunch. It records the
    percentage of the school's student population that qualifies for a free lunch
    program. It is an indicator of poverty. The variable eog stands for end of
    grade. It is the school's average score on end of grade tests and is an indicator
    of academic success. Finally, type indicates the type of school - e, m, or h
    for elementary, middle or high school, respectively. You are to investigate
    whether pf 1 is predictive of eog.

    (a) Read the data into R and plot it in a sensible way. Use different plot
         symbols for the three types of schools.
     (b) Does there appear to be a relationship between pf1 and eog? Is the
         relationship the same for the three types of schools? Decide whether the
         rest of your analysis should include all types of schools, or only one or
         two.
     (c) Using the types of schools you think best, remake the plot and add a
         regression line. Say in words what the regression line means.
     (d) During the 2000-2001 school year Duke University, in Durham, NC,
         sponsored a tutoring program in one of the elementary schools. Many
         Duke students served as tutors. From looking at the plot, and assuming
         the program was successful, can you figure out which school it was?

16. Load mtcars into an R session. Use R to find the m.l.e.'s (/o, /31). Confirm
    that they agree with the line drawn in Figure 3.(a). Starting from Equa-
    tion     , derive the m.l.e.'s for #0 and 01.

17. Get more current data similar to mtcars. Carry out a regression analysis
    similar to Example 3. Have relationships among the variables changed over
    time? What are now the most important predictors of mpg?

18. Repeat the logistic regression of am on wt, but use hp instead of wt.

19. A researcher randomly selects cities in the US. For each city she records the
    number of bars y2 and the number of churches z2. In the regression equation
    z2 =3o + 31y do you expect #1 to be positive, negative, or around 0?

20. Jevons' coins?


21. (a) Jane writes the following R code:


﻿


3.5. EXERCISES


258


                x <- runif ( 60, -1, 1 )


         Describe x. Is it a number, a vector, or a matrix? What is in it?
     (b) Now she writes

                y <- x + rnorm ( 60 )
                myfit <- im ( y     x )


         Make an intelligent guess of what she found for 0o and 01.
     (c) Using advanced statistical theory she calculates

                                     SD(3o) =_.13
                                     SD(31) =_.22

         Finally she writes

                inO <- 0
                inl <- 0
                for ( i in 1:100 ) {
                   x <- runif ( 60, -1, 1 )
                   y <- x + rnorm ( 60 )
                   fit <- im ( y ~ x )
                   if ( abs(fit$coef [1]) <= .26 ) inO <- inO + 1
                   if ( abs(fit$coef[2]-1) <= .44 ) inl <- inl + 1
                }


         Make an intelligent guess of inO and in1 after Jane ran this code.

22. The Army is testing a new mortar. They fire a shell up at an angle of 600 and
    track its progress with a laser. Let ti, t2, ..., t100 be equally spaced times from
    ti = (time of firing) to t100= (time when it lands). Let y1,..., Yoo be the
    shell's heights and z1, ..., zioo be the shell's distance from the howitzer (mea-
    sured horizontally along the ground) at times ti, t2, ..., t100. The yr's and zi's
    are measured by the laser. The measurements are not perfect; there is some
    measurement error. In answering the following questions you may assume
    that the shell's horizontal speed remains constant until it falls to ground.


﻿


3.5. EXERCISES


259


     (a) True or False: The equation

                                  yi =0o + #1t2 + E2

         should fit the data well.
     (b) True or False: The equation

                               Yi =/3O+/#1tj+#2t2+E e(3.23)

         should fit the data well.
     (c) True or False: The equation

                                  z2 =3o+13t+E2                       (3.24)

         should fit the data well.
     (d) True or False: The equation

                               z2 =3o+31tj+/32t +Ej                   (3.25)

         should fit the data well.
     (e) True or False: The equation

                                  Yi =13O+/31zj+E                     (3.26)

         should fit the data well.
     (f) True or False: The equation

                               Y2 = 3o+/31zi +#/32z2+E e(3.27)

         should fit the data well.
     (g) Approximately what value did the Army find for 13o in Part (b)?
     (h) Approximately what value did the Army find for 02 in Part (d)?

23. Some nonstatisticians (not readers of this book, we hope) do statistical anal-
    yses based almost solely on numerical calculations and don't use plots. R
    comes with the data set anscombe which demonstrates the value of plots.
    Type data(anscombe) to load the data into your R session. It is an 11 by 8
    dataframe. The variable names are x1, x2, x3, x4, y1, y2, y3, and y4.


﻿


3.5. EXERCISES


260


     (a) Start with x1 and yl. Use im to model yl as a function of x1. Print a
         summary of the regression so you can see ,30, 31, and a.
     (b) Do the same for the other pairs: x2 and y2, x3 and y3, x4 and y4.
     (c) What do you conclude so far?
     (d) Plot yl versus xl. Repeat for each pair. You may want to put all four
        plots on the same page. (It's not necessary, but you should know how to
        draw the regression line on each plot. Do you?)
     (e) What do you conclude?
     (f) Are any of these pairs well described by linear regression? How would
        you describe the others? If the others were not artificially constructed
        data, but were real, how would you analyze them?

24. Here's some R code:

        x <- rnorm ( 1, 2, 3 )
        y <- -2*x + 1 + rnorm ( 1, 0, 1 )


     (a) What is the marginal distribution of x?
     (b) Write down the marginal density of x.
     (c) What is the conditional distribution of y given x?
     (d) Write down the conditional density of y given x.
     (e) Write down the joint density of (x, y).

     Here's more R code:

        N.sim <- 1000
        w <- rep ( NA, N.sim )
        for ( i in 1:N.sim ) {
           x <- rnorm ( 50, 2, 3 )
           y <- -2*x + 1 + rnorm ( 50, 0, 1 )
           fit <- lm ( y ~ x )
           w[i] <- f it$coef [2]
        }
        zi <- mean ( w )
        z2 <- sqrt ( var ( w + 2 ) )


﻿


3.5. EXERCISES


261


    What does zi estimate? What does z2 estimate?

25. A statistician thinks the regression equation y2 = 0 + /1x2 + EZ fits her data
    well. She would like to learn 31. She is able to measure the yr's accurately but
    can measure the xi's only approximately. In fact, she can measure w = z2+ i
    where o5   N(0, .1). So she can fit the regression equation y =30+#*3w2 + E .
    Note that (/3*, #*) might be different than (,Qo, 31) because they're for the wi's,
    not the xi's. So the statistician writes the following R code.

         N.sim <- 1000
         b.0 <- -10:10
         b.1 <- -10:10
         n <- 50
         for ( i in 1:21 )
         for ( j in 1:21 ) {
           val <- rep ( NA, N.sim )
           for ( k in 1:N.sim ) {
             x <- rnorm ( n )
             w <- x + rnorm ( n, 0, sqrt(.1) )
             y <- b.0[i] + x * b.1[j] + rnorm ( n, 3 )
             fit <- im ( y ~ w )
             val[k] <- fit$coef [2]
           }
           m <- mean ( val )
           sd <- sqrt ( var ( val ) )
           print ( c ( m, sd, (m-b.1[j])/sd ) )
         }


    What is she trying to do? The last time through the loop, the print statement
    yields [11 9.086723 0.434638 -2.101237. What does this show?

26. The purpose of this exercise is to familiarize yourself with plotting logistic
    regression curves and getting a feel for the meaning of ,3o and 31.

    (a) Choose some values of x. You will want between about 20 and 100
         evenly spaced values. These will become the abscissa of your plot.
     (b) Choose some values of ,3o and #13. You are trying to see how different
         values of the /'s affect the curve. So you might begin with a single value
         of #1~ and several values of #3o, or vice versa.


﻿


3.5. EXERCISES


262


     (c) For each choice of (/o, #1) calculate the set of 2 = e/3o+31xi/(1 + e3o+31xi )
         and plot O2 versus x. You should get sigmoidal shaped curves. These are
         logistic regression curves.
     (d) You may find that the particular x's and 13's you chose do not yield a
         visually pleasing result. Perhaps all your 0's are too close to 0 or too
         close to 1. In that case, go back and choose different values. You will
         have to play around until you find x's and 13's compatible with each
         other.

27. Carry out a logistic regression analysis of the O-ring data. What does your
    analysis say about the probability of O-ring damage at 36°F, the temperature
    of the Challenger launch. How relevant should such an analysis have been to
    the decision of whether to postpone the launch?

28. This exercise refers to Example 3,10.

     (a) Why are the points lined up vertically in Figure 3., panels (a) and (b)?
     (b) Why do panels (c) and (d) appear to have more points than panels (a)
         and (b)?
     (c) If there were no jittering, how many distinct values would there be on
         the abscissa of panels (c) and (d)?
     (d) Download the seedling data. Fit a model in which year is a predictor but
         quadrat is not. Compare to fit 1. Which do you prefer? Which variable
         is more important: quadrat or year? Or are they both important?


﻿


CHAPTER 4


                    MORE PROBABILITY


4.1 More Probability Density

Section L2 on page 6 introduced probability densities. Section41 discusses them
further and gives a formal definition.
   Let X be a continuous random variable with cdf Fx. Theorem  on page 8
implies that fx(x) =   Fx(b) b-x and therefore that we can define the pdf by
fx(x) = F(x) = j Fx(b) bx. In fact, this definition is a little too restrictive. The
key property of pdf's is that the probability of a set A is given by the integral of the
pdf. I.e.,
                           P[X e A]   ffx(x)dx

But if f* is a function that differs from fx at only countably many points then, for
any set A, fA f* =A fx, so we could just as well have defined

                           P[X e A]=( f*(x) dx

There are infinitely many functions having the same integrals as fx and f*. These
functions differ from each other on "sets of measure zero", terminology beyond our
scope but defined in books on measure theory. For our purposes we can think of
sets of measure zero as sets containing at most countably many points. In effect,
the pdf of X can be arbitrarily changed on sets of measure zero. It does not matter
which of the many equivalent functions we use as the probability density of X.
Thus, we define


263


﻿


4.2. RANDOM VECTORS


264


Definition 4.1. Any function f such that, for all intervals A,

                            P[X e A] f= f(x)dx

is called a probability density function, or pdf, for the random variable X. Any such
function may be denoted fx.

   Definition 4:1 can be used in an alternate proof of Theorem 1,1 on page1.
The central step in the proof is just a change-of-variable in an integral, showing
that Theorem      is, in essence, just a change of variables. For convenience we
restate the theorem before reproving it.
    Theorem L1 Let X be a random variable with pdf px. Let g be a differentiable,
monotonic, invertible function and define Z = g(X). Then the pdf of Z is

                                            dg-1(t)
                         pz (t) = px (g -1(t) d - t


Proof. For any set A, P[Z E g(A)] = P[X E A] =IfApx(x) dx. Let z = g(x) and
change variables in the integral to get

                     P[Z e g(A)] =     px(gd1(z))     dz
                                   g (A)          dx

I.e., P[Z E g(A)] = fg(A) something dz. Therefore something must be pz(z). Hence,
pz(z) = px(g-1(z))|dx/dz|.


4.2 Random Vectors

It is often useful, even essential, to talk about several random variables simultane-
ously. We have seen many examples throughout the text beginning with Section 1.5
on joint, marginal, and conditional probabilities. Section 4.2 reviews the basics and
sets out new probability theory for multiple random variables.
   Let X1, ..., X, be a set of n random variables. The n-dimensional vector
X = (X1,..., Xn) is called a multivariate random variable or random vector. As
explained below, X has a pdf or pmf, a cdf, an expected value, and a covariance
matrix, all analogous to univariate random variables.


﻿


4.2. RANDOM VECTORS


265


4.2.1 Densities of Random Vectors

When X1, ..., X, are continuous then X has a pdf, written

                              pg(x1,. . ., ).

As in the univariate case, the pdf is any function whose integral yields probabilities.
That is, if A is a region in R" then

                P [Z E A] =  - - p   (zi, -.--.,  n) dzi1 . .. don
                             A

   For example, let X1 Exp(1); X2  Exp(1/2); X1 I X2; and X= (X1, X2) and
suppose we want to find P[lX1 - X21 < 1]. Our plan for solving this problem is to
find the joint density pX, then integrate pk over the region A where IX1 - X2| < 1.
Because X1 I X2, the joint density is

                pX (XI, x2) =pxi(xi)px2(x2) = e-l x !e-2/2
                                                 2

To find the region A over which to integrate, it helps to plot the X1-X2 plane.
Making the plot is left as an exercise.


  P[lX1 - X2 1 1] =    p(xi, x2) dz1 dz2
                    A
            1   1 +1                      oo°° i+1
        = -          e-lex2/2 dz2 dzi + -         e-x1 ex2/2 dz2 dzi
            2 0 o21                           i-1
                                                              0.47 (4.1)

   The random variables (X1,... , Xn) are said to be mutually independent or jointly
independent if
                  pX (zi, . .. , zn) = pxi (zi) x -.-.-x pxn (zn)
for all vectors (zi,... , z).
   Mutual independence implies pairwise independence. I.e., if (X1,... , Xn) are
mutually independent, then any pair (Xi, Xj) are also independent. The proof
is left as an exercise. It is curious but true that pairwise independence does not
imply joint independence. For an example, consider the discrete three-dimensional


﻿


4.2. RANDOM VECTORS


266


distribution on X= (X1, X2, X3) with


P[(X1, X2, X3)
P[(X1, X2, X3)
P[(X1, X2, X3)
P[(X1, X2, X3)


(0, 0, 0)]
(1, 0, 1)]
(0, 1, 1)]
(1, 1, 0)]


(4.2)


1/4


It is easily verified that X1 I X2, X1 I X3, and X2
are not mutually independent. See Exercise 6.


I X3 but that X1, X2, and X3


4.2.2 Moments of Random Vectors

When X is a random vector, its expected value is also a vector.

                        E[Z] = (E[X1], . . . ,E[Xn])

When Z    (X1, . . . , Xn) is a random vector, instead of a variance it has a covariance
matrix. The ij'th entry of the covariance matrix is Cov(Xi, X3). The notation is


Cov(X) - EX


2i~l
  (71           1l
912  92         I
              ~2m
   (71  (7 n  '' ' (7


where ig =


Cov(X2, Xj) and a2


Var(XZ). Sometimes a1 is also denoted X22.


4.2.3 Functions of Random Vectors
Section 4,2,3 considers functions of random vectors. If g is an arbitrary function
that maps X to R then

           E[g(Z)] = .--.- g(zi, . .. , z )p  (zi,  . . . , zn) dzi -.-.-dxn

but it's hard to say much in general about the variance of g(Z). When g is a linear
function we can go farther, but first we need a lemma.

Lemma 4.1. Let X1 and X2 be random variables and Y= X1 + X2. Then


1. E[Y]=E[X1] + E[X2]


﻿


4.2. RANDOM VECTORS


267


  2. Var(Y) = Var(X1) + Var(X2) + 2 Cov(X1, X2)

Proof. Left as exercise.

   Now we can deal with linear combinations of random vectors.

Theorem 4.2. Let a= (ai, ... , an) be an n-dimensional vector and define Y = a $
Sa Xi.Then,

   1. E[Y] =E[L aXi  E aE[Xi]


D-


  2. Var(Y) = Lai Var(Xi) + 2E_- E,"_i 1 a a Cov(XX, Xj) =   a--"

Proof. Use Lemma 4.1 and Theorems L  (pg. 39) and L4 (pg. 40). See Exercise 8.
                                                                        D

   The next step is to consider several linear combinations simultaneously. For
some k <n, and for each i = 1, ...,k, let

                   Y = aiX1 + ... inXn =  aijXj =


where the as 's are arbitrary constants and di = (ai,..., ain). Let Y = (Y1,..., Yk).
In matrix notation,
                                 Y=AX
where A is the k x n matrix of elements aij. Covariances of the Y's are given by

            Cov(Y, Y) = Cov(d X, dX)
                          n  n
                               Cov(aikXk, agrXj)
                         k=1 £=1
                         n            n-1 n
                         -  ail/jalp T/~+  3 3 (aag + ajkaif)cakf
                         k=1          k=1 £=k+1


Combining the previous result with Theorem 4.2 yields Theorem 4 .

Theorem 4.3. Let X be a random vector of dimension n with mean E[X] = and
covariance matrix Cov(X) =ZY; let A be a k x n matrix of rank k; and let Y = AX.
Then


﻿


4.2. RANDOM VECTORS


268


   1. E[Y] = Ap, and

   2. Cov(Y) = AEA'

   Finally, we take up the question of multivariate transformations, extending
the univariate version, Theorem .1 (pg.  ). Let X= (X1,..., Xn) be an n-
dimensional continuous random vector with pdf fk. Define a new n-dimensional
random vector Y = (Y1,..., Yn) = (gi(X),..., g (X)) where the g's are differen-
tiable functions and where the the transformation g : Z Y is invertible. What is
fg, the pdf of Y?
   Let J be the so-called Jacobian matrix of partial derivatives.

                                  8x1       ax,
                                  (9Y2      (9Y2
                                _(8X1       8Xn
                            J&x
                                  BYn       BYn
                                  8x1       ax,
and |JI be the absolute value of the determinant of J.

Theorem 4.4.
                           fg -) = f(g-1(y))|JI-1
Proof. The proof follows the alternate proof of Theorem:L1 on page 2. For any
set A, P[Y E g(A)] = P[X E A] = f... fAPi() d1.-- dxn. Let g= g(x) and
change variables in the integral to get

              P[Y Eg(A)] =f- -J-p-(g-(J))|J       1 dy1...- dyn


I.e., P[Y E g(A)] = f --- fg(A) something dy1 ... dyn. Therefore something must be
pg (y). Hence, pp(g) - pX(g-1(g))|JI-1.D

   To illustrate the use of Theorem:4. we solve again an example previously given
on page 26, which we restate here. Let X1  Exp(1); X2  Exp(2); X1 I X2;
and X= (X1, X2) and suppose we want to find P[lX1 - X2|     1]. We solved this
problem previously by finding the joint density of X= (X1, X2), then integrating
over the region where IX1 - X2|     1. Our strategy this time is to define new
variables Y1 = X1 - X2 and Y2, which is essentially arbitrary, find the joint density
of Y =(Y1, Y2), then integrate over the region where YI  1. We define Y1
  X1- X2 because that's the variable we're interested in. We need a Y2 because


﻿


4.2. RANDOM VECTORS                                                     269

Theorem is for full rank transformations from IRan to ]Rn. The precise definition
of Y2 is unimportant, as long as the transformation from X~ to Y is differentiable


and invertible. For convenience, we define Y2


X2. With these definitions,


   / Y1
 J  1
JzzzQy2


DY1
0Y2}


( 1
0


-1)


    Jzz=1
X 1 =zYi+ Y2


and


X2 = Y2


From the solution
e-(Y1+FY2) X iC-Y2/2
          2
grate.


on page
  lC ey1 e


we know pjk (


1 X2) =CeX1 X 1e-X2/2 SO PY(Y1Y2) -
shows the region over which to inte-


-3Y2/2. Figure


PL[X1 -X2 < 1]


PL[Yi <1]


IpY 1 Y2) dy, dy2
A


      e-
1 2 -1


  1   °
=3    1 e
     -


y1


                 1 [
C-3Y2/2 dy2 dyi + - I -Y1
               2 J0


0O


e-3Y2/2 dy2 dyi


-Y1 [- 3Y2/2] 1 dyi + 1
                     3


0


e-Y1 [-c-3Y2/2] 7 dyi


e-Y1 dy,


1


J0
   eY1/2 dyi
-1


1


2 yl/2O °   -1 -y1 1
  3    -1     3     0


r0.47 (4.3)


Figure    was produced by the following snippet.


par ( mar=c(O,O,O,O) )


plot ( c(0,6),


c(0,2), type="n", xlab="", ylab=""


,xaxt="n",


yaxt="n", bty="n" )


﻿


4.2. RANDOM VECTORS27


270


Y-


Eigure
where


4.1: The (XIX2)
X and Y live. The


plane and the (Y1, Y2) plane. The light gray regions are
dark gray regions are where X1 - X2 < 1.


polygon (


c(1,1.9,1.9,1) , c(1,1,1.9,1.9) , col=gray(.8),
border=NA )


polygon ( c(1,1.2,1.9,1.9,1.7,1) , c(1,1,1.7,1.9,1.9,1.2),
         col=gray(.5), border=NA )
segments ( 0, 1, 1.9, 1, lwd=3 ) #* x1 axis
segments ( 1, 0, 1, 1.9, lwd=3 ) #* x2 axis
text ( c(2,1), c(1,2), c(expression(bold(X[11)),
                      expression (bold(X [21 ))) )
polygon ( c(5,5.9,5.9,4), c(1,1,1.9,1.9), col=gray(.8),
         border=NA )
polygon ( c(5,5.2,5.2,4.8,4.8), c(1,1,1.9,1.9,1.2),
         col=gray(.5), border=NA )
segments ( 4, 1, 5.9, 1, lwd=3 ) #* yl axis
segments ( 5, 0, 5, 1.9, lwd=3 ) #* y2 axis
text ( c(6,5), c(1,2), c(expression(bold(Y[11)),
                      expression(bold(Y [21 ))) )


arrows ( 2.5, 1, 3.5, 1, length=.2, lwd=2 )


﻿


4.3. REPRESENTING DISTRIBUTIONS


271


   The point of the example, of course, is the method, not the answer. Functions
of random variables and random vectors are common in statistics and probability.
There are many methods to deal with them. The method of transforming the pdf
is one that is often useful.


4.3 Representing Distributions

We usually describe a random variable Y through py - its pmf if Y is discrete or
its pdf if Y is continuous. But there are at least two alternatives. First, any random
variable Y can be described by its cumulative distribution function, or cdf, Fy which
is defined by

                              S       P[Y = y] if Y is discrete
          Fy c)- P[Y< c]=         =-(4.4)
                       -        '] f p(y) dy    if Y is continuous.

   Equation    defines the cdf in terms of the pmf or pdf. It is also possible to go
the other way. If Y is continuous, then for any number b c R

                       P(Y < b) = F(b) =   p(y) dy

which shows by the Fundamental Theorem of Calculus that p(y) = F'(y). On the
other hand, if Y is discrete, Then P[Y = y] = P[Y  y] - P[Y < y] = Fy(y) -
Fy (y-). (We use the notation Fy(y-) to mean the limit of Fy(z) as z approaches y
from below. It is also written lim1Eo Fy(y - E)). Thus the reverse of Equation  is

                         S(Fy(y) - Fy(y-) if Y is discrete
                         FY (y)           if Y is continuous

Equation     is correct except in one case which seldom arises in practice. It is
possible that Fy(y) is a continuous but nondifferentiable function, in which case Y
is a continuous random variable, but Y does not have a density. In this case there
is a cdf Fy without a corresponding pmf or pdf.
   Figure    shows the pmf and cdf of the Bin(10, .7) distribution and the pdf
and cdf of the Exp(1) distribution.


﻿


4.3. REPRESENTING DISTRIBUTIONS


272


Bin (10, .7)


     0
     0
     0


Bin (10, .7)


   O


Q~ 0
   6

   O

   0-
   6


0


0

0_


0


0


      0-


      0-

    0-
    0-
I I I I I I

0   4  8


    0
000000


0   4   8


  y


Exp(1)


  y


Exp(1)


   00
   6


Q
   6


   0
   60

      0 12 34 5


           y


   00
   6


o  r
   0


   0
   0


0 12 34 5


     y


Figure 4.2: pmf's, pdf's, and cdf's


﻿


4.3. REPRESENTING DISTRIBUTIONS


273


Figure    was produced by the following snippet.


par ( mfrow=c(2,2) )
y <- seq ( -1, 11, by=1 )


plot ( y, dbinom
       main="Bin
plot ( y, pbinom
       ylab=" cdf "
segments ( -1:10,
             0:11,
y <- seq ( 0, 5,
plot ( y, dexp (
plot ( y, pexp (


( y, 10, .7 ), type="p", ylab="p
(10,  .7)" )
( y, 10, .7 ), type="p", pch=16,
  main="Bin (10, .7)" )
  pbinom ( -1:10, 10, .7 ),
  pbinom ( -1:10, 10, .7 ) )
len=50 )
y, 1 ), type="l", ylab="pdf", ma
y, 1 ), type="l", ylab="cdf", ma


pmf ",


in="Exp(1)" )
in="Exp(1)" )


   * segments ( x0, yo, x1, y1) draws line segments. The line segments run
     from (x0,y0) to (x1,y1). The arguments may be vectors.


   The other alternative representation for Y is its moment generating function or
mgf Mr. The moment generating function is defined as


My (t) =  IE[et]


{E   etYpy(y) if Y is discrete
f etYpy(y)    if Y is continuous


(4.6)


My is also known as the Laplace transform of pr.
   Because we define the mgf as a sum or integral there is the question of whether
the sum or integral is finite and hence whether the mgf is well defined. In Equa-
tion   , the mgf is always defined at t = 0. (See Exercise  .) But even if My(t) is
not well defined (the integral or sum is not absolutely convergent) for large t, what
matters for statistical practice is whether My (t) is well defined in a neighborhood
of 0, i.e. whether there exists a 6 > 0 such that My(t) exists for t E (-6, 6). The
moment generating function gets its name from the following theorem.

Theorem 4.5. If Y has mgf My defined in a neighborhood of 0, then


                        E(Y"] =My") (0) --n -My(t)


﻿


4.3. REPRESENTING DISTRIBUTIONS


2 74


Proof. We provide the proof for the case n = 1. The proof for larger values of n is
similar.
                         d            df
                         dMy (t)o=df etupy (y) dyJ

                                         d teta  py  (y) dy

                                   =   yet' py(y)dy

                                   =f ypy (y) dy

                                   = E[Y]


    The second line of the proof has the form


                        d+ff(t, y) dy=     d f (t, y) dy,

an equality which is not necessarily true. It is true for "nice" functions f; but estab-
lishing exactly what "nice" means requires measure theory and is beyond the scope
of this book. We will continue to use the equality without thorough justification.
    One could, if one wished, calculate and plot My (t), though there is usually little
point in doing so. The main purpose of moment generating functions is in proving
theorems and not, as their name might suggest, in deriving moments. And mgf's
are useful in proving theorems mostly because of the following two results.

Theorem 4.6. Let X and Y be two random variables with moment generating func-
tions (assumed to exist) Mx and My. If Mx(t) = My(t) for all t in some neighbor-
hood of 0, then Fx = FY; i.e., X and Y have the same distribution.

Theorem 4.7. Let Y1,... be a sequence of random variables with moment generating
functions (assumed to exist) My,,.... Define M(t) = limn-   ,(t). If the limit
exists for all t in a neighborhood of 0, and if M(t) is a moment generating function,
then there is a unique cdf F such that

   1.
                                 F(y) = lim FY,(y)
                                        fnd - oo


﻿


4.3. REPRESENTING DISTRIBUTIONS


275


  2. M is the mgf of F.

  Theorems4. and 4.7 both assume that the necessary mgf's exist. It is incon-
venient that not all distributions have mgf's. One can avoid the problem by using
characteristic functions (also known as Fourier transforms) instead of moment gen-
erating functions. The characteristic function is defined as

                                Cy (t) = E [eiY]

where i     /1. All distributions have characteristic functions, and the charac-
teristic function completely characterizes the distribution, so characteristic func-
tions are ideal for our purpose. However, dealing with complex numbers presents
its own inconveniences. We shall not pursue this topic further. Proofs of Theo-
rems,4,6 and 4,7 and similar results for characteristic functions are omitted but
may be found in more advanced books.
   Two more useful results are theorems 4,8and .

Theorem 4.8. Let X be a random variable, a, b be constants, and define Y = aX + b.
Then MY (t) = ebtMx (at).

Proof.

                             MY (t) = EB [e(axb)t]
                                   = ebtI g[eatx]
                                   = ebt Mx(at)


Theorem 4.9. Let X and Y be independent random variables. Define Z = X + Y.
Then
                             Mz(t) = Mx(t)My(t)

Proof.

          Mz(t) = E [e(X±Y)t] gE[exteYt] gE[eXtgI[eYt] =Mx(t)My(t)


Corollary 4.10. Let Y1,...,Yn be a collection of i.i.d. random variables each with
mgf My. Define X = Y1 + - -- + Yn. Then


Mx(t) =My t]"


﻿


4.4. EXERCISES


276


4.4 Exercises

   1. Refer to Equation   on page 265.

      (a) To help visualize the joint density pX, make a contour plot. You will
          have to choose some values of x1, some values of x2, and then evaluate
          p (xi, x2) on all pairs (Xi, x2) and save the values in a matrix. Finally,
          pass the values to the contour function. Choose values of x1 and x2 that
          help you visualize pX. You may have to choose values by trial and error.
      (b) Draw a diagram that illustrates how to find the region A and the limits
          of integration in Equation
      (c) Supply the missing steps in Equation 4 1. Make sure you understand
          them. Verify the answer.
      (d) Use R to verify the answer to Equation  by simulation.

  2. Refer to Example 1.6 on page 43 on tree seedlings where N is the number of
     New seedlings that emerge in a given year and X is the number that survive
     to the next year. Find P[X;> 1].

  3. (X1, X2) have a joint distribution that is uniform on the unit circle. Find
     p(x1)x2)-

  4. The random vector (X, Y) has pdf p(xy)(x, y) c ky for some k > 0 and (x, y)
     in the triangular region bounded by the points (0, 0), (-1, 1), and (1, 1).

     (a) Find k.
     (b) Find P[Y < 1/2].
     (c) Find P[X < 0].
     (d) Find P[lX - Y| <1/2].

  5. Prove the assertion on page 265 that mutual independence implies pairwise
     independence.

     (a) Begin with the case of three random variables X= (X1, X2, X3). Prove
          that if X1, X2, X3 are mutually independent, then any two of them are
          independent.
      (b) Generalize to the case X  (X1, . .. ,X,).


﻿


4.4. EXERCISES


277


6. Refer to Equation 42 on page 66. Verify that X1 I X2, X1 I  X3, and
    X2 I X3 but that X1, X2, and X3 are not mutually independent.

 7. Prove Lemma 4.

 8. Fill in the proof of Theorem 42 on page 267.

 9. X and Y are uniformly distributed in the rectangle whose corners are (1, 0),
    (0, 1), (-1, 0), and (0, -1).

    (a) i. Find p(x, y).
          ii. Are X and Y independent?
          iii. Find the marginal densities p(x) and p(y).
          iv. Find the conditional densities p(x y) and p(y lx).
          v. Find IE[X], E[X |Y = .5], and E[X Y = -.5].
     (b) LetU=X+YandV=X-Y.
          i. Find the region where U and V live.
          ii. Find the joint density p(u, v).
          iii. Are U and V independent?
          iv. Find the marginal densities p(u) and p(v).
          v. Find the conditional densities p(u lIv) and p(v lu).
          vi. Find E[U], E[U I V = .5], and E[U I V = -.5].

10. Let the random vector (U, V) be distributed uniformly on the unit square. Let
    X =UV and Y =U/V.

    (a) Draw the region of the X-Y plane where the random vector (X, Y) lives.
    (b) Find the joint density of (X, Y).
    (c) Find the marginal density of X.
    (d) Find the marginal density of Y.
    (e) Find P[Y > 1].
    (f) Find P[X > 1].
    (g) Find P[Y > 1/2].
    (h) Find P[X > 1/2].
    (i) Find P[XY > 1].


﻿


4.4. EXERCISES                                                          278

      (j) Find P[XY > 1/2].

 11. Just below Equation 4.6 is the statement "the mgf is always defined at t = 0."
     For any random variable Y, find My (0).

 12. Provide the proof of Theorem 4.5 for the case n = 2.

 13. Refer to Theorem 4;9. Where in the proof is the assumption X I Y used?


﻿


CHAPTER 5


               SPECIAL DISTRIBUTIONS


Statisticians often make use of standard parametric families of probability distribu-
tions. A parametric family is a collection of probability distributions distinguished
by, or indexed by, a parameter. An example is the Binomial distribution introduced
in Section 1..1. There were N trials. Each had a probability 0 of success. Usually
0 is unknown and could be any number in (0, 1). There is one Bin(N, 0) distribution
for each value of 0; 0 is a parameter; the set of probability distributions

                            {Bin(N,O6) :O0 E (0, 1)}

is a parametric family of distributions.
   We have already seen four parametric families - the
Binomial (Section  1), Poisson (Section L312), Exponential (Section 1.3.3), and
Normal (Section      ) distributions. Chapter  examines these in more detail and
introduces several others.


5.1 The Binomial and Negative Binomial Distribu-
        tions

The Binomial Distribution Statisticians often deal with situations in which there
is a collection of trials performed under identical circumstances; each trial results
in either success or failure. Typical examples are coin flips (Heads or Tails), medical
trials (cure or not), voter polls (Democrat or Republican), basketball free throws
(make or miss). Conditions for the Binomial Distribution are

   1. the number of trials n is fixed in advance,


279


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


280


   2. the probability of success 0 is the same for each trial, and

   3. trials are conditionally independent of each other, given 0.

Let the random variable X be the number of successes in such a collection of trials.
Then X is said to have the Binomial distribution with parameters (n, 0), written
X     Bin(n, 0). The possible values of X are the integers 0, 1, ..., n. Figure
shows examples of Binomial pmf's for several combinations of n and 0. Usually 0
is unknown and the trials are performed in order to learn about 0.
    Obviously, large values of X are evidence that 0 is large and small values of X
are evidence that 0 is small. But to evaluate the evidence quantitatively we must
be able to say more. In particular, once a particular value X = x has been observed
we want to quantify how well it is explained by different possible values of 0. That
is, we want to know p(x l0).

Theorem 5.1. If X    Bin(n, 0) then

                           px () =    )O(1 - 0)"-

for x = 0, 1,...,n.

Proof. When the n trials of a Binomial experiment are carried out there will be
a sequence of successes (1's) and failures (0's) such as 1000110... - 100. Let §
{0, 1}n be the set of such sequences and, for each x E {0, 1, ... , n}, let Sx be the
subset of § consisting of sequences with x 1's and n - x 0's. If s E S§ then Pr(s)
Ox(1 - 8)-2x. In particular, all s's in § have the same probability. Therefore,

                           px(x) = P(X = z) = P(S )
                         = (size of §) - (Ox(1 - 6)"-x)
                                    -(in)o9x(1o)12x


    The special case n = 1 is important enough to have its own name. When
n = 1 then X is said to have a Bernoulli distribution with parameter 0. We write
X    Bern(0). If X  Bern(0) then px(x) = Ox(1 -8)1-x for x E {0, 1}. Experiments
that have two possible outcomes are called Bernoulli trials.
    Suppose X1 ~Bin("ni, 0), X2  Bin(n2, 0) and X1 I X2. Let X3  X1 + X2.
What is the distribution of X3? Logic suggests the answer is X3 ~Bin(ini + in2, 0)


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


281


because (1) there are n11+ n2 trials, (2) the trials all have the same probability of
success 0, (3) the trials are independent of each other (the reason for the X1 I X2
assumption) and (4) X3 is the total number of successes. Theorem5. shows a
formal proof of this proposition. But first we need to know the moment generating
function.

Theorem 5.2. Let X   Bin(n, 0). Then

                          Mx (t) = [Oet + (1 - 8)]n

Proof. Let Y  Bern(0). Then

                       My(t) = E[eLY] Q= et + (1 - 0).

Now let X =21Y where the Y's are i.i.d. Bern(0) and apply Corollary 4,10. D

Theorem 5.3. Suppose X1    Bin(n1, 0); X2  Bin(n1, 0); and X1 I X2. Let X3 =
X1 + X2. Then X3   Bin(n1 + n2,0).

Proof.

                  Mx3(t)   Mx1(t)Mx2 (t)
                         = [Oet+ (1-e)]h1 [Oet+ (1-)]Th2
                         = [Oet + (1- 0)]"21+2

The first equality is by Theorem 4,9; the second is by Theorem 5.2. We recognize
the last expression as the mgf of the Bin(ni + n2, 0) distribution. So the result
follows by Theorem    .                                                  D

   The mean of the Binomial distribution was calculated in Equation 1.11. Theo-
rem 5,4 restates that result and gives the variance and standard deviation.

Theorem 5.4. Let X   Bin(n, 0). Then

  1. E(X] = n   0.

  2. Var(X) = n8(1 - 8).


3. SD(X) =/n0(1 - 0).


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


282


Proof. The proof for E[X] was given earlier. If X  Bin(n, 0), then X =L>' X,
where Xi    Bern(0) and the Xi's are mutually independent. Therefore, by Theo-
rem   9, Var(X) = nVar(Xi). But

                Var(X) =E(X) - E(XZ)2= 0 - 02 0=(1 - 0).

So Var(X) = nO(1 - 0). The result for SD(X) follows immediately.
   Exercise"1 asks you to prove Theorem 5.4 by moment generating functions. D

   R comes with built-in functions for working with Binomial distributions. You can
get the following information by typing help (dbinom), help (pbinom), help (qbinom),
or help (rbinom). There are similar functions for working with other distributions,
but we won't repeat their help pages here.

Usage:

   dbinom(x, size, prob, log = FALSE)
   pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
   qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
   rbinom(n, size, prob)

Arguments:

  x, q: vector of quantiles.

     p: vector of probabilities.

     n: number of observations. If 'length(n) > 1', the length is
         taken to be the number required.

  size: number of trials.

  prob: probability of success on each trial.

log, log.p: logical; if TRUE, probabilities p are given as log(p).

lower.tail: logical; if TRUE (default), probabilities are P[X <= x],
           otherwise, P[X > x].


Details:


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


283


   The binomial distribution with 'size' = n and 'prob' = p has
   density

                p(x) = choose(n,x) p-x (1-p)^(n-x)

   for x = 0, ..., n.

   If an element of 'x' is not integer, the result of 'dbinom' is
   zero, with a warning. p(x) is computed using Loader's algorithm,
   see the reference below.

   The quantile is defined as the smallest value x such that
   F(x) >= p, where F is the distribution function.

Value:

   'dbinom' gives the density, 'pbinom' gives the distribution
   function, 'qbinom' gives the quantile function and 'rbinom'
   generates random deviates.

   If 'size' is not an integer, 'NaN' is returned.

References:

   Catherine Loader (2000). Fast and Accurate Computation of
   Binomial Probabilities; manuscript available from <URL:
   http://cm.bell-labs.com/cm/ms/departments/sia/catherine/dbinom>

See Also:

   'dnbinom' for the negative binomial, and 'dpois' for the Poisson
   distribution.

Examples:

   # Compute P(45 < X < 55) for X Binomial(100,0.5)
   sum(dbinom(46:54, 100, 0.5))


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


284


   ## Using "log = TRUE" for an extended range
   n <- 2000
   k <- seq(0, n, by = 20)
   plot (k, dbinom(k, n, pi/10, log=TRUE), type='l', ylab="log density",
          main = "dbinom(*, log=TRUE) is better than log(dbinom(*))")
   lines(k, log(dbinom(k, n, pi/10)), col='red', lwd=2)
   ## extreme points are omitted since dbinom gives 0.
   mtext("dbinom(k, log=TRUE)", adj=0)
   mtext("extended range", adj=0, line = -1, font=4)
   mtext("log(dbinom(k))", col="red", adj=1)

   Figure    shows the Binomial pmf for several values of x, n, and p. Note that
for a fixed p, as n gets larger the pmf looks increasingly like a Normal pdf. That's
the Central Limit Theorem. Let Y, ... , Y, ~ i.i.d. Bern(p). Then the distribution
of X is the same as the distribution of Z Y and the Central Limit Theorem tells us
that E Y looks increasingly Normal as n -- 0.
   Also, for a fixed n, the pmf looks more Normal when p = .5 than when p = .05.
And that's because convergence under the Central Limit Theorem is faster when
the distribution of each Y is more symmetric.

   Figure    was produced by

   par ( mfrow=c(3,2) )
   n <- 5
   p <- .05
   x <- 0:5
   plot ( x, dbinom(x,n,p), ylab="p(x)", main="n=5, p=.05" )


The Negative Binomial Distribution Rather than fix in advance the number of
trials, experimenters will sometimes continue the sequence of trials until a pre-
specified number of successes r has been achieved. In this case the total number of
failures N is the random variable and is said to have the Negative Binomial distri-
bution with parameters (r, 0), written N ~ NegBin(r, 0). (Warning: some authors
say that the total number of trials, N + r, has the Negative Binomial distribution.)
One example is a gambler who decides to play the daily lottery until she wins.


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


285


        n=5, p=.05

 co
 6  o


    0
x

N      0    0  0

   O I - - I I I

   0   1 2  3  4  5


n=5, p=.5


0   0


10


10 L
a


00


0


10
O
6


X


10            oj

0  1  2  3 4  5


       x


    n=20, p=.5


       0
       0 0


    0     0


   n=20, p=.05


0 0


     0


C

Q


10


10
Q O
LO


0


0


0 1 2 3  4 5


0


0


6  8 10 12 14


X


X


11-


10


0
O


   n=80, p=.05


     0 0

        0
   0

         0

  00
           0
0             0


0  2  4  6  8


       x


X O
Q
N
a
O\


n=80, p=.5


   000
   0  0
   0  0
 0     0
 0      0
0       0


0


0


  0
100 I   I    00

30 35  40 45  50


X


Figure 5.1: The Binomial pmf


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL


286


The prespecified number of successes is r = 1. The number of failures N until she
wins is random. In this case, and whenever r = 1, N is said to have a Geometric
distribution with parameter 0; we write N   Geo(0). Often, 0 is unknown. Large
values of N are evidence that 0 is small; small values of N are evidence that 0 is
large. The probability function is


              PN(k) = P(N = k)
                    = P (,r- 1 successes in the first k + r - 1 trials
                        and k + r'th trial is a success)
                        (k+r-1)r(l       )k
                          r -1


for k = 0,1, ... .

   Let N1 ~ NegBin(ri, 0), ..., Nt, NegBin(rt, 0), and N1, ..., N, be independent
of each other. Then one can imagine a sequence of trials of length L(Ni + r)
having E ri successes. N1 is the number of failures before the ri'th success; ...;
N1 +--- + Nt is the number of failures before the r1 +- --+ rt'th success. It is evident
that N = E Ni is the number of failures before the r =E r2'th success occurs and
therefore that N   NegBin(r, 0).


Theorem 5.5. If Y   NegBin(r, 0) then E[Y] = r(1 - 0)/ and Var(Y) = r(1 - 0)/02.


Proof. It suffices to prove the result for r = 1. Then the result for r > 1 will follow


﻿


5.1. BINOMIAL AND NEGATIVE BINOMIAL                                   287

by the foregoing argument and Theorems L7 and 1.9. For r = 1,


                              00
                              hO
                      E([N]=    n P([N = n]
                             n=0
                             00
                           =    n(1 - 0)"
                             n=1
                                     00
                            = (1 - e) n(1 - 6)"-l
                                     n=1
                              =0 _-6(1 - e)  d (1 -  )"
                                      n=1

                            =-0(1 -e) d   0(1 -0)"n
                                         n=1
                                      d 1 -e
                            =-(1 -) d2


                            1-0
                               0


The trick of writing each term as a derivative, then switching the order of summa-


﻿


5. 1. BINOMIAL AND NEGATIVE BINOMIAL

tion and derivative is occasionally useful. Here it is again.
                00
       IE(N2)   E Yn2 PL[N = n]
                n0o


288


        00
0(1 -0)E
        n~1
        00
0(1 -0)E
        n~1


(n(n - 1) + n) (1 - 0 -


                     n~1


n(n - 1)(1 - 0 -


1-0
  0

1-0
  0
1-0
  0
1-0
  0


+0(1    0d2
        n~1


02


D0
    E10) n
n~1


+0 (1  0)2df21 -0e
          02 0
  +0(1 _ 0)2
  +2


2-30 +02
    02


Therefore,


Var(N) =E[N2] - (E[N] )2


1-0
02


The R functions for working with the negative Binomial distribution are dnbinom,


pnbinom, qnbinom, and rnbinom. Figure
illustrates the use of qnbinom.


displays the Negative Binomial pdf and


Figure    was produced with the following snippet.


r <- c ( 1,


5, 30 )


p <- c ( .1, .5, .8 )


par ( mfrow=c (3, 3)


)


f or ( i in seq(along=r)
f or ( j in seq(along=p)
  lo <- qnbinom (. 01, r[il


)pj]


﻿


    probability

0.002   0.006


         probability

0.000      0.010      0.020


        probability

0.00     0.04     0.08


CD


CD
z
CD


CD

S.
O


   N,
   0

   0
Z


0


0


      N,
      O

Z    o
      O
      O


01

Ca


      0


z    N


      0


0I


Ul


O


b


O


     probability

0.01     0.03     0.05


      probability

0.02 0.06 0.10 0.14


        probability

0.0      0.2      0.4


      01


Z     0


0


0


      0


z    o


01


0


      0

      N
Z

      a)


II

m
II
O
01


     probability

0.02 0.06 0.10


     probability

0.05 0.15 0.25


  probability

0.2 0.4 0.6 0.8


      N,

      a)
Z
      N,


0


0


      0

      N,
Z


01


0


      0


zo

      N,


0


[N
00


﻿


5.2. MULTINOMIAL


290


5.2 The Multinomial Distribution


The multinomial distribution generalizes the binomial distribution in the following
way. The binomial distribution applies when the outcome of a trial has two possible
values; the multinomial distribution applies when the outcome of a trial has more
than two possible outcomes. Some examples are
Clinical Trials In clinical trials, each patient is administered a treatment, usually
     an experimental treatment or a standard, control treatment. Later, each pa-
     tient may be scored as either success, failure, or censored. Censoring occurs
     because patients don't show up for their appointments, move away, or can't
     be found for some other reason.

Craps After the come-out roll, each successive roll is either a win, loss, or neither.

Genetics Each gene comes in several variants. Every person has two copies of the
     gene, one maternal and one paternal. So the person's status can be described
     by a pair like {a, c} meaning that she has one copy of type a and one copy
     of type c. The pair is called the person's genotype. Each person in a sample
     can be considered a trial. Geneticists may count how many people have each
     genotype.
Political Science In an election, each person prefers either the Republican candi-
     date, the Democrat, the Green, or is undecided.
In this case we count the number of outcomes of each type. If there are k possible
outcomes then the result is a vector y, ... , Yk where ye is the number of times that
outcome i occured and Y1 + - - - + Yk = n is the number of trials.


﻿


5.2. MULTINOMIAL


291


   Let p = (pi,...,Pk) be the probabilities of the k categories and n be the number
of trials. We write Y   Mult(n, p). In particular, Y (Y1, ... , Y) is a vector of
length k. Because Y is a vector, so is its expectation

          E(Y] = p = (p   ,. p) = (E(Y1], ... , E(Yk = (np1,... ,npk).

   The i'th coordinate, Y, is a random variable in its own right. Because Y counts
the number of times outcome i occurred in n trials, its distribution is

                     Y~Bin(n, p).      (See Exercise 19.)               (5.1)

Although the Y's are all Binomial, they are not independent. After all, if Y1 = n,
then Y2 =.-     Yk = 0, so the Y's must be dependent. What is their joint pmf?
What is the conditional distribution of, say, Y2,..., Yk given Y1? The next two
theorems provide the answers.

Theorem 5.6. If Y    Mult(n, p) then


                     fy(yi, . . . ,y ) =-      P1 - - -pk

where (Ynk) is the multinomial coefficient

                                  n         n!


Proof. When the n trials of a multinomial experiment are carried out, there will
be a sequence of outcomes such as abkdbg ... f, where the letters indicate the out-
comes of individual trials. One such sequence is

                            a***ab***b ... k...k
                            yitimes y2times yktimes

The probability of this particular sequence is H p I. Every sequence with yi a's, ...,
Yk k's has the same probability. So

    fy(yi,... y) =(number of such sequences) x           (Ypi= i, Yk) Hi
                                                         (Y1 -.-- y


D-


﻿


5.3. POISSON


292


Theorem 5.7. If Y   Mult(n, p) then

                (Y2, .. . ,Yk |Yi  =  )  ~Mult(n -y1, (p*,. . . ,p))

where p2 = p2/(1 - p1) forai = 2, .. . ,k.

Proof See Exercise 18.                                                    D

   R's functions for the multinomial distribution are rmultinom and dmultinom.
rmultinom(m,n,p) draws a sample of size m. p is a vector of probabilities. The
result is a k x m matrix. Each column is one draw, so each column sums to n. The
user does not specify k; it is determined by k = length(p).


5.3 The Poisson Distribution

The Poisson distribution is used to model counts in the following situation.

   " There is a domain of study, usually a block of space or time.

   " Events arise at seemingly random locations in the domain.

   " There is an underlying rate at which events arise.

   " The rate does not vary over the domain.

   " The occurence of an event at any location Li is independent of the occurence
     of an event at any other location £2.

Let y be the total number of events that arise in the domain. Y has a Poisson
distribution with rate parameter A, written Y Poi(A). The pmf is

                              e-AA
                           py~y) =      for y =O,1, .. .

The mean was derived in Chapter 1, Exercise 18a. It is

                                  E[Y]   A.

Theorem 5.8. Let Y  Poi(A). Then

                               My(t) = eA(e -l)


﻿


5.3. POISSON

Proof.


293


M Y (t)  I[eY]   00 ''p ~ ) >3 e~
                y=0          y=0o   Y
         - s A e(A')Y -eA   00$c e et(AL)Y
         0O    Y.      e-a  Z...ow


D-


Theorem 5.9. Let Y Poi(A). Then


                               Var(Y) _A

Proof. Just for fun (!) we will prove the theorem two ways - first directly and
then with moment generating functions.


   Proof 1.


IB[Y2]  0 >3 -~


        00        ean      0-AAY


        ->y(y -1)

        -0°Oe-AAz±2
          ->3+
        z=0   z


So Var(y) =IB[Y2] _ (IB[Y])2= A.


﻿


5.3. POISSON


294


   Proof 2.


                           dt       t=o

                           dt2t=
                           =    tAe e et-1)
                           dtt=
                           -[AeteA(e~1) + A2e2teAe1)]
                         = A +A2

So Var(Y) =E[Y2] - (E[Y])2=A.                                          D

Theorem 5.10. Let Y~ Poi(A ) for i = 1, ... , n and let the Ys be mutually indepen-
dent. Let Y = E Y and A = E  Ai. Then Y  Poi(A).

Proof. Using Theorems 4,9 and5 we have

                  My(t) =7J My(t) =fJei(e l)     eA(et-)

which is the mgf of the Poi(A) distribution.                           D

   Suppose, for i = 1, ...,n, Y is the number of events occurring on a domain
D2; Y~ Poi(A2). Suppose the Di's are disjoint and the Y's are independent. Let
Y = E Y be the number of events arising on D = UD2. The logic of the situation
suggests that Y  Poi(A) where A = E Ai. Theorem 51 assures us that every-
thing works correctly; that Y does indeed have the Poi(A) distribution. Another
way to put it: If Y Poi(A), and if the individual events that Y counts are ran-
domly divided into two types Y1 and Y2 according to a binomial distributuion with
parameter 0, then (1) Y1 Poi(AO) and Y2 Poi(A(1 - 0)) and (2) Y1 I Y2.
   Figure    shows the Poisson pmf for A = 1, 4, 16, 64. As A increases the pmf
looks increasingly Normal. That's a consequence of Theorem 5 10 and the Central
Limit Theorem. When Y     Poi(A) Theorem 5,10 tells us we can think of Y as
Y =     _ Y where each Y~ Poi(1). (A must be an integer for this to be precise.)
Then the Central Limit Theorem tells us that Y will be approximately Normal when
A is large.


﻿


5.3. POISSON


295


k= 1


0


Q~


N~
6


O00


   0


     0

     0 000


Q~


0


10

6


0-

0 -
6


00


0


0


0


0


0


0


o  2  4   6


      Y


    X= 16


  1


Q~


00
0


6t


0


6


0


0


0

0


0


0
0


02468


   k=64


   00
   0 0
   0  0
   0 0
   0 0
   0   0
   00
   o 0
       0


  00


50 60 70 80


    Y


   0
        0
    0    0

  0       0

  0       0
0          0


  10 15 20 25


      y


Q~


0
0 -
6


Figure 5.3: Poisson pmf for A= 1, 4, 16, 64


﻿


5.3. POISSON


296


   Figure     was produced with the following snippet.

   y <- 0:7
   plot ( y, dpois(y,1), xlab="y", ylab=expression(p[Y](y)),
           main=expression(lambda==1) )
  y <- 0:10
  plot ( y, dpois(y,4), xlab="y", ylab=expression(p[Y](y)),
           main=expression(lambda==4) )
  y <- 6:26
  plot ( y, dpois(y,16), xlab="y", ylab=expression(p[Y](y)),
           main=expression(lambda==16) )
  y <- 44:84
  plot ( y, dpois(y,64), xlab="y", ylab=expression(p[Y](y)),
           main=expression(lambda==64) )


   One of the early uses of the Poisson distribution was in The probability variations
in the distribution of c particles by                          . (An c particle is a
Helium nucleus, or two protons and two neutrons.)

Example 5.1 (Rutherford and Geiger)
The phenomenon of radioactivity was beginning to be understood in the early 20th century.
In their 1910 article, Rutherford and Geiger write

      "In counting the c particles emitted from radioactive substances ... [it] is of
      importance to settle whether . . . variations in distribution are in agreement
      with the laws of probability, i.e. whether the distribution of az particles on an
      average is that to be anticipated if the az particles are expelled at random both
      in regard to space and time. It might be conceived, for example, that the
      emission of an az particle might precipitate the disintegration of neighbouring
      atoms, and so lead to a distribution of az particles at variance with the simple
      probability law."

So Rutherford and Geiger are going to do three things in their article. They're going to
count az particle emissions from some radioactive substance; they're going to derive the
distribution of az particle emissions according to theory; and they're going to compare
the actual and theoretical distributions.
    Here they describe their experimental setup.


﻿


5.3. POISSON                                                                       297

      "The source of radiation was a small disk coated with polonium, which was
      placed inside an exhausted tube, closed at one end by a zinc sulphide screen.
      The scintillations were counted in the usual way ... the number of scintilla-
      tions ... corresponding to 1/8 minute intervals were counted ....
      "The following example is an illustration of the result obtained. The numbers,
      given in the horizontal lines, correspond to the number of scintillations for
      successive intervals of 7.5 seconds.

                                                            Total per minute.
          lstminute: 3      7   4  4   2   3   2   0..... 25
          2nd"          5   2   5  4   3  5    4   2 ....   30
          3rd"          5   4   1  3   3   1   5   2 ....   24
          4th"          8   2   2  2   3  4    2   6 ....   31
          5th"          7   4   2  6   4   5  10   4 ....   42

                                Average for 5 minutes ...   30.4
                                True average ........... 31.0

And here they describe their theoretical result.

      "The distribution of a particles according to the law of probability was kindly
      worked out for us by Mr. Bateman. The mathematical theory is appended as
      a note to this paper. Mr. Bateman has shown that if x be the true average
      number of particles for any given interval falling on the screen from a constant
      source, the probability that n a particles are observed in the same interval is
      given by Zce-x. n is here a whole number, which may have all positive values
      from 0 to oo. The value of x is determined by counting a large number of
      scintillations and dividing by the number of intervals involved. The probability
      for n a particles in the given interval can then at once be calculated from
      the theory."

Refer to Bt        [1   ] for his derivation. Table 5.1 shows their data. As Rutherford
and Geiger explain:

      "For convenience the tape was measured up in four parts, the results of which
      are given separately in horizontal columns 1. to IV.
      "For example (see column I.), out of 792 intervals of 1/8 minute, in which
      3179 a particles were counted, the number of intervals 3 a particles was 152.
      Combining the four columns, it is seen that out of 2608 intervals containing


﻿


5.3. POISSON


298


      10,097 particles, the number of times that 3 a particles were observed was
      525. The number calculated from the equation was the same, viz. 525."
Finally, how did Rutherford and Geiger compare their actual and theoretical distributions?
They did it with a plot, which we reproduce as Figure 54. Their conclusion:
      "It will be seen that, on the whole, theory and experiment are in excellent
      accord. ... We may consequently conclude that the distribution of a particles
      in time is in agreement with the laws of probability and that the a particles
      are emitted at random. . . . Apart from their bearing on radioactive problems,
      these results are of interest as an example of a method of testing the laws of
      probability by observing the variations in quantities involved in a spontaneous
      material process."

Example 5.2 (neurobiology)
This example continues Example 2.6. We would like to know whether this neuron re-
sponds differently to difFerent tastants and, if so, how. To that end, we'll see how often
the neuron fires in a short period of time after receiving a tastant and we'll compare
the results for difFerent tastants. Specifically, we'll count the number of spikes in the
150 milliseconds (150 msec = .15 s) immediately following the delivery of each tastant.
(150 msec is about the rate at which rats can lick and is thought by neurobiologists to
be about the right interval of time.) Let Yj be the number of spikes in the 150 msec
following the j'th delivery of tastant i. Because we're counting the number of events in
a fixed period of time we'll adopt a Poisson model:

                                    Yj~Poi(As)

where A, is the average firing rate of this neuron to tastant i.
   We begin by making a list to hold the data. There should be one element for each
tastant. That element should be a vector whose length is the number of times that
tastant was delivered. Here is the R code to do it. (Refer to Example 2,6 for reading in
the data.)
  nspikes <- list(
     MSG100    = rep ( NA, length(tastants$MSG100) ),
     MSG300    = rep ( NA, length(tastants$MSG300) ),
     NaC1100 = rep ( NA, length(tastants$NaC1100) ),
     NaC1300 = rep ( NA, length(tastants$NaCl300) ),
     water     = rep ( NA, length(tastants$water) )
                     )


﻿


CD


0


C/1


Number     0   1   2    3   4   5    6   7  8  9 10 11 12 13 14 Number Number Average
of     a                                                          of    a of inter- number
particles                                                         particles vals
I ........ 15 56 106 152 170 122    88  50 17 12   3  0  0  1 0 3179       792      4.01
II ....... 17 39  88 116 120   98   63  37  4  9   4  1 0   0  0 2334      596      3.92
III ...... 15 56  97 139 118   96   60  26 18  3   3  1 0   0  0 2373      632      3.75
IV ...... 10  52  92 118 124   92   62  26  6  3   0  2  0  0  1 2211      588      3.76
Sum .... 57 203 383 525 532 408 273 139 45 27 10      4  0  1  1 10097     2608     3.87
Theoretica154 210 407 525 508 394 254 140 68 29 11  4  1 4     1
values


N\


﻿


5.3. POISSON


300


c: O


0,
0~
0


0

~0
E

Z


0
0

0C

0
0ql


0
0
CY)


0
0
CV


0
0


0


o


I             I      I           ---r-


I      I      I      I

0      2      4      6


8


10


1

12


                             Number of Particles in Interval


Figure 5.4: Rutherford and Geiger's Figure 1 comparing theoretical (solid line) to

actual (open circles) distribution of a particle counts.


﻿


5.3. POISSON


301


   Now we fill in each element by counting the number of neuron firings in the time
interval.

  for ( i in seq(along=nspikes) )
  for ( j in seq(along=nspikes[[i]]) )
     nspikes[[i]] [j] <- sum (       spikes[[8]] > tastants[[i]] [j]
                                   & spikes[[8]] <= tastants[[i]] [j] + .15
                               )


Now we can see how many times the neuron fired after each delivery of, say, MSG100 by
typing nspikes$MSG100.
   Figure    compares the five tastants graphically. Panel A is a stripchart. It has five
tick marks on the x-axis for the five tastants. Above each tick mark is a collection of
circles. Each circle represents one delivery of the tastant and shows how many times
the neuron fired in the 150 msec following that delivery. Panel B shows much the same
information in a mosaic plot. The heights of the boxes show how often that tastant
produced 0, 1, ... , 5 spikes. The width of each column shows how often that tastant
was delivered. Panel C shows much the same information in yet a different way. It has
one line for each tastant; that line shows how often the neuron responded with 0, 1,
    5 spikes. Panel D compares likelihood functions. The five curves are the likelihood
functions for A , ... , A .
   There does not seem to be much difference in the response of this neuron to different
tastants. Although we can compute the m.l.e. A's with

     lapply ( nspikes, mean )


and find that they range from a low of A3 ~ 0.08 for .1 M NaCl to a high of A1  0.4
for .1M MSG, panel D suggests the plausibility of A1 = ..   ~A .2.

   Figure    was produced with the following snippet.

     spiketable <- matrix ( NA, length(nspikes), 6,
                               dimnames = list ( tastant = 1:5,
                                                    counts = 0:5 )
                        )
     for ( i in seq(along=nspikes) )
       spiketable[i,] <- hist ( nspikes[[i]], seq(-.5,5.5,by=1),
                                    plot=F )$counts


﻿


5.3. POISSON


302


A Neuron's Responses

          to 5 Tastants


    A                                   B


LO -


I)


~0
E


                 0

              0

o

000 OD    0   0

amC 0    0  GIlD D D


0


0-


1


2


3


4


5


tastant


tastant


  C


D


C-
0
U)


00


0


0


0
0


0


0]


v


0.0


0.2


I    I    I   I
0.4 0.6 0.8  1.0


0   1   2    3   4   5


number of firings


lambda


Figure 5.5: Numbers of firings of a neuron in 150 msec after five different tas-
tants. Tastants: 1=MSG .1M; 2=MSG .3M; 3=NaCl .1M; 4=NaCl .3M; 5=water.
Panels: A: A stripchart. Each circle represents one delivery of a tastant. B: A mo-
saic plot. C: Each line represents one tastant. D: Likelihood functions. Each line
represents one tastant.


﻿


5.4. UNIFORM


303


     freqtable <- apply ( spiketable, 1, function(x)x/sum(x) )


   " The line spiketable <- ... creates a matrix to hold the data and illustrates
     the use of dimnames to name the dimensions. Some plotting commands use those
     names for labelling axes.

   " The line spiketable [i,] <- . . . shows an interesting use of the hist com-
      mand. Instead of plotting a histogram it can simply return the counts.

   " The line freqtable <- ... divides each row of the matrix by its sum, turning
      counts into proportions.


   But let's investigate a little further. Do the data really follow a Poisson distribution?
Figure shows the Poi(.2) distribution while the circles show the actual fractions of
firings. There is apparently good agreement. But numbers close to zero can be deceiving.
The R command dpois ( 0:5, .2 ) reveals that the probability of getting 5 spikes is
less than 0.00001, assuming A ~ 0.2. So either the A2's are not all approximately .2,
neuron spiking does not really follow a Poisson distribution, or we have witnessed a very
unusual event.

   Figure    was produced with the following snippet.

   matplot ( 0:5, freqtable, pch=1, col=1,
              xlab="number of firings", ylab="fraction" )
  lines ( 0:5, dpois ( 0:5, 0.2 ) )


5.4 The Uniform Distribution

The Discrete Uniform Distribution The discrete uniform distribution is the dis-
tribution that gives equal weight to each integer 1, ... , n. We write Y  ~ U(1, n).
The pmf is
                                   p(y) = 1/n                              (5.2)
for y = 1, ... , n. The discrete uniform distribution is used to model, for example,
dice rolls, or any other experiment in which the outcomes are deemed equally


﻿


5.4. UNIFORM


304


0


0
0


00
0


OD

0ql


0


CN
0


0
0


8
0


I        I       I       I       I        I

0       1       2        3       4       5


                                        number of firings


Figure 5.6: The line shows Poisson probabilities for A = 0.2; the circles show the
fraction of times the neuron responded with 0, 1, ..., 5 spikes for each of the five
tastants.


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


305


likely. The only parameter is n. It is not an especially useful distribution in practical
work but can be used to illustrate concepts in a simple setting. For an applied
example see Exercise 22.

The Continuous Uniform Distribution The continuous uniform distribution is
the distribution whose pdf is flat over the interval [a, b]. We write Y  U(a, b).
Although the notation might be confused with the discrete uniform, the context
will indicate which is meant. The pdf is

                              p(y) = 1/(b - a)

for y E [a, b]. The mean, variance, and moment generating function are left as
Exercise23.
   Suppose we observe a random sample yi,..., y, from U(a, b). What is the
m.l.e.(&, b)? The joint density is

                               f(b )" ifa < y(1) and b ; Y()
                p( yi, . . . ,yn) \=a
                               0       otherwise

which is maximized, as a function of (a, b), if b - a is as small as possible without
making the joint density 0. Thus, & = y(i) and b = y .


5.5 The Gamma, Exponential, and Chi Square Distri-
       butions

"F" is the upper case Greek letter Gamma. The gamma function is a special mathe-
matical function defined on IR+ as

                            F(a) =     to-e-' dt
                                    0o

Information about the gamma function can be found in mathematics texts and
reference books. For our purposes, the key facts are:

                         F(a + 1) = ar(a) for a > o
                    F(n) = (n - 1)! for positive integers n
                                17(1/2) =wv


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE                                   306

For any positive numbers a and Q, the Gamma(a, ,) distribution has pdf

                    p(y)     1   y-le-/       for y > 0              (5.3)

We write Y ~ Gam(a, Q).
   Figure    shows Gamma densities for four values of a and four values of 3.

   " In each panel of Figure  the curves for different ax's have different shapes.
     Sometimes a is called the shape parameter of the Gamma distribution.

   " The four panels look identical except for the axes. I.e., the four curves with
     a = .5, one from each panel, have the same shape but different scales. The
     different scales correspond to different values of 3. For this reason , is called
     a scale parameter. One can see directly from Equation that , is a scale
     parameter because p(y) depends on y only through the ratio y/, . The idea of
     scale parameter is embodied in Theorem  . See Section  for more on
     scale parameters.


   Figure    was produced by the following snippet.

   par ( mfrow=c(2,2) )

   shape <- c ( .5, 1, 2, 4 )
   scale <- c ( .5, 1, 2, 4 )
   leg   <- expression ( alpha == .5, alpha == 1,
                         alpha == 2, alpha == 4 )

  for ( i in seq(along=scale) ) {
    ymax <- scale [i] *max(shape) + 3*sgrt(max(shape))*scale [i]
    y <- seq ( 0, ymax, length=100 )
    den <- NULL
    for ( sh in shape )
      den <- cbind ( den, dgamma(y,shape=sh,scale=scale[il) )
    matplot ( y, den, type="l", main=letters[il, ylab="p(y)" )
    legend ( ymax*.1, max(den[den!=Inf]), legend = leg )
  }


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


                        p3= 0.5


307


p3=1


0
CY6


IC)


a


0 _


0 _


a


0 _


TC

0O


O

  0 12 34 5


        y


O  '

  0 246 8


        y


p3=2


p3=4


    0

    CD -


a 0

    c' _
    0


    0


    co~
    0


$Q c'


    0


O

  0  5  10 15 20


        y


O

  0 10 20 30 40


        y


Figure 5.7: Gamma densities for various values of ai and , .


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


308


Theorem 5.11. Let X ~ Gam(a,#,Q) and let Y = cX. Then Y Gam(a, c#).
Proof. Use Theorem 1 1.

                         px(X) =    1    -e
                           px  -  (a)#4x
Since Y = cX, x= y/c and dz/dy = 1/c, so

                       Py()=cF()      (y/c)-le Yc/3
                                   1    (y)Yl eY/c1)
                              F(a) (c#)a
which is the Gam(a, c#) density. Also see Exercise 9.
   The mean, mgf, and variance are recorded in the next several theorems.
Theorem 5.12. Let Y Gam(a,#13) Then E(Y] = a#.
Proof.

                E[Y]=      y      Ye -/7dy
                       in F  ( )a
                       (a+1)o3 J°O 1(a +>a1)+1 ae      dy


The last equality follows because (1) ]7(a + 1) c=da(a), and (2) the integrand is a
Gamma density so the integral is 1. Also see Exercise 9.               D
   The last trick in the proof - recognizing an integrand as a density and conclud-
ing that the integral is 1 - is very useful. Here it is again.
Theorem 5.13. Let Y ~ Gam(a, 3). Then the moment generating function is MY (t)
(1 - t#) -Ofor t < 1/#.
Proof.

                My (t) =    etY      y-1e-/fdy
                           oF(a)#a
                       -t i )a f00      ___ y          dy
                          Oa Jo F(a)( 1-t()a
                      = (1 - t3)-


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


309


Theorem 5.14. Let Y ~ Gam(a, , ). Then

                                 Var(Y) = a#2

and
                                SD(Y) =      j.

Proof. See Exercise   .

The Exponential Distribution We often have to deal with situations such as

   " the lifetime of an item

   " the time until a specified event happens

   The most fundamental probability distribution for such situations is the expo-
nential distribution. Let Y be the time until the item dies or the event occurs. If Y
has an exponential distribution then for some A > 0 the pdf of Y is

                        PY (y) = A-le-y/A    for y > 0.

and we write Y    ~ Exp(A). This density is pictured in Figure     (a repeat of
Figure    ) for four values of A. The exponential distribution is the special case
of the Gamma distribution when a = 1. The mean, SD, and mgf are given by
Theorems       -
   Each exponential density has its maximum at y = 0 and decreases monotoni-
cally. The value of A determines the value py (0 A) and the rate of decrease. Usually
A is unknown. Small values of y are evidence for large values of A; large values of
y are evidence for small values of A.
Example 5.3 (Radioactive Decay)
It is well known that some chemical elements are radioactive. Every atom of a radioactive
element will eventually decay into smaller components. E.g., uranium-238 (by far the
most abundant uranium isotope, 238U) decays into thorium-234 and an av particle while
plutonium-239 (the isotope used in nuclear weapons, 239Pu) decays into uranium-235
(235U) and an av particle.
(See http://www.epa.gov/radiation/radionuclides for more information.)
   The time Y at which a particular atom decays is a random variable that has an
exponential distribution. Each radioactive isotope has its own distinctive value of A. A
radioactive isotope is usually characterized by its median lifetime, or half-life, instead of


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


310


0


00-


to -


Q


-                           lambda = 2
-                       - --lambda = 1
-                       -"-"-"-lambda = 0.2
-                       ---lambda = 0.1


.1


NM


0


0.0


0.5


1.0


1.5


2.0


x


Figure 5.8: Exponential densities


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


311


A. The half-life is the value m which satisfies P[Y < m] = P[Y > m] = 0.5. The
half-life m can be found by solving

                                 A-1 e-Y/A dy = 0.5.

The answer is m = Alog 2. You will be asked to verify this claim in Exercise 29.
   Uranium-238 has a half-life of 4.47 billion years. Thus its A is about 6.45 billion.
Plutonium-239 has a half-life of 24,100 years. Thus its A is about 35,000.

   Exponential distributions have an interesting and unique memoryless property.
To demonstrate, we examine the Exp(A) distribution as a model for T, the amount
of time a computer Help line caller spends on hold. Suppose the caller has already
spent t minutes on hold; i.e., T > t. Let S be the remaing time on hold; i.e.,
S = T - t. What is the distribution of S given T > t? For any number r > 0,

                                                  P[T >t +r, T >t]
           P[S>r |T>t]P[T>t+r |T>t]P                  [T
                                                       P(T > t]
                             P[T > t + r]   A-le-(t+r)/A-
                               P[T > t]      A-le-t/A

In other words, S has an Exp(A) distribution (Why?) that does not depend on
the currently elapsed time t (Why?). This is a unique property of the Exponential
distribution; no other continuous distribution has it. Whether it makes sense for
the amount of time on hold is a question that could be verified by looking at data.
If it's not sensible, then Exp(A) is not an accurate model for T.

Example 5.4
Some data here that don't look exponential


The Poisson process There is a close relationship between Exponential, Gamma,
and Poisson distributions. For illustration consider a company's customer call cen-
ter. Suppose that calls arrive according to a rate A such that

   1. in a time interval of length T, the number of calls is a random variable with
     distribution Poi(AT) and

   2. if I1 and i2 are disjoint time intervals then the number of calls in I1 is inde-
     pendent of the number of calls in I2.


﻿


5.5. GAMMA, EXPONENTIAL, CHI SQUARE


312


When calls arrive in this way we say the calls follow a Poisson process.
   Suppose we start monitoring calls at time to. Let T1 be the time of the first
call after to and Y1 = T1 - to, the time until the first call. T1 and Y1 are random
variables. What is the distribution of Y1? For any positive number y,
                  Pr[Y1 > y] = Pr[no calls in [to, to + y]] = e -
where the second equality follows by the Poisson assumption. But
   Pr[Y > y] = e-A  - Pr[Y < y] = 1 - e-A -> py(y) =Ae--Y 1 ~ Exp(1/A)
What about the time to the second call? Let T2 be the time of the second call after
to and Y2 = T2 - to. What is the distribution of Y2? For any y > 0,
              Pr[Y2 > y] = Pr [fewer than 2 calls in [to, y]]
                        = Pr[O calls in [to, y]] + Pr[1 call in [to, y]]
                        = e-/ + yAe7Y
and therefore
                                                     2   _A
                 p Y2 (y)= AeY - Ae7A + yA 2        AyeY
                                                   F(2)
so Y2   Gam(2, 1/A).
   In general, the time Yn until the n'th call has the Gam(n, 1/A) distribution. This
fact is an example of the following theorem.
Theorem 5.15. Let Y1,...,Y, be mutually independent and let Y~ Gam(a,,3).
Then
                           Y =     Y~Gam(a,,3)
where a   L as.
Proof. See Exercise 30.
   In Theorem 5,15 note that the Y's must all have the same 13 even though they
may have different ar's.
   Poisson-Gamma conjugacy F = Gam/Gam

The Chi-squared Distribution The Gamma distribution with /3= 2 and a = p/2
where p is a positive integer is called the chi-squared distribution with p degrees of
freedom. We write Y  4
Theorem 5.16. Let Y1,... , Y, ~i.i.d. N(0, 1). Define X =  Y2. Then X~x.


Proof. This theorem will be proved in Section


F-


﻿


5.6. BETA


313


5.6 The Beta Distribution

For positive numbers a and Q, the Beta(a, 3) distribution is a distribution for a
random variable Y on the unit interval. The density is

                py(y) =   (a+)a-1(1 - y)-1         for y E [0, 1]
                        F(z)F7(3)~

The parameters are (a, /3). We write Y  Be(a, #3). The mean and variance are
given by Theorem.

Theorem 5.17. Let Y  Be(a,/3). Then

                           E[Y]=    a

                         Var(Y) (=2(,3
                                   (aZ + #)2(a + + 1

Proof. See Exercise 6.

   Figure 59 shows some Beta densities. Each panel shows four densities having
the same mean. It is evident from the Figure and the definition that the parameter
a (/3) controls whether the density rises or falls at the left (right). If both a > 1 and
/ > 1 then p(y) is unimodal. The Be(1, 1) is the same as the U(0, 1) distribution.
   The Beta distribution arises as the distribution of order statistics from the U(0, 1)
distribution. Let x1, ... , x~ i.i.d. U(0, 1). What is the distribution of x(l), the first
order statistic? Our strategy is first to find the cdf of x(l), then differentiate to get
the pdf.

                   Fxm)(x) = P[X(1) < zc]
                           = 1 - P[all Xi's are greater than x]
                           S1- (1-z)

Therefore,

                      d_11_ F(n+                        -X1)
           px()  =d d FX1 (x) = n(1 - z)"-1 = F(1)F(n)

which is the Be(1, n) density. For the distribution of the largest order statistic see
Exercise 27.


﻿


5.6. BETA                                                  314


                                    a


                        0.0(b 02                0.(060. ,1 1.0


                 o'                     ..  (a b) _ (3,12)-. .-


                 0.0    0.2     0.4    0.6    0.8    1.0

                                    y


                                    C


                        - (ab) = (.3003
             >~ co     (a.b) (ab) =(,.3
                 N (a,b)=(0,.......)


              0                                 -,--

                 0.0    0.2     0.4    0.6    0.8    1.0

                                    y


         Eiur 59:Bea enites- : et dnstis it man.2 b Btadesiie wt

    men 5 C Bt dnsteswihmen 9


﻿


5.6. BETA                                                           315


   Figure   was produced by the following R snippet.

   par ( mfrow=c(3,1) )
   y <- seq ( 0, 1, length=100 )
   mean <- c ( .2, .5, .9 )
   alpha <- c ( .3, 1, 3, 10 )
   for ( i in 1:3 ) {
      beta <- (alpha - mean[i]*alpha) / mean[i]
      den <- NULL
      for ( j in 1:length(beta) )
        den <- cbind ( den, dbeta(y,alpha[j],beta[j]) )
      matplot ( y, den, type="l", main=letters[i], ylab="p(y)" )
      if ( i == 1 )
        legend ( .6, 8, paste ( "(a,b) = (", round(alpha,2), ",",
                   round(beta,2), ")", sep="" ), lty=1:4 )
      else if ( i == 2 )
        legend ( .1, 4, paste ( "(a,b) = (", round(alpha,2), ",",
                  round(beta,2), ")", sep="" ), lty=1:4 )
      else if ( i == 3 )
        legend ( .1, 10, paste ( "(a,b) = (", round(alpha,2), ",",
                  round(beta,2), ")", sep="" ), lty=1:4 )


   The Beta density is closely related to the Gamma density by the following the-
orem.


Theorem 5.18. Let X1 ~ Gam(ai, , ); X2~ Gam(a2, , ); and X1 I X2. Then

                               X1
                        Y --   ~, Be(ai, a2)
                            X1+X2

Proof. See Exercise .


   Note that Theorem    requires X1 and X2 both to have the same value of Q,
but the result doesn't depend on what that value is.


﻿


5.7. NORMAL


316


5.7 The Normal and Related Distributions

5.7.1 The Univariate Normal Distribution
The histograms in Figure    on page

   " are approximately unimodal,

   " are approximately symmetric,

   " have different means, and

   " have different standard deviations.

Data with these properties is ubiquitous in nature. Statisticians and other scientists
often have to model similar looking data. One common probability density for
modelling such data is the Normal density, also known as the Gaussian density.
The Normal density is also important because of the Central Limit Theorem.
   For some constants p E R and a > 0, the Normal density is

                        p(  | p, (7) =    C e--2   .                   (5 .4)
                                     ( 1    1xP)2(5.4)


Example 5.5 (Ocean temperatures, continued)
To see a Normal density in more detail, Figure  reproduces the top right histogram
from Figure    redrawn with the Normal density overlaid, for the values y ~ 8.08 and
a   0.94. The vertical axis is drawn on the density scale. There are 112 temperature
measurements that go into this histogram; they were recorded between 1949 and 1997;
their latitudes are all between 440 and 460; their longitudes are all between -21 and
-19°.

   Figure    was produced by the following R snippet.

   good <- abs ( med.1000$ln - lons[3] ) < 1 &
             abs ( med.1000$lat - lats[1] ) < 1
    temps <- med.1000$temp[good]
    hist ( temps, xlim=c(4,12), breaks=seq(4,12,by=.5),
            freq=F, xlab="temperature", ylab="density", main = "")
    mu <- mean ( temps )
    sig <- sqrt ( var ( temps ) )
    x    <- seq ( 4, 12, length=60 )


﻿


5.7. NORMAL


317


QZI


' /1


)
0


I


% i -11,-;


0
0


4


6


8


10


12


                                    temperature


Figure 5.10: Water temperatures (°C) at 1000m depth, 44 - 46°N latitude and
19 - 21°W longitude. The dashed curve is the N(8.08,0.94) density.


     lines ( x, dnorm(x,mu,sig), lty=2 )


   Visually, the Normal density appears to fit the data well. Randomly choosing one of
the 112 historical temperature measurements, or making a new measurement near 45°N
and 20°W at a randomly chosen time are like drawing a random variable t from the
N(8.08,0.94) distribution.
   Look at temperatures between 8.50 and 9.0°C. The N(8.08, 0.94) density says the
probability that a randomly drawn temperature t is between 8.50 and 9.0°C is


P [t C (8.5, 9.0]]


     1          1 2t-8.082
 S9.0             0     dt ~ 0.16.
J8.5 2r 0.94


(5.5)


   The integral in Equation  is best done on a computer, not by hand. In R it can be
done with pnorm(9.0,8.08, .94 ) - pnorm(8.5,8.08, .94 ). A fancier way to do it
is diff(pnorm(c(8.5,9),8.08, .94)).

   . When x is a vector, pnorm(x,mean,sd) returns a vector of pnorm's.


﻿


5.7. NORMAL


318


. When x is a vector, diff (x) returns the vector of differences x [2] -x [1] ,


x[3]-x[2],


..., x[n]-x[n-1].


   In fact, 19 of the 112 temperatures fell into that bin, and 19/112 ~ 0.17, so the
N(8.08, 0.94) density seems to fit very well.
   However, the N(8.08, 0.94) density doesn't fit as well for temperatures between 7.50
and 8.0°C.


P [t C (7.5, 8.0]]


h 0 1   1 (t-.0 .2


dt   0.20.


In fact, 15 of the 112 temperatures fell into that bin; and 15/112 ~ 0.13. Even so, the
N(8.08,0.94) density fits the data set very well.
Theorem 5.19. Let Y ~ N(p, (). Then
                                          2
                               my (t) = e 2 + t.


Proof.


My(t) = J

      -J


ety  1 -$22(y-) dy
    277
    1 __-21
      S_2 (y2-(2p-2.2t)y+p2) dy
 2wu


2~


1_
    C
27 (


     /          Q2t)2
2Q (y-l/ + 2t))2+ f +  d'J
                      /


2      4 2
C   2Q2

C 2 + /t


D


The technique used in the proof of Theorem


is worth remembering, so let's


look at it more abstractly. Apart from multiplicative constants, the first integral in
the proof is


ete-21(y- ) dy


I e22 (y-i)2+ty dy


The exponent is quadratic in y and therefore, for some values of a, b, c, d, C, and f
can be written


1
2 (-/_t2 +ty


ay2+by+c


Sy-d )2


﻿


5.7. NORMAL


319


This last expression has the form of a Normal distribution with mean d and SD
e. So the integral can be evaluated by putting it in this form and manipulating
the constants so it becomes the integral of a pdf and therefore equal to 1. It's a
technique that is often useful when working with integrals arising from Normal
distributions.
Theorem 5.20. Let Y  N(p, a). Then

                      E([Y] = y   and     Var(Y)=ua2.

Proof. For the mean,

                   E([Y] = MY (0)= (ta2 + p)e 2 t
                                                  t=0
For the variance,

         [Y2] =M"0) = a2e() 2 +t + (ta2 + p)2e 22 +t  -= 2 + p2
                                                      t=0
So,
                        Var(Y) = E[Y2] - B[Y]2 =2.


   The N(O, 1) distribution is called the standard Normal distribution. As Theo-
rem 5,21 shows, all Normal distributions are just shifted, rescaled versions of the
standard Normal distribution. The mean is a location parameter; the standard
deviation is a scale parameter. See Section 8.6.
Theorem 5.21.
   1. If X ~N(0, 1) and Y = aX + pithenY   N(p, a).

   2. IfYr~N( p, a) and X = (Y-p,u)/athenX~N(0, 1).
Proof.   1. Let X   N(O, 1) and Y = aX + p. By Theorem 4,8
                                                   u2t2
                           MY(t)=e"tMx(at) =-ete 2

   2. Let Y ~ N(p, a) and X = (Y-p,u)/a. Then

                  Mx(t) = e-"'"My(t/a) =e- Le 2±"+la -


D-


﻿


5.7. NORMAL


320


   Section55 introduced the x2 distribution, noting that it is a special case of the
Gamma distribution. The x2 distribution arises in practice as a sum of squares of
standard Normals. Here we restate Theorem  ., then prove it.
Theorem 5.22. Let Y1, ..., Yn ~ i.i.d. N(0, 1). Define X =LY2. Then X 4
Proof. Start with the case n = 1.

                 Mx(t) =B[e 2]  f e'us 1e_ dy

                          1 -      (1_2 )y2

                          -(  -t1/2 f  1 -2t1 (12t)y2 d

                        - (1 -2t
So X    Gam(1/2, 2) =x10
   If n > 1 then by Corollary 410
                     Mx(t) = My2+...+y2(t) = (1 - 2t)-n/2
So X    Gam(n/2,2) =x* .

5.7.2 The Multivariate Normal Distribution

Let X be an n-dimensional random vector with mean p  d and covariance matrix
Eg. We say that X has a multivariate Normal distribution if its joint density is

                      (X)        1                                     (5.6)
                            (27r)n/2
where |El refers to the determinant of the matrix E. We write X  N(p, E). Com-
parison of Equations54 (page 316) and 5k6 shows that the latter is a generaliza-
tion of the former. The multivariate version has the covariance matrix E in place
of the scalar variance a2.
   To become more familiar with the multivariate Normal distribution, we begin
with the case where the covariance matrix is diagonal:

                                 : 0     0   ---
                                 0  a2   0   ---
                              0 0n .:


﻿


5.7. NORMAL


321


In this case the joint density is

                   p( 4)         1e
                           (27r)n/2
                              1        (I\     i  1(',-2
                           1      1)    1(-   2

                        =            e 2(7


the product of n separate one dimensional Normal densities, one for each dimen-
sion. Therefore the Xi's are independent and Normally distributed, with Xi
N(pu, o-). Also see Exercise 3.
   When ai =="   n =1, then E is the n-dimensional identity matrix In. When,
in addition, p1 u= - - = pn= 0, then X  N(0, In) and X is said to have the
standard n-dimensional Normal distribution.
   Note: for two arbitrary random variables X1 and X2, X1 I X2 implies Cov(X1, X2)
0; but Cov(X1, X2) = 0 does not imply X1 I X2. However, if X1 and X2 are jointly
Normally distributed then the implication is true. I.e. if (X1, X2) N(p, E) and
Cov(X1, X2) = 0, then X1 I X2. In fact, something stronger is true, as recorded in
the next theorem.

Theorem 5.23. Let X = (X1,..., Xn)     N(p, E) where Z has the so-called block-
diagonal form
                             E11 012     --"-  01m'
                             021 E22     -" -  02m

                             0mi -     0mm-1 Ymm

where Yi is an n2 x n2 matrix, O1 is an ni x n3 matrix of O's and E  nM in. Partition
X to conform with E and define Yr's: Y1 = (X1,..., Xn1), Y2 = (Xn1+1, ...,Xn1+n2),
... ., m = (Xni+...+nmi+i. . . , Xnm) and vi's: vi = (ji,... ,pn1), 72 = (pni+1,-... ,n1+n2),
.    r. , um = (pni+...+nm-1+1, - - - ,Jpam). Then

   1. The Y's are independent of each other, and


2.Y N(v2,E22)


﻿


5.7. NORMAL


322


Proof. The transformation X -(fi, ... , fm) is just the identity transformation, so

  pi ...Ym(fi, . . . ,m)


  To learn more about the multivariate Normal density, look at the curves on
which p= is constant; i.e., {J pg (?)  c} for some constant c. The density depends
on the zi's through the quadratic form (z - pj)tE-1(z - pi), so pg is constant where
this quadratic form is constant. But when E is diagonal, ( - p)tE--(z -
   D~e- p )2/of so pg()=c is the equation of an ellipsoid centered at y~ and with
eccentricities determined by the ratios o-/oj.
   What does this density look like? It is easiest to answer that question in two
dimensions. Figure shows three bivariate Normal densities. The left-hand col-
umn shows contour plots of the bivariate densities; the right-hand column shows
samples from the joint distributions. In all cases, EX1] =EX2]  0. In the
top row, o-xt =o-x  1; in the second row , ox  1; ox  2; in the third row ,
-xi =  1/2; o-x2 = 2. The standard deviation is a scale parameter, so changing the
SD just changes the scale of the random variable. That's what gives the second
and third rows more vertical spread than the first, and makes the third row more
horizontally squashed than the first and second.

   Figure      was produced with the following R code.

par ( mfrow=c(3,2) ) # a 3 by 2 array of plots

x1 <- seq(-5,5,length=60)
x2 <- seq(-5,5,length=60)
den.1 <- dnorm ( xl, 0, 1 )
den.2 <- dnorm ( x2, 0, 1 )
den.jt <- den.1 %o% den.2
contour ( x1, x2, den.jt, xlim=c(-5,5), ylim=c(-5,5), main="(a)",


﻿


5.7. NORMAL


323


(a)


(b)


  C\J-
N


  C\J-
N


-4 -2  0   2  4


-4 -2  0   2  4

       X1


X1


(c)


(d)


  C\J-
N


  C\J-
N


-4 -2  0   2  4


-4 -2  0   2  4

       X1


X1


(e)


(f)


  C\J-
N


  C\J-
N


-4 -2  0   2  4

       X1


-4 -2  0   2  4

       X1


Figure 5.11: Bivariate Normal density. IB[X1] - IB[X2] =0.

(a), (d): cUx1         2U2 .

(e), (f): LTX1 ='72; 2-2.

(a), (c), (e): contours of the joint density.

(b), (d), (f): samples from the joint density.


﻿


5.7. NORMAL


324


   Now let's see what happens when E is not diagonal. Let Y N(pp, E), so


                           (27r)n/2 y

and let X    N(O, In). X is just a collection of independent N(O, 1) random vari-
ables. Its curves of constant density are just (n- 1)-dimensional spheres centered at


﻿


5.7. NORMAL


325


the origin. Define Z = E1/2Z + p. We will show that pz = pg, therefore that Z and
Y have the same distribution, and therefore that any multivariate Normal random
vector has the same distribution as a linear transformation of a standard multivari-
ate Normal random vector. To show pz = pg we apply Theorem . The Jacobian
of the transformation from i to Z is |E|1/2, the square root of the determinant of
E. Therefore,


pz (#) = pX (y-1/2(Y - -  )) y -1/2


1  e-1/2(-12(g-p))t(-12(g-p)) -1/2
2,


1          1y


The preceding result says that any multivariate Normal random variable, Y in our
notation above, has the same distribution as a linear transformation of a standard
Normal random variable.
   To see what multivariate Normal densities look like it is easiest to look at
2 dimensions. Figure       shows three bivariate Normal densities. The left-
hand column shows contour plots of the bivariate densities; the right-hand column
shows samples from the joint distributions. In all cases, E[X1] = E[X2] = 0 and
r1 = Cr2 = 1. In the top row, (1,2 = 0; in the second row, (1,2 = .5; in the third row,
Qi,2 = -.8.


Figure


was produced with the following R code.


par ( mfrow=c(3,2) ) # a 3 by 2 array of plots
npts <- 60
sampsize <- 300

x1 <- seq(-5,5,length=npts)
x2 <- seq(-5,5,length=npts)

Sigma <- array ( NA, c(3,2,2) )
Sigma[1,,] <- c(1,0,0,1)
Sigma[2,,]  <- c(1,.5,.5,1)
Sigma[3,,] <- c(1,-.8,-.8,1)


﻿


5.7. NORMAL


326


a


b


  C\J-
N


  C\J-
N


-4 -2  0   2  4


-4 -2  0   2  4

       X1


X1


C


d


  C\J-
N


  C\J-
N


-4 -2  0   2  4


-4 -2  0   2  4

       X1


X1


e


f


  C\J-
N


  C\J-
N


-4 -2  0   2  4

       X1


-4 -2  0   2  4

       X1


Figure 5.12: Bivariate Normal density. E[Xi] =E[X2] = 0; 91 = 92 = 1.

(a), (b): oi,2 = 0.

(c), (d): o1,2 = .5.
(e), (f): X1,2 = -.8.
(a), (c), (e): contours of the joint density.
(b), (d), (f): samples from the joint density.


﻿


5.7. NORMAL


327


   We conclude this section with some theorems about Normal random variables
that will prove useful later.


Theorem 5.24. Let X    N(p, E) be an n-dimensional Normal random variable; let
A be a full rank n by n matrix; and let Y = AX. Then Y  N(Au, AZAt).


﻿


5.7. NORMAL


328


Proof By Theorem 4.4 (pg. 268),

             pg (y) =-pX (A-1#)|A-1

                    (27r)n/2 A l l E 2
                             1   e-i(A-ly(-u)E-1(A-1-A))
                             1e

                    (2r)n/2 A l l 2
                             1   e 2-i -A)*(-)*E-1(A-1(--Aus
                             1e

                    (2r)n/2 All l
                             1e1

                    (2r)n/2 AEAt12e

which we recognize as the N(Au, AEAt) density.                       D

Corollary 5.25. Let X N(p, E) be an n-dimensional Normal random variable; let
A be a full rank n by n matrix; let b be a vector of length n; and let Y = AX + b.
Then Y ~N(Au+b,AZAt).

Proof. See Exercise 33.                                              D

Corollary 5.26. Let X1, ... , X,  i.i.d. N(p, o-). Define S2 =L21(X2 - X)2. Then
X I S2.

Proof. Define the random vector Y= (Y1,..., Y)t by

                              Y1 = X1 - X
                              Y2 = X2 - X


                            Yn_1 = Xn_1 - X
                              Y = X

The proof follows these steps.

  1. S2 is a function only of (Y1, ... , Y_1)t; i.e. not a function of Yn.

  2. (Y1,...,Y_1) Y.


﻿


5.8. T AND 3F


329


1. L:2_1(X2 - X) = 0. Therefore, (Xn - X)


Ef_- (XZ - X). And therefore


     n-1
S2 =((Xi
     i=1


       n-1
X)2+    ((X
        i=1i


    2
X)


n-1
iY2
i=1


i~i


is a function of (Y1,... , Yn_1)t.


2.


      1

Y=


  1     1
  n     n
_1   1 _
n       n


1
n
1
n


1
n


... _1\
       n


       1


AX


1
n
1
n


1
n
1
n


   where the matrix A is defined by the preceding equation, so
   Y N(Au, a2AAt). The first n - 1 rows of A are each orthogonal to the last
   row. Therefore
                            AAt _ (En
                                    60t  1/n

   where Ell has dimension (n - 1) x (n - 1) and 0 is the (n - 1)-dimensional
   vector of 0's. Thus, by Theorem 52, (Y1,... , Y1)t I Yn.

3. Follows immediately from 1 and 2.


D-


5.8 The t and F Distributions


5.8.1 The t distribution

The t distribution arises when making inference about the mean of a Normal dis-
tribution.
   Let X1, ..., Xn ~ i.i.d. N(p, a) where both p and a are unknown, and suppose
our goal is to estimate p. ft= X is a sensible estimator. Its sampling distribution is
X ~ N(p, 6/ /n) or, equivalently,

                                    ~ N(0, 1)


﻿


5.8. T AND 3F


330


We would like to use this equation to tell us how accurately we can estimate P.
Apparently we can estimate y to within about +2a//\ most of the time. But that's
not an immediately useful statement because we don't know a. So we estimate a
by   = (n-1 L(XZ - X)2)1/2 and say


                              X -p
                                    ~ N(0, 1),
                              /in

approximately. This section derives the exact distribution of (X - p)/( /\/i) and
assesses how good the Normal approximation is. We already know from Corol-
lary      that X 1 a. Theorem 5.28 gives the distribution of S2=       2
E(X, - X)2. First we need a lemma.

Lemma 5.27. Let V = V1+V2 and W = W1+W2 where V1 I V2 and W1 IW2. If V
and W have the same distribution, and if V1 and W1 have the same distribution, then
V2 and W2 have the same distribution.


Proof. Using moment generating functions,


                          Mv2(t) = Mv(t)/Mv1(t)
                                 = Mw(t)/Mw1(t)
                                 = Mw2(t)


Theorem 5.28. Let X1, ...,X,   i.i.d. N(p, a). Define S2 = E" (X2 - X)2. Then

                                S2     2
                                  2 n-1

Proof. Let

                                 (       I   2


﻿


5.8. T AND 3F


331


Then V ~  and
             " n  X  - X) +(X  -p)

          V=(
              i=1


i=1


i=1
S2
2 +v2


            2
+n  +)2(X

           2
 +X


n
(Xi
2-1


where S2/a2 I V2 and V2  xI. But also,


i21


+   x 2


W1+W2


where 1  1W2, W1 ~_1 and W2 ~x .Now
Lemma 5,2.


the conclusion follows by
                      D


Define


       n-1 X-
T  =/vI


fri(X-/p)/6
V/2/(n - 1)U2


Then by Corollary .6 and Theorem 5.28, T has the distribution of
U/ V/(n - 1) where U    N(0, 1), V x_, and U I V. This distribution is called
the t distribution with n - 1 degrees of freedom. We write T  tn_1. Theorem 5,29
derives its density.


Theorem 5.29. Let U  N(0,1), V   x, and U I V. Then T
density


U/ V/Vp has


        F(p)p  
pT~t)=  2 rp (t2 +p)
         ~)2


p+1    F(P+l)
2         2
     F(2) p-


(1


- p+

p


Proof. Define


      U
T     Vp


and  Y=V


﻿


5.8. T ANDEF                                                   332

We make the transformation (U, V) - (T, Y), find the joint density of (T, Y), and
then the marginal density of T. The inverse transformation is


                       U T--     and   V =Y


The Jacobian is
                      dU dU    y2 Ty 2

                      dT dY     0   1
The joint density of (U, V) is


Y2i

p


pu) v u, V)e


u2  1   p_1 -
   2  pv2  e


2


Therefore the joint density of (T, Y) is

                              1 ty  1    p_1 _y- Y
                PTY (t, Y)    eC        y 2 e 2
                            27    F(2)222p

and the marginal density of T is


PT (t) =fPTYtYdY

                    ply2
        2wF(2)2z


1 - (
e ~dy


      1     r(p-+ -1(2p NJ
        p2-F2) t2 +p)

        1 ~op+
        x (0p(+l (22;p 1Y2

   p(P2l)  /2+p
 T'(2)pp/2(t2+~p< 21

   2_            2
F s) p   1


p+1
2


4e2p/(t2+p) dy


D-


﻿


5.8. T AND F                                                             333

   Figure      shows the t density for 1, 4, 16, and 64 degrees of freedom, and
the N(0, 1) density. The two points to note are

   1. The t densities are unimodal and symmetric about 0, but have less mass in
     the middle and more mass in the tails than the N(0, 1) density.

   2. In the limit, as p -- 0c, the t, density appears to approach the N(0, 1) density.
     (The appearance is correct. See Exercise .)


   Figure      was produced with the following snippet.

     x <- seq ( -5, 5, length=100 )
     dens <- cbind ( dt(x,1), dt(x,4), dt(x,16), dt(x,64),
                      dnorm(x) )
    matplot ( x, dens, type="l", ylab="density", xlab="t",
               lty=c(2:5,1), col=1 )
     legend ( x=-5, y=.4, lty=c(2:5,1),
              legend=c ( paste("df = ", c(1,4,16,64)),
                           "Normal" ) )


   At the beginning of Section      we said the quantity fn(X - p)/& had a
N(0, 1) distribution, approximately. Theorem  derives the density of the related
quantity  n - 1 (X - p) /& which has a t,_1 distribution, exactly. Figure  shows
how similar those distributions are. The t distribution has slightly more spread than
the N(0, 1) distribution, reflecting the fact that o- has to be estimated. But when n is
large, i.e. when o- is well estimated, then the two distributions are nearly identical.
   If T ~ t,, then
                           /o     P(P+l)      t2 - __1
                    E[T] =     t     2     1 + -      dt                (5.7)
                             _  r(2) p7       p

In the limit as t -- 00, the integrand behaves like t-p; hence is integrable if
and only if p > 1. Thus the ti distribution, also known as the Cauchy distribution,
has no mean. When p > 1, E[T] = 0, by symmetry. By a similar argument, the
t  distribution has a variance if and only if p > 2. When p > 2, then Var(T)
p/(p - 2). In general, T has a k-th moment (E[T] < 00) if and only if p > k.


﻿


5.8. T ANDF                                                         334


                          ---0df=   1
                     co   ..... df = 4
                  o       .-- df =16
                          ---df =64
              5C      -        Normal


                          -4      -2      0      2      4

                                          t


Figure 5.13: t densities for four degrees of freedom and the N(0, 1) density


﻿


5.9. EXERCISES


335


5.8.2 The F distribution

5.9 Exercises

  1. Prove Theorem     by moment generating functions.

  2. Refer to Theorem 5.8.

      (a) What was the point of the next to last step?
      (b) Justify the last step.

  3. Assume that all players on a basketball team are 70% free throw shooters
     and that free throws are independent of each other.

     (a) The team takes 40 free throws in a game. Write down a formula for
          the probability that they make exactly 37 of them. You do not need to
          evaluate the formula.
      (b) The team takes 20 free throws the next game. Write down a formula for
          the probability that they make exactly 9 of them.
      (c) Write down a formula for the probability that the team makes exactly
          37 free throws in the first game and exactly 9 in the second game. That
          is, write a formula for the probability that they accomplish both feats.

  4. Write down the distribution you would use to model each of the following
     random variables. Be as specific as you can. I.e., instead of answering "Pois-
     son distribution", answer "Poi(3)" or instead of answering "Binomial", answer
     "Bin(n, p) where n = 13 but p is unknown."

     (a) The temperature measured at a randomly selected point on the surface
          of Mars.
      (b) The number of car accidents in January at the corner of Broad Street
          and Main Street.
      (c) Out of 20 people in a post office, the number who, when exposed to
          anthrax spores, actually develop anthrax.
      (d) Out of 10,000 people given a smallpox vaccine, the number who develop
          smallpox.
      (e) The amount of Mercury in a fish caught in Lake Ontario.


﻿


5.9. EXERCISES


336


5. A student types dpois (3,1.5) into R. R responds with 0.1255107.

    (a) Write down in words what the student just calculated.
    (b) Write down a mathematical formula for what the student just calculated.

6. Name the distribution. Your answers should be of the form Poi(A) or N(3, 22),
   etc. Use numbers when parameters are known, symbols when they're not.
   You spend the evening at the roulette table in a casino. You bet on red 100
   times. Each time the chance of winning is 18/38. If you win, you win $1;
   if you lose, you lose $1. The average amount of time between bets is 90
   seconds; the standard deviation is 5 seconds.

   (a) the number of times you win
   (b) the number of times you lose
   (c) the number of bets until your third win
   (d) the number of bets until your thirtieth loss
   (e) the amount of time to play your first 40 bets
   (f) the additional amount of time to play your next 60 bets
   (g) the total amount of time to play your 100 bets
   (h) your net profit at the end of the evening
   (i) the amount of time until a stranger wearing a red carnation sits down
       next to you
    (j) the number of times you are accidentally jostled by the person standing
       behind you

7. A golfer plays the same golf course daily for a period of many years. You may
   assume that he does not get better or worse, that all holes are equally difficult
   and that the results on one hole do not influence the results on any other
   hole. On any one hole, he has probabilities .05, .5, and .45 of being under
   par, exactly par, and over par, respectively. Write down what distribution best
   models each of the following random variables. Be as specific as you can. I.e.,
   instead of answering "Poisson distribution" answer "Poi(3)" or "Poi(A) where
   A is unknown." For some parts the correct answer might be "I don't know."

   (a) X, the number of holes over par on 17 September, 2002
   (b) W, the number of holes over par in September, 2002


﻿


5.9. EXERCISES


337


     (c) Y, the number of rounds over par in September, 2002
     (d) Z, the number of times he is hit by lightning in this decade
     (e) H, the number of holes-in-one this decade
     (f) T, the time, in years, until his next hole-in-one

 8. During a CAT scan, a source (your brain) emits photons which are counted
    by a detector (the machine). The detector is mounted at the end of a long
    tube, so only photons that head straight down the tube are detected. In
    other words, though the source emits photons in all directions, the only ones
    detected are those that are emitted within the small range of angles that lead
    down the tube to the detector.
    Let X be the number of photons emitted by the source in 5 seconds. Suppose
    the detector captures only 1% of the photons emitted by the source. Let Y be
    the number of photons captured by the detector in those same 5 seconds.

    (a) What is a good model for the distribution of X?
    (b) What is the conditional distribution of Y given X?
    (c) What is the marginal distribution of Y?

    Try to answer these questions from first principles, without doing any calcu-
    lations.

 9. (a) Prove Theorem 5.11 using moment generating functions.
     (b) Prove Theorem 5,12 using moment generating functions.

10. (a) Prove Theorem15,14 by finding E[Y2] using the trick that was used to
        prove Theorem 5.2.
     (b) Prove Theorem       by finding E[Y2] using moment generating func-
        tions.

11. Case Study 4.2.3 in Larsen and Marx [add reference] claims that the number
    of fumbles per team in a football game is well modelled by a Poisson(2.55)
    distribution. For this quiz, assume that claim is correct.

    (a) What is the expected number of fumbles per team in a football game?
    (b) What is the expected total number of fumbles by both teams?
    (c) What is a good model for the total number of fumbles by both teams?


﻿


5.9. EXERCISES


338


     (d) In a game played in 2002, Duke fumbled 3 times and Navy fumbled 4
         times. Write a formula (Don't evaluate it.) for the probability that Duke
         will fumble exactly 3 times in next week's game.
     (e) Write a formula (Don't evaluate it.) for the probability that Duke will
         fumble exactly three times given that they fumble at least once.

12. Clemson University, trying to maintain its superiority over Duke in ACC foot-
    ball, recently added a new practice field by reclaiming a few acres of swamp-
    land surrounding the campus. However, the coaches and players refused
    to practice there in the evenings because of the overwhelming number of
    mosquitos.
    To solve the problem the Athletic Department installed 10 bug zappers around
    the field. Each bug zapper, each hour, zaps a random number of mosquitos
    that has a Poisson(25) distribution.

    (a) What is the exact distribution of the number of mosquitos zapped by 10
         zappers in an hour? What are its expected value and variance?
     (b) What is a good approximation to the distribution of the number of
         mosquitos zapped by 10 zappers during the course of a 4 hour practice?
     (c) Starting from your answer to the previous part, find a random variable
         relevant to this problem that has approximately a N(0,1) distribution.

13. Bob is a high school senior applying to Duke and wants something that will
    make his application stand out from all the others. He figures his best chance
    to impress the admissions office is to enter the Guinness Book of World
    Records for the longest amount of time spent continuously brushing one's
    teeth with an electric toothbrush. (Time out for changing batteries is permis-
    sible.) Batteries for Bob's toothbrush last an average of 100 minutes each,
    with a variance of 100. To prepare for his assault on the world record, Bob
    lays in a supply of 100 batteries.
    The television cameras arrive along with representatives of the Guinness com-
    pany and the American Dental Association and Bob begins the quest that he
    hopes will be the defining moment of his young life. Unfortunately for Bob
    his quest ends in humiliation as his batteries run out before he can reach the
    record which currently stands at 10,200 minutes.
    Justice is well served however because, although Bob did take AP Statistics
    in high school, he was not a very good student. Had he been a good statistics


﻿


5.9. EXERCISES


339


    student he would have calculated in advance the chance that his batteries
    would run out in less than 10,200 minutes.
    Calculate, approximately, that chance for Bob.

14. An article on statistical fraud detection  l  a    [    ]), when
    talking about records in a database, says:
    "One of the difficulties with fraud detection is that typically there are many
    legitimate records for each fraudulent one. A detection method which cor-
    rectly identifies 99% of the legitimate records as legitimate and 99% of the
    fraudulent records as fraudulent might be regarded as a highly effective sys-
    tem. However, if only 1 in 1000 records is fraudulent, then, on average, in
    every 100 that the system flags as fraudulent, only about 9 will in fact be so."
    QUESTION: Can you justify the "about 9"?

15. [credit to FPP here, or change the question.] In 1988 men averaged
    around 500 on the math SAT, the SD was around 100 and the histogram
    followed the normal curve.

    (a) Estimate the percentage of men getting over 600 on this test in 1988.
    (b) One of the men who took the test in 1988 will be picked at random, and
         you have to guess his test score. You will be given a dollar if you guess
         it right to within 50 points.
         i. What should you guess?
         ii. What is your chance of winning?

16. Multiple choice.

     (a) X    Poi(A). Pr[X < 7]
          i. E  __, -  a A/X!
          ii.  7  e-A/x!
          iii. L7 0e-oAA/z!
     (b) X and Y are distributed uniformly on the unit square.
         Pr[X .5|Y<.25]
         i. .5
         ii. .25
         iii. can't tell from the information given.


﻿


5.9. EXERCISES


340


     (c) X   Normal(p, a2). Pr[X > p + -]
          i. is more than .5
          ii. is less than .5
          iii. can't tell from the information given.
     (d) X1,... , Xioo ~N(0, 1). IX=(X1+- - -+Xioo)/100. Y =(X1+-.- -+Xioo).
        Calculate
          i. Pr[-.2 < <X .2]
          ii. Pr[-.2 < Xi < .2]
          iii. Pr[-.2 < Y < .2]
          iv. Pr[-2<c X<c 2]
          v. Pr[-2<c Xi<c 2]
          vi. Pr[-2 < Y < 2]
        vii. Pr[-20 < IX  20]
        viii. Pr[-20  Xi   20]
        ix. Pr[-20 < Y < 20]
     (e) X~ Bin(100, 8). E""0 f (zl6)=
          i. 1
          ii. the question doesn't make sense
          iii. can't tell from the information given.
     (f) X and Y have joint density f(x, y) on the unit square. f(x)
          i. f f(x,y)dx
          ii. f f(x, y) dy
          iii. Joff(x, y)dy
     (g) X1,... , X ~ Gamma(r, A) and are mutually independent.
        f (zi, . . . ,za)=
          i. [A/(r - 1)!](H xi)r-le-AE xi
          ii. [A"'T/((r -l)!)"n](H zi)-le-A 1xI*
          iii. [A"'T/((r - 1)!)"n](H zi)-le-A E**

17. In Figure 5.2, the plots look increasingly Normal as we go down each column.
    Why? Hint: a well-known theorem is involved.


18. Prove Theorem


﻿


5.9. EXERCISES


341


19. Prove a version of Equation 51 on page 29. Let k = 2. Start from the joint
    pmf of Y1 and Y2 (Use Theorem 5 .), derive the marginal pmf of Y1, and
    identify it.

20. Rongelap Island, Poisson distribution

21. seed rain, Poisson distribution

22. (a) Let Y  U(1, n) where the parameter n is an unknown positive integer.
         Suppose we observe Y = 6. Find the m.l.e. n. Hint: Equation 5.2 defines
         the pmf for y E {1, 2, ... , n}. What is p(y) when y 0 {1, 2, ..., n}?
     (b) In World War II, when German tanks came from the factory they had
         serial numbers labelled consecutively from 1. I.e., the numbers were 1,
         2,..... The Allies wanted to estimate T, the total number of German
         tanks and had, as data, the serial numbers of the tanks they had cap-
         tured. Assuming that tanks were captured independently of each other
         and that all tanks were equally likely to be captured find the m.l.e. T.

23. Let Y be a continuous random variable, Y  U(a, b).

     (a) Find E[Y].
     (b) Find Var(Y).
     (c) Find My (t).

24. (a) Is there a discrete distribution that is uniform on the positive integers?
         Why or why not? If there is such a distribution then we might call it
         U(1, oo).
     (b) Is there a continuous distribution that is uniform on the real line? Why
         or why not? If there is, then we might call it U(-oo, oo).

25. Let x    Gam(a,,#) and let y = 1/z. Find the pdf of y. We say that y
    has an inverse Gamma distribution with parameters a and 13 and write y
    invGam(a, 3).

26. Prove Theorem 5:17. Hint: Use the method of Theorem 512.

27. Let x1, ..., x, ~ i.i.d. U(0, 1). Find the distribution of x(n), the largest order
    statistic.

28. In the R code to create Figure  , explain how to use dgamma(...) instead
    of dpois (. . .).


﻿


5.9. EXERCISES


342


29. Prove the claim on page:311 that the half-life of a radioactive isotope is m =
    A log 2.

30. Prove Theorem

31. Prove Theorem   .18.

32. Page32 shows that the n-dimensional Normal density with a diagonal co-
    variance matrix is the product of n separate univariate Normal densities. In
    this problem you are to work in the opposite direction. Let X1, ..., X, be
    independent Normally distributed random variables with means ,u1, ..., yn
    and SD's o-1, ..., o-.

    (a) Write down the density of each Xi.
    (b) Write down the joint density of X1, ..., X,.
    (c) Show that the joint density can be written in the form of Equation 5.6
    (d) Derive the mean vector and covariance matrix of X1, ..., X,.

33. Prove Corollary 5,25.

34. Show, for every x E R,
                                               2w
                              lim ptp,(x) = 1e--2
                              p-oo         v27
    where pt.,(x) is the t density with p degrees of freedom, evaluated at x. Hint:
    use Sterling's formula. (This problem is Exercise 5.18(c) in Statistical Infer-
    ence, 2nd ed. by Casella and Berger.)

35. It turns out that the t distribution with p degrees of freedom can be written as
    a mixture of Normal distributions, a fact that is sometimes useful in statistical
    calculations. Let T   Gam(p/2, 2) and, conditional on T, y   N(0, 1/(pT)).
    Show that the marginal distribution of y is the t distribution with p degrees
    of freedom.


﻿


CHAPTER 6


                 BAYESIAN STATISTICS


6.1 Multidimensional Bayesian Analysis

This chapter takes up Bayesian statistics. Modern Bayesian statistics relies heavily
on computers, computation, programming, and algorithms, so that will be the
major focus of this chapter. We cannot give a complete treatment here, but there
are several good books that cover these topics in more depth. See, for example,
  Gelman et\al."[200],\Liu [2004],\Marnhand Roberty[200], or Ret an-d Cas ella

  Recall the framework of Bayesian inference from Section 2.5.

  " We posit a parametric family of distributions {p(y O8)}.

  " We express our old knowledge of 0 through a prior probability density p(O).

  " The previous two items combine to yield p(y, 0) and, ultimately, p(O|y).

  " The posterior density p(O|y) represents our new state of knowledge about 0.

The posterior density is

                            p(O)p(y|0)
                  p(0 |  - y) = (cp(O)p(y 0).                     (6.1)
                           f p(6)p(y |16)de

   So far, so good. But in many interesting applications, 0 is multi-dimensional
and problems arise when we want to examine the posterior. Equation61 tells us
how to evaluate the posterior at any value of 0, but that's not always sufficient
for getting a sense of which values of 0 are most likely, somewhat likely, unlikely,


343


﻿


6.1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


344


etc. One way to develop a feeling for a multidimensional posterior is to examine a
marginal posterior density, say

                  p(B1I ly) _=J -.-.-J p(81, . .. , 6  y) d02 . . . d64.(6.2)

Unfortunately, the integral in Equation62 is often not analytically tractable and
must be integrated numerically. Standard numerical integration techniques such
as quadrature may work well in low dimensions, but in Bayesian statistics Equa-
tion 6,2 is often sufficiently high dimensional that standard techniques are unreli-
able. Therefore, new numerical integration techniques are needed. The most im-
portant of these is called Markov chain Monte Carlo integration, or MCMC. Other
techniques can be found in the references at the beginning of the chapter. For the
purposes of this book, we investigate MCMC. But first, to get a feel for Bayesian
analysis, we explore posteriors in low dimensional, numerically tractable examples.
   The general situation is that there are multiple parameters 01, ..., Ok, and data
Y1, ..., yn. We may be interested in marginal, conditional, or joint distributions of
the parameters either a priori or a posteriori. Some examples:

   " p(1, ... , 0k), the joint prior

   " p(O1, ... ,k 6ly1, ... , yn), the joint posterior

   " p*(1Ily1, ..., yn) = f ..-- f p (1, . . . ,  l y1, ...,y) d02... dk, the marginal pos-
     terior of 01


         p(62, - -, 6k  | 61, #1,- - -, Yn)  =p(61, -  -,6k  |91, -  -, Yn)p(Bi 1 , - - - , Yn)
                                    OCp(61, -  -,  k  |91, -  -, Yn),

     the conditional joint posterior density of (2,..., k) given 01, where the "c"
     means that we substitute 01 into the numerator and treat the denominator as
     a constant.

The examples in this section illustrate the ideas and shows how to do low-dimensional
integrals in R.

Example 6.1 (Pine Cones)
One possible result of increased CO2 in the atmosphere is that plants will use some of
the excess carbon for reproduction, instead of growth. They may, for example produce


﻿


6.1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


345


         ring    ID     xcoor   ycoor   spec   dbh   1998    1999   2000
           1    11003    0.71    0.53   pita   19.4    0       0      0
           1    11004    1.26    2.36   pita   14.1    0       0      4
           1    11011    1.44    6.16   pita   19.4    0       6      0
           1    11013    3.56    5.84   pita   21.6    0       0      0
           1    11017    3.75    8.08   pita   10.8    0       0      0

           6   68053     0.82   10.73   pita   14.4    0       0      0
           6   68055    -2.24   13.34   pita    11     0       0      0
           6   68057    -0.78   14.21   pita    8      0       0      0
           6   68058     0.76   14.55   pita   10.6    0       0      0
           6   68059     1.48     13    pita   21.2    0       5     10

Table 6.1: The numbers of pine cones on trees in the FACE experiment, 1998-2000.


more seeds, produce bigger seeds, produce seeds earlier in life, or produce seeds when
they, the plants, are smaller. To investigate this possibility in the Duke FACE experiment
(See Example 1,12 and its sequels.) a graduate student went to the FACE site each year
and counted the number of pine cones on pine trees in the control and treatment plots.
citation here and Example 3,8
   The data are in Table .1. The first column is ring. Rings a,b,c were control; x,y,z
were treatment. The next column, ID, identifies each tree uniquely; xcoor and ycoor
give the location of the tree. The next column, spec, gives the species; pita stands for
pinus taeda, or loblolly pine, the dominant canopy tree in the FACE experiment. The
column dbh gives diameter at breast height, a common way for foresters and ecologists
to measure the size of a tree. The final three columns show the number of pine cones in
1998, 1999, and 2000. In this example we'll look at the data for the year 2000. We want
the relationship between dbh and the number of pine cones, and whether that relationship
is the same in the control and treatment plots.
    Figures ., 6.2, and. plot the numbers of pine cones as a function of dbh in the
years 1998-2000. In 1998, very few trees had pine cones and those that did had very
few. But by 1999, many more trees had cones and had them in greater number. There
does not appear to be a substantial difference between 1999 and 2000. As a quick check
of our visual impression we can count the fraction of pine trees having pine cones each
year, by ring. The following R code does the job.
     for ( i in 1:6 ) {
       good <- cones$ring == i


﻿


6.1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


346


       print ( c ( sum ( cones$X1998[goodl > 0 ) / sum(good),
                     sum ( cones$X1999[goodl > 0 ) / sum(good),
                     sum ( cones$X2000[good] > 0 ) / sum(good) ) )
     }
     [1] 0.0000000 0.1562500 0.2083333
     [1] 0.05633803 0.36619718 0.32394366
     [1] 0.01834862 0.21100917 0.27522936
     [1] 0.05982906 0.39316239 0.37606838
     [1] 0.01923077 0.10576923 0.22115385
     [1] 0.04081633 0.19727891 0.18367347


Since there's not much action in 1998 we will ignore the data from that year. The data
show a greater contrast between treatment (rings 2, 3, 4) and control (rings 1, 5, 6) in
1999 than in 2000. So for the purpose of this example we'll use the data from 1999. A
good scientific investigation, though, would use data from all years.
   We're looking for a model with two features: (1) the probability of cones is an
increasing function of dbh and of the treatment and (2) given that a tree has cones,
the number of cones is an increasing function of dbh and treatment. Here we describe
a simple model with these features. The idea is (1) a logistic regression with covariates
dbh and treatment for the probability that a tree is sexually mature and (2) a Poisson
regression with covariates dbh and treatment for the number of cones given that a tree
is sexually mature. Let Y be the number of cones on the i'th tree. Our model is

                           {(1  if the i'th tree had extra CO2
                             0 otherwise

                     0i        if the i'th tree is sexually mature
                          {0 otherwise
                                      exp(o + #1dbhi + #32x)
                   ri= P[O= 1]=
                                    1 + exp(3o + /31dbh2 + ,32x2)
                           #2 = exp(yo + 71dbh + 72x2)
                                  Y     Poi(Oici)

There are six unknown parameters: #0, #31, #2, 7o, 71, 72. We must assign prior dis-
tributions and compute posterior distributions of these parameters. In addition, each
tree has an indicator O and we will be able to calculate the posterior probabilities


﻿


6. 1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


                       ring


347


                      II6


       II5


 5 10 15 20 25            5 10 15 20 25


                                 0
00a en~0  00 O UD 0                 0


   000inmmM oou0Ean6M ®~uomO 00      0

              5 10 15 20 25


0
- o


0D


0\


00


Cv,
Wi
0
0


CD


0\


dbh


Figure 6.1: Numbers of pine cones in 1998 as a function of dbh


﻿


6. 1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


                  ring


348


                 II6


5 10 15 20 25       5 10 15 20 25


          O 0


    0f o 0       0        0o
         00
 0             0ooOOo ° "0,1&


               0


                          0
               0          0 0
               000
                  09       0
        0 PO0             0


          5 10 15 20 25


0
- o


0D


0\


U)
N)
a)
U,


CD


0\


dbh


Figure 6.2: Numbers of pine cones in 1999 as a function of dbh


﻿


6. 1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


                       ring


349


                     II6


5 10 15 20 25            5 10 15 20 25


             0 0                  0


           0 0       0

     0000O


         0


                   000
           0                       0

           0                   00


  ® ®oom.anOuixM ,      00C 0b o


            5  10 15 20 25


0
- o


0D


0\


0

U~)


CD


0\


dbh


Figure 6.3: Numbers of pine cones in 2000 as a function of dbh


﻿


6.1. MULTIDIMENSIONAL BAYESIAN ANALYSIS


350


    We start with the priors 0, /1, /2, 70,71, 72 ~ i.i.d.U(-100, 100). This prior distribu-
tion is not, obviously, based on any substantive prior knowledge. Instead of arguing that
this is a sensible prior, we will later check the robustness of conclusions to specification
of the prior. If the conclusions are robust, then we will argue that almost any sensible
prior would lead to roughly the same conclusions.
    To begin the analysis we write down the joint distribution of parameters and data.
Reformat this equation.

                             P(yi, - - - , n,/O0,/01,02,7o, 71, 72)
    =p(00,0#1,02, 70, 71, 72) X P(yi, - -, yn |00,01,0#2, 70, 71, 72)

                                       61(-100,100) (00) X 1(-100,100) (01) X 1(-100,100)(02) X 1(-100,100) (0)

                                                                      X 1(-100,100)(71) X 1(-100,100)(72)
                 exp(/3o +/31dbh +,32x2)    exp(- exp(o + 71dbh + 72xj)) exp(o + 71dbh2 + 72xi)9
        :y0    1 + exp(/3o + 31dbh2 +32xj)                            y!
                       1                   exp(3o +,31d bhi + #2xj)
 X                                     -                             exp(- exp(7o + 71dbh + 72xi)))
   i :y i 0 ((                                                                                       .
                                                                                                   (6.2

In Equation 63 each term in the product Hiyi>0 is

           P[i'th tree is sexually mature] x p(y  i'th tree is sexually mature)

while each term in H    o is

        P[i'th tree is immature] + P[i'th tree is mature but produces no cones].

The posterior p(3o, 31, /32, 70, 71, 72 |Y1, - -- , yn) is proportional, as a function of
(,3o, /1, /2, 70, 71, 72), to Equation 63. Similarly, conditional posteriors such as
P(/3o |, /32, 7o, 71, 72, Y1, -"-"- , yn) are proportional, as a function of 3o, to Equation 63.
But that doesn't allow for much simplification; it allows us only to ignore the factorials
in the denominator.

    To learn about the posterior in, say, Equation 6 it is easy to write an R function
that accepts (,3o, /31, /32, 70, 71, 72) as input and returns 6,3 as output. But that's quite
a complicated function of (/3o, /31, /32,7, 71,72) and it's not obvious how to use the
function or what it says about any of the four parameters. Therefore, in Section 6,2
we present an algorithm that is very powerful for evaluating the integrals that often
arise in multivariate Bayesian analyses.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


351


6.2 The Metropolis, Metropolis-Hastings, and Gibbs
        Sampling Algorithms

In "Markov chain Monte Carlo", the term "Monte Carlo" refers to evaluating an
integral by using many random draws from a distribution. To fix ideas, suppose we
want to evaluate Equation 6.2 and let 0= (1, ... , 8k). If we could generate many
samples 81,... , Ou of 0 (where e2 = (O,,... ,62,i)) from its posterior distribution
then we could approximate Equation.2 by

   1. discarding 0i,2, ... , 0j,k from each iteration,

   2. retaining  01,1, ... ,  ,1,

   3. using 81,1, ... , (M,1 and standard density estimation techniques (page 105) to
     estimate p(81|y), or

  4. for any set A, using
                               number of O2,1's in A
                                        M
     as an estimate of P[01 E A | y].

That's the idea behind Monte Carlo integration.
   The term "Markov chain" refers to how the samples 01,... , 0O are produced.
In a Markov chain there is a transition density or transition kernel k(02|18_1) which
is a density for generating 82 given 2_O1. We first choose 01 almost arbitrarily, then
generate (02|10), (03|02), and so on, in succession, for as many steps as we like.
Each O2 has a density p = p(2) which depends on 01 and the transition kernel. But,

   1. under some fairly benign conditions (See the references at the beginning of
     the chapter for details.) the sequence pi, p2, ... converges to a limit p, the
     stationary distribution, that does not depend on 81;

  2. the transition density k(02il I_1) can be chosen so that the stationary distribu-
     tion p is equal to p(| y);

  3. we can find an m such that i > m -     p = p(8| y);


4. then 0m±1, ..., BM are, approximately, a sample from p(8|y).


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


352


   The Metropolis-Hastings algorithm [Metropol iea, 1953, H, 197] is
one way to construct an MCMC algorithm whose stationary distribution is p(0|y).
It works according to the following steps.

   1. Choose a proposal density g(0*10).

   2. Choose 01.

   3. For i = 2, 3, . ..

        " Generate a proposal 0* from g(0*|82_1).
        " Set

                                   { p(*|I y)g(e2-i1l*)
                           r=min    1,   * y(-             .)(6.4)
                                      p(Oi-1I y)g(0*|O2-1)

        " Set
                           -   f>*    with probability r,
                           e2 =
                                 Oi_1 with probability 1 - r.

Steps 1-3 define the transition kernel k. In many MCMC chains, the acceptance
probability r may be strictly less than one, so the kernel k is a mixture of two parts:
one that generates a new value of 8i+1 O2 and one that sets i+1 = e2.
   To illustrate MCMC, suppose we want to generate a sample 01,..., Bio,ooo from
the Be(5, 2) distribution. We arbitrarily choose a proposal density g(Q* 0) = U(0 -
.1, 0 + .1) and arbitrarily choose 01 = 0.5. The following R code draws the sample.

  samp <- rep ( NA, 10000 )
  samp[1] <- 0.5
  for ( i in 2:10000 ) {
    prey <- samp [i-1]
    thetastar <- runif ( 1, prey - .1, prey + .1 )
    r <- min ( 1, dbeta(thetastar,5,2) / dbeta(prev,5,2) )
    if ( rbinom ( 1, 1, r ) == 1 )
       new <- thetastar
    else
       new <- prev
    samp [i] <- new
  }


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS                          353


             c


                        0.2      0.4       0.6      0.8      1.0

                                        0


                  I        I        I       I        I        I
                  0      2000     4000     6000     8000    10000

                                      Index


                  0      2000     4000     6000     8000    10000

                                      Index


Figure 6.4: 10,000 MCMC samples of the Be(5, 2) density. Top panel: histogram of
samples from the Metropolis-Hastings algorithm and the Be(5, 2) density. Middle
panel: O" plotted against i. Bottom panel: p(i) plotted against i.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


354


The top panel of Figure  shows the result. The solid curve is the Be(5, 2) density
and the histogram is made from the Metropolis-Hastings samples. They match
closely, showing that the algorithm performed well.

   Figure    was produced by

   par ( mfrow=c(3,1) )
   hist ( samp[-(1:1000)], prob=TRUE, xlab=expression(theta),
              ylab="", main="" )
    x <- seq(0,1,length=100)
    lines ( x, dbeta(x,5,2) )
    plot ( samp, pch=".", ylab=expression(theta) )
    plot ( dbeta(samp,5,2), pch=".", ylab=expression(theta) )


The code samp [-(1:1000)1 discards the first 1000 draws in the hope that the sam-
pler will have converged to its stationary distribution after 1000 iterations.

   Assuming that convergence conditions have been met and that the algorithm is
well-constructed, MCMC chains are guaranteed eventually to converge and deliver
samples from the desired distribution. But the guarantee is asymptotic and in prac-
tice the output from the chain should be checked to diagnose potential problems
that might arise in finite samples.
   The main thing to check is mixing. An MCMC algorithm operates in the space
of 0. At each iteration of the chain, i.e., for each value of i, there is a current
location Oi. At the next iteration the chain moves to a new location Oi. In this
way the chain explores the 0 space. While it is exploring it also evaluates p(Oi). In
theory, the chain should spend many iterations at values of 0 where p(8) is large
- and hence deliver many samples of 0's with large posterior density - and few
iterations at values where p(8) is small. For the chain to do its job it must find the
mode or modes of p(8), it must move around in their vicinity, and it must move
between them. The process of moving from one part of the space to another is
called mixing.
   The middle and bottom panels of Figure  illustrate mixing. The middle panel
plots 0, vs. i. It shows that the chain spends most of its iterations in values of 0
between about 0.6 and 0.9 but makes occasional excursions down to 0.4 or 0.2
or so. After each excursion it comes back to the mode around 0.8. The chain has
taken many excursions, so it has explored the space well. The bottom panel plots
p(Oj) vs. i. It shows that the chain spent most of its time near the mode where


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


355


p(O)   2.4 but made multiple excursions down to places where p(O) is around 0.5,
or even less. This chain mixed well.
   To illustrate poor mixing we'll use the same MCMC algorithm but with different
proposal kernels. First we'll use (0*|0) = U(0 - 100, 0 + 100) and change the
corresponding line of code to
thetastar <- runif ( 1, prey - 100, prey + 100 ). Then we'll use (0*|10)
U(0 - .00001, 0 + .00001) and change the corresponding line of code to
thetastar <- runif ( 1, prey - .00001, prey + .00001 ). Figure    shows
the result. The left-hand side of the figure is for (0*10) = U(0 - 100, 0 + 100). The
top panel shows a very much rougher histogram than Figure .4; the middle and
bottom panels show why. The proposal radius is so large that most proposals are
rejected; therefore, 0i±1 O= for many iterations; therefore we get the flat spots in
the middle and bottom panels. The plots reveal that the sampler explored fewer
than 30 separate values of 0. That's too few; the sampler has not mixed well. In
contrast, the right-hand side of the figure - for (0*10) = U(0-.00001, 0+.00001) -
shows that 0 has drifted steadily downward, but over a very small range. There are
no flat spots, so the sampler is accepting most proposals, but the proposal radius
is so small that the sampler hasn't yet explored most of the space. It too has not
mixed well.
   Plots such as the middle and bottom plots of Figure 6,5 are called trace plots
because they trace the path of the sampler.
   In this problem, good mixing depends on getting the proposal radius not too
large and not too small, but just right. To be sure, if we run the MCMC chain long
enough, all three samplers would yield good samples from Be(5, 2). But the first
sampler mixed well with only 10,000 iterations while the others would require
many more iterations to yield a good sample. In practice, one must examine the
output of one's MCMC chain to diagnose mixing problems. No diagnostics are fool
proof, but not diagnosing is foolhardy.
   Several special cases of the Metropolis-Hastings algorithm deserve separate
mention.

Metropolis algorithm It is often convenient to choose the proposal density g(*|10)
     to be symmetric; i.e., so that g(Q*10) = g(00*). In this case the Metropolis
     ratio p(o*| y)g(8i_1|*)/p(82_1| y)g(*| i-8_1) simplifies to p(o*| y)/p(i_1I y).
     That's what happened in the Be(5, 2) illustration and why the line
     r <- min ( 1, dbeta(thetastar,5,2) / dbeta(prev,5,2) ) doesn't involve
     g.


Independence sampler It may be convenient to choose g(O*|0)


g(9*) not de-


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


356


Co)

N'


0.4990 0.4994 0.4998 0.5002

           0


I      I  I  I
0.2  0.4  0.6 0.8

          0


CA


LO)

Co7


00
0)


N'
0)


I   I  I   I   I  I
0 2000    6000  10000


I   I  I   I   I  I
0 2000    6000  10000


Index


Index


m-


m-


Co

CO
CA


Co7


    0 2000   6000   10000

           Index


9J|             -    -l
  o I  I   I   I  I   I
    0 2000   6000   10000

           Index


Figure 6.5: 10,000 MCMC samples of the Be(5, 2) density. Left column: (0*|0)
U(8 - 100, 0 + 100); Right column: (0*|0) = U(0 - .00001, 0 + .00001). Top: his-
togram of samples from the Metropolis-Hastings algorithm and the Be(5, 2) density.
Middle: O2 plotted against i. Bottom: p(O2) plotted against i.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


357


     pendent on 0. For example, we could have used thetastar <- runif (1) in
     the Be(5, 2) illustration.

Multiple transition kernels We may construct multiple transition kernels, say g1,
       ., gm. Then for each iteration of the MCMC chain we can randomly choose
     j E {1,... , m} and make a proposal according to g3. We would do this either
     for convenience or to improve the convergence rate and mixing properties of
     the chain.

Gibbs sampler [Gema.  an        a, 18] In many practical examples, the so-
     called full conditionals or complete conditionals p(Oj |1(_x), y) are known and
     easy to sample for all j, where (_j) = (1, ..., Q8_1, O8+1, ... , 8k). In this case
     we may sample 6i, from
     p(O |i,1, ... ,  ,j-1,  i-1,j+1, ... , j-i_1,) forj  =  1, ... , k  and  set O  =  (2i,1, ... , Oj,k).
     We would do this for convenience.

   The next example illustrates several MCMC algorithms on the pine cone data of
Example 6.1.

Example 6.2 (Pine Cones, cont)
In this example we try several MCMC algorithms to evaluate and display the posterior
distribution in Equation 63. Throughout this example, we shall, for compactness, refer
to the posterior density as p(8) instead of p(O|yi1, ...,y).
   First we need functions to return the prior density and the likelihood function.

dprior <- function ( params, log=FALSE ) {
  logprior <- (     dunif ( params ["b0"], -100, 100, log=TRUE )
                  + dunif ( params ["b1"], -100, 100, log=TRUE )
                  + dunif ( params ["b2"], -100, 100, log=TRUE )
                  + dunif ( params ["g0"], -100, 100, log=TRUE )
                  + dunif ( params ["g1"], -100, 100, log=TRUE )
                  + dunif ( params ["g2"], -100, 100, log=TRUE )
                )
  if (log) return (logprior)
  else return (exp(logprior))
}

lik <- function ( params, n.cones=cones$X2000, dbh=cones$dbh,
                    trt=cones$trt, log=FALSE ) {
  zero <- n.cones == 0


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


358


tmpl <- params["b0"]
tmp2 <- params["gO"]
etmpl <- exp(tmpl)
etmp2 <- exp(tmp2)


+ params["b1"] * dbh + params["b2"] * trt
+ params["gl"] * dbh + params["g2"] * trt


tmpl[!zero] )
etmp2[!zerol )
n.cones[!zero] * tmp2[!zero] )
log ( 1 + etmpl[zerol * exp ( -etmp2[zerol ) ) )
log ( 1 + etmpl ) )


loglik <- (


  sum (
- sum (
+ sum (
+ sum (
- sum (


           )
if (log) return (loglik)
else return (exp(loglik))


}


Now we write a proposal function. This ones makes (0*|0)  N(8, .1I6), where I6 is the
6 x 6 identity matrix.

g.all <- function ( params ) {
  sig <- c(.1,.1,.1,.1,.1,.1)
  proposed <- mvrnorm ( 1, mu=params, Sigma=diag(sig) )
  return ( list ( proposed=proposed, ratio=1 ) )
}


Finally we write the main part of the code. Try to understand it; you may have to write
something similar. Notice an interesting feature of R: assigning names to the components
of params allows us to refer to the components by name in the lik function.

# initial values
params <- c ( "b0"=0, "b1"=0, "b2"=0, "g0"=0, "g1"=0, "g2"=0 )

# number of iterations
mc <- 10000

# storage for output
mcmc.out <- matrix ( NA, mc, length(params)+1 )


# the main loop


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS

for ( i in 1:mc ) {
  prop <- g.all ( params )
  new <- prop$proposed
  log.accept.ratio <- (  dprior ( new, log=TRUE )
                           - dprior ( params, log=TRUE )
                           + lik ( new, log=TRUE )
                           - lik ( params, log=TRUE )
                           - log ( prop$ratio )
                         )
  accept.ratio <- min ( 1, exp(log.accept.ratio) )


359


if ( as.logical
  params <- new


( rbinom(1,1,accept.ratio) ) )


mcmc.out[i,] <- c ( params, lik ( params, log=TRUE )


)


}


Figure shows trace plots of the output.
move very often; it did not mix well and did


The plots show that the sampler did not
not explore the space effectively.


   Figure    was produced by the following snippet.

       par ( mfrow=c(4,2), mar=c(4,4,1,1)+.1 )
       for ( i in 1:6 )
         plot ( mcmc . out [, i] , ylab=names (params) [i] , pch=". " )
       plot ( mcmc.out[,7], ylab=expression(p(theta)), pch="." )


   When samplers get stuck, sometimes it's because the proposal radius is too large. So
next we try a smaller radius: sig <- rep(.01,6). Figure   shows the result. The
sampler is still not mixing well. The parameter #0 travelled from its starting point of
0 = 0 to about #0 ~ -1.4 or so, then seemed to get stuck; other parameters behaved
similarly. Let's try running the chain for more iterations: mc <- 100000. Figure
shows the result. Again, the sampler does not appear to have mixed well. Parameters #0
and #1, for example, have not yet settled into any sort of steady-state behavior and p(O)
seems to be steadily increasing, indicating that the sampler may not yet have found the
posterior mode.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


360


  N'
  O'
N


T0


N'


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


    N'
  O'
NQ (
    O


C)


LO)


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


N'


O'


N'


o


0  2000    6000    10000


0  2000    6000    10000

         Index


Index


m-


L J


0  2000    6000    10000

         Index


Figure 6.6: Trace plots of MCMC output from the pine cone code on page


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


361


O   6


T0


6


L J
6


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


N'


6


C7


  6


  (0
T 6


0 2000     6000    10000

         Index


I    I   I   I   I   I
0 2000     6000    10000


C)

0


0  2000    6000    10000

         Index


I   I    I   I   I   I
0  2000  6000  10000

         Index


Index


m-


                  0 2000      6000    10000

                           Index


Figure 6.7: Trace plots of MCMC output from the pine cone code with a smaller
proposal radius.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


362


0 0
    Co)


N'

(0


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


N-

  C)
O'
T~C
  6


1


0)


6

T-
6


0 2000     6000    10000

         Index


I    I   I   I   I   I
0 2000     6000    10000


N'


LC)


LC)
C7

LC)
N'


0 2000     6000    10000

         Index


I    I   I   I   I   I
0 2000     6000    10000


Index


Index


  L J


0-

  ,:F


                  0 2000      6000    10000

                           Index


Figure 6.8: Trace plots of MCMC output from the pine cone code with a smaller

proposal radius and 100,000 iterations. The plots show every 10'th iteration.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


363


    It is not always necessary to plot every iteration of an MCMC sampler. Figure
plots every 10'th iteration; plots of every iteration look similar. The figure was produced
by the following snippet.

       par ( mfrow=c(4,2), mar=c(4,4,1,1)+.1 )
       plotem <- seq ( 1, 100000, by=10 )
       for ( i in 1:6 )
          plot ( mcmc . out [plotem, i] , ylab=names (params) [i] , pch=". " )
       plot ( mcmc.out[plotem,7], ylab=expression(p(theta)), pch="." )


    The sampler isn't mixing well. to write a better one we should try to understand why
this one is failing. It could be that proposing a change in all parameters simultaneously is
too dramatic, that once the sampler reaches a location where p(8) is large, changing all
the parameters at once is likely to result in a location where p(O) is small, therefore the
acceptance ratio will be small, and the proposal will likely be rejected. To ameliorate the
problem we'll try proposing a change to only one parameter at a time. The new proposal
function is

g.one <- function ( params ) {
   sig <- c ( "b0"=.1, "b1"=.1, "b2"=.1, "g0"=.1, "g1"=.1, "g2"=.1 )
   which <- sample ( names(params), 1 )
   proposed <- params
   proposed [which] <- rnorm ( 1, mean=params [which] , sd=sig [which] )
   return ( list ( proposed=proposed, ratio=1 ) )
}


which randomly chooses one of the six parameters and proposes to update that parameter
only. Naturally, we edit the main loop to use g. one instead of g. all. Figure  shows
the result. This is starting to look better. Parameters # 2 and 7Y2 are exhibiting steady-
state behavior; so are # o and #1, after iteration 10,000 or so (x = 1000 in the plots).
Still, 'Yo and '}1 do not look like they have converged.
    Figure     illuminates some of the problems. In particular, , o and #1 seem to be
linearly related, as do 'Yo and -y1. This is often the case in regression problems; and we
have seen it before for the pine cones in Figure . In the current setting it means that
p(| y, ... , yn) has ridges: one along a line in the (, #1~Q) plane and another along a
line in the (-yo, -1) plane.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


364


o  o


  N-


0  2000    6000    10000


LC)

N


LC)


0  2000    6000    10000


Index


Index


  N'

cTM


  6


C)


LO)


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


O'
N'


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


LC)


m-


               L J .I                   I

                   0 2000     6000    10000

                           Index


Figure 6.9: Trace plots of MCMC output from the pine cone code with proposal
function g. one and 100,000 iterations. The plots show every 10'th iteration.


﻿


CD


    0.20 0.40 -1.6 -1.0 -0.4 0.16 0.24


      F WO :} I 'I IF .jW I,'


    .L f. I i t 5.I I I I I 1 '.'.


    N)  (C) L


  LJLJI    JQI L

  :4t     I I I I I I , ',":' "J

        I I II  NEW
  0      [~J ~

  176 17 8 .2 .5 .__ _ _. -. -.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


366


    Figure     was produced by the following snippet.

plotem <- seq ( 10000, 100000, by=10 )
pairs ( mcmc.out[plotem,], pch=".",
         labels=c(names(params) ,"density") )


As the figure shows, it took the first 10,000 iterations or so for # 0 and #1 to reach a
roughly steady state and for p(O) to climb to a reasonably large value. If those iterations
were included in Figure   , the points after iteration 10,000 would be squashed together
in a small region. Therefore we made plotem <- seq ( 10,000, 100000, by=10 )
to drop the first 9999 iterations from the plots.

    If our MCMC algorithm proposes a move along the ridge, the proposal is likely to be
accepted. But if the algorithm proposes a move that takes us off the ridge, the proposal
is likely to be rejected because p would be small and therefore the acceptance ratio would
be small off the ridge. But that's not happening here: our MCMC algorithm seems not to
be stuck, so we surmise that it is proposing moves that are small compared to the widths
of the ridges. However, because the proposals are small, the chain does not explore the
space quickly. That's why -yo and '}1 appear not to have reached a steady state. We
could improve the algorithm by proposing moves that are roughly parallel to the ridges.
And we can do that by making multivariate Normal proposals with a covariance matrix
that approximates the posterior covariance of the parameters. We'll do that by finding
the covariance of the samples we've generated and using it as the covariance matrix of
our proposal distribution. The R code is

Sig <- cov ( mcmc.out[10000:100000,-71)
g.group <- function ( params ) {
  proposed <- mvrnorm ( 1, mu=params, Sigma=Sig )
  return ( list ( proposed=proposed, ratio=1 ) )
}


We drop the first 9999 iterations because they seem not to reflect p(O) accurately. Then
we calculate the covariance matrix of the samples from the previous MCMC sampler. That
covariance matrix is used in the proposal function. The results are shown in Figures
and       Figure      shows that the sampler seems to have converged after the first
several thousand iterations. The posterior density has risen to a high level and is hovering
there; all six variables appear to be mixing well. Figure  confirms our earlier impression


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


367


that the posterior density seems to be approxiately Normal - at least, it has Normal-
looking two dimensional marginals - with # 0 and #1, highly correlated with each other,
71i and 7y2 highly correlated with each other, and no other large correlations. The sampler
seems to have found one mode and to be exploring it well.

   Figures     and     were produced with the following snippet.

plotem <- seq ( 1, 100000, by=10 )
par ( mfrow=c(4,2), mar=c(4,4,1,1)+.1 )
for ( i in 1:6 )
  plot ( mcmc . out [plotem, i] , ylab=names (params) [i] , pch=". " )
plot ( mcmc.out[plotem,7], ylab=expression(p(theta)), pch="." )

plotem <- seq ( 1000, 100000, by=10 )
pairs ( mcmc.out[plotem,], pch=".",
         labels=c(names(params) ,"density") )


   Now that we have a good set of samples from the posterior, we can use it to an-
swer substantive questions. For instance, we might want to know whether the extra
atmospheric CO2 has allowed pine trees to reach sexual maturity at an earlier age or to
produce more pine cones. This is a question of whether # 2 and 7Y2 are positive, negative,
or approximately zero. Figure   shows the answer by plotting the posterior densities
of # 2 and 7y2. Both densities put almost all their mass on positive values, indicating
that P[#2 > 0] and P[72 > 0] are both very large, and therefore that pines trees with
excess CO2 mature earlier and produce more cones than pine trees grown under normal
conditions.

   Figure     was produced by the following snippet.

par ( mfrow=c(1,2) )
plot ( density ( mcmc.out[10000:100000,"b2"] ),
        xlab=expression(beta [2]),
        ylab=expression(p(beta[2])), main="" )
plot ( density ( mcmc.out[10000:100000,"g2"] ),
        xlab=expression(gamma[2]),
        ylab=expression(p(gamma[2])), main="" )


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


368


0

N


0


T0


N
6


0  2000    6000    10000

         Index


0  2000    6000    10000

         Index


N
r


0
6


LO)


  0
  N
  6

  -

0   -

  6


0  2000    6000    10000

         Index


I    I   I   I   I   I
0  2000    6000    10000

         Index


N N) O
O0 _
  0


0  2000    6000    10000

         Index


I    I   I   I   I   I
0  2000    6000    10000

         Index


  LO


o     :
  O .


      0 2000     6000    10000


Index


Figure 6.11: Trace plots of MCMC output from the pine cone code with proposal

function g. group and 100,000 iterations. The plots show every 10'th iteration.


﻿


6.2. METROPOLIS, METROPOLIS-HASTINGS, AND GIBBS


369


0.15 0.35


                  b0


              co


              CO   ".


                 -7 -4


Figure 6.12: Pairs plots


posal g.group.


SIZ


   -".


b2


0.0


-2.0


  -2 0


     I'


0.2 0.6


..- -.2  -., .


  tiT1' f
  "t


t!.
    "1
    y1 '
    r",
 N" 1 ±; '


"S"


   ati


   .r'


   g2


    x LO)


    N-


L"i


LiF5


density


      (0
      0-


0.12 0.19


1760


of MCMC output from the pine cones example with pro-


﻿


6.3. EXERCISES


               C
               C

          N C


370


..


C\N


0.0    1.0

    R2


|   |
0.2  0.4  0.6

     72


         Figure 6.13: Posterior density of # 2 and _72 from Example


6.3 Exercises

  1. This exercise asks you to enhance the code for the Be(5, 2) example on page

      (a) How many samples is enough? Instead of 10,000, try different numbers.
          How few samples can you get away with and still have an adequate ap-
          proximation to the Be(5, 2) distribution? You must decide what "ade-
          quate" means; you can use either a firm or fuzzy definition. Illustrate
          your results with figures similar to
      (b) Try an independence sampler in the Be(5, 2) example on page  . Re-
          place the proposal kernel with 0* ~ U(0, 1). Run the sampler, make a
          figure similar to Figure and describe the result.
      (c) Does the proposal distribution matter? Instead of proposing with a ra-
          dius of 0.1, try different numbers. How much does the proposal radius
          matter? Does the proposal radius change your answer to part ? Illus-
          trate your results with figures similar to
      (d) Try a non-symmetric proposal. For example, you might try a proposal
          distribution of Be(5, 1), or a distribution that puts 2/3 of its mass on
          (zi_1 - .1, xi1) and 1/3 of its mass on (xz_1, x_1 + .1). Illustrate your


﻿


6.3. EXERCISES


371


       results with figures similar to 6 4.
    (e) What would happen if your proposal distribution were Be(5, 2)? How
       would the algorithm simplify?

2. (a) Some researchers are interested in 0, the proportion of students who
        ever cheat on college exams. They randomly sample 100 students and
        ask "Have you ever cheated on a college exam?" Naturally, some stu-
        dents lie. Let #1 be the proportion of non-cheaters who lie and #2 be the
        proportion of cheaters who lie. Let X be the number of students who
        answer "yes" and suppose X = 40.
        i. Create a prior distribution for 0, # and #2. Use your knowledge
           guided by experience. Write a formula for your prior and plot the
           marginal prior density of each parameter.
         ii. Write a formula for the likelihood function (O, 01, #2).
         iii. Find the m.l.e..
         iv. Write a formula for the joint posterior density p(O, #1, #2|X = 40).
         v. Write a formula for the marginal posterior density p(O|X = 40).
         vi. Write an MCMC sampler to sample from the joint posterior.
       vii. Use the sampler to find p(O|X = 40). Summarize your results.
            Include information on how you assessed mixing and on what you
            learned about p(O| X = 40).
       viii. Assess the sensitivity of your posterior, p(O|X = 40), to your prior
            for #1 and #2.
    (b) Randomized response This part of the exercise uses ideas from Exer-
        cises  in Chapter_1 and 6 in Chapter . As explained there, researchers
        will sometimes instruct subjects as follows.
           Toss a coin, but don't show it to me. If it lands Heads, answer
           question (a). If it lands tails, answer question (b). Just answer
           'yes' or 'no'. Do not tell me which question you are answering.
           (a) Does your telephone number end in an even digit?
           (b) Have you ever cheated on an exam in college?
        The idea of the randomization is, of course, to reduce the incentive to
        lie. Nonetheless, students may still lie.
        i. If about 40 students answered 'yes' in part (a), about how many do
           you think will answer 'yes' under the conditions of part (b)?


﻿


6.3. EXERCISES


3 72


         ii. Repeat part (a) under the conditions of part (b) and with your best
            guess about what X will be under these conditions.
        iii. Assess whether researchers who are interested in 0 are better off
            using the conditions of part (a) or part (b).

3. Figures      and 6.12 suggest that the MCMC sampler has found one mode
   of the posterior density. Might there be others? Use the lik function and
   R's optim function to find out. Either design or randomly generate some
   starting values (You must decide on good choices for either the design or the
   randomization.) and use optim to find a mode of the likelihood function.
   Summarize and report your results.

4. Example 642 shows that 32 and '72 are very likely positive, and therefore that
   pine trees with extra CO2 mature earlier and produce more cones. But how
   much earlier and how many more?

   (a) Find the posterior means E[32|Yi,... Y2] and E[72|Yi ,. Y,Y2] approxi-
        mately, from the Figures in the text.
    (b) Suppose there are three trees in the control plots that have probabilities
        0.1, 0.5, and 0.9 of being sexually mature. Plugging in E[32 YiY. -.-Y2]
        from the previous question, estimate their probabilities of being mature
        if they had grown with excess CO2.
    (c) Is the plug-in estimate from the previous question correct? I.e., does
        it correctly calculate the probability that those trees would be sexually
        mature? Explain why or why not. If it's not correct, explain how to
        calculate the probabilities correctly.

5. In the context of Example6  we might want to investigate whether the
   coefficient of dbh should be the same for control trees and for treated trees.

   (a) Write down a model enhancing that on page  to allow for the possi-
        bility of different coefficients for different treatments.
    (b) What parts of the R code have to be changes?
    (c) Write the new code.
    (d) Run it.
    (e) Summarize and report results. Report any difficulties with modifying
        and running the code. Say how many iterations you ran and how you
        checked mixing. Also report conclusions: does it look like different treat-
        ments need different coefficients? How can you tell?


﻿


CHAPTER 7


                        MORE MODELS


This chapter takes up a wide variety of statistical models. It is beyond the scope
of this book to give a full treatment of any one of them. But we hope to introduce
each model enough so the reader can see it what situations it might be useful, what
it's primary characteristcs are, and how a simple analysis might be carried out in R.
A more thorough treatment of many of these models can be found inVeal
Ripley.^ [2;0 .


7.1     Hierarchical Models

It is often useful to think of populations as having subpopulations, and those as
having subsubpopulations, and so on. One example comes from [cite Worsley et
al] who describe fMRI (functional magnetic resonance imaging) experiments. A
subject is placed in an MRI machine and subjected to several stimuli while the
machine measures the amount of oxygen flowing to various parts of the brain.
Different stimuli affect different parts of the brain, allowing scientists to build up a
picture of how the brain works. Let the generic parameter 0 be the change in blood
flow to a particular region of the brain under a particular stimulus. 0 is called
an effect. As citation explain, 6 may vary from subject to subject, from session
to session even for the same patient, and from run to run even within the same
session. To describe the situation fully we need three subscripts, so let 0ijk be the
effect in subject i, session j, run k. For a single subject i and session j there will be
an overall average effect; call it pi. The set {6ijk}k will fall around ,u with a bit of
variation for each run k. Assuming Normal distributions, we would write

                        {Oijk}k |pi, Uk i.i.d. N(pg Uk)


373


﻿


7.2. TIME SERIES AND MARKOV CHAINS


374


   Likewise, for a single subject i there will be an overall average effect; call it
pi. The set {,u}3 will fall around pi with a bit of variation for each session j.
Further, each pi is associated with a different subject so they are like draws from
a population with a mean and standard deviation, say yu and cr. Thus the whole
model can be written

                        {Oijk}k |pi, Uk i.i.d. N(pi, Uk)
                        {i}3 |wpi, cg ~i.i.d. N(i,, a)
                          {i} |,p, ai ~i.i.d. N(pu, a )
Figure    is a graphical representation of this model.

                             pretty picture here

       Figure 7.1: Graphical representation of hierarchical model for fMRI


   More examples: metaanalysis, Jackie Mohan's germination records, Chantal's
arabadopsis, CO2 uptake from R and Pinheiro and Bates, FACE growth rates by
tree l ringItreatment
   Is a sample from one population or several? Mixtures of Normals. Extra varia-
tion in Binomials, Poisson, etc. Hierarchical and random effects models. Discrete
populations: medical trials, different species, locations, subjects, treatments.


7.2 Time Series and Markov Chains

Figure 7. shows some data sets that come with R. The following descriptions are
taken from the R help pages.

Beaver The data are a small part of a study of the long-term temperature dynamics
     of beaver Castor canadensis in north-central Wisconsin. Body temperature
     was measured by telemetry every 10 minutes for four females, but data from
     one period of less than a day is shown here.

Mauna Loa Monthly atmospheric concentrations of CO2 are expressed in parts per
     million (ppm) and reported in the preliminary 1997 SIO manometric mole
     fraction scale.

DAX The data are the daily closing prices of Germany's DAX stock index. The data
     are sampled in business time; i.e., weekends and holidays are omitted.


﻿


7.2. TIME SERIES AND MARKOV CHAINS


375


UK Lung Disease The data are monthly deaths from bronchitis, emphysema and
     asthma in the UK, 1974 - 1979.

Canadian Lynx The data are annual numbers of lynx trappings for 1821 - 1934
     in Canada.

Presidents The data are (approximately) quarterly approval rating for the Presi-
     dent of the United states from the first quarter of 1945 to the last quarter of
     1974.

UK drivers The data are monthly totals of car drivers in Great Britain killed or
     seriously injured Jan 1969 to Dec 1984. Compulsory wearing of seat belts
     was introduced on 31 Jan 1983.

Sun Spots The data are monthly numbers of sunspots. They come from the World
     Data Center-C1 For Sunspot Index Royal Observatory of Belgium, Av. Circu-
     laire, 3, B-1180 BRUSSELS http://www.oma.be/KSB-ORB/SIDC/sidctxt.
     html.

What these data sets have in common is that they were all collected sequentially in
time. Such data are known as time series data. Because each data point is related to
the ones before and the ones after, they usually cannot be treated as independent
random variables. Methods for analyzing data of this type are called time series
methods. More formally, a time series is a sequence Y, ... , YT of random variables
indexed by time. The generic element of the series is usually denoted Y.

   Figure    was produced by the following snippet.

   par ( mfrow=c(4,2) )
   plot.ts ( beaverl$temp, main="Beaver", xlab="Time",
               ylab="Temperature" )
    plot.ts ( co2, main="Mauna Loa", ylab="C02 (ppm)" )
    plot.ts ( EuStockMarkets[,1], main="DAX",
               ylab="Closing Price" )
    plot.ts ( ldeaths, main="UK Lung Disease",
               ylab="monthly deaths" )
    plot.ts ( lynx, main="Canadian Lynx", ylab="trappings" )
    plot.ts ( presidents, main="Presidents", ylab="approval" )
    plot.ts ( Seatbelts[,"DriversKilled"], main="UK drivers",
               ylab="deaths" )


﻿


7.2. TIME SERIES AND MARKOV CHAINS


376


Beaver


Mauna Loa


a)

(z
a
E
a)
-


0  20 40 60 80


11-
a


0


1960 1970 1980 1990


Time


Time


a)
0

0)
0


           DAX


0


   1992 1994 1996 1998


C,
(1)


0
E


UK Lung Disease


10


1974  1976  1978 1980


Time


Time


         Canadian Lynx
U/)0
a) o)                        >
  r- C


     1820  1860  1900

              Time


Presidents


1945  1955  1965 1975

        Time


         UK drivers        U           Sun Spots
                           CL,
        C)      ~          U)
O/ O)~~1                             rT


     _0                      0
     1970 1975  1980 1985  E     1750   1850   1950


Time


Time


Figure 7.2: Beaver: Body temperature of a beaver recorded every 10 minutes;

Mauna Loa: Atmospheric concentration of C02; DAX: Daily closing prices of the

DAX stock exchange in Germany; UK Lung Disease: monthly deaths from bronchi-

tis, emphysema and asthma; Canadian Lynx: annual number of trappings; Pres-

idents: quarterly approval ratings; UK drivers: deaths of car drivers; Sun Spots:

monthly sunspot numbers.


﻿


7.2. TIME SERIES AND MARKOV CHAINS


377


    plot.ts ( sunspot.month, main="Sun Spots",
               ylab="number of sunspots" )


   . plot . ts is the command for plotting time series.


   The data sets in Figure exhibit a feature common to many time series: if one
data point is large, the next tends to be large, and if one data point is small, the
next tends to be small; i.e., Y and Yt+1 are dependent. The dependence can be seen
in Figure    which plots Y+1 vs. Y, for the Beaver and President datasets. The
upward trend in each panel shows the dependence. Time series analysts typically
use the term autocorrelation - the prefix auto refers to the fact that the time series
is correlated with itself - even though they mean dependence. R has the built-in
function acf for computing autocorrelations. The following snippet shows how it
works.

> acf ( beaverl$temp, plot=F, lag.max=5 )

  Autocorrelations of series 'beaverl$temp', by lag

       0     1     2      3     4      5
  1.000 0.826 0.686 0.580 0.458 0.342

The six numbers in the bottom line are Cor(Y, Yt), Cor(Yt, Yt+1), ..., Cor(Y, Yt+5)
and are referred to as autocorrelations of lag 0, lag 1, ..., lag 5. Those autocorre-
lations can, as usual, be visualized with plots as in Figure

   Figure    was produced by the following snippet.

   dim ( beaver1 )
   plot ( beaverl$temp[-114], beaverl$temp[-1], main="Beaver",
            xlab=expression(y[t]), ylab=expression(y[t+1]) )
    length ( presidents )
    plot ( presidents [-120] , presidents [-1] , main="Presidents",
            xlab=expression(y[t]), ylab=expression(y[t+1]) )


﻿


7.2. TIME SERIES AND MARKOV CHAINS


378


Beaver


0

CO

CO


          tc


-.    3
36.4 37.0


0


CO


Presidents


30  60


Yt


Yt


Figure 7.3: Yt+1 plotted against Y for the Beaver and Presidents data sets


   Figure    was produced by the following snippet.

   par ( mfrow=c(3,2) )
   temp <- beaverl$temp
   n <- length(temp)
   for ( k in 0:5 ) {
      x <- temp [1: (n-k)]
      y <- temp [(1+k) :n]
      plot ( x, y, xlab=expression(Y[t]),
      ylab=expression(Y[t+k]), main=paste("lag =", k) )
    }


    Because time series data cannot usually be treated as independent, we need
special methods to deal with them. It is beyond the scope of this book to present
the major theoretical developments of time series methods. As Figure shows,
there can be a wide variety of structure in time series data. In particular, the
Beaver, and Presidents data sets have no structure readily apparent to the eye;
DAX has seemingly minor fluctuations imposed on a general increasing trend; UK


﻿


7.2. TIME SERIES AND MARKOV CHAINS


                            lag = 0


3 79


lag =1


CO)


Co-
CO)


Co-
CO)


     O0


36.4  36.8  37.2


O


N-
CO)


Co
CO)


Co
CO)


O 0D

0O


O1


36.4  36.8  37.2


Yt


Yt


lag = 2


lag = 3


N
CO)


Co
CO)


Co
CO)


        0


    0 O      0


8


36.4  36.8  37.2


0


CO)

co


Co-
CO)


0


0


100 "OMI

    100 0
10


36.4  36.8  37.2


Yt


Yt


lag = 4


lag = 5


0


0


CO)


0)
0
00


0


CO


Co
CO)


0


01


0'


Col
(0
CO)


[o ~  OO


36.4  36.8  37.2


     V O


36.4  36.8  37.2


         Yt


Yt


Figure 7.4: Yt+k plotted against Y~ for the Beaver data set and lags k =o0,... ,5


﻿


7.2. TIME SERIES AND MARKOV CHAINS


380


Lung Disease and UK drivers have an annual cycle; Mauna Loa has an annual
cycle imposed and a general increasing trend; and Canadian Lynx and Sun Spots
are cyclic, but for no obvious reason and with no obvious length of the cycle. In
the remainder of this section we will show, by analyzing some of the data sets in
Figure , some of the possibilities.

Beaver Our goal is to develop a more complete picture of the probabilistic struc-
ture of the {Y}'s. To that end, consider the following question. If we're trying to
predict Y+1, and if we already know Y, does it help us also to know Yt_1? I.e., are
Yt_1 and Y+1 conditionally independent given Yt? That question can be answered
visually with a coplot (Figures  and  ). Figure  shows the coplot for the
Beaver data.

   Figure    was produced by the following snippet.

   temp <- beaverl$temp
   n      <- length ( temp )
   coplot ( temp[3:nl ~ temp[1:(n-2)1 I temp[2:(n-1)1,
              xlab=c (expression(Y[t-1]), expression(Y[t])),
              ylab=expression(Y[t+1]) )


   The figure is ambiguous. In the first, second, and sixth panels, Yt+1 and Yt_1
seem to be linearly related given Y, while in the third, fourth, and fifth panels,
Yt+1 and Yt_1 seem to be independent given Y. We can examine the question
numerically with the partial autocorrelation, the conditional correlation of Y+1 and
Yt_1 given Y. The following snippet shows how to compute partial autocorrelations
in R using the function pacf.

> pacf ( temp, lag.max=5, plot=F )

  Partial autocorrelations of series 'temp', by lag

        1      2       3      4       5
   0.826   0.014  0.031 -0.101 -0.063

The numbers in the bottom row are Cor(Y, Yt+k Yt+1, -- , Yt+k-1). Except for the
first, they're small. Figure  and the partial autocorrelations suggest that a model
in which Y+1 -L Yt-1 Y would fit the data well. And the first panel in Figure


﻿


7.2. TIME SERIES AND MARKOV CHAINS


381


Yt


   36.4  36.6 36.8  37.0 37.2  37.4


   36.4 36.8 37.2       36.4 36.8 37.2

       o          o


               06
          o       o


N
CV)


     0

     900


             36.4 36.8 37.2


COJ


CV)


CV)


CV)


+


Yt-1


Figure 7.5: coplot of Y~l as a function of Y_1 given Y~ for the Beaver data set


﻿


7.2. TIME SERIES AND MARKOV CHAINS


382


suggests that a model of the form Yt+1 = 0 + #1Y + Et+i might fit well. Such a
model is called an autoregression. R has a function ar for fitting them. Here's how
it works with the Beaver data.

  > fit <- ar ( beaverl$temp, order.max=1 )
  > fit # see what we've got

  Call:
  ar(x = beaverl$temp, order.max = 1)

  Coefficients:
        1
  0.8258

  Order selected 1 sigma^2 estimated as   0.01201

The 0.8258 means that the fitted model is Yt+1 =  o + 0.8258Y + Et+1. The et's
have an estimated variance of 0.012. f it$x.mean shows that 00 = 36.86. Finally,
qqnorm(f it$resid) (Try it.) shows a nearly linear plot, except for one point, indi-
cating that Yt+1 N(36.86 +.8258Y, .012) is a reasonably good model, except for
one outlier.

Mauna Loa The Mauna Loa data look like an annual cycle superimposed on a
steadily increasing long term trend. Our goal is to estimate both components and
decompose the data as

          Y= long term trend + annual cycle + unexplained variation.

Our strategy, because it seems easiest, is to estimate the long term trend first, then
use deviations from the long term trend to estimate the annual cycle. A sensible
estimate of the long term trend at time t is the average of a year's CO2 readings,
for a year centered at t. Thus, let

                          .5YL-6 + Yt-5 + ... + Yt+5 + .5yt+6
                    g(t) _=12(7.1)
                                         12

where g(t) represents the long term trend at time t. R has the built-in command
filter to compute g. The result is shown in Figure 7.6(a) which also shows how to
use f ilter. Deviations from are co2 - g.hat. See Figure 7(b). The deviations
can be grouped by month, then averaged. The average of the January deviations,


﻿


7.2. TIME SERIES AND MARKOV CHAINS


383


for example, is a good estimate of how much the January CO2 deviates from the
long term trend, and likewise for other months. See Figure  (c). Finally, Fig-
ure    (d) shows the data, y, and the fitted values g + monthly effects. The fit is
good: the fitted values differ very little from the data.


   Figure    was produced by the following snippet.

   filt <- c ( .5, rep(1,11), .5 ) / 12
   g.hat <- filter ( co2, filt )
   par ( mfrow=c(2,2) )
   plot.ts ( co2, main="(a)" )
   lines ( g.hat )
   resids <- co2 - g.hat
   plot.ts ( resids, main="(b)" )
   resids <- matrix ( resids, nrow=12 )
   cycle <- apply ( resids, 1, mean, na.rm=T )
   plot ( cycle, type="b", main="(c)" )
   plot.ts ( co2, type="p", pch=".", main="(d)" )
   lines ( g.hat )
   lines ( g.hat + cycle )


DAX   Y is the closing price of the German stock exchange DAX on day t. Investors
often care about the rate of return Yt* = Yt+1/Y, so we'll have to consider whether
to analyze the Y's directly, or convert them to Yt*'s first. Figure is for the DAX
prices directly. Panel (a) shows the Y's. It seems to show minor fluctuations around
a steadily increasing trend. Panel (b) shows the time series of Y - Yt_1. It seems to
show a series of fluctuations approximately centered around 0, with no apparent
pattern, and with larger fluctuations occuring later in the series. Panel (c) shows
Y versus Yt_1. It shows a strong linear relationship between Y and Yt_1. Two
lines are drawn on the plot: the lines Y = # + #1Y_1 for (O, +) = (0, 1) and
for (O,# 1) set equal to the ordinary regression coefficients found by lm. The two
lines are indistinguishable, suggesting that Y ~ Yt_1 is a good model for the data.
Panel (d) is a Q-Q plot of Y - Y_1. It is not approximately linear, suggesting that
Y ~ N(Y_1, o-) is not a good model for the data.


﻿


7.2. TIME SERIES AND MARKOV CHAINS


384


(a)


    0
    (-
    C')


CVJ 0
O   rI
o   C')


    0
    CJ
    C')


(I)

(I


N~J


0 -


N~


(b)


1960   1980


       Time


       (C)


       0


    /  0


1960   1980


       Time


       (d)


a)

0


C')


0


C')


0


0


    0
    (0
    C')


CVJ 0
O   ri
o   C')


    0
    CVJ
    C')


2 46 8


    Index


12


1960   1980


       Time


Figure 7.6: (a): CO2 and g; (b): residuals; (c): residuals averaged by month; (d):

data, g, and fitted values


﻿


7.2. TIME SERIES AND MARKOV CHAINS


                            (a)


385


(b)


    0
    0
    0
><  LO

  0
    0
    0
    0~


  0

p   0


       1992  1995  1998


1992  1995  1998


Time


Time


(c)


(d)


    0
    0
    0
X LO

  0
    0
    0
    0~


0-  0


E0


2000  4000  6000


     DAXt_1


-3   -1   1   3

Theoretical Quantiles


Figure 7.7: DAX closing prices. (a): the time series of Y's; (b): Yt - Y-1; (c): Yt
versus Yi (d): QQ plot of Yt - Y-1"


﻿


7.2. TIME SERIES AND MARKOV CHAINS


386


   Figure    was produced by the following snippet.

   par ( mfrow=c(2,2) )
   plot.ts ( DAX, main="(a)" )
   plot.ts ( diff(DAX), ylab = expression ( DAX[t] - DAX[t-1] ),
            main="(b)" )
  plot ( DAX[-n], DAX[-1], xlab = expression ( DAX[t-1] ),
         ylab = expression ( DAX[t] ), main="(c)" )
  abline ( 0, 1 )
  abline ( lm ( DAX[-1]  DAX[-n] )$coef, lty=2 )
  qqnorm ( diff (DAX) , main="(d)" )

  " The R command diff is for taking differences, typically of time series.
     diff (y) yields y [2] -y [1] , y [3] -y [2] , ... which could also be accom-
     plished easily enough without using diff: y [-11 - y [-n]. But additional
     arguments, as in diff ( y, lag, differences ) make it much more use-
     ful. For example, diff (y, lag=2) yields y [3] -y [1] , y E4] -y [2] , ... while
     diff ( y, differences=2 ) is the same as diff ( diff (y) ). The latter is
     a construct very useful in time series analysis.


   Figure is for the Y*'s. Panel (a) shows the time series. It shows a seemingly
patternless set of data centered around 1. Panel (b) shows the time series of Y* -
Y*_1, a seemingly patternless set of data centered at 0. Panel (c) shows Yt* versus
Y*_1. It shows no apparent relationship between Y* and Y*1, suggesting that Y*
Y*_1 is a good model for the data. Panel (d) is a Q-Q plot of Y*. It is approximately
linear, suggesting that Y* ~ N(p, o-) is a good model for the data, with a few
outliers on both the high and low ends. The mean and SD of the Y*'s are about
1.000705 and 0.01028; so Y* ~ N(1.0007, 0.01) might be a good model.

   Figure    was produced by the following snippet.
   par ( mfrow=c(2,2) )
   plot.ts ( rate, main="(a)" )
   plot.ts ( diff(rate), ylab = expression ( rate[t] - rate[t-1] ),
            main="(b)" )
  plot ( rate[-n2], rate[-1], xlab = expression ( rate[t-1] ),
         ylab = expression ( rate[t] ), main="(c)" )
  qqnorm ( rate, main="(d)" )


﻿


7.2. TIME SERIES AND MARKOV CHAINS


                             (a)


387


(b)


a)


O
O


N~
O)
O


a)

WY


10
0
6


0


o 500     1500


      Time


      (C)


0 500     1500


      Time


      (d)


aY


0
0


N~
C)
0


--
E
0z


0
0


N~
C)
0


0.92  0.98  1.04


-3  -1   1


3


ratet_1


Theoretical Quantiles


Figure 7.8: DAX returns. (a): the time series of Yt7*'s; (b): Y/t*

YtA1; (d): QQ plotof Yt*.


YtA ; (c): Y* versus


﻿


7.3. CONTINGENCY TABLES


388


   We now have two possible models for the DAX data: Y   Yt_1 with a still
to be determined distribution and Y*  N(1.0007, 0.01) with the Yt*'s mutually
independent. Both seem plausible on statistical grounds. (But see Exercise 5 for
further development.) It is not necessary to choose one or the other. Having
several ways of describing a data set is useful. Each model gives us another way to
view the data. Economists and investors might prefer one or the other at different
times or for different purposes. There might even be other useful models that we
haven't yet considered. Those would be beyond the scope of this book, but could
be covered in texts on time series, financial mathematics, econometrics, or similar
topics.
   population matrix models

Example 7.1 (FACE)
Richter's throughfall data

Example 7.2 (hydrology)
Jagdish's data. Is this example too complicated?


7.3 Contingency Tables

loglinear models? Simpson's paradox? Census tables as examples?


7.4 Survival analysis

In many studies, the random variable is the time at which an event occurs. For
example,

medicine The time until a patient dies.

neurobiology The time until a neuron fires.

quality control The time until a computer crashes.

higher education The time until an Associated Professor is promoted to Full Pro-
     fessor.

Such data are called survival data. For the i'th person, neuron, computer, etc., there
is a random variable
                          ye=time of event on i'th unit.


﻿


7.4. SURVIVAL ANALYSIS


389


We usually call y2 the lifetime, even though the event is not necessarily death. It
is often the case with survival data that some measurements are censored. For
example, if we study a university's records to see how long it takes to get promoted
from Associate to Full Professor, we will find some Associate Professors leave the
university - either through retirement or by taking another job - before they get
promoted while others are still Associate Professors at the time of our study. For
these people we don't know their time of promotion. If either (a) person i left the
university after five years, or (b) person i became Associate Professor five years
prior to our study, then we don't know y2 exactly. All we know is y2 > 5. This
form of censoring is called right censoring. In some data sets there may also be
left censoring or interval censoring. Survival analysis typically requires specialized
statistical techniques. R has a package of functions for this purpose; the name of
the package is survival. The survival package is automatically distributed with
R. To load it into your R session, type library(survival). The package comes
with functions for survival analysis and also with some example data sets. Our
next example uses one of those data sets.

Example 7.3 (Bladder Tumors)
This example comes from a study of bladder tumors, originally published inBar [1980]
and later reanalyzed in Wei et al [1989]. Patients had bladder tumors. The tumors
were removed and the patients were randomly assigned to one of three treatment groups
(placebo, thiotepa, pyridoxine). Then the patients were followed through time to see
whether and when bladder tumors would recur. R's survival package has the data
for the first two treatment groups, placebo and thiotepa. Type bladder to see it.
(Remember to load the survival package first.) The last several lines look like this.

       id rx number size stop event enum
  341 83    2       3    4    54      0     1
  342 83    2       3    4    54      0     2
  343 83    2       3    4    54      0     3
  344 83    2       3    4    54      0     4
  345 84    2       2     1   38      1     1
  346 84    2       2     1   54      0     2
  347 84    2       2     1   54      0     3
  348 84    2       2     1   54      0     4
  349 85    2       1    3    59      0     1
  350 85    2       1    3    59      0     2
  35185     2       1    3    59      0     3
  352 85    2       1    3    59      0     4


﻿


7.4. SURVIVAL ANALYSIS


390


    " id is the patient's id number. Note that each patient has four lines of data. That's
      to record up to four recurrences of tumor.

    " rx is the treatment: 1 for placebo; 2 for thiotepa.

    " number is the number of tumors the patient had at the initial exam, when the
      patient joined the study.

    " size is the size (cm) of the largest initial tumor.

    " stop is the time (months) of the observation.

    " event is 1 if there's a tumor; 0 if not.

    " enum line 1, 2, 3, or 4 for each patient

For example, patient 83 was followed for 54 months and had no tumor recurrences;
patient 85 was followed for 59 months and also had no recurrences. But patient 84, who
was also followed for 54 months, had a tumor recurrence at month 38 and no further
recurrences after that. Our analysis will look at the time until the first recurrence, so we
want bladder [bladder$enum==1,], the last several lines of which are
        id rx number size stop event enum
   329 80   2        3     3    49       0     1
   333 81   2        1     1    50       0     1
   337 82   2        4     1     4       1     1
   341 83   2        3     4    54       0     1
   345 84   2        2     1    38       1     1
   349 85   2        1     3    59       0     1


Patients 80, 81, 83, and 85 had no tumors for as long as they were followed; their data
is right-censored. The data for patients 82 and 84 is not censored; it is observed exactly.
    Figure 79 is a plot of the data. The solid line is for thiotepa; the dashed line for
placebo. The abscissa is in months. The ordinate shows the fraction of patients who have
survived without a recurrence of bladder tumors. The plot shows, for example, that at 30
months, the survival rate without recurrence is about 50% for thiotepa patients compared
to a little under 40% for placebo patients. The circles on the plot show censoring. I.e.,
the four circles on the solid curve between 30 and 40 months represent four placebo
patients whose data was right-censored. There is a circle at every censoring time that is
not also the time of a recurrence (for a different patient).


﻿


7.4. SURVIVAL ANALYSIS


391


0)

0

0

a)
0


0


0


(0


6


0     10     20     30     40    50     60


                  months


Figure 7.9: Survival curve for bladder
for placebo.


cancer. Solid line for thiotepa; dashed line


﻿


7.4. SURVIVAL ANALYSIS


392


   Figure    was produced with the snippet

       event. first <- bladder [, "enum"] == 1
       blad.surv <- Surv ( bladder [event .first, "stop",],
                             bladder [event . first , "event "] )
       blad.f it <- survf it ( blad. surv ~ bladder [event .first , "rx"] )
       plot ( blad.f it, conf.int=FALSE, mark=1, xlab="months",
              ylab="fraction without recurrence", lty=1:2 )


   " Surv is R's function for creating a survival object.   You can type
     print (blad. surv) and summary(blad. surv) to learn more about survival ob-
     jects.

   " survfit computes an estimate of a survival curve.


   In survival analysis we think of y1, ... , y., as a sample from a distribution F,
with density f, of lifetimes. In survival analysis, statisticians often work with the
survivor function S(t) = 1 - F(t) = P[yi > t], the probability that a unit survives
beyond time t. The lines in Figure   are the so-called Kaplan-Meier estimates of
S for patients in the thiotepa and placebo groups, which arise from the following
argument. Partition R+ into intervals (0, t1], (ti1, t2], ..., and let pi = P[y > tily >
t2_1]. Then for each i, Si = Hj pi. The pi's can be estimated from data as pi =
(r(ti) - di)/r(ti) where r(ti) is the number of people at risk (in the study but not
yet dead) at time ti_1 and di is the number of deaths in the interval (t2_1, ti]. Thus,
S(tI)   _ fJ (r(ti) - di)/r(ti). As the partition becomes finer, most terms in the
product are equal to one; only those intervals with a death contribute a term that
is not one. The limit yields the Kaplan-Meier estimate


                                      i~itr(yz)

This estimate is reasonably accurate so long as r(t) is reasonably large; S(t) is more
accurate for small values of t than for large values of t; and there is no information
at all for estimating S(t) for t > max{yi}.
   Survival data is often modelled in terms of the hazard function

       li(t) = liiP [y E [t, t + h) y > t] =.P [y c [t, t + h)] _ f(t)  (7.2)
             h->O         h             h->0) h P[y ;> t]    S(t)


﻿


7.4. SURVIVAL ANALYSIS


393


The interpretation of h(t) is the fraction, among people who have survived to time
t, of those who will die soon thereafter. There are several parametric families
of distributions for lifetimes in use for survival analysis. The most basic is the
exponential - f(y) = (1/A) exp-Y/A - which has hazard function h(y) =1/A, a
constant. A constant hazard function says, for example, that young people are just
as likely to die as old people, or that new air conditioners are just as likely to fail
as old air conditioners. For many applications that assumption is unreasonable, so
statisticians may work with other parametric families for lifetimes, especially the
Weibull, which has h(y) = Aa(Ay)>-1, an increasing function of y if Ay > 1. We
will not dwell further on parametric models; the interested reader should refer to
a more specialized source.
   However, the goal of survival analysis is not usually to estimate S and h, but
to compare the survivor and hazard functions for two groups such as treatment
and placebo or to see how the survivor and hazard functions vary as functions of
some covariates. Therefore it is not usually necessary to estimate S and h well, as
long as we can estimate how S and h differ between groups, or as a function of the
covariates. For this purpose it has become common to adopt a proportional hazards
model:
                              h(y) = ho(y) exp(3'x)                         (7.3)

where ho is the baseline hazard function that is adjusted according to x, a vector
of covariates and 13, a vector of coefficients. Equation 7: is known as the Cox pro-
portional hazards model. The goal is usually to estimate 3. fR's survival package
has a function for fitting Equation 73 to data.

Example 7.4 (Bladder Tumors, cont.)
This continues Example 73. Here we adopt the Cox proportional hazards model and
see how well we can estimate the effect of the treatment thiotepa compared to placebo
in preventing recurrence of bladder tumors. We will also examine the effects of other
potential covariates.
   We'd like to fit the Cox proportional hazards model h(y) = ho(y) exp(/trt - trt)
where trt is an indicator variable that is 1 for patients on thiotepa and 0 for patients
on placebo; but first we check whether such a model looks plausible; i.e. whether the
hazards look proportional. Starting from Equation.2 we can integrate both sides to get
H(y) =fo'h(z)dz      -log S(y). H(y) is called the cumulative hazard function. Thus,
if the two groups have proportional hazards, they also have proportional cumulative hazard
functions and log survivor functions. Figure,7:10 plots the estimated cumulative hazard
and log (cumulative hazard) functions for the bladder tumor data. The log (cumulative
hazard) functions look parallel, so the proportional hazards assumption looks reasonable.


﻿


7.4. SURVIVAL ANALYSIS


394


CN~


N

CEj


E
0


N


E
0
0


_ _-MM
P


CoI


20


50


I    I
   200


0 10      30      50


months


months


Figure 7.10: Cumulative hazard and log(hazard) curves for bladder cancer. Solid
line for thiotepa; dashed line for placebo.


   Figure    was produced with the snippet


plot ( blad.f it, conf.int=FALSE, mark=1, xlab="months",
        ylab="cumulative hazard", lty=1:2, fun="cumhaz" )
plot ( blad.f it, conf.int=FALSE, mark=1, xlab="months",
        ylab="log(cumulative hazard)", lty=1:2, fun="cloglog"


)


* The fun argument allows transformations of the survival curve. fun="cumhaz"
  plots the cumulative hazard function and fun="cloglog" plots the log (cumulative
  hazard) function.


﻿


7.5. THE POISSON PROCESS


395


   Since the proportional hazards model looks reasonable, we fit it:
blad.cox <- coxph ( blad.surv      bladder[event.first,"rx"] ).
Printing blad. cox yields


  Call:
  coxph(formula = blad.surv     bladder[event.first, "rx"])

                                  coef exp(coef) se(coef)       z    p
  bladder[event.first, "rx"] -0.371         0.69     0.303 -1.22 0.22

  Likelihood ratio test=1.54 on 1 df, p=0.215 n= 85


The estimated coefficient is i3tt = -0.371. Thus the hazard function for thiotepa patients
is estimated to be exp(-0.371) = 0.69 times that for placebo patients. The standard
error of #t/t = -0.371 is about 0.3; SO 3trt is accurate to about +0.6 or so.


7.5 The Poisson process


Example 7.5
Earthquakes

Example 7.6
Neurons firing


7.6 Change point models


Example 7.7
Everglades


﻿


7.7. SPATIAL MODELS                                                 396

7.7 Spatial models

7.8 Point Process Models

7.9 Evaluating and enhancing models

residuals, RSS, pred. vs. obs., QQ plots, chi-square, AIC, BIC, DIC, SE's of coeffi-
cients, testing nested models, others?


7.10     Exercises

  1. (a) Make plots analagous to Figures 73 and 7.4, compute autocorrelations,
         and interpret for the other datasets in Figure 7.2.
      (b) Make plots analagous to Figure 7., compute partial autocorrelations,
         and interpret for the other data sets in Figure72.

  2. Create and fit a good model for body temperatures of the second beaver. Use
     the dataset beaver2.

  3. (a) Why does Equation71 average over a year? Why isn't it, for example,

                         t =5Yt-k + Yt-k+1 + ... + Yt+k-1 + .5Yt+k
                                          2k
         for some k / 6?
      (b) Examine , in Equation 7. Use R if necessary. Why are some of the
         entries NA?

  4. The R code for Figure 7.6 contains the lines

         resids <- matrix ( resids, nrow=12 )
         cycle <- apply ( resids, 1, mean, na.rm=T )


     Would the following lines work instead?

         resids <- matrix ( resids, ncol=12 )
         cycle <- apply ( resids, 2, mean, na.rm=T )


﻿


7.10. EXERCISES


397


    Why or why not?

 5. Figure 7.7 and the accompanying text suggest that Y Yt_1 is a good model
    for the DAX data. But that doesn't square with the observation that the Y's
    have a generally increasing trend.

    (a) Find a quantitative way to show the trend.
    (b) Say why the DAX analysis missed the trend.
    (c) Improve the analysis so it's consistent with the trend.

 6. Figures 7.7 and 7.8 and the accompanying text analyze the DAX time series
    as though it has the same structure throughout the entire time. Does that
    make sense? Think of and implement some way of investigating whether the
    structure of the series changes from early to late.

 7. Choose one or more of the other EU Stock Markets that come with the DAX
    data. Investigate whether it has the same structure as the DAX.

 8. (a) Make a plausible analysis of the UK Lung Disease data.
     (b) R has the three data sets ldeaths, fdeaths, and mdeaths which are the
        total deaths, the deaths of females, and the deaths of males. Do the
        deaths of females and males follow similar distributional patterns? Jus-
        tify your answer.

 9. Make a plausible analysis of the Presidents approval ratings.

10. (a) Make a plausible analysis of the UK drivers deaths.
     (b) According to R, "Compulsory wearing of seat belts was introduced on 31
        Jan 1983." Did that effect the number of deaths? Justify your answer.
     (c) Is the number of deaths related to the number of kilometers driven?
         (Use the variable kms in the SeatbeIts data set.) Justify your answer.

11. This question follows up Example 7.4. In the example we analyzed the data to
    learn the effect of thiotepa on the recurrence of bladder tumors. But the data
    set has two other variables that might be important covariates: the number
    of initial tumors and the size of the largest initial tumor.

    (a) Find the distribution of the numbers of initial tumors. How many pa-
         tients had 1 initial tumor, how many had 2, etc?


﻿


7.10. EXERCISES                                                           398

      (b) Divide patients, in a sensible way, into groups according to the number
          of initial tumors. You must decide how many groups there should be
          and what the group boundaries should be.
      (c) Make plots similar to Figures7. and 7 1 to see whether a proportional
          hazard model looks sensible for number of initial tumors.
      (d) Fit a proportional hazard model and report the results.
      (e) Repeat the previous analysis, but for size of largest initial tumor.
      (f) Fit a proportional hazard model with three covariates: treatment, num-
          ber of initial tumors, size of largest initial tumor. Report the results.


﻿


CHAPTER 8


            MATHEMATICAL STATISTICS


8.1 Properties of Statistics

8.1.1 Sufficiency
Consider the following two facts.

   1. Let Y1 ,..., Y  ~ i.i.d. Poi(A). Chapter , Exercise 7 showed that f(A) depends
     only on E Y and not on the specific values of the individual Y's.

  2. Let Y1, ..., Y ~ i.i.d. Exp(A). Chapter , Exercise20 showed that f(A) de-
     pends only on E Y and not on the specific values of the individual Y's.

Further, since f(A) quantifies how strongly the data support each value of A, other
aspects of y are irrelevant. For inference about A it suffices to know f(A), and
therefore, for Poisson and Exponential data, it suffices to know E Y. We don't
need to know the individual Y's. We say that E Y is a sufficient statistic for A.
   Section      examines the general concept of sufficiency. We work in the con-
text of a parametric family. The idea of sufficiency is formalized in Definition 8 .

Definition 8.1. Let {p(-|O)} be a family of probability densities indexed by a pa-
rameter 0. Let y = (yi, ..., yn) be a sample from p(-|0) for some unknown 0. Let
T(y) be a statistic such that the joint distribution factors as

                         f p(y l0)   g(T(y), 0)h(y).

for some functions g and h. Then T is called a sufficient statistic for 0.


399


﻿


8.1. PROPERTIES OF STATISTICS


400


   The idea is that once the data have been observed, h(y) is a constant that does
not depend of 0, so f(O) c ]   0)p(y,1)= g(T, 0)h(y) c g(T, 0). Therefore, in order
to know the likelihood function and make inference about 0, we need only know
T(y), not anything else about y. For our Poisson and Exponential examples we can
take T(y)   E y.
   For a more detailed look at sufficiency, think of generating three Bern(0) trials
y = (yi, y2, g3). y can be generated, obviously, by generating yi, Y2, y3 sequentially.
The possible outcomes and their probabilities are

                               (0, 0, 0) (1-0)3
                               (1,0,0)
                               (0,1, 0) (1-O)2
                               (0, 0,1)
                               (1, 1, 0)
                               (1, 0, 1) 02(1 -0)
                               (0, 1,1)
                               (1,1,1)     03

But y can also be generated by a two-step procedure:

   1. Generate Ey = 0, 1, 2, 3 with probabilities (1- 6)3, 3O(1- )2, 302(1- ), 63,
     respectively.

   2. (a) IfEy2= 0, generate (0, 0, 0)
      (b) If   y2= 1, generate (1, 0, 0), (0, 1, 0), or (0, 0, 1) each with probability
          1/3.
      (c) If E y 2= 2, generate (1, 1, 0), (1, 0, 1), or (0, 1, 1) each with probability
          1/3.
      (d) If E y2 = 3, generate (1, 1, 1)

It is easy to check that the two-step procedure generates each of the 8 possible
outcomes with the same probabilities as the obvious sequential procedure. For
generating y the two procedures are equivalent. But in the two-step procedure,
only the first step depends on 0. So if we want to use the data to learn about 0, we
need only know the outcome of the first step. The second step is irrelevant. I.e.,
we need only know E yi. In other words, E ye is sufficient.


﻿


8.1. PROPERTIES OF STATISTICS


401


   For an example of another type, let y, . .. , y  i.i.d. U(0, 0). What is a sufficient
statistic for 0?

                        p~y1)=(yif y <6fori= 1,...,n
                                0 otherwise
                                1
                              =n 1(0,0) (Y(Th))

shows that Y(n), the maximum of the yr's, is a one dimensional sufficient statistic
for 0.

Example 8.1
In World War II, when German tanks came from the factory they had serial numbers
labelled consecutively from 1. I.e., the numbers were 1, 2,..... The Allies wanted to
estimate T, the total number of German tanks and had, as data, the serial numbers
of captured tanks. See Exercise 22 in Chapter . Assume that tanks were captured
independently of each other and that all tanks were equally likely to be captured. Let
x1,... , xr be the serial numbers of the captured tanks. Then x(n) is a suffcient statistic.
Inference about the total number of German tanks should be based on x(n) and not on
any other aspect of the data.

   If y is a random variable whose values are in a space Y, then y is a random
variable whose values are in y". For any statistic T we can divide Y" into subsets
indexed by T. I.e., for each value t, we define the subset

                           y" = {y E E" : T(y) = t}

Then T is a sufficient statistic if and only if

                                  p(yl y E 7t)

does not depend on 0.
   Sometimes sufficient statistics are higher dimensional. For example, let y,... Y, y~
i.i.d. Gam(a, #0). Then


       H YiIF(a)        3a 1 a-Y2//3F ('Y2 ~'~aieZ~/

so T(y)= (H     Ly, E y) is a two dimensional sufficient statistic.
   Sufficient statistics are not unique. If T =-T(y) is a sufficient statistic, and if f
is a 1-1 function, then f(T) is also sufficient. So in the Poisson, Exponential, and


﻿


8.1. PROPERTIES OF STATISTICS


402


Bernoulli examples where E y2 was sufficient, y =     yi/n is also sufficient. But
the lack of uniqueness is even more severe. The whole data set T(y) = (y) is an
n-dimensional sufficient statistic because

                           f p(yi |0) = g(T(y), )h(y)
where g(T(y), 0) = p(y 10) and h(y) = 1. The order statistic T(y) = (y(1), ..., y(n))
is another n-dimensional sufficient statistic. Also, if T is any sufficient one dimen-
sional statistic then T2 = (y1, T) is a two dimensional sufficient statistic. But it is in-
tuitively clear that these sufficient statistics are higher-dimensional than necessary.
They can be reduced to lower dimensional statistics while retaining sufficiency, that
is, without losing information.
   The key idea in the preceding paragraph is that the high dimensional sufficient
statistics can be transformed into the low dimensional ones, but not vice versa.
E.g., y is a function of (y(),. . . , y(n)) but (y(),... ,y(n)) is not a function of Y. Defi-
nition,8: is for statistics that have been reduced as much as possible without losing
sufficiency.
Definition 8.2. A sufficient statistic T(y) is called minimal sufficient if, for every
other sufficient statistic T2, T(y) is a function of T2(y).
   This book does not delve into methods for finding minimal sufficient statistics.
In most cases the user can recognize whether a statistic is minimal sufficient.
   Does the theory of sufficiency imply that statisticians need look only at sufficient
statistics and not at other aspects of the data? Not quite. Let y,...., y, be binary
random variables and suppose we adopt the model yi, ... , y~ i.i.d. Bern(0). Then
for estimating 0 we need look only at E y2. But suppose (Yi,. . . , yn) turn out to be
                               00 --- 011 --- 1
                                 many O's many 1's
i.e., many O's followed by many 1's. Such a dataset would cast doubt on the as-
sumption that the yr's are independent. Judging from this dataset, it looks much
more likely that the yr's come in streaks. So statisticians should look at all the data,
not just sufficient statistics, because looking at all the data can help us create and
critique models. But once a model has been adopted, then inference should be
based on sufficient statistics.

8.1.2 Consistency, Bias, and Mean-squared Error
Consistency    Heuristically speaking, as we collect ever more data we should be
able to learn the truth ever more accurately. This heuristic is captured formally,


﻿


8.1. PROPERTIES OF STATISTICS


403


at least for parameter estimation, by the notion of consistency. To say whether an
estimator is consistent we have to define it for every sample size. To that end, let
Y1, Y2, -..-. i.i.d. f for some unknown density f having finite mean ,u and SD a.
For each n E N let T :R" - inR. I.e. Tn is a real-valued function of (yi,... Yyn). For
example, if we're trying to estimate pu we might take T m = n-1 E" y.

Definition 8.3. The sequence of estimators T1, T2, ... is said to be consistent for the
parameter 0 if for every 0 and for every Ec> 0.

                             lim P [lITn - 8 | < E] = 1.
                             h-oo
    For example, the Law of Large Numbers, Theorem L12, says the sequence of
sample means {T   = n-1 E   y} is consistent for p. Similarly, let Sn = n-1 Ei(yi -
Tn)2 be the sample variance. Then {S,} is consistent for a2. More generally, m.l.e.'s
are consistent.

Theorem 8.1. Let Y1, Y2, -.-.-. i.i.d. Ypy 0) and let On be the m.l.e. from the sample
(y1, ... , yn). Further, let g be a continuous function of 0. Then, subject to regularity
conditions, {g((9)} is a consistent sequence of estimators for g(0).

Proof. The proof requires regularity conditions relating to differentiability and the
interchange of integral and derivative. It is beyond the scope of this book. Q

    Consistency is a good property; one should be wary of an inconsistent es-
timator. On the other hand, consistency alone does not guarantee that a se-
quence of estimators is optimal, or even sensible. For example, let R (yi , ..., yn)
(Ln/2i|)- (y1+-" --+YL/2j), the mean of the first half of the observations. (Lw] is the
floor of w, the largest integer not greater than w.) The sequence {R,} is consistent
for yu but is not as good as the sequence of sample means.

Bias It seems natural to want the sampling distribution of an estimator to be
centered around the parameter being estimated. This desideratum is captured
formally, at least for centering in the sense of expectation, by the notion of bias.

Definition 8.4. Let 0 = O(yi,.. . , y) be an estimator of a parameter 0. The quantity
E[O] - 0 is called the bias of 0. An estimator whose bias is 0 is called unbiased.

    Here are some examples.

An unbiased estimator Let y,. .. ,yi~ i.i.d. NQp,   ) and consider yt =y as an
      estimate of pu. Because EB[y] =p, yj is an unbiased estimate of pu.


﻿


8.2. TRANSFORMATIONS OF PARAMETERS


404


A biased estimator Let y1,... , y~ i.i.d. N(p, a) and consider


     as an estimate of U2.
         E[n-' (yi - Y)2] 1E1[Z(y2 - p  + p - Y)2]

                         = n-1 {E[Z(y - p)2] + 2E[Z(yi - p)(p - Y)]

                            +         -E[ (p-_Y)2
                         = n- {un2 - 2U2 + U2}
                            2   -1 2
                          nU -12 27

                            n
     Therefore 62 is a biased estimator of 62. It's bias is -2/n. Some statisticians
     prefer to use the unbiased estimator &2  (n _1)-1 L(y2.-

A biased estimator Let x1,... , x~ i.i.d. U(0, 0) and consider 0 = x(n) as an esti-
     mate of 0. (0 is the m.l.e.; see Section 5.4.) But x(n) < 0; therefore IE[xn] < 0;
     therefore x(n) is a biased estimator of 0.

Mean Squared Error

8.1.3 Efficiency

8.1.4 Asymptotic Normality

8.1.5 Robustness

8.2    Transformations of Parameters

Equivalent parameterizations, especially ANOVAs, etc. Invariance of MLEs.


8.3    Information

8.4 More Hypothesis Testing


[a more formal presentation here?]


﻿


8.5. EXPONENTIAL FAMILIES


405


8.4.1 p values

8.4.2 The Likelihood Ratio Test

8.4.3 The Chi Square Test

8.4.4 Power

8.5 Exponential families

8.6 Location and Scale Families

Location/scale families


8.7 Functionals

functionals


8.8 Invariance

Invariance


8.9 Asymptotics

In real life, data sets are finite: (y1,..., y,). Yet we often appeal to the Law of
Large Numbers or the Central Limit Theorem, Theorems    ,    , and
which concern the limit of a sequence of random variables as n - 00. The hope is
that when n is large those theorems will tell us something, at least approximately,
about the distribution of the sample mean. But we're faced with the questions
"How large is large?" and "How close is the approximation?"
   To take an example, we might want to apply the Law of Large Numbers or
the Central Limit Theorem to a sequence Y1, Y2,... of random variables from a
distribution with mean y and SD o. Here are a few instances of the first several
elements of such a sequence.


﻿


8.9. ASYMPTOTICS


406


       0.70   0.29   0.09  -0.23  -0.30  -0.79  -0.72  -0.35  1.79  -
       -0.23 -0.24   0.29  -0.16   0.37  -0.01  -0.48  -0.59  0.39  -
       -1.10 -0.91  -0.34   0.22   1.07  -1.51  -0.41  -0.65  0.07  -


Each sequence occupies one row of the array. The "- - - " indicates that the sequence
continues infinitely. The ":" indicates that there are infinitely many such sequences.
The numbers were generated by

  y <- matrix ( NA, 3, 9 )
  for ( i in 1:3 ) {
    y [i , ] <- rnorm(9)
    print ( round ( y[i,], 2 ) )
  }

  * I chose to generate Y's from the N(0, 1) distribution, so I used rnorm, and so,
     for this example, pu= 0 and a = 1. Those are arbitrary choices. I could have
     used any values of pu and a and any distribution for which I know how to
     generate random variables on the computer.

   * round does rounding. In this case we're printing each number with two deci-
     mal places.

Because there are multiple sequences, each with multiple elements, we need two
subscripts to keep track of things properly. Let Yj be the j'th element of the i'th se-
quence. For the i'th sequence of random variables, we're interested in the sequence
of means Yi, Y2, ... where Yi> = (Yi + - - - + Yin)/n. And we're also interested in
the sequence Zi, Z42,... where Zi, = v/n(Yn - u). For the three instances above,
the Yin's and Zrn's can be printed with

  for ( i in 1:3 ) {
    print ( round ( cumsum(y[i,]) / 1:9, 2) )
    print ( round ( cumsum(y[i,]) / (sqrt(1:9)), 2) )
}

   * cumsum computes a cumulative sum; so cumsum(y [1,]) yields the vector
     y[1,1], y[1,1]+y[1,2], ..., y[1,1]+...+y[1,9]. (Print out
     cumsum(y [1,]) if you're not sure what it is.) Therefore,
     cumsum(y [i ,])/ /: 9 is the sequence of Y>'s.


﻿


8.9. ASYMPTOTICS


407


   " sqrt computes the square root. So the second print statement prints the
     sequence of Z<n's.

The results for the Yin's are


0.70
-0.23
-1.10


0.49
-0.23
-1.01


0.36
-0.06
-0.78


0.96
-0.23
-1.94


0.21
-0.08
-0.53


0.84
-0.31
-1.83


0.11
0.01
-0.21


0.71
-0.15
-1.35


-0.04
0.00
-0.43


0.39
-0.15
-1.97


-0.14
-0.07
-0.43


0.11
-0.33
-2.12


-0.16
-0.13
-0.45


-0.01
-0.54
-2.35


0.05
-0.07
-0.40


0.59
-0.41
-2.33


and for the Zrn's are


0.70
-0.23
-1.10


0.91
-0.40
-1.74


We're interested in the following questions.

   1. Will every sequence of Y's or Zr's converge?
     along each row of the array.


This is a question about the limit


  2. If they converge, do they all have the same limit?

  3. If not every sequence converges, what fraction of them converge; or what is
     the probability that a randomly chosen sequence of Y's or Zr's converges?

  4. For a fixed n, what is the distribution of Yn or Z,. This is a question about
     the distribution along columns of the array.

  5. Does the distribution of Yn or Z, depend on n?

  6. Is there a limiting distribution as n -oo?

  Some simple examples and the Strong Law of Large Numbers, Theorem
answer questions 1, 2, and 3 for the sequences of Y's. The Central Limit Theorem,
Theorem    14, answers question 6 for the sequences of Zr's.

   1. Will every sequence of Y's converge? No. Suppose the sequence of Y's is
     1, 2, 3, .... Then {Y} increases without limit and does not converge.

  2. If they converge, do they have the same limit? No. Here are two sequences
     of Y's.


﻿


8.9. ASYMPTOTICS


408


                                   1   1   1   ---
                                   -1  -1  -1  -""

     The corresponding sequences {Y} converge to different limits.

   3. What is the probability of convergence? The probability of convergence is 1.
     That's the Strong Law of Large Numbers. In particular, the probability of
     randomly getting a sequence like 1, 2, 3, ... that doesn't converge is 0. But
     the Strong Law of Large Numbers says even more. It says

                                 P[ lim Yn = ] = 1.
                                   nh-oo

     So the probability of getting sequences like 1,1,1, ... or -1, -1, -1, ... that
     converges to something other than yu is 0.

   4. What is the distribution of Zn? We cannot say in general. It depends on the
     distribution of the individual Y's.

   5. Does the distribution of Zn depend on n? Yes, except in the special case
     where Y~ N(0, 1) for all i.

   6. Is there a limiting distribution? Yes. That's the Central Limit Theorem. Re-
     gardless of the distribution of the YTg's, as long as Var(Yig) < oo, the limit, as
     n - oc, of the distribution of Z, is N(O, 1).

   The Law of Large Numbers and the Central Limit Theorem are theorems about
the limit as n - oo. When we use those theorems in practice we hope that our
sample size n is large enough that Yi, ~p and Z     N(O, 1), approximately. But
how large should n be before relying on these theorems, and how good is the
approximation? The answer is, "It depends on the distribution of the Yj's". That's
what we look at next.
   To illustrate, we generate sequences of Y's from two distributions, compute
Yin's and Zin's for several values of n, and compare. One distribution is U(0, 1); the
other is a recentered and rescaled version of Be(.39, .01).
   The Be(.39, .01) density, shown in Figure 8.1, was chosen for its asymmetry. It
has a mean of .39/.40 = .975 and a variance of (.39)(.01)/((.40)2(1.40)) .017. It
was recentered and rescaled to have a mean of .5 and variance of 1/12, the same
as the U(0, 1) distribution.
   Densities of the Ys,'s are in Figure 8.. As the sample size increases from nm  10
to nm 270, the iY>'s from both distributions get closer to their expected value of


﻿


8.9. ASYMPTOTICS


409


0.5. That's the Law of Large Numbers at work. The amount by which they're off
their mean goes from about ±.2 to about ±.04. That's Corollary at work. And
finally, as n -- 0c, the densities get more Normal. That's the Central Limit Theorem
at work.
   Note that the density of the Yi,'s derived from the U(0, 1) distribution is close
to Normal even for the smallest sample size, while the density of the Yi,'s derived
from the Be(.39, .01) distribution is way off. That's because U(0, 1) is symmetric
and unimodal, and therefore close to Normal to begin with, while Be(.39, .01) is far
from symmetric and unimodal, and therefore far from Normal, to begin with. So
Be(.39, .01) needs a larger n to make the Central Limit Theorem work; i.e., to be a
good approximation.
   Figure    is for the Zi.'s. It's the same as Figure  except that each density
has been recentered and rescaled to have mean 0 and variance 1. When put on the
same scale we can see that all densities are converging to N(0, 1).


Co


C\l
N


0.0


0.2      0.4       0.6      0.8


1.0


                  Figure 8.1: The Be(.39, .01) density


Figure    was produced by
x <- seq ( .01, .99, length=80 )
plot ( x, dbeta(x, .39, .01), type="l", ylab="", xlab="" )


﻿


8. 10. EXERCISES41


410


   Figure  was generated by the following R code.

   samp.size <- c ( 10, 30, 90, 270 )
   n.reps  <- 500
   Y.1 <- matrix ( NA, n.reps, max(samp.size) )
   Y.2 <- matrix ( NA, n.reps, max(samp.size) )
   for ( i in 1:n.reps ) {
   Y.1[i,1 <- runif ( max(samp.size), 0, 1 )
   Y.2[i,1 <- ( rbeta ( max(samp.size), 0.39, .01 ) - .975 )*
                sqrt( .4-2*1.4 / (.39*.01*12 ) ) + .5
  }
  par ( mfrow=c (2, 2) )
  for ( n in 1:length(samp.size) ) {
    Ybar. 1 <- apply ( Y .1 [, 1:samp .size [nll], 1, mean )
    Ybar. 2 <- apply ( Y .2 [, 1:samp .size [n]l], 1, mean )
    sd <- sqrt ( 1 / ( 12 * samp. size [n] ) )
    x <- seq ( .5-3*sd, .5+3*sd, length=60 )
    y <- dnorm ( x, .5, sd )
    denl <- density ( Ybar.1 )
    den2 <- density ( Ybar.2 )
    ymax <- max ( y, denl$y, den2$y )
    plot ( x, y, ylim=c(O,ymax), type="l", lty=3, ylab="",
          xlab="", main=paste ("n =,samp. size En]) )
    lines ( deni, lty=2 )
    lines ( den2, lty=4 )
  }


  * The manipulations in the line Y. 2 [is,] <- ... are so Y. 2 will have mean
    1/2 and variance 1/12.


8.10 Exercises

  1. Let Y>, ... ,Y, be a sample from N( i, or).

     (a) Suppose u is unknown but o- is known. Find a one dimensional sufficient


﻿


8.10. EXERCISES


411


0
C')


CVJ

0
CVJ


0


0


0
CMY)


CVJ

0
CVJ


0


0


      n =10


          I"
          I"
          I"


          .


0.0   0.4    0.8


      n = 90


        b l


0.0   0.4    0.8


0
C')


CVJ

0
CVJ


0


0


0
CMY)


CVJ

0
CVJ


0


0


      n = 30


         l'


         I '

         I "

         I  " I


0.0   0.4  0.8


Figure 8.2: Densities of Y> for the U(0, 1) (dashed), modified Be(.39, .01) (dash

and dot), and Normal (dotted) distributions.


﻿


8.10. EXERCISES


412


n =10


n = 30


O


co
0

P1
6


0
6


        .1" ,
        .\.


   .^  /   k


-3   -1   1    3


0


co
0


              I'
              1I


N\.       '*/ 'I


   -3   -1    1    3


n = 90


n = 270


0


co
0


(0

  6 3    -


0


co
0


CV.      "*i  '.'


0
  O

    -3  -1    1    3


Figure 8.3: Densities of Z~, for the U(0, 1) (dashed), modified Be(.39, .01) (dash

and dot), and Normal (dotted) distributions.


﻿


8.10. EXERCISES                                                             413

          statistic for p.
      (b) Suppose p is known but a is unknown. Find a one dimensional sufficient
          statistic for a.
      (c) Suppose y and a are both unknown. Find a two dimensional sufficient
          statistic for (,u, a).

  2. Let Y1, ... , Y be a sample from Be(a, #3). Find a two dimensional sufficient
     statistic for (a, ,Q).

  3. Let Y1,... Yn,  i.i.d. U(-8, 0). Find a low dimensional sufficient statistic for
     0.


﻿


                        BIBLIOGRAPHY


Consumer Reports, June:366-367, 1986.

D. F. Andrews and A. M. Herzberg. Data. Springer-Verlag, New York, 1985.

H. Bateman. On the probability distribution of a particles. Philosophical Magazine
  Series 6, 20:704-705, 1910.

Richard J. Bolton and David J. Hand. Statistical fraud detection: A review. Statis-
  tical Science, 17:235-255, 1992.

Paul Brodeur. Annals of radiation, the cancer at Slater school. The New Yorker, Dec.
  7, 1992.

Jason C. Buchan, Susan C. Alberts, Joan B. Silk, and Jeanne Altmann. True paternal
  care in a multi-male primate society. Nature, 425:179-181, 2003.

D. P. Byar. The veterans administration study of chemoprophylaxis for recur-
  rent stage I bladder tumors: Comparisons of placebo, pyridoxine, and topical
  thiotepa. In M. Pavone-Macaluso, P. H. Smith, and F. Edsmyn, editors, Bladder
  Tumors and Other Topics in Urological Oncology, pages 363-370. Plenum, New
  York, 1980.

George Casella and Roger L. Berger. Statistical Inference. Duxbury, Pacific Grove,
  second edition, 2002.

Lorraine Denby and Daryl Pregibon. An example of the use of graphics in regres-
  sion. The American Statistician, 41:33-38, 1987.

A. J. Dobson. An Introduction to Statistical Modelling. Chapman and Hall, London,
  1983.


414


﻿


BIBLIOGRAPHY


415


D. Freedman, R. Pisani, and R. Purves. Statistics. W. W. Norton and Company, New
  York, 1998.

Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data
  Analysis. Chapman and Hall, Boca Raton, 2nd edition, 2004.

S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the
  Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Ma-
  chine Intelligence, 6:721-741, 1984.

W. K. Hastings. Monte Carlo sampling methods using Markov chains and their
  applications. Biometrika, 57:97-109, 1970.

Michael Lavine. What is Bayesian statistics and why everything else is wrong. The
  Journal of Undergraduate Mathematics and Its Applications, 20:165-174, 1999.

Michael Lavine, Brian Beckage, and James S. Clark. Statistical modelling of
  seedling mortality. Journal of Agricultural, Biological and Environmental Statis-
  tics, 7:21-41, 2002.

Jun S. Liu. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New
  York, 2004.

Jean-Michel Marin and Christian P. Robert. Bayesian Core: A Practical Approach to
  Computational Bayesian Statistics. Springer-Verlag, New York, 2007.

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
  Equation of state calculations by fast computing machines. Journal of Chemical
  Physics, 21:1087-1092, 1953.

R Development Core Team. R: A Language and Environment for Statistical Com-
  puting. R Foundation for Statistical Computing, Vienna, Austria, 2006. URL
  http: //www. R-pro ject .org. ISBN 3-900051-07-0.

Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Springer-
  Verlag, New York, 1997.

E. Rutherford and H. Geiger. The probability variations in the distribution of a
  particles. Philosophical Magazine Series 6, 20:698-704, 1910.

Mark J. Schervish. Theory of Statistics. Springer-Verlag, New York, 1995.


﻿


BIBLIOGRAPHY                                                               416

T.S. Tsou and R.M. Royall. Robust likelihoods. Journal of the American Statistical
  Association, 90:316-320, 1995.

Jessica Utts. Replication and meta-analysis in parapsychology. Statistical Science,
  4:363-403, 1991.

W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New
  York, fourth edition, 2002.

L. J. Wei, D. Y. Lin, and L. Weissfeld. Regression analysis of multivariate incomplete
  failure time data by modeling marginal distributions. Journal of the American
  Statistical Association, 84:1065-1073, 1989.

Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, New York, second
  edition, 1985.


﻿


INDEX


a particle, 296

autocorrelation, 350
autoregression, 355\

bandwidth, 16
bias, 376

case, 216
cdf, see cumulative distribution func-
       tion
Central Limit Theorem,
change of variables, 12.
characteristic functions, 275
Chebychev's Inequality, 7
chi-squared distribution,;32
consistency,
coplots,
correlation, 53:
covariance, 514
covariance matrix, 266
covariate, 216
cross tabulation, 116.
cumulative distribution function, 271
cumulative hazard function,;366

DASL, see Data and Story Library, see
       Data and Story Library,
Data and Story Library,  1,11
density


    probability, 264
density estimation,:105
dependence,
distribution,
Distributions
    Bernoulli, 280
    Beta, 313
    Binomial,1279
    Cauchy,333
    Exponential,20
    Gamma,30
    inverse Gamma, 341
    Multinomial, 290
    Negative binomial,284
    Normal, 22
    Poisson, 17, 292
    standard multivariate Normal,
    standard Normal, 2
    Uniform,30

errors, 219
estimate, 154
expected value, 30
explanatory variable,.216

fitted values, 224, 250
fitting,2
floor, 376
formula, 225


417


﻿


INDEX


418


gamma function, 30
Gaussian density, 31
generalized moment, 39
genotype, 290

half-life, 309
histogram, 104

independence, 54.
   joint, 26
   mutual, 265
indicator function, 56
indicator variable,

Jacobian, 268

Kaplan-Meier estimate, 365

Laplace transform, 273
Law of Large Numbers, 79, 80
likelihood function, 13
likelihood set,
linear model, 219
linear predictor,
location parameter,319
logistic regression, 240
logit, 241_

marginal likelihood,
Markov chain Monte Carlo, 344
maximum likelihood estimate, 154
mean,
median, 95\
Mediterranean tongue,3
mgf, see moment generating function
minimal sufficient,375"
moment,
moment generating function, 273
mosaic plot,'117
multinomial coefficient, 21


multivariate
    change of variables,

order statistic, 96, 37
outer product,

parameter, 1   3   279
parametric family,9
partial autocorrelation, 353
pdf, see probability density, see proba-
       bility density
physics,
Poisson process,  3,12
predicted values, 2
probability
    continuous,_1, 7
    density,
    discrete, 1,6
proportional hazards model,

QQ plots,:111
quantile, 95

R commands
     !59

     Er]], see subscript
     [], see subscript
     #, 4
     %o%, 324
   ~225
   abline,14
   acf, 35
   apply, 63
   ar,35
   array,6
   arrows,.38
   as.factor, 246
   assignment, 4,


﻿


INDEX


419


boxplot, 6-1,

cbind, 76
contour,'147
coplot,\127
cor, 3
cov, 51
cumsum, 37>
data.frame,
dbinom, 1,
density, 12
dexp,22
diff, 3,
dim, 73
dimnames, ;Z
dmultinom,
dnbinom,
dnorm, 23
dpois,.20
expression,
filter, 3
fitted, 250
for, 5
glm, 244
hist, 12.
if,
is.na, 147
legend, 22
length, 74.
lines, 12:
list,
lm, 225
logl0,
lowess, 208
matplot, 22
matrix, 22,
mean,
median,95


mosaicplot,
names, 73
pacf, 353
pairs,
par,
paste, 22
pbinom,2:
plot, 9, 20
plot.ecdf, 9
plot.ts,
pnbinom,
print,
qbinom,
qnbinom,
qqnorm,
quantile, 96
rbinom, 39,
read.table,
rep, 5\,
rmultinom,
rnbinom, 2
rnorm, 26
round, 379
sample, 3
scan, 131
segments,
seq, 2
sqrt, 380
stripchart, 1
subscript, 7
sum, 4, 5
supsmu,20
Surv, 36
survfit,
tapply, 76
text, 38
unique,104
var, 1


﻿


INDEX


420


    while, 59\
R data sets
    airquality,
    attenu, 202
    beaveri,38,3
    co2,6438
    discoveries, 1
    EuStockMarkets,
    faithful,12 , 202
    iris, 51
    ldeaths,
    lynx,
    mtcars,2,229,
    PlantGrowth, 21
    presidents, 38
    Seatbelts, 34
    sunspot.month,
    Tooth Growth,
    UCBAdmissions,
random vector,264
regression function,
regressor, 216
residual, 20
residual plots, 23
response variable, 21


Weak Law of Large Numbers,


sampling distribution,
scale parameter, 319
scatterplot smoothers, 205
standard deviation,
standard error,1
standard Normal distribution,
standard units, 26
stationary distribution, 344
StatLib, 73
Strong Law of Large Numbers,
sufficient statistic, 372


variance,


﻿


                 INDEX OF EXAMPLES


1970 draft lottery, 204

baboons,_187
bladder cancer,35

CEO salary"11
craps, 2, 3.  ,, 1  5,  ,, 49.., 5,  59..   .\, 157

FACE, 4, 144, 23

hot dogs, 105, 210, 216, 225

Ice Cream Consumption, 220

neurobiology, 12, 2

O-rings, 236
ocean temperatures, 29, 316

quiz scores, 109, 148

Rutherford and Geiger, 29

seedlings, 18,:,495     36, 1483,

Slater school, 136 ,13


421