Lecture Notes

Session #1

Today we distinguished between the ways in which a probabilist and a statistician view a scenario involving the modeling of a political opinion poll via a binomial distribution. Then we distinguished between Bayesian and frequentist interpretations of probability. We applied the Bayesian view to the opinion poll by asking what probability distribution to assign to the uncertain-but-not-random proportion of voters who will vote "yes".

On Friday we will explicitly assign a beta distribution to that uncertain quantity. We treat the definition and properties of the beta distribution, which is covered in section 5.10 of the textbook, and we will examine reasons for applying it to this problem.

Session #2

Some of you have seen the gamma function, the gamma distribution, and the beta distribution in previous courses such as 18.440. But some have not, and accordingly today we looked at the definition of the gamma function, the identity Gamma(alpha + 1) = alpha * Gamma(alpha), and the definition and some basic properties of the beta distribution. Finally we considered attributing the beta distribution to the proportion of voters who will vote "yes", in the political opinion polling scenario that we began to consider on Wednesday. We will continue with that problem later, and we will begin to treat the bivariate normal distribution. The family of bivariate normal distributions is indexed by *five* parameters: Two expectations, two variances, and a covariance.

Session #3

We considered the use of a beta distribution as the *prior* distribution of a proportion of a population. In the course of this we stated and used

  1. the Law of Total Probability, and
  2. Bayes' formula, which says:

posterior probability density
= constant * prior probability density * likelihood function.

The Law of Total Probability can be used in doing #4 on the first problem set.

Session #4

Today we dealt with the idea that the variance of a random vector is a square matrix. We used that to see that the bivariate normal density naturally generalizes the univariate normal density. We looked at the formula for the *conditional* expectation of one component of a bivariate-normally distributed random vector, *given* the value of the other component. We touched on the conditional variance. (Although we didn't have time to do it, *one* way of proving the identity for the conditional expectation is to apply Bayes' formula: Multiply the prior density by the likelihood function and then normalize. Possibly we will look at that on Friday.)

Session #5

Today we treated "conjugate priors"; we saw that the family of beta distributions is conjugate to the family of binomial distributions and the family of gamma distributions is conjugate to the family of Poisson distributions. Next time we will see what is conjugate to the family of normal distributions.

Session #6

After treating the Bayesian Central Limit Theorem, we considered this scenario:

M ~ N(80, 10^2)  (the prior)
X|[M=m] ~ N(M, 1^2)  (the distribution of the "data")

We sought the conditional distribution of M given X = x:

M|[X=x] ~ ?

We found M|[X=x] ~ N(?, ?), where the two question marks are the posterior expectation E(M | X = x) and the posterior variance var(M | X = x). We concluded:

  • the normal distributions are a family of conjugate priors;
  • the posterior expected value is a weighted average of the prior expected value and the "data";
  • the weights are inversely proportional to the variances 10^2 and 1^2;
  • the posterior variance is much smaller than the prior variance; if this had not been the case, common sense suggests we should have been surprised: The more data, the more information; the more information, the smaller the variance.

Session #7

Today we treated a problem done on "Monday" by a different method. We relied on the fact that if the conditional distribution of Y given X does not depend on X, then that is the same as the "unconditional" (or "marginal ") distribution of Y, and X and Y are independent. We also used the identity

Var(AY) = A (var Y) A'

when Y is a random vector and A is a matrix. And we used the formula that we saw earlier for E(X|Y) when (X,Y)' has a bivariate normal distribution.

Then we changed subjects drastically and dealt with conditional independence.

Finally -- another change of subject -- we began to deal with frequentist estimation by the method of maximum likelihood. We distinguished between "estimates" and "estimators"; to the latter we can attribute probability distributions, and along with those, expectations, variances, etc.

Session #8

We began with a fairly typical maximum-likelihood problem, and found that the maximum-likelihood estimator in that case is unbiased (and in passing, we defined the concepts of "biased" and "unbiased"). Then we looked at a slightly less typical maximum-likelihood problem and saw that (1) the maximum did not occur at a point where the derivative is zero, and (2) we needed to pay more attention to the boundaries between the pieces in a piecewise definition then to the derivative of the likelihood function (actually, we didn't use the derivative at all), and (3) in this case the maximum-likelihood estimator was biased. Finally, we looked at a maximum-likelihood problem that had some perhaps unexpected features: The likelihood function was a function of two variables, but when we find the maximizing value of the first one, with the second one held fixed, it turned out not to depend on the value of the second one, and that saved us some work; we used a "trick" that amounted to decomposing a vector into two parts orthogonal to each other; we located the dependence of the likelihood function upon one of its arguments in one term that turned out to be easy to deal with, once isolated. To be continued.....

Session #9

We finished the maximum likelihood problem that we began last time. We showed that the maximum likelihood estimate of the variance of a normally distributed population, based on a random sample, is biased. "Biased" sounds bad, but we will see that that is to some extent misleading.

On Wednesday we will begin to consider the concept of a sufficient statistic.

Session #10

We dealt very tersely with the fact that if h is a known function then the MLE of h(theta) is h(MLE of theta). This is useful for some parts of the third problem set. We may deal with this topic at greater length later.

We defined and tried to motivate the concept of sufficiency. We looked at some concrete examples. We derived Fisher's factorization of the probability mass function of a discrete random variable in the form f_theta (x) = g_theta (T(x)) * h(x). The first factor depends on the data only through the sufficient statistic T(x); the second factor does not depend on theta. (WARNING: the "h" in this paragraph and that in the previous paragraph are two different things!)

Session #11

We did several concrete examples of Fisher factorizations, whose general form is f_theta (x) = g_theta (T(x)) * h(x).

Then we defined the Rao-Blackwell procedure.

Forewarned is forearmed! Here is something that will be on the test: Explain why the Rao-Blackwell estimator E(delta(X)|T(X)) does not depend on theta, even though the distribution of delta(X) does depend on theta (as it must, if it is to make sense as an estimator of theta).

Session #12

Today we examined several concrete instances of Rao-Blackwell estimators. On Wednesday we will deal with the concept of completeness. (That is when we will need two-sided Laplace transforms at one point; we haven't reached those yet.)

Session #13

Today we dealt with the Rao-Blackwell theorem, which identifies one respect in which Rao-Blackwell estimators are "better" than the "crude" estimators that go into them. Then we examined "completeness" and the Lehmann-Scheffe theorem, which entails, among other things, that a statistic that is complete, sufficient, and unbiased, is the unique "best" unbiased estimator. We looked at several examples of statistics that are not complete, and one that is complete. Another example of one that is complete is treated in #4(e) on the fourth problem set.

The handout had "sin(pi W_1)" where "sin(2 pi W_1)" was needed. That has been corrected in the version of the handout that is on the web, which is otherwise identical to what you got in class.

(Foreshadowing things to come: It is possible that we will see something similar to completeness when we treat linear regression. If so, we will see a lemma that can be stated by saying "least-squares estimators are uncorrelated with every_linear_unbiased estimator of zero," and we will use that to prove the "Gauss-Markov theorem." That result can be stated by saying "least-squares estimators are best _linear_ unbiased estimators.")

Session #14

We showed that if Z ~ N(0,1) then Z^2 has a certain gamma distribution, and inferred that the sum of squares of n independent copies of Z also has a gamma distribution. Thus a chi-square distribution is a gamma distribution. (DeGroot & Schervish actually *define* a chi-square distribution to be a certain gamma distribution, and show that the aforementioned sum of squares has that distribution.) As a by-product we showed that the value of the gamma function at 1/2 is the square root of pi.

We considered the distribution of the sum as i goes from 1 to n, of (Z_i — Zbar)^2, where Zbar = the sample mean (Z_1 + .... + Z_n)/n. Earlier we found that the *expectation* of this sum is n — 1. Since the things being squared, Z_i — Zbar are *not* independent, and they are *not* standard normals (since they have a slightly smaller variance) and there are *not* n — 1 of them, but rather n of them, it is not obvious that this sum should have a chi-square distribution with n — 1 degrees of freedom.

We will use geometry to show that, and also to show that Zbar is independent of (Z_1 — Zbar, ..... , Z_n — Zbar).

Coming on Monday: Orthogonal matrices, idempotent matrices, geometry, etc.

Session #15

Let A' denote the transpose of the matrix A. (In "Matlab®" notation, A' is the transpose of A if the entries in A are _real_, and if the entries are _complex_, then A' is the result of both transposing and conjugating.) Let mu = E(X), where X is a random vector with n scalar components ("scalar" in this context will mean _real_ number; in some other contexts it means _complex_ number, and in yet other contexts it can mean other things). Today we dealt with some consequences of the definition var(X) = E((X — mu)(X — mu)'). We saw that var(X) is an n x n nonnegative-definite matrix. We went through a quick derivation of the identity var(AX) = A ( var(X) ) A' if A is a constant (i.e., non-random) matrix. We recalled some basic facts about orthogonal matrices. We saw that if var(X) = I = the n x n identity matrix, and G is an n x n orthogonal matrix, then var(GX) = I. We found an n x n matrix Q such that if x is any n x 1 column vector, then Qx is the vector whose ith component, for i = 1, ...., n, is the ith component of x minus the average of all components of x. We saw that Qx = x if x is in a certain (n — 1)-dimensional space, and Qx = 0 if x is in the 1-dimensional orthogonal complement of that (n — 1)-dimensional space. Finally, we considered how to construct an orthogonal matrix G such that GQG' = the n x n matrix with 0s off the main diagonal and with (0, 1, 1, 1, 1, ......, 1) on the main diagonal. (The first entry is 0; all the rest are 1.)

(The "spectral theorem" of linear algebra can be stated by saying that every real symmetric matrix can be "diagonalized by an orthogonal matrix". What we did at the end of the hour today is one special case of that.)

On Wednesday we will quickly recapitulate all of this and then see what this tells us about chi-square distributions and about independence of the sample average and the sample variance.

Session #16

As the syllabus states, information on the web site is required reading. The fifth problem set will be on a problem similar to what we did today; you should rely on what you learn from the pdf file when you do those problems.

"Attachment" to the summary of Session #16 (PDF).

Session #17

Today we distinguished between tuples of random variables that are separately but not jointly normal (an example was exhibited) and those that are jointly normal. I made several assertions that I did not prove:

  1. If each of X and Y is multivariate normal and they have the same expectation and the same variance (the variance being a square matrix) then X and Y have the same distribution (and therefore it makes sense to write X ~ N(mu, V) ).
  2. Random vectors that are jointly normally distributed and uncorrelated are independent.
  3. If X has an n-dimensional normal distribution and A is an m x n matrix, then AX has an m-dimensional normal distribution.
  4. ( cov(X,Y) )' = cov(Y,X). (They are each other's transposes.)
  5. cov(AX,BY) = A( cov(X,Y) )B'.

(3), (4), and (5) are easy to show; (1) and (2) are more difficult.

We used all of this to show that the sample variance is independent of the sample mean, in an i.i.d. sample from a normally distributed population. Finally, we began to examine Student's distribution, also called the t-distribution. We need all of the foregoing in order to think about the t-distribution.

Wednesday's test will cover the material on which the first four problem sets rely (exclusive of #7 on the fourth problem set). Be sure to bring in any questions on that material on Monday.

Session #18

Today we applied Student's distribution, also called the t-distribution, to confidence intervals. We defined the concept of a "pivotal quantity." We looked at the dubiousness of at least one commonplace interpretation of confidence intervals.

Session #19

Those who were not in class today should note the presence of the 6th problem set on the web. This is largely a follow-up to things we did before today. Today we looked at the application of the Central Limit Theorem to confidence intervals for parameters indexing binomial and Poisson distributions. In the course of this we quickly reviewed the "Poisson Limit Theorem" -- the result that says Bin(n, c/n) ----> Poisson(c) as n ----> infinity.

Session #20

In class today we started the problem that appears on the 7th problem set as #3. As I said then, you will supply the details. Then we began to treat hypothesis testing. Finally, we looked at the definition of the "likelihood ratio" --- that will appear on the 8th problem set.

Session #21

Today we showed that the "t-test" is a likelihood ratio test.

Session #22

We defined the ideas of "level", "power function", and "p-value", and we looked at examples.

Session #23

Part (c) of 3# on the 7th problem set is perhaps more involved than what I had in mind originally, since I was thinking of some typical data sets rather than more generally. Therefore I will tell the grader to be fairly generous and give credit for answers that show an understanding of the problem without completely solving it.

Today's summary: We considered a problem involving three symmetric idempotent matrices P, H—P, and I—H, the product of any two of which is 0, and whose sum is I, such that, when "X" denotes a certain random vector of interest, then (I—H)X has expectation 0, and (H—P)X has expectation 0 if and only if the null hypothesis that we considered is true, and these two random variables are independent. In the last few seconds we looked at a function of these two statistics that has an F distribution if the null hypothesis is true, without *yet* having said very much about the F distribution.

Session #24

The 8th problem set is now on the web!

  • It _looks_ long. To _some_ extent (I hope) that's because it's broken into little bite-sized pieces. But still, it seems long enough to allow somewhat more than a week for it (i.e., until session #27).
  • # 1 - 7 are "theoretical."
  • # 8 - 11 are "concrete" applications of what was done in # 1 - 7.
  • # 12 is "theoretical" and amounts to a proof of the celebrated "Neyman-Pearson Lemma" on hypothesis testing by likelihood-ratio tests.
  • Don't wait until the last day to ask questions about it. As some of you may have observed, I can be fairly generous in answering those questions if you understand where your difficulty arises.

TODAY'S SUMMARY: We continued with the problem we began last time. Ultimately this leads us to some conclusions about the probability distribution of the test statistic in a typical analysis-of-variance problem.

Session #25

We finished deriving the F statistic and its probability distribution (assuming our null hypothesis is true) in the problem we were looking at last time. Along the way we dealt with properties of symmetric idempotent matrices and their geometry.

In case you missed today's brief handout I've decided to make it another "attachment."

"Attachment" to the summary of Session #25 (PDF).

Session #26

In #5 on the 8th problem set, "a sample of size n" should be construed to mean n i.i.d. observations.

Today we constructed an "ANOVA table" and saw that the sums of squares entered in it constitute a partition of the total corrected sum of squares. The total corrected sum of squares measures the variability in the data; the separate rows in the ANOVA table correspond to separate "sources" of variability.

We began to consider a chi-square test of a null hypothesis that a die is "fair." We looked at a rather hand-waving explanation of why the test statistic has approximately a chi-square distribution. We would like this to follow from the fact that a certain random vector has approximately a normal distribution. That will follow from a Central Limit Theorem. Details will follow on Friday.

Session #27

We dealt with square roots of certain matrices, generalized inverses of those square roots, the multivariate Central Limit Theorem, and how all of that leads us to the probability distribution of the test statistic that we considered at the end of the hour on Wednesday. All of that is in the handout.

Then we briefly considered the concept of "monotone likelihood ratio."

Session #28

On the 9th problem set, the first problem is "theoretical" and applies the Neyman-Pearson Lemma. The next three involve some actual number-crunching. The fifth should be a routine bit of algebra.

Today we treated a chi-square test of independence in categorical data. In the last few minutes we introduced the logit function defined by logit(p) = log{ p/(1 — p) }.

Session #29

We further examined the function

logit(p) = log( p/[1 — p] ).

We saw that a simple special case of Bayes' formula can be stated thus:

logit( posterior probability )
= logit( prior probability ) + log( likelihood ratio ).

The second term depends only on the data and not on the prior.

Session #30

We did some algebra related to the logistic regression model

logit(Pr(Y_i = 1)) = alpha_0 + alpha_1 * z_i.

In particular, we found that the pair (SUM_i Y_i, SUM_i Y_i * z_i) is a sufficient statistic for this family of distributions.

Session #31

On Wednesday we considered some issues related to logistic regression, including numerical algorithms, although we didn't finish that topic. We looked for the first time at estimation by the method of moments (another probable topic for the 11th problem set).

Session #32

We reviewed the gamma distribution and did a concrete problem on estimation of its two parameters by the method of moments. We looked somewhat briefly at the Newton-Raphson iterative method for maximizing a function of several variables. That method can be used to maximize a likelihood function, using the method-of-moments estimates as the first approximation to the maximum likelihood estimates.

Session #33

Today we considered in example of _empirical_Bayes_methods_. I have put a summary in pdf format and I am calling it an "attachment." The 11th problem set may refer to that "attachment."

"Attachment" to the summary of Session #33 (PDF).

Session #34

Our main mathematical result today was that if Theta is a continuous random variable that is always positive and  X | Theta ~ Poisson(Theta) then E(Theta | X = x) = (x + 1)Pr(X = x + 1)/Pr(X = x). Bayes' formula is involved because we started out knowing the conditional distribution of X given Theta and we ended up with a conclusion about the distribution of Theta given X.

If Theta is a randomly chosen insurance customer's accident rate and X is that customer's number of accidents during a fixed time period, then E(Theta | X = x) = (x + 1)Pr(X = x + 1)/Pr(X = x) could be used to estimate Theta after observing X, provided the distribution of X is known. The relevant ratio of probabilities can be estimated based on a large sample, so that the estimate of E(Theta | X = x) is

(x + 1) * (number of customers who suffered x + 1 accidents)
/ (number of customers who suffered x accidents).

This last step is why the word "empirical" is used when this is called an empirical Bayes method. It is not a _Bayesian_ method because the "prior" distribution of Theta is a frequency distribution rather than a degree-of-belief distribution.

Session #35

Today we looked at the Kolmogorov-Smirnov test, which relies on the "maximum-discrepancy statistic." The fact that the probability distribution of the test statistic given that the null hypothesis is true, does not depend on which continuous distribution the null hypothesis specifies, may be surprising initially.

Session #36

On Friday I will attempt to induce some volunteers (or maybe "volunteers") to present solutions to the 11th problem set. I may also say a bit about what the final will look like --- that may consist mainly of saying what will *not* be on the final, since I will probably not have entirely decided what will appear.

Session #37

Today we saw some solutions to the 11th problem set. The final exam will cover material relied on by the problem sets, and may put slightly more emphasis on the 8th-through-11th ones than on the earlier ones. You may use both sides of an 8 & 1/2-by-11-inch sheet of paper for notes during the test, and you should bring a calculator. You will *not* need to bring any tables.

MATLAB® is a trademark of The MathWorks, Inc.