Neither great nor bad writing (I wish it had main takeaways at the end of each chapter), but a solid overview and progression from probability to statistics foundations (great framework to go over the subject matter). Very dense writing with proofs/theory, so I had to skim some whole sections. Granted, it was written moreso for graduate statistics coursework than to read generally.
Notes/quotes
- 'Possible methods of counting: with or without replacement, ordered vs unordered.' Unordered without replacement is so common it has its own notation: n choose r = n! / r!(n-r)!
- "probabilities should be determined by the sampling mechanism."
- "In many experiments it is easier to deal with a summary variable than with the original probability structure. For example, in an opinion poll, we might decide to ask 50 people whether they agree or disagree with a certain issue. If we record a "1" for agree and "0" for disagree, the sample space for this experiment has 2^50 elements, each an ordered string of 1s and 0s of length 50. We should be able to reduce this to a reasonable size! It may be that the only quantity of interest is the number of people who agree out of 50 and, if we define a variable X = number or 1s recorded out of 50, we have captured the essence of the problem. Not that the sample space for X is the set of integers {0,1,2,...,50} and is much easier to deal with than the original sample space." ... "A random variable is a function from a sample space S into the real numbers." More examples: 'experiment: toss two dice, random variable X: sum of the numbers. Experiment: toss a coin 25 times, random variable X: number of heads in 25 tosses.' "In defining a random variable, we have also defined a new sample space."
- "Statistical distributions are used to model populations; as such we usually deal with a family of distributions rather than a single distribution. The family is indexed by one or more parameters, which allow us to vary certain characteristics of the distribution wrote staying with one functional form. For example, we may specify that the normal distribution is a reasonable choice to model a particular population, but we cannot precisely specify the mean. Then we deal with a parametric family, a normal distribution with mean mu."
- "Theorem... If X and Y are independent random variables, then Cov(X,Y) = 0 and [correlation of x&y] = 0. Proof: Since X and Y are independent from [an earlier Theorem]... we have EXY= (EX)(EY). Thus Cov(X,Y) = EXY - (EX)(EY) ... = 0. And corr xy = Cov(X,Y)/sigmax sigmay =...0." ... "If X and Y are positively correlated (Cov(X,Y)>0), then the variation in X + Y is great than the sum of the variation in X and Y."
- "A linear combination of independent normal random variables is normally distributed."
- "[CLT:] Starting from virtually no assumptions (other than independence and finite variances, we end up with normality! The point here is that normality comes from sums of 'small (finite variance), independent disturbances. The assumption of finite variances is essentially necessary for convergence to normality."
- "A sufficient statistic for a parameter theta is a statistic that, in a certain sense, captures all the information about theta contained in the sample. Any additional information in the sample, besides the value of the sufficient statistic, does not contain any more information about theta.... [sufficiency principle in data reduction] that is, if x and y are two sample points such that T(x) = T(y), then the inference about theta should be the same where X=x or X=y is observed." ... "Thus, the likelihood principle [in data reduction] states that the same conclusion about mu should be drawn for any two sample points satisfying xbar = ybar." Ie if the sample means are the same, the mean is probably the same expectation (simplified). "Equivalence principles: if Y=g(X) is a change of measurement scale such that the model for Y has the same formal structure as the model for X, then an inference procedure should be both measurement evquivariant and formally equivarient." Ie some data in inches, some in meters, you can convert units (transform the data) and use the same process. "the Equivariance Principle is composed of two distinct types of equivariance. One type, measurement equivalence, is intuitively reasonable.... but the other principle, formal invariable, is quite different. It equates any two problems with the same mathematical structure, regardless of the physical reality they are trying tovexplain. It stays that one inference procedure is appropriate even if the physical realities are quite different, aj assumption that is sometimes difficult to justify."... "All three principles [sufficiency, likelihood, equivariance] prescribe relationships between inferences at different sample points, restricting the set of allowable inferences and, in this way, simplifying the analysis of the problem."
- "An estimator is a function of the sample, while an estimate is the realized value of the estimator (that is, a number) that we obtained when a sample is actually taken." Estimators: "method of moments is, perhaps, the oldest method of finding point estimatord" - eg taking the mean, or calculating the variance. Problem is that sometimes its negative, eg if the variance is higher than the mean. Satterthwaite approximation can help obtain an estimate of that is always positive. Another method/most common used: Maximum Likelihood Estimators (MLE). "The MLE is the parameter point for which the observed sample is the most likely.... the first problem is that of actually finding the global maximum and verifying that... the second problem is.... how sensitive is the estimate to small changes in the data?" Third method is Bayes estimator.
- "In the Bayesian approach theta is considered to be a quantity whose variation can be described by a probability distribution (called the prior distribution). This is a subjective distribution, based on the experiments belief, and is formulated before the data is seen (hence the name prior distribution)." ... "The Bayes estimator is, again, a linear combination of the prior and sample means. Notice also that [tau?]^2, the prior variance, is allowed to tend to infinity, the Bayes estimator tends towards the sample mean. We can interpret this as saying that, as the prior information becomes more vague, the Bayes estimator tends to give more weight to the sample information."
- The fourth and last estimator method is Expectation-Maximization (EM)" algorithm, which is too complicated... "based on the idea of replacing one difficult likelihood maximization with a sequence of easier maximization whose limit is the answer to the original problem. It is particularly suited to the 'missing data' problem."
- "The general topic of evaluating statistical procedures is part of the branch of statistics known as decision theory."... "we first investigate finite-sample measures of the quality of an estimator, beginning with its mean squares error [MSE]." ... " Thus, MSE incorporates two components, one measuring the variability of the estimation (precision), and the other measuring its bias (accuracy)." ..."Mean squared error is a special case of a function called a loss function. The study of performance, and the optimality, of estimates evaluated through loss functions is a branch of decision theory."
- "[Lehmann-Scheffe theorem] Unbiased estimators based on complete sufficient statistics are unique."
- "After a hypothesis test is done, the conclusions must be reported in some statistically meaningful way. One method of reporting the results of a hypothesis test is to report the size, alpha, of the test used and the decision go reject Ho or accept Ho.... if alpha is small, the decision to reject Ho is fairly convincing, but if alpha is large.. [it] is not very convincing because the test has a large probability of incorrectly making that decision. Another way of reporting the results of a hypothesis test is to report the value of a certain kind of test statistic called a p-value." ... "a p-value reports the results of a test on a more continuous scale, rather than just the dichotomous decision to 'Accept Ho' or 'Reject Ho.'"
- "we have carefully said that the interval [estimator] covers the parameter, not that the parameter is inside the internal.... we wanted to stress that the random quantity is the interval, not the parameter."
- "The use of decision theory in interval estimating problems is not a widespread as in point estimating or hypothesis testing problems."
- Huner (1981) robustness: "any statistical procedures should possess the following desirable features:
(1) It should have a reasonably good (optimal or near optimal) efficiency at the assumed model.
(2) If should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly....
(3) Somewhat larger deviations from the model should not cause a catastrophe."
- "This insensitivity to extreme observations is sometimes considered an asset of the sample median.... the performance of the median improves in distributions with heavy tails."
- "A basic idea of the ANOVA, that of partitioning variation, is a fundamental idea of experimental statistics. The ANOVA belies its name in that it is not concerned with analyzing variances but rather with analyzing variation in means."