I work with applications of statistics and probability to recursive models for time series arising in engineering (where the inference task is called system identification), and I read this book in order to understand methods that germinated in econometrics (like IV) but may be meaningfully translated to other fields. My emphasis on abstraction runs against the grain of this very practical "companion," full of concrete details of real-life studies (many of them at the hand by one of the authors). But if I'd wanted math, I'd have read a math book. MHE helped me understand a few new concepts and better follow econometric ways of thinking (in labor econ at the very least).
This book, by its own admission, plays fast and loose with the theoretical foundations of statistics. Asymptotics are generally taken for granted. There is no discussion on integrability assumptions for the Central Limit Theorem and the Strong Law of Large Numbers, and Nassim Taleb has very memorably reminded us that even CLT convergence can be painfully slow. Maybe I am a curmudgeon in this respect, but let me give an example. I once witnessed a presentation that approximated a normalized sum of i.i.d. random variables with a t-distribution. But the random variables in question were actually quotients of Normal variates, and as such didn't actually have any finite moments.
Two important assumptions made in the first chapter are
1) models have errors-in-variables until proved otherwise, that is, if y = ax + b, both y and x will be measured with random error. But the author doesn't discuss the fact that the EiV linear least squares estimator as well as the maximum likelihood estimator can have heavy tails. There is a passing reference to the problem of regression dilution and a further discussion on how to mitigate dilution with IV.
2) the estimated quantities a and b in such a linear model are not necessary understood, as often in Bayesian or frequentist statistics, as the "truth" of an underlying data-generating process. Rather, statistics derive their validity by converging to a (strong) population limit, which is taken as interesting in itself.
My biggest critique is that this MHE, despite (or rather pursuant to) its title, doesn't adequately caution against multiple comparisons. In fact, on one occasion it encourages them. In practical steps for 2SLS analysis, "Report the F-statistic on the excluded instruments." So far, so good. "Pick your single best instrument and report just-identified estimates using this one only." Uh-oh. (Econometric "experiments" can never be perfectly repeated.) A frequentist would save the day by including "pick your single best instrument" as one step in a statistic that can be bootstrapped against the data. A Bayesian would say the day by proposing a continuous mixture of models governed by a prior. In either case, the day needs to be saved.
Things I learned
In this section, I will write (mostly for myself) short technical expositions of statistical techniques that I hope to remember.
Suppose that our model is the almost-sure law
0 = f(X, W, θ)
where X is a vector of observed random variables, W is a vector of unobserved random variables, and θ is the parameter we are trying to measure. We invert this law for θ using linear regression or MLE or something else. In econometrics, often f is a linear relationship between "cause" and "effect" coordinates of X, W is random effects, and θ is the coefficient vector. Let F be any σ-algebra in this probability space. Let us observe that we may take conditional expectations of both sides of this equation.
0 = E[ f(X, W, θ) | F ]
Why would we do this? One reason would F is a σ-algebra that smooths away (confounding) randomness in X and W while leaving the identifiability of θ intact. The generators of F are called instrumental variables.
As an aside, F can also be taken to be generated by a sufficient statistic of X, with a result similar to Rao-Blackwellization.
From this abstract point of view, forbidden regression is the assumption that conditional expectation commutes with nonlinear function composition, which now sounds obviously false.
Finally, quantile regression is not as scary as it seems. One may be familiar with squared error minimization (least squares) or absolute error minimization (least absolute deviation). Absolute error minimization corresponds to median regression. What if the convex loss function looked like the absolute value, except it were more sloped on one side of 0 than the other? The result is a quantile regression where the discrepancy between positive and negative residual penalty determines what quantile you are trying to fit.