I really like the overview/coverage of data science as a whole. Granted, it would help a lot to have experience coding in python to read along, which makes it feel more like a data science overview for a software engineer. Lots of meh-quality humor and lots of coding and hard-coded links in the book that seem odd (maybe better an an ebook), but overall a good refresher.
Quotes
- "variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means. ...a 'large' positive covariance means that x tends to be large when you is large and small when you Is small. ...this number can be hard to interpret, for a couple of reasons: (1) its units are the product of the inputs units.... (2) if each user had twice as many friends (but the same number of minutes), the covariance would be twice as large. But in a sense the variables would be just as interrelated. Said differently, its hard to say what counts as a 'large' covariance. For this reason, it's more common to look at correlation, which divides out the standard deviation of both variables. .... the correlation is unites and always lies between -1 (perfect anti-correlation), and 1 (perfect correlation)."
- "The key issue [Simpsons paradox] is that correlation is measuring the relationship between your two variables all else being equal.... the only real way to avoid this is by knowing your data and by doing what you can to make sure you've checked for possible confounding factors."
- "x=[-2,-1,0,1,2], y=[2,1,0,1,2] have zero correlation. But they certainly have a relationship-each element of y equals to absolute value of the corresponding element of x. What they don't have is a relationship in which knowing how x_i compares to mean (x) gives us information about how y_i compares to mean(y). That is the sort of relationship that correlation looks for. .... ×=[-2,-1,0,1,2], y=[99.98,99 99,100,100.01,100.02] are perfectly correlated, but (depending on how you're measuring it) it's quite possible that this relationship isn't all that interesting."
- F Scott Fitzgerald quote: "to write it, it took three months; to conceive it, three minutes; to collect the data in it, all my life."
- "...data science is mostly turning business problems into data problems and collecting data and understanding data and cleaning data and formatting data, after which machine learning is almost an afterthought. Even so, it's an interesting and essential afterthought..."
- "What is a model? It's simply a specification of a mathematical (or probabilistic) relationship that exists between different variables." .... " well use machine learning to refer to creating and using models that are learned from data. In other contexts this might be called predictive modeling or data mining...."
- "supervised models (in which there is a set of data labeled with the correct answers to learn from), and unsupervised models (in which there are no such labels).... semisupervised (in which only some of the data are labeled), and online (in which the model needs to continuously adjust to newly arriving data)."
- "A bigger problem is if you use the test/train split not just to judge a model but also to choose from among many models.... in such a situation, you should split the data into three parts: a training set for building models, a validation set for choosing among trained models, and a test set for judging the final model."
- "A model that predicted 'yes' when its even a little bit confident will probably have a high recall but a low precision; a model that predicts 'yes' only when its extremely confident is likely to have a low recall and a high precision. Alternatively, you can think of this as a trade-off between false positives and false negatives. Saying 'yes' too often will give you lots of false positives; saying 'no' too often will give you lots of false negatives."
- "High bias and low variance typically corresponding to underfitting." Ie a flat line model is pretty stable but consistently wrong.
"If your model has high bias (which means it performs poorly even on your training data) then one thing to try is to add more features..... if your model has high variance, then you can similarly remove features. But another solution is to obtain more data (if you can)."
- "Holding model complexity constant, the more data you have, the harder it is to overfit. On the other hand, more data won't help with the bias. If your model doesn't use enough features to capture regularities in the data, throwing more data at it won't help."... "Features are whatever inputs we provide to our model."
- "Naive Bayes classifier... is suited to yes-or-no features... regression models.... require numeric features (which could include dummy variables that are 0s and 1s). And decision trees... can deal with numeric or categorical data."
- "The key to Naive Bayes is making the (big) assumption that the presences (or absenses) of each word are independent of one another condition on a message being spame or not. Intuitively, this assumption means that knowing whether a certain spam message contains the word 'viagra' gives you no information about whether that same message contains the word 'rolex.'.... this is an extreme assumption.... despite the unrealisticness of this assumption, this model often performs well and is used in actual spam filters."
- "In practice, you usually want to avoid multiplying lots of probabilities together to avoid a problem called underfloor, in which computers font deal well with floating point numbers that are really close to zero." [Logs help here].
- [multiple regression] "You should think of the coefficients of the model as representing all-else-being-equal estimates of the impacts of each factor."
- [multiple regression] "Keep in mind, however that adding new variables to a regression will necessarily increase the R-squared.... the regression as a whole may fit our data very well, but if some of the independent variables are correlated (or irrelevant), their coefficients might not mean much." ... "if the goal is to explain some phenomenon, a sparse model with three factors might be more useful than a slightly better model with hundreds....regularization is an approach in which we add to the error term a penalty that gets larger as beta gets larger. We then minimize the combined error and penalty. The more importance we place on the penalty term, the more we discourage large coefficients. [Eg ridge regression].... usually you'd want to rescale your data before using this approach. After all, if you changed years of experience to centuries of experience, its least Square coefficients would increase by a factor of 100 and suddenly get penalized much more, even though it's the same model."
- "Whereas the ridge penalty shrank the coefficients overall, the lasso penalty tends to force coefficients to be zero, which makes it good for learning sparse models."
- "Partitioning on SSN will product one-person subsets, each of which necessarily has Ero entropy. But a model that relies on SSN is certain not to generalize beyond the training set. For this reason, you should probably to try avoid (or bucket, if appropriate) attributes with large numbers of possible values when creating decision trees."