A Layman explanation for you:
A Good Scientist, is able to concise and explain in most simple way, even that my Grandma can understand
Pre-requisite: Thirst for Knowledge & insane Curiosity with questioning.
What do you want?
Depends on Who are you?
Engineer: You need to know, only how to use it or apply it
Do you want to build next-generation technology?
Research Scientist/Professor: You need to understand deeply, to create novel methods & contribute
I am wanting to write Math & Statistics, perhaps in my next writings.
In this, I concise to core aspects.
What is Data-Science?
Applied-Statistics represented using Programming Languages for Making Meaning out of Data.
Outline of this work:
1. Exploratory Data Analysis
2. Data and Sampling Distribution
3. Statistical Experiment and Significance Testing
4. Regression and Prediction
5. Classification
6. Statistical Machine Learning
7. Unsupervised Learning
So, What is the meat of this Book?
Let's go through this
1. Exploratory Data Analysis
First Chapter, gives clues for Data Analysis, from John Turkey's seminal paper. In short, Data Analysis is Exploratory through simple plots, summary statistics.
We have, numeric & categorical data
Numeric: Continuous & Categorical
Moreover, for two dimensional data, we have Rectangular Data [Row & Column]
In PANDAS, Google Colab et al - we have Data-frame.
In addition to the above, we have non-rectangular data structure, time-series & spatial data-structure.
Summary Statistics, What we want is to create a short summary of our data.
To do Summary Statistics: We commonly have, mean, median, outliers, anomaly detection, deviation, variance.
These are basic statistical measures for exploratory data.
For Visualization: We can use, Box-Plot, Histogram, Density Plot, Scatterplot, Contour Plots,
Gist of this chapter, Summarizing, Visualizing the data.
2. Data and Sampling Distribution
In applied work [industry], you'd probably be more concerned with data-quality, scale et al.
The Author goes into details with the following,
Population, in statistics describes defined large set of data.
Sample is subset of data, from larger set.
Random sampling, equal chance of being chosen.
Data quality consists of completeness, consistency, cleanliness, accuracy & representativeness.
Bias within statistics includes, difference between actual and predicted values.
Sometimes, bias might be because of random chance.
Sometimes, bias might be due to actual bias.
Size vs Quality: When does Size Matter?
The Author says, surprising, smaller is better.
Consider best predicted search destination for a query.
For Massive amounts, the author gives example of query in Google Search.
Imagine Google Search, for a query, "Tamil" Which one would come first?
The query has to pass through 100,000k documents, and give you relevant result.
So, how to mitigate bias?
We do it through, random shuffling.
Specifying a hypothesis,
Collecting data following randomization,
Random sampling principles ensures against bias.
Regression towards Mean,
Regression toward the mean simply says that, following an extreme random event, the next random event is likely to be less extreme.
Bootstrap is when, we take a sample taken with replacement from observed data-set.
And why do we do this? To assess variability of sample statistic.
Bootstrap is also a way to construct confidence interval.
Confidence Interval, is a way to represent uncertainty, gives us a range of interval.
Normal Distribution, the most famous distribution, imagine a nicely assorted Indian food, thali.
Contrary to what we believe, most of the data used in Data Science project is Long-Tailed Distribution.
They are not Normally Distribution.
T-Distribution is shaped like normal, a bit thicker and longer on tails.
Binomial Distribution -- Well, you have discrete set of values within a random distribution.
We have two values, that is why Binomial [Yes/No]
We also, have Chi-Square Distribution; In short, we want to measure extend of departure from what we expect in null model.
We represent this as, null hypothesis.
We have F-Distribution, where we measure ratio of variability among groups.
Imagine, a fertilizer or groups, and we want to find out variability of its effectiveness.
Poisson Distribution, When we have time involved, which is, Average number of events per unit of time or space.
We could ask, How many capacity do we need to be 9% sure that internet traffic arrives every second?
Exponential Distribution, We want to estimate failure rate i.e aircraft engine failure.
Weibull Distribution, We extend further from Exponential, where event rate is allowed to change.
3. Statistical Experiment and Significance Testing
Many Scientific Application requires experimentation.
Formulate Hypothesis, Experiment, Collect Data, Inference & Conclusions
A Popular one is A/B Testing, And Why do we do it?
Basically to find, which one is better?
Another popular method is Multi-Arm Bandit Algorithm.
So, Why use it? To Optimize decision making, through, m number of trails.
So, imagine, we want to design a policy that maximizes most returns.
Next, Hypothesis Tests consists of Significance test
Null Hypothesis: We take chance to blame
Alternative Hypothesis: Counterpoint to null
One-way test -- Hypothesis that counts chance result in one direction
Two-way test -- Hypothesis test that counts chance result in two direction
T-Test: We want to find if there's difference between probabilities of two population.
Degrees of Freedom: No of independent values
ANOVA: Analysis of Variance
Why use this? We have more than A/B; A/B/C/D with numeric data
F-Test: F value depends on this ratio of variance
Fisher's Test: Significance test, where, we use it for finding purposeful association between two categories
Multi-Arm Bandit Algorithm:
Basically, we have a hypothetic slot machine, where we try multiple attempts for making optimum decision.
4. Regression and Prediction
Linear Regression:
Relationship between Magnitude of One Variable and Second Variable.
Multiple Linear Regression: Here, we find relationship between two or more independent variables to predict outcome of dependent variable.
Root Mean Squared Error:
We want a performance metric, meaning, we build a model.
And we want this model to predict something.
So, we find out difference between predicted and actual.
Many ways to reduce this error, and RMSE is a popular way.
So, we have square root, okay of what? Of averaged squared errors.
Cross-Validation:
What are we trying to do with Cross-validation?
We are trying to sample data from here and there within the data to see, how it does on prediction.
Weighted Regression:
So - Regression, recall we get a scalar value.
Weighted Regression, We want to use it when, our dataset doesn't display heteroscedasticity, meaning no display of variance.
Multi-Collinearity:
We say, two variables are multi-collinear, when there is high correlation between them.
5. Classification
Questions like,
a) "Is this customer likely to churn?"
b) "Is this person going to come back and read my review?"
c) "Is this person going to eat Tamil Dosai?"
These are all, Classification questions.
They come under, supervised learning.
Recall, supervised learning,
We have a label, Tamil Dosai for food, and then, we want to classify if the next food, say Idly is Dosai or not?
If it did classify Tamil Idly as Dosai, then we are not able to generalize.
That is not, what we are wanting.
Naive Bayes:
So, what is this?
We assume, between the data-sets, there is no relationship.
They are all independent, and it's a simple form of Baye's Theorem.
Discriminant Analysis:
So, What are we wanting?
We are wanting groups, yes, groups.
Imagine a set of South Indian food [Dosai, Idly, Chutney, Rice, Sambar, Chicken]
Well, We eat it, but we keep them in groups, right?
And, what do we want to do with it?
We want a combination of those variables to predict, veg or non-veg?
Yes --So, we have assumption called, multivariable normal distribution.
Which means, we are assume the dataset is normally distributed.
We have independent variable & dependent variable.
We use the dependent variable for forming groups and independent variables for variables.
We can use it for categorical or continuous predictors
Logistic Regression:
A Popular method, here the outcome is binary.
Mostly, it's simple and faster to use.
Generalized Linear Models
A Probability distribution or family.
6. Statistical Machine Learning
K-Nearest Neighbours:
Basically, this is non-parametric, meaning, there's no assumption in the dataset.
And it is supervised learning.
Distance Metrics:
Mahalanobis distance:
You'd come across frequently many distance metrics.
Mahalanobis is a distance metric, that measures distance between points of dataset involving mean and co-variance matrix.
One Hot Encoder:
We have, "Dosa" and we want to represent this into a way, computer understands.
In simple terms, we translate this into 0 and 1.
Normalization:
Again, we have, "Dosa" variable, and we are scaling to make sure to do computation easier.
Z-Scores:
How far is taste of yours from regular Tamil people in Food?
We want a normal distribution to represent taste buds.
And then, we have Z-score, if your Z-score is zero, it says to me, it's identical.
Classification and Regression Trees:
CART:
I actually don't know tree names in America.
I'd stick with tree names in Tamil Nadu, India.
Oh, Coconut Trees.
So, imagine classifying each branch from coconut, tree based on some criteria [Dosai/non-Dosai]
So, we split the tree with criteria.
Recursive partitioning:
We are wanting to create a decision tree, based on a criteria.
Bagging and Random Forest:
We want to use Bagging alongside with Random Forest - Why?
To reduce bias.
Ah, recall aggregation & bootstrapping.
Variable Importance,
How much, we give a model, that uses the variable to make accurate prediction.
Hyper-parameter:
Basically, we have parameters to control, learning process of the models.
Ah! We want to tune our model.
Boosting:
We want to reduce errors in data analysis.
Usually, we apply where classifier has high bias.
Ah! Bias, recall, when actual difference predicted value is large.
XGBoost:
XGBoost, a type of regression and classification technique, mostly used in learning to rank.
Well, we have something called, mix of models, they say it ensemble models.
Basically, we combine few of the models, to create better performance and increase prediction.
Cross-Validation:
What we are wanting to do, resample method, so that we can test and validate different portions of data.
7. Unsupervised Learning
Frequently, in Machine Learning; We do come across ways to do learning.
So, Unsupervised Learning is basically a way to extract meaning without Labels.
Clustering:
Recall, we want to group them, we can use this for exploratory data.
We could use it to wanting to reduce dimensions.
Principal Component Analysis:
We want to do PCA, when we want to find, most important variables that give, most divergence.
Think this way, you have 100 columns of data, you want to find most important for your question.
So, PCA would give you the 3 or 4 that co-vary the most.
Issues:
We can't use PCA for categorical values
Correspondance Analysis:
Ah, so we can't use PCA for categorical variables.
What do we do? We use Correspondence Analysis to find association between categories.
We get the output as a bi-plot.
K-Means Clustering:
Clustering, recall, we want groups.
And why do we want it?
Well, we want it because, we make groups to do some exploratory data or something like it.
Cluster - similar records, and k is number we want.
Hierarchical Clustering:
We usually have tree of cluster map.
A subset of Agglomerative Algorithm.
Measure of Dissimilarity:
So, how do we measure dissimilarity?
Complete linkage, Single linkage, Average Linkage, Minimum variance.
Multivariate Normal Distribution:
There are lot of distribution, Multivariate is normal distribution, you have more dimensions.
Scaling and Categorical Variables:
Basically, we are squashing , expanding data to bring multiple variables with same scale.
Categorical variables: Yes/No
Gower's Distance: Similarity measure, Ah! We use it for binary or categorical variables
"Torture the Data long enough, and it will confess"
Deus Vult,
Gottfried