Resources
Articles

How much maths & statistics knowledge is required to strive in the Data Science field?

Mohammad Arshad

Grasping the ground concepts of mathematics and statistics, and being able to relevantly apply them them to Data Science-related problem, is the toughest part of Data Science and many aspirants get overwhelmed by vast topics and concepts in statistics.

I can broadly divide all statistical topics into two components:

  • Basic Statistics – This includes all the statistical analysis that we do before modeling; like variability, hypothesis testing, probability, etc.
  • Statistics in Machine Learning Algorithms – All ML algorithms we use for modeling are based on mathematics and statistics. For the specific ML you are using, you should be aware of the statistical concepts.

In basic statistics, what we generally do is that -after we get the data the first time in our project- we do some quick analysis of the data we checked in terms of distribution, central tendency, Max-Min, variability, and any outliers.

There are mainly 4 sections in Basic Statistics and Maths

  1. Probability
  2. Distributions
  3. Estimations
  4. Inferences

The main tasks that require basic statistics pre-modeling, or any analysis, are:

  1. Exploratory data analysis
  2. Chi-Square testing
  3. Proportional test
  4. Feature engineering- We make the independent variables and dependent variables
  5. Data Cleaning – Missing Value and Outlier Treatment
  6. Sampling the data

Below is a mindmap I've created that explains how we use basic statistics concepts in our day-to-day projects:

We try to understand what is the business problem we are solving with the data, we make the independent variables and dependent variables to see how we can sample the data we do some statistical tests to see whether the samples we created are statistically significant then we finalize the data for the model

All the activities here can be generally classified as basic before we are running the machine learning algorithm

If you want to self-study, Here is the link to Inferential Statistics courses from Massive Open Online Courses

In Statistics for Machine Learning Algorithms, We should focus on one Machine Learning algorithm, for regression problems and one for classification and know that very well. For others, you can just know important points and build on them as you do more projects.

For any Machine Learning algorithm, we implement. The minimal requirement is to know the assumptions for that algorithm, and then, we have to be able to statistically test those assumptions.

For example, for multiple linear regression, we should test normality, autocorrelation, multicollinearity, and homoscedasticity.

For example, for logistic regression, we should know the confusion matrix, recall, precision and AUC.

This can go advanced as per the complexity of the algorithm, but learn as you are doing the project.

Grasping the ground concepts of mathematics and statistics, and being able to relevantly apply them them to Data Science-related problem, is the toughest part of Data Science and many aspirants get overwhelmed by vast topics and concepts in statistics.

I can broadly divide all statistical topics into two components:

  • Basic Statistics – This includes all the statistical analysis that we do before modeling; like variability, hypothesis testing, probability, etc.
  • Statistics in Machine Learning Algorithms – All ML algorithms we use for modeling are based on mathematics and statistics. For the specific ML you are using, you should be aware of the statistical concepts.

In basic statistics, what we generally do is that -after we get the data the first time in our project- we do some quick analysis of the data we checked in terms of distribution, central tendency, Max-Min, variability, and any outliers.

There are mainly 4 sections in Basic Statistics and Maths

  1. Probability
  2. Distributions
  3. Estimations
  4. Inferences

The main tasks that require basic statistics pre-modeling, or any analysis, are:

  1. Exploratory data analysis
  2. Chi-Square testing
  3. Proportional test
  4. Feature engineering- We make the independent variables and dependent variables
  5. Data Cleaning – Missing Value and Outlier Treatment
  6. Sampling the data

Below is a mindmap I've created that explains how we use basic statistics concepts in our day-to-day projects:

We try to understand what is the business problem we are solving with the data, we make the independent variables and dependent variables to see how we can sample the data we do some statistical tests to see whether the samples we created are statistically significant then we finalize the data for the model

All the activities here can be generally classified as basic before we are running the machine learning algorithm

If you want to self-study, Here is the link to Inferential Statistics courses from Massive Open Online Courses

In Statistics for Machine Learning Algorithms, We should focus on one Machine Learning algorithm, for regression problems and one for classification and know that very well. For others, you can just know important points and build on them as you do more projects.

For any Machine Learning algorithm, we implement. The minimal requirement is to know the assumptions for that algorithm, and then, we have to be able to statistically test those assumptions.

For example, for multiple linear regression, we should test normality, autocorrelation, multicollinearity, and homoscedasticity.

For example, for logistic regression, we should know the confusion matrix, recall, precision and AUC.

This can go advanced as per the complexity of the algorithm, but learn as you are doing the project.