Regression and Classification

Supervised machine learning algorithm knows the desired output. Both Regression and Classification are a part of Supervised Machine Learning algorithm. The only difference is that Classification is for discrete variables meanwhile Regression predicts output for continuous variables. main difference being clusters don’t have labels.

She briefed about different types of regression like linear regression, logistic regression, ridge regression, lasso regression, polynomial regression and Bayesian linear regression. The speaker focused on linear regression and logistic regression which are basic and understood by majority of the clients. A Logistic Regression gives the probability and Linear Regression gives the value. Both are well proven algorithm and not black box algorithm. She next spoke about linear regression like single linear equation and multiple linear equation. Logistic Regression is actually a linear classification problem because it gives the probability.

As we increase the complexity of the model the computer’s computation power increases and the control is lost on the parameters which determines how the model should behave.

To know the revenue on advertising spends prediction using advertising data linear regression can be used. For spam detection logistic regression can be used.

The speaker then focused on two important parameters: the sensitivity and specificity. Sensitivity tells whether one is able to capture all the true positives or not. Specificity tell us whether one was able to capture all the true negative or not. It’s a risky call if a person has cancer and the model shows false negative, predicting he won’t get cancer even though he is highly likely of getting cancer. In such cases to increase the specificity all the type 2 errors need to be reduced. The metrics are: Accuracy, Precision, Recall, Specificity and F1

The speaker ended the discussion with business cases for Regression and Classification. One of the interesting business case study was how Ensemble Regression is used to predict sales for one of the largest technology client in the USA in marketing spends on print, digital ad campaigns.

The session ended by intrigued students clearing their doubts which the speaker answered graciously.

Linear regression to predict impact of reduced spend on click share and impression share for high spending technology giant.

Rich regression to quantify the magnitude of travel sentiment post COVID outbreak 4EMEA region.

Logistic regression to classify re-marketing list and audience for niche travel partners.

Less so and OLS modelling to quantify implement price last city for one of Australia’s largest supermarket chain.

An orthodox tree-based techniques (add a boost And are you S boost) used to predict a severe failure for a Canadian multinational mass media and information firm.

Linear classifier a key logistic regression used to predict customer is having high propensity to call the server centre after visiting the self-help website and thereby reducing the cold right by 7PPT through redirections.

Multilevel classification of B2B customers to improve the revenue through privatization and cross/upsell for Fortune 50 client. Retention, acquisition

In the end there were questions and answers.

Decision trees are an example of unsupervised machine learning.

Classification gives the output of zero or one.

True positive: The prediction is one and indeed The answer was also one.

True negative: The person did not get an admit but the model is saying that the person got an admit.

False positive: We predicted years, but they’re actually no also known as type 1 error. They are not supposed to get an admit but the model predicted that they should get an adamant.

False negative: we predicted no, but they are actually yes also known as type two error. The model predicted that they should not get an an admit but they actually got an admit.

Multi linear regression and hard data

Testing the four assumptions:

Linearity: The relationship between X and the mean of Y is linear

Using plots or correlations function to check if the relationship between heart disease and biking/smoking is linear.

Independence: observations are independent of each other. If the correlation between two activities a small we can include both the parameters in the model.

Normality: For any fixed value of X, Y is normally distributed.

Example the fish aquarium the moment you take the sample you are biasing the model for certain outcome. And when you tested on a certain data you won’t get the outcome because They won’t be the same skewness as under test data.

Homoscedasticity (homogeneity of variance): The residue should not have any trend within them.

If the estimate have a negative relationship that means people who bike have less heart disease.

The more standard deviation the more the sample data represents the true meaning of the population.

The residue standard error is the homoscedasticity.

A R square is a very important metrics which tells how much of the variation is been explained by the model. Why are there two values of R Square? Multiple or square varies between the variables.

There are two main types of linear regression

Simple linear equation uses only one independent variable. Example what is the happiness quotient of employee with dollar XX salary?
Multiple linear regression uses two or more independent variables. Example what is the likelihood that a person might develop a heart disease if the smoke or two biking?

linear classifiers: logistic regression

Observations should be independent of each other

No or little multicollinearity among the independent variable

Assumes linearity of the independent variables and dependent variable

Require large sample of size because it trains and trains on a certain set of data.

Y represents the dependent variable

X represents independent variables

a and b are the coefficients which are numeric constants.

Linear classifiers in R

While there are many functions to perform a logistical regression, most frequently we use glm() function. The basic syntax for glm() function in logistic regression is:

glm (formula, data, family)

Following is a description of the parameters used-

Formula is the simple presenting the relationship between the variables

Data is the reference name of the data said

Family is R object to specify the details of the model. Its value is normal for logistic regression.

Every problem starts with data wrangling where the programmer checks on the data. The the data is divided into training and test. On the training centre is actually features selection.

There are four assumptions associated with the linear regression model:

Linearities: the relationship between X and the mean of fire is linear

homoscedasticity (aka homogeneity of variance): the variance of residual is the same for any value of X

Independence: observations are independent of each other.

Normality: for any fixed value of X, Y is normally distributed.

There are two main types of linear regression

Simple linear equation uses only one independent variable. Example what is the happiness quotient of employee with dollar XX salary?

Multiple linear regression uses two or more independent variables. Example what is the likelihood that a person might develop a heart disease if the smoke or two biking?

Business cases for classification and regression

Linear regression to predict impact of reduced spend on click share and impression share for high spending technology giant.

Rich regression to quantify the magnitude of travel sentiment post COVID outbreak 4EMEA region.

Logistic regression to classify re-marketing list and audience for niche travel partners.

Less so and OLS modelling to quantify implement price last city for one of Australia’s largest supermarket chain.

An orthodox tree-based techniques (add a boost And are you S boost) used to predict a severe failure for a Canadian multinational mass media and information firm.

Multilevel classification of B2B customers to improve the revenue through privatization and cross/upsell for Fortune 50 client. Retention, acquisition

Ensemble regression to predict sales for one of the largest technology client in the USA factoring in marketing spends on print, digital ad campaigns. It is not a linear problem if there are marketing spends or digital spends.

In the end there were questions and answers.

Z value how far is the mean from the standard deviation. P value determines if it by chance or not. R runs couple of iterations called as Fisher scoring iterations. These iterations are inbuilt CRAN R runs to give best fit mode for right output.

The bigger the difference between the null deviance and the residual deviance, the model is right. That means the deviance doesn’t have that much variance in it and that is what we want.

Standard error: how accurate it is with respect to the population.

Standard deviation: How far it is away from the mean. When we are talking about the spread of the mean.

T-value:

M when we convert to

Continuous : regressions what will be the average temperature tomorrow.

Method for regression and Classification remains the same.

How does regression work?

Training database

Feature selection: only few variables need to be selected

Modelling

Testing data

Regression and Classification

Leave a comment Cancel reply

Published by Ruchi Kandpal

Share this:

Leave a comment Cancel reply

Published by Ruchi Kandpal