Five Algorithms for Data Science

Five Algorithms for Data Science

Linear Regression

Description: It establishes a relationship between independent (doesn’t change when others do) and dependent variables (changes in response to changes in the input) by adjusting them into a regression line (see image).

A chart depicting linear regression

When to use it: Forecasting trends and effects. Analyzing the strength of the effect of independent variables on the dependent variable; in other words, how much does the dependent variable change with a modification to one or more independent variables.

Examples of Questions: How much will sales decrease with a $5 increase? How much additional income for each X amount spent on marketing?

Python: scikit-learn

R: the lm function

Logistic Regression

Description: Explains relationship between one dependent binary variable and one or more independent variables when conducting a trend or predictive analysis.

A chart depicting logistic regression

When to use it: Used for predictive analysis in situations where the dependent variable is binary (ex. yes or no, on or off).

Examples of Questions: Are sales influenced by customer satisfaction and loyalty? Are the fish dying because of the acidity in the water? Does protein intake have an influence in muscle growth?

Python: scikit-learn

R: the glm function

Decision Tree

Description: classifies both categorical and dependent variables, the population or sample is split into two or more homogeneous sets based on independent variables. The construct could be thought of having a construct of “if…else if…else” where if it doesn’t lead to one classification it’s classified to another.

a decision tree with points as the leaves

When to use it: When the outcome is uncertain and there are various options. In cases where the goal is optimization.

Examples of Questions: What variables to go after to achieve max profit. What features will result in optimum sales for a new product to be released.

Python: scikit-learn in combination with numpy and pandas for the necessary data manipulation.

R: ctree function in party package

Random Forest

Description: a group of decision trees, where each tree votes for a class prediction and the class with the most votes becomes the model’s prediction. Its strength lies in that the models are uncorrelated, which produces a higher accuracy.

An illustration of a random forest showing how the majority voting leads to the final class

When to use it: You require higher accuracy and have enough training data for random sampling and subsetting of features.

Examples of Questions: What group of people will the medical treatment work better with? Out of three investment options which represents a better decision?

Python: scikit-learn in combination with numpy and pandas for the necessary data manipulation.

R: randomForest package


Description: Finds a hyperplane (depicted as a line) with the greatest margin (distance) from the points that separates the data into distinct classifications. The data points are plotted in an n-dimensional space where n is the number of features. Classification is enabled since the value of each feature is tied to a particular coordinate.

An image of SVM showing the margin, the hyperplane and support vectors

When to use it: It can be used for both regression and classification, but it is better for the latter.

Examples of Questions: Nowadays it is best used for text classification.

Python: scikit-learn

R: svm function from the package e1071

Miguel Morales

Leave a Reply

Your email address will not be published. Required fields are marked *