Vocab List for Analytics

Vocab List for Analytics

Sometimes it’s hard to keep up or remember all the terms in analytics, so I made this list that hopefully helps you out. I know…I know, it doesn’t include ALL of them, the good news is that over time I will be adding more words. Also it’s important to mention that the goal of this list is to provide cues or a starting point to either remember what a certain word means or when you need a simpler introduction to ease your understanding before jumping on a larger post.

Additive model: A regression model whose main component is the weighted sum of the independent variables (predictors), which means the effect of one independent variable does not depend on another.

ARIMA: Autoregressive Integrated Moving Average. Forecasting technique that projects future values based on the moving average, with weight applied to each of the past terms.

AWS: A secure cloud services platform, offering computer power, database storage, and content delivery.

Bias: How much on average are the predicted values different from the actual value.

Classification: Models that predict the class of given data points based on the input values given during training.

Clustering: Models where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.

Continuous: Data than can be any value in a range such as height, area, time, etc.

CRISP-DM: Methodology that consists of Business Understanding->Data Understanding->Data Preparation->Modeling->Evaluation->Deployment.

Chi-squared Test of Independence: Test used to determine if categorical variables are not affected by the presence of another, variable is dependent if the p-value is less than the significance level (usually .05).

Decision trees: Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. It can handle both numerical and categorical data.

Dimension Reduction: Reducing the number of variables under consideration. When the raw data has very high dimensional features and some features are redundant or irrelevant to the task, reducing helps find the relationship.

Discrete: Numbers that can only take certain values like values in a dice or number of students in a class.

Dummy variable: An explanatory binary variable that equals 1 if a certain categorical effect is present and 0 if its absent.

ETL: Extract, Transform, Load

-Extract: Obtain the data from the source, the aim is converting to a single format appropriate for transformation processing.

-Transform: Selecting only certain columns to load, encoding free-form values, joining data from multiple sources, pivoting (turning multiple columns into multiple rows, disaggregating.

-Load: Load cleansed data to a delimited flat file or data warehouse.

Elbow Method: Using a K-Means model to see what number of clusters provide a better fit based on the inflection point between the squared sum of the distances to the center of each cluster and X number of clusters.

Extrapolating: Predicting values that are outside the interval.

F-1 Score: A weighted average of the precision and recall, where an F1 score reaches its best value at 1.

Hadoop: An open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters. It has four modules:

   -Distributed File System: Allows data to be stored in an easily accessible format across various storage devices. It sits above the file system of the host computer.

   -MapReduce: Reads data from the database, putting it into a format suitable for analysis (map) and performing mathematical operations on the data (reduce).

  -Yarn: Manages resources of the system storing the data and running the analysis.

  -Common: Provides the tools (in Java) needed for the user’s computer system to read data stored under Hadoop’s file system.

Hive: Declarative SQLish language, for creating reports, operates on the server side of the cluster. Data is stored in Hbase which can be accessed through Hive, one can write data queries that are transformed to Map Reduce jobs.

Interpolating: Predicting values that are inside the interval given for analysis.

K-Means: Model that looks for a k number of clusters in a dataset by grouping observations.

KS: A measure of the degree of separation between positive and negative distributions to measure the performance of classification models.

Least Squares Method: Finds the best fit for a set of data points by minimizing the sum of squares of the residuals.

Linear Regression: Model in which the value of the dependent variable/target depends on the values of the independent variables/predictors. The independent variables explain the value of the dependent variable.

Logistic Regression: Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one more nominal/ordinal/interval/ratio-level independent variable.

Multicollinearity: When there is a high correlation between two or more predictor variables; basically, one can be used to predict the other. One way to detect it is to calculate the correlation coefficient, r (close to 1 or -1 means there is multicollinearity and one of the variables should be removed).

Neural Networks: Consists of thousands or millions of processing nodes that are interconnected in which data moves only in one direction based on multiplying the data received by an assigned number, adding the results and passing it on only if they passed the established threshold value.

Nonlinear Models: Models in which there isn’t a linear relationship among the parameters. Examples of non-linear equations are: power, Weibull growth, and fourier. R-squared is invalid for nonlinear models and p values can’t be calculated with software.

Normal Distribution: A function that shows the possible values for a variable and how often they occur, in normal distribution the mean, mode, and median are the same and the curve is symmetrical.

Pentaho: Business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities.

Precision: Percentage of positive predictions that were correct.

Pig: Multiquery approach, easy to learn if familiar with SQL data types such as maps, tuples, and bags. Schema is not optional.

PCA: Reduces dimensionality by transforming a large set of variables into smaller ones containing most of the information of the large data set.

Population: A collection of all the items of interest to our study. Numbers obtained when using population are parameters.

Quantitative Analysis: Process of collecting and evaluating measurable and verifiable data such as revenues, market share, and wage in order to understand the behavior and performance of a business.

Random Forest: A group of decision trees, where each tree votes for a class prediction and the one with the most votes becomes the models prediction. The number of estimators to be used is 10-100.

Recall: Percentage of true positives described as positive.

Redshift: AWS data warehousing solution that can be used to run complex OLAP queries.

Regression: An equation that can be used to estimate expected value for the dependent variable with great precision.

Reinforcement Learning: The training of a ML model to make a sequence of decisions, to achieve a goal in an uncertain potentially complex environment.

Residuals: The difference between the observation and the fitted line.

Sample: A select number of items taken from a population. Numbers obtained are statistics.

Scala: Programming language that interoperates seamlessly with Java and JS, it serves as the implementation language for frameworks like Spark. Allows to combine functional programming with objects and classes. Uses recursion, pattern matching, and higher order functions.

Spark: Uses cluster computing for its analytics power as well as storage, doesn’t have its own file system but works with MongoDB, Hadoops HDFS, and Amazon’s S3 system. It can analyze data in real-time.

Supervised Learning: All data is labeled and the algorithms learn to predict the output from the input data. The algorithm makes iterative predictions, which are supervised until an acceptable level of performance is reached. Ex. random forest, SVM, linear regression

Time Series: A series of data points in time order.

TensorFlow: Open source library for ML where developers use dataflow graphs (structure that describe how the data moves through a graph or a series of processing nodes.

Unsupervised Learning: All data is unlabeled and the algorithms learn the inherent structure from the input data. Usually for clustering and association, where inherent groupings and association rules that describe data (such as people who by X also buy Y) are discovered. Ex. K-Means

Variance: Looks at the average degree to which each point differs from the mean, the average of all data points. Obtained by calculating the mean, subtracting it, and then squaring the result.

Credits and valuable links to find more on the subjects (I may have forgotten to list some sources since this list was made a long time ago, but please let me know if any reference is missing):

https://statisticsbyjim.com/

https://www.scikit-yb.org/

https://www.analyticsvidhya.com/

https://www.listendata.com/

Miguel Morales

Leave a Reply

Your email address will not be published. Required fields are marked *