Sometimes it’s hard to keep up or remember all the terms in analytics, so I made this list that hopefully helps you out. I know…I know, it doesn’t include ALL of them, the good news is that over time I will be adding more words. Also it’s important to mention that the goal of this list is to provide cues or a starting point to either remember what a certain word means or when you need a simpler introduction to ease your understanding before jumping on a larger post.

**Additive model:** A regression model
whose main component is the weighted sum of the independent variables
(predictors), which means the effect of one independent variable does not
depend on another.

**ARIMA: A**uto**r**egressive **I**ntegrated **M**oving **A**verage. Forecasting technique
that projects future values based on the moving average, with weight applied to
each of the past terms.

**AWS:** A secure cloud
services platform, offering computer power, database storage, and content
delivery.

**Bias: **How much on average
are the predicted values different from the actual value.

**Classification**: Models that predict
the class of given data points based on the input values given during training.

**Clustering:** Models where you
want to discover the inherent groupings in the data, such as grouping customers
by purchasing behavior.

**Continuous:** Data than can be any
value in a range such as height, area, time, etc.

**CRISP-DM:** Methodology that
consists of Business Understanding->Data Understanding->Data Preparation->Modeling->Evaluation->Deployment.

**Chi-squared Test of
Independence: **Test
used to determine if categorical variables are not affected by the presence of
another, variable is dependent if the p-value is less than the significance
level (usually .05).

**Decision trees**: Given a data of
attributes together with its classes, a decision tree produces a sequence of
rules that can be used to classify the data. It can handle both numerical and
categorical data.

**Dimension Reduction: **Reducing the number
of variables under consideration. When the raw data has very high dimensional
features and some features are redundant or irrelevant to the task, reducing
helps find the relationship.

**Discrete**: Numbers that can
only take certain values like values in a dice or number of students in a class.

**Dummy variable**: An explanatory
binary variable that equals 1 if a certain categorical effect is present and 0
if its absent.

**ETL: **Extract, Transform,
Load

**-Extract:** Obtain the data from
the source, the aim is converting to a single format appropriate for
transformation processing.

**-Transform: **Selecting only
certain columns to load, encoding free-form values, joining data from multiple
sources, pivoting (turning multiple columns into multiple rows, disaggregating.

**-Load: **Load cleansed data to
a delimited flat file or data warehouse.

**Elbow Method**: Using a K-Means
model to see what number of clusters provide a better fit based on the
inflection point between the squared sum of the distances to the center of each
cluster and X number of clusters.

**Extrapolating: **Predicting values
that are outside the interval.

**F-1 Score: **A weighted average of
the precision and recall, where an F1 score reaches its best value at 1.

**Hadoop: **An open source distributed
processing framework that manages data processing and storage for big data
applications in scalable clusters. It has four modules:

** -Distributed File System: **Allows data to be
stored in an easily accessible format across various storage devices. It sits
above the file system of the host computer.

** -MapReduce: **Reads data from the
database, putting it into a format suitable for analysis (map) and performing
mathematical operations on the data (reduce).

** -Yarn:** Manages resources of the system storing the
data and running the analysis.

** -Common: **Provides the tools (in Java) needed
for the user’s computer system to read data stored under Hadoop’s file system.

**Hive:** Declarative SQLish
language, for creating reports, operates on the server side of the cluster.
Data is stored in Hbase which can be accessed through Hive, one can write data
queries that are transformed to Map Reduce jobs.

**Interpolating: **Predicting values
that are inside the interval given for analysis.

**K-Means: **Model that looks
for a k number of clusters in a dataset by grouping observations.

**KS: **A measure of the
degree of separation between positive and negative distributions to measure the
performance of classification models.

**Least Squares Method:
**Finds
the best fit for a set of data points by minimizing the sum of squares of the
residuals.

**Linear Regression:** Model in which the
value of the dependent variable/target depends on the values of the independent
variables/predictors. The independent variables explain the value of the
dependent variable.

**Logistic Regression:** Logistic regression
is used to describe data and to explain the relationship between one dependent
binary variable and one more nominal/ordinal/interval/ratio-level independent
variable.

**Multicollinearity**: When there is a
high correlation between two or more predictor variables; basically, one can be
used to predict the other. One way to detect it is to calculate the correlation
coefficient, r (close to 1 or -1 means there is multicollinearity and one of
the variables should be removed).

**Neural Networks: **Consists of thousands
or millions of processing nodes that are interconnected in which data moves
only in one direction based on multiplying the data received by an assigned
number, adding the results and passing it on only if they passed the
established threshold value.

**Nonlinear Models:** Models in which
there isn’t a linear relationship among the parameters. Examples of non-linear
equations are: power, Weibull growth, and fourier. R-squared is invalid for
nonlinear models and p values can’t be calculated with software.

**Normal Distribution: **A function that shows
the possible values for a variable and how often they occur, in normal
distribution the mean, mode, and median are the same and the curve is
symmetrical.

**Pentaho:** Business
intelligence (BI) software that provides data integration, OLAP services,
reporting, information dashboards, data mining and extract, transform, load
(ETL) capabilities.

**Precision: **Percentage of
positive predictions that were correct.

**Pig:** Multiquery approach,
easy to learn if familiar with SQL data types such as maps, tuples, and bags.
Schema is not optional.

**PCA: **Reduces
dimensionality by transforming a large set of variables into smaller ones
containing most of the information of the large data set.

**Population**: A collection of all
the items of interest to our study. Numbers obtained when using population are
parameters.

**Quantitative
Analysis: **Process
of collecting and evaluating measurable and verifiable data such as revenues,
market share, and wage in order to understand the behavior and performance of a
business.

**Random Forest: **A group of decision
trees, where each tree votes for a class prediction and the one with the most
votes becomes the models prediction. The number of estimators to be used is
10-100.

**Recall: **Percentage of true
positives described as positive.

**Redshift:** AWS data warehousing
solution that can be used to run complex OLAP queries.

**Regression: **An equation that can
be used to estimate expected value for the dependent variable with great
precision.

**Reinforcement
Learning: **The
training of a ML model to make a sequence of decisions, to achieve a goal in an
uncertain potentially complex environment.

**Residuals: **The difference
between the observation and the fitted line.

**Sample: **A select number of
items taken from a population. Numbers obtained are statistics.

**Scala: **Programming language
that interoperates seamlessly with Java and JS, it serves as the implementation
language for frameworks like Spark. Allows to combine functional programming
with objects and classes. Uses recursion, pattern matching, and higher order
functions.

**Spark: **Uses cluster
computing for its analytics power as well as storage, doesn’t have its own file
system but works with MongoDB, Hadoops HDFS, and Amazon’s S3 system. It can
analyze data in real-time.

**Supervised Learning: **All data is labeled
and the algorithms learn to predict the output from the input data. The
algorithm makes iterative predictions, which are supervised until an acceptable
level of performance is reached. Ex. random forest, SVM, linear regression

**Time Series: **A series of data
points in time order.

**TensorFlow: **Open source library
for ML where developers use dataflow graphs (structure that describe how the
data moves through a graph or a series of processing nodes.

**Unsupervised Learning**: All data is
unlabeled and the algorithms learn the inherent structure from the input data.
Usually for clustering and association, where inherent groupings and
association rules that describe data (such as people who by X also buy Y) are
discovered. Ex. K-Means

**Variance:** Looks at the average
degree to which each point differs from the mean, the average of all data
points. Obtained by calculating the mean, subtracting it, and then squaring the
result.

Credits and valuable links to find more on the subjects (I may have forgotten to list some sources since this list was made a long time ago, but please let me know if any reference is missing):