Statistical model

A statistical model is a mathematical model that helps us understand how data is created and behaves. It is built on statistical assumptions about how sample data comes from a larger population.

These models are simpler versions of real processes that create data. They make it easier to study and predict patterns.

When we talk about chances or likelihoods, we use the term probabilistic model. Every statistical hypothesis test and every way to estimate values, known as a statistical estimator, comes from these models. They are the foundation of statistical inference, which is how scientists make conclusions from data.

A statistical model describes the relationship between random variables, which are values that change, and other values that stay the same. It is a formal way to express ideas and theories about how things work. Researchers like Herman Adèr and Kenneth Bollen use these models. They are important tools in many fields, helping us understand complex information.

Introduction

A statistical model is a set of ideas that helps us understand how data might be created. For example, imagine rolling two six-sided dice. One idea is that each number from 1 to 6 has an equal chance of appearing, like 1 out of 6. With this idea, we can calculate the chance of both dice showing 5, which would be 1 out of 6 times 1 out of 6, or 1 out of 36.

Another idea might say that the number 5 has a different chance, like 1 out of 8, because the dice are weighted. But this idea alone isn’t enough to predict all possible outcomes, since we don’t know the chances for the other numbers. A good statistical model must let us calculate the probability of any event, even if it’s sometimes very hard to do.

Formal definition

A statistical model is a way to explain how data might be created using math. It has two main parts: the possible results we might see, and the different ways those results could happen based on chances.

Sometimes, models get more detailed. In Bayesian statistics, we add chances for the settings of the model. Models can also help us see if our ways of working work well.

An example

Imagine we want to see how the age of children in a group relates to their height. If the children’s ages are spread out, we might notice that older children tend to be taller. We can use a special math tool called a linear regression model to describe this relationship.

In this model, we might write an equation like height = b₀ + b₁ × age + error. Here, b₀ is a starting point, b₁ tells us how much height changes for each year of age, and the error part explains why some children of the same age might be a little taller or shorter than expected. This helps us make better guesses about children’s heights based on their ages.

General remarks

A statistical model is a special kind of mathematical model. Unlike other mathematical models, a statistical model includes some uncertainty. This means that some parts of the model are not fixed numbers but have probabilities, making them stochastic. For example, in a model about children's heights, a part called ε represents this uncertainty.

Statistical models are used for three main reasons: to make predictions, to find useful information from data, and to describe random patterns. These purposes help scientists understand and work with data better.

Dimension of a model

A statistical model has a dimension. This tells us how many numbers we need to describe it.

For example, if we think data comes from a bell-shaped curve (called a Gaussian distribution), we need two numbers. These are the center (mean) and how spread out it is (standard deviation). This means the dimension is 2.

Sometimes, a model might need more numbers. If we think data points follow a straight line with some scatter, we need three numbers. These are where the line starts (intercept), how steep it is (slope), and how much the points scatter around the line (variance). Even though a line looks one-dimensional, the model describing it has a dimension of 3 because of these extra details.

Nested models

Not to be confused with Multilevel models.

Two statistical models are called nested when one model can be changed into another by adding limits to its settings. For example, all Gaussian distributions include those with zero average—we limit the average in the full set to get the zero-average group.

Another example is a quadratic model, which includes a linear model when we set one setting to zero. In these cases, the first model usually has more settings than the second.