Mathematical definition of the bias-variance tradeoff

This post has been de-listed (Author was flagged for spam)

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Flair (click to view more posts with a particular flair)

Education

Post Body

This is a topic that I have discussed heavily in the past, and I constantly see it popping up as a question and source of confusion among the community, asking what exactly is the bias-variance tradeoff, why is it called that, and how do you actually define them in objective terms? I was lucky enough to have a professor in my university days by the name of Shai Ben-David who co-authored an excellent textbook titled Understanding Machine Learning: From Theory to Algorithms that is theoretically oriented, and it had the best definition of the bias-variance tradeoff that I have seen to this day, which is a topic that I think is heavily misunderstood today, though I would love to hear the opinions of others that might disagree with me.

Common misconception: For example, I have seen people define the bias-variance tradeoff as referring to the volatility of outputted predictions of your model relative to small local changes to input features. However, we all know that variance refers to the overfitting error in bias-variance tradeoff, and this definition would imply that a constant-predictor model f(x) = k that simply outputs a constant value (parameter that is learned) would have a variance of zero, and therefore will not overfit? Even if it is given a training set with only a single example, it will not overfit at all? Also, we all know that increasing the size of training dataset should reduce variance, but how would more training samples change the volatility of the outputted predictions relative to input feature space? This definition clearly does not make sense. I think it is common to see highly overfit models with high variance that also happen to have highly volatile predictions relative to small changes in input features, however that is NOT a requirement nor is it a good definition.

Another common misconception: Another common definition of bias-variance or overfitting-underfitting that I have seen before is usually referring to models that have a learning curve over the course of their training (usually neural networks or ensemble models) where you can see the training error and validation error change over the course of training. Typically, the training error always improves, while the validation error improves for a while until it reaches its 'peak validation performance' iteration and then the validation error starts getting worse while training keeps improving. Sometimes, people will refer to the model iterations BEFORE the 'peak validation' as being underfitting or high bias, and then refer to the model iterations AFTER the 'peak validation' as being overfitting or high variance. I understand why people often make this association and confusion, but again it doesn't make much sense. Why are model iterations before considered underfitting while the iterations after are overfitting? Does this mean that the 'peak validation' iteration is not overfitting or underfitting? How do you even define bias/variance for models that don't have a learning curve? There is a lot lacking here and it just doesn't make sense if you think about it deeply, and again is another case of looking at the symptom to define something.

So then, what is the bias-variance tradeoff?

The bias variance trade-off is a useful concept that stems from decomposing the generalisation error. Before we can understand this, we need to define a few concepts that are important.

First, we need to frame the training of a model as a mathematical process. What are we doing when we take a linear regression model and feed it a bunch of training data? One way to look at this, is that we have a model hypothesis class H = { f(x) = kx m | for all k,m } which is the set of all possible simple univariate linear regression models that we could learn. Then, training a model, is basically running a TrainingAlgorithm that takes in the training dataset as input and outputs a linear model f(x) that is from the hypothesis class H. So basically, H is the set of all possible models, and training a model is basically trying to find the best model in H that fits our training data best and that we think will generalise the best to future unseen data.

Given those definitions, we can now define the key terms that we need to bring it all together.

What is generalisation error?

The generalisation error, in this example, is the error that we expect to observe by training a linear regression model on the training dataset that we have and deploying it on future unseen data examples drawn from the target distribution. Basically, it is the true error of the model that we will have. The reason that we have a 'test set' is so we can estimate the generalisation error. Now, thankfully, smarter people than me have gone ahead and done a bunch of mathematical wizardry to decompose the generalisation error into the following definition:

GeneralisationError = TrueError Bias Variance

Note that Bias = Underfitting Error and Variance = OverfittingError

What is TrueError?

TrueError is defined as the underlying error of the target distribution. For example, suppose that we are given training data from the target distribution of samples (x, y) where y = 5x 2. Then we can clearly see that this can be perfectly predicted (if we know the underlying true distribution and can model it), and the TrueError would be 0. However, for another example, maybe the target distribution is y = 5x 2 u, where u is a random Normal variable with mean 0 and variance 1. Well clearly, even if we know the true underlying distribution, we won't be able to perfectly predict it because of the underlying error of the target distribution, and the TrueError is equal to the variance of the random variable u.

What is Bias?

Bias is defined as the difference between the generalisation error of the BEST model from our hypothesis class H and the underlying TrueError. For example, if the true underlying target distribution is y = 5x 2, then we can see that our model class of simple linear regression models H contains this exact model! So therefore, the best model instance is f(x) = 5x 2 where f in H, and we can see that our Bias/UnderfittingError is equal to 0! As you can see, Bias or the Underfitting Error is basically the error introduced by choosing our model hypothesis class H. If our hypothesis class contains the true target distribution model, then our bias will be 0, otherwise the farther our 'best' model in the class is from the true underlying target distribution, then the higher our bias will be. Notice that this definition has nothing to do with the volatility of the model output predictions, and has nothing to do with the learning curve. Basically in machine learning, we assume that a good training algorithm with infinite training data should always choose the best model from your hypothesis class H, so Bias can be thought of as the expected error of your model trained with infinite training data (minus the TrueError).

What is Variance?

Variance is defined as the expected squared difference between the generalisation error of the BEST model from our hypothesis class H and the error of the EXPECTED model that will be learned from our class H given the finite training set of size N that is fed into the training algorithm. To continue with our example from above, we know that the best model in our class H is the model instance f(x) = 5x 2. However, suppose that we are given a training set of size 1 with only a single sample of (x,y). Given only 1 sample, our training algorithm may choose the model instance f(x) = 4x 1, which is worse than our best model and will have higher generalisation error. The difference between the expected GE of our model trained on the dataset of size 1 and the expected GE of our 'best model' is the variance or overfitting error component of the GE. Notice how the variance is dependent on the model hypothesis class H, the training algorithm A, and the size of the training dataset. This is why we can reduce overfitting by increasing the size of our training dataset, or we can reduce overfitting by reducing the size of our model hypothesis class (by choosing a simpler model or adding regularization which typically reduces the size of the model hypothesis class).

TL;DR: An accurate mathematical exact definition of underfitting/overfitting A.K.A the bias-variance tradeoff is useful to understand even beyond a theoretical context. For a more comprehensive understanding, I recommend that you read the textbook I mentioned at the beginning of the post.

EDIT NOTE: A couple people have been confused by the exact terminology. I should clarify that bias-variance decomposition is technically different than the approximation-estimation error decomposition. But they are extremely similar, and in most cases they are mathematically equivalent. In fact, it is useful to think of the approximation-estimation decomp as a sub-case of the bias-variance decomposition, if we make the assumption that our training algorithm is expected to output the best model in its hypothesis class. If this assumption can be made, then they become mathematically equivalent most intents and purposes. It's important to note that most modern class machine learning algorithms and classes satisfy this assumption, so they are equivalent. I also find that the approximation-estimation decomp is practically more useful because it simplifies things and provides more direct applicability to the real world problems that we typically face as data scientists.

Author

Account Strength

Account Age

12 years

Verified Email

Yes

Verified Flair

Total Karma

8,255

Link Karma

1,911

Comment Karma

6,298

Profile updated: 6 months ago

Posts updated: 7 months ago

MrTwiggy

Subreddit

r/datascience

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 3 years ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/datascience...