Under the Hood: Linear Regression
Written by Steven Chun
Published on 25 April 2020
It has been twenty-one days since I’ve seen another person. Actually, it’s been more like 4-5 weeks, but every time I opened up this essay I had to update the number, and I lost count. As the world collectively weathers the pandemic, I find myself with an abundance of time and no excuse to not write One might consider this my Walden moment. On the other hand, you wouldn’t be blamed for thinking “Steven, you’re watching Bon Appetit in your apartment. You’re not in a remote New England cabin.” Consider, however, that Thoreau’s mother made the 20 minute walk from Concord to Walden Pond to bring him food and do his laundry . I do my own laundry. . So, let’s talk about an old friend: linear regression.
This essay is an attempt to answer an ever escalating series of “but how does that work?”. Regardless of your field, you’ve probably encountered linear regression. As a statistical model, it’s as prevalent as it is useful. But less attention is given to the mechanics of regression. How is it computed? Is it hard? And at what specific point does it melt my 2015 Macbook?
For any concept there’s, like, the most abstract definition, usually math, and then, on the opposite end of the spectrum, there’s the CPU physically flipping logic gates to execute some machine codeYou might want to fact check this. I don’t know hardware. . Searing a steak on one end, the chemistry of the Maillaird reaction on the other, and I guess even quantum interactions if you wanna go further. You get the picture. A reasonable explanation usually goes from the abstract down towards the tangible to the point where you can apply a concept in your work. We’re going to take that explanation and push it just a little farther than is really useful—why? Why am I watching how olive oil is made ? Because why the hell not What I mean, of course, is that there is no other reason. Learning adds richness to each second of life. ?
I’m going to try to build everything from the ground up, but it will be more fun if you have some sense of basic probability and linear algebra If you know what a matrix is, you’re golden. I picked this topic partly to refresh myself on matrix factorization, because I have a hunch that’s where this is all going. .
On the other hand, if you’re already familiar with the matrix derivation of the OLS normal equations, you may want to skip ahead to my next post, which features appearances by: QR factorization! Gradient descent! And the 41 year-old FORTRAN library that underpins basically all scientific computing. But before that, we must do theory.
You will find this essay to have the unmistakable character of someone who was always just barely keeping up with a derivation or proof in lecture. For all of high school and freshman year of college, I was certain that I was not-a-math-person. This, and sleep deprivation brought on by a commitment to overscheduling, meant that I was in a state of actual slumber for 70-80% of my education in fundamental mathematical operations. I want to emphasize: the math in this post never came quickly to me, and if my writing appears pedantic or outright slow, it is because I fear that a rushed assumption or unexplained algebraic manipulation might cause an otherwise talented, curious, maybe sleepy learner to miss the forest for the trees.
Linear Regression is Nice
First, we must appreciate linear regression. We are like Salieri. We see the beauty. We do not understand how it is done. A lot of writing is stealing. In fact, if you see humans as just pattern recognition machines, and that’s not a particularly bad way to see us, all creation is stealing. But I must admit I stole this comparison from Craig Silverstein, who made it first about Sanjay Ghemawat in this wonderful New Yorker piece on Sanjay and Jeff Dean .
I love linear regression. As much as fancy techniques like decision trees or artificial neural networks have captured the present imagination, linear regression continues to save lives and move nations. It’s simple and interpretable; you can use it to make arguments about cause and effect. It lets you know why good or bad things happen, and provides insight on how you might promote more good things and prevent bad things.
For example, you might think, “Classical economic theory See Equilibrium In Competitive Insurance Markets: An Essay on the Economics of Imperfect Information by Rothschild and Stiglitz, 1976. suggests that making everyone sign up for healthcare makes everyone better off—I wonder if that holds up in the real world?” You could also ask much sillier questions, like, “Do people stay alive longer so they can save money on their estate taxes?” These guys did . You might get some data on healthcare coverage, costs, and premiums; you could exploit the fact that Massachusetts actually did make everyone sign up for healthcare; and then you can have a little linear regression to estimate if people were better off. You might look at your regression and conclude, “Hey, people are better off, like $241 better off! We should do this nationally!” And you go and do it nationally Well actually, you publish a paper—this paper —and then someone important reads it and goes and writes some policy. , and it goes, you know, okay .
Of course, economics is a lot more than just specifying and running a regression. Actually, that’s probably the easiest part. But still! You can’t do any of it without linear regression!
That is all to say, linear regression is pretty great. Abhijit Banerjee, Esther Duflo, and Michael Kremer won the Nobel Prize last year for pioneering the use of randomized control trials (RCTs) in alleviating poverty. Why do RCTs work so well? Because when you run regressions on them, you can be really confident in the results. Great stuff.
What Does it Do?
Let’s get semi-technical. If you have one independent variable (
The loss function de jure French for, loss function with juice. is Ordinary Least Squares (OLS) Least Squares makes sense, you want the line that produces the least squared error. Ordinary because this is plain old linear least squares, as opposed to the extra-ordinary non-linear least squares. . It means you measure how far off your prediction is by taking the Euclidean distance between the true value and your predicted value You know this one: it’s just your regular geometric distance. What’s more fun is that there’s also Manhattan Distance , where you take the distance as if you had to get there via city blocks! New York, baby! and squaring it. You do that for each example, sum them all, and you get a number that represents the sum of the squared errors—this is your loss, which you want to minimize.
Now, we have defined our line in the classic way as
But moving forward we’re going to use the more common econometric notation of
What mathematical notation gains in precision it loses in legibility. However,
this is basically the same thing.
But how does that work?
It’s not immediately obvious how you actually go about picking the optimal
Let’s start with the more precise, more standard method. To do so, we must find a closed-form Closed-form is a funny term, but generally it means you can get to a solution by using a predefined set of mathematical operations, e.g. Normal Math™. Things like algebra and calculus will get you to the answers you seek. Of course, what constitutes Normal Math™ is completely arbitrary, but generally things like infinite series and iterative methods don’t count. solution for OLS. This is the dressing down you should get in most introductory econometrics classes.
Solving for Beta
First If you’re already familiar with the univariate derivation of OLS, feel free to skip ahead to the next section. This section is heavily drawn from this nice worksheet I found from Queen’s University at Kingston’s Introductory Econometrics class . , let us formalize the focus of our optimization for OLS, which is often called the Residual Sum of Squares (RSS) or the Sum of Squared Estimates of Errors (SSE).
Simply, this says that the SSE, given our model parameters, equals the sum of each residual squared. That is, how far off our prediction is from the truth, squared. We want to minimize this function—which means we want to find the first order condition (FOC), the point at which the derivative of the SSE equals zero Of course, the first derivative might mean a minimum or a maximum, but as I understand, SSE is quadratic, which means that a local min or max is the global min or max, and it’s also convex, which means that there is a global min. The proof of convexity is currently beyond me. If you have a simple proof, write me! .
So, we partially differentiate the SSE with regards to each parameter.
We use
Now, we apply the chain rule. Remember, the chain rule applies to functions of
the form
And to clean things up, we pop the 2 out of the summation.
You’ll note, there is still a derivative to be taken. However, here, we must
specify with which parameter we’re differentiating with respect to.
Differentiating with respect to
And with respect to
Once we set these to zero we have our FOCs. Since we’re again solving for zero,
we can drop the
Our first order of First Order Conditions, rephrased to expand
Sweet, though seeing how you might solve these is a little tricky, especially with that summation mucking things up. So we’re going to transform these into more standard forms which are known as the OLS Normal Equations. This is just a little rearrangement, so don’t worry about it too much.
Starting with the FOC for
And multiply both sides by
And then the OLS Normal Equation for
So our two OLS Normal equations are:
Since all the summation terms will add up to a constant, these are just two linear equations with two unknowns! You might recall Don’t take this as a suggestion that you should recall. I don’t care, that’s why I’m telling you again. that a linear system with two unknowns and two constraints will usually have a unique solution. Basically, each equation tells you something about the system, a hint that you can use to constrain your answer. Two unknowns need two hints. You could do linear regression with a pencil and paper This, of course, would be silly. That didn’t stop an old professor of mine—a distinguished academic—from asking us to do k-means by hand. It ended up being over 800 manual calculations of mean and distance. Pedagogy did not seem to be his forte. .
Thinking about how you’d actually solve this system, if you divide the first
equation by
Where
Anyway, solve that bad boy for one of the unknowns and plug it into the other equation. You have done the linear regression.
You might have noticed that I waffled when saying that this linear system will have a solution. Now this is only for multivariable linear regression, which we’re about to jump into, but it’s an important extension of thinking about OLS in terms of constraints and unknowns. In a special case, it’s possible to not be able to solve OLS at all. We call this constraint perfect multicollinearity. If you have multiple explanatory variables (the “multi-”) and one of the explanatory variables is a perfect linear combination of another (the “-collinearity”), you cannot solve for each beta. For example, if one of the fields in my dataset was age, and then I decided to add another field that was age times two plus one, I break OLS. One of the constraints is just one of the other constraints in a dress and wig and umbrella, so we actually have more unknowns than constraints. Another way to think about this is if we have two explanatory variables that are perfectly collinear, they’re describing the same effect on our dependent variable. As such, it doesn’t make sense to split up one effect into two coefficients. Luckily, this is usually an easy fix. You just remove one of the explanatory variables, since it’s redundant. You have solved the multicollinearity.
Now would be a good time to take a break. We did a little work. The good stuff is yet to come. You’re probably a little hungry. A spoon of peanut butter and some milk, definitely some coffee, that’s what I always say.
“A mathematician is a device for turning coffee into theorems.”
How It’s Actually Done
Okay, so the above was a useful theoretical grounding, but computers don’t calculate linear regressions by doing algebraic substitutions. Also, the above was univariate and commonly we want multivariate regression. So, we need a new tool and that tool is linear algebra.
Linear algebra helps us express multivariate OLS succinctly, with the added benefit of framing the operations much in the same way as we actually execute them en machina This StackOverflow question does a good job of explaining why matrix operations are fast on computers. Basically, a lot of linear operations can be done in parallel, since they’re independent. Also, computer caches (L1, L2, L3…) work better when memory access is “spatially and temporally coherent”. Basically, it has structure and isn’t jumping all over the place. .
Our multivariate model now looks like this
But what about our constant intercept? Well, we just make the
first column of
Which says that the residuals (which we called
Same! But different. We have
Same stuff as before, but nicely generalized to as many examples and as many betas as we want This section is largely sourced from here .
Now, the sum of squared residuals is just
Given the definition of the residuals, we can write
If you are forgetting your transpose properties, and it would not be untoward to
charge this author of doing so, this last step might take a second or two.
First, we must remember that the transpose respects addition,
Proof:
First observe that the
Furthermore, if we transpose a matrix, we switch the rows and the columns.
These facts together mean we can write
and
And here we can see that these last two results are the same summation, just
with their members flipped around. Q.E.D
Source
. Upon its consideration, we are forced to again
congratulate the transpose on its integrity.
Consolidating our previous work and exploiting the fact that
As before, we now need to take the derivative with respect to our betas. Matrix calculus is fundamentally the same as scalar calculus, but the interactions with dimensions and the transpose add extra complexity, so don’t worry too much if the next piece isn’t immediately obvious.
Which gives us our multivariate OLS normal equations in matrix form:
Nice. The only unknown in this equation is
And discard
Aha! We have it! Fully generalized, here at last is the precise definition of
what we must do to obtain any regression, for any dataset, for any problem,
forever That is, like we mentioned, as long as there’s no perfect
multicollinearity.
. And the truth is, most applied classes will leave it at that.
Knowing that
This math is the theoretical essence of the empirical revolution. Clinical trials and fiscal policy in just a few lines of linear algebra. There’s a good reason the applied world often stops here—everything that comes next is an implementation detail. But, Dear Reader, believe me when I say that the next time we ask, “But how does that work”?
That’s where it gets spicy.