Under the Hood: Linear Regression

Written by Steven Chun
Published on 25 April 2020

It has been twenty-one days since I’ve seen another person. Actually, it’s been more like 4-5 weeks, but every time I opened up this essay I had to update the number, and I lost count. As the world collectively weathers the pandemic, I find myself with an abundance of time and no excuse to not write One might consider this my Walden moment. On the other hand, you wouldn’t be blamed for thinking “Steven, you’re watching Bon Appetit in your apartment. You’re not in a remote New England cabin.” Consider, however, that Thoreau’s mother made and . I do my own laundry. . So, let’s talk about an old friend: linear regression.

graph of points with a line of best fit
drawn

This essay is an attempt to answer an ever escalating series of “but how does that work?”. Regardless of your field, you’ve probably encountered linear regression. As a statistical model, it’s as prevalent as it is useful. But less attention is given to the mechanics of regression. How is it computed? Is it hard? And at what specific point does it melt my 2015 Macbook?

For any concept there’s, like, the most abstract definition, usually math, and then, on the opposite end of the spectrum, there’s the CPU physically flipping logic gates to execute some machine codeYou might want to fact check this. I don’t know hardware. . Searing a steak on one end, the chemistry of the Maillaird reaction on the other, and I guess even quantum interactions if you wanna go further. You get the picture. A reasonable explanation usually goes from the abstract down towards the tangible to the point where you can apply a concept in your work. We’re going to take that explanation and push it just a little farther than is really useful—why? Why am I watching ? Because why the hell not What I mean, of course, is that there is no other reason. Learning adds richness to each second of life. ?

A very silly diagram describing an explanation spectrum from the abstract
(math) to the concrete (machine code), along the spectrum are a reasonable
explanation, and further down, machine code

I’m going to try to build everything from the ground up, but it will be more fun if you have some sense of basic probability and linear algebra If you know what a matrix is, you’re golden. I picked this topic partly to refresh myself on matrix factorization, because I have a hunch that’s where this is all going. .

On the other hand, if you’re already familiar with the matrix derivation of the OLS normal equations, you may want to skip ahead to my next post, which features appearances by: QR factorization! Gradient descent! And the 41 year-old FORTRAN library that underpins basically all scientific computing. But before that, we must do theory.

You will find this essay to have the unmistakable character of someone who was always just barely keeping up with a derivation or proof in lecture. For all of high school and freshman year of college, I was certain that I was not-a-math-person. This, and sleep deprivation brought on by a commitment to overscheduling, meant that I was in a state of actual slumber for 70-80% of my education in fundamental mathematical operations. I want to emphasize: the math in this post never came quickly to me, and if my writing appears pedantic or outright slow, it is because I fear that a rushed assumption or unexplained algebraic manipulation might cause an otherwise talented, curious, maybe sleepy learner to miss the forest for the trees.

Linear Regression is Nice

First, we must appreciate linear regression. We are like Salieri. We see the beauty. We do not understand how it is done. A lot of writing is stealing. In fact, if you see humans as just pattern recognition machines, and that’s not a particularly bad way to see us, all creation is stealing. But I must admit I stole this comparison from Craig Silverstein, who made it first about Sanjay Ghemawat in this .

I love linear regression. As much as fancy techniques like decision trees or artificial neural networks have captured the present imagination, linear regression continues to save lives and move nations. It’s simple and interpretable; you can use it to make arguments about cause and effect. It lets you know why good or bad things happen, and provides insight on how you might promote more good things and prevent bad things.

For example, you might think, “Classical economic theory See Equilibrium In Competitive Insurance Markets: An Essay on the Economics of Imperfect Information by Rothschild and Stiglitz, 1976. suggests that making everyone sign up for healthcare makes everyone better off—I wonder if that holds up in the real world?” You could also ask much sillier questions, like, “Do people stay alive longer so they can save money on their estate taxes?” . You might get some data on healthcare coverage, costs, and premiums; you could exploit the fact that Massachusetts actually did make everyone sign up for healthcare; and then you can have a little linear regression to estimate if people were better off. You might look at your regression and conclude, “Hey, people are better off, like better off! We should do this nationally!” And you go and do it nationally Well actually, you publish a paper——and then someone important reads it and goes and writes some policy. , and it goes, you know, .

Of course, economics is a lot more than just specifying and running a regression. Actually, that’s probably the easiest part. But still! You can’t do any of it without linear regression!

That is all to say, linear regression is pretty great. Abhijit Banerjee, Esther Duflo, and Michael Kremer won the Nobel Prize last year for pioneering the use of randomized control trials (RCTs) in alleviating poverty. Why do RCTs work so well? Because when you run regressions on them, you can be really confident in the results. Great stuff.

What Does it Do?

Let’s get semi-technical. If you have one independent variable () and one dependent variable (), linear regression specifies a line of best fit in the form , by solving for an optimal and More generally, linear regression defines a hyperplane of best fit. It is a cool word, hyperplane, but it just means a shape that is one dimension less than the space it exists in. In 2-dimensions (charts and graphs and such), that’s a line (1-dimensional), in 3-dimensions, a plane (2-dimensional), and anything higher: hyperplanes (ooh, very cool). In math, once you get past 3-dimensions you slap a sweet name on it and stop counting. Vectors are 1-dimensional, matrices are 2-dimensional, everything after is a tensor. Nice. . The line that “best fits” depends on your definition of good. The standard terminology for this is that fitting the line is an optimization problem and your measure of “good or bad” is an objective function. An objective function that measures good is often called a reward function, and one that measures bad is a loss function. Since, in linear regression, you compare how far (bad) your predicted values (the line) are from your observed values (the dots), we often think of linear regression as using a loss function.

The loss function de jure French for, loss function with juice. is Ordinary Least Squares (OLS) Least Squares makes sense, you want the line that produces the least squared error. Ordinary because this is plain old linear least squares, as opposed to the extra-ordinary non-linear least squares. . It means you measure how far off your prediction is by taking the Euclidean distance between the true value and your predicted value You know this one: it’s just your regular geometric distance. What’s more fun is that there’s also , where you take the distance as if you had to get there via city blocks! New York, baby! and squaring it. You do that for each example, sum them all, and you get a number that represents the sum of the squared errors—this is your loss, which you want to minimize.

Now, we have defined our line in the classic way as

But moving forward we’re going to use the more common econometric notation of

What mathematical notation gains in precision it loses in legibility. However, this is basically the same thing. Pronounced “y-hat”. is again the predicted value at any point, the carrot (called a ) indicates that this is a predicted value which will differ from which is the true value, is the residual (the difference between our estimation and the truth , and the subscript just indicates that this is the prediction for the -th is the canonical notation for index (AKA position). If you have two indices, people denote the second with . This is very stupid. Try to tell the difference, handwritten on a blackboard, halfway across a lecture hall. example in our dataset. and Pronounced “beta naught” (or beta zero) and “beta one”. are the parameters, or unknowns, in our model—equivalent to and in our previous definition. These define the line; these are what we need to solve for.

But how does that work?

It’s not immediately obvious how you actually go about picking the optimal and . In fact, if you ask a computer scientist and an economist, by virtue of their training you may get very different answers I have the pleasure of being both (in a very loose sense), and it’s this dissonance that got me started on this post. .

Let’s start with the more precise, more standard method. To do so, we must find a closed-form Closed-form is a funny term, but generally it means you can get to a solution by using a predefined set of mathematical operations, e.g. Normal Math™. Things like algebra and calculus will get you to the answers you seek. Of course, what constitutes Normal Math™ is completely arbitrary, but generally things like infinite series and iterative methods don’t count. solution for OLS. This is the dressing down you should get in most introductory econometrics classes.

Solving for Beta

First If you’re already familiar with the univariate derivation of OLS, feel free to skip ahead to the next section. This section is heavily drawn from this nice worksheet I found from Queen’s University at Kingston’s . , let us formalize the focus of our optimization for OLS, which is often called the Residual Sum of Squares (RSS) or the Sum of Squared Estimates of Errors (SSE).

Simply, this says that the SSE, given our model parameters, equals the sum of each residual squared. That is, how far off our prediction is from the truth, squared. We want to minimize this function—which means we want to find the first order condition (FOC), the point at which the derivative of the SSE equals zero Of course, the first derivative might mean a minimum or a maximum, but as I understand, SSE is quadratic, which means that a local min or max is the global min or max, and it’s also convex, which means that there is a global min. The proof of convexity is currently beyond me. If you have a simple proof, write me! .

So, we partially differentiate the SSE with regards to each parameter.

We use because this step applies to both and . This says that the SSE differentiated with respect to equals the sum of the derivative of the square residuals. This works because differentiation is linear, that is the derivative of the sum of the square residuals is the same as the sum of the derivative of each residual.

Now, we apply the chain rule. Remember, the chain rule applies to functions of the form and is actually a composite function, the inner function is the calculation of the residual and the outer is the square.

And to clean things up, we pop the 2 out of the summation.

You’ll note, there is still a derivative to be taken. However, here, we must specify with which parameter we’re differentiating with respect to. Differentiating with respect to we get:

And with respect to :

Once we set these to zero we have our FOCs. Since we’re again solving for zero, we can drop the in front of the summations—it won’t affect the result.

Our first order of First Order Conditions, rephrased to expand :

Sweet, though seeing how you might solve these is a little tricky, especially with that summation mucking things up. So we’re going to transform these into more standard forms which are known as the OLS Normal Equations. This is just a little rearrangement, so don’t worry about it too much.

Starting with the FOC for , let’s distribute the summation.

And multiply both sides by to get a Nice Equation™.

And then the OLS Normal Equation for is the same thing just multiplied by since that’s the only difference between the two FOCs and all we’ve done is move things around.

So our two OLS Normal equations are:

Since all the summation terms will add up to a constant, these are just two linear equations with two unknowns! You might recall Don’t take this as a suggestion that you should recall. I don’t care, that’s why I’m telling you again. that a linear system with two unknowns and two constraints will usually have a unique solution. Basically, each equation tells you something about the system, a hint that you can use to constrain your answer. Two unknowns need two hints. You could do linear regression with a pencil and paper This, of course, would be silly. That didn’t stop an old professor of mine—a distinguished academic—from asking us to do k-means by hand. It ended up being over 800 manual calculations of mean and distance. Pedagogy did not seem to be his forte. .

Thinking about how you’d actually solve this system, if you divide the first equation by , you get:

Where and are the respective means for all and in your data. This is very neat and the sort of thing you might guess from the start, but aren’t totally sure about. “Of course the line of best fit goes through the mean of the data!” you might say, and then someone asks, “Why?” and then you have to write a blog post. Tough.

Anyway, solve that bad boy for one of the unknowns and plug it into the other equation. You have done the linear regression.

You might have noticed that I waffled when saying that this linear system will have a solution. Now this is only for multivariable linear regression, which we’re about to jump into, but it’s an important extension of thinking about OLS in terms of constraints and unknowns. In a special case, it’s possible to not be able to solve OLS at all. We call this constraint perfect multicollinearity. If you have multiple explanatory variables (the “multi-”) and one of the explanatory variables is a perfect linear combination of another (the “-collinearity”), you cannot solve for each beta. For example, if one of the fields in my dataset was age, and then I decided to add another field that was age times two plus one, I break OLS. One of the constraints is just one of the other constraints in a dress and wig and umbrella, so we actually have more unknowns than constraints. Another way to think about this is if we have two explanatory variables that are perfectly collinear, they’re describing the same effect on our dependent variable. As such, it doesn’t make sense to split up one effect into two coefficients. Luckily, this is usually an easy fix. You just remove one of the explanatory variables, since it’s redundant. You have solved the multicollinearity.

Now would be a good time to take a break. We did a little work. The good stuff is yet to come. You’re probably a little hungry. A spoon of peanut butter and some milk, definitely some coffee, that’s what I always say.

“A mathematician is a device for turning coffee into theorems.”

Alfréd Rényi His friend, Paul Erdős, would have argued that a mathematician is a device for turning amphetamines into theorems. He wrote a lot of papers.

How It’s Actually Done

Okay, so the above was a useful theoretical grounding, but computers don’t calculate linear regressions by doing algebraic substitutions. Also, the above was univariate and commonly we want multivariate regression. So, we need a new tool and that tool is linear algebra.

Linear algebra helps us express multivariate OLS succinctly, with the added benefit of framing the operations much in the same way as we actually execute them en machina does a good job of explaining why matrix operations are fast on computers. Basically, a lot of linear operations can be done in parallel, since they’re independent. Also, computer caches (L1, L2, L3…) work better when memory access is “spatially and temporally coherent”. Basically, it has structure and isn’t jumping all over the place. .

Our multivariate model now looks like this

But what about our constant intercept? Well, we just make the first column of all s, which does the exact same thing. And like before:

Which says that the residuals (which we called before, but now we’ll use because that seems more common in these derivations), are, like before, equal to the truth minus our prediction.

Same! But different. We have examples, explanatory variables, is an dimensional vectorCommonly referred to as a column vector. of ground truth values, is a matrix Matrices are denoted number of rows by number of columns. In this case, each row of is a point in our dataset with features. , is a dimensional vector of the true parameters, and is a dimensional vector of residuals. Visually, that looks like this:

Same stuff as before, but nicely generalized to as many examples and as many betas as we want This section is largely sourced from .

Now, the sum of squared residuals is just transpose written as or , the transpose flips x and y dimensions, so our column vector is now a row vector. times . The product of a matrix and a matrix is a scalar, so that all checks out.

Given the definition of the residuals, we can write

If you are forgetting your transpose properties, and it would not be untoward to charge this author of doing so, this last step might take a second or two. First, we must remember that the transpose respects addition, . The transpose must be commended for its reasonableness here. But we must also remember that . At first, this seems evidence of the transposes’s deceit and chicanery. The proof of this property, however, is so concise that to spite Fermat In 1637, in the margin of one of his books, Pierre de Fermat proposed a conjecture, now known as , that would take 358 years to solve. While proposing his theorem, Fermat mentioned that he had already proved it, but that he didn’t have enough space in the margin to write it down. “I have discovered a truly marvelous demonstration of this proposition that this margin is too narrow to contain.” is a phrase that haunted nearly 4 centuries of mathematicians. , I shall include it in the margin Theorem: Let and be matrices.

Proof: First observe that the entry of can be written as



Furthermore, if we transpose a matrix, we switch the rows and the columns. These facts together mean we can write

and


And here we can see that these last two results are the same summation, just with their members flipped around. Q.E.D
. Upon its consideration, we are forced to again congratulate the transpose on its integrity.

Consolidating our previous work and exploiting the fact that and is simply a scalar and the transpose of a scalar is itself, we get

As before, we now need to take the derivative with respect to our betas. Matrix calculus is fundamentally the same as scalar calculus, but the interactions with dimensions and the transpose add extra complexity, so don’t worry too much if the next piece isn’t immediately obvious.

Which gives us our multivariate OLS normal equations in matrix form:

Nice. The only unknown in this equation is so let’s put that on one side of the equation. To do so only requires recalling that a matrix times the inverse of itself gives the identity matrix (a matrix with 1s along the diagonal and zeros elsewhere) which does nothing to any matrix applied to it. Multiply both sides by the inverse

And discard because it ipso facto does nothing:

Aha! We have it! Fully generalized, here at last is the precise definition of what we must do to obtain any regression, for any dataset, for any problem, forever That is, like we mentioned, as long as there’s no perfect multicollinearity. . And the truth is, most applied classes will leave it at that. Knowing that allows you to derive a whole bunch of useful things about OLS, like that it’s the Best Linear, Unbiased, and Efficient estimator, and if you make some assumptions about the disturbances, you can get into the wonderful land of standard errors and heteroskedasticity If you want to go in that direction, fine. , go to section 4. . But to actually solve this equation, you type “reg” into Stata and you’re done. If you’re like me, that’s unsatisfying.

This math is the theoretical essence of the empirical revolution. Clinical trials and fiscal policy in just a few lines of linear algebra. There’s a good reason the applied world often stops here—everything that comes next is an implementation detail. But, Dear Reader, believe me when I say that the next time we ask, “But how does that work”?

That’s where it gets spicy.