The Idea of Regression

New Regression Tutorial (A Hypothetical Example With Runners)

Andy Grogan-Kaylor

2022-11-14

1 Introduction

Somewhere along the way in a math class in elementary, middle or high school, you may have encountered the idea of graphing lines and thinking about the equations that represent those lines.

We are going to try to illustrate the idea of regression analysis using some hypothetical data on runners.

2 Imagine A Single Runner

Let’s think about this idea using the hypothetical example of several runners.

The orange line represents a runner who runs 6 miles every hour. In 2 hours, this person will have run 12 miles, and if that runner continued for 3 hours of running, they would run 18 miles.

We can write an equation that represents the distance run in several equivalent ways:

In this case, the runner’s speed (6 miles per hour) is what we call the slope of the line. The orange runner is getting 6 miles of distance for every hour spent running. Economists sometimes talk about this idea as the “rate of return”: For every hour of running, the orange runner gets 6 miles of distance.

3 Imagine Two Additional Runners

Imagine now two other runners, represented by a red line and a blue line.

The red runner starts at the same place as the orange runner, but runs at a slower 3 mile per hour pace. We can say that the slope of the red runner’s line is flatter than the line for the orange runner. In fact, this slope is 3 miles per hour.

The blue runner’s situation is somewhat different. The blue runner runs at the same speed as the orange runner. We can say that their lines on the graph have the same slope.

But after two hours of running, the blue runner is further along because the blue runner started 2 miles ahead of the orange runner. We need a new term to describe this idea. In graphing, and in statistics, we say that the blue runner’s line intercepts the y axis at a higher point than the line for the orange runner. Put another way, the blue runner has a higher y-intercept than the orange runner.

These two concepts of the slope and the y-intercept are the foundations of the idea of regression.

4 A New Example With More Runners

Let’s stick with the hypothetical example of runners, but now let’s imagine a slightly different situation. Imagine that we have data on how far several different runners have run, and we want to find the average speed of these runners. (you could also think this as the average rate of change of distance over time.)

I want to draw the line that best fits these data to get a sense of on average, how fast runners run.

My guess is drawn as a blue line.

I could even make a guess about the slope which represents the average speed: How far does the distance go up for every mile that is run, on average?

I also need to think about where the line crosses, or intercepts, the y-axis.

My best guess about the slope and intercept together are drawn as a purple line. It looks like for every hour run, on average, runners run just under 5 miles, so let’s say 4.9 miles per hour.

It looks like my best guess is that all of the runners started the race at 0. That is to say, none of the runners had a head start, like they did in my first example.

So now, I could think about writing an equation for my line.

I know that the equation for my line is:

\[\text{distance} = \text{starting place} + \text{speed} * \text{hours} + e_i\]

The \(e_i\) is a new concept for us in this tutorial, and represents the error. To a certain degree we are going to make a wrong prediction about every runner in our data. We are trying to understand the distances that runners run, on average.

In this particular case my best guess is…

\[miles = 0 + 4.9 * hours\]

Expressed in more intuitive language: “for every 1 hour that a runner runs, on average, runners get 4.9 miles of distance.”

In general, this is going to be a template for the kind of sentence we use in regression analysis: “For every 1 unit change in the independent variable, what is the change in the dependent variable?”

We need to think about a few more issues.

In general, we call the starting place by a few different names:

We call the rate of change by a few different names as well:

So now we can talk about a more general form for our equation. A more general way of writing:

\[miles = 0 + 4.9 * hours\]

Would be to say

5 Getting Estimates From Computer Software

The last thing that I want to do in this tutorial is to ask the computer to make a best guess about the slope and the y-intercept. We’ll learn something about how those estimates are made in other readings, and in lecture.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7143 1.969 0.3627 0.7409
hours 4.714 1.01 4.667 0.01857
Fitting linear model: miles ~ hours
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5 1.69 0.8789 0.8386

As always, when looking at statistical output, there is probably more information than we want, so it makes sense to identify the crucial quantities.

It turns out our estimate of slope (4.9) was pretty close to what the computer finds 4.714 , while our estimate that the intercept was 0 is a little bit different from the computer’s estimate 0.714. According to the computer, the best guess is that the average runner had a little bit of a head start.

Lastly, the \(R^2\) value indicates that 0.879 of the variation in our dependent variable, miles, is explained by the variation in our independent variable, hours.

(written by Andy Grogan-Kaylor. Comments and questions welcome and should be directed to agrogan@umich.edu)