CliffsNotes To Go Sweepstakes -- Enter Now to Win an iPod touch Loaded with Cliffs Study Apps

Do you think kids convicted of "sexting" should be charged as child porn distributors?

Yes, those kinds of photos are just wrong.
Only if the photos are of someone under 16.
Not if the photos are of yourself.
No, it's all in fun.

View Results

Simple Linear Regression

Although correlation is not concerned with causation in relationships among variables, a related statistical procedure called regression often is. Regression is used to assess the contribution of one or more “causing” variables (called independent variables) to one “caused” (or dependent) variable. It can also be used to predict the value of one variable from the values of others. When there is only one independent variable and when the relationship can be expressed as a straight line, the procedure is called simple linear regression.

Any straight line in two-dimensional space can be represented by this equation:




where y is the variable on the vertical axis, x is the variable on the horizontal axis, a is the y-value where the line crosses the vertical axis (often called the intercept), and b is the amount of change in y corresponding to a one-unit increase in x (often called the slope). Figure 1 gives an example.





Figure 1

A straight line.


If no single line can be drawn such that all the points fall on it, what is the “best” line? The best line is the one that minimizes the distance of all the data points to the line.

Regression is an inferential procedure, meaning that it can be used to draw conclusions about populations based on samples randomly drawn from those populations. Suppose that your ten exercise-machine owners were randomly selected to represent the population of all exercise-machine owners. In order to use this sample to make educated guesses about the relationship between the two variables (months of machine ownership and time spent exercising) in the population, you need to rewrite the equation above to reflect the fact that you will be estimating population parameters:




All you have done is replace the intercept (a) with β0 and the slope (b) with β1. The formula to compute the parameter estimate β1 is




where



and



The formula to compute the parameter estimate β0 is




where y and x are the two sample means.

You have already computed the quantities that you need to substitute into these formulas for the exercise example—except for the mean of x( y), which is 64/10 = 6.4, and the mean of y( y), which is 56/10 = 5.6. First, compute the estimate of the slope:




Now the intercept may be computed:




So the regression equation for the example is y = 9.856 − 0.665( x). When you plot this line over the data points, the result looks like that shown in Figure 2 .





Figure 2

Illustration of residuals.


The vertical distance from each data point to the regression line is the error, or residual, of the line's accuracy in estimating that point. Some points have positive residuals (they lie above the line); some have negative ones (they lie below it). If all the points fell on the line, there would be no error and no residuals. The mean of the sample residuals is always 0 because the regression line is always drawn such that half of the error is above it and half below. The equations that you used to estimate the intercept and slope determine a line of “best fit” by minimizing the sum of the squared residuals. This method of regression is called least squares.

Because regression estimates usually contain some error (that is, all points do not fall on the line), an error term (ɛ, the Greek letter epsilon) is usually added to the end of the equation:




The estimate of the slope β1 for the exercise example was −0.665. The slope is negative because the line slants down from left to right, as it must for two variables that are negatively correlated, reflecting that one variable decreases as the other increases. When the correlation is positive, β1 is positive, and the line slants up from left to right.

Confidence interval for the slope

Example 1: What if the slope is 0, as in Figure 3 ? That means that y has no linear dependence on x, or that knowing x does not contribute anything to your ability to predict y.





Figure 3

Example of uncorrelated data, so the slope is zero.


It is often useful to compute a confidence interval for a regression slope. If it contains 0, you would be unable to conclude that x and y are related. The formula to compute a confidence interval for β1 is




where



and



and where ∑( y – ŷ)2 is the sum of the squared residuals, tα/2n – 2 is the critical value from the t-table corresponding to half the desired alpha level at n – 2 degrees of freedom, and n is the size of the sample (the number of data pairs). The test for this example will use an alpha of .05. A t distribution critical values table will show that t.025,8 = 2.306.

Compute the quantity ∑( y – ŷ)2 by subtracting each predicted y-value (ŷ) from each actual y-value, squaring it, and summing the squares (see Table 2 ). The predicted y-value (ŷ) is the y-value that would be predicted from each given x, using the formula y = 9.856 – 0.665 (x). (Use Table 1 for reference.)

TABLE 1 Exercise Data for Ten People

Person

1

2

3

4

5

6

7

8

9

10

Months Owned

5

10

4

8

2

7

9

6

1

12

Hours Exercised

5

2

8

3

8

5

5

7

10

3


TABLE 2 Determining the Residuals for the Data in Table 1

x

y

y

residual

residual2

5

5

6.530

=

–1.530

2.341

10

2

3.205

=

–1.205

1.452

4

8

7.195

=

0.805

.648

8

3

4.535

=

–1.535

2.356

2

8

8.525

=

–0.525

.276

7

5

5.200

=

–0.200

.040

9

5

3.870

=

1.130

1.277

6

7

5.865

=

1.135

1.288

1

10

9.190

=

0.810

.656

12

3

1.875

=

1.125

1.266

0

11.600

Now, compute s:




You have already determined that Sxx = 110.4; you can proceed to the main formula:




You can be 95 percent certain that the population parameter β1 (the slope) is no lower than −0.929 and no higher than −0.401. Because this interval does not contain 0, you would be able to reject the null hypothesis that β1 = 0 and conclude that these two variables are indeed related in the population.

Confidence interval for prediction

You have learned that you could predict a y-value from a given x-value. Because there is some error associated with your prediction, however, you might want to produce a confidence interval rather than a simple point estimate. The formula for a prediction interval for y for a given x is




where



and where ŷ is the y-value predicted for x using the regression equation, tα/2,n – 2 is the critical value from the t- table corresponding to half the desired alpha level at n – 2 degrees of freedom, and n is the size of the sample (the number of data pairs).

Example 2: What is a 90 percent confidence interval for the number of hours spent exercising per week if the exercise machine is owned 11 months?

The first step is to use the original regression equation to compute a point estimate for y:




For a 90 percent confidence interval, you need to use t.05,8 which a t distribution critical values table shows to be 1.860. You have already computed the remaining quantities, so you can proceed with the formula:




You can be 90 percent confident that the population mean for the number of hours spent exercising per week when x (number of weeks machine owned) = 11 is between about 0 and 5.

Assumptions and cautions

The use of regression for parametric inference assumes that the errors (ɛ) are (1) independent of each other and (2) normally distributed with the same variance for each level of the independent variable. Figure 4 shows a violation of the second assumption. The errors (residuals) are greater for higher values of x than for lower values.





Figure 4

Data with increasing variance as x increases.


Least squares regression is sensitive to outliers, or data points that fall far from most other points. If you were to add the single data point x = 15, y = 12 to the exercise data, the regression line would change to the dotted line shown in Figure 5 . You need to be wary of outliers because they can influence the regression equation greatly.





Figure 5

Least squares regression is sensitive to outliers.


It can be dangerous to extrapolate in regression—to predict values beyond the range of our data set. The regression model assumes that the straight line extends to infinity in both directions, which is often not true. According to the regression equation for the example, people who have owned their exercise machines longer than around 15 months do not exercise at all. It is more likely, however, that “hours of exercise” reaches some minimum threshold and then declines only gradually, if at all (see Figure 6 ).





Figure 6

Extrapolation beyond the data is dangerous.


Cite this article

CliffsNotes® To Go
Literature reviews for the iPhone™ & iPod touch® help you study anywhere, anytime.
Learn more now!
cover
Get Up to Speed on the Math You Really Need!
Basic math for use in the real world.
Get Math You Can Really Use — Every Day!
Feeling Trapped by Trapezoids?
Get Help with Geometry Now!