## Correlation

Consider Table 1, which contains measurements on two variables for ten people: the number of months the person has owned an exercise machine and the number of hours the person spent exercising in the past week. If you display these data pairs as points in a scatter plot (see Figure 1), you can see a definite trend. The points appear to form a line that slants from the upper left to the lower right. As you move along that line from left to right, the values on the vertical axis (hours of exercise) get smaller, while the values on the horizontal axis (months owned) get larger. Another way to express this is to say that the two variables are inversely related: The more months the machine was owned, the less the person tended to exercise.

Figure 1. The data in Table 1 is an example of negative correlation. These two variables are correlated. More than that, they are correlated in a particular direction—negatively. For an example of a positive correlation, suppose that instead of displaying “hours of exercise” on the vertical axis, you put the person's score on a test that measures cardiovascular fitness (see Figure 2). The pattern of these data points suggests a line that slants from lower left to upper right, which is the opposite of the direction of slant in the first example. Figure 2 shows that the longer the person has owned the exercise machine, the better his or her cardiovascular fitness tends to be. This might be true in spite of the fact that time spent exercising decreases the longer the machine has been owned because purchasers of exercise machines might be starting from a point of low fitness, which may improve only gradually.

Figure 2. An example of positive correlation. If two variables are positively correlated, as the value of one increases, so does the value of the other. If they are negatively (or inversely) correlated, as the value of one increases, the value of the other decreases.

A third possibility remains: that as the value of one variable increases, the value of the other neither increases nor decreases. Figure 3 is a scatter plot of months the exercise machine has been owned (horizontal axis) by the person's height (vertical axis). No line trend can be seen in the plot. New owners of exercise machines may be short or tall, and the same is true of people who have had their machines longer. These two variables appear to be uncorrelated.

Figure 3. An example of uncorrelated data. You can go even further in expressing the relationship between variables. Compare the two scatter plots in Figure 4. Both plots show a positive correlation because, as the values on one axis increase, so do the values on the other. But the data points in Figure 4(b) are more closely packed than the data points in Figure 4(a), which are more spread out. If a line were drawn through the middle of the trend, the points in Figure 4(b) would be closer to the line than the points in Figure 4. In addition to direction (positive or negative), correlations also can have strength, which is a reflection of the closeness of the data points to a perfect line. Figure 4(b) shows a stronger correlation than Figure 4(a).

Figure 4. (a) Weak and (b) strong correlations. Pearson's product moment coefficient (r), commonly referred to as the correlation coefficient, is a quantitative measure of correlation between two interval‐level variables. The coefficient r can take values from –1.0 to 1.0. The sign of r indicates whether the correlation is positive or negative. The magnitude (absolute value) of r indicates the strength of the correlation, or how close the array of data points is to a straight line.

Two computing formulas for r are and where Σ xy is the sum of the xy cross‐products (each x multiplied by its paired y), n is the size of the sample (the number of data pairs), Σ x and Σ y are the sums of the x and y values, s x and s y are the sample standard deviations of x and y, and and are the means.

Use Table 2 to compute r for the relationship between months of exercise‐machine ownership and hours of exercise per week. The first step is to compute the components required in the main formula. Let x be months of ownership and y be hours of exercise, although you could also do the reverse. Now, compute the sample standard deviations for x and y using the formula:  Finally, r may be computed: A correlation of r = –0.899 is almost as strong as the maximum negative correlation of –1.0, reflecting the fact that your data points fall relatively close to a straight line. Finding the significance of r

You might want to know how significant an r of –0.899 is. The formula to test the null hypothesis that R (the population correlation) = 0 is where r is the sample correlation coefficient, and n is the size of the sample (the number of data pairs). The probability of t may be looked up in Table 3 (in "Statistics Tables") using n − 2 degrees of freedom. The probability of obtaining a t of –5.807 with 8 df (drop the sign when looking up the value of t) is lower than the lowest listed probability of 0.0005. If the correlation between months of exercise‐machine ownership and hours of exercise per week were actually 0, you would expect an r of –0.899 or lower in fewer than one out of a thousand random samples.

To evaluate a correlation coefficient, first determine its significance. If the probability that the coefficient resulted from chance is not acceptably low, the analysis should end there; neither the coefficient's sign nor its magnitude may reflect anything other than sampling error. If the coefficient is statistically significant, the sign should give an indication of the direction of the relationship, and the magnitude indicates its strength. Remember, however, that all statistics become significant with a high enough n.

Even if it's statistically significant, whether a correlation of a given magnitude is substantively or practically significant depends greatly on the phenomenon being studied. Generally, correlations tend to be higher in the physical sciences, where relationships between variables often obey uniform laws, and lower in the social sciences, where relationships may be harder to predict. A correlation of 0.4 between a pair of sociological variables may be more meaningful than a correlation of 0.7 between two variables in physics.

Bear in mind also that the correlation coefficient measures only straight‐line relationships. Not all relationships between variables trace a straight line. Figure 5 shows a curvilinear relationship such that values of y increase along with values of x up to a point, then decrease with higher values of x. The correlation coefficient for this plot is 0, the same as for the plot in Figure 3. This plot, however, shows a relationship that Figure 3 does not.

Correlation does not imply causation. The fact that Variable A and Variable B are correlated does not necessarily mean that A caused B or that B caused A (though either may be true). If you were to examine a database of demographic information, for example, you would find that the number of churches in a city is correlated with the number of violent crimes in the city. The reason is not that church attendance causes crime, but that these two variables both increase as a function of a third variable: population.

Figure 5. Data that can cause trouble with correlation analysis. Also note that the scales used to measure two variables have no effect on their correlation. If you had converted hours of exercise per week to minutes per week, and/or months machine owned to days owned, the correlation coefficient would have been the same.

The coefficient of determination

The square of the correlation coefficient r is called the coefficient of determination and can be used as a measure of the proportion of variability that two variables share, or how much one can be “explained” by the other. The coefficient of determination for this example is (–0.899) 2 = 0.808. Approximately 80 percent of the variability of each variable is shared with the other variable.

Top
REMOVED