Correlation analysis; a statistical test for relationships between two sets of data.

Correlation analysis; a statistical test for relationships between two sets of data. W. Michael Childress.

Suppose you find yourself asking a question like one of the following:

* You have lists of heights and weights for a group of ten people. Right off hand you would think that taller people are usually heavier than shorter people, but there are some tall slender people and some short pudgy people in the group. Can you say that for the group (or for any group with height and weight for each person), taller people are heavier and shorter people are lighter?

* In your local tavern, you get into a discussion about football teams. You content that teams with better defenses (points allowed) tend to win more games, while your buddy contends that offense is the key to a better won-lost record. Who is right, or are you both right, or are you both wrong?

* Two teachers are asked to independently rank students in a class from first to last according to how good a student they perceive each person to be. Are the rankings consistent between the teachers, or do the teachers have different perceptions about who the good students are?

If you have faced such questions and don't know exactly how to approach them, your personal computer can once again come to your assistance. The appropriate approach for these examples is correlation analysis, and small computers are ideal tools for performing all the necessary calculations.

Correlation analysis is a family of statistical tests to determine mathematically whether there are trends or relationships between two or more sets of data from the same list of items or individuals (for example, heights and weights of people). The tests provide a statistical yes or no as to whether a significant relationship or correlation exists between the variables (for example, there is a significant tendency for taller people to be heavier).

A correlation test consists of calculating a correlation coefficient from the two data sets (a data set might be a list of heights, for example) and then comparing this coefficient to an appropriate entry in a table of correlation coefficient criterion numbers.

The entry or criterion is selected according to the number of items or data pairs in the set (10 in the heights and weights example, 28 in the case of NFL teams). If the coefficient is greater than or equal to the selected criterion, then there is a significant correlation or relationship between the two data sets. Details of the correlation tests and the strict meaning of significance are described in the sidebar.

The program presented in Listing 1 will conduct a correlation test on data that you enter by pairs for each item or individual. The program prompts data input, calculates the appropriate coefficient, compares the coefficient to the appropriate table criterion (stored in an array), and then presents the results as a yes or no answer.

This program considers correlations between exactly two sets of data (heights and weights, or points scored and number of wins). The data in the sets do not have to be strictly numerical data such as counts or measurements, however. Two types of correlation coefficients can be calculated by the program.

The first is used when the two sets of data are both numerical data--number of inches high, number of pounds, number of points scored, and number of wins. The second is used when one or both sets of data are rankings--from first to however many items are on the list. For example, the tallst person in the group would be 1, the next tallest 2, and so on to the shortest person at number 10. Similarly, football teams could be ranked according to won-lost records from 1 do 28.

The use of ranking data requires use of a special coefficient (Spearman's coefficient of correlation), so the program specifically asks what type of data you will enter. If one set of data is ranks, then the second set must also be in ranks. This requires that you convert the second variable data into ranks before entering them into the computer.

Ties in rankings are simply resolved by using the average ranking for all items tied for that rank. For example, if two people are tied for the third rank slot at six feel tall, the rank used in the correlation test would be the average of the two rank slots 3 and 4, or 3.5. It is not neccessary that you enter the data for each item in sequence according to rankings, just that the rank number for each item reflect its ranking in the total list of items for that variable or data set.

Two more notes about correlations are important. First, significant correlations can be positive or negative. A positive correlation indicates that when one attribute (height) increases, the other also tends to increase (weight). In the height-weight example, this would indicate that taller people are heavier and shorter people are lighter. A negative correlation indicates that as one attribute (height) increases, the other (weight) decreases; that is, taller people tend to be lighter, and shorter people to be heavier. The program shown above will determine whether the data entered indicate a significant positive correlation, a significant negative correlation, or no significant correlation.

The other note is that the correlation test in the program specifically tests for straight-line correlations. If the data are actually related in a non-linear relationship (i.e., not y=a+m*x, but something like y=a*x*x or some other curvilinear relationship), then the correlation test will probably not be able to detect it. Testing for non-linear relationships steps up into intermediate statistics and is far beyond the scope of this program. Such tests can be found in most intermediate statistics texts.

Correlation Program

The correlation test program is presented in Listing 1. The Basic used is AppleSoft for an Apple II or III. This code should be very easy to convert to other Basic as needed. The listing is divided into segment for clarity and for ease in making enhancements.

The data input segment runs from lines 200 to 400. In this segment, the program asks you to enter the number of items N, the data type (ranks or numerical), and then, using a FOR loop, each data pair. Data are stored in two arrays A and B, which are dimensioned to size N in line 240. When the data are all entered, the program displays a CALCULATING message while all coefficient calculations and tests are proceeding.

The next segment runs from lines 400 to 600, and includes the correlation coefficient calculations. If numerical data are entered, then the product-moment coefficient is calculated in lines 420 to 495. The calculations are derived from the equations described in the sidebar. If the data are rankings, the rank correlation coefficient is calculated in lines 500 to 560. In either case, the product produces a coefficient R.

The coefficient must be compared to a criterion number which is appropriate for the number of items in the data lists. A series of criteria for various numbers of items is loaded into array AA in the correlation criterion segment, lines 600 to 770. The program selects the criterion for this set of data in lines 700 to 755, based on the number of items.

The actual comparison of the coefficient with the criterion and the display of test results occurs in the correlation test segment, lines 800 to 970. In line 810, the coefficient is tested for a positive correlation, and for a negative correlation in the next line. If the coefficient does not pass either test, then no significant correlation exists. The final segment is the calculation results display beginning in line 1000. Here the coefficient and criterion are displayed as explanation for the test conclusions.

Correlation Examples

An example set of data for the correlation tests is given in Table 1. These are heights and weights for the first ten players on a football roster from 1982. The players in this case are the items, and height and weight are the two data sets that will be tested for a correlation. Both the actual height/weight numerical and ranking data are given for these individuals so that correlation tests can be performed for numerical and ranked data sets. Note that ties in height rankings were resolved by using average rankings for the tied items.

The correlation test program was run twice, once for the numerical data and again for the ranking data. Results from these tests are shown in Figures 1 and 2. Figure 1 shows the program run with the numerical height/weight data. The resulting correlation coefficient was about 0.8059, which exceeds the criterion number 0.632. This indicates a significant positive correlation; taller players tend to be heavier and shorter players tend to be lighter. Figure 2 uses the rankings of these data in another program run. The resulting rank correlation coefficient, 0.8363, is close to the other coefficient and also indicates a significant positive correlation. This is expected since the rankings were derived from the numerical data.

These examples indicate that rankings can provide a basis for statistical tests on data that may have very subjective evaluations, such as the "good student" example at the beginning of this article. As long as a numbered rank can be assigned to both attributes, the rank correlation should be applicable for almost any situation.

Program Enhancements

The program as listed is designed to be short and simple. A variety of enhancements can be made to increase its power and utility. Some suggestions are listed below.

* Increase the number of data points that can be considered. This will require additional criteria for larger numbers of items. These are available in most statistics texts for numbers of items up to and over 1000. Fortunately, the criteria do not change much for numbers greater than 1000.

* Save the data on disk after input. To keep this program as general as possible, I have not included any disk commands. However, it would be very easy to store the data on a disk since the data are already organized into arrays. This could be a subroutine option which the program would ask about immediately after all data were entered.

* Use data from a disk. Instead of entering the data by hand for each test, you could easily have the program input the data from a disk into the arrays A and B. This might be another program option, so that you could select from alternative sources of data.

* Display entered data. Once entered or read from a disk, the data could be displayed in a table on the monitor, or could be sent to a printer for a formatted printout. The results of the correlation test could also be sent to the printer for a permanent copy. The sample runs in Figures 1 and 2 were derived by diverting all monitor displays to the printer. On an Apple II, this involves entering PR#1, or whatever slot number your printer card is in, then running the program.

* Convert the program into a subroutine. This program can be easily renumbered by hand or using an appropriate utility, and put into any program for ready access. Two things to be very careful of in this case are using variables that are used elsewhere in the main program and redmensioning arrays if the subroutine is used more than once.

* Include the program in a statistical package. This is something you should consider very seriously. Calculating means and variances is a snap for beginning programmers, and more complex statistical calculations can be obtained both from elementary statistics books and articles like this one in various personal computing magazines. Other good programs to be included would be plotting and graphing routines which would show you how the data looked as well as performing tests. All these could be included on one disk for your own data analyses, especially if you included a menu program that would run the programs as selected.