Correlation (Lecture from Chapters 5 and 14)

Site hosted by Angelfire.com: Build your free website today!

Correlation: Descriptive Aspects (Chapters 5 and 14)

- So far, we have had IVs that have two or more levels/groups, and people are put in one of the groups (t-test, ANOVA)

- The general question so far has been: Do the groups differ on the DV? Or, do scores on the DV depend on the level of the IV?

- When both the IV and DV are quantitative, we test the relationship between the two using linear equations.

- The statistical tests used here are correlation and regression.

- These procedures can be used to analyze the relationship between two variables when:

1. Both variables are quantitative and measured on and interval level (with ANOVA we have a qualitative IV(s) and quantitative DV).

2. The two variables have been measured on the same individuals.

- Basically, what we’re asking here is if changes in one variable are associated with changes in the other variable (i.e., is there a relationship between the two variables such that when one changes in a certain direction, the other also changes in a certain direction).

- We test for correlation using procedures based on the general linear equation:

Y = a + bX

- Y = score on the DV

- X = score on the IV

- b = slope of the line (change in Y for every 1 unit change in X)

- a = y-intercept (score on Y when X = 0)

- If you have this information for any two continuous variables, you can plot lines.

Y-Intercept (a)

- The intercept of a linear model is the point at which the line intersects the Y axis

- This is symbolized by the letter a

- Also can be thought of as the value for Y when X = 0

Slope (b)

- Slope is the term we use to refer to how steep a line is on a scatterplot

The slope tells you how many units variable Y changes with one unit change in variable X

- Slope can be either positive or negative

- A positive slope indicates that an increase in your X variable leads to an increase in your Y variable (or a decrease in X leads to a decrease in Y)

- This is called a positive, or direct relationship

- A negative slope indicates that an increase in your X variable leads to a decrease in your Y variable (or a decrease in X leads to an increase in Y)

- This is called a negative, or inverse relationship

- A slope of zero indicates that in increase in your X variable leads to no change in your Y variable

** In most behavioral science research, the slope is what you are most interested in because it tells you how strongly the IV and DV are related to each other.

CORRELATION

** In research, we never have a perfect linear relationship – there’s always error and other variables (disturbance variables) that influence the DV, besides the IV.

**Most of the relationships that we observe approximate a linear relationship

**the Pearson correlation coefficient (r) tells us the extent to which a relationship between two variables approximates a linear relationship and the nature of this relationship.

Correlations can range from –1.00 to +1.00

There are two important components of r that are used for interpreting what this number means:

(1) Magnitude – the absolute value of the correlation (ignoring the signs) tells us how closely this relationship approximates a linear one.

- The further r is from zero (either positive or negative) the better the approximation

- a correlation of 1.00 or –1.00 indicates a perfect relationship

- a correlation of zero indicates no linear relationship

(2) Sign – the sign of the correlation indicates the direction of the relationship

a negative correlation indicates an inverse or negative relationship.
a positive correlation indicates a direct or positive relationship.

**So, we address our research question by asking:

- Given the values we have for X & Y, "Is there a linear relationship between X and Y?"

- If we know a person’s X-score, can we better estimate what their Y-score is, than if we had no information?

General Process that is used:

(1) Is there a linear relationship between the variables?

- Correlation

(2) If so, what is the equation for the line? (knowing the equation allows us to estimate Y from scores on X).

- Regression

Calculation of the Pearson Correlation Coefficient

Pearson r tells us the degree to which a linear relationship is approximated; i.e., how closely the two variables approximate a linear relationship. The number that is calculated is called a correlation coefficient.

Example:

- A question that is of interest to many people is how strong the relationship is between GRE scores (verbal and quantitative) and success in graduate school.

- We have a small sample of 8 graduate students from SUNY Albany. We know their GRE scores and their GPA for their first year of grad school. We can use this GPA measure as a measure of success in grad school. The obtained data looks like this:

Individual

GRE Score (X)

Graduate GPA (Y)

1

1200

3.60

2

1250

3.70

3

1300

3.80

4

1400

4.00

5

1450

3.60

6

1450

3.70

7

1475

3.50

8

1550

3.90

- If we just look over these numbers, we can see that it doesn’t really look like the GPA increases as GRE scores increase, but if we calculate the Pearson correlation coefficient, we can see how much these two variables are actually related (i.e., how closely these two variables approximate a linear relationship).

Computational formula for r: (The book also goes through the definitional formula in detail. This can aid in understanding the calculations, but I am going to get right to the computational formula.)

- As can be seen in this formula, in the denominator, we have the SS for our X variable and the SS for our Y variable.

- In the numerator, we have the sum of the cross products of X and Y.

The best way to do these computations is to make a table:

Indiv

GRE (X)

Grad GPA (Y)

X²

Y²

XY

1

1200

3.60

1440000

12.96

4320

2

1250

3.70

1562500

13.69

4625

3

1300

3.80

1690000

14.44

4940

4

1400

4.00

1960000

16.00

5600

5

1450

3.60

2102500

12.96

5220

6

1450

3.70

2102500

13.69

5365

7

1475

3.50

2175625

12.25

5162.5

8

1550

3.90

2402500

15.21

6045

We calculate a value of r = +.16. What does this mean? There is a positive relationship between the two variables, so as one increases, the other also increases.

- But what about the magnitude? .16 is pretty close to zero, so we would conclude that there is not much of a relationship between these two variables.

CORRELATION DOES NOT MEAN CAUSATION!!!

Even if you find that two variables are strongly correlated, this does not mean that one variable causes the other. The correlation merely tells you that the two variables are somehow related.

It could be that: (1) X causes Y

(2) Y causes X

(3) Some other variable causes both X and Y to be related. (e.g., motivation)

Correlation: Inferential Aspects (Chapter 14)

Inference of a Relationship Using Pearson Correlation

- So far, we have described what the correlation coefficient is, and how to calculate it.

- The most common use of correlation is to use the calculated value to make inferences about a population based on sample data.

- We calculate a correlation coefficient from a sample, and we want to know if we can conclude that a correlation exists in the population.

- The purpose of this hypothesis testing procedure is to see if our observed correlation is due to sampling error, or if there is an actual relationship between our two variables in the population.

Step 1 – Null and alternative hypotheses

** Remember that a correlation of zero means there is no relationship between the two variables.

- In our null and alternative hypotheses, represents the true correlation in the population.

- is a lowercase Greek r, pronounced "rho"

-- saying that the population correlation between the two variables is 0.

-- saying that the population correlation between the two variables is NOT 0.

Step 2 – Critical values for r

- Book talks about testing the significance with a t-test, then goes in to say that we can also do it the following way (which is much simpler).

- Appendix H (p. 606) gives you the critical values for correlations.

- To get a critical value, you need to know three pieces of information.

1. Degrees of freedom (N – 2)

2. Direction v. non-directional test

3. Alpha level (.05)

- If the observed value of r is greater than the positive critical value or less than the negative critical value, we reject our null hypothesis. (Otherwise, we fail to reject null.)

For our GRE and GPA example:

df = 8 – 2 = 6

Non-directional/two-tailed test

=.05

- Our critical value = .707

- So, we would not reject our null hypothesis.

- Our correlation coefficient is not statistically significant.

- We cannot conclude that a true relationship exists between these variables in the population. Our observed correlation is due to sampling error.

Strength of the Relationship

- represents the proportion of variability shared by the two variables.

- is formally known as the coefficient of determination.

For our example:

-- represents a weak relationship between the two variables.

Nature of the relationship

- This is found by examining the sign of the observed correlation coefficient

- If the correlation is positive, we say that our two variables have a positive or direct relationship with one another.

- If the correlation is negative, we say that the two variables have an inverse or negative relationship with one another.

- This is not relevant for our example, because we did not reject our null hypothesis. There is no statistically significant relationship to discuss.

Individual	GRE Score (X)	Graduate GPA (Y)
1	1200	3.60
2	1250	3.70
3	1300	3.80
4	1400	4.00
5	1450	3.60
6	1450	3.70
7	1475	3.50
8	1550	3.90

Indiv	GRE (X)	Grad GPA (Y)	X²	Y²	XY
1	1200	3.60	1440000	12.96	4320
2	1250	3.70	1562500	13.69	4625
3	1300	3.80	1690000	14.44	4940
4	1400	4.00	1960000	16.00	5600
5	1450	3.60	2102500	12.96	5220
6	1450	3.70	2102500	13.69	5365
7	1475	3.50	2175625	12.25	5162.5
8	1550	3.90	2402500	15.21	6045