Section 3 Categorical Data

Readings	Ott and Longnecker, Chapter 10, pages 469-528.
Instructor Guidance	At this point in the course we are going to stop and take a break from comparing population means for continuous data and look instead at what kinds of tests are appropriate when the underlying response is simply a classification (i.e. a color, class or count, not something on an obvious continuous scale). Such data are called categorical or count data. We begin by developing one population test for a population proportion. A response that is of the form of true/false or yes/no or success/failure is said to be a Bernouilli response. Counts of Bernouilli responses (e.g. number of true responses) can be modeled using the Binomial Distribution with probability of a success or true or yes response being the parameter of interest. This probability is estimated by the proportion of successes. In a one population test, we wish to determine whether our observed proportion could reasonably have come from some population of Bernoulli responses with a specified proportion. If we have a lot of data we can estimate the true underlying Binomial Distribution, and with this distribution it is then possible to compute critical values for the test exactly. Often, with larger sample sizes, it is easier and almost as accurate to use a different test statistics and the Normal (Gaussian) distribution for determining the test critical value. This alternative test is presented on page 475. The one population test can be extended to the two population test in the same way we extended the one population t-test. Again, we have to figure out an appropriate pooled variance term (see page 484), but the test itself is fairly straightforward. In the case where we have more than two classes as possible responses (e.g. the color response could be RED, BLUE, GREEN or YELLOW) the extension of the proportions test to multiple proportions no longer follows what was done with continuous responses. Now we are dealing with the Multinomial Distribution and the appropriate test statistic is related to a Chi Square Distribution. The Chi Square test for equal proportions (Section 10.4) is as powerful a tool for analysis of categorical data as the Analysis of Variance is for continuous response data. The hardest part here is understanding how the expected value components, E_i, are derived. Figure this out and you have this test licked. The Chi Square test is further applied in two situations. The first deals with testing whether count data follow a specific distribution (e.g. the Poisson Distribution as given in Section 10.5). The second deals with comparing proportions constructed as the cross-product of two classification variables. This application of the Chi Square test is used to determine if the two variables are related to each other. Using the concept of independence developed in the chapter on probability, we can develop here a Chi Square test for independence of these factors. Again, in both of these special cases, the key to the test is figuring out how the expected value components, E_i, are derived. The last two sections of this chapter deal with measuring the strength of relationships among categorical variables. These are very interesting concepts and have important uses in social sciences and medicine. This is not part of this course but I encourage you to read the section anyway.
PPT Lecture	Tests of Proportions (Powerpoint) and (PDF) Chi Square Test for Multiple Proportions (Powerpoint) and (PDF)
Optional Activities	None
Exercises	To check your understanding of the readings and practice these concepts and methods, go to Unit 3 Section 3 Exercises, do the exercises then check your answers from the page provided. Following this continue on to the Unit 3 Test.