Lesson 02


Introduction to Statistical Concepts
Lesson 02, Section 2 Demo: Using PROC MEANS to Generate Descriptive Statistics

*Please open stat0_demo1.sas to follow along with this demonstration.

Before we run the MEANS procedure that we've just learned, let's use PROC PRINT to take a look at our raw data. In the PROC PRINT statement, we specify the input data set, indicating with the OBS= option that we wish to view only 10 observations, or rows, of data. The TITLE statement specifies the output heading.

proc print data=statdata.testscores (obs=10);
   title 'Listing of the SAT Data Set';
run;
title;

When we submit the program, we can see the first 10 observations in our data set. The first column lists the observation numbers. The next three columns contain three variables: the gender of the student, the student's SAT score, and the unique ID number.

Now that we've confirmed the basic structure of our data, let's use PROC MEANS to generate descriptive statistics for the variable SATScore

proc means data=statdata.testscores maxdec=2 fw=10 printalltypes;
   class Gender;
   var SATScore;
   title 'Descriptive Statistics Using PROC MEANS';
run;
title;

When we run the program, we can see how using the CLASS statement produced separate statistics for all 80 observations, then for just females and then just males. It is interesting to see that the average mean SATScore for all 80 students is 1190.6. When we look at just females, the mean is 1221 and the mean for males is 1160. Because we didn't specify which statistics we wanted, SAS generated the default statistics for PROC MEANS. The output provides the sample size (that is, the number of non-missing values), the mean, standard deviation, the minimum, and the maximum values of SATScore.

Now let's modify our PROC MEANS statement and specify some different statistics. Remember that SAS will override the default statistics, so we must specify all statistics that we want included in the output. Let's request the sample size (N), the mean and median values, standard deviation and variance, in addition to the lower quartile (Q1) and upper quartile (Q3). Let's go ahead and submit our program.

proc means data=statdata.testscores maxdec=2 fw=10 printalltypes
           n mean median std var q1 q3; 
   class Gender;
   var SATScore;
   title 'Selected Descriptive Statistics for SAT Scores';
run;
title;

Notice that the output is still broken out with all of the observations and then by gender because we specified Gender in the CLASS statement and PRINTALLTYPES in the PROC MEANS statement. Now we can see the values of the mean compared to the median. So, although the average score for all students was 1190.63, the median or middle value in the data was 1170. We have the values for variance as well as standard deviation. We'll talk more about how these values relate to the data later. This output also includes the values for the 25th and 75th percentiles. These figures tell us that 25% of the data values for all students fell below 1085 and that the top 25% were above 1280.


Introduction to Statistical Concepts
Lesson 02, Section 3 Demo: Using SAS to Picture Your Data

*Please open stat0_demo2.sas to follow along with this demonstration.

The program in the editor uses PROC UNIVARIATE to produce descriptive statistics, a histogram and a normal probability plot for the variable SATScore. Let's specify the WIDTH= option so that our graphics are sized correctly.

Let's submit the program.

ods graphics on/width=600;
proc univariate data=statdata.testscores;
   var SATScore;
   id idnumber;
   histogram SATScore / normal(mu=est sigma=est);
   inset skewness kurtosis / position=ne;
   probplot SATScore / normal(mu=est sigma=est);
   inset skewness kurtosis;
   title 'Descriptive Statistics Using PROC UNIVARIATE';
run;
title;

The log shows that SAS processed the code without errors. When we look at our results, the tabular output tells us quite a lot about our data.

The sample size (N) is 80. The mean is 1190.6, which is roughly equivalent to the median of 1170 in the quantiles report.

The standard deviation is approximately 147, which is a measure of the average variability around the mean. The variance is the standard deviation squared.

The skewness statistic is 0.642. It is positive, but close to 0, so the distribution is slightly right-skewed. The kurtosis statistic is 0.424. It is positive, but also close to 0, so the distribution is slightly heavy-tailed.

The coefficient of variation is 12.35. This is the standard deviation expressed as a percentage of the mean. This is useful if you need to compare data with different units of measurements – for example, inches to centimeters. It's a way of standardizing units of measurement.

The standard error of the mean of 16.4 measures the variability of the mean.

In the quantiles report, SAS provides various reference points in the data set. We can see that the minimum for SATScore is 890 and the maximum is 1600. We are also given other percentile values, such as the 25th, the median, and the 75th. In the extreme observations output, SAS specifies the five lowest values of SATScores and the five highest values. Remember that we specified IDNumber as an identifier in our code.

Next, let's scroll down and review our graphs.

This histogram provides some additional information about the TestScores data. The bin identified with the midpoint of 1100 has approximately 33% of the values. The inset box displays the skewness and kurtosis values. The data looks approximately normal.

Let's review the normal probability plot. The diagonal reference line here represents where the data values would fall if they came from a normal distribution. The circles represent the observed data values. Because the circles closely follow the diagonal reference line in the graph, you can conclude that there does not appear to be a significant departure from normality.

Now we use PROC SGPLOT to create a boxplot of the variable SATScore. Let's specify a reference line at 1200. This is the SAT test score goal for magnet schools in the Carver County school district. We also specify the DATALABEL option in the VBOX statement and ask SAS to label any outliers with their IDNumber values. Let's submit this code.

proc sgplot data=statdata.testscores;
   refline 1200 / axis=y lineattrs=(color=blue); 
   vbox SATScore / datalabel=IDNumber; 
   format IDNumber 8.;
   title "Box Plots of SAT Scores";
run;
title;

The log verifies that the code ran successfully. Now let's look at our output. The top whisker represents the largest point up to 1.5 interquartile units from the box. The top line of the box represents the 75th percentile. The horizontal line inside the box represents the median or 50th percentile. The bottom line of the box represents the 25th percentile. The bottom whisker represents the smallest point up to 1.5 interquartile units from the box. The diamond represents the mean. The blue horizontal line is the reference line we added where SATScore is 1200. Note that there are two outliers, values beyond 1.5 interquartile units. SAS displays their IDNumber values as we had requested.


Introduction to Statistical Concepts
Lesson 02, Section 4 Demo: Calculating a 95% Confidence Interval

*Please open stat0_demo3.sas to follow along with this demonstration.

Let's use PROC MEANS to calculate a 95% confidence interval for the mean of the variable SATScore in the TestScores data set. The MAXDEC= option rounds the values in the table to four decimal places. We specify that SAS should display the number of observations, the mean, the standard error of the mean, and the confidence limits of the mean at the default 95% confidence level.

proc means data=statdata.testscores maxdec=4
           n mean stderr clm;
   var SATScore;
   title '95% Confidence Interval for SAT';
run;
title;

Let's look at the output. The sample size is 80. The mean of SATScore is 1190.625. The standard error of the mean is 16.4416. The standard error measures the variability of the sample mean or how much variability we expect in the distribution of all means of samples of size n. (n=80 in this example.) The 95% confidence interval of the mean is 1157.9 to 1223.4.

How does this relate to your original question? You want to know whether the average SAT score for the Carver County magnet high schools is different from the standard of 1200 set by the school board. Even though the sample mean of 1190.6 is not exactly 1200, the confidence interval of 1157.9 to 1223.4 contains the value 1200. This is evidence that the value of 1200 is at least a reasonable candidate for the true population mean. Hence, you can conclude that 1190.6 is a reasonably likely sample mean for a random sample drawn from a population centered around 1200.

Of course, this logic applies only if you can assume that the population of means is normally distributed, and the standard deviation of the sample is a good estimate of the standard deviation of the population. This concept is discussed further in the section on hypothesis testing.


Introduction to Statistical Concepts
Lesson 02, Section 5 Demo: Using PROC UNIVARIATE to Perform a Hypothesis Test

*Please open stat0_demo4.sas to follow along with this demonstration.

This program uses PROC UNIVARIATE to test the hypothesis that the mean of SATScore is equal to 1200. Your null hypothesis is that the population's mean SAT score for the Carver County magnet high schools is 1200. Your alternative hypothesis is that the population's mean SAT score is not 1200. Let's submit this program and take a look at the output SAS produces.

ods select testsforlocation;
proc univariate data=statdata.testscores mu0=1200;
   var SATScore;
   title 'Testing Whether the Mean of SAT Scores = 1200';
run;
title;

The Tests for Location table provides the t statistic, labeled Student's t, and the corresponding p-value. The p-value is greater than the significance level, or α, of 0.05 that we had set. Note by the way that it is a coincidence that the t statistic and p-value have the same numeric value (although one is positive and the other negative). Because the p-value is greater than alpha, we fail to reject the null hypothesis. Therefore, we believe that there is no statistical difference between the sample mean of 1190 and the hypothesized mean of 1200.

Here's another way to look at it. If the null hypothesis is true, how likely are we to see a t statistic with an absolute value of .5702 or greater? Well, about 57% of the time. This value confirms that we do not have enough evidence to reject the null hypothesis. To summarize: the original question was whether the mean SAT score for Carver County magnet high school students equals 1200. From this hypothesis test, we conclude that there is not enough evidence to say that the sample mean score of 1190.625 is statistically different from 1200.