SAS logo

Activities and Practices (with Solutions) for Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression


Course Code: ECST, prepared date: June 19, 2023
Copyright © SAS Institute Inc., Cary, NC, USA. All rights reserved.
        

Performing Demos and Practices

To perform the demonstrations and practices, you can write and submit code in SAS Studio, SAS Enterprise Guide, or the SAS Windowing environment, or you can use SAS Studio tasks. In addition to Base SAS software, you must have SAS/STAT.

In the course demos, we submit code in the SAS Studio programming environment, but you can view step-by-step task instructions by clicking the Task Version button below the video. The task-generated code and results might differ from those shown in the video, so the generated code is included for verification.

You can also write code or use SAS Studio tasks to complete the practices. Just select the Open Code Version or the Open Task Version button on the Practice page.

All task steps were written for SAS Studio 3.7. If you are using your own software, you can use the Downloads page to upgrade your SAS Studio Single-User Edition to the latest release, including hot fixes. If you don't have your own software, you can use SAS OnDemand for Academics to access SAS Studio free of charge.


Program Files

The course files are divided among 4 folders. The ECST142 folder contains SAS programs to setup the course environment and data, and three subfolders:

  • data - the SAS data sets needed to run the demos and practices will be stored here.
  • demos - contains the demo program files.
  • solutions - contains the solution program files.

Demo and solution program names are in the form st1XXyZZ.sas where st1 is the course code, XX is a 2-digit lesson number, y is either d for demo or s for solution, and ZZ is a two-digit number to uniquely identify the file. For simplicity, the demo and solution files are each numbered sequentially within a lesson. For example, st102d03.sas is the third demo in lesson 2, and st104s02.sas is the solution to the second practice in lesson 4. The filename is included in a comment at the top of each program.


SAS Syntax

Partial SAS syntax is displayed and explained in the demos. To view detailed syntax from SAS Studio, click the Help button near the top right and select SAS Product Documentation. You can also navigate to support.sas.com/documentation.


Exploring SAS Data Sets in SAS Studio

You can use the table viewer in SAS Studio to explore SAS data sets.

  1. Open SAS Studio. In the navigation pane on the left, double-click Libraries. Expand My Libraries and then expand the SASHELP library.

  2. Open the CARS data set by double-clicking it or by dragging it to the work area on the right. The data set opens in the table viewer. By default all of the columns and the first 100 rows are displayed. You can use the arrows above the table (top right) to page forward and backward through the rows.

  3. Clear the Select all checkbox in the Columns area of the table viewer. No columns are displayed. Select the Make, Model, and Type checkboxes. The corresponding columns are displayed.

  4. Select Make in the column list. The column properties are displayed below the list.

  5. Close the table tab.

You can also use SAS Studio tasks to explore your data. The List Data task displays the rows of a SAS data set, and the List Table Attributes task displays its metadata including the number of columns and rows, and the name, type and length of each column.


Exploring SAS Data Sets Programmatically

You can use the PRINT procedure to display the rows of a SAS data set, and the CONTENTS procedure to display its metadata.

In the PROC PRINT step, you use the DATA= option to name the input data set and the VAR statement to select the columns to display. By default, SAS lists all columns and all rows, but you can use the OBS=n data set option to limit the display to the first n rows. In the PROC CONTENTS step, you use the DATA= option to name the input data set.

Sample Code
proc print data=sashelp.cars(obs=100);
   var Make Model Type MSRP;
run;
proc contents data=sashelp.cars;
run;

Lesson 01

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 3 CODE VERSION

Practice: Using PROC TTEST to Perform a One-Sample t Test

The data in stat1.normtemp come from an article in the Journal of Statistics Education by Dr. Allen L. Shoemaker from the Psychology Department at Calvin College. The data are based on an article in a 1992 edition of JAMA (Journal of the American Medical Association). The notion that the true mean body temperature is 98.6 is questioned. There are 65 males and 65 females. There is also some doubt about whether mean body temperatures for women are the same as for men.

  1. Look at the distribution of the continuous variables in the stat1.normtemp data set.Use PROC UNIVARIATE to produce histograms and insets with means, standard deviations, and sample size.

    Solution:

    /*st101s01.sas*/ /*Part A*/ 
    %let interval=BodyTemp HeartRate; 
    ods graphics; 
    ods select histogram; 
    proc univariate data=stat1.NormTemp noprint; 
       var &interval; 
       histogram &interval / normal kernel; 
       inset n mean std / position=ne; 
       title "Interval Variable Distribution Analysis"; 
    run;

    Here are the results.

  2. What are the means and standard deviations for each continuous variable?

    Solution:

    • The mean BodyTemp is 98.25 with a standard deviation of 0.71.
    • The mean HeartRate is 73.76 with a standard deviation of 7.06.

  3. Perform a one-sample t test to determine whether the mean of body temperatures is 98.6. Produce a confidence interval plot of BodyTemp. Use the value 98.6 as a reference.

    Solution:

    /*st101s01.sas*/ /*Part B*/ 
    proc ttest data=stat1.NormTemp h0=98.6 plots(only shownull)=interval; 
       var BodyTemp; 
       title 'Testing Whether the Mean Body Temperature=98.6'; 
    run; 
    title;

    Here are the results.

  4. What is the value of the t statistic and the corresponding p-value?

    Solution:

    The t value is -5.45, and the p-value is <.0001.

  5. Do you reject or fail to reject the null hypothesis at the 0.05 level that the average temperature is 98.6 degrees?

    Solution:

    You reject the null hypothesis at the 0.05 level.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 3 TASK VERSION

Practice: Using PROC TTEST to Perform a One-Sample t Test

The data in stat1.normtemp come from an article in the Journal of Statistics Education by Dr. Allen L. Shoemaker from the Psychology Department at Calvin College. The data are based on an article in a 1992 edition of JAMA (Journal of the American Medical Association). The notion that the true mean body temperature is 98.6 is questioned. There are 65 males and 65 females. There is also some doubt about whether mean body temperatures for women are the same as for men.

  1. Look at the distribution of the continuous variables in the stat1.normtemp data set.Use PROC UNIVARIATE to produce histograms and insets with means, standard deviations, and sample size.

    Solution:

    /*st101s01.sas*/ /*Part A*/ 
    %let interval=BodyTemp HeartRate; 
    ods graphics; 
    ods select histogram; 
    proc univariate data=stat1.NormTemp noprint; 
       var &interval; 
       histogram &interval / normal kernel; 
       inset n mean std / position=ne; 
       title "Interval Variable Distribution Analysis"; 
    run;

    Here are the results.

  2. What are the means and standard deviations for each continuous variable?

    Solution:

    • The mean BodyTemp is 98.25 with a standard deviation of 0.71.
    • The mean HeartRate is 73.76 with a standard deviation of 7.06.

  3. Perform a one-sample t test to determine whether the mean of body temperatures is 98.6. Produce a confidence interval plot of BodyTemp. Use the value 98.6 as a reference.

    Solution:

    /*st101s01.sas*/ /*Part B*/ 
    proc ttest data=stat1.NormTemp h0=98.6 plots(only shownull)=interval; 
       var BodyTemp; 
       title 'Testing Whether the Mean Body Temperature=98.6'; 
    run; 
    title;

    Here are the results.

  4. What is the value of the t statistic and the corresponding p-value?

    Solution:

    The t value is -5.45, and the p-value is <.0001.

  5. Do you reject or fail to reject the null hypothesis at the 0.05 level that the average temperature is 98.6 degrees?

    Solution:

    You reject the null hypothesis at the 0.05 level.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 4 CODE VERSION

Practice: Using PROC TTEST to Compare Groups

Elli Sagerman, a Masters of Education candidate in German Education at the University of North Carolina at Chapel Hill in 2000, collected data for a study. She looked at the effectiveness of a new type of foreign language teaching technique on grammar skills. She selected 30 students to receive tutoring. Fifteen received the new type of training during the tutorials and 15 received standard tutoring. Two students moved from the district before completing the study. Scores on a standardized German grammar test were recorded immediately before the 12-week tutorials and again 12 weeks later at the end of the trial. Sagerman wanted to see the effect of the new technique on grammar skills.

  1. Using PROC TTEST, analyze the stat1.german data set. Assess whether the treatment group improved more than the control group.

    Solution:

    /*st101s02.sas*/ 
    ods graphics; 
    proc ttest data=STAT1.German plots(shownull)=interval; 
       class Group; 
       var Change; 
       title "German Grammar Training, Comparing Treatment to Control"; 
    run;

    Here are the results.


  2. Do the two groups seem to be approximately normally distributed?

    Solution:

    The plots show evidence that supports approximate normality in both groups.

  3. Do the two groups have approximately equal variances?

    Solution:

    Because the p-value for the Equality of Variances test is greater than the alpha level of 0.05, you would not reject the null hypothesis. This conclusion supports the assumption of equal variance (the null hypothesis being tested here).

  4. Does the new teaching technique seem to result in significantly different scores compared with the standard technique?

    Solution:

    The p-value for the Pooled (Equal Variance) test for the difference between the two means shows that the two groups are not statistically significantly different. Therefore, there is not strong enough evidence to say conclusively that the new teaching technique is different from the old. The Difference Interval plot displays these conclusions graphically.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 4 TASK VERSION

Practice: Using the t Tests Task to Compare Groups

Elli Sagerman, a Masters of Education candidate in German Education at the University of North Carolina at Chapel Hill in 2000, collected data for a study. She looked at the effectiveness of a new type of foreign language teaching technique on grammar skills. She selected 30 students to receive tutoring. Fifteen received the new type of training during the tutorials and 15 received standard tutoring. Two students moved from the district before completing the study. Scores on a standardized German grammar test were recorded immediately before the 12-week tutorials and again 12 weeks later at the end of the trial. Sagerman wanted to see the effect of the new technique on grammar skills.

  1. Using the t TEST task, analyze the german data set. Assess whether the treatment group improved more than the control group.

    Solution:

    • In the Navigation pane, select Tasks and Utilities.
    • Expand Tasks.
    • Expand Statistics and select the t Tests task.
    • On the DATA tab, do the following:
      • Select the stat1.german table.
      • Select Two-sample test under ROLES.
      • Assign Change as the Analysis variable and Group as the Groups variable.
    • On the OPTIONS tab, do the following:
      • Clear Tests for normality.
      • Under PLOTS, choose Selected plots. Select Histogram and box plot, Normality plot, and Confidence interval plot.
    • Run the task.

    Here are the results.


  2. Do the two groups seem to be approximately normally distributed?

    Solution:

    The plots show evidence that supports approximate normality in both groups.

  3. Do the two groups have approximately equal variances?

    Solution:

    Because the p-value for the Equality of Variances test is greater than the alpha level of 0.05, you would not reject the null hypothesis. This conclusion supports the assumption of equal variance (the null hypothesis being tested here).

  4. Does the new teaching technique seem to result in significantly different scores compared with the standard technique?

    Solution:

    The p-value for the Pooled (Equal Variance) test for the difference between the two means shows that the two groups are not statistically significantly different. Therefore, there is not strong enough evidence to say conclusively that the new teaching technique is different from the old. The Difference Interval plot displays these conclusions graphically.

Lesson 02

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 2 CODE VERSION

Practice: Using PROC GLM to Perform a One-Way ANOVA

Montana Gourmet Garlic is a company that uses organic methods to grow garlic. It specializes in hardneck varieties. Knowing a little about experimental methods, the owners design an experiment to test whether growth of the garlic is affected by the type of fertilizer. They limit the experimentation to a Rocambole variety named Spanish Roja, and test three organic fertilizers and one chemical fertilizer (as a control). They "blind" themselves to the fertilizer by using containers with numbers 1 through 4. (In other words, they design the experiment in such a way that they do not know which fertilizer is in which container.) One acre of farmland is set aside for the experiment. The land is divided into 32 beds, and they randomly assign fertilizers to the beds. At harvest, they calculate the average weight of garlic bulbs in each of the beds. The data are in the stat1.garlic data set.

Consider an experiment to study four types of fertilizer, labeled 1, 2, 3, and 4. One fertilizer is chemical and the rest are organic. You want to see whether the average weights of the garlic bulbs are significantly different for plants in beds that use different fertilizers.
  1. Test the hypothesis that the means are equal. Use PROC MEANS to generate descriptive statistics for the four groups, and use PROC SGPLOT to produce box plots of bulb weight for the four groups. Submit the code and view the results.

    Solution:

    Checking Assumptions
    /*st102s01.sas*/  /*Part A*/
    proc means data=stat1.garlic; 
       var BulbWt;
       class Fertilizer;
       title 'Descriptive Statistics of BulbWt by Fertilizer';
    run;
    
    proc sgplot data=stat1.garlic;
        vbox BulbWt / category=Fertilizer 
                      connect=mean;
        title "Bulb Weight Differences across Fertilizers";
    run;
    
    title;

    Here are the results.


  2. Which fertilizer has the highest mean?

    Solution:

    Fertilizer 3 has the highest mean, 0.2424075, although its mean is quite close to fertilizers 1 and 2.

  3. Perform a one-way ANOVA using PROC GLM. Be sure to check that the assumptions of the analysis method that you choose are met. Submit the code and view the results.

    Solution:

    ANOVA
    /*st102s01.sas*/  /*Part B*/
    ods graphics;
    
    proc glm data=stat1.garlic plots=diagnostics;
        class Fertilizer;
        model BulbWt=Fertilizer;
        means Fertilizer / hovtest=levene;
        title "One-Way ANOVA with Fertilizer as Predictor";
    run;
    quit;
    
    title;

    Here are the results.


  4. What conclusions can you reach at this point in your analysis?

    Solution:

    The overall F value from the analysis of variance table is associated with a p-value of 0.0013. Presuming that all assumptions of the model are valid, you know that at least one treatment mean is different from one other treatment mean. At this point, you don't know which means are significantly different from one another.

    Both the histogram and Q-Q plot show that the residuals seem relatively normally distributed (one assumption for ANOVA).

    The Levene’s Test for Homogeneity of Variance table shows a p-value greater than alpha. Therefore, do not reject the hypothesis of homogeneity of variances (equal variances across fertilizer types). This assumption for ANOVA is met.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 2 TASK VERSION

Practice: Using PROC GLM to Perform a One-Way ANOVA

Montana Gourmet Garlic is a company that uses organic methods to grow garlic. It specializes in hardneck varieties. Knowing a little about experimental methods, the owners design an experiment to test whether growth of the garlic is affected by the type of fertilizer. They limit the experimentation to a Rocambole variety named Spanish Roja, and test three organic fertilizers and one chemical fertilizer (as a control). They "blind" themselves to the fertilizer by using containers with numbers 1 through 4. (In other words, they design the experiment in such a way that they do not know which fertilizer is in which container.) One acre of farmland is set aside for the experiment. The land is divided into 32 beds, and they randomly assign fertilizers to the beds. At harvest, they calculate the average weight of garlic bulbs in each of the beds. The data are in the stat1.garlic data set.

Consider an experiment to study four types of fertilizer, labeled 1, 2, 3, and 4. One fertilizer is chemical and the rest are organic. You want to see whether the average weights of the garlic bulbs are significantly different for plants in beds that use different fertilizers.
  1. Test the hypothesis that the means are equal. Use PROC MEANS to generate descriptive statistics for the four groups, and use PROC SGPLOT to produce box plots of bulb weight for the four groups. Submit the code and view the results.

    Solution:

    Checking Assumptions
    /*st102s01.sas*/  /*Part A*/
    proc means data=stat1.garlic; 
       var BulbWt;
       class Fertilizer;
       title 'Descriptive Statistics of BulbWt by Fertilizer';
    run;
    
    proc sgplot data=stat1.garlic;
        vbox BulbWt / category=Fertilizer 
                      connect=mean;
        title "Bulb Weight Differences across Fertilizers";
    run;
    
    title;

    Here are the results.


  2. Which fertilizer has the highest mean?

    Solution:

    Fertilizer 3 has the highest mean, 0.2424075, although its mean is quite close to fertilizers 1 and 2.

  3. Perform a one-way ANOVA using PROC GLM. Be sure to check that the assumptions of the analysis method that you choose are met. Submit the code and view the results.

    Solution:

    ANOVA
    /*st102s01.sas*/  /*Part B*/
    ods graphics;
    
    proc glm data=stat1.garlic plots=diagnostics;
        class Fertilizer;
        model BulbWt=Fertilizer;
        means Fertilizer / hovtest=levene;
        title "One-Way ANOVA with Fertilizer as Predictor";
    run;
    quit;
    
    title;

    Here are the results.


  4. What conclusions can you reach at this point in your analysis?

    Solution:

    The overall F value from the analysis of variance table is associated with a p-value of 0.0013. Presuming that all assumptions of the model are valid, you know that at least one treatment mean is different from one other treatment mean. At this point, you don't know which means are significantly different from one another.

    Both the histogram and Q-Q plot show that the residuals seem relatively normally distributed (one assumption for ANOVA).

    The Levene’s Test for Homogeneity of Variance table shows a p-value greater than alpha. Therefore, do not reject the hypothesis of homogeneity of variances (equal variances across fertilizer types). This assumption for ANOVA is met.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 3 CODE VERSION

Practice: Using PROC GLM to Perform Post Hoc Pairwise Comparisons

Consider the analysis of the garlic data set. In the previous exercise, you used PROC GLM to perform one-way ANOVA, and found that there was a statistically significant difference among mean garlic bulb weights for the different fertilizers. Now, perform a post hoc test to look at the individual differences among means.

  1. Use PROC GLM to conduct pairwise comparisons with an experimentwise error rate of α=0.05. (Use the Tukey adjustment.) Submit the code and view the results.

    Solution:

    /*st102s02.sas*/
    ods graphics;
    
    ods select lsmeans diff diffplot controlplot;
    proc glm data=STAT1.Garlic 
             plots(only)=(diffplot(center) controlplot);
       class Fertilizer;
       model BulbWt=Fertilizer;
       Tukey: lsmeans Fertilizer / pdiff=all adjust=tukey;
       title "Post-Hoc Analysis of ANOVA - Fertilizer as Predictor";
    run;
    quit;
    
    title;

    Here are the results.


  2. Which types of fertilizer are significantly different?

    Solution:

    The Tukey comparisons show significant differences between fertilizers 3 and 4 (p=0.0020) and 1 and 4 (p=0.0058).

  3. Use level 4 (the chemical fertilizer) as the control group and perform a Dunnett's comparison with the organic fertilizers to see whether they affected the average weights of garlic bulbs differently from the control fertilizer.

    Solution:

    /*st102s02.sas*/
    ods graphics;
    
    ods select lsmeans diff diffplot controlplot;
    proc glm data=STAT1.Garlic 
             plots(only)=(diffplot(center) controlplot);
       class Fertilizer;
       model BulbWt=Fertilizer;
       Dunnett:lsmeans Fertilizer / pdiff=control('4') adjust=dunnett;
       title "Post-Hoc Analysis of ANOVA - Fertilizer as Predictor";
    run;
    quit;
    
    title;

    Here are the results.


  4. Which types of fertilizer are significantly different?

    Solution:

    The Dunnett comparisons show the same pairs as significantly different, but with smaller p-values than with the Tukey comparisons (3 versus 4 p=0.0011, 1 versus 4 p=0.0031). This is due to the fact that the Tukey adjustment is for more pairwise comparisons than the Dunnett adjustment.


  5. Challenge: Perform unadjusted tests of all pairwise comparisons to see what would happen if the multi-test adjustments were not made.

    Solution:

    /*st102s02.sas*/
    ods graphics;
    
    ods select lsmeans diff diffplot controlplot;
    proc glm data=STAT1.Garlic 
             plots(only)=(diffplot(center) controlplot);
       class Fertilizer;
       model BulbWt=Fertilizer;
       No_Adjust: lsmeans Fertilizer / pdiff=all adjust=t;
       title "Post-Hoc Analysis of ANOVA - Fertilizer as Predictor";
    run;
    quit;
    
    title;

    Here are the results.


  6. How do the results compare to what you saw in the Tukey adjusted tests?

    Solution:

    The unadjusted (t test) comparisons have smaller p-values than they had with Tukey adjustments. One additional comparison has a p-value below 0.05 (2 versus 3).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 3 TASK VERSION

Practice: Using the One-Way ANOVA Task to Perform Post Hoc Pairwise Comparisons

Consider the analysis of the garlic data set. In the previous exercise, you used PROC GLM to perform one-way ANOVA, and found that there was a statistically significant difference among mean garlic bulb weights for the different fertilizers. Now, perform a post hoc test to look at the individual differences among means

  1. Use the One-Way ANOVA task to conduct pairwise comparisons with an experimentwise error rate of α=0.05. (Use the Tukey adjustment.)

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the One-Way ANOVA task.
    4. Select the stat1.garlic table.
    5. Assign BulbWt to the Dependent variable role and assign Fertilizer to the Categorical variable role.
    6. On the OPTIONS tab, under HOMOGENEITY OF VARIANCE, use the Test drop-dwon list to select None, and clear the check box for Welch's variance-weighted ANOVA.
    7. Under COMPARISONS, use the drop-down list for Comparisons method and select Tukey for Tukey's HSD.
    8. Under PLOTS, use the Display plots drop-down list to select Selected plots, and then select LS-mean difference plot. Clear all other check boxes.
    9. Run the task.

    Here are the results.


  2. Which types of fertilizer are significantly different?

    Solution:

    The Tukey comparisons show significant differences between fertilizers 3 and 4 (p=0.0020) and 1 and 4 (p=0.0058).

  3. Modify the task to use level 4 (the chemical fertilizer) as the control group and perform a Dunnett's comparison with the organic fertilizers to see whether they affected the average weights of garlic bulbs differently from the control fertilizer.

    Solution:

    To compare the output for the three different comparison methods, rerun the task with different comparison methods.
    1. On the OPTIONS tab, under COMPARISONS, select Dunnett two-tail as the Comparisons method, and select 4 as the Control level.
    2. To include the control plot in the output, under PLOTS, use the Display plots drop-down list and select Default plots.
    3. Click Run.

    Here are the results.


  4. Which types of fertilizer are significantly different?

    Solution:

    The Dunnett comparisons show the same pairs as significantly different, but with smaller p-values than with the Tukey comparisons (3 versus 4 p=0.0011, 1 versus 4 p=0.0031). This is due to the fact that the Tukey adjustment is for more pairwise comparisons than the Dunnett adjustment.


  5. Challenge: Perform unadjusted tests of all pairwise comparisons to see what would happen if the multi-test adjustments were not made.

    Solution:

    1. For unadjusted tests, on the OPTIONS tab, under COMPARISONS, choose Least significant difference (LSD) as the Comparisons method.
    2. Click Run.

    Here are the results.


  6. How do the results compare to what you saw in the Tukey adjusted tests?

    Solution:

    The unadjusted (t test) comparisons have smaller p-values than they had with Tukey adjustments. One additional comparison has a p-value below 0.05 (2 versus 3).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 4 CODE VERSION

Practice: Using PROC CORR to Describe the Relationship between Continuous Variables

The percentage of body fat, age, weight, height, and 10 body circumference measurements (for example, abdomen) were recorded for 252 men by Dr. Roger W. Johnson of Calvin College in Minnesota. The data are in the stat1.bodyfat2 data set. Body fat, one measure of health, has been accurately estimated by a water displacement measurement technique.

  1. Generate scatter plots and correlations for the VAR variables Age, Weight, and Height, and the circumference measures Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist versus the WITH variable, PctBodyFat2.

    **IMPORTANT: For PROC CORR, ODS Graphics will display a maximum of 10 VAR variable plots at a time. This practice analyzes thirteen variables, so it requires two PROC CORR steps to generate all thriteen plots. This limitation only applies to the ODS graphics. The correlation table displays all variables in the VAR statement by default.

    Analyze the relationships:

    • Write a PROC CORR step to analyze all thirteen variables (Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist ). This will generate a correlation table for all of the variables, but it will display plots for only the first ten.
    • Write an ODS statement to limit the graphic output to scatter plots.
    • Write another PROC CORR step, to look at only the last three variables, Biceps, Forearem, and Wrist.

    Submit the code. The output should include a correlation table for all thirteen variables followed by a plots for the first ten, and then plots for the last three.

    Solution:

    /*st102s03.sas*/  /*Part A*/
    %let interval=Age Weight Height Neck Chest Abdomen Hip 
                  Thigh Knee Ankle Biceps Forearm Wrist;
    
    ods graphics / reset=all imagemap;
    proc corr data=STAT1.BodyFat2
              plots(only)=scatter(nvar=all ellipse=none);
       var &interval;
       with PctBodyFat2;
       id Case;
       title "Correlations and Scatter Plots";
    run;
    
    %let interval=Biceps Forearm Wrist;
    
    ods graphics / reset=all imagemap;
    ods select scatterplot;
    proc corr data=STAT1.BodyFat2
              plots(only)=scatter(nvar=all ellipse=none);
       var &interval;
       with PctBodyFat2;
       id Case;
       title "Correlations and Scatter Plots";
    run;

    Here are the results.


  2. Examine the plots. Can straight lines adequately describe the relationships?

    Solution:

    Yes. Height seems to be the only variable that shows no real linear relationship. Age and Ankle show little linear trend.

  3. Are there any outliers that you should investigate?

    Solution:

    One person has outlying values for several measurements. In addition, there are one or two values that seem to be outliers for Ankle.

  4. Which variable has the highest correlation with PctBodyFat2?

    Solution:

    Abdomen, with r=0.81343, is the variable with the highest correlation with PctBodyFat2.

  5. What is the p-value for the coefficient? Is it statistically significant at the 0.05 level?

    Solution:

    The p-value is <.0001.

  6. Generate correlations among all the variables previously mentioned (Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist) minus PctBodyFat2. Use the OUT= option in the PROC CORR statement to output the correlation table into a data set named pearson. Use the BEST= option to select only the highest five per variable.

    Submit the code and review the results.

    Solution:

    /*st102s03.sas*/  /*Part B*/
    ods graphics off;
    %let interval=Age Weight Height Neck Chest Abdomen Hip Thigh
                  Knee Ankle Biceps Forearm Wrist;
    
    proc corr data=STAT1.BodyFat2 
              nosimple 
              best=5
              out=pearson;
       var &interval;
       title "Correlations of Predictors";
    run;

    Here are the results.


  7. Are there any notable relationships?

    Solution:

    Several relationships appear to have high correlations (such as those among Hip, Thigh, and Knee). Weight seems to correlate highly with all circumference variables.

  8. Challenge: Use the pearson data set to print only the correlations whose absolute values are 0.70 and above, or note them with an asterisk in the full correlation table.

    Submit the code and review the results.

    Solution:

    Potential solution to printing the correlation matrix with asterisks in the full correlation table:
    /*st102s03.sas*/  /*Part B*/
    %let big=0.7;
    proc format;
        picture correlations &big -< 1 = '009.99' (prefix="*")
                             -1 <- -&big = '009.99' (prefix="*")
                             -&big <-< &big = '009.99';
    run;
    
    proc print data=pearson;
        var _NAME_ &interval;
        where _type_="CORR";
        format &interval correlations.;
    run;

    Here are the results.

    Potential solution to printing only the correlations whose absolute values are 0.7 and above:
    /*st102s03.sas*/  /*Part B*/
    %let big=0.7;
    data bigcorr;
        set pearson;
        array vars{*} &interval;
        do i=1 to dim(vars);
            if abs(vars{i})<&big then vars{i}=.;
        end;
        if _type_="CORR";
        drop i _type_;
    run;
    
    proc print data=bigcorr;
        format &interval 5.2;
    run;
    
    title;

    Here are the results.


Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 4 TASK VERSION

Practice: Using the Correlation Analysis Task to Describe the Relationship between Continuous Variables

The percentage of body fat, age, weight, height, and 10 body circumference measurements (for example, abdomen) were recorded for 252 men by Dr. Roger W. Johnson of Calvin College in Minnesota. The data are in the bodyfat2 data set. Body fat, one measure of health, has been accurately estimated by a water displacement measurement technique.

  1. Use PROC CORR to generate scatter plots and correlations for the VAR variables Age, Weight, and Height, and the circumference measures versus the WITH variable, PctBodyFat2.

    **IMPORTANT: For PROC CORR, ODS Graphics limits you to 10 VAR variables at a time. For this exercise, look at the relationships with Age, Weight, and Height separately from the circumference variables (Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist).

    Note
    : This limitation exists only on the graphics that are obtained from ODS. The correlation table displays all variables in the VAR statement by default.

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the Correlation Analysis task.
    4. Select the stat1.bodyfat2 data set.
    5. Assign Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist to the Analysis variables role.
    6. Assign PctBodyFat2 to the Correlate with role.
    7. On the OPTIONS tab, under STATISTICS, use the drop-down list for Display statistics and select Selected statistics. Then select Correlations, Display p-values, and Descriptive statistics. (Correlations and Display p-values might already be selected.)
    8. Under PLOTS, use the drop-down list for Type of plot and select Individual scatter plots. Ensure the check box for Include inset statistics is selected, and change the Number of variables to plot value from 5 to 10.
    9. Run the task.

    Here are the results.


  2. Modify the task to generate the scatter plots for the remaining variables, Biceps, Forearm, and Wrist.

    Solution:

    1. On the DATA tab, select all the Analysis variables except Biceps, Forearm, and Wrist and click the Remove column icon.
    2. Run the task.

    Here are the results.


  3. Examine all of the plots. Can straight lines adequately describe the relationships?

    Solution:

    Yes. Height seems to be the only variable that shows no real linear relationship. Age and Ankle show little linear trend.

  4. Are there any outliers that you should investigate?

    Solution:

    One person has outlying values for several measurements. In addition, there are one or two values that seem to be outliers for Ankle.

  5. Which variable has the highest correlation with PctBodyFat2?

    Solution:

    Abdomen, with r=0.81343, is the variable with the highest correlation with PctBodyFat2.

  6. What is the p-value for the coefficient? Is it statistically significant at the 0.05 level?

    Solution:

    The p-value is <.0001.

  7. Modify the Correlation Analysis task to generate correlations among all the variables previously mentioned (Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist) minus PctBodyFat2. Don't generate descriptive statistics or plots again. Select only the highest five per variable.

    Note
    : You'll need to edit the generated code to select the highest five.

    Solution:

    1. On the DATA tab, assign Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee and Ankle to the Analysis variables role. Biceps, Forearm, and Wrist should already be assigned to this role.
    2. In the Correlate with box, select PctBodyFat2 and click the Remove column icon.
    3. On the OPTIONS tab, clear the check box for Descriptive statistics.
    4. Use the drop-down list for Type of plot and select None.
    5. To print the highest five correlated variables for each variable, edit the generated code. Click the Edit SAS Code icon in the CODE tab and add best=5 as follows:
      ods noproctitle;
      ods graphics / imagemap=on;
      
      proc corr data=STAT1.BODYFAT2 best=5 pearson nosimple plots=none;
         var Biceps Forearm Wrist Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle;
      run;
    6. Run the code.

    Here are the results. The order of the variables in your results might vary, depending on the order in which you selected the variable names for the Analysis variables role.


  8. Are there any notable relationships?

    Solution:

    Several relationships seem to have high correlations (such as those among Hip, Thigh, and Knee). Weight seems to correlate highly with all circumference variables.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 5 CODE VERSION

Practice: Using PROC REG to Fit a Simple Linear Regression Model

Using the bodyfat2 data set, perform a simple linear regression model.

  1. Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor.

    Solution:

    /*st102s04.sas*/
    ods graphics on;
    
    proc reg data=STAT1.BodyFat2;
       model PctBodyFat2=Weight;
       title "Regression of % Body Fat on Weight";
    run;
    quit;
    
    title;

    Here are the results.


  2. What is the value of the F statistic and the associated p-value? How would you interpret this in connection with the null hypothesis?

    Solution:

    The value of the F statistic is 150.03 and the p-value is <.001. Therefore, you would reject the null hypothesis of no relationship, or a zero slope for Weight.

  3. Write the predicted regression equation.

    Solution:

    The prediction regression equation is PctBodyFat2 = -12.05158 + 0.17439 * Weight.

  4. What is the value of R-square? How would you interpret this?

    Solution:

    The R-square value of 0.3751 can be interpreted to mean that 37.51% of the variability in PctBodyFat2 can be explained by Weight.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 5 TASK VERSION

Practice: Using PROC REG to Fit a Simple Linear Regression Model

Using the bodyfat2 data set, perform a simple linear regression model.

  1. Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor.

    Solution:

    /*st102s04.sas*/
    ods graphics on;
    
    proc reg data=STAT1.BodyFat2;
       model PctBodyFat2=Weight;
       title "Regression of % Body Fat on Weight";
    run;
    quit;
    
    title;

    Here are the results.


  2. What is the value of the F statistic and the associated p-value? How would you interpret this in connection with the null hypothesis?

    Solution:

    The value of the F statistic is 150.03 and the p-value is <.001. Therefore, you would reject the null hypothesis of no relationship, or a zero slope for Weight.

  3. Write the predicted regression equation.

    Solution:

    The prediction regression equation is PctBodyFat2 = -12.05158 + 0.17439 * Weight.

  4. What is the value of R-square? How would you interpret this?

    Solution:

    The R-square value of 0.3751 can be interpreted to mean that 37.51% of the variability in PctBodyFat2 can be explained by Weight.

Lesson 03

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 CODE VERSION

Practice: Performing a Two-Way ANOVA Using PROC GLM

Data were collected to determine whether different dosage levels of a drug have an effect on blood pressure for people with one of three types of heart disease. The data are in the stat1.drug data set.

  1. Examine the data with a vertical line plot. Put BloodP on the Y axis, and DrugDose on the X axis, and then stratify by Disease.

    /*st103s01.sas*/ /*Part A*/ 
    proc sgplot data=STAT1.drug; 
       vline DrugDose / group=Disease stat=mean response=BloodP markers; 
       format DrugDose dosefmt.; 
    run;

    Here are the results.

  2. What information can you obtain by looking at the data?

    It seems that the drug dose affects a change in blood pressure. However, that effect is not consistent across diseases. Higher doses result in increased blood pressure for patients with disease B, decreased blood pressure for patients with disease A, and little change in blood pressure for patients with disease C.


  3. Test the hypothesis that the means are equal. Be sure to include an interaction term if the graphical analysis that you performed indicates that would be advisable.

    /*st103s01.sas*/ /*Part B*/
    ods graphics on;
    proc glm data=STAT1.drug plots(only)=intplot;
       class DrugDose Disease;
       model BloodP=DrugDose|Disease;
       lsmeans DrugDose*Disease;
    run;
    quit;

    Here are the results.

  4. What conclusions can you reach at this point?

    The global F test indicates a significant difference among the different groups. Because the interaction is in the model, this is a test of all combinations of DrugDose*Disease against all other combinations. The R-square value implies that approximately 35% of the variation in BloodP can be explained by variations in the explanatory variables. The interaction term is statistically significant, as preditcted by the plot of the means.


  5. To investigate the interaction effect between the two factors, include the SLICE option by manually editing the generated code or you can write the code directly.

    Modify the LSMEANS statement to include the slice=Disease option preceded by a slash, as shown in the code below.
    /*st103s01.sas*/ /*Part B*/
    ods graphics on;
    proc glm data=STAT1.drug plots(only)=intplot;
       class DrugDose Disease;
       model BloodP=DrugDose|Disease;
       lsmeans DrugDose*Disease / slice=Disease;
    run;
    quit;

    Here are the results.

  6. Is the effect of DrugDose significant?


    The slice table shows the effect of DrugDose at each level of the disease. The effect is significant for all, except Disease C.


Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 TASK VERSION

Practice: Performing a Two-Way ANOVA Using PROC GLM

Data were collected to determine whether different dosage levels of a drug have an effect on blood pressure for people with one of three types of heart disease. The data are in the stat1.drug data set.

  1. Examine the data with a vertical line plot. Put BloodP on the Y axis, and DrugDose on the X axis, and then stratify by Disease.

    /*st103s01.sas*/ /*Part A*/ 
    proc sgplot data=STAT1.drug; 
       vline DrugDose / group=Disease stat=mean response=BloodP markers; 
       format DrugDose dosefmt.; 
    run;

    Here are the results.

  2. What information can you obtain by looking at the data?

    It seems that the drug dose affects a change in blood pressure. However, that effect is not consistent across diseases. Higher doses result in increased blood pressure for patients with disease B, decreased blood pressure for patients with disease A, and little change in blood pressure for patients with disease C.


  3. Test the hypothesis that the means are equal. Be sure to include an interaction term if the graphical analysis that you performed indicates that would be advisable.

    /*st103s01.sas*/ /*Part B*/
    ods graphics on;
    proc glm data=STAT1.drug plots(only)=intplot;
       class DrugDose Disease;
       model BloodP=DrugDose|Disease;
       lsmeans DrugDose*Disease;
    run;
    quit;

    Here are the results.

  4. What conclusions can you reach at this point?

    The global F test indicates a significant difference among the different groups. Because the interaction is in the model, this is a test of all combinations of DrugDose*Disease against all other combinations. The R-square value implies that approximately 35% of the variation in BloodP can be explained by variations in the explanatory variables. The interaction term is statistically significant, as preditcted by the plot of the means.


  5. To investigate the interaction effect between the two factors, include the SLICE option by manually editing the generated code or you can write the code directly.

    Modify the LSMEANS statement to include the slice=Disease option preceded by a slash, as shown in the code below.
    /*st103s01.sas*/ /*Part B*/
    ods graphics on;
    proc glm data=STAT1.drug plots(only)=intplot;
       class DrugDose Disease;
       model BloodP=DrugDose|Disease;
       lsmeans DrugDose*Disease / slice=Disease;
    run;
    quit;

    Here are the results.

  6. Is the effect of DrugDose significant?


    The slice table shows the effect of DrugDose at each level of the disease. The effect is significant for all, except Disease C.


Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 2 CODE VERSION

Practice: Performing Multiple Regression Using PROC REG

Using the stat1.bodyfat2 table, fit a multiple regression model with multiple predictors, and then modify the model by removing the least significant predictors.
  1. Run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.

    Note: Turn off ODS Graphics.

     

    Submit the following program:
    /*st103s02.sas*/  /*Part A*/
    ods graphics off;
    proc reg data=STAT1.BodyFat2;
        model PctBodyFat2=Age Weight Height
              Neck Chest Abdomen Hip Thigh
              Knee Ankle Biceps Forearm Wrist;
        title 'Regression of PctBodyFat2 on All '
              'Predictors';
    run;
    quit;

    Here are the results.


  2. Compare the ANOVA table with this one from the model with only Weight. What is different?
    Analysis of Variance
    Source DF Sum of Squares Mean Square F Value Pr > F
    Model 1 6593.01614 6593.01614 150.03 <.001
    Error 250 10986 43.94389    
    Corrected Total 251 17579      

     

    There are key differences between the ANOVA table for this model and the one for the simple linear regression model. The degrees of freedom for the model are much higher, 13 versus 1. Also, the Mean Square model and the F ratio are much smaller.


  3. How do the R-Square and the adjusted R-Square compare with these statistics for the Weight regression?
    Root MSE 6.62902 R-Square 0.3751
    Dependent Mean 19.15079 Adj R-Sq 0.3726
    Coeff Var 34.61485    

     

    Both the R-Square and the adjusted R-Square for the full models are larger than the simple linear regression. The multiple regression model explains almost 75% of the variation in the PctBodtFat2 variable versus approximately 37.5% that is explained by the simple linear regression model.


  4. Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change?

     

    Yes, including the other variables in the model changed both the estimate of the intercept and the slope for Weight. Also, the p-values for both changed dramatically. The slope of Weight is now not significantly different from zero.


  5. To simplify the model, rerun the model from step 1, but eliminate the variable with the highest p-value. Compare the output with the model from step 1.

     

    Submit the code below. Knee was removed because it has the largest p-value (0.9552).
    /*st103s02.sas*/  /*Part B*/
    ods graphics off;
    proc reg data=STAT1.BodyFat2;
        model PctBodyFat2=Age Weight Height
              Neck Chest Abdomen Hip Thigh
              Ankle Biceps Forearm Wrist;
        title 'Regression of PctBodyFat2 on All '
              'Predictors, Minus Knee';
    run;
    quit;

    Here are the results.


  6. Did the p-value for the model change?

     

    The p-value for the model did not change to four decimal places.


  7. Did the R-Square and the adjusted R-Square values change?

     

    The R-Square showed essentially no change. The adjusted R-Square increased from .7348 to .7359. When an adjusted R-Square increases by removing a variable from the model, it strongly implies that the removed variable was not necessary.


  8. Did the parameter estimates and their p-values change?

     

    Some of the parameter estimates and their p-values changed slightly, but none to any large degree.


  9. To simplify the model further, rerun the model from step 5, but eliminate the variable with the highest p-value. How did the output change from the previous model?

    Submit the following code. Chest was removed because it is the variable with the highest p-value in the previous model.
    /*st103s02.sas*/  /*Part C*/
    ods graphics off;
    proc reg data=STAT1.BodyFat2;
        model PctBodyFat2=Age Weight Height
              Neck Abdomen Hip Thigh
              Ankle Biceps Forearm Wrist;
        title 'Regression of PctBodyFat2 on All '
              'Predictors, Minus Knee, Chest';
    run;
    quit;

    Here are the results.

    The ANOVA table did not change significantly. The R-Square remained essentially unchanged. The adjusted R-Square increased again. This confirms that the variable Chest did not contribute to explaining the variation in PctBodyFat2 when the other variables were in the model.


  10. Did the number of parameters with p-values less than 0.05 change?

     

    The p-value for Weight changed more than any other and is now slightly more than 0.05. The p-values and parameter estimates for other variables changed much less. There are no more variables in this model with p-values below 0.05, compared with the previous one.


Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 2 TASK VERSION

Practice: Performing Multiple Regression Using the Linear Regression Task

Using the bodyfat2 table, fit a multiple regression model with multiple predictors, and then modify the model by removing the least significant predictors
  1. Run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist.

    Note: Turn off ODS Graphics.

     

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and select the Linear Regression task.
    4. On the DATA tab, select the stat1.bodyfat2 table.
    5. Assign PctBodyFat2 to the Dependent variable role.
    6. Assign Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist to the Continuous variables role.
    7. On the MODEL tab, click the Edit button to open the Model Effects Builder. Add all the variables as model effects, and click OK to close the Model Builder.
    8. On the OPTIONS tab, clear all check boxes for Diagnostic and Residual Plots and Scatter Plots.
    9. Run the code.

    Here are the results.


  2. Compare the ANOVA table with this one from the model with only Weight. What is different?
    Analysis of Variance
    Source DF Sum of Squares Mean Square F Value Pr > F
    Model 1 6593.01614 6593.01614 150.03 <.001
    Error 250 10986 43.94389    
    Corrected Total 251 17579      

     

    There are key differences between the ANOVA table for this model and the one for the simple linear regression model. The degrees of freedom for the model are much higher, 13 versus 1. Also, the Mean Square model and the F ratio are much smaller.


  3. How do the R-Square and the adjusted R-Square compare with these statistics for the Weight regression?
    Root MSE 6.62902 R-Square 0.3751
    Dependent Mean 19.15079 Adj R-Sq 0.3726
    Coeff Var 34.61485    

     

    Both the R-Square and the adjusted R-Square for the full models are larger than the simple linear regression. The multiple regression model explains almost 75% of the variation in the PctBodtFat2 variable, versus approximately 37.5% that is explained by the simple linear regression model.


  4. Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change?

     

    Yes, including the other variables in the model changed both the estimate of the intercept and the slope for Weight. Also, the p-values for both changed dramatically. The slope of Weight is now not significantly different from zero.


  5. To simplify the model, rerun the model from step 1, but eliminate the variable with the highest p-value. Compare the output with the model from step 1.

     

    Remove Knee from the list of continuous variables because it has the largest p-value (0.9552). It will be removed from the MODEL statement automatically. Run the regression again.

    Here are the results.


  6. Did the p-value for the model change?

     

    The p-value for the model did not change out to four decimal places.


  7. Did the R-Square and the adjusted R-Square values change?

     

    The R-Square showed essentially no change. The adjusted R-Square increased from .7348 to .7359. When an adjusted R-Square increases by removing a variable from the model, it strongly implies that the removed variable was not necessary.


  8. Did the parameter estimates and their p-values change?

     

    Some of the parameter estimates and their p-values changed slightly, but none to any large degree.


  9. To simplify the model further, rerun the model from step 5, but eliminate the variable with the highest p-value. How did the output change from the previous model?

    Remove Chest from the list of continuous variables because it is the variable with the highest p-value in the previous model. Run the regression again.

    Here are the results.

    The ANOVA table did not change significantly. The R-Square remained essentially unchanged. The adjusted R-Square increased again. This confirms that the variable Chest did not contribute to explaining the variation in PctBodyFat2 when the other variables were in the model.


  10. Did the number of parameters with p-values less than 0.05 change?

     

    The p-value for Weight changed more than any other and is now slightly more than 0.05. The p-values and parameter estimates for other variables changed much less. There are no more variables in this model with p-values below 0.05, compared with the previous one.


Lesson 04

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 1 Activity: Optional Stepwise Selection Method Code

Submit the following code to perform both the forward selection and backward elimination processes.

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
         Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ;
         
proc glmselect data=stat1.ameshousing3 plots=all;
   FORWARD: model SalePrice=&interval / selection=forward details=steps select=SL slentry=0.05;
   title "Forward Model Selection for SalePrice - SL 0.05";
run;

proc glmselect data=stat1.ameshousing3 plots=all;
   BACKWARD: model SalePrice=&interval / selection=backward details=steps select=SL slstay=0.05;
   title "Backward Model Selection for SalePrice - SL 0.05";
run;
title;

Examine the results.

The final models that are obtained using the SLENTRY=0.05 and SLSTAY=0.05 criteria are displayed for FORWARD, BACKWARD, and STEPWISE. In this instance, all the selected models matched. However, this won't always be the case. When you run stepwise methods on your own data, the methods might select different models. .

Also, recall the significance levels that the previous program used for entering the model and staying in the model. If you were to use different significance levels for entering the model and staying in the model, PROC GLMSELECT could produce very different models. The choice of SLENTRY and SLSTAY levels can greatly affect the final models that are selected using stepwise methods.

One last thing to remember is that the stepwise techniques don't take any collinearity in your model into account. Collinearity means that predictor variables in the same model are highly correlated. If collinearity is present in your model, you might want to consider first reducing the collinearity as much as possible and then running stepwise methods on the remaining variables.




Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 1 CODE VERSION

Exercise: Using PROC GLMSELECT to Perform Stepwise Selection

Use the stat1.bodyfat2 data set to identify a set of "best" models. Use significance-level model selection techniques.

  1. With the SELECTION=STEPWISE option, use SELECT=SL in PROC GLMSELECT to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Use the default values for SLENTRY= and SLSTAY=. Submit the code and view the results.

    Solution:

    /*st104s01.sas*/ /*Part A*/
    ods graphics on;
    proc glmselect data=stat1.bodyfat2 plots=all;
        STEPWISESL: model PctBodyFat2=Age Weight Height
                          Neck Chest Abdomen Hip Thigh
                          Knee Ankle Biceps Forearm Wrist
                          / SELECTION=STEPWISE SELECT=SL;
        title 'SL STEPWISE Selection with PctBodyFat2';
    run;

    Here are the results.

    In the results, notice the following:
    • Selection stopped because the candidate for entry has SLE > 0.15 and the candidate for removal has SLS < 0.15.
    • The stepwise selection process, using significance level, seems to select an eight-effect model (including the intercept).
    • The Coefficient panel shows that the standardized coefficients do not vary greatly when additional effects are added to the model.
    • The Fit panel indicates that the best model, according to AIC, AICC, and adjusted R-square, is the final model viewed during the selection process. SBC shows a minimum at step four.
    • The parameter estimates from the selected model are presented in the Parameter Estimates table.

  2. Modify the code to specify the forward selection process (FORWARD). Submit the code and view the results.

    Solution:

    /*st104s01.sas*/ /*Part B*/
    proc glmselect data=stat1.bodyfat2 plots=all;
        FORWARDSL: model PctBodyFat2=Age Weight Height
                         Neck Chest Abdomen Hip Thigh
                         Knee Ankle Biceps Forearm Wrist
                         / SELECTION=FORWARD SELECT=SL;
        title 'SL FORWARD Selection with PctBodyFat2';
    run;

    Here are the results.

    In the results, notice the following:
    • Selection stopped as the candidate for entry has SLE > 0.5.
    • The forward selection process, using significance level, seems to select an 11-effect model (including the intercept).
    • The Coefficient panel shows that the standardized coefficients do not vary greatly when additional effects are added to the model.
    • The Fit panel indicates that the best models, according to AIC, AICC, adjusted R-square, and SBC, are at various steps in the selection progression.
    • The parameter estimates from the selected model are presented in the Parameter Estimates table.

  3. How many variables would result from a model using forward selection and a significance-level-for-entry criterion of 0.05, instead of the default SLENTRY= value, 0.50? Modify and submit the code, and view the results.

    Solution:

    /*st104s01.sas*/ /*Part C*/
    proc glmselect data=stat1.bodyfat2 plots=all;
        FORWARDSL: model PctBodyFat2=Age Weight Height
                         Neck Chest Abdomen Hip Thigh
                         Knee Ankle Biceps Forearm Wrist
                         / SELECTION=FORWARD SELECT=SL
                         SLENTRY=0.05;
        title 'SL FORWARD (0.05) Selection with PctBodyFat2';
    run;

    Here are the results.

    The results show that, when the value of SLENTRY= is changed from the default to 0.05, the number of effects in the selected model is reduced to five (including the intercept).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 1 TASK VERSION

Exercise: Using PROC GLMSELECT to Perform Stepwise Selection

Use the stat1.bodyfat2 data set to identify a set of "best" models. Use significance-level model selection techniques.

  1. With the SELECTION=STEPWISE option, use SELECT=SL in PROC GLMSELECT to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Use the default values for SLENTRY= and SLSTAY=. Submit the code and view the results.

    Solution:

    /*st104s01.sas*/ /*Part A*/
    ods graphics on;
    proc glmselect data=stat1.bodyfat2 plots=all;
        STEPWISESL: model PctBodyFat2=Age Weight Height
                          Neck Chest Abdomen Hip Thigh
                          Knee Ankle Biceps Forearm Wrist
                          / SELECTION=STEPWISE SELECT=SL;
        title 'SL STEPWISE Selection with PctBodyFat2';
    run;

    Here are the results.

    In the results, notice the following:
    • Selection stopped because the candidate for entry has SLE > 0.15 and the candidate for removal has SLS < 0.15.
    • The stepwise selection process, using significance level, seems to select an eight-effect model (including the intercept).
    • The Coefficient panel shows that the standardized coefficients do not vary greatly when additional effects are added to the model.
    • The Fit panel indicates that the best model, according to AIC, AICC, and adjusted R-square, is the final model viewed during the selection process. SBC shows a minimum at step four.
    • The parameter estimates from the selected model are presented in the Parameter Estimates table.

  2. Modify the code to specify the forward selection process (FORWARD). Submit the code and view the results.

    Solution:

    /*st104s01.sas*/ /*Part B*/
    proc glmselect data=stat1.bodyfat2 plots=all;
        FORWARDSL: model PctBodyFat2=Age Weight Height
                         Neck Chest Abdomen Hip Thigh
                         Knee Ankle Biceps Forearm Wrist
                         / SELECTION=FORWARD SELECT=SL;
        title 'SL FORWARD Selection with PctBodyFat2';
    run;

    Here are the results.

    In the results, notice the following:
    • Selection stopped as the candidate for entry has SLE > 0.5.
    • The forward selection process, using significance level, seems to select an 11-effect model (including the intercept).
    • The Coefficient panel shows that the standardized coefficients do not vary greatly when additional effects are added to the model.
    • The Fit panel indicates that the best models, according to AIC, AICC, adjusted R-square, and SBC, are at various steps in the selection progression.
    • The parameter estimates from the selected model are presented in the Parameter Estimates table.

  3. How many variables would result from a model using forward selection and a significance-level-for-entry criterion of 0.05, instead of the default SLENTRY= value, 0.50? Modify and submit the code, and view the results.

    Solution:

    /*st104s01.sas*/ /*Part C*/
    proc glmselect data=stat1.bodyfat2 plots=all;
        FORWARDSL: model PctBodyFat2=Age Weight Height
                         Neck Chest Abdomen Hip Thigh
                         Knee Ankle Biceps Forearm Wrist
                         / SELECTION=FORWARD SELECT=SL
                         SLENTRY=0.05;
        title 'SL FORWARD (0.05) Selection with PctBodyFat2';
    run;

    Here are the results.

    The results show that, when the value of SLENTRY= is changed from the default to 0.05, the number of effects in the selected model is reduced to five (including the intercept).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 2 CODE VERSION

Exercise: Using PROC GLMSELECT to Perform Other Model Selection Techniques

Use the stat1.bodyfat2 data set to identify a set of "best" models using other model selection techniques.

  1. With the SELECTION=STEPWISE option, use SELECT=SBC in PROC GLMSELECT to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Submit the code and view the results.

    Solution:

    /*st104s02.sas*/ /*Part A*/
    ods graphics on;
    proc glmselect data=STAT1.bodyfat2 plots=all;
       STEPWISESBC: model PctBodyFat2 = Age Weight Height Neck Chest Abdomen
                          Hip Thigh Knee Ankle Biceps Forearm Wrist
                          / SELECTION=STEPWISE SELECT=SBC;
    	title 'SBC STEPWISE Selection with PctBodyFat2';
    run;

    Here are the results.

    In the results, notice the following:
    • The stepwise selection process, using SELECT=SBC, seems to select a five-effect model (including the intercept).
    • The Coefficient panel shows that the standardized coefficients do not vary greatly when additional effects are added to the model.
    • The Fit panel indicates that the best model, according to AIC, AICC, adjusted R-square, and SBC, is the final model viewed during the selection process. Remember that this statement compares only the models that were viewed in these steps of the selection process.
    • The parameter estimates from the selected model are presented in the Parameter Estimates table.

  2. Modify the code to specify SELECT=AIC. Submit the code and view the results.

    Solution:

    /*st104s02.sas*/ /*Part B*/
    proc glmselect data=stat1.bodyfat2 plots=all;
       STEPWISEAIC: model PctBodyFat2=Age Weight Height
                          Neck Chest Abdomen Hip Thigh
                          Knee Ankle Biceps Forearm Wrist
                          / SELECTION=STEPWISE SELECT=AIC;
       title 'AIC STEPWISE Selection with PctBodyFat2';
    run;
    quit;
    title;

    Here are the results.

    Using SELECT=AIC, the selected model contains nine effects (including the intercept).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 2 TASK VERSION

Exercise: Using PROC GLMSELECT to Perform Other Model Selection Techniques

Use the stat1.bodyfat2 data set to identify a set of "best" models using other model selection techniques.

  1. With the SELECTION=STEPWISE option, use SELECT=SBC in PROC GLMSELECT to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Submit the code and view the results.

    Solution:

    /*st104s02.sas*/ /*Part A*/
    ods graphics on;
    proc glmselect data=STAT1.bodyfat2 plots=all;
       STEPWISESBC: model PctBodyFat2 = Age Weight Height Neck Chest Abdomen
                          Hip Thigh Knee Ankle Biceps Forearm Wrist
                          / SELECTION=STEPWISE SELECT=SBC;
    	title 'SBC STEPWISE Selection with PctBodyFat2';
    run;

    Here are the results.

    In the results, notice the following:
    • The stepwise selection process, using SELECT=SBC, seems to select a five-effect model (including the intercept).
    • The Coefficient panel shows that the standardized coefficients do not vary greatly when additional effects are added to the model.
    • The Fit panel indicates that the best model, according to AIC, AICC, adjusted R-square, and SBC, is the final model viewed during the selection process. Remember that this statement compares only the models that were viewed in these steps of the selection process.
    • The parameter estimates from the selected model are presented in the Parameter Estimates table.

  2. Modify the code to specify SELECT=AIC. Submit the code and view the results.

    Solution:

    /*st104s02.sas*/ /*Part B*/
    proc glmselect data=stat1.bodyfat2 plots=all;
       STEPWISEAIC: model PctBodyFat2=Age Weight Height
                          Neck Chest Abdomen Hip Thigh
                          Knee Ankle Biceps Forearm Wrist
                          / SELECTION=STEPWISE SELECT=AIC;
       title 'AIC STEPWISE Selection with PctBodyFat2';
    run;
    quit;
    title;

    Here are the results.

    Using SELECT=AIC, the selected model contains nine effects (including the intercept).

Lesson 05

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 1 CODE VERSION

Practice: Using PROC REG to Examine Residuals

Run a regression on PctBodyFat2 in the stat1.bodyfat2 data set to examine residuals.

  1. Use PROC REG to run a regression model of PctBodyFat2 on Abdomen, Weight, Wrist, and Forearm. Create plots of the residuals by the four regressors and by the predicted values, and a normal Q-Q plot. Submit the code and view the results.

    Solution:

    /*st105s01.sas*/
    ods graphics / imagemap=on;
    
    proc reg data=STAT1.BodyFat2 
             plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS);
       FORWARD: model PctBodyFat2 = Abdomen Weight Wrist Forearm;
       id Case;
       title 'FORWARD Model - Plots of Diagnostic Statistics';
    run;
    quit;

    Here are the results.


  2. Do the residual plots indicate any problems with the constant variance assumption?

    Solution:

    It doesn't appear that the data violate the assumption of constant variance. Also, the residuals show nice random scatter and indicate no problem with model specification.

  3. Are there any outliers indicated by the evidence in any of the residual plots?

    Solution:

    There are a few outliers for Wrist and Forearm, and one clear outlier in both Abdomen and Weight.

  4. Does the Q-Q plot indicate any problems with the normality assumption?

    Solution:

    The normality assumption seems to be met.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 1 TASK VERSION

Practice: Using PROC REG to Examine Residuals

Run a regression on PctBodyFat2 in the stat1.bodyfat2 data set to examine residuals.

  1. Use PROC REG to run a regression model of PctBodyFat2 on Abdomen, Weight, Wrist, and Forearm. Create plots of the residuals by the four regressors and by the predicted values, and a normal Q-Q plot. Submit the code and view the results.

    Solution:

    /*st105s01.sas*/
    ods graphics / imagemap=on;
    
    proc reg data=STAT1.BodyFat2 
             plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS);
       FORWARD: model PctBodyFat2 = Abdomen Weight Wrist Forearm;
       id Case;
       title 'FORWARD Model - Plots of Diagnostic Statistics';
    run;
    quit;

    Here are the results.


  2. Do the residual plots indicate any problems with the constant variance assumption?

    Solution:

    It doesn't appear that the data violate the assumption of constant variance. Also, the residuals show nice random scatter and indicate no problem with model specification.

  3. Are there any outliers indicated by the evidence in any of the residual plots?

    Solution:

    There are a few outliers for Wrist and Forearm, and one clear outlier in both Abdomen and Weight.

  4. Does the Q-Q plot indicate any problems with the normality assumption?

    Solution:

    The normality assumption seems to be met.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 2 CODE VERSION

Practice: Using PROC REG to Generate Potential Outliers

Generate statistics for potential outliers in the stat1.bodyfat2 data set. Write this data to an output data set, and print your results.

  1. Use PROC REG to run a regression model of PctBodyFat2 on Abdomen, Weight, Wrist, and Forearm. Create plots to identify potential influential observations that are based on the suggested cutoff values. Submit the code and view the results.

    Solution:

    /*st105s02.sas*/  /*Part A*/
    ods graphics on;
    ods output RSTUDENTBYPREDICTED=Rstud 
               COOKSDPLOT=Cook
               DFFITSPLOT=Dffits 
               DFBETASPANEL=Dfbs;
    proc reg data=STAT1.BodyFat2 
             plots(only label)=
                  (RSTUDENTBYPREDICTED 
                   COOKSD 
                   DFFITS 
                   DFBETAS);
       FORWARD: model PctBodyFat2
                     = Abdomen Weight Wrist Forearm;
       id Case;
       title 'FORWARD Model - Plots of Diagnostic Statistics';
    run;
    quit;

    Here are the results.

    • In the RStudent by Predicted for PctBodyFat2 scatter plot, only a modest number of observations are further than two standard error units from the mean of 0.
    • In the Cook's D for PctBodyFat2 plot, there are 10 labeled outliers, but observation 39 is clearly the most extreme.
    • In the Influence Diagnostics for PctBodyFat2 plot, the same observations are shown to be influential by the DFFITS statistic.
    • In the panel plot, DFBETAS are particularly high for observation 39 on the parameters for Weight and Forearm circumference.

  2. Write the residuals output to a data set named influential, subset the data to select only observations that are potentially influential outliers, and print your results. Submit the code and view the results.

    Solution:

    /*st105s02.sas*/  /*Part B*/
    data influential;
    /*  Merge datasets from above.*/
        merge Rstud
              Cook 
              Dffits
    		  Dfbs;
        by observation;
    
    /*  Flag observations that have exceeded at least one cutpoint;*/
        if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;
        array dfbetas{*} _dfbetasout: ;
        do i=2 to dim(dfbetas);
            if dfbetas{i} then flag=1;
        end;
    
    /*  Set to missing values of influence statistics for those*/
    /*  who have not exceeded cutpoints;*/
        if ABS(Rstudent)<=3 then RStudent=.;
        if Cooksdlabel eq ' ' then CooksD=.;
    
    /*  Subset only observations that have been flagged.*/
        if flag=1;
        drop i flag;
    run;
    
    proc print data=influential;
        id observation ID1;
        var Rstudent CooksD Dffitsout _dfbetasout:; 
    run;

    Here are the results.

    The same observations appear in the PROC PRINT report as in the plots.

    Examine the values of observation 39 to see what is causing problems. You might find it interesting.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 2 TASK VERSION

Practice: Using the Linear Regression Task to Generate Potential Outliers

Generate statistics for potential outliers in the bodyfat2 data set, write this data to an output data set, and print your results.

  1. Use the Linear Regression task to run a regression model of PctBodyFat2 on Abdomen, Weight, Wrist, and Forearm. Create plots to identify potential influential observations that are based on the suggested cutoff values.

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the Linear Regression task.
    4. Select the stat1.bodyfat2 table.
    5. Assign PctBodyFat2 to the Dependent variable role.
    6. Assign Abdomen, Weight, Wrist, and Forearm to the Continuous variables role.
    7. On the MODEL tab, click the Edit this model icon, select all variables, and click Add. Then click OK.
    8. On the OPTIONS tab, expand Diagnostic and Residual Plots and clear the check boxes for Diagnostic plots and Residuals for each explanatory variable.
    9. Expand More Diagnostics Plots and select all four check boxes. This will display diagnostic plots with labels for influential observations.
    10. Expand Scatter Plots and clear the check box for Observed values by predicted values.
    11. Modify the code to add the Cook's D influence statistics and to export the RSTUDENT, DFFITS, DFBETAS, and Cook's D statistics.
      1. On the CODE tab, click the Edit SAS code icon.
        1. In the PROC REG step, enter cooksd within the parentheses where the plots are listed.
        2. Add an ODS OUTPUT statement to save the data from the plots.
          • Save RSTUDENTBYPREDICTED in a data set named Rstud.
          • Save COOKSDPLOT in a data set named Cook.
          • Save DFFITSDPLOT in a data set named Dffits.
          • Save DFBETAPANEL in a data set named Dfbs.
    12. Click Run.

    Modified Code

    ods noproctitle;
    ods graphics / imagemap=on;
    ods output RSTUDENTBYPREDICTED=Rstud 
               COOKSDPLOT=Cook
               DFFITSPLOT=Dffits 
               DFBETASPANEL=Dfbs;
    
    proc reg data=STAT1.BODYFAT2 alpha=0.05 plots(only label)=(rstudentbypredicted cooksd dffits dfbetas);
       model PctBodyFat2=Weight Abdomen Forearm Wrist /;
    run;
    quit;

    Here are the results.

    • In the RStudent by Predicted for PctBodyFat2 scatter plot, only a modest number of observations are further than two standard error units from the mean of 0.
    • In the Cook's D for PctBodyFat2 plot, there are 10 labeled outliers, but observation 39 is clearly the most extreme.
    • In the Influence Diagnostics for PctBodyFat2 plot, the same observations are shown to be influential by the DFFITS statistic.
    • In the panel plot, DFBETAS are particularly high for observation 39 on the parameters for Weight and Forearm circumference.

  2. Write a DATA step to merge the four output data sets based on the common variable, observation. Add code to subset the data to select only the observations that are potentially influential outliers. Name the output data set influential. Submit the code and use the PRINT procedure or the Table Viewer in SAS Sudio to display the influential data set.

    Solution:

    /*st105s02.sas*/  /*Part B*/
    data influential;
    /*  Merge datasets from above.*/
        merge Rstud
              Cook 
              Dffits
    		  Dfbs;
        by observation;
    
    /*  Flag observations that have exceeded at least one cutpoint;*/
        if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;
        array dfbetas{*} _dfbetasout: ;
        do i=2 to dim(dfbetas);
            if dfbetas{i} then flag=1;
        end;
    
    /*  Set to missing values of influence statistics for those*/
    /*  who have not exceeded cutpoints;*/
        if ABS(Rstudent)<=3 then RStudent=.;
        if Cooksdlabel eq ' ' then CooksD=.;
    
    /*  Subset only observations that have been flagged.*/
        if flag=1;
        drop i flag;
    run;
    
    proc print data=influential;
        id observation ID1;
        var Rstudent CooksD Dffitsout _dfbetasout:; 
    run;

    Here are the results.

    The same observations appear in the PROC PRINT report as in the plots.

    Examine the values of observation 39 to see what is causing problems. You might find it interesting.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 3 CODE VERSION

Practice: Using PROC REG to Assess Collinearity

Run a regression of PctBodyFat2 on all the other numeric variables in the data set stat1.bodyfat2.

  1. Write a PROC REG step to determine whether a collinearity problem exists in your model. Submit the code and view the results.

    Solution:

    /*st105s03.sas*/  /*Part A*/
    ods graphics off;
    proc reg data=STAT1.BodyFat2;
       FULLMODL: model PctBodyFat2 = 
                       Age Weight Height
                       Neck Chest Abdomen Hip Thigh
                       Knee Ankle Biceps Forearm Wrist
                       / vif;
       title 'Collinearity -- Full Model';
    run;
    quit;
    ods graphics on;

    Here are the results.

    There seems to be high collinearity with Weight, Hip, and Abdomen. Chest and Thigh are below the cut off but are larger than the others that do not exceed 5.

  2. If there is a collinearity problem, what would you like to do about it? Will you remove any variables? Why or why not?

    Solution:

    The answer is not so easy. Weight is collinear with some of the other variables, but as you saw before in your model-building process, Weight is a relatively significant predictor in the "best" models. A subject-matter expert should determine the answer. If you want to remove Weight, simply run that model again without that variable.
    /*st105s03.sas*/  /*Part B*/
    ods graphics off;
    proc reg data=STAT1.BodyFat2;
       NOWT: model PctBodyFat2 =
                   Age Height
                   Neck Chest Abdomen Hip Thigh
                   Knee Ankle Biceps Forearm Wrist
                   / vif;
       title 'Collinearity -- No Weight';
    run;
    quit;
    
    ods graphics on;

    Here are the results.


Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 3 TASK VERSION

Practice: Using the Linear Regression Task to Assess Collinearity

Run a regression of PctBodyFat2 on all the other numeric variables in the data set bodyfat2.

  1. Use the Linear Regression task to determine whether a collinearity problem exists in your model.

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the Linear Regression task.
    4. Select the stat1.bodyfat2 table.
    5. Assign PctBodyFat2 to the Dependent variable role.
    6. Assign Age, Weight, Height, Neck, Chest, Abdomen, Hip, Thigh, Knee, Ankle, Biceps, Forearm, and Wrist to the Continuous variables role.
    7. On the MODEL tab, use the Model Effect Builder to specify the appropriate model. Click the Edit this model icon, select all variables, and click Add. Then click OK.
    8. On the OPTIONS tab, use the drop-down list for Display statistics and select Default and selected statistics.
    9. Expand Collinearity and select Variance inflation factors.
    10. Suppress all the plots by clearing the check boxes for all the different graphic output options.
    11. Click Run.

    Here are the results.

    There seems to be high collinearity with Weight, Hip, and Abdomen. Chest and Thigh are below the cut off but are larger than the others that do not exceed 5.

  2. If there is a collinearity problem, what would you like to do about it? Will you remove any variables? Why or why not?

    Solution:

    The answer is not so easy. Weight is collinear with some of the other variables, but as you saw before in your model-building process, Weight is a relatively significant predictor in the "best" models. A subject-matter expert should determine the answer. If you want to remove Weight, modify the model using the Model Effects Builder and rerun the task.
    1. On the MODEL tab, click the Edit this model icon.
    2. Select Weight from the Model Effects list, and then click the Delete effect icon.
    3. Click OK.
    4. Click Run.

    Here are the results.


Lesson 06

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 1 CODE VERSION

Practice: Building a Predictive Model Using PROC GLMSELECT

Use the ameshousing3 data set to build a model that predicts the sale prices of homes in Ames, Iowa, that are 1500 square feet or below, based on various home characteristics.

  1. Write a PROC GLMSELECT step that predicts the values of SalePrice. Partition the stat1.ameshousing3 data set into a training data set of approximately 2/3 and a validation data set of approximately 1/3. Specify the seed 8675309. Define the Interval and Categorical macro variables as shown below, and use them to specify the inputs. Use stepwise regression as the selection method, Akaike's information criterion (AIC) to add and or remove effects, and average squared error for the validation data to select the best model. Add the REF=FIRST option in the CLASS statement. Submit the code and examine the results.

    %let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
             Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
    %let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces 
             Season_Sold Garage_Type_2 Foundation_2 Heating_QC 
             Masonry_Veneer Lot_Shape_2 Central_Air;


    Solution:

    /*st106s01.sas*/
    
    %let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area 
             Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom;
    %let categorical=House_Style2 Overall_Qual2 Overall_Cond2 Fireplaces 
             Season_Sold Garage_Type_2 Foundation_2 Heating_QC 
             Masonry_Veneer Lot_Shape_2 Central_Air;
    
    
    /*In this example, the data set ameshousing3 is divided into */
    /*training and validation using the PARTITION statement, */
    /*along with the SEED= option in the PROC GLMSELECT statement.*/
    proc glmselect data=STAT1.ameshousing3
                   plots=all 
                   seed=8675309;
       class &categorical / param=ref ref=first;
       model SalePrice=&categorical &interval / 
                       selection=stepwise
                       (select=aic 
                       choose=validate) hierarchy=single;
       partition fraction(validate=0.3333);
       title "Selecting the Best Model using Honest Assessment";
    run;

    Here are the results.


  2. Which model did PROC GLMSELECT choose?

    Solution:

    PROC GLMSELECT chose the model at Step 10, which has the following effects:Intercept, Basement_Area, Gr_Liv_Area, Age_Sold, Garage_Area, Overall_Cond2, Fireplaces, Overall_Qual2, House_Style2, Deck_Porch_Area, and Heating_QC.

  3. Resubmit the PROC GLMSELECT step. Do not make any changes to it. Does it produce the same results as before?

    Solution:

    The results are the same. Every time you run a specific PROC GLMSELECT step using the same seed value, the pseudo-random selection process is replicated and you get the same results.

  4. In the PROC GLMSELECT statement, change the value of SEED= and submit the modified code. Does it produce the same results as before?

    Solution:

    Because you used a different seed, the results are almost certainly different from the previous results.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 1 TASK VERSION

Practice: Building a Predictive Model Using the Predictive Regression Models Task

Use the ameshousing3 data set to build a model that predicts the sale prices of homes in Ames, Iowa, that are 1500 square feet or below, based on various home characteristics.

  1. Use the Predictive Regression Models task to partition ameshousing3 into a training data set of approximately 2/3 and a validation data set of approximately 1/3. Use 8675309 as the random seed, and Reference Coding for Parameterization of Effects. Specify SalePrice as the dependent variable, and assign the classification and continuous variables as listed below. Add all variables as model effects.

    Classification Variables Continuous Variables
    Heating_QC Lot_Area
    Central_Air Gr_Liv_Area
    Fireplaces Bedroom_AbvGr
    Season_Sold Garage_Area
    Garage_Type_2 Basement_Area
    Foundation_2 Total_Bathroom
    Masonry_Veneer Deck_Porch_Area
    Lot_Shape_2 Age_Sold
    House_Style2  
    Overall_Qual2  
    Overall_Cond2  

    Use stepwise regression as the selection method, Akaike's information criterion (AIC) to add and or remove effects, and average squared error for validation data to select the best model. Produce criteria and coefficient plots. Edit the generated code to add the REF=FIRST option after PARAM=REF in the CLASS statement.



    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and select the Predictive Regression Models task.
    4. Select the stat1.ameshousing3 table.
    5. Expand Partition Data and select Validation Data.
    6. In the Identify validation or test data cases drop-down list, select Specify a sample proportion, and enter 0.3333 as the Proportion of validation cases value.
    7. Select the Specify the random seed check box, and enter 8675309 as the Random seed value.
    8. Assign SalePrice to the Dependent variable role.
    9. Assign Heating_QC, Central_Air, Fireplaces, Season_Sold, Garage_Type_2, Foundation_2, Masonry_Veneer, Lot_Shape_2, House_Style2, Overall_Qual2, and Overall_Cond2 to the Classification variables role.
    10. Assign Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, Age_Sold to the Continuous variables role.
    11. Expand Parameterization of Effects. In the Coding drop-down list, select Reference Coding.
    12. On the MODEL tab, select Custom Model and then click Edit to open the Model Effects Builder.
    13. Select all variables, and click Add to add them to the model. Click OK.
    14. On the SELECTION tab under MODEL SELECTION, use the Selection method drop-down list to select Stepwise regression
    15. Use the Add/remove effects with drop-down list to select Akaike's information criterion (AIC).
    16. Use the Select best model by drop-down list to select Average square error for validation data.
    17. Expand SELECTION PLOTS and select Criteria plots and Coefficient plots.
    18. Click the EDIT button in the CODE window and add the option ref=first after the param=ref option in the CLASS statement.
    19. Run the code.

    Here are the results.

  2. Which model did PROC GLMSELECT choose?

    Solution:

    PROC GLMSELECT chose the model at Step 10, which has the following effects: Intercept, Basement_Area, Gr_Liv_Area, Age_Sold, Garage_Area, Overall_Cond2, Fireplaces, Overall_Qual2, House_Style2, Deck_Porch_Area, and Heating_QC.

  3. Resubmit the modified code. Do not make any changes to it. Does it produce the same results as before?

    Solution:

    The results are the same. Every time you run a specific PROC GLMSELECT step using the same seed value, the pseudo-random selection process is replicated and you get the same results.

  4. Return to the CODE window that contains the edited code. In the PROC GLMSELECT statement, change the value of SEED= and submit the modified code. Does it produce the same results as before?

    Solution:

    Because you used a different seed, the results are almost certainly different from the previous results.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 2 CODE VERSION

Practice: Scoring Using the SCORE Statement in PROC GLMSELECT

You want to re-create the model that was built in the previous practice (based on stat1.ameshousing3), create an item store, and then use the item store to score the new cases in stat1.ameshousing4. You'll score the data in two ways (using PROC GLMSELECT and PROC PLM) and compare the results.

  1. Open the solution program from the previous practice, st106s01.sas. There is no need to examine the results, so make the following changes to the code:
    1. Remove the PLOTS= option.
    2. Add the NOPRINT option to the PROC GLMSELECT statement.
    3. Remove the TITLE statement

    Here's the modified code:

    proc glmselect data=STAT1.ameshousing3
                   seed=8675309
                   noprint;
       class &categorical / param=ref ref=first;
       model SalePrice=&categorical &interval / 
                   selection=stepwise
                   (select=aic 
                   choose=validate) hierarchy=single;
       partition fraction(validate=0.3333);
    run;

  2. In the PROC GLMSELECT step,
    1. Add a STORE statement to create an item store named store1, and a SCORE statement to score the data in stat1.ameshousing4.
    2. Add a PROC PLM step that uses the item store, store1, to score the data in stat1.ameshousing4.
      Note:
      Be sure to use different names for the two scored data sets.
    3. Add a PROC COMPARE step to compare the scoring results from PROC GLMSELECT and PROC PLM.
    4. Submit the code and examine the results.

    /*st106s02.sas*/
    
    proc glmselect data=STAT1.ameshousing3
                   seed=8675309
                   noprint;
       class &categorical / param=ref ref=first;
       model SalePrice=&categorical &interval / 
                   selection=stepwise
                   (select=aic 
                   choose=validate) hierarchy=single;
       partition fraction(validate=0.3333);
       score data=STAT1.ameshousing4 out=score1;
       store out=store1;
       title "Selecting the Best Model using Honest Assessment";
    run;
    
    proc plm restore=store1;
       score data=STAT1.ameshousing4 out=score2;
    run;
    
    proc compare base=score1 compare=score2 criterion=0.0001;
       var P_SalePrice;
       with Predicted;
    run;

    Here are the results.


  3. Does the PROC COMPARE output indicate any differences between the predictions produced by the two scoring methods?

    The two scoring methods produce the same predictions. 

    Note: Depending on the version of SAS and SAS/STAT that you are using, your results might look somewhat different from the output shown here. However, the results should indicate that these data sets do not differ.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 2 TASK VERSION

Practice: Using the SCORE Statement in PROC GLMSELECT

Re-create the model that was built in the previous practice with a few changes. Create an item store, and then use the item store to score the new cases in ameshousing4. You'll use code to score the data in two different ways (using PROC GLMSELECT and PROC PLM) and compare the results.

  1. Build the model.
    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and select the Predictive Regression Models task.
    4. On the DATA tab, select the stat1.ameshousing3 table.
    5. Expand Partition Data and select Validation Data.
    6. In the Identify validation or test data cases drop-down list, select Specify a sample proportion, and enter 0.3333 as the Proportion of validation cases value.
    7. Select the Specify the random seed check box, and enter 8675309 as the Random seed value.
    8. Assign SalePrice to the Dependent variable role.
    9. Assign the classification and continuous variables as listed below.

      Classification Variables Continuous Variables
      Heating_QC Lot_Area
      Central_Air Gr_Liv_Area
      Fireplaces Bedroom_AbvGr
      Season_Sold Garage_Area
      Garage_Type_2 Basement_Area
      Foundation_2 Total_Bathroom
      Masonry_Veneer Deck_Porch_Area
      Lot_Shape_2 Age_Sold
      House_Style2  
      Overall_Qual2  
      Overall_Cond2  

    10. Expand Parameterization of Effects. In the Coding drop-down list, select Reference Coding.
    11. On the MODEL tab, select Custom Model and then click Edit to open the Model Effects Builder.
    12. Select all the variables, and click Add to add them to the model. Click OK.
    13. On the SELECTION tab under MODEL SELECTION, use the Selection method drop-down list to select Stepwise regression.
    14. Use the Add/remove effects with drop-down list to select Akaike's information criterion (AIC).
    15. In the Select best model by drop-down list, select Average square error for validation data.
    16. Expand SELECTION PLOTS and select Criteria plots and Coefficient plots.
    17. On the SCORING tab, select Save scored data and enter the name score1 for the output data set. It will be created in the Work library.
      Notice the Save scoring code check box. You won't use the scoring code in this practice, but you could select this check box and browse to a location where you can store files. It will create a file with scoring code that you can include in a DATA step to score the item store.
    18. Click the EDIT button in the CODE window and make the following changes:
      1. Add the option ref=first after the param=ref option in the CLASS statement.
      2. Add data=stat1.ameshousing4 in the SCORE statement.
      3. Add a STORE statement to create an item store named store1 in the Work library.
      4. You don't need to examine the results, so you can remove the ODS statement and add the NOPRINT option in the PROC GLMSELECT statement.
    19. Submit the code. Check the log to verify that the item store, work.store1, was created, and that the data set, work.score1, was created with 300 observations and 34 variables.

    Generated Code

    ods noproctitle;
    
    proc glmselect data=STAT1.AMESHOUSING3 plots=(criterionpanel coefficientpanel) 
    		seed=8675309 noprint;
    	 partition fraction(validate=0.3333);
    	 class Heating_QC Central_Air Fireplaces Season_Sold Garage_Type_2 Foundation_2 
    		Masonry_Veneer Lot_Shape_2 House_Style2 Overall_Qual2 Overall_Cond2 / 
    		param=ref ref=first;
    	 model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area 
    		Total_Bathroom Overall_Qual Deck_Porch_Area Age_Sold Heating_QC Central_Air 
    		Fireplaces Season_Sold Garage_Type_2 Foundation_2 Masonry_Veneer Lot_Shape_2 
    		House_Style2 Overall_Qual2 Overall_Cond2 / selection=stepwise
             (select=aic) hierarchy=single;
    	 score out=WORK.score1 predicted residual data=stat1.ameshousing4;
    	 code;
    	 store out=work.store1;
    run;

  2. Write a PROC PLM step to process the store1 item store. Score the data in ameshousing4. Create an output data set named score2. Submit the code, and check the log to verify that work.score2 was created with 300 observations and 33 variables.

    Solution:

    proc plm restore=store1;
       score data=STAT1.ameshousing4 out=score2;
    run;

    Here are the results.


  3. Write a PROC COMPARE step to compare the scoring results from PROC GLMSELECT and PROC PLM. Use a criterion of 0.0001 for the comparison.

    Solution:

    proc compare base=score1 compare=score2 criterion=0.0001;
       var P_SalePrice;
       with Predicted;
    run;

  4. Submit the PROC COMPARE step and examine the resullts. Does the PROC COMPARE output indicate any differences between the predictions produced by the two scoring methods?

    Solution:

    Here are the results. As shown in this output, the two scoring methods produce the same predictions. 

    Note: Depending on the version of SAS and SAS/STAT that you are using, your results might look somewhat different from the output shown here. However, the results should indicate that these data sets do not differ.

Lesson 07

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 1 CODE VERSION

Practice: Using PROC FREQ to Examine Distributions

An insurance company wants to relate the safety of vehicles to several other variables. A score was given to each vehicle model, using the frequency of insurance claims as a basis. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC FREQ to create one-way frequency tables for the categorical variables Unsafe, Type, Region, and Size. Submit the code and view the results.

    Solution:

    /*st107s01.sas*/  /*Part A*/
    ods graphics off;
    proc freq data=STAT1.safety;
       tables Unsafe Type Region Size;
       title "Safety Data Frequencies";
    run;
    ods graphics on;

    Here are the results.


  2. What is the measurement scale of each of the four variables?

    Solution:

    Unsafe - Nominal, Binary
    Type - Nominal
    Region - Nominal
    Size - Ordinal
    Weight - Continuous

  3. Do the variables Unsafe, Type, Region, and Size have any unusual values that warrant further investigation?

    Solution:

    No

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 1 TASK VERSION

Practice: Using the One-Way Frequencies Task to Examine Distributions

An insurance company wants to relate the safety of vehicles to several other variables. A score was given to each vehicle model, using the frequency of insurance claims as a basis. The safety data set contains the data about vehicle safety.

  1. Use the One-Way Frequencies task to create one-way frequency tables for the categorical variables Unsafe, Type, Region, and Size.

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the One-Way Frequencies task.
    4. Select the stat1.safety table.
    5. Assign Unsafe, Type, Region, and Size to the Analysis variables role.
    6. On the OPTIONS tab, expand PLOTS and select the Suppress plots check box.
    7. Click Run.

    Here are the results.


  2. What is the measurement scale of each of the four variables?

    Solution:

    Unsafe - Nominal, Binary
    Type - Nominal
    Region - Nominal
    Size - Ordinal
    Weight - Continuous

  3. Do the variables Unsafe, Type, Region, and Size have any unusual values that warrant further investigation?

    Solution:

    No

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 2 CODE VERSION

Practice: Using PROC FREQ to Perform Tests and Measures of Association

The insurance company wants to determine whether a vehicle's safety score is associated with either the region in which it was manufactured or the vehicle's size. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC FREQ to create the crosstabulation of the variables Region by Unsafe. Along with the default output, generate the expected frequencies, the chi-square test of association, and the odds ratio. To clearly identify the values of Unsafe, create and apply a temporary format. Submit the code and view the results.

    Solution:

    /*st107s01.sas*/  /*Part B*/
    proc format; 
       value safefmt 0='Average or Above'
                     1='Below Average';
    run;
    
    proc freq data=STAT1.safety;
       tables Region*Unsafe / expected chisq relrisk;
       format Unsafe safefmt.;
       title "Association between Unsafe and Region";
    run;

    Here are the results.


  2. For the cars made in Asia, what percentage had a Below Average safety score?

    Solution:

    Region is a row variable, so look at the Row Pct value in the Below Average cell of the Asia row. Of the cars made in Asia, 42.86% have a Below Average safety score.

  3. For the cars with an Average or Above safety score, what percentage was made in North America?

    Solution:

    Look at the Col Pct value in the Average or Above cell of the N America row. Of the cars with an Average or Above safety score, 69.70% were made in North America.

  4. Do you see a statistically significant (at the 0.05 level) association between Region and Unsafe?

    Solution:

    The association is not statistically significant at the 0.05 alpha level. The p-value is 0.0631.

  5. What does the odds ratio compare? What does this suggest about the difference in odds between Asian and North American cars?

    Solution:

    The odds ratio compares the odds of Below Average safety for North America versus Asia. The odds ratio of 0.4348 means that cars made in North America have 56.52% lower odds for being unsafe than cars made in Asia.

    Note: Recall that the odds ratios in the Estimates of Relative Risk table are calculated by comparing row1/row2 for column1. In this problem, this comparison is Asia to N America and the outcome is Average or Above in safety. The value 0.4348 is interpreted as the odds of an Average or Above car made in Asia is 0.4348 times the odds for American-made cars. If you want to compare N America to Asia, still using Average or Above for safety, the odds ratio would be the inverse of 0.4348, or approximately 2.3. This is interpreted as cars made in North America have 2.3 times the odds for being safe than cars made in Asia. This single inversion would also create the odds ratio for comparing Asia to N America but Below Average in safety. If you want to compare N America to Asia using Below Average in safety, you invert your odds ratio twice and return to the value 0.4348.

  6. Write another PROC FREQ step to create the crosstabulation of the variables Size and Unsafe. Along with the default output, generate the measures of ordinal association. Format the values of Unsafe. Submit the code and view the results.

    Solution:

    /*st107s01.sas*/  /*Part C*/
    proc freq data=STAT1.safety;
       tables Size*Unsafe / chisq measures cl;
       format Unsafe safefmt.;
       title "Association between Unsafe and Size";
    run;

    Here are the results.


  7. What statistic do you use to detect an ordinal association between Size and Unsafe?

    Solution:

    The Mantel-Haenszel chi-square detects an ordinal association.

  8. Do you reject or fail to reject the null hypothesis at the 0.05 level?

    Solution:

    You reject the null hypothesis at the 0.05 level.

  9. What is the strength of the ordinal association between Size and Unsafe?

    Solution:

    The Spearman Correlation is -0.5425.

  10. What is the 95% confidence interval around the statistic that measures the strength of the ordinal association?

    Solution:

    The confidence interval is (-0.6932, -0.3917).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 2 TASK VERSION

Practice: Using PROC FREQ to Perform Tests and Measures of Association

The insurance company wants to determine whether a vehicle's safety score is associated with either the region in which it was manufactured or the vehicle's size. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC FREQ to create the crosstabulation of the variables Region by Unsafe. Along with the default output, generate the expected frequencies, the chi-square test of association, and the odds ratio. To clearly identify the values of Unsafe, create and apply a temporary format. Submit the code and view the results.

    Solution:

    /*st107s01.sas*/  /*Part B*/
    proc format; 
       value safefmt 0='Average or Above'
                     1='Below Average';
    run;
    
    proc freq data=STAT1.safety;
       tables Region*Unsafe / expected chisq relrisk;
       format Unsafe safefmt.;
       title "Association between Unsafe and Region";
    run;

    Here are the results.


  2. For the cars made in Asia, what percentage had a Below Average safety score?

    Solution:

    Region is a row variable, so look at the Row Pct value in the Below Average cell of the Asia row. Of the cars made in Asia, 42.86% have a Below Average safety score.

  3. For the cars with an Average or Above safety score, what percentage was made in North America?

    Solution:

    Look at the Col Pct value in the Average or Above cell of the N America row. Of the cars with an Average or Above safety score, 69.70% were made in North America.

  4. Do you see a statistically significant (at the 0.05 level) association between Region and Unsafe?

    Solution:

    The association is not statistically significant at the 0.05 alpha level. The p-value is 0.0631.

  5. What does the odds ratio compare? What does this suggest about the difference in odds between Asian and North American cars?

    Solution:

    The odds ratio compares the odds of Below Average safety for North America versus Asia. The odds ratio of 0.4348 means that cars made in North America have 56.52% lower odds for being unsafe than cars made in Asia.

    Note: Recall that the odds ratios in the Estimates of Relative Risk table are calculated by comparing row1/row2 for column1. In this problem, this comparison is Asia to N America and the outcome is Average or Above in safety. The value 0.4348 is interpreted as the odds of an Average or Above car made in Asia is 0.4348 times the odds for American-made cars. If you want to compare N America to Asia, still using Average or Above for safety, the odds ratio would be the inverse of 0.4348, or approximately 2.3. This is interpreted as cars made in North America have 2.3 times the odds for being safe than cars made in Asia. This single inversion would also create the odds ratio for comparing Asia to N America but Below Average in safety. If you want to compare N America to Asia using Below Average in safety, you invert your odds ratio twice and return to the value 0.4348.

  6. Write another PROC FREQ step to create the crosstabulation of the variables Size and Unsafe. Along with the default output, generate the measures of ordinal association. Format the values of Unsafe. Submit the code and view the results.

    Solution:

    /*st107s01.sas*/  /*Part C*/
    proc freq data=STAT1.safety;
       tables Size*Unsafe / chisq measures cl;
       format Unsafe safefmt.;
       title "Association between Unsafe and Size";
    run;

    Here are the results.


  7. What statistic do you use to detect an ordinal association between Size and Unsafe?

    Solution:

    The Mantel-Haenszel chi-square detects an ordinal association.

  8. Do you reject or fail to reject the null hypothesis at the 0.05 level?

    Solution:

    You reject the null hypothesis at the 0.05 level.

  9. What is the strength of the ordinal association between Size and Unsafe?

    Solution:

    The Spearman Correlation is -0.5425.

  10. What is the 95% confidence interval around the statistic that measures the strength of the ordinal association?

    Solution:

    The confidence interval is (-0.6932, -0.3917).

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 3 CODE VERSION

Practice: Using PROC LOGISTIC to Perform a Binary Logistic Regression Analysis

The insurance company wants to characterize the relationship between a vehicle's weight and its safety rating. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC LOGISTIC to fit a simple logistic regression model with Unsafe as the response variable and Weight as the predictor variable. Use the EVENT= option to model the probability of Below Average safety scores. Request profile likelihood confidence limits, an odds ratio plot, and an effect plot. Submit the code and view the results.

    Solution:

    /*st107s02.sas*/
    ods graphics on;
    proc logistic data=STAT1.safety plots(only)=(effect oddsratio);
       model Unsafe(event='1')=Weight / clodds=pl;
       title 'LOGISTIC MODEL (1):Unsafe=Weight';
    run;

    Here are the results.


  2. Do you reject or fail to reject the null hypothesis that all regression coefficients of the model are 0?

    Solution:

    The p-value for the Likelihood Ratio test is <.0001, and therefore, the global null hypothesis is rejected.

  3. Write the logistic regression equation.

    Solution:

    The regression equation is as follows:
    Logit(Unsafe) = 3.5422 + (-1.3901) * Weight

  4. Interpret the odds ratio for Weight.

    Solution:

    The odds ratio for Weight (0.249) says that the odds for being unsafe (having a Below Average safety rating) are 75.1% lower for each thousand-pound increase in weight.

    The confidence interval (0.102 , 0.517) does not contain 1,which indicates that the odds ratio is statistically significant.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 3 TASK VERSION

Practice: Using the Binary Logistic Regression Task to Perform a Binary Logistic Regression Analysis

The insurance company wants to characterize the relationship between a vehicle's weight and its safety rating. The safety data set contains the data about vehicle safety.

  1. Use the Binary Logistic Regression task to fit a simple logistic regression model with Unsafe as the response variable and Weight as the predictor variable. Use the EVENT= option to model the probability of Below Average safety scores. Request profile likelihood confidence limits, odds ratio plot, and an effect plot.

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the Binary Logistic Regression task.
    4. Select the stat1.safety table.
    5. Assign Unsafe to the Response role, and use the Event of interest drop-down list to specify 1.
    6. Assign Weight to the Continuous variables role.
    7. On the MODEL tab, verify that Main effects model is selected.
    8. On the OPTIONS tab, in the Select statistics to display drop-down list, select Default and additional statistics.
    9. Expand the Parameter Estimates property. In the Confidence intervals for odds ratios drop-down list, select Based on profile likelihood.
    10. Expand PLOTS, and in the Select plots to display drop-down list, select Default and additional plots.
    11. Select Effect plot and Odds ratio plot.
    12. Click Run.

    Here are the results.


  2. Do you reject or fail to reject the null hypothesis that all regression coefficients of the model are 0?

    Solution:

    The p-value for the Likelihood Ratio test is <.0001, and therefore, the global null hypothesis is rejected.

  3. Write the logistic regression equation.

    Solution:

    The regression equation is as follows:
    Logit(Unsafe) = 3.5422 + (-1.3901) * Weight

  4. Interpret the odds ratio for Weight.

    Solution:

    The odds ratio for Weight (0.249) says that the odds for being unsafe (having a Below Average safety rating) are 75.1% lower for each thousand-pound increase in weight.

    The confidence interval (0.102 , 0.517) does not contain 1, which indicates that the odds ratio is statistically significant.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 4 CODE VERSION

Practice: Using PROC LOGISTIC to Perform a Multiple Logistic Regression Analysis with Categorical Variables

The insurance company wants to model the relationship between three of a car's characteristics, weight, size, and region of manufacture, and its safety rating. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC LOGISTIC to fit a multiple logistic regression model with Unsafe as the response variable and WeightSize, and Region as the predictor variables.
    1. Use the EVENT= option to model the probability of Below Average safety scores.
    2. Specify Region and Size as classification variables and use reference cell coding. Specify Asia as the reference level for Region, and 3 (large cars) as the reference level for Size.
    3. Request profile likelihood confidence limits, an odds ratio plot, and the effect plot.
    4. Submit the code and view the results.


    Solution:

    /*st107s03.sas*/
    ods graphics on;
    proc logistic data=STAT1.safety plots(only)=(effect oddsratio);
       class Region (param=ref ref='Asia')
             Size (param=ref ref='3');
       model Unsafe(event='1')=Weight Region Size / clodds=pl;
       title 'LOGISTIC MODEL (2):Unsafe=Weight Region Size';
    run;

    Here are the results.


  2. Do you reject or fail to reject the null hypothesis that all regression coefficients of the model are 0?

    Solution:

    The p-value for the Likelihood Ratio test is <.0001, and therefore, you reject the null hypothesis.

  3. If you reject the global null hypothesis, then which predictors significantly predict safety outcome?

    Solution:

    Only Size is significantly predictive of Unsafe.

  4. Interpret the odds ratio for significant predictors.

    Solution:

    Only Size is significant. The design variables show that Size=1 (Small or Sports) cars have 14.560 times the odds of having a Below Average safety rating compared to the reference category 3 (Large or Sport/Utility). The 95% confidence interval (3.018, 110.732) does not contain 1, implying that the contrast is statistically significant at the 0.05 level.

    The contrast from the second design variable is 1.931 (Medium versus Sport/Utility), implying a trend toward greater odds of low safety as size decreases. However, the 95% confidence interval (0.343, 15.182) contains 1, and therefore, the contrast is not statistically significant.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 4 TASK VERSION

Practice: Using PROC LOGISTIC to Perform a Multiple Logistic Regression Analysis with Categorical Variables

The insurance company wants to model the relationship between three of a car's characteristics, weight, size, and region of manufacture, and its safety rating. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC LOGISTIC to fit a multiple logistic regression model with Unsafe as the response variable and WeightSize, and Region as the predictor variables.
    1. Use the EVENT= option to model the probability of Below Average safety scores.
    2. Specify Region and Size as classification variables and use reference cell coding. Specify Asia as the reference level for Region, and 3 (large cars) as the reference level for Size.
    3. Request profile likelihood confidence limits, an odds ratio plot, and the effect plot.
    4. Submit the code and view the results.


    Solution:

    /*st107s03.sas*/
    ods graphics on;
    proc logistic data=STAT1.safety plots(only)=(effect oddsratio);
       class Region (param=ref ref='Asia')
             Size (param=ref ref='3');
       model Unsafe(event='1')=Weight Region Size / clodds=pl;
       title 'LOGISTIC MODEL (2):Unsafe=Weight Region Size';
    run;

    Here are the results.


  2. Do you reject or fail to reject the null hypothesis that all regression coefficients of the model are 0?

    Solution:

    The p-value for the Likelihood Ratio test is <.0001, and therefore, you reject the null hypothesis.

  3. If you reject the global null hypothesis, then which predictors significantly predict safety outcome?

    Solution:

    Only Size is significantly predictive of Unsafe.

  4. Interpret the odds ratio for significant predictors.

    Solution:

    Only Size is significant. The design variables show that Size=1 (Small or Sports) cars have 14.560 times the odds of having a Below Average safety rating compared to the reference category 3 (Large or Sport/Utility). The 95% confidence interval (3.018, 110.732) does not contain 1, implying that the contrast is statistically significant at the 0.05 level.

    The contrast from the second design variable is 1.931 (Medium versus Sport/Utility), implying a trend toward greater odds of low safety as size decreases. However, the 95% confidence interval (0.343, 15.182) contains 1, and therefore, the contrast is not statistically significant.

Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 CODE VERSION

Practice: Using PROC LOGISTIC to Perform Backward Elimination; Using PROC PLM to Generate Predictions

The insurance company wants to model the relationship between three of a car's characteristics, weight, size, and region of manufacture, and its safety rating. Run PROC LOGISTIC and use backward elimination. Start with a model using only main effects. The stat1.safety data set contains the data about vehicle safety.

  1. Use PROC LOGISTIC to fit a multiple logistic regression model with Unsafe as the response variable and WeightSize, and Region as the predictor variables.
    1. Use the EVENT= option to model the probability of Below Average safety scores.
    2. Apply the SIZEFMT. format to the variable Size.
    3. Specify Region and Size as classification variables and use reference cell coding. Specify Asia as the reference level for Region, and 1 (small cars) as the reference level for Size.
    4. Add a UNITS statement with -1 as the unit for Weight so that you can see the odds ratio for lighter cars over heavier cars.
    5. Add a STORE statement to save the analysis results as isSafe.
    6. Request any relevant plots.
    7. Submit the code and view the results.

    Solution:

    /*st107s04.sas*/
    ods graphics on;
    proc logistic data=STAT1.safety plots(only)=(effect oddsratio);
       class Region (param=ref ref='Asia')
    	     Size (param=ref ref='Small');
       model Unsafe(event='1') = Weight Region Size / clodds=pl selection=backward;
       units Weight = -1;
       store isSafe;
       format Size sizefmt.;
       title 'Logistic Model: Backwards Elimination';
    run;

    Notice that the reference level for Size is set to 'Small' in the solution, rather than '1'. When a format is applied to a CLASS statement variable, the reference level option should refer to the formatted value and not the internal value.

    Here are the results.


  2. Which terms appear in the final model?

    Solution:

    Only Size appears in the final model.

  3. If you compare these results with those from the previous practice (a model fit with only one variable, Region), do you think that this is a better model?

    Solution:

    Comparing the model fit statistics, you see that the AIC (92.629) and SC (100.322) are both smaller in the model fit by the backward elimination method, 119.854 and 124.982, respectively. This indicates that the Size-only model is doing better than the Region-only model.

    Using the c statistics, you can also see improvement beyond the Region-only model, that is, 0.818 in this model compared wtih 0.598 in the previous model.

  4. Using the final model that was chosen by backward elimination, and using the STORE statement, generate predictive probabilities for the cars in the following DATA step:
    data checkSafety;
       length Region $9.;
       input Weight Size Region $ 5-13;
       datalines;
       4 1 N America
       3 1 Asia     
       5 3 Asia     
       5 2 N America
    	 ;
    run;

    Solution:

    proc plm restore=isSafe;
       score data=checkSafety out=scored_cars / ILINK;
       title 'Safety Predictions using PROC PLM';
    run;
    
    proc print data=scored_cars;
    run;

    Here are the results.


Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 TASK VERSION

Practice: Using the Binary Logistic Regression Task to Perform Backward Elimination; Using PROC PLM to Generate Predictions

The insurance company wants to model the relationship between three of a car's characteristics, weight, size, and region of manufacture, and its safety rating. Use the Binary Logistic Regression task and backward elimination. Start with a model using only main effects. The safety data set contains the data about vehicle safety.

  1. Use the Binary Logistic Regression task to fit a multiple logistic regression model with Unsafe as the response variable and WeightSize, and Region as the predictor variables.
    1. Use the EVENT= option to model the probability of Below Average safety scores.
    2. Specify Size and Region as classification variables and use reference cell coding. Specify 1 (small cars) as the reference level for Size and Asia as the reference level for Region.
    3. Add a UNITS statement with -1 as the unit for Weight so that you can see the odds ratio for lighter cars over heavier cars.
    4. Add a STORE statement to save the analysis results as isSafe.
    5. Request any relevant plots.
    6. Run the task and view the results.

    Solution:

    1. In the Navigation pane, select Tasks and Utilities.
    2. Expand Tasks.
    3. Expand Statistics and open the Binary Logistic Regression task.
    4. Select the stat1.safety table.
    5. Assign Unsafe to the Response role, and use the Event of interest drop-down list to specify 1.
    6. Assign Size and Region to the Classification variables role.
    7. Expand the Parameterization of Effects property and use the Coding drop-down list to select Reference coding.
    8. Assign Weight to the Continuous variables role.
    9. On the MODEL tab, verify that Main effects model is selected.
    10. On the SELECTION tab, use the Selection method drop-down list to choose Backward elimination.
    11. On the OPTIONS tab, in the Select statistics to display drop-down list, select Default and additional statistics.
    12. Expand the Parameter Estimates property. In the Confidence intervals for odds ratios drop-down list, select Based on profile likelihood.
    13. Expand PLOTS, and in the Select plots to display drop-down list, select Default and additional plots.
    14. Select Effect plot and Odds ratio plot.
    15. Modify the code to specify specific levels of each class variable to use as reference levels. On the CODE tab, click the Edit SAS code icon.  
    16. In the CLASS statement, add the options (REF='1') immediately after Size and (REF='Asia') immediately after Region.
    17. Add the statement units Weight= -1; after the MODEL statement.
    18. Add the statement store isSafe; after the UNITS statement.
    19. Click Run.

    Here are the results.


  2. Which terms appear in the final model?

    Solution:

    Only Size appears in the final model.

  3. If you compare these results with those from the previous practice (a model fit with only one variable, Region), do you think that this is a better model?

    Solution:

    Comparing the model fit statistics, you see that the AIC (92.629) and SC (100.322) are both smaller in the model fit by the backward elimination method, 119.854 and 124.982, respectively. This indicates that the Size-only model is doing better than the Region-only model.

    Using the c statistics, you can also see improvement beyond the Region-only model, that is, 0.818 in this model compared wtih 0.598 in the previous model.

  4. Using the final model that was chosen by backward elimination, and using the STORE statement, generate predictive probabilities for the cars in the following DATA step:
    data checkSafety;
       length Region $9.;
       input Weight Size Region $ 5-13;
       datalines;
       4 1 N America
       3 1 Asia     
       5 3 Asia     
       5 2 N America
    	 ;
    run;

    Solution:

    proc plm restore=isSafe;
       score data=checkSafety out=scored_cars / ILINK;
       title 'Safety Predictions using PROC PLM';
    run;
    
    proc print data=scored_cars;
    run;

    Here are the results.