Lesson 00, Section 0 Performing Demos and Practices
To perform the demonstrations and practices, you can write and submit code in SAS Studio, SAS Enterprise Guide, or the SAS Windowing environment, or you can use SAS Studio tasks. In addition to Base SAS software, you must have SAS/STAT.
In the course demos, we submit code in the SAS Studio programming environment, but you can view step-by-step task instructions by clicking the Task Version button below the video. The task-generated code and results might differ from those shown in the video, so the generated code is included for verification.
You can also write code or use SAS Studio tasks to complete the practices. Just select the Open Code Version or the Open Task Version button on the Practice page.
All task steps were written for SAS Studio 3.7. If you are using your own software, you can use the Downloads page to upgrade your SAS Studio Single-User Edition to the latest release, including hot fixes. If you don't have your own software, you can use SAS OnDemand for Academics to access SAS Studio free of charge.
Program Files
The course files are divided among 4 folders. The ECST142 folder contains SAS programs to setup the course environment and data, and three subfolders:
- data - the SAS data sets needed to run the demos and practices will be stored here.
- demos - contains the demo program files.
- solutions - contains the solution program files.
Demo and solution program names are in the form st1XXyZZ.sas where st1 is the course code, XX is a 2-digit lesson number, y is either d for demo or s for solution, and ZZ is a two-digit number to uniquely identify the file. For simplicity, the demo and solution files are each numbered sequentially within a lesson. For example, st102d03.sas is the third demo in lesson 2, and st104s02.sas is the solution to the second practice in lesson 4. The filename is included in a comment at the top of each program.
SAS Syntax
Partial SAS syntax is displayed and explained in the demos. To view detailed syntax from SAS Studio, click the Help button near the top right and select SAS Product Documentation. You can also navigate to support.sas.com/documentation.
Exploring SAS Data Sets in SAS Studio
You can use the table viewer in SAS Studio to explore SAS data sets.
- Open SAS Studio. In the navigation pane on the left, double-click Libraries. Expand My Libraries and then expand the SASHELP library.
- Open the CARS data set by double-clicking it or by dragging it to the work area on the right. The data set opens in the table viewer. By default all of the columns and the first 100 rows are displayed. You can use the arrows above the table (top right) to page forward and backward through the rows.
- Clear the Select all checkbox in the Columns area of the table viewer. No columns are displayed. Select the Make, Model, and Type checkboxes. The corresponding columns are displayed.
- Select Make in the column list. The column properties are displayed below the list.
- Close the table tab.
You can also use SAS Studio tasks to explore your data. The List Data task displays the rows of a SAS data set, and the List Table Attributes task displays its metadata including the number of columns and rows, and the name, type and length of each column.
Exploring SAS Data Sets Programmatically
You can use the PRINT procedure to display the rows of a SAS data set, and the CONTENTS procedure to display its metadata.
In the PROC PRINT step, you use the DATA= option to name the input data set and the VAR statement to select the columns to display. By default, SAS lists all columns and all rows, but you can use the OBS=n data set option to limit the display to the first n rows. In the PROC CONTENTS step, you use the DATA= option to name the input data set.
Sample Codeproc print data=sashelp.cars(obs=100); var Make Model Type MSRP; run;
proc contents data=sashelp.cars; run;
Lesson 00, Section 0 Demo: Exploring Ames Housing Data
Filename: st101d01.sas
In this demonstration, we use the FREQ and UNIVARIATE procedures to explore the Ameshousing3 table, generating graphics and summary statistics to learn more about the data.

PROC FREQ DATA=SAS-data-set; TABLES table-request(s) < / options>; <additional statements> RUN; |
PROC UNIVARIATE DATA=SAS-data-set < / options>; VAR variables; HISTOGRAM variables < / options>; INSET keywords < / options>; RUN; |
- Select Libraries, My Libraries, and expand the STAT1 library. Double-click AMESHOUSING3, which contains a random sample of 300 homes from the original data. We use this table in most of our analyses, so we'll explore some of the variables.
- Select Column labels in the View field to display more descriptive labels. The categorical variables include Style of dwelling, such as 1Story, 2Story, etc., Original construction year, Number of fireplaces, Foundation Type, such as Concrete/Slab or Cinder Block, and masonry veneer or not. The continuous variables include the Lot size in square feet, Above ground living area in square feet, Sale price in dollars, Basement area in square feet, Number of full bathrooms and half bathrooms, and Age of house when sold, in years.
- Open program st101d01.sas.
/*st101d01.sas*/ /*Part A*/ /*Exploration of all variables that are available for analysis.*/ /*%let statements define macro variables containing lists of */ /*dataset variables*/ %let categorical=House_Style Overall_Qual Overall_Cond Year_Built Fireplaces Mo_Sold Yr_Sold Garage_Type_2 Foundation_2 Heating_QC Masonry_Veneer Lot_Shape_2 Central_Air; %let interval=SalePrice Log_Price Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Full_Bathroom Half_Bathroom Total_Bathroom ; /*st101d01.sas*/ /*Part B*/ /*PROC FREQ is used with categorical variables*/ ods graphics; proc freq data=STAT1.ameshousing3; tables &categorical / plots=freqplot ; format House_Style $House_Style. Overall_Qual Overall. Overall_Cond Overall. Heating_QC $Heating_QC. Central_Air $NoYes. Masonry_Veneer $NoYes. ; title "Categorical Variable Frequency Analysis"; run; /*st101d01.sas*/ /*Part C*/ /*PROC UNIVARIATE provides summary statistics and plots for */ /*interval variables. The ODS statement specifies that only */ /*the histogram be displayed. The INSET statement requests */ /*summary statistics without having to print out tables.*/ ods select histogram; proc univariate data=STAT1.ameshousing3 noprint; var &interval; histogram &interval / normal kernel; inset n mean std / position=ne; title "Interval Variable Distribution Analysis"; run; title;
Part A defines macro variables to help organize the data set variables and make modifying the SAS code easier. The %LET statements are used to name the macro variables and set their values. The first %LET statement creates a macro variable named categorical, and assigns it a space-delimited list of the categorical variables in the table. The next %LET statement creates the macro variable named interval, and assigns it the names of all the interval variables. Now, instead of typing a long list of variable names over and over in your programs, you can simply reference macro variable values by placing an ampersand in front of the macro variable name.
In Part B, the PROC FREQ step uses the ameshousing3 data to generate frequency tables and plots summarizing the categorical variables. In PROC FREQ, you list the analysis variables in the TABLES statement. The macro variable reference &categorical is replaced with the macro variable's value, the list of categorical variables, when you submit the step. The PLOTS= option requests a frequency plot. And by including the FORMAT statement, the data will be formatted and grouped before being analyzed. - Submit the code in Parts A and B and check the log to verify that there are no error or warning messages.
- Review the output.
The Table House_Style, One-Way Frequencies table indicates that from the 300 homes in our sample, almost 200 are one-story homes. There are few homes with other styles, and there are only six observations with the house style 2nd level unfinished. There are too few members to analyze, so they'll be merged with the One story and Two story levels in the variable House_Style2.
In the Tables Overall_Qual and Overall_Cond, the One-Way Frequencies and Distribution Plots indicate that the variables representing the overall quality and overall condition of the homes are predominantly average. Both variables have many levels with small frequencies. For example, there's only one home each with an overall quality of 1, the poorest level, and 9, the best level. We'll trichotomize these two variables into Below Average, Average, and AboveAverage, in the variables Overall_Qual2 (overall quality 2) and Overall_Cond2 (overall condition 2).
In the Table Year_Built, the One-Way Frequencies and Distribution Plots indicate that Year_Built ranges from 1875 to 2009 and has more values than is practical to treat as a categorical variable in a statistical model with only 300 observations, so we'll treat it as interval.
In the Table Fireplaces, the One-Way Frequencies and Distribution Plots indicate that 195 homes have no fireplace, 93 have a single fireplace, and 12 homes have two fireplaces. Because the number of fireplaces has a natural ordering, we can treat Fireplaces as an ordinal variable.
In the Table Mo_Sold, the One-Way Frequencies and Distribution Plots indicate that the variable representing month sold shows a clear trend toward sales in the summer months, July and June. Some months have small numbers, so instead of analyzing by month, we created Season_Sold to use in subsequent analyses. Season 1 is from month 12 to month 2; season 2 is from month 3 to month 5; season 3 is from month 6 to month 8; and season 4 is from month 9 to month 11.
In the Table Yr_Sold, the One-Way Frequencies and Distribution Plots indicate that Yr_Sold is fairly uniform, meaning there were a similar number of homes sold each year between 2006 and 2010.
In the Table Garage_Type_2, the One-Way Frequencies and Distribution Plots indicate that the Garage_Type_2 variable shows that 159 homes have an attached garage, 109 have a detached garage, and 29 homes do not have a garage (represented by NA). The table also states that there are three homes with missing information.
The Table Foundation_2, the One-Way Frequencies and Distribution Plots show that most homes are on cinder block, followed by concrete and then brick tile or stone.
The Table Heating_QC, the One-Way Frequencies and Distribution Plots show that there are four levels of heating quality, excellent, fair, good, and average. Fortunately, most homes have excellent or average heating quality.
The Table Masonry_Veneer, the One-Way Frequencies and Distribution Plots show that most homes do not have masonry veneer. Only 89 do.
The Table Lot_Shape_2, the One-Way Frequencies and Distribution Plots show that most homes have a regular lot shape, and a majority have central air. - Use PROC UNIVARIATE to explore the continuous or interval variables, plotting histograms of the data to see the shape and spread, and also print the mean and standard deviation summary statistics. The PROC UNIVARIATE step in Part B performs a distribution analysis and plots the distribution of the continuous variables. The NOPRINT option suppresses the other output. We reference the interval macro variable in the VAR and HISTOGRAM statements, and request a normal curve, a kernel density estimate, and an inset box in the top right, or northeast corner, displaying the number of rows, the mean, and the standard deviation.
- Submit this step.
- Review the output.
The first histogram, Distribution of SalePrice shows that for the continuous variable, SalePrice, the average sale price of homes in our sample is $137,524. The histogram is bell shaped, referring to a Gaussian, or normal distribution, a quality of the data that's important for our analyses in subsequent lessons. The blue line overlaying the plot is a normal density estimate and the red line is a kernel density estimate, which basically mimics the histogram. If these two overlayed lines are similar, the data are close to a normal distribution. We'll discuss this more in Lesson 1.
Sometimes researchers use a log transformation on an outcome variable such as SalePrice to provide more bell-shaped or normal-looking data for future analyses. In reviewing the Log_Price histogram, we see that both the original variable and the log transformation provide bell-shaped data.
The Gr_Liv_Area histogram indicate that on average, homes in Ames, Iowa, have 1,130 square feet of above ground living area. Most homes range from 900 to 1380 square feet.
The Basement_Area histogram is fairly bell shaped with a mean of 882 square feet. The Garage_Area histogram shows that the average garage area for homes with a garage is 369 square feet. The Deck_Porch_Area histogram is an example of a skewed distribution. Most of the observations, approximately 40%, have no deck, and then we see fewer and fewer larger decks. The Lot_Area histogram is fairly normal looking with a mean of 8,294 square feet. The Age_Sold histogram shows that the average age of homes sold in our sample is approximately 46 years, and the ages range from new to about 132 years old.
The Bedroom_AbvGr histogram shows that the number of bedrooms above ground, which could also be analyzed as a categorical variable, is 2.5 on average. Similarly, in the Full_Bathroom, Half_Bathroom, and Total_Bathroom histograms, the number of full, half and total bathrooms could also be analyzed as categorical variables, and the average numbers are 1.68, 0.25, and 1.70, respectively.
After looking at the variables in our course data, you might have some intuition as to which variables could accurately model sale price. For example, we could do an analysis of variance to see whether homes with central air are more likely to sell for higher prices, or whether homes with excellent heating condition are associated with higher priced homes. In addition, we could use regression to see whether the above ground living area, as a proxy for the size of homes, is correlated with SalePrice. Or perhaps we can model the probability of the home selling for more than $175,000 using both the number of fireplaces and basement area jointly. These are questions we'll be able to answer going forward.
Lesson 00, Section 0 Demo: Exploring Ames Housing Data
Use the Table Analysis task to generate plots and tables for the categorical variables in the ameshousing3 data set. Then use the Distribution Analysis task to generate plots and descriptive statistics for the continuous variables.
Generating Plots and Tables for Categorical Variables Using the Table Analysis Task
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the Table Analysis task.
- On the DATA tab, click the Select a table icon and select the stat1.ameshousing3 table.
- Assign the following variables to the Row variables role. Use the Ctrl key to select multiple variables.
- House_Style
- Overall_Qual
- Overall_Cond
- Year_Built
- Heating_QC
- Central_Air
- Fireplaces
- Mo_Sold
- Yr_Sold
- Garage_Type_2
- Foundation_2
- Masonry_Veneer
- Lot_Shape_2
- On the OPTIONS tab, select Cell under Percentages, and select Frequencies and percentages under Cumulative.
- Under STATISTICS, clear the Chi-square statistics check box.
- Click Run.
Generated Code
ods noproctitle; proc freq data=STAT1.AMESHOUSING3; tables (House_Style Overall_Qual Overall_Cond Year_Built Heating_QC Central_Air Fireplaces Mo_Sold Yr_Sold Garage_Type_2 Foundation_2 Masonry_Veneer Lot_Shape_2) / norow nocol plots(only)=(freqplot); run;
Obtaining Descriptive Statistics for Continuous Variables Using the Distribution Analysis Task
- Open the Distribution Analysis task under Statistics. Notice that the stat1.ameshousing3 table is already selected. SAS Studio displays the last data set that was used.
- Assign the following continuous variables to the Analysis variables role. Use the Ctrl key to select multiple variables.
- Lot_Area
- Gr_Liv_Area
- Bedroom_AbvGr
- Garage_Area
- SalePrice
- Basement_Area
- Full_Bathroom
- Half_Bathroom
- Total_Bathroom
- Deck_Porch_Area
- Age_Sold
- Log_Price
- On the OPTIONS tab, select Add normal curve, Add kernel density estimate, and Add inset statistics.
- Expand Inset Statistics and select Mean and Standard deviation in addition to the default, Number of observations.
- Click Run to submit the generated code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; /*** Exploring Data ***/ proc univariate data=STAT1.AMESHOUSING3; ods select Histogram; var Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area SalePrice Basement_Area Full_Bathroom Half_Bathroom Total_Bathroom Deck_Porch_Area Age_Sold Log_Price; histogram Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area SalePrice Basement_Area Full_Bathroom Half_Bathroom Total_Bathroom Deck_Porch_Area Age_Sold Log_Price / normal kernel; inset n mean std / position=ne; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 3 Demo: Performing a One-Sample t Test Using PROC TTEST
Filename: st101d02.sas
In this demonstration we use the TTEST procedure to determine whether the population mean sale price of homes in Ames, Iowa, is $135,000 given our sample. This procedure performs t tests and computes confidence limits. It can also use ODS Graphics to produce histograms, quantile-quantile plots, box plots, and confidence limit plots.

PROC TTEST DATA=SAS-data-set <options>; VAR variables; RUN; |
- Open program st101d02.sas.
- Submit the code.
-
Review the output.
The Statistics table provides descriptive statistics of our sample, including sample size, mean, standard deviation, standard error, and minimum and maximum values of SalePrice.
The Confidence Limits table provides confidence limits for μ and σ. The default level is 95%, but you can change it with the ALPHA= option in the PROC TTEST statement. Set alpha equal to 1 minus the confidence level.
The T-Tests table provides the t test information, including the degrees of freedom, the t value and the p-value, 0.2460. Recall that if the t statistic is close to zero, and the p-value is greater than α, evidence suggests the hypothesized population parameter is statistically reasonable, and we can fail to reject the null hypothesis. Our t value is 1.16 and the p-value is greater than our α, 0.05, so we conclude that the mean sale price of homes in Ames, Iowa, is not statistically different from $135,000.
The Mean of Sale Price Interval Plot shows the confidence interval around the mean estimate of sale price, and the vertical line references the null hypothesis value. Because the vertical reference line is within the bounds of the confidence interval, we conclude that the mean sale price of homes in Ames, Iowa, is not statistically different from $135,000. Finally, we need to verify the validity of the test by checking that the distribution of the prices of houses is normal. We can use the histogram and Q-Q plot to verify this assumption.
In the Summary Panel, the Distribution of SalePrice histogram appears to be bell shaped, like a normal distribution. The normal and kernel density estimates are nearly overlapping, indicating that the estimated data distribution from our sample is nearly equivalent to a normal distribution.
If the data are normal, a Q-Q plot produces a relatively straight line with some deviations due to random sampling. In the Q-Q Plot of SalePrice, the sorted sale prices are plotted against quantiles from a standard normal distribution. The tail ends seem to be skewed due to possible outliers, but, overall, the plot fails to show departures from normality.
Based on the t test results, we can assume that the Student's t test is valid, and we can conclude that the mean sale price of homes in Ames, Iowa, is not statistically different from $135,000.

/*st101d02.sas*/ ods graphics; proc ttest data=STAT1.ameshousing3 plots(shownull)=interval H0=135000; var SalePrice; title "One-Sample t-test testing whether mean SalePrice=$135,000"; run; title;
The PROC TTEST step analyzes the SalePrice variable. The H0= option specifies our null hypothesis value of 135,000. The INTERVAL option requests confidence interval plots for the means, and the SHOWNULL option displays a vertical reference line at the null value of 135,000.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 3 Demo: Performing a One-Sample t Test Using PROC TTEST (SAS Studio Task Version)
Use the t Tests task to test whether the mean sale price is $135,000 in the ameshousing3 data set.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the t Tests task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Analysis variable role.
- On the OPTIONS tab, set the Alternative hypothesis: mu^= to 135000.
- Clear the option to conduct Tests for normality.
- Expand PLOTS and use the drop-down list to choose Selected plots. Select Confidence interval plot, in addition to the Histogram and box plot, and Normality plot defaults.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; /*** t Test ***/ proc ttest data=STAT1.AMESHOUSING3 sides=2 h0=135000 plots(only showh0)=(summaryPlot intervalPlot qqplot); var SalePrice; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 4 Demo: Performing a Two-Sample t Test Using PROC TTEST
Filename: st101d03.sas
In this demonstration, we use PROC TTEST to perform a two-sample t test, and test whether the mean of SalePrice is the same for homes with masonry veneer as for those without.

PROC TTEST DATA=SAS-data-set <options>; CLASS variable; VAR variables < / options>; RUN; |
- Open program st101d03.sas.
- Submit the code.
-
Review the output.
Start by verifying our assumption of normality of the distribution of each group by looking at the histograms and Q-Q plots. In the Summary Panel, the Distribution of SalePrice histogram appears to be normally distributed, with, of course, a different center of location. Both histograms have a blue normal reference curve superimposed on the plots to help determine whether the distributions are normal.
Consider the Q-Q plots. If the data in a Q-Q plot come from a normal distribution, the points cluster tightly around the reference line. The first plot shows homes without masonry veneer, or the Nos, and the second shows homes with, the Yess. Both Q-Q plots exhibit relatively straight lines. There's slight curvature on both, but nothing too extreme. From these four plots, it's safe to say that both populations are normally distributed.
Next, consider the Equality of Variances table. The F statistic is 1.36 and the p-value is relatively large at 0.1039. Based on this, do we reject or fail to reject the null hypothesis of equal variances? The p-value is greater than alpha, so we do not reject the null hypothesis. We don't have enough evidence to say the variances are unequal.
Based on the results of the F test, we now look in the T-Tests table at the t test for the hypothesis of equal means, the pooled t test. The p-value is less than .0001, which is less than 0.05, so we can reject the null hypothesis that the group means are equal. We can conclude that the sale price of homes with masonry veneer is significantly different from homes without it. Also notice that the t statistic values for both the Pooled and the Satterthwaite tests are almost equal, -5.38 and -5.72. When the population variances are equal, the t values are equivalent mathematically. The slight difference here is due to random sampling differences when calculating the variances.
We can make the same conclusion about the means from the Difference Interval Plot. The Mean of SalePrince Difference (No-Yes) interval plot includes both pooled and Satterthwaite 95% confidence intervals. Notice for the pooled variance method, the confidence interval for the difference in means is between about -$33,000 and -$15,000. It doesn't include 0, which is our hypothesized value. In other words, we have enough evidence to say that the difference of the means is significantly different from 0 at the 95% confidence level.
Now consider the Statistics table. From our sample of 300 homes, 89 homes have masonry and 209 do not. There are 2 homes with missing data, so well remove these from the analysis. For the 209 homes without masonry veneer, the sale price sample mean is $130,172, and for the 89 homes with it, the sample mean is $154,705. The difference mean value between no masonry veneer and masonry veneer is -$24,533. From the sample data, it's clear that homes with masonry veneer tend to have a statistically significant higher value.

/*st101d03.sas*/ ods graphics; proc ttest data=STAT1.ameshousing3 plots(shownull)=interval; class Masonry_Veneer; var SalePrice; format Masonry_Veneer $NoYes.; title "Two-Sample t-test Comparing Masonry Veneer, No vs. Yes"; run; title;
This PROC TTEST step doesn't use the null hypothesis option because we're testing the equality of means. The CLASS statement selects Masonry_Veneer as the grouping variable. The CLASS statement is required in a two-sample t test. The classification variable can be numeric or character, but must have exactly two levels, because PROC TTEST divides the observations into the two groups using the values of this variable. Classification levels are determined from the formatted values of the CLASS variable, so if necessary, you can apply a format to collapse the data into two levels. The FORMAT statement here applies the $NoYes format to display Yes and No in the output instead of Y and N.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 01, Section 4 Demo: Performing a Two-Sample t Test Using PROC TTEST (SAS Studio Task Version)
Use the t Tests task to test whether the mean sale price is the same for homes with masonry veneer and those without.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the t Tests task.
- Select the stat1.ameshousing3 table.
- On the DATA tab, under ROLES, use the t-test drop-down list to select Two-sample test.
- Assign SalePrice to the Analysis variable role.
- Assign Masonry_Veneer to the Groups variable role.
- On the OPTIONS tab, clear the option to conduct Tests for normality.
- Expand PLOTS and use the drop-down list to choose Selected plots. Select Confidence interval plot, in addition to the Histogram and box plot, and Normality plot defaults.
- Click Run to submit the generated code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; /*** t Test ***/ proc ttest data=STAT1.AMESHOUSING3 sides=2 h0=0 plots(only showh0)=(summaryPlot intervalPlot qqplot); class Masonry_Veneer; var SalePrice; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 1 Demo: Exploring Associations Using PROC SGPLOT
Filename: st102d01.sas
In this demo, you use PROC SGPLOT to create box plots, and visually examine the association between the categorical predictor Central_Air and the continuous response SalePrice in the ameshousing3 data. As shown in the syntax, PROC SGPLOT has many graphical options to visually explore data and create a variety of plots.

PROC SGPLOT DATA=SAS-data-set <options>; HBAR category-variables < / options>; VBAR category-variables < / options>; HBOX category-variables < / options>; VBOX category-variables < / options>; RUN; |
- Open program st102d01.sas.
/*st102d01.sas*/ /*Part C*/ proc sgplot data=STAT1.ameshousing3; vbox SalePrice / category=Central_Air connect=mean; title "Sale Price Differences across Central Air"; run;
Part C of the program uses the VBOX statement to create a vertical box plot that shows the distribution of the data. It specifies the variable for the Y axis, SalePrice, followed by a forward slash. The CATEGORY= option specifies the category variable and creates different box plots for each distinct value.This means it will create a plot for homes with no central air, and one for those with central air. To help us visually assess the relationship between our variables, we'll use connect=mean to include the regression line and connect the means of Y at each value of X. - Submit this step.
-
Review the output.
In the SGPlot Procedure box plot, SalePrice is on the Y axis, and Central_Air is on the X axis. The No group is on the left, the Yes group is on the right, and there are some outliers. The regression line is definitely not horizontal. Clearly, there appears to be an association between Central_Air and SalePrice. It seems that homes with central air tend to sell for higher prices than homes without.Exploring associations with box plots helps prepare you for what you might encounter as you analyze your data. However, don't use these plots exclusively to determine which variables to include in your model. They represent only simple relationships between one predictor variable and the response variable. When you start putting multiple variables in the model, the picture of associations can become very different.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 1 Demo: Exploring Associations Using PROC SGPLOT (SAS Studio Task Version)
Use the Box Plot task to visually examine the association between the categorical predictor Central_Air and the continuous response SalePrice in our ameshousing3 data.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Graph and select the Box Plot task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Analysis variable role.
- Assign Central_Air to the Category role.
- On the APPEARANCE tab, expand the ANALYSIS AXIS property and clear the check box to Show grid lines.
- Click Run.
Generated Code
ods graphics / reset width=6.4in height=4.8in imagemap; proc sgplot data=STAT1.AMESHOUSING3; vbox SalePrice / category=Central_Air; run; ods graphics / reset;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 1 Demo: Exploring Associations Using PROC SGSCATTER (SAS Studio Task Version)
Use the Scatter Plot task to examine the association between the response variable SalePrice and predictor variables in our ameshousing3 data. We also want to see the general shape of each association. We’ll start by generating a scatter plot to see whether there's an association between SalePrice and Above Ground Living Area.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Graph and open the Scatter Plot task.
- Select the stat1.ameshousing3 table.
- Assign Gr_Liv_Area to the X axis role and assign SalePrice to the Y axis role.
- On the APPEARANCE tab, expand the FIT CURVES property and select Regression to add a regression fit to the scatter plot.
- Click Run.
Generated Code
ods graphics / reset width=6.4in height=4.8in imagemap; proc sgplot data=STAT1.AMESHOUSING3; reg x=Gr_Liv_Area y=SalePrice / nomarkers; scatter x=Gr_Liv_Area y=SalePrice /; xaxis grid; yaxis grid; run; ods graphics / reset;
Multiple Scatter Plots
- Expand Statistics and open the Data Exploration task to plot multiple correlation plots simultaneously.
- The stat1.ameshousing3 table should be selected.
- Select Lot_Area, Gr_Liv_Area, Garage_Area, SalePrice, Basement_Area, and Deck_Porch_Area as the Continuous variables.
- On the PLOTS tab, clear the check box to output a Scatter plot matrix.
- Select the Regression scatter plots option, and select SalePrice as the response variable.
- Click Run.
Generated Code
options validvarname=any; ods noproctitle; ods graphics / imagemap=on; /* Regression scatter plot macro */ %macro regressionScatterplot(xVar=, yVar=, title=, groupVar=); proc sgscatter data=STAT1.AMESHOUSING3; plot (&yVar)*(&xVar) / %if(&groupVar ne %str()) %then %do; group=&groupVar legend=(sortorder=ascending) %end; reg; title &title; run; title; %mend regressionScatterplot; %regressionScatterplot(xVar=Lot_Area, yVar=SalePrice, title="SalePrice vs Lot_Area"); %regressionScatterplot(xVar=Gr_Liv_Area, yVar=SalePrice, title="SalePrice vs Gr_Liv_Area"); %regressionScatterplot(xVar=Garage_Area, yVar=SalePrice, title="SalePrice vs Garage_Area"); %regressionScatterplot(xVar=SalePrice, yVar=SalePrice, title="SalePrice vs SalePrice"); %regressionScatterplot(xVar=Basement_Area, yVar=SalePrice, title="SalePrice vs Basement_Area"); %regressionScatterplot(xVar=Deck_Porch_Area, yVar=SalePrice, title="SalePrice vs Deck_Porch_Area");The SAS Studio Data Exploration task limits the number of continuous variables to six and writes individual scatter plots to output. To plot more than five variables simultaneously in a panel plot, use PROC SGSCATTER.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 2 Demo: Performing a One-Way ANOVA Using PROC GLM (SAS Studio Task Version)
Use the One-Way ANOVA task to run an analysis of variance to test whether the average SalePrice value differs among the houses with different heating qualities. Before we can trust the results from our ANOVA, such as the p-value, standard errors, and confidence intervals, we need to check the assumptions of our model. We’ll use Levene’s test of homogeneity of variances to assess constant variance. We can check normality and independence through residual plots such as histograms, Q-Q plots, residuals versus predicted values, and residuals versus predictors.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the One-Way ANOVA task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign Heating_QC to the Categorical variable role.
- On the OPTIONS tab, under HOMOGENEITY OF VARIANCE, clear the option for Welch's variance-weighted ANOVA.
- Under COMPARISONS, use the Comparisons method drop-down list to select None.
- Under PLOTS, from the Display plots drop-down list, select the Selected plots option, and then select Box plot, if not already selected, and Diagnostics plot. Clear the check boxes for Means plot and LS-mean difference plot.
- Click Run.
Generated Code
Title; ods noproctitle; ods graphics / imagemap=on; proc glm data=STAT1.AMESHOUSING3 plots(only)=(boxplot diagnostics); class Heating_QC; model SalePrice=Heating_QC; means Heating_QC / hovtest=levene plots=none; run; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 3 Demo: Performing a Post Hoc Pairwise Comparison Using PROC GLM
Filename: st102d03.sas
Recall that we already determined from a significant overall ANOVA result that at least one heating quality was different. In this demonstration, we use PROC GLM to determine which pairs are significantly different from each other in their mean sale price. In the program, we'll make all pairwise comparisons and apply Tukey's adjustment, as well as comparisons to a control using Dunnett's adjustment. In actual practice, you would determine in advance whether youre interested in all pairwise comparisons or only comparisons to a control, and request the appropriate technique.

PROC GLM DATA=SAS-data-set <options>; |
- Open program st102d03.sas.
/*st102d03.sas*/ ods graphics; ods select lsmeans diff diffplot controlplot; proc glm data=STAT1.ameshousing3 plots(only)=(diffplot(center) controlplot); class Heating_QC; model SalePrice=Heating_QC; lsmeans Heating_QC / pdiff=all adjust=tukey; lsmeans Heating_QC / pdiff=control('Average/Typical') adjust=dunnett; format Heating_QC $Heating_QC.; title "Post-Hoc Analysis of ANOVA - Heating Quality as Predictor"; run; quit; title;
As in the previous demonstration, the CLASS statement specifies the classification variable Heating_QC, and the MODEL statement specifies the response, SalePrice, equal to the classification variable, Heating_QC, as indicated in the ANOVA model.
Next, we request all the multiple comparison methods with options in the LSMEANS statements. Multiple LSMEANS statements are permitted, although typically you would only use one type of method for each LSMEANS effect. Two different methods are used for illustration here.
In the first LSMEANS statement, we specify our predictor variable, Heating_QC. The PDIFF= option requests p-values for the differences. PDIFF=ALL, which is the default, requests to compare all means, and produces a diffogram automatically. The ADJUST= option specifies the adjustment method for multiple comparisons. If you don't specify an option, SAS uses the Tukey method by default. Recall that Tukey's adjustment will maintain the experimentwise error rate at 0.05 for all six pairwise comparisons.
In the second LSMEANS statement, the PDIFF=CONTROL option requests that each level be compared to a control level. You choose the appropriate control level based on the research goals. The control level is written in quotation marks. We're using Average/Typical as the control for demonstration purposes, which will result in three comparisons, one for each remaining level versus the control. Because we specify the ADJUST=Dunnett option, the GLM procedure produces multiple comparisons using Dunnett's method. This method maintains an experimentwise error rate of 0.05 for all three comparisons and creates a control plot.
The PROC GLM statement includes the PLOTS= options. The DIFFPLOT option modifies the diffogram that's produced by the LSMEANS statement with the PDIFF=ALL option. The CENTER option adds a dot to the intersection of two least squares means for each comparison.
The CONTROLPLOT option requests a display in which least squares means are compared against a reference level. LS-mean control plots are produced only when you specify PDIFF=CONTROL or ADJUST=DUNNETT in the LSMEANS statement. In this case, they're produced by default. - Submit the program.
-
Review the output.
We'll start with the Tukey LSMEANS comparisons. The other tables and results are identical to the previous demonstration.
The LSMeans table shows the means for each group, and each mean is assigned a number to refer to it in the next table. The table shows that the average sale price of homes with Excellent heating quality is the highest, at approximately $154,000. Homes with Fair heating quality have the lowest average price, at approximately $97,000. The other two levels are nearly equivalent at about $130,000.
The second table is a Difference Matrix. It shows the p-values from pairwise comparisons of all possible combinations of means. Notice that row 2 column 4 has the same p-value as row 4 column 2, because the same two means are compared in each case. Both are displayed as a convenience to the user. The diagonal is blank of course, because it doesn't make any sense to compare a mean to itself. The only nonsignificant pairwise difference is between Average/Typical and Good. These p-values are adjusted using the Tukey method and are, therefore, larger than the unadjusted p-values for the same comparisons. However, the experimentwise Type 1 error rate is held fixed at alpha.
The comparisons of least square means are also shown graphically in the Heating_QC Diffogram. Six comparisons are shown, but because the Average/Typical and Good levels have very close means, two pairs of lines are close together. The blue solid lines denote significant differences between heating quality levels, because these confidence intervals for the difference do not cross the diagonal equivalence line. Red dashed lines indicate a non-significant difference between treatments. Starting at the top, left to right, we can see Excellent is significantly different from Fair, from Good, and from Average/Typical. Then at the middle left, Good heating quality houses are significantly different from Fair, but not from Average/Typical. Finally, Average/Typical is significantly different from Fair heating quality in their mean sales price. The text on the graph tells us that the Tukey adjustments have been applied to these comparisons.
The next Least Squares Means table for Heating_QC displays the Dunnett's LSMEANS comparisons. In this case, all other quality levels are compared to Average/Typical. Once again, Good is the only level that is not significantly different from that control level.
The Heating_QC Control Plot corresponds to the tables that were summarized. The horizontal line is drawn at the least squares mean for Average/Typical, which is $130,574. The other three means are represented by the ends of the vertical lines extending from the horizontal control line. The mean value for Good is so close to Average/Typical that it can't be seen here.
Notice that the blue areas of non-significance vary in size. This is because different comparisons involve different sample sizes. Smaller sample sizes require larger mean differences to reach statistical significance. This control plot shows significant differences between Excellent and Average/Typical, and between Fair and Average/Typical, just like in the table above.
As we've seen, tests for significant differences among treatments can be assessed graphically or through tables of p-values. Some people prefer graphs; others prefer the tables. It's your personal preference which to use.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 3 Demo: Performing a Post Hoc Pairwise Comparison Using PROC GLM (SAS Studio Task Version)
You already determined from a significant overall ANOVA result that at least one heating quality was different. Use the One-Way ANOVA task to produce comparison information to determine which pairs are significantly different from each other in their mean sale prices.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the One-Way ANOVA task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign Heating_QC to the Categorical variable role.
- On the OPTIONS tab, under HOMOGENEITY OF VARIANCE, use the Test drop-dwon list to select None, and clear the check box for Welch's variance-weighted ANOVA.
- Under COMPARISONS, use the Comparisons method drop-down list to select Tukey, if not already selected.
- Under PLOTS, use the Display plots drop-down list to select the Selected plots option, and then select only the LS-mean difference plot.
- Click Run.
Generated Code
Title; ods noproctitle; ods graphics / imagemap=on; proc glm data=STAT1.AMESHOUSING3 plots(only); class Heating_QC; model SalePrice=Heating_QC; lsmeans Heating_QC / adjust=tukey pdiff alpha=.05 plots=(diffplot); run; quit;
One-Way ANOVA Using Dunnett's Method
To produce output for multiple comparison methods, you can run the tasks separately.- Modify the existing task to use Dunnett's method. On the OPTIONS tab, under COMPARISONS, select Dunnett two-tail as the Comparisons method, and select TA as the Control level.
- Under PLOTS, use the Display plots drop-down list to select Default plots.
- Click Run.
Generated Code
Title; ods noproctitle; ods graphics / imagemap=on; proc glm data=STAT1.AMESHOUSING3; class Heating_QC; model SalePrice=Heating_QC; lsmeans Heating_QC / adjust=dunnett pdiff=control('TA') alpha=.05; run; quit;NOTE: Typically, only one type of multiple comparison method would be used, and SAS Studio conducts one comparison method at a time. You can edit the generated code manually to include multiple comparison statements. In the code window, click Edit to add the code for the second comparison method. The following edited code provides comparison information using both Tukey’s HSD Test and Dunnett’s method:
Title; ods noproctitle; ods graphics / imagemap=on; proc glm data=STAT1.AMESHOUSING3 plots(only); class Heating_QC; model SalePrice=Heating_QC; lsmeans Heating_QC / adjust=tukey pdiff alpha=.05 plots=(diffplot); lsmeans Heating_QC / adjust=dunnett pdiff=control('TA') alpha=.05 plots=(controlplot); run; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 4 Demo: Producing Correlation Statistics and Scatter Plots Using PROC CORR
Filename: st102d04.sas
In this demonstration, we use PROC CORR to produce correlation statistics and scatter plots for our data. Our goal is to identify, both visually and numerically, which predictors are linearly associated with SalePrice, as well as the strength of the relationship. By default, PROC CORR produces Pearson correlation coefficients and corresponding p-values.

PROC CORR DATA=SAS-data-set <options>; VAR variables; WITH variables; ID variables; RUN; |
- Open program st102d04.sas.
/*st102d04.sas*/ /*Part A*/ %let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom; ods graphics / reset=all imagemap; proc corr data=STAT1.AmesHousing3 rank plots(only)=scatter(nvar=all ellipse=none); var &interval; with SalePrice; id PID; title "Correlations and Scatter Plots with SalePrice"; run; title; /*st102d04.sas*/ /*Part B*/ ods graphics off; proc corr data=STAT1.AmesHousing3 nosimple best=3; var &interval; title "Correlations and Scatter Plot Matrix of Predictors"; run; title;
The PROC CORR statement specifies AmesHousing3 as the data set. To rank-order the absolute value of the correlations from highest to lowest, we're using the RANK option. To request individual scatter plots, we specify the PLOTS=SCATTER option. After the keyword SCATTER, there are two more options. NVAR=ALL specifies that all the variables listed in the VAR statement be displayed in the plots, and ELLIPSE=NONE suppresses the drawing of ellipses on scatter plots. The VAR statement specifies the continuous variables that we want correlations for. By default, SAS produces correlations for each pair of variables in the VAR statement, but we'll use the WITH statement to correlate each continuous variable with SalePrice.
The IMAGEMAP option in the ODS GRAPHICS statement enables tooltips to be used in HTML output. Tooltips enable you to identify data points by moving the cursor over observations in a plot. In PROC CORR, the variables used in the tooltips are the X-axis and Y-axis variables, the observation number, and any variable in the ID statement, which in this case, is the variable PID. - Submit Part A of this program.
- Review the output.
By default, the CORR Procedure generates a table of Variable Information that lists the variables that were analyzed. It also displays a Simple Statistics table with descriptive statistics for each variable, including the mean, standard deviation, and minimum and maximum values.
The Pearson Correlations table displays the correlation coefficients and p-values for the correlation of SalePrice with each of the predictor variables. Notice that the table is ranked by the absolute correlation coefficient. Basement_Area has the strongest linear association with the response variable with a correlation coefficient of about 0.69. Therefore, Basement_Area would be the best single predictor of SalePrice in a simple linear regression. The p-value is small, which indicates that the population correlation coefficient, ρ, is likely different from 0. The second largest magnitude correlation coefficient is Above Ground Living Area at about 0.65, and so on.
Next we'll consider the scatter plots. In the SalePrice by Gr_Liv_Atea, the Above Ground Living Area seems to exhibit a noticeably positive linear association with SalePrice. Of course, the scatter plot of SalePrice by Basement_Area also shows a positive linear relationship. Notice that there are several houses that have basements with a size of zero square feet. These are houses without basements, not missing values. This mixture of data can affect the correlation coefficient. You would need to take this into account if you build a model with Basement_Area as a predictor variable. You can move the cursor over the observation to display the coordinate values, observation number, and ID variable values.
The scatter plots with Deck_Porch_Area and Lot_Area show the variables have weak correlations with SalePrice, because a horizontal line could be an adequate line of best fit.
As expected, SalePrice and the age of the house when sold, Age_Sold, have a negative linear relationship. The older the house, the less the home tends to sell for.
The scatter plots with the total number of bedrooms (Bedroom_AbvGr) and bathrooms (Total_Bathroom) have few continuous values and could be analyzed as classification variables. These plots are basically displaying the distribution of SalePrice at each level, similarly to a box plot. However, the scatter plot with total number of bathrooms seems to exhibit a positive linear relationship, as the center of the distributions tends to increase as the number of bathrooms increases. Overall, the correlation and scatter plot analyses indicate that several variables might be good predictors for SalePrice.
When you prepare to conduct a regression analysis, it's always a good practice to examine the correlations among the potential predictor variables. This is because strong correlations among predictors included in the same model can cause a variety of problems, like multicollinearity. - In Part B of this program, we want to produce a correlation matrix to help us compare the relationships between predictor variables. The correlation matrix shows correlations and p-values for all combinations of the predictor variables. Here we'll limit our attention to the strongest three correlations with each predictor.
In this PROC CORR statement, we're using the NOSIMPLE option to suppress the printing of the simple descriptive statistics for each variable. The BEST= option prints the n highest correlation coefficients for each variable, so in this case, the three strongest correlations. - Submit this step.
- Review the output.
In the results, notice that the Variables Information table is still listed, but the table of simple statistics is gone.
The Pearson Correlations table indicates that there are moderately strong correlations between Total_Bathroom and Age_Sold, -0.52889, between Total_Bathroom and Basement_Area, 0.48500, and between Bedroom_AbvGr and above Gr_Liv_Area, 0.48431.
If some of these potential predictors were highly correlated, we might omit some from the multiple regression models that we'll produce later in the course. Strong correlations among sets of predictors, also known as multicollinearity, can cause a variety of problems for statistical models. Correlation analysis has the potential to reveal multicollinearity problems, but additional methods to detect it are necessary. Bivariate correlations in the range shown above are not causes for concern.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 4 Demo: Producing Correlation Statistics and Scatter Plots Using PROC CORR (SAS Studio Task Version)
Use the Correlation Analysis task to produce correlation statistics and scatter plots for the ameshousing3 data. The goal is to identify, both visually and numerically, which predictors are linearly associated with SalePrice, as well as the strength of the relationship.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Correlation Analysis task.
- Select the stat1.ameshousing3 table.
- Assign the continuous variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) to the Analysis variable role.
- Assign SalePrice as the Correlate with variable.
- On the OPTIONS tab, under STATISTICS, use the Display statistics drop-down list and choose the Selected statistics option and select the Correlations and Display p-value check boxes, which might already be checked, as well as Descriptive statistics.
- Under PLOTS, use the Type of plot drop-down list and choose Individual scatter plots. Ensure that the Include inset statistics check box is selected, and change the Number of variables to plot to 8 to generate plots for all eight variables.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc corr data=STAT1.AMESHOUSING3 pearson plots=scatter(ellipse=none nvar=8 nwith=8); var Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold; with SalePrice; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 5 Demo: Performing Simple Linear Regression Using PROC REG
Filename: st102d05.sas
When we performed exploratory data analysis, we found a significant Pearson correlation between SalePrice and several continuous variables in the ameshousing3 data set. Let's use PROC REG to build a simple linear regression model using Lot_Area as the predictor variable in order to determine how exactly Lot_Area and SalePrice are linearly related.

PROC REG DATA=SAS-data-set <options>; MODEL dependents = <regressors> < / options>; RUN; |
- Open program st102d05.sas.
/*st102d05.sas*/ ods graphics; proc reg data=STAT1.ameshousing3; model SalePrice=Lot_Area; title "Simple Regression with Lot Area as Regressor"; run; quit; title;
The PROC REG statement specifies the data set ameshousing3, and the MODEL statement specifies the model that we're analyzing, SalePrice=Lot_Area. - Submit the program.
- Review the output.
In the REG procedure output, the Number of Observations table shows that the number of observations that were read and the number used are the same. This indicates there are no missing values for SalePrice and Lot_Area.
Next, the Analysis of Variance table shows how the total variability in SalePrice can be partitioned to test the null hypothesis that the slope for Lot_Area is equal to 0.
The ANOVA table in regression is equivalent to the ANOVA table from analysis of variance. It provides the model, error, and total sums of squares. It provides the degrees of freedom for each source of variability, and it also calculates the mean squares that are used to compute the F value. Recall that the mean squares are calculated as the sum of squares divided by their corresponding degrees of freedom, and dividing the mean square model by the mean square error computes the F value.
In regression, the degrees of freedom are calculated as the number of parameters minus 1, in this case, 2 - 1 = 1 for the model degrees of freedom, and the number of observations used minus the number of model parameters is 300 - 2 = 298. The total degrees of freedom are the same as before, n-1.
Finally, the ANOVA table reports the p-value to evaluate the null hypothesis, and in this case, it's highly significant. Thus, you can conclude that the simple linear regression model fits the data better than the baseline model. Evidence suggests that there's a significant linear relationship between SalePrice and Lot_Area, because the slope for Lot_Area is significantly different from zero.
The the third part of the PROC REG output, Fit Statistics table, displays summary measures of fit for the model. The root MSE is 36456 and is, of course, just the square root of the mean square error in the Analysis of Variance table. The root MSE is a measure of the error standard deviation. The dependent mean is 137525, which is the overall mean of SalePrice. The coefficient of variation is 26.50882. This is the size of the error standard deviation divided by the dependent mean. This statistic is used less often than the R-square and the adjusted R-square, and typically in specialized situations. The coefficient of determination is also referred to as the R-square value. Recall that it's the proportion of variability in the response variable explained by the regression model. In this example, the value is 0.0642, which means that Lot_Area explains 6% of the total variation in SalePrice.
The R-square is also just the squared value of the bivariate Pearson correlation coefficient that we saw in a previous demonstration between Lot_Area and SalePrice, 0.25335. The adjusted R-square is adjusted for the number of parameters in the model. This statistic is useful for comparing models with different numbers of predictors.
The Parameter Estimates table specifies the individual pieces of your model equation based on your data, whereas the Analysis of Variance table provides the overall fit for the model. The Parameter Estimates table also provides significance tests for each model parameter.
The parameter estimate for the intercept, , is 113740, and the parameter estimate for the slope of Lot_Area, , is 2.86770. So the regression equation is SalePrice=113740 2.86770*Lot_Area. The model indicates that each additional square foot of lot area is associated with an approximately $2.87 higher sale price.
The p-values for each parameter estimate tests the null hypothesis that the parameter estimate equals zero. Typically, we're not interested in the test of the intercept=0, because only the slope defines the nature of the linear association between the response and the predictor. The t test value is calculated by dividing the parameter estimate by the corresponding standard error estimate. Pr > |t| is the p-value associated with the test statistic. It tests whether the parameter is different from 0. For this example, the slope for the predictor variable is statistically different from 0.
Notice that, in simple linear regression, the t value for the slope is equivalent to the square root of the F value from the ANOVA table, and the p-values are identical. This will not be the case when more predictors are added to the model. Note that, extrapolation of the model beyond the range of your predictor variables is inappropriate. You can't assume that the relationship maintains in areas that were not sampled from.
The parameter estimates table also shows that the intercept parameter is not equal to 0. However, the test for the intercept parameter has practical significance only when the range of values for the predictor variable includes 0. In this example, the test could not have practical significance because SalePrice=0, or giving away a house for free, is not within the range of observed values.
The diagnostics panel and the residuals by Lot_Area graph provide graphics to verify our model assumptions. For normality, the histogram of residuals looks bell-shaped and the dots on the Q-Q plot essentially fall on a straight line. Both indicate no deviations from normality. Non-constant variance can often be detected in residual plots when the residuals are close to zero and then expand to larger magnitudes.
The Fit PLot produced by ODS Graphics shows the predicted regression line superimposed over a scatter plot of the data.
To assess the level of precision around the mean estimates of SalePrice, you can produce confidence intervals around the means. This is represented in the shaded area in the plot. A 95% confidence interval for the mean states that you're 95% confident your interval contains the population mean of Y for a particular X. Confidence intervals become wider as you move away from the mean of the independent variable. This reflects the fact that your estimates become more variable as you move away from the means of X and Y.
Suppose that the mean SalePrice at a fixed value of Lot_Area is not the focus. If you're interested in making a prediction for a future single observation, you need a prediction interval. This is represented by the area between the broken lines in the plot. A 95% prediction interval is one that you are 95% confident contains a new observation if you were to actually sample another observation. Prediction intervals are wider than confidence intervals, because single observations have more variability than means.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 02, Section 5 Demo: Performing Simple Linear Regression Using PROC REG (SAS Studio Task Version)
Because there's a significant Pearson correlation between SalePrice and several continuous variables in the ameshousing3 data set, use the Linear Regression task to build a simple linear regression model. Use Lot_Area as the predictor variable to determine exactly how a change in the Lot_Area is associated with a change in the SalePrice.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign Lot_Area to the Continuous variables role.
- On the MODEL tab, click the Edit this model icon to specify the Model effects.
- In the Model Effects Builder window, select Lot_Area and click Add under Single Effects.
- Click OK to close the Model Effects Builder window.
- On the OPTIONS tab, under PLOTS, expand Scatter Plots and clear the check box for Observed values by predicted values.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc reg data=STAT1.AMESHOUSING3 alpha=0.05 plots(only)=(diagnostics residuals fitplot); model SalePrice=Lot_Area /; run; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 Demo: Performing a Two-Way ANOVA Using PROC GLM
Filename: st103d01.sas
In the Ames Housing example, we want to consider the effect that heating system quality and season sold have on home sale prices. We'll start by exploring the data using the MEANS and SGPLOT procedures.

PROC MEANS DATA=SAS-data-set <statistic-keyword(s)>; CLASS variable(s) < / option(s)>; VAR variable(s); RUN; |
PROC SGPLOT DATA=SAS-data-set <option(s)>; VLINE category-variable < / option(s)>; RUN; |
PROC GLM DATA=SAS-data-set <options>; CLASS variable(s); MODEL dependent-variable = independent-effects < / options>; LSMEANS effects < / options>; RUN; |
- Open program st103d01.sas.
- Submit Parts A and B to run both steps.
- Review the output.
- In the PROC GLM step, the ORDER=INTERNAL option tells SAS to use the order of the variable values stored internally, rather than the order of the formatted values. The internal values for Season_Sold are 1, 2, 3, and 4, so by including this option, the seasons will appear in the order Winter, Spring, Summer, and Fall, instead of alphabetical order. In the MODEL statement, SalePrice is the dependent variable, and Heating_QC and Season_Sold are the factors or model effects. In PROC GLM, the order of variables in the CLASS statement determines the look of the graph. The first variable labels the X axis, and the second variable is represented by the color-coded lines. The LSMEANS statement requests a Tukey-adjusted analysis of the difference across all seasons.
- Submit the PROC GLM step in Part C.
- Review the output.

/*st103d01.sas*/ /*Part A*/ ods graphics off; proc means data=STAT1.ameshousing3 mean var std nway; class Season_Sold Heating_QC; var SalePrice; format Season_Sold Season.; title 'Selected Descriptive Statistics'; run; /*st103d01.sas*/ /*Part B*/ proc sgplot data=STAT1.ameshousing3; vline Season_Sold / group=Heating_QC stat=mean response=SalePrice markers; format Season_Sold season.; run; /*st103d01.sas*/ /*Part C*/ ods graphics on; proc glm data=STAT1.ameshousing3 order=internal; class Season_Sold Heating_QC; model SalePrice = Heating_QC Season_Sold; lsmeans Season_Sold / diff adjust=tukey; format Season_Sold season.; title "Model with Heating Quality and Season as Predictors"; run; quit; title;
In Part A, the PROC MEANS step requests summary statistics for the variables of interest. The analysis variable, SalePrice, is named in the VAR statement, and the classification variables, Season_Sold and Heating_QC, are listed in the CLASS statement. The NWAY option requests the combination of all variables named in the CLASS statement. The FORMAT statement applies the Season format to the Season_Sold variable to display Winter, Spring, Summer, and Fall, instead of the corresponding numeric values, 1, 2, 3, and 4.
We use the SGPLOT procedure in Part B to plot the mean SalePrice by Season_Sold in a vertical line chart with the bars grouped by Heating_QC. The MARKERS option adds data point markers to the chart.
In the Summary Statistics generated by the MEANS procedure, the mean sale price is lowest for houses with fair heating systems. The table of means also shows that few houses with fair heating are sold regardless of season. For example, only one fair-heating-quality house was sold in the fall, which is why there is no standard deviation or variance for that mean. We can't be as confident about estimated means that are based on small samples sizes. Looking at the graph produced by the SGPlot procedure, the season that a home sold doesn't seem to affect the sale price very much, except where the heating system is fair. For those homes, the mean sale price seems markedly lower in the colder seasons.
How does this exploratory plot help us plan our analysis? Well, we see that the effect of the heating quality on sale price seems to depend on the season the house is sold. This indicates a possible interaction effect. Well use PROC GLM to first test only the main effects of Season_Sold and Heating_QC. Later, we'll incorporate the interaction suggested by our plot.
We're testing to see whether all means are equal for each predictor variable. In the Analysis of Variance table, the degrees of freedom is 6, because Season_Sold and Heating_QC each accounts for three degrees of freedom, the number of levels minus 1 for each variable. The statistically significant p-value indicates not all means are equal for each predictor variable, but it doesn't indicate which mean values are significantly different. We can determine which means differ by looking at the table showing tests of individual factors. In the Fit Statistics table, the R-square value, 0.171954, indicates approximately 17% of the variability in SalePrice is explained by the two categorical predictors.
Next we'll consider the Type I and Type III Model ANOVA tables. In the Type I table, each effect is tested sequentially, and adjusts for all preceding listed effects. In other words, the order of the effects matters. The model specification determines the order in this table. The test of Heating_QC is an unadjusted test, because there are no other terms above it, whereas the Season_Sold test adjusts for the Heating_QC, which appears before it. The test for Season_Sold asks whether the season can explain the leftover variation in SalePrice after heating quality has explained as much of the sales price variation as possible. Typically, only Type III sums of squares tables are interpreted and reported for ANOVA. Type I sums of squares are more useful in say, polynomial regression models when we want to understand how high-order terms sequentially benefit the model.
Unlike Type I sums of squares, the Type III values are not generally additive, and the values do not necessarily sum to the model sums of squares. In the Type III table, all listed effects are adjusted for all other effects in the table, so order is not important. The Type III sums of squares for a variable, also called the partial sums of squares, is the increase in the model sum of squares due to adding the variable to a model which already contains all the other variables.
Judging from the p-values in the Type III sums of squares table, there seems to be no significant differences across levels of Season_Sold, with a p-value of 0.1768, but there are significant differences across the Heating_QC variable, with a p-value less than .0001. That is, even after you control for the effects of Season_Sold, the heating quality variable still explains significant differences in SalePrice.
The interaction plot for SalePrice differs from the exploratory plot because PROC GLM imposes a main effects model on the data given our model specification. In other words, the effect of each variable is not permitted to differ at different levels of the other variable. That constraint can be relaxed by adding an interaction term, as you'll see in the next demonstration.
Before we move on, this interaction plot illustrates an important point. In this plot, it seems that there is no interaction between heating quality and season sold. These results look completely reasonable, and we might be tempted to stop here. But we know, based on the earlier SGPLOT graphic, that this model is not adequate. We've already seen that the effect of season changes with the level of heating quality. So remember, its always a good idea to plot your data before you fit a model.
Recall that the LSMEANS statement requested a Tukey-adjusted analysis of the difference across all seasons. There are no significant differences in SalePrice means among the four levels of Season_Sold. The p-values range from 0.176 to 0.987. All are well above the typical significance threshold of 0.05.
Next, we could request comparisons of means for the different heating qualities, but the automatically generated interaction plot and the model that it represents don't match what we learned from our exploratory data analysis. The plot we created before generating the model showed that the effect of heating quality seems to change with the season. To allow for this in our model, we need to add and test an interaction effect. Let's take a moment to discuss interactions, and then come back and add an interaction effect to our model.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 Demo: Performing a Two-Way ANOVA Using PROC GLM (SAS Studio Task Version)
Before conducting an analysis of variance, you should explore the data.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the Summary Statistics task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Analysis variables role.
- Assign Season_Sold and Heating_QC to the Classification variables role.
- Select Season_Sold in the list of Classification variables, and click the Move column up icon to move it to the top of the list.
- On the OPTIONS tab, expand Basic Statistics and select only Mean and Standard Deviation.
- Expand Additional Statistics and select Variance.
- Run the code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc means data=STAT1.AMESHOUSING3 chartype mean std var vardef=df; var SalePrice; class Season_Sold Heating_QC; run;
To further explore the numerous treatments, use the Line Chart task to examine the means graphically.
- Expand Graph and select the Line Chart task.
- Select the stat1.ameshousing3 table.
- Assign Season_Sold to the Category role and Heating_QC to the Subcategory role.
- From the Measure drop-down list, select Variable. In the Variable field, select SalePrice.
- Expand Statistics and select Mean.
- Run the code.
Generated Code
ods graphics / reset width=6.4in height=4.8in imagemap; proc sgplot data=STAT1.AMESHOUSING3; vline Season_Sold / response=SalePrice group=Heating_QC stat=mean; yaxis grid; run; ods graphics / reset;Note: To add markers to the chart for point value, edit a copy of the generated code and specify the MARKERS option in the VLINE statement as shown below.
vline Season_Sold / response=SalePrice group=Heating_QC stat=mean markers;
You can use the N-Way ANOVA task to discover the effects of both Season_Sold and Heating_QC.
- Expand Statistics and select the N-Way ANOVA task.
- Select the stat1.ameshousing3 table.
- Select SalePrice as the Dependent variable.
- Select Season_Sold and Heating_QC as Factors, in that order. Note: Order is important when selecting factors. The displayed order determines the generated code for the CLASS statement. If you add Heating_QC first and Season_Sold second, a different graph is produced. You can use the up and down arrows to change the order of variables in the Factors field.
- On the MODEL tab, click the Edit button to open the Model Effects Builder. Add Heating_QC and Season_Sold to Model Effects, in that order, and click OK. Note: Order is important when selecting the factors in the Model Builder. If you add Season_Sold first and Heating_QC second, a different report is produced.
- Run the code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc glm data=STAT1.AMESHOUSING3; class Season_Sold Heating_QC; model SalePrice=Heating_QC Season_Sold / ss1 ss3; lsmeans Heating_QC Season_Sold / adjust=tukey pdiff=all alpha=0.05 cl; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 Demo: Performing a Two-Way ANOVA With an Interaction Using PROC GLM
Filename: st103d02.sas
In our interaction plot of heating quality and season sold, we identified a possible interaction effect. Let's modify the two-way ANOVA model to include the interaction term Heating_QC crossed with Season_Sold and save the results in an item store. This will enable us to perform post-fitting analyses without refitting the model.

PROC GLM DATA=SAS-data-set <options>; CLASS variables; MODEL dependent-variable = independent-effects; LSMEANS effects < / options>; STORE <OUT= item-store-name> < / LABEL='label'>; RUN; |
PROC PLM RESTORE=SAS-data-set <statistic-keyword(s)>; CLASS variable(s) < / option(s)>; VAR variable(s); RUN; |
- Open program st103d02.sas.
- Submit the PROC GLM step in Part A.
- Review the output.

/*st103d02.sas*/ /*Part A*/ ods graphics on; proc glm data=STAT1.ameshousing3 order=internal plots(only)=intplot; class Season_Sold Heating_QC; model SalePrice = Heating_QC Season_Sold Heating_QC*Season_Sold; lsmeans Heating_QC*Season_Sold / diff slice=Heating_QC; format Season_Sold Season.; store out=interact; title "Model with Heating Quality and Season as Interacting Predictors"; run; quit; /*st103d02.sas*/ /*Part B*/ proc plm restore=interact plots=all; slice Heating_QC*Season_Sold / sliceby=Heating_QC adjust=tukey; effectplot interaction(sliceby=Heating_QC) / clm; run; title;
In the PROC GLM step, we want only an interaction plot. In the MODEL statement, we added the interaction effect, Heating_QC crossed with Season_Sold. We used an asterisk to specify the interaction effect, but we could use a vertical bar between the main effects to specify the factorial representation. Recall that the DIFF option in the LSMEANS statement computes and compares least squares means of the model effects. Including the interaction term in the LSMEANS statement provides the least square means of all 16 groups of the crossed factors. We added the SLICE= option to slice the interaction effect by the different levels of heating quality. Each slice will have one Heating_QC level and will show the Season_Sold effect across that slice. The STORE statement saves the results in an item store named interact so that we can perform further analysis post model fitting. We used a one-level name, so the item store will be in the temporary library, work. Typically, you'd specify a two-level name for an item store so that it remains available after your current SAS session ends.
The Overall ANOVA table shows that the degrees of freedom for the model is now 15. This includes three degrees of freedom for each main effect, and 3 times 3, or nine degrees of freedom for the interaction term. The overall model is statistically significant. In the Fit Statistics table, the R-square, 0.230634, tells us that this model explains approximately 23% of the variability in SalePrice. This is an improvement from the 17% that was explained by the model with only main effects.
What about the interaction term, Season_Sold*Heating_QC? Is it significant, or should it be removed? In both the Type I and Type III ANOVA output, the p-value, 0.0121, indicates that the interaction effect is statistically significant at the .05 alpha level. This means that the effect of Season_Sold differs at different levels of Heating_QC, and vice versa. Given the significance of the interaction, it should stay in the model. To maintain model hierarchy, all effects contained within significant interactions should also remain in the model, regardless of their p-value.
The interaction model reflects the data more accurately than the main effects model. In the previous main effects model, it seemed that Season_Sold wasn't related to SalePrice. Now we can see that it is related, but in a more complex way. Season_Sold is important, but only for some heating qualities, and we see that only through the interaction.
So, what does the significant interaction mean? Let's dissect the season sold crossed with heating quality interaction in three ways. First, we'll look at our plot of Season_Sold crossed with the Heating_QC means, and then we'll look at pairwise comparisons of least square means. Finally, we'll look at tests of simple effects using the SLICE option. Let's look at the Interaction Plot for SalePrice, a line plot overlaid with all the observations of the data set. This plot shows that for most categories of heating quality, the season when the property was sold had little effect on the sale price. When the heating quality was fair, as shown by the red line, SalePrice was low in winter, increased until summer and then decreased again in fall. Depending on the order of the predictor variables in the CLASS statement, the interaction plot can be changed so that each line is a season and heating quality is on the X axis. This provides a different view of the interaction.
The least squares means table, LSMeans, displays the mean sale price for every combination of Season_Sold and Heating_QC, but the Difference Matrix displays p-values for every comparison between the means. For example, if we want to test the null hypothesis that the mean sale price of homes sold in winter with fair heating is equal to the mean sale price of homes sold in summer with fair heating, we would compare mean 2 to mean 10. The means are $58,100 and $128,800 respectively. The p-value for this comparison is p=0.0046, which indicates a statistically significant difference. These tables were produced by the DIFF option in the LSMEANS statement.
We can also make sense out of the interaction by looking at tests of simple effects that were requested through the SLICE option. These tests compare the means for one factor at a particular level of the other factor. Let's focus on the slice analysis of this model, as shown in the Sliced AOVA table. The displayed tests are of Season_Sold within each slice, or level of Heating_QC. The first p-value looks at the homogeneity of means within the Heating_QC group, excellent, across all the levels of Season_Sold. This p-value shows that there is no significant difference in the sale price means across Season_Sold when heating quality is excellent.
There is a statistically significant Season_Sold effect for houses with fair and good heating systems, but not for typical/average systems. This table supports our finding from visually interpreting the interaction plot. That is, the sale price of homes might be affected by the interaction of heating quality and the season a house is sold. The note below the table reminds you that these p-values are not adjusted for multiple tests. Later we'll use the item store created in this step to adjust for multiple comparison tests.
Let's check the log to verify that the item store was created. We see that the results were saved in the temporary item store, work.interact.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 Demo: Performing a Two-Way ANOVA With an Interaction Using PROC GLM (SAS Studio Task Version)
Perform a two-way ANOVA of SalePrice with Heating_QC and Season_Sold as predictor variables. Include the interaction between the two explanatory variables.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the N-Way ANOVA task.
- Select the stat1.ameshousing3 table.
- Select SalePrice as the Dependent variable.
- Select Season_Sold and Heating_QC (in that order) as Factors. You can use the up and down arrows to change the order of variables in the Factors field.
- On the MODEL tab, click the Edit button to open the Model Effects Builder.
- Select Heating_QC and use the Add button to add it to Model Effects. Then add Season_Sold to Model Effects. Add the interaction term Heating_QC*Season_Sold by selecting both variables in the Variables list and clicking the Cross button. Alternatively, to add all three terms together, select both variables in the Variables list, and click the Full Factorial button. Heating_QC, Season_Sold, and Heating_QC*Season_Sold will be added to Model Effects.
- Click OK to close the Model Effects Builder.
- On the OPTIONS tab under STATISTICS, in the Select statistics to display drop-down list, select Default and additional statistics, and then clear the Perform multiple comparisons check box.
- Modify the generated code to slice the interaction by Heating_QC and to create an item store.
- Click the Edit button in the CODE window to open an editable copy of the code.
- Add an LSMEANS statement and a STORE statement as shown in the modified code below.
- Run the code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc glm data=STAT1.AMESHOUSING3; class Heating_QC Season_Sold; model SalePrice=Heating_QC Season_Sold Heating_QC*Season_Sold / ss1 ss3; lsmeans Heating_QC*Season_Sold / diff slice=Heating_QC; store out=interact; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 1 Demo: Performing Post-Processing Analysis Using PROC PLM (SAS Studio Task Version)
In SAS Studio 3.7, there is currently no task to generate PROC PLM code. Submit the code in Part B of the st103d02.sas file.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 2 Demo: Fitting a Multiple Linear Regression Model Using PROC REG
Filename: st103d03.sas
In this demonstration, we use PROC REG to run a linear regression model with two predictor variables. Then we use PROC GLM to fit the same model again to show a few additional plots that are not available in PROC REG. We'll save the results of our analyses in an item store, and then use PROC PLM to perform additional analysis.

PROC REG DATA=SAS-data-set <options>; MODEL dependent-variables = regressors < / options>; RUN; |
PROC GLM DATA=SAS-data-set <options>; MODEL dependent-variables = independent-effects; STORE <OUT=> item-store-name < / LABEL='label'>; RUN; |
PROC PLM RESTORE=item-store-specification <options>; EFFECTPLOT <plot-type <(plot-definition-options)>> </option(s)>; RUN; |
- Open program st103d03.sas.
/*st103d03.sas*/ /*Part A*/ ods graphics on; proc reg data=STAT1.ameshousing3 ; model SalePrice=Basement_Area Lot_Area; title "Model with Basement Area and Lot Area"; run; quit; /*st103d03.sas*/ /*Part B*/ proc glm data=STAT1.ameshousing3 plots(only)=(contourfit); model SalePrice=Basement_Area Lot_Area; store out=multiple; title "Model with Basement Area and Gross Living Area"; run; quit; /*st103d03.sas*/ /*Part C*/ proc plm restore=multiple plots=all; effectplot contour (y=Basement_Area x=Lot_Area); effectplot slicefit(x=Lot_Area sliceby=Basement_Area=250 to 1000 by 250); run; title;
In the PROC REG step, the MODEL statement specifies SalePrice as the response variable, and Basement_Area and Lot_Area as predictors. - Submit the PROC REG step in Part A.
- Review the output.
The Analysis of Variance table shows that this model is statistically significant at the 0.05 alpha level.
In the Fit Statistics table, the R-square of 0.4802, indicates that 48% of the variability in SalePrice can be explained by both Basement_Area and Lot_Area. Recall from a previous model that Lot_Area alone explained only 6.42%. Is the R-square higher because the new model is better, or simply because the model has more predictors? To find out, compare the adjusted R-square values.
The simpler model with only Lot_Area had an adjusted R-square of 0.061. The adjusted R-square for the multiple regression is higher, at 0.4767. The higher adjusted R-square indicates that adding Basement_Area improved the model enough to warrant the additional model complexity.
Let's look at the Parameter Estimates tables. Our earlier analysis showed that the correlation between Lot_Area and SalePrice was statistically significant. With Basement_Area added to the model, the Lot_Area estimate is notably different than it was in the simple linear regression model (2.87 in the simple regression model and 0.80 in this model), and its p-value no longer shows statistical significance.
The reason is that in the two-predictor model, the parameter estimate for each predictor variable is adjusted for the presence of the other variable in the model. Basement_Area is a significant predictor of SalePrice even after controlling for Lot_Area. But Lot_Area is not a significant predictor of SalePrice after controlling for Basement_Area. This means that Lot_Area and Basement_Area are correlated, and Lot_Area does not explain significant variation in SalePrice over and above Basement_Area. So, when the model accounts for the effect of Basement_Area, the effect of Lot_Area no longer shows statistical significance.
If these were our only predictors, we'd consider removing Lot_Area from the model, but we might decide to add other predictors instead. The additional predictors might change the p-values for Lot_Area and Basement_Area, so it's best to wait to see the full model before discarding non-significant terms.
Now, let's use the Fit Diagnostics graphical output to verify our statistical assumptions. The residuals plotted against predicted values give us a relatively random scatter around 0. They provide evidence that we have constant variance.
In the Q-Q plot, the residuals fall along the diagonal line, and they look approximately normal in the histogram. This indicates that there are no problems with an assumption of normally distributed error.
Next, in the Residual Plots, we see the residuals plotted against the predictor variables. Patterns in these plots are indications of an inadequate model. The residuals show no pattern, although lot size does show a few outliers. - Let's go back to the code. In the second step, we'll run the same model in PROC GLM, requesting a contour plot and an item store named multiple.
- Submit the PROC GLM step in Part B.
- Review the output.
In the ANOVA results, we see that the values in the Fit Statistics table are the same as in PROC REG. PROC GLM doesn't report an adjusted R-square value.
The Solution, or parameter estimates table gives the same results (within rounding error) as in PROC REG.
We can use this Contour Fit Plot with the overlaid scatter plot to see how well the model predicts observed values. The plot shows predicted values of SalePrice as gradations of the background color from blue, representing low values, to red, representing high values. The dots are similarly colored, and represent the actual data. Observations that are perfectly fit would show the same color within the circle as outside the circle. The lines on the graph help you read the actual predictions at even intervals.
For example, this point near the upper right represents an observation with a basement area of approximately 1,500 square feet, a lot size of approximately 17,000 square feet, and a predicted value of more than $180,000 for sale price. However, the dot's color shows that its observed sale price is actually closer to $160,000. - Let's go back to the code. In the last step, we use PROC PLM to process the item store created by PROC GLM and create additional plots. The EFFECTPLOT statement produces a display of the fitted model and provides options for changing and enhancing the displays. The EFFECTPLOT option, CONTOUR, requests a contour plot of predicted values against two continuous predictors. We want Basement_Area plotted on the Y axis, and Lot_Area on the X axis.
The SLICEFIT option displays a curve of predicted values versus a continuous variable grouped by the levels of another effect. We want to see the Lot_Area effect at different values of Basement_Area. with tick marks ranging from 250 to 1000, in increments of 250. - Submit the PROC PLM step in Part C.
- Review the output.
Notice that the lines in the Contour Fit Plot are oriented differently than the plot from PROC GLM. The item store doesn't contain the original data, so PROC PLM can show only the predicted values, not the individual observed values. Clearly, the PROC GLM contour fit plot is more useful, but if you don't have access to the original data, and you can run PROC PLM on the item store, this plot gives you an idea of the relationship between the predictor variables and predicted values.
The last plot, a Sliced Fit Plot, is another way to display the results of a two-predictor regression model. This plot displays SalePrice by Lot_Area, categorized by Basement_Area. The regression lines represent the slices of Basement_Area that we specified in the code. As you can see, you have several options for visualizing and communicating the results of your analyses.
Okay. We've created a multiple regression model with two predictors, so what's next? We can test the remainder of our predictors in a larger multiple regression model. With 11 predictors, there are many possible models that we could explore. As we've seen, the significance and parameter estimates for each predictor can change depending on which other predictors are included in the model.
So how do we decide which model is best to go forward with? Ultimately, it will be decided by our specific research goal and our subject-matter knowledge. There are tools that we can use to limit the possible models to a manageable number of candidates. In the next lesson, we'll see some commonly used approaches to model selection.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 03, Section 2 Demo: Fitting a Multiple Linear Regression Model Using PROC REG (SAS Studio Task Version)
Perform a linear regression of SalePrice with Lot_Area and Basement_Area as predictor variables.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice as the Dependent variable, and Basement_Area and Lot_Area as the Continuous variables.
- On the MODEL tab, click the Edit button to open the Model Effects Builder. Add Basement_Area and Lot_Area to Model Effects, and click OK to close the Model Effects Builder.
- On the OPTIONS tab, expand Scatter Plots and clear the Observed values by predicted values check box.
- Run the code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc reg data=STAT1.AMESHOUSING3 alpha=0.05 plots(only)=(diagnostics residuals); model SalePrice=Basement_Area Lot_Area /; run; quit;
Note: Additional plots can be obtained when you submit the code below. It is available in the st103d03.sas file.
/*st103d03.sas*/ /*Part B*/ proc glm data=STAT1.ameshousing3 plots(only)=(contourfit); model SalePrice=Basement_Area Lot_Area; store out=multiple; title "Model with Basement Area and Gross Living Area"; run; /*st103d03.sas*/ /*Part C*/ proc plm restore=multiple plots=all; effectplot contour (y=Basement_Area x=Lot_Area); effectplot slicefit(x=Lot_Area sliceby=Basement_Area=250 to 1000 by 250); run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 1 Demo: Performing Stepwise Regression Using PROC GLMSELECT (SAS Studio Task Version)
Use the Linear Regression task to select a model for predicting SalePrice in the ameshousing3 data set by using the STEPWISE selection method. Use 0.05 as the significance level for entry into and staying in the model.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign the interval variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) to the Continuous variables role.
- On the MODEL tab, use the Model Effect Builder to specify the appropriate model. Click the Edit this model icon, select all variables, and click Add under Single Effects. Then click OK.
- On the OPTIONS tab, clear the check boxes for all diagnostic plots, residual plots, and scatter plots.
- On the SELECTION tab, use the Selection method drop-down list to choose Stepwise selection.
- For the Add/remove effects with value, choose Significance level.
- Expand the DETAILS property and select Details for each step from the drop-down menu.
- To obtain detailed graphical output, modify the generated code. Click the Edit SAS code icon on the CODE tab and change plots=(criterionpanel) to plots=all.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 outdesign(addinputvars)=Work.reg_design plots=(all); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold / showpvalues selection=stepwise (slentry=0.05 slstay=0.05 select=sl) details=steps; run; proc delete data=Work.reg_design; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 2 Demo: Performing Model Selection Using PROC GLMSELECT
Filename: st104d02.sas
In this demonstration, we use four PROC GLMSELECT steps on the response variable SalePrice, regressing on eight predictor variables in the data set ameshousing3.

PROC GLMSELECT DATA=SAS-data-set <options>; <label:>MODEL dependent = regressors < / options>; RUN; |
- Open program st104d02.sas.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ; /*st104d02.sas*/ ods graphics on; proc glmselect data=STAT1.ameshousing3 plots=all; STEPWISEAIC: model SalePrice = &interval / selection=stepwise details=steps select=AIC; title "Stepwise Model Selection for SalePrice - AIC"; run; proc glmselect data=STAT1.ameshousing3 plots=all; STEPWISEBIC: model SalePrice = &interval / selection=stepwise details=steps select=BIC; title "Stepwise Model Selection for SalePrice - BIC"; run; proc glmselect data=STAT1.ameshousing3 plots=all; STEPWISEAICC: model SalePrice = &interval / selection=stepwise details=steps select=AICC; title "Stepwise Model Selection for SalePrice - AICC"; run; proc glmselect data=STAT1.ameshousing3 plots=all; STEPWISESBC: model SalePrice = &interval / selection=stepwise details=steps select=SBC; title "Stepwise Model Selection for SalePrice - SBC"; run;
In each step, we request all default plots. We use STEPWISE as the selection method in the SELECTION= option, and include DETAILS=steps to obtain step information and the selection summary table. For each run, we specify a different selection criterion and use the SELECT= option: AIC, BIC, AICC, and SBC. Notice the corresponding labels in the MODEL statements. Again, this helps quickly identify what each PROC step is requesting. - Submit the code to compare the selected models.
- Review the output.
The first part of the output is from the run that used AIC as the selection criterion. In Step 0, the intercept-only model, the AIC value was approximately 6624. Recall that with information criteria, a smaller value is better. So in Step 1, Basement_Area is added, because it's the variable whose addition will most improve, or reduce, the AIC. The selection process continues to add or remove variables, making the AIC smaller each time.
Now let's take a look at the summary table. We see the AIC value at each step, and the AIC for the final model is approximately 6141. It's interesting to see that, in addition to the intercept, all eight of the predictor variables were added into the model. Of course, this won't happen in every situation. The selection process stopped because all effects are in the final model and no variable could be removed. The AIC component of the Coefficient Panel shows larger improvements to the AIC across Steps 1 through 3 and moderate improvements across Steps 4 and 5. After about the fifth step, the AIC has roughly leveled off, and is showing small improvements at Steps 6 through 8. You can interpret this plot to mean that the last several variables added little improvement to the AIC, but they did add to the complexity of the model. So, in addition to the model with all eight variables, you might decide to consider some simpler models, for example, those at Steps 5, 6, and 7, as possible candidates.
Next is the Criterion Panel. For AIC, AICC, and the adjusted R-square, the model with all eight variables is best. However, for SBC, the model at the sixth step is the best.
We'll move on to the output from the second PROC GLMSELECT run, the model selection process that used BIC as the selection criterion. The Selection Summary table shows that, again, the final model is the one with all eight variables. Notice that the final BIC value, 5841, is different from the final AIC value that we saw earlier. That's because each information criterion uses a different calculation for the penalty.
In the Coefficient Panel from this run, the Coefficient Progression plot is the same as the one for AIC, because all eight variables were added in the same order. Again, after the fifth step, there are small improvements in model fit.
Because this PROC GLMSELECT run uses BIC, it's added to the default plots in the criterion panel.
Now let's look at the results for the thrid PROC GLMSELECT, the model selection process that used AIIC as the selection criterion. Again, PROC GLMSELECT chooses the same eight-variable model. Finally, we'll look at the results for the last PROC GLMSELECT run that uses the selection criterion SBC. The Selection Summary table shows us that the selected model includes only six variables. At this point, the next candidate for entry is Lot_Area, but adding it wouldn't improve the SBC. The next candidate for removal is Bedroom_Above_Grade, but removing it would also not improve the SBC. Selection stopped at a local minimum of the SBC criterion.
The Coefficient Panel shows similarities to those seen earlier. Larger improvements in SBC can be seen across Steps 1 through 3, but smaller improvements occur over Steps 4 through 6. Like previous images, the standardized coefficients seem to stabilize after Step 4.
The Criterion Panel shows that, accounting only for the models that were viewed, the optimal fit statistics were obtained at Step 6.
Let's recap some of the model-building strategies that we've used on SalePrice. Using AIC, BIC, and AICC as selection criteria yields a model with all eight effects. Using SBC yields a model with only six effects. And recall that when we use significance level with stepwise, backward, and forward selection, specifying SLENTRY=0.05 and SLSTAY as 0.05, all three methods select the same model that contains seven effects.
By using multiple model-building strategies, you can generate a list of candidate models. So, how do you choose among the models? One option is to use a holdout data set to perform honest assessment on the models. Another option is to consult a subject-matter expert. Sometimes models with different sets of predictors than the one produced by an automated selection method can be more useful to you. For example, particular predictors might be added because they are important based on theoretical grounds. Some predictors might be included so that you can compare your work to previously published research. Other variables might be included specifically to control for effects, even if they were excluded by the model selection method. The cost of collecting data might help determine which model to go forward with. For example, if a five-predictor model explains as much variation as an eight-predictor model, but it requires much less time and money to collect the data, then it makes sense to use the simpler model. Finally, if a model doesn't meet statistical assumptions, it might need to be modified or excluded from the set of candidates.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 04, Section 2 Demo: Performing Model Selection Using PROC GLMSELECT (SAS Studio Task Version)
Use the Linear Regression task to select a model for predicting SalePrice in the ameshousing3 data set by using the STEPWISE selection method. First, choose AIC as the criterion to add/remove effects, and then rerun the task three times to use BIC, AICC, and SBC, respectively.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign the interval variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) to the Continuous variables role.
- On the MODEL tab, use the Model Effect Builder to specify the appropriate model. Click the Edit this model icon, select all variables, and click Add. Then click OK.
- On the OPTIONS tab, clear the check boxes for all diagnostic plots, residual plots, and scatter plots.
- On the SELECTION tab, use the Selection method drop-down list to choose Stepwise selection.
- For the Add/remove effects with value, choose Akaike's information criterion for AIC as the criterion.
- Expand SELECTION PLOTS and select the check box to display Coefficient plots, in addition to the already selected Criteria plots.
- Expand the DETAILS property and select Details for each step from the drop-down list.
- Click Run.
Generated Code for AIC
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 outdesign(addinputvars)=Work.reg_design plots=(criterionpanel coefficientpanel); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold / showpvalues selection=stepwise (select=aic) details=steps; run; proc delete data=Work.reg_design; run;
Rerun the task and modify the information criterion. Choose Sawa Bayesian information criterion for BIC.
Generated Code for BIC
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 outdesign(addinputvars)=Work.reg_design plots=(criterionpanel coefficientpanel); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold / showpvalues selection=stepwise (select=bic) details=steps; run; proc delete data=Work.reg_design; run;
Rerun the task and modify the information criterion. Choose Akaike's information criterion corrected for small-sample bias for AICC.
Generated Code for AICC
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 outdesign(addinputvars)=Work.reg_design plots=(criterionpanel coefficientpanel); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold / showpvalues selection=stepwise (select=aicc) details=steps; run; proc delete data=Work.reg_design; run;
Rerun the task and modify the informaton criterion. Choose Schwarz Bayesian information criterion for SBC.
Generated Code for AICC
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 outdesign(addinputvars)=Work.reg_design plots=(criterionpanel coefficientpanel); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area \ Total_Bathroom Deck_Porch_Area Age_Sold / showpvalues selection=stepwise (select=sbc) details=steps; run; proc delete data=Work.reg_design; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 1 Demo: Examining Residual Plots Using PROC REG
Filename: st105d01.sas
In this demonstration, we use PROC REG to create residual plots and other diagnostic plots. We use these plots to check our model assumptions and to check for outliers. First, to assess our model overall, we'll produce the eight default plots for fit diagnostics.

PROC REG DATA=SAS-data-set <options>; MODEL dependents = <regressors> < / options>; RUN; |
- Open program st105d01.sas.
- Submit this step.
- Review the output.
- Let's go back to the code. We want to run PROC REG again, but request only specific plots. In Part B, we've added the PLOTS=ONLY option and requested the QQ plot to assess the normality of the residual error, RESIDUALBYPREDICTED to request a plot of residuals by predicted values, and RESIDUALS to request a panel of plots of residuals by the predictor variables in the model. The rest of the program is the same. This produces separate full-size plots for the QQ and the residual by predicted plots. If we wanted separate full-sized plots for the eight residual by predictor graphs, we could add DIAGNOSTICS(UNPACK) to the plot options.
- Submit the PROC REG step in Part B.
- Review the output.
The Diagnostic Plots section contains the full-sized versions of the plots that we just saw. Full-size plots are easier to copy and paste into documents and presentations, if needed.
Consider the Q-Q plot. If the residuals are normally distributed, the plot should appear to be a straight, diagonal line. This plot shows little deviation from the expected pattern. Thus, you can conclude that the residuals do not significantly violate the normality assumption. If the residuals did violate the normality assumption, then a transformation of the response variable or a different model might be warranted.

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ; /*st105d01.sas*/ /*Part A*/ ods graphics on; proc reg data=STAT1.ameshousing3; CONTINUOUS: model SalePrice = &interval; title 'SalePrice Model - Plots of Diagnostic Statistics'; run; quit; /*st105d01.sas*/ /*Part B*/ proc reg data=STAT1.ameshousing3 plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS); CONTINUOUS: model SalePrice = &interval; title 'SalePrice Model - Plots of Diagnostic Statistics'; run; quit;
In Part A, the PROC REG statement specifies the data set ameshousing3, and the MODEL statement specifies SalePrice as the response variable and all the variables in the interval macro variable as the predictor variables. We've added an optional label of CONTINUOUS to the MODEL statement to label the output. Notice that the label must be followed by a colon.
Scroll to the Diagnostic Plots. In the Diagnostic Panel, the first plot is the plot of residuals versus predicted values. Looking at this plot, we're able to verify the equal variance assumption. We can also verify the independence assumption and check the adequacy of the model. Remember that we want to see a random scatter, with no patterns of our residuals above and below the 0 reference line. And the plot shows just that. We can conclude that the errors have constant variance. There's also no indication of correlated residuals, so we've met the independence assumption as well.
The Residuals versus Quantile plot is a normal quantile plot of the residuals. Using this plot, we can verify that the errors are normally distributed. The residuals follow the normal reference line pretty closely.
In the lower left corner, a histogram shows the normality of the residuals. Notice that a normal density curve is overlaid on the residual histogram to help detect departures from normality. Considering both the QQ plot and the histogram, we can conclude that the errors are normally distributed.
The plot of SalePrice versus Predicted Values of SalePrice shows data points spread along the 45-degree reference line, which indicates good model fit. There's a reasonably close match between the actual values and the predictions based on this model.
The last plot in the Diagnostic Panels is called a residual-fit or RF plot. It consists of side-by-side quantile plots of the centered fit and the residuals. The Fit (minus) Mean picture on the left shows the predicted or fitted values minus the overall mean. You can check to determine whether the vertical spread of the residuals in the plot on the right is greater than the spread of the centered fit in the plot on the left. The vertical spread of the residuals seems less than the vertical spread of the centered fit, so the model is fine. In other words, after accounting for the predictors in the model, relatively little residual variation remains.
The three remaining plots in this panel can be used to diagnose possible outliers. We'll discuss Rstudent residuals and Cook's D in the subsequent section.
The Residual Plots panel, the first panel includes the plots of the residuals versus the values of each of the interval predictor variables. They show no obvious trends or patterns in the residuals. Recall that independence of residual errors, or no trends, is an assumption for linear regression, as is constant variance across all levels of all predictor variables and across all levels of the predicted values. None of the variables contribute to possible violations in the assumptions.
Notice that when you visually inspect residual plots, the distinction of whether a pattern exists is a matter of discretion. If there's any question about the presence of a pattern, you should further investigate for possible causes of the pattern.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 1 Demo: Examining Residual Plots Using PROC REG (SAS Studio Task Version)
Use the Linear Regression task to create residual plots and other diagnostic plots. Use these plots to check your model assumptions and to check for outliers.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign the interval variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) to the Continuous variables role.
- On the MODEL tab, use the Model Effect Builder to specify the appropriate model. Click the Edit this model icon, select all variables, and click Add. Then click OK.
- On the OPTIONS tab, expand Scatter Plots and clear the check box to display a scatter plot of Observed values by predicted values.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc reg data=STAT1.AMESHOUSING3 alpha=0.05 plots(only)=(diagnostics residuals); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold /; run; quit;SAS Studio doesn't offer all available procedure plots when you use tasks. If you'd like to display other plots, you must modify the code and specify each plot using the appropriate plot option. The code below produces the Quantile-Quantile plot, the residual versus predicted values plot, and the residual versus regressor values plot. Individual plots are produced full sized.
/*st105d01.sas*/ /*Part B*/ proc reg data=STAT1.ameshousing3 plots(only)=(QQ RESIDUALBYPREDICTED RESIDUALS); CONTINUOUS: model SalePrice = &interval; title 'SalePrice Model - Plots of Diagnostic Statistics'; run; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 2 Demo: Looking for Influential Observations Using PROC GLMSELECT and PROC REG
Filename: st105d02.sas
In this demonstration, we look for influential observations in the ameshousing3 data set. First, we select a model by stepwise selection, PROC GLMSELECT. We then use PROC REG to generate influence statistics and plots for the selected model and save the plot data to temporary output data sets. We'll reference these data sets in the second part of the demonstration. All of the code for these demonstrations needs to be run in the same SAS session.

PROC GLMSELECT DATA=SAS-data-set <options>; CLASS variable(s); <label:> MODEL dependent = <effects> < / options>; RUN; |
PROC REG DATA=SAS-data-set <options>; MODEL dependents = <regressors> < / options>; RUN; |
- Open program st105d02.sas.
- Submit Part A, beginning with the %LET statement, up to and including the ODS SELECT ALL statetment.
- Check the SAS log.
The log shows that the step processed. PROC GLMSELECT automatically saves the list of the chosen model effects as the _GLSIND macro variable. We could see the values of this macro variable in the log by submitting %put &_glsind;, but we'll see the model effects in the PROC REG output. - Let's look at our PROC REG step in Part A. The plots (only label)= option generates only the specified plots, and labels extreme observations in the plot. If we were to include an ID statement, SAS would use the value of the ID variable as the label. In this case, the extreme observations will be labeled with the observation numbers. In our MODEL statement, we specify SalePrice as the response variable, and for the predictor variables, we reference the macro variable to specify the list of effects. Notice that we've included a label, SigLimit, which is short for Significance Limit, in front of the MODEL statement. Later, we'll output this model into new data sets, which will include this SigLimit model label. Above the PROC REG step we include an ODS OUTPUT statement. This statement, along with the PLOTS= option, writes the data from the influence plots into separate output data sets. Notice that some of the plot objects that we reference here have slightly different names than the ones that we use in PROC REG. For example, to reference the data that creates the COOKSD plot, we reference COOKSDPLOT rather than just COOKSD.
- Submit the rest of the Part A code, beginning with ODS GRAPHICS On.
- Review the output.
Here's the final model that was selected by PROC GLMSELECT. This is the same model as the previous stepwise selection demonstration in Lesson 4. Here, we'll focus on only the influence statistics.
In the Diagnostic Plots, the R Student by Predicted plot shows 16 observations beyond two standard errors from the mean of 0, and they're identified with their observation numbers. Remember that RStudent residuals are assumed to be normally distributed and therefore, you expect approximately 5% of values to be beyond two standard errors from the mean. The fact that you have 16 beyond two standard errors is no cause for concern, because 5% of 300 is 15 expected observations. Observation 123 is the largest outlier, and it's well separated from the other points. We might want to recheck this observation.
Next is the Cook's D plot, which is a needle plot. Cook's D is a measure of the simultaneous change in all parameter estimates when an observation is deleted. The horizontal line shows the Cook's D cutoff boundary. The plot labels the 21 influential points that are above the cutoff.
Let's look at the DFFITS plot. Recall that DFFITS measures the impact that an observation has on the predicted value. This plot flags several observations as influential points based on DFFITS. At this point, it might be helpful to see which parameters these observations might influence the most using DFBETAS information. The DFBETAS plot is a panel plot, which contains one plot for each parameter. In this case, SAS created two panels. Each plot labels the points that are potentially influencing the parameter that's associated with each of the predictor variables.
Detection of outliers or influential observations with plots is convenient for relatively small data sets, but for larger data sets, it can be difficult to discern one observation from another. One method for extracting only the influential observations is to write the output of the ODS plots into data sets and then subset the influential observations. We'll do this in the next demonstration.

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ; /*st105d02.sas*/ /*Part A*/ ods select none; proc glmselect data=STAT1.ameshousing3 plots=all; STEPWISE: model SalePrice = &interval / selection=stepwise details=steps select=SL slentry=0.05 slstay=0.05; title "Stepwise Model Selection for SalePrice - SL 0.05"; run; quit; ods select all; ods graphics on; ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=STAT1.ameshousing3 plots(only label)= (RSTUDENTBYPREDICTED COOKSD DFFITS DFBETAS); SigLimit: model SalePrice = &_GLSIND; title 'SigLimit Model - Plots of Diagnostic Statistics'; run; quit; /*st105d02.sas*/ /*Part B*/ title; proc print data=Rstud; run; proc print data=Cook; run; proc print data=Dffits; run; proc print data=Dfbs; run; data Dfbs01; set Dfbs (obs=300); run; data Dfbs02; set Dfbs (firstobs=301); run; data Dfbs2; update Dfbs01 Dfbs02; by Observation; run; data influential; /* Merge datasets from above.*/ merge Rstud Cook Dffits Dfbs2; by observation; /* Flag observations that have exceeded at least one cutpoint;*/ if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1; array dfbetas{*} _dfbetasout: ; do i=2 to dim(dfbetas); if dfbetas{i} then flag=1; end; /* Set to missing values of influence statistics for those*/ /* that have not exceeded cutpoints;*/ if ABS(Rstudent)<=3 then RStudent=.; if Cooksdlabel eq ' ' then CooksD=.; /* Subset only observations that have been flagged.*/ if flag=1; drop i flag; run; title; proc print data=influential; id observation; var Rstudent CooksD Dffitsout _dfbetasout:; run;
In Part A, the PROC GLMSELECT step uses the stepwise selection method to automatically select a model. The specified selection criterion is significance level, and both the entry and stay criteria are 0.05.
In addition to the output that this step produces, it also automatically creates the macro variable _GLSIND, which stores a list of effects selected by PROC GLMSELECT. You can then reference the list as &_GLSIND in subsequent statements.
In this case, we want to create the list of effects in the _GLSIND macro variable, but we don't need to see the PROC GLMSELECT output. So, before the step, we add the statement ODS SELECT NONE, which suppresses the output, and we add ODS SELECT ALL at the end of the step to make sure that we get the output from the next step we run.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 2 Demo: Looking for Influential Observations Using PROC GLMSELECT and PROC REG (SAS Studio Task Version)
Use the Linear Regression task to look for influential observations in the ameshousing3 data set.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign the interval variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) to the Continuous variables role.
- On the MODEL tab, use the Model Effects Builder to specify the appropriate model. Click the Edit this model icon, select all variables, and click Add. Click OK.
- On the OPTIONS tab, expand Diagnostic and Residual Plots and clear the check boxes for Diagnostic plots and Residuals for each explanatory variable.
- Expand More Diagnostics Plots and select all four check boxes. This will display diagnostic plots with labels for influential observations.
- Expand Scatter Plots and clear the check box for Observed values by predicted values.
- On the SELECTION tab, use the Selection method drop-down list and choose Stepwise selection.
- For the Add/remove effects with value, choose Significance level.
- On the CODE tab, click the Edit SAS code icon.
- In the PROC REG step, enter cooksd within the parentheses where the plots are listed.
- Add COOKSDPLOT to the list in the ODS SELECT statement.
- Add the following code after the ODS SELECT statement to write the data from the influence plots into data sets:
ods output RStudentByPredicted=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs;
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 outdesign(addinputvars)=Work.reg_design plots=(criterionpanel); model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold / showpvalues selection=stepwise (slentry=0.05 slstay=0.05 select=sl); run; proc reg data=Work.reg_design alpha=0.05 plots(only label)=(rstudentbypredicted cooksd dffits dfbetas); ods select RStudentByPredicted DFFITSPlot DFBETASPanel COOKSDPLOT; ods output RStudentByPredicted=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; model SalePrice=&_GLSMOD /; run; quit; proc delete data=Work.reg_design; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 2 Demo: Examining the Influential Observations Using PROC PRINT
Filename: st105d02.sas
In this demonstration, we use PROC PRINT to take a look at the output data sets that we created in PROC GLMSELECT. We then combine them into a single data set with only observations that exceed the suggested cutoffs of the influence statistics.This part of the demonstration uses many programming concepts, including MERGE statements, arrays, and DO loops, which you can learn about in SAS programming courses.
- Open program st105d02.sas.
- Submit the four PROC PRINT steps in Part B.
- Review the output.
In the first PRINT Procedure output, RStud data set, notice that it includes a column for the SigLimit label that we gave our model in the first part of the demonstration. In the outLevLabel column, the observations that have a value are the ones that were flagged as influential based on the RStudent cutoff. Notice that all RStudent influence statistic values are in the RStudent column.
Next PRINT Procedure output is the Cook data set. The variable CooksDLabel identifies observations that are deemed influential due to high Cook's D values. These are the observations that have an influence on all the estimated parameters as a group.
Now let's look at the PRINT Procedure output for the Dffits data set. The variable DFFITSOUT identifies observations that are deemed influential due to high DFFITS values. These are the observations that have an influence on the predictions.
Finally, in the PRINT Procedure output for the Dfbs data set, the variables _DFBETASOUT1 through _DFBETASOUT8 identify the observations whose DFBETA values exceed the threshold for influence. _DFBETASOUT1 represents the value for the intercept, and the other seven variables show influential outliers on each of the predictor variables in the MODEL statement. The order of the predictor variables is based on the order of the variables that are listed in the MODEL statement (or in this case, in the _GLSIND macro variable).
Think back to the DFBETAS panel plot from Part A, where we saw the order of the variables in _GLSIND are above Ground Living Area, Basement_Area, and so on. There were too many predictor variables to fit into one panel, so SAS produced a second panel plot.
With the multiple panels for DFBETAS, the Dfbs data set is relatively split. The first 300 observations display the DFBETAS information for the first panel, which includes the first six effects in the model (including the intercept). The information for the second panel, which includes the final two effects, is missing.
Beginning at observation 301, this is reversed. SAS stacked the rows of the first panel on top of the rows of the second panel. Let's find a way to merge the two panels by observation.
The first DATA step in Part B copies the first 300 observations to a new data set, Dfbs01, and the second DATA step copies the remaining observations, beginning at 301, to Dfbs02. The third DATA step uses the UPDATE statement to combine data sets by Observation, and create the data set Dfbs2. - Submit the first three DATA steps in Part B and take a look at the new data sets. Note: The new data sets will display automatically in SAS Studio's table viewer. If you're not working in SAS Studio, you will need to submit a PROC PRINT step for each data set, Dfbs01, Dfbs02, and Dfbs2.
- Let's return to the Code. The next DATA step merges the final Dfbs2 data set with the Rstud, Cook, and Dffits data sets by Observation.These are the four data sets that contain the influence data.
The IF statement identifies observations that exceed the respective influence cutoff values. In this case, it identifies observations with an RStudent value beyond the thresholds of 3 and -3, Cooksdlabel values that are not equal to a missing value in other words, any observations that were flagged as influential based on the Cook's D cutoff, and any observations that have a Dffitsout value that is, any observations that were flagged based on the DFFITS cutoff. You can change the cutoff thresholds in this statement.
Next we want to flag observations with non-missing values in any _DFBETASOUT columns. We'll use an array, a DO loop, and an IF statement to flag these observations.
Now we need to do a little cleanup. For observations that were flagged as influential based on the cutoffs for Cook's D, DFFITS, and DFBETAS, but not flagged by RStudent, we don't want the RStudent value in the output data set. These IF statements assign a missing value for a statistic if the observation doesn't exceed the cutoff point for that particular statistic. Finally we use a subsetting IF statement to write only the flagged observations to output. Following the DATA step, we use a PROC PRINT step to display the output data set, influential. - Submit the last two steps in Part B.
- Review the output.
The PRINT Procedure output displays a summary of only the influence statistics that were outside the cutoff boundaries. The columns for the DFBETAS values still begin with _DFBETASOUT. Notice that, if we wanted to, we could rename these to make it clear which one is for the intercept and which ones are for each of the predictor variables.
Now that we have the flagged observations, we can investigate them further, first to filter out any erroneous data, and then to determine what makes each point influential.

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ; /*st105d02.sas*/ /*Part A*/ ods select none; proc glmselect data=STAT1.ameshousing3 plots=all; STEPWISE: model SalePrice = &interval / selection=stepwise details=steps select=SL slentry=0.05 slstay=0.05; title "Stepwise Model Selection for SalePrice - SL 0.05"; run; quit; ods select all; ods graphics on; ods output RSTUDENTBYPREDICTED=Rstud COOKSDPLOT=Cook DFFITSPLOT=Dffits DFBETASPANEL=Dfbs; proc reg data=STAT1.ameshousing3 plots(only label)= (RSTUDENTBYPREDICTED COOKSD DFFITS DFBETAS); SigLimit: model SalePrice = &_GLSIND; title 'SigLimit Model - Plots of Diagnostic Statistics'; run; quit; /*st105d02.sas*/ /*Part B*/ title; proc print data=Rstud; run; proc print data=Cook; run; proc print data=Dffits; run; proc print data=Dfbs; run; data Dfbs01; set Dfbs (obs=300); run; data Dfbs02; set Dfbs (firstobs=301); run; data Dfbs2; update Dfbs01 Dfbs02; by Observation; run; data influential; /* Merge datasets from above.*/ merge Rstud Cook Dffits Dfbs2; by observation; /* Flag observations that have exceeded at least one cutpoint;*/ if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1; array dfbetas{*} _dfbetasout: ; do i=2 to dim(dfbetas); if dfbetas{i} then flag=1; end; /* Set to missing values of influence statistics for those*/ /* that have not exceeded cutpoints;*/ if ABS(Rstudent)<=3 then RStudent=.; if Cooksdlabel eq ' ' then CooksD=.; /* Subset only observations that have been flagged.*/ if flag=1; drop i flag; run; title; proc print data=influential; id observation; var Rstudent CooksD Dffitsout _dfbetasout:; run;
In Part B we use PROC PRINT steps to print each of the output data sets, Rstud, Cook, Dffits, and Dfbs. Examining these data sets might help us later when we combine them into one data set.
The Dfbs01 data set contains the first 300 observations, and the columns for the last two predictor variables contain only missing values.
The Dfbs02 data set contains the last 300 observations, and the columns for the first six predictor variables contain only missing values.
The combined data set, Dfbs2, contains all non-missing values from the previous two data sets.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 2 Demo: Examining the Influential Observations Using PROC PRINT (SAS Studio Task Version)
In SAS Studio 3.7, currently no task generates the code for this demo. Submit the code in Part B of the st105d02.sas file.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 3 Demo: Calculating Collinearity Diagnostics Using PROC REG
Filename: st105d03.sas
In this demonstration, we use PROC CORR to investigate the correlations between the variable score and the other interval variables. Then we'll use PROC REG and the VIF option to assess the magnitude of the collinearity problem.

PROC CORR DATA=SAS-data-set <options>; MODEL variables; WITH variables; ID variables; RUN; |
PROC REG DATA=SAS-data-set <options>; MODEL dependents = <regressors> < / options>; RUN; |
- Open program st105d03.sas.
- Submit the code in Part A.
- Review the output.
In the Pearson Correlation table, the new variable, score, appears to be significantly correlated with all the interval variables, but focus your attention on the actual correlations in the first row. Recall that closer to 1 or -1 implies a stronger correlation between two variables. Score is highly correlated with Basement_Area, and moderately correlated with Above Ground Living Area and Total_Bathroom. The correlation with Basement_Area is large enough that they'll clearly provide redundant information about sale price and should not both be included in the same model. Let's check for additional sources of collinearity that might not be detected with correlation coefficients. - Let's go back to the code and look at Part B. We're going to use PROC REG with the VIF option to further assess the collinearity problem and identify the predictors involved in the problem. In the PROC REG statement, we specify the amescombined data set. The MODEL statement specifies SalePrice as the response variable and all of the variables in the macro variable and the score variable as predictors, followed by a forward slash, and then the VIF, or variance inflation factor, option. SAS calculates the VIF for each predictor term in the model. The VIFi is the ratio of , where is the R-square value when regressing the ith predictor, Xi, on all the other predictors in the model. But the important thing to remember is the approximate cutoff value. If the VIF is greater than 10 for any predictors in the model, those predictors are likely involved in collinearity.
- Submit the first PROC REG step in Part B.
- Review the output.
In the Parameter Estimates table, VIF values are displayed in the Variance Inflation column. The VIFs for Above Ground Living Area, Basement_Area, and score are much larger than 10, so a severe collinearity problem is present. At this point there are many ways to proceed. You might use some subject-matter expertise if available. Another option is to systematically remove variables starting with the highest VIF and re-run the analysis. Much like p-values, VIF values will need to be updated with each successive variable removal.
We decided to contact the researchers that provided the score variable, and determined that score is a composite variable. The researchers, on the basis of prior literature, created a composite variable, which is a weighted function of the two variables, Above Ground Living Area and Basement_Area. Score is equal to 10,000 minus twice Above Ground Living Area plus 5 times Basement_Area, and rounded.
This isn't an uncommon occurrence and illustrates an important point. If a composite variable is included in a model along with some or all of its component measures, there's bound to be collinearity. If the composite variable has meaning, it can be used as a substitute measure for both components, and you can remove the variables Above Ground Living Area and Basement_Area from the analysis. However, composite measures have the disadvantage of losing some information about the individual variables. If this a concern, then you can remove score from the analysis.
We'll remove score from the analysis in order to maintain the information about the two variables, Above Ground Living Area and Basement_Area. Then we'll check the variance inflation factors again to see whether collinearity remains a problem. - Let's go back to the code. In the last PROC REG step, we've removed score from the MODEL statement, and added a label of NOSCORE. Let's run this new model and calculate the VIFs.
- Submit the last PROC REG step.
- Review the output.
Scroll to the Parameter Estimates table. As you can see, all the VIF values are smaller than 2 now. Because collinearity can have a substantial effect on the outcome of a stepwise model selection procedure, it's advisable to deal with collinearity before using any automated model selection tool. The eight variables in question no longer exhibit a high degree of collinearity, and could now be safely passed into a stepwise selection approach.

%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ; /*st105d03.sas*/ /* Part A*/ proc sort data=STAT1.ameshousing3 out=STAT1.ames_sorted; by PID; run; proc sort data=STAT1.amesaltuse; by PID; run; data amescombined; merge STAT1.ames_sorted STAT1.amesaltuse; by PID; run; title; proc corr data=amescombined nosimple; var &interval; with score; run; /*st105d03.sas*/ /*Part B*/ proc reg data=amescombined; model SalePrice = &interval score / vif; title 'Collinearity Diagnostics'; run; quit; proc reg data=amescombined; NOSCORE: model SalePrice = &interval / vif; title2 'Removing Score'; run; quit;
In Part A, we'll first combine the score data from the other research group with the data we already have. The PROC CORR step produces Pearson correlation statistics and corresponding p-values. We specify the new data set, amescombined, with the nosimple option to suppress the descriptive statistics. In the VAR statement, we specify the continuous variables listed in the interval macro variable. By default, SAS produces correlations for each pair of variables in the VAR statement, but we'll use the WITH statement to correlate each continuous variable with score.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 05, Section 3 Demo: Calculating Collinearity Diagnostics Using PROC REG (SAS Studio Task Version)
Use the Linear Regression task to investigate the correlations between the variable score and the other interval variables. First combine the score data from the other research group with the data that we already have. Then further assess the collinearity problem and identify the predictors that are involved in the problem.
- Run the code below. This combines the data from the other research group with the data that we've been analyzing.
%let interval=Gr_Liv_Area Basement_Area Garage_Area Deck_Porch_Area Lot_Area Age_Sold Bedroom_AbvGr Total_Bathroom ; /*st105d03.sas*/ /* Part A*/ proc sort data=STAT1.ameshousing3; by PID; run; proc sort data=STAT1.amesaltuse; by PID; run; data amescombined; merge STAT1.ameshousing3 STAT1.amesaltuse; by PID; run;
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- To investigate the correlations, expand Statistics and open the Correlation Analysis task.
- Select the work.amescombined table.
- Assign the interval variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) to the Analysis variables role.
- Assign score to the Correlate with role.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc corr data=WORK.AMESCOMBINED pearson nosimple noprob plots=none; var Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold; with score; run;
- Expand Statistics and open the Linear Regression task.
- Select the stat1.ameshousing3 table.
- Assign SalePrice to the Dependent variable role.
- Assign the interval variables (Lot_Area, Gr_Liv_Area, Bedroom_AbvGr, Garage_Area, Basement_Area, Total_Bathroom, Deck_Porch_Area, and Age_Sold) and the variable score to the Continuous variables role.
- On the MODEL tab, click the Edit this model icon, select all variables, and click Add. Then click OK.
- On the OPTIONS tab, under STATISTICS, use the drop-down list for Display statistics and select Default and selected statistics.
- Expand Collinearity and select the option to display Variance inflation factors.
- Suppress all plots by clearing the check boxes under Diagnostics and Residual Plots and Scatter Plots.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc reg data=STAT1.AMESHOUSING3 alpha=0.05 plots=none; model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold score / vif; run; quit;Remove score from the model and rerun the task.
- On the DATA tab, select score from the list of Continuous variables, and click the Remove column icon.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc reg data=STAT1.AMESHOUSING3 alpha=0.05 plots=none; model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold / vif; run; quit;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 1 Demo: Building a Predictive Model Using PROC GLMSELECT (SAS Studio Task Version)
Build a predictive regression model of SalePrice from both catagorical and interval predictors. Use ameshousing3 as the training data set and ameshousing4 as the validation data set. Use backward elimination with SBC for the training data as the model-building criterion, and choose the model with the smallest average squared error for the validation data set. Create an item store to use in subsequent processing.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and select the Predictive Regression Models task.
- Select the stat1.ameshousing3 table. Note: You'll add the validation data set in a later step.
- Assign SalePrice as the dependent variable.
- Assign the classification and continuous variables as listed below.
Classification Variables Continuous Variables Heating_QC Lot_Area Central_Air Gr_Liv_Area Fireplaces Bedroom_AbvGr Season_Sold Garage_Area Garage_Type_2 Basement_Area Foundation_2 Total_Bathroom Masonry_Veneer Deck_Porch_Area Lot_Shape_2 Age_Sold House_Style2 Overall_Qual2 Overall_Cond2 - On the DATA tab, expand Parameterization of Effects and verify that GLM coding is selected.
- On the MODEL tab, select Custom Model and then click Edit to open the Model Effects Builder.
- Select all of the variables and click Add.
- Verify that the Intercept check box is selected.
- Click OK.
- On the SELECTION tab under MODEL SELECTION, select Backward elimination in the Selection method drop down list, and under Add/remove effects with, select Schwarz Bayesian information criterion.
- Expand SELECTION PLOTS and select Criteria plots and Coefficient plots.
- Expand DETAILS and then expand Model Effects Hierarchy. Under Model effects hierarchy, select Do not maintain hierarchy of effects. The default is Maintain hierarchy of effects.
- Click the Edit button in the code window to open the editor, and make the following changes manually:
- Add the valdata= option to the PROC GLMSELECT statement to specify stat1.ameshousing4 as the validation data set. Note: Currently SAS Studio does not include the option to specify a separate validation data set.
- Add choose=validate within the parentheses containing select=sbc in the MODEL statement to use the average squared error of the validation data set as the model selection tool.
- Add the ref=first option to the CLASS statement to treat the first level of each variable in the classification variable list as the reference level.
- Add a STORE statement to save the analysis results in a SAS item store, stat1.amesstore.
- Submit the code.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc glmselect data=STAT1.AMESHOUSING3 plots=(criterionpanel coefficientpanel) valdata=stat1.ameshousing4; class Heating_QC Central_Air Fireplaces Season_Sold Garage_Type_2 Foundation_2 Masonry_Veneer Lot_Shape_2 House_Style2 Overall_Qual2 Overall_Cond2 / param=glm ref=first; model SalePrice=Lot_Area Gr_Liv_Area Bedroom_AbvGr Garage_Area Basement_Area Total_Bathroom Deck_Porch_Area Age_Sold Heating_QC Central_Air Fireplaces Season_Sold Garage_Type_2 Foundation_2 Masonry_Veneer Lot_Shape_2 House_Style2 Overall_Qual2 Overall_Cond2 / selection=backward (select=sbc choose=validate) hierarchy=none; store out=stat1.amesstore; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 2 Demo: Scoring Data Using PROC PLM
Filename: st106d02.sas
In the previous demonstration, we built a predictive model and created an item store. Now, we'll use the item store to score data with two different methods, and then compare the results to show equivalence between the two methods. For demonstration purposes, we'll score ameshousing4, the validation data set from the previous demonstration. Remember that in a business environment, you would score data that was not used in either training or validation.

PROC PLM RESTORE=item-store <options>; SCORE DATA=SAS-data-set <OUT=SAS-data-set>; CODE <FILE=file-name>; RUN; |
DATA <data-set-name>; SET SAS-data-set <(data-set-options)>; %INCLUDE source; RUN; |
PROC COMPARE BASE=SAS-data-set COMPARE=SAS-data-set CRITERION=value; VAR variable(s); WITH variable(s); RUN; |
- Open program st106d02.sas.
/*st106d02.sas*/ /*Part A*/ proc plm restore=STAT1.amesstore; score data=STAT1.ameshousing4 out=scored; code file="&homefolder\scoring.sas"; run; data scored2; set STAT1.ameshousing4; %include "&homefolder\scoring.sas"; run; proc compare base=scored compare=scored2 criterion=0.0001; var Predicted; with P_SalePrice; run;
In the PROC PLM step, the RESTORE= option specifies amesstore, the item store that contains the model information. The SCORE statement scores the ameshousing4 data and creates a new data set, scored. This data set contains the input data and the scored variable, Predicted. The CODE statement writes the scoring instructions to scoring.sas, the file named in the FILE= option. The scoring instructions are SAS DATA step programming statements that create a scoring variable. By default, the name of this new variable is the original variable name prefixed with P_. - Submit the PROC PLM step.
- Review the output.
The Store Information table shows the model parameters. The log shows that the data set, WORK.SCORED, and the SAS program, scoring.sas, were created.
- Let's go back to the code. Next, we need a DATA step to read the input data and execute the scoring code. This DATA step reads ameshousing4 and creates a temporary data set, scored2. The %INCLUDE statement copies the scoring code from scoring.sas. Remember that if we made any transformations to the original data set before building the model, we would need to perform those transformations in the DATA step before the %INCLUDE statement.
- Submit the DATA step.
- Check the SAS log to verify that the output data set, WORK.SCORED2, was successfully created.
- Let's see whether the two methods scored the data the same way. We'll use the COMPARE procedure to compare the values of the scored variable in the two output data sets. There's no need to do any preliminary matching or sorting in this case, because the output data sets are based on the same input data set. They have the same number of variables and observations, in the same order.
In the PROC COMPARE statement, the BASE= option specifies scored, the data set created by the SCORE statement. The COMPARE= option specifies scored2, the data set created by the DATA step. By default, the criterion for judging the equality of the numeric values is .00001, but you can use the CRITERION= option to change this. In this example, we'll use 0.0001, which is less stringent than the default. The VAR statement names the scored variable in the BASE= data set, Predicted, and the WITH statement specifies P_SalePrice, the scored variable in the COMPARE= data set. - I'll submit the PROC COMPARE step.
- Review the output.
- In the Compare Summary table, let's look at the Values Comparison Summary to see whether the two methods produced similar predictions. All the compared values are, in fact, equal. Some values aren't exactly equal due to rounding, but as the maximum difference criterion value indicates, the differences are too small to be important. Of course, if we used the more stringent default criterion, the results would likely show more differences. We built a predictive model on training data, chose a best fitting and generalizable model according to validation data, and now we've seen multiple ways to deploy our predictive model. We can now predict new cases after we measure the model inputs by passing the new data to PROC PLM or a DATA step using score code. That is, we can predict home prices after we measure the home attributes that are needed as inputs in our predictive model. After predicting sale prices of homes in Ames, Iowa, we'll have some idea of the future commission for our real estate firm.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 06, Section 2 Demo: Scoring Data Using PROC PLM (SAS Studio Task Version)
In SAS Studio 3.7, currently no tasks are available to perform the steps in this demo. Submit the code in the st106d02.sas file.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 1 Demo: Examining the Distribution of Categorical Variables Using PROC FREQ and PROC UNIVARIATE
Filename: st107d01.sas
Let's examine the distribution of Bonus at each value of the predictors, Fireplaces and Lot_Shape_2. We'll create one-way frequency tables to view the frequency of the levels of each categorical variable. We'll then create two-way frequency tables, also known as crosstabulation tables, to look for a possible association between two categorical variables. The crosstabulation table shows frequency statistics for the combinations of levels of two variables.

PROC FREQ DATA=SAS-data-set;
TABLES table-request(s) < / options>; < additional statements > RUN; |
PROC UNIVARIATE DATA=SAS-data-set <options>; VAR variables; HISTOGRAM variables < / options>; INSET keywords < /options>; RUN; |
- Open program st107d01.sas.
/*st107d01.sas*/ title; proc format; value bonusfmt 1 = "Bonus Eligible" 0 = "Not Bonus Eligible" ; run; proc freq data=STAT1.ameshousing3; tables Bonus Fireplaces Lot_Shape_2 Fireplaces*Bonus Lot_Shape_2*Bonus/ plots(only)=freqplot(scale=percent); format Bonus bonusfmt.; run; proc univariate data=STAT1.ameshousing3 noprint; class Bonus; var Basement_Area ; histogram Basement_Area; inset mean std median min max / format=5.2 position=nw; format Bonus bonusfmt.; run;
To start, we use PROC FORMAT to format the values of Bonus. If the value is 1, SAS displays Bonus Eligible, and if the value is 0, SAS displays Not Bonus Eligible.
The PROC FREQ step uses the ameshousing3 data set to generate frequency tables and plots summarizing the categorical variables. The TABLES statement requests individual tables for the three categorical variables Bonus, Fireplaces, and Lot_Shape_2. To request crosstabulation tables, we specify an asterisk between the names of the variables that we want to appear in the table. The first variable represents the rows, and the second variable represents the columns. This TABLES statement requests a crosstabulation of Fireplaces by Bonus, and a table of Lot_Shape_2 by Bonus.
The PLOTS= option requests a frequency plot for each frequency table, and SCALE=PERCENT displays percentages, or relative frequencies.
The FORMAT statement applies the bonusfmt. format to the variable Bonus.
Because we also want to look at the distribution of the continuous variable Basement_Area by Bonus status (eligible or not eligible), well use PROC UNIVARIATE to create histograms for each level of Bonus. The CLASS statement indicates our categorical predictor variable. The VAR and HISTOGRAM statements specify Basement_Area, and we use the INSET statement to create a box of summary statistics in the northwest corner of the graph. Again, well format the values of Bonus.
- Submit the code.
- Review the output.
The first table is a one-way frequency table for Bonus. By default, four types of frequency measures are included in the table for each level of the variable. We see the frequency and percent of each level, as well as the cumulative frequency and cumulative percent. The last row always displays 100; 100% of the observations contain the last value and all other values listed above it. From this table, you can see that most homes, 85% of the ones in our sample, are not Bonus Eligible, meaning they didn't sell for more than $175,000.
The second table, a one-way frequency table for fireplaces, shows that most homes do not have a fireplace. Only 31% of homes in our sample have a single fireplace, and only 12 homes have 2 fireplaces.
The third one-way frequency table analyzes Lot_Shape_2. Approximately two-thirds of homes in our sample have a regular lot shape, and the other third have an irregular lot shape. Notice there's one missing value for the Lot_Shape_2 variable.
Tables 4 and 5 are the requested crosstabulation tables. By default, a crosstabulation table has four measures in each cell, indicated in the legend. Frequency indicates the number of observations that contain both the row variable value and the column variable value. We'll use Table 4, Fireplaces by Bonus as an example. It shows that 25 homes with 1 fireplace are bonus eligible. Percent indicates the number of observations in each cell as a percentage of the total number of observations. For example, about 3% of homes with 2 fireplaces are not bonus eligible. Row Pct indicates the number of observations in each cell as a percentage of the total number of observations in that row. The total number of observations for each row appears in the Total column for the row. In this table, the first row indicates that, of the 195 homes that do not have a fireplace, about 91% are not bonus eligible. And Col Pct indicates the number of observations in each cell as a percentage of the total number of observations in that column. The total number of observations for each column appears at the bottom. Of the 255 homes that are not bonus eligible, about 27% have one fireplace.
It seems there's some association between the variables Bonus and number of Fireplaces. For example, homes that are not bonus eligible are much more likely to have 0 fireplaces, at about 69%. Whereas bonus eligible homes are more likely to have 1 fireplace, at about 55%. With the unequal group sizes, the row percentages might not easily display if Fireplaces is associated with Bonus.
The cross-tabular frequency plot displays the simple frequencies from the crosstabulations. For example, the largest bar shows that 59% of all homes in the sample have 0 fireplaces and are not bonus eligible.
Now consider Table 5, the Bonus by Lot_Shape_2 crosstabulation table. When you compare the row percentages, there's a much larger probability of the home being not bonus eligible if the lot shape is regular, at about 94%, as opposed to only 67% for irregular lot shapes. The distribution of Bonus changes when the value of Lot_Shape_2 changes. There seems to be an association between the two variables. Later, we'll investigate this further to find whether the association is statistically significant.
The PROC UNIVARIATE histogram plot shows the distribution of the continuous variable, Basement_Area, by Bonus status. The distribution of homes that are not bonus eligible appears to be more variable and has a larger standard deviation. There certainly appears to be an association between Bonus and Basement_Area. The larger the basement area, the more likely the home is to be bonus eligible. The histograms are different and centered in different locations. The median of bonus eligible homes is almost 500-square feet larger than homes that are not bonus eligible.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 1 Demo: Examining the Distribution of Categorical Variables Using PROC FREQ and PROC UNIVARIATE (SAS Studio Task Version)
Use the One-Way Frequencies task to create one-way frequency tables for the variables Bonus, Fireplaces, and Lot_Shape_2. Use the Table Analysis task to create two-way frequency tables for the variables Bonus by Fireplaces, and Bonus by Lot_Shape_2. For the continuous variable Basement_Area, use the Summary Statistics task to generate histograms for each level of Bonus.
Generating One-Way Frequency Tables
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the One-Way Frequencies task.
- Select the stat1.ameshousing3 table.
- Assign the variables Bonus, Fireplaces, and Lot_Shape_2 to the Analysis variables role.
- Click Run.
Generated Code
proc freq data=STAT1.AMESHOUSING3; tables Fireplaces Lot_Shape_2 Bonus / plots=(freqplot cumfreqplot); run;
Generating Two-Way Frequency Tables
- Open the Table Analysis task.
- On the DATA tab, verify that stat1.ameshousing3 is already selected.
- Assign Fireplaces and Lot_Shape_2 to the Row variables role, and assign Bonus to the Column variables role.
- On the OPTIONS tab, expand FREQUENCY TABLE.
- Under Percentages, select Cell, Row, and Column.
- Under Statistics, clear the check box for Chi-square statistics.
- Click Run.
Generated Code
ods noproctitle; proc freq data=STAT1.AMESHOUSING3; tables (Fireplaces Lot_Shape_2) *(Bonus) / nocum plots(only)=(freqplot mosaicplot); run;
Generating Histograms
- Open the Summary Statistics task.
- Assign Basement_Area to the Analysis variables role, and assign Bonus to the Classification variables role.
- On the OPTIONS tab, clear the check box for Number of observations.
- Expand PLOTS and select Histogram, and then select Add inset statistics.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc means data=STAT1.AMESHOUSING3 chartype mean std min max vardef=df; var Basement_Area; class Bonus; run; proc univariate data=STAT1.AMESHOUSING3 vardef=df noprint; var Basement_Area; class Bonus; histogram Basement_Area; inset mean std min max / position=nw; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 2 Demo: Performing a Pearson Chi-Square Test of Association Using PROC FREQ
Filename: st107d02.sas
We know that there are possible associations between the binary response, Bonus, and the categorical predictors, Lot_Shape_2 and Fireplaces. Now let's run a formal test to determine whether the associations are significant.

PROC FREQ DATA=SAS-data-set;
TABLES table-request(s) < / options>; < additional statements > RUN; |
- Open program st107d02.sas.
/*st107d02.sas*/ ods graphics off; proc freq data=STAT1.ameshousing3; tables (Lot_Shape_2 Fireplaces)*Bonus / chisq expected cellchi2 nocol nopercent relrisk; format Bonus bonusfmt.; title 'Associations with Bonus'; run; ods graphics on;
This PROC FREQ step requests a crosstabulation table for Lot_Shape_2 by Bonus and Fireplaces by Bonus. Notice the grouping syntax used here with parentheses around the predictor variables. This is just another way to request the crosstabulation tables. After the forward slash, the CHISQ option produces the Pearson chi-square test of association and the measures of association that are based on the chi-square statistic, as well. We also have some additional options related to measures of association. The EXPECTED option prints the expected cell counts, which are the cell counts we expect under the null hypothesis of no association. CELLCHI2 prints each cell's contribution to the total chi-square statistic. NOCOL suppresses the printing of the column percentages and NOPERCENT suppresses the printing of the cell percentages. Finally, we'll add the RELRISK, the relative risk option to print a table that contains risk ratios, or probability ratios, and the odds ratios. - Submit this program.
- Review the output.
The first cross-tabular frequency table shows the crosstabulation table for Lot_Shape_2 by Bonus. You can see how the options in the TABLES statement changed the statistics that appear in each cell. The actual frequency appears first. It seems that the cell for Lot_Shape_2, Irregular and Bonus, Bonus Eligible contributes the most to the chi-square statistic, with a Cell Chi-Square value of 21.905.
Next is the table that shows the chi-square test and Cramer's V. Because the p-value for the Chi-Square statistic is less than .0001, you reject the null hypothesis at the 0.05 level and conclude that there is evidence of an association between Lot_Shape_2 and Bonus. The Cramer's V value of -0.3531 indicates that the association detected with the chi-square test is relatively weak.
Exact tests are often useful when there are low cell counts. The chi-square test typically requires 20-25 total observations for a 2*2 table, with 80% of the table cells having counts greater than 5. In our case, we've met the requirements for the Bonus by Lot_Shape_2 crosstabulation. However, the next table, Fisher's Exact Test is provided by PROC FREQ when tests of association are requested for 2*2 tables, by default. Otherwise, the exact test must be requested using an EXACT statement.
In the Relative Risk Estimates table, the Odds Ratio and Relative Risk values show a measure of association strength. The Odds Ratio is shown in the first row of the table, along with the 95% confidence limits. To interpret the odds ratio, refer to the crosstabulation table at the beginning of the output. The top row (Irregular) is the numerator of the ratio, while the bottom row (Regular) is the denominator. The interpretation is stated in relation to the left column of the crosstabulation table (Not Bonus Eligible).
The value of 0.1347 says that an irregular lot has about 13.5% of the odds of not being bonus eligible, compared with a regular lot. This is equivalent to saying that a regular lot has about 13.5% of the odds of being bonus eligible, compared with an irregular lot. We can interpret the reciprocal of the odds ratio, 1/0.1347=7.423 similarly. The odds of being bonus eligible are more than seven times the odds for homes with irregular lot shapes than regular lot shapes.
It's often easier to report odds ratios by first transforming the decimal value to a percent-difference value. The formula for doing that is (Odds Ratio - 1) * 100. In other words, regular lots have 86.53% lower odds of being bonus eligible compared with irregular lots.
The 95% odds ratio confidence interval goes from 0.0664 to 0.2735, which doesn't include 1. This confirms the statistically significant result of the Pearson chi-square test of association. A confidence interval that includes the value of 1 would indicate equality of odds and would not be a significant result.
Relative risk estimates for each column are interpreted as probability ratios, rather than odds ratios. You have a choice of assessing probabilities of the left column (Column 1) or the right column (Column 2). For example, Relative Risk (Column 1) shows the ratio of the probabilities of irregular lots to regular lots being in the left column (66.67/93.69=0.7116).
The last two tables show the output from the Fireplaces by Bonus analysis. The output from the Chi-Square Tests show that there is also a statistically significant association between Fireplaces and Bonus, with a significant chi-square test statistic of 15.41 and a p-value 0.0004. However, Cramer's V for that association is 0.2267, indicating a relatively weak association.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 2 Demo: Performing a Pearson Chi-Square Test of Association Using PROC FREQ (SAS Studio Task Version)
You know that there are possible associations between the variables Lot_Shape_2 and Bonus, as well as between Fireplaces and Bonus. Now use the Table Analysis task to run a formal test to determine whether the associations are significant.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Table Analysis task.
- Select the stat1.ameshousing3 table.
- Assign Fireplaces and Lot_Shape_2 to the Row variables role, and assign Bonus to the Column variables role.
- On the OPTIONS tab, expand PLOTS and select the Suppress plots check box.
- Expand FREQUENCY TABLE.
- Under Frequencies, select Observed and Expected.
- Under Percentages, select Cell and Row.
- Under Statistics, select Odds ratio and relative risk (for 2X2 tables), in addition to Chi-square statistics, which should already be selected.
- Click Run.
Generated Code
ods noproctitle; proc freq data=STAT1.AMESHOUSING3; tables (Fireplaces Lot_Shape_2) *(Bonus) / chisq relrisk expected nocol nocum plots=none; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 2 Demo: Detecting Ordinal Associations Using PROC FREQ
Filename: st107d03.sas
In this demonstration we use PROC FREQ to test whether an ordinal association exists between Bonus and Fireplaces.

PROC FREQ DATA=SAS-data-set;
TABLES table-request(s) < / options>; < additional statements > RUN; |
- Open program st107d03.sas.
/*st107d03.sas*/ ods graphics off; proc freq data=STAT1.ameshousing3; tables Fireplaces*Bonus / chisq measures cl; format Bonus bonusfmt.; title 'Ordinal Association between FIREPLACES and BONUS?'; run; ods graphics on;
In this step, the TABLES statement specifies a crosstabulation table for Fireplaces by Bonus, as well as three options that generate various measures of association. The CHISQ option produces the Pearson chi-square, the likelihood-ratio chi-square, and the Mantel-Haenszel chi-square. It also produces measures of association based on chi-square statistics, such as the phi coefficient, the contingency coefficient, and Cramer's V. The MEASURES option produces the Spearman correlation statistic along with a few other measures of association. The CL option produces confidence limits for the statistics that the MEASURES option requests. - Submit the code.
- Review the output.
The first table is the same Fireplaces by Bonus crosstabulation table that was generated in the previous demonstration.
Let's look at the results of the Mantel-Haenszel chi-square test. Because the p-value is 0.0010, you can conclude at the 0.05 significance level that there is evidence of an ordinal association between Bonus and Fireplaces. There's a significant trend in the likelihood of being bonus eligible as the number of fireplaces increases.
The last table displays a variety of measures of association, including the Spearman correlation statistic and its 95% confidence limits. The Spearman correlation value of 0.2107 indicates that there’s a weak positive ordinal relationship between Fireplaces and Bonus. That is, as Fireplaces levels increase, Bonus tends to increase.
The ASE is the asymptotic standard error, and is only an appropriate measure of the standard error for relatively large samples.
Because the 95% confidence interval for the Spearman correlation statistic does not contain 0, the relationship is significant at the 0.05 significance level. However, the confidence intervals are valid only if your sample size is large. A general guideline is to have a sample size of at least 25 for each degree of freedom in the Pearson chi-square statistic. Because we have a sample size of 300, our confidence intervals are valid.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 2 Demo: Detecting Ordinal Associations Using PROC FREQ (SAS Studio Task Version)
Use the Table Analysis task to test whether an ordinal association exists between Bonus and Fireplaces.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Table Analysis task.
- Select the stat1.ameshousing3 table.
- Assign Fireplaces to the Row variables role, and assign Bonus to the Column variables role.
- On the OPTIONS tab, expand PLOTS and select the Suppress plots check box.
- Expand FREQUENCY TABLE.
- Under Percentages, select Cell, Row, and Column.
- Under Statistics, select Measure of association, in addition to Chi-square statistics, which should already be selected.
- Modify the code to include confidence bounds in the table. On the CODE tab, click the Edit SAS code icon.
- Enter cl in the options list (after the slash) in the TABLES statement.
- Click Run.
Generated Code
ods noproctitle; proc freq data=STAT1.AMESHOUSING3; tables (Fireplaces) *(Bonus) / cl chisq measures nocum plots=none; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 3 Demo: Fitting a Binary Logistic Regression Model Using PROC LOGISTIC
Filename: st107d04.sas
Let's fit a binary logistic regression model in PROC LOGISTIC to characterize the relationship between the continuous variable Basement_Area and our categorical response, Bonus.

PROC LOGISTIC DATA=SAS-data-set <options>;
MODEL variable <(variable_options)> = <effects> < / options>; RUN; |
- Open program st107d04.sas.
/*st107d04.sas*/ ods graphics on; proc logistic data=STAT1.ameshousing3 alpha=0.05 plots(only)=(effect oddsratio); model Bonus(event='1')=Basement_Area / clodds=pl; title 'LOGISTIC MODEL (1):Bonus=Basement_Area'; run;
The PROC LOGISTIC statement specifies the amehousing3 data set and has several options. The PLOTS= option requests only the EFFECT and ODDSRATIO plots. The ALPHA= option requests confidence intervals for each parameter estimate.
The MODEL statement specifies the response variable Bonus, and in parentheses, the variable option EVENT= specifies the event category for the binary response. PROC LOGISTIC then models the probability of the event category you specify. In this example, the event category is the value 1 for Bonus, which indicates a Bonus Eligible home. If you don't include this option, event=0 would be modeled instead, because it's the first level in alphanumeric order. You then specify an equal sign, followed by the predictor variable, Basement_Area.
After the forward slash, you use the CLODDS= option to compute confidence intervals for the odds ratios of all predictor variables. Following the equal sign, you specify a keyword to indicate the type of confidence interval: PL for profile likelihood, WALD, or BOTH. If you don't specify the CLODDS= option, PROC LOGISTIC computes Wald confidence intervals by default. Wald statistics require fewer computations to perform. Profile-likelihood confidence intervals are desirable for small sample sizes. The CLODDS= option also enables the production of the odds ratio plot that's specified in the PLOTS= option. - Submit this program.
- Review the output.
The Model Information table describes the data set, the response variable, the number of response levels, the type of model, and the algorithm used to obtain the parameter estimates. The Optimization Technique is the iterative numerical technique that PROC LOGISTIC uses to estimate the model parameters. The model is assumed to be binary logit when there are exactly two response levels.
In the Observation Summary, the Number of Observations Used is the count of all nonmissing observations. In this case, there were no missing observations for the variables specified in the MODEL statement.
The Response Profile table shows the response variable values listed according to their ordered values. Because we used the EVENT= option in this example, the model is based on the probability of being bonus eligible (Bonus=1). This table also shows frequencies of response values. In this sample of 300 homes, only 45 sold for more than $175,000, and 255 sold for less.
Next, you should always check that the modeled response level is the one you intended. Otherwise your interpretation of the model will be erroneous.
The Model Convergence Status simply indicates that the convergence criterion was met, and there are many options to control the convergence criterion. The optimization technique doesn't always converge to a maximum likelihood solution. When this is the case, the output after this point cannot be trusted. Always check to see that the convergence criterion is satisfied.
The Model Fit Statistics table reports the results of three tests: AIC, SC, which is also known as Schwarzs Bayesian Criterion, or SBC, and -2 Log L, which is -2 times the natural log of the likelihood. The AIC, SC, and -2 Log L are goodness-of-fit measures. These statistics measure relative fit and are used only to compare models. They do not measure absolute fit of any single model. Smaller values for all these measures indicate better fit. However, -2 Log L can be reduced by simply adding more regression parameters to the model. Therefore, it's not used to compare the fit of models that use different numbers of parameters except for comparisons of nested models using likelihood ratio tests. AIC adjusts for the number of predictor variables, and SC adjusts for the number of predictor variables and the number of observations. SC uses a bigger penalty for extra variables, and therefore, favors more parsimonious models.
The Global Tests table, Testing Global Null Hypothesis: BETA=0, provides three statistics to test the null hypothesis that all regression coefficients of the model are 0. A significant pvalue for these tests provides evidence that at least one of the regression coefficients for a predictor variable is significantly different from 0. In this way, they are like the overall F test in linear regression. The Likelihood Ratio Chi-Square value is calculated as the difference between the -2 Log L value of the baseline model (intercept only) and the -2 Log L value of the hypothesized model (intercept and covariates).
The degrees of freedom are equal to the difference in number of parameters between the hypothesized model and the baseline model. In this case, there's only one additional predictor, Basement_Area, compared to the intercept-only model. The Score and Wald tests are also used to test whether all the regression coefficients are 0, and all three tests are asymptotically equivalent and often give very similar values. However, the Likelihood Ratio test is the most reliable, especially for small sample sizes.
The Parameter Estimates table, Analysis of Maximum Likelihood Estimates, lists the estimated model parameters, their standard errors, Wald Chi-Square values, and p-values. The parameter estimates are the estimated coefficients of the fitted logistic regression model. For this example, the logistic regression equation is logit(p-hat) = -9.7854 + (0.00739) * Basement_Area.
The Wald Chi-Square and its associated p-value tests whether the parameter estimate is significantly different from 0. The p-value for the variable Basement_Area is significant at the 0.05 alpha level.
The estimated model is displayed on the probability scale in the effect plot. You can see the sigmoidal shape of the estimated probability curve and that the probability of being bonus eligible increases as the basement area increases.
We'll take a closer look at the information in the last two tables and plot after this demonstration.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 3 Demo: Fitting a Binary Logistic Regression Model Using PROC LOGISTIC (SAS Studio Task Version)
Use the Binary Logistic Regression task to fit a binary logistic regression model and characterize the relationship between Basement_Area and the categorical response, Bonus. Model the probability of being bonus eligible, and request profile likelihood confidence intervals for the estimated odds ratio.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Binary Logistic Regression task.
- Select the stat1.ameshousing3 table.
- Assign Bonus to the Response role, and use the Event of interest drop-down list to specify 1.
- Assign Basement_Area to the Continuous variables role.
- On the MODEL tab, verify that Main effects model is selected.
- On the OPTIONS tab, in the Select statistics to display drop-down list, select Default and additional statistics.
- Expand the Parameter Estimates property. In the Confidence intervals for odds ratios drop-down list, select Based on profile likelihood.
- Expand PLOTS, and in the Select plots to display drop-down list, select Default and additional plots.
- Select Effect plot and Odds ratio plot.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc logistic data=STAT1.AMESHOUSING3 plots=(effect oddsratio(cldisplay=serifarrow)); model Bonus(event='1')=Basement_Area / link=logit clodds=pl alpha=0.05 technique=fisher; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 4 Demo: Fitting a Multiple Logistic Regression Model with Categorical Predictors Using PROC LOGISTIC (SAS Studio Task Version)
Use the Binary Logistic Regression task to fit a binary logistic regression model and characterize the relationship of Basement_Area, Fireplaces, and Lot_Shape_2 with Bonus. Specify reference cell coding and specify Regular as the reference group for Lot_Shape_2 and 0 as the reference level for Fireplaces. Model the probability of being bonus eligible and request profile likelihood confidence intervals for the estimated odds ratio. Request a report of odds ratios for 100 units for the Basement_Area variable.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Binary Logistic Regression task.
- Select the stat1.ameshousing3 table.
- Assign Bonus to the Response role, and use the Event of interest drop-down list to specify 1.
- Assign Fireplaces and Lot_Shape_2 to the Classification variables role.
- Expand the Parameterization of Effects property and use the Coding drop-down list to select Reference coding.
- Assign Basement_Area to the Continuous variables role.
- On the MODEL tab, verify that Main effects model is selected.
- On the OPTIONS tab, in the Select statistics to display drop-down list, select Default and additional statistics.
- Expand the Parameter Estimates property. In the Confidence intervals for odds ratios drop-down list, select Based on profile likelihood.
- Expand PLOTS, and in the Select plots to display drop-down list, select Default and additional plots.
- Select Effect plot and Odds ratio plot.
- Modify the code to specify specific levels of each class variable to use as reference levels. On the CODE tab, click the Edit SAS code icon.
- In the CLASS statement, add the options (REF='0') immediately after Fireplaces and (REF='Regular') immediately after Lot_Shape_2 .
- Add the statement units Basement_Area=100; after the MODEL statement.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc logistic data=STAT1.AMESHOUSING3 plots=(effect oddsratio(cldisplay=serifarrow) ); class Fireplaces (REF='0') Lot_Shape_2 (REF='Regular')/ param=ref; model Bonus(event='1')=Fireplaces Lot_Shape_2 Basement_Area / link=logit clodds=pl alpha=0.05 technique=fisher; units Basement_Area=100; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 Demo: Fitting a Multiple Logistic Regression Model with Interactions Using PROC LOGISTIC (SAS Studio Task Version)
Use the Binary Logistic Regression task to fit a binary logistic regression model and use the backward elimination method. The full model should include all the main effects and two-way interactions.
- In the Navigation pane, select Tasks and Utilities.
- Expand Tasks.
- Expand Statistics and open the Binary Logistic Regression task.
- Select the stat1.ameshousing3 table.
- Assign Bonus to the Response role, and use the Event of interest drop-down list to specify 1.
- Assign Fireplaces and Lot_Shape_2 to the Classification variables role.
- Expand the Parameterization of Effects property and use the Coding drop-down list to select Reference coding.
- Assign Basement_Area to the Continuous variables role.
- On the MODEL tab, select Custom model.
- Click the Edit this model icon under Model Effects to specify the model.
- In the Model Effects Builder window, select all three variables and then select N-way Factorial. Select N=2 to specify a model with all the variables and the associated two-way interaction terms, and then click Add.
- Click OK.
- On the SELECTION tab, use the Selection method drop-down list to choose Backward elimination, and change the Significance level to 0.1 to remove an effect from the mode.
- On the OPTIONS tab, in the Select statistics to display drop-down list, select Default and additional statistics.
- Expand the Parameter Estimates property. In the Confidence intervals for odds ratios drop-down list, select Based on profile likelihood.
- Expand PLOTS, and in the Select plots to display drop-down list, select Default and additional plots.
- Select Effect plot.
- Modify the code to specify specific levels of each class variable to use as reference levels. On the CODE tab, click the Edit SAS code icon.
- In the CLASS statement, add the options (REF='0') immediately after Fireplaces and (REF='Regular') immediately after Lot_Shape_2.
- Add the statement units Basement_Area=100; after the MODEL statement.
- Click Run.
Generated Code
ods noproctitle; ods graphics / imagemap=on; proc logistic data=STAT1.AMESHOUSING3 plots=(effect); class Fireplaces (REF='0') Lot_Shape_2 (REF='Regular') / param=ref; model Bonus(event='1')=Basement_Area Fireplaces Lot_Shape_2 Basement_Area*Fireplaces Basement_Area*Lot_Shape_2 Fireplaces*Lot_Shape_2 / link=logit clodds=pl alpha=0.05 selection=backward slstay=0.1 hierarchy=single technique=fisher; units Basement_Area=100; run;
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 Demo: Fitting a Multiple Logistic Regression Model with All Odds Ratios Using PROC LOGISTIC
Filename: st107d06.sas
In this demonstration, we want to refine the multiple logistic regression model that we fit in the last demonstration. Now we want to produce the odds ratios for each value of the variables that are involved in an interaction from the final model.

PROC LOGISTIC DATA=SAS-data-set <options>;
CLASS variable <(options)> ... < / options>; MODEL variable <(variable_options)> = <effects> < / options>; UNITS <independent1=list1> ... < / options>; ODDSRATIO < 'label' > variable < / options>; RUN; |
- Open program st107d06.sas.
/*st107d06.sas*/ /*Part A*/ proc logistic data=STAT1.ameshousing3 plots(only)=(effect oddsratio); class Fireplaces(ref='0') Lot_Shape_2(ref='Regular') / param=ref; model Bonus(event='1')=Basement_Area|Fireplaces|Lot_Shape_2 @2 / selection=backward clodds=pl slstay=0.10; units Basement_Area=100; title 'LOGISTIC MODEL (3): Backward Elimination ' 'Bonus=Basement_Area|Fireplaces|Lot_Shape_2'; run; /*st107d06.sas*/ /*Part B*/ proc logistic data=STAT1.ameshousing3 plots(only)=oddsratio(range=clip); class Fireplaces(ref='0') Lot_Shape_2(ref='Regular') / param=ref; model Bonus(event='1')=Basement_Area|Lot_Shape_2 Fireplaces; units Basement_Area=100; oddsratio Basement_Area / at (Lot_Shape_2=ALL) cl=pl; oddsratio Lot_Shape_2 / at (Basement_Area=1000 1500) cl=pl; title 'LOGISTIC MODEL (3.1): Bonus=Basement_Area|Lot_Shape_2 Fireplaces'; run;
In the modified PROC LOGISTIC step in Part B, the PLOTS= option now specifies only an odds ratio plot, but includes the RANGE=CLIP option. This option is helpful when one or more odds ratio confidence intervals are so large that the scale makes it difficult to see the smaller ones.
The MODEL statement specifies all the significant terms that remained in the final model. At the end of the MODEL statement, notice that we've removed the SELECTION= and CLODDS= options that specify the backward elimination method and profile-likelihood confidence limits. The profile-likelihood confidence limits are now specified in the ODDSRATIO statements that we've added.
To produce the odds ratios for each value of a variable thats involved in an interaction, you specify a separate ODDSRATIO statement for each variable. In the first ODDSRATIO statement, we specify Basement_Area followed by a forward slash, and the AT option. The AT option specifies fixed levels of one or more interacting variables. For each categorical variable, you can specify a list of one or more formatted levels of the variable, the keyword REF to select the reference level, or the keyword ALL to select all levels of the variable. Here, we're requesting all levels of the variable Lot_Shape_2.
The second ODDSRATIO statement produces odds ratios for Lot_Shape_2 at 1000 and 1500 square feet of Basement_Area. - Submit the PROC LOGISTIC step in Part B.
-
Review the output.
Jump to the Odds Ratios. There are four odds ratios displayed for the interaction effects. The first two show the odds ratios comparing homes with a difference of 100 square feet of basement area holding the lot shape constant. For example, the odds of being bonus eligible for a home with a regular lot shape are almost 3 times greater than a home with 100 square feet less of basement area. The last two odds ratios compare the odds ratios for irregular lots shapes compared to regular when holding the basement area constant. For example the odds of being bonus eligible are more than 20 times greater for homes with irregular lot shapes vs regular when holding the basement area fixed at 1000 square feet.
We can assess the significance of the odds ratios from either the table or the graphic. In the table, an odds ratio is significant when the confidence interval does not include the value 1. In the PL or Profile-Likelihood plot, a vertical reference line at 1 makes it easy to assess significance. Odds ratios whose confidence intervals do not overlap the reference line are statistically significant.
From this plot, it's clear that the lot shape effect is different at different values of basement area. The lot shape effect is highly significant when the Basement_Area is set to 1000 square feet, but not when Basement_Area is set to 1500 square feet, as you can see the confidence interval for the odds ratio covers a value of 1.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 Demo: Fitting a Multiple Logistic Regression Model with All Odds Ratios Using PROC LOGISTIC (SAS Studio Task Version)
To refine the multiple logistic regression model that you fit in the previous demonstration, estimate and plot odds ratios for the simple effects of variables that are involved in an interaction. Include only the significant terms.
In SAS Studio 3.7, currently no task generates the code for this demo. Submit the code in Part B of the st107d06.sas file.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 Demo: Generating Predictions Using PROC PLM
Filename: st107d07.sas
Let's run the LOGISTIC procedure again, with the same effects as before, except let's add in the STORE statement to save the model information and score new data.

PROC LOGISTIC DATA=SAS-data-set <options>;
CLASS variable <(options)> ... < / options>; MODEL variable <(variable_options)> = <effects> < / options>; UNITS <independent1=list1> ... < / options>; STORE <OUT=> item-store-name < / LABEL='label'>; RUN; |
PROC PLM RESTORE=item-store-specification <options>;
SCORE DATA=SAS-data-set <OUT=SAS-data-set> <keyword<=name>> ... < / options>; RUN; |
- Open program st107d07.sas.
/*st107d07.sas*/ ods select none; proc logistic data=STAT1.ameshousing3; class Fireplaces(ref='0') Lot_Shape_2(ref='Regular') / param=ref; model Bonus(event='1')=Basement_Area|Lot_Shape_2 Fireplaces; units Basement_Area=100; store out=isbonus; run; ods select all; data newhouses; length Lot_Shape_2 $9; input Fireplaces Lot_Shape_2 $ Basement_Area; datalines; 0 Regular 1060 2 Regular 775 2 Irregular 1100 1 Irregular 975 1 Regular 800 ; run; proc plm restore=isbonus; score data=newhouses out=scored_houses / ILINK; title 'Predictions using PROC PLM'; run; proc print data=scored_houses; run;
Before the PROC LOGISTIC step, we add the statement ODS SELECT NONE, which suppresses the output, and we add ODS SELECT ALL at the end of the step to make sure that we get the output from the next step we run. In the STORE statement, we specify the name of the item store we want to save, isbonus. Next, we'll create a data set named newhouses that contains the new data we want to score.
Finally, we'll use PROC PLM to generate predictions for the newhouses data set. The RESTORE option specifies that the predictions will be based on the analysis results saved in the item store isbonus. The SCORE statement specifies that SAS will score the data set named newhouses and will write the results into a new data set named scored_houses. The ILINK option requests predictions on the probability scale as opposed to the logit scale. This makes the predictions easier to interpret.
We'll close our program with a PRINT procedure so that we can view the scored data. - Submit the code.
-
Review the output.
As expected, the Predictions table produced by PROC PLM shows that the house with the highest predicted probability of being bonus eligible (0.306) has an irregular lot shape, 1 fireplace, and a basement area of 975 square feet. The house with the lowest predicted probability (0.0004) has a regular lot shape, 2 fireplaces, and a basement area of 775. Again, the predicted values in the last column are probabilities because we used the ILINK option. Otherwise, the last column would be the predicted logit values.
Be sure that you generate predictions only for new data records that fall within the range of the training data. If not, predictions could be invalid due to extrapolation. We assume that the modeled relationship between predictors and the response holds across the span of the observed data. We should not assume that this relationship holds everywhere.
Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Lesson 07, Section 5 Demo: Generating Predictions Using PROC PLM (SAS Studio Task Version)
Using the model that was selected from backward selection, including main effects and two-way interactions, generate predictions for bonus eligibility for new data.
In SAS Studio 3.7, the Binary Logistic Regression task does not have the option to generate predictions for new data. To save the context and results of the statistical analysis, submit the code in the st107d07.sas file and use PROC PLM for predictions.