Machine Learning Using SAS® Viya®
Lesson 01, Section 1 Demo: Creating a Project and Selecting Data
Note: The following text is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, I'll show you how to create a project in Model Studio and select the data during that process. This is the project that we'll use throughout the course.
Here we see SAS Drive, where we access SAS Viya products. I'll click the Applications menu and select Build Models. The Model Studio Projects page appears. From the Model Studio Projects page, you can view existing projects, create new projects, and access the Exchange. The Exchange is a place where you can save your pipelines, and find pipeline templates created by other users, as well as best practice pipelines that are provided by SAS. You learn more about the Exchange later. The Projects page shown here has a few existing projects. When you do this demonstration, your Projects page might look different.
I'm going to click New Project. I'll provide the name Demo. We want to keep the project type as Data Mining and Machine Learning.
To select the data source, I'll click Browse under Data. The data source that we need, commsdata, is loaded into memory, so it is listed on the Available tab. I'll scroll down on the Available tab to find the commsdata table. Keep in mind that multiple versions of this data set might be available in SAS Viya for Learners. Select the most recent version available and click OK.
Back in the New Project window, notice that the name of the data source is now displayed.
I will leave the project description blank.
And now, let's take a quick look at some of the Advanced Project settings. In the New Project Settings Window, in the left column, are four groups of Advanced Project settings: Advisor Options, Partition Data, Event-Based Sampling, and Node Configuration. Right now, we'll talk about the Advisor Options, which you can access only at this point in creating a project. You learn about the other Advanced Project settings in a later demonstration. Let's look at the three options in the Advisor Options group.
Maximum class level specifies the threshold for rejecting categorical variables. If a categorical input has more levels than the specified maximum number, it is rejected.
Interval cutoff determines whether a numeric input is designated as interval or nominal. If a numeric input has more levels than the interval cutoff value, it is declared interval. Otherwise, it is declared nominal.
Maximum percent missing specifies the threshold for rejecting inputs with missing values. If an input has a higher percentage of missing values than the specified maximum percent, it is rejected. This option is on by default, but we could turn it off if we wanted to.
We won't change the default settings. I'll click Cancel.
We're back at the New Project window. We'll click Save to save the Demo project.
Model Studio automatically opens the Demo project. Across the top are four tabs: Data, Pipelines, Pipeline Comparison, and Insights. The Data tab is selected by default. Notice the message at the top, which tells us that we must assign a role of Target to a variable in order to run a pipeline.
For our project, the target variable is churn. I need to select the variable and then change the role. I'm going to scroll down in the variable table. I'll select the check box next to the variable name churn. Then, in the right pane, I can change some of the metadata or information about the variable. I'm going to assign a role to the variable churn. I'll select the Role menu and change the role to Target. The target is now defined and the warning at the top of the window is gone.
If the target is Binary or Nominal, you can also change the event of interest by clicking Specify the Target Event Level. In the window that appears, you can select a level from the menu. Notice that the frequency count is provided for each level. For our project, the churn rate is about 12%. By default, Model Studio sorts the levels in ascending alphanumeric order, and selects the last level as the event. For our target, the selected level is 1, so we don't need to change this.
Keep in mind that you cannot modify the names or labels of your variables in Model Studio.
Machine Learning Using SAS® Viya®
Lesson 01, Section 1 Demo: Importing Data from a Local Source
Note: You can import a local file into memory in SAS Viya. However, this functionality is not available in SAS Viya for Learners, so there is no practice associated with this demo.
In this demonstration, I load the data into memory in the New Project window. I import the data source as a local file. To select the data source, I'll click Browse under Data. The data source that we need, commsdata, is not loaded into memory, so it does not appear on the Available tab. We need to import the data source as a local file.
On the Import tab, I expand Local files and then select Local file. I navigate to the location of the file on the local drive, select it, and click Open. When I click Import Item, SAS Viya loads the data in memory. After the table is imported, it appears on the Available tab, where it is available to other projects.
On the Import tab, I click OK to return to the New Project window. Notice that the name of the data source is now displayed.
Machine Learning Using SAS® Viya®
Lesson 01, Section 2 Demo: Modifying the Data Partition
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, I'll change the metadata for a few variables, modify the default data partition settings, run the partitioning, and then look at the partitioning log. We perform all of these tasks on the Data tab. Based on business knowledge, we want to reject 11 variables in the commsdata data set. Rejected variables will not be used in our models.
Before I start selecting the variables I want to reject, notice that the variable churn is still selected from the last demonstration. To deselect churn, I click its check box. It is important to understand how to select and deselect variables on the Data tab. Otherwise, you might inadvertently reassign variable roles or change metadata. You learn some tips after this demonstration.
Now I'll select the check boxes for the first three variables to reject: city, city_lat, and city_long. Notice that Model Studio allows me to select multiple variables simultaneously.
Scrolling down, I'll select the remaining eight variables that I want to reject.
In the right panel, I'll click the Role menu, and assign a role of Rejected to the selected variables.
To change the default data partition settings, I'll click the Settings icon in the upper right corner, and then select Project settings.
In the Project Settings window, the Partition Data settings are currently displayed. The Create partition variable check box is selected, which indicates that partitioning is done by default. The default partitioning method is Stratify. By default, Model Studio does a 60-30-10 allocation to training, validation, and test. For our Demo project, we're going to specify 70% of the data for training and 30% for validation. We will not use a test set. So I'm going to change the 60 under Training to 70, and I'll change the 10 under Test to 0.
The partition settings can be edited only if no pipelines in the project have been run. After the first pipeline is run, the partition tables are created for the project, and the partition settings cannot be changed.
Let's look at some of the other project settings.
On the left, select Event-Based Sampling. Notice that by default, event-based sampling is turned off. When event-based sampling is turned on, the desired proportion of events and non-event cases can be set after the sampling is done. The default proportion for events and non-events after sampling is 50% each. The sum of the proportions must be 100%. After a pipeline is run in the project, the event-based sampling settings cannot be changed. We'll keep the event-based sampling options at their default settings.
Select Node Configuration. This setting is useful when we use the Open Source Code node with the language set to Python. When I select the Prepend Python configuration code check box, a code editor appears. In the editor, I could add Python code to prepend to code that I specified in the Open Source code node. You learn more about the Open Source Code node later in the course. I'll clear this check box here because we will not use Python code for now.
Select Rules. The Rules options can be used to change the selection statistic and partition data set that determine the champion model during Model Comparison. Statistics can be selected for class and interval targets. We will keep the Rules options at their default settings.
Click Save to save the new partition settings.
Click the Pipelines tab. We currently have a single pipeline in our Demo project, which is called Pipeline 1. It currently contains only a Data source node. In order to create the partition indicator, I'm going to run the Data node. I'll right-click the node and then select Run.
The green check mark indicates that the node ran successfully, without an error, and the data has been partitioned. Recall that after the Data node has been run, you cannot change the partitioning, event-based sampling, project metadata, project properties, or the target variable. However, you can change variable metadata with the Manage Variables node or through the Data tab.
Let's take a brief look at the log file that was generated during the partitioning. I'll click Settings in the top right corner. And then I'll select Project logs. From the available logs, I'll select Log for Project Partitioning, and then click Open.
In the Partition Log window, we see the log file that was created during the partitioning. The log file can even be downloaded for record keeping. I'll close the Partition Log window. And then I'll close the Available Logs window to return to the pipeline.
Machine Learning Using SAS® Viya®
Lesson 01, Section 2 Practice the Demo: Modify the Data Partition
In this practice, you change the metadata for multiple variables, modify the default data partition settings, run the partitioning, and then look at the partitioning log.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- If you closed the Demo project, reopen it. Make sure that the Data tab is selected.
- Make sure that the check box for the churn variable is cleared.
Note: It is important to understand how to select and deselect variables on the Data tab. Otherwise, you might inadvertently reassign variable roles or change metadata. For details, see the variable selection tips before this practice. - Reject 11 variables so that they will not be used in modeling, as follows:
- On the Data tab, select the check boxes for the following variables:
- city
- city_lat
- city_long
- data_usage_amt
- mou_onnet_6m_normal
- mou_roam_6m_normal
- region_lat
- region_long
- state_lat
- state_long
- tweedie_adjusted
- In the right pane, for Role, make sure that Rejected is selected.
Note: Variable metadata includes the role and measurement level of the variable. Common variable roles are Input, Target, Rejected, Text, and ID. Common variable measurement levels are Interval, Binary, Nominal, and Ordinal. - Click Settings
in the upper right corner of the window, and select Project settings from the menu.
Note: If you want to see or modify the partition settings before creating the project, you can do this from the user settings. In the user settings, the Partition tab enables you to specify the method for partitioning as well as specify associated percentages. Any settings at this level are global and are applied to any new project created.
The Project Settings window appears with Partition Data selected on the left by default.
Note: You can edit the data partitioning settings only if no pipelines in the project have been run. After the first pipeline has been run, the partition tables are created for the project, and the partition settings cannot be changed. Remember that, as shown in the last demonstration, you can also access the Partition Data options while the project is being created, under the Advanced settings.
- Notice that the Create partition variable check box is selected, which indicates that partitioning is done by default. The default partitioning method is Stratify.
- By default, Model Studio does a 60-30-10 allocation to training, validation, and test. For the Demo project, make the following changes:
- Change the Training percentage to 70.
- Leave the Validation percentage set to 30.
- Change the Test percentage to 0. Note: You will not use a test data set for this project.
- On the left, select Event-Based Sampling to look at those settings. By default, event-based sampling is turned off. (That is, the Enable event-based sampling check box is not selected.) When event-based sampling is turned on, the desired proportion of event and non-event cases can be set after the sampling is done. In this case, the default proportion for both events and non-events after sampling is 50% each. The sum of the proportions must be 100%.
For the Demo project, keep the Event-based Sampling options at their default settings. Note: After a pipeline has been run in the project, the Event-Based Sampling settings cannot be changed. Remember that, as shown in the last demonstration, you can also access the Event-Based Sampling options while the project is being created, under the Advanced settings. - On the left, select Node Configuration. The Prepend Python configuration code setting is useful when you use the Open Source Code node with the language set to Python.
- To explore this setting, do the following:
- Select the Prepend Python configuration code check box. A code editor appears. In the editor, you could add Python code to prepend to code that you specified in the Open Source code node. You learn more about the Open Source Code node later in the course.
- Clear the Prepend Python configuration code check box because you will not use Python code at this point.
- On the left, select Rules to look at those settings. The Rules options can be used to change the selection statistic and partitioned data set that determine the champion model during model comparison. Statistics can be selected for class and interval targets.
For the Demo project, keep the Rules options at their default settings. - Click Save to save the new partition settings and return to the Demo project page.
- Click the Pipelines tab. In the Demo project, there is currently a single pipeline named Pipeline 1.
On the Pipelines tab, you can create, modify, and run pipelines. Each pipeline has a unique name and an optional description. In the Demo project, Pipeline1 currently contains only a Data source node. - To create the partition indicator, you can run the Data node. Right-click the Data node and select Run.
After the node runs, a green check mark in the node indicates that it ran without errors and the data have been partitioned.
Note: After you run the Data node, you cannot change the partitioning, event-based sampling, project metadata, project properties, or the target variable. However, you can change variable metadata with the Manage Variables node or through the Data tab. - To look at the log file that was generated during partitioning, click Settings in the upper right corner, and select Project logs from the menu.
- From the Available Logs window, select Log for Project Partitioning, and then click Open. The log that was created during partitioning appears. You can scroll through the log if you want. Note: It is also possible to download a log file by clicking the Download log link at the bottom of the log.
- To return to the pipeline, close the Partition Log window, and then close the Available Logs window.
Machine Learning Using SAS® Viya®
Lesson 01, Section 2 Demo: Building a Pipeline from a Basic Template
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we create a pipeline from a basic template in the Demo project. We use this pipeline to do imputation and build a baseline regression model. We run the pipeline and look at the results. Later in the course, we compare this model with machine learning models.
Remember that Pipeline 1, which has a single Data node, was created automatically with the project. We'll reserve this pipeline for exploring the data, which we will do later in the course.
To create a new pipeline, I click the plus sign next to Pipeline 1.
In the New Pipeline window, under Select a pipeline template, I click Browse. A list of pre-built pipeline templates is shown. These templates are available for both class and interval targets, at basic, intermediate, and advanced levels.
I select the Basic template for class target and click OK.
I'll change the name to Starter Template. We can change the pipeline name later if we want to.
I could provide a description, such as "This is based on the basic template for class target," but I'll leave this blank for now.
Instead of using one of the pre-populated pipelines already configured to create a model, we could select Automatically generate the pipeline. This option uses automated machine learning to dynamically build a pipeline that is based on your data. This option is disabled if the target variable has not been set or if the project data advisor has not finished running. We will not use this option in this course.
I click OK.
As you can see, the basic template has only four nodes: the Data node; one node for data preparation, which is Imputation; one model, Logistic Regression; and Model Comparison. Even when a pipeline has only one model, a Model Comparison node is included by default.
To run the entire pipeline, I'll click Run Pipeline in the upper right corner.
While the pipeline is running, notice that the Run Pipeline button changes to Stop Pipeline so that you can stop the run of the pipeline at any time. After the pipeline runs, the green check marks in the nodes indicate that the pipeline has run successfully.
Let's discuss some of the results. We'll start with the results of the logistic regression model. I can open the results in two ways. I can either right-click the node and select Results, or I can click the three vertical dots on the right side of the node and select Results.
In the results of the Logistic Regression node, notice that there are two tabs in the upper left corner: Node and Assessment. Some of the windows on the Node tab are the t Values by Parameter plot, the Parameter Estimates table, the Selection Summary table, and the Output.
On the Assessment tab, some of the windows are the Lift Reports plots, the ROC Reports plots, and the Fit Statistics table.
You can explore these results as you want. To close the results, I'll click Close in the upper right corner.
Now let's look at the results of the Model Comparison node.
The Model Comparison table appears at the top of the results. We currently have only one model in the pipeline, so information about that model is provided. To maximize the table, you can click the Expand icon in the upper right corner.
Now, other statistics are visible.
In the Model Comparison table, the fit statistic that is used to select a champion model is displayed first. The default fit statistic for selecting a champion model with a class target is KS (Kolmogorov-Smirnov). For the Demo project, we will use KS. However, it is possible to change the selection statistic at the pipeline level or at the project level after returning to the pipeline.
I'll close the Model Comparison table. Then I'll close the results and return to the pipeline.
For future reference, you can change the selection statistic for the current pipeline by changing the class selection statistic in the Model Comparison node's properties.
To change the selection statistic for all pipelines within a project, you change the class selection statistic on the project's Settings menu, which you saw in an earlier demonstration.
Machine Learning Using SAS® Viya®
Lesson 01, Section 2 Practice the Demo: Build a Pipeline from a Basic Template
In this practice, you create a pipeline from a basic template in the Demo project. You use this pipeline to do imputation and build a baseline regression model that you compare with machine learning models in a later demonstration. You run the pipeline and look at the results.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- Make sure that the Demo project is open and the Pipelines tab is selected. Note: Remember that Pipeline 1, which has a single Data node, was created automatically with the project. You'll reserve this pipeline for exploring the data, which you do in a later demonstration.
- Click the plus sign next to the Pipeline 1 tab.
The New Pipeline window appears. - Under Select a pipeline template, click Browse to access the Browse Templates window. This window displays a list of pre-built pipeline templates, which are available for both class (categorical) and interval targets. These templates are available at basic, intermediate, and advanced levels. The Browse Templates window also displays any pipeline templates that users have created and saved to the Exchange.
- Select Basic template for class target, and click OK.
- In the New Pipeline window, in the Name field, enter Starter Template as the pipeline name.
Note: Specifying a pipeline description is optional. - Notice the Automatically generate a pipeline option. This option is an alternative to using one of the pre-populated pipelines already configured to create a model. When you select the Automatically generate a pipeline option, Model Studio uses automated machine learning to dynamically build a pipeline that is based on your data. This option is disabled if the target variable has not been set or if the project data advisor has not finished running. We do not use this option in this course.
- Click OK.
A Starter Template pipeline tab appears on the Pipelines tab for the Demo project. The basic template for class target is a simple linear flow that includes the following nodes: the Data node, one node for data preparation (Imputation), one model node (Logistic Regression), and the Model Comparison node. Even when a pipeline has only one model, a Model Comparison node is included by default.
- To run the entire pipeline, click Run pipeline in the upper right corner of the canvas. After the pipeline runs, green check marks in the nodes indicate that the pipeline has run successfully.
Note: While the pipeline is running, notice that the Run Pipeline button changes to Stop Pipeline. To interrupt a running pipeline, you can click this button. - Right-click the Logistic Regression node and select Results. The Results window appears and contains two tabs: Node and Assessment. The Node tab, which is selected by default, displays the results from the Logistic Regression node.
Note: Alternatively, you can open the node results by clicking More (the three vertical dots) on the right side of the node and selecting Results.
- Explore the results. A subset of the items on this tab are listed below:
- t-Values by Parameter plot
- Parameter Estimates table
- Selection Summary table
- Output
- Click the Assessment tab to see the assessment results from the Logistic Regression node. Explore the results. A subset of the items on this tab are listed below:
- Lift reports plots
- ROC reports plots
- Fit Statistics table
- To close the Results window and return to the pipeline, click Close in the upper right corner.
- To open the results of the Model Comparison node, right-click the Model Comparison node and select Results.
At the top of the results window is the Model Comparison table. This pipeline contains only one model, so the Model Comparison table currently displays information about only that one model.
- In the upper right corner of the Model Comparison table, click Maximize View (
) to maximize the table. The fit statistic that is used to select a champion model is displayed first. The default fit statistic for selecting a champion model with a class target is KS (Kolmogorov-Smirnov).
- Close the Model Comparison table.
- Close the Model Comparison Results window and return to the pipeline.
Note: For future reference, you can change the selection statistic at the pipeline level or at the project level, after you return to the pipeline. To change the selection statistic for all pipelines within a project, you change the class selection statistic on the project's Settings menu (which was shown in an earlier demonstration). However, for the Demo project, continue to use the default selection statistic, KS.
Machine Learning Using SAS® Viya®
Lesson 02, Section 1 Demo: Exploring the Data
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we explore the source data using the Data Exploration node. The Data Exploration node selects a subset of variables to provide a representative snapshot of the data.
To start, I click the Pipeline1 tab.
In the pipeline, I right-click the Data node and select Add child node > Miscellaneous > Data Exploration. Model Studio adds the Data Exploration node and connects it to the Data node.
Another way to bring a node into a pipeline is to click the Nodes icon in the left panel and drag one of the listed nodes on top of a node that is already in the pipeline. The new node is added below the existing node.
I want to keep the properties of the Data Exploration node at their default values.
The Variable selection criterion specifies whether to display the most important inputs or suspicious variables. We want to see the most important inputs, so we keep the current setting, Importance. By default, a maximum of 50 of the most important variables are selected. To see the most suspicious variables, we would change the setting to Screening.
You can control the selection of suspicious variables by specifying screening criteria, like cutoff for flagging variables with a high percentage of missing values, high-cardinality class variables, class variables with dominant levels, class variables with rare levels, skewed interval variables, peaky interval variables, and interval variables with thick tails.
I'll change the Variable selection criterion back to Importance.
I run the Data Exploration node.
When the run is complete, I open the Data Exploration results.
To start, I expand the Important Inputs bar chart. This chart is available only if Variable selection criterion is set to Importance. The Relative Variable importance metric is based on a decision tree and is a number between zero and 1. I'll close the chart.
Next, I expand the Interval Variable Moments table. This table displays the interval variables with their associated statistics, which include Minimum, Maximum, Mean, Standard Deviation, Skewness, Kurtosis, Relative Variability, and the Mean plus or minus 2 Standard Deviations. Note that some of the input variables have negative minimum values. We'll handle these negative values in an upcoming demonstration. I'll close the table.
Then I expand the Interval Variable Summaries scatter plot. This is a scatter plot of skewness against kurtosis for the interval input variables. Notice that we have a few interval input variables in the upper right corner that are suspicious based on high kurtosis and high skewness values. You can place your cursor on these dots to see the variable names.
Using the menu in the upper left corner, I show a bar chart of the relative variability for each interval variable. Now I exit the maximized view.
In the results, I also want to look at information about missing values. I scroll down to the Missing Values bar chart and maximize it. Notice that some of the variables have a higher percentage of missingness than others. I exit the maximized view.
Then I close the results of the Data Exploration node.
I want to rename the first pipeline, so I double-click its tab and change the name from Pipeline 1 to Data Exploration. Another way to rename a pipeline is to click the Options menu for the tab (the three dots) and select Rename.
Machine Learning Using SAS® Viya®
Lesson 02, Section 1 Practice the Demo: Explore the Data
In this practice, you explore the source data (commsdata) using the Data Exploration node in Model Studio. Here you select a subset of variables to provide a representative snapshot of the data. Variables can be selected to show the most important inputs or to indicate suspicious variables (that is, variables with anomalous statistics).
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Pipelines tab. Make sure that Pipeline 1 is selected.
- Right-click the Data node and select Add child node > Miscellaneous > Data Exploration from the pop-up menu. Model Studio automatically adds a Data Exploration node to the pipeline and connects it to the Data node.
Note: Alternatively, you can select a node from one of the sections in the Nodes pane on the left and drag it onto an existing node in the pipeline. The new node is added to the canvas below the existing node and automatically connected to that node. - Keep the default settings for the Data Exploration node. Notice that Variable selection criterion is set to Importance. In this demo, we want to see the most important inputs, so we keep this setting,
Note: The variable selection criterion specifies whether to display the most important inputs or suspicious variables. By default, a maximum of 50 of the most important variables are selected. To see the most suspicious variables, you would change the setting to Screening. Then you can control the selection of suspicious variables by specifying screening criteria, such as cutoff for flagging variables with a high percentage of missing values, high-cardinality class variables, class variables with dominant levels, class variables with rare modes, skewed interval variables, peaky interval variables, and interval variables with thick tails.
- Right-click the Data Exploration node and select Run from the pop-up menu.
- When the pipeline finishes running, right-click the Data Exploration node and select Results from the pop-up menu.
- Maximize the Important Inputs bar chart and examine the relative importance of the ranked variables. Note: This bar chart is available only if Variable selection criterion is set to Importance.
Note: The Relative Variable importance metric is based on a decision tree and is a number between zero and 1. (You learn more about decision trees later in this course.) - Minimize the Important Inputs bar chart.
- Maximize the Interval Variable Moments table.
This table displays the interval variables with their associated statistics, which include Minimum, Maximum, Mean, Standard Deviation, Skewness, Kurtosis, Relative Variability, and the Mean plus or minus 2 Standard Deviations. Note that some of the input variables have negative minimum values. You handle these negative values in an upcoming practice. - Close the Interval Variable Moments table.
- Maximize the Interval Variable Summaries scatter plot. This is a scatter plot of skewness against kurtosis for all the interval input variables. Notice that a few input variables in the upper right corner are suspect based on high kurtosis and high skewness values. You can place your cursor on these dots to see the associated variable names.
- Click the View chart menu in the upper left corner of the window and select Relative Variability. Examine the bar chart of the relative variability for each interval variable.
Note: Relative variability is useful for comparing variables with similar scales, such as several income variables. Relative variability is the coefficient of variation, which is a measure of variance relative to the mean, . - Close the Interval Variable Summaries scatter plot.
- Scroll down in the Data Exploration Results window and maximize the Missing Values bar chart, which shows the variables that have missing values. Notice that some of the variables have a higher percentage of missingness than others.
- Close the Missing Values bar chart.
- Click Close to close the results.
- Double-click the Pipeline 1 tab and change its name by entering Data Exploration.
Note: Another way to rename a pipeline is to click the options menu for the tab (the three dots) and select Rename.
Machine Learning Using SAS® Viya®
Lesson 02, Section 1 Demo: Replacing Incorrect Values Starting on the Data Tab
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we replace incorrect values starting on the Data tab. This method replaces values in all pipelines in the project. Later, you learn about using the Manage Variables node with the Replacement node to replace values in a single pipeline.
In an earlier demonstration, remember that we explored some of the interval input variables and saw that some had negative minimum values. Using business knowledge, we're going to replace these negative values with zeros for a subset of the interval input variables.
In our Demo project, I return to the Data tab.
To find our subset of variables, I will first sort by the Role column. I'll right-click in the Role column and select Sort > Sort (ascending). Now all the input variables are grouped together after the ID variable and before the Rejected variables.
Now I will add a second sort to the current sort on Role. This will help me group together the input variables that have negative values. I'll scroll to the right until I see the Minimum column. I'll right-click in the Minimum column and select Sort > Add to sort (ascending). Input variables with negative minimum values are now grouped together. Add to sort means that the initial sorting done by Role still holds. So the sort on minimum values takes place within each sorted Role group.
I'll rearrange the columns so that the Minimum column is next to the Variable Name column. I click the Options button in the upper right corner of the data table. Then I select Manage Columns. In the Displayed columns list, I select Minimum and then move it up by clicking the Up arrow. Now, with the Minimum column adjacent to the Variable Name column, I click OK. I scroll to the left so I can see the variable names and the minimum values. To make sure that any previously selected variables are no longer selected, I'll click the first variable's name rather than the check box. Then I'll select the 22 interval input variables. Notice that there's one more variable with a negative minimum value. However, based on business knowledge, I won't select it.
With the variables selected, I'll move to the right pane and enter 0.0 in the Lower limit field. In a pipeline, the Filtering and Replacement nodes use this lower limit to filter out or replace negative values of the selected variables. Recall that this is customer billing data, and negative values often imply that there is a credit applied to the customer's account. So it is realistic that there are negative numbers in these columns. However, with telecom data, there is a general practice to convert negative values to zeros.
Notice that I did not edit any variable values. I only set a metadata property. The Replacement node makes the change to the data.
I click the Pipelines tab and then select the Starter Template pipeline. Notice that the green check marks in the nodes have changed to gray circles. This indicates that the nodes need to be rerun to reflect the change in metadata. After the pipeline is rerun, the nodes will show a green check mark again.
I will add a Replacement node to the pipeline, using an alternative method. In the left column, I'll expand Nodes. Then I'll expand Data Mining Preprocessing. I'll click and drag the Replacement node between the Data node and the Imputation node. Then I'll hide the left pane.
The Replacement node can be used to replace outliers and unknown class levels with specified values. This is where you invoke the metadata property of the lower limit that we set before. In the properties pane of the Replacement node, I'll set the Default limits method to Metadata limits, change Alternate limits method to (none), and keep the Replacement value property set to the default (Computed limits).
I'll run the Replacement node.
The Replacement node ran successfully, so I'll open the results.
I'll maximize the Interval Variables table. This table shows which variables now have a lower limit of zero. The original variables will now be rejected. The new versions of the variables, which have REP_ prepended to the name, are now the valid input variables. I'll close the Interval Variables table. And I'll close the results.
Now I run the remainder of the pipeline by clicking the Run Pipeline button.
When the pipeline run is complete, I'll open the Model Comparison results. And I'll expand the Model Comparison table to see the performance of the Logistic Regression model. I'll close the Model Comparison table and the results.
As I mentioned earlier, there is another way to assign metadata properties. You can use the Manage Variables node with the Replacement node to replace values in a single pipeline. Let me show you where to find the Manage Variables node, although we won't use it in our Demo project. In the left pane, expand the nodes. Then expand Data Mining Preprocessing. One of the nodes under Data Mining Preprocessing is the Manage Variables node.
Now I'll hide the Nodes pane.
Machine Learning Using SAS® Viya®
Lesson 02, Section 1 Practice the Demo: Replace Incorrect Values Starting on the Data Tab
In this practice, you replace incorrect values starting on the Data tab. This method replaces values in all pipelines in the project. Note: Later, you learn about using the Manage Variables node with the Replacement node to replace values in a single pipeline.
In an earlier practice, you explored some of the interval input variables and saw that some have negative minimum values. Based on business knowledge, you will replace these negative values with zeros for a subset of the interval input variables. To start, you sort the variables to find the subset that you want to work with.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Data tab.
- Right-click the Role column and select Sort > Sort (ascending). All the input variables are now grouped together after the ID variable and before the Rejected variables.
- To group the input variables that have negative values, you will add a second sort to the current sort on Role. Scroll to the right, right-click the Minimum column, and select Sort > Add to sort (ascending). Variables with negative minimum values are now grouped together.
Note: Add to sort means that the initial sorting done on Role still holds. So the sort on minimum values takes place within each sorted Role group.
- Rearrange columns so that the Minimum column is next to the Variable Name column, as follows:
- Click the Options icon in the upper right corner of the data table, and then select Manage columns. The Manage Columns window appears.
- In the Displayed Columns list, select Minimum. By clicking the up arrow multiple times, move the Minimum column immediately below the Variable Name column.
- Click OK. The Manage Columns window closes.
- On the Data tab, scroll all the way to the left so that you can see the Variable Name column and the Minimum column.
- Select the following 22 interval input variables:
Note: In the practice environment, the variables might be listed in a different order than shown here. To make sure that any previously selected variables are no longer selected, select the first variable's name rather than its check box.
- tot_mb_data_roam_curr
- seconds_of_data_norm
- lifetime_value
- bill_data_usg_m03
- bill_data_usg_m06
- voice_tot_bill_mou_curr
- tot_mb_data_curr
- mb_data_usg_roamm01 through mb_data_usg_roamm03
- mb_data_usg_m01 through mb_data_usg_m03
- calls_total
- call_in_pk
- calls_out_pk
- call_in_offpk
- calls_out_offpk
- mb_data_ndist_mo6m
- data_device_age
- mou_onnet_pct_MOM
- mou_total_pct_MOM
- In the right pane, enter 0.0 in the Lower Limit field. This specifies the lower limit to be used in the Filtering and Replacement nodes with the Metadata limits method. The Filtering and Replacement nodes use this lower limit to respectively filter out or replace negative values of the selected variables.
Note: This is customer billing data, and negative values often imply that there is a credit applied to the customer's account. So it is realistic that there are negative numbers in these columns. However, in telecom data, it is a general practice to convert negative values to zeros. Note that you did not edit any variable values. Instead, you only set a metadata property that can be invoked using the Replacement node. - Click the Pipelines tab.
- Select the Starter Template pipeline. Notice that, because of the change in metadata, the green check marks in the nodes in the pipeline have changed to gray circles. This indicates that the nodes need to be rerun to reflect the change.
- Add a Replacement node to the pipeline.
Note: The Replacement node can be used to replace outliers and unknown class levels with specified values. It is in this node that you invoke the metadata property of the lower limit that you set earlier.
Note: The following steps show the drag-and-drop method of adding the node. If you prefer, you can use the alternate method of adding a node that was shown in earlier practices.- Expand the Nodes pane on the left side of the canvas.
- Expand Data Mining Preprocessing.
- Click the Replacement node and drag it between the Data node and the Imputation node.
- Hide the Nodes pane.
- In the properties panel for the Replacement node, specify the following settings in the Interval Variables section:
- Set Default limits method to Metadata limits.
- Change Alternate limits method to (none).
- Leave Replacement value at the default value, Computed limits.
- Run the Replacement node and view the results.
- In the results of the Replacement node, maximize the Interval Variables table. This table shows which variables now have a lower limit of 0.
The original variables will now be rejected. The new versions of the variables, which have REP_ prepended to the name, are now the valid input variables.
- Close the Interval Variables table.
- Close the results window of the Replacement node.
- To update the remainder of the pipeline, click the Run Pipeline button.
- When the run is complete, right-click the Model Comparison node and select Results.
- Maximize the Model Comparison table and view the performance results for the Logistic Regression model.
- Exit the maximized view of the Model Comparison table.
- Select Close to return to the pipeline.
Note: There is one more variable with a negative minimum value. Leave this variable unselected for the Demo project.
Note: Alternatively, you can assign metadata properties by using the Manage Variables node. You can use the Manage Variables node with the Replacement node to replace values in a single pipeline. In the Nodes pane, the Manage Variables node is in the Data Mining Preprocessing section. However, you do not use the Manage Variables node in the Demo project.
Machine Learning Using SAS® Viya®
Lesson 02, Section 2 Demo: Adding Text Mining Features
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we create new features using the Text Mining node. There are currently five text variables in the commsdata data set, but I will use only one of them, which is named verbatims. The verbatims variable represents free-form, unstructured data from a customer survey. Two of the other text variables are already rejected. I need to manually reject the remaining two text variables by using the Data tab.
Next to Pipelines, I click Data.
It might be helpful to sort by variable role. I'll right-click the Role column and select Sort > Sort (ascending).
I'll scroll to the end of the list. Notice that all the unrejected Text variables are listed together. I select the variables issue_level2 and resolution.
Now I'll change the role to Rejected.
To return to the Starter Template, I click the Pipelines tab and then select Starter Template.
I'll use the Nodes pane on the left to drag and drop a Text Mining node between the Imputation node and the Logistic Regression node. The Text Mining node is under Data Mining Preprocessing.
Now I'll hide the Nodes pane.
We'll keep all the properties of the Text Mining node at their default values. I'll right-click and run the Text Mining node.
When the run is complete, I'll open the node results. Several windows are available, including tables of Kept Terms and Dropped Terms. These tables include terms used and ignored respectively during the text analysis.
Notice that several terms include a plus sign. The plus sign indicates stemming. For example, + service represents service, services, serviced, and so on.
I scroll down and expand the Topics table. These 15 topics were created based on groups of terms that occur together in several documents. Each term-document pair is assigned a score for every topic. Thresholds are then used to determine whether the association is strong enough to consider whether the document or term belongs in the topic. Terms and documents can belong to multiple topics.
Because 15 topics were discovered, 15 new columns of inputs are created. The output columns contain SVD, or singular value decomposition, scores that can be used as inputs for the downstream nodes.
I'll close the Topics table.
At the top of the Results window, I click the Output Data tab. Let's look at the attributes that we created.
I click View Output Data. Here, in the Sample Data window, I could take a sample of the data if I wanted to.
I click View Output Data again to open the output data table. I'll scroll to the right to see the column headings that begin with Score for. These columns are for new variables based on the topics created by the Text Mining node. For each topic, the SVD coefficients (or scores) are shown for each observation in the data set. Notice that the coefficients have an interval measurement level. The Text Mining node converts textual data into numeric variables, specifically interval variables. These columns will be passed along to subsequent nodes.
If you want to rearrange or hide columns, you can use the Manage Columns button.
I'll close the Results window.
Before rerunning the model, let's use the Manage Variables node to take another look at the new interval input columns. I right-click the Text Mining node and select Add child node > Data Mining Preprocessing > Manage Variables.
In the Run Node window, I click Close. Notice that Model Studio splits the pipeline path after the Text Mining node.
Now I run the Manage Variables node. And when the node finishes running, I open the results.
Let's expand the Output window. In the Incoming Variables table are the 15 new columns representing the dimensions of the SVD calculations based on the 15 topics discovered by the Text Mining node. These 15 columns, COL1 through COL15, serve as new interval inputs for subsequent models.
I close the Output window and the results.
Let's run the pipeline. When the pipeline run is complete, we can assess the performance of the model by looking at the results of the Model Comparison node. I'll right-click the Model Comparison node and select Results. Now I'll expand the Model Comparison table.
Adding these text features does not necessarily guarantee that the performance of the model will improve. We'll see whether any of the new attributes made it into the final model by looking at the results of the Logistic Regression node.
I close the Model Comparison table and the results.
I open the results of the Logistic Regression model node.
I scroll down and expand the Output window. Now, I scroll down to look at the table for Selection Summary. Notice that one of the columns created by text mining entered the model during the stepwise selection process.
I close the Output window and the results.
Machine Learning Using SAS® Viya®
Lesson 02, Section 2 Practice the Demo: Add Text Mining Features
In this practice, you create new features using the Text Mining node. You use the text variable verbatims, which is one of five text variables in the commsdata data source. Rejecting the other four text variables (Call_center, issue_level1, issue_level2, and resolution) requires a metadata change on the Data tab. You must make sure their role is set to Rejected.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Data tab.
- Make sure that any previously selected variables are deselected.
- To sort by variable name, right-click the Variable Name column and select Sort > Sort (ascending).
- Select the variables Call_center, issue_level1, issue_level2, and resolution.
- In the right pane, make sure that the role is set to Rejected. Rejecting these other text variables ensures that only the verbatims variable is used as an input for the Text Mining node.
- To return to the Starter Template pipeline, click the Pipelines tab and select Starter Template.
- Add a Text Mining node (which is in the Data Mining Processing group) between the Imputation node and the Logistic Regression node. Note: Keep the default settings of the Text Mining node.
- Run the Text Mining node.
- When the run is finished, open the results of the Text Mining node. Many windows are available, including the Kept Terms table (which shows the terms used in the text analysis) and the Dropped Terms table (which shows the terms ignored in the text analysis).
Note: In the tables, the plus sign next to a word indicates stemming. For example, +service represents service, services, serviced, and so on. - Maximize the Topics table. This table shows topics that the Text Mining node created based on groups of terms that occur together in several documents. Each term-document pair is assigned a score for every topic. Thresholds are then used to determine whether the association is strong enough to consider whether that document or term belongs in the topic. Terms and documents can belong to multiple topics. Fifteen topics were discovered, so fifteen new columns of inputs are created. The output columns contain SVD (singular value decomposition) scores that can be used as inputs for the downstream nodes.
- Close the Topics table.
- Click the Output Data tab, and then click View Output Data.
- In the Sample Data window, click View Output Data. Note: In the Sample Data window, you can choose to create a sample of the data to view. However, you do not do this for the Demo project.
- Scroll to the right to see the column headings that begin with Score for. These columns are for new variables based on the topics created by the Text Mining node. For each topic, the SVD coefficients (or scores) are shown for each observation in the data set. Notice that the coefficients have an interval measurement level. The Text Mining node converts textual data into numeric variables, specifically interval variables. These columns will be passed along to subsequent nodes.
Note: If you want to rearrange or hide columns, you can use the Manage Columns button. - Close the Results window.
- Another way to see the 15 new interval input columns that were added to the data is to use the Manage Variables node. To add a Manage Variables node to the pipeline after the Text Mining node, right-click the Text Mining node and select Add child node > Data Mining Preprocessing > Manage Variables.
- When the Run Node message window appears, click Close. Notice that Model Studio splits the pipeline path after the Text Mining node.
- Run the Manage Variables node. When the Run Node message window appears again, click Close.
- When the node finishes running, open the results.
- Maximize the Output window to see the new columns (COL1 through COL15), which represent the dimensions of the SVD calculations based on the 15 topics discovered by the Text Mining node. These new columns serve as new interval inputs for subsequent models.
- Close the Output window and close the results.
- To run the entire pipeline, click the Run pipeline button.
- To assess the performance of the model, open the results of the Model Comparison node. Expand the Model Comparison table. Note: Adding text features does not necessarily improve the model.
- Close the Model Comparison table. Close the results of the Model Comparison node.
- To see whether any of the new variables entered the final model, open the results of the Logistic Regression node.
- Maximize the Output window.
- Scroll down to the Selection Summary table. Notice that one of the columns created by the Text Mining node entered the model during the stepwise selection process.
- Close the Output window and the results.
Machine Learning Using SAS® Viya®
Lesson 02, Section 3 Demo: Transforming Inputs
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we use the Transformations node to apply a numerical transformation to input variables. In an earlier demonstration, we explored inputs and saw that a few had a high measure of skewness. Let's revisit the results of the data exploration.
Notice that I need to rerun the Data Exploration pipeline because metadata properties have been defined.
I'll run the Data Exploration node. I'll view the results.
First, I expand the Interval Variable Moments table. Notice that five variables have a high degree of skewness, and their names start with uppercase MB_Data_Usg.
I close the Interval Variable Moments table and expand the Important Inputs plot.
Notice that the same MB_Data_Usg variables we just saw are also selected as important variables. Behind the scenes, importance is defined by a decision tree using PROC TREESPLIT.
I close the Important Inputs plot and the results.
I have identified five variables that I want to transform. I could define metadata transformations using the Manage Variables node, but instead I'll use the Data tab.
It might be helpful to sort by the variable name. I'll right-click the Variable Name column and select Sort > Sort (ascending).
I'm going to scroll down until I find the five MB variables that we identified. Notice that there are actually six variables whose names start with uppercase MB. Although the additional variable was not identified as important in the results that we just looked at, there's a good chance that this is also skewed. I want to transform all six.
To make sure that no other variables are selected, I'll click the name of the first uppercase MB variable. Then I select the check box for the other five.
In the right pane under Transform, I select Log.
To see that the transformation rule has been applied to these variables, I scroll to the right to see the Transform column.
Keep in mind that setting transformation rules does not perform the transformation. It only defines the metadata property. To apply the transformation, we need to use the Transformations node.
I return to the Starter Template pipeline.
I'll add a Transformations node between the Replacement node and the Imputation node. On the left, I'll expand the Nodes pane. I'll expand Data Mining Preprocessing. And I'll drag a Transformations node between the Replacement and Imputation nodes.
I'll hide the Nodes pane. I can zoom into the pipeline by using my mouse wheel.
I will not change the properties of the Transformations node. Although the Default interval inputs method property indicates (none), the metadata rules that I assigned to the variables on the Data tab override this default setting.
The idea is that you first change metadata on the Data tab, or by using the Manage Variables node, to specify what you want to do with the variables. Then you need to add a node to make those changes to the data (in this example, replacement or transformation). The subsequent node actually performs the changes that you encoded in metadata.
I'll run the Transformations node. When the run is complete, let's look at the results.
I'll expand the Transformed Variables Summary table. This table displays information including how the variables were transformed, the corresponding input variable, the formula applied, the variable level, the type, and the variable label. Notice that new variables have been created with the prefix LOG_ before the original variable name. The original versions of these variables are now rejected.
In the Formula column, notice that the formula for the log transformation includes an offset of 1 to avoid the case of log(0).
I close the Transformed Variables Summary window and the results.
To assess the performance of the Logistic Regression model, let's run the pipeline.
I'll look at the results of the Model Comparison node and expand the Model Comparison table. Here, I can assess the performance of the logistic regression model.
I close the Model Comparison table and then the results.
Machine Learning Using SAS® Viya®
Lesson 02, Section 3 Practice the Demo: Transform Inputs
In this practice, you use the Transformations node to apply a numerical transformation to input variables. In an earlier practice, you explored interval inputs and saw that a few had a high measure of skewness. Here, you revisit the results of that data exploration.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Data Exploration pipeline tab.
The pipeline requires a rerun because metadata properties have been defined. - Right-click Data Exploration and select Run to run the node.
- Right-click the Data Exploration node and select Results.
- Expand the Interval Variable Moments table. Notice that five variables have a high degree of skewness and their names begin with MB_Data_Usg.
- Close the Interval Variable Moments table.
- Expand the Important Inputs plot. Notice that the same MB_Data_Usg variables have also been selected as important variables. Behind the scenes, Importance is defined by a decision tree using the TREESPLIT procedure.
- Close the Important Inputs plot.
- Close the Results window.
Now you are ready to define transformation rules in the metadata and apply the changes to the data. First, you change metadata on the Data tab to specify what you want to do with the variables. Note: The Manage Variables node is an alternative means of defining metadata transformations, but it is not used in this practice. - Click the Data tab.
- It might be helpful to sort by variable name. Make sure that all variables are deselected. Right-click the Variable Name column. Select Sort and then Sort (ascending).
- Scroll down until you see six variables whose names begin with (uppercase) MB_Data_Usg. Although only five of these were identified as important in the results that you just saw, there's a good chance that the other one is also skewed. It's a good idea to transform all six of them.
- To make sure that no other variables are selected, click the name of the first of the six MB_Data_Usg variables. Then select the check box for the other five of these variables. Note: Select only those variables whose name begins with uppercase MB.
- In the Multiple Variables window on the right, under Transform, select Log.
- To verify that the transformation rule has been applied to these variables, scroll right to display the Transform column. Notice that Log is displayed for each of the selected variables.
Note: Remember that setting transformation rules doesn't perform the transformation. It only defines the metadata property. You must use the Transformations node to apply the transformation. - To return to the Starter Template pipeline, click Pipelines, and then click the Starter Template tab.
- Add a Transformations node between the Replacement node and the Imputation node. Leave the Transformations node options at their default settings.
Note: Although the Default interval inputs method property indicates (none), the metadata rules that you assigned to the variables on the Data tab override this default setting. - Right-click the Transformations node and select Run.
- When the run is finished, right-click the node and select Results.
- Expand the Transformed Variables Summary table. This table displays information about the transformed variables, including how they were transformed, the corresponding input variable, the formula applied, the variable level, the type, and the variable label.
Notice that new variables have been created with the prefix LOG_ at the beginning of the original variable names. The original versions of these variables are now rejected.
Note: In the Formula column, notice that the formula for the Log transformations includes an offset of 1 to avoid the case of Log(0). - Close the Transformed Variables Summary window.
- Close the results.
- Run the entire pipeline to assess the performance of the logistic regression model.
- Open the results of the Model Comparison node and maximize the Model Comparison table. Here, you can assess the performance of the logistic regression model.
- Close the Model Comparison table and close the results.
Machine Learning Using SAS® Viya®
Lesson 02, Section 4 Demo: Selecting Features
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll make use of the Variable Selection node to reduce the number of inputs for modeling. In the starter template, I will place a Variable Selection node between the Text Mining node and the Logistic Regression node. I'll expand the left pane, expand Data Mining Preprocessing, and drag and drop Variable Selection.
I'll zoom in a bit to see the node names more clearly. I'll make sure the Variable Selection node is selected, and then look at the Properties pane. Varying combinations of criteria can be used to select inputs. Keep Combination criterion at Selected by at least 1. This means that any input selected by at least one of the selection criteria chosen is passed on to subsequent nodes as an input.
The Fast Supervised Selection method is selected by default. The Create Validation from Training property is also selected by default, but its button is initially disabled. I'll turn on two more methods: Unsupervised Selection and Linear Regression Selection. I'll first select Unsupervised Selection by clicking the button slider next to its name.
When I turn on the Unsupervised Selection method, additional options are shown. We'll keep the default settings for these additional properties. I can hide these new options by selecting the down arrow next to the property name. I'll also select the Linear Regression Selection method, and I'll minimize its properties.
Notice the Create Validation from Training property. This property was selected by default, but the slider button did not become active until one of the supervised methods was selected. This property specifies whether a validation sample should be created from the incoming training data.
It is recommended to create this validation set, even if the data have already been partitioned, so that only the training partition is used for variable selection, and the original validation partition can be used for modeling.
I'll run the Variable Selection node. When the run is complete, I'll open the results. Many windows appear.
I'm going to start by expanding the Variable Selection table. The Variable Selection table shows the output role for the input variables after the node has processed them. On the right, notice the Reason column. A blank Reason field indicates those inputs that the node selected and will pass on.
If we scroll down in the Variable Selection table, we begin to see variables with text in the Reason field. These variables were rejected by the node. The Reason column indicates the reason for the rejection.
Recall that sequential selection (the default) is performed, and any variable rejected by this unsupervised method is not used by the subsequent supervised methods. The variables that are rejected by supervised methods are represented by Combination Criterion in the Reason column. To see which methods were involved in each combination, I'll take a look at the Variable Selection Combination Summary next.
I'll close the Variable Selection table and maximize the Variable Selection Combination Summary. For each variable, this table includes the result (that is, Input or Rejected) for each method that was used, the total count of each result, and the final output role (that is, Input or Rejected).
So, for example, if we look at the first variable, AVG_DAYS_SUSP, we see that in the Input column there's a count of 2, and in the Rejected column, a count of 0. That means that this variable was selected by two of the input criteria: Fast Selection and Linear Regression. The next variable, BILLING_CYCLE, has 0 in the Input column, and 2 in the Rejected column. It was rejected by two criteria: Fast and Linear Regression.
Sometimes, you can see a 1 in each column (for Input and Rejected).
I will close the Variable Selection Combination Summary table and the results. I will rerun the entire pipeline. The pipeline has run successfully. We could look at the results of the Logistic Regression model from the node, but because there's only one model in the pipeline, we can look at the performance of the model through the Model Comparison node.
I'll right-click the Model Comparison node and look at results. And I will go ahead and maximize the Model Comparison table, so that we can see other statistics describing the performance of the logistic regression model. I'll close that table and the results.
Machine Learning Using SAS® Viya®
Lesson 02, Section 4 Practice the Demo: Select Features
In this practice, you use the Variable Selection node to reduce the number of inputs for modeling.
Note: This is the task shown in the previous demonstration. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Starter Template pipeline.
- Add a Variable Selection node (in the Data Mining Preprocessing node group) between the Text Mining node and the Logistic Regression node.
- With the Variable Selection node selected, review the settings in the node properties panel on the right. In the properties, varying combinations of criteria can be used to select inputs. Notice the following default settings, which you will use:
- Combination Criterion is set to Selected by at least 1. This means that any input selected by at least one of the selection criteria chosen is passed on to subsequent nodes as an input.
- The Fast Supervised Selection method is selected by default.
- The Create Validation from Training property is also selected by default, but its button is initially disabled.
- Combination Criterion is set to Selected by at least 1. This means that any input selected by at least one of the selection criteria chosen is passed on to subsequent nodes as an input.
- In the properties panel, turn on the Unsupervised Selection and Linear Regression Selection methods by clicking the button slider next to each property name. When a property is turned on, additional options appear. You can hide the new options by selecting the down arrow next to the property name.
Keep the default settings for all the new options that appear for the Unsupervised Selection and Linear Regression Selection methods.
- Notice that the Create Validation from Training property was initially selected by default, but the slider button did not become active until you selected a supervised method above. This property specifies whether a validation sample should be created from the incoming training data. It is recommended to create this validation set even if the data have already been partitioned so that only the training partition is used for variable selection and the original validation partition can be used for modeling.
- Run the Variable Selection node.
- Right-click the Variable Selection node and select Results.
- Expand the Variable Selection table. This table contains the output role for the input variables after they have gone through the node. These variables have a blank cell in the Reason column, indicating that they have been selected and are passed on from the node.
- Scroll down in the Variable Selection table. For the variables that have been rejected by the node, the Reason column displays the reason for rejection.
Remember that sequential selection (the default) is performed, and any variable rejected by this unsupervised method is not used by the subsequent supervised methods. The variables that are rejected by supervised methods are represented by combination criteria (at least one in this case) in the Reason column. If you want to see whether they were selected or rejected by each method, look at the Variable Selection Combination Summary. - Close the Variable Selection table.
- Expand the Variable Selection Combination Summary table.
For each variable, this table includes the result (Input or Rejected) for each method that was used, the total count of each result, and the final output role (Input or Rejected). For example, for the variable AVG_DAYS_SUSP, the Input column has a count of 2, and the Rejected column has a count of 0. This means that this variable was selected by two of the input criteria: Fast Selection and Linear Regression. The variable BILLING_CYCLE has 0 in the Input column, and 2 in the Rejected column. It was rejected by two criteria: Fast and Linear Regression. The variable with the label Days of Open Work Orders has a count of 1 in the Input column, and 1 in the Rejected column. This means that this input was rejected by the Fast criterion, but it was selected by the Linear Regression criterion. The property Combination criterion is set to Selected by at least 1, so this variable is selected as an input because it was selected by at least one of the properties.
- Close the Variable Selection Combination Summary table.
- Close the results.
- Click the Run Pipeline button to rerun the pipeline.
- Right-click the Model Comparison node and select Results.
Note: As an alternative, you could view the results for the Logistic Regression model through the Logistic Regression node.
- Expand the Model Comparison table and view the statistics for the performance of the Logistic Regression model.
- Close the Model Comparison table and close the results.
Machine Learning Using SAS® Viya®
Lesson 02, Section 4 Demo: Saving a Pipeline to the Exchange
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we save the current pipeline to The Exchange so that it's available to other users. We'll use this pipeline later in the course. Next to the name Starter Template, click the Options menu, and select Save to The Exchange.
Provide a name for the saved pipeline. In this case, we'll call it CPML demo pipeline.
Because others will be able to view this pipeline in The Exchange, it might be helpful to add a description -- for example, maybe including some of the nodes or the models in the pipeline. Here, I'll just enter a brief description (Logistic regression pipeline) and then click Save.
To see the saved pipeline in the Exchange, we need to exit the current project.
To go directly to the Exchange, I click its button in the left panel. In the Exchange, expand Pipelines and then select Data Mining and Machine Learning. A list of pipeline templates appears on the right. Towards the bottom of that list, we see the CPML demo pipeline. The description is Logistic regression pipeline. This is the pipeline that we just saved.
To exit the Exchange and return to our Demo project in Model Studio, click the Projects button in the upper left corner.
Machine Learning Using SAS® Viya®
Lesson 02, Section 4 Practice the Demo: Save a Pipeline to the Exchange
In this practice, you save the Starter Template pipeline to the Exchange. You use this pipeline later in the course. (Remember that the Exchange is a place where users can save pipelines, and find pipeline templates created by other users, as well as best practice pipelines that are provided by SAS. However, in this course, you do not use pipelines created by other users.)
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, next to the Starter Template tab, click the Options menu and select Save to The Exchange.
- Change the name of the pipeline to CPML demo pipeline. For the description, enter Logistic regression pipeline. Click Save.
- To go to the Exchange, click its button in the left panel.
- In the left pane of the Exchange, expand Pipelines and select Data Mining and Machine Learning. The CPML demo pipeline that you just saved appears in the list of pipeline templates.
Note: In SAS Viya for Learners, in the Exchange, you will likely see other users' pipelines. If there are multiple CPML demo pipelines, make sure you select the one that you created. Check the Owner column for your email address and the Last Modified column for the date and time that you created your pipeline. - To exit the Exchange and return to the Demo project in Model Studio, click the Projects button in the upper left corner.
Machine Learning Using SAS® Viya®
Lesson 03, Section 1 Demo: Building a Decision Tree Model Using the Default Settings
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we build a decision tree model using the default settings. I build that model in our Demo project, starting with a pipeline that I can get from the Exchange.
First, I add a new pipeline. In the New Pipeline window, I specify the name Tree Based, because we'll build many tree-based models in this pipeline.
Under Select a pipeline template, I'll click the down arrow. From the list, I select CPML demo pipeline and then click OK. In the New Pipeline window, I click OK. Notice that this is a copy of the pipeline from the Starter Template, but no nodes have been run yet.
I'll add a Decision Tree node after the Variable Selection node. I right-click the Variable Selection node and select Add child node, Supervised Learning, and Decision Tree. Notice that this creates a parallel path in the pipeline with the Logistic Regression node. I'll keep all the properties of the Decision Tree node in their default state.
Now I run the Decision Tree node. When it finishes running, I'll open the results.
When you run this demonstration, don't be surprised if your results look a bit different from what you see here. Remember that SAS Viya is a distributed computing environment. So, for many nodes, the results are likely to vary slightly every time the node runs. For example, this is true of most of the Supervised Learning nodes.
I'll highlight a few of the windows in the results.
To start, I'll expand the tree diagram in the upper left corner. The tree diagram presents the final tree structure for this model, such as the depth of the tree and all end leaves. If you hover over a leaf, a tooltip appears, showing information about that leaf. The leaf that I'm currently pointing at has 469 observations. Of these observations, approximately 94% are event cases and about 6% are nonevent cases. To see the splitting rules, you can hover over the branches. This information is helpful in interpreting the tree. I'll close the tree diagram.
Further down, I'm going to expand the Pruning Error plot. This plot is based on the misclassification rate because we have a binary target. The plot shows the change in misclassification rate on training and validation data as the tree grows or as more leaves are added to the tree. The blue line represents the training data, and the orange line represents the validation data. We see that for the training data, the misclassification rate consistently decreases. In other words, it improves as the size of the tree grows. However, for the validation data, the misclassification rate decreases, for the most part, as the size of the tree grows. But there are a few scenarios where the misclassification rate increases. We can see that the selected subtree contains 51 leaves after we have optimized complexity. Starting at this tree and for the next few trees, notice that the misclassification rate actually increases, which means that it's getting worse. I'll close the Pruning Error plot.
Next, I'll maximize the Variable Importance table. This table shows the final variables selected by the decision tree and their relative importance. The most important input variable has the relative importance 1. All others are measured based on the most important input. In this case, notice that the decision tree selected ever_days_over_plan as the most important variable. The Importance Standard Deviation column shows the dispersion of the importance taken over several partially independent trees. So, for a single tree, this column has all zero values. For forest and gradient boosting, the numbers would be nonzero. I'll close the Variable Importance table.
Further down in the results are several code windows, one for each type of code that Model Studio generates. Supervised Learning nodes can generate as many as three types of score code (node score code, path EP score code, and DS2 package score code) as well as training code. You learn more about score code later in the course.
Finally, I'll maximize the Output window. In this window, we can see that the TREESPLIT procedure is the underlying procedure for this node. This window also shows the final decision tree model parameters, the Variable Importance table, and the pruning iterations. I'll close the Output window.
Now I'll click the Assessment tab. Four tables are shown.
First, I'll expand the Lift Reports. A Cumulative Lift plot is shown by default. We can interpret the plot as a comparison of the performance of the model at certain depths of the data ranked by the posterior probability of the event compared to a random model. Ideally, you want to see a lift greater than 1, which means that your model is outperforming a random model. Lift and cumulative lift are discussed in more detail later. Notice the information on the right that helps you interpret the plot. I'll close the Lift Reports plot.
Because our data set has a binary target, the ROC Reports plot is also available. A ROC chart appears by default. I'll expand it. The ROC chart plots sensitivity against 1 minus specificity for varying cutoff values. Sensitivity is defined as the true positive rate. And 1 minus specificity is defined as the false positive rate. Again, the information on the right helps you interpret the plot. You learn more about the ROC chart, along with sensitivity and specificity, later in the course.
I'll close the ROC Reports window and expand the Fit Statistics table. Here, we see the performance of the final model on the training and validation data sets. A useful fit statistic to consider is average squared error. Notice that the average squared error on validation data is .0720. As you move forward with modifying your model, you might want to write down the values of the fit statistics that you use to assess performance. That way, you can see whether your model is improving. I'll close the Fit Statistics table.
The fourth window shows the Event Classification chart.
I'll close the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 1 Practice the Demo: Build a Decision Tree Model Using the Default Settings
In this practice, you build a decision tree model, using the default settings, in the Demo project. You build the model in a new pipeline based on a template from the Exchange.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Pipelines tab.
- To add a new pipeline, click the plus sign (+) next to the Starter Template tab.
- In the New Pipeline window, enter Tree Based in the Name field.
Note: Entering a description for the pipeline in the New Pipeline window is optional.
- Under Select a pipeline template, click the down arrow to browse templates.
- In the Browse Templates window, select CPML demo pipeline.
Click OK.
- In the New Pipeline window, click OK.
- In the Tree Based pipeline, notice that the new pipeline is a copy of the pipeline from Starter Template, but no nodes have been run.
- Add a Decision Tree node (from the Supervised Learning group) after the Variable Selection node.
Keep all properties for the Decision Tree node at their default settings.
- Right-click the Decision Tree node and select Run.
- Right-click the Decision Tree node and select Results.
In the results of the Decision Tree node (the Node tab) are several charts and plots to help you evaluate the model's performance.
Explore the windows and plots that are described below:
Note: Remember that your results might vary from the results in the demonstration video, which are described below.
The first plot is the Tree Diagram, which presents the final tree structure for this particular model, such as the depth of the tree and all end leaves. If you place your cursor on a leaf, a tooltip appears, giving you information about that particular leaf, such as the number of observations, the percentage of these that are event cases, and the percentage of nonevent cases. To see a splitting rule, you can place your cursor on a branch. This information is helpful in interpreting the tree.
This Pruning Error plot is based on the misclassification rate because the target is binary. The plot shows the change in misclassification rate on training and validation data as the tree grows or as more leaves are added to the tree. The blue line represents the training data and the orange line represents the validation data. In this plot, for the training data, does the misclassification rate consistently decrease? If so, that means it improves as the size of the tree grows. However, for the validation, you probably see that, for the most part, the misclassification rate decreases as the size of the tree grows. But there are a few scenarios where the misclassification rate increases. The selected subtree contains 51 leaves after we have optimized complexity. Starting at this tree and for the next few trees, notice that the misclassification rate actually increases, which means that it's getting worse.
The Variable Importance table shows the final variables selected by the decision tree and their relative importance. The most important input variable has the relative importance 1. All others are measured based on the most important input. In this case, notice that the decision tree selected ever_days_over_plan as the most important variable. The Importance Standard Deviation column shows the dispersion of the importance taken over several partially independent trees. So, for a single tree, this column has all zero values. For forest and gradient boosting, the numbers would be nonzero.
Farther down in the results are several code windows, one for each type of code that Model Studio generates. Supervised Learning nodes can generate as many as three types of score code (node score code, path EP score code, and DS2 package score code) as well as training code. You learn more about score code later in the course.
The Output window shows that the TREESPLIT procedure is the underlying procedure for the Decision Tree node. It also shows the final decision tree model parameters, the Variable Importance table, and the pruning iterations. - Click the Assessment tab. Explore the windows and plots that are described below:
In the Lift Reports window, the Cumulative Lift plot is shown by default. We can interpret the plot as a comparison of the performance of the model at certain depths of the data ranked by the posterior probability of the event compared to a random model. Ideally, you want to see a lift greater than 1, which means that your model is outperforming a random model. Lift and cumulative lift are discussed in more detail later. Notice the information on the right that helps you interpret the plot.
Because our data set has a binary target, the ROC Reports plot is also available. A ROC chart appears by default. The ROC chart plots sensitivity against 1 minus specificity for varying cutoff values. Sensitivity is defined as the true positive rate. And 1 minus specificity is defined as the false positive rate. Again, the information on the right helps you interpret the plot. You learn more about the ROC chart, along with sensitivity and specificity, later in the course.
- The Fit Statistics table shows the performance of the final model on the training and validation data sets. A useful fit statistic to consider is average squared error. Take note of the average squared error on validation data.
Note: As you move forward with modifying your models in this course, you might want to write down the values of the fit statistics that you use to assess performance. This enables you to see whether your model is improving.
- Notice that the fourth window on the Assessment tab shows the Event Classification chart.
- Close the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 2 Demo: Modifying the Structure Parameters
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll alter the current decision tree model in our pipeline by changing the tree structure parameters.
In the Tree Based pipeline, I'll select the Decision Tree node. In the node's properties pane, I'll expand Splitting Options. I'll make three changes. Scrolling down, the first change I will make is to increase the maximum depth from 10 to 14. This allows for a larger tree to be grown, which could lead to overfitting. I'll also increase the minimum leaf size from 5 to 15. This change might actually help prevent overfitting. And my final change is to increase the number of interval bins to 100.
Farther up, notice the property Maximum number of branches. This setting specifies the maximum number of branches that a splitting rule produces. We will use the default number of splits, which is 2.
Now I'll run the Decision Tree node.
The run is complete, so we'll look at results. The results windows are similar to those we saw for the default tree. Let's look at the performance of this decision tree.
I'll click the Assessment tab. We could assess Performance by looking at the Lift chart or ROC chart. However, let's maximize and look at the Fit Statistics table. Notice that the average squared error on validation data is now 0.0646, slightly smaller than before. And smaller means that this model is performing better. Keep in mind that modifying a model does not always result in better performance. I'll close the Fit Statistics table and the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 2 Practice the Demo: Modify the Structure Parameters
In this practice, you modify the tree structure parameters of the Decision Tree node that you added earlier in the Tree Based pipeline.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Tree Based pipeline, select the Decision Tree node.
- In the properties panel for the Decision Tree node, expand Splitting Options. Make the following changes:
Note: The property Maximum number of branches specifies the maximum number of branches that a splitting rule produces. Use the default number of splits, which is 2. - Increase Maximum depth from 10 to 14. This allows for a larger tree to be grown, which could lead to overfitting.
- Increase Minimum leaf size from 5 to 15. This change could help prevent overfitting.
- Increase Number of interval bins to 100.
- Right-click the Decision Tree node and select Run.
- Right-click the Decision Tree node and select Results.
- To look at performance of the decision tree, click the Assessment tab.
- In the Fit Statistics table, note the average squared error for the decision tree model on the VALIDATE partition. Is this fit statistic value slightly smaller than for the previous model? If so, this indicates that this model is performing better than the first model using the default settings. Keep in mind that modifying a model does not always result in better performance.
Note: To assess performance, you could also look at the Lift chart or the ROC chart. - Close the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 3 Demo: Modifying the Recursive Partitioning Parameters
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we will further alter the decision tree by modifying the recursive partitioning parameters and then assess its performance.
Let's make sure that the Decision Tree node is selected, so that its properties are shown in the right pane. I'll expand the Splitting Options. I'll expand the Grow Criterion properties. I'll change Class target criterion from Information gain ratio to Gini.
Then we run the Decision Tree node. When the run is complete, we'll open the results and, again, I'll go straight to the average squared error of the final model.
From the Results window, I'll click the Assessment tab. I'll scroll down to the Fit Statistics table and maximize it. Notice that the average squared error on validation data is now 0.0608. In this case, a decrease in average squared error indicates an improved fit in the model.
Close the Fit Statistics table and the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 3 Practice the Demo: Modify the Recursive Partitioning Parameters
In this practice, you change more settings of the Decision Tree node in the Tree Based pipeline. You modify the recursive partitioning parameters and compare this model performance to the models built earlier.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Tree Based pipeline, make sure that the Decision Tree node is selected.
- In the properties panel for the Decision Tree node, under Grow Criterion, change Class target criterion from Information gain ratio to Gini.
- Right-click the Decision Tree node and select Run.
- Right-click the Decision Tree node and select Results.
- Click the Assessment tab.
In the Fit Statistics table, take note of the average squared error for the Decision Tree model on the VALIDATE partition. If there is a decrease in average squared error, this indicates an improved fit in the model based on changing the recursive partitioning parameters. - Close the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 4 Demo: Modifying the Pruning Parameters
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll continue to alter the decision tree -- this time, by modifying the pruning parameters. In the Properties pane, I'll expand Pruning Options. And under Pruning Options, I'll change Subtree method from Cost complexity to Reduced error. Now I'll run the Decision Tree node.
The run is complete, so we'll look at the results. And we'll assess performance by looking at average squared error on validation data.
Click the Assessment tab. I'll scroll down and expand the Fit Statistics table. Average squared error under validation data is 0.0608. Recall that changes to any of these properties does not guarantee improvement of the model. In this case, average squared error stayed the same as the previous model.
When you perform this demonstration, take a minute to write down the value of the average squared error for this model. Also, write down the value of KS shown in a column further to the right. You might need to scroll. You will use these values later to compare the current model with a model that you create in the next practice.
Now let's run the Model Comparison node and see how the decision tree compares to the logistic regression model also in the pipeline. Close the Fit Statistics table and the results.
I'll rerun the entire pipeline. When the run is complete, I'll open the results of the Model Comparison node. To compare the models, I'll expand the Model Comparison table.
We see that the decision tree is currently selected as the champion model. Recall that, by default, the champion is selected based on the KS statistic value, where larger is better. Even if we compare the average squared error values for the two models, we get a smaller average squared error for the decision tree than the logistic regression model, which indicates that the decision tree performs better on that statistic.
Close the Model Comparison window and the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 4 Practice the Demo: Modify the Pruning Parameters
In this practice, you continue to modify the settings of the Decision Tree node in the Tree Based pipeline. You modify the pruning parameters and compare this model performance to the models built earlier.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Tree Based pipeline, make sure that the Decision Tree node is selected.
- In the properties panel, scroll down and expand Pruning Options.
- Change Subtree method from Cost complexity to Reduced error.
- Right-click the Decision Tree node and select Run.
- Right-click the Decision Tree node and select Results.
- Click the Assessment tab and expand the Fit Statistics table. Is the average squared error for this decision tree model the same as before on the VALIDATE partition? Remember that changing properties does not guarantee improvement in model performance.
- Close the Fit Statistics table.
- Close the Results window.
- Click the Run pipeline button.
- Right-click the Model Comparison node and select Results.
The Model Comparison table shows which model is currently the champion from the Tree Based pipeline. This is based on the default fit statistic KS. Even if you compare the average squared error values for the two models, you likely get a smaller average squared error for the decision tree than the logistic regression model, which indicates that the decision tree performs better on that statistic. - Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 03, Section 5 Demo: Building a Gradient Boosting Model
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll build a gradient boosting model using the default settings. I'll briefly discuss the results, and then change some settings and build a second gradient boosting model.
I'll add a Gradient Boosting node to the pipeline. I'll keep all the properties of the Gradient Boosting node at their default settings and run that node. When the run is complete, I'll open the results.
In the upper left corner, I'll maximize the Error plot. This plot shows the performance of the model as the number of trees increases. By default, the statistic is average squared error. We can see that the trends for average squared error are decreasing -- that is to say, improving -- on both training and validation data sets. I'll close the Error plot.
The results also include a table of variable importance.
We see several windows associated with scoring, and the Output window. The Output window indicates that this node uses the GRADBOOST procedure.
Let's look at the average squared error of the final model. I'll click the Assessment tab. We see Lift reports, ROC reports, the Event Classification chart, and the Fit Statistics table. I'll maximize the Fit Statistics table.The average squared error on validation data for our final gradient boosting model is 0.0570. Close the Fit Statistics table and the results.
Now, I'll change some of the properties of the gradient boosting model. First, I make sure the Gradient Boosting node is selected. In the properties panel, I'll reduce the number of trees from 100 to 50.
Farther down, under Tree-splitting Options, I increase the Maximum depth from 4 to 8. I can change this value by moving the slider or clicking the slider dot and entering the value.
I also increase the Minimum leaf size from 5 to 15, and increase the Number of interval bins from 50 to 100.
Now I run the Gradient Boosting node.
When the run is complete, I open the results. To assess performance, let's look at the average squared error on training data.
Click the Assessment tab. I'll scroll down and expand the Fit Statistics table. We can see that, compared to the gradient boosting model with the default settings, we get a slight improvement in average squared error. Here, average squared error on validation data is 0.0555.
When you perform this demonstration, take a minute to write down the value of the average squared error for this model. Also, write down the value of KS shown in a column further to the right. You might need to scroll. You will use these values later to compare the current model with a different model that you create.
Close the Fit Statistics table and the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 5 Practice the Demo: Build a Gradient Boosting Model
In this practice, you add a Gradient Boosting node to the Tree Based pipeline. You first run the gradient boosting model with default settings. You then change some settings and compare the model to the other models in the pipeline.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, make sure that the Tree Based pipeline is selected.
- Add a Gradient Boosting node (from the Supervised Learning group) after the Variable Selection node.
- Keep all properties for the Gradient Boosting node at their defaults.
- Run the Gradient Boosting node.
- Open the results for the Gradient Boosting node.
- Maximize the Error plot. This plot shows the performance of the model as the number of trees increases based on average squared error.
In this case, the trends for average squared error are decreasing (improving) on both validation and training data sets. The results also include a table of variable importance. You see several windows associated with scoring, and the Output window.
- Close the Error plot.
- Notice the Variable Importance plot and score code windows.
- The Output window indicates that the underlying procedure for the Gradient Boosting node is PROC GRADBOOST.
- Click the Assessment tab.
- Maximize the Fit Statistics table and note the average squared error on the VALIDATE partition.
- Close the Fit Statistics table.
- Close the Results window.
- With the Gradient Boosting node selected, make the following changes to the node properties:
- Reduce Number of trees from 100 to 50 in the properties panel.
- Under Tree-splitting Options, increase Maximum depth from 4 to 8. To change the value of Maximum depth, you can either move the slider or manually enter a value in the box.
- Increase Minimum leaf size from 5 to 15.
- Increase Number of interval bins from 50 to 100.
- Reduce Number of trees from 100 to 50 in the properties panel.
- Run the Gradient Boosting node.
- Open the results for the Gradient Boosting node.
- Click the Assessment tab and scroll down to the Fit Statistics table. Note the average squared error for this gradient boosting model on the VALIDATE partition. Is the value of this fit statistic slightly better than for the first gradient boosting model, which was based on the default settings?
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 03, Section 5 Demo: Building a Forest Model
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll add a Forest model to the Tree Based pipeline and build the forest using the default settings. Then I'll change some of the settings in an attempt to improve the model.
I'll start by adding a Forest node under the Variable Selection node. I'll keep the Forest node properties at their default settings and run the node.
Let's look at results. Many of the windows in the results are similar to those from the Gradient Boosting model. Remember that both of these models are ensembles of decision trees.
At the top, we see an Error plot indicating the performance of the model as the number of trees increases. This plot contains three lines that show performance on the training data, the validation data, and the out-of-bag sample, respectively.
We see a table of variable importance and the same code output windows as we saw for gradient boosting. The Output window shows that FOREST is the underlying procedure.
Now I'll click the Assessment tab. Here we see the lift reports and the ROC reports. Scrolling down, we see the Fit Statistics table. Notice that the average squared error on validation data is 0.0574.
Let's see if we can improve on the performance of the model. I'll close the Results window and make sure that the Forest node is selected. In the properties panel, the first thing I'll do is decrease the number of trees from 100 to 50. Under Tree splitting Options, I'll change the Class target criterion from Information gain ratio to Entropy. I'll decrease the Maximum depth from 20 to 12. I'll increase the Minimum leaf count from 5 to 15. I'll increase the Number of interval bins to 100.
The default number of inputs to consider per split is the square root of the total number of inputs. I'll clear the check box for that option. Then I'll set the value of Number of inputs to consider per split to 7, which is about half the number of inputs that come from the Variable Selection node.
Now I'll run the Forest node.
When the run is complete, we'll look at the results. I'll click the Assessment tab and look at the Fit Statistics table. We see that, on the validation data set, average squared error is 0.0572, which decreased a little bit. Remember that smaller is better for the average squared error, so this model does outperform the prior forest by just a bit.
When you perform this demonstration, take a minute to write down the value of the average squared error for this model. Then maximize the Fit Statistics table and write down the value of KS. You will use these values later to compare against another model that you build.
Let's see how the forest compares to the other models in the pipeline. I'll close the results of the forest. And I'll run the entire pipeline so that we can see the results of model comparison.
When the run is complete, I'll look at the results of model comparison. And we see that forest is the champion model from the pipeline based on KS. I'll close the results.
Machine Learning Using SAS® Viya®
Lesson 03, Section 5 Practice the Demo: Build a Forest Model
In this practice, you add a Forest node to the Tree Based pipeline. You first build a forest model using the default settings. You then change some of the settings and compare the model to the other models in the pipeline.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Tree Based pipeline, add a Forest node (from the Supervised Learning group) after the Variable Selection node.
- Keep all properties for the Forest node at their default setting.
- Right-click the Forest node and select Run.
- Right-click the Forest node and select Results. Note: Many of the items in the Results window for the forest model are similar to items that you saw in the results for the gradient boosting model in a previous practice.
The Error plot shows the performance of the model as the number of trees increases. This plot contains three lines that show performance on the training data, the validation data, and the out-of-bag sample, respectively. You see a table of variable importance and the same code output windows as you saw for gradient boosting.
The Output window shows that the underlying procedure is the FOREST procedure. - Click the Assessment tab.
The Fit Statistics table shows the average squared error on the VALIDATE partition. - Close the Results window.
- Make sure that the Forest node is selected. In the node properties panel on the right, make the following changes:
- Reduce Number of trees from 100 to 50.
- Under Tree-splitting Options, change Class Target Criterion from Information gain ratio to Entropy.
- Decrease Maximum depth from 20 to 12.
- Increase Minimum leaf count from 5 to 15.
- Increase Number of interval bins to 100.
- The default number of inputs to consider per split is the square root of the total number of inputs. Clear the check box for this option and set Number of inputs to consider per split to 7, about half the number of inputs that come from the Variable Selection node.
- Reduce Number of trees from 100 to 50.
- Run the Forest node.
- Open the results for the Forest node.
- Click the Assessment tab and scroll down to the Fit Statistics table. Take note of the average squared error for this forest model on the VALIDATE partition. Did this fit statistic decrease a small amount? If so, this model is a little bit better than the first model, which used the default settings.
- Close the Results window.
- To see how the forest model compares to the other models in the pipeline, click the Run pipeline button.
- Right-click the Model Comparison node and select Results. How does the performance of the forest model compare to the other models in the pipeline?
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 04, Section 1 Demo: Building a Neural Network Using the Default Settings
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll add a new pipeline to our Demo project and build a neural network model using the default settings.
I'll click the plus sign to add a new pipeline. We'll name the pipeline Neural Network, and we'll select the CPML demo pipeline as the starting template.
Under Select a pipeline template, I click the down arrow. Notice that the CPML demo pipeline appears in this menu because we used it in a previous demonstration. I'll select that template and then click OK.
I'll add a Neural Network node beneath the Variable Selection node. When the Neural Network node is selected, we can see its properties. I'll keep all the properties at their default settings.
I'll run the Neural Network node.
When the run is complete, we'll look at the results to evaluate the model's performance.
The first plot in the upper left corner is the network diagram. This plot represents the final neural network structure for the model, including the hidden layer and the hidden units. I'll maximize the plot so that we can take a quick look. And then I'll close it.
The iteration plot shows the model's performance based on the validation error throughout the training process, when new iterations are added.
As usual, we see the score code windows.
The Output window shows the result from the NNET procedure: the final neural network model parameters, the iteration history, and the optimization process. I'll maximize the Output window and scroll down briefly. Then I'll close it.
We can further assess the performance of the model from the Assessment tab. In the upper left corner, as usual, we see the cumulative lift plot showing the model's performance ordered by the percentage of the population. In the upper right corner, we see the ROC curve. This curve shows the model's performance by considering the true positive rate and the false positive rate.
Scrolling down, we see the Fit Statistics table. I'll maximize the table. Notice that the average squared error on validation is 0.0737.
I'll close the Results window.
Machine Learning Using SAS® Viya®
Lesson 04, Section 1 Practice the Demo: Build a Neural Network Using the Default Settings
In this practice, you create a new pipeline in the Demo project, using the CPML demo pipeline. You build a neural network model using the default settings.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the plus sign next to the Tree Based pipeline tab to add a new pipeline.
- In the New Pipeline window, enter information about the new pipeline:
- Enter Neural Network in the Name field.
- Under Select a pipeline template, click the down arrow to browse templates. Select the CPML demo pipeline. The CPML demo pipeline appears in this menu because you used it in a previous practice.
- Enter Neural Network in the Name field.
- Click Save.
- In the Neural Network pipeline, add a Neural Network node (from the Supervised Learning group) after the Variable Selection node.
- Select the Neural Network node to activate its properties panel. Keep all properties for the Neural Network node at their defaults.
- Right-click the Neural Network node and select Run.
- Right-click the Neural Network node and select Results.
Explore the following charts and plots, which help you evaluate the model's performance:
The Network Diagram presents the final neural network structure for this model, including the hidden layer and the hidden units.
The Iteration plot shows the model's performance based on the validation error throughout the training process when new iterations are added.
As usual, you see the score code windows.
The Output window shows the results from the NNET procedure: the final neural network model parameters, the iteration history, and the optimization process. - Click the Assessment tab, and explore the results. Note the following:
In the Lift Reports window, the Cumulative Lift plot shows the model's performance ordered by the percentage of the population. This plot is very useful for selecting the model based on a particular target of the customer base.
For a binary target, you also have the ROC curve in the ROC Reports window. The ROC curve shows the model's performance considering the true positive rate and the false positive rate.
The Fit Statistics table shows the model's performance based on various assessment measures, such as average squared error. Note the average squared error on validation data. - Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 04, Section 2 Demo: Modifying the Neural Network Architecture
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll change some of the parameters of the Neural Network model, in an attempt to improve its performance. Make sure the Neural Network node is selected.
In the properties panel, I'll start by changing the input standardization from Midrange to Z score. I'll expand the Hidden Layer options and clear the check box for Use the same number of neurons in hidden layers. And under Custom Hidden Layer Options, I'll enter 26 for Hidden layer 1: number of neurons. That's about twice as many neurons as the number of inputs coming from the Variable Selection node.
Under Target Layer Options, notice the Direct connections property. In the future, if you want to create a skip layer perceptron, select this check box.
Run the Neural Network node.
When the run is complete, I'll open the results. I'll go straight to the performance of the model on validation data as defined by average squared error.
I'll click the Assessment tab and scroll down. In this case, we got quite an improvement in average squared error on validation data. We're down to 0.0694.
I'll close the results.
Machine Learning Using SAS® Viya®
Lesson 04, Section 2 Practice the Demo: Modify the Neural Network Architecture
In this practice, you modify the network architecture parameters of the neural network model with the intent to improve performance.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, select the Neural Network node.
- In the properties panel for the node, make the following changes:
- Change Input standardization from Midrange to Z score.
- Expand the Hidden Layer options. Clear the check box for Use the same number of neurons in hidden layers.
- Under Custom Hidden Layer Options, enter 26 for Hidden layer 1: number of neurons. This is about twice as many as the number of inputs coming from the Variable Selection node.
Note: Under Target Layer Options, notice the Direct connections property. In the future, if you want to create a skip layer perceptron, select this check box.
- Change Input standardization from Midrange to Z score.
- Right-click the Neural Network node and select Run.
- Right-click the Neural Network node and select Results.
- Click the Assessment tab.
In the Fit Statistics table, take note of the average squared error for this neural network model on the VALIDATE partition. Is this fit statistic value better than for the first model (which used the default settings)? - Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 04, Section 3 Demo: Modifying the Learning and Optimization Parameters
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll change some of the optimization parameters of the neural network in an attempt to improve the model. We'll also run the Model Comparison node and compare the final model to the Logistic Regression model that is already in the pipeline.
The options for optimizing the model are in the properties pane of the Neural Network node. Under Common Optimization Options, notice the options Maximum iterations and Maximum time. These options control early stopping. For this model, I do not want to change these options.
Further down are the options that control weight decay. I'll increase the L1 weight decay from 0 to 0.01. I'll decrease the L2 weight decay from 0.1 to 0.0001.
I'll run the Neural Network node.
When the run is complete, I'll open the results. Again, we'll measure model performance by looking at the average squared error on validation data.
Click the Assessment tab and scroll down to see the Fit Statistics table. The validation average squared error is now 0.0676.
When you perform this demonstration, take a minute to write down the value of the average squared error and the value of KS for this model. You will use these values later to compare against a different model you build.
I'll close the results.
Let's run the pipeline. When the run is complete, I'll open the results from the Model Comparison node to see which is the champion model from the pipeline. We see that the neural network is the champion model in this pipeline, based on KS.
I'll close the results.
Machine Learning Using SAS® Viya®
Lesson 04, Section 3 Practice the Demo: Modify the Learning and Optimization Parameters
In this practice, you modify the learning and optimization parameters of the neural network model in the Neural Network pipeline, and compare the model performance to the performance of the logistic regression model already in the pipeline.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Neural Network pipeline, make sure that the Neural Network node is selected.
- In the properties panel for the node, make the following changes:
- Under Common Optimization Options, increase L1 weight decay from 0 to 0.01.
Note: The options Maximum iterations and Maximum time control early stopping. For this model, do not change these options.
- Decrease L2 weight decay from 0.1 to 0.0001.
- Under Common Optimization Options, increase L1 weight decay from 0 to 0.01.
- Right-click the Neural Network node and select Run.
- Right-click the Neural Network node and select Results.
- Click the Assessment tab and scroll down to the Fit Statistics table. Take note of the average squared error for this neural network model on the VALIDATE partition.
- Close the Results window.
- To identify the champion model in this pipeline, do the following:
- Click the Run pipeline button to run the entire pipeline.
- Right-click the Model Comparison node and select Results.
The neural network model is the champion model of the pipeline, based on the default statistic, KS.
- Click the Run pipeline button to run the entire pipeline.
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 05, Section 1 Demo: Building a Support Vector Machine Using the Default Settings
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, I'll add a new pipeline to our Demo project and build a support vector machine using the default settings.
I add a new pipeline. I'll name this pipeline Support Vector Machine. As the template, I'll select CPML Demo Pipeline. And I'll save it.
I'll add a Support Vector Machine node under the Variable Selection node. To make sure that the Support Vector Machine node is active, I select it. Its properties are now visible in the properties pane on the right, but I'll leave the default settings.
Now I'll run the Support Vector Machine node. When the run is complete, I'll open the results. The Results window contains several charts and plots to help evaluate the model's performance. In the upper left corner is the Fit Statistics table. This table shows the support vector machine's performance, based on several assessment measures.
In the upper right corner, I'll maximize the Training Results window. This table shows the parameters of the final support vector machine model, such as the number of support vectors and the bias, which is the offset that defines the support vector machine. I'll close the Training Results window.
As with previous models, we see score code windows. And we see the Output window showing the results from the SVMACHINE procedure.
I'll maximize the Output window. We see the support vector machine model parameters, the training results, the iteration history, the misclassification matrix, the fit statistics, and the predicted probability variables. I'll close the Output window.
To see the model performance results, I'll click the Assessment tab. As usual, we see the lift reports, the ROC reports, the Event Classification plot, and the Fit Statistics table. I'll maximize the table. Notice that the average squared error on validation is 0.1127.
I'll close the table and the results.
Machine Learning Using SAS® Viya®
Lesson 05, Section 1 Practice the Demo: Build a Support Vector Machine Using the Default Settings
In this practice, you create a new pipeline based on the CPML demo pipeline, and add a Support Vector Machine (SVM) node to it. You build the support vector machine model using the default settings.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the plus sign next to the Neural Network tab to add a new pipeline.
- In the New Pipeline window, enter the following information:
- In the Name field, enter Support Vector Machine.
- Under Select a pipeline template, select CPML demo pipeline.
- In the Name field, enter Support Vector Machine.
- Click Save.
- Add a Support Vector Machine (SVM) node (from the Supervised Learning group) under the Variable Selection node.
- Select the SVM node.
- In the properties panel, keep all properties for the SVM node at their defaults.
- Run the SVM node.
- Open the results for the SVM node. Explore the following charts and plots, which help you evaluate the model's performance:
- The Fit Statistics table presents several assessment measures that indicate the performance of the support vector machine model.
- The Training Results table shows the parameters for the final support vector machine model, such as the number of support vectors and the bias, which is the offset that defines the support vector machine.
- As with previous models, you see score code windows.
- The Output window shows the final support vector machine model parameters, the training results, the iteration history, the misclassification matrix, the fit statistics, the predicted probability variables, and the underlying procedure (the SVMACHINE procedure).
- The Fit Statistics table presents several assessment measures that indicate the performance of the support vector machine model.
- Click the Assessment tab. As usual, you see the lift reports, the ROC reports, the Event Classification plot, and the Fit Statistics table. In the Fit Statistics table, take note of the average squared error on the VALIDATE partition.
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 05, Section 2 Demo: Modifying the Methods of Solution Parameters
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we modify one of the key methods of solution parameters for the support vector machine, in an attempt to improve the model performance. In the properties of the support vector machine, notice the Penalty property. The Penalty value balances model complexity and training error. A larger Penalty value creates a more robust model at the risk of overfitting the training data.
I'll change the penalty from 1 to 0.1. Now I'll run the Support Vector Machine node.
When the run is complete, I'll open the results. To assess the model's performance, we'll look at the average squared error on the validation data set.
I'll select the Assessment tab, and scroll down to the Fit Statistics table. Notice that the average squared error on validation data has decreased to 0.0972.
Close the results.
Machine Learning Using SAS® Viya®
Lesson 05, Section 2 Practice the Demo: Modify the Methods of Solution Parameters
In this practice, you modify one of the key methods of solution parameters for the support vector machine model in an attempt to improve its performance.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Support Vector Machine pipeline, make sure that the SVM node is selected.
- In the properties panel, change Penalty from 1 to 0.1. The Penalty value balances model complexity and training error. A larger Penalty value creates a more robust model at the risk of overfitting the training data.
- Run the SVM node.
- Right-click the SVM node and select Results.
- Click the Assessment tab.
- The Fit Statistics table shows the average squared error on validation data. Has the value increased or decreased?
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 05, Section 3 Demo: Increasing the Flexibility of the Support Vector Machine
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we alter the support vector machine model by modifying three options: the kernel function, the tolerance, and maximum iterations. Then we compare the model's performance to the logistic regression model that is already in the pipeline.
In the properties pane for the Support Vector Machine node, change the Kernel setting from Linear to Polynomial. Selecting Polynomial activates the Polynomial degree property underneath. We leave that set at 2. In Model Studio, only degrees of 2 and 3 are available.
The next option is Tolerance. The Tolerance value balances the number of support vectors and model accuracy. A Tolerance value that is too large creates too few support vectors. And a value that is too small overfits the training data.
I'll increase the Tolerance to 0.6. Then I'll decrease the Maximum iterations property from 25 to 10.
Now I'll run the Support Vector Machine node.
When the run is complete, I'll look at the results. To assess model performance, I'll look at the average squared error on validation data.
I'll click the Assessment tab. In the Fit Statistics table, we see that average squared error on validation data is now 0.0950.
When you perform this demonstration, take a minute to write down the average squared error and KS values for this model. You'll use these values later to compare the current model against a different model that you'll build.
I'll close the results.
Now I'll run the pipeline. When the run is complete, we'll look at the results of the Model Comparison node to determine the champion model from this pipeline. Based on KS, we see that the logistic regression model is the champion from this pipeline.
I'll close the results.
Machine Learning Using SAS® Viya®
Lesson 05, Section 3 Practice the Demo: Increase the Flexibility of the Support Vector Machine
In this practice, you attempt to improve the performance of the support vector machine by modifying three options: the kernel function, the tolerance, and maximum iterations. You then compare the model performance to the logistic regression model in the pipeline.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Support Vector Machine pipeline, make sure that the SVM node is selected.
- In the properties pane for the node, make the following changes:
- Change Kernel from Linear to Polynomial. Leave Polynomial degree as 2.
Note: In Model Studio, only degrees of 2 and 3 are available.
- Increase Tolerance from 0.000001 to 0.6.
Note: The Tolerance value balances the number of support vectors and model accuracy. A Tolerance value that is too large creates too few support vectors. A value that is too small overfits the training data.
- Decrease Maximum iterations from 25 to 10.
- Change Kernel from Linear to Polynomial. Leave Polynomial degree as 2.
- Run the SVM node.
- Open the results for the support vector machine model.
- Click the Assessment tab. Scroll down to the Fit Statistics table and take note of the average squared error on validation data. Is this fit statistic better than for the previous model?
- Close the Results window.
- To determine the champion model from this pipeline, run the Model Comparison node by clicking the Run Pipeline button.
- Look at the results of the Model Comparison node. Based on the KS statistic (the default), which model is the champion from this pipeline?
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 05, Section 3 Demo: Adding Model Interpretability
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we look at the Model Interpretability feature in Model Studio. Although we are working with a support vector machine here, we can use model interpretability for any of the models that we build in this course.
In the Support Vector Machine pipeline, I'll select the Support Vector Machine node. In the properties pane for the support vector machine, under Post-training Properties, I'll expand Model Interpretability and specify several options.
I'll expand Global Interpretability and select both Variable Importance and PD Plots.
I'll expand Local Interpretability and select ICE plots, LIME, and Kernel SHAP.
I'll scroll down and change the maximum number of Kernel SHAP variables to 10. This means that 10 inputs are displayed in the chart, ordered by importance according to the absolute Kernel SHAP values.
Note that the Random observations option provides explanations for five randomly selected observations. If we wanted to, we could select five observations from the data instead. However, we won't do that.
I'm ready to run the Support Vector Machine node. When the run is finished, I open the results. In the Results window, notice that a new tab, Model Interpretability, now appears after the Node and Assessment tabs. I'll click the Model Interpretability tab.
I'll expand the Surrogate Model Variable Importance table. The most important inputs are listed in descending order of their importance. ever_days_over_plan appears to be the most important predictor. Relative importance is based on simple decision trees that are only one level deep. We will see that the inputs to the PD and the ICE plots are the top predictors from this table.
I'll close the table and expand the Partial Dependence plot. The plot shows the marginal effect of a feature (in this case, ever_days_over_plan) on the predicted outcome of the model we just fit.
The prediction function is fixed at a few values of the chosen feature and averaged over the other features.
On the right side of the plot, notice that a concise description of this PD plot appears. Many plots on the Model Interpretability tab provide this type of description.
A PD plot can show whether the relationship between the target and the feature is linear, monotonic, or more complex. Here, there is a positive linear relationship. The longer that a customer's usage exceeds plan limits, the more likely the customer is to churn.
Now let's look at the relationship between the model's predictions and a different variable, using the View chart menu. Why does this menu list only five variables? These are the five most important inputs in the model, based on a one-level decision tree for all inputs used to predict the predicted values from the model.
I'll select the categorical variable handset_age_grp. This PD plot indicates that the highest probability of churn is associated with the middle age group (that is, the middle level of the variable): handsets between 24 and 48 months old. The newest handsets (those less than 24 months old) have the next highest probability of churn. And the oldest handsets have the lowest probability of churn. Does this make business sense? Yes, it does. A new device has a lower probability of churn because the customer hasn't had time to test it out yet. At the other end, if a customer has had a handset for more than four years, that customer probably likes it.
I'll close the PD plot and expand the PD and ICE Overlay plot.
This is a combined plot of the partial dependency results and the individual conditional expectation results overlaid. We see six lines because five are for ICE and one is the PD.
ICE plots can help reveal interesting subgroups and interactions between model variables. For a chosen feature, an ICE plot shows how changes in the feature relate to changes in the prediction for individual observations. This ICE plot shows one line for each of the five randomly chosen observations, as specified in the node properties that we saw earlier.
This ICE plot shows churn probability by ever_days_over_plan. Each line represents the conditional expectation for one customer instance. The plot indicates that for all five instances, there is a consistent increase in the probability of churn as ever_days_over_plan increases, given that other features are constant.
ICE plots help reveal interesting customer subgroups and interactions between model variables. These relationships are not apparent in PD plots because they are averaged out. When evaluating an ICE plot of an interval input, the most useful feature to observe is intersecting slopes. Intersecting slopes indicate that there is an interaction between the plot variable and one or more complementary variables. ever_days_over_plan does not show any interactions.
Let's look at the ICE plot for a different variable. From the View chart menu, I'll select handset_age_grp.
When evaluating an ICE plot of a categorical input, it is useful to look among individual observations for different relationships between the groups (or levels) of the categorical variable and the target. Significant differences in these relationships indicate group effects.
Five individuals are represented in this plot, with the average predicted probability of churn calculated separately for each individual, across all levels of handset_age_grp. For this variable, the trend of observing the lowest probability in the oldest handset age group holds true for all five individuals.
I'll close the PD and ICE Overlay plot.
The last two plots that we will examine, LIME and SHAPLY, are created by explaining individual predictions. In a given feature space, Shapley values help you determine where you are, how you got there, and how influential each variable is at that location. This is in contrast to LIME values, which help you determine how changes in a variable's value affect the model's prediction.
Let's begin by expanding the LIME Explanations plot.
This LIME plot displays the regression coefficient for the inputs selected in a local surrogate linear regression model. This surrogate model fits the predicted probability of the event (1) for the target (churn) for each of the five randomly chosen observations. In the chart, the inputs are ordered by significance, with the most significant input for the local regression model appearing at the bottom of the chart.
The LASSO technique is used to select the most significant effects from the set of inputs that was used to train the model. A positive estimate indicates that the observed value of the input increases the predicted probability of the event. For example, the value of 0 for delinq_indicator decreases the predicted probability of the event (1) for the target (churn) by 0.1516 compared to the individual having a different value for delinq_indicator.
I'll close the LIME Explanations plot and expand the Kernel SHAP Values plot.
Unlike LIME coefficients, Shapley values do not come from a local regression model. For each individual observation, an input's Shapley value is the contribution of the observed value of the input to the predicted probability of the event (1) for the target (churn). The Shapley values of all inputs sum to the predicted value of that local instance. The inputs are displayed in the chart, ordered by importance according to the absolute Kernel SHAP values, with the most significant input appearing at the bottom of the chart.
The Kernel SHAP values are the regression coefficients that are obtained by fitting a weighted least squares regression. Note that each nominal input is binary encoded based on whether it matches the individual observation. Interval inputs are binary encoded based on their proximity to the individual observation with a value of 1 if the observation is close to the local instance.
To eliminate the bias of collinearity in regression, Shapley values average across all permutations of the features joining the model. Therefore, Shapley values control for variable interaction.
I'll close the Kernel SHAP Values plot and then close the results.
Now I'll run the entire pipeline. Let's look at the results of the model comparison. The SVM is the champion of this pipeline based on KS.
I'll close the results.
Machine Learning Using SAS® Viya®
Lesson 05, Section 3 Practice the Demo: Add Model Interpretability
In this practice, you use the Model Interpretability feature to provide some explanation about the support vector machine model.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, in the Support Vector Machine pipeline, select the SVM node.
- In the properties pane for the SVM node, make the following changes:
- Under Post-training Properties, expand Model Interpretability.
- Expand Global Interpretability and select both Variable importance and PD plots.
- Expand Local Interpretability and select the check boxes for ICE plots, LIME, and Kernel SHAP.
- Under Maximum number of Kernel SHAP variables, move the slider to change the number to 10. This means that 10 inputs are displayed in the chart, ordered by importance according to the absolute Kernel SHAP values.
- Notice that Specify instances to explain is set to Random. This setting provides explanations for five randomly selected observations. Although you will not change this setting now, note that it is possible to select five observations from the data instead.
- Under Post-training Properties, expand Model Interpretability.
- Run the SVM node.
- Open the results for the SVM node.
- Notice that there is a new tab in addition to the Node and Assessment tabs: Model Interpretability. Click the Model Interpretability tab.
- Expand the Surrogate Model Variable Importance table. The most important inputs are listed in descending order of their importance. What appears to be the most important predictor? Relative importance is based on simple decision trees that are only one level deep. You will see that inputs to the PD and ICE plots are the top predictors from this table.
- Expand the PD plot. This plot shows the marginal effect of a feature (in this case, ever_days_over_plan) on the predicted outcome of the model that you just fit.
The prediction function is fixed at a few values of the chosen feature and averaged over the other features.
A PD plot can show whether the relationship between the target and the feature is linear, monotonic, or more complex.
On the right side of the plot, notice that a concise description of this PD plot appears. Many plots on the Model Interpretability tab provide this type of description.
- To look at the relationship between the model's predictions and a different variable, use the View chart menu. The menu shows the five (by default) most important inputs in the model, based on a one-level decision tree for all inputs used to predict the predicted values from the model.
Select the categorical variable handset_age_grp. This PD plot indicates that the highest probability of churn is associated with the middle age group (that is, the middle level of the variable): handsets between 24 and 48 months old. The newest handsets (those less than 24 months old) have the next highest probability of churn. And the oldest handsets have the lowest probability of churn. This makes sense from a business standpoint. A new device has a lower probability of churn because the customer hasn't had time to test it out yet. At the other end, if a customer has had a handset for more than four years, they probably like it. - Close the PD plot.
- Expand the PD and ICE Overlay plot. This is a combined plot of the partial dependency results and the individual conditional expectation results overlaid. There are six lines because five are for ICE and one is the PD.
ICE plots can help reveal interesting subgroups and interactions between model variables. For a chosen feature, an ICE plot shows how changes in the feature relate to changes in the prediction for individual observations. This ICE plot shows one line for each of the five randomly chosen observations, as specified in the node properties that you saw earlier.
This ICE plot shows churn probability by ever_days_over_plan. Each line represents the conditional expectation for one customer instance. The plot indicates that for all five instances, there is a consistent increase in the probability of churn as ever_days_over_plan increases, given that other features are constant. ICE plots help reveal interesting customer subgroups and interactions between model variables. These relationships are not apparent in PD plots because they are averaged out.
When evaluating an ICE plot of an interval input, the most useful feature to observe is intersecting slopes. Intersecting slopes indicate that there is an interaction between the plot variable and one or more complementary variables. ever_days_over_plan does not show any interactions. - Look at the ICE plot for a different variable. From the View chart menu, select handset_age_grp.
When evaluating an ICE plot of a categorical input, it is useful to look among individual observations for different relationships between the groups (or levels) of the categorical variable and the target. Significant differences in these relationships indicate group effects. Five individuals are represented in this plot, with the average predicted probability of churn calculated separately for each individual, across all levels of handset_age_grp. For this variable, the trend of observing the lowest probability in the oldest handset age group holds true for all five individuals.
- Close the PD and ICE Overlay plot.
- Notice the LIME and SHAPLEY plots. These plots are created by explaining individual predictions. In a given feature space, Shapley values help you determine where you are, how you got there, and how influential each variable is at that location. This is in contrast to LIME values, which help you determine how changes in a variable's value affects the model's prediction.
- Expand the LIME Explanations plot.
This LIME plot displays the regression coefficient for the inputs selected in a local surrogate linear regression model. This surrogate model fits the predicted probability of the event (1) for the target churn for each of the five randomly chosen observations. In the chart, the inputs are ordered by significance, with the most significant input for the local regression model appearing at the bottom of the chart.
The LASSO technique is used to select the most significant effects from the set of inputs that was used to train the model. A positive estimate indicates that the observed value of the input increases the predicted probability of the event. For example, in the demo video, the value of 0 for delinq_indicator decreases the predicted probability of the event (1) for the target churn by 0.1516 compared to the individual having a different value for delinq_indicator. Note: When you perform this demo, your results might differ.
- Close the LIME Explanations plot.
- Expand the Kernel SHAP Values plot.
Unlike LIME coefficients, SHAPLEY values do not come from a local regression model. For each individual observation, an input's Shapley value is the contribution of the observed value of the input to the predicted probability of the event (1) for the target churn. The Shapley values of all inputs sum to the predicted value of that local instance. The inputs are displayed in the chart, ordered by importance according to the absolute Kernel SHAP values, with the most significant input appearing at the bottom of the chart.
The Kernel SHAP values are the regression coefficients that are obtained by fitting a weighted least squares regression. Note that each nominal input is binary encoded based on whether it matches the individual observation. Interval inputs are binary encoded based on their proximity to the individual observation with a value of 1 if the observation is close to the local instance. To eliminate the bias of collinearity in regression, Shapley values average across all permutations of the features joining the model. Therefore, Shapley values control for variable interaction.
- Close the Kernel SHAP Values plot.
- Close the results.
- The Model Comparison node needs to be run because you turned on Model Interpretability in the SVM node above it. Run the entire pipeline and view the results of model comparison. In the demo video, the SVM model is the champion of this pipeline based on KS. Note: When you perform these steps, a different model might be the champion.
- Close the results.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Comparing Models within a Pipeline
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll run and interpret the Model Comparison node in the Tree Based pipeline. We'll use this pipeline because it has more models built than the other pipelines.
Select the Tree Based pipeline. We'll use the default assessment measure for binary targets found in the Project Settings window, in the Rules properties. Note that the default statistic for class selection is the Kolmogorov-Smirnov statistic. Click Cancel.
In the Tree Based pipeline, select the Model Comparison node. Recall that, for a single pipeline, you can change the fit statistic for class targets in the properties for this node.
To be sure that we're looking at the most recent results from the Model Comparison node, let's rerun that node.
When the run is complete, we'll view the results. At the top, we see the Model Comparison table. This table shows the champion model that was selected based on the default statistic -- in this case, KS.
Here, we see that the champion model for this pipeline is the forest. Scrolling down, we see the Properties table. The Properties table shows the criteria used to evaluate the models and select the champion.
Click the Assessment tab and expand Lift Reports. The lift report shows results based on the response percentage. Using the menu in the upper left corner, you can also choose to see the model's performance based on the captured response percentage, cumulative captured response percentage, cumulative response percentage, cumulative lift, gain, and lift. Close the Lift Reports plot.
The second plot shows the ROC report based on accuracy. Maximize this plot. Using the menu in the upper left corner, you can also see the model's performance based on the F1 score and ROC. Close the ROC Reports plot.
Scroll down to see the Fit Statistics table and maximize it. The Fit Statistics table shows how each model in the pipeline performs on the data partitions defined in the project settings for a series of fit statistics, such as area under the ROC curve, average squared error, Gini coefficient, and KS.
Close the Fit Statistics table and the results.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Practice the Demo: Compare Models within a Pipeline
In this practice, you run and interpret the Model Comparison node in the Tree Based pipeline. You compare the models' performances based on different fit statistics.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, select the Tree Based pipeline. Note: You are using this pipeline because it has more models built than the other pipelines.
- Use the default assessment measure for binary targets found in the Project Settings window, in the Rules properties. To view these settings, do the following:
- Click the Settings button in the upper right corner of the project window, and select Project settings.
- In the left pane of the Project Settings window, select Rules. The default statistic for class selection is the Kolmogorov-Smirnov (KS) statistic.
- Click Cancel.
- Click the Settings button in the upper right corner of the project window, and select Project settings.
- In the Tree Based pipeline, select the Model Comparison node.
Note: You can also change the fit statistics in the properties for this node. - To make sure that you're looking at the most recent results from the Model Comparison node, right-click the Model Comparison node and select Run.
- Right-click the Model Comparison node and select Results.
The Model Comparison table shows the champion model based on the default statistic (in this case, KS).
- Scroll down to see the Properties table. The Properties table shows the criteria used to evaluate the models and select the champion.
- Click the Assessment tab and expand the Lift Reports plot.
The lift report shows results based on the response percentage. Using the menu in the upper left corner, you can also choose to see the model's performance based on the captured response percentage, cumulative captured response percentage, cumulative response percentage, cumulative lift, gain, and lift.
- Close the Lift Reports plot.
- Expand the ROC Reports plot.
The ROC Reports plot is based on Accuracy, by default. Using the menu in the upper right corner, you can also see the models' performances based on the F1 Score and ROC. - Close the ROC Reports plot.
- Expand the Fit Statistics table.
The Fit Statistics table shows how each model in the pipeline performs on the data partitions defined in the project settings (train, validate, and test) for a series of fit statistics, such as Area Under ROC, Average Square Error, Gini Coefficient, and KS, among others. - Close the Fit Statistics table.
- Close the Results window.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Comparing Models across Pipelines
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we'll run the pipeline comparison. Pipeline comparison enables you to compare the best models from each pipeline in the project and identify an overall project champion.
I'll click the Pipeline Comparison tab. The table at the top lists the champion model from each pipeline and the model that was selected as the overall champion across pipelines (the champion of champions). The overall champion is selected by default in the list. The champion is indicated by a star in the Champion column.
Below the comparison table at the top, the remaining charts and tables show information about only the selected model. They summarize the performance of the model, show the Variable Importance list of the model, provide training and score code, and show other outcomes.
I can also choose to see additional tables and plots for some or all of the listed models. In the table at the top, I can select the check boxes next to the models I want to compare. To quickly select all models, I'll select the check box in the top left corner.
When multiple models are selected, the Compare button in the upper right corner is activated. I'll click Compare now.
In the Compare results, you can compare assessment statistics and graphics across the models currently selected on the Pipeline Comparison tab. Close the Compare results window.
If you want, you can manually add a challenger model to the pipeline comparison. To do this, return to the pipeline that contains the desired model. I'll go to the Tree Based pipeline. I'll right-click the model that I want to add (the gradient boosting model) and select Add challenger model from the popup menu. The selected model now appears on the Pipeline Comparison tab. Now there is a Challenger column to indicate the challenger model. In this case, I will not run the comparison again.
I want to register the overall champion model in a later demonstration, so I'll go ahead and clear the check boxes for all other models in the table at the top of the Pipeline Comparison tab.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Practice the Demo: Compare Models across Pipelines
In this practice, you run the pipeline comparison. Pipeline comparison enables you to compare the best models from each pipeline created. It also enables you to register the overall champion model and use it in other tools.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Pipeline Comparison tab.
At the top, you see the champion model from each pipeline as well as the model deemed the overall champion in the pipeline comparison, the champion of champions. The overall champion is selected by default and is indicated by a star in the Champion column.
In addition, several charts and tables summarize the performance of the overall champion model (the selected model), show the Variable Importance list of the model, provide training and score codes, and show other outcomes from the selected best model. The default assessment measure for pipeline comparison is Kolmogorov-Smirnov (KS).
All the results shown are for the overall champion model only. You might want to perform a model comparison of each of the models shown. - Select the check boxes next to all the models shown at the top of the Results page. You can also select the check box next to the word Champion at the top of the table.
- When multiple models are selected, the Compare button in the upper right corner is activated. Click Compare.
The Compare results enable you to compare assessment statistics and graphics across the models currently selected on the Pipeline Comparison tab. - Close the Compare results window.
- To add a challenger model (a model that was not automatically selected) to the pipeline comparison, perform the following steps:
- Return to the pipeline that contains the desired model (here, the Tree Based pipeline).
- Right-click the node for a model other than the pipeline champion and select Add challenger model from the pop-up menu.
- Click the Pipeline Comparison tab. The selected model now appears in the Pipeline Comparison table at the top, in the Challenger column.
- Return to the pipeline that contains the desired model (here, the Tree Based pipeline).
- To prepare to register the overall champion model in a later practice, clear the check boxes for all other models in the table at the top of the Pipeline Comparison tab.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Reviewing a Project Summary Report on the Insights Tab
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we look at a project summary report for the Demo project from the Pipeline Comparison tab.
To see the report, I'll click the Insights tab. This report is available after at least one pipeline in the project has run successfully.
The report has multiple panels with summary information about the project and the models listed on the Pipeline Comparison tab.
At the top left is the Project Summary panel. Here we see the project target variable, the number of pipelines in the project, the project champion model (which is the champion model selected by the Pipeline Comparison tab), the event rate, and other project summary information.
At the top right is the Project Notes panel. Adding project notes here can be helpful, for example, if multiple people are working on the same project.
Most Common Variables Selected Across All Models is a bar chart that shows the number of times that an input variable was determined to be an important variable for any of the models on the Pipeline Comparison tab. Let's maximize this chart. At the top of the plot are the variables that appear in all models in the pipeline comparison. Notice that many variables appear in all five models in our pipeline comparison. At the bottom of the chart, notice that one variable was used in four of the models in the pipeline comparison. Close the plot.
Now I'll maximize the Assessment for All Models plot. This bar chart summarizes model performance for all models listed on the Pipeline Comparison tab. The overall champion (or project champion) is marked by an orange star. Pipeline champions are marked with a blue-green star. If you manually added a challenger model to the Pipeline Comparison tab, as we did, it appears in the plot without a star. For our Demo project, notice that the project champion model is the forest. The KS value for this model is nearly 0.6. Close the plot.
Next, I'll maximize the Most Important Variables for Champion Model plot. This bar chart shows the relative importance of the most important variables for the project champion. Here, notice that the most important variable for the champion model is total_days_over_plan. The relative importance of the variable handset_age_group is about 91% of the importance for total_days_over_plan. This means that handset_age_group is 0.91 times as important as total_days_over_plan for this model. Close the plot.
At the bottom, let's maximize the Cumulative Lift for Champion Model plot. This plot displays the cumulative lift for the project champion model for both the training and validation partitions.
Close the plot.
Now that we have gained additional insights into our project and its models, we are ready to prepare for model deployment. To return to the pipeline comparison results, I'll click the Pipeline Comparison tab.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Practice the Demo: Review a Project Summary Report on the Insights Tab
In this demonstration, we look at a project summary report for the Demo project from the Pipeline Comparison tab.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Insights tab.
The Insights tab contains summary information in the form of a report for the project, the champion model, and any challenger models. For the purposes of the Insights tab, a champion model is the overall project champion model, and a challenger model is one that is a pipeline champion, but not the overall project champion.
At the top of the report is a summary of the project and a list of any project notes. Summary information about the project includes the target variable, the champion model, the event rate, and the number of pipelines in the project. - Maximize the plot for Most Common Variables Selected Across All Models. This plot summarizes common variables used in the project by displaying the number of pipeline champion models that the variables appear in. Only variables that appear in models used in the pipeline comparison are displayed.
The plot shows that many variables were used by all models in the pipeline comparison. These variables are listed at the top of the plot. Variables not used in all models are listed at the bottom of the plot. - Close the Most Common Variables Selected Across All Models plot.
- Maximize the Assessment for All Models plot. This plot summarizes model performance for the champion model across each pipeline and the overall project champion. The orange star next to the model indicates that it is the project champion.
In the demo video, the champion is the forest. Take note of the KS value for the model that is selected as the champion when you practice these steps.
- Close the Assessment for All Models plot.
- Maximize the Most Important Variables for Champion Model plot. This plot shows the most important variables, as determined by the relative importance calculated using the actual overall champion model.
- Close the Most Important Variables for Champion Model plot.
- At the bottom of the results, notice the Cumulative Lift for Champion Model plot. This plot displays the cumulative lift for the overall project champion model for both the training and validation partitions.
- To prepare for model deployment, return to the pipeline comparison results by clicking the Pipeline Comparison tab.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Registering the Champion Model
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we register the champion model. Registering the model makes it available to other SAS applications.
On the Pipeline Comparison tab, I click the Project Pipeline menu in the upper right corner and select Register models. The Register Models window appears. The Status column indicates that Model Studio is registering the model, and then indicates when the registration process is complete. I'll close this window.
Notice the new column, Registered, indicating that the model has been registered.
After the model is registered, you can view and use it in SAS Model Manager. In SAS Model Manager, you can export the score code in different formats, deploy the model, and manage its performance over time. You'll see this in a later demonstration.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Practice the Demo: Register the Champion Model
In this practice, you register the champion model in the Demo project. Registering the model makes it available to other SAS applications.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, on the Pipeline Comparison tab, make sure that only the champion model is selected.
- On the right side of the window, click the Project pipeline menu (the three vertical dots). Note that the Manage Models option is not available.
- Select Register models to open the Register Models window. Wait until you see the following indications that the registration process is finished:
- The spinning circle next to Registering in the Status column indicates that the selected model is actively being registered.
- The Register Models window is updated to indicate that the registration process has successfully completed.
- Close the Register Models window.
- In the table at the top of the Pipeline Comparison tab, notice the new Registered column. This column indicates that the champion model was registered.
Note: After the model is registered, you can view and use it in SAS Model Manager. In SAS Model Manager, you can export the score code in different formats, deploy the model, and manage its performance over time. You see this in a later practice.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Exploring the Settings for Model Selection
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, I'll highlight some of the settings for model selection that you can change if you do not want to use the default values. This is helpful to know for future projects. However, for our Demo project, we use the default settings.
First, I'll show you how to change the settings when comparing models within a single pipeline -- in this case, the Tree Based pipeline. Make sure that the Model Comparison node is selected. In the properties pane, we can currently change three properties: Class selection statistic, Interval selection statistic, and Selection partition. Notice that all three properties are set to the default value. That is, they use the rule defined in the project settings.
If you change the model selection settings here instead of in the Project Settings window, they will apply only to the current pipeline.
For a class or interval target, you can select a different assessment measure from the menu. Suppose I select Average squared error for the Class selection statistic. Notice that the green check mark on the Model Comparison node disappears, indicating that we would need to run the node again to take advantage of the new setting.
Two properties at the bottom of the panel are currently inactive: Selection depth and ROC-based cutoff.
When you select a response-based measure for a class target, such as Cumulative lift, you can also specify a selection depth other than the default. When you select an ROC-based measure such as ROC separation, you can specify a cutoff other than the default.
Now I'll return to the default setting for Class selection statistic.
For the selection partition, you can choose between Test, Train, and Validate.
Now let's explore the settings for model comparison across pipelines. Select the Settings menu in the upper right corner. Then select Project Settings. In the Project Settings window, select Rules in the left pane. On the right, the first three model comparison properties are the same properties that we saw in the properties pane for the Model Comparison node: Class selection statistic, Interval selection statistic, and Selection partition. I'll close the Project Settings window without changing anything.
Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Practice the Demo: Explore the Settings for Model Selection
In this practice, you explore some of the settings for model selection that you can change if you don't want to use the default values for model comparison. Note: It is helpful to know about these settings for future projects, but you use the default settings for the course project.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Tree Based pipeline.
- To explore the settings for comparing models within a single pipeline, perform the following steps:
- Select the Model Comparison node.
- In the properties panel for the node, notice that the three properties (Class selection statistic, Interval selection statistic, and Selection partition) are all set to the default value, Use rule from project settings. Remember that earlier, when you set up the course project, you modified the data partition in the project settings. If you change the model selection settings here instead of in the project settings, those settings apply only to the current pipeline.
- For a class or interval target, you can select a different measure. Click the Class selection statistic drop-down and select Average squared error. The green check mark on the Model Comparison node disappears, indicating that you need to run the node again to take advantage of the new setting.
- Notice that two properties at the bottom of the panel are currently inactive: Selection depth and ROC-based cutoff.
When you select a response-based measure, such as Cumulative lift, you can also specify a selection depth other than the default. When you select an ROC-based measure, such as ROC separation, you can specify a cutoff other than the default.
- Change the Class selection statistic property so that it is back to the default setting, Use rule from project settings.
- Click the Selection partition drop-down. The available options are Test, Train, and Validate. Leave the default value, Use rule from project settings.
- Select the Model Comparison node.
- To explore the settings for comparing models across pipelines, perform the following steps:
- Click the Settings icon in the upper right corner, and then select Project settings.
- Select Rules in the left pane of the Edit Project Settings window. On the right, the first three Model Comparison properties are the same properties that we saw in the properties pane for the Model Comparison node: Class selection statistic, Interval selection statistic, and Selection partition.
- Close the Project Settings window.
- Click the Settings icon in the upper right corner, and then select Project settings.
Machine Learning Using SAS® Viya®
Lesson 06, Section 2 Demo: Viewing the Score Code and Running a Scoring Test
Note: The text below is an exact transcription of what is spoken in the demo video. To perform the demo yourself, follow the steps in the next Practice the Demo page.
In this demonstration, we access SAS Model Manager from Model Studio and run a scoring test using the champion model that we registered earlier. Before you deploy a model, it is often important to run a scoring test in a nonproduction environment to make sure that the score code runs without errors.
This is Andy, and I'm at the Tree Based pipeline. I'll access Pipeline Comparison. Notice that the Pipeline Comparison window now shows a gradient boosting model as the champion model. Your eyes aren't playing tricks on you. This is different from the forest model that you saw Model Studio identify in a previous demonstration. You also saw Jeff register that forest model.
This is a good illustration of how SAS Viya works. I recorded this particular demonstration video as part of a course update but the earlier demonstration video was not rerecorded. Due to the non-deterministic nature of these models, the Gradient Boosting model is now the champion model and I registered it behind the scenes. When you do this demonstration, it doesn't matter which model turns out to be the champion. The steps are the same.
In the Pipeline Comparison window, I select the Project Pipeline menu in the upper right corner. The Manage Models option becomes active after at least one model is registered. I select Manage Models to access Model Manager.
When Model Manager opens, by default, we see a list of files that contain various types of code for training and scoring the registered model.
In the left pane, I'll click the Projects icon, the second icon from the top, to display the Projects list. Notice that our Demo project is listed. I'll select the Demo project to open it.
The Models tab, which is selected by default, lists the model that I registered earlier for this project: the Gradient Boosting model from the Tree Based pipeline.
The tabs at the top of the page are used during the entire model management process, which goes beyond model deployment. In this demonstration, we focus on the tabs that we use for the scoring test.
To open the model, I'll click the name of the model (not the check box). The model name appears at the top, with a new set of tabs underneath.
The Files tab is selected by default. On the left is the same list of files related to scoring this model that you saw earlier.
Before I run the scoring test, you might like to see the score code that Model Studio generated for the model and the data preparation nodes in the pipeline. I'll select dmcas_epscorecode.sas to display the score code on the right. I'll scroll down through the code. Notice, for example, that this code replaces negative values with zero for specified variables.
Of course, the score code varies by model. Keep in mind that you don't need to be able to understand the code in order to run a scoring test.
After you test your score code, a likely next step is to export the code from the Files tab. Then you can put the model into production by deploying it in a variety of environments.
I'll click Close in the upper right corner of the window.
We are now at the Models page, which lists all models registered across all projects.
On the left, I'll click the Projects icon to return to the Demo project.
I'll select the check box next to the name of the model. It's time to run the scoring test, so I'll click the Scoring tab.
On the Tests tab, I'll click New Test. In the New Test window, I'll enter CPML_Gradient as the name. A description is not required, but I'll enter one.
In the Model field, I'll click Choose Model.
In the Choose a Model window, I'll select the champion model. Now I'm back at the New Test window, and I'm ready to select the data set to be scored. In the data source (Input table) field, I click the Browse icon to open the Choose Data window.
Note: This demo video shows how to import a data set as a local file, which you cannot do in SAS Viya for Learners. The Practice the Demo page tells you how to select the data set in SAS Viya for Learners.
The data set to be scored is not yet available in memory, so I need to import it. I click the Import tab, and then expand Local Files and select Local File. In the same location where the modeling data is stored, I'll select score_commsdata, and click Open.
I click Import Item. I see that the table is successfully imported. I click OK. The name of the data source that I imported appears in the New Test window. I click Save.
I'm back at the Scoring tab for the Demo project. The scoring test that I just saved is listed, and I select its check box. I click Run.
When the run is finished, the Status column has a green check mark, and a table icon appears in the Results column. This indicates that the test ran successfully.
To open the results, I'll click the table icon in the Results column. In the left pane, under Test Results, I click Output. By default, the score data table shows the new variables created during data preparation; the new variables created during the scoring process, which contain the predictions; and all the original variables.
Notice that the predicted probability for churn appears in the Predicted: churn = 1 column. The values in this column are used to make business decisions.
If you want, you can reduce the number of columns or rearrange columns in the output table. Just click the Options icon in the upper right corner and select Manage columns.
I'll close the output table.
From here, you can use the Applications menu to return to either SAS Drive or Model Studio.
Machine Learning Using SAS® Viya®
Lesson 06, Section 2 Practice the Demo: View the Score Code and Run a Scoring Test
In this practice, you access SAS Model Manager from Model Studio and run a scoring test using the champion model that you registered in an earlier practice. Before you deploy a model, it is often important to run a scoring test in a nonproduction environment to make sure that the score code runs without errors.
Note: This is the task shown in the previous demonstration video. However, keep in mind that SAS Visual Data Mining and Machine Learning uses distributed processing, so the values in results will vary slightly across runs.
- In the Demo project, click the Pipeline Comparison tab.
- On the right, click the Project pipeline menu (three vertical dots). Notice that the Manage Models option is now available because at least one model has been registered. Select Manage Models from the menu.
By default, when Model Manager opens, you see a list of files that contain various types of code for training and scoring the registered model.
- In the left pane of Model Manager, click the Projects icon (the second icon from the top) to display the Projects list. The Demo project appears in this list. The SAS Model Manager project named Demo is based on the Model Studio project of the same name.
Note: In SAS Viya for Learners, you can see projects created by other users, so you might see multiple Demo projects listed. Make sure to select the Demo project that has your email address specified in the Modified by field.
- Click the name of the Demo project to open it. The Models tab, which is selected by default, lists the model that you registered earlier in Model Studio.
Note: The demo video corresponding to this practice was updated more recently than the earlier demo video in which the model was registered. Notice that the two demo videos show a different registered model. When you perform these practices, it doesn't matter which model is selected as the champion and registered. The steps are the same. - Notice the tabs at the top of the page. These tabs are used during the entire model management process, which goes beyond model deployment. In this practice, you focus on the tabs that are used for the scoring test.
- To open the registered model, click its name. (Do not click the selection check box next to the model name.) Notice the tabs near the top of the page. The Files tab is selected by default. On the left is the same list of files related to scoring this model that you saw earlier.
- To see the score code that Model Studio generated for this model and the data preparation nodes in the pipeline, select dmcas_epscorecode.sas in the left panel. The score code appears on the right. If you want, scroll down through the code. The score code varies by model. You do not need to be able to understand the code in order to run a scoring test.
Note: After you test your score code, a likely next step is to export the code from the Files tab. Then you can put the model into production by deploying it in a variety of environments. - In the upper right corner of the score code window, click Close.
The Models page appears, which lists all models registered across all projects. Here, you have only one project (the Demo project) and one registered model. - On the left, click the Projects icon to return to the Demo project. It's time to create and run a scoring test on the selected model.
- Click on the name of the model (not the checkbox).
- Click the Scoring tab.
- On the Tests tab, click New Test.
- In the New Test window, enter the following information:
- In the Name box, enter CPML_Champion.
- In the Description box, enter Demo project champion. (Entering a description is optional.)
- Below Model, click Choose Model. In the Choose a Model window, select the champion model. The Choose a Model window closes and you return to the New Test window.
- In the Name box, enter CPML_Champion.
- To select the data source for the test, perform the following steps:
Note: The demo video shows how to import the data set as a local file, which you cannot do in SAS Viya for Learners. The following steps show you how to select the data set, which has already been loaded into memory in SAS Viya for Learners.
- Select the score_commsdata table on the Available tab and click OK.
- Notice that the name of the data set now appears in the New Test window.
- Click Save.
- Select the score_commsdata table on the Available tab and click OK.
- Back on the Scoring tab for the Demo project, select the check box next to the name of the scoring test that you just created.
- In the upper right corner of the table, click Run.
When the run is finished, the Status column has a green check mark and a table icon appears in the Results column. This indicates that the test ran successfully.
- To open the test results, click the table icon in the Results column.
- In the left pane, under Test Results, click Output. By default, the score data table shows new variables created during data preparation; the new variables created during the scoring process, which contain the predictions; and all the original variables.
- Scroll to the right until you see the Predicted: churn = 1 column. The predicted values of churn are used to make business decisions.
Note: If you want, you can reduce the number of columns or rearrange columns in the output table. To do this, click the Options icon in the upper right corner and select Manage columns. - Close the Output Table window. From here, you can use the Applications menu to return to either SAS Drive or Model Studio.