Lesson 01


Machine Learning Using SAS® Viya®
Lesson 01, Section 1 Demo: Creating a Project and Loading Data

In this demonstration, we create a new project in Model Studio. This is the project that is used throughout the course. During project creation, we'll load the commsdata data set into memory from a local drive in the virtual lab environment and define a target variable.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. From the Windows taskbar, launch Google Chrome. When the browser opens, select SAS Drive from the bookmarks bar or from the link on the page.

  2. If the user ID and password are not pre-filled, do the following:
    1. Enter student in the User ID field.
    2. Enter Metadata0 in the Password field.
      Note: Use caution when you enter the user ID and password because values can be case sensitive.

  3. Click Sign In.

  4. In the Assumable Groups window, select Yes.

    The SAS Drive home page appears. From SAS Drive, you can access SAS Viya products, such as Model Studio.

  5. In the upper left corner of SAS Drive, click the Applications menu Applications menu icon and select Build Models.

    From the Model Studio Projects page, you can view existing projects, create new projects, and access the Exchange. The Exchange is a place where you can save your pipelines, and find pipeline templates created by other users, as well as best practice pipelines that are provided by SAS. You learn more about the Exchange later.

  6. Click New Project to open the New Project window.

  7. In the Name field, enter Demo as the name of the project. In the Type field, leave the default value, Data Mining and Machine Learning.

    Note: Model Studio projects can be one of three types, depending on the SAS licensing for your site: Data Mining and Machine Learning projects, Forecasting projects, and Text Analytics projects.

  8. Under Data, click Browse.

    The Choose Data window appears with the Available tab selected by default. Notice that the data source for the Demo project, commsdata, does not appear on the Available tab. You need to load it into memory.

  9. Import the SAS data set (a local file) into memory, as follows:
    1. Click the Import tab.
    2. Expand Local Files and select Local file.
    3. In the Open window, navigate to D:\Workshop\winsas\CPML.
    4. Select commsdata.sas7bdat, and click Open.
    5. In the Choose Data window, click Import Item. SAS Viya loads the data into memory. After the data set is imported, it is listed on the Available tab and can be used for other projects.
      Note: Later in this course, you learn more about supported data types and how the data is loaded behind the scenes.
    6. Click OK.

  10. In the New Project window, notice that the name of the imported data set now appears in the Data field.

  11. The Description field is optional. Leave it blank.

  12. To look at some of the advanced project settings, click Advanced.

    The New Project Settings window appears. On the left, four groups of project settings are listed: Advisor Options (selected by default), Partition Data, Event-Based Sampling, and Node Configuration. You cannot see the Advisor Options for a given project after you create it (that is, after you save it), so let's look at those options now. You learn about the other advanced project settings later.

  13. On the right, view the following options in the Advisor Options group:
    1. Maximum class level specifies the threshold for rejecting categorical variables. If a categorical input has more levels than the specified maximum number, it is rejected.
    2. Interval cutoff determines whether a numeric input is designated as interval or nominal. If a numeric input has more levels than the interval cutoff value, it is declared interval. Otherwise, it is declared nominal.
    3. Maximum percent missing specifies the threshold for rejecting inputs with missing values. If an input has a higher percentage of missing values than the specified maximum percent, it is rejected. By default, this option is on. (That is, Apply the "maximum percent missing" limit is selected).

  14. Without changing the Advisor Options settings, click Cancel to return to the New Project window.

    Note: After you save a project, the Advisor Log, and other logs, are available from the Settings menu. You see this menu later.

  15. Click Save to save the Demo project.

    Note: After you create a new project, Model Studio opens the project. The Data tab is selected by default. The other three tabs are Pipelines, Pipeline Comparison, and Insights.

  16. Notice the warning message that currently appears at the top of the window. As the message indicates, when a project is created, you must assign a target variable in order to run a pipeline.

  17. In the variables table, do the following:
    1. Select the check box next to the variable name churn.
    2. In the right pane, click the Role menu and select Target. The target is now defined and the warning message at the top of the page disappears.

  18. If the target is binary or nominal, you can also view or change the event of interest. In the right panel, click Specify the Target Event Level. In the Specify the Target Event Level window, click the Target event level menu. Notice that the menu provides the frequency count for each level. For our project, the churn rate is about 12%. By default, Model Studio sorts the levels in ascending alphanumeric order and selects the last level as the event. For your target, the selected level is 1, so you don't need to change it.

  19. Click Cancel to return to the Demo project.

Note: You cannot modify the names or labels of your variables in Model Studio.


Machine Learning Using SAS® Viya®
Lesson 01, Section 2 Demo: Modifying the Data Partition

In this demonstration, we change the metadata for multiple variables, modify the default data partition settings, run the partitioning, and then look at the partitioning log.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. If you closed the Demo project, reopen it. Make sure that the Data tab is selected.

  2. Make sure that the check box for the churn variable is cleared.

    Note: It is important to understand how to select and deselect variables on the Data tab. Otherwise, you might inadvertently reassign variable roles or change metadata. For details, see the Selecting Variables on the Data Tab page after this demonstration.

  3. Based on business knowledge, we want to reject 11 variables in the commsdata data set. Rejected variables will not be used in our models. Reject 11 variables so that they will not be used in modeling, as follows:

    1. On the Data tab, select the check boxes for the following variables:
      • city
      • city_lat
      • city_long
      • data_usage_amt
      • mou_onnet_6m_normal
      • mou_roam_6m_normal
      • region_lat
      • region_long
      • state_lat
      • state_long
      • tweedie_adjusted

    2. In the right pane, for Role, select Rejected.

      Note: Variable metadata includes the role and measurement level of the variable. Common variable roles are Input, Target, Rejected, Text, and ID. Common variable measurement levels are Interval, Binary, Nominal, and Ordinal.

  4. Click SettingsSettings menu icon in the upper right corner of the window, and select Project settings from the menu.

    Note: If you want to see or modify the partition settings before creating the project, you can do this from the user settings. In the user settings, the Partition tab enables you to specify the method for partitioning as well as specify associated percentages. Any settings at this level are global and are applied to any new project created.

    The Project Settings window appears with Partition Data selected on the left by default.

    Note:
    You can edit the data partitioning settings only if no pipelines in the project have been run. After the first pipeline has been run, the partition tables are created for the project, and the partition settings cannot be changed. Remember that, as shown in the last demonstration, you can also access the Partition Data options while the project is being created, under the Advanced settings.

  5. Notice that the Create partition variable check box is selected, which indicates that partitioning is done by default. The default partitioning method is Stratify.

  6. By default, Model Studio does a 60-30-10 allocation to training, validation, and test. For the Demo project, make the following changes:

    1. Change the Training percentage to 70.
    2. Leave the Validation percentage set to 30.
    3. Change the Test percentage to 0. Note: You will not use a test data set for this project.

  7. On the left, select Event-Based Sampling to look at those settings. By default, event-based sampling is turned off. (That is, the Enable event-based sampling check box is not selected.) When event-based sampling is turned on, the desired proportion of event and non-event cases can be set after the sampling is done. In this case, the default proportion for both events and non-events after sampling is 50% each. The sum of the proportions must be 100%.

    For the Demo project, keep the Event-based Sampling options at their default settings.

    Note: After a pipeline has been run in the project, the Event-Based Sampling settings cannot be changed. Remember that, as shown in the last demonstration, you can also access the Event-Based Sampling options while the project is being created, under the Advanced settings.

  8. On the left, select Node Configuration. The Prepend Python configuration code setting is useful when you use the Open Source Code node with the language set to Python.

  9. To explore this setting, do the following:
    1. Select the Prepend Python configuration code check box. A code editor appears. In the editor, you could add Python code to prepend to code that you specified in the Open Source code node. You learn more about the Open Source Code node later in the course.
    2. Clear the Prepend Python configuration code check box because you will not use Python code at this point.

  10. On the left, select Rules to look at those settings. The Rules options can be used to change the selection statistic and partitioned data set that determine the champion model during model comparison. Statistics can be selected for class and interval targets.

    For the Demo project, keep the Rules options at their default settings.

  11. Click Save to save the new partition settings and return to the Demo project page.

  12. Click the Pipelines tab. In the Demo project, there is currently a single pipeline named Pipeline 1.

    On the Pipelines tab, you can create, modify, and run pipelines. Each pipeline has a unique name and an optional description. In the Demo project, Pipeline1 currently contains only a Data source node.

  13. To create the partition indicator, you can run the Data node. Right-click the Data node and select Run.

    After the node runs, a green check mark in the node indicates that it ran without errors and the data have been partitioned.

    Note: After you run the Data node, you cannot change the partitioning, event-based sampling, project metadata, project properties, or the target variable. However, you can change variable metadata with the Manage Variables node or through the Data tab.

  14. To look at the log file that was generated during partitioning, click Settings in the upper right corner, and select Project logs from the menu.

  15. From the Available Logs window, select Log for Project Partitioning, and then click Open. The log that was created during partitioning appears. You can scroll through the log if you want.

    Note: It is also possible to download a log file by clicking the Download log link at the bottom of the log.

  16. To return to the pipeline, close the Partition Log window, and then close the Available Logs window.


Machine Learning Using SAS® Viya®
Lesson 01, Section 2 Demo: Building a Pipeline from a Basic Template

In this demonstration, we create a pipeline from a basic template in the Demo project. We use this pipeline to do imputation and build a baseline regression model that we compare with machine learning models in a later demonstration. We run the pipeline and look at the results.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. Make sure that the Demo project is open and the Pipelines tab is selected.

    Note: Remember that Pipeline 1, which has a single Data node, was created automatically with the project. You'll reserve this pipeline for exploring the data, which you do in a later demonstration.

  2. Click the plus sign next to the Pipeline 1 tab.

    The New Pipeline window appears.

  3. Under Select a pipeline template, click Browse to access the Browse Templates window. This window displays a list of pre-built pipeline templates, which are available for both class (categorical) and interval targets. These templates are available at basic, intermediate, and advanced levels. The Browse Templates window also displays any pipeline templates that users have created and saved to the Exchange.

  4. Select Basic template for class target, and click OK.

  5. In the New Pipeline window, in the Name field, enter Starter Template as the pipeline name.

    Note: Specifying a pipeline description is optional.

  6. Notice the Automatically generate a pipeline option. This option is an alternative to using one of the pre-populated pipelines already configured to create a model. When you select the Automatically generate a pipeline option, Model Studio uses automated machine learning to dynamically build a pipeline that is based on your data. This option is disabled if the target variable has not been set or if the project data advisor has not finished running. We do not use this option in this course.

  7. Click Save.

    A Starter Template pipeline tab appears on the Pipelines tab for the Demo project. The basic template for class target is a simple linear flow that includes the following nodes: the Data node, one node for data preparation (Imputation), one model node (Logistic Regression), and the Model Comparison node. Even when a pipeline has only one model, a Model Comparison node is included by default.

  8. To run the entire pipeline, click Run pipeline in the upper right corner of the canvas. After the pipeline runs, green check marks in the nodes indicate that the pipeline has run successfully.

    Note: While the pipeline is running, notice that the Run Pipeline button changes to Stop Pipeline. To interrupt a running pipeline, you can click this button.

  9. Right-click the Logistic Regression node and select Results. The Results window appears and contains two tabs: Node and Assessment. The Node tab, which is selected by default, displays the results from the Logistic Regression node.

    Note: Alternatively, you can open the node results by clicking More (the three vertical dots) on the right side of the node and selecting Results.

  10. Explore the results. A subset of the items on this tab are listed below:
    • t-Values by Parameter plot
    • Parameter Estimates table
    • Selection Summary table
    • Output

    Note: You can specify GLM or Deviation (Effect) as the coding method for class inputs in the Logistic Regression node as well as in the GAM, GLM, Linear Regression and Quantile Regression nodes.

  11. Click the Assessment tab to see the assessment results from the Logistic Regression node. Explore the results. A subset of the items on this tab are listed below:
    • Lift reports plots
    • ROC reports plots
    • Fit Statistics table

    Note: You can copy tables, reports, and graphs from nodes in your pipeline to your clipboard. After you run a node, you can also export the contents from the Results view of that node as a PDF.

  12. To close the Results window and return to the pipeline, click Close in the upper right corner.

  13. To open the results of the Model Comparison node, right-click the Model Comparison node and select Results.

    At the top of the results window is the Model Comparison table. This pipeline contains only one model, so the Model Comparison table currently displays information about only that one model.

  14. In the upper right corner of the Model Comparison table, click Maximize View (Maximize View) to maximize the table. The fit statistic that is used to select a champion model is displayed first. The default fit statistic for selecting a champion model with a class target is KS (Kolmogorov-Smirnov).

  15. Close the Model Comparison table.

  16. Close the Model Comparison Results window and return to the pipeline.

Note: For future reference, you can change the selection statistic at the pipeline level or at the project level, after you return to the pipeline. To change the selection statistic for all pipelines within a project, you change the class selection statistic on the project's Settings menu (which was shown in an earlier demonstration). However, for the Demo project, continue to use the default selection statistic, KS.

Lesson 02


Machine Learning Using SAS® Viya®
Lesson 02, Section 1 Demo: Exploring the Data

In this demonstration, we explore the source data (commsdata) using the Data Exploration node in Model Studio. Here we select a subset of variables to provide a representative snapshot of the data. Variables can be selected to show the most important inputs or to indicate suspicious variables (that is, variables with anomalous statistics).

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Pipelines tab. Make sure that Pipeline 1 is selected.

  2. Right-click the Data node and select Add child node > Miscellaneous > Data Exploration from the pop-up menu. Model Studio automatically adds a Data Exploration node to the pipeline and connects it to the Data node.

    Note: Alternatively, we can select a node from one of the sections in the Nodes pane on the left and drag it onto an existing node in the pipeline. The new node is added to the canvas below the existing node and automatically connected to that node.

    Note: You can copy and paste modeling nodes to the same or to a different pipeline. Any changes that you made to the modeling node before copying are maintained in the node when it is pasted. You can copy and paste using a pop-up menu action or using the keyboard Ctrl + C and Ctrl + V combinations.

  3. Keep the default settings for the Data Exploration node. Notice that Variable selection criterion is set to Importance. In this demo, we want to see the most important inputs, so we keep this setting,

    Note: The variable selection criterion specifies whether to display the most important inputs or suspicious variables. By default, a maximum of 50 of the most important variables are selected. To see the most suspicious variables, we would change the setting to Screening. Then we can control the selection of suspicious variables by specifying screening criteria, such as cutoff for flagging variables with a high percentage of missing values, high-cardinality class variables, class variables with dominant levels, class variables with rare modes, skewed interval variables, peaky interval variables, and interval variables with thick tails.

  4. Right-click the Data Exploration node and select Run from the pop-up menu.

  5. When the pipeline finishes running, right-click the Data Exploration node and select Results from the pop-up menu.

  6. Maximize the Important Inputs bar chart and examine the relative importance of the ranked variables.

    Note: This bar chart is available only if Variable selection criterion is set to Importance.

    Note: The Relative Variable importance metric is based on a decision tree and is a number between zero and 1. (We learn more about decision trees later in this course.)

  7. Minimize the Important Inputs bar chart.

  8. Maximize the Interval Variable Moments table.

    This table displays the interval variables with their associated statistics, which include Minimum, Maximum, Mean, Standard Deviation, Skewness, Kurtosis, Relative Variability, and the Mean plus or minus 2 Standard Deviations. Note that some of the input variables have negative minimum values. We'll handle these negative values in an upcoming demonstration.

  9. Close the Interval Variable Moments table.

  10. Maximize the Interval Variable Summaries scatter plot. This is a scatter plot of skewness against kurtosis for all the interval input variables. Notice that a few input variables in the upper right corner are suspect based on high kurtosis and high skewness values. We can place our cursor on these dots to see the associated variable names.

  11. Click the View chart menu in the upper left corner of the window and select Relative Variability. Examine the bar chart of the relative variability for each interval variable.

    Note: Relative variability is useful for comparing variables with similar scales, such as several income variables. Relative variability is the coefficient of variation, which is a measure of variance relative to the mean, CV=σ/μ .

  12. Close the Interval Variable Summaries scatter plot.

  13. Scroll down in the Data Exploration Results window and maximize the Missing Values bar chart, which shows the variables that have missing values. Notice that some of the variables have a higher percentage of missingness than others.

  14. Close the Missing Values bar chart.

  15. Click Close to close the results.

  16. Double-click the Pipeline 1 tab and change its name by entering Data Exploration.

    Note: Another way to rename a pipeline is to click the options menu for the tab (the three dots) and select Rename.

Machine Learning Using SAS® Viya®
Lesson 02, Section 1 Demo: Replacing Incorrect Values Starting on the Data Tab

In this demonstration, we replace incorrect values starting on the Data tab. This method replaces values in all pipelines in the project. Note: Later, you learn about using the Manage Variables node with the Replacement node to replace values in a single pipeline.

In an earlier demonstration, we explored some of the interval input variables and saw that some have negative minimum values. Based on business knowledge, we will replace these negative values with zeros for a subset of the interval input variables. To start, we sort the variables to find the subset that we want to work with.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Data tab.

  2. Right-click the Role column and select Sort > Sort (ascending). All the input variables are now grouped together after the ID variable and before the Rejected variables.

  3. To group the input variables that have negative values, we will add a second sort to the current sort on Role. Scroll to the right, right-click the Minimum column, and select Sort > Add to sort (ascending). Variables with negative minimum values are now grouped together.

    Note: Add to sort means that the initial sorting done on Role still holds. So the sort on minimum values takes place within each sorted Role group.

  4. If required, rearrange columns so that the Minimum column is next to the Variable Name column, as follows:
    1. Click the Options icon in the upper right corner of the data table, and then select Manage columns. The Manage Columns window appears.
    2. In the Displayed Columns list, select Minimum. By clicking the up arrow multiple times, move the Minimum column immediately below the Variable Name column.
    3. Click OK. The Manage Columns window closes.
    4. On the Data tab, scroll all the way to the left so that we can see the Variable Name column and the Minimum column.

  5. Select the following 22 interval input variables:

    Note: In the practice environment, the variables might be listed in a different order than shown here. To make sure that any previously selected variables are no longer selected, select the first variable's name rather than its check box.

    • tot_mb_data_roam_curr
    • seconds_of_data_norm
    • lifetime_value
    • bill_data_usg_m03
    • bill_data_usg_m06
    • voice_tot_bill_mou_curr
    • tot_mb_data_curr
    • mb_data_usg_roamm01 through mb_data_usg_roamm03
    • mb_data_usg_m01 through mb_data_usg_m03
    • calls_total
    • call_in_pk
    • calls_out_pk
    • call_in_offpk
    • calls_out_offpk
    • mb_data_ndist_mo6m
    • data_device_age
    • mou_onnet_pct_MOM
    • mou_total_pct_MOM

    Note: There is one more variable with a negative minimum value. Leave this variable unselected for the Demo project.

  6. In the right pane, enter 0.0 in the Lower Limit field. This specifies the lower limit to be used in the Filtering and Replacement nodes with the Metadata limits method. The Filtering and Replacement nodes use this lower limit to respectively filter out or replace negative values of the selected variables.

    Note: This is customer billing data, and negative values often imply that there is a credit applied to the customer's account. So it is realistic that there are negative numbers in these columns. However, in telecom data, it is a general practice to convert negative values to zeros. Note that we did not edit any variable values. Instead, we only set a metadata property that can be invoked using the Replacement node.

  7. Click the Pipelines tab.

  8. Select the Starter Template pipeline. Notice that, because of the change in metadata, the green check marks in the nodes in the pipeline have changed to gray circles. This indicates that the nodes need to be rerun to reflect the change.

  9. Add a Replacement node to the pipeline.
    Note: The Replacement node can be used to replace outliers and unknown class levels with specified values. It is in this node that we invoke the metadata property of the lower limit that we set earlier.
    Note: The following steps show the drag-and-drop method of adding the node. If we prefer, we can use the alternate method of adding a node that was shown in earlier practices.
    1. Expand the Nodes pane on the left side of the canvas.
    2. Expand Data Mining Preprocessing.
    3. Click the Replacement node and drag it between the Data node and the Imputation node.
    4. Hide the Nodes pane.

  10. In the properties panel for the Replacement node, specify the following settings in the Interval Variables section:
    1. Set Default limits method to Metadata limits.
    2. Change Alternate limits method to (none).
    3. Leave Replacement value at the default value, Computed limits.

  11. Run the Replacement node and view the results.

  12. In the results of the Replacement node, maximize the Interval Variables table. This table shows which variables now have a lower limit of 0.
    The original variables will now be rejected. The new versions of the variables, which have REP_ prepended to the name, are now the valid input variables.

  13. Close the Interval Variables table.

  14. Close the results window of the Replacement node.

  15. To update the remainder of the pipeline, click the Run Pipeline button.

  16. When the run is complete, right-click the Model Comparison node and select Results.

  17. Maximize the Model Comparison table and view the performance results for the Logistic Regression model.

  18. Exit the maximized view of the Model Comparison table.

  19. Select Close to return to the pipeline.

Note: Alternatively, we can assign metadata properties by using the Manage Variables node. We can use the Manage Variables node with the Replacement node to replace values in a single pipeline. In the Nodes pane, the Manage Variables node is in the Data Mining Preprocessing section. However, we do not use the Manage Variables node in the Demo project.


Machine Learning Using SAS® Viya®
Lesson 02, Section 2 Demo: Adding Text Mining Features

In this demonstration, we create new features using the Text Mining node. Of the five text variables in the commsdata data source, we use the text variable verbatims. This variable represents free-form, unstructured data from a customer survey. Two of the text variables (Call_center and issue_level1) are already rejected. Rejecting the remaining two variables (issue_level2 and resolution) requires a metadata change on the Data tab. We must change their role to Rejected.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Data tab.

  2. Make sure that any previously selected variables are deselected.

  3. If required, sort by variable name, right-click the Variable Name column and select Sort > Sort (ascending).

  4. Select the variables issue_level2 and resolution.

  5. In the right pane, change the role to Rejected. This ensures that only the verbatims variable is used as an input for the Text Mining node.

  6. To return to the Starter Template pipeline, click the Pipelines tab and select Starter Template.

  7. Add a Text Mining node (which is in the Data Mining Processing group) between the Imputation node and the Logistic Regression node. Note: Keep the default settings of the Text Mining node.

  8. Run the Text Mining node.

  9. When the run is finished, open the results of the Text Mining node. Many windows are available, including the Kept Terms table (which shows the terms used in the text analysis) and the Dropped Terms table (which shows the terms ignored in the text analysis).

    Note: In the tables, the plus sign next to a word indicates stemming. For example, +service represents service, services, serviced, and so on.

    Note: You can use custom lists like Start Lists, Stop Lists and Synonym Lists in a Text Mining Node.

  10. Maximize the Topics table. This table shows topics that the Text Mining node created based on groups of terms that occur together in several documents. Each term-document pair is assigned a score for every topic. Thresholds are then used to determine whether the association is strong enough to consider whether that document or term belongs in the topic. Terms and documents can belong to multiple topics. Fifteen topics were discovered, so fifteen new columns of inputs are created. The output columns contain SVD (singular value decomposition) scores that can be used as inputs for the downstream nodes.

  11. Close the Topics table.

  12. Click the Output Data tab, and then click View Output Data.

  13. In the Sample Data window, click View Output Data. Note: In the Sample Data window, we can choose to create a sample of the data to view. However, we do not do this for the Demo project.

  14. Scroll to the right to see the column headings that begin with Score for. These columns are for new variables based on the topics created by the Text Mining node. For each topic, the SVD coefficients (or scores) are shown for each observation in the data set. Notice that the coefficients have an interval measurement level. The Text Mining node converts textual data into numeric variables, specifically interval variables. These columns will be passed along to subsequent nodes.

    Note: If we want to rearrange or hide columns, we can use the Manage Columns button.

  15. Close the Results window.

  16. Another way to see the 15 new interval input columns that were added to the data is to use the Manage Variables node. To add a Manage Variables node to the pipeline after the Text Mining node, right-click the Text Mining node and select Add child node > Data Mining Preprocessing > Manage Variables.

  17. When the Run Node message window appears, click Close. Notice that Model Studio splits the pipeline path after the Text Mining node.

  18. Run the Manage Variables node. When the Run Node message window appears again, click Close.

  19. When the node finishes running, open the results.

  20. Maximize the Output window to see the new columns (COL1 through COL15), which represent the dimensions of the SVD calculations based on the 15 topics discovered by the Text Mining node. These new columns serve as new interval inputs for subsequent models.

  21. Close the Output window and close the results.

  22. To run the entire pipeline, click the Run pipeline button.

  23. To assess the performance of the model, open the results of the Model Comparison node. Expand the Model Comparison table. Note: Adding text features does not necessarily improve the model.

  24. Close the Model Comparison table. Close the results of the Model Comparison node.

  25. To see whether any of the new variables entered the final model, open the results of the Logistic Regression node.

  26. Maximize the Output window.

  27. Scroll down to the Selection Summary table. Notice that one of the columns created by the Text Mining node entered the model during the stepwise selection process.

  28. Close the Output window and the results.


Machine Learning Using SAS® Viya®
Lesson 02, Section 3 Demo: Transforming Inputs

In this demonstration, we use the Transformations node to apply a numerical transformation to input variables. In an earlier demo, we explored interval inputs and saw that a few had a high measure of skewness. Here, we revisit the results of that data exploration.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Data Exploration pipeline tab.

    The pipeline requires a rerun because metadata properties have been defined.

  2. Right-click Data Exploration and select Run to run the node.

  3. Right-click the Data Exploration node and select Results.

  4. Expand the Interval Variable Moments table. Notice that five variables have a high degree of skewness and their names begin with MB_Data_Usg.

  5. Close the Interval Variable Moments table.

  6. Expand the Important Inputs plot. Notice that the same MB_Data_Usg variables have also been selected as important variables. Behind the scenes, Importance is defined by a decision tree using the TREESPLIT procedure.

  7. Close the Important Inputs plot.

  8. Close the Results window.

    Now we are ready to define transformation rules in the metadata and apply the changes to the data. First, we change metadata on the Data tab to specify what we want to do with the variables. Note: The Manage Variables node is an alternative means of defining metadata transformations, but it is not used in this demonstration.

  9. Click the Data tab.

  10. It might be helpful to sort by variable name. Make sure that all variables are deselected. Right-click the Variable Name column. Select Sort and then Sort (ascending). Notice that the previous sorting by Role and Minimum columns has been disabled now.

  11. Scroll down until you see six variables whose names begin with (uppercase) MB_Data_Usg. Although only five of these were identified as important in the results that we just saw, there's a good chance that the other one is also skewed. It's a good idea to transform all six of them.

  12. To make sure that no other variables are selected, click the name of the first of the six MB_Data_Usg variables. Then select the check box for the other five of these variables. Note: Select only those variables whose name begins with uppercase MB.

  13. In the Multiple Variables window on the right, under Transform, select Log.

  14. To verify that the transformation rule has been applied to these variables, scroll right to display the Transform column. Notice that Log is displayed for each of the selected variables.

    Note: Remember that setting transformation rules doesn't perform the transformation. It only defines the metadata property. We must use the Transformations node to apply the transformation.

  15. To return to the Starter Template pipeline, click Pipelines, and then click the Starter Template tab.

  16. Add a Transformations node between the Replacement node and the Imputation node. Leave the Transformations node options at their default settings.

    Note: Although the Default interval inputs method property indicates (none), the metadata rules that we assigned to the variables on the Data tab override this default setting.
  17. Right-click the Transformations node and select Run.

  18. When the run is finished, right-click the node and select Results.

  19. Expand the Transformed Variables Summary table. This table displays information about the transformed variables, including how they were transformed, the corresponding input variable, the formula applied, the variable level, the type, and the variable label.

    Notice that new variables have been created with the prefix LOG_ at the beginning of the original variable names. The original versions of these variables are now rejected.

    Note: In the Formula column, notice that the formula for the Log transformations includes an offset of 1 to avoid the case of Log(0).

  20. Close the Transformed Variables Summary window.

  21. Close the results.

  22. Run the entire pipeline to assess the performance of the logistic regression model.

  23. Open the results of the Model Comparison node and maximize the Model Comparison table. Here, we can assess the performance of the logistic regression model.

  24. Close the Model Comparison table and close the results.


Machine Learning Using SAS® Viya®
Lesson 02, Section 4 Demo: Selecting Features

In this demonstration, we use the Variable Selection node to reduce the number of inputs for modeling.

Note: This is the task shown in the demonstration. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Starter Template pipeline.

  2. Add a Variable Selection node (in the Data Mining Preprocessing node group) between the Text Mining node and the Logistic Regression node.

  3. With the Variable Selection node selected, review the settings in the node properties panel on the right. In the properties, varying combinations of criteria can be used to select inputs. Notice the following default settings, which we will use:

    • Combination Criterion is set to Selected by at least 1. This means that any input selected by at least one of the selection criteria chosen is passed on to subsequent nodes as an input.

    • The Fast Supervised Selection method is selected by default.

    • The Create Validation from Training property is also selected by default, but its button is initially disabled.

  4. In the properties panel, turn on the Unsupervised Selection and Linear Regression Selection methods by clicking the button slider next to each property name. When a property is turned on, additional options appear. We can hide the new options by selecting the down arrow next to the property name.

    Keep the default settings for all the new options that appear for the Unsupervised Selection and Linear Regression Selection methods.

  5. Notice that the Create Validation from Training property was initially selected by default, but the slider button did not become active until we selected a supervised method above. This property specifies whether a validation sample should be created from the incoming training data. It is recommended to create this validation set even if the data have already been partitioned so that only the training partition is used for variable selection and the original validation partition can be used for modeling.

  6. Run the Variable Selection node.

  7. Right-click the Variable Selection node and select Results.

  8. Expand the Variable Selection table. This table contains the output role for the input variables after they have gone through the node. These variables have a blank cell in the Reason column, indicating that they have been selected and are passed on from the node.

  9. Scroll down in the Variable Selection table. For the variables that have been rejected by the node, the Reason column displays the reason for rejection.

    Remember that sequential selection (the default) is performed, and any variable rejected by this unsupervised method is not used by the subsequent supervised methods. The variables that are rejected by supervised methods are represented by combination criteria (at least one in this case) in the Reason column. If we want to see whether they were selected or rejected by each method, look at the Variable Selection Combination Summary.

  10. Close the Variable Selection table.

  11. Expand the Variable Selection Combination Summary table.

    For each variable, this table includes the result (Input or Rejected) for each method that was used, the total count of each result, and the final output role (Input or Rejected). For example, for the variable AVG_DAYS_SUSP, the Input column has a count of 2, and the Rejected column has a count of 0. This means that this variable was selected by two of the input criteria: Fast Selection and Linear Regression. The variable BILLING_CYCLE has 0 in the Input column, and 2 in the Rejected column. It was rejected by two criteria: Fast and Linear Regression. The variable with the label Days of Open Work Orders has a count of 1 in the Input column, and 1 in the Rejected column. This means that this input was rejected by the Fast criterion, but it was selected by the Linear Regression criterion. The property Combination criterion is set to Selected by at least 1, so this variable is selected as an input because it was selected by at least one of the properties.

  12. Close the Variable Selection Combination Summary table.

  13. Close the results.

  14. Click the Run Pipeline button to rerun the pipeline.

  15. Right-click the Model Comparison node and select Results.

    Note: As an alternative, we could view the results for the Logistic Regression model through the Logistic Regression node.

  16. Expand the Model Comparison table and view the statistics for the performance of the Logistic Regression model.

  17. Close the Model Comparison table and close the results.

Machine Learning Using SAS® Viya®
Lesson 02, Section 4 Demo: Saving a Pipeline to the Exchange

In this demonstration, we save the Starter Template pipeline to the Exchange, where it will be available for other users. We use this pipeline later in the course.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, next to the Starter Template tab, click the Options menu and select Save to The Exchange.

  2. While creating a template from the pipeline, note that not all nodes will be duplicated as they were configured. Click OK to ignore the warning message.

  3. Change the name of the pipeline to CPML demo pipeline. For the description, enter Logistic regression pipeline. Click Save.

  4. To go directly to the Exchange, click its button in the left panel.

  5. In the left pane of the Exchange, expand Pipelines and select Data Mining and Machine Learning. The CPML demo pipeline that we just saved appears in the list of pipeline templates.

  6. To exit the Exchange and return to the Demo project in Model Studio, click the Projects button in the upper left corner.

Lesson 03


Machine Learning Using SAS® Viya®
Lesson 03, Section 1 Demo: Building a Decision Tree Model Using the Default Settings

In this demonstration, we build a decision tree model, using the default settings, in the Demo project. We build the model in a new pipeline based on a template from the Exchange.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Pipelines tab.

  2. To add a new pipeline, click the plus sign (+) next to the Starter Template tab.

  3. In the New Pipeline window, enter Tree Based in the Name field.

    Note: Entering a description for the pipeline in the New Pipeline window is optional.

  4. Under Select a pipeline template, click the down arrow to browse templates.

  5. In the Browse Templates window, select CPML demo pipeline. Click OK.

  6. In the New Pipeline window, click Save.

  7. In the Tree Based pipeline, notice that the new pipeline is a copy of the pipeline from Starter Template, but no nodes have been run.

  8. Add a Decision Tree node (from the Supervised Learning group) after the Variable Selection node. Keep all properties for the Decision Tree node at their default settings.

  9. Right-click the Decision Tree node and select Run.

  10. Right-click the Decision Tree node and select Results. In the results of the Decision Tree node (the Node tab) are several charts and plots to help we evaluate the model's performance. Explore the windows and plots that are described below:

    Note: Remember that your results might vary from the results in the demonstration video, which are described below.

    The first plot is the Tree Diagram, which presents the final tree structure for this particular model, such as the depth of the tree and all end leaves. If we place our cursor on a leaf, a tooltip appears, giving us information about that particular leaf, such as the number of observations, the percentage of these that are event cases, and the percentage of nonevent cases. To see a splitting rule, we can place our cursor on a branch. This information is helpful in interpreting the tree.

    This Pruning Error plot is based on the misclassification rate because the target is binary. The plot shows the change in misclassification rate on training and validation data as the tree grows or as more leaves are added to the tree. The blue line represents the training data and the orange line represents the validation data. In this plot, for the training data, does the misclassification rate consistently decrease? If so, that means it improves as the size of the tree grows. However, for the validation, you probably see that, for the most part, the misclassification rate decreases as the size of the tree grows. But there are a few scenarios where the misclassification rate increases. The selected subtree contains 51 leaves after we have optimized complexity. Starting at this tree and for the next few trees, notice that the misclassification rate actually increases, which means that it's getting worse.

    The Variable Importance table shows the final variables selected by the decision tree and their relative importance. The most important input variable has the relative importance 1. All others are measured based on the most important input. In this case, notice that the decision tree selected ever_days_over_plan as the most important variable. The Importance Standard Deviation column shows the dispersion of the importance taken over several partially independent trees. So, for a single tree, this column has all zero values. For forest and gradient boosting, the numbers would be nonzero.

    Farther down in the results are several code windows, one for each type of code that Model Studio generates. Supervised Learning nodes can generate as many as three types of score code (node score code, path EP score code, and DS2 package score code) as well as training code. We learn more about score code later in the course.

    The Output window shows that the TREESPLIT procedure is the underlying procedure for the Decision Tree node. It also shows the final decision tree model parameters, the Variable Importance table, and the pruning iterations.

  11. Click the Assessment tab. Explore the windows and plots that are described below:

    In the Lift Reports window, the Cumulative Lift plot is shown by default. We can interpret the plot as a comparison of the performance of the model at certain depths of the data ranked by the posterior probability of the event compared to a random model. Ideally, we want to see a lift greater than 1, which means that our model is outperforming a random model. Lift and cumulative lift are discussed in more detail later. Notice the information on the right that helps us interpret the plot.

    Because our data set has a binary target, the ROC Reports plot is also available. A ROC chart appears by default. The ROC chart plots sensitivity against 1 minus specificity for varying cutoff values. Sensitivity is defined as the true positive rate. And 1 minus specificity is defined as the false positive rate. Again, the information on the right helps us interpret the plot. We learn more about the ROC chart, along with sensitivity and specificity, later in the course.

  12. The Fit Statistics table shows the performance of the final model on the training and validation data sets. A useful fit statistic to consider is average squared error. Take note of the average squared error on validation data.

    Note: As we move forward with modifying our models in this course, we might want to write down the values of the fit statistics that we use to assess performance. This enables us to see whether our model is improving.

  13. Notice that the fourth window on the Assessment tab shows the Event Classification chart.

  14. Close the results.

Machine Learning Using SAS® Viya®
Lesson 03, Section 2 Demo: Modifying the Structure Parameters

In this demonstration, we modify the tree structure parameters of the Decision Tree node that we added earlier in the Tree Based pipeline.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Tree Based pipeline, select the Decision Tree node.

  2. In the properties panel for the Decision Tree node, expand Splitting Options. Make the following changes:

    Note: The property Maximum number of branches specifies the maximum number of branches that a splitting rule produces. Use the default number of splits, which is 2.

    1. Increase Maximum depth from 10 to 14. This allows for a larger tree to be grown, which could lead to overfitting.

    2. Increase Minimum leaf size from 5 to 15. This change could help prevent overfitting.

    3. Increase Number of interval bins to 100.

  3. Right-click the Decision Tree node and select Run.

  4. Right-click the Decision Tree node and select Results.

  5. To look at performance of the decision tree, click the Assessment tab.

  6. In the Fit Statistics table, note the average squared error for the decision tree model on the VALIDATE partition. Is this fit statistic value slightly smaller than for the previous model? If so, this indicates that this model is performing better than the first model using the default settings. Keep in mind that modifying a model does not always result in better performance.

    Note: To assess performance, we could also look at the Lift chart or the ROC chart.

  7. Close the results.

Machine Learning Using SAS® Viya®
Lesson 03, Section 3 Demo: Modifying the Recursive Partitioning Parameters

In this demonstration, we change more settings of the Decision Tree node in the Tree Based pipeline. We modify the recursive partitioning parameters and compare this model performance to the models built earlier.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Tree Based pipeline, make sure that the Decision Tree node is selected.

  2. In the properties panel for the Decision Tree node, under Grow Criterion, change Class target criterion from Information gain ratio to Gini.

  3. Right-click the Decision Tree node and select Run.

  4. Right-click the Decision Tree node and select Results.

  5. Click the Assessment tab.

    In the Fit Statistics table, take note of the average squared error for the Decision Tree model on the VALIDATE partition. If there is a decrease in average squared error, this indicates an improved fit in the model based on changing the recursive partitioning parameters.

  6. Close the results.

Machine Learning Using SAS® Viya®
Lesson 03, Section 4 Demo: Modifying the Pruning Parameters

In this demonstration, we continue to modify the settings of the Decision Tree node in the Tree Based pipeline. We modify the pruning parameters and compare this model performance to the models built earlier.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Tree Based pipeline, make sure that the Decision Tree node is selected.

  2. In the properties panel, scroll down and expand Pruning Options.

  3. Change Subtree method from Cost complexity to Reduced error.

  4. Right-click the Decision Tree node and select Run.

  5. Right-click the Decision Tree node and select Results.

  6. Click the Assessment tab and expand the Fit Statistics table. Is the average squared error for this decision tree model the same as before on the VALIDATE partition? Remember that changing properties does not guarantee improvement in model performance.

  7. Write down the value of the average squared error and the value of KS. (We might need to scroll to the right to view the KS value). You use these values later to compare the current model with the model that you create in the next optional practice.

  8. Close the Fit Statistics table.

  9. Close the Results window.

  10. Click the Run pipeline button.

  11. Right-click the Model Comparison node and select Results.

    The Model Comparison table shows which model is currently the champion from the Tree Based pipeline. This is based on the default fit statistic KS. Even if we compare the average squared error values for the two models, we likely get a smaller average squared error for the decision tree than the logistic regression model, which indicates that the decision tree performs better on that statistic.

  12. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 03, Section 5 Demo: Building a Gradient Boosting Model

In this demonstration, we add a Gradient Boosting node to the Tree Based pipeline. We first run the gradient boosting model with default settings. We then change some settings and compare the model to the other models in the pipeline.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, make sure that the Tree Based pipeline is selected.

  2. Add a Gradient Boosting node (from the Supervised Learning group) after the Variable Selection node.

  3. Keep all properties for the Gradient Boosting node at their defaults.

  4. Run the Gradient Boosting node.

  5. Open the results for the Gradient Boosting node.

  6. Maximize the Error plot. This plot shows the performance of the model as the number of trees increases based on average squared error. In this case, the trends for average squared error are decreasing (improving) on both validation and training data sets. The results also include a table of variable importance. We see several windows associated with scoring, and the Output window.

  7. Close the Error plot.

  8. Notice the Variable Importance plot and score code windows.

  9. The Output window indicates that the underlying procedure for the Gradient Boosting node is PROC GRADBOOST.

  10. Click the Assessment tab.

  11. Maximize the Fit Statistics table and note the average squared error on the VALIDATE partition.

  12. Close the Fit Statistics table.

  13. Close the Results window.

  14. With the Gradient Boosting node selected, make the following changes to the node properties:

    1. Reduce Number of trees from 100 to 50 in the properties panel.

    2. Under Tree-splitting Options, increase Maximum depth from 4 to 8. To change the value of Maximum depth, we can either move the slider or manually enter a value in the box.
    3. Increase Minimum leaf size from 5 to 15.

    4. Increase Number of interval bins from 50 to 100.

  15. Run the Gradient Boosting node.

  16. Open the results for the Gradient Boosting node.

  17. Click the Assessment tab and scroll down to the Fit Statistics table. Note the average squared error for this gradient boosting model on the VALIDATE partition. Is the value of this fit statistic slightly better than for the first gradient boosting model, which was based on the default settings?

  18. Write down the value of the average squared error. Maximize the Fit Statistics table and write down the value of KS. You use these values later to compare the current model with the model that you create in the next optional practice.

  19. Exit the maximized view of the Fit Statistics table.

  20. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 03, Section 5 Demo: Building a Forest Model

In this demonstration, we add a Forest node to the Tree Based pipeline. We first build a forest model using the default settings. We then change some of the settings and compare the model to the other models in the pipeline.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Tree Based pipeline, add a Forest node (from the Supervised Learning group) after the Variable Selection node.

  2. Keep all properties for the Forest node at their default setting.

  3. Right-click the Forest node and select Run.

  4. Right-click the Forest node and select Results. Note: Many of the items in the Results window for the forest model are similar to items that we saw in the results for the gradient boosting model in a previous demo.

    The Error plot shows the performance of the model as the number of trees increases. This plot contains three lines that show performance on the training data, the validation data, and the out-of-bag sample, respectively. We see a table of variable importance and the same code output windows as we saw for gradient boosting.

    The Output window shows that the underlying procedure is the FOREST procedure.

  5. Click the Assessment tab.

    The Fit Statistics table shows the average squared error on the VALIDATE partition.

  6. Close the Results window.

  7. Make sure that the Forest node is selected. In the node properties panel on the right, make the following changes:

    1. Reduce Number of trees from 100 to 50.

    2. Under Tree-splitting Options, change Class Target Criterion from Information gain ratio to Entropy.

    3. Decrease Maximum depth from 20 to 12.

    4. Increase Minimum leaf size from 5 to 15.

    5. Increase Number of interval bins to 100.

    6. The default number of inputs to consider per split is the square root of the total number of inputs. Clear the check box for this option and set Number of inputs to consider per split to 7, about half the number of inputs that come from the Variable Selection node.

  8. Run the Forest node.

  9. Open the results for the Forest node.

  10. Click the Assessment tab and scroll down to the Fit Statistics table. Take note of the average squared error for this forest model on the VALIDATE partition. Did this fit statistic decrease a small amount? If so, this model is a little bit better than the first model, which used the default settings.

  11. Write down the value of the average squared error. Maximize the Fit Statistics table and write down the value of KS. You use these values later to compare the current model with the model that you create in the next optional demo.

  12. Close the Fit Statistics table.

  13. Close the Results window.

  14. To see how the forest model compares to the other models in the pipeline, click the Run pipeline button.

  15. Right-click the Model Comparison node and select Results. How does the performance of the forest model compare to the other models in the pipeline?

  16. Close the Results window.

Lesson 04


Machine Learning Using SAS® Viya®
Lesson 04, Section 1 Demo: Building a Neural Network Using the Default Settings

In this demonstration, we create a new pipeline in the Demo project, using the CPML demo pipeline. We build a neural network model using the default settings.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the plus sign next to the Tree Based pipeline tab to add a new pipeline.

  2. In the New Pipeline window, enter information about the new pipeline:

    1. Enter Neural Network in the Name field.

    2. Under Select a pipeline template, click the down arrow to browse templates. Select the CPML demo pipeline. The CPML demo pipeline appears in this menu because we used it in a previous demo.

  3. Click Save.

  4. In the Neural Network pipeline, add a Neural Network node (from the Supervised Learning group) after the Variable Selection node.

  5. Select the Neural Network node to activate its properties panel. Keep all properties for the Neural Network node at their defaults.

  6. Right-click the Neural Network node and select Run.

  7. Right-click the Neural Network node and select Results. Explore the following charts and plots, which help us evaluate the model's performance:

    The Network Diagram presents the final neural network structure for this model, including the hidden layer and the hidden units.

    The Iteration plot shows the model's performance based on the validation error throughout the training process when new iterations are added.

    As usual, we see the score code windows.

    The Output window shows the results from the NNET procedure: the final neural network model parameters, the iteration history, and the optimization process.

  8. Click the Assessment tab, and explore the results. Note the following:

    In the Lift Reports window, the Cumulative Lift plot shows the model's performance ordered by the percentage of the population. This plot is very useful for selecting the model based on a particular target of the customer base.

    For a binary target, we also have the ROC curve in the ROC Reports window. The ROC curve shows the model's performance considering the true positive rate and the false positive rate.

    The Fit Statistics table shows the model's performance based on various assessment measures, such as average squared error. Note the average squared error on validation data.

  9. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 04, Section 2 Demo: Modifying the Neural Network Architecture

In this demonstration, we modify the network architecture parameters of the neural network model with the intent to improve performance.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, select the Neural Network node.

  2. In the properties panel for the node, make the following changes:

    1. Change Input standardization from Midrange to Z score.

    2. Expand the Hidden Layer options. Clear the check box for Use the same number of neurons in hidden layers.

    3. Under Custom Hidden Layer Options, enter 26 for Hidden layer 1: number of neurons. This is about twice as many as the number of inputs coming from the Variable Selection node.

      Note: Under Target Layer Options, notice the Direct connections property. In the future, if we want to create a skip layer perceptron, select this check box.

  3. Right-click the Neural Network node and select Run.

  4. Right-click the Neural Network node and select Results.

  5. Click the Assessment tab.

    In the Fit Statistics table, take note of the average squared error for this neural network model on the VALIDATE partition. Is this fit statistic value better than for the first model (which used the default settings)?

  6. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 04, Section 3 Demo: Modifying the Learning and Optimization Parameters

In this demonstration, we modify the learning and optimization parameters of the neural network model in the Neural Network pipeline, and compare the model performance to the performance of the logistic regression model already in the pipeline.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Neural Network pipeline, make sure that the Neural Network node is selected.

  2. In the properties panel for the node, make the following changes:

    1. Under Common Optimization Options, increase L1 weight decay from 0 to 0.01.

      Note: The options Maximum iterations and Maximum time control early stopping. For this model, do not change these options.

    2. Decrease L2 weight decay from 0.1 to 0.0001.

  3. Right-click the Neural Network node and select Run.

  4. Right-click the Neural Network node and select Results.

  5. Click the Assessment tab and scroll down to the Fit Statistics table. Take note of the average squared error for this neural network model on the VALIDATE partition.

  6. Write down the value of the average squared error. Maximize the Fit Statistics table and write down the value of KS. You use these values later to compare the current model with the model that you create in the next optional practice.

  7. Close the Fit Statistics table.

  8. Close the Results window.

  9. To identify the champion model in this pipeline, do the following:

    1. Click the Run pipeline button to run the entire pipeline.

    2. Right-click the Model Comparison node and select Results.

      The neural network model is the champion model of the pipeline, based on the default statistic, KS.

  10. Close the Results window.

Lesson 05


Machine Learning Using SAS® Viya®
Lesson 05, Section 1 Demo: Building a Support Vector Machine Using the Default Settings

In this demonstration, we create a new pipeline based on the CPML demo pipeline, and add a Support Vector Machine (SVM) node to it. We build the support vector machine model using the default settings.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the plus sign next to the Neural Network tab to add a new pipeline.

  2. In the New Pipeline window, enter the following information:

    1. In the Name field, enter Support Vector Machine.

    2. Under Select a pipeline template, select CPML demo pipeline.

  3. Click Save.

  4. Add a Support Vector Machine (SVM) node (from the Supervised Learning group) under the Variable Selection node.

  5. Select the SVM node.

  6. In the properties panel, keep all properties for the SVM node at their defaults.

  7. Run the SVM node.

  8. Open the results for the SVM node. Explore the following charts and plots, which help us evaluate the model's performance:

    • The Fit Statistics table presents several assessment measures that indicate the performance of the support vector machine model.

    • The Training Results table shows the parameters for the final support vector machine model, such as the number of support vectors and the bias, which is the offset that defines the support vector machine.

    • As with previous models, we see score code windows.

    • The Output window shows the final support vector machine model parameters, the training results, the iteration history, the misclassification matrix, the fit statistics, the predicted probability variables, and the underlying procedure (the SVMACHINE procedure).

  9. Click the Assessment tab. As usual, we see the lift reports, the ROC reports, the Event Classification plot, and the Fit Statistics table. In the Fit Statistics table, take note of the average squared error on the VALIDATE partition.

  10. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 05, Section 2 Demo: Modifying the Methods of Solution Parameters

In this demonstration, we modify one of the key methods of solution parameters for the support vector machine model in an attempt to improve its performance.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Support Vector Machine pipeline, make sure that the SVM node is selected.

  2. In the properties panel, change Penalty from 1 to 0.1. The Penalty value balances model complexity and training error. A larger Penalty value creates a more robust model at the risk of overfitting the training data.

  3. Run the SVM node.

  4. Right-click the SVM node and select Results.

  5. Click the Assessment tab.

  6. The Fit Statistics table shows the average squared error on validation data. Has the value increased or decreased?

  7. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 05, Section 3 Demo: Increasing the Flexibility of the Support Vector Machine

In this demonstration, we attempt to improve the performance of the support vector machine by modifying three options: the kernel function, the tolerance, and maximum iterations. We then compare the model performance to the logistic regression model in the pipeline.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Support Vector Machine pipeline, make sure that the SVM node is selected.

  2. In the properties pane for the node, make the following changes:

    1. Change Kernel from Linear to Polynomial. Leave Polynomial degree as 2.

      Note: In Model Studio, only degrees of 2 and 3 are available.

    2. Increase Tolerance from 0.000001 to 0.6.

      Note: The Tolerance value balances the number of support vectors and model accuracy. A Tolerance value that is too large creates too few support vectors. A value that is too small overfits the training data.

    3. Decrease Maximum iterations from 25 to 10.

  3. Run the SVM node.

  4. Open the results for the support vector machine model.

  5. Click the Assessment tab.

  6. In the Fit Statistics window, take note of the average squared error on validation data. Is this fit statistic better than for the previous model? Write down the value of the average squared error.

  7. Maximize the Fit Statistics table and write down the value of KS. You use these values later to compare the current model with the model that you create in a later optional practice. Close the Fit Statistics table.

  8. Close the Results window.

  9. To determine the champion model from this pipeline, run the Model Comparison node by clicking the Run Pipeline button.

  10. Look at the results of the Model Comparison node. Based on the KS statistic (the default), which model is the champion from this pipeline?

  11. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 05, Section 3 Demo: Adding Model Interpretability

In this demonstration, we use the Model Interpretability feature to provide some explanation about the support vector machine model.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Support Vector Machine pipeline, select the SVM node.

  2. In the properties pane for the SVM node, make the following changes:

    1. Under Post-training Properties, expand Model Interpretability.

    2. Expand Global Interpretability and select both Variable importance and PD plots.

    3. Expand Local Interpretability and select the check boxes for ICE plots, LIME, and Kernel SHAP.

    4. Under Maximum number of Kernel SHAP variables, move the slider to change the number to 10. This means that 10 inputs are displayed in the chart, ordered by importance according to the absolute Kernel SHAP values.

    5. Notice that Specify instances to explain is set to RandomThis setting provides explanations for five randomly selected observations. Although we will not change this setting now, note that it is possible to select five observations from the data instead.

  3. Run the SVM node.

  4. Open the results for the SVM node.

  5. Notice that there is a new tab in addition to the Node and Assessment tabs: Model Interpretability. Click the Model Interpretability tab.

  6. Expand the Surrogate Model Variable Importance table. The most important inputs are listed in descending order of their importance. What appears to be the most important predictor? Relative importance is based on simple decision trees that are only one level deep. we will see that inputs to the PD and ICE plots are the top predictors from this table.

  7. Expand the PD plot. This plot shows the marginal effect of a feature (in this case, ever_days_over_plan) on the predicted outcome of the model that we just fit. The prediction function is fixed at a few values of the chosen feature and averaged over the other features. A PD plot can show whether the relationship between the target and the feature is linear, monotonic, or more complex.

    On the right side of the plot, notice that a concise description of this PD plot appears. Many plots on the Model Interpretability tab provide this type of description.

  8. To look at the relationship between the model's predictions and a different variable, use the View chart menu. The menu shows the five (by default) most important inputs in the model, based on a one-level decision tree for all inputs used to predict the predicted values from the model.

    Select the categorical variable handset_age_grp. This PD plot indicates that the highest probability of churn is associated with the middle age group (that is, the middle level of the variable): handsets between 24 and 48 months old. The newest handsets (those less than 24 months old) have the next highest probability of churn. And the oldest handsets have the lowest probability of churn. This makes sense from a business standpoint. A new device has a lower probability of churn because the customer hasn't had time to test it out yet. At the other end, if a customer has had a handset for more than four years, they probably like it.

  9. Close the PD plot.

  10. Expand the PD and ICE Overlay plot. This is a combined plot of the partial dependency results and the individual conditional expectation results overlaid. There are six lines because five are for ICE and one is the PD.

    ICE plots can help reveal interesting subgroups and interactions between model variables. For a chosen feature, an ICE plot shows how changes in the feature relate to changes in the prediction for individual observations. This ICE plot shows one line for each of the five randomly chosen observations, as specified in the node properties that we saw earlier.

    This ICE plot shows churn probability by ever_days_over_plan. Each line represents the conditional expectation for one customer instance. The plot indicates that for all five instances, there is a consistent increase in the probability of churn as ever_days_over_plan increases, given that other features are constant. ICE plots help reveal interesting customer subgroups and interactions between model variables. These relationships are not apparent in PD plots because they are averaged out.

    When evaluating an ICE plot of an interval input, the most useful feature to observe is intersecting slopes. Intersecting slopes indicate that there is an interaction between the plot variable and one or more complementary variables. ever_days_over_plan does not show any interactions.

  11. Look at the ICE plot for a different variable. From the View chart menu, select handset_age_grp.

    When evaluating an ICE plot of a categorical input, it is useful to look among individual observations for different relationships between the groups (or levels) of the categorical variable and the target. Significant differences in these relationships indicate group effects. Five individuals are represented in this plot, with the average predicted probability of churn calculated separately for each individual, across all levels of handset_age_grp. For this variable, the trend of observing the lowest probability in the oldest handset age group holds true for all five individuals.

  12. Close the PD and ICE Overlay plot.

  13. Notice the LIME and SHAPLEY plots. These plots are created by explaining individual predictions. In a given feature space, Shapley values help us determine where we are, how we got there, and how influential each variable is at that location. This is in contrast to LIME values, which help we determine how changes in a variable's value affects the model's prediction.

  14. Expand the LIME Explanations plot.

    This LIME plot displays the regression coefficient for the inputs selected in a local surrogate linear regression model. This surrogate model fits the predicted probability of the event (1) for the target churn for each of the five randomly chosen observations. In the chart, the inputs are ordered by significance, with the most significant input for the local regression model appearing at the bottom of the chart.

    The LASSO technique is used to select the most significant effects from the set of inputs that was used to train the model. A positive estimate indicates that the observed value of the input increases the predicted probability of the event. For example, in the demo video, the value of 0 for delinq_indicator decreases the predicted probability of the event (1) for the target churn by 0.1516 compared to the individual having a different value for delinq_indicator. Note: When we perform this demo, our results might differ.

  15. Close the LIME Explanations plot.

  16. Expand the Kernel SHAP Values plot.

    Unlike LIME coefficients, SHAPLEY values do not come from a local regression model. For each individual observation, an input's Shapley value is the contribution of the observed value of the input to the predicted probability of the event (1) for the target churn. The Shapley values of all inputs sum to the predicted value of that local instance. The inputs are displayed in the chart, ordered by importance according to the absolute Kernel SHAP values, with the most significant input appearing at the bottom of the chart.

    The Kernel SHAP values are the regression coefficients that are obtained by fitting a weighted least squares regression. Note that each nominal input is binary encoded based on whether it matches the individual observation. Interval inputs are binary encoded based on their proximity to the individual observation with a value of 1 if the observation is close to the local instance. To eliminate the bias of collinearity in regression, Shapley values average across all permutations of the features joining the model. Therefore, Shapley values control for variable interaction.

  17. Close the Kernel SHAP Values plot.

  18. Close the results.

  19. The Model Comparison node needs to be run because we turned on Model Interpretability in the SVM node above it. Run the entire pipeline and view the results of model comparison. In the demo video, the SVM model is the champion of this pipeline based on KS. Note: When we perform these steps, a different model might be the champion.

  20. Close the results.

Lesson 06


Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Comparing Models within a Pipeline

In this demonstration, we run and interpret the Model Comparison node in the Tree Based pipeline. We compare the models' performances based on different fit statistics.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, select the Tree Based pipeline. Note: We are using this pipeline because it has more models built than the other pipelines.

  2. Use the default assessment measure for binary targets found in the Project Settings window, in the Rules properties. To view these settings, do the following:

    1. Click the Settings button in the upper right corner of the project window, and select Project settings.

    2. In the left pane of the Project Settings window, select Rules. The default statistic for class selection is the Kolmogorov-Smirnov (KS) statistic.

    3. Click Cancel.

  3. In the Tree Based pipeline, select the Model Comparison node.

    Note: We can also change the fit statistics in the properties for this node.

  4. To make sure that we're looking at the most recent results from the Model Comparison node, right-click the Model Comparison node and select Run.

  5. Right-click the Model Comparison node and select Results.

    The Model Comparison table shows the champion model based on the default statistic (in this case, KS).

  6. Scroll down to see the Properties table. The Properties table shows the criteria used to evaluate the models and select the champion.

  7. Click the Assessment tab and expand the Lift Reports plot.

    The lift report shows results based on the response percentage. Using the menu in the upper left corner, we can also choose to see the model's performance based on the captured response percentage, cumulative captured response percentage, cumulative response percentage, cumulative lift, gain, and lift.

  8. Close the Lift Reports plot.

  9. Expand the ROC Reports plot.

    The ROC Reports plot is based on Accuracy, by default. Using the menu in the upper right corner, we can also see the models' performances based on the F1 Score and ROC.

  10. Close the ROC Reports plot.

  11. Expand the Fit Statistics table.

    The Fit Statistics table shows how each model in the pipeline performs on the data partitions defined in the project settings (train, validate, and test) for a series of fit statistics, such as Area Under ROC, Average Square Error, Gini Coefficient, and KS, among others.

  12. Close the Fit Statistics table.

  13. Close the Results window.

Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Comparing Models across Pipelines

In this demonstration, we run the pipeline comparison. Pipeline comparison enables us to compare the best models from each pipeline created. It also enables us to register the overall champion model and use it in other tools.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Pipeline Comparison tab.

    At the top, we see the champion model from each pipeline as well as the model deemed the overall champion in the pipeline comparison, the champion of champions. The overall champion is selected by default and is indicated by a star in the Champion column.

    In addition, several charts and tables summarize the performance of the overall champion model (the selected model), show the Variable Importance list of the model, provide training and score codes, and show other outcomes from the selected best model. The default assessment measure for pipeline comparison is Kolmogorov-Smirnov (KS).

    All the results shown are for the overall champion model only. We might want to perform a model comparison of each of the models shown.

  2. Select the check boxes next to all the models shown at the top of the Results page. We can also select the check box next to the word Champion at the top of the table.

  3. When multiple models are selected, the Compare button in the upper right corner is activated. Click Compare.

    The Compare results enable we to compare assessment statistics and graphics across the models currently selected on the Pipeline Comparison tab.

  4. Close the Compare results window.

  5. To add a challenger model (a model that was not automatically selected) to the pipeline comparison, perform the following steps:

    1. Return to the pipeline that contains the desired model (here, the Tree Based pipeline).

    2. Right-click the node for a model other than the pipeline champion and select Add challenger model from the pop-up menu.

    3. Click the Pipeline Comparison tab. The selected model now appears in the Pipeline Comparison table at the top, in the Challenger column.

  6. To prepare to register the overall champion model in a later demonstration, clear the check boxes for all other models in the table at the top of the Pipeline Comparison tab.

Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Reviewing a Project Summary Report on the Insights Tab

In this demonstration, we look at a project summary report for the Demo project from the Pipeline Comparison tab.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Insights tab.

    The Insights tab contains summary information in the form of a report for the project, the champion model, and any challenger models. For the purposes of the Insights tab, a champion model is the overall project champion model, and a challenger model is one that is a pipeline champion, but not the overall project champion.

    At the top of the report is a summary of the project and a list of any project notes. Summary information about the project includes the target variable, the champion model, the event rate, and the number of pipelines in the project.

  2. Maximize the plot for Most Common Variables Selected Across All Models. This plot summarizes common variables used in the project by displaying the number of pipeline champion models that the variables appear in. Only variables that appear in models used in the pipeline comparison are displayed.

    The plot shows that many variables were used by all models in the pipeline comparison. These variables are listed at the top of the plot. Variables not used in all models are listed at the bottom of the plot.

  3. Close the Most Common Variables Selected Across All Models plot.

  4. Maximize the Assessment for All Models plot. This plot summarizes model performance for the champion model across each pipeline and the overall project champion. The orange star next to the model indicates that it is the project champion.

    In the demo video, the champion is the forest. Take note of the KS value for the model that is selected as the champion when you practice these steps.

  5. Close the Assessment for All Models plot.

  6. Maximize the Most Important Variables for Champion Model plot. This plot shows the most important variables, as determined by the relative importance calculated using the actual overall champion model.

  7. Close the Most Important Variables for Champion Model plot.

  8. At the bottom of the results, notice the Cumulative Lift for Champion Model plot. This plot displays the cumulative lift for the overall project champion model for both the training and validation partitions.

  9. To prepare for model deployment, return to the pipeline comparison results by clicking the Pipeline Comparison tab.

Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Registering the Champion Model

In this demonstration, we register the champion model in the Demo project. Registering the model makes it available to other SAS applications.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, on the Pipeline Comparison tab, make sure that only the champion model is selected.

  2. On the right side of the window, click the Project pipeline menu (the three vertical dots). Note that the Manage Models option is not available.

  3. Select Register models to open the Register Models window. Wait until we see the following indications that the registration process is finished:
    • The spinning circle next to Registering in the Status column indicates that the selected model is actively being registered.
    • The Register Models window is updated to indicate that the registration process has successfully completed.

    Note: You can register a model from a Supervised Learning node when a preceding Open Source Code node contains Python score code.

  4. Close the Register Models window.

  5. In the table at the top of the Pipeline Comparison tab, notice the new Registered column. This column indicates that the champion model was registered.

Note: After the model is registered, we can view and use it in SAS Model Manager. In SAS Model Manager, we can export the score code in different formats, deploy the model, and manage its performance over time. We see this in a later demonstration.


Machine Learning Using SAS® Viya®
Lesson 06, Section 1 Demo: Exploring the Settings for Model Selection

In this demonstration, we explore some of the settings for model selection that we can change if we don't want to use the default values for model comparison. Note: It is helpful to know about these settings for future projects, but we use the default settings for the course project.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Tree Based pipeline.

  2. To explore the settings for comparing models within a single pipeline, perform the following steps:

    1. Select the Model Comparison node.

    2. In the properties panel for the node, notice that the three properties (Class selection statistic, Interval selection statistic, and Selection partition) are all set to the default value, Use rule from project settings. Remember that earlier, when we set up the course project, we modified the data partition in the project settings. If we change the model selection settings here instead of in the project settings, those settings apply only to the current pipeline.

    3. For a class or interval target, we can select a different measure. Click the Class selection statistic drop-down and select Average squared error. The green check mark on the Model Comparison node disappears, indicating that we need to run the node again to take advantage of the new setting.

    4. Notice that two properties at the bottom of the panel are currently inactive: Selection depth and ROC-based cutoff. When we select a response-based measure, such as Cumulative lift, we can also specify a selection depth other than the default. When we select an ROC-based measure, such as ROC separation, we can specify a cutoff other than the default.

    5. Change the Class selection statistic property so that it is back to the default setting, Use rule from project settings.

    6. Click the Selection partition drop-down. The available options are Test, Train, and Validate. Leave the default value, Use rule from project settings.

  3. To explore the settings for comparing models across pipelines, perform the following steps:

    1. Click the Settings icon in the upper right corner, and then select Project settings.

    2. Select Rules in the left pane of the Edit Project Settings window. On the right, the first three Model Comparison properties are the same properties that we saw in the properties pane for the Model Comparison node: Class selection statistic, Interval selection statistic, and Selection partition.

    3. Close the Project Settings window.

Machine Learning Using SAS® Viya®
Lesson 06, Section 2 Demo: Viewing the Score Code and Running a Scoring Test

In this demonstration, we access SAS Model Manager from Model Studio and run a scoring test using the champion model that we registered in an earlier demo. Before we deploy a model, it is often important to run a scoring test in a nonproduction environment to make sure that the score code runs without errors.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, click the Pipeline Comparison tab.

  2. On the right, click the Project pipeline menu (three vertical dots). Notice that the Manage Models option is now available because at least one model has been registered. Select Manage Models from the menu.

    By default, when Model Manager opens, we see a list of files that contain various types of code for training and scoring the registered model.

  3. In the left pane of Model Manager, click the Projects icon (the second icon from the top) to display the Projects list. Notice that the Demo project appears in this list. The SAS Model Manager project named Demo is based on the Model Studio project of the same name.

  4. Click the name of the Demo project to open it. The Models tab, which is selected by default, lists the model that we registered earlier in Model Studio.

    Note: The demo video corresponding to these steps was updated more recently than the earlier demo video in which the model was registered. Notice that the registered model in these two demo videos is different. When we perform these steps, it doesn't matter which model is selected as the champion and registered. The steps are the same.

  5. Notice the tabs at the top of the page. These tabs are used during the entire model management process, which goes beyond model deployment. In this demonstration, we focus on the tabs that are used for the scoring test.

  6. To open the registered model, click its name. (Do not click the selection check box next to the model name.) Notice the tabs near the top of the page. The Files tab is selected by default. On the left is the same list of files related to scoring this model that we saw earlier.
  7. Note: Depending on which version of SAS Viya we are using, the Files tab displays either a list of files organized alphabetically, or files organized by type.

  8. To see the score code that Model Studio generated for this model and the data preparation nodes in the pipeline, select dmcas_epscorecode.sas in the left panel. The score code appears on the right. If we want, scroll down through the code. The score code varies by model. We do not need to be able to understand the code in order to run a scoring test.

    Note: After we test our score code, a likely next step is to export the code from the Files tab. Then we can put the model into production by deploying it in a variety of environments.

  9. In the upper right corner of the score code window, click Close.

    The Models page appears, which lists all models registered across all projects. Here, we have only one project (the Demo project) and one registered model.

  10. On the left, click the Projects icon to return to the Demo project. It's time to create and run a scoring test on the selected model.

  11. Click on the name of the model (not the checkbox).

  12. Click the Scoring tab.

  13. On the Tests tab, click New Test.

  14. In the New Test window, enter the following information:

    1. In the Name box, enter CPML_Champion.

    2. In the Description box, enter Demo project champion. (Entering a description is optional.)

    3. Below Model, click Choose Model. In the Choose a Model window, select the champion model. The Choose a Model window closes and we return to the New Test window.

  15. To select the data source for the test, perform the following steps:

    1. Below Input table, click the Browse button. The Choose Data window appears. The three tabs at the top (Available, Data Sources, and Import) are the same as the tabs that we saw in an earlier demo, when we selected the data source for the Demo project.

    2. The data source to be scored is not yet loaded into memory, so we need to import it. Click the Import tab.

    3. Expand Local files and select Local file.

    4. In the Open window, navigate to D:\Workshop\Winsas\CPML.

    5. Select score_commsdata.sas7bdat and click Open.

    6. Back in the Choose Data window, click Import Item.

    7. When we see a message indicating that the table is successfully imported, click OK.

    8. In the New Test window, notice that the name of the imported data set now appears in the Data source field.

    9. Click Save.

  16. Back on the Scoring tab for the Demo project, select the check box next to the name of the scoring test that we just created.

  17. In the upper right corner of the table, click Run.

    When the run is finished, the Status column has a green check mark and a table icon appears in the Results column. This indicates that the test ran successfully.

  18. To open the test results, click the table icon in the Results column.

  19. In the left pane, under Test Results, click Output. By default, the score data table shows new variables created during data preparation; the new variables created during the scoring process, which contain the predictions; and all the original variables.

  20. Scroll to the right until we see the Predicted: churn = 1 column. The predicted values of churn are used to make business decisions.
  21. Note: Depending on which version of SAS Viya we are using, the variable names created during the scoring process may be different.

    Note: If we want, we can reduce the number of columns or rearrange columns in the output table. To do this, click the Options icon in the upper right corner and select Manage columns.

  22. Close the Output Table window. From here, we can use the Applications menu to return to either SAS Drive or Model Studio.

Lesson 07


Machine Learning Using SAS® Viya®
Lesson 07, Section 0 Demo: Adding Open Source Models to a Model Studio Project

In this demonstration, we use the Open Source Code node to execute Python and R code in Model Studio. We create forest models in R and Python and compare them with the models that we built earlier in Model Studio.

Note: This is the task shown in the demonstration video. However, keep in mind that SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, select the Pipelines tab.

  2. Click the plus sign next to the Support Vector Machine pipeline tab to create a new pipeline.

  3. In the New Pipeline window, perform the following steps:

    1. In the Name field, enter Open Source.

    2. Under Select a pipeline template, ensure that Blank Template is selected.

  4. Click OK.

  5. Add an Imputation node (from the Data Mining Preprocessing group) after the Data node. Leave the settings of the Imputation node at the defaults. Many open-source packages handle missing values differently than SAS does.

    Note: Both Python and R packages might not support missing values in data. Some open-source packages even return an error if the data include missing values. It is your responsibility to prepare the data as necessary for these packages. It is highly recommended that you add an Imputation node before the Open Source Code node to handle missing values. If the training data do not include missing values but if either the validation or test data do include missing values, consider enabling the Impute non-missing variables property in the Imputation node. However, for this demo, do not select this property.

  6. Add an Open Source Code node (from the Miscellaneous group) after the Imputation node.

  7. Right-click the Open Source Code node and rename the node Python Forest. Notice that no Model Comparison node was automatically added at this point.

  8. In the properties panel of the Open Source Code node (now renamed Python Forest), verify that Language is set to Python.

  9. Expand the Data Sample properties and clear the check box for Include SAS formats. Note: This property controls whether the downloaded data sent to the Python or R software should keep SAS formats.

  10. Click Open Code Editor and perform the following steps:

    1. Click the Load source code file icon.

    2. In the Open window, navigate to the Python program file (Python_Forest.py) in the course data folder (D:\workshop\winsas\cpml), and click Open. This Python code fits a random forest classifier model in Python.

      Note:
      If you work with a program file that has an extension other than .PY, you can copy and paste the code from a text editor.

    3. Click Save.

    4. Close the code editor.

  11. Run the Python Forest node.

  12. Add another Open Source Code node after the Imputation node. Notice that, once again, no Model Comparison node was added automatically.

  13. Right-click the Open Source Code node and rename it R Forest.

  14. In the properties panel of the Open Source Code node (now renamed R Forest), perform the following steps:

    1. Set Language to R.

    2. Clear the check box for Include SAS formats.

  15. Click Open Code Editor and perform the following steps:

    1. Click the Load source code file icon.

    2. In the Open window, navigate to the R program file (R_Forest.r) in the course data folder (D:\workshop\winsas\cpml), and click Open. This R code fits a random forest classifier model.

    3. Click Save.

    4. Close the code editor.

  16. Run the R Forest node.

  17. Open the results of the R Forest node.

  18. Maximize the R Code window. Notice that additional code appears above the open-source code that we added earlier. Model Studio adds this precursor code when the node is run. The code varies depending on the selected properties. The original open-source code appears below the USER CODE comment.

  19. Close the R Code window.

  20. Notice that the Results window does not have an Assessment tab. Model Studio does not yet recognize that the Open Source Code node is building a model, because the node is still categorized as a Miscellaneous node. To generate model assessment results, we need to move both of the Open Source code nodes to the Supervised Learning group.

  21. Close the Results window.

  22. Right-click the R Forest node and select Move > Supervised Learning. Notice the following changes after the node moves to the Supervised Learning group:
    • The node color changes to the color of Supervised Learning nodes.
    • The node needs to be rerun.
    • A Model Comparison node is automatically added to the pipeline, connected to the R Forest node.

  23. Repeat the previous step to move the Python Forest node to the Supervised Learning group. Notice that the Python Forest node is also connected to the Model Comparison node.

  24. Run the Model Comparison node. This reruns the R Forest and Python Forest nodes as well.

  25. Open the results of the R Forest node. Click the Assessment tab.

  26. Close the results of the R Forest node.

  27. Open the results of the Model Comparison node. The champion model for this pipeline is listed at the top.

  28. Close the results of the Model Comparison node.

  29. To compare two open-source models with the models that we fit in Model Studio, click the Pipeline Comparison tab. The champion model across pipelines is listed at the top.