Lesson 02

Machine Learning Using SAS® Viya®
Lesson 02, Section 5 Best Practices for Common Data Preprocessing Challenges

Data preprocessing covers a range of processes that are different for raw, structured, and unstructured data (from one or multiple sources). Data preprocessing focuses on improving the quality and completeness of the data, standardizing how it is defined and structured, collecting and consolidating it, and transforming the data to make it useful, particularly for machine learning analysis. The selection and type of preparation processes, as well as the order in which you perform these processes, can differ depending on your purpose, your data expertise, how you plan to interact with the data, and what type of questions you want to answer.

Conceptual image illustrating information stated in the surrounding text.

The table below summarizes some challenges that you might encounter in preparing your data. It also includes suggestions for how to handle each challenge by using the Data Mining Preprocessing pipeline nodes in Model Studio.

Common Data Preprocessing Challenges and Recommendations for Handling Them
Data Problem Common Challenges Suggested Best Practice
Data collection
  • Biased data
  • Incomplete data
  • High-dimensional data
  • Sparsity
  • Take time to understand the business problem and its context
  • Enrich the data
  • Dimension reduction (Feature and Variable Selection nodes)
  • Change representation of data (Transformations node)
"Untidy" data
  • Value ranges as columns
  • Multiple variables in the same column
  • Variables in both rows and columns
  • Transform the data with SAS code (Code node)
Outliers
  • Out-of-range numeric values and unknown categorical values in score data
  • Discretization (Transformations node)
  • Winsorizing (Imputation node)
Sparse target variables
  • Low primary event occurrence rate
  • Overwhelming preponderance of zero or missing values in target
  • Proportional oversampling
Variables of disparate magnitudes
  • Misleading variable importance
  • Distance measure imbalance
  • Gradient dominance
  • Standardization (Transformations node)
High-cardinality variables
  • Overfitting
  • Unknown categorical values in holdout data
  • Binning (Transformations node)
  • Replacement of values (Replacement node)
Missing data
  • Information loss
  • Bias
  • Binning (Transformations node)
  • Imputation (Imputation node)
Strong multicollinearity
  • Unstable parameter estimates
  • Dimension reduction (Feature Extraction, Variable Clustering, and Variable Selection nodes)

Note: Some of these challenges can also be handled later, in the modeling stage, such as using tree-based methods for handling missing data automatically.


Lesson 03

Machine Learning Using SAS® Viya®
Lesson 03, Section 4 (Optional) Practice: Build a Decision Tree Model Using Autotuning

Build a decision tree model using the Autotune feature in Model Studio.

Note: This practice is optional.

Note: SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Tree Based pipeline that you created in an earlier demo, add a Decision Tree node below the Variable Selection node. Change the name of the new Decision Tree node to Auto Decision Tree. In the properties panel for the Auto Decision Tree node, turn on the Autotune feature. Use the default settings. Run the Auto Decision Tree node.

    Solution:

    1. In the Tree Based pipeline, right-click the Variable Selection node and select Add child node > Supervised Learning > Decision Tree.

    2. Right-click the new Decision Tree node -- it should be called Decision Tree (1) -- and select Rename. Change the name to Auto Decision Tree.

    3. In the properties pane for the Auto Decision Tree node, turn on the Perform Autotuning option. The default settings show the starting values and ranges that are tried for each property in the auto decision tree model.

    4. Right-click the Auto Decision Tree node and select Run. This process might take a few minutes.

  2. In the results, look at the following assessment measures: KS and average squared error. How does the performance of the autotuned tree compare to the performance of the decision tree already in the pipeline (built in an earlier demo)?

    Solution:

    1. When the run is completed, right-click the Auto Decision Tree node and select Results.

    2. In the Results window, click the Assessment tab.

    3. Scroll down to the Fit Statistics table, and maximize it. For the autotuned model on the VALIDATE partition, write down the values of average squared error and KS.

    4. Restore the Fit Statistics table, and close the results.

    Both measures show a slight improvement in model performance for the autotuned tree as compared to the tree already in the pipeline. For the autotuned tree, the average squared error decreased and the KS increased.

Machine Learning Using SAS® Viya®
Lesson 03, Section 5 (Optional) Practice: Build a Gradient Boosting Model Using Autotuning

Build a gradient boosting model using the Autotune feature in Model Studio.

Note: This practice is optional.

Note: SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Tree Based pipeline that you created in an earlier demo, add a Gradient Boosting node after the Variable Selection node. Change the name of the new Gradient Boosting node to Auto Gradient Boosting. Turn on the Autotune feature. Use the default settings. Run the Auto Gradient Boosting node.

    Solution:

    1. In the Tree Based pipeline, right-click the Variable Selection node and select Add child node > Supervised Learning > Gradient Boosting.

    2. Right-click the new Gradient Boosting node -- it should be called Gradient Boosting (1) -- and select Rename. Change the name to Auto Gradient Boosting.

    3. In the properties pane for the Auto Gradient Boosting node, turn on the Perform Autotuning option. The default settings show the starting values and ranges that are tried for each property in the auto gradient boosting model.

    4. Right-click the Auto Gradient Boosting node and select Run. This process might take a few minutes.

  2. In the results, look at the following assessment measures: KS and average squared error. How does the performance of the autotuned model compare to the performance of the gradient boosting model already in the pipeline (built in an earlier demo)?

    Solution:

    1. When the run is completed, right-click the Auto Gradient Boosting node and select Results.

    2. In the Results window, click the Assessment tab.

    3. Scroll down to the Fit Statistics table, and maximize it. For the autotuned model on the VALIDATE partition, write down the average squared error and the KS.

    4. Restore the Fit Statistics table, and close the results.

    The performance of the autotuned gradient boosting model is nearly the same as that of the gradient boosting model already in the pipeline. The autotuned gradient boosting model did minimally worse on average squared error but showed a slight improvement on KS.

Machine Learning Using SAS® Viya®
Lesson 03, Section 5 (Optional) Practice: Build a Forest Model Using Autotuning

Build a forest model using the Autotune feature in Model Studio.

Note: This practice is optional.

Note: SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Tree Based pipeline that you created in an earlier demo, add a Forest node below the Variable Selection node. Change the name of the new Forest node to Auto Forest. Turn on the Autotune feature. Use the default settings. Run the Auto Forest node.

    Solution:

    1. In the Tree Based pipeline, right-click the Variable Selection node and select Add child node > Supervised Learning > Forest.

    2. Right-click the new Forest node, which should be called Forest (1), and select Rename. Change the name to Auto Forest.

    3. In the properties pane for the Auto Forest node, turn on the Perform Autotuning option. The default settings show the starting values and ranges that are tried for each property in the auto forest model.

    4. Right-click the Auto Forest node and select Run. This process might take a few minutes.

  2. In the results, look at the following assessment measures: KS and average squared error. How does the performance of the autotuned forest compare to the performance of the forest model already in the pipeline (built in an earlier demo)?

    Solution:

    1. When the run is completed, right-click the Auto Forest node and select Results.

    2. In the Results window, click the Assessment tab.

    3. Scroll down to the Fit Statistics table, and maximize it. For the autotune model on the VALIDATE partition, write down the average squared error and the KS. Note: Remember that your results might vary from what is shown in the demo corresponding to this demo.

    4. Close the Fit Statistics table, and close the results.

Lesson 04

Machine Learning Using SAS® Viya®
Lesson 04, Section 3 (Optional) Practice: Build a Neural Network Model Using Autotuning

Build a neural network model using the Autotune feature in Model Studio.

Note: This practice is optional.

Note: SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Neural Network pipeline that you created in an earlier demo, add a Neural Network node below the Variable Selection node. Change the name of the new Neural Network node to Auto Neural Network. Turn on the Autotune feature. Use the default settings. Run the Auto Neural Network node.

    Solution:

    1. In the Neural Network pipeline, right-click the Variable Selection node and select Add child node > Supervised Learning > Neural Network.

    2. Right-click the new Neural Network node -- it should be called Neural Network (1) -- and select Rename. Change the name to Auto Neural Network.

    3. In the properties pane for the Auto Neural Network node, turn on the Perform Autotuning option. The default settings show the starting values and ranges that are tried for each property in the auto neural network model.

    4. Right-click the Auto Neural Network node and select Run. This process might take a few minutes.

  2. In the results, look at the following assessment measures: KS and average squared error. How does the performance of the autotuned neural network compare to the performance of the neural network model already in the pipeline (built in an earlier demo)?

    Solution:

    1. When the run is completed, right-click the Auto Neural Network node and select Results.

    2. In the Results window, click the Assessment tab.

    3. Scroll down to the Fit Statistics table, and maximize it. For the autotune model on the VALIDATE partition, write down the average squared error and the KS.

    4. Restore the Fit Statistics table, and close the results.

    There is slight improvement in model performance for both measures for the autotuned neural network compared to the neural network already in the pipeline. Average squared error decreased for the autotuned neural network, and KS increased for the autotuned neural network.

Lesson 05

Machine Learning Using SAS® Viya®
Lesson 05, Section 3 (Optional) Practice: Build a Support Vector Machine Model Using Autotuning

Build a support vector machine model using the Autotune feature in Model Studio.

Note: This practice is optional.

Note: SAS Viya uses distributed processing, so the values in results will vary slightly across runs.

  1. In the Demo project, in the Support Vector Machine pipeline that you created in an earlier demo, add a Support Vector Machine (SVM) node below the Variable Selection node. Change the name of the new SVM node to Auto SVM. Turn on the Autotune feature and set the Polynomial Degree maximum range to 2. Run the Auto SVM node.

    Solution:

    1. In the Support Vector Machine pipeline, right-click the Variable Selection node and select Add child node > Supervised Learning > SVM.

    2. Right-click the new SVM node, which should be called SVM (1), and select Rename. Change the name to Auto SVM.

    3. In the properties pane for the Auto SVM node, turn on the Perform Autotuning option. The default settings show the starting values and ranges that are tried for each property in the auto SVM model.

    4. Under Polynomial Degree, change the maximum range by changing To from 3 to 2.

    5. Right-click the Auto SVM node and select Run. This process might take a few minutes.

  2. In the results, look at the following assessment measures: KS and average squared error. How does the performance of the autotuned support vector machine model compare to the performance of the support vector machine model already in the pipeline (built in an earlier demo)?

    Solution:

    1. When the run is completed, right-click the Auto SVM node and select Results.

    2. In the Results window, click the Assessment tab.

    3. Scroll down to the Fit Statistics table, and maximize it. For the autotune model on the VALIDATE partition, write down the average squared error and the KS.

    4. Restore the Fit Statistics table, and close the results.

    There is a decrease in model performance for both measures for the autotuned SVM compared to the SVM already in the pipeline. For the autotuned SVM, average squared error increased quite a bit and KS decreased.