## Deep Learning Using SAS® Software

Lesson 01, Section 2 Demo: Loading and Modeling Data with Traditional Neural Network Methods

Note: The paths in the code might differ from what is seen in this video.

In this demonstration, we use conventional neural network modeling methods to model the target marketing data for a financial institution (the **develop** data set). The neural network will have seven hidden layers but will fail to sufficiently discriminate between the target events and the nonevents in the data.

**Note:** Instructions for accessing SAS Studio in the virtual lab and navigating to the location of the course program files are provided in the Course Overview.

- Open the program named
**DLUS01D01.sas**. Examine the two LIBNAME statements and the three DATA steps.

/********************************/ /*Create Local and CAS Libraries*/ /********************************/ libname local "/workshop/winsas/LWDLUS/Data"; libname mycas cas; /*****************************************************/ /* Load data sets from the local machine into memory.*/ /*****************************************************/ data mycas.train_develop; set local.train_develop; run; data mycas.valid_develop; set local.valid_develop; run; data mycas.test_develop; set local.test_develop; run;

This program begins by creating a local library and a caslib using two LIBNAME statements. CAS stands for Cloud Analytic Services. The caslib enables us to move data into memory and process data on the CAS server. The two LIBNAME statements are described below:- The first LIBNAME statement creates the local library by specifying the location to the course data.
- The second LIBNAME statement creates a caslib using a LIBNAME statement just like in traditional SAS programming. The LIBNAME statement specifies the name of the caslib to be created (in this case,
**mycas**) and then the keyword CAS.

**mycas**and the library in the SET statement is named**local**. This enables us to move data from the**local**directory to the CAS server.

- Run the two LIBNAME statements and the three DATA steps. After the code runs and the data set is loaded into memory, click the
**Output D****ata**tab to view the data.

You can see that the validation data set has 49 columns and 9679 rows.

- Run the PROC FREQ step, which enables us to explore the binary outcome distribution of the target variable,
**ins**, in the validation data.

/*******************************************************************/ /* Examine the target outcome (INS) in the Valid_Develop data */ /*******************************************************************/ proc freq data=mycas.Valid_develop; table ins; run;

In the results, notice that the event rate is 34.63%. This means that 34.63% of customers purchase the variable annuity.

- Examine the PROC NNET step, which is used to train a traditional neural network: a seven hidden-layer, feed-forward neural network.

/*********************************************************/ /* Run a traditional neural network with 7 hidden layers */ /*********************************************************/ proc nnet data=MYCAS.Train_DEVELOP standardize=std; target Ins / level=nominal; input AcctAge DDABal CashBk Checks NSFAmt Phone Teller SavBal ATMAmt POS POSAmt CDBal IRABal LOCBal ILSBal MMBal MMCred MTGBal CCBal CCPurc Income LORes HMVal Age CRScore Dep DepAmt InvBal / level=interval; input DDA DirDep NSF Sav ATM CD IRA LOC ILS MM MTG CC SDB HMOwn Moved InArea Inv / level=nominal; hidden 30; hidden 20; hidden 10; hidden 5; hidden 10; hidden 20; hidden 30; train outmodel=mycas._Nnet_model_ validation=MYCAS.Valid_DEVELOP numtries=1 seed=12345 stagnation=15; optimization algorithm=LBFGS regL1=0.003 regL2=0.002 seed=12345 maxiter=50; run;

The PROC NNET statement specifies DATA=train_develop (the training data).

The STANDARDIZE= option standardizes the input variables.

The TARGET statement specifies the target variable**ins**and sets the level to nominal because the target is binary.

The two INPUT statements group the input variables by their measurement level using the LEVEL= option. All of the input variables specified in the first INPUT statement are interval input variables, and all of the inputs in the second INPUT statement are nominal.

All the HIDDEN statements are specified after the TARGET and INPUT statements. Each HIDDEN statement specifies the number of mathematical neurons or hidden units that are associated with a particular hidden layer. This neural network has seven hidden layers. The first hidden layer has 30 mathematical neurons, the second hidden layer has 20, the third has 10, and the fourth has 5 neurons. The fifth, sixth, and seventh hidden layers expand back out to 10, 20, and 30 neurons, respectively. This creates an hourglass shape.

The TRAIN and OPTIMIZATION statements provide addition training options for the model. The TRAIN statement uses the options described below:- The OUTMODEL= option saves the model generated by the optimization of the neural network. We could use this table to score new data in the future.
- The VALIDATION= option specifies the validation data. This enable us to view the misclassification of the model on data not used to train the model.
- The NUMTRIES= option is set to 1, which optimizes the model with only a single starting position for the objective function. Specifying a number larger than 1 can help ensure that we truly minimize the objective function. For simplicity, in this case, it is kept at 1.
- The SEED= option is used to specify a seed in order to reproduce results.
- The STAGNATION= option is set to end the training routine in the event that the objective function fails to change after 15 iterations.

The OPTIMIZATION statement sets the algorithm to LBFGS, using both L1 and L2 regularization, and trains the model for 50 maximum iterations.

**Note:**The*limited memory BFGS*(*L-BFGS*) is used to adjust the network's parameters. Like the original BFGS, L-BFGS uses an estimation of the inverse Hessian to steer the search. But whereas BFGS stores an*n*-by-*n*approximation to the Hessian (where*n*is the number of variables), the L-BFGS variant stores only a few vectors that represent the approximation implicitly.

- Run the PROC NNET step and look at the results.

The Model Information table shows that this neural network was trained on 19358 observations and has 189 total nodes and 3590 weight parameters.

The Iteration History table contains the objective function and also the validation and fit error values for each iteration in the optimization routine. The validation error represents the validation misclassification rate because the target is binary. Notice that the model stagnated after only 18 iterations and the final validation misclassification is 34.63%. Recall that this is the prior event rate for the validation data. This means that the model is failing to discriminate between 0s and 1s in the data. Essentially, the model predicts that all observations are 0s or simply nonevents. So even though this model has multiple hidden layers and almost 4000 weight parameters, it is failing to learn the data. We'll revisit this data after you learn about the deep learning methodology that you need to better fit the data.

## Deep Learning Using SAS® Software

Lesson 01, Section 2 Demo: Building and Training a Deep Learning Neural Network Using CASL Code

In the previous demonstration, we built a neural network that had seven hidden layers. However, the neural network failed to discriminate between events and nonevents. In this demonstration, we build a neural network with the same structure, but we incorporate deep learning methods to obtain a better model.

**Note:** If you did not perform the steps of the previous demonstration in your current SAS Studio session, run the program **DLUS01D01.sas** to load the data into memory before you continue.

- Open the program
**DLUS01D02.sas**and examine the first four PROC CAS steps.

/********************************/ /*Load the Deep Learn Action Set*/ /********************************/ proc cas; loadactionset 'DeepLearn'; run; /*******************************/ /*Build the Deep Learning Model*/ /*******************************/ proc cas; BuildModel / modeltable={name='DLNN', replace=1} type = 'DNN'; run; Proc Cas; AddLayer / model='DLNN' name='data' layer={type='input' STD='STD' dropout=.05}; AddLayer / model='DLNN' name='HLayer1' srcLayers={'data'} layer={type='FULLCONNECT' n=30 act='ELU' init='xavier' dropout=.05}; AddLayer / model='DLNN' name='HLayer2' srcLayers={'HLayer1'} layer={type='FULLCONNECT' n=20 act='RELU' init='MSRA' dropout=.05}; AddLayer / model='DLNN' name='HLayer3' srcLayers={'HLayer2'} layer={type='FULLCONNECT' n=10 act='RELU' init='MSRA' dropout=.05}; AddLayer / model='DLNN' name='HLayer4' srcLayers={'HLayer3'} layer={type='FULLCONNECT' n=5 act='RELU' init='MSRA' dropout=.05}; AddLayer / model='DLNN' name='HLayer5' srcLayers={'HLayer4'} layer={type='FULLCONNECT' n=10 act='RELU' init='MSRA' dropout=.05}; AddLayer / model='DLNN' name='HLayer6' srcLayers={'HLayer5'} layer={type='FULLCONNECT' n=20 act='RELU' init='MSRA' dropout=.05}; AddLayer / model='DLNN' name='HLayer7' srcLayers={'HLayer6'} layer={type='FULLCONNECT' n=30 act='RELU' init='MSRA' dropout=.05}; AddLayer / model='DLNN' name='outlayer' layer={type='output' act='Softmax'} srcLayers={"HLayer7"}; quit; /*****************************/ /*Fit the Deep Learning Model*/ /*****************************/ proc cas; dlTrain / model='DLNN' table='Train_Develop' ValidTable='Valid_Develop' modelWeights={name='DeepTrainedWeights_d', replace=1} bestweights={name='bestdeepweights', replace=1} target='INS' inputs={'AcctAge','DDABal','CashBk','Checks','NSFAmt','Phone','Teller', 'SavBal','ATMAmt','POS','POSAmt','CDBal','IRABal','LOCBal', 'ILSBal','MMBal','MMCred','MTGBal','CCBal','CCPurc','Income', 'LORes','HMVal','Age','CRScore','Dep','DepAmt','InvBal','DDA', 'DirDep','NSF','Sav','ATM','CD','IRA','LOC','ILS','MM','MTG', 'CC','SDB','HMOwn','Moved','InArea','Inv'} nominals={'INS','DDA','DirDep','NSF','Sav','ATM','CD','IRA','LOC','ILS', 'MM','MTG','CC','SDB','HMOwn','Moved','InArea', 'Inv'} optimizer={minibatchsize=60, regL1=0.003, regL2=0.002, maxepochs=50, algorithm={method='ADAM', lrpolicy='Step', gamma=0.5, stepsize=10, beta1=0.9, beta2=0.999, learningrate=.001}} seed=12345; quit;

This program uses CASL code, or simply the CAS language. The CAS procedure invokes CASL. For example, the first PROC CAS step loads the deepLearn action set using the LOADACTIONSET statement. The CAS procedure ends like any other SAS procedure, with either a RUN or QUIT statement. This effectively loads the deep learning methodology tools (specifically the deepLearn action set) into memory, so that we can use deep learning actions on our data in CAS.

To initialize the deep learning model, the second CAS procedure specifies the buildModel action and all its arguments after the forward slash. The following arguments are nested in the modeltable argument:

- The name argument specifies the reference name of the deep learning model that we are creating: in this case, DLNN for deep learning neural network.

- The replace=1 tells SAS to replace the model in CAS if it was already created.

- The type argument specifies the type of the deep learning model. The type can be DNN for a deep feed forward neural network, RNN for a recurrent neural network, or CNN for a convolutional neural network. Given the data we are using, the specified type is DNN.

Now that the model shell is built, it is time to populate it with the layers for the neural network structure. The third PROC CAS step has a sequence of addLayer actions used to iteratively add nine layers to the neural network. The layers are described below:

- The first addLayer action specifies the model name so that the layer is added to the appropriate neural network. This first layer is named
**data**. The layer argument specifies the options for the layer. In this case, type='INPUT' because this is the input layer. The STD option is used to standardize the inputs. The dropout option applies dropout with a rate of 5%.

- The next addLayer action adds another layer to the same model shell. This is a hidden layer named
**HLayer1**, the first of seven hidden layers. In the layer properties, the type is fully connected, the number of neurons is set to 30, and exponential linear is the activation function. Notice that ELU is used for the first hidden layer and RELU for the rest of the hidden layers. RELU tends to saturate rather quickly when it is connected to the input layer (communication with L. Lewis, SAS 2017). The Xavier method is used to randomly initialize the weights. A dropout rate of 5% is applied to this first hidden layer. The last option to be specified is the source layers (srcLayers). This indicates where the information is coming from that is connected to this layer. Here, the information is coming from the data layer. When you build larger, more complex neural network structures, it will be important to keep track of the information flow in the network by accurately specifying the source layers in each hidden layer.

- The process of iteratively adding hidden layers to the network continues. Each subsequent layer is a fully connected layer, with RELU activation, MSRA initialization, and a 5% dropout rate. The name argument is used to specify a unique name for each hidden layer. Note that the hidden layers still have an hourglass shape regarding the number of neurons, just as in the previous demonstration. Also, this is a sequential neural network so the previous hidden layer feeds into the next hidden layer throughout the network.

- The final layer is of the type output and uses softmax activation because the target is binary.

After the neural network is constructed, the fourth PROC CAS step trains the model using the dlTrain action. The arguments in this step include the following:

- The TABLE= option is populated with the name of the training data set loaded into CAS.

- The model argument specifies the name of the neural network model that is used to train the data: DLNN, in this case. The name is specified in quotation marks.

- The table and ValidTable arguments specify the training and validation data sets respectively.

- The modelWeights argument specifies the name of a new CAS table, which will hold the weights of the model derived at the last epoch.

- The bestWeights argument creates a data set that saves the best fitting weights, according to the holdout data, regardless of the epoch in the optimization. If a validation data set is included, then the best performing weights are chosen based on the model's performance on the validation data. Otherwise, the training data are used.

- The target, inputs, and nominal variables are specified in their own respective arguments. Notice that all the inputs are listed in the inputs argument and only the nominals are listed in the nominals argument. The nominals include the target
**ins**because it is binary and, therefore, we must tell SAS it is categorical.

- After the ValidTable option, the target option specifies the
**ins**target.

- The optimizer argument specifies the settings for the optimization algorithm. In the optimizer settings for this program, notice the following:

- The minibatch size is set to 60 and the total epochs (in maxEpochs) is set to 50. Therefore, 60 observations to will be used to update the weights every iteration and all the data is used throughout the optimization routine 50 times.

**Note:**The minibatchSize option specifies the number of observations per thread in a mini-batch. You can use this parameter to control the number of observations (the minibatch size) that the action uses. This course was written to be delivered on a machine using 16 CPUs. Therefore, each total minibatch is actually the minibatch multiplied by the 16 CPUs. Sometimes long tails (fewer observations used in later iterations within an epoch) can form, which should be mitigated when possible. To print detailed optimization information in the log, specify loglevel=3 in the optimization property.**on each worker**

- Also included are the L1 and L2 penalty that were specified in the PROC NNET step in the previous demonstration.

- The algorithm argument specifies the algorithm used to find the model weights. Here, we specify the ADAM method for optimization and set the learning rate policy (that is, the strategy used to reduce the learning rate throughout the training process) to Step. This results in multiplying the learning rate value by gamma every so many user-specified epochs (as specified in the stepsize option). So for this example, the learning rate starts at 0.001, and then every 10 epochs it is multiplied by 0.5. That is, the learning rate is cut in half every 10 epochs.
**Note:**If no step size is indicated, then the learning rate is reduced every 10 epochs. Also, beta1 and beta2 are set to their recommended values according to the literature for ADAM optimization.

**Note:**In the algorithm argument, you can specify the following methods: ADAM, LBFGS, MOMENTUM, and VANILLA.

- The minibatch size is set to 60 and the total epochs (in maxEpochs) is set to 50. Therefore, 60 observations to will be used to update the weights every iteration and all the data is used throughout the optimization routine 50 times.
- Last but not least, a seed is set.

- The name argument specifies the reference name of the deep learning model that we are creating: in this case, DLNN for deep learning neural network.
- Run the four PROC CAS steps to build the first deep learning model. Look at the results.

The first few results tables show the customized neural network structure table being built from layer to layer (that is, descriptive information pertaining to the model shell and layers included in the model).

The Model Information table shows that the model has nine total layers: one input layer, seven hidden layers, and one output layer. And this model has approximately the same number of parameters as the previous model: 3747.

The Optimization History table shows the epoch number, learning rate, and the error for each partition of the data, as well as other information. At the tenth epoch, notice that the learning rate is cut in half due to the learning rate policy that was implemented. Again, the validation error is the misclassification for the model on the validation data. Notice that the model begins much like the previous model in that it initially fails to discriminate between the levels of the target. However, at epoch 5, the model begins to learn the data and starts separating the events from the nonevents. At the end of the table, notice that the model performance is approximately 26% misclassification on the validation data. The model's best performance occurred in the 39th epoch. Simply by applying deep learning methodology to the same model architecture, we were able to improve the model performance.

Note: In the next demonstration, you continue to use this program.

## Deep Learning Using SAS® Software

Lesson 01, Section 2 Demo: Adding Batch Normalization to a Deep Learning Neural Network Using CASL Code

In this demonstration, we build a second deep learning model, and this time use batch normalization. We also remove dropout from some of the layers. However, the general hourglass shape of the neural network structure will be the same as in the model we built in the previous demonstration.

- If you did not perform the steps of the first demonstration in this lesson in your current SAS Studio session, open and run the program
**DLUS01D01.sas**to load the data into memory before you continue.

- Open the program that you started using in the previous demonstration,
**DLUS01D02.sas**. If you did not perform the steps of the previous demonstration in your current SAS Studio session, run the first four PROC CAS steps to build the first deep learning model before you continue.

- Scroll down to the last two PROC CAS steps in the program. This is the code that we use in this demonstration to build a deep learning model with batch normalization. Examine this code.

/********************************************************/ /*Build the Deep Learning Model with Batch Normalization*/ /********************************************************/ proc cas; BuildModel / modeltable={name='BatchDLNN', replace=1} type = 'CNN'; /*INPUT Layer*/ AddLayer / model='BatchDLNN' name='data' layer={type='input' STD='STD' dropout=.05}; /*FIRST HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer1' srcLayers={'data'} layer={type='FULLCONNECT' n=30 act='ELU' init='xavier' }; /*SECOND HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer2' srcLayers={'HLayer1'} layer={type='FULLCONNECT' n=20 act='identity' init='xavier' includeBias=False}; AddLayer / model='BatchDLNN' name='BatchLayer2' srcLayers={'HLayer2'} layer={type='BATCHNORM' act='TANH'}; /*THIRD HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer3' srcLayers={'BatchLayer2'} layer={type='FULLCONNECT' n=10 act='identity' init='xavier' includeBias=False }; AddLayer / model='BatchDLNN' name='BatchLayer3' srcLayers={'HLayer3'} layer={type='BATCHNORM' act='TANH'}; /*FOURTH HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer4' srcLayers={'BatchLayer3'} layer={type='FULLCONNECT' n=5 act='identity' init='xavier' includeBias=False }; AddLayer / model='BatchDLNN' name='BatchLayer4' srcLayers={'HLayer4'} layer={type='BATCHNORM' act='TANH'}; /*FIFTH HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer5' srcLayers={'BatchLayer4'} layer={type='FULLCONNECT' n=10 act='identity' init='xavier' includeBias=False }; AddLayer / model='BatchDLNN' name='BatchLayer5' srcLayers={'HLayer5'} layer={type='BATCHNORM' act='TANH'}; /*SIXTH HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer6' srcLayers={'BatchLayer5'} layer={type='FULLCONNECT' n=20 act='identity' init='xavier' includeBias=False}; AddLayer / model='BatchDLNN' name="BatchLayer6" srcLayers={'HLayer6'} layer={type='BATCHNORM' act='TANH'}; /*SEVENTH HIDDEN LAYER*/ AddLayer / model='BatchDLNN' name='HLayer7' srcLayers={'BatchLayer6'} layer={type='FULLCONNECT' n=30 act='identity' init='xavier' includeBias=False }; AddLayer / model='BatchDLNN' name="BatchLayer7" srcLayers={'HLayer7'} layer={type='BATCHNORM' act='TANH'}; AddLayer / model='BatchDLNN' name='outlayer' srcLayers={'BatchLayer7'} layer={type='output' act='Softmax'}; quit; /*****************************/ /*Fit the Deep Learning Model*/ /*****************************/ proc cas; dlTrain / model='BatchDLNN' table='Train_Develop' ValidTable='Valid_Develop' modelWeights={name='DeepTrainedWeights_d', replace=1} bestweights={name='bestdeepweights', replace=1} target='INS' inputs={'AcctAge','DDABal','CashBk','Checks','NSFAmt','Phone','Teller', 'SavBal','ATMAmt','POS','POSAmt','CDBal','IRABal','LOCBal', 'ILSBal','MMBal','MMCred','MTGBal','CCBal','CCPurc','Income', 'LORes','HMVal','Age','CRScore','Dep','DepAmt','InvBal','DDA', 'DirDep','NSF','Sav','ATM','CD','IRA','LOC','ILS','MM','MTG', 'CC','SDB','HMOwn','Moved','InArea','Inv'} nominals={'INS','DDA','DirDep','NSF','Sav','ATM','CD','IRA','LOC','ILS', 'MM','MTG','CC','SDB','HMOwn','Moved','InArea', 'Inv'} optimizer={minibatchsize=60, regL1=0.003, regL2=0.002, maxepochs=50, algorithm={method='ADAM', lrpolicy='Step', gamma=0.5, stepsize=10, beta1=0.9, beta2=0.999, learningrate=.001}} seed=12345; quit;

The first PROC CAS step includes the buildModel action with the other addLayer actions. You can split the model structure into as many PROC CAS steps as you want. This model is named BatchDLNN. The type is now CNN instead of DNN. We are not building a convolutional neural network for this type of data, but in SAS, the CNN type includes batch normalization, whereas the DNN type does not.

Notice the following details about the layers that are specified in the AddLayer statements:

- The code to add the input layer to the model includes the options to standardize the inputs and apply dropout in this layer.

- The first hidden layer is fully connected with 30 mathematical neurons, the activation function is the exponential linear unit, and Xavier is the weight initialization method. (This model uses Xavier instead of MSRA as the weight initialization method.) This layer is named
**Hlayer1**and is connected to the data layer. Batch normalization is not used in the first hidden layer because the information in the input layer is already normalized using the z-score standardization method, so there is no need to use additional computation.

- The second hidden layer is where the model starts to diverge from the previous model. Notice that this hidden layer includes two addLayer actions, as described below:

- The first addLayer action specifies that the type is fully connected, with 20 mathematical neurons, and Xavier weight initialization. Unlike the previous model, the activation function is set to identity, which means that we pass the combination function forward to the next layer and not the transformed combination function. The includeBias option is set to False, which avoids adding a hidden bias parameter into the layer. This is done because this information is fed into a batchnormalization layer, which includes a bias parameter. Finally, the source layer is the first hidden layer, and this layer is named
**Hlayer2**.

- The second of the two addLayer actions is where batch normalization is applied for this hidden layer. This layer is named
**BatchLayer2**, and is connected to the previous layer,**HLayer2**. Inside the layer argument, the type is set to batchnorm and the activation function is set to the hyperbolic tangent (tanh). The previous model used ReLU as the activation function.

- The first addLayer action specifies that the type is fully connected, with 20 mathematical neurons, and Xavier weight initialization. Unlike the previous model, the activation function is set to identity, which means that we pass the combination function forward to the next layer and not the transformed combination function. The includeBias option is set to False, which avoids adding a hidden bias parameter into the layer. This is done because this information is fed into a batchnormalization layer, which includes a bias parameter. Finally, the source layer is the first hidden layer, and this layer is named
- The remaining hidden layers in the model architecture show the same pattern. In each hidden layer, batch normalization is added with a second addLayer action. The neural network maintains the hourglass shape regarding the number of neurons in each hidden layer.

- Of course, the last layer is the output layer, which now connects to the last batch normalization in the neural network.

In the second PROC CAS step, the dlTrain action fits the model. The arguments and syntax are identical to those used in the previous demonstration with one exception. This time we specify**BatchDLNN**as the model name.

- The code to add the input layer to the model includes the options to standardize the inputs and apply dropout in this layer.
- Run the two PROC CAS steps to build the neural network and fit the model. Look at the results.

The Model Information table shows that the total number of layers has increased to 15 here (compared to 9 in the previous demonstration), by using 6 batch normalization layers. Also, the number of parameters has increased to 3842 as a consequence of batch normalization including learnable weights.

The Optimization History table shows that this model has a slight improvement in performance. The memory cost paid off because the model has improved the validation misclassification rate, further reducing the rate down to approximately 25.9% (observed in the 33rd epoch).

## Deep Learning Using SAS® Software

Lesson 02, Section 1 Demo: Loading and Preparing Image Data

In this demonstration, we load images into memory. We also prepare the image data for modeling, using the SAS Image action set. In the next demonstration, we create a CNN model for these images. **Note:** You use the same SAS program for all demonstrations in this lesson.

- Open the program named
**DLUS02D01.sas**. Run the two LIBNAME statements.

/********************************/ /*Create Local and CAS Libraries*/ /********************************/ libname local "/home/student/LWDLUS/Data"; libname mycas cas;

The first two LIBNAME statements assume that you are running this program in a new SAS Studio session. These statements create a path to the data locally and a connection to CAS using the library named**mycas**.

- Examine the first PROC CAS step, which loads the images into memory.

/*******************************************************/ /* Load the image actionset and */ /* the LargetrainData and SmalltrainData cifar-10 data */ /*******************************************************/ proc cas ; loadactionset 'image'; loadactionset 'table'; table.addCaslib / name='imagelib' path='/home/student/LWDLUS/Data' subdirectories=true activeOnAdd=False; image.loadimages / caslib='imagelib' path='LargetrainData' recurse=true labellevels=1 casout={name='LargetrainData', replace=true}; image.loadimages / caslib='imagelib' path='SmalltrainData' recurse=true labellevels=1 casout={name='SmalltrainData', replace=true}; quit;

The PROC CAS step first loads the image and table action sets. The addCasLib action from the table action set creates a new caslib called**imageLib**. The path argument sets the location to our image data folder. Specifying subdirectories=TRUE allows SAS access to subdirectories in the images folder.

In the first loadImages action, the path argument specifies the caslib that was just created,**imagelib**, and the name of the first folder that contains images,**LargeTrainData**. This loadImages action also specifies two additional options: recurse=TRUE and labelLevels=1. The recurse option tells SAS to pull images from subfolders within the**LargeTrainData**folder. The labelLevels option tells SAS to create a new variable, called**_label_**, which uses the subfolder names as the label for the images. Therefore, we can assume that each subfolder in**LargeTrainData**contains a specific image label level.

- On the local machine, open Windows File Explorer and navigate to the
**LargeTrainData**folder (D:/Workshop/winsas/DLUS/Data/LargetrainData). The**LargeTrainData**folder contains10 subfolders. Each subfolder name will be the label for all the images it contains. For example, we can open the**airplane**folder and see that it contains images of airplanes. When SAS loads these images into memory, the label for each image is set to**airplane**. Close File Explorer.

- Return to the program in SAS Studio and run the PROC CAS procedure to load the images into memory.

In the first loadImages action, the casout argument saves the images in an output data set on the CAS server using the same name,**LargeTrainData**.

The second loadImages action loads up the**smallTrainData**data set.

- To view a sample of the images, run the DATA step. Look at the results.

/***************/ /* View Images */ /***************/ data _null_; set mycas.LargetrainData(where=( _id_<=8 AND _id_>=1 or _id_<=5008 AND _id_>=5001 or _id_<=10008 AND _id_>=10001 or _id_<=15008 AND _id_>=15001 or _id_<=20008 AND _id_>=20001 or _id_<=25008 AND _id_>=25001 or _id_<=30008 AND _id_>=30001 or _id_<=35008 AND _id_>=35001 or _id_<=40008 AND _id_>=40001 or _id_<=45008 AND _id_>=45001) keep=_path_ _id_ _label_) end=eof; if _n_=1 then do; dcl odsout obj(); obj.layout_gridded(columns:8); end; obj.region(); obj.format_text(text: _label_, just: "c", style_attr: 'font_size=8pt'); obj.image(file: _path_, width: "112", height: "112"); if eof then do; obj.layout_end(); end; run;

The DATA step uses SAS Component Language to resize to a width and height of 112, and displays images that qualify the conditions specified in the WHERE clause.

Each row of the results displays a sample of each of the 10 label levels. If you scroll through the images, you can see airplanes, automobiles, birds, cats, dogs, frogs, horses, ships, and trucks.

- Examine the two PROC PARTITION steps.

/******************************************************************/ /* Partition each folder of images into train and validation data */ /******************************************************************/ proc partition data=mycas.SmalltrainData samppct=80 samppct2=20 seed=12345 partind; by _label_; output out=mycas.smallImageData; run; proc partition data=mycas.LargetrainData samppct=80 samppct2=20 seed=12345 partind; by _label_; output out=mycas.LargeImageData; run;

The PARTITION procedure is used to sample data in SAS Viya. It performs simple random sampling, stratified sampling, oversampling, or

These two PROC PARTITION steps are the same except for the data. The first PROC step partitions*k*-fold partitioning to produce a table that contains a subset of the observations or a table that contains partitioned observations.**smallTrainData**and the second PROC step partitions**largeTrainData**.

Each PROC PARTITION step sets the sampct (sample percentage) argument to 80 and the sampct2 (sample percentage 2) argument to 20. If we wanted a third partition, we would have a total percentage of less than 100 and the remaining would be a third partition. For example, if partition 1 was 60% and partition 2 was 30%, then SAS would by default make a third partition be 10% of the data.

The partind (partition indicator) option tells SAS to create a new indicator variable named**_partind_**in the output data set. This option then populates the**_partind_**variable with values that indicate which partition the observations belong to. In this example, there are only two partitions, so the partition indicator variable that is added to the CAS table is binary with 80% of the values equal to a value of 1 (for training), and the remaining 20% have a value of 2 (for validation).

The BY statement is used to conduct stratified random sampling according to the variable that is specified as the stratum variable. Here, the stratum variable is**_label_**, which is the target.

Finally, the OUTPUT statement specifies the output data set names smallImageData and largeImageData.

- Run the two PROC PARTITION steps. Look at the results.

At the top of the results window, the Stratified Sampling Frequency table shows a summary of the partitioned small training data for the first PARTITION procedure. Each label level has 1000 images. Therefore, the requested 80-20 split resulted in 800 images for the first partition and 200 for the second partition for each label level. The Stratified Sampling Frequency table for the second PROC PARTITION step shows that the large training data set has 5000 images for each label level and, therefore, 4000 and 1000 observations for the two partitions.

- Examine and then run the two PROC CAS steps to shuffle the data.

/********************/ /* Shuffle the data */ /********************/ proc cas; table.shuffle / table='smallImageData' casout={name='SmallImageDatashuffled', replace=1}; quit; proc cas; table.shuffle / table='LargeImageData' casout={name='LargeImageDatashuffled', replace=1}; quit;

Each of these PROC steps randomly sorts the observations of one of the two input data sets using the shuffle action. Each step then creates a new data set named**ImageDataShuffled**.

The data are currently ordered by the target values as a result of using the BY statement in PROC PARTITION. For example, the data might begin with pictures of only frogs, followed by pictures of cats, and so on. Ordering by the target can cause a problem for variants of stochastic gradient descent (SGD, momentum, and ADAM). The problem arises because ordering by the outcome can cause the model to become entrenched with a set of parameters that overpredicts a single class. That is, it fails to discriminate between all outcome classes. For example, when the model starts learning the data, it begins learning the airplane images, and then moves to learning automobiles images, and so on, until it learns the last target class,**truck**. Because the weights are adjusted each time according to only a single class of images, the model is likely to overpredict the last target level it saw. Therefore, without shuffling, we would expect the model to overpredict the class of**truck**.

To get a more unbiased set of predictions, it is recommended that the data be randomly shuffled. In fact, it is quite common to change the observation order when blending (ensemble through weighted averages of predictions) models together for image classifiers.

## Deep Learning Using SAS® Software

Lesson 02, Section 3 Demo: Training a Convolutional Neural Network

In the previous demonstration, we loaded two data sets containing images into memory and prepared them for modeling. In this demonstration, we use the image data to build a convolutional neural network.** **Convolutional neural networks can take days, weeks, or even months to train in certain situations. Here, we build a simple convolutional network on a limited amount of data to save time.

- Open the program named
**DLUS02D01.sas**.**Note:**If you did not perform the steps of the previous demonstration in the current SAS Studio session, run the first part of this program (from the beginning through the two PROC CAS steps, as shown in the previous demonstration) before you continue.

- To see a diagram of the model that we will create, run the DATA step in the section about viewing a picture of the model architecture. Look at the results.

/********************************************/ /* View a picture of the model architecture */ /********************************************/ data _NULL_; dcl odsout obj1(); obj1.image(file:'/home/student/LWDLUS/Data/ModelPic.PNG', width: "850", height: "450"); run;

As the diagram shows, this model has an advanced architecture for a new CNN practitioner. This model incorporates spatial exploration, skip layer connections, stacked pooling, and aspects of cardinality. Split paths provide independent tracks for feature extraction. The program builds this network one layer at a time using the deepLearn action set. It will be helpful to refer back to this image when we examine the code that builds the model shell.

- Following the DATA step, the first PROC CAS step summarizes the training image data. Run this step and look at the results.

/*************************************/ /* Summarize the training image data */ /*************************************/ proc cas; image.summarizeimages / table={name='LargeImageDatashuffled', where='_PartInd_=1'}; quit;

The summarizeImages action summarizes**LargeImageDataShuffled**. In the where option, notice that _partind_=1. This indicates a request to summarize only the 80% partition, or the training data, from the shuffled**largeTrainData**.

In the Results table for the summarizeImages action, look at the average intensity of each of the color channels. The average value of the blue intensity, the green intensity, and the red intensity are shown. We will use these three values as offsets in the input layer.

- Examine the next two PROC CAS steps, which build the neural network architecture.

/********************************/ /*Load the Deep Learn Action Set*/ /********************************/ proc cas; loadactionset 'DeepLearn'; run; /*********************************/ /* Build the Deep Learning Model */ /*********************************/ Proc Cas; /* Build a model shell*/ BuildModel / modeltable={name='ConVNN', replace=1} type = 'CNN'; /* Add an input layer */ AddLayer / model='ConVNN' name='data' layer={type='input' nchannels=3 width=32 height=32 randomFlip='H' randomMutation='Random' offsets={113.852228,123.021097,125.294747}}; /* Add several Convolutional layers */ AddLayer / model='ConVNN' name='ConVLayer1a' layer={type='CONVO' nFilters=12 width=1 height=1 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='ConVNN' name='ConVLayer1b' layer={type='CONVO' nFilters=12 width=3 height=3 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='ConVNN' name='ConVLayer1c' layer={type='CONVO' nFilters=12 width=5 height=5 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='ConVNN' name='ConVLayer1d' layer={type='CONVO' nFilters=12 width=7 height=7 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='ConVNN' name='ConVLayer1e' layer={type='CONVO' nFilters=16 width=4 height=4 stride=2 dropout=.2 act='ELU'} srcLayers={'data'}; AddLayer / model='ConVNN' name='ConVLayer1f' layer={type='CONVO' nFilters=16 width=6 height=6 stride=4 dropout=.25 act='ELU'} srcLayers={'data'}; /* Add a concatenation layer */ AddLayer / model='ConVNN' name='concatlayer1a' layer={type='concat'} srcLayers={'ConVLayer1a','ConVLayer1b','ConVLayer1c','ConVLayer1d'}; /* Add a max pooling layer */ AddLayer / model='ConVNN' name='PoolLayer1max' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'concatlayer1a'}; /* Add a concatenation layer */ AddLayer / model='ConVNN' name='concatlayer2' layer={type='concat'} srcLayers={'PoolLayer1max','ConVLayer1e'}; /* Add a max pooling layer */ AddLayer / model='ConVNN' name='PoolLayer2max' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'concatlayer2'}; /* Add a concatenation layer */ AddLayer / model='ConVNN' name='concatlayer3' layer={type='concat'} srcLayers={'PoolLayer2max','ConVLayer1f'}; /* Add a max pooling layer */ AddLayer / model='ConVNN' name='PoolLayer3max' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'concatlayer3'}; /* Add a Convolutional layer with 64, 3 by 3 filters, a stride of 2 and batch normalization */ AddLayer / model='ConVNN' name='ConVLayer2a' layer={type='CONVO' nFilters=64 width=3 height=3 stride=2 act='Identity' includeBias=False} srcLayers={'concatlayer3'}; AddLayer / model='ConVNN' name='BatchLayer2a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer2a'}; /* Add a Convolutional layer with 64, 1 by 1 filters, a stride of 1 and batch normalization */ AddLayer / model='ConVNN' name='ConVLayer2b' layer={type='CONVO' nFilters=64 width=1 height=1 stride=1 act='Identity' includeBias=False} srcLayers={'concatlayer3'}; AddLayer / model='ConVNN' name='BatchLayer2b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer2b'}; /* Add a Convolutional layer with 64, 3 by 3 filters, a stride of 1 and batch normalization */ AddLayer / model='ConVNN' name='ConVLayer3a' layer={type='CONVO' nFilters=64 width=3 height=3 stride=1 init='msra' act='Identity' includeBias=False} srcLayers={'BatchLayer2b'}; AddLayer / model='ConVNN' name='BatchLayer3a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3a'}; /* Add a Convolutional layer with Batch Normalization */ AddLayer / model='ConVNN' name='ConVLayer3b' layer={type='CONVO' nFilters=64 width=5 height=5 stride=1 init='msra' act='Identity' includeBias=False} srcLayers={'BatchLayer2b'}; AddLayer / model='ConVNN' name='BatchLayer3b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3b'}; /* Add a concatenation layer */ AddLayer / model='ConVNN' name='concatlayer4' layer={type='concat'} srcLayers={'BatchLayer3a','BatchLayer3b'}; /* Add a Convolutional layer with Batch Normalization */ AddLayer / model='ConVNN' name='ConVLayer4' layer={type='CONVO' nFilters=128 width=3 height=3 stride=2 init='msra2' act='Identity' includeBias=False} srcLayers={'concatlayer4'}; AddLayer / model='ConVNN' name='BatchLayer4' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer4'}; /* Add a concatenation layer */ AddLayer / model='ConVNN' name='concatlayer5' layer={type='concat'} srcLayers={'PoolLayer3max','BatchLayer4','BatchLayer2a'}; /* Add a Convolutional layer with Batch Normalization */ AddLayer / model='ConVNN' name='ConVLayerLasta' layer={type='CONVO' nFilters=500 width=1 height=1 stride=1 init='msra2' act='Identity' includeBias=False} srcLayers={'concatlayer5'}; AddLayer / model='ConVNN' name='BatchLayerLasta' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayerLasta'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN' name='FCLayer1' layer={type='FULLCONNECT' n=540 act='Identity' init='msra2' includeBias=False} srcLayers={'BatchLayerLasta'}; AddLayer / model='ConVNN' name='BatchLayerFC1' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer1'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN' name='FCLayer2' layer={type='FULLCONNECT' n=540 act='Identity' init='msra2' dropout=.7 includeBias=False} srcLayers={'BatchLayerFC1'}; AddLayer / model='ConVNN' name='BatchLayerFC2' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer2'}; /* Add an output layer with softmax activation */ AddLayer / model='ConVNN' name='outlayer' layer={type='output' act='SOFTMAX'} srcLayers={'BatchLayerFC2'}; /* View the model architecture */ ModelInfo / model='ConVNN'; quit;

The first PROC CAS step loads the deepLearn action set.

In the second of these PROC CAS steps, the buildModel action initializes the model shell. We name this model**ConVNN**and set the type to CNN. Then the layers are added, as described below:

- The first addLayer action adds the input layer. Details about this layer are described below:
- We specify the number of channels in these images to be 3, because we have color images.
- The width and height of each image is 32, so we specify the width and height to be 32.
- We specify randomFlip=H to randomly horizontally flip images within each mini batch. When a mini batch comes in, the program randomly selects some of those images and horizontally flips each image. The same process occurs for each mini batch.
- Likewise, we apply a random mutation to randomly selected images within each mini batch.
- Next, we include the offsets argument, where we specify the average blue intensity value, the average green intensity value, and the average red intensity value. These values will be subtracted from each pixel value from each image, which mean zero centers our data. This is a good practice when working with image data.

- Next, we specify several addLayer statements to create a convolutional layer. In this convolutional layer, we want to have filters of different sizes. Details about these addLayer statements are described below:
- In the first addLayer statement, we create 12 filters of size 1 by 1, with the stride value of 1.
- Next, we create 12 filters that have a width and a height of 3, and a stride value of 1.
- Then we create 12 filters that are 5 by 5 with a stride of 1.
- Then 12 filters that are 7 by 7 with a stride of 1.
- Next, we create 16 filters that have a width and height value of 4, with a stride of 2. We also apply a dropout rate of 20% to these filters.
- The last set of filters that we create in this convolutional layer are 16 filters that are 6 by 6, with a stride value of 4, with a dropout rate of 25%.
- Notice that all these filters use the activation function ELU and use the data as its source layer.

- Next, we concatenate the first four convolution layers. To add a concatenation layer, we specify the following:
- We specify CONCAT in the type argument.
- In the srcLayers argument, we specify the names of the layers that we intend to join. One caveat when you use a concatenation layer is that the information must be of the same size. Specifically, the matrices to be concatenated must be of the same size. Therefore, ConcatLayer1a concatenates the feature maps for ConVLayer1a, ConVLayer1b, ConVLayer1c, and ConVLayer1d. This creates a tensor that is a 32 by 32 by 48, because each convolutional layer has 12 feature maps coming in that are 32 by 32.

- Next, we add a pooling layer named PoolLayer1max. The details about this layer are described below:
- We specify type='POOL'.
- In the width and height, we specify the size of the region summary, in this case 2 by 2. Using a stride value of 2 effectively decreases the information by about 75%. And here we use a MAX pooling operation.
- The information that's flowing into this pooling layer is, of course, coming from the concatenation layer. The input coming into this pooling layer was a 32-by-32-by-48 tensor.
- The output of this pooling layer is a 16-by-16-by-48 tensor.

- Next, we add a concatenation layer named ConcatLayer2. Remember, when you concatenate information, the width and height of the matrices must be the same. So we concatenate the output of the pooling layer that we just added, which is 16 by 16 by 48, with the earlier set of filters that was in the first convolutional layer.
- The input to this ConVLayer1e was 32 by 32 with three channels. Because we have a stride of 2 and 16 filters, the output of ConVLayer1e will be 16 by 16 by 16. So, we concatenate the 16 by 16 by 48 output from the pooling layer with the 16 by 16 by 16 output from the earlier convolution layer.
- The output of this concatenation layer is 16 by 16 by 64.

- We then feed the output of ConcatLayer2 into another pooling layer named PoolLayer2max. This pooling layer also has a width and height of 2, a stride of 2, and it uses the max pooling operation. So each of the summary regions has a size of a 2 by 2. Again, we're downsampling the information by 75%. The incoming information is 16 by 16 by 64 and the output of this layer is going to be an 8 by 8 by 64.

- We then feed the pooling layer output into another concatenation layer named ConcatLayer3. This layer concatenates the output of the previous t pooling layer, PoolLayer2max, with information coming from the ConVLayer1f. We can do this because the matrices are the same size. That is, the width and the height of the tensor is the same.

**Note:**You might want to take another look at the network architecture diagram that we generated earlier. Notice that ConcatLayer3 brings information together right here, where we combine the output from this last convolution and the second pooling layer. Moving forward in the diagram, notice that we then send this information in three different directions: to a third pooling layer, to another convolutional layer with a 3-by-3 filter, and to a convolutional layer with a 1 by 1 filter.

- In the code, the information from ConcatLayer3 is then fed into our third pooling layer, which we call PoolLayer3max. This pooling layer also has a filter the size of a 2 by 2, with a stride of 2, and MAX pooling.

- ConcatLayer3 is also fed into a 3-by-3 convolutional layer with a stride value of 2, named ConvLayer2a. Furthermore, we apply batch normalization to this convolutional layer.

- We also pass the output of ConcatLayer3 to a third layer, which is another convolutional layer, named ConvLayer2b. This layer has a set of filters that are 1 by 1, with the stride of 1, and we apply batch normalization. We take the output of the batch normalization and pass it to two additional convolutional layers, ConvLayer3a and 3b.

**Note:**In the model architecture diagram that we generated earlier, this is the point where we pass information, or branch it, to two different sets of filters. One set is a 3 by 3, and the other set is a 5 by 5. Both sets of filters have stride values of 1.

- Next, we concatenate the information from these two different sets of filters, the 5 by 5's and the 3 by 3's. We take the output of this concatenation (ConcatLayer4) and feed it into another convolutional layer (ConvLayer4) that contains 128 filters that are 3 by 3, with a stride of 2.
The information coming into ConvLayer4 has a depth of 128, because the convolutional layers being fed into it each have 64 filters. The output of this convolutional layer, where we're applying batch normalization, is a 4 by 4 by 128.

- We then pass the information from the batch normalization into a final concatenation layer, ConcatLayer5. This layer brings together all of the tensors and appends them together in preparation for the fully connected layer.

**Note:**In the model architecture diagram that we generated earlier, this concatenation layer occurs where we bring information from the max pooling and the two 3-by-3 convolutional layers.

- Before we pass the information to a fully connected layer, we create ConvLayerLasta, with a set of convolutional filters that are 1 by 1's. And we create 500 output feature maps. So we're going to have a tensor that's 4 by 4 by 500.

- We then take this information and pass it forward to two fully connected layers, each with 540 mathematical neurons. And each fully connected layer has batch normalization applied to it.

- Finally, we add the output layer with the softmax activation function.

- The first addLayer action adds the input layer. Details about this layer are described below:
- Examine the ODS OUTPUT statement and the next PROC CAS step, which trains the model.

/*******************************/ /* Train the CNN model, ConVNN */ /*******************************/ ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / model='ConVNN' table={name='LargeImageDatashuffled', where='_PartInd_=1'} ValidTable={name='LargeImageDatashuffled', where='_PartInd_=2'} modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} inputs='_image_' target='_label_' nominal={'_label_'} GPU=True optimizer={minibatchsize=80, maxepochs=60, algorithm={method='ADAM', lrpolicy='Step', gamma=0.6, stepsize=10, beta1=0.9, beta2=0.999, learningrate=.01}} seed=12345; quit;

Before fitting the model, an ODS OUTPUT statement saves the iteration table and uses it to plot the training and validation error over the optimization history.

The PROC CAS step with the dlTrain action trains the model. Details about this step are provided below:- First, we specify the model that we just built, ConVNN.
- The table argument refers to the training data, so I specify LargeImageDataShuffled where the partition indicator is equal to 1 for the 80% partition.
- For validTable, I specify the 20% partition by setting
**_partind_**equal to a value of 2. - We save the model weights, or the weights at the end of the optimization history, as ConVTrainedWeights_d, and the best weights regardless of the epoch number as ConVBestWeights.
- The inputs are set to _image_, and both the target and nominal arguments are set to _label_.
- The target is categorical, so it needs to be specified in the nominal argument.
- This is a fairly large model. In fact, it will probably have more than 5 million parameters. So in this case, it is a good idea to use a GPU. To leverage the GPU infrastructure, all we have to do is set GPU=TRUE.
- In the optimizer argument, each mini batch contains 80 images, and we train for 60 epochs. For the algorithm, we use the ADAM method, just like earlier, and we use a step learning rate policy.
For the gamma value, we use a value of 0.6, and we decrease our learning rate. That is, we multiply our learning rate value by our gamma value every 10 epochs, starting with a learning rate of 0.01.

- Run the the three PROC CAS steps and the ODS OUTPUT statement discussed in the previous two steps. Look at the results.

In the results, scroll down to the Model Information table. Notice that this model has 32 total layers and approximately 5.1 million parameters. The Optimization History table for the deep learning model shows the validation error (that is, the validation misclassification rate). It begins at around 60%, but it seems to decrease rather nicely down to about 20%.

- Run the remaining code and look at the results.

The two SQL procedure steps create two macro variables that store the lowest misclassification values for the training and validation data, respectively. The SGPLOT procedure plot these errors across the epochs of the optimization.

/******************************************************************/ /* Store minimum training and validation error in macro variables */ /******************************************************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit; /********************/ /* Plot Performance */ /********************/ proc sgplot data=ObjectModeliter; yaxis label='Misclassification Rate' MAX=.9 min=0; series x=Epoch y=FitError / CURVELABEL="&Train" CURVELABELPOS=END; series x=Epoch y=ValidError / CURVELABEL="&Valid" CURVELABELPOS=END; run;

In the results, the training error has reduced close to zero and the validation error has plateaued above the training error.

## Deep Learning Using SAS® Software

Lesson 02, Section 3 Demo: Scoring (Inferencing) Holdout Images and Assessing Model Performance

In the previous demonstration, we trained a convolutional neural network from image data. In this demonstration, we use that model to score new data.

- Open the program named
**DLUS02D01.sas**.**Note:**If you did not perform the steps of the previous demonstration in the current SAS Studio session, run the code from the beginning of the program through the first PROC SGPLOT step before you continue.

- Examine the PROC CAS step that scores the data and the PROC PRINT step that follows it. Then run those two steps. Look at the results.

/**************************************************************/ /* Score data with the DLSCORE action using the trained model */ /**************************************************************/ proc cas; dlScore / model='ConVNN' table={name='SmallImageDatashuffled', where='_PartInd_=1'} initWeights='ConVbestweights' casout={name='ScoredData', replace=1} copyVars='_Label_' ENCODENAME=TRUE gpu=True; quit; proc print data=mycas.ScoredData (obs=20); run;

The PROC CAS step uses the dlScore action to score new data. Notice the following:- The model name is ConVNN, the one we built in the previous demonstration.
- The initweights argument specifies ConVBestWeights, the best weights from fitting the model that we saved earlier as a CAS table.
- Here, we score
**SmallImageDataShuffled**. - To save these predictions in a table named
**ScoredData**, we use the CASOUT statement. - We also want to copy the target label into the
**ScoredData**table so that we can compare the actual label to the predicted label. Setting encodename=true uses the target label in the variable names of the predictions. So this output data set will have ten columns, one for every target class in the output data, where the variable names for the predictions correspond to the target label.**Note:**When encodeName=true, dlScore applies SAS Enterprise Miner naming conventions to the precition variables. - And just like with training a model, you can use a GPU to score new data.

The PROC PRINT step will print the first 20 observations (or first 20 predictions) of the**S****coredData**CAS table.

In the Score Information table generated by PROC CAS, notice that the model scored 8000 images with a misclassification error of about 22.5% on the holdout data.

In the PROC PRINT table, the**_label_**column contains the actual label, and the**I__label_**column contains the predicted classification. Notice the following:- The first observation is actually an automobile. Scrolling to the right, we see that our model did, in fact, predict an automobile, which is good.
- The first misclassification occurs at observation 7. The actual image is an airplane, but the model predicted a ship. The predicted probability that this image was a ship is 0.75, so the model was overconfident in classifying this specific image.
- The next largest predicted probability is that of an automobile, at 0.15.
- The third largest prediction was the actual image, the airplane, at 0.08. At least the model predicted it was some kind of machinery and not a dog or a cat.

- To build a histogram of the misclassification rate on the scored data for each target level, run the next three steps: a DATA step, a PROC SQL step, and a PROC SGPLOT step. Look at the results.

/***********************************/ /* Create misclassification counts */ /***********************************/ data work.MISC_Counts; set mycas.ScoredData; if trim(left(_label_)) = trim(left(I__label_)) then Misclassified_count=0; else Misclassified_count=1; run; /****************************************************/ /* Sum misclassification counts at the target level */ /****************************************************/ proc sql; create table work.AssessModel as select distinct _label_, sum(Misclassified_count) as number_MISC from work.MISC_Counts group by _label_; quit; /*****************************************************/ /* Plot each target level's misclassification counts */ /*****************************************************/ proc sgplot data=work.AssessModel; vbar _label_ / response=number_MISC; yaxis display=(nolabel) grid; xaxis display=(nolabel); run;

At the end of the results, a histogram shows the number of misclassified images for each class. Notice that the model is struggling to classify images of cats and dogs but is doing well with ships and trucks.

- To save your work, run the two PROC CASUTIL steps at the end of the program. Look at the log.

/****************************/ /*Save the Model and Weights*/ /****************************/ proc casutil; save casdata='ConVNN' outcaslib='imagelib' casout='ConVNN.sashdat' replace; quit; proc casutil; save casdata='ConVbestweights' outcaslib='imagelib' casout='ConVbestweights.sashdat' replace; quit;

The first procedure saves the model (ConVNN), and the second saves the best weights from the model training process (ConVbestweights). Both are saved in the**imagelib**library that we connected to at the beginning of the program.

In the log, notice that these two CAS tables were saved as SASHDAT files in the location specified by the**imagelib**library. We will use these two tables in a future demonstration, for supervised transfer learning.

## Deep Learning Using SAS® Software

Lesson 03, Section 2 Demo: Exploring Text Data Using Text Mining Actions

In this demonstration and the next, we focus on predictive modeling with text, which uses a target variable to find patterns that emerge when the values of the target variable are analyzed against the text. Using the **CFPB_COMPLAINTS** SAS data set from the Consumer Financial Protection Bureau, we'll use deep learning models to predict whether the consumer will dispute a company's response to the consumer's complaint. Intuitively, the filed complaint might contain terms or language that indicate an inevitable dispute, regardless of the company response. For example, perhaps the complaint displays such anger or contempt verbiage that the consumer will inevitably dispute the response. Or perhaps the complaint contains passive-aggressive language that indicates that the consumer is highly unlikely to dispute the response. Either way, a deep learning model can help us uncover the relationship between the target dispute and the input text complaint.

In this demonstration, we apply Text Analytics in SAS Viya to the data. We mine the complaints to learn from the raw text of the complaints.

- Open the program named
**DLUS03D01.sas**. Examine the two LIBNAME statements and the four DATA steps, and then run the steps.

/********************************/ /*Create Local and CAS Libraries*/ /********************************/ libname local "/home/student/LWDLUS/Data"; libname mycas cas; /*****************************/ /*Load Data From Local to CAS*/ /*****************************/ data mycas.cfpb_complaints; set local.cfpb_complaints; run; data mycas.cfpb_complaints_clean; set local.cfpb_complaints_clean; run; data mycas.cfpb_complaints_embed; set local.cfpb_complaints_embed; run; data mycas.stoplist; set local.stop_words; run;

The LIBNAME statements create the necessary libraries. The four DATA steps load the data sets for this demonstration into memory:**cfpb_complaints**,**cfpb_complaints_clean**,**cfpb_complaints_embed**, and**stoplist**. (You learn more about the stoplist later in this demonstration.)

- Run the PROC PRINT step, which prints the first few observations of the raw
**CFPB_COMPLAINTS**data set. Look at the results.

/*************/ /*Text Mining*/ /*************/ proc print data=mycas.cfpb_complaints (obs=5); run;

The data contain consumer-submitted complaints and a binary variable named**Dispute**that has a value of 1 if the consumer disputed the company response and 0 otherwise. Towards the end of the demonstration, we'll use the text within the complaint variable to predict if a consumer will dispute the company's response to the complaint.

Notice that each complaint is the raw text filed by a consumer. It contains misspellings, punctuation, and colloquialisms, as you would expect in any language. The only alteration of the text is that some information has been replaced with four Xs to hide private information.

- Examine the DATA step that provides a quick clean of the data, and then run the step. Look at the results.

/*Quick cleaning of raw data*/ data mycas.cfpb_complaints; set mycas.cfpb_complaints; complaint = lowcase(compress(complaint,'ABCDEFGHIJKLMNOPQRSTUVWXYZ.!?1234567890 ', 'ki')); complaint = tranwrd(complaint, ' xxxx', ''); docid + 1; run;

In any text analytics project, it's necessary to clean the data before creating text features for predictive modeling. The level of text cleaning depends on the task at hand. For now, in this exploratory phase of the complaints text, we'll give the data a quick clean using a DATA step. Later, we'll give the data a heavier cleaning when we create deep learning features.

In the DATA step, notice the following:

- The LOWCASE function converts all letters in a string to lowercase. This avoids duplicate words due to casing when we find word counts.

- The COMPRESS function removes specified characters from a string. The first argument is the variable or string from which specified character are removed (in this case, the Complaint variable). The second argument lists the characters that are used by the third argument, the modifier. Here, the second modifier lists characters to keep in the string: all letters, the period, the exclamation point, the question mark, numbers, and space between words. This modifier consists of the K parameter, which requests that we keep the listed characters instead of removing them, and the I, which ignores casing of the characters.

- The next statement uses the TRANWRD function to remove all instances of the anonymous 'XXXX' identifiers in the complaints.

- A sum statement adds a sequence variable named
**docid**to the table.**docid**simply assigns a unique identifier to each document in the corpus, which will be needed when we parse the documents.

- The LOWCASE function converts all letters in a string to lowercase. This avoids duplicate words due to casing when we find word counts.
- Examine the next DATA step and the following PROC PRINT step. Then run these two steps and look at the results.

/*Look for specific terms*/ data mycas.lawyer (drop=newvar); set mycas.cfpb_complaints; newvar = find(complaint,'lawyer','i'); if newvar>0; run; proc print data=mycas.lawyer (obs=5); run;

This DATA step searches for all documents with the word*lawyer*and stores those documents in a new table named**lawyer**. The FIND function searches for a specific substring of characters within a character string. The first argument is the string or variable. The second is the substring to find. The third argument, I, is a modifier that instructs the function to ignore case. This function simply returns a value of 1 if the specified string is present anywhere in the input string. Here, if the newvar variable is greater than zero, then the document is stored in the new table.

The PROC PRINT step prints the total number of documents that the DATA step found with the word*lawyer*and a few observations from the new table. The results show that, in this data set, we found 427 documents with the word*lawyer*, which is almost 5% of our total data. You can read some of these complaints to see how they mention the term*lawyer*.

- Examine the PROC CAS step that parses the data. Then run this step and look at the results.

/*Parse the data*/ proc cas; loadactionset 'textParse'; textParse.tpParse / table = 'cfpb_complaints' docid = 'docid' text = 'complaint' stemming = True nounGroups = False entities = 'none' tagging = False parseConfig = {name='config', replace=True} offset = {name='offset', replace=True}; quit;

The CAS procedure loads the textParse action set to parse the text using natural language processing techniques. In the tpParse action, we configure the parsing algorithm, as follows:

- The table argument specifies the data table.

- The docid argument specifies the sequence variable as the unique document ID (here, also named docid).

- The text argument specifies, in the documents, the character variable that contains the text to be processed (here, the
**Complaint**variable in the data set).

- The stemming argument specifies whether stemming is to occur during parsing. When set to TRUE, terms are evaluated to see whether they belong to common parent form and information is added to the position table. In other words, we reduce terms to their root words so they can be evaluated according to a common term. For example,
*jumping*,*jumped*, and*jumps*should all be reduced to the word*jump*in order to consider all terms a single term.

- The nounGroups argument specifies that noun group extraction is to occur during parsing. When set to TRUE, noun groups become additional rows on the position table. This is also reflected in the terms and parent tables. Here, we set nounGroups equal to FALSE to avoid extracting additional information regarding nouns in the documents. For example, a noun group could be 'the software company' as opposed to treating each of the three words separately.

- The entities argument specifies whether to extract entities in parsing. Here, we set entities equal to NONE to avoid extracting entities information. However, if the value is set to STD, the standard entities are written to the output. An
*entity*is any of several types of information that SAS can distinguish from general text. SAS identifies the following standard entities:

- ADDRESS (postal address or number and street name)
- COMPANY (company name)
- CURRENCY (currency or currency expression)
- DATE (date, day, month, or year)
- INTERNET (email address or URL)
- LOCATION (city, country, state, geographical place, or region)
- MEASURE (measurement or measurement expression)
- ORGANIZATION (government, legal, or service agency)
- PERCENT (percentage or percentage expression)
- PERSON (person's name)
- PHONE (phone number)
- PROP_MISC (proper noun with an ambiguous classification)
- SSN (Social Security number)
- TIME (time or time expression)
- TIME_PERIOD (measure of time expressions)
- TITLE (person's title or position)
- VEHICLE (motor vehicle including color, year, make, and model)

- We set tagging equal to FALSE to avoid using parts of speech in the parsing process.

- We save the parse configuration and offset tables. The parseConfig argument specifies the name of the
**config**CAS table to contain parsing configuration information. The offset argument specifies the name of the output CAS table to contain the position information about the occurrences of child terms in the document collection. Child terms are associations to parent terms. For example, in restaurant reviews, the term*price*could be a parent term and a related child term could be*fast-food*.

The results table for textParse.tpParse lists the offset and config output tables, which are described below:

- The config table has only one row and holds only the hyperparameters of the parsing algorithm. For example, it shows that the default language is English and then our decisions for using stemming, tagging, noun groups, and entities in the parsing process.
- The offset table, which has more than 2 million rows, is massive compared to the original 10,000-row table. It holds positional information for all the parsed terms in our input table. Parsing is only a first step though. We can use this offset table to accumulate parsing results in the documents.

- The table argument specifies the data table.
- Examine the next PROC CAS step, which uses the tpAccumulate action to aggregate the offset information. Then run this step and look at the results.

/*Accumulate terms*/ proc cas; textParse.tpAccumulate / stopList = 'stoplist' stemming = True tagging = False reduce = 1 offset = 'offset' showDroppedTerms = False parent = {name='parent', replace=True} child = {name='child', replace=True} terms = {name='terms', replace=True}; table.fetch / table='terms', to=5; quit;

In this PROC CAS step, notice the following:

- The stopList argument specifies the input CAS table that contains the terms to exclude from the analysis. If it is specified, the table must have the term variable. A role variable is optional. At the beginning of this demonstration program, we loaded a stop word list as 'stoplist' onto the CAS server. We pass that stop list to the action. This list is based on the System for the Mechanical Analysis and Retrieval of Text (or SMART for short) that was developed at Cornell University. There are several stop lists in the literature that you can use, but this is one of the more commonly used for natural language processing. This list gives us 571 common English terms to exclude from the analysis.

- As in the previous action, we set stemming to TRUE and tagging to FALSE.

- The REDUCE argument specifies the minimum number of documents in which a term should occur in order to be retained by the action. So setting REDUCE equal to 1 means that if the term shows up in any document, its information is retained.

- We pass the offset table created previously to the action.

- The showDroppedTerms argument specifies whether to include terms that have a keep status of N in the
**OUTTERMS**output table. Here, we set showDroppedTerms to FALSE to exclude dropped terms from the output table.

- We save the parent, child, and terms output tables.

- Using the fetch action within the CAS procedure, we can view the first five terms of the terms table that we're saving.

In the results, the first table shows information about the three CAS tables: terms, parent, and child. The parent and child tables are compressed representations of the term-by-document matrix with raw counts and weighted counts, respectively. The action uses these tables behind the scenes to distinguish and account for terms, as well as conduct singular value decomposition to find topics.

The terms table provides summary information about the terms in the document collection. The second table in the results shows selected rows from the terms table. For example, notice that the terms table shows the frequency of each term in the corpus and the number of documents it appeared in, as well as how the term is mapped to the parent table. The attribute column describes the characters that compose the term. For example,**Alpha**indicates that all the characters are letters, and**Mixed**is a combination of letters, numbers, punctuation, and spaces.

The terms table can contain duplicate terms because terms can be identified as multiple parts of speech. For example, the term*fair*can be either a noun or an adjective as shown in the following sentences: "People go to the county fair in October" versus "The team had a fair referee tonight." SAS identifies the following parts of speech:- Abbr (abbreviation)
- Adj (adjective)
- Adv (adverb)
- Aux (auxiliary or modal)
- Conj (conjunction)
- Det (determiner)
- Interj (interjection)
- Noun (noun)
- Num (number or numeric expression)
- Part (infinitive marker, negative participle, or possessive marker)
- Pref (prefix)
- Prep (preposition)
- Pron (pronoun)
- Prop (proper noun)
- Punct (punctuation)
- Verb (verb)
- VerbAdj (verb adjective)

- The stopList argument specifies the input CAS table that contains the terms to exclude from the analysis. If it is specified, the table must have the term variable. A role variable is optional. At the beginning of this demonstration program, we loaded a stop word list as 'stoplist' onto the CAS server. We pass that stop list to the action. This list is based on the System for the Mechanical Analysis and Retrieval of Text (or SMART for short) that was developed at Cornell University. There are several stop lists in the literature that you can use, but this is one of the more commonly used for natural language processing. This list gives us 571 common English terms to exclude from the analysis.
- Examine the next four steps: a DATA step, the PROC SQL step, another DATA step, and a PROC SGPLOT step. Then run those steps and look at the results.

/*Find unique terms*/ data terms_unique; set mycas.terms; by _Term_; if last._Term_; run; /*Order unique terms*/ proc sql; CREATE TABLE top_terms AS SELECT _Term_, _Frequency_ FROM terms_unique ORDER BY _Frequency_ DESC; run; data top_terms; set top_terms (obs=5); run; /*Plot the top 5 most used terms*/ proc sgplot data=top_terms; vbar _term_ / response=_frequency_; run;

The first DATA step subsets the terms table to include only the unique terms. We name the output table**terms_unique**and pull terms from the**terms**table. For each unique group of terms in the table, we'll keep only the last term, without regard to the part of speech. So we might be subsetting the noun term versus the verb term and vice versa.

The SQL step queries the**terms_unique**table, sorts the terms according to frequency within the corpus, and saves this information in a table called**top_terms**.

The next DATA step keeps only the top five most-used terms in the corpus.

To go one step further, we use PROC SGPLOT to create a bar chart of these top terms so that we can view the information graphically.

In the results, the bar chart shows that the terms*loan*and*payment*are each used more than 20,000 times in the data set. Rounding out the top five terms, the other three are*mortgage*,*pay*, and*call*.

So far, you have seen how to quickly clean the data, find specific terms, parse raw text, accumulate parsing results, find unique terms, and explore the text graphically. Next, we'll start the model building and prediction phase of our demonstration.

## Deep Learning Using SAS® Software

Lesson 03, Section 2 Demo: Supervised Classification Using an RNN

The previous demonstration focused on the first phase of text analytics, exploratory text mining, in order to uncover patterns in the complaints data and summarize the document collection. In this demonstration, we focus on predictive modeling with text. This process uses a target variable to find patterns that emerge when the values of the target variable are analyzed against the text. Using deep learning models, we'll predict whether the consumer disputed the company's response to the complaint. It's reasonable to expect that a filed complaint might contain terms that indicate an inevitable dispute, regardless of the company response. For example, perhaps the complaint verbiage displays such anger or contempt that the consumer will inevitably dispute the response. On the other hand, a complaint might contain passive-aggressive language that indicates the consumer is highly unlikely to dispute the response. Either way, a deep learning model can help us uncover the relationship between the target dispute and the input text complaint.

- Open the program named
**DLUS03D01.sas**.**Note:**If you did not perform the steps of the previous demonstration in the current SAS Studio session, run the first part of the code (from the beginning of the program through the PROC SGPLOT step that plots the five most-used terms) before you continue.

- Examine the PROC PRINT step for the
**CFPB_COMPLAINTS_EMBED**table, and then run the step. Look at the results.

/*View Word Representations*/ proc print data=mycas.cfpb_complaints_embed (obs=5); run;

Before building our model, we need to convert our text into numeric predictive modeling features. Word embedding is the process of converting text into real-valued vectors. There are several methods for creating these features, such as one-hot encoding or singular value decomposition on the term-by-document matrix. We use the Global Vector for Word Representation algorithm (or GloVe for short). Remember that GloVe is an unsupervised method for generating word embeddings from word-to-word co-occurrence statistics from a corpus. You can download pre-trained word embeddings developed on extremely large collections of documents such as Wikipedia and Twitter data and use their vector representations in your deep learning model to map the text to real-valued vectors. For simplicity, this program applies the GloVe algorithm to the complaints data using open-source algorithms to obtain vector representations for only the words present in the corpus. We then saved the word embeddings in the**CFPB_complaints_embed**SAS file.

**Note:**Another approach is to download pretrained word embeddings from extremely large corpuses instead of creating a unique set (https://nlp.stanford.edu/projects/glove/).

This PROC PRINT step prints the word embeddings table, which we use going forward. In the results, you can see that this table is 15930 rows and 101 columns. The first column contains the term for each unique word in the corpus. The other 100 columns contain the multidimensional vector representation of each term. Note that the dimension of the vectors is your choice. A larger dimension will potentially contain additional useful information but comes at a computational cost.

- Examine the DATA step under the comment about word embedding, and then run the step.

/*Word Embeddings*/ data embed_sample; set mycas.cfpb_complaints_embed; if vocab_term in ('credit','tax','loan','debt','default', 'unfair','difficult','conflict','fight','harm'); run;

Although the word embeddings are used as features for our deep learning model, they also enable us to visualize the terms in a multidimensional space. Terms that cluster together share similar co-occurrence patterns. This DATA step subsets the terms credit, tax, loan, debt, default, unfair, difficult, conflict, fight, and harm into a new table named**embed_sample**.

- Examine the PROC SGPLOT step that graphs the embed_sample table, and then run the step. Look at the results.

proc sgplot data=embed_sample; scatter x=x1 y=x2 / datalabel=vocab_term; run;

This PROC SGPLOT step creates a scatter plot for the 10 terms using only the first two dimensions.

In the scatter plot, notice that the terms default and unfair are the closest to each other, which means that consumers tend to use the term unfair when describing a default experience in their complaint. Likewise, the terms difficult and loan are also close to one another in these two dimensions, providing a similar interpretation. On the other hand, the terms debt and conflict are far apart. The term debt is in the first quadrant, and conflict is in the third quadrant of the plot, which means that these terms rarely co-occur. We could create another graphic with the same words but using another dimension to see how they cluster in a different dimension, or we could plot a different set of terms to interpret their vector representations.

- Examine the PROC PARTITION step, and then run the step.

/****************/ /*Model Building*/ /****************/ /*Partition the data*/ proc partition data=mycas.cfpb_complaints_clean samppct=80 samppct2=10 seed=802 partind; output out=mycas.cfpb_complaints_clean; run;

Before building the predictive model, we need to partition the data. For this model, we need three partitions: one for training, one for validation (to tune the model), and one for a final assessment. We'll use 80% of the data for training, and 10% each for validation and testing. The PARTIND option is specified. The specified output CAS table name is the same as the input CAS table name, which simply adds a partition indicator to the existing table.

Notice also that we are now using the data set**CFPB_COMPLAINTS_CLEAN**. This is a cleaned version of the original data set.**CFPB_COMPLAINTS_CLEAN**contains the same**Dispute**variable as before, and for the consumer-submitted complaints, stop words and non-letters were removed, terms were stemmed to their root, and all tokens were changed to lowercase.

- Examine the PROC FREQ step, and then run the step. Look at the results.

proc freq data=mycas.cfpb_complaints_clean; tables dispute _partind_ dispute*_partind_; run;

We use this PROC FREQ step to find the frequency of each level of the target variable,**Dispute**.

In the results, notice the following:- The first table shows that the levels are almost equal. At a
**Dispute**value of 1, there are 4982 observations, which means that almost 5000 people in this 10,000-observation sample disputed the company's response to the claim. - The Partition Indicator table shows the frequency of the levels of the partition indicator. As you can see, a value of 1 has 8000 observations, which means that level 1 is the training partition and the other two levels each have 1000 observations, or 10% of the partition.
- Finally, the two-way cross tabulation table of the target and the partition indicator shows the number of complaints and non-complaints in each data partition.

- The first table shows that the levels are almost equal. At a
- Examine the PROC CAS step that shuffles the data, and then run the step.

/*Shuffle data*/ proc cas; table.shuffle / table = 'cfpb_complaints_clean' casout = {name='cfpb_complaints_clean', replace=True}; quit;

Shuffling the data before building a deep learning model is very important, because if the data are sorted by the target level, the model is likely to overestimate the probability of the target for the first level passed to the algorithm. For example, out of the 8000 training observations, if the first 4000 observations have the target level 0, the deep learning model will overestimate the probability that consumers do not dispute the company's response to the complaint. And vice versa: If the first 4000 observations of the training data have the target level 1, the model will overestimate the probability that consumers will dispute the claim.

In the shuffle action, we need only to specify the table and the same table name in the casOut argument to replace the table with the shuffled rows.

## Building a Recurrent Neural Network

- Run the PROC CAS step that loads the deepLearn action set.

/*Build a RNN*/ proc cas; loadactionset "deeplearn"; quit;

- Examine the PROC CAS step that builds the model, and then run the step. Look at the results.

proc cas; deepLearn.buildModel / model = {name='rnn',replace=True} type = 'RNN'; deepLearn.addLayer / model = 'rnn' layer = {type='input'} replace=True name = 'data'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=30, act='sigmoid', init='xavier', rnnType='rnn', outputType='samelength'} srcLayers = 'data' replace=True name = 'rnn1'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=30, act='sigmoid', init='xavier', rnnType='rnn', outputType='encoding'} srcLayers = 'rnn1' replace=True name = 'rnn2'; deepLearn.addLayer / model = 'rnn' layer = {type='output', act='auto', init='xavier', error='auto'} srcLayers = 'rnn2' replace=True name = 'output'; deepLearn.modelInfo / model='rnn'; quit;

Remember that recurrent neural networks are generally used for modeling sequence data such as text strings or time series because the order of words in a sentence and temporal structure of time data matters. In this demonstration, we'll build a few different recurrent neural networks on the consumer complaints data to predict the target,**Dispute**. We'll start by building a standard recurrent neural network that has two layers and 30 neurons in each layer.

The buildModel action initializes the network. The model argument specifies a name for the neural network. The type argument specifies the neural network type (in this case, a recurrent neural network, or RNN).

Now we use the addLayer action to iteratively add layers to the model and customize the network architecture. By default, you need an input and output layer. The other layers define the type and complexity of your deep learning neural network.

The first layer is, of course, the input layer. The addLayer action specifies the following arguments:- The model argument specifies the same name as in the buildModel action.
- In the layer argument, we'll set the type to INPUT.
- As a best practice, I like to use the replace=TRUE option to avoid using duplicate layer names in the network.
- Finally, we'll name this layer
**data**so that we know how to connect it to subsequent layers in the model.

The next layer is the first recurrent hidden layer. The following arguments are specified:- The layer argument contains all the hyperparameters for the hidden layer, as follows:
- Here n represents the number of neurons in the hidden layer. We'll use only 30 in this example to be computationally efficient for the demonstration.
- The act argument specifies the activation function for the layer. Instead of using the default hyperbolic tangent function (TANH), we'll use the sigmoid activation function (SIGMOID) to try to avoid neuron saturation.
- To initialize the weights, we'll use the Xavier distribution to try to initialize them outside regions of saturation.
- The rnnType argument here is set to RNN, but it could also be GRU (gated recurrent unit) or LSTM (long short-term memory unit), two models that we'll discuss later.
- The outputType argument for this RNN layer is set to SAMELENGTH, which means that this layer will generate a sequence with the same length as the input. That is, the sequence of inputs is converted into a sequence of hidden layer values.

- The source layer specifies the layer or layers fed into the current layer. In this case, we are using only the previous layer, which is the input layer. And finally, we'll name this layer
**rnn1**.

The next hidden layer will be nearly equivalent in its hyperparameters. Notice the following in this addLayer action:- Unlike the previous hidden layer, the output type here is set to ENCODING. Encoding can be thought of as a many-to-one transformation, in that we are taking the sequence and converting it into a single value in order to predict the output. The output type depends on the problem at hand and how the inputs are used to model the RNN output. For example, we could use a many-to-many mapping to model language translation because the number of words needed to speak a phrase in one language might require a different number of words in another language. In this case, we're converting our sequence into a binary prediction.
- The source layer is the previous recurrent hidden layer.
- The name for this layer is
**rnn2**. - We'll keep this neural network relatively small with only 30 neurons in each of only two hidden layers.

The last layer is the output layer. Notice the following settings in the addLayer action:- For this model, the output layer connects only to the
**rnn2**source layer. - For simplicity, we can set the output activation and error functions to AUTO. AUTO specifies to set the error function according to the model. For binary classification, the error function will resolve to ENTROPY, and for the activation function, it resolves to SOFTMAX. You can also use the auto feature in the other layers as good starting hyperparameters.

Finally, as a best practice, let's use the modelInfo action to ensure that we've correctly built the desired model.

In the results, the Model Information table shows the structure of this recurrent neural network. It has four total layers: one input layer, two hidden recurrent layers, and one output layer. Again, this is a relatively simple model that we are using to show the functionality and train the model quickly. You can get creative and customize these models as you see fit by adding many hidden layers, adding different types of hidden layers, adding many hidden units, using skip layers, and so on.

- Examine the PROC CAS step that trains the model, and then run the step. Look at the results.

proc cas; deepLearn.dlTrain / modelTable = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 1'} validTable = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 2'} modelWeights = {name='rnn_trained_weights', replace=True} target = 'dispute' inputs = 'complaint' texts = 'complaint' nominals = 'dispute' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} seed = '649' optimizer = {miniBatchSize=100, maxEpochs=30, algorithm={method='adam', beta1=0.9, beta2=0.999, learningRate=0.001, gamma=0.5, lrpolicy='step', stepsize=15, clipGradMax=10, clipGradMin=-10}}; quit;

This PROC CAS step fits the model to our data using the dlTrain action. In the dlTrain action, notice the following:

- The table and validTable arguments specify the training data and validation data respectively using the partition indicator (
**_PartInd_**).

- The target is the binary variable
**Dispute**.

- The texts argument specifies the character variables to treat as raw text. These variables must be specified in the inputs argument. Here, the inputs and texts arguments both specify only the
**Complaint**variable.

- The nominals argument specifies only the target,
**Dispute**.

- In the textParms argument, to map the embeddings to the terms in the consumer complaints, the initInputEmbeddings argument specifies the pretrained word embedding data table (
**CFPB_COMPLAINTS_EMBED**).

- The seed argument sets a seed.

- The modelTable argument specifies an in-memory table that is used to store the model weights. Here, the modelTable argument specifies the generic name used earlier, in the buildModel argument (RNN).

- The modelWeights argument saves the weights at the last epoch of the optimization as rnn_trained_weights (in the name argument). You can also use the bestWeights argument to save the best weights regardless of the epoch number in the optimization.

- Finally, the optimizer argument specifies the hyperparameters of the optimization routine, as described below:

- The miniBatchSize argument specifies the number of observations to use for updating the weights in the stochastic gradient descent algorithm. You can use more observations to approach the standard batch gradient descent, which uses all observations, or fewer for the original stochastic gradient descent, which uses only a single row of the data. It's ideal to use all the data to update the model, but for large data sets, this isn't computationally feasible.

- To keep computation time down, we specify only 30 epochs in the maxEpochs argument.

- The algorithm argument specifies the following:

- We first specify the optimization method. The Adam method applies adjustments to the step size for each individual model parameter in an adaptive manner by approximating second-order information about the objective function based on the previous minibatch gradients.

- We're specifying a small learning rate to inch slowly toward convergence, but the Adam algorithm does adapt these movements based on our specified hyperparameters.

- The use of Adam also requires us to specify the parameters beta1 and beta2, which are the exponential decay rates for the first and second moment estimates of the adaptive approximation. Because these rates decay over each epoch, it's a best practice to use values close to 1 so that they stay active for longer periods. After many epochs, the beta parameters converge to zero. As a result, the Adam algorithm no longer corrects the step size for updating the weights.

- I also prefer to use the clipGradMax (maximum gradient value) and clipGradMin (minimum gradient value) arguments to bound the gradient and prevent potentially large weight movements between epochs. All gradients that are greater than the specified clipGradMax value are set to the specified value. Likewise, all gradients that are less than the specified clipGradMin value are set to the specified value.

- Finally, we use the step learning rate policy (lrPolicy='step'), which adjusts the learning rate over the optimization routine at a fixed number of steps and by a fixed amount.
**Note:**Other values of lrPolicy include the following:- FIXED specifies a fixed learning rate.
- IIN specifies to set the learning rate after each epoch according to the initial learning rate, the value of the gamma parameter, and the value of the power parameter. The rate is calculated as the learningRate*(1+gamma*current epoch)^(-power).
- MULTISTEP specifies to set the learning rate after each of the epochs that are specified in the steps parameter. The learning rate is multiplied by the gamma parameter value.
- POLY specifies to set the learning rate after each epoch according to the initial learning rate, the maximum number of epochs, and the value of the power parameter. The rate is calculated as the learningRate*(1-current epoch / maxEpochs)^power.

- Here (in learningRate), we have initialized our learning rate to a small value of 0.001.

- For the step learning rate policy, we have set stepSize=15 and the gamma argument to 0.50. This means that the learning rate will be cut by 50% every 15 epochs in the optimization. Because we are using only 30 max epochs in this example, our learning rate of 0.001 will be cut by 50% only once. This allows the optimization to make larger steps during the beginning of the training routine and then smaller steps as it homes in on parameter convergence.

- We first specify the optimization method. The Adam method applies adjustments to the step size for each individual model parameter in an adaptive manner by approximating second-order information about the objective function based on the previous minibatch gradients.

- The miniBatchSize argument specifies the number of observations to use for updating the weights in the stochastic gradient descent algorithm. You can use more observations to approach the standard batch gradient descent, which uses all observations, or fewer for the original stochastic gradient descent, which uses only a single row of the data. It's ideal to use all the data to update the model, but for large data sets, this isn't computationally feasible.

The Optimization History table shows the epochs, learning rate, and also the loss and error for the training data and validation data. In this case, the training error and validation error reduced similarly over the 30 epochs. This indicates that we did not overfit the training data, and we don't need to apply any type of regularization here. With more time, we should try more epochs and alter the optimization hyperparameters to further reduce the error.

- The table and validTable arguments specify the training data and validation data respectively using the partition indicator (
- Examine the PROC CAS step that is used for scoring, and then run the step. Look at the results.

proc cas; deepLearn.dlScore / model = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 0'} initWeights = 'rnn_trained_weights' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} copyVars = 'dispute' casout = {name='rnn_scored', replace=True}; quit;

To get a final assessment of this model, we'll use the dlScore action to score the test data set. The dlScore action specifies the following arguments:- The table argument specifies the test partition using the partition indicator.
- The model is the same as before.
- The initWeights argument specifies an in-memory table that contains the saved weights from our model. These weights are used to initialize the model.
- The copyVars argument specifies the variables to transfer from the input table to the output table. Here, this saves not only the scored data to the output table but also the target so that we can compare the prediction later.
- The textParms argument specifies the word embedding table used in the dlTrain action.
- The casout argument saves this model as
**rnn_scored**.

In the results, the Score Information table shows that the model misclassification error is 46%. This means that this model gives very little gain in accuracy compared to a random guess model. However, we can try running the model longer, using more hidden layers and neurons, and changing the hyperparameters of the optimization routine.

Instead of altering this model, let's build a slightly deeper model.

## Building a Deeper Model

- Examine the PROC CAS step for building a deeper model, and then run the step. Look at the results.

/*Build a deeper RNN*/ proc cas; deepLearn.buildModel / model = {name='rnn',replace=True} type = 'RNN'; deepLearn.addLayer / model = 'rnn' layer = {type='input'} replace=True name = 'data'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=25, act='sigmoid', init='xavier', rnnType='rnn', outputType='samelength'} srcLayers = 'data' replace=True name = 'rnn1'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=25, act='sigmoid', init='xavier', rnnType='rnn', outputType='samelength'} srcLayers = 'rnn1' replace=True name = 'rnn2'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=25, act='sigmoid', init='xavier', rnnType='rnn', outputType='encoding'} srcLayers = 'rnn2' replace=True name = 'rnn3'; deepLearn.addLayer / model = 'rnn' layer = {type='output', act='auto', init='xavier', error='auto'} srcLayers = 'rnn3' replace=True name = 'output'; deepLearn.modelInfo / model='rnn'; quit;

The first two actions in this PROC CAS step are equivalent to the first RNN model that we built. However, this deeper model has three hidden layers with 25 neurons in each layer. Also, all three layers have the same type, activation, initialization, and RNN type. However, notice that the first two hidden layers have an output type of SAMELENGTH and the final hidden layer is an encoding. The output layer is the same as before. Again, we use the modelInfo action to print the model architecture.

As expected, the Model Information table shows that this model has five total layers: one input layer, three hidden layers, and an output layer.

- Examine the PROC CAS step for training the model, and then run the step. Look at the results.

proc cas; deepLearn.dlTrain / modelTable = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 1'} validTable = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 2'} modelWeights = {name='rnn_trained_weights', replace=True} target = 'dispute' inputs = 'complaint' texts = 'complaint' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} nominals = 'dispute' seed = '649' optimizer = {miniBatchSize=100, maxEpochs=30, algorithm={method='adam', beta1=0.9, beta2=0.999, learningRate=0.001, gamma=0.5, lrpolicy='step', stepsize=15, clipGradMax=10, clipGradMin=-10}}; quit;

This PROC CAS step trains the model with the same code as before. We haven't changed any optimization hyperparameters: We are still training for 30 epochs using the 'step' learning rate policy.

In the Model Information table, the total number of model parameters is again under 6000, even though we used more neurons in this model. The Optimization History table shows that the training and validation error reduce similarly over the 30 epochs, but the errors are high nonetheless.

- Examine the PROC CAS step that scores the test data, and then run the step. Look at the results.

proc cas; deepLearn.dlScore / model = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 0'} initWeights = 'rnn_trained_weights' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} copyVars = 'dispute' casout = {name='rnn_scored', replace=True}; quit;

In this PROC CAS step, we run the dlScore action again to get the misclassification error percentage on the test data. The error for this deeper model is nearly equivalent to the first RNN we ran earlier, indicating no improvement.

Let's try building one more RNN before moving on to discuss gated recurrent neural networks.

## Building a Bidirectional Model

- Examine the PROC CAS step that builds a bidirectional RNN, and then run the step. Look at the results.

/*Build a bidirectional RNN*/ proc cas; deepLearn.buildModel / model = {name='rnn',replace=True} type = 'RNN'; deepLearn.addLayer / model = 'rnn' layer = {type='input'} replace=True name = 'data'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=25, act='sigmoid', init='xavier', reverse=True, rnnType='rnn', outputType='samelength'} srcLayers = 'data' replace=True name = 'rnn1'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=25, act='sigmoid', init='xavier', rnnType='rnn', outputType='samelength'} srcLayers = 'rnn1' replace=True name = 'rnn2'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=25, act='sigmoid', init='xavier', rnnType='rnn', outputType='encoding'} srcLayers = 'rnn2' replace=True name = 'rnn3'; deepLearn.addLayer / model = 'rnn' layer = {type='output', act='auto', init='xavier', error='auto'} srcLayers = 'rnn3' replace=True name = 'output'; deepLearn.modelInfo / model='rnn'; quit;

Let's extend the deeper recurrent neural network model we just built to a bidirectional model. The code to build this bidirectional model is identical to the previous model, except for the addition of reverse=TRUE in the first hidden layer. Remember that this results in an additional hidden layer in the model, which has the information flow in the reverse direction. I'll run the code.

The Model Information table still indicates that the model has only five total layers.

- Examine the PROC CAS step that fits the bidirectional RNN, and then run the step. Look at the results.

proc cas; deepLearn.dlTrain / modelTable = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 1'} validTable = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 2'} modelWeights = {name='rnn_trained_weights', replace=True} target = 'dispute' inputs = 'complaint' texts = 'complaint' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} nominals = 'dispute' seed = '649' optimizer = {miniBatchSize=100, maxEpochs=30, algorithm={method='adam', beta1=0.9, beta2=0.999, learningRate=0.001, gamma=0.5, lrpolicy='step', stepsize=15, clipGradMax=10, clipGradMin=-10}}; quit;

This step uses the same dlTrain action code as before to fit the model.

In this case, the weights for the forward and backward hidden layers, represented by the first hidden layer with reverse=True, are shared. Thus, in the Model Information table, the Total Number of Model Parameters row indicates that this model has the same number of weights as the previous model. Furthermore, the Optimization History table looks nearly equivalent to the previous model, with similar movement and results for the validation and fit errors across the epochs.

- Examine the PROC CAS step that scores the bidirectional RNN, and then run the step. Look at the results.

proc cas; deepLearn.dlScore / model = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 0'} initWeights = 'rnn_trained_weights' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} copyVars = 'dispute' casout = {name='rnn_scored', replace=True}; quit;

We score the test data with this model, using the same dlScore code as before.

In the Score Information table, the misclassification error percentage is similar to the previous two models.

The recurrent neural network with two hidden layers, the deeper RNN, and the bidirectional RNN have all performed similarly with this data set, given the hyperparameters that we used. In general, these are three good models to have in your tool chest when building a supervised classification model with sequence data. And because models are data dependent, one of these three models might perform significantly better on a different data set. However, for this data set, we'll need to use a gated RNN model to improve performance, which we'll discuss next.

## Deep Learning Using SAS® Software

Lesson 03, Section 3 Demo: Supervised Classification Using a GRU

This is the third and final part of the demonstration on modeling the consumer financial protection bureau data. Here we'll build our final predictive model using a gated recurrent neural network (or GRU).

- Open the program named
**DLUS03D01.sas**.**Note:**If you did not perform the steps of the previous two demonstrations in the current SAS Studio session, run the code for those demonstrations (from the beginning of the program to the scoring of the bidirectional model) before you continue.

- Examine the PROC CAS step that builds a GRU, and then run the step. Look at the results.

/*Build a GRU*/ proc cas; deepLearn.buildModel / model = {name='rnn',replace=True} type = 'RNN'; deepLearn.addLayer / model = 'rnn' layer = {type='input'} replace=True name = 'data'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=15, act='sigmoid', init='xavier', rnnType='gru', reverse=True, outputType='samelength'} srcLayers = 'data' replace=True name = 'rnn1'; deepLearn.addLayer / model = 'rnn' layer = {type='recurrent', n=15, act='sigmoid', init='xavier', rnnType='gru', reverse=True, outputType='encoding'} srcLayers = 'rnn1' replace=True name = 'rnn2'; deepLearn.addLayer / model = 'rnn' layer = {type='output', act='auto', init='xavier', error='auto'} srcLayers = 'rnn2' replace=True name = 'output'; deepLearn.modelInfo / model='rnn'; quit;

In the PROC CAS step, note the following details about the layers:- The first layer is, of course, the input layer.
- The two hidden layers are the same as before except that the rnnType is changed to GRU and the number of neurons is reduced to 15. The reduction in neurons is because the GRU already has additional complexity and therefore more weights than the standard RNN model. Here we are trying to make the total number of weight parameters somewhat equal to the previous models to see the advantage of using gates.
- The output layer is the same as before.

We'll use the modelInfo action again to view the structure of this model.

Notice that the Model Information table does not distinguish between standard and gated recurrent layers.

- Examine the PROC CAS step that trains the GRU, and then run the step. Look at the results.

proc cas; deepLearn.dlTrain / modelTable = 'rnn' table = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 1'} validTable = {name = 'cfpb_complaints_clean', where = '_PartInd_ = 2'} modelWeights = {name='rnn_trained_weights', replace=True} target = 'dispute' inputs = 'complaint' texts = 'complaint' textParms = {initInputEmbeddings={name='cfpb_complaints_embed'}} nominals = 'dispute' seed = '649' optimizer = {miniBatchSize=100, maxEpochs=30, algorithm={method='adam', beta1=0.9, beta2=0.999, learningRate=0.001, gamma=0.5, lrpolicy='step', stepsize=15, clipGradMax=10, clipGradMin=-10}}; quit;

The dlTrain action is the same as before.

In the Model Information table, notice that the total number of parameters is about 6600. This is close to the first RNN we built even though we used half the number of neurons here. Again, the GRU has inherent built-in complexity to model the flow of information using gates. In the Optimization History table, notice that the error is closer to 40% than before.

- Run the PROC CAS step that scores the GRU on the test data. Look at the results.

This step uses the dlScore action.

In the Score Information table, the misclassification on the test data for this model is closer to 40%. So, in this example, the GRU model performs better than the standard RNN, the deeper RNN, and the bidirectional RNN. The gates appear to better regulate the flow of information for the complaints data. Typically, the GRU model performs better on small sample sizes.

- Examine the PROC FREQ step, and then run the step. Look at the results.

/*Assess the GRU*/ proc freq data=mycas.rnn_scored; tables _dl_predname_*dispute; run;

To further assess our champion GRU model, we use PROC FREQ to create a cross tabulation of the actual target values versus the predicted dispute values.

Ideally, we would see large frequencies for the cells with matching levels on the diagonal corresponding to 0,0 and 1,1. These frequencies are a bit larger than the off diagonal, which displays the misclassified frequencies. Notice that this table represents the breakdown of the overall misclassification. If we sum the incorrect predictions by the total number of predictions, we get the misclassification error percentage that we saw earlier in the Score Information table. More importantly this breakdown shows us that the model is not simply predicting only one level of the target.

- Examine the PROC CAS step, and then run the step. Look at the results.

proc cas; loadactionset "percentile"; percentile.assess / table={name='rnn_scored'} casout={name='pct', replace=True} inputs='_dl_p0_' response='dispute' event='1'; quit;

This PROC CAS step contains the assess action from the percentile action set, which creates model assessment statistics tables. Other details about this PROC CAS step are provided below:- The table is the scored table from the dlScore action above.
- The input is the predicted probability of a dispute.
- The response is the target dispute.
- The event level is the level we have modeled.
- We'll save this information in the CAS table
**pct**, which is short for percentile.

The assess action creates two separate tables. The first is named**pct**(as specified in the casOut option) and the other is named**pct_ROC**. Each table saves different information to assess the model.

- Run the two PROC PRINT steps. Look at the results.

proc print data=mycas.pct (obs=5); run; proc print data=mycas.pct_roc (obs=5); run;

The PROC PRINT steps print the first five observations of the two tables produced by the assess action, so that we can view their contents. The first table is only 20 rows by 21 columns, and it contains lift curve information like depth, lift, and cumulative lift. The second table contains ROC curve information such as the true positives, false positives, false negatives, true negatives, sensitivity, specificity, and accuracy for each of 100 cutoff rates for assessment.

We can use the contents of these assessment tables to create assessment graphics.

- Examine the DATA step and the PROC SGPLOT step, and then run the steps. Look at the results.

data pct_roc; set mycas.pct_roc; run; proc sgplot data=pct_roc; series y=_sensitivity_ x=_fpr_; run;

For simplicity, we'll create only an ROC plot. The DATA step downloads the CAS table to the local Work library. The PROC SGPLOT step creates a series plot with the sensitivity on the Y axis and the false positive rate on the X axis.

In the chart, the ROC curve for the GRU model is relatively flat, which is expected given the misclassification error percentage for our model.

We'll end the demonstration here. In a real scenario, however, it would be best to tune the GRU model and alter its optimization hyperparameters to increase model performance and increase the area under the ROC curve.

## Deep Learning Using SAS® Software

Lesson 03, Section 3 Demo: Modeling Simulated Data with an LSTM Model

In the previous demonstration, we used a GRU to increase the accuracy of the predicted customer disputes based solely on the text in the complaints. We saw that the gates alone improved the model as we held the hyperparameters of our optimization routine fixed.

In this demonstration, we'll explore the utility of a long short-term memory model in the context of a time series. The time series data for this example consist of 50,000 observations. The data set (**simts2**) is simulated from an AR1 model with random shocks (pulse events) that repeat at time periods 6 (pulse6) and 13 (pulse13). The idea is to introduce data with known systematic patterns into a deep learning framework, and to see how varying things like the number of layers and the type of hidden unit impact the model's ability to learn short and long memory patterns. LSTMs are advantageous in this type of situation because they can better regulate the flow of information and help avoid neuron saturation over long periods.

- Open the program named
**DLUS03D02.sas**. Examine the two LIBNAME statements, the DATA step, and the code that creates the GOFstats macro. Run this part of the code.

/********************************/ /*Create Local and CAS Libraries*/ /********************************/ libname local "/home/student/LWDLUS/Data"; libname mycas cas; /*****************************/ /*Load Data From Local to CAS*/ /*****************************/ data mycas.simts2; set local.simts2; run; /***********************/ /*Create GOFstats Macro*/ /***********************/ %macro GOFstats(ModelName=,DSName=,OutDS=,NumParms=0, ActualVar=Actual,ForecastVar=Forecast); data &OutDS; attrib Model length=$12 MAPE length=8 NMAPE length=8 MSE length=8 RMSE length=8Examine NMSE length=8 NumParm length=8; set &DSName end=lastobs; retain MAPE MSE NMAPE NMSE 0 NumParm &NumParms; Residual=&ActualVar-&ForecastVar; /*---- SUM and N functions necessary to handle missing ----*/ MAPE=sum(MAPE,100*abs(Residual)/&ActualVar); NMAPE=NMAPE+N(100*abs(Residual)/&ActualVar); MSE=sum(MSE,Residual**2); NMSE=NMSE+N(Residual); if (lastobs) then do; Model="&ModelName"; MAPE=MAPE/NMAPE; RMSE=sqrt(MSE/NMSE); if (NumParm>0) and (NMSE>NumParm) then RMSE=sqrt(MSE/(NMSE-NumParm)); else RMSE=sqrt(MSE/NMSE); output; end; keep Model MAPE RMSE NumParm; run; %mend GOFstats;

The LIBNAME statements create the necessary libraries. The DATA step loads the data sets into memory.

The next section of code creates the goodness-of-fit (GOFstats) macro to help compare the models we'll create. The GOFstats macro takes in scored data and produces summary information like the MAPE (mean absolute percentage error) and RMSE (root mean squared error).

- Examine the next section of code, which consists of three DATA steps and a PROC SGPLOT step. Run the steps and look at the results.

In the code shown in the demo video, one of the DATA steps occurs in a different place than it does in the program file in the virtual lab. The instructions below assume that you submit all these steps together instead of (as shown in the demo video) in two separate submissions.

Note:

/****************/ /*Model Building*/ /****************/ data mycas.widgets_t; set mycas.simts2 (obs=40000); lwidgets=log(widgets); w1 = lag1(lwidgets); w2 = lag2(lwidgets); w3 = lag3(lwidgets); w4 = lag4(lwidgets); w5 = lag5(lwidgets); w6 = lag6(lwidgets); w7 = lag7(lwidgets); w8 = lag8(lwidgets); w9 = lag9(lwidgets); w10 = lag10(lwidgets); w11 = lag11(lwidgets); w12 = lag12(lwidgets); w13 = lag13(lwidgets); if _n_ > 13; keep date lwidgets w1 - w13 ; run; data plotin; set mycas.widgets_t (obs=100); run; proc sgplot data=plotin; series x=date y=lwidgets; run; data mycas.widgets_v; set mycas.simts2 (firstobs=40001 obs=40401); lwidgets=log(widgets); w1 = lag1(lwidgets); w2 = lag2(lwidgets); w3 = lag3(lwidgets); w4 = lag4(lwidgets); w5 = lag5(lwidgets); w6 = lag6(lwidgets); w7 = lag7(lwidgets); w8 = lag8(lwidgets); w9 = lag9(lwidgets); w10 = lag10(lwidgets); w11 = lag11(lwidgets); w12 = lag12(lwidgets); w13 = lag13(lwidgets); if _n_ > 13; keep date lwidgets w1 - w13; run;

The first DATA steps subsets the first 40,000 observations of the simulated data from**simts2**into the**widgets_t**data set, where**t**stands for training. Notice that the DATA step converts the target variable**widgets**, which is an arbitrary name, to the log scale and creates 13 lag variables named**w1**to**w13**. The IF statement simply avoids the first 13 observations, which would be missing due to the creation of the lags.**Note:**For this data, the log transformation improved model fitting.

The second DATA step subsets the first 100 observations of the training data into a data set named**plotin**in the Work directory. The PROC SGPLOT step plots these 100 observations as a series.

The third DATA step subsets the next 400 observations from**simts2**into the**widgets_v**data set, where**v**stands for validation. In other respects, this DATA step is similar to the one that creates that training data.

In the results, the plot shows the general pattern of the data with log widgets on the Y axis and the date on the X axis. Remember that this data was simulated as an AR1 with random shocks at lags of 6 and 13.

- Examine the PROC CAS step that builds the first model (the "plain RNN"), and then run the step.

/*Build Model 0 - plain RNN*/ proc cas; loadactionset 'deeplearn'; quit; proc cas; deepLearn.buildModel / model={name='tsRnn0', replace=1} type = 'RNN'; deepLearn.addLayer / model='tsRnn0' name='data' layer={type='input' std='std'}; deepLearn.addLayer / model='tsRnn0' name='rnn1' layer={type='recurrent' n=5 act='sigmoid' init='msra' rnnType='rnn' outputtype='samelength'} srcLayers={'data'}; deepLearn.addLayer / model='tsRnn0' name='rnn2' layer={type='recurrent' n=5 act='sigmoid' init='msra' rnnType='rnn' outputtype='encoding'} srcLayers={'rnn1'}; deepLearn.addLayer / model='tsRnn0' name='outlayer' layer={type='output' act='identity' error='normal'} srcLayers={'rnn2'}; quit;

This PROC CAS step loads the deepLearn action set and then specifies the model architecture.

We'll build a few different recurrent neural networks and compare performance. This PROC CAS step builds the first of these models: a standard recurrent neural network with two recurrent hidden layers. Each hidden layer has 10 nodes.

We call this model tsRnn0 and specify the type as RNN.

Layers are added, as described below:- For the input layer, notice that we use the standardize option for the time series. This simply standardizes the data before it moves through the network, in case there are large values that try to skew model performance.
- The rnnType of each hidden layer is
**RNN**. In both RNN hidden layers, we use only five neurons with the SIGMOID activation function and MSRA for weight initialization. Just as in the previous demonstration, notice that the first hidden layer has the output type SAMELENGTH, and the second layer requires encoding. - Finally, the output layer has the IDENTITY activation, and we use the NORMAL error function for this time series.

- Examine the PROC CAS step that fits the model, and then run the step. Look at the results.

proc cas; deepLearn.dlTrain / table='widgets_t' model='tsRnn0' modelWeights={name='tsTrainedWeights0', replace=1} bestweights={name='bestbaseweights0', replace=1} inputs=${w13 w12 w11 w10 w9 w8 w7 w6 w5 w4 w3 w2 w1} target='lwidgets' optimizer={minibatchsize=5, algorithm={method='adam', lrpolicy='step', gamma=0.5, beta1=0.9, beta2=0.99, learningrate=.001 clipgradmin=-1000 clipgradmax=1000 } maxepochs=30} seed=54321; quit;

This PROC CAS step fits the model with dlTrain. Notice the following details:- The table here is the training widgets (
**widgets_t**) data, and the model is the one we just built, tsRnn0. - The modelWeights argument saves the trained weights from the final epoch in the optimization routine to the
**tsTrainedWeights0**table. - The bestWeights argument saves the trained weights from the epoch corresponding to the minimum loss to the
**bestbaseweights0**table. - Notice that the inputs here are sequentially ordered from the longest lag onward. This allows the information to pass through the network in chronological order.
- The target, of course, is the log of the widgets (
**lwidgets**). - The optimizer options are the same as in the previous model.
We're using ADAM optimization with our standard beta values and we set the learning rate to 0.001 for the model-fitting process. We also use the step learning rate policy with a gamma value of 0.5. Because we did not specify a step size, it defaults to 10, which means that the learning rate will be reduced by 50% every 10 epochs over the optimization routine. Last but not least, we'll specify a maximum of 30 epochs. We set a seed as well.

- The table here is the training widgets (
- Examine the PROC CAS step that scores the validation data, and then run the step. Look at the results.

proc cas; deepLearn.dlscore / table='widgets_v' model='tsRnn0' initweights={name='bestbaseweights0'} copyvars={'lwidgets' 'date'} casout={name='scoreOut0', replace=1}; quit;

This PROC CAS step scores the validation data with dlScore. In this case, the table is**widgets_v**and the model is the same tsRNN0. We score with the best weights from the previous model, and we copy the target and date into the scored data file named**scoreOut0**.

In the Score Information table, note the mean squared error for this basic RNN, which serves as our baseline metric.

- Examine the DATA step and the macro call, and then run that code.

data scored; set mycas.scoreout0; widgets = exp(lwidgets); forecast = exp(_dl_pred_); run; %GOFstats(ModelName=rnn, DSName=work.scored ,OutDS=work.rnn, NumParms=2,ActualVar=widgets,ForecastVar=Forecast);

Before we can plot our forecast or use the goodness-of-fit macro, we need to exponentiate our target and our forecasts (that is, convert them back to the original scale) using the DATA step. The previous step saved these as**dl_pred**in the**scoreOut0**data set. The**%GOFstats**macro calculates MAPE (mean absolute percent error) and RMSE (root mean square error) on the residuals of the scored data. This section of code passes this information into the GOFstats macro.

- In the left panel of SAS Studio, select
**Libraries > Work > RNN**to see the GOFstats output.

**Note:**Alternatively, you can run the PROC PRINT step that occurs in the virtual lab version of this program, but does not appear in the code in the demo video.

proc print data=work.rnn; run;

The output table from the macro provides baseline measures of model accuracy. Notice the MAPE, which is an additional baseline metric.

- Examine the PROC SGPLOT step, and then run it. Look at the results.

proc sgplot data=scored; scatter x=date y=widgets; series x=date y=forecast; run;

This PROC SGPLOT step creates a scatter plot of the original data (actual values) and then overlays a series of the forecasts from the model.

In the plot, it appears that the baseline model has learned the (positive spike) pattern of pulses spaced 6 intervals apart, but it did not do as good a job at capturing the variation associated with the (negative spike) pattern of pulses spaced 13 intervals apart. You can see that the basic RNN with about 100 parameters is unable to accurately forecast the shocks in the data. So what we'll try now is building a long short-term memory model to get better performance.

- Examine the next two PROC CAS steps, which respectively build and fit an LSTM (Model 1). Run these steps and look at the results.

/*Build Model 1 - LSTM*/ proc cas; deepLearn.buildModel / model={name='tsRnn1', replace=1} type = 'RNN'; deepLearn.addLayer / model='tsRnn1' name='data' layer={type='input' std='std' }; deepLearn.addLayer / model='tsRnn1' name='rnn1' layer={type='recurrent' n=5 act='sigmoid' init='msra' rnnType='lstm' outputtype='samelength'} srcLayers={'data'}; deepLearn.addLayer / model='tsRnn1' name='rnn2' layer={type='recurrent' n=5 act='sigmoid' init='msra' rnnType='lstm' outputtype='encoding'} srcLayers={'rnn1'}; deepLearn.addLayer / model='tsRnn1' name='outlayer' layer={type='output' act='identity' error='normal'} srcLayers={'rnn2'}; quit; proc cas; deepLearn.dlTrain / table='widgets_t' model='tsRnn1' modelWeights={name='tsTrainedWeights1', replace=1} bestweights={name='bestbaseweights1', replace=1} inputs=${w1-w13} target='lwidgets' optimizer={minibatchsize=5, algorithm={method='ADAM', lrpolicy='step', gamma=0.5, beta1=0.9, beta2=0.99, learningrate=.001 clipgradmin=-1000 clipgradmax=1000 } maxepochs=30} seed=54321; quit;

In the first PROC CAS step, we have altered the previous model by setting the rnnType argument in the two hidden layers to LSTM instead of RNN. The rest of the network architecture is the same. In the second PROC CAS step, the dlTrain code is exactly the same as for the previous model.

The dlTrain results indicate that the model fitting process converged. That is, the loss and error functions are minimized. Results on the final twelve epochs are shown. In the Model Information table, you can see that the total number of parameters is now almost 400. That is, the gates have added about 300 more weights to the model. And we can see from the Optimization History table that this results in cutting the fit error in half.

- Run the next PROC CAS step, with the dlScore action, to score the validation data. Look at the results.

proc cas; deepLearn.dlscore / table='widgets_v' model='tsRnn1' initweights={name='bestbaseweights1'} copyvars={'lwidgets' 'date' } casout={name='scoreOut1', replace=1}; quit;

The Score Information table shows that the results from this model are somewhat improved. The mean squared error has been reduced from the baseline. The model seems to do a better job capturing the longer memory process in the data.

- Run the DATA step and the macro call.

data scored1; set mycas.scoreout1; widgets = exp(lwidgets); forecast = exp(_dl_pred_); run; %GOFstats(ModelName=lstm_shallow,DSName=work.scored1 ,OutDS=work.lstm_shallow, NumParms=2,ActualVar=widgets,ForecastVar=Forecast);

Again, this DATA step exponentiates the forecasts for this model before the goodness-of-fit macro runs.

- In the left panel of SAS Studio, select
**Libraries > Work > LSTM_SHALLOW**to see the GOFstats output.

**Note:**Alternatively, you can run the PROC PRINT step that occurs in the virtual lab version of this program, but does not appear in the code in the demo video.

proc print data=work.lstm_shallow; run;

In the output table, the MAPE has also been significantly reduced to about 3.5.

- Examine the PROC SGPLOT step, and then run it. Look at the results.

proc sgplot data=scored1; scatter x=date y=widgets; series x=date y=forecast; run;

This step plots the forecasts for this model on top of the original data.

In the plot, you instantly notice that the forecasts have much greater flexibility and are better able to forecast the larger time series values.

- Examine the next two PROC CAS steps, which respectively build and train a deeper LSTM model (Model 2). Run these steps and look at the results.

/*Build Model 2 - same as above, but deeper*/ proc cas; deepLearn.buildModel / model={name='tsRnn2', replace=1} type = 'RNN'; deepLearn.addLayer / model='tsRnn2' name='data' layer={type='input' std='std' }; deepLearn.addLayer / model='tsRnn2' name='rnn1' layer={type='recurrent' n=10 act='sigmoid' init='msra' rnnType='lstm' outputtype='samelength'} srcLayers={'data'}; deepLearn.addLayer / model='tsRnn2' name='rnn2' layer={type='recurrent' n=10 act='sigmoid' init='msra' rnnType='lstm' outputtype='samelength'} srcLayers={'rnn1'}; deepLearn.addLayer / model='tsRnn2' name='rnn3' layer={type='recurrent' n=10 act='sigmoid' init='msra' rnnType='lstm' outputtype='samelength'} srcLayers={'rnn2'}; deepLearn.addLayer / model='tsRnn2' name='rnn4' layer={type='recurrent' n=10 act='sigmoid' init='msra' rnnType='lstm' outputtype='samelength'} srcLayers={'rnn3'}; deepLearn.addLayer / model='tsRnn2' name='rnn5' layer={type='recurrent' n=10 act='sigmoid' init='msra' rnnType='lstm' outputtype='encoding'} srcLayers={'rnn4'}; deepLearn.addLayer / model='tsRnn2' name='outlayer' layer={type='output' act='identity' error='normal'} srcLayers={'rnn5'}; quit; proc cas; deepLearn.dlTrain / table='widgets_t' model='tsRnn2' initweights={name='bestbaseweights1', where='_layerid_< 3'} modelWeights={name='tsTrainedWeights2', replace=1} bestweights={name='bestbaseweights2', replace=1} inputs=${w1-w13} target='lwidgets' optimizer={minibatchsize=5, algorithm={method='ADAM', lrpolicy='step', gamma=0.5, beta1=0.9, beta2=0.99, learningrate=.001 clipgradmin=-1000 clipgradmax=1000 } maxepochs=50} seed=54321; quit;

We build a deeper LSTM model to try and get even better performance. Here we've added three additional LSTM hidden layers, for a total of five hidden LSTM layers. Each of these layers now has 10 neurons instead of 5. Because the model has more weights to train, the maxEpochs argument in the dlTrain action has been increased to 50.

The training results indicate that the loss and error functions converged to minimums. The Model Information table shows that this model has 4000 parameters, which is a 10-fold increase from the last model. In the Optimization History table, we can see that the fit error is now under 0.002.

- Examine the remaining code, and then run it. Look at the results.

proc cas; deepLearn.dlscore / table='widgets_v' model='tsRnn2' initweights={name='bestbaseweights2'} copyvars={'lwidgets' 'date' } casout={name='scoreOut2', replace=1}; quit; data scored2; set mycas.scoreout2; widgets = exp(lwidgets); forecast = exp(_dl_pred_); run; %GOFstats(ModelName=lstm_deep,DSName=work.scored2 ,OutDS=work.lstm_deep, NumParms=2,ActualVar=widgets,ForecastVar=Forecast); proc print data=work.lstm_deep; run; proc sgplot data=scored2; scatter x=date y=widgets; series x=date y=forecast; run;

The PROC CAS step with the dlScore action scores the model on the validation data, the DATA step with the macro call finds the MAPE, the PROC PRINT step (not shown in the demo video) prints the output data set, and the PROC SGPLOT step plots the forecasts.

In the Score Information table, as expected, the mean squared error for the validation partition is about 0.0015, the best so far.

- In the left panel of SAS Studio, select
**Libraries > Work > LSTM_DEEP**to see the GOFstats output.

**Note:**Alternatively, you can run the PROC PRINT step that occurs in the virtual lab version of this program, but does not appear in the code in the demo video.

In the output table, the MAPE is about 3, again the best so far.

- In the results, view the plot of the forecasts.

In the plot, we see visual confirmation that this deeper LSTM model is able to forecast the larger time series values better than the previous models.

In the next demonstration, we'll apply an LSTM to real world data.

## Deep Learning Using SAS® Software

Lesson 03, Section 3 Demo: Modeling Weather Data with an LSTM Model

The goal of this demonstration is to forecast the maximum hourly temperature in Durham, North Carolina. The data set **durham** contains hourly weather information for this city, from 2008 to 2017. There are approximately 90,000 observations. This data set is made available by the National Centers for Environmental Information (NCEI), which provides public access to environmental data archives. The variables in the **durham** data set are listed below:

Name | Model Role | Measurement Level | Description |
---|---|---|---|

LST_DATE | Date | Date | Date of each observation (YYYYMMDD)Note: The dates are unique. |

LST_TIME | Input | Nominal | Hour of the observation (0-2300 by increments of 100) |

T_MAX | Target | Interval | Maximum air temperature in degrees Celsius during the hour |

P_CALC | Input | Interval | Total amount of precipitation in mm during the hour |

Because the data set is large and more complex than a standard classical time series, we'll use a recurrent neural network to forecast the hourly maximum temperature. A recurrent neural network, and in particular a long short-term model, will better regulate the flow of information throughout the time series and provide short-term forecasting flexibility for more accurate predictions.

- Open the program named
**DLUS03D03.sas**. Examine the two LIBNAME statements and the DATA step, and then run those steps.

/********************************/ /*Create Local and CAS Libraries*/ /********************************/ libname local "/home/student/LWDLUS/Data"; libname mycas cas; /*****************************/ /*Load Data From Local to CAS*/ /*****************************/ data mycas.durham; set local.durham; run;

This code creates the necessary libraries and loads the data set for this demonstration into memory.

- Run the PROC PRINT step and look at the data in the results.

/****************/ /*Model Building*/ /****************/ proc print data=mycas.durham (obs=5); run;

This step prints the first five observations from the**durham**data set. This data set contains 87,672 rows, which equates to 10 years of hourly observations and the four variables described earlier. Notice that the date variable is in year, month, day format, which allows for a numerically ordered sequence. The time is in military time, the temperature is in Celsius, and the precipitation is in millimeters.

- Examine the next two DATA steps and the PROC SGPLOT step, and then run these steps. Look at the results.

/*Plot one year of data*/ data sample; set mycas.durham; if lst_date < 20090000 then output sample; run; data sample; set sample; ID+1; run; proc sgplot data=sample; series x=id y=t_max; yaxis min=-20 max=35; run;

To get a feel for the data, let's plot one year of temperatures.

The first DATA step subsets the data and names the output data set**sample**. If the year is before 2009 (or in this case, if the year is 2008), then we output those observations to the sample data table, which equates to 8,788 observations.

The next DATA step adds a sequence variable named**ID**to the data table, which we'll need when we plot the year of temperatures. Let's run this code.

The PROC SGPLOT step creates the series plot of the**T_MAX**variable by the sequence variable**ID**that we just created.

In the plot, the first thing we notice is some extremely low temperatures. These are actually only missing values coded as -99, which is why we set the Y axis bounds to be between -20 and 35. The next thing we notice, as expected with temperature data, is that there is some seasonal behavior. On average, temperatures are lower in the winter months and they peak toward the middle of the graph in the summer months. Presumably, this behavior persists over all 10 years. Although it's hard to see, there is likely daily cyclical behavior as well, because temperatures are generally hottest during midday and cool at night. The one thing we cannot see in this graph is whether there is a general upward trend in temperatures from year to year.

Now let's prepare the data before passing it into a deep learning model.

- Examine the DATA step that creates lags, and then run the step. Look at the results.

/*Create lags*/ data mycas.durham; set mycas.durham; t_max_1 = lag1(t_max); t_max_2 = lag2(t_max); t_max_3 = lag3(t_max); t_max_4 = lag4(t_max); t_max_5 = lag5(t_max); lst_time_1 = lag1(lst_time); lst_time_2 = lag2(lst_time); lst_time_3 = lag3(lst_time); lst_time_4 = lag4(lst_time); lst_time_5 = lag5(lst_time); p_calc_1 = lag1(p_calc); p_calc_2 = lag2(p_calc); p_calc_3 = lag3(p_calc); p_calc_4 = lag4(p_calc); p_calc_5 = lag5(p_calc); run;

This time, we want to create additional inputs and add them to the CAS table. Because we have time series data and the observations are correlated, we'll create lag variables for the target temperature, and the inputs time and precipitation. For ease, we'll consider only 5 lags for each, but 24 lags is sensible because temperatures exhibit a daily cycle.

You can see that the output CAS table increases from 4 variables to 19 variables, by adding 15 lag inputs, 5 for each of the 3 variables. When the lags are created, they introduce missing data for the beginning observations because you can't have a lag 5 input for the first few observations.

- Examine the DATA step that removes observations with missing values, and then run the step. Look at the results.

/*Remove missing*/ data mycas.durham mycas.missing; set mycas.durham; if cmiss(of _all_) or t_max<-30 then output mycas.missing; else output mycas.durham; run;

This DATA step removes values that are missing due to both the lags and the temperature coded as -99 for missing. That is, if any of the variables contain missing values, or the**T_MAX**variable is less than -30, then the data are output to a table named**missing**. Otherwise, we output the data to the original**durham**data table.

In the output table, 226 observations are removed. Now the**durham**data has only 87,446 observations.

- Examine the DATA step that partitions the data, and then run the step.

/*Partition the data*/ data mycas.train mycas.validate mycas.test; set mycas.durham; if lst_date < 20150000 then output mycas.train; else if lst_date < 20170000 then output mycas.validate; else output mycas.test; run;

Because we have time-dependent data, we'll use another DATA step before our analysis to partition the data. We'll use the first seven years of data, or prior to 2015, as the training data. We'll use 2015 and 2016 as validation data, and the final year will be for testing.

This partitioning equates to approximately 61,000 training observations, 17,000 validation observations, and 9,000 observations for testing.

- Run the PROC CAS step that loads the deepLearn action set.

/*Build an LSTM model*/ proc cas; loadactionset "deeplearn"; quit;

- Examine the PROC CAS step that builds the LSTM network and displays information about the model. Run the step and look at the results.

proc cas; deepLearn.buildModel / model = {name='lstm', replace=True} type = 'RNN'; deepLearn.addLayer / model = 'lstm' layer = {type='input', std='std'} replace = True name = 'data'; deepLearn.addLayer / model = 'lstm' layer = {type='recurrent', n=15, init='xavier', rnnType='LSTM', outputType='samelength'} srcLayers = 'data' replace = True name = 'rnn1'; deepLearn.addLayer / model = 'lstm' layer = {type='recurrent', n=15, init='xavier', rnnType='LSTM', outputType='encoding'} srcLayers = 'rnn1' replace = True name = 'rnn2'; deepLearn.addLayer / model = 'lstm' layer = {type='output', act='identity', init='normal'} srcLayers = 'rnn2' replace = True name = 'output'; deepLearn.modelInfo / model='lstm'; quit;

We begin by using the buildModel action to initialize the network. In the model argument, we'll name this network LSTM because we are going to use long short-term hidden layers. The type argument is set to recurrent neural network (RNN) because long short-term models are a subset of recurrent neural networks.

Then the addLayer action is used to add layers, as described below:

- The first layer is the input layer, and we'll name it data. In the layer argument, notice that I'm using the standardize (std) argument to standardize the data in order to prevent large observations from degrading the model fit.

- The next layer is the first recurrent hidden layer, and the layer argument contains all the hyperparameters for the layer. Note the following:
- Here n represents the number of neurons in the hidden layer. We'll use only 15 in this example to be computationally efficient. Remember, for an LSTM model with a small number of neurons, there are still many parameters used to build the gates in the network.
- To initialize the weights, we'll use Xavier.
- The rnnType here is LSTM, but we can also use GRU or the standard RNN.
- Because we have not specified an activation function, the layer defaults to the identity for LSTM hidden layers.
- The output type for this layer is set to SAMELENGTH, which means that this layer will generate a sequence with the same length as the input. That is, the sequence of inputs is converted into a sequence of hidden layer values.
- The source layer specifies the layer or layers connected to the current layer. In this case, we are only using the previous layer, which is the input data.
- Finally, we'll name this layer rnn1.

- The next hidden layer is nearly equivalent in its hyperparameters, except that the output type is changed to ENCODING. Note the following:
- Encoding can be thought of as a many-to-one transformation, in that we are taking the sequence and converting it to a single value in order to predict the output. The output type depends on the problem at hand and how the inputs are used to model the RNN output. For example, we could use a many-to-many mapping to model language translation because the number of words needed to speak a phrase in one language might require a different number of words in another language. In this case, we are converting our sequence into an interval prediction.
- The source layer here is the previous recurrent hidden layer.
- We'll name this layer rnn2.
- We'll keep this neural network relatively small with only 30 total neurons.

- The last layer is the output layer. Note the following:
- For this model, the output layer connects only to the rnn2 source layer.
- The type, of course, is OUTPUT.
- The activation function is set to the IDENTITY.
- The initialization is the normal distribution.
- Because the error function is missing, it defaults to NORMAL for interval data.

The Model Information table provides a description of the LSTM model. The structure of this recurrent neural network has four total layers: one input layer, two hidden recurrent layers, and one output layer. In the future, you can get creative and customize these models as you see fit.

- The first layer is the input layer, and we'll name it data. In the layer argument, notice that I'm using the standardize (std) argument to standardize the data in order to prevent large observations from degrading the model fit.
- Examine the PROC CAS step that trains the model. Run the step and look at the results.

proc cas; deepLearn.dlTrain / modelTable = 'lstm' modelWeights = {name='trained_weights', replace=True} table = 'train' validTable = 'validate' target = 'T_MAX' inputs = {'t_max_5','lst_time_5','p_calc_5', 't_max_4','lst_time_4','p_calc_4', 't_max_3','lst_time_3','p_calc_3', 't_max_2','lst_time_2','p_calc_2', 't_max_1','lst_time_1','p_calc_1'} sequenceOpts = {timeStep=3} seed = '1234' optimizer = {miniBatchSize=4, maxEpochs=50, algorithm={method='adam', gamma=0.2, learningRate=0.01, clipGradMax=10000, clipGradMin=-10000, stepSize=30, lrPolicy='step'}}; quit;

The dlTrain action fits this model to our data. We first specify the training data, validation data, and the target name in the table, validTable, and target arguments, respectively.

Before we train the model, we need to order the inputs in order to have the features pass through the network correctly. Our goal is to simply pass the inputs into the network in chronological order to be in accordance with the sequence data. The first three inputs are the period 5 lag for each feature, and then the period 4 lag, and so on, through the period 1 lag. This enables us to pass the inputs to the model in groups, or vectors, of length 3, corresponding to the number of inputs for each time step in the network. That is, we have a feature sequence length of 5 and vector inputs of length 3. For each time step t, the lag 5 inputs are passed into a lag 5 neuron, and then this neuron feeds into the lag 4 neuron along with the lag 4 inputs and so on, until it reaches the time t position in the sequence. Because we are using a long short-term model, the neurons use gates to decide what information to pass forward at each lag in the time step.

The sequence options (sequenceOpts) argument specifies the settings for the sequence data.

The timeStep argument specifies the number of variables that compose one token for text data, or vector for time series data, for each time point in the sequence. In this case, we have a sequence length of 5 because we have five period lags, and a timeStep, or input vectors, of length 3 because we have three variables.

Next, I'll set a seed.

In the modelTable argument, I specify the network name I used earlier, LSTM.

We'll save the modelWeights as**trained_weights**, which will be the weights at the last epoch of the optimization.

Finally, the optimizer argument specifies the hyperparameters of the optimization routine, as described below:

- The miniBatchSize argument specifies the number of observations to use for updating the weights in the stochastic gradient descent algorithm. You can use more observations to approach the standard batch gradient descent, which uses all observations, or fewer for the original stochastic gradient descent, which uses only a single row of the data. In this case, we'll use only four observations, and we'll use only 50 epochs to optimize the weights.

- In the algorithm argument, we first specify the optimization method, ADAM. The use of Adam also requires us to specify the beta1 and beta2 parameters, which are the exponential decay rates for the first and second moment estimates for the adaptive approximation. Because they are missing here, they are set to the default values .9 and .999. Remember, the beta values decay over time. It's best to set them to values close to 1 so that they stay active for longer periods.

- The learning rate is set to .01, and the gradient bounds are set to positive and negative 10,000.

- In addition to the Adam method for optimization, we've also included the step approach for the learning rate policy. The step approach is used to reduce the learning rate over the optimization epochs. That is, the learning rate reduces as the network learns the data.

- The stepSize argument is set to 30, and the gamma argument is set to .2. This means that every 30 epochs the learning rate is multiplied by gamma (or .2 in this case), and the new learning rate is used for the next 30 epochs. This process repeats until the maximum number of epochs is reached. Because we are using only 50 total epochs and a step size of 30, the learning rate will be cut by 80% only once during the optimization.

In the output, the Model Information table redisplays the model architecture, as well as the total number of model parameters, which in this case is slightly more than 3000.

The Optimization History table shows the epochs, learning rate, and also the loss and error for the training and validation data. In this case, the reduction in training error and validation error is similar over the 50 epochs. This indicates that we did not overfit the training data and we don't need to apply any type of regularization here. Notice that the learning rate is .01 for the first 30 epochs and then is reduced to .002 for the remaining epochs due to the learning rate policy we implemented. With more time, we could try using more epochs and we could alter the optimization hyperparameters to further reduce the error.

- The miniBatchSize argument specifies the number of observations to use for updating the weights in the stochastic gradient descent algorithm. You can use more observations to approach the standard batch gradient descent, which uses all observations, or fewer for the original stochastic gradient descent, which uses only a single row of the data. In this case, we'll use only four observations, and we'll use only 50 epochs to optimize the weights.
- Examine the PROC CAS step that scores the test data, and then run the step. Look at the results.

proc cas; deepLearn.dlScore / table = 'test' model = 'lstm' initWeights = 'trained_weights' copyVars = {'T_MAX','LST_DATE','LST_TIME'} casout = {name='lstm_scored', replace=True}; quit;

To get a final assessment of this model and its generalizability, we'll use the dlScore action to score the test data set. The table argument specifies the test table, the model is the same as before, and the initWeights are the saved weights from our model. We'll use the copyVars argument to save not only the scored data to the output table but also the target, date, and time so that we can evaluate the predictions later. We'll save this scored information as**lstm_scored**.

In the Score Information table, we see that the mean squared error on the test data using the LSTM model is about 0.75 units. This means that the squared error is off by about 0.75 degrees Celsius for the 8000 test observations on average.

- Examine the next DATA step and the PROC MEANS step, and then run those steps. Look at the results.

data avg_err (keep=abs_diff); set mycas.lstm_scored; abs_diff = abs(t_max - _dl_pred_); run; proc means data=avg_err mean; run;

Instead of altering this model in an attempt to obtain a more accurate one, let's proceed by finding the average absolute error and plot the actual versus predicted maximum temperatures. First, we use a DATA step to find the absolute differences between the target and the predicted value in the scored table. Then, we use PROC MEANS to take an average of the absolute differences.

The results show that, in this case, our model is off by about a half of a degree on average over the entire year of test data.

- Examine the next DATA step and the PROC SGPLOT step, and then run those steps. Look at the results.

data sample; set mycas.lstm_scored (obs=1000); ID+1; run; proc sgplot data=sample; series x=ID y=t_max; series x=ID y=_dl_pred_; run;

Next, we'll plot only the first 1,000 observations of the test data against the predicted values so that we don't clutter the graphic. First, the DATA step subsets 1000 observations of the scored data to a data set called**sample**and adds a sequence variable named**ID**. Next, we use two SERIES statements in PROC SGPLOT to overlay the predicted time series on the actual time series. And here in the results,

In the plot, we can see that the predictions are markedly similar to the actual maximum temperatures.

## Deep Learning Using SAS® Software

Lesson 04, Section 1 Demo: Autotuning an RNN Model

In a previous demonstration, we trained a model on time series data. In this demonstration, we'll work with a similar, gated recurrent neural network. However, this time, we'll use dlTune to tune some of the hyperparameters of our model.

- Open the program named
**DLUS04D01.sas**. If you did not perform the steps for the previous lesson's time series demo in your current SAS Studio session, run the first part of this program (through the end of the code that creates the GOFstats macro) before you continue.

- Examine the first PROC CAS step, which builds the model, and then run the step.

/****************************/ /*Create Deep Learning Model*/ /****************************/ proc cas; loadactionset "deeplearn"; buildModel / model={name='tsRnn_tune', replace=1} type = 'RNN'; AddLayer / model='tsRnn_tune' name='data' layer={type='input' std='std' }; AddLayer / model='tsRnn_tune' name='rnn1' layer={type='recurrent' n=10 act='auto' init='msra' rnnType='GRU' outputtype='samelength'} srcLayers={'data'}; AddLayer / model='tsRnn_tune' name='rnn2' layer={type='recurrent' n=10 act='auto' init='msra' rnnType='GRU' outputtype='samelength'} srcLayers={'rnn1'}; AddLayer / model='tsRnn_tune' name='rnn3' layer={type='recurrent' n=10 act='auto' init='msra' rnnType='GRU' outputtype='samelength'} srcLayers={'rnn2'}; AddLayer / model='tsRnn_tune' name='rnn4' layer={type='recurrent' n=10 act='auto' init='msra' rnnType='GRU' outputtype='samelength'} srcLayers={'rnn3'}; AddLayer / model='tsRnn_tune' name='rnn5' layer={type='recurrent' n=10 act='auto' init='msra' rnnType='GRU' outputtype='encoding'} srcLayers={'rnn4'}; AddLayer / model='tsRnn_tune' name='outlayer' layer={type='output' act='identity' error='normal'} srcLayers={'rnn5'}; quit;

This PROC CAS step uses the buildModel action to build the model shell, which is called tsRnn_tune.

Using addLayer statements, we populate this shell with layers. This is a gated recurrent unit, so we specify rnnType='GRU'. Notice that the specified activation functions in the hidden layers have been changed to AUTO.

- Examine the next PROC CAS step, which tunes the model, and then run the step. Look at the results.

/************************/ /*Fit and Tune the Model*/ /************************/ proc cas; dlTune / model='tsRnn_tune' table = 'widgets_t' validtable='widgets_v' modelWeights = {name='tsTunedweights', replace=1} target = 'lwidgets' inputs = ${w1-w13} sequenceOpts = {timeStep=1} optimizer = {miniBatchSize=5, numTrials=10, tuneIter=5, tuneRetention=0.5, algorithm={method='ADAM', lrpolicy='step', beta1=0.9, beta2=0.99, gamma={lowerBound=0.3 upperBound=0.7}, learningRate={lowerBound=0.0001 upperBound=0.01}, clipGradMax=1000 clipGradMin=-1000} maxepochs=5} seed = 1234; quit;

This PROC CAS step uses the dlTune action to tune the model, replacing the dlTrain action from the previous time series demonstration. dlTune is very similar to dlTrain and dlScore. In the table argument, we specify the name of the data set that we're going to train the model on. We set the model argument to the name of the model that we've created. The validTable argument specifies the name of the validation data set. (The dlTune action requires the specification of a validation table.) The target argument specifies the name of the target. And we list our input variables in the inputs argument.

The remaining settings are the same except for the optimizer properties, which are described below:- numTrials specifies the number of models that we're going to test. That is, each of these models represents a sample from the hyperparameter space.
- tuneIter specifies the number of times that we pause the training process and assess our models.
- tuneRetention specifies the percentage of models or the proportion of models that you would like to retain after assessing the models.
- Notice that the gamma argument has curly braces that contain the lowerBound and upperBound arguments. lowerBound specifies the lower value for the particular option. In this case, 0.3 is the lowerBound for the gamma option. The upper bound for the gamma option is set to 0.7. SAS will take a Latin hypercube sample from within this range.
- We're also searching the learningRate between the bounds of 0.0001 and a 0.0.

**Note:**Under the restrictions listed in the syntax, 1 tuning iteration = 5 training epochs.

In the results, we see the Model Information table. The next two tables are the dlTune results, which summarize the autotuning process. The first of these tables, Tune History of Deep Learning Model, shows the tune history for our target outcome, from the first iteration to the last iteration. You can see that dlTune has greatly improved the error value for this model (as compared with the performance of the model in the previous time series demo). In the next table (Best Parameters), the first row specifies the optimal learning rate value and the optimal gamma value.

- Examine the remainder of the code: a PROC CAS step, a DATA step, the GOFstats macro, and a PROC SGPLOT step. Run those steps and look at the results.

/******************************/ /* Score and assess the model */ /******************************/ proc cas; dlscore / table='widgets_v' model='tsRnn_tune' initweights={name='tsTunedweights'} copyvars={'lwidgets' 'date'} casout={name='scoreOut0', replace=1}; quit; data scored; set mycas.scoreout0; widgets = exp(lwidgets); forecast = exp(_dl_pred_); run; %GOFstats(ModelName=rnn, DSName=work.scored ,OutDS=work.rnn, NumParms=2,ActualVar=widgets,ForecastVar=Forecast); proc sgplot data=scored; scatter x=date y=widgets; series x=date y=forecast; run;

We use this code to examine the Model Fit Statistics. The PROC CAS step uses the dlScore action to score the champion autotuned model on the validation data. The DATA step exponentiates the widgets, as well as the forecasted widgets. We use the goodness-of-fit statistics macro to assess the model. And then we use PROC SGPLOT to plot our results.

In the results, the plot shows that our model is now doing a better job capturing the shocks in the series.

Let's examine the MAPE in the Work library under the data set titled RNN. We can see that our MAPE has reduced, down to about 2.84%. So that's significant improvement.

## Deep Learning Using SAS® Software

Lesson 05, Section 1 Demo: Building the Models

In this demonstration, we use unsupervised transfer learning to take information from source data, extract that information, and use it to improve the model that we will deploy on target data.

The **MNIST_FASHION** data is used as the target data and the **Cifar-10** data (used in some previous demonstrations) is used as the source data. The **MNIST_FASHION** data contains images of ten different types of clothing. The images are 28x28 gray scale.

- Open the program named
**DLUS05D01a.sas**. Examine the program code.

/*Create Local and CAS Libraries*/ /********************************/ libname local "/home/student/LWDLUS/Data"; libname mycas cas; /****************/ /* Tagret Model */ /****************/ proc cas; /* Load the Deep Learn action set */ loadactionset "deeplearn"; /* Create Model Shell */ BuildModel / modeltable={name='ConVNN2', replace=1} type = 'CNN'; /* Add an input layer with mutations */ AddLayer / model='ConVNN2' name='data' layer={type='input' nchannels=1 width=32 height=32 randomFlip='H' randomMutation='Random'}; /* Add several Convolutional operations */ AddLayer / model='ConVNN2' name='ConVLayer1a' layer={type='CONVO' nFilters=6 width=3 height=3 stride=1 act='ELU' dropout=.05} srcLayers={'data'}; AddLayer / model='ConVNN2' name='ConVLayer1b' layer={type='CONVO' nFilters=6 width=3 height=3 stride=1 act='ELU' dropout=.05} srcLayers={'data'}; /* Add a max pooling layer (two tracks) */ AddLayer / model='ConVNN2' name='PoolLayer2maxa' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'ConVLayer1a'}; AddLayer / model='ConVNN2' name='PoolLayer2maxb' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'ConVLayer1b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN2' name='ConVLayer3a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=false} srcLayers={'PoolLayer2maxa'}; AddLayer / model='ConVNN2' name='BatchLayer3a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3a'}; AddLayer / model='ConVNN2' name='ConVLayer3b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=false} srcLayers={'PoolLayer2maxb'}; AddLayer / model='ConVNN2' name='BatchLayer3b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3b'}; /* Add a max pooling layer (two tracks) */ AddLayer / model='ConVNN2' name='PoolLayer4maxa' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'BatchLayer3a'}; AddLayer / model='ConVNN2' name='PoolLayer4maxb' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'BatchLayer3b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN2' name='ConVLayer5a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=false} srcLayers={'PoolLayer4maxa'}; AddLayer / model='ConVNN2' name='BatchLayer5a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer5a'}; AddLayer / model='ConVNN2' name='ConVLayer5b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=false} srcLayers={'PoolLayer4maxb'}; AddLayer / model='ConVNN2' name='BatchLayer5b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer5b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN2' name='ConVLayer6a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=2 act='Identity' includeBias=false dropout=.1} srcLayers={'BatchLayer5a'}; AddLayer / model='ConVNN2' name='BatchLayer6a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer6a'}; AddLayer / model='ConVNN2' name='ConVLayer6b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=2 act='Identity' includeBias=false dropout=.1} srcLayers={'BatchLayer5b'}; AddLayer / model='ConVNN2' name='BatchLayer6b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer6b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN2' name='ConVLayer7a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=false} srcLayers={'BatchLayer6a'}; AddLayer / model='ConVNN2' name='BatchLayer7a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer7a'}; AddLayer / model='ConVNN2' name='ConVLayer7b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=false} srcLayers={'BatchLayer6b'}; AddLayer / model='ConVNN2' name='BatchLayer7b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer7b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN2' name='ConVLayer8a' layer={type='CONVO' nFilters=32 width=3 height=3 stride=2 act='Identity' includeBias=false} srcLayers={'BatchLayer7a'}; AddLayer / model='ConVNN2' name='BatchLayer8a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer8a'}; AddLayer / model='ConVNN2' name='ConVLayer8b' layer={type='CONVO' nFilters=32 width=3 height=3 stride=2 act='Identity' includeBias=false} srcLayers={'BatchLayer7b'}; AddLayer / model='ConVNN2' name='BatchLayer8b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer8b'}; /* Add a concatenation layer*/ AddLayer / model='ConVNN2' name='concatlayer9' layer={type='concat'} srcLayers={'BatchLayer8a','BatchLayer8b'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN2' name='FCLayer10' layer={type='FULLCONNECT' n=25 act='Identity' includeBias=false init='msra2'} srcLayers={'concatlayer9'}; AddLayer / model='ConVNN2' name='BatchLayerFC10' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer10'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN2' name='FCLayer11' layer={type='FULLCONNECT' n=25 act='Identity' includeBias=false init='msra2'} srcLayers={'BatchLayerFC10'}; AddLayer / model='ConVNN2' name='BatchLayerFC11' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer11'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN2' name='FCLayer12' layer={type='FULLCONNECT' n=25 act='Identity' includeBias=false init='msra2'} srcLayers={'BatchLayerFC11'}; AddLayer / model='ConVNN2' name='BatchLayerFC12' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer12'}; /* Add an output layer with softmax activation */ AddLayer / model='ConVNN2' name='outlayer' layer={type='output' act='SOFTMAX' init='msra2'} srcLayers={'BatchLayerFC12'}; quit; /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /****************************************/ /* Denoising Convolutional Autoencoder */ /****************************************/ proc cas; /* Create Model Shell */ BuildModel / modeltable={name='SDA', replace=1} type = 'CNN'; /* Add an input layer with dropout */ AddLayer / model='SDA' name='data' layer={type='input' nchannels=1 width=32 height=32 dropout=.3 offsets={92.669099}}; /* Add several Convolutional operations in layer one */ AddLayer / model='SDA' name='ConVLayer1a' layer={type='CONVO' nFilters=8 width=1 height=1 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='SDA' name='ConVLayer1b' layer={type='CONVO' nFilters=8 width=3 height=3 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='SDA' name='ConVLayer1c' layer={type='CONVO' nFilters=8 width=5 height=5 stride=1 act='ELU'} srcLayers={'data'}; AddLayer / model='SDA' name='ConVLayer1d' layer={type='CONVO' nFilters=8 width=7 height=7 stride=1 act='ELU'} srcLayers={'data'}; /* Add a concatenation layer */ AddLayer / model='SDA' name='concatlayer1' layer={type='concat'} srcLayers={'ConVLayer1a','ConVLayer1b','ConVLayer1c','ConVLayer1d'}; /* Add a max pooling layer */ AddLayer / model='SDA' name='PoolLayer1max' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'concatlayer1'}; /* Add a Convolutional layer with BN */ AddLayer / model='SDA' name='ConVLayer2a' layer={type='CONVO' nFilters=64 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'PoolLayer1max'}; AddLayer / model='SDA' name='BatchLayer1' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer2a'}; /* Add a max pooling layer */ AddLayer / model='SDA' name='PoolLayer2max' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'BatchLayer1'}; /* Add a one-by-one Convolutional layer with BN */ AddLayer / model='SDA' name='ConVLayer2' layer={type='CONVO' nFilters=1 width=1 height=1 stride=1 act='Identity' includeBias=FALSE} srcLayers={'PoolLayer2max'}; AddLayer / model='SDA' name='BatchLayer' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer2'}; /* Add a Transpose Convolutional layer with BN */ AddLayer / model='SDA' name='ConVLayer3b' layer={type='TRANSCONVO' nFilters=64 width=3 height=3 stride=1 act='ELU' includeBias=False} srcLayers={'BatchLayer'}; AddLayer / model='SDA' name='BatchLayer3b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3b'}; /* Add a Fully-Connected layer with BN */ AddLayer / model='SDA' name='FCLayer4' layer={type='FULLCONNECT' n=20 act='Identity' includeBias=FALSE} srcLayers={'BatchLayer3b'}; AddLayer / model='SDA' name='BatchLayerFC4' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer4'}; /* Add a Fully-Connected layer with BN */ AddLayer / model='SDA' name='FCLayer5' layer={type='FULLCONNECT' n=20 act='Identity' includeBias=FALSE} srcLayers={'BatchLayerFC4'}; AddLayer / model='SDA' name='BatchLayerFC5' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer5'}; /* Add an Output layer */ AddLayer / model='SDA' name='outlayer' layer={type='output'} srcLayers={'BatchLayerFC5'}; quit; /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /******************/ /* Transfer model */ /******************/ proc cas; /* Create Model Shell */ BuildModel / modeltable={name='ConVNN3', replace=1} type = 'CNN'; /* Add an input layer with mutations */ AddLayer / model='ConVNN3' name='data1' layer={type='input' nchannels=1 width=32 height=32 }; AddLayer / model='ConVNN3' name='data2' layer={type='input' nchannels=1 width=8 height=8 }; /* Add several Convolutional operations */ AddLayer / model='ConVNN3' name='ConVLayer1a' layer={type='CONVO' nFilters=6 width=3 height=3 stride=1 act='ELU' dropout=.05} srcLayers={'data1'}; AddLayer / model='ConVNN3' name='ConVLayer1b' layer={type='CONVO' nFilters=6 width=3 height=3 stride=1 act='ELU' dropout=.05} srcLayers={'data1'}; /* Add a max pooling layer (two tracks) */ AddLayer / model='ConVNN3' name='PoolLayer2maxa' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'ConVLayer1a'}; AddLayer / model='ConVNN3' name='PoolLayer2maxb' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'ConVLayer1b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN3' name='ConVLayer3a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'PoolLayer2maxa'}; AddLayer / model='ConVNN3' name='BatchLayer3a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3a'}; AddLayer / model='ConVNN3' name='ConVLayer3b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'PoolLayer2maxb'}; AddLayer / model='ConVNN3' name='BatchLayer3b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer3b'}; /* Add a max pooling layer (two tracks) */ AddLayer / model='ConVNN3' name='PoolLayer4maxa' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'BatchLayer3a'}; AddLayer / model='ConVNN3' name='PoolLayer4maxb' layer={type='POOL' width=2 height=2 stride=2 pool='max'} srcLayers={'BatchLayer3b'}; /* Add a concatenation layer*/ AddLayer / model='ConVNN3' name='concatfornewdata1' layer={type='concat'} srcLayers={'PoolLayer4maxa','data2'}; AddLayer / model='ConVNN3' name='concatfornewdata2' layer={type='concat'} srcLayers={'PoolLayer4maxb','data2'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN3' name='ConVLayer5a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'concatfornewdata1'}; AddLayer / model='ConVNN3' name='BatchLayer5a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer5a'}; AddLayer / model='ConVNN3' name='ConVLayer5b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'concatfornewdata2'}; AddLayer / model='ConVNN3' name='BatchLayer5b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer5b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN3' name='ConVLayer6a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=2 act='Identity' includeBias=False dropout=.1} srcLayers={'BatchLayer5a'}; AddLayer / model='ConVNN3' name='BatchLayer6a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer6a'}; AddLayer / model='ConVNN3' name='ConVLayer6b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=2 act='Identity' includeBias=False dropout=.1} srcLayers={'BatchLayer5b'}; AddLayer / model='ConVNN3' name='BatchLayer6b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer6b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN3' name='ConVLayer7a' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'BatchLayer6a'}; AddLayer / model='ConVNN3' name='BatchLayer7a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer7a'}; AddLayer / model='ConVNN3' name='ConVLayer7b' layer={type='CONVO' nFilters=16 width=3 height=3 stride=1 act='Identity' includeBias=False} srcLayers={'BatchLayer6b'}; AddLayer / model='ConVNN3' name='BatchLayer7b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer7b'}; /* Add several Convolutional operations with BN */ AddLayer / model='ConVNN3' name='ConVLayer8a' layer={type='CONVO' nFilters=32 width=3 height=3 stride=2 act='Identity' includeBias=False} srcLayers={'BatchLayer7a'}; AddLayer / model='ConVNN3' name='BatchLayer8a' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer8a'}; AddLayer / model='ConVNN3' name='ConVLayer8b' layer={type='CONVO' nFilters=32 width=3 height=3 stride=2 act='Identity' includeBias=False} srcLayers={'BatchLayer7b'}; AddLayer / model='ConVNN3' name='BatchLayer8b' layer={type='BATCHNORM' act='ELU'} srcLayers={'ConVLayer8b'}; /* Add a concatenation layer*/ AddLayer / model='ConVNN3' name='concatlayer9' layer={type='concat'} srcLayers={'BatchLayer8a','BatchLayer8b'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN3' name='FCLayer10' layer={type='FULLCONNECT' n=25 act='Identity' init='msra2' includeBias=False} srcLayers={'concatlayer9'}; AddLayer / model='ConVNN3' name='BatchLayerFC10' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer10'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN3' name='FCLayer11' layer={type='FULLCONNECT' n=25 act='Identity' init='msra2' includeBias=False} srcLayers={'BatchLayerFC10'}; AddLayer / model='ConVNN3' name='BatchLayerFC11' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer11'}; /* Add a fully-connected layer with Batch Normalization */ AddLayer / model='ConVNN3' name='FCLayer12' layer={type='FULLCONNECT' n=25 act='Identity' init='msra2' includeBias=False} srcLayers={'BatchLayerFC11'}; AddLayer / model='ConVNN3' name='BatchLayerFC12' layer={type='BATCHNORM' act='ELU'} srcLayers={'FCLayer12'}; /* Add an output layer with softmax activation */ AddLayer / model='ConVNN3' name='outlayer' layer={type='output' act='SOFTMAX' init='msra2'} srcLayers={'BatchLayerFC12'}; quit; /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /**************************************************************************************************************************/ /********************************************/ /* View pictures of each model architecture */ /********************************************/ data _NULL_; dcl odsout obj1(); obj1.image(file:'/home/student/LWDLUS/Data/Original_classificationmodel.png', width: "850", height: "400"); run; data _NULL_; dcl odsout obj1(); obj1.image(file:'/home/student/LWDLUS/Data/Denoising_autoencoder.png', width: "750", height: "300"); run; data _NULL_; dcl odsout obj1(); obj1.image(file:'/home/student/LWDLUS/Data/ClassificationModel_w_Encoders.png', width: "850", height: "450"); run;

The first PROC CAS step refreshes the CAS server. The the two LIBNAME statements recreates the local library and the caslib.

The next three PROC CAS steps build the three models. The three DATA steps print diagrams of the architecture of each model.

- Run the program and look at the results.

At the bottom of the results, look at the three model architecture diagrams: one for each model that we just built. Notice the following details about the diagrams:

- The first picture displays the target classification model. This is the model that's been trained on the target data, and will be given only the target data (that is, the
**MNIST_FASHION**data).

- The second model is the unsupervised denoising convolutional autoencoder. This unsupervised denoising convolutional autoencoder is going to be provided both the source and target data. The goal is that this model will reconcile differences in the marginal probability distributions between our source data and target data.
Then this denoising convolutional autoencoder will score the target data. We will also extract the feature maps from this 1-by-1 convolutional layer that resides in the center of the model, and then use them as inputs into the final model.

- The final model is the transfer classification model. This is the same model as our target classification model, with one exception. This last model incorporates the information (that is, the feature maps) produced by the denoising convolutional autoencoder that we previously trained. The information is contained in 8x8 images. Therefore, the information must be concatenated at a point in the model architecture in which the tensor matches an 8x8xn structure.

We will work with these models in the next few demonstrations. - The first picture displays the target classification model. This is the model that's been trained on the target data, and will be given only the target data (that is, the

## Deep Learning Using SAS® Software

Lesson 05, Section 1 Demo: Training the Classification Model

In the previous demonstration, we constructed the following three models:

- the target classification model, which is trained on only the target data
- the denoising convolutional autoencoder (unsupervised), which is used to extract information from the source data and deploy that information into the third model
- the transfer classification model, which is the same as the first model with the exception that we'll leverage that additional information that we extract from the source data

In this demonstration, we load and prepare the **MNIST_FASHION** data and use it train the target classification model.

- If you did not perform the steps of the previous (transfer learning) demonstration in your current SAS Studio session, open the program named
**DLUS05D01a.sas**and run it before you continue.

- Open the program named
**DLUS05D01b.sas**. At the beginning of the program, run the LIBNAME statements and the first PROC CAS step.

/********************************/ /*Create Local and CAS Libraries*/ /********************************/ libname local "/home/student/LWDLUS/Data"; libname mycas cas; /*******************/ /* Create a caslib */ /*******************/ proc cas ; loadactionset 'table'; table.addCaslib / name='imagelib' path='/home/student/LWDLUS/Data' subdirectories=true activeOnAdd=False; quit;

- Examine the PROC CAS step that loads the
**MNIST_FASHION**data, and then run the step. Look at the log.

/***********************/ /* Load Fashion Images */ /***********************/ proc cas; loadactionset 'image'; image.loadimages / caslib='imagelib' path='mnist_fashion' recurse=true labellevels=1 casout={name='mnist_fashionloaded', replace=true}; quit;

The loadImages action uploads the data and creates a CAS table named**MNIST_FASHIONLOADED**.

The log shows that the data set contains 15,000 images that are grayscale. The image dimensions are 28 by 28.

- Examine the DATA step, which selects a subset of images for viewing, and displays them larger than actual size. Look at the results.

/***************/ /* View Images */ /***************/ data _null_; set mycas.mnist_fashionloaded(where=( _id_<=4 AND _id_>=1 or _id_<=1504 AND _id_>=1501 or _id_<=3004 AND _id_>=3001 or _id_<=4504 AND _id_>=4501 or _id_<=6004 AND _id_>=6001 or _id_<=7504 AND _id_>=7501 or _id_<=9004 AND _id_>=9001 or _id_<=10504 AND _id_>=10501 or _id_<=12004 AND _id_>=12001 or _id_<=13504 AND _id_>=13501) keep=_path_ _id_ _label_) end=eof; if _n_=1 then do; dcl odsout obj(); obj.layout_gridded(columns:4); end; obj.region(); obj.format_text(text: _label_, just: "c", style_attr: 'font_size=8pt'); obj.image(file: _path_, width: "112", height: "112"); if eof then do; obj.layout_end(); end; run;

SAS Component Language is used to resize the images to a width and height of 112 for viewing. A WHERE clause selects images for viewing.

Here we can see that we have 10 different types of clothing. And this is our target data. So we're going to try to extract information from our source data in some way that we're going to improve our model on this target data.

- Examine the PROC CAS step that resizes the images in the target data set, and then run the step.

/*****************/ /* Resize Images */ /*****************/ proc cas; processImages / casout={name='fashion_resized', replace=1} imageTable={name='mnist_fashionloaded'} imageFunctions ={{functionOptions={functionType='RESIZE', height=32, width=32}}}; quit;

This PROC CAS step uses the processImages action to do the following:- resize the 28 by 28 images to 32 by 32
- create a new CAS table named
**FASHION_RESIZED**containing the resized images

- Run the PROC PARTITION step that partitions the data and the PROC CAS step that shuffles the data. Look at the results.

/**********************************************/ /* Partition images into train and validation */ /**********************************************/ proc partition data=mycas.fashion_resized samppct=80 samppct2=20 seed=12345 partind; by _label_; output out=mycas.fashion_part; run; /********************/ /* Shuffle the Data */ /********************/ proc cas; table.shuffle / table='fashion_part' casout={name='fashionShuffled', replace=1}; quit;

The PROC PARTITION step partitions the data set using a stratified random sample that is stratified by the

**_label_**variable. Eighty percent of the data will be allocated to the training data and 20% of the data will be allocated to the validation data. A new CAS table titled**FASHION_PART**is created.

The PROC CAS step uses the shuffle table action to shuffle the data, and creates a new CAS table titled**FASHIONSHUFFLED**.

In the PROC PARTITION results, the first table indicates the percentage of data that is allocated to training and validation. - Examine the remainder of the code, and then run it. Look at the results.

/*******************************************************/ /* First, train a model on MNIST Fashion (target data) */ /*******************************************************/ ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / model='ConVNN2' table={name='fashionShuffled', where='_PartInd_=1'} ValidTable={name='fashionShuffled', where='_PartInd_=2'} modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} inputs='_image_' target='_label_' nominal={'_label_'} GPU=True optimizer={minibatchsize=80, maxepochs=80, algorithm={method='ADAM', lrpolicy='Step', gamma=0.6, stepsize=10, beta1=0.9, beta2=0.999, learningrate=.01}} seed=12345; quit; /******************************************************************/ /* Store minimum training and validation error in macro variables */ /******************************************************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit; /* Plot Performance */ proc sgplot data=ObjectModeliter; yaxis label='Misclassification Rate' MAX=.9 min=0; series x=Epoch y=FitError / CURVELABEL="&Train" CURVELABELPOS=END; series x=Epoch y=ValidError / CURVELABEL="&Valid" CURVELABELPOS=END; run;

This code does the following:- trains the target classification model (ConVNN2) for 80 epochs using the shuffled fashion (target) data
- saves the iteration history in a table named
**OBJECTMODELITER** - plots the error values in a series plot

**Note:**In the code shown in the demo video, the maxepochs argument is set to 60 instead of 80.

In the Model Information table, it appears that our model has about 98,000 parameters. This isn't a large model. Below the Optimization History table, the plot shows the validation error, which is the validation misclassification rate, and the fit error, which is the training misclassification rate. Our model appears to have a validation misclassification rate of about 10.5% and a training misclassification rate of about 5%.

In the next demonstration, let's see if we can improve this target function using transfer learning.

## Deep Learning Using SAS® Software

Lesson 05, Section 1 Demo: Training the Transfer Classification Model

In this demonstration, we train the denoising convolutional autoencoder on a data set that contains both the source and target data. The source data is the Cifar-10 data (consisting of 10,000 images) that we used in an earlier demonstration.

- If you did not perform the steps of the two previous (transfer learning) demonstrations in your current SAS Studio session, open and run the following programs in sequence before you continue:
**DLUS05D01a.sas**and**DLUS05D01b.sas**.

- Open the program named
**DLUS05D01c.sas**. At the beginning of the program, run the first three steps (PROC CAS, PROC PARTITION, and PROC CAS) to prepare the source data for the Cifar-10 data that we'll use.

/*****************************************/ /* Load the SmalltrainData cifar-10 data */ /*****************************************/ proc cas ; image.loadimages / caslib='imagelib' path='SmalltrainData' recurse=true labellevels=1 casout={name='SmalltrainData', replace=true}; quit; /********************************************/ /* Partition into train and validation data */ /********************************************/ proc partition data=mycas.SmalltrainData samppct=80 samppct2=20 seed=12345 partind; by _label_; output out=mycas.smallImageData; run; /********************/ /* Shuffle the data */ /********************/ proc cas; table.shuffle / table='smallImageData' casout={name='SmallImageDatashuffled', replace=1}; quit;

- Examine the next two PROC CAS steps, and then run the steps. Look at the results.

/************************************/ /* Convert color to grayscale image */ /************************************/ proc cas; image.processImages / table={name='SmallImageDatashuffled'} casout={name='Grayscale_Cifar10', replace=True} copyVars={'_PartInd_'} imagefunctions={{functionOptions={functionType='CONVERT_COLOR' type="COLOR2GRAY"}}}; quit; /*************************************************/ /* Summarize images to verify they are grayscale */ /*************************************************/ proc cas; image.summarizeimages / table={name='Grayscale_Cifar10', where='_PartInd_=1'}; quit;

The first of these PROC CAS steps uses the processImages action to transform the images in the Cifar-10 data contained in the**SMALLIMAGEDATASHUFFLED**table from color to gray scale. We do this so that the images can be used in combination with the**MNIST_FASHION**data, which are gray scale. Notice that the functionType is set to CONVERT_COLOR and the type argument is set to COLOR2GRAY. A new CAS table is created that contains the gray scale images of**Cifar-10**. This output table is named**Grayscale_Cifar10**.

**Note:**To transform images from gray to color, you set the type argument to GRAY2COLOR.

The second PROC CAS step uses the summarizeImages action. We can use the summary it produces to confirm that the images have been transformed from color to gray scale.

In the results table from the summarizeImages action, notice that the average intensity of the blue channel, the green channel, and the red color channel are all the same.

- Examine the next two PROC CAS steps, and then run the steps. Look at the results.

/************************************/ /* Append two image tables together */ /************************************/ proc cas; searchAnalytics.searchJoin / casOut={name="Source_and_target_data", replace=TRUE} joinType="append" leftTable={table={name="Grayscale_Cifar10"}} rightTable={table={name="fashionShuffled"}}; quit; /**************************************************************/ /* Summarize images to find average value to use as an offset */ /**************************************************************/ proc cas; image.summarizeimages / table={name='Source_and_target_data', where='_PartInd_=1'}; quit;

The first of these PROC CAS steps uses the searchJoin action to append the**Cifar-10**data (the source data) to the**MNIST_FASHION**data (the target data). The casOut argument creates an output CAS table named**Source_and_target_data**. The joinType is APPEND. The left table is**Grayscale_Cifar10**and the right table is**fashionShuffled**.

The next PROC CAS step uses the summarizeImages action to create a summary of the training data. We want to extract the average color channel value. The average pixel density value will be subtracted from the images when the denoising convolutional autoencoder is trained on the data.

In the results table from the summarizeImages action, the**Average intensity of B**column shows that the average intensity of the gray-scale channel is 92.79.

- Examine the next PROC CAS step, which shuffles the data, and then run it.

/********************/ /* Shuffle the data */ /********************/ proc cas; table.shuffle / table='Source_and_target_data' casout={name='Source_and_target_shuffled', replace=1}; quit;

Remember, if you're using gradient-based learning methods like stochastic gradient descent or Adam, it's a good idea to shuffle the data. We've just appended two data sets, Cifar-10 and the MNIST_Fashion data, together. So there's a logical order to our data, which we want to remove.

A new CAS table that contains the shuffled images is created.

- Examine the next PROC CAS step, which trains the doinoising convolutional autoencoder.

/***********************************/ /* Train the denoising autoencoder */ /***********************************/ proc cas; dlTrain / model='SDA' table={name='Source_and_target_shuffled', where='_PartInd_=1'} ValidTable={name='Source_and_target_shuffled', where='_PartInd_=2'} modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='SDA_W1', replace=1} inputs='_image_' GPU=True optimizer={minibatchsize=80, maxepochs=20, algorithm={method='ADAM', beta1=.9, beta2=.999, learningrate=.01}} seed=12345; quit;

This PROC CAS step uses the dlTrain action. Notice the following:- The training data is
**Source_and_target_data_shuffled**, where _ PartInd_=1. - The model name is SDA.
- We specify the modelWeights data set and the bestWeights data set for dlTrain to create.
- The inputs argument specifies _image_.
- To train using GPUs, we specify gpu=TRUE.
- The validTable argument specifies the data set that we're using, where _PartInd_=2.
- In the optimizer argument, the miniBatchSize argument is set to 80. The algorithm is ADAM. We will train the model for only 20 epochs.
- There is no target statement included in the dlTrain code. The goal of the denoising convolutional autoencoder is to capture patterns detected in both the source and target data, strengthen these patterns, and increase the breadth of relevant patterns learned.

- The training data is
- Run the PROC CAS step. As the step runs, notice that the following two warnings appear:

- The first warning indicates that the batch normalization layer includes a bias term. That's not a big problem. We are not concerned about including another learnable parameter.

- The second warning, however, is concerning.
That warning says that the source layer ConVLayer2a to batch normalization layer BatchLayer1 does not have an identity activation function. This is a mistake. We should always apply batch normalization before applying a nonlinear transformation. So in the convolutional layer that feeds information into the batch layer, the activation function should be set to Identity. However, in this demo, we will not change the code.

**Note:**If you wish, you can see this problem in**DLUS05D01a**. Find the PROC CAS step that creates the models, and look for the layer named BatchLayer1. In the layer just before BatchLayer1, named ConvLayer2a, the specified activation function is ELU. This is a major problem. The activation function should be set to Identity because we need to normalize the information using batch normalization before we apply our nonlinear transformation.

- The first warning indicates that the batch normalization layer includes a bias term. That's not a big problem. We are not concerned about including another learnable parameter.
- Look at the results of the PROC CAS step.

The Model Information table shows that this model has 170,000 parameters. In the Optimization History table, the validation error column does not show a misclassification rate. Remember that autoencoders are unsupervised models. So your inputs are on your input layer and your inputs are on your output layer. In this case, we're not predicting the outcome class. Instead, we are predicting the pixel density value. So for that reason, we're not using the cross entropy error function.

- Examine the next PROC CAS step, which scores data, and run the step. Look at the results.

/********************************************/ /* Apply the trained denoising autoencoder */ /* to the target (MNIST Fashion) data */ /* and extract the encoders */ /********************************************/ proc cas; dlScore / model='SDA' table={name='fashionShuffled'} initWeights='SDA_W1' layerOut={name='Encoders', replace=1} layers={'BatchLayer', 'data'} layerImageType='JPG' casout={name='ScoredData', replace=1} copyVars={'_Label_','_PartInd_'} ENCODENAME=TRUE gpu=True; quit;

The scoreData action applies the trained denoising convolutional autoencoder to only the target data (**fashionShuffled**) to generate the encoded images. (Remember that this unsupervised model was trained on both the source and target data.)

Notice the following options:

- The initWeights argument references the optimal set of weights derived from the training process.

- The layerOut argument tells dlScore to create a data set named
**Encoders**.

- The layers argument tells dlScore to populate the
**Encoders**data set with the output from any layers that are specified in the layers option. In this case, the output of two layers is created, the**batch**layer (the original data) and the**data**layer (the feature map). The batch layer contains the encoded images that are 8x8 gray scale feature maps, because the images were downsampled by the architecture of the autoencoder model. The variable that contains the corresponding images is titled**_LayerAct_11_IMG_0_**. The output of the data layer contains the original 32x32 gray scale images, which are housed in the variable named**_LayerAct_0_IMG_0_**. Both will be used in the transfer classification model.

Note: In**DLUS05D01a.sas**, you can look at the batch layer in the code for the denoising convolutional autoencoder. That convolutional layer has only one filter. Therefore, we're producing only one feature map, one matrix of values. So we will capture that feature map for every image that's scored.

- The layerImageType argument, here, tells dlScore to output a JPEG image.

In the results, the Output CAS Tables table shows that we've scored 15,000 images. Those are all the images for the MNIST_Fashion data.

- The initWeights argument references the optimal set of weights derived from the training process.
- Examine the remainder of the code in the program, and then run it. Look at the results.

/*******************************************/ /* Use both the extracted encoders and the */ /* target features to train the model */ /* on the target task. */ /*******************************************/ ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / model='ConVNN3' table={name='Encoders', where='_PartInd_=1'} ValidTable={name='Encoders', where='_PartInd_=2'} modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} dataSpecs={ {data={'_LayerAct_0_IMG_0_'}, layer='data1', type='Image'} {data={'_LayerAct_11_IMG_0_'}, layer='data2', type='Image'} {data='_label_', layer='outlayer', type='numericNominal'}} GPU=True optimizer={minibatchsize=80, maxepochs=80, algorithm={method='ADAM', lrpolicy='Step', gamma=0.6, stepsize=10, beta1=0.9, beta2=0.999, learningrate=.01}} seed=12345; quit; /******************************************************************/ /* Store minimum training and validation error in macro variables */ /******************************************************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit;

The PROC CAS step trains the last model (the transfer classification model), on the original images as well as the extracted information (that is, the feature maps that we pulled out of that denoising convolutional autoencoder). To use multiple sources of input information, we use the dataSpecs argument to specify the column name in the data that contains the information we're interested in. Within the dataSpecs argument, notice that each data argument specifies the variable name of the information associated with the layer identified in the layer argument. The type argument specifies the type of information the variable contains (IMAGE, NUMERICNOMINAL, TEXT, or OBJECTDETECTION). Last, we need to specify our output layer and the data specs. So for our output layer, data equals our target variable, _label_. The layer name that we're attaching our target information to is the outlayer. The data type is numericNominal.

**Note:**In the PROC CAS code shown in the demo video, the maxepochs argument is set to 60 instead of 80.

The PROC SGPLOT step plots the results of the misclassification rates.

In the results, in the Optimization History table, the validation error represents the validation misclassification rate. The fit error represents the training misclassification rate. In the iteration plot,you can see that both the training and validation performance are marginally better when transfer learning is used. The classification model could be further regularized to improve validation given the improvement in training performance.

Given that our model now fits the training data better, we can see that extracting information from our source data to use with our target data has helped. This enables us to apply more regularization to improve the model's ability to generalize.

## Deep Learning Using SAS® Software

Lesson 05, Section 1 Demo: Conducting Supervised Transfer Learning

In this demonstration, we conduct supervised transfer learning by applying the model that we built on the Cifar-10 data to the MNIST fashion data.

- If you did not perform the steps of the first two transfer learning demonstrations in your current SAS Studio session, open and run the following programs in sequence before you continue:
**DLUS05D01a.sas**and**DLUS05D01b.sas**.

- Open the program named
**DLUS05D02.sas**. Examine the first two PROC CASUTIL steps, and then run them.

/****************************/ /*Load the Model and Weights*/ /****************************/ proc casutil; load casdata='ConVNN.sashdat' incaslib='imagelib' casout='ConVNN' replace; quit; proc casutil; load casdata='ConVbestweights.sashdat' incaslib='imagelib' casout='ConVbestweights' replace; quit;

When we built and trained our image classification model on the Cifar-10 data in a previous demonstration, we saved both the model and its weights. These two PROC CASUTIL steps load both the model and its weights as CAS tables. We are loading the**ConVNN.sashdat**file onto the CAS server as**ConVNN**from imagelib. We're also loading**ConVbestWeights.sashdat**as the CAS table**ConVbestweights**.

- Examine the PROC CAS step and the DATA step, and run these steps. View the results.

/********************************************/ /* View a picture of the model architecture */ /********************************************/ proc cas; ModelInfo / model='ConVNN'; quit; data _NULL_; dcl odsout obj1(); obj1.image(file:'/home/student/LWDLUS/Data/ModelPic.PNG', width: "850", height: "450"); run;

We use these two steps to view the model architecture. In the PROC CAS step, we specify the modelInfo action to print information about the model. The DATA step prints the model architecture diagram. using the DATA step to jog our memory of this model.

Remember that the ConVNN model had 32 total layers. In the results, you can see the structure of the model.

We will apply this model to the MNIST fashion data. To conduct supervised transfer learning, we'll freeze all the weights in this model up through the first fully connected layer. So the weights from the first convolutional layer through the first fully connected layer will be used essentially for feature extraction on the new MNIST fashion images. Although this model was originally trained on color images and now we are transitioning to gray-scale images, we can still apply supervised transfer learning. Behind the scenes, SAS automatically duplicates the single gray-scale channel into three identical channels to mimic color images.

- Examine the ODS OUTPUT statement, the PROC CAS step with the dlTrain action, and the next three PROC steps. Then run this code and view the results.

/*****************/ /*Train the Model*/ /*****************/ ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / model='ConVNN' table={name='fashionShuffled', where='_PartInd_=1'} ValidTable={name='fashionShuffled', where='_PartInd_=2'} modelWeights={name='ConVTrainedWeights_2', replace=1} bestweights={name='ConVbestweights_2', replace=1} initWeights = 'ConVbestweights' inputs='_image_' target='_label_' nominal={'_label_'} GPU=True optimizer={minibatchsize=500, freezeLayersTo='BatchLayerFC1' maxepochs=60 algorithm={method='ADAM', lrpolicy='Step', gamma=0.6, stepsize=10, beta1=0.9, beta2=0.999, learningrate=.01}} seed=12345; quit; /*******************************************************************/ /* Store minimum training and validation error in macro variables */ /*******************************************************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit; /* Plot Performance */ proc sgplot data=ObjectModeliter; yaxis label='Misclassification Rate' MAX=.9 min=0; series x=Epoch y=FitError / CURVELABEL="&Train" CURVELABELPOS=END; series x=Epoch y=ValidError / CURVELABEL="&Valid" CURVELABELPOS=END; run;

The PROC CAS step uses the dlTrain action to train the model. Note the following:

- The model is ConVNN, the model we just loaded back into memory.

- The training data and validation data are partitions 1 and 2 of the
**fashionshuffled**CAS table respectively.

- We'll save both the model weights and the best weights for scoring. For the initial weights, we provide the previously trained weights that we also just loaded back into the CAS server. These are the weights that will be used for feature extraction and will be partially frozen.

- As expected, the inputs argument is set to _image_, the target and nominal arguments are set to _label_, and the gpu argument is set to TRUE.

- Most of the optimizer and algorithm options are set to the same values as when we first fit this model on the cifar-10 data. For example, maxEpochs is set to 60 and we're using ADAM optimization and the step learning rate policy.

- MiniBatchSize is now 500, which is different from the value used earlier.
**Note:**The value of miniBatchSize was changed in the program after the demo video was recorded, so the video shows a different value.

- We now include the freezeLayersTo argument and set it equal to BatchLayerFC1. This tells SAS to freeze all the weights in the initial weights argument, from the beginning through the layer that we named BatchLayerFC1 in the network architecture, which corresponds to the layer before the final, or second, fully connected layer. Effectively, the frozen weights will act as a feature extraction utility, and the fully connected layer that is not frozen will be optimized to learn how to classify the MNIST fashion data.

The next three PROC steps (two PROC SQL steps and a PROC SGPLOT step) plot the training and validation error across the optimization history.

In the results, in the Model Information table, we see again that this model has 32 total layers and approximately 5.1 million weights. However, notice that 4.8 million weights were frozen, so we actually trained only approximately 300 thousand weights. Below the Optimization History table, its corresponding series plot shows that the training error and validation error seem to reduce almost in unison and the final validation error is about 14%.

Based on this validation error, this model does not outperform the previous model that we ran on the MNIST fashion data. However, this model has a few advantages. First, it was easy to apply because we didn't need to design and code a new model. Second, we had to train only about 300,000 weights even though the model used over 5.1 million weights. So if you have a good working model on one data set, supervised transfer learning enables you to easily apply that model to a new data set in order to quickly generate a reasonable baseline.

- The model is ConVNN, the model we just loaded back into memory.
- Examine the remaining code and then run it. View the results.

/**************************************/ /* Score data using the trained model */ /**************************************/ proc cas; dlScore / model='ConVNN' table={name='fashionShuffled', where='_PartInd_=2'} initWeights='ConVbestweights_2' casout={name='ScoredData', replace=1} copyVars='_Label_' ENCODENAME=TRUE gpu=True; quit; proc print data=mycas.ScoredData (obs=20); run; /***********************************/ /* Create misclassification counts */ /***********************************/ data work.MISC_Counts; set mycas.ScoredData; if trim(left(_label_)) = trim(left(I__label_)) then Misclassified_count=0; else Misclassified_count=1; run; /****************************************************/ /* Sum misclassification counts at the target level */ /****************************************************/ proc sql; create table work.AssessModel as select distinct _label_, sum(Misclassified_count) as number_MISC from work.MISC_Counts group by _label_; quit; /*****************************************************/ /* Plot each target level's misclassification counts */ /*****************************************************/ proc sgplot data=work.AssessModel; vbar _label_ / response=number_MISC; yaxis display=(nolabel) grid; xaxis display=(nolabel); run;

The PROC CAS step uses the dlScore action to score the new supervised transfer learning model on new data and generate performance information.

In the results, the series plot shows that we scored 3000 new observations and had a misclassification of 14.7 percent. At the end of the results, the histogram shows that this model actually did a really good job of classifying some of the fashion categories even though the model architecture was not built explicitly for this data set.

Although the misclassification error is larger than the unsupervised transfer learning model, this methodology gives you an alternative to fitting and optimizing a new model. The supervised transfer learning model saves the modeler time and uses a more efficient model with frozen weights, and the model is competitive with the previous model.

## Deep Learning Using SAS® Software

Lesson 05, Section 2 Demo: Creating a Customized Learning Rate Policy Using FCMP

In this demonstration, we're going to use FCMP to customize our deep learning model's learning rate schedule. That is, we're going to create a customized learning rate policy. FCMP is powerful for building customized learning rates, error functions, activation functions, and constructing your own hidden layers.

Learning rates can have a significant impact on model performance. I recommend using a learning rate schedule to modify the learning rate when you train the model. SAS offers a wide selection of default learning rate policies, including fixed, inverse (INV), multistep, polynomial (POLY), and step. However, some problems might require further customization of the learning rate schedule, and you can do that using FCMP.

- If you did not perform the steps of the three earlier transfer learning demonstrations in your current SAS Studio session, open and run the following programs in sequence before you continue:
**DLUS05D01a.sas**,**DLUS05D01b.sas**, and**DLUS05D01c.sas**.

- Open the program named
**DLUS05D03.sas**. Examine the code.

/**********************************************/ /* Create the Customized Learning Rate Policy */ /**********************************************/ proc cas; setsessopt / cmplib='imagelib.mylrdef'; fcmpact.addRoutines routineCode={ "function CusLR(iterNum, rate, batch, gamma); rate = rate/gamma; rate = max(rate, 1e-12); if iterNum = 10 then rate = .5; else if iterNum = 20 then rate = .9; else if iterNum = 30 then rate = 1; else if iterNum = 80 then rate = 1.5; else if iterNum = 110 then rate = 1.7; else if iterNum = 130 then rate = 1.9; else; return(rate); endsub;"} package="mypkg" saveTable=True funcTable={name="mylrdef" caslib="imagelib" replace=True}; quit; /**********************************************************/ /* Fit the Model with the Customized Learning Rate Policy */ /**********************************************************/ ods output OptIterHistory=ObjectModeliter; proc cas; dlTrain / model='ConVNN3' table={name='Encoders', where='_PartInd_=1'} ValidTable={name='Encoders', where='_PartInd_=2'} modelWeights={name='ConVTrainedWeights_d', replace=1} bestweights={name='ConVbestweights', replace=1} dataSpecs={ {data={'_LayerAct_0_IMG_0_'}, layer='data1', type='Image'} {data={'_LayerAct_11_IMG_0_'}, layer='data2', type='Image'} {data='_label_', layer='outlayer', type='numericNominal'}} GPU=True optimizer={minibatchsize=80, loglevel=3, maxepochs=160, algorithm={method='Momentum', gamma=1.006, fcmplearningrate='CusLR', learningrate=.01}} seed=12345; quit; /******************************************************************/ /* Store minimum training and validation error in macro variables */ /******************************************************************/ proc sql noprint; select min(FitError) into :Train separated by ' ' from ObjectModeliter; quit; proc sql noprint; select min(ValidError) into :Valid separated by ' ' from ObjectModeliter; quit; /* Plot Performance */ proc sgplot data=ObjectModeliter; yaxis label='Misclassification Rate' MAX=.9 min=0; series x=Epoch y=FitError / CURVELABEL="&Train" CURVELABELPOS=END; series x=Epoch y=ValidError / CURVELABEL="&Valid" CURVELABELPOS=END; run;

The first PROC CAS step begins by setting session options for CMPLIB. Then it calls the addRoutines action from the FCMP action set. In the routineCode argument, note the following:- We begin by specifying the function that we're going to create: CusLR.
- In this function, we call several arguments: iterNum, rate, batch, and gamma. If you're a SAS programmer, you can think of these arguments as similar to macro variables, where they resolve to a particular value.
- We specify that the rate equals the rate divided by gamma. (The rate and gamma value are determined by the dlTrain code in the next PROC CAS step.) The presence of the batch argument means that the learning rate is divided by the gamma value of 1.006 each batch. The learning rate would be reduced every epoch if the batch argument were removed. Periodically the learning rate is reset to a higher value. That is, at epochs 10, 20, 30, 80, 110, and 130, the learning rate is drastically increased (shocked) before being gradually reduced again. So every iteration, we're going to reduce our learning rate by dividing it by gamma. We actually get more aggressive as the model learns the data better.

- The fcmpLearningRate argument specifies our customized learning rate, which is CusLR.
- In the learningRate argument, we still include a learning rate value. This initial learning rate is then adjusted according to the function created by the addRoutines action.

- Run the program and view the results.

In the Results window, review the Model Information table and the optimization history of the deep learning model.

An iteration plot displays the validation misclassification and training misclassification rates. Notice that every time we shock the learning rate by increasing the value significantly, the model's performance on training and validation converge. In this instance, this has not improved the performance of our model. Some research papers have shown that cyclical learning rates actually can perform pretty well. However, the overall performance is not superior to the model trained earlier in the lesson, which used a STEP policy provided by SAS. In this instance, shocking the learning rate hasn't really helped our model, so we might want to choose a different learning rate policy.