SAS® Programming 2: Data Manipulation Techniques
Lesson 01, Section 1 Demo: DATA Step Processing in Action
A great way to learn how data is processed in the execution phase is to watch it happen statement by statement and row by row. In this demo, we use the DATA step debugger in SAS Enterprise Guide to observe the process of execution.
Note: This demo must be performed in Enterprise Guide.
- Open p201d01.sas from the demos folder.
data storm_complete; set pg2.storm_summary_small; length Ocean $ 8; drop EndDate; where Name is not missing; Basin=upcase(Basin); StormLength=EndDate-StartDate; if substr(Basin,2,1)="I" then Ocean="Indian"; else if substr(Basin,2,1)="A" then Ocean="Atlantic"; else Ocean="Pacific"; run;
- The DATA step markers for debugging (Debug) toolbar button enables debugging in the program. If this option is enabled, you see the same icon and a green bar next to each DATA step in your program.
- Click the Debugger icon next to the DATA statement. The DATA Step Debugger window appears.
- At this point, the compilation phase is complete. The PDV is displayed on the right side of the window. It has three columns: Variable, Value, and Watch. Notice that all columns read from the storm_summary_small table start with a missing value.
- Two additional columns are included in the PDV during execution. _ERROR_ is 0 by default but is set to 1 whenever a data error is encountered, such as a value that cannot be read or calculated. _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times that the DATA step has iterated.
- Click the Step execution to next line toolbar button to execute the highlighted SET statement and then highlight the next executable statement. Recall that during the compilation phase, the WHERE statement established a rule to read rows into the PDV only where Name is not missing. The first two rows of the input table have missing values for Name, so the third row is read. However, because this is the first iteration of the DATA step, _N_ is still equal to 1. Values for the columns Name, Basin, MaxWind, StartDate, and EndDate are assigned in the PDV.
- The assignment statement for Basin is the next executable statement. LENGTH, DROP, and WHERE are compile-time statements, so those are skipped in the execution phase. Click Step execution to next line twice to execute the Basin and StormLength assignment statements. Notice that Basin was already in uppercase and did not change, but a value of 6 was assigned to StormLength.
- Click Step execution to next line to execute the IF, ELSE IF, and ELSE statements. After line 10, Pacific is assigned to Ocean.
- With the RUN statement highlighted, click Step execution to next line. As the concluding step boundary for the DATA step, the RUN statement triggers an implicit output. The values in the PDV are written as the first row in storm_complete. After the implicit output, the process returns to the top of the DATA step.
- Notice that _N_ is now 2, representing the second iteration of the DATA step. Columns read from the SET table retain their values. However, the new computed columns, Ocean and StormLength, are reset to missing. This action is called reinitializing the PDV.
- Click Step execution to next line to step through the program until line 8. Notice that the value of Basin is SI, so the IF condition is true. Execute the IF statement, and SAS assigns Indian to the Ocean column. The remaining ELSE statements are skipped and RUN is highlighted.
- Execute the RUN statement. _N_ is increased to 3, and the PDV is reinitialized.
- Click Start/continue debugger execution to proceed through the rest of execution. Close the DATA step debugger.
Note: Red text in the Value column represents data values that were updated with the execution of the previously highlighted statement. This makes it easy to see what's changed in the PDV.
Note: While debugging a program, the output table is not created. When the program runs outside of the debugger, the implicit output writes rows to the output table.
Note: With the DATA step markers for Debugging Toolbar button toggled On, the DATA step debugger (the green bug) is available by default for any DATA step in all programs. To suppress the debugger icon in the editor, click the DATA step markers for debugging (Debug) toolbar button to toggle it off.
SAS® Programming 2: Data Manipulation Techniques
Lesson 01, Section 2 Demo: Directing Row Output
In this demo, we create multiple output tables in a single data step and use IF-THEN-ELSE logic in order to designate which rows should be written to which tables.
- Open p201d02.sas from the demos folder and find the Demo section.
data storm_complete; set pg2.storm_summary; length Ocean $ 8; Basin=upcase(Basin); StormLength=EndDate-StartDate; MaxWindKM=MaxWindMPH*1.60934; if substr(Basin,2,1)="I" then Ocean="Indian"; else if substr(Basin,2,1)="A" then Ocean="Atlantic"; else Ocean="Pacific"; run;
- Modify the DATA statement to create three tables named indian, atlantic, and pacific.
data indian atlantic pacific;
- Modify the IF-THEN/ELSE conditional statements to write output to the appropriate table.
data indian atlantic pacific; set pg2.storm_summary; length Ocean $ 8; Basin=upcase(Basin); StormLength=EndDate-StartDate; MaxWindKM=MaxWindMPH*1.60934; if substr(Basin,2,1)="I" then do; Ocean="Indian"; output indian; end; else if substr(Basin,2,1)="A" then do; Ocean="Atlantic"; output atlantic; end; else do; Ocean="Pacific"; output pacific; end; run;
- Run this program and view the log. Verify that the three tables, Indian, Atlantic, and Pacific, were all created. Each table has 10 variables or columns, and a different number of rows.
- Suppose we don't want MaxWindMPH in these tables. Add a DROP statement to remove MaxWindMPH.
data indian atlantic pacific; set pg2.storm_summary; length Ocean $ 8; Basin=upcase(Basin); StormLength=EndDate-StartDate; MaxWindKM=MaxWindMPH*1.60934; if substr(Basin,2,1)="I" then do; Ocean="Indian"; output indian; end; else if substr(Basin,2,1)="A" then do; Ocean="Atlantic"; output atlantic; end; else do; Ocean="Pacific"; output pacific; end; drop MaxWindMPH; run;
- Highlight the DATA step, run the selected code, and view the log.
Now each table has 9 variables or columns because MaxWindMPH has been dropped from all three tables.
SAS® Programming 2: Data Manipulation Techniques
Lesson 01, Section 2 Demo: Directing Column Output
In this demo, we control which columns are read in and out of the PDV with the DROP= or KEEP= data set options.
Note: This demo must be performed in Enterprise Guide to take advantage of the DATA step debugger to see what's going on behind the scenes.
- Open p201d03.sas from the demos folder. This is the program we finished with in the last demonstration. It creates three different tables for each ocean: Indian, Atlantic, and Pacific. We're also calculating a new column, MaxWindKM, so we have wind measurements in both miles per hour and kilometers per hour.
data indian atlantic pacific; set pg2.storm_summary; length Ocean $ 8; Basin=upcase(Basin); StormLength=EndDate-StartDate; MaxWindKM=MaxWindMPH*1.60934; if substr(Basin,2,1)="I" then do; Ocean="Indian"; output indian; end; else if substr(Basin,2,1)="A" then do; Ocean="Atlantic"; output atlantic; end; else do; Ocean="Pacific"; output pacific; end; run;
- Depending on the table, we want to keep one or both of the wind measurement columns. Use the DROP= data set option to drop MaxWindMPH from the indian table and MaxWindKM from the atlantic table. Do not drop any columns from the pacific table.
data indian(drop=MaxWindMPH) atlantic(drop=MaxWindKM) pacific;
- Start the DATA step debugger. Note that MaxWindMPH and MaxWindKM are included in the PDV, so both columns and their values will be available for processing in the DATA step.
- Close the debugger, run the program, and examine the three output tables. MaxWindMPH has been dropped from the indian table, MaxWindKM has been dropped from the atlantic table, and the pacific table has all columns.
- Next, we want to drop MinPressure from all three tables, and we don't need MinPressure for any of the processing of the DATA step. Add a DROP= data set option in the SET statement to drop MinPressure. Start the debugger. Notice that MinPressure is not included in the PDV.
set pg2.storm_summary(drop=MinPressure);
Because the DROP= data set option is on the SET statement, MinPressure is not read into the PDV. - Close the debugger, run the program, and examine the three output tables. Confirm that MinPressure has been dropped from each table.
Note: The placement of the DROP= and KEEP= data set options controls what data is read into the PDV, as well as what is written out.
SAS® Programming 2: Data Manipulation Techniques
Lesson 02, Section 1 Demo: Creating an Accumulating Column
In this demo, we look at daily rainfall totals in Houston, Texas, during 2017 and create a new column that stores a year-to-date rainfall total.
Note: This demo must be performed in Enterprise Guide.
- Open p202d01.sas from the demos folder.
Notice the assignment statement for YTD, or Year-To-Date, Rain. We want to take the previous value of YTDRain and add it to the value of DailyRain for each row. Will it work?
data houston_rain; set pg2.weather_houston; keep Date DailyRain YTDRain; YTDRain=YTDRain+DailyRain; run;
- Run the program and view the output table. Notice that the values for YTDRain are all missing.
- To determine why YTDRain is missing, open the DATA step debugger in Enterprise Guide. Click the Step execution to next line toolbar button to execute the highlighted SET statement. The first row from weather_houston is loaded into the PDV.
- Notice that the YTDRain assignment statement will add a missing value and .01. In SAS, when we build an arithmetic expression like this using the Plus symbol, a missing plus a known value is always missing. So that's the first problem. This prevents you from creating the accumulating column.
- What if we initialize YTDRain as 0? The DATA step debugger enables you to edit values in the PDV for the purpose of testing. In the PDV area of the DATA step debugger, double-click the missing value for YTDRain and change it to 0. Click Step execution to next line to execute the YTDRain assignment statement, and notice that the new value is 0.01. So the value of YTDRain is now accurate. 0 plus 0.01 gives us 0.01.
- Now let's look at iteration number 2. Click Step execution to next line to advance past the RUN statement. This results in the implicit output of the contents of the PDV to the output table and implicit return to the top of the DATA step. Remember, by default, when we come back to the top of the DATA step, all New or Computed columns are reset to Missing. So YTDRain is reset to a missing value.
- But we need YTDRain to hold the value from the previous iteration of the DATA step. So it would need to be 0.01. Double-click the missing value for YTDRain and enter .01, which is the value from the previous row. Step through execution in the second iteration and notice that YTDRain is 1.3, the accumulation of day 1 and day 2.
- Exit the debugger.
Note: By default, all computed columns are reset to missing each time that the PDV is reinitialized.
If we could solve the two issues programmatically in our DATA step, then we could create the Accumulating column we need.
SAS® Programming 2: Data Manipulation Techniques
Lesson 02, Section 2 Demo: Identifying the First and Last Row in Each Group
In this demo, we use the Data step debugger and Enterprise Guide to observe how the FIRST./LAST. variables are assigned values in the PDV during execution.
Note: This demo must be performed in Enterprise Guide.
- Open p202d02.sas from the demos folder.
proc sort data=pg2.storm_2017 out=storm2017_sort(keep=Basin Name); by Basin; run; data storm2017_max; set storm2017_sort; by Basin; *First_Basin=first.basin; *Last_Basin=last.basin; run;
- Run the PROC SORT step to create a temporary table named storm2017_sort that groups the rows by Basin. View the sorted table, work.storm2017_sortand note that in the first 20 rows the value of Basin is EP.
- Return to the program. The DATA step reads the sorted table, and groups the processing by Basin.
- Start the DATA step debugger and examine the contents of the PDV. Because there is a BY statement included, the variables First.Basin and Last.Basin are included in the PDV. These variables are temporary, so they are automatically dropped before writing each row to the output table.
- Click Step execution to next line to execute the SET statement. The first row from the input table is loaded into the PDV. Because it is the first row in the Basin=EP group, First.Basin is 1. SAS is able to look ahead to the next sequential row in the input table and determine it is not the last occurrence of EP, so Last.Basin is 0.
- After the SET statement executes, SAS skips to the RUN statement. The BY statement is a compile-time statement that adds the First./Last. variables in the PDV. Click Step execution to next line again to advance past the RUN statement.
- Execute the SET statement for the second iteration. The group value is still EP, so First.Basin and Last.Basin are both 0 because it is not the first or last occurrence of EP in the group.
- Click the Watch check box next to Last.Basin. This sets a Watchpoint for the variable LAST.Basin. so whenever that value changes for LAST.Basin, execution will pause and we'll receive notification in the Debug Console.
Click the Start/continue debugger execution button to execute the program until the value of Last.Basin changes. Note that the debugger stops when _N_ is 20. This is the last row in the EP group, so First.Basin is 0 and Last.Basin is 1.
- Click the Start/continue debugger execution button to proceed through execution until Last.Basin changes again. Notice that when _N_ is 21, the group value for Basin changes to NA, and First.Basin is 1.
- Exit the debugger.
- Because the First.Basin and Last.Basin variables are temporary, they are not included in the output table. Uncomment the two assignment statements to assign the values to permanent columns to view their values for each row.
- Run the program and view the output table. Examine the values for the First_Basin and Last_Basin columns. At row 21, we see that FIRST.Basin is equal to 1. And at row 37, LAST.Basin is 1, each of these values indicating when a group begins or ends.
Note: Typically you don't save the FIRST. and LAST. values permanently, but we did so in this demo as an alternate way to show the values for these temporary variables. They're very useful when you want to perform actions depending on when a group begins or ends.
SAS® Programming 2: Data Manipulation Techniques
Lesson 02, Section 2 Demo: Creating an Accumulating Column within Groups
In this demo, we use weather data for Houston and First. and Last. variables to create an accumulating column named MTDRain, or Month-to-Date Rain, and reset it each time a new month begins.
- Open p202d03.sas from the demos folder and find the Demo section.
data houston_monthly; set pg2.weather_houston; keep Date Month DailyRain YTDRain; YTDRain+DailyRain; run;
This program reads the pg2.weather_houston table, and creates a data set that includes a new column called YTDRain. It's simply an accumulating column which will add DailyRain for each row, summing the DailyRain values for the year using a sum statement. - Submit the DATA step and view the output table. Notice that YTDRain is an accumulating column that creates a running total of DailyRain. Also notice that the data is sorted by Month and Date.
- We'll modify the DATA step to create an accumulating column named MTDRain, or Month-to-Date Rain, and use the First. variable to reset it each time a new month begins.
data houston_monthly; set pg2.weather_houston; keep Date Month DailyRain MTDRain; by Month; if First.Month=1 then MTDRain=0; MTDRain+DailyRain; run;
- The BY statement processes the rows by groups based on the values of Month.
- We changed the accumulating column YTDRain to MTDRain in the KEEP and sum statements.
- The IF statement resets MTDRain to 0 each time that SAS reaches the first row within a new Month group.
- Submit the DATA step and view the output table.
The accumulating column MTDRain looks great. The values accumulate through the month of January, and at the beginning of February MTDRain is equal to 0.
SAS® Programming 2: Data Manipulation Techniques
Lesson 03, Section 2 Demo: Using Numeric Functions
In this demo, we use numeric functions to create an anonymous student report that includes the top three quiz scores.

RAND('distribution', parameter1, ... parameterk) LARGEST(k, value-1 <, value-2 ...> ) ROUND(number, <, rounding-unit> ) |
- Open p203d01.sas from the demos folder and find the Demo section.
data quiz_analysis; set pg2.class_quiz; drop Quiz1-Quiz5; Quiz1st=largest(1, of Quiz1-Quiz5); run;
In this DATA step, the assignment statement creates a new column, Quiz1st, using the largest function to get the largest or maximum score from the columns Quiz1 through Quiz5. Notice we're using the keyword OF to reference the range of columns Quiz1 - Quiz5. - Copy and paste the Quiz1st assignment statement twice and modify the statements to create columns named Quiz2nd and Quiz3rd and assign the second and third largest quiz scores.
data quiz_analysis; set pg2.class_quiz; drop Quiz1-Quiz5; Quiz1st=largest(1, of Quiz1-Quiz5); Quiz2nd=largest(2, of Quiz1-Quiz5); Quiz3rd=largest(3, of Quiz1-Quiz5); run;
- Create a new column named Top3Avg that uses the MEAN function with the top three quiz scores as the arguments.
data quiz_analysis; set pg2.class_quiz; drop Quiz1-Quiz5; Quiz1st=largest(1, of Quiz1-Quiz5); Quiz2nd=largest(2, of Quiz1-Quiz5); Quiz3rd=largest(3, of Quiz1-Quiz5); Top3Avg=mean(Quiz1st, Quiz2nd, Quiz3rd); run;
- We don't want any identifying information in the report, so modify the DROP statement to drop Name from the output data set.
drop Quiz1-Quiz5 Name;
- We need to give each student a random ID, a four digit number that they can use to identify their scores. So before the SET statement, we create a new column named StudentID and use the RAND function with 'integer' as the first argument to generate random integers. It generates integers between the values specified in the second and third arguments. To create a four-digit number, we use 1000 as the lower limit and 9999 as the upper limit.
Note: Because the assignment statement is before the SET statement, StudentID will be the first column added to the PDV and the leftmost column in the output table.
data quiz_analysis; StudentID=rand('integer',1000,9999); set pg2.class_quiz; drop Quiz1-Quiz5 Name; Quiz1st=largest(1, of Quiz1-Quiz5); Quiz2nd=largest(2, of Quiz1-Quiz5); Quiz3rd=largest(3, of Quiz1-Quiz5); Top3Avg=mean(Quiz1st, Quiz2nd, Quiz3rd); run;
- Run the code and view the output table. The quiz_analysis table includes the random numbers, the top three quiz scores, and Top3Avg, but some scores have quite a bit of precision.
Note: Because the numbers for StudentID are randomly assigned, your output for the StudentID column might differ. - Modify the Top3Avg assignment statement to use the ROUND function to round the values returned by the MEAN function to the nearest integer.
Top3Avg=round(mean(Quiz1st, Quiz2nd, Quiz3rd));
Add a round off unit as the second argument to the ROUND function. Use .1 to round to the nearest tenth.
data quiz_analysis; StudentID=rand('integer',1000,9999); set pg2.class_quiz; drop Quiz1-Quiz5 Name; Quiz1st=largest(1, of Quiz1-Quiz5); Quiz2nd=largest(2, of Quiz1-Quiz5); Quiz3rd=largest(3, of Quiz1-Quiz5); Top3Avg=round(mean(Quiz1st, Quiz2nd, Quiz3rd), .1); run;
- Run the code and view the output table. The new table looks good. The decimal places are rounded to the nearest tenth.
SAS® Programming 2: Data Manipulation Techniques
Lesson 03, Section 2 Demo: Shifting Date Values Based on an Interval
We want to identify dates before or after major storms to compare property values. In this demo, we create a new column called AssessmentDate using the INTNX function to shift dates based on when the storm began.

INTNX('interval', start, increment <, 'alignment' >) |
- Open p203d02.sas from the demos folder and find the Demo section.
data storm_damage2; set pg2.storm_damage; keep Event Date AssessmentDate; AssessmentDate=intnx('month', Date, 0); format Date AssessmentDate date9.; run;
We'll create a new column called AssessmentDate and use the INTNX function to shift dates based on when the storm began. The assignment statement for AssessmentDate calls the INTNX function with three arguments. The first argument is the interval or time period, month. The second argument is the start date, and the third argument is the increment. - Run the code and view the output table. What value is returned from INTNX ? We're passing the Date the storm began, so AssessmentDate has been shifted to the first of the same month.
- To see the impact of the various arguments in the INTNX function, modify the arguments as directed. Run the code and examine the results after each modification.
- Change the increment value to 2.
AssessmentDate=intnx('month', Date, 2)
View the output table. Notice now that the AssessmentDate has shifted forward by two months, but the date within the month is still returned as the first. - Change the increment value to -1 and add 'end' as the optional fourth argument to specify alignment.
AssessmentDate=intnx('month', Date, -1, 'end');
View the output table. The -1 shifts the date backwards, so the previous month, and 'end' aligns to the end of the month, so it returns the last day of the previous month. - Change the alignment argument to 'middle'.
AssessmentDate=intnx('month', Date, -1, 'middle');
View the output table. All of the dates are either the 16th or 15th, depending on how many days are within the month. These are the AssessmentDate values we want in the output table. - Write an assignment statement to create a new column named Anniversary that is the date of the 10-year anniversary for each storm. Use the INTNX function with 'year' as the interval, and 10 as the increment to advance the dates by 10 years. Use 'same' as the optional fourth argument to specify alignment. Keep the new column in the output table and use the DATE9. format to display the values.
data storm_damage2; set pg2.storm_damage; keep Event Date AssessmentDate Anniversary; AssessmentDate=intnx('month', Date, -1, 'middle'); Anniversary=intnx('year', Date, 10, 'same'); format Date AssessmentDate Anniversary date9.; run;
- Run the code and view the output table. In the output table, AssessmentDate is the middle of the previous month, and Anniversary is the same day and month as the storm date but 10 years in the future.
SAS® Programming 2: Data Manipulation Techniques
Lesson 03, Section 3 Demo: Using Character Functions to Extract Words from a String
In this demo, we use the SCAN function to extract words from a character column, and then use the PROPCASE function to standardize casing.

SCAN(string, n <, 'delimiters' >)
PROPCASE(string <, 'delimiters' >) |
- Open p203d03.sas from the demos folder.
data weather_japan_clean; set pg2.weather_japan; Location=compbl(Location); City=scan(Location, 1); Prefecture=scan(Location, 2); run;
This DATA step reads the pg2.weather_japan table and calls the COMPBL function to remove extra blanks from the Location column. Then we create two new columns. We use the SCAN function to read the first word from Location and assign it to City, then read the second word from Location and assign it to Prefecture.
- Run the code and view the output table. In the first several rows, City and Prefecture look good. But in row 8 the city name should be MIYAKE-JIMA. However, the hyphen is a default delimiter, so MIYAKE is assigned to City and JIMA is assigned to Prefecture.
- How do we get the entire string to be assigned to City, and Tokyo to be assigned to Prefecture? In both SCAN functions, add ',' as the third argument to specify that the only delimiter is a comma.
data weather_japan_clean; set pg2.weather_japan; Location=compbl(Location); City=scan(Location, 1, ','); Prefecture=scan(Location, 2, ','); run;
Run the code and view the output table. Now in row 8, City is MIYAKE-JIMA and Tokyo is assigned to Prefecture.
- Next, create an additional column named Country that just has the final country code, or the final word in Location. The additional assignment statement uses the SCAN function to read Location, and uses -1 as the second argument. Remember, negative values read from the right side of the string, so negative 1 will return the last word.
data weather_japan_clean; set pg2.weather_japan; Location=compbl(Location); City=(scan(Location, 1, ',')); Prefecture=scan(Location, 2, ','); Country=scan(Location, -1); run;
Run the code and view the output table. The country code looks good. - Next, we want to ensure that City is in proper case. We surround the SCAN function with the PROPCASE function. The PROPCASE function will use a default list of delimiters to determine when one word ends and the next word begins indicating which letter should be capitalized.
data weather_japan_clean; set pg2.weather_japan; Location=compbl(Location); City=propcase(scan(Location, 1, ',')); Prefecture=scan(Location, 2, ','); Country=scan(Location, -1); run;
Run the code and view the output table. Most City values look correct, but in row 8 both Miyake and Jima are capitalized. The proper casing for this city name should be Miyake-jima. - The hyphen is one of the default delimiters, so we need to specify that a space be the only delimiter. Use ' ' as the optional second argument to PROPCASE to specify that the only delimiter should be a space.
data weather_japan_clean; set pg2.weather_japan; Location=compbl(Location); City=propcase(scan(Location, 1, ','), ' '); Prefecture=scan(Location, 2, ','); Country=scan(Location, -1); run;
Run the code and view the output table. The City values are correct.
SAS® Programming 2: Data Manipulation Techniques
Lesson 03, Section 4 Demo: Using the INPUT Function to Convert Column Types
In this demo, we'll convert character columns to numeric columns using the INPUT function. And then we'll convert numeric columns to character columns using the PUT function.
In the input table, weather_atlanta, Date and Precip (for precipitation) are both stored as a character value. We want to make dates numeric, the number of days from January 1, 1960, and Precip should be a numeric value as well. In the Precip column, some rows have a value of T. This means that a trace value was recorded, which means a very small amount of precipitation that resulted in no measurable accumulation. So we need to do a little bit of work with the Date and Precip columns to store them properly as numbers.

DATA output-table; SET input-table (RENAME=(current-column=new-column)); ... column1 = INPUT(source, informat); column2 = PUT(source, format); ... RUN; |
- Open p203d04.sas from the demos folder and find the Demo section of the program.
data atl_precip; set pg2.weather_atlanta; where AirportCode='ATL'; drop AirportCode City Temp: ZipCode; TotalPrecip+Precip; run;
This DATA step reads the weather_atlanta table and subsets to include only those rows where AirportCode is equal to ATL. And we want to create an accumulating column for TotalPrecip, which will be a running total of the Precip column. But remember, right now, Precip is character.
- Run the first DATA step and view the log. It contains the message, character values have been converted to numeric values at the places given by, specifically, line 77. SAS attempts to convert the character Precip value to a numeric value using the w. informat. SAS is successful when the character value is a legitimate numeric value such as .27. SAS is unsuccessful when the value is equal to a non-numeric value such as T. A value of T is converted to a missing numeric value.
- View the output table. Notice that TotalPrecip was accurately created for each row. The automatic conversion worked perfectly. The sum statement ignores the missing values for the Precip values of T.
- Even though SAS created the correct answer, what can you do differently? Add IF-THEN/ELSE statements to the DATA step to create a new column named PrecipNum. If Precip is not equal to T, then use the INPUT function to assign the numeric equivalent of Precip to the PrecipNum column. Otherwise, assign 0 to PrecipNum. Use PrecipNum in the SUM statement instead of Precip. Drop the Precip column.
data atl_precip; set pg2.weather_atlanta; where AirportCode='ATL'; drop AirportCode City Temp: ZipCode Precip; if Precip ne 'T' then PrecipNum=input(Precip,6.); else PrecipNum=0; TotalPrecip+PrecipNum; run;
- Run the DATA step and view the log. Notice that the SAS log no longer contains a note about character values being converted to numeric values and no longer contains notes about invalid numeric data for Precip='T'.
- Add to the DATA step to create a numeric column Date from the character column Date. Also, format the numeric Date and drop the character Date.
data atl_precip; set pg2.weather_atlanta(rename=(Date=CharDate)); where AirportCode='ATL'; drop AirportCode City Temp: ZipCode Precip CharDate; if Precip ne 'T' then PrecipNum=input(Precip,6.); else PrecipNum=0; TotalPrecip+PrecipNum; Date=input(CharDate,mmddyy10.); format Date date9.; run;
- Run the DATA step. Confirm that you have a numeric precipitation column and a numeric date column.
SAS® Programming 2: Data Manipulation Techniques
Lesson 03, Section 4 Demo: Using the PUT Function to Convert Column Types
In this demo,we continue manipulating the pg2.weather_atlanta table by converting some numeric columns to character columns using the PUT function.

DATA output-table; SET input-table (RENAME=(current-column=new-column)); ... column1 = INPUT(source, informat); column2 = PUT(source, format); ... RUN; |
- Open p203d04.sas from the demos folder and find the Demo section of the program. .
data atl_precip; set pg2.weather_atlanta; CityStateZip=catx(' ',City,'GA',ZipCode); run;
In this DATA step we create a column called CityStateZip by concatenating together the City, the string GA for Georgia, and the ZipCode. The first argument in the CATX function, space, indicates that each of those strings will be concatenated together, leading and trailing blanks will be removed, and a single space will be inserted. But recall, ZipCode is a number. -
Submit the step and view the output table. CityStateZip looks great. The CAT functions automatically convert numeric values to character values, and removes leading and trailing blanks in the converted value.
View the log. SAS does not write a note to the log when values are converted with the CAT functions. - The concatenation functions offer flexibility in concatenating both character and numeric values. The assignment statement calls the SUBSTR function to create a character column ZipCodeLast2 that contains the last two digits of the numeric column ZipCode.
The first argument is the column, ZipCode, and we start reading in position 4 and read 2 characters.
data atl_precip; set pg2.weather_atlanta; CityStateZip=catx(' ',City,'GA',ZipCode); ZipCodeLast2=substr(ZipCode, 4, 2); run;
- Run the code and view the log. It contains a similar note, numeric values have been converted to character values. SAS converts the numeric ZipCode value to a character value.
- View the output table. Notice that ZipCodeLast2 is missing. It is not displaying the last two digits of the ZIP code. When SAS automatically converts a numeric value to a character value, the BEST12. format is used, and the resulting character value is right-aligned. The numeric value of 30320 becomes the character value of seven leading spaces followed by 30320.
- We'll do the conversion explicitly using the PUT function.
data work.weather_atlanta; set pg2.weather_atlanta; CityStateZip=catx(' ',City,'GA',ZipCode); ZipCodeLast2=substr(put(ZipCode,z5.), 4, 2); run;
In this DATA step, the first argument to the SUBSTR function is a call to the PUT function to read the numeric ZipCode using the z5. format. The Z format will write a number as a character string and insert leading zeros, if necessary, to fill 5 total positions. - Run the DATA step and view the output table.
ZipCodeLast2 now displays the last two digits of ZIPCode.
SAS® Programming 2: Data Manipulation Techniques
Lesson 04, Section 1 Demo: Creating and Using Custom Formats
In this demo, we use the FORMAT procedure to create a custom numeric and character format based on single values as well as a range of values.

PROC FORMAT; VALUE format-name value-or-range-1 = 'formatted-value' value-or-range-2 = 'formatted-value' . . . ; RUN; |
- Open p204d01.sas from the demos folder and find the Demo section of the program.
proc format; value stdate low - '31DEC1999'd = '1999 and before' '01JAN2000'd - '31DEC2009'd = '2000 to 2009' '01JAN2010'd - high = '2010 and later' . = 'Not Supplied'; *Add a VALUE statement; run; proc freq data=pg2.storm_summary; tables Basin*StartDate / norow nocol; *Add a FORMAT statement; run;
The program begins with a PROC FORMAT step. The VALUE statement creates a numeric format named STDATE based on date values. We use the keyword LOW to build a range from whatever the oldest data is up to and including December 31, 1999, and we assign a corresponding label. Then we build two other range, then we provide a label for missing values. - Add a VALUE statement to the PROC FORMAT step to create the $REGION format with the following labels:
NA Atlantic WP, EP, SP Pacific NI, SI Indian blank Missing other Unknown
proc format; value stdate low - '31DEC1999'd = '1999 and before' '01JAN2000'd - '31DEC2009'd = '2000 to 2009' '01JAN2010'd - high = '2010 and later' . = 'Not Supplied'; value $region 'NA'='Atlantic' 'WP','EP','SP'='Pacific' 'NI','SI'='Indian' ' '='Missing' other='Unknown'; run;
- Highlight and run the PROC FORMAT step. View the log and verify that the formats have been output.
- Add a FORMAT statement in the PROC FREQ step to format Basin with the $REGION format and StartDate with the STDATE format.
proc freq data=pg2.storm_summary; tables Basin*StartDate; format StartDate stdate. Basin $region.; run;
Submit the PROC FREQ step and view the output table. Verify that cross-tabulation report has formatted values for Basin in the rows, as well as the formatted ranges for StartDate in the columns.
Custom formats are not only a great way to just display values differently, but also group raw values.
SAS® Programming 2: Data Manipulation Techniques
Lesson 04, Section 2 Demo: Creating Custom Formats from Tables
In this demo, we build a custom format from a lookup table.

PROC FORMAT CNTLIN=input-table FMTLIB; SELECT format-names; RUN; |
- Open p204d02.sas from the demos folder.
data sbdata; retain FmtName '$sbfmt'; set pg2.storm_subbasincodes(rename=(Sub_Basin=Start SubBasin_Name=Label)); keep Start Label FmtName; run; proc format cntlin=sbdata; run; /*Complete the steps to create the CATFMT format from the storm_categories table*/ data catdata; retain ; set pg2.storm_categories ; keep FmtName Start End Label; run; proc format cntlin=; run; title "Frequency of Wind Measurements for Storm Categories by SubBasin"; title2 "2016 Storms"; proc freq data=pg2.storm_detail; /*include only Category 1-5 2016 storms with known subbasin*/ where Wind>=74 and Season=2016 and Sub_basin not in('MM', 'NA'); tables Sub_basin*Wind / nocol norow nopercent; *Add a FORMAT statement; run;title;
The first DATA step creates a temporary table called sbdata. We use the pg2.storm_subbasincodes lookup table. In order to use this table to create a custom format, it needs to have the three required columns, FmtName, Start, and Label. We create the column FmtName with a RETAIN statement, and assign a constant value of '$sbfmt'.
We need to rename columns in subbasin codes to define Start and Label. So Sub_Basin that has the raw coded values will be Start, and SubBasin_Name, which is the descriptive name, will be Label. We only keep the three required columns. Then we use a PROC FORMAT step with the CNTLIN option and name the temporary table sbdata. - Submit the DATA step that creates the sbdata table. View the output table and verify that it contains three required columns to build a format.
- Submit the PROC FORMAT step that imports the sbdata table. View the log and confirm that the $SBFMT format was created. The format was created, so tnow it's ready to be used.
- Now we'll build another format using the pg2.storm_categories table.
pg2.storm_categories Low High Category 74 95 Category 1 96 110 Category 2 111 129 Category 3 130 156 Category 4 157 high Category 5
The table has five rows that define a range of maximum wind speeds (Low and High) for each storm category (Category). So this time, rather than building a format based on a single value, we'll create labels based on ranges. - Modify the second DATA and PROC FORMAT steps to create a table named catdata that will include the following columns.
Column in pg2.storm_categoriesColumn in catdata<none> FmtName (assign the value catfmt for each row) Low Start High End Category Label
data catdata; retain FmtName "catfmt"; set pg2.storm_categories (rename=(Low=Start High=End Category=Label)); keep FmtName Start End Label; run; proc format cntlin=catdata; run;
This DATA step creates a temporary table called catdata. The RETAIN statement creates the column FmtName and names the format using the string "catfmt". This is a numeric format, so it doesn't need a dollar sign in front of the name.
Next, following the storm_categories table, we use the RENAME= DATA SET option to rename the columns from the original table to create the required columns for PROC format. The column Low is renamed to Start. The values in this column represent the lower boundary for each range.
Next, the column High is renamed as End, which will be the upper boundary for each range. And finally Category is renamed as Label. The PROC FORMAT step imports the catdata table.
Highlight the DATA and PROC FORMAT steps and run the selected code.
View the output table and confirm that the column names are correct. Then view the log and confirm that the CATFMT format was created. - Now we're ready to use these custom formats.
In this PROC FREQ step, we want to analyze Sub_basin by Wind in a two-way cross-tabulation report. But if we run this with the raw data values, it will produce a huge table with a separate column for each unique measure of wind. Instead, we add a FORMAT statement in the PROC FREQ step to format Sub_basin with the $SBFMT format and Wind with the CATFMT format.
title "Frequency of Wind Measurements for Storm Categories by SubBasin"; title2 "2016 Storms"; proc freq data=pg2.storm_detail; /*include only Category 1-5 2016 storms with known subbasin*/ where Wind>=74 and Season=2016 and Sub_basin not in('MM', 'NA'); tables Sub_basin*Wind / nocol norow nopercent; format Sub_basin $sbfmt. Wind catfmt.; run;
Run the TITLE statements and PROC FREQ step and view the output. The columns are now the formatted values of wind, showing Category 1 through 4.
SAS® Programming 2: Data Manipulation Techniques
Lesson 05, Section 1 Demo: Concatenating Tables
In this demo, we use the DATA Step and the RENAME= data set option to concatenate tables.

DATA output-table; SET input-table1(rename=(current-colname=new-colname)) input-table2 ... ; RUN; |
- Open p205d01.sas from the demos folder and find the Demo section of the program.
data storm_complete; *Complete the SET statement; set ; Basin=upcase(Basin); run; proc sort data=storm_complete; by descending StartDate; run;
Recall the storm_summary table has one row per storm from 1980 through 2016. The table, pg2.storm_2017 includes storms from the 2017 storm season. We want to include all rows from both tables in a single data set named storm_complete. - Modify the SET statement to concatenate pg2.storm_summary and pg2.storm_2017 by listing both table names. Add an assignment statement to correct the case for Basin to ensure that all values are in uppercase. Highlight the DATA and PROC SORT steps and run the selected code.
data storm_complete; set pg2.storm_summary pg2.storm_2017; Basin=upcase(Basin); run; proc sort data=storm_complete; by descending StartDate; run;
View the log and verify that the work.storm_complete table has 3172 rows and 9 columns. View the output table and verify that it is sorted by descending StartDate. - Notice that for the 2017 storms Year is populated with 2017, Location has values, and Season is missing. Rows from the storm_summary table (starting with row 55) have Season populated and Year and Location are missing.
These two columns should be a single column in the new table, but they're named differently in the input tables. We need to rename the columns. - Use the RENAME= option on the storm_2017 table to rename the existing column Year as Season. That way, the Year values read from storm_2017 will align with the Season values from the storm_summary table. Drop the Location column.
data storm_complete; set pg2.storm_summary pg2.storm_2017(rename=(Year=Season) drop=Location); Basin=upcase(Basin); run; proc sort data=storm_complete; by descending StartDate; run;
Highlight the demo program and run the selected code. View the log and verify that the work.storm_complete table has 3172 rows and 7 columns. View the output table and verify that Location and Year are not included. Now there is column named Season with 2017 and all prior years.
SAS® Programming 2: Data Manipulation Techniques
Lesson 05, Section 2 Demo: Merging Tables
In this demo, we use a DATA step with a MERGE and BY statement to combine two tables with matching rows.

PROC SORT DATA=input-table OUT=output-table; BY BY-column; RUN; DATA output-table; MERGE input-table1 input-table2 ... ; BY BY-column(s); RUN; |
- Open p205d02.sas from the demos folder and find the Demo section of the program.
proc sort data=pg2.storm_summary out=storm_sort; by Basin; run; proc sort data=pg2.storm_basincodes out=basincodes_sort; by BasinCode; run; data storm_summary2; merge ; run;
The PROC SORT steps prepare the input data sets, storm_summary and storm_basincodes for merging. Notice that one is sorted by Basin and the other by BasinCode. The MERGE statement in the DATA step is incomplete. - Highlight and submit the two PROC SORT steps. View the storm_sort table and the basincodes_sort table. Notice that storm_sort has 3118 rows and basincodes_sort has 6 rows. Both tables include a column representing basin codes. However, the column is named Basin in the storm_sort table and BasinCode in the basincodes_sort table.
Note: The storm_basincodes table serves as a lookup table that includes the two-letter basin codes and the corresponding basin name. - To combine the BasinName column with the columns in the storm_summary table, the tables need to be merged. Complete the MERGE statement. Use the RENAME= data set option to rename the BasinCode column as Basin in the basincodes_sort table. Add a BY statement to combine the sorted tables by Basin.
Note: This will be a one-to-many merge.
data storm_summary2; merge storm_sort basincodes_sort(rename=(BasinCode=Basin)); by Basin; run;
- Run the program and view the storm_summary2 table. Notice that the BasinName values have been matched with each of the Basin code values.
- Scroll to the end of the storm_summary2 table. Notice that when the value of Basin is lowercase na, the values for BasinName are missing. This is because lowercase na occurs only in the storm_sort table and not in basincodes_sort.
Note: To view the end of the table in SAS Studio, click the Last page toolbar button.
SAS® Programming 2: Data Manipulation Techniques
Lesson 05, Section 3 Demo: Merging Tables with Nonmatching Rows
In this demo, we use the DATA step merge in order to combine two tables and identify non-matching rows.

DATA output-table; MERGE input-table1(IN=var1) input-table2(IN=var2) ... ; BY BY-column(s); RUN; |
- We'll start by looking at the pg2.storm_damage table. As a reminder, this table includes 38 rows or all storms with costs exceeding $1 billion. It includes information about the event, the date, a summary description, and cost, but not the basin name, max wind speed, or minimum pressure for these storms. That information is contained in the pg2.storm_final table.
We'll merge these two tables together to include those additional columns found in the Storm_Damage table. But Storm_Final includes many more rows than just the 38 in Storm_Damage. It has over 3,000 rows, so we'll limit the rows in the output table to only those rows that actually match. We want to eliminate the non-matches.
Let's look at the columns we'll use to match the individual storms. Storm_Damage includes the Event column that identifies the type of storm, hurricane or a tropical storm, and its name. So the last word in Event is the storm name, and it's in proper case. The Date column holds the date of the storm, stored as a SAS date.
Storm_Final has the storm name in the Name column, but it's stored in uppercase. We'll need to make sure that these values are consistent in order for them to be matched appropriately. The Season column holds the year in which the storm occurred.
So Season and Name must be derived from the Date and Event columns in the storm_damage table. - Open p205d03.sas from the demos folder and find the Demo section of the program.
proc sort data=pg2.storm_final out=storm_final_sort; by Season Name; run; data storm_damage; set pg2.storm_damage; Season=Year(date); Name=upcase(scan(Event, -1)); format Date date9. Cost dollar16.; drop event; run; proc sort data=storm_damage; by Season Name; run; data damage_detail; merge ; keep Season Name BasinName MaxWindMPH MinPressure Cost; run;
-
The PROC SORT step arranges the rows in Storm_Final by both Season and Name and creates an output table named storm_final_sort. Remember that a name can be used more than once, but Name is unique within each season, so it's important to group or sort by both columns.
Run this PROC SORT step to prepare the storm_final table for merging.
View the log and verify that the step was successful. View the storm_final_sort table. It is arranged by Season and Name. Because some storm names have been used more than once, unique storms are identified by both Season and Name.
Note: Storm names are in uppercase. - In the DATA step we read the permanent pg2.storm_damage table and create a temporary Storm_Damage table. We create a column named Season and extract the year value from Date to assign it a value.
Next, we use the SCAN function to extract the last word from Event. This is the storm name. Remember, it was in proper case, so we call UPCASE to convert the case. Now the name values will match the names in the storm_final table. Then we use a format statement to improve the display of Date and Cost, and we drop Event.
Run the DATA step and view the work.storm_damage table. Verify that the Season and Name columns were created correctly. Now we can match the values in these two columns with the values in the Storm_Final table in order to add the max wind and minimum pressure. - Before we can merge the tables, we need to sort work.storm_damage by Season and Name.
Submit the PROC SORT step and view the output table. Verify that the rows are sorted correctly. - Finally, we merge the sorted and prepared data and create a table called damage_detail.
data damage_detail; merge storm_final_sort storm_damage; by Season Name; keep Season Name BasinName MaxWindMPH MinPressure Cost; run;
We list both tables on the MERGE statement with storm_final_sort listed first, followed by storm_damage. The BY statement specifies that we're merging rows by Season and Name.
Run the DATA step and view the output table. It has over 3,000 rows and includes both the matches and the non-matches between the two tables. Most of the values in the Cost column are missing because those storms are not found in the storm_damage table. That's because Cost came from the Storm_Damage table, which only included 38 storms. However, row number 4, storm Allen, was included in both tables, so all columns have a value. - We want the new table to have only the 38 storms that were in the Storm_Damage table. So, we use the
IN= data set option to control which rows are output.
data damage_detail; merge storm_final_sort storm_damage(in=inDamage); by Season Name;keep Season Name BasinName MaxWindMPH MinPressure Cost; if inDamage=1; run;
In the MERGE statement, we use the IN= data set option after the storm_damage table to create a temporary variable named inDamage that flags rows where Season and Name were read from the storm_damage table. The inDamage variable will be included in the PDV. During execution, it will take on a value of either 0 or 1. For a row that is read from the Storm_Damage table, the value will be 1, otherwise, it'll be 0.
The subsetting IF statement checks the value of inDamage. If it is equal to 1, SAS will continue processing the rest of the DATA step, so it outputs that row and returns to the top of the DATA step. If inDamage is not equal to 1, then it simply moves on to the next row. - Submit the DATA step and view the Damage_Detail table. It has 38 rows. These are the same 38 storms that were in the original Storm_Damage table. It also has the additional columns, BasinName, MaxWindMPH, and MinPressure that were all read from the Storm_Final table. Using the IN= data set option and the variables created in the PDV enables you to determine what to do with matching or non-matching rows.
SAS® Programming 2: Data Manipulation Techniques
Lesson 06, Section 1 Demo: Executing an Iterative DO Loop
In this demo, we use Enterprise Guide and the DATA step debugger to watch what happens behind the scenes as a DO loop processes.

DATA output-table; . . . DO index-column = start TO stop <BY increment>; . . . repetitive code . . . END; . . . RUN; |
- Open p206d01a.sas from the demos folder and find the Demo section of the program.
data forecast; set sashelp.shoes(rename=(Sales=ProjectedSales)); do Year = 1 to 3; ProjectedSales=ProjectedSales*1.05; output; end; keep Region Product Subsidiary Year ProjectedSales; format ProjectedSales dollar10.; run;
This DATA step reads an existing SAS table, sashelp.shoes. We use a DO loop with an index variable named Year that should execute the loop three times, for Year 1, 2, and 3. - Run the program and view the output table. Notice that there are three rows (Year 1, 2, and 3) for each combination of Region, Product, and Subsidiary.
- Return to the Program tab and click the DATA step markers for debugging (Debug) toolbar button to enable debugging in the program if it is not already enabled. Click the Debugger icon next to the DATA statement. The DATA Step Debugger window appears.
- Click the Step execution to next line toolbar button to execute the highlighted SET statement. It loads the first row from sashelp.shoes into the PDV. _N_ is 1 because this is the first iteration of the DATA step. The index variable Year is added to the PDV.
- Year is initialized to missing, but when you click the button again to execute the highlighted DO statement, Year is set to the start value, or 1.
- Click the button three times to execute the statements inside the DO loop and the END statement. ProjectedSales increases by 5%, and that row is output to the new table. When you execute the END statement, Year increments to 2 and processing returns to the inside of the DO loop.
- Continue to click the button to execute the highlighted statements inside the DO loop. SAS returns to the top of the loop and checks to see if Year is between the start and end value. 2 is within the range, so the loop executes again. ProjectedSales increases by another 5%, and that row is output to the new table. The END statement increments Year to 3. 3 is still within the range, so the loop executes again. ProjectedSales increments and another row is output to the new table. At this point, we have written three rows to our output table, even though we're still within the first iteration of our DATA step.
- At the end of third iteration of the DO loop, Year is incremented to 4. SAS determines that 4 is outside of the range of the index variable, so it exits the loop and jumps to the RUN statement.
- Close the DATA Step Debugger.
Although the value of Year is 4, the loop executed three times. The explicit OUTPUT statement is within the DO loop, so three rows are written for each one row read from sashelp.shoes.
Alternate Steps for SAS Studio
Note: This demo can be performed in any of the SAS programming interfaces.- Open p206d01b.sas from the demos folder.
data forecast; putlog 'Top of DATA Step ' Year= _N_=; set sashelp.shoes(obs=2 rename=(Sales=ProjectedSales)); do Year = 1 to 3; ProjectedSales=ProjectedSales*1.05; output; putlog 'Value of Year written to table: ' Year=; end; putlog 'Outside of DO Loop: ' Year=; keep Region Product Subsidiary Year ProjectedSales; format ProjectedSales dollar10.; run;
Notice the three PUTLOG statements in the DATA step. - Run the program and view the output table. Notice that there are three rows (Year 1, 2, and 3) for the first two input rows.
- Review the PUTLOG text in the SAS log.
SAS® Programming 2: Data Manipulation Techniques
Lesson 06, Section 1 Demo: Using Iterative DO Loops
In this demo, we modify an existing DATA step with variations of iterative DO loops and variations in the placement of the OUTPUT statement to determine when rows are written to an output table.

DATA output-table; SET input-table; . . . DO index-column = start TO stop <BY increment>; . . . repetitive code . . . <OUTPUT;> END; . . . <OUTPUT;> RUN; |
- Open the pg2.savings table.
pg2.savings Name Amount James 250 Linda 300 Mary 275 Robert 350
Notice that there are four rows representing different people. The Amount value is a monthly savings value. - Open p206d02.sas from the demos folder and find the Demo section of the program.
data YearSavings; set pg2.savings; *add an assignment statement; do Month=1 to 12; Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run;
In this DATA step, we look at those four individuals and see how much they have saved after 12 months. Each month they add to their savings based on the amount in the pg2.savings table, and they earn interest each month. It's 2% annual interest, but compounded monthly.
The index variable month ranges from 1 to 12. There's no explicit OUTPUT statement, so output occurs implicitly at the conclusion of each DATA step iteration. - Run the program and view the output table.
work.YearSavings Name Amount Month Savings James 250 13 3,032.70 Linda 300 13 6,733.15 Mary 275 13 10,205.03 Robert 350 13 14,656.79
The output table has the same four rows with the original monthly savings amount. The value of Month is 13 because the index variable always increments one beyond the stop value, but the statements in the loop executed 12 times. Why are there only four rows? The implicit output occurs outside the loop, after that final increment occurs. So there's one output row for every row read from the input table.
Notice that the Savings value keeps increasing for each row. This is because the value of Savings is automatically retained because it's part of a sum statement. So each time we start a new iteration of the data step, reading a new person, the value of Savings is retained, but it represents the savings from the previous person. - We fix the issue by adding an assignment statement to reset Savings to zero before entering the DO loop for the next individual.
data YearSavings; set pg2.savings; Savings=0; do Month=1 to 12; Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run;
work.YearSavings Name Amount Savings Month James 250 3,032.70 13 Linda 300 3,639.24 13 Mary 275 3,335.97 13 Robert 350 4,245.78 13
-
Run the program and view the output table. Notice the corrected values for Savings. These savings amounts represent one year of savings. What if we want to extend that for five years?
data YearSavings; set pg2.savings; Savings=0; do Year=1 to 5; do Month=1 to 12; Savings+Amount; Savings+(Savings*0.02/12); end; end; format Savings comma12.2; run;
Interest is compounded monthly, so we keep the DO loop that calculates the interest and increases savings every month. Outside this DO loop, we add another loop to iterate through the five years. The outer DO loop will iterate five times, once for each year, and within each year, the inner loop for month will iterate 12 times. - Run the program and view the output table.
work.YearSavings Name Amount Savings Year Month James 250 15,788.11 6 13 Linda 300 18,945.73 6 13 Mary 275 17,366.92 6 13 Robert 350 22,103.35 6 13
Now there is one row for each person. Each row represents the savings after five years, assuming that savings are added each month. The value of Year is 6 and the value of Month is 13, one increment beyond each stop value. - To see the interest and savings at the end of each year, we add an OUTPUT statement to the bottom of the outer DO loop. This will execute for each value of Year, after the month loop iterates 12 times.
data YearSavings; set pg2.savings; Savings=0; do Year=1 to 5; do Month=1 to 12; Savings+Amount; Savings+(Savings*0.02/12); end; output; end; format Savings comma12.2; run;
Run the program and view the output table. It has 5 rows for each person (a total of 20 rows). Each row represents the savings at each of the five years. Because of the placement of the OUTPUT statement, Year only goes up to five. That's because the OUTPUT statement is within the year loop.
So technically, the index variable Year did increment to six, but it was after the final row was output for each person. However, Month, the index variable for the inner loop, still went up to 13, triggering the end of the inner loop. - Finally, move the OUPUT statement to the bottom of the inner DO loop to output after each month.
data YearSavings; set pg2.savings; Savings=0; do Year=1 to 5; do Month=1 to 12; Savings+Amount; Savings+(Savings*0.02/12); output; end; end; format Savings comma12.2; run;
- Run the program and view the output table. It has 60 rows for each person for a total of 240 rows. Each row represents the savings at each Year and Month combination. The month values go from 1 to 12 because output occurred before Month was incremented up to the value 13.
You have a lot of flexibility with the explicit OUTPUT statement in choosing where you place it either inside or outside of a DO loop.
SAS® Programming 2: Data Manipulation Techniques
Lesson 06, Section 2 Demo: Using Conditional DO Loops
In this demo, we modify an existing DATA step with variations of the conditional DO loop.

DATA output-table; SET input-table; . . . DO UNTIL | WHILE (expression); . . . repetitive code . . . <OUTPUT;> END; RUN; |
- Open the pg2.savings2 table.
pg2.savings2 Name Amount Savings James 250 1250 Linda 300 3600 Mary 275 2200 Robert 350 1750
This table contains four rows, The Savings column is the current value of each person's savings account. Notice that Linda's savings account is starting at $3,600. The Amount column is the additional amount the person deposits each month. - Open p206d03.sas from the demos folder and find the Demo section of the program.
data MonthSavings; set pg2.savings2; do until (Savings=3000); Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run;
The DATA step reads the pg2.savings2 table and creates a temporary table called MonthSavings.The DO statement indicates that we want the loop to execute until Savings equals $3,000. Remember, Linda's account already exceeds $3,000. - Run the program. Because Savings is never equal to 3000, the program is in an infinite loop. Stop the infinite DO loop from running.
- In SAS Enterprise Guide, click the Stop toolbar button on the Program tab.
- In SAS Studio, click Cancel in the Running pop-up window.
- We need to be sure that the DO UNTIL condition will eventually be true, so instead of checking for Savings equal to 3000, we'll check for Savings greater than 3000.
Make the following modifications to the DATA step:- Replace the equal sign with a greater than symbol to avoid an infinite loop.
- Add a sum statement inside the DO loop to create a column named Month that increments by 1 for each loop.
- Before the DO loop, add an assignment statement to reset Month to 0 each time that a new row is read from the input table.
data MonthSavings; set pg2.savings2; Month=0; do until (Savings>3000); Month+1; Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run;
The DATA step will iterate four times, once for each individual. Because the statement that increments Month is a sum statement, Month is automatically retained. So every time we read a new person, Month still holds the value leftover from the previous person. - Run the program view the output table. Notice that for James, Mary, and Robert, all of their savings values exceed $3,000 and we find out how many months it took them to reach that point. But what about Linda? She started out with$3,600, but because we used DO UNTIL syntax, the condition was not checked until the bottom of the loop. So the loop executed one time for Linda, increasing her savings amount by $300 and adding interest.
- We can change the loop to a DO WHILE so that the loop never executes for Linda because she already has more than $3,000. We also need reverse the logic so that the loop executes while Savings is less than $3,000. Remember, in DO WHILE loops, the condition is checked at the top. If the condition is false, then the loop doesn't execute, even one time.
data MonthSavings; set pg2.savings2; Month=0; do while (Savings<3000); Month+1; Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run;
- Run the program view the output table. Verify that Linda's Savings amount is still 3600.
SAS® Programming 2: Data Manipulation Techniques
Lesson 06, Section 2 Demo: Combining Iterative and Conditional DO Loops
In this demo, we combine iterative and conditional DO loops to determine the number of times a loop iterates and compare results for DO WHILE and DO UNTIL.

DATA output-table; SET input-table; . . . DO index-column = start TO stop <BY increment> UNTIL | WHILE (expression); . . . repetitive code . . . END; RUN; |
- Open p206d04.sas from the demos folder. Each DATA step uses a loop to execute until a person's savings exceeds $5,000 or they reach 12 months, whichever comes first. The intent of both DATA steps is to process the DO loop for each row in the pg2.savings2 table. Each loop represents one month of savings.
data MonthSavingsW; set pg2.savings2; do Month=1 to 12 while (savings<=5000); Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run; data MonthSavingsU; set pg2.savings2; do Month=1 to 12 until (savings>5000); Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; run; title "DO WHILE Results"; proc print data=MonthSavingsW; run; title "DO UNTIL Results"; proc print data=MonthSavingsU; run;
The first DATA step uses an iterative DO loop with a WHILE condition. It will execute up to 12 times while Savings is less than or equal to $5,000. The condition is evaluated at the top of the loop after Month is incremented.
The second DATA step uses an iterative DO loop with an UNTIL condition. It will execute up to 12 times until Savings is greater than $5,000. The condition is evaluated at the bottom of the loop before Month is incremented. - Run the program and view the results.
Notice that the values of Savings is the same in the DO WHILE and DO UNTIL reports, indicating that the DO loops executed the same number of times for each person.
In the first row, the value of Month is 13 in both tables, because James never reached $5,000, so the index variable Month incremented one beyond the stop value. But remember, even though Month is 13, the loop only ran 12 times.
What about the other people? Their Savings values match in both tables, but the Month values are different. The value of Month in the DO WHILE results is 1 greater compared to the DO UNTIL results. This is because in the DO WHILE loop, the index variable Month increments before the condition is checked. So, the Month column in the output table does not accurately represent the number of times that the DO loop iterated. If you want Month to represent an accurate count of the number of iterations for each row, it's best to create a counter variable inside the loop. - To add a counter, we use a different index variable to control the DO loop, and add a sum statement inside the loop to increment Month. The value of Month will be retained, so reset it to 0 each time we read a row from the input table. Now Month represents the number of iterations of the DO loop, so the index variable can be dropped.
To create an accurate counter for the number of iterations of a DO loop, make the following modifications to both DATA steps:- Add a sum statement inside the loop to create a column named Month and add 1 for each iteration.
- Before the DO loop add, an assignment statement to reset Month to 0 each time that a new row is read from the input table.
- Change the name of the index variable to an arbitrary name, such as i.
- Add a DROP statement to drop i from the output table.
Note: Make the same additions in both DATA steps.
- Run the program and view the results.
Notice that the values of Savings and Month match for the DO WHILE and DO UNTIL reports. Month represents the number of months or iterations it took for each individual to either save $5,000 or reach 12 months, whichever came first.

data MonthSavingsW; set pg2.savings2; Month=0; do i=1 to 12 while (savings<=5000); Month+1; Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; drop i; run; data MonthSavingsU; set pg2.savings2; Month=0; do i=1 to 12 until (savings>5000); Month+1; Savings+Amount; Savings+(Savings*0.02/12); end; format Savings comma12.2; drop i; run; title "DO WHILE Results"; proc print data=MonthSavingsW; run; title "DO UNTIL Results"; proc print data=MonthSavingsU; run;
SAS® Programming 2: Data Manipulation Techniques
Lesson 07, Section 1 Demo: Creating a Narrow Table with the DATA Step
In this demo, we use the DATA step debugger in SAS Enterprise Guide to examine a DATA step that creates a narrow table from a wide table.
- Open the p207d01.sas program from the demos folder.
data class_test_narrow; set pg2.class_test_wide; keep Name Subject Score; length Subject $ 7; Subject="Math"; Score=Math; output; Subject="Reading"; Score=Reading; output; run;
class_test_wide Name Math Reading Alfred 82 79 Alice 71 67 Barbara 96 86
The input table, class_test_wide, has one row per name, or per student, and two columns representing test scores for either Math or Reading. We want to create a narrow table that will have one row per score. In the new table, each name or student will have two rows-- one for the Math score and one for the Reading score. We'll use DATA step to accomplish this and walk through each statement with the DATA step debugger. - Start the DATA step debugger. Notice that the three columns from class_test_wide (Name, Math, and Reading) are included in the PDV. Two additional columns, Subject and Score, are added to the PDV because of assignment statements. The LENGTH statement establishes the attributes of the Subject column that will store either Reading or Math.
- Click the Step execution to next line toolbar button to execute the highlighted SET statement. The first row from class_test_wide is loaded into the PDV.
- Execute the two assignment statements. Values are assigned to Subject and Score for the math test. Execute the OUTPUT statement to write Name, Subject, and Score to the output table.
- Execute the next two assignment statements. Values are assigned to Subject and Score for the reading test. Execute the OUTPUT statement to write Name, Subject, and Score to the output table.
- Proceed through execution of the second iteration of the DATA step. Two additional rows are written to the output table for the test scores for Alice.
- Close the DATA step debugger and run the program. Examine the output table and confirm that each student has two rows corresponding to the math and reading test scores.
Note: At the end of this first iteration of the DATA step, two rows have been written to the class_test_narrow table.
SAS® Programming 2: Data Manipulation Techniques
Lesson 07, Section 2 Demo: Creating a Wide Table with PROC TRANSPOSE
In this demo, we use PROC TRANSPOSE to transpose data values within groups into rows, creating a wide table. We use different options and statements in PROC TRANSPOSE to customize the output table.

PROC TRANSPOSE DATA=input-table OUT=output-table <PREFIX=column> <NAME=column>; <VAR column(s);> <ID column;> <BY column(s);> RUN; |
- Explore the input table, pg2.storm_top4_narrow.
pg2.storm_top4_narrow (partial) Season Basin Name WindRank WindMPH 1980 EP AGATHA 1 100 1980 EP AGATHA 2 95 1980 EP AGATHA 3 90 1980 EP AGATHA 4 85 1980 EP BLAS 1 50 1980 EP BLAS 2 50 1980 EP BLAS 3 50 1980 EP BLAS 4 45 1980 EP CELIA 1 65 1980 EP CELIA 2 65
pg2.storm_top4_narrow is a narrow table with multiple rows per storm. Each storm is uniquely identified by the combination of Season, Basin, and Name. There are four rows for each storm, representing the highest four wind speeds that were measured. We'll transpose this into a wide table with one row per storm, with separate columns for each WindRank-- Wind1 through Wind4. - Open p207d02.sas from the demos folder.
proc transpose data=pg2.storm_top4_narrow out=wind_rotate; var WindMPH; id WindRank; run;
In this PROC TRANSPOSE step, DATA= specifies the input table, pg2.storm_top4_narrow, and we use OUT= to create an output data set named wind_rotate. The VAR statement indicates that we want to transpose WindMPH. We start with the ID statement listing WindRank which has the values of 1 through 4. -
Run the PROC TRANSPOSE step. The step fails and the error message indicates that the output table, wind_rotate, doesn't have any columns.
View the log. It indicates that the ID value 1 occurs twice in the input data set. Well, that's true. We have a value of ID equal to 1 for every storm, but the values of the columns listed on the ID statement must be unique. Remember, each unique storm is identified by the combination of Season, Basin, and Name. So within Season, Basin, and Name, there would only be unique values of WindRank1 through WindRank4. - Add a BY statement to transpose the values within the groups of Season, Basin, and Name.
proc transpose data=pg2.storm_top4_narrow out=wind_rotate; var WindMPH; id WindRank; by Season Basin Name; run;
- Run the program and view the output table. It has one row per storm. Within each storm, the values of WindRank were transposed, and the wind measurements were assigned as the data values. The unique values of WindRank (1, 2, 3, and 4) are assigned as the column names for the transposed values of WindMPH.
These are not the best column names, and they don't follow the recommended SAS naming rules.
Note: Enterprise Guide and SAS Studio set the VALIDVARNAME= system option to ANY, which permits column names that do not follow standard SAS naming rules. If VALIDVARNAME= is set to V7, underscores are added in front of leading numbers or in place of spaces or special symbols in column names. - To give the transposed columns standard names, add the PREFIX=Wind option in the PROC TRANSPOSE statement. To rename the _NAME_ column that identifies the source column for the transposed values, add the NAME=WindSource option as well.
We want to use a prefix to indicate that the column names should be Wind1 through Wind4? The column called _NAME_ represents the measure that was transposed. We can specify a new name for this column.
proc transpose data=pg2.storm_top4_narrow out=wind_rotate prefix=Wind name=WindSource; var WindMPH; id WindRank; by Season Basin Name; run;
We use the PREFIX= option to specify that the prefix Wind be inserted in front of the numbers that were transposed. The NAME= option replaces the column _NAME_ with WindSource. - Run the step and view the output table. The table looks much better, but WindSource has the same value in every row. It probably isn't necessary to keep this column in the output table. If we were transposing more than one measure,this column would identify the different measures, but in this case it isn't needed.
- Delete the NAME= option and add the DROP= data set option on the output table to drop the _NAME_ column.
proc transpose data=pg2.storm_top4_narrow out=wind_rotate(drop=_name_) prefix=Wind; var WindMPH; id WindRank; by Season Basin Name; run;
- Run the step and view the output table. Now we have a wide table, one row per storm, with the four wind measurements.