SAS® Programming 3: Advanced Techniques
Lesson 01, Section 1 Demo: Reviewing DATA Step Syntax and Processing
In this demonstration, we'll review DATA step compilation and execution using a program that has two DATA steps, and a PROC SORT step. The first calculates the population growth percentage per country between 2000 and 2017, and the second calculates the population growth percentage per continent between 2000 and 2017.
- Open the p301d01.sas program in the demos folder and find the Demo section and review the DATA step for Scenario 1.
data work.PctGrowth18Yrs; length Continent $ 13 Country $ 18; set pg3.population_top25countries; Country=scan(CountryCodeName,2,'-'); PctGrowth18Yrs=(Pop2017-Pop2000)/Pop2000*100; drop CountryCodeName; format Pop2000 Pop2017 comma16. PctGrowth18Yrs 5.1; run;
- In Enterprise Guide, click the DATA step markers for debugging button on the toolbar to enable debugging in the program.
- Click the Debugger icon next to the Scenario 1 DATA statement. Compilation has already taken place, so the pane on the right lists each column in the PDV with a value of missing.
- The currently executing line is highlighted in yellow. Click the Step execution to next line toolbar button, or F10, to execute the SET statement and read the first row from the input table. Several columns were read into the PDV, and their values are displayed in the color red. The Country and PctGrowth18Yrs columns have missing values. Click the Step button again to assign a value to Country, and again to calculate PctGrowth18Yrs. DROP and FORMAT are compile-time only statements, so now the RUN statement is highlighted. Click the Step button again. SAS writes a row to the output table, returns to the top of the DATA step, increments _N_, and reinitializes the new columns, Country and PctGrowth18Yrs to missing. The value of the columns read from the input table are not reinitialized because they're automatically retained.
The SET statement is about to execute again, reading the second row from the input table. As you continue to step through the statements, you can observe the processing of the DATA step. Close the DATA step debugger window.
- Run the Scenario 1 DATA step a to create the work.PctGrowth18Yrs table.
-
View the new table.
- Notice that there's no Debug button next to the PROC SORT step. This is a DATA step debugger, so you can't step through a PROC step. Highlight and run the PROC SORT step. The output table, Continent_Sorted, is displayed.
- In the second scenario we need to calculate the percent population growth for each continent. To do that, we need to sum the country level information for each continent, so the data must be grouped by continent.
proc sort data=pg3.population_top25countries out=work.continent_sorted; by Continent descending Pop2017; run; data work.PctGrowth18Yrs_Cont; set work.continent_sorted; by Continent; if first.Continent=1 then do; Count=0; Pop2000Total=0; Pop2017Total=0; end; Count+1; Pop2000Total+Pop2000; Pop2017Total+Pop2017; if last.Continent=1 then do; PctGrowth18Yrs_Cont= (Pop2017Total-Pop2000Total)/Pop2000Total*100; output; end; format Pop2000Total Pop2017Total comma16. PctGrowth18Yrs_Cont 5.1; keep Continent Count Pop2000Total Pop2017Total PctGrowth18Yrs_Cont; run;
The BY statement in this DATA step allows SAS to process the data grouped by the values of Continent, and it creates two new columns, First.Continent and Last.Continent to identify the first and last row for each group. This will allow us to create accumulating columns, or running totals, for each continent.
When we read the first row for a continent, First.Continent is equal to 1, so we set each of our accumulating columns (Count, Pop2000total, and Pop2017total) to 0. Then we use the Sum statement to increment the accumulating columns. For example, we increment Count by 1, and Pop2000Total by Pop2000. Recall that the Sum statement is a unique assignment statement that creates, initializes, populates, and retains a new column. It also ignores missing values.
When Last.Continent is equal to 1, we know we've reached the end of the BY group and we have all the information we need to calculate the 18-year growth rate. In the DO group we calculate PctGrowth18Yrs_cont and use an explicit output statement to write the contents of the PDV to the new table.
Click the debug icon next to the Scenario 2 DATA statement. In the PDV notice the columns from the input table, First.Continent, Last.Continent, the new columns being created in this step, and _ERROR_ and _N_. When the SET statement reads the first row, we see that First.Continent is 1, so the statements in the DO group will execute, initializing the accumulating columns to zero. Then the Sum statements execute, populating their respective columns, and showing their new values in the color red. The IF condition is false because Last.Continent is 0, so the RUN statement is the next line to execute. Nothing is written to the output table because the explicit OUTPUT statement overrides the implicit output. Click Step again and notice that only _N_ changes. - We could continue stepping through the code, but it might take many iterations before the value of continent changes. Instead we'll set a breakpoint on the second IF statement. A breakpoint is a line on which execution will stop in each iteration of the step. Click on the line number to the left of the IF statement. The line is highlighted in red with a circle icon. Now instead of stepping to the next line, click Run or F5 to continue execution until the breakpoint is reached. _N_ increments on each iteration, and when it's equal to 6, Continent is Africa and Last.Continent is 1.
- Click the Step button and notice it calculates PctGrowth18Yrs and executes the OUTPUT statement, writing a row for Africa to the new table. Click Step again which executes RUN and returns to the SET statement. When it executes, Continent changes to Asia so First.Continent is set to 1 and Last.Continent is set to 0.
- Close the debugger window and run the DATA step.
-
View the new table.
The new table, work.PctGrowth18Yrs_Cont, has one row for each continent.
SAS® Programming 3: Advanced Techniques
Lesson 02, Section 1 Demo: Retrieving Previous Values with the LAG Function
In this demonstration we use the LAG function to calculate the difference between the current and previous day's average temperature for two cities in China. The data is in the pg3.weather_china_daily2017 table.

LAG<n>(column) |
- Open the p302d01.sas program in the demos folder and find the Demo section. Highlight and run the DATA step to create work.china_temps.
-
View the new table.
Notice that the first 365 rows contain the daily average temperatures for Beijing and the last 365 rows contain the daily average temperatures for Shanghai. - Go back to the code and uncomment the two assignment statements. The first statement calls LAG1 to get the average temperature for the previous day, and the next statement calculates TempIncrease as the difference between the current value and the previous value of TavgC.
TavgCPrevDay=lag1(TavgC); TempIncrease=TavgC-TavgCPrevDay;
- Run the modified DATA step.
-
View the new table.
The output table contains the two new columns, TavgCPrevDay and TempIncrease. In row 366, the first row for Shanghai, the value of TavgCPrevDay contains the December 31st Beijing temperature. - The last Beijing temperature for was incorrectly used as the previous temperature for Shanghai. We need to process the rows for each city separately. Add a BY statement and an IF-THEN statement to set the previous value to missing for the first row in a city, when First.City is equal to 1.
by City; TavgCPrevDay=lag1(TavgC); if first.City=1 then TavgCPrevDay=.; TempIncrease=TavgC-TavgCPrevDay;
- Run the DATA step.
-
View the new table.
The first row of Shanghai data (row 366) contains a missing value for the previous temperature. - Now that the data is correct, submit the ODS statements, the PROC MEANS step, and the PROC SGPLOT step to determine the biggest difference in daily average temperature between consecutive days.
-
Review the output.
In the HTML results, place the cursor on a data point for a tooltip that displays the values of Date and TempIncrease. Notice the following results:- The biggest decrease in temperature (-8.9) occurred in Beijing on 10/2/2017.
- The biggest increase in temperature (7.2) occurred in Beijing on 6/7/2017.
- The biggest decrease in temperature (-10.5) occurred in Shanghai on 2/20/2017.
- The biggest increase in temperature (8.9) occurred in Shanghai on 2/19/2017.
SAS® Programming 3: Advanced Techniques
Lesson 02, Section 1 Demo: Counting and Finding Words with Character Functions
In this demonstration we analyze tornado data to count the total number of words and the number of times the string EF appears in the narrative. We also locate the string EF within each narrative to identify the word that follows EF.

COUNT(string, substring <, modifier(s)>) COUNTC(string, character-list <, modifier(s)>) COUNTW(string <, delimiter(s)> <, modifier(s)>) FIND(string, substring <, modifier(s)> <, start-position>) FINDC(string, character-list <, modifier(s)> <, start-position>) FINDW(string, word, <, delimiter(s)> <, modifier(s)> <, start-position>) |
- Open the p302d02.sas program in the demos folder and find the Demo section. The first DATA step reads pg3.tornado_2017narrative and creates the table, work.narrative. Run the DATA step and the PROC PRINT step to display the new table.
-
Review the output.
We'll analyze the text in the Narrative column. - Uncomment the two assignment statements in the COUNT functions section. The first statement calls COUNT to get the number of times EF occurs in the Narrative column and assigns the return value to NumEF. The second assignment statement calls COUNTW to count the number of words using a space delimiter and assigns the return value to NumWord. Run the DATA and PROC PRINT steps again.
/* COUNT Functions */ NumEF=count(Narrative,'EF'); NumWord=countw(Narrative,' ');
-
Review the output.
The results include the two new columns, NumEF and NumWord. In row 1, the Narrative column has one occurrence of EF and a total of 112 words. - Uncomment the two assignment statements in the FIND Functions section. The first statement calls the FIND function to get the starting position of the first occurrence of the string EF and assigns it to EFStartPos. The second statement calls FINDW to get the starting position of the first occurrence of the word EF using the default delimiters and assigns the result to EFWordNum. Run the DATA and PROC PRINT steps again.
/* FIND Functions */ EFStartPos=find(Narrative,'EF'); EFWordNum=findw(Narrative,'EF');
-
Review the output.
Notice that EFWordNum is equal to EFStartPos when the first occurrence of EF is followed by a hyphen, and EFWordNum is equal to 0 when the first occurrence of EF is followed by a number. This is because a hyphen is included in the default set of word delimiters, but numbers aren't. So EF-3 is seen as a word, but EF1 is not. - Modify the EFWordNum assignment statement to add a third argument that specifies the digits 0 through 5, a hyphen, a period and a comma as delimiters and run the DATA and PROC PRINT steps again.
EFWordNum=findw(Narrative,'EF','012345- .,');
-
Review the output.
Now it recognizes EF-number as a word, so EFWordNum and EFStartPos have the same value. - Modify the EFWordNum assignment statement to add a fourth argument, 'e', to request the number of the word instead of its starting position. Run the DATA and PROC PRINT steps again.
EFWordNum=findw(Narrative,'EF','012345- .,','e');
-
Review the output.
In row 1, EF starts in position 119 and it's the twentieth word in Narrative. - Uncomment the conditional statement, which, if true, scans Narrative for the word after the EF word. For example, in row 1, EFWordNum was 20, so this would return the twenty-first word. Run the DATA and PROC PRINT steps again.
if EFWordNum>0 then AfterEF=scan(Narrative,EFWordNum+1,'012345- .,');
-
Review the output.
In row 1, the value of AfterEF is strength and in row 2 its value is damage. - Run the PROC FREQ and PROC MEANS steps.
-
Review the output.
On average, EF is referenced 0.88 times within a narrative with a range of 0 to 6 times. On average, 101.7 words are written in a narrative with a range of 3 to 676 words. Tornado is the word most likely to follow the EF value.
SAS® Programming 3: Advanced Techniques
Lesson 02, Section 2 Demo: Validating Data with the PRXMATCH Function
The PRXMATCH function searches for a pattern match and returns the position at which the pattern is found. In this demo we use it to find valid US phone numbers in the pg3.phonenumbers_us table.

pattern-ID-number=PRXPARSE(Perl-regular-expression);
PRXMATCH(Perl-regular-expression | pattern-ID-number, source) |
- Open the p302d04.sas program in the demos folder and find the Demo section. In the first DATA step, the incomplete assignment statement calls the PRXMATCH function and assigns the return value to a column named Loc.
Add a Perl regular expression to the first argument of the PRXMATCH function to find valid phone numbers. In this example, a valid US phone number consists of a three-digit area code, a three-digit prefix, and a four-digit line number, with the groups separated by hyphens.
Loc=prxmatch('/([2-9]\d\d)-([2-9]\d\d)-(\d{4})/',Phone);
The first slash is the beginning delimiter. The first group, the area code, must start with a digit 2 through 9, followed by any two digits. The three-digit prefix follows the same rules as the area code, so use the same metacharacters in the second set of parentheses. The four-digit line number can be any 4 digits, and the final slash is the ending delimiter. Run the first DATA and PROC PRINT steps. -
Review the output.
PRXMATCH returns the position at which a pattern match is found, so the rows where Loc has a value greater than 0 contain valid phone numbers. Rows 2, 3, and 7 are considered valid based on the regular expression but notice that Row 3 has a 1 dash prefix and row 7 is followed by US.
We need to tighten up our regular expression to find only rows where the 10-digit phone number is the only text in the Phone column, with no leading or trailing text. - Copy and paste the Loc assignment statement and modify it to create the LocStartEnd column
to find only values that start and end with the 10-digit number, with no leading or trailing text. Call the STRIP function to the second argument to remove any leading and trailing spaces from the Phone column before the value is compared to the matching expression.
LocStartEnd=prxmatch('/^([2-9]\d\d)-([2-9]\d\d)-(\d{4})$/',strip(Phone));
The caret at the beginning of the regular expression indicates that the match must start at the beginning of the string, with no leading text. Similarly, the dollar sign at the end indicates that the match must be at the end of the string, with no trailing text.
- Submit the DATA and PROC PRINT steps again.
-
Review the output.
The new pattern matches only one phone number, the row where LocStartEnd is 1. The last two rows also contain valid phone numbers, with the area code enclosed in parentheses.
- Copy and paste the Loc assignment statement again. Modify the statement to create LocParen and allow parentheses around the area code, possibly followed by a space. Recall that parentheses are used for grouping, but in this case, you want to match them explicitly, so precede each parenthesis with a backslash character. There is no longer a hyphen after the area code, so replace the first hyphen with \s*, allowing for zero or more white space characters.
LocParen=prxmatch('/\(([2-9]\d\d)\)\s*([2-9]\d\d)-(\d{4})/',Phone);
- Submit the DATA and PROC PRINT steps again.
-
Review the output.
Based on the LocParen column, the last two rows match the pattern beginning at position 1 in Phone. - Add a subsetting IF statement to select only the rows where a pattern was matched by any of the PRXMATCH calls.
if Loc ne 0 or LocStartEnd ne 0 or LocParen ne 0;
- Submit the DATA and PROC PRINT steps again.
-
Review the output.
The new report contains the five rows with valid phone numbers. - The last DATA step includes a call to the PRXDEBUG routine which sends debugging output to the SAS log. Instead of passing the regular expression to the PRXMATCH function, it is assigned to a column named Exp, and then passed as the first argument to PRXMATCH. Submit the last DATA step.
-
View the log.
The PRXDEBUG routine wrote to the log on every iteration of the DATA step. The log output includes the iteration number, a message that the regular expression is compiling, and other debugging information. Scroll down and observe that the compiling message appears on each iteration. - Recall that adding the O option at the end of a Perl regular expression causes it to be compiled only once. Add this option and run the DATA step again.
data work.ValidPhoneNumbers; set pg3.phonenumbers_us; putlog 'Iteration: ' _N_=; call prxdebug(1); /* Sends debugging output to the SAS log. */ Exp='/([2-9]\d\d)-([2-9]\d\d)-(\d{4})/o'; Loc=prxmatch(Exp,Phone); run;
-
View the log.
Now the Compiling message is only included on the first iteration. This shows that the O option makes your code more efficient.
SAS® Programming 3: Advanced Techniques
Lesson 02, Section 2 Demo: Standardizing Data with the PRXCHANGE Function
The PRXCHANGE function performs a pattern-matching replacement. In this demonstration, we use it to standardize words in weather station names and to switch the longitude and latitude values for the weather station locations in the pg3.weather_usstationshourly table.

PRXCHANGE(Perl-regular-expression, times, source) |
- Open the p302d05.sas program in the demos folder and find the Demo section. In the first DATA step, the two assignment statements call PRXCHANGE. The first call modifies the value in the Name column and stores the modified string in Name_New. The next statement modifies the value of Name_New, replacing various forms of INT and INTL with the word INTERNATIONAL.
- In the first assignment statement, modify the Perl regular expression to replace the letters AP with the word AIRPORT for all occurrences. Only replace the word AP. Don't replace it if it appears within a larger word. To do this search for space AP space, and replace it with space AIRPORT space. Alternatively, you can use the word boundary metacharacter instead of the leading and trailing space: \bAP\b. The second argument, -1, replaces all occurrences of the pattern.
- Submit the program.
-
Review the output.
The Name_New column contains the standardized value of AIRPORT and INTERNATIONAL. - Next, switch longitude and latitude values for the weather station locations. Uncomment the LatLong assignment statement in the DATA step and the VAR statement in the PROC PRINT step. The assignment statement calls PRXCHANGE to do a find and replace on the value in LongLat. The regular expression begins with the letter s which you can think of as substitution, followed by a slash delimiter, the search pattern, another slash delimiter, the replacement pattern, and a final slash.
Let's walk through the metacharacters that define the search pattern. There are three sets of parentheses, so three groups are created. The first group matches the longitude value. It might start with a hyphen, so we use -? to specify zero or one hyphen. \d+ matches one or more digits, followed by a period that must be matched exactly, and then the \d* to match zero or more digits. It will stop matching digits when it encounters the @ sign, which is saved as the second group. The third group matches the latitude value, using the same metacharacters as the first group.
Each group is saved in a capture buffer, so the first capture buffer, $1, holds the longitude, $2 holds the at sign and $3 holds the latitude value. Complete the replace portion of the argument by entering $3$2$1 to specify latitude@longitude. - Submit the program.
-
Review the output.
The latitude value appears before the longitude value in the LatLong column.

Name_New=prxchange('s/ AP / AIRPORT /',-1,Name); *Name_New=prxchange('s/\bAP\b/AIRPORT/',-1,Name);

LatLong=prxchange('s/(-*\d+\.\d*)(@)(-*\d+\.\d*)/$3$2$1/',-1,LongLat);
SAS® Programming 3: Advanced Techniques
Lesson 03, Section 2 Demo: Processing One-Dimensional Arrays: Part 1
Suppose you're planning a trip to Europe during the first or third quarter of next year and you want to explore the average monthly temperatures and precipitation in Dublin and Madrid. In this demonstration we use arrays to convert monthly Fahrenheit temperatures to Celsius, and to calculate the quarterly percentages of precipitation based on the total yearly precipitation.

ARRAY array-name[number-of-elements | *] array-name[element-number | lower-bound:upper-bound] DIM(array-name) |
- Open the p303d01.sas program in the demos folder and find the Demo section. The first DATA step contains two ARRAY statements. The first array, Farenht, has three elements that reference existing columns: Temp1, Temp2, and Temp3. The Celsius array creates three new numeric columns, TempC1, TempC2, and TempC3.
- Add a DO loop to convert each temperature in the Fahrenheit array to Celsius and assign the new value to the corresponding element in the Celsius array. Name the index variable Month, and set its start value to 1 and its stop value to 3.
- Run the DATA step.
-
View the new table.
The output table, tempQ1, contains the three existing Fahrenheit temperatures Temp1-Temp3 and the three new Celsius temperatures TempC1-TempC3. - Modify the DATA step to create a new table, tempQ3, that contains the Fahrenheit and Celsius values for the 3rd quarter, months 7, 8, and 9.
- Change the table name to tempQ3 and change the names of the Temp columns in the KEEP statement to Temp7-Temp9.
- In both ARRAY statements, set the lower bound to 7 and the upper bound to 9, and update the element list to refer to temperatures for months 7 through 9.
- Change the start and stop values in the DO loop, and the column names in the FORMAT statement and run the DATA step.
*First DATA Step; data work.tempQ3(drop=Month); set pg3.weather_dublinmadrid_monthly2017 (keep=City Temp7-Temp9); array Farenht[7:9] Temp7-Temp9; array Celsius[7:9] TempC7-TempC9; do Month=7 to 9; Celsius[Month]=(Farenht[Month]-32)*5/9; end; format TempC7-TempC9 6.1; run;
- Run the modified DATA step.
-
View the new table.
The output table, tempQ3, contains Fahrenheit and Celsius temperatures for months 7 through 9. - Next, you need to calculate the quarterly percentages of precipitation based on the total yearly precipitation. Open the pg3.weather_dublinmadrid_monthly2017 table. It contains four quarterly precipitation columns in inches (PrecipQ1-PrecipQ4) in addition to the Fahrenheit temperature columns.
The second data step defines two arrays. The P array references the four quarterly precipitation values read from the input table, and the PCT array creates four new columns, PrecipPctQ1 through PrecipPctQ4. - You need to calculate the total precipitation for the year before you can calculate the new percentage columns. Add an assignment statement after the first array statement to calculate PrecipTotal, summing the quarterly precipitation columns. You can use either SUM(of PrecipQ1-PrecipQ4) or SUM(of P[*]). The asterisk references all elements in the P array.
- In the DO loop, add an assignment statement to calculate the quarterly percentages by dividing each quarterly precipitation value by the precipitation total. I'll run the step and examine the output table. The new columns show the percentage of yearly precipitation for each quarter.
- Run the modified DATA step.
-
View the new table.
Based on the results, the 3rd quarter is the wettest quarter in Dublin, and the 1st quarter is the wettest in Madrid. But if you're planning to visit both countries, then the second quarter may be the best option for not getting wet, as these are the lowest percentages for each city.

*First DATA Step; data work.tempQ1(drop=Month); set pg3.weather_dublinmadrid_monthly2017 (keep=City Temp1-Temp3); array Farenht[3] Temp1-Temp3; array Celsius[3] TempC1-TempC3; do Month=1 to 3; Celsius[Month]=(Farenht[Month]-32)*5/9; end; format TempC1-TempC3 6.1; run;

data work.precip(drop=i); set pg3.weather_dublinmadrid_monthly2017 (keep=City PrecipQ1-PrecipQ4); array P[4] PrecipQ1-PrecipQ4; PrecipTotal=sum(of PrecipQ1-PrecipQ4); *PrecipTotal=sum(of P[*]); array Pct[4] PrecipPctQ1-PrecipPctQ4; do i=1 to 4; Pct[i]=P[i]/PrecipTotal; end; format PrecipPctQ1-PrecipPctQ4 percent8.1; run;
SAS® Programming 3: Advanced Techniques
Lesson 03, Section 2 Demo: Processing One-Dimensional Arrays: Part 2
In this demonstration we use arrays to compare the quarterly precipitation values for one year to the average quarterly precipitation values for five years.

ARRAY array-name[number-of-elements | *] |
- Open the p303d02.sas program in the demos folder and find the Demo section. The input table, pg3.weather_dublinmadrid_monthly5yr, contains the average monthly Fahrenheit temperatures and the total quarterly precipitation values in inches for Dublin and Madrid for a five-year period, 2013 through 2017. We subset the rows and columns, selecting only the Dublin data.
- The DO loop will execute four times, once for each quarterly comparison. Add three conditional statements to assign values to the status columns.
If the quarterly precipitation value, P[i], is greater than the 5-year average for that quarter, PAvg[i], the value Above is assigned to Status[i]. If the quarterly value is less than its corresponding 5-year average, the status is set to Below, and if by chance they're exactly equal, the status is set to Same.
data work.DublinPrecipStatus(drop=i); set pg3.weather_dublinmadrid_monthly5yr (keep=City Year PrecipQ1-PrecipQ4); where City='Dublin'; array P[4] PrecipQ1-PrecipQ4; array PAvg[4] PAvgQ1-PAvgQ4 (7.65 , 6.26 , 7.56 , 9.12); array Status[4] $ 5 StatusQ1-StatusQ4; do i=1 to 4; if P[i] > PAvg[i] then Status[i]='Above'; else if P[i] < PAvg[i] then Status[i]='Below'; else if P[i] = PAvg[i] then Status[i]='Same'; end; run;
- Run the DATA step.
-
View the new table.
In row 1 the quarter 1 precipitation value for 2013 is 8.94, higher than the 5-year average for that quarter, 7.65. The value of StatusQ1 is Above, which is correct. The values for the 5-year average columns, PAvgQ1 through PAvgQ4, are the same in every row. This shows that the new status columns were created correctly, but it's redundant. - Add a DROP statement to remove the new status columns from the output.
data work.DublinPrecipStatus(drop=i); set pg3.weather_dublinmadrid_monthly5yr (keep=City Year PrecipQ1-PrecipQ4); where City='Dublin'; array P[4] PrecipQ1-PrecipQ4; array PAvg[4] PAvgQ1-PAvgQ4 (7.65 , 6.26 , 7.56 , 9.12); array Status[4] $ 5 StatusQ1-StatusQ4; do i=1 to 4; if P[i] > PAvg[i] then Status[i]='Above'; else if P[i] < PAvg[i] then Status[i]='Below'; else if P[i] = PAvg[i] then Status[i]='Same'; end; drop PAvgQ1-PAvgQ4; run;
- Run the DATA step again.
-
View the new table.
The table contains the four precipitation columns and the four status columns but not the four average precipitation columns. They were dropped. - As an alternative to the DROP statement, you can define PAvg as a temporary array. Remove the DROP statement and add _TEMPORARY_ to the ARRAY statement just before the initial value list. Placement of this keyword is important. It must go here.
data work.DublinPrecipStatus(drop=i); set pg3.weather_dublinmadrid_monthly5yr (keep=City Year PrecipQ1-PrecipQ4); where City='Dublin'; array P[4] PrecipQ1-PrecipQ4; array PAvg[4] _temporary_ (7.65 , 6.26 , 7.56 , 9.12); array Status[4] $ 5 StatusQ1-StatusQ4; do i=1 to 4; if P[i] > PAvg[i] then Status[i]='Above'; else if P[i] < PAvg[i] then Status[i]='Below'; else if P[i] = PAvg[i] then Status[i]='Same'; end; run;
- Run the DATA step.
-
View the new table.
Looking at the data it's not easy to see a consistent pattern. However, it seems that 2014 and 2015 were the wettest of the five years, as three of the four quarterly comparisons show above-average precipitation.
The DATA step contains three ARRAY statements. The array, P, references existing columns of quarterly precipitation read from the input table. The array, PAvg, creates four numeric columns and initializes them using the quarterly averages across the five years. For example, the 5-year average precipitation for quarter 1 is 7.65 inches. The Status array creates four new character columns, StatusQ1 through StatusQ4, each with a length of 5.
SAS® Programming 3: Advanced Techniques
Lesson 03, Section 3 Demo: Processing Two-Dimensional Arrays: Part 1
In this demonstration we use a DATA step to calculate the difference in the average daily temperature from the average monthly temperature. We use a two-dimensional array as a lookup table.
- Open the p303d04.sas program in the demos folder and find the Demo section. In this DATA step, the ARRAY statement creates the two-dimensional array, Avg, with two rows and three columns. The year values, 2013 and 2014, are used to define the lower and upper bounds for the rows. The three columns correspond to the month values. Six initial values are listed to populate the new columns, Avg1-Avg6.
For simplicity purposes, the WHERE= data set option on the SET statement selects data for the 15th day of the first three months of 2013 and 2014. These six data points are compared to the six averages stored in the Avg array. - In Enterprise Guide, click the DATA step markers for debugging button on the toolbar to enable debugging in the program.
- Click the Debugger icon next to the DATA statement. Compilation has already taken place, so the pane on the right lists each column in the PDV. Notice that Avg1 thru Avg6 are populated with the initial values from the ARRAY statement, and other than _N_ and _ERROR_, the other columns are set to missing.
- Click the Step execution to the next line toolbar button to execute the SET statement, populating Date and TempDailyAvg. The date is 15Jan2013.
- Click the Step button repeatedly to execute each statement, observing how the values change. The YEAR function extracts the year, 2013, from Date and assigns it to Y, and the MONTH function extracts the month number, assigning 1 to M. The next statement locates the corresponding temperature in the Avg array, in this case, Avg[2013,1]. This corresponds to Avg1, so the value 40.9 is assigned to TempMonthlyAvg. The temperature change is calculated and assigned to the Difference column.
- On the second iteration, Date is 15Feb2013, so TempMonthlyAvg is assigned 40.7, the value located in Avg[2013,2], or Avg2. Difference is assigned 1.2999.
- Continue clicking through the six iterations of the DATA step and notice that the values of Y and M are used to access the corresponding values from the Avg2 array. Close the debugger window.
- Run the DATA step.
-
View the new table.
Notice that Avg1 through Avg6 are included in the output table. - The average values are only needed for a lookup table during execution. Add _TEMPORARY_ to the ARRAY statement.
data work.DublinDaily; array Avg[2013:2014,3] _temporary_ (40.9, 40.7, 38.6, 42.5, 42.6, 45.4); set pg3.weather_dublin_daily5yr (where=(day(Date)=15 and month(Date) le 3 and year(Date) in (2013,2014)) keep=Date TempDailyAvg); Y=year(Date);
M=month(Date); TempMonthlyAvg=Avg[year(Date),month(Date)]; Difference=TempDailyAvg-TempMonthlyAvg; run; - Click the Debug icon again. Note that the Avg columns aren't included in the PDV because they were defined as temporary elements, not as DATA step columns. Close the Debugger window.
- Run the DATA step again.
-
View the new table.
Verify that the Avg columns are not in the output table. - This program invokes the YEAR and MONTH functions to populate Y and M, and then uses Y and M as subscripts to access an array element. As an alternative, you can use a SAS expression as an array subscript. Delete the assignment statements for the Y and M columns, and in the TempMonthlyAvg assignment statement, replace Y with the call to the YEAR function, and replace M with a call to the MONTH function. The return values are used as subscripts.
The advantage of this approach is that you're not allocating space in the PDV for columns that are only needed as subscripts.
data work.DublinDaily; array Avg[2013:2014,3] _temporary_ (40.9, 40.7, 38.6, 42.5, 42.6, 45.4); set pg3.weather_dublin_daily5yr (where=(day(Date)=15 and month(Date) le 3 and year(Date) in (2013,2014)) keep=Date TempDailyAvg); TempMonthlyAvg=Avg[year(Date),month(Date)]; Difference=TempDailyAvg-TempMonthlyAvg; run;
- Run the DATA step again.
-
View the new table.
Veirfy that the Y and M columns are no longer included in the output table.
SAS® Programming 3: Advanced Techniques
Lesson 03, Section 3 Demo: Processing Two-Dimensional Arrays: Part 2
In this demonstration we calculate the difference in the average daily temperature from the average monthly temperature over a five-year period. We need to create a lookup table using 12 months of temperatures across 5 years. Instead of typing the initial values, we load them from a SAS table into an array.
- Open the p303d05.sas program in the demos folder and find the Demo section.
In this DATA step we load the average monthly temperatures from the pg3.weather_dublin_monthly5yr table into the two-dimensional array, Avg. The array has 5 rows and 12 columns, creating 60 new columns, Avg1 through Avg60. We need to read the daily values from weather_dublin_daily5yr and compare the daily values to the monthly averages in the Avg array. For simplicity, we only read the daily temperatures for the 15th day of each month.
data work.DublinDaily; array Avg[2013:2017,12]; if _N_=1 then do Yr=2013 to 2017; set pg3.weather_dublin_monthly5yr(keep=Temp1-Temp12); Avg[Yr,1]=Temp1; Avg[Yr,2]=Temp2; Avg[Yr,3]=Temp3; Avg[Yr,4]=Temp4; Avg[Yr,5]=Temp5; Avg[Yr,6]=Temp6; Avg[Yr,7]=Temp7; Avg[Yr,8]=Temp8; Avg[Yr,9]=Temp9; Avg[Yr,10]=Temp10; Avg[Yr,11]=Temp11; Avg[Yr,12]=Temp12; end; set pg3.weather_dublin_daily5yr(where=(day(Date)=15) keep=Date TempDailyAvg); Y=year(Date); M=month(Date); TempMonthlyAvg=avg[Y,M]; Difference=TempDailyAvg-TempMonthlyAvg; keep Date TempDailyAvg TempMonthlyAvg Difference; run;
- In Enterprise Guide, click the DATA step markers for debugging button on the toolbar to enable debugging in the program.
- Click the Debugger icon next to the DATA statement. Compilation has already taken place, so the pane on the right lists each column in the PDV. Notice that there are 60 Avg columns, all with missing values, and 12 Temp columns, that are also set to missing.
- In the first iteration of the DATA step, the DO loop loads the 60-element array. There's no need to populate the array more than one time, and the IF-THEN-DO logic ensures that. Click the Step execution to next line toolbar button, or F10, to execute the IF/THEN statement, the SET statement and the first few assignment statements in the DO loop. Yr is set to 2013 and the first row of monthly data is read, populating Temp1-Temp12. The series of assignment statements assigns the value of a Temp column to a corresponding element in Avg.
- To speed through the loading of the array, set a watch on the Avg60 column by entering the command watch avg60 on the command line at the bottom of the window, or by selecting the checkbox in the Watch column next to Avg60. This is the last column that will be populated in the DO loop.
- With the watch set, click the Start/continue debugger execution toolbar button. This causes the execution to continue until the value of the Avg60 column changes. At this point, Year is 2017 and the entire 60-element array has been loaded.
- Click the Step execution to next line toolbar button, or F10 to read from the daily table, look up the average temperature in the array, and calculate the difference. Watch how the values of the variables change as you step through the code.
- On the second iteration of the DATA step, when _N_ changes to 2, the values of Avg1 through Avg60 are reset to missing because the columns are created in assignment statements. Add _TEMPORARY_ to the ARRAY statement. This eliminates the columns from the PDV and from the output table, and causes the loaded values to be retained.
- Finally, modify the IF-THEN block to eliminate the repetitive assignment statements. Define another array, T, with 12 elements, Temp1 through Temp12, and add a DO loop that increments an index column, Month, from 1 to 12. Inside the loop, assign the value of T[Month] to Avg[Yr,Month].
data work.DublinDaily; array Avg[2013:2017,12] _temporary_; if _N_=1 then do Yr=2013 to 2017; set pg3.weather_dublin_monthly5yr(keep=Temp1-Temp12); array T[12] Temp1-Temp12; do Month=1 to 12; Avg[Yr,Month]=T[Month]; end; end; set pg3.weather_dublin_daily5yr(where=(day(Date)=15) keep=Date TempDailyAvg); Y=year(Date); M=month(Date); TempMonthlyAvg=avg[Y,M]; Difference=TempDailyAvg-TempMonthlyAvg; keep Date TempDailyAvg TempMonthlyAvg Difference; run;
- Run the DATA step.
-
View the new table.
Verify that the output table contains Date, TempDailyAvg, TempMonthlyAvg, and Difference
SAS® Programming 3: Advanced Techniques
Lesson 04, Section 2 Demo: Declaring and Defining a Hash Object
In this demonstration we declare and define a hash object containing US state capital and population data.

DECLARE object object-name(<argument_tag-1: value-1, … > );
object-name.DEFINEKEY('key-1' < , … 'key-n' >); object-name.DEFINEDATA('data-1' < , … 'data-n' >); object-name.DEFINEDONE( ); |
- Open the p304d02.sas program in the demos folder and find the Demo section. The DATA step declares the States hash object and uses the DATASET argument to specify that the data in pg3.population_usstates will be loaded into the hash object. Then it uses hash object methods to define StateName as the key component and Capital and StatePop2017 as the data components. The DEFINEDONE method tells SAS that the definition is complete. During execution, this method loads the table into the hash object.
- Run the program.
-
Review the log.
An error message in the log reports an undeclared key symbol, StateName for the hash object. The error occurs because the key component isn't defined as a column in the code, so it isn't in the PDV. - Add a LENGTH statement after the DATA statement to define StateName as a character column with a length of 20 bytes.
data work.StateCityPopulation; length StateName $ 20; if _N_=1 then do; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); end; run;
- Run the program again.
-
Review the log.
There's another error. The log message references the undeclared data symbol Capital. This error appears because Capital isn't defined in the PDV.
The same is true for StatePop2017. - Add both columns to the LENGTH statement. Capital is a character column with a length of 14 bytes and StatePop2017 is a numeric column with a byte size of 8.
data work.StateCityPopulation; length StateName $ 20 Capital $ 14 StatePop2017 8; if _N_=1 then do; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); end; run;
- Run the program again.
-
Review the log.
There are no errors, but the notes in the log indicate that the three components are uninitialized. This is because the LENGTH statement created the columns in the PDV, but they aren't assigned values in the code. - To eliminate the notes, add a CALL MISSING statement to assign a missing value to each of the three columns. This will eliminate the uninitialized notes because the compiler views this as assigning values to each column. You only need to set the components to missing during the first iteration, so put the call inside the DO block.
data work.StateCityPopulation; length StateName $ 20 Capital $ 14 StatePop2017 8; if _N_=1 then do; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); call missing(StateName,Capital,StatePop2017); end; run;
- Run the program again.
-
Review the log.
The notes indicate that 50 rows were read to load the hash object, and the new table, StateCityPopulation was created with one row and three columns, the three data components. -
View the output table.
The columns contain missing values. Why? Remember that a hash object is an in-memory table that only exists for the duration of the DATA step. Although this DATA step creates a new table, only the descriptor portion was written. You didn't include code to write the hash object components to the data portion of the table.

data work.StateCityPopulation; if _N_=1 then do; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); end; run;
SAS® Programming 3: Advanced Techniques
Lesson 04, Section 3 Demo: Performing a Table Lookup with the FIND Method
In this demonstration we use the FIND method to retrieve data from the States hash object and use it to calculate the percentage of city populations within a state.

object-name.FIND(<KEY: value-1, … KEY:value-n>); |
- Open the p304d03.sas program in the demos folder and find the Demo section. The DATA step declares and defines the States hash object as described in the previous demonstration. The SET statement reads the pg3.population_uscities table and the assignment statement passes StateCode to the STNAMEL function to assign the return value of StateName.
data work.StateCityPopulation; length StateName $ 20 Capital $ 14 StatePop2017 8; if _N_=1 then do; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); call missing(StateName, Capital, StatePop2017); end; set pg3.population_uscities; StateName=stnamel(StateCode); *PctPop=CityPop2017/StatePop2017; *format StatePop2017 comma14. PctPop percent8.1; run;
- Run the program.
-
View the output table (partial).
The table contains 19500 rows. StateName looks good, but the columns that correspond to the hash object data components, Capital and StatePop2017, have missing values. There is no code in the DATA step to retrieve the data components from the hash object. - Add an assignment statement that invokes the FIND method to search for StateName in the States hash object and assign the return code to a new column, RC. If the StateName value is found, then the Capital and StatePop2017 columns will be populated in the PDV, and written to the output table.
RC=States.find(key:StateName);
- Uncomment the PctPop assignment statement and the FORMAT statement.
PctPop=CityPop2017/StatePop2017; format StatePop2017 comma14. PctPop percent8.1;
- Run the program.
-
View the output table (partial).
The value of RC is 0 for all rows except row 20. The StateName value, District of Columbia, wasn't found in the hash object. The values of Capital and StatePop2017 are missing for row 20. This is because the columns were reinitialized at the beginning of the iteration, and no values were retrieved from the hash object. - There's another way to create columns in the PDV and initialize them to missing. Delete the LENGTH and CALL MISSING statements and replace them with a conditional SET statement at the beginning of the DO block. The condition, IF 0, will never be true, so the SET statement won't execute, but it will be used at compile time to create the columns in the PDV.
data work.StateCityPopulation; if _N_=1 then do; if 0 then set pg3.population_usstates; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); end; set pg3.population_uscities; StateName=stnamel(StateCode); RC=States.find(key:StateName); PctPop=CityPop2017/StatePop2017; format StatePop2017 comma14. PctPop percent8.1; run;
- Run the program.
-
View the output table.
Observe that in row 20, the values of Capital and StatePop2017 are incorrect. Variables that are read via a SET statement are automatically retained, so they aren't re-initialized to missing. Therefore, you see the previous values for these two columns. When the FIND method is successful, the previous values are overwritten, but if FIND fails, you need to reinitialize data component columns. - Add an IF-THEN statement to check the value of RC. If it's non-zero, a failure, then invoke CALL MISSING to set Capital and StatePop2017 to missing.
data work.StateCityPopulation; if _N_=1 then do; if 0 then set pg3.population_usstates; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); end; set pg3.population_uscities; StateName=stnamel(StateCode); RC=States.find(key:StateName); if RC ne 0 then call missing(Capital, StatePop2017); PctPop=CityPop2017/StatePop2017; format StatePop2017 comma14. PctPop percent8.1; run;
- Run the program.
-
View the output table (partial).
Capital and StatePop2017 now have missing values. - Modify the DATA step to create a second table, CapitalPopulation that contains population data for only capital cities.
- Add work.CapitalPopulation to the DATA statement.
- After the PctPop calculation, add an OUTPUT statement to write the contents of the PDV to the original table, StateCityPopulation.
- If the row is for a capital city, then write the contents of the PDV to the CapitalPopulation table.
data work.StateCityPopulation work.CapitalPopulation; if _N_=1 then do; if 0 then set pg3.population_usstates; declare hash States(dataset: 'pg3.population_usstates'); States.definekey('StateName'); States.definedata('Capital','StatePop2017'); States.definedone(); end; set pg3.population_uscities; StateName=stnamel(StateCode); RC=States.find(key:StateName); if RC ne 0 then call missing(Capital, StatePop2017); PctPop=CityPop2017/StatePop2017; output work.StateCityPopulation; if Capital=CityName then output work.CapitalPopulation; format StatePop2017 comma14. PctPop percent8.1; run;
- Run the program.
-
View the StateCityPopulation table (partial).
The complete StateCityPopulation table contains 19,500 rows. -
View the CapitalPopulation table.
The CapitalPopulation table contains 50 rows, one for each capital city.