The DATA Step in SAS names the SAS Data Set that is being created (the DATA statement), tells SAS where the data may be found (the INFILE statement if the data are in an external file, or a CARDS statement if the data are instream), and describes the setup and format of the raw data (the INPUT statement).
A very simple DATA step (with data in an external file) might look like this:
data first;
infile '~/mydata/raw8.data';
input id 1-3 height 4-5 weight 7-8
age 10-11 sex $ 15;
data second;
input id 1-3 height 4-5 weight 7-9
age 11-12 sex $ 13;
cards;
00158 98 14F
00265 149 15M
00371 198 17M
00445 100 12F
and so forth
Occasionally, users have problems with the record length of raw data files used in SAS. The typical configuration for UNIX files is that each record ends immediately after the last character in the line. Consider the following:
14658
2346
33546
45342Lines 1, 3 and 4 are five characters long, while line 2 is four characters long. We'll call the raw data "raw.data" and ask the following SAS program to try to read it:
data temp; infile 'raw.data';
input a 1 b 2 c 3 d 4 e 5 ;
run;
While the value for E in observation 2 should be "." (missing), the end of line forces SAS to go to the next line, where it finds a "3" and assigns that to variable E for observation 2. All values from that point on are incorrect, and the SAS Data Set ends up with the wrong number of observations.
One solution to this problem is to force UNIX to "pad" each line with blanks up to a certain point so that all the lines are the same length. In the example above, if each line were extended to column 6, the problem would be solved. In other data files, you just need to choose a number that is longer than the longest existing line. The following SAS program can be run to create a new data file that SAS can read as intended.
data _null_;
infile 'raw.data';
file 'raw.data.fixed';
input; put @6 ' ' @1 _infile_;
run;
This program puts a real blank at column 6 in each line, then puts each line of the original infile beginning at column 1 of the same line. The result is a data file in which every line is 6 characters long, regardless of how many "real" characters are in the line.