These sample files are used in the ITS workshop "Dealing With Data", which is part of the ITS workshop series each semester. The course can be taught also by special request. The class includes a presentation of search strategies and worldwide sources on the WWWeb, which are summarized in:
Raw ASCII -- rectangular, free format (sample contains both
numeric and character values)
Raw ASCII -- rectangular, fixed format (sample contains both
numeric and character values)
EBCDIC -- (fixed format used for this sample)
Hierarchical -- (example taken from U.S. Census PUMS files)
Comma-Delimited
Excel/Access (.xls) filesdatabase.xlsfile to database.xls
*.xls file)
Tab-Delimited
System Files
SPSS Portable Filespss.transport to spss.por
SAS Transport File
Binary Data
Compressed Data (once uncompressed, data files will probably be
one of the types shown in the above examples)rawfree.dat.gz)
gzip.You can download this sample file to UNIX and uncompress it with the following command:
gunzip rawfree.dat.gzYou can download this sample file to a PC (Windows) or Mac and uncompress it using Winzip from http://www.winzip.com/
When unzipped, this data file is equivalent to the Raw ASCII (rectangular, free format) file in the first example above, and can be read with the SPSS or SAS programs in that example
NOTE: Unfortunately, in this age of computer software trying to be too helpful, some (most?) browsers will uncompress the data before downloading, thereby destroying this example. Please complain to your browser's WWWeb site about this.
*.html or *.txt and does not
give you the "All Files (*.*)" choice
Save or OK
A rectangular file consists of straightforward rows (observations) and columns (variables) in which each row (or group of rows) is a complete observation. A hierarchical file is one in which some records contain one type of data (e.g., Household), and other records contain other types of data (e.g., Persons). Hierarchical files require considerably more programming in order to read them, as compared with rectangular files.
Free format means that each data value is separated by some delimiter, usually a space. These are also referred to as space-delimited files. In comma-delimited files the data values are separated by commas, and in tab-delimited files, the values are separated by Tabs. Raw data stored without delimiters is referred to as fixed format or column-input data. These types of files require more detailed programming (or more involved point-and-click operations) to read them in.
All statistics packages require that raw data be "read in", or processed, before analysis is possible. System files contain data that have been processed by a statistics or database package into that package's own proprietary format. SPSS creates SPSS Save Files, SAS creates SAS data sets, and so forth. Most statistics software also provides the convenience of transport files, also known as portable files.
Data may be coded in numeric or character form which basically means "numbers" or alphabetical "characters". Some data also are coded in any of several binary formats usually to conserve space. Numeric data is typically the default; all other types of data require the use of format specifications in order to distinguish their values from true numeric values.
Most of the types of data mentioned above may be stored in compressed form, as a space-saving measure. Usually (but not always) the data must be uncompressed before reading the files into a statistics package.
USCweb | ITS User Services | ITS Help | Statistics Help Documents