PROBLEM: In reading hierarchical data (like the PUMS Census files which has Household records followed by a varying number of Person records for each household) I'd like to extract certain person records and then create one record per household, with the person-record information retained in the final household record. BASIC DATA: Hierarchical; primary data (e.g., Household) with multiple secondary entries (e.g., Person[s]). OBJECTIVE: One observation per Household with original data preserved from multiple persons (in separate variables). DESCRIPTION: To do this operation, one must understand the structure of the target (final) data set records. They will have the original household information and for each person variable of interest (in our example these are personal income and race), there will be one variable (column) representing each person for each variable of interest. So, if you are extracting two people from each household (in our example these are Wife and Husband) and you are interested in two variables (income and race), you will end up with four variables in your final data set: Wife's income, Husband's income, Wife's race, and Husband's race. EXAMPLE: Consider these data (original PUMS configuration of H and P record types, but *not* necessarily data that actually would be in the PUMS files -- this is a simplified, conceptual example): rectype hhnum hhinc hharea inc sex race famstat H 413 8500 5 P 413 3500 M 1 H P 413 5000 F 2 W P 413 0 M 1 C H 414 5600 3 P 414 5600 F 3 S H 415 4650 8 P 415 4650 F 2 W P 415 0 M 2 H Here we have hhinc (household income), hharea (location of household), inc (personal income), sex (gender of person), race (race of person), famstat (status in family: Wife, Husband, Child, Single). When you run the PUMS program, you should get one observation for each P (person) record, and the household data (in my example HHNUM, HHAREA, HHINC) distributed across all the Person records for each household. In case the data are not PUMS (and there is no canned program to rectangularize the data), this is how the above data would be processed into 'person' records (with household information retained): SAS Example: data first; retain hhinc hharea; drop rectype; input @1 rectype $ @; if rectype='H' then input hhnum 3-5 hhinc 7-13 hharea 15; else if rectype='P' then do; input hhnum 3-5 inc 17-22 sex $ 24 race 26 famstat $ 28; output; end; cards; H 413 8500 5 P 413 3500 M 1 H P 413 5000 F 2 W P 413 0 M 1 C H 414 5600 3 P 414 5600 F 3 S H 415 4650 8 P 415 4650 F 2 W P 415 0 M 2 H ; SPSS Example: input program. data list file='temp1.data' / rectype 1 (a). do if (rectype eq 'H'). + reread. + data list file='temp1.data' / hhnum 3-5 hhinc 7-13 hharea 15. + leave hhnum hhinc hharea. else if (rectype eq 'P'). + reread. + data list file='temp1.data' / hhnum 3-5 inc 17-22 sex 24 (a) race 26 famstat 28 (a). end case. end if. end input program. save outfile='whole.sav'. The data then would look like: hhnum hhinc hharea inc sex race famstat 413 8500 5 3500 M 1 H 413 8500 5 5000 F 2 W 413 8500 5 0 M 1 C 414 5600 3 5600 F 3 S 415 4650 8 4650 F 2 W 415 4650 8 0 M 2 H and then you would be ready to make one line (observation) per household. This could be done a number of ways, but perhaps the most 'obvious' (i.e., the one where it's most easy to see what's going on) is to make two data sets (one for W and one for H, rename all the variables that apply to the person (not the household), and then merge back together BY HHNUM, resulting in one line per household. Here are programs that would do this for the example data above (assuming the data above are in a SAS data set called WHOLE with the variable names indicated). SAS Example: data husb1 wife1; set whole; if famstat='H' then output husb1; else if famstat='W' then output wife1; data husband; set husb1; drop hhinc hharea famstat sex; rename inc=husbinc race=husbrace; data wife; set wife1; drop famstat sex; rename inc=wifeinc race=wiferace; SPSS Example: get file='whole.sav' / rename inc=husbinc race=husbrace. select if (famstat='H'). save outfile='husband.sav' / drop hhinc hharea famstat sex. get file='whole.sav' / rename inc=wifeinc race=wiferace. select if (famstat='W'). save outfile='wife.sav' / drop famstat sex. You would then have two data sets that look like this: HUSBAND: hhnum husbinc husbrace 413 3500 1 415 0 2 WIFE: hhnum hhinc hharea wifeinc wiferace 413 8500 5 5000 2 415 4650 8 4650 2 Note that the household variables (HHINC HHAREA) were dropped from the HUSBAND data base. This is not necessary, as the same values are found in both H and W records; it's just cleaner and more efficient if they are merged in from one dataset only, rather than asking SAS to overlay them (so to speak). Don't worry about this; just remember that in this situation, where the values for the household variables are the same across all members of the household (because of the way the PUMS data were read in the first place), it doesn't matter if they (HHINC and HHAREA) are in both HUSBAND and WIFE datasets, or just in one of them. We also dropped the FAMSTAT and SEX variables, because these are useless in the final data set in this example. The new variables (HUSBxx and WIFExx) imply the sex and family status of the participants. For example, if we created a HUSBSEX variable, all the values would be M, and for WIFESEX, all values would be F, so this would give us unnecessary information. The final step is to merge these two 'halves' back together for your final analysis data set. This is straightforward: SAS Example: data final; merge wife husband; by hhnum; SPSS Example: match files file='wife.sav' / file='husband.sav' / by hhnum. And then, finally, your data will look like this: hhnum hhinc hharea wifeinc wiferace husbinc husbrace 413 8500 5 5000 2 3500 1 415 4650 8 4650 2 0 2