This letter is at least a wake-up call for all of us, and perhaps a warning that we not take point-and-click software at face value. There is no question that a point-and-click interface can be a wonderful thing, and there are even some statistical applications that seem inconceivable without it. The EQS structural equation modeling software, with the ability to build a path analysis in the form of an interactive onscreen picture, is a brilliant improvement over trying to do the same job with a written program. But for some more conventional statistics, there may be a downside.
In this short letter, I discuss the teaching as well as the research use of statistics; and I separate the analysis operations and the data reading/data manipulation functions of statistics software.
First, the smallest of history lessons. Once, there was pencil and paper, and Pearson and Scheffé used those to generate their contributions and teach their students. Then computers entered the picture, and a whole new generation of statistics was possible. Programs were needed, of course, and they were structured on logical concepts, and once in place, removed the statistician (and student) from actual calculations, however complex or simple. This was a blessing and a curse, though it was still very helpful in writing a program without errors if one understood the concepts of the statistics they were using. Now comes point-and-click. If someone has data that are simple to input or already prepared for them, they can generate wonderful and complex output without knowing anything more than mouse technique. In many cases, the blessing is almost nil, and the curse is potentially huge.
The net result of this progression is that we collectively have given up thorough understanding for convenience. Instead of spending three hours doing calculations by hand, we could write
PROC REG; MODEL GRE = SAT;
or
REGRESSION / VARIABLES = GRE SAT / DEPENDENT = GRE.
and let the computer crunch 5,000 observations in a few seconds. Now the point-and-click interfaces provide variable lists to choose from and boxes to fill up, requiring almost no understanding of why or how the choices are made and, predictably, less and less comprehension of what the output means.
Teaching statistics to fresh minds always has been a challenge and a necessity. As we've seen, the history has progressed from hand calculation, through programming, and has arrived at point-and-click. For a moment, consider the implications for a budding statistician of reversing this sequence. For most, their first exposure is to point-and-click because, as many excellent instructors point out, it is the quickest way to get to the concepts in a class setting. Later in a research career the time arrives when they have to use programming techniques, and for some who will end up contributing to their respective fields, they will also have to understand statistics from the computational perspective. Paedagogically, beginning with point-and-click software is an acceptable approach for students whose statistical exposure will end with their undergraduate survey classes, but is potentially disastrous for candidates for serious research work.
Beyond the teaching/learning setting, for a subset of those we nurture, is serious research. While learning, most students use small, rectangular, pretty data sets with few if any missing values and all the variables necessary for all calculations. It must start simple, and it does, and under those conditions everything is point-and-clickable, including data input (or reading 'raw' data from external files), data manipulation, analysis, and even graphing. Data in the real world, however, are not nearly as glamorous. Complex data structures such as Census PUMS data or the General Social Survey, or even rectangular data sets that just have a lot of variables with differing characteristics present an unbelievably tedious task to the researcher who insists on using point-and- click techniques. Many in such situations begin to realize that, at least for data manipulations, programming syntax provides a much more efficient way of getting the data ready for analysis.
It is conceivable that today's students learning statistics might feel shortchanged when later they realize that they also must learn programming techniques to manipulate real-world data or even for some analysis operations. It is even possible that such realizations might discourage otherwise interested scholars from careers as researchers and statisticians.
The humble suggestion that all this leads to is this: keep all three levels of statistical knowledge -- computation, programming, point-and- click processing -- in all discussions and propagations of statistical techniques. Inform students of the wonders as well as the hazards of real-world computation, and give them opportunities to gain the skills they will need before their flexibility is lost to point-and-click convenience.
An overview of various types of data used in research is available through the ITS Workshop "Dealing With Data -- An Overview for Researchers", taught each semester. The syllabus is at:
http://www.usc.edu/its/doc/statistics/help/classes/
http://www.usc.edu/its/doc/statistics/help/multiuse/datasamples/