The Unix Way - what it does right (and wrong)

The strength of the Unix approach is often said to be the ability to chain small tools together to perform larger tasks. The ability to pipe one command's output into another command's input was incorporated into the original UNIX shell in the early 1970s. This can be a wonderful ability: your system can have numerous tools, each intended to do a single job, and do it well. As a result each tool is conceptually simple and easy to understand, and the process of describing moderately complex processes on the command line becomes fairly simple and concise:

find -name \*.hpp | grep foo | sed -e 's/hpp$/cpp/;' | xargs -e rm -f

The idea is quite elegant, and apart from some syntactic limitations it's very flexible as well. Under this sort of command shell, one's total collection of programs becomes part of the interactive programming environment. However, in practice some subtle problems creep in. First off, the data sent across these pipes is unstructured. Each program in the chain is entirely responsible for interpreting its input in the intended way. This raises a problem, for instance, when the filenames retrieved by "find" contain spaces. If a program in the chain expects to be able to separate its input into individual chunks by looking for whitespace, then it will divide the input improperly in this case. Furthermore, if it reassembles the information for output, it may not put the same amount of whitespace in where it was detected. These problems have complicated the process of chaining individual programs together. The problem becomes worse as the data to be processed becomes more complex: if multiple, sequentially piped operations were to perform meaningful operations on an XML file, for instance, then each tool in the chain would need to be able to parse XML. Likewise, other forms of syntactic encoding, such as escape characters, can be problematic depending on whether filters in the chain perform any interpretation of them.

It is my belief that the Unix philosophy of using multiple small tools in combination to perform larger tasks is a good one, but that it ought to be updated. The fact that all these tools operate on roughly the same datatype (a stream of bytes, interpreted as text) complicates matters, as all the operations put into the chain must be carefully composed, to ensure the format of the input is handled and maintained correctly.

Mail GEC