"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question." - Charles Babbage
I love the above quote, because it aptly describes the major source of errors in a program, namely data errors. In the previous article on complexity, I discussed the problems associated with large programs. In this article, I will discuss the negative effects of complexity as it relates to data within a program.
First though, let me give you real world example where data complexity was a big problem in a commercial application. A certain company had a program that had been in development for several months and the program still did not run correctly, even after three different programmers had worked on the program. The company called me to take a look at the program and see if I could get it up and running.
One glance at the source code and I could tell where the program was failing. Nearly every variable in the program was a global variable, and the variables were being updated willy-nilly throughout the code, creating data errors and program failure. It took me a month to clean up the global variables and organize the data structures and code in such a way that it could be debugged successfully. It only took a few days to fix the problem areas and get the program running after the code cleanup.
The other programmers had tried to debug the program without data and code organization, but couldn’t because of the complexity introduced by the global variables. By reducing the number of global variables, the complexity was also reduced, allowing me to debug the problem routines and apply the fixes. Overall, the program wasn’t overly complex; it had been made that way through mismanaged data.
The problems of data complexity do not arise from the type and amount of data we are working with in our program. The number 1 is no more complex than the number 100 and the string “Mary had a little lamb” is no more complex than “Jack and Jill ran up the hill.” A database containing a million rows of data is no more complex than a database containing one row of data. The problems of data complexity arise when we misrepresent our data in a program and when we mishandle our manipulation of that data.
To illustrate the first problem, data representation, suppose the user must enter a number between 1 and 9. Let’s say that we use an integer for the input. What happens if the user enters an A? The program will fail. Is it the fault of the user? No, it is our fault because we misrepresented the range of possible values that could be input into our program.
To illustrate the second problem, mishandling data, we use a string instead of an integer to collect the above input, and then convert that string to a number. The user again enters an A, and when converted to a number results in 0. When we try to use this number, the program will fail, because 0 is outside our range of acceptable numbers. Again, the problem is ours because we failed to check for the 0 case when we tried to use the data.
These examples are trivial, but the problems are the same in any size program. Both of these errors occur when we don’t fully understand a program’s data space.
Data and Input Spaces
You can define data as the “space” in which a program operates. Any data inside the boundaries of the space is valid data and any data outside the boundaries is invalid data. The overall data space of a program is defined by the problem that the program is designed to solve. Within this overall data space, are subset data spaces such as those in our input example. We can think of the overall data space of the program as a large box, and inside the box are smaller boxes and boxes within boxes that define the various types of data that we must handle in a program. Each data space has a set of boundaries, and we must remain within those boundaries if the program is to run successfully.
Let’s keep the box analogy in our mind for the moment. A box is used to store something. We put stuff into the box and we take stuff out of the box. If we put a bowling ball in a box containing our nice set of beer steins, the beer steins will be a nice pile of glass at the end of the day. To correctly pack our box, we need to know both what the box is supposed to hold, and what are the possible items that could find their way into the box. We have to make sure we label the boxes correctly, and then watch the packers pack the box to make sure no bowling balls get packed with our beer steins.
For data within a program, we create a variable that is appropriate to the data we want to manipulate. We label the box an integer. However, we don’t want the full range of an integer in the variable, we only want the numbers 1 through 9, which defines the data space of the variable. A box simply wastes space unless it is holding something, so we must put something into our variable to make is useful. Here is where we may have a problem. What are the possible values that could find their way into the variable box? What is the input space of the variable? If we use the integer variable for keyboard input are we going to get a beer stein or bowling ball? The numbers we want 1 through 9 or A through I?
Like the data space, the input space has a set of defining boundaries. For keyboard input, the number and type of keys on the keyboard define the input space. For a text file, it is all the possible values that a text file can hold, both printable and non-printable characters, with the added complexity of formatting rules that need to be taken into account. For a database, it is the layout of a table or the results of a SQL query. Often times we can strictly define the input space, many times we can’t. A text file may be formatted in a particular way that we expect to see, but we can’t be sure that the user hasn’t edited the file and changed the format. How we represent the data in our program depends on the input space to that variable.
We could convert the keyboard input to its equivalent ascii code and store that number in our integer variable, but now we have data that is outside the data space we have defined for our variable. The numbers we are looking for, 1 through 9, lie within the input space of the keyboard, it is just a matter of mapping the input space to the data space of our variable, which means adding additional code to our program. The amount of code we need to facilitate this mapping process determines the order of magnitude of the data.
First Order Data
First order data has a one to one relationship between data space and input space. We can use the minimum amount of code and we know that the input data is within the data space of the variable. Most data generated within a program, is first order data.
Second Order Data
Second order data requires some conversion since the input space is a sub- or super-set of our data space. However, since the two data sets are related, the conversion is minor, consisting of a single state transition from one form to another. By state transition, I mean that a string is converted into a number, or a number into a string, an integer is cast to a real or a real is cast to an integer. Second order data is common when a program must collect keyboard input, parse a text file and in some database systems where you pass a buffer to the record manager and it returns the data in the buffer.
Second order data is more complex, since the possibility exists that the input may be outside of our data space. Our keyboard input example above is a second order data problem. We can use a string to capture the input, but we must test the value to be sure it falls within our range of 1 to 9.
Second order data is more complex because we must use intermediate values to map the input space to the data space. Suppose in our keyboard example we choose to use a string to collect the input. This string must be converted to a number, and that number must be checked to make sure it lies within our data space of 1 to 9. Only when the input data falls within our defined data space do we store the number into our integer variable.
Third Order Data
Third order data is the most troublesome, since there is little to no commonalty between the input space and the data space, and a large effort must be employed to map the two spaces. Converting a jpeg to a bitmap is a third order data problem.
Third order data also has the potential to incur data loss. A 24-bit image may look quite good, but when converted to a 16-bit image, the result isn’t always desirable, due to the loss of information incurred in the conversion process. Third order data must be analyzed carefully to see if the result is worth the increased complexity of our program and the potential loss of data.
Since second and third order data requires mapping the input space to the data space, we must employ defensive programming techniques to ensure that our program doesn’t fail on unexpected input. An example of this is a Null value in a database. In many programming languages, trying to assign a Null to a control or string will result in an error. Our input routine must take into account a possible Null value and handle the situation accordingly, either by checking for a Null value, or converting the field to the appropriate value using the language’s conversion functions.
In many cases we can’t predict what the unexpected input may be, and in these cases, we must utilize the language’s error trapping constructs to prevent program failure. By trapping unexpected data from the input space, the program can choose it will continue, and do so in a controlled and graceful manner. Program crashes are not only ugly and indicate sloppy or lazy programming, but can leave the computer in an unstable state that could affect other processes. If the language doesn’t have any error trapping available, then it may be time to evaluate a language that does if the program needs to execute without failure.
Pitfalls of First Order Data
It may seem that first order data is problem free, but in fact a whole host of problems can arise from misrepresenting first order data within a program. The two main areas of misrepresentation are scope and data type. Overuse of global variables within a program is a scope problem and cause unwanted side effects and errors. Data type problems usually involve using the wrong data type or an unnecessarily complicated data type. Using a byte when an integer may be needed, or using a pointer value when a simple variable would work equally well are examples of misrepresenting the data type.
Scope refers to the accessibility of a variable within a program. Most structured programming languages have at least two levels of scope, global and local. A variable that has global scope can be accessed anywhere within a program. A variable that has local scope is only accessible within the block where it has been defined. The general rule of thumb is to use the lowest level of scope possible for a variable to minimize errors. Unfortunately, there is no single definition of scope across all programming languages. Each language implements scope according to the needs of the language, and there can exist various levels between global and local scope. For example, an object may contain Public variables that can be accessed outside the object, and Private variables that are hidden within the object and inaccessible.
Learning the scope rules of the language should go hand-in-hand with learning the language’s lexicon in order to properly represent data within the program. Scope errors are some of the most subtle and hard-to-find errors that can occur, but by using scope properly, these errors can be kept to a minimum, and when they occur, can be localized to a small block of code.
Data type errors generally arise from simply not understanding the data types available in a programming language. Does an integer represent a number from -32,768 to 32,767, or -2,147,483,648 to 2,147,483,647? Knowing the available types, and the ranges of those types, will eliminate most data type errors.
However, there is a data type error problem that I like to refer to as the gee-whiz data type. This problem arises from using a complicated data type when a simple data type can be used to get the same result. For example, I have seen complicated, compound pointer types that were used simply because the programmer discovered how to use pointers and thought that the gee-whiz factor of using pointers elevated his or her stature as a programmer. In reality, all it did was to complicate the program, make it harder to debug and opened a door to potentially crippling errors. It is an axiom of programming that simple is better and using the simplest data type possible reduces the complexity of a program and reduces the chance of complexity related errors.
Good Data, Better Program
Programs are created to solve a particular problem, and that problem is defined by its data space. A good program solves the problem at hand by correctly describing and manipulating the data space of the problem. Good data doesn’t necessarily mean a bug-free and workable solution, but good data does mean that we have adequately described the problem and that it one less thing we need to worry about as we build our program solutions.