præclarum

Search:

Frank A. Krueger
Seattle, WA, United States

Recent Posts

Knuth’s Solution to the Permutation Problem in C#
Simplifying the Interview: Solving Permutation Pro...
Using SQLite as an On-disk File Format, Part 2
Using SQLite as an On-disk File Format, Part 1
Travel Guide to India
Time Delays in IO

Patterns of Integer Usage

Sunday, April 30, 2006 Link

Can stronger type systems help us write better code? I think so. As an example, I consider three patterns of integer usage and how stronger or more explicit type systems can help us avoid bugs when dealing with those integers.

There are three general uses of integers in programming languages: in the majority of cases they are used for indexing into arrays and other containers; alternatively, they are used for numerical computations; lastly, we use them as a poor-man’s bitfield or other general-purpose memory container. If we wanted to be so-last-year we could say that there are three patterns to the usage of integers.

Sometimes these usage patterns are given representations in the type systems of our programming languages. In C++ for example the type size_t (or its cousins vector<>::size_type, string::size_type, etc.) is used to index into things while we typically use a plain int for numerical computations and the wonders of stdint.h for bitpacking. Even though all of these are just aliases for C’s only notion of integers, these conventions at least force us as programmers to think through our own appropriate uses of them.

This representation of usage patterns as types helps the programmer avoid errors. Later on in this article though, we will see how a stronger type system can help us even more.

In his presentation The Next Mainstream Programming Language: A Game Developer's Perspective, Tim Sweeney notes that 90% of the integers in Unreal are used to index into arrays. The remaining 10% are then split between bitcoding and numeric computations. I think most of us will agree that his numbers are somewhat typical of our own C/C++ programs.

However, this is not always the case. Sometimes we use integers as a means of indirection and therefore have to code.

In this article I want to look for possible refinements in our use of these three kinds of integers. This article is a summary and in some cases an expansion of the ideas presented by Tim Sweeney in his presentation (see the previous link), Reg Braithwaite’s I'll take Static Typing for $800, Alex, and on Wirth’s Algorithms and Data Structures.

Integers are array indices

Some languages are strict in their uses of integers. Languages like Modula-2 and Ada give you very fine-grained control of your integers and subtypes thereof. In these languages emphasis is placed on static (compile time) checking. Other languages are looser and depend solely on runtime bounds checks (Lisp, C#).

I am a big fan of static type checking and static error analysis. This preference comes from my roots as an embedded systems developer. When writing these types of systems, the developer must be able to (1) describe every possible way the system can fail and (2) describe how the system should recover or fail (in the large) given these failures (in the small).

Imagine a language in which you can only index into arrays with an integer that is statically guaranteed to not generate out of bounds exceptions. If the size of the array is static, then it is trivial for a compiler to ensure this. However, things aren’t quite as simple when array sizes are dynamic. In that case we must rely upon runtime bounds checks.

Let us therefore consider two ways that these bounds can be made: one that relies upon the programmer being super anal-retentive (professional) and others that force the compiler to do that boring work for us.

Consider the prototype and comment for the function lubksb as given in Numerical Recipes in C++:

#include <vector>

typedef std::vector<double> Vec_DP;
typedef std::vector<double> Vec_INT;
typedef std::vector<std::vector<double> > Mat_DP;
typedef Mat_DP Mat_I_DP;
typedef Vec_DP Vec_IO_DP;
typedef Vec_INT Vec_I_INT;

//
// Solves the set of n linear equations A*X=B.
// Here a[0..n-1][0..n-1] is input, not as the matrix A but rather as
// its LU decomposition, determined by the routine ludcmp.
// indx[0..n-1] is input as the permutation vector returned by ludcmp.
// b[0..n-1] is input as the right-hand side vector B, and returns the
// solution vector X...
//
void lubksp(Mat_I_DP &a, Vec_I_INT &indx, Vec_IO_DP &b);

That’s a lot of type definitions and a lot of explanation (7 lines of code and 6 lines of very important comment) for a function, and a lot for a programmer who uses this function to keep in mind.

If you think that is bad, consider all the code that the guy who writes lubksp needs to write to ensure that all these conditions are met.

The problem lies in the fact that C++ is not very descriptive of the necessary conditions for a function to succeed. Let’s consider for a moment a hypothetical language that is more descriptive. One that utilizes algebraic types in order to convey these preconditions. In this hypothetical language, the function’s prototype could be boiled down to:

func lubksp{n: nat}(
in a: [n, n]float,
in indx: [n]nat < n,
inout b: [n]float
) : void;

In this notation or language:

n is any natural number

a is a matrix with dimensions n by n

indx is an array of n natural numbers that are all less than n

b is an array of floats with dimension n

When this function is run with unknown inputs, it is equivalent to the following C# function:

static void LU_Back_Substitute(float[,] a, int[] indx, float[] b)
{
if (a == null) throw new ArgumentNullException("a");
if (a.GetLength(0) != a.GetLength(1)) throw new Sys;
int n = a.GetLength(0);

if (indx == null) throw new ArgumentNullException("indx");
if (indx.Length != n) throw new ArgumentOutOfRangeException("indx");
for (int i = 0; i < n; i++) {
if (indx[i] >= n) throw new ArgumentOutOfRangeException("indx");
}

if (b == null) throw new ArgumentNullException("b");
if (b.Length != n) throw new ArgumentOutOfRangeException("b");
}

In Eiffel these argument assertions are called the preconditions of the function. They are written in a much nicer way than is available in C# because Eiffel supports the notion in its syntax. However, not even the Eiffel version is as succinct as the hypothetical language's version.

Now let’s consider one of the overloads of the Substring method of the String class in the .NET library. Its prototype is simply:

class String {
public string Substring(int startIndex, int length);
}

We all know that there is a very fine relationship between startIndex, length and the actual length of the string being operated upon. That relationship, however, cannot be expressed in C++, C#, VB, Delphi, Python, or any other mainstream languages. However, the hypothetical programming language can express it completely. Consider this prototype:

func Substring(
in str: String,
in startIndex: nat < str.Length,
in length: nat <= (str.Length - startIndex)
) : String;

This prototype spells out very clearly which values of startIndex and length are valid for every invocation of Substring.

The str parameter is required to be a valid String—no null values are allowed.

The startIndex must be a natural number that is less than the length of the string.

The length parameter depends on the first two parameters.

Now, all this parameter checking is actually performed by .NET’s implementation of Substring, but those checks do not occur until runtime. In the hypothetical language however, those checks occur at compile time.

To demonstrate this, consider a really boring application that allows a user to extract a substring of a string. The user will be asked to enter the string, the startIndex, and the length of the substring. Also imagine that our input validation code is good enough to verify that the startIndex and length are integers, but nothing more. Something like this:

func ProcessInput : (void -> string)
{
obj str := ReadString() : string;
obj startIndex := ReadInteger() : int;
obj length := ReadInteger() : int;

// Somehow call and return the value of Substring
}

(The syntax of this language borrows from Damian Conway’s SPECS.)

ProcessInput is a function that takes no arguments (void) and returns a string. The string to be returned must come from a call to Substring, but there is a problem. Our local startIndex and length are not type-compatible with the parameters of Substring!

To make them compatible, we need to cast them to the appropriate subtypes. Tim Sweeney proposes one syntax for such casts and I highly recommend you go through his PowerPoint slides. He proposes a casting operator that can either succeed at performing the cast and will execute a body of code, or will execute an alternative body of code. It’s essentially an if-else construct dedicated to casting. Continuing our example, the remainder of ProcessInput using such a cast would be:

    if cast (obj safeIndex := startIndex : nat < str.Length) {
if cast (obj safeLen := length : nat <= (str.Length - safeIndex)) {
Substring(str, safeIndex, safeLen);
}
else {
ReportErrorToUser("You specfied too many characters ");
}
}
else {
ReportErrorToUser("Your starting index is too large");
}

Writing this code is similar to writing the preconditions of a function. The important difference here is that the person writing the code has to immediately consider the repercussions of the failure. No precondition failures can occur in the call to Substring and therefore there can be no spurious runtime failures.

Again, going back to my roots in embedded systems, I prefer error paths in the code to be explicit and in your face. That way, we always can see what the software is going to do and do not have to look around at all call sites to figure out whether errors are being properly handled. This is the paranoia of systems programmers—we need to know everything.

If you are concerned about performance, consider this. If the entire program was written out using explicit and exact types, then the compiler can determine if the types of the values being provided in a function call are subtypes of the parameters of the function, then that function does not need to perform the parameter validation.

In reality, the implementation of this might be best if functions do not validate their parameters. Instead, the calling code must ensure that the values passed to the function are valid. In that way only one (fast) form of functions need exist and the validation code can be customized one invocation site at a time.

Calculating the set of assertions that match the declarative types (as given in the hypothetical language) is not a particularly easy thing to calculate. A future article will attempt to figure out a method to extract those assertions.

Integers for numeric computations

Languages like C do all integer math without a care in the world as to the range of integers it is actually able to represent. Cl version 14 (the Microsoft C/C++ Compiler) is more than happy to compile this program:

#include <iostream>
int main() {
std::cout << 2000000000 * 2000000000 << std::endl;
return 0;
}

that lazily outputs this silly answer:

-1651507200

Languages that take mathematics a little more seriously, such as Scheme, let us know what that value really is:

>(* 2000000000 2000000000)
4000000000000000000

The reason Scheme is able to compute that large number and C cannot (without implementing a library) is because Scheme allows integers to be of any magnitude so long as it is able to dynamically allocate enough memory to hold that value. C, sticking close to its down-to-the-hardware mantra, only performs arithmetic on integers whose values can be represented in the machine’s own word size and refuses to perform dynamic allocation for integers.

Because of this, C and C++ programmers have found the need to use libraries such as SafeInt by David LeBlanc or Michael Howard’s IntSafe.

(But we shouldn’t bash C. It wasn’t designed to be used like this. It’s only an odd reality that we as programmers have asked C to do so much more than what it was designed to do. Instead of pestering C any further we should take the logical step and use more powerful and applicable languages.)

Other languages know that they should take math more seriously, but simply refuse to even try. When attempting to compile the static expression 2000000000 * 2000000000, csc version 8 complains:

error CS0220: The operation overflows at compile time in checked mode

Checked code in C# will tell you it can’t do the math but will wait until runtime to complain. The following program:

class B {
public static int Main(string[] args) {
int v = int.Parse(args[0]);
System.Console.WriteLine(checked (v * v));
return 0;
}
}

throws “System.OverflowException: Arithmetic operation resulted in an overflow” when executed with a parameter of 2000000000. However, if one forgets to use the checked syntax, then the compiler is just as happy as C++ to give -1651507200 as the answer. Needless to say, if you are writing a C# application, you should turn on default arithmetic checks! (Don’t worry, you can just as easily disable them in areas that you prove are safe.)

So why do old languages such as Lisp and Scheme support large integers, but modern languages like C# and Java do not? Well, I can guess at two reasons the designers would give. First, dynamically allocating integers and storing their values on the heap is certainly less efficient than using integers that fit nicely into the machine’s registers. Second, they would probably argue that a programmer should simply use a custom (dynamic) integer class in a library.

In fact, .NET “ships” with a BigInteger class tucked away in vjslib. The reason “ships” is here in quotes is because vjslib isn’t a part of the default .NET 2.0 redistributable, it is a part of the Visual J# redistributable (a separate download). Furthermore, vjslib is incompatible with 64-bit (or non-32-bit) assemblies. Bad news for all the high-end computers out there. Also, you will have to abandon infix notation and stick with prefix calls:

class C {
public static int Main(string[] args) {
java.math.BigInteger value = new java.math.BigInteger("2000000000");
System.Console.WriteLine(value.multiply(value));
return 0;
}

}

If you can handle those restrictions though, BigInteger does the job.

There are of course other libraries. For example, F#’s library comes with a dynamic integer classes. And let’s not forget nice languages like Common Lisp, Scheme, and Erlang where all integers, by default, can have as large a magnitude as the computer’s memory can provide.

Integer as bitfields and other containers

We sometimes use integers in the most obscene ways because of limitations in the C programming language have persisted through time. Some of the most popular abuses include: integers as bitfields (sets of flags), integers as ports (in embedded hardware), and integers as pointers to objects and functions (see the Win32 API for numerous examples).

In fact, it is this use of integers that makes porting of programs from one system to another (32-bit to 64-bit, Intel to Motorola, etc.) so difficult. If the limitations of the C language had not forced us to write such unseemly code, then porting efforts wouldn’t be such, well, efforts.

So what are the alternatives to C? Well C# got flags (sets of Boolean values) partially right by providing the Flags attribute that can be applied to enumerations. I say partially because the enumeration type system is still weak (to allow compatibility with legacy code written in C) and can be overcome. A better solution is presented by the languages Modula-2 and Oberon. They implement sets as first-class types.

What about integers as pointers to objects and functions? Well this is typically a result of not having strong support for discriminating unions or polymorphism.

And what about as general-purpose binary data? Most language designers are content with forcing the user to regard an array of bytes as a general stream of binary data. This seems fin from the start except when you start doing a lot of communications code. In this code, we constantly need to convert types from their in-memory representation to some streaming representation. .NET has a library to assist this, but it does not operate at the language level. Erlang is the only language that I know of that decided to separate byte[] from the rest of the arrays and lists and that provides conversion routines with their own syntax.

I hope that you can see that this use of integers has largely been superseded with more powerful and safe constructs. So use the constructs! Don’t go packing your bits into an integer because everyone else is or was doing the same!

Thoughts on higher-level looping constructs

As we use more and more higher-level constructs, such as recursion, iterators, and enumerators for loops, in our day to day programming, our dependence on array indexing decreases and a whole class of bugs gets eliminated in our software. Eliminating a whole class of bugs is always a good thing.

However, in imperative languages where we sometimes modify the size of an array as we are iterating over it, using integer indexing is still necessary. In those cases it is still important to consider the ideas raised in this article.

If we can use higher-level constructs to eliminate 90% of our bugs, then let’s use other constructs to eliminate the remaining 10%.

Conclusion

As programmers it is our job to always think through how we are using the type systems of our languages and whether we are using it to assist us to the fullest degree. Even if we are using weakly typed languages such as C, we can still avoid whole classes of bugs simply by keeping in mind patterns of usage and how stronger languages deal with those patterns.

Reader Comments

Post a Comment