Advanced String Techniques in C++ - Part II: A Complete String Class by (08 September 2000) |
Return to The Archives |
Introduction
|
This last one of the two string tutorials will focus on the theory and
implementation of string classes. You'll learn how to go beyond strlen() and its
fellow functions and handle strings in a way that will probably feel a lot more
intuitive to most C++ programmers out there. I'll be using both ASCII and Unicode related code in this issue, and understanding my previous tutorial will be very helpful. |
Strings Of The Old School
|
Strings in C and C++ are not intrinsic types like int or float, but rather
arrays of characters terminated by NULL characters. String manipulation is
conducted by manipulating the character sequence directly; for instance,
strlen() counts the number of bytes it finds in an array of characters, until a
terminating NULL is found. Following that same principle, a function like
strcat() locates the end of a sequence of characters, and copies another
sequence of characters onto the end of the first sequence, thus concatenating
(merging) two strings. Such character manipulation is efficient and fairly straightforward, but it extends poorly into the world of object oriented programming. In a world where most everything is represented by classes, dealing directly with the nuts and bolts making up such fundamental entities as strings just feels wrong. As an analogy, consider a vector class. Almost everyone using C++ for 3D graphics purposes have access to a vector class, enabling them to write for instance the following to compute the sum of two vectors:
instead of:
As strings are just as fundamental as 3D vectors in today's engines, wouldn't it be nice to be able to apply the same techniques to them? Isn't string concatenation operation #1 more intuitive and easier to follow than #2?
They both perform the same task, that is, setting a string to be equal to "123", and then appending "456" to the end of it, thus making it contain the string "123456". Operation #1 is performed using string classes, whereas #2 is performed using regular C string functions. If you prefer technique #2 (or if you're a C programmer), you can stop reading now, as the rest of this tutorial will describe how to implement #1. |
String Classes
|
The type of string class that I'll describe here is a C++ class that
encapsulates the data and functions necessary to represent a character string
and some common operations that can be applied to it. Class libraries from different vendors often come with string classes as well as a multitude of other useful classes. As an example, the STL library contains a very competent string class. But since we're game programmers, we like to code things ourselves as it gives us full control and understanding of the code, am I right? My string class is built around a regular C string, but since the C string itself is declared as protected, we're practically never allowed to mess with it from the outside. Instead, we accomplish what we want by using the class' functions and operators in the true sense of C++. The class works with both Unicode and ASCII strings. By #defining _UNICODE before including the class' header file, the class is set to operate on Unicode strings. Otherwise, it uses ASCII strings. Remember that you must also include tchar.h before the class' header file, as the class relies on the _tcs function set (described in the previous tutorial) to transparently handle both Unicode and ASCII strings. The class is downloadable via the link below. It might be a good idea to have it available as I'll briefly describe its member functions. The actual implementation and customization of the class, or any other type of string class that better fits your specific needs, is left as an exercise to you. |
A Look At The Inside
|
Three member variables are defined in the class. The first one, Text, is used to
hold the actual characters of the string. It's dynamically reallocated (using
new and delete, or malloc() and free() if you prefer) by the two protected
functions AllocStr() and FreeStr(). All memory allocations taking place inside
the string class use these functions, making it easy for you to alter the way
memory is handled if you're for instance using some custom memory manager. The
other two member variables are integers holding the size of the memory block
currently allocated for the string (Size), and the number of characters in the
string (Len). Quite a few constructors are defined: A regular constructor that empties the string, a copy constructor and constructors that takes regular C strings in both ASCII and Unicode format. It is for instance possible (with the Unicode version of the string class) to fill a class instance with the Unicode equivalent of an ASCII sequence of characters, and (of course) vice versa. There's also a set of assignment operators that matches the constructors, all according to what I believe is good C++ practice. It is possible to get a pointer to the actual character string through the accessor GetString() or the * operator, should it prove necessary. Some interesting functions are Compare(), which compares the string to another string and returns -1, 0 or 1 (regular strcmp() return values) to indicate the result, Find(), which locates a substring inside a string and returns its position, the two versions of Insert(), one which inserts a character at a given location and one which inserts a string, Delete(), which removes a substring within the string, and GetSubString(), which returns a part of the string. VarArg is used to emulate sprintf()-ish behavior. It's for instance possible to write:
Notice the use of GetString() to get the address of the string for the VarArg function above. You can't just pass the string object, as C's variable argument system isn't capable of deriving a character string from it. You must explicitly send the string using the GetString() function. Also note the T() macro, which does the exact same thing as the _TEXT macro discussed in the first part of this tutorial, but saves some typing. EatLeadingWhitespace() and EatTrailingWhitespace are used to remove whitespace (spaces and tabulation) from the beginning and end of a string. ToAnsi() and ToUnicode() are used to retrieve regular C character strings in one of the specific character sets. These are useful for instance when calling Win95/98 API functions from within a Unicode program - it's easier to call ToAnsi() on a string object to get a Windows-compatible character string than to use WideCharToMultiByte() each and every time you wish to do such a conversion. The [] operator returns a reference to the character at a specific index in the character array. Since it's a reference, the following code is perfectly legal:
... and will make str contain the string "Flipcode". IsValidIndex() can be used to determine if a character index is valid for a specific string. The operators + and += are overloaded for you to be able to do concatenations quickly and easily. The + operator is defined as a friend function of the class, thus enabling you to write complex concatenation operations such as:
Finally, I've overloaded all of the comparison operators to call Compare() appropriately, to make it possible to compare strings using this syntax:
|
What About Templates?
|
For the uninitiated, templates are C++ way of achieving data type independence.
Many programmers would probably implement a string class as a template class and
thus make the character format dynamically modifiable between char and wchar_t.
This works perfectly well, but for a number of reasons my string class is not a
template class:
|
A Few Things to Keep In Mind
|
The string class is hardly as efficient as it could have been, was it not an
educational piece of code. Many operations are made per-character, whereas
memcpy() or memmove() operations could be faster. Another inefficiency lies is
the fact that the character string is reallocated for every character being
inserted or removed from it. A better approach would be to let AllocStr allocate
more memory than is actually needed. By leaving such memory vacant for future
operations, future allocations can be avoided. Another memory optimization comes to mind. malloc() and free() are not as fast as we'd like them to be; using a global pool-based allocator for strings would probably be faster. But I wont spoil you with such luxury, all such optimizations are left as exercises for the reader. |
Downloads
|
You may download the string class source code (CString.h) here: article_advstrings_cstring.h |
Closing
|
I bet you're fed up with strings now. Fredrik Andersson (f01fan@efd.lth.se) Lead Programmer, Herring Interactive |
Article Series:
|