Site hosted by Angelfire.com: Build your free website today!

Why String Operations are so Slow

This is a draft article only

Introduction

How Strings Are Represented

  1. In VB and VBScript the design focus, when considering strings and string operations, was to make them easy to use at the expense of performance. Both languages make use of the BSTR and OLE_VARIANT definitions, which are provided by the Windows and NT operating systems, to implement strings. These definitions are most suited to storing strings whose length (and to a lesser extent, content) never change.
  2. Each string is stored as a contiguous block of memory, just big enough to hold the content of the string (plus a small area set aside to say how long the string is). Because of this, any operation that changes the length of a string requires that a new block of memory be allocated, and the new content of the string be copied into it. For operations that add or delete only a few characters, this approach is reasonably efficient. But the problem is that if you are carrying out many consecutive operations that change the length of a string, the operating system must make copies of the string over and over again.
  3. For storing very large strings (over several hundred K), another factor comes into play. Because each string is stored in a contiguous block of memory, string operations at this scale require vast amounts of memory. When you are allocating several megabytes of storage at a time (say, for example, when you add one character to a 2Mb string, when you must allocate 2Mb+1 characters for the result), the operating system may have to save the contents of some of the computer's physical memory out to disk to make room. This slows down string operations even further.

The Unicode BSTR definition

Unicode BSTR
SizeContent
4Length (in bytes)
-->
LengthUnicode Content
In VB and VBScript, strings are stored in unicode rather than ASCII format. Unicode strings store each character of the string using sixteen bits (two bytes) of memory. For this reason, operations on unicode strings are, in general, a little less than twice as slow as operations on ASCII strings.

The arrow drawn to the left of the diagram indicates where the pointer to the string actually aims. It points at the first character of the string content, not at the 4-byte length field.

Variants containing strings

Variant
SizeContent
4varType
4Reserved
4Pointer
4Unused
In VBScript and VB variants containing strings require additional memory for an OLE_VARIANT structure (which is shown at left). In VBScript and ASP, all strings are stored as variants. The extra overhead of looking up the string itself via the OLE_VARIANT structure would be a worry, if the string operations themselves were efficient, but because they are so slow, the overhead is not important.

Common Operations

  1. Concatentation - the & and + operators

    Concatenation is very slow, because it requires reallocation. [Todo: outline the steps involved. I could "steal" back the 6-step list in an MSKB article on this topic, since they ripped off one of my posts to the fastcode newsgroup...] The time taken to concatenate N strings of the same length, one at a time, will in general be proportional to N*N.
    In point of fact timings are even worse than this for very long strings. As the overall string length climbs above about 200K, some or all of the memory used for the string may be paged out by the Operating System (this appears to happen about one time in ten or twelve, and results in 60% slower performance, roughly - no precise figures yet!).
    In VB (but not in VBScript) it is possible to implement faster string concatenation using additional variables and steps, or by replacing strings with a class that makes concatenation more efficient. [The classes I've posted to the fastcode and other VB component newsgroups weren't the full quid; I only showed how to optimize concatenation for performance, not how to speed up other string operations (and, in particular, how to write *out* long strings to an ASP Response without undue performance overhead... Here's a transcript of a post I made to the aspxml newsgroup back in July 2000... What I didn't know when I wrote it was that there's probably a dirty trick you can use to Response.Write long strings very rapidly. You loop an index through them sending out about 1000 characters at a time, as shown here... Haven't tried this yet, but it'll work.]
      Public Sub WriteBufferToResponse(byref strBuffer AS String, byval lngLength AS Long _
        , byref Response As ASP.Response)
        'Imagine strBuffer is a buffer (any allocation length), and 
        'lngLength is how many chars are currently in use
        Dim strWindow AS String
        Dim lngIndex AS Long
        strWindow = space(1000)
        for lngIndex=1 to lngLength-999 Step 1000
          Mid(strWindow,1,1000)=mid(strBuffer,lngIndex,1000) 
          Response.Write strWindow
        Next
        if lngIndex<=lngLength then
          Response.Write mid(strBuffer,lngIndex,lngLength-lngIndex+1)      
        end if
      End Sub
    But there are things you can do to make ASP string performance a little better, without making calls to COM components.

    One option is to add the string fragments to the end of a dynamically allocated array, and then iterate through the array, writing out the individual elements (you should not use Join to short-cut this, as Join does not perform well for arrays containing many strings. See below).

    If you don't like that idea (nor do I!), another way to speed up string concatenation of large numbers of short strings is to use intermediate variables. For example, if you are writing out the HTML for an entire table, have separate strings for the current table cell, the current row, and the table thus far. This will help. If you possibly can, though, you are better off issuing individual Response.Write calls to write out the short strings.

    Some people recommend first concatenating all the strings on the right hand side of an assignment bar the first one, like so:
        strLong = strLong & (strShort & strShorter & strShorterYet)
      
    While this will improve performance it does not aid readability. Assigning into an intermediate variable would perform almost as well, and the meaning will be a lot clearer:
        '
        'draft code to calculate a HTML string for a table
        '
        'strTable -- string for the table as a whole
        'strRow   -- string for the row under construction
        'strCell  -- string for the current cell
        '
        strTable = "<TABLE>" 
        '...for each row
          strRow = "  <TR>" & vbCrlf
          '...for each cell
            strCell = & "    <TD>" & strCell & "</TD>" & vbCrlf
            strRow = strRow & strCell
    	  '...after all cells in current row        
          strRow = strRow "  </TR>" & vbCrlf
          strTable = strTable & strRow
        '...after all rows  
        strTable = strTable & "</TABLE>" & vbCrlf
      
    See also, Format Conversions below.

    Which operator? Should you use & or +

    If you know in advance that the left and right hand sides of a concatenation are both going to be strings (before you ask to concatenate them), you can use + instead of &. There is a performance advantage.
     
  2. The Join() function

    Both VB (version 6 and up) and VBScript provide a Join() function. Join takes an array (of Strings or of Variants) and a separator as parameters, concatenates all of the strings in the array, and adds a copy of the separator between each pair of strings. For example,
      Join(Array(1,2,"rabbit"),"--")
    returns
    "1--2--rabbit".
    Join is faster than straight concatenation, but does not scale well when called to concatenate arrays with a large number of elements. The execution time for Join() also increases with the square of the number of strings being concatenated. Note that, although many people will tell you that the average lengths of the strings do not factor into the overall execution time for Join, this is not so. It *is* important (but, I grant, less so than the number of strings).
    The n-squared execution time of the function is due to the fact that Join(), although it makes more efficient (and more direct) calls to the BSTR functions provided by the operating system, does not determine how long the resulting string will need to be before it builds it, so it must reallocate over and over again as it concatenates the individual strings from the array to build the result.
    In VB (but not in VBScript) it is possible to write implementations of Join() which scale better (even with the overhead of going through extra layers of API calls to do individual string operations). [Todo: Insert link to LinearJoin() for VB and NLogNJoin() for VBScript, once I have performance stats and a graph of same] [Unfortunately the LinearJoin implementation I already have doesn't scale down well. The standard Join() works better for relatively short strings. To beat Join() even for short strings I would need to implement in C++ (or possibly in assembler! Yuck!)] [In the meantime, hint: if you looked at every string before starting to build the result you would know exactly how long the result is going to be. Then you could use the Mid statement instead of the & or + operators...]
     
  3. The Len() function

    Determining the length of a string is very fast. Because the length of a string is stored in the BSTR structure itself, it is not necessary to look at the content of a string to determine how long it is. If the length of a string is changing, it is not worth your while to keep track of it with a string length variable.
    In VBScript and ASP it usually isn't worth assigning the length of a string into a variable at all, even if the length of the string isn't changing, as the time it takes for the script engine to make the assignment will probably outweigh the cost of asking for the length!
     
  4. The UCase() and LCase() functions

    The Ucase() function takes time proportional to the length of the string that is being converted (as does the LCase() function). You might think that means you shouldn't use it, and that, if you need case-insensitive comparisons, strComp() is better, but that might not be so. If you are using the same string a lot of times (in case-insensitive operations) it might well be worth converting it to upper or lower case once, and then doing case sensitive comparisons. Indeed, Ucase() and LCase() could just about be considered secret weapons, as we shall see when we come to the Instr(), InstrRev() and Replace() functions.
     
  5. The Instr() and InstrRev() Search functions

    When used in case-sensitive mode Instr is quick (this is the default; if you are calling Instr with two or three parameters, this is what you are using). When used in case-insensitive mode it is slow. About 50 times slower than the case-sensitive version, for long strings [Well, it was in VB5, but I haven't re-run the performance analysis for VB6, or for ASP and VBScript, but I'd be surprised if that has changed much. At best it will be about 10 times slower, I expect]

    This means that if you are searching the same string many times, it pays to convert it to upper or lower case with UCase() or LCase(). If you need to keep a cased version of the string, consider having two copies of the string, one which has been converted to uniform case, and one which has not (the first is used when you are "asking questions" about the string content, the second whenever you wish to "do something to it"). AS we shall see in the discussion of the Replace() function, this apparently daft approach can pay major dividends.
     
  6. The Replace() function

    The Replace() function works very well when there aren't too many copies of the string to be replaced in the target string. Indeed, it is probably one of the best-written of the standard string operations built into VB and VBScript. However, like Instr() and InstrRev() its performance is nowhere near as good when it is run in case-insensitive mode. When you are replacing the same substring hundreds or thousands of times, Replace() pays such a steep price for case insensitive comparison that it is worth your while to replace it altogether (pardon the pun!). I have never done performance analysis on the case-sensitive version of Replace() to see how well it scales for replacing the same string hundreds or thousands of times. I suspect its execution time will, in the end, increase with the square of the length of the string. But I'm not sure. [Todo: insert a link to my own implementation of Replace for VB4/5 - it's 1998 vintage but still does the trick - which is faster for case-insensitive Replace (well, it is when the search pattern is > about 5 chars and the searched string > about 200), and will outperform the case-sensitive version of Replace when the searched string length (or the number of replacements) gets large enough... but before I do this I'll want detailed performance info, showing when it becomes faster]
     
  7. The Split() function

    Split() is your friend. It is very fast (except when used in case-insensitive mode). If, for example, you wish to carry out an operation on each line of a file, a fast way to do it is to load the entire content of a file into a string, and then use Split() to break it into an array of lines. You can then carry out your operation (whatever it is), even using the Mid(), Left() and Right() functions to concatenate substrings if you like, without paying too much of a performance cost. If you are writing back to a file, you might as well (unless it's megabytes in length!) call Join() to stick the lines back together when you are finished processing them.
     
  8. The Mid(), Left() and Right() functions

    Each of these functions creates a new string. This means that their execution time is approximately proportional to the length of their result. The length of the string they are extracting from appears to be immaterial. [Todo: back this up with performance statistics]
    However, you should be wary of using them at all, if you can find a way to avoid it. A common method, for example, of parsing command-line parameters is to start with a string containing the entire command line, and to snap off bits of it like this:
        strNextBit = left(strCommandLine, lngCutPoint)
        strCommandLine = mid(strCommandLine, lngCutPoint + 1)
      
    However, this wastes considerable time, as both operations require new strings. When you are parsing a long string, the second statement, in particular, becomes expensive when you are executing it over and over again. Indeed, you're paying the same performance price that you pay when you concatenate large numbers of small strings to build a big one. Carving a string up like this is just like concatenation, only in reverse.

    Instead, you should look at ways to parse strings using the Mid() function, and keeping track of where you are up to. If you are, for example, searching through a file of several hundred kilobytes, this latter approach will work far better. Not just a few times faster, but hundreds of times faster, in many cases. [Unfortunately the examples I have available to illustrate this point are rather complex. For example, an ASP->VB-Component translator! Once I have a nice simple example I'll put in a link... I suppose I could put in a command-line splitter, that's only about 70 lines of code...]
     
  9. The Space() and String() functions

    The Space() and String() functions are both very fast. If you need a string of a given length and you don't much care what's in it (say you're checking how fast you can read back a string from a COM component), Space() is the function to call. It seems a pity that there isn't a function, say, StringRepeat(), to return a given multiple-character string repeat an arbitrary number of times. But you can get this! Instead of StringRepeat(pattern,count), which won't work because there's no such animal, you can use Replace(Space(count)," ",pattern). Although Replace() won't scale up well for a really large number of repeats, doing it this way will be a lot faster than concatenating it yourself. If you define a StringRepeat() function of your own it will even be more readable.
     
  10. The Mid assignment statement

    If VB didn't have this it's string operations would be unfixable (without API calls, anyway!). Since VBScript doesn't have it, you can't exploit it to solve most of your string performance problems the way you can in VB. The advantage of using the Mid assignment statement is that it does not change the length of the string that you use it on. That means that the execution time (for large strings) is roughly proportional to the number of characters copied rather than the number of characters in the resulting string.

    It turns out that, by exploiting this, you can build classes and routines that have much better performance than you might expect.
    The basic idea is to allocate more string length than you'll need, using the Space() function, and keeping track of how much you have actually used. For example, consider the following example implementation of URLDecode ( which I haven't fully tested yet!):
      Function URLDecode(ByVal DecodeMe As String) As String
        Dim result As String        'The output string (may be shorter than the input)
        Dim i As Long               'Index for next character in the input string
        Dim NextChar As String      'Next character to go to output string (possibly converted)
        Dim outputLoc As Long       'Output position in constructed HTML encoded string
        outputLoc = 1
        result = Space(Len(DecodeMe))
        On Error GoTo BadHexCode
        For i = 1 To Len(DecodeMe)
            NextChar = Mid(DecodeMe, i, 1)
            If NextChar = "%" Then
                NextChar = Chr(Val("&H" & Mid(DecodeMe, i + 1, 2))) 'hex to character
                i = i + 2
            ElseIf Mid(DecodeMe, i, 1) = "+" Then
                NextChar = " "
            End If
            Mid(result, outputLoc, 1) = NextChar
            outputLoc = outputLoc + 1
        Next
        URLDecode = Left(result, outputLoc - 1)
        Exit Function
    BadHexCode:
        Err.Raise 5, "URLDecode", "Could not interpret % sequence starting at character " & i _
            & " of the URL encoded string " & Chr(34) & DecodeMe & Chr(34)
    End Function
    Okay, so it's not as readable as a naive version of URLDecode (by a long shot!), but I assure you it performs much better than this (particularly if you call it to translate very long strings [Todo: include performance stats showing just how big a difference there is!]):
      Function URLDecode(ByVal DecodeMe As String) As String
        Dim result As String        'The output string (may be shorter than the input)
        Dim i As Long               'Index for next character in the input string
        Dim NextChar As String      'Next character to go to output string (possibly converted)
        result = ""
        On Error GoTo BadHexCode
        For i = 1 To Len(DecodeMe)
            NextChar = Mid(DecodeMe, i, 1)
            If NextChar = "%" Then
                NextChar = Chr(Val("&H" & Mid(DecodeMe, i + 1, 2))) 'hex to character
                i = i + 2
            ElseIf Mid(DecodeMe, i, 1) = "+" Then
                NextChar = " "
            End If
            result = result & nextChar
        Next
        URLDecode = result
        Exit Function
    BadHexCode:
        Err.Raise 5, "URLDecode", "Could not interpret % sequence starting at character " & i _
            & " of the URL encoded string " & Chr(34) & DecodeMe & Chr(34)
    End Function
  11. The StrComp function

    [In a nutshell, don't use this!]
     
  12. Format Conversions

    [Todo: I should provide links to my latest VB implementations of QuoteStringForVB, QuoteStringForSQL, QuoteStringForJavascript, and indicate why they beat the tar out of most such.]