Tuesday, February 3, 2009

Breaking Down Strings

There seems to be an understanding among some coders about the use of strings in .NET. I can't say, exactly, where it comes from either, except for maybe a single misconstrued idea: Because strings are immutable, every string concatenation operation causes the CLR to make another memory allocation. If you're saying, "What gives? That's true!", keep your shirt on, and keep reading.

I have to say that I was one of those who misunderstood the inner workings of the CLR and blindly believed what other, also mislead, developers told me. However, thanks to great developers such as Jon Skeet who have published a wealth of knowledge on the internal happenings of the .NET CLR, I have been lead to the light. Therefore this article, is evangelizing what I know to be true.

The articles, for reference, that prompted this blog are http://www.yoda.arachsys.com/csharp/stringbuilder.html (Jon Skeet), and http://www.simple-talk.com/community/blogs/jcrease/archive/2009/01/16/71678.aspx (!Jon Skeet). While much of this article is summary of other people's hard work, I have personally verified as much of it as possible through debugging and IL dissembling. For IL disassembling, I use Red Gate's .NET Reflector.

Getting back to the misconstrued idea that I pointed out earlier. This statement is VERY misleading if not taken in the proper context. It isn't that string are not immutable, because that's not the case. It isn't also, that memory allocations happen whenever the string is modified, because that's very much the case. The context of the statement depends on one knowing exactly when and how the CLR modifies strings. Lets look at the following code:

Example 1:

C# code:
string str_hello = "Hello"; string str_world = "World"; string str_test = str_hello; str_test += " "; str_test += str_world str_test += "!";

IL code:
L_0001: ldstr "Hello" L_0006: stloc.0 L_0007: ldstr "World" L_000c: stloc.1 ... L_0026: ldloc.0 L_0027: stloc.3 L_0028: ldloc.3 L_0029: ldstr " " L_002e: call string [mscorlib]System.String::Concat(string, string) L_0033: stloc.3 L_0034: ldloc.3 L_0035: ldloc.1 L_0036: call string [mscorlib]System.String::Concat(string, string) L_003b: stloc.3 L_003c: ldloc.3 L_003d: ldstr "!" L_0042: call string [mscorlib]System.String::Concat(string, string) L_0047: stloc.3

This code made a total of 6 memory allocations. 6! And all we did was say "Hello World!". 2 of the allocations were to hold "Hello" and "World", but the other 4 were performed while concatenating everything together. How might we make this more efficient? I'm glad you asked, actually. The answer to that is simple. Have you ever noticed the number of overloads that String.Concat has? Lots! Including one that takes in 4 strings. And, hey, we just happen to have 4 strings.

Example 2:

C# code:
string str_hello = "Hello"; string str_world = "World"; string str_test = String.Concat(str_hello, " ", str_world, "!");

IL code:
L_0001: ldstr "Hello" L_0006: stloc.0 L_0007: ldstr "World" L_000c: stloc.1 L_000d: ldloc.0 L_000e: ldstr " " L_0013: ldloc.1 L_0014: ldstr "!" L_0019: call string [mscorlib]System.String::Concat(string, string, string, string) L_001e: stloc.2

Great! We did it! We've cut our memory allocations in half. Now all we have to do is just use String.Concat all the time, right? Well, not exactly. While that wouldn't hurt anything, there's actually something else that creates IL code that looks just the same.

Example 3:

C# code:
string str_hello = "Hello"; string str_world = "World"; string str_test = str_hello + " " + str_world + "!";

IL code:
L_0001: ldstr "Hello" L_0006: stloc.0 L_0007: ldstr "World" L_000c: stloc.1 L_000d: ldloc.0 L_000e: ldstr " " L_0013: ldloc.1 L_0014: ldstr "!" L_0019: call string [mscorlib]System.String::Concat(string, string, string, string) L_001e: stloc.2

Believe it or not, I copied the IL code from the previous 2 examples from 2 different builds of the 2 different examples, and it produced the exact same IL code! Additionally, String.Concat even has an overload that accepts a string[]! That means no matter how many times you use the concatenation operator in the same statement, like those above, there's only 1 memory allocation.

There's one 4th and final concatenation example that I'd like to go over before moving on.

Example 4:

C# code:
string str_test = "Hello" + " " + "World" + "!"

IL code:
L_0001: ldstr "Hello World!" L_0006: stloc.0

That's it. No really, those 2 lines of IL are all that are generated from the above statement. The .NET compiler is smart enough to see that all you're doing is concatenation string literals and it does it all for you at compile time!

While this example is trivial, in a web application where you might need to be long strings of HTML, or even email address lists with hundreds to thousands of users hitting all at the same time, that trivial difference can easily hit up to hundreds of MB which would easily incapacitate your app. As most developers know, this is fixable using StringBuilder, which I will be discussing later.

No comments: