Wednesday, April 29, 2009

Finalizers and Thread local storage

Sunset from our balcony

Writing finalizers is generally tricky. In the msdn documentation for finalizers the following limitations are mentioned.

  1. The exact time when the finalizer executes during garbage collection is undefined
  2. The finalizers of two objects are not guaranteed to run in any specific order
  3. The thread on which the finalizer is run is unspecified.
  4. Finalizers might not be run at all

#3 has interesting consequences. If a native resources is allocated by a managed ctor (or any other method) and the finalizers is used to de-allocate that then the allocation and de-allocation will not happen on the same thread. The reason being that the CLR uses one (maybe more than one) special thread to run all finalizers on. This is easily verified by the fact that if you query for the thread id inside the finalizers you will get a different id than the main thread the application is being run on.

The thread safety is generally easy to handle but might lead to some tricky and hard to locate problems. Consider the following code

class MyClass
public MyClass()
tls = 42;
Console.WriteLine("Ctor threadid = {0} TLS={1}", AppDomain.GetCurrentThreadId(), tls);

Console.WriteLine("Finalizer threadid = {0} TLS={1}", AppDomain.GetCurrentThreadId(), tls);

public void DoWork()
Console.WriteLine("DoWork threadid = {0} TLS={1}", AppDomain.GetCurrentThreadId(), tls);

static int tls;

class Program
static void Main(string[] args)
MyClass mc = new MyClass();
mc = null;

Here we create a class, use it and finalize it. In all of these 3 methods we print the thread id and also use a special variable named tls.

If you see the tls is marked with the attribute ThreadStatic. The definition of ThreadStatic is “Indicates that the value of a static field is unique for each thread”.

I hope by now you have figured out the gotcha :). Under the hood the CLR uses a native OS concept called Thread Local Usage (TLS) to ensure that the value of tls is unique per thread. TLS uses special per thread data-structure to store that data. Now we have set the value in the ctor and used it in finalizer. Since they run on different threads each will get different values of the same field.

On my system the out put is as follows

Ctor threadid = 5904 TLS=42
DoWork threadid = 5904 TLS=42
Finalizer threadid = 4220 TLS=0
Press any key to continue . . .

As is evident the finalizer ran on a different thread (id is different) and the TLS value is also different from what was set.

So the moral of this story is “Be careful about thread safety of finalizers and do not use thread local storage in it

Monday, April 20, 2009

What post are you known for?

Charminar, Hyderabad

I was just listening to Scot Hanselman’s podcast where he interviews Joel. They talk about how most bloggers get known for one post. E.g. Joel is known for “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”.

Now obviously I’m not a famous blogger and I’m not known for any post but the fact that my most read post is “How many (.NET) types are loaded for Hello World” makes me feel a bit weird. IMO that is one of the least useful posts I’ve ever written.

Friday, April 17, 2009

.NET Code Pitching

Raiding dads office

The word pitch can have many meanings, but in case of the code pitching it is used in the sense of “to throw away”.

The desktop CLR never throws away or pitches native code that it JITs. However, the .NET Compact Framework CLR supports code pitching due to the following reasons

  1. It is targeted towards embedded/mobile systems and therefore needs to be more sensitive towards memory usage. So in some situations it throws away JITed code to free up memory.
  2. NETCF runs on RISC processors like ARM where code density is lower than that of x86. This means the JITed native code is larger in size for a given set of IL on say ARM vs that on say x86. Due to this the memory usage overhead is a bit aggravated. However, this isn’t really a primary motivator and there are other work around like using the newer ARM instruction set extensions.

Code pitching can happen due to various reasons like (note this is not an exhaustive list)

  1. When the managed application gets WM_HIBERNATE message it fires a round of garbage collection and also pitches code
  2. When the JITer fails to allocate memory it attempts code pitching to free up memory and re-tries allocation
  3. Other native resource allocation failures also initiates pitching

Obviously code pitching has performance implications and can result in the same method being pitched multiple times. You can monitor code pitching in Remote Performance Monitor.


As the name suggests the GC does drive code pitching but they are not essentially related. E.g. if user code forces GC by System.GC.Collect() code pitching is not done. Code pitching is primarily driven by real low memory scenarios.

Caveat: While pitching the CLR ensures that it doesn’t throw away any JITed method on the current managed execution stack. No points for guessing why that is important.

Wednesday, April 15, 2009

Multiple calls to GC.ReRegisterForFinalize

Holi Celebration at our appartment complex

What happens when there are multiple calls to GC.ReRegisterForFinalize for the same object?

Consider the following C# code

class MyClass

class Program
static void Main(string[] args)
MyClass mc = new MyClass();
GC.ReRegisterForFinalize(mc); // 1
GC.ReRegisterForFinalize(mc); // 2
GC.ReRegisterForFinalize(mc); // 3


Here for the same object we have called GC.ReRegisterForFinalize thrice.

This is one of the few cases where the behavior of .NET and .NET Compact Framework differs.

In case of .NET the MyClass finalizer is called 4 times. Since the object has a finalizer, it is anyway finalized once and for the additional 3 calls to re-register it is finalized 3 more times.

In case of .NET Compact Framework the finalizer is called only once. The reason being internally finalizable is just a bool flag. Each time GC.SuppressFinalize is called it is set to false and each time GC.ReRegisterForFinalize is called it is set to true. So multiple calls doesn’t have any effect.

Hopefully this is one piece of information which none of my readers ever find useful, but since I got a bug assigned for the difference in behavior I might as well let folks know.

Tuesday, April 14, 2009

.NET Compact Framework MemMaker

2007_01_28 120
Glen recently discovered that by keeping his application’s EXE empty and putting all the forms, code, resources, and data in managed DLLs, he reduced the amount of virtual memory his app uses inside its slot while at the same time taking advantage of memory outside the slot in the 1 GB shared memory area

Read more about this very interesting finding at Rob Tiffany’s blog at

Monday, April 13, 2009

Object Resurrection using GC.ReRegisterForFinalize


I have been thinking about writing this post for some time, but Easter weekend provided the right context to do so.

When I first encountered GC.ReRegisterForFinalize I was a bit baffled. The context help “Requests that the system call the finalizer for the specified object for which System.GC.SuppressFinalize(System.Object) has previously been called.” didn’t provide much clue as to why I would want to finalize an object more than once.

This is when I found out about object resurrection.

res·ur·rec·tion   (rěz'ə-rěk'shən)
  1. The act of rising from the dead or returning to life.
  2. The state of one who has returned to life.
  3. The act of bringing back to practice, notice, or use; revival.

This definition made perfect sense. The basic idea of using object resurrection is to handle scenarios where an object creation is very expensive for some reason (e.g. it makes system calls that take a lot of time or consumes a lot of resources). To avoid creating new objects and incur the creation cost again, you’d try to re-cycle older objects which have already done their job or in other words resurrect dead objects.

This can be done by using the following facts

  1. If an object is garbage (not-reachable) and has a finalizer, it is not collected in the first pass of the GC but is placed in the finalizer queue
  2. The finalizer thread calls the finalizers of each object in the finalizer queue and removes it from the queue.
  3. The next pass of the GC actually reclaims the object (as in de-allocate the memory)

In #2 above when the finalizer is executing if a new reference to the object is created and GC.ReRegisterForFinalize is called then the object becomes reachable and hence non garbage and so in #3 the GC will not reclaim it. Since GC.ReRegisterForFinalize is called, in the next cycle when the same object is available for collection #1 through #3 will again follow (so we set up the cycle).

Consider the scenario where MyClass is a class which is expensive to create. We want to setup a pool of these objects so that we can just pick up objects from this pool and once the object is done with, it is automatically put into that pool.

class MyClass
/// <summary>
/// Expensive class
/// </summary>
/// <param name="pool" />Pool from which the objects are picked up</param>

public MyClass(List pool)
Console.WriteLine("Expensive MyClass::ctor");
this.pool = pool; // Store the pool so we can use it later

public void DoWork()
Console.WriteLine("{0} Did some work", this.resurrectionCount);

/// <summary>
/// Finalizer method
/// </summary>

// automatically return to the pool, makes this object reachable as well

// Ensure next time this same finalizer is called again

private List<MyClass> pool;
private int resurrectionCount = 0;

// Create pool and add a bunch of these objects to the pool
List pool = new List();
pool.Add(new MyClass(pool));
pool.Add(new MyClass(pool));

// Client code
MyClass cl = pool[0]; // get the object at the head of the pool
pool.RemoveAt(0); // remove the object
cl.DoWork(); // start using it. Once done it will automatically go back to pool

At the start a bunch of these objects are created and put in this list of objects (taking a hit in startup time). Then as and when required they are removed from this list and put to work. Once the work is done they automatically get en-queued back to the pool as the finalizer call will ensure that.

Don’t do this

Even though the above mechanism seems pretty tempting, it is actually a very bad idea to really use this method. The major reason being that the CLR never guarantees when GC is run and post that when the finalizer will be run. So in effect even though a bunch of these objects have done their job it will be not be deterministic on when they will land up back in the queue.

If you really need to use object pooling a much better approach is to implement your special interface where you provide a method which is explicitly called by client code and is therefore deterministically controllable.

Tuesday, April 07, 2009

Technical Presentation Tip


I’m no Scot Hanselman, but even then I guess I can share at least one tip for delivering technical presentation that has worked very well for me.

At the very beginning of your presentation always have a slide clearly calling out what the attendees will “Take Away” and what is the “Pre-requisites'” of attending your presentation. Presentation titles sometime mislead and is generally not enough to indicate what a presentation is all about.

Benefits of calling out key take-away

  1. Helps you focus on what you want to convey and you can take feedback from the attendees on whether you met the expectation that you set.
  2. If an attendee feels he already knows about the topic or is not really interested for some other reason he can leave at the very beginning. No point wasting anyone’s time.
  3. You can direct discussion/questions. If someone tries venturing into an area you intentionally don’t want to cover you can get the discussion onto the right track. When I give my GC Theory talk I call out very clearly that the presentation is not about how either .NET or .NETCF implements GC and inevitably when someone ventures into some .NET implementation detail and I use this to get back to the main course.

Benefits of calling out pre-requisite

  1. Just like you need to ensure you are not wasting the attendees time by calling out take aways, you should ensure that you are not wasting your time trying to explain garbage collection to device driver programmers :)
  2. Explaining basic stuff to one attendee (who raises them) and wasting the time of all other attendees who already know the answer is not a good use of anyone’s time. _ASSERT on the pre-reqs and ensure that you are clear about what baseline knowledge you are expecting out of all the attendees.

You can learn more real presentation tips from Scot’s post and maybe then one day you’ll earn a feedback comment “The presentation was superb. I think this is the first presentation for which I was completely awake” and try to figure out who that one was ;)

Monday, April 06, 2009

Floating point operations in .NET Compact Framework on WinCE+ARM

Holi Celebration at our appartment complex

There has been a some confusion on how .NETCF handles floating point operations. The major reason for this confusion is due to the fact that the answer differs across the platforms NETCF supports (e.g. S60/Xbox/Zune/WinCE). I made a post on this topic which is partially incorrect. As I followed that up I learnt a lot, especially from Brian Smith. Hopefully this post removes all the confusions floating around floating point handling in .NETCF on WinCE.

How does desktop CLR handle floating point operation

Consider the following floating point addition in C#

float a = 0.7F;
float b = 0.6F;
float c = a + b;

For this code the final assembly generated by CLR JITer on x86 platform is

            float a = 0.7F;
0000003f mov dword ptr [ebp-40h],3F333333h
float b = 0.6F;
00000046 mov dword ptr [ebp-44h],3F19999Ah
float c = a + b;
0000004d fld dword ptr [ebp-40h]
00000050 fadd dword ptr [ebp-44h]
00000053 fstp dword ptr [ebp-48h]

Here the JITter directly emits floating point instructions fld, fadd and fstp. It could do so because floating point unit and hence the floating point instructions are always available on x86.

Why does NETCF differ

Unfortunately NETCF targets a huge number of HW configs which vary across Xbox, Zune, WinCE on x86/MIPS/ARM/SH4, S60, etc.  There are even sub-flavors of these base configs (e.g. ARMv4, ARMv6, MIPS II, MIPS IV). On all of these platforms floating point unit (FPU) is not available.

This difference in FPU availability is taken care of by using different approaches on different platforms. This post is for WinCE+ARM and hence I’ll skip the other platforms.

Zune in a special WinCE platform

Zune is special because it is tied to a specific version of ARM with FPU built in (locked HW). NETCF on Zune was primarily targeted for XNA games and on games performance of floating point operation is critical. Hence the .NETCF JITer was updated to target ARM FPU. So for basic mathematical operation it emits inline native ARM FPU instructions very much like the desktop JITter shown above. The result is that the basic floating point operations are much faster.

WinCE in general

In general for WinCE on ARM, presence of FPU cannot be assumed because the least common subset targeted is ARMv4 which doesn’t essentially have a FPU.

To understand the implication it is important to understand that floating point operations can be basically classified into two categories:

  1. BCL System.Math operations like sin/cos/tan
  2. Simple floating point operations like +,/,- , *, conversion, comparison.

For the first category the JITer simply delegates the operation into WinCE by calling into CoreDll.dll, e.g. the sin, sinh, cos, cosh, etc.. available in CoreDll.dll.

For the second category the JITer calls into small worker functions implemented inside the NETCF CLR. These worker functions are native code and compiled for ARM. If we disassemble them we would see that for these the native ARM compiler emits calls into coredll.dll into say __imp___addd

It is evident from above that the performance of managed floating point operation is heavily dependent on whether the underlying WinCE uses the ARM FPU as in most scenarios floating point operations are finally delegated into it.

The whole thing can be summarized in the following table (courtesy Brian Smith)

WinCE6 + ARMv4i

WinCE6 + ARMv6_FP

#1 CE is NOT FPU optimized

#2 CE is FPU optimized

#3 CE is FPU optimized and so is NETCF (e.g. Zune JIT)

System.Math library calls

Delegated via pinvoke, emulated within CE (speed: slow)

Delegated via pinvoke, FPU instructions within CE (speed: medium)

Delegated via pinvoke, FPU instructions within CE (speed: medium)

FP IL opcodes (add, etc)

Delegated via JIT worker, emulated within CE (speed: slow)

Delegated via JIT worker, FPU instructions within CE (speed: medium)

FPU instructions inlined by JITed code (speed: fast)

#1 is the general case for NETCF + WinCE + ARM. #3 is the current scenario of NETCF + Zune + ARM.

#2 is based on the fact that WinCE 6.0 supports “pluggable” FP library. However, the NETCF team has not tried out this flavor and hence does not give any guidance on whether plugging in a FP library in WinCE will really have any performance improvement, however theoretically it does seems likely.

#3 today is only for Zune, but going forward it does seem likely that newer versions of WinCE will update it’s base supported HW spec, it will include FPU and then this feature will also make it to base NETCF for WinCE.