Friday, February 20, 2009

Back To Basics: Handling overflow in mark stage

Bonfire at pragati resort

This post is Part 7 in the series of posts on Garbage Collection (GC). Please see the index here.

Let’s first recap the basic algorithm of the mark-sweep garbage collection. In C like pseudo code the algorithm looks like

void GC()
ObjectCollection roots = GetRoots();
for(int i = 0; i < roots.Count(); ++i)

void Mark(Object* pObj)
if (!Marked(pObj)) // Marked returns the marked flag from object header
MarkBit(pObj); // Marks the flag in obj header
// Get list of references that the current object has
// and recursively mark them as well
ObjectCollection children = pObj->GetChildren();
for(int i = 0; i < children.Count(); ++i)
Mark(children[i]); // recursively call mark

This algorithm is by nature recursive where in the marking phase we recursively mark every object and it’s children. However, recursion is not optimum in terms of both time and space usage. Each method call comes with additional overhead which is particularly more bothersome because GC is anyway run when the system is low on resource.

Controlling the depth of call stack in recursive algorithm is also hard. In the algorithm shown above the depth of the stack is equal to the length of the longest reference chain.

This is worked around by using general recursion to iteration conversion methods like using explicit marking stacks as show below

Stack g_MarkStack(MAX_SIZE);
void Init()

void GC()
ObjectCollection roots = GetRoots();
for(int i = 0; i < roots.Count(); ++i)

void Mark()
Object *pObj = g_MarkStack.Pop();
ObjectCollection children = pObj->GetChildren();
for(int i = 0; i < children.Count(); ++i)

Using explicit marking stack we have removed the need for recursive algorithm. Now this is more deterministic and it is easy to figure out when the heap graph is too deep and the stack actually overflows (at the point of push we can check for g_MarkStack.IsFull())

Handling overflow

Explicit stack makes detecting and handling of stack overflow easier. It doesn’t however, remove the issue. The first cut attempt is to reduce the number of items being pushed into the stack. Objects are pushed onto the stack so that they behave as branch points so that once a child has been marked we can come back to it’s parent and branch off and mark it’s other children. So simple fixes like only pushing in case an object has two or more children does help in reducing the overflows

There are other ways to work around this problem like using alternate tree marking algorithm like Schorr-Waite graph marking algorithm

However, whatever method we try there will be large objects trees where the stack will run out of space and that needs to be handled.

.NET Compact framework uses a two level fall back scheme as explained below.

Using overflow stack

The basic mark stack used is a grow-able stack buffer which can grow to max 4KB. In case this overflows another small fixed sized stack is used which is called the overflow-stack. Objects where the overflow happens while marking are pushed on to this overflow stack. The children of these are not traversed.

This means a part of the tree is left out from marking and growth of the mark stark is kind of contained. The mark buffer unwinds and marking is started back on it. In the next pass marking is restarted considering the objects on the overflow-stack as roots. This is done iteratively. The idea here is that with each such pass, more and more of the tree is marked and hence the marking finally converges.


For really deep trees even the overflow-stack can overflow. To handle this the following steps are taken

  1. .NETCF uses pooled memory where 64K sized memory pools are used and objects are sub-allocated from them. The memory pool where the overflowed object resides is located. Each of these pools contain two pointers which point to the first and the last objects in the pool that overflowed. These two pointers are updated to point to the currently overflowed object.
  2. The header of the object where the overflow happened is marked with a special bit flag OBJFLAG_MARK_OVERFLOW.
  3. At this point the object pool looks something like
  4. A global linklist of all such pools which has overflowed objects is also preserved
  5. Once objects in mark-stack and overflow-stack is marked another pass is started which does the following
    foreach Pool in Overflow-arena
        foreach Obj >= Pool.header.start and Obj < Pool.header.end
             if(obj.flag & OBJFLAG_MARK_OVERFLOW)

The very design of the .NETCF GC is to cater to low memory devices. Hence it very rarely needs to use the fall-back. The design ensures that GC is optimal for low memory usage but can scale up (albeit with perf cost) for large memory as well.

Thursday, February 12, 2009

Installing S60 development tools on Windows 7

Fire-eater in action

My first attempt to install Carbide on Win 7 failed. So I thought I’d document the exact steps I followed the second time to make that work

Install Pre-requisites

  1. Perl
    1. I used ActivePerl
    2. To ensure UAC didn’t cause any issues I first launched an elevated command prompt by going to All Programs –> Accessories, right click on Command Prompt and choose Run as administrator
    3. In that I ran the command msiexec /i ActivePerl-
    4. I choose the path c:\perl\bin
    5. I ran path command from another newly launched command prompt to ensure that c:\perl\bin is present in it (so Path environment variable is updated)
  2. Java
    1. I downloaded jre-6u7-windows-i586-p.exe
    2. Right clicked on it and used Run as administrator to install it

Installing the S60 Tools

  1. S60 SDK
    1. Downloaded
    2. Unzipped to a folder
    3. Right clicked on setup.exe and used Run as administrator to install it
    4. Used c:\S60\devices\ as the install path. So my final installation is in C:\S60\devices\S60_5th_Edition_SDK_v0.9
  2. Carbide
    1. Downloaded Carbide V2.0
    2. Right clicked on it and used Run as administrator to install it
    3. Used C:\Apps\ as the install path
  3. PC Suite
    1. Downloaded Nokia_PC_Suite_7_1_18_0_eng_us_web.exe
    2. Right click on the exe and choose Troubleshoot compatibility
    3. In the wizard that opens click next
    4. Ensure the following is checked on the next page of that wizard
    5. Hit Next and in the next page choose Window Vista
    6. Go on hitting next and finally the installer will launch

At this point you are done and should be able to launch Carbide, Emulator and PC Suite.

Scott Guthrie’s visit to Microsoft India

I rarely blog about a photo but this is an exception. Scott Guthrie our Corporate Vice President for .NET Developer Platform was here in Hyderabad on a visit to Microsoft India. I had a chance to attend a bunch of product reviews and  meetings with him. Every person in our division and outside is in complete utter awe of this guy, it was inspirational to hear his ideas and visions around .NET and Silverlight.


Scott with a bunch of folks from .NET Compact Framework and Silverlight Mobile team. As you can guess Scott is in the center. I’m the one on the absolute right.

Sunday, February 08, 2009

Back to Basics: Optimizing reference counting garbage collection

Mesmerized by magic

One of the comments on my previous post on reference counting GC prompted this post. The comment goes as

“Ref counting GC seems to have some hard issues. How does it still get used?”

The advantages of reference counting GC is that it allows reclamation instantly after the object is no longer in use and that it spreads the GC cost across execution and hence does not suffer from the longer pauses like mark-sweep GC does. These makes it compelling to get used in some scenarios (e.g. interactive systems). However, it does suffer from some major drawbacks. In this post I will skim through the issues and how they are worked around in real systems.

Cyclic references

The fact that cyclic references cannot be handled by general refcounting mechanism is touted as it’s biggest draw back. A variety of approaches has been attempted to solve this. The two main categories of approaches include either ways to distinguish pointers that form a part of the cycle from those that do not or using back up mark-sweep GC.

Using a backup mark-sweep GC works pretty well in handling cyclic references. The idea is to have a regular refcounting GC that just ignores cyclic references. It fails to cleanup cyclic references that just builds up over time. When periodically the back up mark-sweep GC is run which cleans up objects with non-zero refcount that are not externally reachable.

Reducing the number of reference counting

Reference counting makes pointer manipulation very costly because every referencing/de-referencing results in book keeping to happen. One common approach to work around this, is to figure out what are the reference counting that can be safely deferred to be handled later.

For a lot of systems most references are taken in function local variables which takes references as the function is called and releases them as the function returns. Even inside functions iterators can walk over data structures like lists. Refcounting these can be very costly. These reference needn’t be explicitly counted right away as they  on the stack and can be accessed later.

However, for global variables and even class member variables assignments the reference count is updated.When on de-referencing an object the refcount goes to 0 it cannot be reclaimed because there can be some stack variable references to it (which was not counted). To handle this the object is placed on a special list which contains all objects whose ref count has hit 0 (e.g. ZeroRefCountList).

With this mode the execution can proceed as usual and every so often a special routine is fired which walks all the active thread’s call stack to locate the stack variable and then update object ref count as follows

void UpdateRefCount()
// Stack object references are not refcounted and so bump up their
// refcount for all their occurrences

foreach (object obj in stack)

// At this point the ZeroRefCountList objects are updated to
// their latest ref count values. So 0 means really 0 refcount
// and hence free them recursively

foreach (object obj in ZeroRefCountList)
if(refCount(obj) == 0)
foreach(Object child in obj.GetChildren)

// reset the refcount to get it back to the state
// where it was before calling UpdateRefCount

foreach (object obj in stack)

The above routine is a condensed version of mark and sweep with the differences that it needn’t walk the entire memory and is not performance intensive.

Reduce space overhead

In the worst case every pointer in the system can point to a the same object and hence the refCount field is typically pointer sized so that it can store all the reference’s count. However, in reality this is seldom the case and being so conservative might result in a lot of space overhead per object. Say for a 32 bit system it means storing a 32 bit value for every object. For small sized objects that can mean upto 50% overhead.

To handle this a conservative estimate can be made to the maximum number of reference that an object can have. Let’s assume it to be 256. This needs a refcount field of 8 bits (1 byte) and that is stored in the object header. Post this the refcount is made as follows

void RefCountIncrement(Object obj)
if(obj.refCount < 256)

void RefCountDecrement(Object obj)
if(obj.refCount < 256)

The above algorithm allows correct refcounting for all objects whose refcount never goes upto 256. The moment it does it’s refcount sticks to 256 and is never incremented or decremented. Typically these objects are fairly smaller in number but if they exist they are not reclaimed (sticks to the system). Once such accumulations of these happen in significant number (which will take a lot of time) a backup tracing GC is run to sweep the memory of these objects.

Why use backup mark-sweep GC as backup?

A bunch of the solutions mentioned above use mark-sweep GC as a solution. The question is if they are to be used then why not completely switch to mark-sweep and dump refcounting. The main reason is that refcounting act an an optimization for mark-sweep GC reducing system pauses (remember they typically halt execution completely) introduced by them and making them run less often.

Sunday, February 01, 2009

Back To Basics: Copying Garbage Collection

Banana leaf

This post is Part 5 in the series of posts on Garbage Collection (GC). Please see the index here.

In my previous post I have discussed reference counting and mark-sweep garbage collector in some detail. This post is a quick intro into the copying GC. I’ll not be going into the gory details because this is not much used in the .NET world.

The copying GC conceptually breaks the memory into two regions. Let’s call the ping and pong as shown below.


The buffer Ping is initially used for allocation of memory. It maintains a pointer at the end of the last allocated memory and for each new request simply returns that pointer and then increments it to point to the new end. So at the beginning when no memory is allocated it points to the bottom. After a bunch of allocation this is what the it looks like


Allocation is very fast as it involves only incrementing a pointer. However, after a bunch of allocations the situation will arrive where current-Pointer + requested > Ping-top. This means no more memory can be allocated from Ping. It is at this time the GC is run.

Garbage collection involves enumerating the roots and then copying every live object from Ping to Pong. Since objects are moved in memory all their references have to be updated as well. After the copy the buffers are switched so that subsequent allocations can now happen from Pong. Ping remains clean for the next copy cycle.

For the example set above lets assume when the GC is run Object 1 and 3 are live but 2 isn’t. After GC runs the memory layout becomes


Advantages and disadvantages

The primary advantage is that allocation is extremely fast (almost equal to stack allocation) because out of memory checks is just a pointer check and actual allocation is just incrementing the free pointer. Since the GC compacts memory it maintains good locality of reference resulting in better cache hit ratios and hence faster execution.

The main disadvantage with copying GC is that almost double the memory space is required. Even though it is argued that in virtual memory based systems it shouldn’t be a problem because the un-used memory pages will simply be paged out to the disk, the argument is not essentially true. Specially during GC phase both the memory half's are touched (due to ping <-> pong copy) and the number of page-faults generated are much larger than other GC schemes.