Bong Geek - Abhinaba Basu: 01/01/2009

Thursday, January 29, 2009

Back To Basics: Mark and Sweep Garbage Collection

This post is Part 4 in the series of posts on Garbage Collection (GC). Please see the index here.

Recap of definition

I’ll quickly repeat couple of definitions which are used in this post

Object: This is a unit of storage on the heap. It generally means an object in the object oriented sense but is equally applicable to a procedural language (e.g. a struct/native-type in C) or functional language.
Object/Reference graph: This is the directional graph of objects in memory. A typical sample is below. The nodes are objects in memory and the edges (arrows) are references one object holds to another (e.g. a pointer or reference in C/C++). There can be circular references between two nodes (Object 3 and Object 5) or nodes referencing themselves (Object 6).
Roots: These are the set of nodes in the object graph from which the references start. These are typically references held in registers, local variable on the stack or global variables. The green nodes in the diagram above are roots.
Unreachable object: These are nodes in the graph which have no edge referencing them. The Orange node in the diagram above is an unreachable object. This is the node that the GC needs to clean/free because it is not reachable from any node and is hence garbage memory.

Mark-sweep GC

In the last post I discussed reference counting GC which continually tracks memory and frees them the moment they go out of reference. Mark and sweep GC takes a very different approach.

In this scheme memory is not freed the moment they become garbage. Actually no subsystem is used to keep track of memory as they are being used.

When the system starts running out of memory (or some other such trigger) the GC is fired. It first enumerates all the roots and then starts visiting the objects referenced by them recursively (essentially travelling the nodes in the memory graph). When it reaches an object it marks it with a special flag indicating that the object is reachable and hence not garbage. At the end of this mark phase it gets into the sweep phase. Any object in memory that is not marked by this time is garbage and the system disposes it.

The algorithm

At the top level the algorithm in C like Pseudocode looks as follows

void GC()
{
    HaltAllProcessing();
    ObjectCollection roots = GetRoots();
    for(int i = 0; i < roots.Count(); ++i)
        Mark(roots[i]);
    Sweep();
}

It has three phases, enumeration of roots, marking all object references starting from the root and finally sweeping them.

Root Enumeration

As I mentioned above the root enumeration lists all references held in registers, global or static fields, local variables on the stack, function arguments on stack, etc… The runtime system needs to provide ways for the GC to collect the list of roots. E.g. in the case of .NET the JIT maintains the information on the active roots and provides APIs to the GC to enumerate them. The system can also do some form of dynamic analysis to further optimize it. Say a function accepts a pointer argument. Now while JITing the jitter can figure out that the argument is not being used through out the method and can drop it from the list of roots the moment it is last accessed in the method.

Mark

Typically every object is given some sort of header as I mentioned in my last post. This header contains a bit flag which is used to “mark” that object. The Mark procedure is recursive and acts as follows

void Mark(Object* pObj)
{
    if (!Marked(pObj)) // Marked returns the marked flag from object header
    {
        MarkBit(pObj); // Marks the flag in obj header

        // Get list of references that the current object has
        // and recursively mark them as well


        ObjectCollection children = pObj->GetChildren();
        for(int i = 0; i < children.Count(); ++i)
        {

            Mark(children[i]); // recursively call mark
        }	
    }
}

This is a recursive algorithm with heavy dose of simplification used and we will re-visit this in later posts on with actual implementation implications.

The recursion termination criteria is

Either a node doesn’t have any children (leaf node)
If it is already marked which ensures that we do not land in infinite recursion for circular references

At the end of this marking phase every object that can be reached using a reference from the roots would have been visited and it’s marked bit set.

Sweep

Sweep starts by iterating through the entire memory and frees memory blocks that are not marked. It also clears the Mark bit so that subsequent GC passes can correctly mark/unmark them (resets the memory to pre-mark state).

void Sweep()
{
    Object *pHeap = pHeapStart;
    while(pHeap < pHeapEnd)
    {
        if (!Marked(pHeap))
            Free(pHeap); // put it to the free object list
        else
            UnMarkBit(pHeap);


        pHeap = GetNext(pHeap);
    }
}

There are a bunch of assumptions here which has to be met. Some of them are that the

Entire heap can be iterated over
Both free memory blocks and allocated objects have the same sort of header that can be queried with Marked
You can find the next memory block when given one block (GetNext implemented)
There is some sort of free memory pool which tracks the list of free memory and you can add a block to it by calling Free(Object *)

Special memory layout is needed to handle this assumptions and I’ll try to cover how they are handled atleast in .NETCF in later post.

Compaction

Repeated allocation can actually fragment memory and slow allocation speed significantly (head over here for a nice story on this). So many systems actually combine sweep with a compaction. This ensures fragmentation reduces and hence subsequent allocations are faster. However, when memory is compacted objects move and hence all references to them has to be updated to point to the new location.

When is the GC run

At the face of it, it seems that GC should be run when memory allocation fails. However, when memory allocation fails there is a high chance that GC will fail as well because it would need some working memory to function which won’t be available. So in real world systems GC is fired under a variety of conditions like

System is low on memory (enough for a GC to run but soon allocations are going to fail)
Memory it too much fragmented
After allocating a bunch of memory (e.g. .NET CF fires GC on allocating 1MB after the last GC)

This obviously varies from system to system and is carefully tuned to match the scenarios being supported.

Advantage and disadvantages

The primary advantage of mark-sweep is that it handles cyclic references naturally. Moreover, no additional overheads are added while normal execution (e.g. allocation, pointer manipulations). Combined with compaction it ensures good locality of reference and reduced fragmentation and hence optimal subsequent allocations.

However, mark-sweep pauses useful execution and walks entire memory marking and sweeping it. Hence it adds large freezes which is un-acceptable in most interactive systems. However, incremental variations of Mark-sweep are available which works around this limitation. Mark-sweep also starts thrashing when memory fills up as it fails to clean up memory and it gets triggered again and again as memory approaches exhaustion. Self tuning GC helps in this scenario.

Monday, January 26, 2009

Back To Basics: Reference Counting Garbage Collection

This is Part 3 in a series of post on GC, visit the list here.

Reference counting (refcounting) GC is one of the two primary GC mechanisms widely used. The basic workings of this GC is pretty simple and based on counting the number of reference to a memory block (or object) from other blocks. Each time memory is created or a reference to the object is assigned it’s refcount is bumped up. Each time a pointer is assigned away from it the refcount is reduced. When refcount of the block goes to 0, it means that no pointer references it and hence it is un-reachable and is garbage and can be reclaimed.

Let’s assume RefCount(VOID *pMem) gives us the reference count of the object at pMem. Then the following code shows how the reference count of a specific object in memory varies with the code flow.

Object * obj1 = new Object(); // RefCount(obj1) starts at 1
Object * obj2 = obj1;         // RefCount(obj1) incremented to 2 as new reference is added
Object * obj3 = new Object(); 

obj2->SomeMethod();
obj2 = NULL;                  // RefCount(obj1) decremented to 1 as ref goes away
obj1 = obj3;                  // RefCount(obj1) decremented to 0 and can be collected

As the execution flow mutates the state of the program, the refcount of each object is updated transparently and the block is freed when refcount hits 0. As you might guess this record keeping and collection can be implemented in various ways. Here I discuss one of them…

Storage structure and allocation

For each object the refcount has to be stored in some place. Since this field will be accessed very often it’s best to keep it inside or close to the object to preserve spatial locality of reference which has a lot of impact on performance (by ensuring good cache hit ratio). One of the approach we can take (which is most common) is to add a small header to the start of every object allocated. So each time allocation is asked for a object we allocate sizeof(header) + sizeof(type to be allocate). The memory layout looks like

As apparent from the diagram the pointer returned is at the start of the actual data and not the start of the data allocated. This means that the caller isn’t even aware of the header and can use the data as is. However the garbage collector knows that every object has this header and can use any object pointer to access this header.

The header structure we choose is

struct MemHeader
{
    UINT32 refCount;
};

The allocation routine (in pseudo C) looks like

// cb is the number of bytes to be allocated
PVOID GC_Alloc(size_t cb)
{
    // allocate MemHeader + cb but cast it to MemHeader
    MemHeader* pHdr = (MemHeader*)PlatformAlloc(MEMHEADERSIZE + cb);
    if (pHdr == NULL)
        return NULL;

    // set the initial refCount
    pHdr->refCount = 1;

    // increment the pointer by the size of MemHeader 
    // and make it point to the start of the actual data
    ++pHdr;

    return (PVOID)pHdr;
}

When the GC wants to access the header it can simply use

inline MemHeader * GetHeader(PVOID pMem)
{
    return ((MemHeader*)pMem) - 1;
}

Counting references

For every pointer updation the affected objects’ reference count has to be updated. E.g. for a pointer assignment of sort l = r the following needs to be done

void GC_UpdateRef(PVOID *l, PVOID r)
{
    // used for assignment of l = r
    
    // increment ref count for r-value as an additional reference is being taken
    if (r != NULL)
        ++(GetHeader(r)->refCount);

    // decrease ref count of l-value as it’s releasing it’s previous reference
    GC_ReleaseRef(*l);

    // do the final assignment
    *l = r;
}

However, this function has to be called for all pointer assignments and even when pointers go out of scope. This is the tricky part and is achieved by various ways

Compilers can insert call to GC_UpdateRef for all pointer assignment and also when references go out of scope (like it inserts call to destructors/finalizers)
For languages that support operator overriding this behavior can be added from outside. However, that might need some changes to the type system like making all garbage collectable types inherit from some common type (which then has the overriding added).

Freeing objects

The final piece is actually freeing objects when their reference count goes down to 0.

VOID GC_ReleaseRef(PVOID pMem)
{
    if (pMem == NULL)
        return;

    // get the memory header and decrement the refcount
    MemHeader *pHdr = GetHeader(pMem);
    --(pHdr->refCount);
    
    // if ref count has gone down to zero, free it
    if (pHdr->refCount == 0)
    {
        PlatformFree(pHdr);
    }
}

Even though this seems straightforward it is actually lacking a crucial block. Consider the following reference graph

Here if object 1 goes out of reference and gets reclaimed by GC_ReleaseRef then recursively we need to check (and free if required) all other references held by it. So in this particular case object 4 shouldn’t be freed (as object 2 has a ref to it) but both Object 3 and Object 5 needs to be freed. So the full method would be something like

VOID GC_ReleaseRef(PVOID pMem)
{
    if (pMem == NULL) return;
    MemHeader *pHdr = GetHeader(pMem);
    --(pHdr->refCount);
    if (pHdr->refCount == 0)
    {
        foreach(PVOID pChild in Get_Child(pHdr)) 
            GC_ReleaseRef(pChild);
        PlatformFree(pHdr);
    }
}

The additional part are the lines in red. The assumption is that Get_Child returns all other references held by the object and we recursively free them. However, actually implementing Get_Child is hard. One of the mechanism used it to maintain compiler/type-system generated reference mask which I will cover that as a part of mark-sweep garbage collection.

Advantage/disadvantage of reference counting GC

Reference counting GC has a primary disadvantage. Which is used on it’s own it completely fails to handle circular references. Consider the following reference graph

Here Object2 refers to Object3 and vice-versa. Now if Object 1 releases the reference marked in red then both Object2 and Object3 becomes un-reachable and garbage. However, since they refer to each other both will have a refCount of 1 and will not get freed. To handle this situation ref counting GC is never used on it’s own. It is used in conjunction with some other GC like mark-sweep which is run periodically to clean these circular reference leaks.

Refcounting GC adds significant code/perf bloat as for every assignment calls to update refCount has to be added. Moreover, for multi-threaded system it can prove to be a major issue as locks have to be taken when updating refCount. Even the most performant atomic operations (e.g. Win32 InterlockedIncrement) are costly when used repeatedly. The header space cost which seems petty also eats into the system because for small objects they cause significant overhead. E.g. for an allocated int32 it adds 100% overhead (as the header is also 32 bit value).

Advantage besides being easy to implement is that objects gets re-claimed the moment they become garbage. This means if objects support destructors/finalizers then these are run pretty soon and system resources held by them return to the OS faster.

Reference counting GC ensures that the collection is distributed over the whole period of execution and hence for interactive system doesn’t result in system freezes like say mark-sweep GC does.

Sunday, January 25, 2009

Beach by the side of Suratkal Engineering college

After becoming the .NET Compact Framework (.NETCF) dynamic memory management module owner I am continually learning a lot about the subject. Based on hallway discussion I figured out that a lot of developers are not very clear about the subject and would like to learn more. Most online material I encountered gets into too much of details and implementation for most casual readers.

So here is my attempt to write up a series on GC. I plan to cover the basic stuff and then move into details of .NET CF GC including performance, profiling. I plan to cover bits of desktop .NET GC as well (but for that Maoni’s blog is a better resource) The first two in this series is already published. I will keep this post updated to act as an index into the series.

In case you have some suggestion send it my way and I will try to include that as well…

Thursday, January 22, 2009

Windows 7 rocks

Ok I know this blog is not about Windows but I thought I’d share the love.

To me it looks like as is someone has combed through the entire user experience and fixed most of the stuff that bothered me.

The taskbar change alone are good enough for the upgrade. E.g. the taskbar feels very natural when set in vertical position (it didn’t work well in this configuration before). The icons in tray and the status bar can be dragged around and it actually shows download progress in the IE taskbar icon itself (See the green progress in the screenshot below). I can even close an app by hovering over it’s icon on the taskbar (no need to restore it and then travel across the entire screen and hit close button). The taskbar is very accessible through keyboard, hit Win+T and taskbar is in focus and then shift through the icons using the arrow keys.

The greatest shocker was when I connected my laptop to my TV and it instantly popped up the following, it then correctly picked up my TV’s resolution (1080p) and matched screen resolution to that. Almost felt magical…

I could go on and on (like the awesome window management features) but you can head over to Tim’s blog and read more details there.

Monday, January 19, 2009

Back to basics: Why use garbage collection

This is a part of a series of post on GC. See the index here

Garbage Collection does exactly what it’s more fancier name “Automatic dynamic memory management” suggests. As I discussed in my last post on Memory Allocation dynamic memory is hard to manage and GC attempts to do that automatically and relieves the coder from the hard task.

GC basically attempts to take care of two basic scenarios remove garbage and avoid dangling pointers. They are very inter-related but are different scenarios

Garbage Collect

Consider a typical object reference graph. In the graph every rectangle is an object and the arrows denote object references. So A is an object which references B. For a program to be able to use an object it should be in one of these reference chains. The reference chain start from what is called Root and are typically references held in registers, on stack as local variable or global variables.

Let’s assume that due to some operation, A relinquishes the reference to B and the graph becomes something like this…

Now B and hence C is not reachable from any valid root in the program and hence have become Garbage (un-reachable). The programmer must ensure to follow all reference and free (de-allocate them). One of the duty of a GC system is to automate this process by tracking down (using various algorithms) such objects and reclaim the memory used by them automatically. So in a GC system when the reference is broken it will figure out that B and hence C is not reachable and will de-allocate them.

Hanging/dangling reference

Let’s consider another object graph which is similar to the one above but in addition to A, another object A’ also has a reference to B (or in other words B is shared between them)

Even here after some operation object A doesn’t need reference to B. The programmer does what he thinks is right and de-allocates B and C

However, A’ still has a references to B and hence that reference is now hanging or in more specific term pointing to invalid memory and would typically return unpredictable result when accessed. The key here is the unpredictable behavior. It is not necessary that program will crash. Unless the memory location in B is re-used it will seem to have valid data and de-references from A’ will work fine. So the failure will come up in un-expected ways and totally un-related places in the program and will make locating the root cause extremely hard.

GC helps by automatically taking care of both of the above scenarios and ensuring that the system doesn’t land up in either of them.

How important is GC

If GC is so cool then why doesn’t all systems use GC. There are multiple reasons but with newer technology most of them are going away. One of the primary reasons of not using GC is the performance overhead. This makes GC a less lucrative deal for real-time systems, device drivers and even gaming platforms. However, there are examples where GC is used even for these systems, e.g. .NET Compact Framework (the team I work for) is used with full GC support very successfully in XNA games on Xbox.

However, on some systems like functional languages which relies a lot on closures and deferred execution of those, it makes execution flow very un-predictable and hence GC becomes almost mandatory.

Thursday, January 15, 2009

Back To Basics: Memory allocation, a walk down the history

This is a part of a series of post on GC. See the index here

Most languages we deal with today support a whole bunch of different memory allocation options. We have an option to use static allocation, stack-allocation and heap allocation. However, same was not true in the pre-historic era, life was simple then and the earlier languages just supported static allocation, then slowly stack and then heap allocation came by and today we have come to expect automatic memory management from most new languages.

However, that didn’t really mean that there was no advantage is the simplistic approach taken by these early languages.

Static allocation

This is the simplest allocation option and was supported by initial languages like Fortran (early 50’s). All memory was allocated (reserved) during compile time and variable names were bound to the target memory and frozen. This had some serious limitation like variable memory couldn’t be supported. In case memory usage might vary the developer simply allocated the largest size he anticipated and used from it.

This also meant that size of all Data Structures were known at runtime (e.g. fixed length arrays, no link-lists). Since name binding happened statically, simple things we have come to expect over time like recursion was not possible because the second instantiation of a method would mean that same variables are accessed and since names are bound to addresses statically they would over-write data.

Even though this seems to be pretty restrictive this had some advantages. The most obvious is robustness and predictable behavior. Since there are no runtime memory allocation, there is no possibility of running out of memory (so no OutOfMemoryException :) ). The execution was also very fast because there is no memory to manage and no stack/heap to tear down. Moreover the compiler could emit direct memory accesses instruction without going through any indirection (as symbol->address mapping doesn’t change).

Stack Allocation

The first languages to bring in stack allocation was Algol in late 50’s. This worked past some limitations of static allocation by supporting memory frames that got pushed onto the system stack as procedures were called. What this meant was that based on the input parameters of a method it could push a different sized frame and hence had variable memory allocation (atleast for method local variables). However, return value had to be of fixed size (fixed by compile time) because the caller would need to know how much was left behind by callee on the stack.

Recursion was possible because every invocation would have it’s exclusive frame, and hence when a method returned the frame would unwind (pop) and memory allocated by it would be no longer available to the callee.

Obviously with this came reduced robustness because based on control flow the program could get out of stack and the additional stack management overhead reduced speed (though not significantly).

Even though it wasn’t much but it still was a huge improvement…

Heap Allocation

Funny to call it so but this is the state of the art in memory allocation :) and has remained so for a long time (with some improvements).

With heap data could be allocated in chunks of variable size and later de-allocated. There is no ordering requirement in-between the allocation and de-allocation unlike the stack frames where callee memory gets deallocated before the caller. So now it was possible to allocate and pass variable sized memory around (e.g. from Callee to Caller). It was easier to manage memory better, e.g. by using growable non-contiguous lists over arrays, model data closer to their real life existence like a tree.

With the heap allocation life seemed simpler at first but actually became harder. Robustness nose-dived. Memory allocated dynamically had to be managed carefully because if allocated memory is not de-allocated after it’s use is over, it becomes garbage and un-available (called memory leak) and slowly runs out of space. At the end the dreaded out of memory scenario comes up where there is no more memory left to be allocated to the requestor and the program fails to continue. Heap allocation is also a costly operation and can have severe perf implications.

Even with all the tooling support it still remains hard to locate memory leaks and constitutes one of the harder classes of bugs to be tackled.

Garbage Collection

LISP in 1959 started a new technique called Automatic Memory Management known by the more popular term Garbage Collection (GC). This tries to address the memory leak issue by needing explicit memory allocation but removing the need to de-allocate memory. Instead it provides language facilities that actually hunt down the un-used memory and automatically free them without user-code intervention.

As you might guess that even though GC increases robustness and reduces coding overhead significantly it also has performance implications. This makes using GC in hard in a whole bunch of scenarios. E.g. you typically do not expect to see a GC being used for a real-time system or a system driver (but there are exceptions to this as well); but you do expect to see it in a web-development platform that manages huge amount of memory and can sometimes live with a bit of response delay when the GC is actually running.

Putting it all together

Most modern programming languages support all three forms of storage. E.g. C++ or C# supports all 3.

Even though all of them might be available at the same time, the choice of usage is key to ensure robustness and meet performance goals of a program. Unfortunately a lot of developers take a casual approach to it and later land up into trouble when the memory foot-print become larger and harder to handle.

Consider the following pseudo-code

void foo()
{
    MyType m1;                                   // Option-1 stack allocation
    MyType* m2 = AllocateOnHeap(sizeof(MyType)); // Option-2 heap allocation
    ...
    if (SomeCond)
       foo();     // recursive call    
    ...
    DeAllocate(m);
}

Here foo is called in recursion. If MyType is of significant size and we use option-1 of using stack allocation then the depth of recursion supported by the method will be significantly reduced because each function call will need additional stack space to accommodate MyType. However, if we use heap (Option-2) then the stack needed will be significantly lower (increase in depth of recursion) but since heap allocation is much slower the speed will suffer. Based on the size of MyType, the need of the program the allocation would have to be carefully chosen.

Based on target domain different languages have chosen

Bong Geek - Abhinaba Basu

Links

Search