Search

Showing posts with label Garbage Collection. Show all posts
Showing posts with label Garbage Collection. Show all posts

Tuesday, October 16, 2012

Test Driven Development of a Generational Garbage Collection

Crater Lake, OR panorama

These days everyone is talking about being agile and test driven development (TDD). I wanted to share a success story of TDD that we employed for developing Generational Garbage Collector (GC) for Windows Phone Mango.

The .NET runtime on Windows Phone 7 shipped with a mark-sweep-compact; stop the world global non-generational GC. Once a GC was triggered, it stopped all managed execution and scanned the entire managed heap to look up all managed references and cleaned up objects that were not in use. Due to performance bottleneck we decided to enhance the GC by adding a generational GC (referred to as GenGC). However, post the General Availability or GA of WP7 we had a very short coding window. Replacing such a fundamental piece of the runtime in that short window was very risky. So we decided to build various kinds of stress infrastructure first, and then develop the GC. So essentially

  1. Write the tests
  2. See those tests failing
  3. Write code for the generational GC
  4. Get tests to pass
  5. Use the tests for regression tracking as we refactor the code and make it run faster

Now building tests for a GC is not equivalent of traditional testing of features or APIs where you write tests to call into mocked up API, see it fail until you add the right functionality. Rather these tests where verifications modes and combination of runtime stresses that we wrote.

To appreciate the testing steps we took do read the Back To Basics: Generational Garbage Collection and WP7 Mango: Mark-Sweep collection and how does a Generational GC help posts

Essentially in a generational GC run all of the following references should be discovered by the GC

  1. Gen0 Objects reachable via roots (root –> object OR recursively root –> object –> object )
  2. Objects accessible from runtime native code (e.g. pinned for PInvoke, COM interop, internal runtime references)
  3. Objects referenced via Gen1 –> Gen0 pointers

The first two were anyway heavily covered by our traditional GC tests. #3 being the new area being added.

To implement a correct generational GC we needed to ensure that at all places in the runtime where managed object references are updated they need to get reflected in the CardTable (#3 above). This is a daunting task and prone to bugs via omission as we need to ensure that

  1. All forms of assignments in MSIL that we JIT there are calls to update the CardTable.
  2. All places in the native runtime code where such references are directly and or indirectly updated the same thing is ensured. This includes all JIT worker-routines, COM, Marshallers.

If a single instance is missed then it would result in valid/reachable Gen0 objects being collected (or deleted) and hence in the longer run result in memory corruption, crashes that will be hard if not impossible to debug. This was assessed to be the biggest risk to shipping generational GC.

The other problem is that these potential omissions can be only exposed by certain ordering of allocation and collection. E.g. only a missing tracked reference of A –> B can result in a GC issue only if a GC happened in between allocations of A and B (A is in higher generation than B). Also due to performance reasons (write atomicity for lock-less updates) for every assignment of A = B we do not update the card-table bit that covers the memory area of A. Rather we update the whole byte in the card-table. This means an update to A will cover other objects allocated adjacent to A. Hence if an update to an object just beside A in the memory is missed it will not be discovered until some other run where that object lands up being allocated farther away from A.

GC Verification mode

Our solution to all of these problems was to first create the GC verification mode. What this mode does is runs the traditional full mark-sweep GC. While running that GC it goes through all objects in the memory and as it traverses them for every reference A (Gen1) –> B(Gen0), it verifies that the card table bit for A is indeed set. This ensures that if a GenGC was to run, it would not miss that references

Granular CardTable

We used very high granular card-table resolution for test runs. For these special runs each bit of the card-table corresponded to almost one object (1 bit to 2 byte resolution). Even though the card-table size exploded it was fine because this wasn’t a shipping configuration. This spaced out objects covered by the card-table and exposed adjacent objects not being updated.

GC Stress

In addition we ran the GC stress mode, where we made the GC run extremely frequently (we could push it up to a GC in every allocation). The allocator was also updated to ensure that allocations were randomized so that objects moved around everywhere in the memory.

Hole Finder

Hole finder moves all objects around in memory after a GC. This exposes stale pointer issues. If an object didn’t get updated properly due to the GC it would now point to invalid memory because all previous memory locations are now invalid memory. So a subsequent write will fail-fast with AV and we can easily detect that point of failure.

With all of these changes we ran the entire test suites. Also by throttling down the GC Stress mode we could still use the runtime to run real apps on the phone. Let me tell you playing NFS on a device with the verification mode, wasn’t fun :)

With this precaution we ensured that not a single GenGC bug has come in from the phone. It shipped rock solid and we were more confident with code churn because regressions would always be caught. I actually never blogged about this because I felt that if I do, it’ll jinx something :)

WP7: CLR Managed Object overhead

Winter day

A reader contacted me over this blog to inquire about the overhead of managed objects on the Windows Phone. which means that when you use an object of size X the runtime actually uses (X + dx) where dx is the overhead of CLR’s book keeping. dx is generally fixed and hence if x is small and there are a lot of those objects it starts showing up as significant %.

There seems to be a great deal of information about the desktop CLR in this area. I hope this post will fill the gap for NETCF (the CLR for the Windows Phone 7)

All the details below are implementation detail and is provided just as guidance. Since these are not mandated by any specification (e.g. ECMA) they may and most probably will change.

General layout

The overhead varies on the type of objects. However, in general the object layout looks something like

imageSo internally just before the actual object there is a small object-header. If the object is a finalizable object (object with a finalizer) then there’s another 4-byte pointer at the end of the object. 
The exact size of the header and what's inside that header depends on the type of the object.

Small objects (size < 16KB)

All objects smaller than 16KB has a 8 byte object header.image

The 32-bits in the flag are used to store the following information

  1. Locking
  2. GC information. E.g.
    1. Bits to mark objects
    2. Generation of the object
    3. Whether the object is pinned
  3. Hashcode (for GetHashCode support)
  4. Size of the object

Since all objects are 4 byte aligned, the size is stored as an unit of 4 bytes and uses 11 bits. Which means that the maximum size that can be stored in that header is 16KB.

The second flag contains a 32-bit pointer to the Class-Descriptor or the type information of that object. This field is overloaded and is also used by the GC during compaction phase.

Since the header cost is fixed, the smaller the object the larger the overhead becomes in terms of % overhead.

Large Objects

As we saw in the previous section that the standard object header supports describing 16KB objects. However, we do need to support larger objects in the runtime.

This is done using a special large-object header which is 20 bytes.image

The overhead of large object comes to be at max of around 0.1%. A dedicated 32-bit size field allows the runtime to support very large objects (close to 2GB).

It’s weird to see that the normal object header is repeated twice. The reason is that very large objects are rare on the phone. Hence the runtime tries to optimize it’s various operations so that the large object doesn’t bring in additional code paths and checks. Given an object reference having the normal header just before it helps us in speeding various operations. At the same time having the header as the first data-structure of all objects helps us to be faster in other cases. Bottom line is that the additional 8 bytes or 0.04% extra space usage helps us in enough performance gains at other places so that we pay that extra overhead cost.

Pool overhead

NETCF uses 64KB object pools and then sub-allocates objects from it. So in addition to the per-object overhead it uses an additional 44-byte of object-pool-header per 64kb. This means another 0.06% overhead per 64KB. For every object, the contribution of this can be ignored.

Arrays

Arrays pan out slightly differently.

image

In case of value type array every member doesn’t really need to have the type information and hence the object header associated with it. The Array header is a standard object header of 8 bytes or big object header of 20bytes (based on the size of the array). Then there is a 32-bit count of the number of elements in the array. For single dimension array there is no more overhead. However, for higher dimension arrays the length in each dimension is also stored to support runtime bounds check.

The reference type arrays are similar but instead of the array contents being inlined into the array the elements are references to the objects on the managed heap and each of those objects have their standard headers.

Putting it all together

Standard object 8 bytes ( + 4 bytes for finalizable object)
Large objects (>16KB) 20 bytes ( + 4 bytes for finalizable object)
Arrays object overhead + 4 bytes array header + (nDimension x 4 bytes)

Tuesday, June 14, 2011

WP7 Mango: The new Generational GC

In my previous post “Mark-Sweep collection and how does a Generational GC help” I discussed how a generational Garbage Collector (GC) works and how it helps in reducing collection latencies which show up as long load times (startup as well as other load situations like game level load) and gameplay or animation jitter/glitches. In this post I want to discuss how those general principles apply to the WP7 Generational GC (GenGC) specifically.

Generations and Collection Types

We use 2 generations on the WP7 referred to as Gen0 and Gen1. A collection could be any of the following 4 types

  1. An ephemeral or Gen0 collection that runs frequently and only collects Gen0 objects. Object surviving the Gen0 collection is promoted to Gen1
  2. Full mark-sweep collection that collects all managed objects (both Gen1 and Gen0)
  3. Full mark-sweep-compact collection that collects all managed objects (both Gen1 and Gen0)
  4. Full-GC with code-pitch. This is run under severe low memory and can even throw away JITed code (something that desktop CLR doesn’t support)

The list above is in the order of increasing latency (or time they take to run)

Collection triggers

GC triggers are the same and as outlined in my previous post WP7: When does the GC run. The distinction between #2 and #3 above is that at the end of all full-GC the collector considers the memory fragmentation and can potentially run the memory compactor as well.

  1. After significant allocation
    After significant amount of managed allocation the GC is started. The amount today is 1MB (called GC quanta) but is open to change. This GC can be ephemeral or full-GC. In general it’s an ephemeral collection. However, it might be a full collection under the following cases
    1. After significant promotion of objects from Gen0 to Gen1 the collections become full collections. Today 5MB of promotion triggers a full GC (again this number is subject to change).
    2. Application’s total memory usage is close to the maximum memory cap that apps have (very little free memory left). This indicates that the application will get terminated if the memory utilization is not cut-back.
    3. Piling up of native resources. We use different heuristics like native to managed memory ratio and finalizer queue heuristics to detect if GC needs to turn to full collection to release native resources being held-up due to Gen0 only collections
  2. Resource allocation failure
    All resource allocation failure means that the system is under memory pressure and hence such collections are always full collection. This can lead to code pitch as well
  3. User code triggered GC
    User code can start collections via the System.GC.Collect() managed API. This results in a full collection as documented by that API. We have not added the method overload System.GC.Collect(generation). Hence there is no way for the developer to start a ephemeral or Gen0 only collection
  4. Sharing server initiated
    Sharing server can detect phone wide memory issue and start GC in all managed processes running. These are full-GC and can potentially pitch code as well.

 

So from all of the above, the 3 key takeaways are

  1. Low memory or memory cap related collections are always full-collections. These could also turn out to be the more costly compacting collection and/or pitch JITed code
  2. Collections are in general ephemeral and become full-collection after significant object promotion
  3. No fundamental changes to the GC trigger policies. So an app written for WP7 will not see any major changes to the number of GC’s that happen. Some GC will be ephemeral and others will be full-GCs.

 

Write Barriers/Card-table

As explained in my previous post, to keep track of Gen1 to Gen0 reference we use write-barrier/card-table.

Card-table can be visualized as a memory bitmap. Each bit in the card-table covers n bytes of the net address space. Each such bit is called a Card. For managed reference updates like  A.b = C in addition to JITing the real assignment, calls are added to Write-barrier functions. This  write barrier locates the Card corresponding to the address of write and sets it. Later during collection the collector checks all Gen-1 objects covered by a set card-bit and marks Gen-0 references in those objects.

This essentially brings in two additional cost to the system.

  1. Memory cost of adding those calls to the WB in the JITed code
  2. Cost of executing the write barrier while modifying reference

Both of the above are optimized to ensure they have minimum execution impact. We only JIT calls to WB when absolutely required and even then we have an overhead of a single instruction to make the call. The WB are hand-tuned assembly code to ensure they take minimum cycles. In effect the net hit on process memory due to write barriers is way less than 0.1%. The execution hit in real-world applications scenarios is also not in general measureable (other than real targeted testing).

Differences from desktop

In principle both the desktop GC and the WP7 GC are similar in that they use mark-sweep generational GC. However, there are differences based on the fact that the WP7 GC targets a more constrained device.

  1. 2 generations as opposed to 3 on the desktop
  2. No background or incremental collection supported on the phone
  3. WP7 GC has additional logic to track and handle application policies like application memory caps and total memory utilization
  4. The phone CLR uses a very different memory layout which is pooled and not linear. So no concept of Large Object Heap. So lifetime of large objects is no different
  5. No support for particular generation collection from user code

Wednesday, June 08, 2011

WP7 Mango: Mark-Sweep collection and how does a Generational GC help

About a month back we announced that in the next release of Windows Phone 7 (codenamed Mango) we will ship a new garbage collector in the CLR. This garbage collector (GC) is a generational garbage collector.

This post is a “back to basics” post where I’ll try to examine how a mark-sweep GC works and how adding generational collection helps in boosting it’s performance. We will take a simplified look at how mark-sweep-compact GC works and how generational GC can enhance it’s performance. In later posts I’ll try to elaborate on the specifics of the WP7 generational GC and how to ensure you get the best performance out of it.

Object Reference Graph

Objects in the memory can be considered to be a graph. Where every object is a node and the references from one object to another are edges. Something like

image

To use an object the code should be able to “reach” it via some reference. These are called reachable objects (in blue). Objects like a method’s local variable, method parameters, global variables, objects held onto by the runtime (e.g. GCHandles), etc. are directly reachable. They are the starting points of reference chains and are called the roots (in black).

Other objects are reachable if there are some references to them from roots or from other objects that can be reached from the roots. So Object4 is reachable due to the Object2->Object4 reference. Object5 is reachable because of Object1->Object3->Object5 reference chain. All reachable objects are valid objects and needs to be retained in the system.

On the other hand Object6 is not reachable and is hence garbage, something that the GC should remove from the system.

Mark-Sweep-Compact GC

A garbage collector can locate garbage like Object6 in various ways. Some common ways are reference-counting, copying-collection and Mark-Sweep. In this section lets take a more pictorial view of how mark-sweep works.

Consider the following object graph

1

At first the GC pauses the entire application so that the object graph is not being mutated (as in no new objects or references are being created). Then it goes into the mark phase. In mark phase the GC traverses the graph starting at the roots and following the references from object to object. Each time it reaches an object through a reference it flips a bit in the object header indicating that this object is marked or in other words reachable (and hence not garbage). At the end everything looks as follows

2

So the 2 roots and the objects A, C, D are reachable.

Next it goes into the sweep phase. In this phase it starts from the very first object and examines the header. If the header’s mark bit is set it means that it’s a reachable object and the sweep resets that bit. If the header’s bit is not set, it’s not reachable and is flagged as garbage.

3

So B and E gets flagged as garbage. Hence these areas are free to be used for other objects or can be released back to the system

4

This is where the GC is done and it may resume the execution of the application. However, if there are too many of those holes (released objects) created in the system, then the memory gets fragmented. To reduce memory fragmentation. The GC may compact the memory by moving objects around. Do note that compaction doesn’t happen for every GC, it is run based on some fragmentation heuristics.

5

Both C and D is moved here to squeeze out the hole for B. At the same time all references to these objects in the system is also update to point to the correct location.

One important thing to note here is that unlike native objects, managed objects can move around in memory due to compaction and hence taking direct references to it (a.k.a memory pointers) is not possible. In case this is ever required, e.g. a managed buffer is passed to say the microphone driver native code to copy recorded audio into, the GC has to be notified that the corresponding managed object cannot be moved during compaction. If the GC runs a compaction and object moves during that microphone PCM data copy, then memory corruption will happen because the object being copied into would’ve moved. To stop that, GCHandle has to be created to that object with GCHandleType.Pinned to notify the GC that the corresponding object should never move.

On the WP7 the interfaces to these peripherals and sensors are wrapped by managed interfaces and hence the WP7 developer doesn’t really have to do these things, they are taken care offm under the hood by those managed interfaces.

The performance issue

As mentioned before during the entire GC the execution of the application is stopped. So as long as the GC is running the application is frozen. This isn’t a problem in general because the GC runs pretty fast and infrequently. So small latencies of the order of 10-20ms is not really noticeable.

However, with WP7 the capability of the device in terms of CPU and memory drastically increased. Games and large Silverlight applications started coming up which used close to 100mb of memory. As memory increases the number of references those many objects can have also increases exponentially. In the scheme explained above the GC has to traverse each and every object and their reference to mark them and later remove them via sweep. So the GC time also increases drastically and becomes a function of the net workingset of the application. This results in very large pauses in case of large XNA games and SL applications which finally manifests as long startup times (as GC runs during startup) or glitches during the game play/animation.

Generational Approach

If we take a look at a simplified allocation pattern of a typical game (actually other apps are also similar), it looks somewhat like below

image

The game has a large steady state memory which contains much of it’s steady state data (which are not released) and then per-frame it goes on allocating/de-allocating some more data, e.g. for physics, projectiles, frame-data. To collect this memory layout the traditional GC has to walk or traverse the entire 50+ mb of data to locate the garbage in it. However, most of the data it traverses will almost never be garbage and will remain in use.

This application behavior is used for the Generational GC premise

  1. Most objects die young
  2. If an object survives collection (that is doesn’t die young) it will survive for a very long time

Using these premise the generational GC tries to segregate the managed heap into older and younger generations objects. The younger generation called Gen-0 is collected in each GC (premise 1), this is called the Ephemeral or Gen0 Collection. The older generation is called Gen-1. The GC rarely collects the Gen-1 as the probability of finding garbage in it is low (premise 2).

image

So essentially the GC becomes free of the burden of the net working set of the application.

Most GC will be ephemeral GC and it will only traverse the recently allocated objects, hence the GC latency remains very low. Post collection the surviving objects are promoted to the higher generation. Once a lot of objects are promoted, the higher generation starts becoming full and then a full collection is run (which collects both gen-1 and gen-0). However, due to premise 1, the ephemeral collection finds a lot of garbage in their runs and hence promotes very few objects. This means the growth rate of the higher generation is low and hence full-collection will run very infrequently.

Ephemeral/Gen-0 collection

Even in ephemeral collection the GC needs to deterministically find all objects in Gen-0 which are not reachable. This means the following objects needs to survive a Gen-0 collection

  1. Objects directly reachable from roots
  2. Root –> Gen0 –> Gen-0 objects (indirectly reachable from roots)
  3. Objects referenced from Gen1 to Gen0

Now #1 and #2 pose no issues as in the Ephemeral GC, we will anyway scan all roots and Gen-0 objects. However to find objects from Gen1 which are referencing objects in Gen-0, we would have to traverse and look into all Gen1 objects. This will break the very purpose of having segregating the memory into generation. To handle this write-barrier+card-table technique is used.

The runtime maintains a special table called card-table. Each time any references are taken in the managed code e.g. a.b = c; the code JITed for this assignment also updates an internal data-structure called CardTable to capture that write. Later during the ephemeral collection, the GC looks into that table to find all the ‘a’ which took new references. If that ‘a’ is a gen-1 object and ‘c’ a gen-0 object then it marks ‘c’ recursively (which means all objects reachable from ‘c’ is also marked). This technique ensures that without examining all the gen-1 objects we can still find all live objects in Gen-0. However, the cost paid is

  1. Updating object reference is a little bit more costly now
  2. Making old object taking references to new objects increases GC latency (more about these in later posts)

Putting it all together, the traditional GC would traverse all references shown in the diagram below. However, an ephemeral GC can work without traversing the huge number of Red references.

image

It scans all the Roots to Gen-0 references (green) directly. It traverses all the Gen1->Gen0 references (orange) via the CardTable data structure.

Conclusion

  1. Generational GC reduces the GC latency by avoiding looking up all objects in the system
  2. Most collections are gen-0 or ephemeral collection and are of low latency this ensures fast startup and low latency in game play
  3. However, based on how many objects are being promoted full GC’s are run sometimes. When they do, they exhibit the same latency as a full GC on the previous WP7 GC

Given the above most applications will see startup and in-execution perf boost without any modification. E.g today if an application allocates 5 MB of data during startup and GC runs after every MB of allocation then it traverses 15mb (1 + 2 + 3 + 4 + 5). However, with GenGC it might get away with traversing as low as only 5mb.

In addition, especially game developers can optimize their in-gameplay allocations such that during the entire game play there is no full collection and hence only low-latency ephemeral collections happens.

How well the generational scheme works depend on a lot of parameters and has some nuances. In the subsequent posts I will dive into the details of our implementation and some of the design choices we made and what the developers needs to do to get the most value out of it.

Thursday, July 29, 2010

Windows Phone 7 App Development: When does the GC run

Cub

Many moons ago I made a post on When does the .NET Compact Framework Garbage Collector run. Given that a lot of things have changed since then, it’s time to make another post about the same thing.

For the developers coming to Windows Phone 7 (WP7) from the Windows desktop let me first clarify that the runtime (CLR) that is running on the WP7 is not the same as the one running on the desktop. The WP7 runtime is known as .NET Compact Framework (NETCF) and it works differently than the “desktop CLR”. For 90% of cases this is irrelevant as the WP7 developer targets the XNA or Silverlight programming model and hence what is running underneath is really not important. E.g when you drive you really do not care about the engine. This post is for the other 5% where folks do run into issues (smoke coming out of the car).

Moreover do note that when the GC is run is really an implementation detail that is subject to change.

Now that we have all the disclaimers behind us lets get down to the list.

The Garbage Collector is run in the following situations

  1. After some significant allocation:
    When an application tries to allocate managed memory the allocator first checks a counter that indicates the number of bytes of managed data allocated since the last GC. If this counter crosses a threshold (which is changeable and set to 1MB currently) then GC is fired (and the counter is obviously reset).
    The basic idea is that there has been significant allocation since the last GC and hence do it again.
  2. Resource allocation failure
    If some internal native allocation fails, like loadlibrary fails or JIT buffer allocation fails due to out-of-memory condition then GC is started to free up some memory and the allocation is re-attempted (only 1 re-attempt)
  3. User code can trigger GC
    Using the managed API System.GC.Collect(), user code can force a GC
  4. Sharing server initiated
    One of the new features in the WP7 CLR is the sharing server (see my post http://blogs.msdn.com/b/abhinaba/archive/2010/04/28/we-believe-in-sharing.aspx for details). In WP7 there is a central server coordinating all the managed processes. If this sharing server is notified of low memory condition, it starts GC in all the managed processes in the system.

The GC is NOT run in the following cases (I am explicitly calling these out because in various conferences and interactions I’ve heard folks thinking it might be)

  1. GC is not run on some timer. So if a process is not allocating any memory and there is no low-memory situation then GC will never be fired irrespective of how long the application is running
  2. The phone is never woken up by the CLR to run GC. GC is always in response to an active request OR allocation failure OR low memory notification.
  3. In the same lines there is no GC thread or background GC on WP7

 

For folks migrating from NETCF 3.5 the list below gives you the changes

  1. WinForm Application going to background used to fire GC. On WP7 this is no longer true
  2. Sharing server based changes are obviously new
  3. The GC quantum cannot be changed by user

Sunday, December 27, 2009

Back to Basics: Memory leaks in managed systems

Kolkata Trip 2009

Someone contacted me over my blog about his managed application where the working set goes on increasing and ultimately leads to out of memory. In the email at one point he states that he is using .NET and hence there should surely be no leaks. I have also talked with other folks in the past where they think likewise.

However, this is not really true. To get to the bottom of this first we need to understand what the the GC does. Do read up http://blogs.msdn.com/abhinaba/archive/2009/01/20/back-to-basics-why-use-garbage-collection.aspx.

In summary GC keeps track of object usage and collects/removes those that are no longer referenced/used by any other objects. It ensures that it doesn’t leave dangling pointers. You can find how it does this at http://blogs.msdn.com/abhinaba/archive/2009/01/25/back-to-basic-series-on-dynamic-memory-management.aspx

However, there is some catch to the statement above. The GC can only remove objects that are not in use. Unfortunately it’s easy to get into a situation where your code can result in objects never being completely de-referenced.

Example 1: Event Handling (discussed in more detail here).

Consider the following code

EventSink sink = new EventSink();
EventSource src = new EventSource();

src.SomeEvent += sink.EventHandler;
src.DoWork();

sink = null;
// Force collection
GC.Collect();
GC.WaitForPendingFinalizers();

In this example at the point where we are forcing a GC there is no reference to sink (explicitly via sink = null ), however, even then sink will not be collected. The reason is that sink is being used as an event handler and hence src is holding an reference to sink (so that it can callback into sink.EventHandler once the src.SomeEvent is fired) and stopping it from getting collected


Example 2: Mutating objects in collection (discussed here)


There can be even more involved cases. Around 2 years back I saw an issue where objects were being placed inside a Dictionary and later retrieved, used and discarded. Now retrieval was done using the object key. The flow was something like



  1. Create Object and put it in a Dictionary
  2. Later get object using object key
  3. Call some functions on the object
  4. Again get the object by key and remove it

Now the object was not immutable and in using the object in step 3 some fields of that object got modified and the same field was used for calculating the objects hash code (used in overloaded GetHashCode). This meant the Remove call in step 4 didn’t find the object and it remained inside the dictionary. Can you guess why changing a field of an object that is used in GetHashCode fails it from being retrieved from the dictionary? Check out http://blogs.msdn.com/abhinaba/archive/2007/10/18/mutable-objects-and-hashcode.aspx to know why this happens.


There are many more examples where this can happen.


So we can conclude that memory leaks is common in managed memory as well but it typically happens a bit differently where some references are not cleared as they should’ve been and the GC finds these objects referenced by others and does not collect them.

Thursday, December 03, 2009

NETCF: Count your bytes correctly

Pirate

I got a stress related issue reported in which the code tried allocating a 5MB array and do some processing on it but that failed with OOM (Out of Memory). It also stated that there was way more than 5 MB available on the device and surely it’s some bug in the Execution Engine.

Here’s the code.

try{
byte[] result;
long GlobalFileSize = 5242660; //5MB
result = new byte[GlobalFileSize];
string payload = Encoding.UTF8.GetString(result, 0, result.Length);
System.Console.WriteLine("len " + payload.Length);
}
catch (Exception e)
{
System.Console.WriteLine("Exception " + e);
}

The problem is that the user didn’t count his bytes well. The array is 5MB and it actually gets allocated correctly. The problem is with the memory complexity of the UTF8.GetString which allocates further memory based on it’s input. In this particular case the allocation pattern goes something like

  5MB  -- Byte Array allocation (as expected and works)

5MB -- Used by GetString call
10MB -- Used by GetString call
5MB -- Used by GetString call
10MB -- Used by GetString call


So GetString needed a whooping 30MB and the total allocation is around 35MB which was really not available.


Morale of the story: Count your bytes well, preferably through a tool like Remote Performance Monitor

Tuesday, September 01, 2009

NETCF: GC and thread blocking

Hyderabad Zoological Park

One of our customers asked us a bunch of questions on the NETCF garbage collector that qualifies as FAQ and hence I thought I’d put them up here.

Question: What is the priority of the GC thread
Answer: Unlike some variants of the desktop GC and other virtual machines (e.g. JScript) we do not have any background or timer based GC. The CLR fires GC on the same thread that tried allocation and either the allocation failed or during pre-allocation checks the CLR feels that time to run GC has come. So the priority of the GC thread is same as that of the thread that resulted in GC.

Question: Can the GC freeze managed threads that are of higher priority than the GC thread?
Answer: Yes GC can freeze managed threads which has higher priority than itself. Before actually running GC the CLR tries to go into a “safe point”. Each thread has a suspend event associated with it and this event is checked by each thread regularly. Before starting GC the CLR enumerates all managed threads and in each of them sets this event. In the next point when the thread checks and finds this event set, it blocks waiting for the event to get reset (which happens when GC is complete). This mechanism is agnostic of the relative priority of the various threads.

Question: Does it freeze all threads or only Managed threads?
Answer: The NETCF GC belongs to the category “stop-the-world GC”. It needs to freeze threads so that when it is iterating and compacting managed heap nothing on it gets modified. This is unlike some background/incremental GC supported by other virtual machines.

GC freezes only managed threads and not other native threads (e.g. COM or host-threads). However, there are two exclusions even on the managed threads

  1. It doesn’t freeze the managed thread that started GC because the GC will be run on that thread (see answer to first question)
  2. Managed threads that are currently executing in native p/invoke code is not frozen either. However, the moment they try to return to managed execution they get frozen. (I actually missed this subtle point in my response and Brian caught it :))

Hope this answers some of your questions. Feel free to send me any .NET Compact Framework CLR questions. I will either answer them myself or get someone else to do it for you :)

Sunday, May 31, 2009

How does the .NET Compact Memory allocator work

Aalankrita - Twins

As I mentioned in one of my previous posts the .NET Compact Framework uses 5 separate heaps of which one is dedicated for managed data and is called the managed heap. The .NETCF allocator, allocates pools of data from this managed heap and then sub-allocates managed objects from this pool.

This is how the process goes

image At the very beginning the allocator allocates a 64Kb pool (called the obj-pool) and ties it with the current app-domain state (AppState).
image All object allocation requests are served from this pool (Pool 0). After each allocation an internal pointer (shown in yellow) is incremented to point to the end of the last allocated object.
image Subsequent allocation is just incrementing this pointer.
image This goes on until the point where there is no more space left in the current obj-pool
image Since there is no more space left in the current pool (pool 0) a new pool (pool 1) is allocated and all subsequent object allocation requests are serviced from the new pool.

The first question I get asked at this point is what happens when one object itself is bigger than 64Kb :). The answer is that the CLR considers such objects as “huge-objects” and they are placed in their own obj-pool. This essentially means that either obj-pools are 64Kb in size and has more than one object in them or are bigger than 64Kb and has only one huge-object in it.

Object allocation speed varies a lot between whether it is being allocated from the current pool (where it is just the cost of a pointer increment) or it needs a new pool to be allocated (where there is additional cost of a 64Kb allocation). On the average managed objects are 35 bytes in size and around 1800 can fit in one pool and hence on the whole allocation speed attained is pretty good.

Verifying what we discussed above

All of the things I said above can be easily verified using the Remote Performance monitor. I wrote a small application that allocates objects in a loop and launched it under Remote Performance monitor and published the data to perfmon so that I can observe how the managed heap grows. I used the process outlined in Steven Pratschner’s blog post at http://blogs.msdn.com/stevenpr/archive/2006/04/17/577636.aspx 

image

Using the above mentioned mechanism I am observing the managed bytes allocated (green line) and the GC Heap (red line). You can see that the green line increases linearly and after every 64Kb the red line or the GC Heap bumps up by 64Kb as we allocate a new object pool from the heap.

Monday, May 04, 2009

Memory leak via event handlers

Golconda fort - light and sound

One of our customers recently had some interesting out-of-memory issues which I thought I’d share.

The short version

The basic problem was that they had event sinks and sources. The event sinks were disposed correctly. However, in the dispose they missed removing the event sink from the event invoker list of the source. So in effect the disposed sinks were still reachable from the event source which is still in use. So GC did not de-allocate the event sinks and lead to leaked objects.

Now the not-so-long version

Consider the following code where there are EventSource and EventSink objects. EventSource exposes the event SomeEvent which the sink listens to.

EventSink sink = new EventSink();
EventSource src = new EventSource();

src.SomeEvent += sink.EventHandler;
src.DoWork();

sink = null;
// Force collection
GC.Collect();
GC.WaitForPendingFinalizers();

For the above code if you have a finalizer for EventSink and add a breakpoint/Console.WriteLine in it, you’d see that it is never hit even though sink is clearly set to null. The reason is the 3rd line where we added sink to the invoke list of source via the += operator. So even after sink being set to null the original object is still reachable (and hence not garbage) from src.


+= is additive so as the code executes and new sinks are added the older objects still stick around resulting in working set to grow and finally lead to crash at some point.


The fix is to ensure you remove the sink with –= as in

EventSink sink = new EventSink();
EventSource src = new EventSource();
src.SomeEvent += sink.EventHandler;
src.DoWork();
src.SomeEvent -= sink.EventHandler;
sink = null;

The above code is just sample and obviously you shouldn’t do event add removal in a scattered way. Make EventSink implement IDisposable and encapsulate the += –= in ctor/dispose.

Tuesday, April 28, 2009

Finalizers and Thread local storage

Sunset from our balcony

Writing finalizers is generally tricky. In the msdn documentation for finalizers the following limitations are mentioned.

  1. The exact time when the finalizer executes during garbage collection is undefined
  2. The finalizers of two objects are not guaranteed to run in any specific order
  3. The thread on which the finalizer is run is unspecified.
  4. Finalizers might not be run at all

#3 has interesting consequences. If a native resources is allocated by a managed ctor (or any other method) and the finalizers is used to de-allocate that then the allocation and de-allocation will not happen on the same thread. The reason being that the CLR uses one (maybe more than one) special thread to run all finalizers on. This is easily verified by the fact that if you query for the thread id inside the finalizers you will get a different id than the main thread the application is being run on.

The thread safety is generally easy to handle but might lead to some tricky and hard to locate problems. Consider the following code

class MyClass
{
public MyClass()
{
tls = 42;
Console.WriteLine("Ctor threadid = {0} TLS={1}", AppDomain.GetCurrentThreadId(), tls);
}

~MyClass()
{
Console.WriteLine("Finalizer threadid = {0} TLS={1}", AppDomain.GetCurrentThreadId(), tls);
}

public void DoWork()
{
Console.WriteLine("DoWork threadid = {0} TLS={1}", AppDomain.GetCurrentThreadId(), tls);
}

[ThreadStatic]
static int tls;
}

class Program
{
static void Main(string[] args)
{
MyClass mc = new MyClass();
mc.DoWork();
mc = null;
GC.Collect();
GC.WaitForPendingFinalizers();
}
}

Here we create a class, use it and finalize it. In all of these 3 methods we print the thread id and also use a special variable named tls.


If you see the tls is marked with the attribute ThreadStatic. The definition of ThreadStatic is “Indicates that the value of a static field is unique for each thread”.


I hope by now you have figured out the gotcha :). Under the hood the CLR uses a native OS concept called Thread Local Usage (TLS) to ensure that the value of tls is unique per thread. TLS uses special per thread data-structure to store that data. Now we have set the value in the ctor and used it in finalizer. Since they run on different threads each will get different values of the same field.


On my system the out put is as follows

Ctor threadid = 5904 TLS=42
DoWork threadid = 5904 TLS=42
Finalizer threadid = 4220 TLS=0
Press any key to continue . . .

As is evident the finalizer ran on a different thread (id is different) and the TLS value is also different from what was set.


So the moral of this story is “Be careful about thread safety of finalizers and do not use thread local storage in it

Tuesday, April 14, 2009

Multiple calls to GC.ReRegisterForFinalize

Holi Celebration at our appartment complex

What happens when there are multiple calls to GC.ReRegisterForFinalize for the same object?

Consider the following C# code

class MyClass
{
~MyClass()
{
Console.WriteLine("~MyClass");
}
}

class Program
{
static void Main(string[] args)
{
MyClass mc = new MyClass();
GC.ReRegisterForFinalize(mc); // 1
GC.ReRegisterForFinalize(mc); // 2
GC.ReRegisterForFinalize(mc); // 3

GC.Collect();
GC.WaitForPendingFinalizers();
}
}

Here for the same object we have called GC.ReRegisterForFinalize thrice.


This is one of the few cases where the behavior of .NET and .NET Compact Framework differs.


In case of .NET the MyClass finalizer is called 4 times. Since the object has a finalizer, it is anyway finalized once and for the additional 3 calls to re-register it is finalized 3 more times.


In case of .NET Compact Framework the finalizer is called only once. The reason being internally finalizable is just a bool flag. Each time GC.SuppressFinalize is called it is set to false and each time GC.ReRegisterForFinalize is called it is set to true. So multiple calls doesn’t have any effect.


Hopefully this is one piece of information which none of my readers ever find useful, but since I got a bug assigned for the difference in behavior I might as well let folks know.

Monday, March 30, 2009

NETCF: Total Bytes in Use After GC – Perf counter

Holi (The festival of colors) celebration in full swing

At least one of our customers were a bit confused with the .NET Compact Framework performance counter “Total Bytes in Use After GC” which is described as “The number of bytes of memory, native and managed, in use after the last garbage collection.”. The description lead him to believe it is the total memory in use by a given managed application. That is not true.

The memory inside a .NET Compact Framework managed application can be broadly classified into

  1. Managed memory
  2. Jitted code
  3. Other native memory allocated by the CLR
  4. Native memory allocated by other sub-systems like say a COM module or some other native modules which the managed application P/Invokes into.

The first three is managed by .NET Compact Framework by using 5 heaps. The counter mentioned above returns the sum of all bytes allocated in all of the first 3 categories. .NETCF doesn’t keep tab of memory allocated in category #4.

Sunday, March 29, 2009

How many heaps does the .NET Compact framework use

Prokriti

While discussing the memory architecture with an internal customer, he inquired about how many heaps .NETCF creates. I’m not sure how it might be helpful to users, but the answer is 5. This has been touched upon in some blogs and presentations (e.g. MEDC 2005) but I thought I’d put up a handy list.

Heap Description
JIT heap This is the heap where the buffers of jitted code is placed. Unlike the desktop CLR under some circumstances the CLR does free jitted code by a processed named code-pitching (which I will cover in a later blog post)
GC heap This is where all the managed objects resides. The garbage collector is responsible for managing all objects on this heap including compacting it when it gets fragmented.
Short-term heap This is used for short-lived objects. E.g. while doing a number conversion for the System.Convert methods if temporary space is required by the underlying CLR native code then it is allocated from this heap only to be freed soon.
App-domain This is used by the CLR to place things like assembly meta-data, application/appdomain/thread state information.
Process heap This is the default heap where rest of the stuff goes in

Thursday, March 26, 2009

Random GC fun

Laad Bazaar, Charminar

A friend came by and asked “Dude, are you free”. I replied back cautiously “Why? Are you the Garbage Collector?”. I do not even implement a finalizer and hence wouldn’t even get some extra time standing in the finalizer queue. So I’m not taking any chance…

Monday, March 23, 2009

Dangers of large object heap

View of Charminar from inside Laad Bazaar, Hyderabad

Andrew Hunter has an interesting post on the issues with large object heap on the desktop CLR. Visit http://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/

Side Note: .NET Compact Framework doesn’t support large object heap simply because it is primarily targeted towards embedded applications which has far lesser memory requirements.

Monday, March 02, 2009

Back To Basics: How does the GC find object references

Oh Kolkata!!

This post is Part 9 in the series of posts on Garbage Collection (GC). Please see the index here.

Most garbage collection algorithms needs to know all other objects referenced by a given object to correctly traverse object hierarchy. All of reference counting, mark-sweep and even coping collectors need it. Consider the mark algorithm that I used in mark-sweep GC

void Mark(Object* pObj)
{
if (!Marked(pObj)) // Marked returns the marked flag from object header
{
MarkBit(pObj); // Marks the flag in obj header
// Get list of references that the current object has
// and recursively mark them as well

ObjectCollection children = pObj->GetChildren();
for(int i = 0; i < children.Count(); ++i)
{
Mark(children[i]); // recursively call mark
}
}
}

Here a critical piece is the line in red. Given the object header we need to be able to locate all other objects referenced by it (or are children of it in object reference graph). There are two basic ways to do this


Conservative GC


In case a GC is being plugged into an existing system in the form of a library and there is no easy way to modify the runtime then a conservative (as opposed to precise) system is used. A conservative GC considers every bit pattern on the heap to be a valid reference to another heap object. For the above case given a object header the GC needs to have the following basic information



  1. The size of the object
  2. Alignment of pointers (lets assume it to be 32bit for our case)

So once the GC gets a object pointer pObj it considers all 32bit patterns in between pObj and (pObj + sizeof(Object)) to be references to other child objects. It then uses some basic elimination like whether the value points inside the current heap for the system. If it doesn’t match any elimination logic then it is considered to be a reference.


Even though this approach seems a bit extreme, it works out well and the only flip side is that it returns false positives and hence some garbage which should’ve been collected is left out (e.g. because an int containing a value accidentally matches the address of an object on the heap).


Precise GC


In precise GC the GC precisely knows every reference held in a given object. This is possible only if the runtime generates and preserves this information.


As a sample implementation I will describe how the .NET Compact Framework or .NETCF (which uses a precise garbage collector) goes doing about this.


One additional goal employed by .NETCF is that the GC should be independent of the type system (that is shouldn’t need to have any information of types of objects).


Every object in NETCF has an object header which in turn contains a pointer to it’s type descriptor or class descriptor. This class descriptor has a special 32-bit field named RefMap which is used to store reference information stored by a given type


image


So given an object pointer pObj the GC fetches the object header from before the pointer and then de-references the class descriptor pointer to get to the corresponding type’s Class descriptor. It then reads the RefMap stored into it. To do all of the above the GC needs to know only the object header and class descriptor layout and not any other type information.


All references in NETCF are 32-bit aligned and the refmap contains one bit per 32-bits of the object. If the bit is set it means that the corresponding 32-bit is a reference.


Let’s take the following class

class MyType
{
public char a;
public char b;
public AnotherType t1;
public int i;
public AnotherType t2;
}


The memory layout of the class would be something like


image


The first two chars take 16 bits each. After that there are 3, 32-bit entities of which t1 and t2 (1st and 3rd) are references to other objects but i is not.


So the GC needs to know that given this object the 32-bits from 32nd and 96th are actually references. This is exactly what is encoded in RefMap field. It contains the bit-pattern 00000000000000000000000000001010 where the LSB is for the first 32 bits of the object (which is a+b in this case) and so on.


Since bit 1 of refmap is set it means (pObj + 1 * 32) is a reference. If that value is non-NULL then the GC can just de-reference it to get to the t1 object reference held by pObj.


Obviously it means that the maximum size of object that RefMap can encode is 32*4 = 128bytes. No points for guessing how it handles objects larger than that :)

Sunday, March 01, 2009

Back To Basics: Generational Garbage Collection

Oh Kolkata!!  New Market

This post is Part 8 in the series of posts on Garbage Collection (GC). Please see the index here.

One of the primary disadvantage discussed in the post on mark-sweep garbage collection is that it introduces very large system pauses when the entire heap is marked and swept. One of the primary optimization employed to solve this issue is employing generational garbage collection. This optimization is based on the following observations

  1. Most objects die young
  2. Over 90% garbage collected in a GC is newly created post the previous GC cycle
  3. If an object survives a GC cycle the chances of it becoming garbage in the short term is low and hence the GC wastes time marking it again and again in each cycle

The optimization based on the above observations is to segregate objects by age into multiple generations and collect each with different frequencies.

This scheme has proven to work rather well and is widely used in many modern systems (including .NET).

Detailed algorithm

The objects can be segregated into age based generations in different ways, e.g. by time of creation. However one common way is to consider a newly created object to be in Generation 0 (Gen0) and then if it is not collected by a cycle of garbage collection then it is promoted to the next higher generation, Gen1. Similarly if an object in Gen1 survives a GC then that gets promoted to Gen2.

Lower generations are collected more often. This ensures lower system pauses. The higher generation collection is triggered fewer times.

How many generations are employed, varies from system to system. In .NET 3 generations are used. Here for simplicity we will consider a 2 generation system but the concepts are easily extended to more than 2.

Let us consider that the memory is divided into two contiguous blocks, one for Gen1 and the other for Gen0. At start memory is allocated only from Gen0 area as follows

image

So we have 4 objects in Gen0. Now one of the references is released

image

Now if GC is fired it will use mark and sweep on Gen0 objects and cleanup the two objects that are not reachable. So the final state after cleaning up is

image

The two surviving objects are then promoted to Gen1. Promotion includes copying the two objects to Gen1 area and then updating the references to them

image

Now assume a whole bunch of allocation/de-allocation has happened. Since new allocations are in Gen0 the memory layout looks like

image 

The whole purpose of segregating into generations is to reduce the number of objects to inspect for marking. So the first root is used for marking as it points to a Gen0 object. While using the second root the moment the marker sees that the reference is into a Gen1 object it does not follow the reference, speeding up marking process.

Now if we only consider the Gen0 objects for marking then we only mark the objects indicated by ✓. The marking algorithm will fail to locate the Gen1 to Gen0 references (shown in red) and some object marking will be left out leading to dangling pointers.

One of the way to handle this is to somehow record all references from Gen1 to Gen0 (way to do that is in the next section) and then use these objects as new roots for the marking phase. If we use this method then we get a new set of marked objects as follows

image

This now gives the full set of marked objects. Post another GC and promotion of surviving objects to higher generation we get

image

At this point the next cycle as above resumes…

Tracking higher to lower generation references

In general applications there are very few (some studies show < 1% of all references) of these type of references. However, they all need to be recorded. There are two general approached of doing this

Write barrier + card-table

First a table called a card table is created. This is essentially an array of bits. Each bit indicates if a given range of memory is dirty (contains a write to a lower generation object). E.g. we can use a single bit to mark a 4KB block.

 image

Whenever an reference assignment is made in user code, instead of directly doing the assignment it is redirected to a small thunk (incase .NET the JITter does this). The thunk compares the assignees address to that of the Gen1 memory range. If the range falls within, then the thunk updates the corresponding bit in the card table to indicate that the range which the bit covers is now dirty (shown as red).

First marking uses only Gen0 objects. Once this is over it inspects the card table to locate dirty blocks. Then it considers every object in that dirty block to be new roots and marks objects using it.

As you can see that the 4KB block is just an optimization to reduce the size of the card table. If we increase the granularity to be per object then we can save marking time by having to consider only one object (in contrast to all in 4KB range) but our card table size will also significantly increase.

One of the flip sides is that the thunk makes reference assignment slower.

HW support

Hardware support also uses card table but instead of using thunk it simply uses special features exposed by the HW+OS for notification of dirty writes. E.g. it can use the Win32 api GetWriteWatch to get the list of pages where write happened and use that information to get the card table entries.

However, these kind of support is not available on all platforms (or older version of platforms) and hence is less utilized.

Friday, February 20, 2009

Back To Basics: Handling overflow in mark stage

Bonfire at pragati resort

This post is Part 7 in the series of posts on Garbage Collection (GC). Please see the index here.

Let’s first recap the basic algorithm of the mark-sweep garbage collection. In C like pseudo code the algorithm looks like

void GC()
{
HaltAllThreads();
ObjectCollection roots = GetRoots();
for(int i = 0; i < roots.Count(); ++i)
Mark(roots[i]);
Sweep();
}

void Mark(Object* pObj)
{
if (!Marked(pObj)) // Marked returns the marked flag from object header
{
MarkBit(pObj); // Marks the flag in obj header
// Get list of references that the current object has
// and recursively mark them as well
ObjectCollection children = pObj->GetChildren();
for(int i = 0; i < children.Count(); ++i)
{
Mark(children[i]); // recursively call mark
}
}
}

This algorithm is by nature recursive where in the marking phase we recursively mark every object and it’s children. However, recursion is not optimum in terms of both time and space usage. Each method call comes with additional overhead which is particularly more bothersome because GC is anyway run when the system is low on resource.


Controlling the depth of call stack in recursive algorithm is also hard. In the algorithm shown above the depth of the stack is equal to the length of the longest reference chain.


This is worked around by using general recursion to iteration conversion methods like using explicit marking stacks as show below

Stack g_MarkStack(MAX_SIZE);
void Init()
{
g_MarkStack.RemoveAll();
}

void GC()
{
HaltAllThreads();
Init();
ObjectCollection roots = GetRoots();
for(int i = 0; i < roots.Count(); ++i)
g_MarkStack.Push(roots[i]);
Mark();
Sweep();
}

void Mark()
{
while(!g_MarkStack.Empty())
{
Object *pObj = g_MarkStack.Pop();
if(!Marked(pObj))
{
MarkBit(pObj);
ObjectCollection children = pObj->GetChildren();
for(int i = 0; i < children.Count(); ++i)
{
g_MarkStack.Push(children[i]);
}
}
}
}

Using explicit marking stack we have removed the need for recursive algorithm. Now this is more deterministic and it is easy to figure out when the heap graph is too deep and the stack actually overflows (at the point of push we can check for g_MarkStack.IsFull())


Handling overflow


Explicit stack makes detecting and handling of stack overflow easier. It doesn’t however, remove the issue. The first cut attempt is to reduce the number of items being pushed into the stack. Objects are pushed onto the stack so that they behave as branch points so that once a child has been marked we can come back to it’s parent and branch off and mark it’s other children. So simple fixes like only pushing in case an object has two or more children does help in reducing the overflows


There are other ways to work around this problem like using alternate tree marking algorithm like Schorr-Waite graph marking algorithm


However, whatever method we try there will be large objects trees where the stack will run out of space and that needs to be handled.


.NET Compact framework uses a two level fall back scheme as explained below.


Using overflow stack


The basic mark stack used is a grow-able stack buffer which can grow to max 4KB. In case this overflows another small fixed sized stack is used which is called the overflow-stack. Objects where the overflow happens while marking are pushed on to this overflow stack. The children of these are not traversed.


This means a part of the tree is left out from marking and growth of the mark stark is kind of contained. The mark buffer unwinds and marking is started back on it. In the next pass marking is restarted considering the objects on the overflow-stack as roots. This is done iteratively. The idea here is that with each such pass, more and more of the tree is marked and hence the marking finally converges.


Overflow-arena


For really deep trees even the overflow-stack can overflow. To handle this the following steps are taken



  1. .NETCF uses pooled memory where 64K sized memory pools are used and objects are sub-allocated from them. The memory pool where the overflowed object resides is located. Each of these pools contain two pointers which point to the first and the last objects in the pool that overflowed. These two pointers are updated to point to the currently overflowed object.
  2. The header of the object where the overflow happened is marked with a special bit flag OBJFLAG_MARK_OVERFLOW.
  3. At this point the object pool looks something like
    image
  4. A global linklist of all such pools which has overflowed objects is also preserved
  5. Once objects in mark-stack and overflow-stack is marked another pass is started which does the following
    foreach Pool in Overflow-arena
        foreach Obj >= Pool.header.start and Obj < Pool.header.end
             if(obj.flag & OBJFLAG_MARK_OVERFLOW)
                 Mark(obj)

The very design of the .NETCF GC is to cater to low memory devices. Hence it very rarely needs to use the fall-back. The design ensures that GC is optimal for low memory usage but can scale up (albeit with perf cost) for large memory as well.

Sunday, February 01, 2009

Back To Basics: Copying Garbage Collection

Banana leaf

This post is Part 5 in the series of posts on Garbage Collection (GC). Please see the index here.

In my previous post I have discussed reference counting and mark-sweep garbage collector in some detail. This post is a quick intro into the copying GC. I’ll not be going into the gory details because this is not much used in the .NET world.

The copying GC conceptually breaks the memory into two regions. Let’s call the ping and pong as shown below.

image

The buffer Ping is initially used for allocation of memory. It maintains a pointer at the end of the last allocated memory and for each new request simply returns that pointer and then increments it to point to the new end. So at the beginning when no memory is allocated it points to the bottom. After a bunch of allocation this is what the it looks like

image

Allocation is very fast as it involves only incrementing a pointer. However, after a bunch of allocations the situation will arrive where current-Pointer + requested > Ping-top. This means no more memory can be allocated from Ping. It is at this time the GC is run.

Garbage collection involves enumerating the roots and then copying every live object from Ping to Pong. Since objects are moved in memory all their references have to be updated as well. After the copy the buffers are switched so that subsequent allocations can now happen from Pong. Ping remains clean for the next copy cycle.

For the example set above lets assume when the GC is run Object 1 and 3 are live but 2 isn’t. After GC runs the memory layout becomes

image

Advantages and disadvantages

The primary advantage is that allocation is extremely fast (almost equal to stack allocation) because out of memory checks is just a pointer check and actual allocation is just incrementing the free pointer. Since the GC compacts memory it maintains good locality of reference resulting in better cache hit ratios and hence faster execution.

The main disadvantage with copying GC is that almost double the memory space is required. Even though it is argued that in virtual memory based systems it shouldn’t be a problem because the un-used memory pages will simply be paged out to the disk, the argument is not essentially true. Specially during GC phase both the memory half's are touched (due to ping <-> pong copy) and the number of page-faults generated are much larger than other GC schemes.