Tuesday, March 31, 2009

NETCF: Total Bytes in Use After GC – Perf counter

Holi (The festival of colors) celebration in full swing

At least one of our customers were a bit confused with the .NET Compact Framework performance counter “Total Bytes in Use After GC” which is described as “The number of bytes of memory, native and managed, in use after the last garbage collection.”. The description lead him to believe it is the total memory in use by a given managed application. That is not true.

The memory inside a .NET Compact Framework managed application can be broadly classified into

  1. Managed memory
  2. Jitted code
  3. Other native memory allocated by the CLR
  4. Native memory allocated by other sub-systems like say a COM module or some other native modules which the managed application P/Invokes into.

The first three is managed by .NET Compact Framework by using 5 heaps. The counter mentioned above returns the sum of all bytes allocated in all of the first 3 categories. .NETCF doesn’t keep tab of memory allocated in category #4.

Monday, March 30, 2009

How many heaps does the .NET Compact framework use

Prokriti

While discussing the memory architecture with an internal customer, he inquired about how many heaps .NETCF creates. I’m not sure how it might be helpful to users, but the answer is 5. This has been touched upon in some blogs and presentations (e.g. MEDC 2005) but I thought I’d put up a handy list.

Heap Description
JIT heap This is the heap where the buffers of jitted code is placed. Unlike the desktop CLR under some circumstances the CLR does free jitted code by a processed named code-pitching (which I will cover in a later blog post)
GC heap This is where all the managed objects resides. The garbage collector is responsible for managing all objects on this heap including compacting it when it gets fragmented.
Short-term heap This is used for short-lived objects. E.g. while doing a number conversion for the System.Convert methods if temporary space is required by the underlying CLR native code then it is allocated from this heap only to be freed soon.
App-domain This is used by the CLR to place things like assembly meta-data, application/appdomain/thread state information.
Process heap This is the default heap where rest of the stuff goes in

Friday, March 27, 2009

.NET Compact Framework and ARM FPU

Laad Bazaar Pearls, Charminar

One of our customers was porting an application on to .NETCF from native code and observed that on ARM v7 the floating point performance of .NET CF was way below that of native code. In particular he was trying out sine function. Their micro-benchmark showed that doing sine over couple of million times was 27 times slower in .NETCF as compared to native code.

The reason behind this is that native code targets the Floating Point Unit (FPU) when available. However, .NETCF targets the lowest common denominator of ARMV4i and doesn’t use FPU even when run on a higher ARM version where FPU is available. It actually uses SW emulation for floating point operations and hence is slower.

If floating point math performance is critical for your application then the way to address that would be to write the math operation in native code (a native dll) and P/Invoke into it.

However, the cost of making P/Invoke calls is high (~5-6x slower than normal method calls) and you need to ensure that it doesn’t offset the savings in using the FPU. A common way to do that is to use bulking and reduce chattiness of P/Invoke calls. E.g., if you are evaluating an expression of the form func(a) + func(b) + func(c) instead of creating a P/Invokable function func and calling it thrice, its best to put the whole expression inside the native dll and call that one function which does the whole operation. If it is possible you can even try to go one step ahead and pass a buffer of input and output and get the native dll to do all the operation and return the processed buffer in one go.

Trivia: The .NET Compact Framework that runs on the Zune (which uses ARM as well) does use the FPU on it. So we do have the support in code and hopefully at some point in the future it will make it to the main-stream .NET Compact Framework.

Thursday, March 26, 2009

Random GC fun

Laad Bazaar, Charminar

A friend came by and asked “Dude, are you free”. I replied back cautiously “Why? Are you the Garbage Collector?”. I do not even implement a finalizer and hence wouldn’t even get some extra time standing in the finalizer queue. So I’m not taking any chance…

Tuesday, March 24, 2009

Dangers of large object heap

View of Charminar from inside Laad Bazaar, Hyderabad

Andrew Hunter has an interesting post on the issues with large object heap on the desktop CLR. Visit http://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/

Side Note: .NET Compact Framework doesn’t support large object heap simply because it is primarily targeted towards embedded applications which has far lesser memory requirements.

Wednesday, March 04, 2009

Small Basic

Today my father in law started teaching my 4 year old daughter how to write in Bengali which is her native language. Below is the screen shot of her attempt to write অ আ ই, the first character is written by her teacher and the rest by her.

image

This got me thinking around when the time comes how do I introduce her to programming languages. A friend suggested I check out Small Basic which is from Microsoft Dev Labs and is a free download.

I downloaded the Getting Started Guide and skimmed though it for around 5 mins. Then fired Small Basic IDE up and instantly loved the UI, it had KISS written all over it. However, I did feel that it made extra effort to look too cute.

image

I could easily code up the following by skimming through the guide. The only hiccup was figuring out how to get a color from RGB as I didn’t guess that the method would be inside GraphicsWindow.

GraphicsWindow.BackgroundColor = "Black"
GraphicsWindow.Width = 400
GraphicsWindow.Height = 256

For y = 0 To 255
GraphicsWindow.PenColor = GraphicsWindow.GetColorFromRGB(y, 0, 0)
GraphicsWindow.DrawLine(0, y, 400, y)
EndFor

GraphicsWindow.BrushColor = "White"
GraphicsWindow.FontSize = 24
GraphicsWindow.DrawText(10, 110, "Gradient")

The output is the following cool looking gradient effect


image


The whole language and the pre-packaged classes were very immersive. I really started having fun and could code up a small logo like application in around 10 mins. I could just use the following code (which is a sample from the getting started guide)

GraphicsWindow.BackgroundColor = "Black"
GraphicsWindow.MouseDown = OnMouseDown

' Event handler
Sub OnMouseDown
' Get a random image tagged flower from flickr
pic = Flickr.GetRandomPicture("Flower")
GraphicsWindow.DrawResizedImage(pic, 0, 0, 640, 480)
EndSub

to create a UI where every click gets a random image from flickr


I just loved the whole experience, almost felt like Pop-fly, only way more powerful. I think Small Basic has a lot of potential especially after seeing the samples of what folks have created over at http://blogs.msdn.com/smallbasic/

Tuesday, March 03, 2009

Back To Basics: How does the GC find object references

Oh Kolkata!!

This post is Part 9 in the series of posts on Garbage Collection (GC). Please see the index here.

Most garbage collection algorithms needs to know all other objects referenced by a given object to correctly traverse object hierarchy. All of reference counting, mark-sweep and even coping collectors need it. Consider the mark algorithm that I used in mark-sweep GC

void Mark(Object* pObj)
{
if (!Marked(pObj)) // Marked returns the marked flag from object header
{
MarkBit(pObj); // Marks the flag in obj header
// Get list of references that the current object has
// and recursively mark them as well

ObjectCollection children = pObj->GetChildren();
for(int i = 0; i < children.Count(); ++i)
{
Mark(children[i]); // recursively call mark
}
}
}

Here a critical piece is the line in red. Given the object header we need to be able to locate all other objects referenced by it (or are children of it in object reference graph). There are two basic ways to do this


Conservative GC


In case a GC is being plugged into an existing system in the form of a library and there is no easy way to modify the runtime then a conservative (as opposed to precise) system is used. A conservative GC considers every bit pattern on the heap to be a valid reference to another heap object. For the above case given a object header the GC needs to have the following basic information



  1. The size of the object
  2. Alignment of pointers (lets assume it to be 32bit for our case)

So once the GC gets a object pointer pObj it considers all 32bit patterns in between pObj and (pObj + sizeof(Object)) to be references to other child objects. It then uses some basic elimination like whether the value points inside the current heap for the system. If it doesn’t match any elimination logic then it is considered to be a reference.


Even though this approach seems a bit extreme, it works out well and the only flip side is that it returns false positives and hence some garbage which should’ve been collected is left out (e.g. because an int containing a value accidentally matches the address of an object on the heap).


Precise GC


In precise GC the GC precisely knows every reference held in a given object. This is possible only if the runtime generates and preserves this information.


As a sample implementation I will describe how the .NET Compact Framework or .NETCF (which uses a precise garbage collector) goes doing about this.


One additional goal employed by .NETCF is that the GC should be independent of the type system (that is shouldn’t need to have any information of types of objects).


Every object in NETCF has an object header which in turn contains a pointer to it’s type descriptor or class descriptor. This class descriptor has a special 32-bit field named RefMap which is used to store reference information stored by a given type


image


So given an object pointer pObj the GC fetches the object header from before the pointer and then de-references the class descriptor pointer to get to the corresponding type’s Class descriptor. It then reads the RefMap stored into it. To do all of the above the GC needs to know only the object header and class descriptor layout and not any other type information.


All references in NETCF are 32-bit aligned and the refmap contains one bit per 32-bits of the object. If the bit is set it means that the corresponding 32-bit is a reference.


Let’s take the following class

class MyType
{
public char a;
public char b;
public AnotherType t1;
public int i;
public AnotherType t2;
}


The memory layout of the class would be something like


image


The first two chars take 16 bits each. After that there are 3, 32-bit entities of which t1 and t2 (1st and 3rd) are references to other objects but i is not.


So the GC needs to know that given this object the 32-bits from 32nd and 96th are actually references. This is exactly what is encoded in RefMap field. It contains the bit-pattern 00000000000000000000000000001010 where the LSB is for the first 32 bits of the object (which is a+b in this case) and so on.


Since bit 1 of refmap is set it means (pObj + 1 * 32) is a reference. If that value is non-NULL then the GC can just de-reference it to get to the t1 object reference held by pObj.


Obviously it means that the maximum size of object that RefMap can encode is 32*4 = 128bytes. No points for guessing how it handles objects larger than that :)

Monday, March 02, 2009

Back To Basics: Generational Garbage Collection

Oh Kolkata!!  New Market

This post is Part 8 in the series of posts on Garbage Collection (GC). Please see the index here.

One of the primary disadvantage discussed in the post on mark-sweep garbage collection is that it introduces very large system pauses when the entire heap is marked and swept. One of the primary optimization employed to solve this issue is employing generational garbage collection. This optimization is based on the following observations

  1. Most objects die young
  2. Over 90% garbage collected in a GC is newly created post the previous GC cycle
  3. If an object survives a GC cycle the chances of it becoming garbage in the short term is low and hence the GC wastes time marking it again and again in each cycle

The optimization based on the above observations is to segregate objects by age into multiple generations and collect each with different frequencies.

This scheme has proven to work rather well and is widely used in many modern systems (including .NET).

Detailed algorithm

The objects can be segregated into age based generations in different ways, e.g. by time of creation. However one common way is to consider a newly created object to be in Generation 0 (Gen0) and then if it is not collected by a cycle of garbage collection then it is promoted to the next higher generation, Gen1. Similarly if an object in Gen1 survives a GC then that gets promoted to Gen2.

Lower generations are collected more often. This ensures lower system pauses. The higher generation collection is triggered fewer times.

How many generations are employed, varies from system to system. In .NET 3 generations are used. Here for simplicity we will consider a 2 generation system but the concepts are easily extended to more than 2.

Let us consider that the memory is divided into two contiguous blocks, one for Gen1 and the other for Gen0. At start memory is allocated only from Gen0 area as follows

image

So we have 4 objects in Gen0. Now one of the references is released

image

Now if GC is fired it will use mark and sweep on Gen0 objects and cleanup the two objects that are not reachable. So the final state after cleaning up is

image

The two surviving objects are then promoted to Gen1. Promotion includes copying the two objects to Gen1 area and then updating the references to them

image

Now assume a whole bunch of allocation/de-allocation has happened. Since new allocations are in Gen0 the memory layout looks like

image 

The whole purpose of segregating into generations is to reduce the number of objects to inspect for marking. So the first root is used for marking as it points to a Gen0 object. While using the second root the moment the marker sees that the reference is into a Gen1 object it does not follow the reference, speeding up marking process.

Now if we only consider the Gen0 objects for marking then we only mark the objects indicated by ✓. The marking algorithm will fail to locate the Gen1 to Gen0 references (shown in red) and some object marking will be left out leading to dangling pointers.

One of the way to handle this is to somehow record all references from Gen1 to Gen0 (way to do that is in the next section) and then use these objects as new roots for the marking phase. If we use this method then we get a new set of marked objects as follows

image

This now gives the full set of marked objects. Post another GC and promotion of surviving objects to higher generation we get

image

At this point the next cycle as above resumes…

Tracking higher to lower generation references

In general applications there are very few (some studies show < 1% of all references) of these type of references. However, they all need to be recorded. There are two general approached of doing this

Write barrier + card-table

First a table called a card table is created. This is essentially an array of bits. Each bit indicates if a given range of memory is dirty (contains a write to a lower generation object). E.g. we can use a single bit to mark a 4KB block.

 image

Whenever an reference assignment is made in user code, instead of directly doing the assignment it is redirected to a small thunk (incase .NET the JITter does this). The thunk compares the assignees address to that of the Gen1 memory range. If the range falls within, then the thunk updates the corresponding bit in the card table to indicate that the range which the bit covers is now dirty (shown as red).

First marking uses only Gen0 objects. Once this is over it inspects the card table to locate dirty blocks. Then it considers every object in that dirty block to be new roots and marks objects using it.

As you can see that the 4KB block is just an optimization to reduce the size of the card table. If we increase the granularity to be per object then we can save marking time by having to consider only one object (in contrast to all in 4KB range) but our card table size will also significantly increase.

One of the flip sides is that the thunk makes reference assignment slower.

HW support

Hardware support also uses card table but instead of using thunk it simply uses special features exposed by the HW+OS for notification of dirty writes. E.g. it can use the Win32 api GetWriteWatch to get the list of pages where write happened and use that information to get the card table entries.

However, these kind of support is not available on all platforms (or older version of platforms) and hence is less utilized.