Thursday, November 08, 2012

Windows Phone 8: Evolution of the Runtime and Application Compatibility

Long time back at the wake of the release of Windows Phone 7 (WP7) I posted about the Windows Phone 7 series programming model. I also published how .NET Compact framework powered the applications on WP7.

Further simplifying the block diagram, we can think of the entire WP7 application system as follows

As with most block diagrams, this is gross simplifications. However, I hope it helps to easily picture the entire system.

Essentially the application can be purely managed (written in say C# or VB.net). The application can only utilize services exposed by the developer platform and core services provided by .NET Compact Framework. The application can in no way directly use native code or talk to the OS (say call an Win32 API). It has to always go through the runtime infrastructure and is in a security sandbox.

The application manager is the loose term I am using to encompass everything that is used to managed the application including the host.

Windows Phone 8 (WP8) is a huge huge change from Windows Phone 7.x (WP7). From the perspective of a WP7 application running on a WP8 device the system looks as follows

Everything in Green in this diagram got outright replaced with entire new codebase and the rest of the system other than the application was heavily modified to work with the new OS and the new managed runtime.

Shared Windows Core

The OS moved away from Windows Compact Embedded (WinCE) OS core that was used in WP7 to a new OS which shares it’s core with the desktop Windows 8. This means that a bunch of things in the WP8 OS is shared with the desktop implementation, this includes things like kernel, networking, driver framework and others. The shared core obviously brings great value as innovations and features will more easily flow across the two form factors (device and desktop) and also reduce engineering redundancy on Microsoft side. Some of the benefits are readily visible today like great multi-core support, WinRT interop and others are more subtle.

CoreCLR

.NET Compact Framework (NETCF) that was used in WP7 has a very different design philosophy and hence a completely different implementation from the desktop .NET. I will have a follow up post on this but for now it suffices to note that .NETCF is a very portable runtime that is designed to be very versatile and cross platform. Desktop CLR on the other hand is more closely tied with Windows and the processor architecture. It closely works with the OS and the underlying HW to give the maximum performance benefit to managed code running on it.

With the new Windows RT which works on ARM, desktop CLR was anyway updated to work on the ARM processor. So when the phone chose to move to shared core it was an obvious choice to move the CLR as well. This gave the same benefits of shared innovation and reduced engineering redundancy.

The full desktop CLR is more heavy and provides functionality that is not really required by the phone scenarios and hence a lighter variant of it (which is built from the same source) called CoreCLR was chosen for WP8. CoreCLR is the evolution of the lightweight runtime that powered Silverlight. With the move to CoreCLR developers get a much faster runtime with extended feature set that includes interop via WinRT.

Backward Compat

One of the simple statements made during all of these WP8 launch presentations was that applications in the store built for WP7 will work as is for WP8. This is a small statement but is a huge achievement and effort from the runtime implementation perspective. Making apps work as is when the entire runtime, OS and chipset has changed is non-trivial to say the least. Our team worked very hard to make this possible.

One of the biggest things that played out to our benefit was that the WP7 apps were fully sandboxed and couldn’t use any native code. This means that they didn’t have any means of taking behavioral dependence on the OS APIs or underlying HW. The OS APIs were used via the CLR and it could always add quirks to expose consistent behavior to the applications.

API Compatibility
This required us to ensure that CoreCLR exposes the same API set as NETCF. We ran various automated tools to managed API surface area changes and retain meaningful API compat between WP7 and WP8. With a closed application store it was possible for us to gather complete metrics on API usage and correctly prioritize engineering resources to ensure that majority of applications continued to see the same API set in signature and semantics.

We also needed to ensure that the same APIs behave as closely as possible with that provided by NETCF. We tested a lot of applications in the app store to get as close as we can and believe that we are at a place that should allow most WP7 application’s API usage to transparently fall over to the new runtime.

Runtime behavior changes
When a runtime changes there are behavioral changes that can expose pre-existing issues with applications. This includes things like timing differences. Even though these runtime behaviors are not documented, or in some cases especially called out that user code should not take dependence on them, some apps still did do that.

Couple of the examples we saw

Taking dependence on finalization order
Even though CLI specification clearly calls out that finalization order in .NET is not deterministic, code still took subtle dependency on them. In one particular case object F1 used a file and it’s finalizer released it. Later another object F2 opens the same file. With change in the runtime, the timing of GC changed so that at the time F2 tried to open the file, F1 which has been already collected hasn’t yet had its finalizer run. Hence the application crashed. We got in touch with the app developer and got the code moved to use the right dispose pattern/
GC Timing and changes in number of generations
Application finalizers contrary to .NET guidelines modified other managed objects. Now changed GC timings and generation resulted in those objects to have already been collected and hence resulted in finalizer crashes
Threading issues
This was one of the painful ones. With the change in OS and hence thread scheduler and addition of multiple cores in the ARM CPU, a lot of subtle races and deadlocks in the applications got exposed. These were pre-existing issues where synchronization primitives were not used correctly or some code relied on the quanta based thread scheduling WinCE did. With move to WP8, threads run in parallel on different cores and scheduled in different order, this lead to various deadlocks and race driven crashes. Again we had to get in touch with app developer to address these issues
There were other cases where exact precision of floating point math was relied on. This resulted in a board game where pucks flew off from its surface
Whether or not functions got inlined
Order of static constructor initialization (especially in conjunction of function in lining)

We addressed some of these issues where it was realistic to fix them in the runtime. For some of the others we got in touch with the application developers. Obviously all applications in the store and all of their features were not tested. So you should try to test your applications when you have access to WP8.

At one point we were all playing games on our phones telling our managers that I am compat testing Need For Speed and I need to test till level 10 :)

Tuesday, October 16, 2012

Test Driven Development of a Generational Garbage Collection

These days everyone is talking about being agile and test driven development (TDD). I wanted to share a success story of TDD that we employed for developing Generational Garbage Collector (GC) for Windows Phone Mango.

The .NET runtime on Windows Phone 7 shipped with a mark-sweep-compact; stop the world global non-generational GC. Once a GC was triggered, it stopped all managed execution and scanned the entire managed heap to look up all managed references and cleaned up objects that were not in use. Due to performance bottleneck we decided to enhance the GC by adding a generational GC (referred to as GenGC). However, post the General Availability or GA of WP7 we had a very short coding window. Replacing such a fundamental piece of the runtime in that short window was very risky. So we decided to build various kinds of stress infrastructure first, and then develop the GC. So essentially

Write the tests
See those tests failing
Write code for the generational GC
Get tests to pass
Use the tests for regression tracking as we refactor the code and make it run faster

Now building tests for a GC is not equivalent of traditional testing of features or APIs where you write tests to call into mocked up API, see it fail until you add the right functionality. Rather these tests where verifications modes and combination of runtime stresses that we wrote.

To appreciate the testing steps we took do read the Back To Basics: Generational Garbage Collection and WP7 Mango: Mark-Sweep collection and how does a Generational GC help posts

Essentially in a generational GC run all of the following references should be discovered by the GC

Gen0 Objects reachable via roots (root –> object OR recursively root –> object –> object )
Objects accessible from runtime native code (e.g. pinned for PInvoke, COM interop, internal runtime references)
Objects referenced via Gen1 –> Gen0 pointers

The first two were anyway heavily covered by our traditional GC tests. #3 being the new area being added.

To implement a correct generational GC we needed to ensure that at all places in the runtime where managed object references are updated they need to get reflected in the CardTable (#3 above). This is a daunting task and prone to bugs via omission as we need to ensure that

All forms of assignments in MSIL that we JIT there are calls to update the CardTable.
All places in the native runtime code where such references are directly and or indirectly updated the same thing is ensured. This includes all JIT worker-routines, COM, Marshallers.

If a single instance is missed then it would result in valid/reachable Gen0 objects being collected (or deleted) and hence in the longer run result in memory corruption, crashes that will be hard if not impossible to debug. This was assessed to be the biggest risk to shipping generational GC.

The other problem is that these potential omissions can be only exposed by certain ordering of allocation and collection. E.g. only a missing tracked reference of A –> B can result in a GC issue only if a GC happened in between allocations of A and B (A is in higher generation than B). Also due to performance reasons (write atomicity for lock-less updates) for every assignment of A = B we do not update the card-table bit that covers the memory area of A. Rather we update the whole byte in the card-table. This means an update to A will cover other objects allocated adjacent to A. Hence if an update to an object just beside A in the memory is missed it will not be discovered until some other run where that object lands up being allocated farther away from A.

GC Verification mode

Our solution to all of these problems was to first create the GC verification mode. What this mode does is runs the traditional full mark-sweep GC. While running that GC it goes through all objects in the memory and as it traverses them for every reference A (Gen1) –> B(Gen0), it verifies that the card table bit for A is indeed set. This ensures that if a GenGC was to run, it would not miss that references

Granular CardTable

We used very high granular card-table resolution for test runs. For these special runs each bit of the card-table corresponded to almost one object (1 bit to 2 byte resolution). Even though the card-table size exploded it was fine because this wasn’t a shipping configuration. This spaced out objects covered by the card-table and exposed adjacent objects not being updated.

GC Stress

In addition we ran the GC stress mode, where we made the GC run extremely frequently (we could push it up to a GC in every allocation). The allocator was also updated to ensure that allocations were randomized so that objects moved around everywhere in the memory.

Hole Finder

Hole finder moves all objects around in memory after a GC. This exposes stale pointer issues. If an object didn’t get updated properly due to the GC it would now point to invalid memory because all previous memory locations are now invalid memory. So a subsequent write will fail-fast with AV and we can easily detect that point of failure.

With all of these changes we ran the entire test suites. Also by throttling down the GC Stress mode we could still use the runtime to run real apps on the phone. Let me tell you playing NFS on a device with the verification mode, wasn’t fun :)

With this precaution we ensured that not a single GenGC bug has come in from the phone. It shipped rock solid and we were more confident with code churn because regressions would always be caught. I actually never blogged about this because I felt that if I do, it’ll jinx something :)

Core Parking

For some time now, my main box got a bit slow and was glitching all the time. After some investigation I found that some power profile imposed by our I T department enabled CPU parking on my machine. This effectively parks CPU on low load condition to save power. However,

This effects high load conditions as well. There is a perceptible delay between the load hitting the computer and the CPU’s being unparked. Also some CPU’s remain parked for spike loads (large but smaller duration usage spikes)
This is a desktop and not a laptop and hence I really do not care about power consumption that much

Windows Task Manager (Ctrl + Shift + Esc and then Performance tab) clearly shows this parking feature. 3 parked cores show flatlines

You can also find out if your machine is behaving the same from Task Manager -> Performance Tab -> Resource Monitor -> CPU Tab. The CPU graphs on the right will show which cores if any are parked.

To disable this you need to

Find all occurrences of 0cc5b647-c1df-4637-891a-dec35c318583 in the Windows registry (there will be multiple of those)
For all these occurrences set the ValueMin and ValueMax keys to 0
Reboot

For detailed steps see the video http://www.youtube.com/watch?feature=player_detailpage&v=mL-w3X0PQnk#t=85s

Everything is back to normal once this is taken care off Smile

Managing your Bugs

I have worked on bunch of large scale software product development that spanned multiple teams, thousands of engineers, many iterations. Some of those products plug into even larger products. So obviously we deal with a lot of bugs. Bugs filed on us, bugs we file on other partner teams, integration bugs, critical bugs, good to have fixes and plan ol crazy bugs. However, I have noticed one approach to handling those bugs which kind of bugs me.

Some people just cannot make hard calls on the gray area bugs and keeps hanging them around. They bump the priority down and take those bugs from iteration to iteration. These bugs always hang around as Priority 2 or 3 (in a scale of Priority 0 through 3).

I personally believe that bugs should be dealt with the same way emails should be dealt with. I always follow the 4 Ds for real bugs.

Do it

If the bug meets the fix bar and is critical enough, just go fix it. This is the high priority critical bug which rarely gives rise to any confusion. Something like the “GC crashes on GC_Sweep for application X”

Delete it (or rather resolve it)

If you feel a bug shouldn’t be fixed then simply resolve the bug. Give it back to the person who created the bug with clear reasons why you feel this shouldn’t be fixed. I know it’s hard to say “won’t fix”, but the problem is if one was to fix every bug in the system then the product will never ship. So just don’t defer the thought of making this hard call. There is no benefit in just keeping the bug hanging around for 6 months only to make the same call 2 weeks prior to shipping, citing time constraint as a reason. Make the hard call upfront. You can always tag the bug appropriately so that for later releases you can relook.

Defer it

Once you have made a call to fix the bug it should be correctly prioritized. With all bugs you know you won’t be fixing out of the way, its easy to make the right call of fixing it in the current development cycle or move it to the next iteration.

Delegate it

If you feel someone else needs to take a look at the bug because it’s not your area of expertise, QA needs to give better repro steps, someone else needs to make a call on it (see #1 and #2 above) then just assign it to them. No shame in accepting that you aren’t the best person to make a call.

WP7: CLR Managed Object overhead

A reader contacted me over this blog to inquire about the overhead of managed objects on the Windows Phone. which means that when you use an object of size X the runtime actually uses (X + dx) where dx is the overhead of CLR’s book keeping. dx is generally fixed and hence if x is small and there are a lot of those objects it starts showing up as significant %.

There seems to be a great deal of information about the desktop CLR in this area. I hope this post will fill the gap for NETCF (the CLR for the Windows Phone 7)

All the details below are implementation detail and is provided just as guidance. Since these are not mandated by any specification (e.g. ECMA) they may and most probably will change.

General layout

The overhead varies on the type of objects. However, in general the object layout looks something like

So internally just before the actual object there is a small object-header. If the object is a finalizable object (object with a finalizer) then there’s another 4-byte pointer at the end of the object.
The exact size of the header and what's inside that header depends on the type of the object.

Small objects (size < 16KB)

All objects smaller than 16KB has a 8 byte object header.

The 32-bits in the flag are used to store the following information

Locking
GC information. E.g.
1. Bits to mark objects
2. Generation of the object
3. Whether the object is pinned
Hashcode (for GetHashCode support)
Size of the object

Since all objects are 4 byte aligned, the size is stored as an unit of 4 bytes and uses 11 bits. Which means that the maximum size that can be stored in that header is 16KB.

The second flag contains a 32-bit pointer to the Class-Descriptor or the type information of that object. This field is overloaded and is also used by the GC during compaction phase.

Since the header cost is fixed, the smaller the object the larger the overhead becomes in terms of % overhead.

Large Objects

As we saw in the previous section that the standard object header supports describing 16KB objects. However, we do need to support larger objects in the runtime.

This is done using a special large-object header which is 20 bytes.

The overhead of large object comes to be at max of around 0.1%. A dedicated 32-bit size field allows the runtime to support very large objects (close to 2GB).

It’s weird to see that the normal object header is repeated twice. The reason is that very large objects are rare on the phone. Hence the runtime tries to optimize it’s various operations so that the large object doesn’t bring in additional code paths and checks. Given an object reference having the normal header just before it helps us in speeding various operations. At the same time having the header as the first data-structure of all objects helps us to be faster in other cases. Bottom line is that the additional 8 bytes or 0.04% extra space usage helps us in enough performance gains at other places so that we pay that extra overhead cost.

Pool overhead

NETCF uses 64KB object pools and then sub-allocates objects from it. So in addition to the per-object overhead it uses an additional 44-byte of object-pool-header per 64kb. This means another 0.06% overhead per 64KB. For every object, the contribution of this can be ignored.

Arrays

Arrays pan out slightly differently.

In case of value type array every member doesn’t really need to have the type information and hence the object header associated with it. The Array header is a standard object header of 8 bytes or big object header of 20bytes (based on the size of the array). Then there is a 32-bit count of the number of elements in the array. For single dimension array there is no more overhead. However, for higher dimension arrays the length in each dimension is also stored to support runtime bounds check.

The reference type arrays are similar but instead of the array contents being inlined into the array the elements are references to the objects on the managed heap and each of those objects have their standard headers.

Putting it all together

Standard object	8 bytes ( + 4 bytes for finalizable object)
Large objects (>16KB)	20 bytes ( + 4 bytes for finalizable object)
Arrays	object overhead + 4 bytes array header + (nDimension x 4 bytes)

My first Windows Phone App

I have been involved with Windows Phone 7 from the very beginning and worked on delivering the CLR for the GA version and all the way through Nodo and Mango.

All this while I was busy working on one of the lower layers and seeing bits and bytes flying by. For some time I wanted to try out writing an app for the phone to experience how it feels to be our developer customer. Unfortunately my skills fall more on system programming than designing good looking UI. I had a gazillion ideas floating in my head but most were beyond my graphic and UI design skills. I was also waiting for some sort of inspiration.

Over the last school break our local King Country Library System which earned the best library system of the year award run an interesting effort. Students received a sheet which they filled up as they read books. On hitting 500 minutes and 1000 minutes of book reading they got prizes. One of the prize was a Kaleidoscope. Not the one with colored glass pieces like I had as a child, but one through which you could see everything around. This was the inspiration I was waiting for and I wrote an WP7 Mango app that does the same. Interestingly the app got a lot of good reviews and is rated at 5 stars (something I didn’t expect). It’s free, go check it out.

Get the app here, documentation is at http://bonggeek.com/Kaleidoscope/

Delay Sending an Email

Like many engineer I am partially fueled by the engineers angst. I am not always able to vent it in code and sometimes resort to sending really dumb emails :).

I figured out that I can tolerate delaying sending my emails over having to smack my face every other day. Basically all emails I send is delayed by about 5 minutes and that makes a huge difference. Somehow you always know you’ve sent something you shouldn’t have in the first 2 minutes of sending that email.

This is how you set this up on Outlook 2010… Bring up the new rules UI from the ribbon (Home > Rules > Managed Rules and Alerts)

Next you’d get another UI just hit next and yes on the following warning.

After that choose the delay duration. I use 5 minutes

Bonus!!!

In case you want to pretend that you are hard at work when you aren’t then you can also use the timed email feature in Outlook. In the new email window use the following.

Unfortunately if you are an software engineer your work is measured by the amount and quality of code you check in and not emails so sadly this doesn’t work Smile

Bong Geek - Abhinaba Basu

Links

Search