Bong Geek - Abhinaba Basu: 06/01/2011

Thursday, June 23, 2011

Secure Conversion of your personal Microsoft Office Documents to Kindle

I love my Kindle 3 and really enjoy it’s readability and portability. However, a lot of my reading is confidential docs and papers which don’t come off the Kindle store. I had to spend some time to figure out the best way to get those onto Kindle securely. Since I use the latest Microsoft document formats (docx) a bunch of the commonly referred tools like Calibre didn’t work for me.

In case you are not worried about security and do not mind sending your document to Amazon for free conversion and transfer to your kindle, then just visit http://www.amazon.com/gp/help/customer/display.html?nodeId=200375630 and thanks for stopping by on this blog Smile

However, in case the document is confidential and you don’t want to send it to Amazon read on

Using PDF

Kindle 3 supports PDF natively. So you can just transfer a PDF to it by connecting the Kindle to your PC over USB. Save the document to pdf using say Microsoft office save to PDF (File –> Save As (format PDF)) and then attach your kindle to the PC, it will show up as a Removable Storage drive and you can copy the PDF to it.

However, the document might turn out with tiny text and you’d need to scroll all around on the Kindle to view it. This is far from being acceptable and I use the following steps to get around that.

Install CurePDF writer from http://cutepdf.com/Products/CutePDF/writer.asp. This installs a virtual printer to your PC called “CutePDF Writer”. Now print any document to this printer (File –> Print) and it will create a pdf file on your computer. While printing I use the following settings

I use a small page size of A6 to ensure good readability on the Kindle. Once you click on the Print button CutePDF shows a file save as dialog using which you can directly point it to the attached Kindle storage drive. Experiment for what page size works best for the particular kind of document and Kindle you have (plain vs Dx).

Using MobiPocket Creator

Download MobiPocket Creator from http://www.mobipocket.com/en/DownloadSoft/ProductDetailsCreator.asp. Launch MobiPocket and then click on Import –> MS Word Document

Use the browse button to choose the document, in the file open dialog choose the document. However, do note that you may need to change the File type filter to *.* and not *.doc if you are opening the latest docx format

Once you have selected the document click on Import. This will get the document imported to MobiPocket.

Click on Build and then “Open folder containing eBook” and click OK. From the folder that opens copy the *.opf and *.prc to the Kindle’s document folder. Detach the kindle from the PC. The document should show up on the Kindle home screen.

Tuesday, June 14, 2011

WP7 Mango: The new Generational GC

In my previous post “Mark-Sweep collection and how does a Generational GC help” I discussed how a generational Garbage Collector (GC) works and how it helps in reducing collection latencies which show up as long load times (startup as well as other load situations like game level load) and gameplay or animation jitter/glitches. In this post I want to discuss how those general principles apply to the WP7 Generational GC (GenGC) specifically.

Generations and Collection Types

We use 2 generations on the WP7 referred to as Gen0 and Gen1. A collection could be any of the following 4 types

An ephemeral or Gen0 collection that runs frequently and only collects Gen0 objects. Object surviving the Gen0 collection is promoted to Gen1
Full mark-sweep collection that collects all managed objects (both Gen1 and Gen0)
Full mark-sweep-compact collection that collects all managed objects (both Gen1 and Gen0)
Full-GC with code-pitch. This is run under severe low memory and can even throw away JITed code (something that desktop CLR doesn’t support)

The list above is in the order of increasing latency (or time they take to run)

Collection triggers

GC triggers are the same and as outlined in my previous post WP7: When does the GC run. The distinction between #2 and #3 above is that at the end of all full-GC the collector considers the memory fragmentation and can potentially run the memory compactor as well.

After significant allocation
After significant amount of managed allocation the GC is started. The amount today is 1MB (called GC quanta) but is open to change. This GC can be ephemeral or full-GC. In general it’s an ephemeral collection. However, it might be a full collection under the following cases
1. After significant promotion of objects from Gen0 to Gen1 the collections become full collections. Today 5MB of promotion triggers a full GC (again this number is subject to change).
2. Application’s total memory usage is close to the maximum memory cap that apps have (very little free memory left). This indicates that the application will get terminated if the memory utilization is not cut-back.
3. Piling up of native resources. We use different heuristics like native to managed memory ratio and finalizer queue heuristics to detect if GC needs to turn to full collection to release native resources being held-up due to Gen0 only collections
Resource allocation failure
All resource allocation failure means that the system is under memory pressure and hence such collections are always full collection. This can lead to code pitch as well
User code triggered GC
User code can start collections via the System.GC.Collect() managed API. This results in a full collection as documented by that API. We have not added the method overload System.GC.Collect(generation). Hence there is no way for the developer to start a ephemeral or Gen0 only collection
Sharing server initiated
Sharing server can detect phone wide memory issue and start GC in all managed processes running. These are full-GC and can potentially pitch code as well.

So from all of the above, the 3 key takeaways are

Low memory or memory cap related collections are always full-collections. These could also turn out to be the more costly compacting collection and/or pitch JITed code
Collections are in general ephemeral and become full-collection after significant object promotion
No fundamental changes to the GC trigger policies. So an app written for WP7 will not see any major changes to the number of GC’s that happen. Some GC will be ephemeral and others will be full-GCs.

Write Barriers/Card-table

As explained in my previous post, to keep track of Gen1 to Gen0 reference we use write-barrier/card-table.

Card-table can be visualized as a memory bitmap. Each bit in the card-table covers n bytes of the net address space. Each such bit is called a Card. For managed reference updates like A.b = C in addition to JITing the real assignment, calls are added to Write-barrier functions. This write barrier locates the Card corresponding to the address of write and sets it. Later during collection the collector checks all Gen-1 objects covered by a set card-bit and marks Gen-0 references in those objects.

This essentially brings in two additional cost to the system.

Memory cost of adding those calls to the WB in the JITed code
Cost of executing the write barrier while modifying reference

Both of the above are optimized to ensure they have minimum execution impact. We only JIT calls to WB when absolutely required and even then we have an overhead of a single instruction to make the call. The WB are hand-tuned assembly code to ensure they take minimum cycles. In effect the net hit on process memory due to write barriers is way less than 0.1%. The execution hit in real-world applications scenarios is also not in general measureable (other than real targeted testing).

Differences from desktop

In principle both the desktop GC and the WP7 GC are similar in that they use mark-sweep generational GC. However, there are differences based on the fact that the WP7 GC targets a more constrained device.

2 generations as opposed to 3 on the desktop
No background or incremental collection supported on the phone
WP7 GC has additional logic to track and handle application policies like application memory caps and total memory utilization
The phone CLR uses a very different memory layout which is pooled and not linear. So no concept of Large Object Heap. So lifetime of large objects is no different
No support for particular generation collection from user code

Wednesday, June 08, 2011

WP7 Mango: Mark-Sweep collection and how does a Generational GC help

About a month back we announced that in the next release of Windows Phone 7 (codenamed Mango) we will ship a new garbage collector in the CLR. This garbage collector (GC) is a generational garbage collector.

This post is a “back to basics” post where I’ll try to examine how a mark-sweep GC works and how adding generational collection helps in boosting it’s performance. We will take a simplified look at how mark-sweep-compact GC works and how generational GC can enhance it’s performance. In later posts I’ll try to elaborate on the specifics of the WP7 generational GC and how to ensure you get the best performance out of it.

Object Reference Graph

Objects in the memory can be considered to be a graph. Where every object is a node and the references from one object to another are edges. Something like

To use an object the code should be able to “reach” it via some reference. These are called reachable objects (in blue). Objects like a method’s local variable, method parameters, global variables, objects held onto by the runtime (e.g. GCHandles), etc. are directly reachable. They are the starting points of reference chains and are called the roots (in black).

Other objects are reachable if there are some references to them from roots or from other objects that can be reached from the roots. So Object4 is reachable due to the Object2->Object4 reference. Object5 is reachable because of Object1->Object3->Object5 reference chain. All reachable objects are valid objects and needs to be retained in the system.

On the other hand Object6 is not reachable and is hence garbage, something that the GC should remove from the system.

Mark-Sweep-Compact GC

A garbage collector can locate garbage like Object6 in various ways. Some common ways are reference-counting, copying-collection and Mark-Sweep. In this section lets take a more pictorial view of how mark-sweep works.

Consider the following object graph

At first the GC pauses the entire application so that the object graph is not being mutated (as in no new objects or references are being created). Then it goes into the mark phase. In mark phase the GC traverses the graph starting at the roots and following the references from object to object. Each time it reaches an object through a reference it flips a bit in the object header indicating that this object is marked or in other words reachable (and hence not garbage). At the end everything looks as follows

So the 2 roots and the objects A, C, D are reachable.

Next it goes into the sweep phase. In this phase it starts from the very first object and examines the header. If the header’s mark bit is set it means that it’s a reachable object and the sweep resets that bit. If the header’s bit is not set, it’s not reachable and is flagged as garbage.

So B and E gets flagged as garbage. Hence these areas are free to be used for other objects or can be released back to the system

This is where the GC is done and it may resume the execution of the application. However, if there are too many of those holes (released objects) created in the system, then the memory gets fragmented. To reduce memory fragmentation. The GC may compact the memory by moving objects around. Do note that compaction doesn’t happen for every GC, it is run based on some fragmentation heuristics.

Both C and D is moved here to squeeze out the hole for B. At the same time all references to these objects in the system is also update to point to the correct location.

One important thing to note here is that unlike native objects, managed objects can move around in memory due to compaction and hence taking direct references to it (a.k.a memory pointers) is not possible. In case this is ever required, e.g. a managed buffer is passed to say the microphone driver native code to copy recorded audio into, the GC has to be notified that the corresponding managed object cannot be moved during compaction. If the GC runs a compaction and object moves during that microphone PCM data copy, then memory corruption will happen because the object being copied into would’ve moved. To stop that, GCHandle has to be created to that object with GCHandleType.Pinned to notify the GC that the corresponding object should never move.

On the WP7 the interfaces to these peripherals and sensors are wrapped by managed interfaces and hence the WP7 developer doesn’t really have to do these things, they are taken care offm under the hood by those managed interfaces.

The performance issue

As mentioned before during the entire GC the execution of the application is stopped. So as long as the GC is running the application is frozen. This isn’t a problem in general because the GC runs pretty fast and infrequently. So small latencies of the order of 10-20ms is not really noticeable.

However, with WP7 the capability of the device in terms of CPU and memory drastically increased. Games and large Silverlight applications started coming up which used close to 100mb of memory. As memory increases the number of references those many objects can have also increases exponentially. In the scheme explained above the GC has to traverse each and every object and their reference to mark them and later remove them via sweep. So the GC time also increases drastically and becomes a function of the net workingset of the application. This results in very large pauses in case of large XNA games and SL applications which finally manifests as long startup times (as GC runs during startup) or glitches during the game play/animation.

Generational Approach

If we take a look at a simplified allocation pattern of a typical game (actually other apps are also similar), it looks somewhat like below

The game has a large steady state memory which contains much of it’s steady state data (which are not released) and then per-frame it goes on allocating/de-allocating some more data, e.g. for physics, projectiles, frame-data. To collect this memory layout the traditional GC has to walk or traverse the entire 50+ mb of data to locate the garbage in it. However, most of the data it traverses will almost never be garbage and will remain in use.

This application behavior is used for the Generational GC premise

Most objects die young
If an object survives collection (that is doesn’t die young) it will survive for a very long time

Using these premise the generational GC tries to segregate the managed heap into older and younger generations objects. The younger generation called Gen-0 is collected in each GC (premise 1), this is called the Ephemeral or Gen0 Collection. The older generation is called Gen-1. The GC rarely collects the Gen-1 as the probability of finding garbage in it is low (premise 2).

So essentially the GC becomes free of the burden of the net working set of the application.

Most GC will be ephemeral GC and it will only traverse the recently allocated objects, hence the GC latency remains very low. Post collection the surviving objects are promoted to the higher generation. Once a lot of objects are promoted, the higher generation starts becoming full and then a full collection is run (which collects both gen-1 and gen-0). However, due to premise 1, the ephemeral collection finds a lot of garbage in their runs and hence promotes very few objects. This means the growth rate of the higher generation is low and hence full-collection will run very infrequently.

Ephemeral/Gen-0 collection

Even in ephemeral collection the GC needs to deterministically find all objects in Gen-0 which are not reachable. This means the following objects needs to survive a Gen-0 collection

Objects directly reachable from roots
Root –> Gen0 –> Gen-0 objects (indirectly reachable from roots)
Objects referenced from Gen1 to Gen0

Now #1 and #2 pose no issues as in the Ephemeral GC, we will anyway scan all roots and Gen-0 objects. However to find objects from Gen1 which are referencing objects in Gen-0, we would have to traverse and look into all Gen1 objects. This will break the very purpose of having segregating the memory into generation. To handle this write-barrier+card-table technique is used.

The runtime maintains a special table called card-table. Each time any references are taken in the managed code e.g. a.b = c; the code JITed for this assignment also updates an internal data-structure called CardTable to capture that write. Later during the ephemeral collection, the GC looks into that table to find all the ‘a’ which took new references. If that ‘a’ is a gen-1 object and ‘c’ a gen-0 object then it marks ‘c’ recursively (which means all objects reachable from ‘c’ is also marked). This technique ensures that without examining all the gen-1 objects we can still find all live objects in Gen-0. However, the cost paid is

Updating object reference is a little bit more costly now
Making old object taking references to new objects increases GC latency (more about these in later posts)

Putting it all together, the traditional GC would traverse all references shown in the diagram below. However, an ephemeral GC can work without traversing the huge number of Red references.

It scans all the Roots to Gen-0 references (green) directly. It traverses all the Gen1->Gen0 references (orange) via the CardTable data structure.

Conclusion

Generational GC reduces the GC latency by avoiding looking up all objects in the system
Most collections are gen-0 or ephemeral collection and are of low latency this ensures fast startup and low latency in game play
However, based on how many objects are being promoted full GC’s are run sometimes. When they do, they exhibit the same latency as a full GC on the previous WP7 GC

Given the above most applications will see startup and in-execution perf boost without any modification. E.g today if an application allocates 5 MB of data during startup and GC runs after every MB of allocation then it traverses 15mb (1 + 2 + 3 + 4 + 5). However, with GenGC it might get away with traversing as low as only 5mb.

In addition, especially game developers can optimize their in-gameplay allocations such that during the entire game play there is no full collection and hence only low-latency ephemeral collections happens.

How well the generational scheme works depend on a lot of parameters and has some nuances. In the subsequent posts I will dive into the details of our implementation and some of the design choices we made and what the developers needs to do to get the most value out of it.

Bong Geek - Abhinaba Basu

Links

Search