Sunday, April 05, 2009

Floating point operations in .NET Compact Framework on WinCE+ARM

Holi Celebration at our appartment complex

There has been a some confusion on how .NETCF handles floating point operations. The major reason for this confusion is due to the fact that the answer differs across the platforms NETCF supports (e.g. S60/Xbox/Zune/WinCE). I made a post on this topic which is partially incorrect. As I followed that up I learnt a lot, especially from Brian Smith. Hopefully this post removes all the confusions floating around floating point handling in .NETCF on WinCE.

How does desktop CLR handle floating point operation

Consider the following floating point addition in C#

float a = 0.7F;
float b = 0.6F;
float c = a + b;

For this code the final assembly generated by CLR JITer on x86 platform is

            float a = 0.7F;
0000003f mov dword ptr [ebp-40h],3F333333h
float b = 0.6F;
00000046 mov dword ptr [ebp-44h],3F19999Ah
float c = a + b;
0000004d fld dword ptr [ebp-40h]
00000050 fadd dword ptr [ebp-44h]
00000053 fstp dword ptr [ebp-48h]

Here the JITter directly emits floating point instructions fld, fadd and fstp. It could do so because floating point unit and hence the floating point instructions are always available on x86.

Why does NETCF differ

Unfortunately NETCF targets a huge number of HW configs which vary across Xbox, Zune, WinCE on x86/MIPS/ARM/SH4, S60, etc.  There are even sub-flavors of these base configs (e.g. ARMv4, ARMv6, MIPS II, MIPS IV). On all of these platforms floating point unit (FPU) is not available.

This difference in FPU availability is taken care of by using different approaches on different platforms. This post is for WinCE+ARM and hence I’ll skip the other platforms.

Zune in a special WinCE platform

Zune is special because it is tied to a specific version of ARM with FPU built in (locked HW). NETCF on Zune was primarily targeted for XNA games and on games performance of floating point operation is critical. Hence the .NETCF JITer was updated to target ARM FPU. So for basic mathematical operation it emits inline native ARM FPU instructions very much like the desktop JITter shown above. The result is that the basic floating point operations are much faster.

WinCE in general

In general for WinCE on ARM, presence of FPU cannot be assumed because the least common subset targeted is ARMv4 which doesn’t essentially have a FPU.

To understand the implication it is important to understand that floating point operations can be basically classified into two categories:

  1. BCL System.Math operations like sin/cos/tan
  2. Simple floating point operations like +,/,- , *, conversion, comparison.

For the first category the JITer simply delegates the operation into WinCE by calling into CoreDll.dll, e.g. the sin, sinh, cos, cosh, etc.. available in CoreDll.dll.

For the second category the JITer calls into small worker functions implemented inside the NETCF CLR. These worker functions are native code and compiled for ARM. If we disassemble them we would see that for these the native ARM compiler emits calls into coredll.dll into say __imp___addd

It is evident from above that the performance of managed floating point operation is heavily dependent on whether the underlying WinCE uses the ARM FPU as in most scenarios floating point operations are finally delegated into it.

The whole thing can be summarized in the following table (courtesy Brian Smith)

WinCE6 + ARMv4i

WinCE6 + ARMv6_FP

#1 CE is NOT FPU optimized

#2 CE is FPU optimized

#3 CE is FPU optimized and so is NETCF (e.g. Zune JIT)

System.Math library calls

Delegated via pinvoke, emulated within CE (speed: slow)

Delegated via pinvoke, FPU instructions within CE (speed: medium)

Delegated via pinvoke, FPU instructions within CE (speed: medium)

FP IL opcodes (add, etc)

Delegated via JIT worker, emulated within CE (speed: slow)

Delegated via JIT worker, FPU instructions within CE (speed: medium)

FPU instructions inlined by JITed code (speed: fast)

#1 is the general case for NETCF + WinCE + ARM. #3 is the current scenario of NETCF + Zune + ARM.

#2 is based on the fact that WinCE 6.0 supports “pluggable” FP library. However, the NETCF team has not tried out this flavor and hence does not give any guidance on whether plugging in a FP library in WinCE will really have any performance improvement, however theoretically it does seems likely.

#3 today is only for Zune, but going forward it does seem likely that newer versions of WinCE will update it’s base supported HW spec, it will include FPU and then this feature will also make it to base NETCF for WinCE.

No comments: