Thursday, June 04, 2020

Being Brown

I actually did not realize I am brown until I heard one day from my daughter that kids in her class call her and other of Indian subcontinent origin coconut. Brown on the outside, white inside. Brown due to her skin color and white because of her personality and thought process. I was not equipped to tell her how to handle it. Later I learnt the school principal and teacher reached out to her and used her as an example on how she handles it not just for herself but when she sees others doing something similar. No wonder she got elected as student body president. Even if I fail to be a great black community ally, I am sure I have raised a good one.

I will not pretend that I know about the plight Black people in US face day to day, but I know enough to stop and ask. I want to be an ally and I understand I may be doing things wrong and I would love to hear about it, learn and make things better. I have been a software engineer and now a manager and been in the software industry for close to 20 years. So if someone feels that they can use my help, get advice based on limited experience I have in software development and being successful in a software career, my twitter DMs are open (@abhinaba) or please connect on linkedin. Please reach out and I will try to do whatever is in my power to make things better.

Being Brown has been interesting. The first time I faced racism, I didn't even realize it at the beginning. I was sitting in a small diner in a rural corner of US. I felt I was sitting at the bar for a long time for someone to take a breakfast order. Suddenly a gray haired gentleman sitting beside me, started screaming at the waitress, that he has had enough and she should take my order. The gentleman told a confused me, that she was intentionally ignoring me and serving everyone around and he has been to the same restaurant and has seen her do the same with others. I learnt what being an ally is.

When I was telling a story to a colleague about being stopped on the highway and police approaching the car with hands on holster, the colleague uncomfortable told me that most likely I was perceived to be black from a distance. I was let go with just a casual chat as I was not breaking any law. I worry a Black person most likely would get way more than that chat.

15 years in US now I see a reverse turn. I feel as if I have brown-privilege at least in the Pacific Northwest or other large urban location. The infrequent profanity hurled from the car, or people sneering still continues, but overall I feel the privilege. The uber-driver or someone I meet in a restaurant or a bar, automatically tells me, "so you must be good in maths" or "do you work in Software". I get stereo-typed as an Indian (brown) skinned engineer. While it might seem good, but all stereotyping is bad. I immediately start thinking will this same person clutch their bags tight on seeing a Black person, will they automatically make judgement on intelligence based on hair color. Most likely they would. Also such stereotyping has immense negative effect on peoples career. Just the other day I was reading a story from our Azure CVP on how being Indian he was told to work on technical stuff and not business stuff.

As a recent immigrant American, I am learning about being an Black community ally, but my promise is I will try my best. Black lives matter I am here to help.

Wednesday, May 06, 2020

Using Visual Studio Codespaces

One of the pain points we face with remote development is having to go through few extra hops to get to our virtual dev boxes. Many of us uses Azure VMs for development (in addition to local machines) and our security policy is to lock down all VMs to our Microsoft corporate network.

So to ssh or rdp into an Azure VM for development, we first connect over VPN to corporate network, then use a corpnet machine to then login to the VMs. That is painful and more so now when we are working remotely.

This is  where the newly announced Visual Studio Codespaces come in. Basically it is a hosted vscode in the cloud. It runs beautifully inside a browser and best of all comes with full access to the linux shell underneath. Since it is run as a service and secured by the Microsoft team building it, we can simply use it from a browser on any machine (obviously over two-factor authentication).

At the time of writing this post, the cost is around $0.17 per hour for 4 core/8GB which brings the price to around $122 max for the whole month. Codespaces also has a snooze feature. I use snooze after one hour of no usage. This does mean additional startup time when you next login, but saves even more money. In between snooze the state of the box is retained.

While just being able to use the IDE on our code base is cool in itself, having access to the shell underneath is even cooler. Hit Ctrl+` in vscode to bring up the terminal window.

I then sync'd my linux development environment from, installed required packages that I need. Finally I have a full shell and IDE in the cloud, just the way I want it.

To try out Codespaces head to

Sunday, April 26, 2020

Managing Baremetal blades in Azure

In this post I give a brief overview of how we run the control plane of BareMetal Compute in Azure that powers the SAP HANA Large instances (in memory database running on extreme high memory machines). We support different types of BareMetal blades that go all the way up to 24TB RAM including special memory support like Intel Optane (persistent memory) and unlike Virtual Machines, customer get full access to the BareMetal physical machine with root access but still behind network level security sandboxing.

Few years back when we started the project we faced some daunting challenges. We were trying to get custom build SAP HANA certified bare-metal machines into Azure DCs, fit them in standard Azure racks and then manage them at scale and expose control knobs to customer inside Azure Portal. These are behemoths going up to 24 TB ram and based on size different OEMs were providing us the blades, storage, networking gear and fiber-channel devices.

We quickly realized that most of the native Azure native compute stack will not work because they are built with design assumptions that do not hold for us.
  1. Azure fleet nodes or blades are built to Microsoft specification and have common denominator management API surface and monitoring, but we were bringing in disparate externally certified HW that did not meet us there
  2. Our model needed to provide customer with full root access to the bare metal blades and they were not isolated across an hyper-visor
  3. The allocator and other logic in Azure was not location aware. E.g. We had custom NetApp storage literally placed beside the high memory compute for very low latency, high throughput usage that is required by the SAP HANA in memory databases
  4. We had storage and networking requirements in terms of uptime, latency and throughput that were not met by standard Azure storage, latency and hence we had to build our own.
  5. We differed in basic layout from Azure, e.g. our blades do not have any local storage and everything runs off remote storage, we had different NW topology (many HW NICs per blade with very different I/O requirements).
Obviously we had to re-build the full cloud stack but with much more limited resources. Instead of 100s and 1000s of engineers we had a handful. So we set down a few guiding principles
  1. Be frugal on resourcing
  2. Rely on external services instead of trying to build in-house
  3. Design for maintainability
Finally 3 years in, we can see that many of our decisions and designs are holding through the test of time. We have expanded to 10s of regions around the world, added numerous scenarios but at the same time never had to significantly scale our dev resources. 


While it might seem obvious to lot of people building these kinds of services, it was an unlikely choice for a Microsoft service. Our stack is built on Kubernetes (or rather Azure Kubernetes Service), go-land, fluentd and similar open source software. Also we stand on the shoulder of giants, we did not have to invent many core areas because it comes for free inside Azure, like RBAC, cross region balancing etc. 

At the high level our architecture looks as follows

Customer Experience

The customer interacts with out system using either the Azure Portal (screenshot above), the command line tools or the SDK. We build extensions to the Azure portal for our product sub-area. All resources in azure is exposed using standardized RESTful APIs. We publish the swagger spec here  and the CLI and SDK is generated out of those.

In any case all interactions of the customer is handled by the central Azure Resource Manager (ARM). It handles authentication, RBAC, etc. Every resource type in Azure is handled by a corresponding resource provider. In our case it is the HANA or BareMetal RP (BmRP). ARM knows (via data the BmRP provides back to it) how to forward calls that it gets from customers to a particular regional instance of BmRP.

The regional Resource Provider or RP

If we are in N Azure regions then the BmRP (resource provider) is deployed in N instances (one in each region) and it runs on Azure Kubernetes Service (AKS). BmRP is build mostly using go-lang and engineered through Azure DevOps. We have automated build pipeline for the RP and single click (maybe a few clicks) deployment. We use use Helm to manage our deployment.

The service itself is stateless and the state is stored in SQL Server Azure. We use both structured data and document-DB style json. All data is replicated remotely to one more region, we configure automated backups for disaster recovery scenarios.

We do not share any state across the RP instances. We are particular about ensuring that every regional instance can completely work on its own. This is to ensure that in case there is a regional outage it does not effect any other regions.

Each instance of RP in turn manages multiple clusters of bare-metal machines. There is one or more such clusters per RP instances. Each cluster is managed by an instance of a cluster manager (CM). All communication between RP and the cluster manager is via two Azure service-bus-queue (SBQ). One from the BmRP to the CM and the other in reverse direction. BmRP issues various commands (JSON messages) to the CM through the SBQ and gets back responses from the CM via the other SBQ.

We pump both metrics (hot-path) and logs (warm path) into our Azure wide internal telemetry pipeline called Jarvis. We then add backend alerts and dashboards on the metrics for near realtime alerting and in some cases also on the logs (using logs-to-metrics). The data is also digested into Azure Kusto (aka Data Explorer) which is a log analytics platform. The alerts tells us if something has gone wrong (severity two and above alerts ring on-call phones) and then we use the logs in Jarvis or Kusto queries to debug.

Also it's good to call out that all control flow across pods and across the services is encrypted in transit over nginx and linkerd. The data that we store is SQL Server is also obviously encrypted in transit and at rest.

Tech usage: AKS, Kubernetes, linkerd, nginx, helm, linux, Docker, go-lang,  Python, Azure Data-explorer, Azure service-bus-queue, Azure SQL Server, Azure DevOps, Azure Container Repository, Azure Key-vault, etc.

Cluster Manager

We have a cluster-manager (CM) per compute cluster. The cluster manager runs on AKS. The AKS vnet is connected over Azure Express Route into the management VLAN for the cluster that contains all our compute, storage and networking devices. All of these devices are in a physical cluster inside Azure Data-center.

We wanted to ensure that the design is such that the cluster manager can be implemented with a lot of versatility and can evolve without dependency on the BmRP. The reason is we imagined one day the cluster-manager could also run inside remote locations (edge sites) and we weren't sure if we can call into the CM from outside or have some other sort of persistent connectivity. So we chose Azure Service Bus Queue for communication using simple json command response going between BmRP and CM. This only requires that the CM can make GET calls on the SBQ end point and nothing more to talk to BmRP.

The cluster manager has two major functions

  1. Provide a device agnostic control abstraction layer to the RP
  2. Monitor various devices (compute, storage and NW) in the cluster


Instead of having our BmRP know specifics of all types of HW in the system, it works on an abstraction. It expects basic generic CRUD type of operations being available on those devices and issues generic commands which then the cluster-manager translates to device specific actions.

It is easier to follow through how things work if we take one specific user workflow. Say a customer wants to reboot their BareMetal blade for some reason (an Update category operation). For this the customer hits the reboot button for their blade in Azure Portal, the Portal calls into Azure Resource Manager (ARM). ARM calls into BmRP's REST Api for the same. BmRp drops the reboot blade command into the service bus queue. Finally one of the replica of the cluster-manager container that is listening for those commands pick the command up. Now for various memory sizes and types we use different blades supplied by different OEMs. Some of those blades can be controlled remotely via Redfish APIs, some support ipmi commands, some even proprietary REST apis. The job of the cluster-manager is to take the generic reboot command, identify the exact type of the blade for which the command is and then issue blade specific control commands. It then takes the response and sends it back to the RP as an ack.

Similarly for storage it can handle a get-storage-status command by servicing it with  ONTAP  (NetApp storage manager) REST Api call .

Telemetry and Monitoring

We use a fluentd based pipeline for actively monitoring all devices in the cluster. We use active monitoring, which means that not only we listen onto the various events generated by the devices in the cluster, we also call into these devices to get more information.

The devices in our cluster uses various types of event mechanism. We configure these various types of events like syslog, snmp, redfish events to be sent to the load-balancer of the AKS cluster. When one of the fluentd end-points get the event it is send through a series of input plugins. Some of the plugins filter out noisy events we do not care about, some of the plugins call back into the device that generated the event to get more information and augment the event.

Finally the output plugins send the result into our Jarvis telemetry pipeline and other destinations. Many of the plugins we use or have built are open sourced, like herehere, here and here.  Some of the critical event like blade power-state change (reboot) is sent back also through the service-bus-queue to the BmRP so that it can store blade power-state information in our database.

Generally we rely on events being sent from devices to the fluentd end point for monitoring. Since many of these events come over UDP and the telemetry pipeline itself is running on a remote (from the actual devices) AKS cluster we expect some of the events to get dropped. To ensure we have high reliability on these events, we also hence have backup polling. The cluster manager in periodic intervals reaches out to the equipment in the cluster using whatever APIs those equipment support to get their status and fault-state (if any). Between the events and backup polling we have both near real-time as well as reliable coverage.

Whether from events or from polling all telemetry is sent to our Jarvis pipeline. We then have Jarvis alerts configured on these events. E.g. if a thermal event occurs and either storage nodes or blade's temperature goes over a threshold the alert will fire. Same thing for cases like a blade crashing due to HW issues.

Tech-user: AKS, Kubernetes, linux, Docker, fluentd, Ruby, go-lang, Python, Azure service-bus-queue, snmp, ipmi, Redfish, ONTAP, syslog, etc.

Scaling and Reliability

Our front-door is Azure Resource Manager (ARM) through which all users come in. ARM has served us well and provides us with user-authentication, throttling, caching and regional load balancing. It forwards user calls for blades in one region to the RP in that region. So as we add more regions we simply deploy a new instance of the BmRP for that region, register with ARM and scale. 

Even in the same region as we land more infra we put them in new clusters with about 100-ish blades and its corresponding storage. Since each cluster has its own cluster manager with a shared nothing model, the cluster manager scales as well along with every cluster we land. As we add more scenarios in the cluster manager though we need to scale the manager itself.

Scaling the cluster manager itself is also trivial in our design, it either gets event traffic from devices or commands from BmRP over SBQ that it then executes. The events come through at the load-balancer of the AKS cluster and hence just increasing the replica count of those containers work. Commands coming from the BmRP arrives over service bus queue and all these containers listen on the same queue, so when we add more replicas of these containers the worker count goes up.

Since we use separate SBQ per CM, adding new CM automatically means new SBQ pair gets created for it. The only concern could be that we add so many new scenarios in the CM that the traffic between one CM and BmRP goes up enough to cause bottleneck in the SBQ. However, SBQ itself can be scaled up to handle it (we have never had to do that), or worst case we may have to add more SBQ, sharding by types of commands and scale it horizontally. To be honest this is not something I think will ever happen.

With all of the above we have been able to meet 99.95 uptime for our control plane and keep latencies under our target. The only place where we hit issues is connectivity with our SQL DB. We had to upscale the DB SKU in the past. We continue to infrequently hit timeouts and other issues at the DB. At one point we were talking about moving to CosmosDB as it is touted to be more reliable, but most likely we will invest in some sort of caching in the future. That can either be by deploying some sort of cache engine in the BmRP itself or most likely use Azure Redis Cache (see principles at the top of this post).

Tuesday, February 18, 2020

System Engineering Guidelines

While building our system that powers memory intensive compute in Azure we use the following engineering guidelines. We use these guidelines to build our BareMetal resource provider, cluster manager etc. These are useful principles we have accumulated from experience building various systems. What other principles do you use and recommend including?
  1. Close on broad design before sending PRs.
    1. Add design as markdown to root of the feature code path and discuss it as a PR. For broad cross RP feature it is ok to place it in the root
    2. For sizable features please have a design meeting
    3. Requirement for a design meeting is to send a pre-read and expectation is all attendees have reviewed the pre-read before coming in
  2. Be aware of distributed system quirks
    1. Think CAP theorem. This is a distributed system, network partition will occur, be explicit about your availability and consistency model in that event
    2. All remote calls will fail, have re-tries that uses exponential back-off. Log warning on re-tries and error if it finally fails
    3. Ensure we always have consistent state. There should be only 1 authoritative version of truth. Having local data that is eventually consistent with this truth is acceptable. Know the max time-period for eventual consistency
  3. System needs to be reliable, scalable and fault tolerant
    1. Always avoid SPOF (Single Point of Failure), even for absolutely required resources like SQLServer consider retrying (see below), gracefully fail and recover
    2. Have retries
    3. APIs need to be responsive and return in sub second for most scenarios. If something needs to take longer, immediately return with a mechanism to track progress on the background job started
    4. All API and actions we support should have a 99.9 uptime/success SLA. Shoot for 99.95
    5. Our system should be stateless (have state elsewhere in data-store) and designed to be cattle and not pets
    6. Systems should be horizontally scalable. We should be able to simply add more nodes to a cluster to handle more traffic
    7. Choose to use a managed service over attempting to build it or deploy it in-house
  4. Treat configuration as code
    1. Breaks due to out of band config changes are too common. So consider config deployment the same way as code deployment (Use SCD == Safe Config/Code Deployment)
    2. Config should be centralized. Engineers shouldn't be hunting around to look for configs
  5. All features must have feature flag in config.
    1. The feature flag can be used to disable features in per region basis
    2. Once a feature flag is disabled the feature should cause no impact to the system
  6. Try to make sure your system works on a single boxThis makes dev-test significantly easier. Mocking auxiliary systems is OK
  7. Never delete things immediately
    1. Don't delete anything instantaneously, especially data. Tombstone deleted data away from user view
    2. Keep data, metadata, machines around for a garbage collector to periodically delete at configurable duration.
  8. Strive to be event driven
    1. Polling is bad as the primary mechanism
    2. Start with event driven approach and have fallback polling
  9. Have good unit tests.
    1. All functionality needs to ship with tests in the same PR (no test PR later)
    2. Unit test tests functionality of units (e.g. class/modules)
    3. They do not have to test every internal functions. Do not write tests for tests' sake. If test covers all scenarios exposed by an unit, it is OK to push back on comments like "test all methods".
    4. Think what does your unit implement and can the test validate the unit is working after any changes to it
    5. Similarly if you add a reference to an unit from outside and depend on a behavior consider adding a test to the callee so that changes to that unit doesn’t break your requirements
    6. Unit test should never call out from dev box, they should be local tests only
    7. Unit test should not require other things to be spun up (e.g. local SQL server)
  10. Consider adding BVT to scenarios that cannot be tested in unit tests.
    E.g. stored procs need to run against real SqlDB deployed in a container during BVT, or test query routing that needs to run inside a web-server
  11. All required tests should be automatically run and not require humans to remember to run them
  12. Test in production via our INT/canary clusterSomethings simply cannot be tested on dev setup as they rely on real services to be up. For these consider testing in production over our INT infra.
    1. All merges are automatically deployed to our INT cluster
    2. Add runners to INT that simulate customer workloads.
    3. Add real lab devices or fake devices that test as much as possible. E.g. add fake snmp trap generator to test fluentd pipeline, have real blades that can be rebooted using our APIs periodically
    4. Bits are then deployed to Canary clusters where there are real devices being used for internal testing, certification. Bake bits in Canary!
  13. All features should have measurable KPIs and metrics.
    1. You must add metrics against new features. Metrics should tell how well your feature is working, if your feature stops working or if any anomaly is observed
    2. Do not skimp on metrics, we can filter metrics on the backend rather than not having them fired
  14. Copious logging is required.
    1. Process should never fail silently
    2. You must add logs for both success and failure paths. Err on the side of too much logging
  15. Do not rely on text logs to catch production issues.
    1. You cannot rely on too many error logs from a container to catch issues. Have metrics instead (see above)
    2. Logs are a way to root-cause and debug and not catch issues
  16. Consider on-call for all development
    1. Ensure you have metrics and logs
    2. Ensure you write good documentation that anyone in the team can understand without tons of context
    3. Add alerts with direct link to TSGs
    4. Add actionable alerts where the on-call can quickly mitigate
    5. On-call should be able to turn off specific features in case it is causing problems in production
    6. All individual merges can be rolled back. Since you cannot control when code snap for production happens the PRs should be such that it can be individually rolled back

Saturday, February 15, 2020

The C Word - Part 2

Our confrontation with Lymphoma started in 2011, when the C word entered our life and my wife got diagnosed with Stage 4 Hodgkins sclerosing lymphoma. We have fought through and continue to do so. Even through she is in remission now, the shadow of Cancer still hangs over.

We had just moved across the world from India to the US in 2010 and had no family and very few friends around. We had to mostly duke it out ourselves with little support. We were able to get the best treatment available in the world through the Fred Hutch and Seattle Cancer Care Alliance. However, we do realize not everyone is fortunate to be able to do so.

Our daughter has decided to do her part now and raise funds through the Lymphoma and Leukemia Society. If you'd like to help her  please head to

Sunday, January 19, 2020

Chobi - Face Detection Based Static Image Gallery Generator

Over the holidays I created a simple static photo gallery generator that I named chobi. The sources are in

Given a source folder of photos, chobi will generate a destination folder containing the original photos, generated thumbnails, css stylesheets, scripts and html files which constitute a website displaying those photos as follows

The Problem

While building chobi and also when I was looking into similar online tools, I kept hitting a major issue and that prompted further work and this post.

You see thumbnail generation from image has a major problem. I needed the gallery generator to create square thumbnails for the image strip shown at the bottom of the page. However, the generated thumbnails would simply be either from the center or some other arbitrary location. This meant that the thumbnails would cut off at weird places.

Consider the following image.

If I create a thumbnail from a tool without knowing where the face is in the image, it will generate something like
Obviously that doesn't work. So using my vanilla generator I got a website with all sorts of similar head chopped off thumbnails (marked in Red)

The Solution

Chobi uses face-detection to ensure that does not happen and the face is always fully present in the generated thumbnail. Consider the same thumbnail as above but now with face-detection

Another example of an original image and then generated thumbnail first without and then with face-detection.

With this face-detection plugged in chobi generates a much better web-site, with almost no photo cropped where it shouldn't be.


  1. Chobi sources are at
  2. It uses the face-detected thumbnail generator which I wrote at
  3. That is in turn is based out of a face detection library pigo written fully in go


Checkout a gallery build using chobi at 

Monday, January 13, 2020

How to run Windows 7 after end of support

Windows 7 end of support is upon us in 1 more day (1/14/2020). This post tries to answer the question on whether you can safely continue to run it. The short answer is that you can't, atleast if it is connected to the outside in some form.

However, I have a friend back in India who has some software that he relies on and he can't run it on modern Windows. So when I was answering his question on how he can run it, I thought I'd write it up in the blog as well.

This post outlines how you can run Windows 7 in virtual machine running on Microsoft Hyper-visor on a windows 10 machine. The process also uses checkpoints to reset the VM back to the old state each time. This ensures that even if something malicious gets hold of the system, you can simple go back to the pristine state you started with.


Obviously you need a computer capable of running Hyper-V. For my purpose I am using a Windows 10 Professional machine. You should also have more than enough CPU cores and memory to run Windows 7 in that machine. I recommend atleast 4 cores and 8GB memory so that you can give half of that to the Windows 7 VM and keep the rest for a functional host PC.

Get hold of Windows 7 ISO or download it from

Then visit the system requirement I decided to give roughly twice the resources as the requirements to create my VM.

Setup VM

Hit windows-key and type hyper-v to launch. Then start creating a VM by clicking New and then Virtual machine.

Choose the following Generation 1

I decided to then give it 4GB memory and 2 CPU cores

Chose to create a 40GB OS disk

Then install from bootable CD and pointed the location of the image file to the downloaded ISO image

Click through next to end and finish the creation wizard. Then right click on the newly created VM and choose "Connect".

Install Windows 7

At this point if everything went well the VM has booted off the installation ISO and we are on the following screen. Choose "Clean install" and proceed through the installation wizard.

Finally installation starts.

A reboot later we have Windows 7 starting up!

Created a username, password

Finally booted into Windows 7 and here's my website displayed in Internet Explorer

Secure by Checkpoint

Even though we have booted into Windows 7 soon this will be a totally unsupported OS, that means no security updates. This is a dangerous system to keep open to the internet. I recommend never doing that!! Also to be doubly sure, we will create a checkpoint. What that does is it creates a snapshot of the memory and the disk. So in case something malicious lands in this VM, we can go back to the pristine state when the snapshot was created and hence rollback any changes made by the virus or malware.

To create a  checkpoint right click on the VM in hyper-v manager and choose Checkpoint

You can see the checkpoints created in the Hyper-V Manager.

Lets make a change to the Windows 7 VM by creating a file named "Howdy I am created.txt" on the desktop.

Since the checkpoint was created before creating the file, I can revert back to the checkpoint by right-click on the checkpoint and choosing Apply.

After applying the checkpoint when I go back into the VM, the created file is all gone!!!


This is a hack at best and not recommended. However, if for some applications or other need where you "have" to run Windows 7, this can be an option.

Monday, January 06, 2020

Chobi - A static photo gallery generator

I love using Microsoft Todo and before taking time off in December I create a holiday todo list. I tend to be at home with the family and do bunch of projects around. I try to ensure that I am not doing only work related projects during that time, so put in a ceiling of half a week for coding related stuff. Other Todos generally involves carpentry, DIY home projects, yardwork, cleaning etc.

One of the projects was to update my online photo gallery. Now being a programmer I made it way more complicated than I should've. I decided to code up a minimalistic program to generate static photogallery out of folders of images I export out of Adobe Lightroom. As I mentioned above one of the requirement was to finish it in around 3 days.

I am happy to share that I have the project done and the sources are available at It took me about 3 days and most of the time was spent figuring out UI stuff which I rarely do and pondering about which photos to put in the gallery.

The code is in go and it does the following

  1. It iterates through a folder of images (sub-dir not supported yet) and copies the images to a destination
  2. Also places thumbnails (configurable size) into the destination
  3. There is a template html that it modifies to display those images
  4. It also uses some client side script to 
    1. Randomize the image order
    2. Show a carousel of the images
    3. A thumbnail gallery at the bottom
    4. Automated photo rotation
Here's a screenshot of the sample landscape gallery.

Since this was very time-bound project there are tons to stuff left to do, some basic bugs abound as well. But I decided to timeout on the effort for now and revisit again hopefully in the spring.

Wednesday, October 09, 2019

CAYL - Code as you like day

Building an enterprise grade distributed service is like trying to fix and improve a car while driving it at high speed down the free-way. Engineering debt accumulates fast and engineers in the team yearn for the time to get to them. A common complaint is also that we need more time to tinker with cool features and tech to learn and experiment.

An approach many companies take is the big hackathon events. Even though they have their place, I think those are mostly for PR and getting eye candy. Which exec doesn’t want to show the world their company creates AI powered blockchain running on quantum computer in just a 3 day hackathon.

This is where CAYL comes in. CAYL or “Code As You Like” is named loosely on “go as you like” event I experienced as a student in India. In a lot of uniform based schools in Kolkata, it is common to have a go as you like day, where kids dress up however they want.

Even though we call it code as you like, it has evolved beyond coding. One of our extended Program Management team has also picked this up and call it the WAYL (Work as you like day). This is what we have set aside in our group calendar for this event.

“code as you like day” is a reserve date every month (first Monday of the month) where we get to code/document or learn something on our own.
There will be no scheduled work items and no SCRUM
We simply do stuff we want to do. Examples include but not limited to
  1. Solve a pet peeve (e.g. fix a bug that is not scheduled but you really want to get done)
  2. A cool feature
  3. Learn something related to the project that you always wanted to figure out (how do we use fluentd to process events, what is helm)
  4. Learn something technical (how does go channels work, go assembly code)
  5. Shadow someone from a sibling team and learn what they are working on
We can stay late and get things done (you totally do not have to do that) and there will be pizza or ice-cream.
One requirement is that you *have* to present the next day, whatever you did.  5 minutes each

I would say we have had great success with it. We have had CAYL projects all over the spectrum
  1. Speed up build system and just make building easier
  2. ML Vision device that can tell you which bin trash needs to go in (e.g. if it is compostable)
  3. Better BVT system and cross porting it to work on our Macs
  4. Pet peeves like make function naming more uniform, remove TODO from code, spelling/grammar  etc.
  5. Better logging and error handling
  6. Fix SQL resiliency issues
  7. Move some of our older custom management VMs move to AKS
  8. Bring in gomock, go vet, static checking
  9. 3D game where mommy penguin gets fish for her babies and learns to be optimal using machine learning
  10. Experiment with Prometheus
  11. A dev spent a day shadowing dev from another team to learn the cool tech they are using etc.
We just finished our CAYL yesterday and one of my CAYL items was to write a blog about it. So it’s fitting that I am hitting publish on this blog, as I sit in the CAYL presentation while eating Kale chips

Monday, September 30, 2019

Azure Dedicated

I remember a discussion with a group of friends around 8 years back. Microsoft was in it’s early days of becoming a leader in the cloud. Those friends, all techies in the Seattle area had varying expectation on how it would work out. Many thought that a full blown move was few decades away because their experience indicated that all big companies ran on stack that was very old and simply couldn’t be moved to the cloud any time soon.

Waves of workloads have been since moving to the cloud. A new brew of startups were cloud native from the start and they were the first to use the power of cloud. Many large and small enterprises had already virtualized workloads and they moved as well. Some moved their new workloads (green-field), some even followed lift-n-shift with some modifications (brown-field) into the cloud.

However, a class of large enterprises were stuck in their data centers. They wanted to use the power of the cloud, they wanted to use IoT integration, Machine-learning and the capability of elastic growth of their applications, but the center of their systems were running on some stack that did not run in the standard virtualization offered in the cloud. These enterprises said that if they cannot move those workloads into the cloud, they would need to keep the lights on in their data-centers and moving some peripheral workloads simply did not make sense.

This is where Azure Dedicated and we come into the picture.

SAP HANA Large Instance

For some of these customers that #$%#@ is SAP HANA in-memory DB on a single machine with 768 vCPUs and 24 terabytes of ram (yup) and we have them covered. Some wanted to scale those out to 60 terabytes in memory, we have them covered too with our bare-metal machines running in Azure. See SAP HANA Large Instances on Azure. They wanted to then expand their applications elastically using VMs running on Azure with sub 1 ms latency to those baremetal DB machines, we have that working too.

We started our journey in this area with this workload. Now we have evolved into our own little organization in Azure called Azure Dedicated and also support the following workloads.

Azure VMware Solutions

Some customers wanted to run their VMware workloads and we have two offers for them, see more about Azure VMware Solution by CloudSimple and Virtustream here

Hardware Security Modules

In partnership with other teams in Azure we support HSM, which are standard cryptographic appliances powering say financial institutions.

Cray Supercomputer?

So you need to simulate something or do ML on tens of thousands of cores, we have Cray super computers running inside Azure for that!!

Azure NetApp Files

Working closely with the storage team we deliver demanding file based workloads running on Azure NetApp files


In partnership with SkyTap we provide IBM Power workloads on Azure to customers.

What next?

We know there are more such anchors holding back enterprises from moving into the cloud. If you have some ideas on what we should take on next, please let me know in the comments!

Wednesday, October 10, 2018

SAP HANA Large Instances on Azure


Over the past year I have been working to light up bare-metal machines on Azure Cloud. These are specialized bare-metal machines that have extremely high amount of RAM and CPU and in this particular case, purpose built to run SAP HANA in-memory database. We call them the HANA Large Instance and they come certified by SAP (see list here).

So why bare-metal? They are huge high performance machines that goes all the way up to 24TB RAM (yup) and 960 CPU threads. They are purpose built for HANA in memory database and have the right CPU/Memory ratio and high performance storage to run demanding OLTP + OLAP workloads. Imagine a bank being able to load every credit card transaction in the past 5 year and be able to do analytics including fraud detection on a new transaction in a few seconds, or track the flow of commodities from the worlds largest warehouses to millions of stores and 100s of millions of customers. These machines come with 99.99% SLA and can be reserved by customers across the world in US-East, US-West, Japan-East, Japan-West, Europe-West, Europe-North, Australia-SouthEast, Australia-East to SAP HANA workloads.

In SAP TechEd and SAPPHIRE I demoed bare-metal HLI machines with standard Azure Portal integration. Right now customers can see their HLI machines in the portal and coming soon even reboot them from the portal.

Portal preview

Click on the screenshot below to see a recorded video on how the Hana Large Instances are visible on the Azure portal and also how customers can raise support requests from the portal.

Portal screenshot

Reboot Demo

This is something we are working on right now and will be available soon. Click on the screenshot below to see the video of a HANA Large instance being rebooted from the portal directly.image

Getting Access

Customers with HLI blades can run the following CLI command to register our HANA Resource Provider

az provider register --namespace Microsoft.HanaOnAzure

Or alternatively using the Go to your subscription that has HANA Large Instances, select “Resource Providers”, type “Hana” in the search box. Click on register.



Send them to

Friday, June 01, 2018

Deploy Cloud Dev Box on Azure with Terraform


Summary: See for a terraform based solution to deploy VMs in Azure with full remote desktop access.

Now the longer form :). I have blogged in the past about how to setup a Ubuntu desktop on Azure that you can RDP (remote desktop) into. Over the past few months I have moved onto doing most of my development work exclusively on cloud VM and I love having full desktop experience on my customized “Cloud Dev box”. I RDP into it from my dev box at work, Surface Pro, secure laptop etc.

I wanted to ensure that I can treat the box as cattle and not pet. So I came up with a terraform based scripts to bring up these cloud dev boxes. I have also shared them with my team in Microsoft and few devs are already using it. I hope it will be useful to you as well incase you want something like that. All code is at

A few things about the main terraform script at 

  1. It is a good security practice is to ensure that your VM is locked down. I use Azure NSG rules to ensure that the VM denies in-bound traffic from Internet. I accept parameters to the script where you can give IP ranges which will then be opened up. This ensures that your VM is accessible from only safe locations, in my case those are IP ranges of Microsoft (from work) and my home IP address.
  2. While you can use just the TF file and setup script I have a driver script at that you might find useful
  3. Once the VM is created I use remote execution feature of terraform to run the script in to install various software that I need including Ubuntu desktop and xrdp for remote desktop. This takes around 10 minutes atleast
  4. By default Standard_F8s machine is used, but that can be overridden with larger sizes (eg. Standard_F16s). I have found machines smaller than that doesn’t provide adequate performance. Note: You will incur costs for running these biggish VMs


Obviously you need terraform installed. I think the whole system works really well if you launch from because that way all the credential stuff is automatically handled, and cloud shell comes pre-installed with terraform.

If you want to run from any other dev box, you can need to have Azure CLI and terraform installed (use script for it) . Then do the following where subsId is the subscriptionId under which you want the VM to run.

az login
az account set --subscription="<some subscription Id>"

While you can download the files from here and use it, you should be better of by customizing the script and then running it. I use the following to run

curl -O
chmod +x
./ abhinab <password>



Now you can use a rdp client like mstsc to loginto the machine.

NOTE: In my experience 1080p resolution works well, 4K lags too much to be useful. Since mstsc default is full-screen be careful if you are working on hi-res display and explicitly use 1080p resolution.

There I am logged into my cloud VM.


Wednesday, May 16, 2018

Getting Azure Cloud Location


I have had got some ask on how to discover which Azure cloud the current system is running on. Basically you want to figure out if you are running something in the Azure public cloud or in one of the specialized government clouds.

Unfortunately this is not currently available in Instance Metadata Service. However, it can be found out using a an additional call. The basic logic is to get the current location over IMDS and then call Azure Management API to see which cloud that location is present in.

Sample script can be found at

locations=`curl -s -H Metadata:True ""`

# Test regions

endpoints=`curl -s` 
publicLocations=`echo $endpoints | jq .cloudEndpoint.public.locations[]`

if grep -q $locations <<< $publicLocations; then
    echo "PUBLIC"
    exit 1

chinaLocations=`echo $endpoints | jq .cloudEndpoint.chinaCloud.locations[]`
if grep -q $locations <<< $chinaLocations; then
    echo "CHINA"
    exit 2

usGovLocations=`echo $endpoints | jq .cloudEndpoint.usGovCloud.locations[]`
if grep -q $locations <<< $usGovLocations; then
    echo "US GOV"
    exit 3

germanLocations=`echo $endpoints | jq .cloudEndpoint.germanCloud.locations[]`
if grep -q $locations <<< $germanLocations; then
    echo "GERMAN"
    exit 4

echo "Unknown'
exit 0

This is what I see for my VM


Monday, March 26, 2018

Azure Serial Console


My team just announced the public preview of Azure Serial console. This has been a consistent ask from customers who want to recover VMs in the cloud.  Go to your VM in and then click on the Serial Console button


This opens a direct serial console connection to your VM. It is not required to have the VM open to internet. This is amazing to diagnose VM issues. E.g. if you are not able to SSH to the VM for some reason (blocked port, bad config change, busted boot config). You drop into the serial console and interact with your machine. Cool or what!!



To show you the difference between a SSH connection and serial console, this is my machine booting up!!