Tuesday, February 18, 2020

System Engineering Guidelines

While building our system that powers memory intensive compute in Azure we use the following engineering guidelines. We use these guidelines to build our BareMetal resource provider, cluster manager etc. These are useful principles we have accumulated from experience building various systems. What other principles do you use and recommend including?
  1. Close on broad design before sending PRs.
    1. Add design as markdown to root of the feature code path and discuss it as a PR. For broad cross RP feature it is ok to place it in the root
    2. For sizable features please have a design meeting
    3. Requirement for a design meeting is to send a pre-read and expectation is all attendees have reviewed the pre-read before coming in
  2. Be aware of distributed system quirks
    1. Think CAP theorem. This is a distributed system, network partition will occur, be explicit about your availability and consistency model in that event
    2. All remote calls will fail, have re-tries that uses exponential back-off. Log warning on re-tries and error if it finally fails
    3. Ensure we always have consistent state. There should be only 1 authoritative version of truth. Having local data that is eventually consistent with this truth is acceptable. Know the max time-period for eventual consistency
  3. System needs to be reliable, scalable and fault tolerant
    1. Always avoid SPOF (Single Point of Failure), even for absolutely required resources like SQLServer consider retrying (see below), gracefully fail and recover
    2. Have retries
    3. APIs need to be responsive and return in sub second for most scenarios. If something needs to take longer, immediately return with a mechanism to track progress on the background job started
    4. All API and actions we support should have a 99.9 uptime/success SLA. Shoot for 99.95
    5. Our system should be stateless (have state elsewhere in data-store) and designed to be cattle and not pets
    6. Systems should be horizontally scalable. We should be able to simply add more nodes to a cluster to handle more traffic
    7. Choose to use a managed service over attempting to build it or deploy it in-house
  4. Treat configuration as code
    1. Breaks due to out of band config changes are too common. So consider config deployment the same way as code deployment (Use SCD == Safe Config/Code Deployment)
    2. Config should be centralized. Engineers shouldn't be hunting around to look for configs
  5. All features must have feature flag in config.
    1. The feature flag can be used to disable features in per region basis
    2. Once a feature flag is disabled the feature should cause no impact to the system
  6. Try to make sure your system works on a single boxThis makes dev-test significantly easier. Mocking auxiliary systems is OK
  7. Never delete things immediately
    1. Don't delete anything instantaneously, especially data. Tombstone deleted data away from user view
    2. Keep data, metadata, machines around for a garbage collector to periodically delete at configurable duration.
  8. Strive to be event driven
    1. Polling is bad as the primary mechanism
    2. Start with event driven approach and have fallback polling
  9. Have good unit tests.
    1. All functionality needs to ship with tests in the same PR (no test PR later)
    2. Unit test tests functionality of units (e.g. class/modules)
    3. They do not have to test every internal functions. Do not write tests for tests' sake. If test covers all scenarios exposed by an unit, it is OK to push back on comments like "test all methods".
    4. Think what does your unit implement and can the test validate the unit is working after any changes to it
    5. Similarly if you add a reference to an unit from outside and depend on a behavior consider adding a test to the callee so that changes to that unit doesn’t break your requirements
    6. Unit test should never call out from dev box, they should be local tests only
    7. Unit test should not require other things to be spun up (e.g. local SQL server)
  10. Consider adding BVT to scenarios that cannot be tested in unit tests.
    E.g. stored procs need to run against real SqlDB deployed in a container during BVT, or test query routing that needs to run inside a web-server
  11. All required tests should be automatically run and not require humans to remember to run them
  12. Test in production via our INT/canary clusterSomethings simply cannot be tested on dev setup as they rely on real services to be up. For these consider testing in production over our INT infra.
    1. All merges are automatically deployed to our INT cluster
    2. Add runners to INT that simulate customer workloads.
    3. Add real lab devices or fake devices that test as much as possible. E.g. add fake snmp trap generator to test fluentd pipeline, have real blades that can be rebooted using our APIs periodically
    4. Bits are then deployed to Canary clusters where there are real devices being used for internal testing, certification. Bake bits in Canary!
  13. All features should have measurable KPIs and metrics.
    1. You must add metrics against new features. Metrics should tell how well your feature is working, if your feature stops working or if any anomaly is observed
    2. Do not skimp on metrics, we can filter metrics on the backend rather than not having them fired
  14. Copious logging is required.
    1. Process should never fail silently
    2. You must add logs for both success and failure paths. Err on the side of too much logging
  15. Do not rely on text logs to catch production issues.
    1. You cannot rely on too many error logs from a container to catch issues. Have metrics instead (see above)
    2. Logs are a way to root-cause and debug and not catch issues
  16. Consider on-call for all development
    1. Ensure you have metrics and logs
    2. Ensure you write good documentation that anyone in the team can understand without tons of context
    3. Add alerts with direct link to TSGs
    4. Add actionable alerts where the on-call can quickly mitigate
    5. On-call should be able to turn off specific features in case it is causing problems in production
    6. All individual merges can be rolled back. Since you cannot control when code snap for production happens the PRs should be such that it can be individually rolled back

Saturday, February 15, 2020

The C Word - Part 2

Our confrontation with Lymphoma started in 2011, when the C word entered our life and my wife got diagnosed with Stage 4 Hodgkins sclerosing lymphoma. We have fought through and continue to do so. Even through she is in remission now, the shadow of Cancer still hangs over.

We had just moved across the world from India to the US in 2010 and had no family and very few friends around. We had to mostly duke it out ourselves with little support. We were able to get the best treatment available in the world through the Fred Hutch and Seattle Cancer Care Alliance. However, we do realize not everyone is fortunate to be able to do so.

Our daughter has decided to do her part now and raise funds through the Lymphoma and Leukemia Society. If you'd like to help her  please head to


Sunday, January 19, 2020

Chobi - Face Detection Based Static Image Gallery Generator

Over the holidays I created a simple static photo gallery generator that I named chobi. The sources are in https://github.com/abhinababasu/chobi.

Given a source folder of photos, chobi will generate a destination folder containing the original photos, generated thumbnails, css stylesheets, scripts and html files which constitute a website displaying those photos as follows

The Problem

While building chobi and also when I was looking into similar online tools, I kept hitting a major issue and that prompted further work and this post.

You see thumbnail generation from image has a major problem. I needed the gallery generator to create square thumbnails for the image strip shown at the bottom of the page. However, the generated thumbnails would simply be either from the center or some other arbitrary location. This meant that the thumbnails would cut off at weird places.

Consider the following image.

If I create a thumbnail from a tool without knowing where the face is in the image, it will generate something like
Obviously that doesn't work. So using my vanilla generator I got a website with all sorts of similar head chopped off thumbnails (marked in Red)

The Solution

Chobi uses face-detection to ensure that does not happen and the face is always fully present in the generated thumbnail. Consider the same thumbnail as above but now with face-detection

Another example of an original image and then generated thumbnail first without and then with face-detection.

With this face-detection plugged in chobi generates a much better web-site, with almost no photo cropped where it shouldn't be.


  1. Chobi sources are at https://github.com/abhinababasu/chobi
  2. It uses the face-detected thumbnail generator which I wrote at https://github.com/abhinababasu/facethumbnail
  3. That is in turn is based out of a face detection library pigo written fully in go


Checkout a gallery build using chobi at 

Monday, January 13, 2020

How to run Windows 7 after end of support

Windows 7 end of support is upon us in 1 more day (1/14/2020). This post tries to answer the question on whether you can safely continue to run it. The short answer is that you can't, atleast if it is connected to the outside in some form.

However, I have a friend back in India who has some software that he relies on and he can't run it on modern Windows. So when I was answering his question on how he can run it, I thought I'd write it up in the blog as well.

This post outlines how you can run Windows 7 in virtual machine running on Microsoft Hyper-visor on a windows 10 machine. The process also uses checkpoints to reset the VM back to the old state each time. This ensures that even if something malicious gets hold of the system, you can simple go back to the pristine state you started with.


Obviously you need a computer capable of running Hyper-V. For my purpose I am using a Windows 10 Professional machine. You should also have more than enough CPU cores and memory to run Windows 7 in that machine. I recommend atleast 4 cores and 8GB memory so that you can give half of that to the Windows 7 VM and keep the rest for a functional host PC.

Get hold of Windows 7 ISO or download it from https://www.microsoft.com/en-us/software-download/windows7.

Then visit the system requirement https://support.microsoft.com/en-us/help/10737/windows-7-system-requirements. I decided to give roughly twice the resources as the requirements to create my VM.

Setup VM

Hit windows-key and type hyper-v to launch. Then start creating a VM by clicking New and then Virtual machine.

Choose the following Generation 1

I decided to then give it 4GB memory and 2 CPU cores

Chose to create a 40GB OS disk

Then install from bootable CD and pointed the location of the image file to the downloaded ISO image

Click through next to end and finish the creation wizard. Then right click on the newly created VM and choose "Connect".

Install Windows 7

At this point if everything went well the VM has booted off the installation ISO and we are on the following screen. Choose "Clean install" and proceed through the installation wizard.

Finally installation starts.

A reboot later we have Windows 7 starting up!

Created a username, password

Finally booted into Windows 7 and here's my website displayed in Internet Explorer

Secure by Checkpoint

Even though we have booted into Windows 7 soon this will be a totally unsupported OS, that means no security updates. This is a dangerous system to keep open to the internet. I recommend never doing that!! Also to be doubly sure, we will create a checkpoint. What that does is it creates a snapshot of the memory and the disk. So in case something malicious lands in this VM, we can go back to the pristine state when the snapshot was created and hence rollback any changes made by the virus or malware.

To create a  checkpoint right click on the VM in hyper-v manager and choose Checkpoint

You can see the checkpoints created in the Hyper-V Manager.

Lets make a change to the Windows 7 VM by creating a file named "Howdy I am created.txt" on the desktop.

Since the checkpoint was created before creating the file, I can revert back to the checkpoint by right-click on the checkpoint and choosing Apply.

After applying the checkpoint when I go back into the VM, the created file is all gone!!!


This is a hack at best and not recommended. However, if for some applications or other need where you "have" to run Windows 7, this can be an option.

Monday, January 06, 2020

Chobi - A static photo gallery generator

I love using Microsoft Todo and before taking time off in December I create a holiday todo list. I tend to be at home with the family and do bunch of projects around. I try to ensure that I am not doing only work related projects during that time, so put in a ceiling of half a week for coding related stuff. Other Todos generally involves carpentry, DIY home projects, yardwork, cleaning etc.

One of the projects was to update my online photo gallery. Now being a programmer I made it way more complicated than I should've. I decided to code up a minimalistic program to generate static photogallery out of folders of images I export out of Adobe Lightroom. As I mentioned above one of the requirement was to finish it in around 3 days.

I am happy to share that I have the project done and the sources are available at https://github.com/abhinababasu/chobi. It took me about 3 days and most of the time was spent figuring out UI stuff which I rarely do and pondering about which photos to put in the gallery.

The code is in go and it does the following

  1. It iterates through a folder of images (sub-dir not supported yet) and copies the images to a destination
  2. Also places thumbnails (configurable size) into the destination
  3. There is a template html that it modifies to display those images
  4. It also uses some client side script to 
    1. Randomize the image order
    2. Show a carousel of the images
    3. A thumbnail gallery at the bottom
    4. Automated photo rotation
Here's a screenshot of the sample landscape gallery.

Since this was very time-bound project there are tons to stuff left to do, some basic bugs abound as well. But I decided to timeout on the effort for now and revisit again hopefully in the spring.