- Close on broad design before sending PRs.
- Add design as markdown to root of the feature code path and discuss it as a PR. For broad cross RP feature it is ok to place it in the root
- For sizable features please have a design meeting
- Requirement for a design meeting is to send a pre-read and expectation is all attendees have reviewed the pre-read before coming in
- Be aware of distributed system quirks
- Think CAP theorem. This is a distributed system, network partition will occur, be explicit about your availability and consistency model in that event
- All remote calls will fail, have re-tries that uses exponential back-off. Log warning on re-tries and error if it finally fails
- Ensure we always have consistent state. There should be only 1 authoritative version of truth. Having local data that is eventually consistent with this truth is acceptable. Know the max time-period for eventual consistency
- System needs to be reliable, scalable and fault tolerant
- Always avoid SPOF (Single Point of Failure), even for absolutely required resources like SQLServer consider retrying (see below), gracefully fail and recover
- Have retries
- APIs need to be responsive and return in sub second for most scenarios. If something needs to take longer, immediately return with a mechanism to track progress on the background job started
- All API and actions we support should have a 99.9 uptime/success SLA. Shoot for 99.95
- Our system should be stateless (have state elsewhere in data-store) and designed to be cattle and not pets
- Systems should be horizontally scalable. We should be able to simply add more nodes to a cluster to handle more traffic
- Choose to use a managed service over attempting to build it or deploy it in-house
- Treat configuration as code
- Breaks due to out of band config changes are too common. So consider config deployment the same way as code deployment (Use SCD == Safe Config/Code Deployment)
- Config should be centralized. Engineers shouldn't be hunting around to look for configs
- All features must have feature flag in config.
- The feature flag can be used to disable features in per region basis
- Once a feature flag is disabled the feature should cause no impact to the system
- Try to make sure your system works on a single boxThis makes dev-test significantly easier. Mocking auxiliary systems is OK
- Never delete things immediately
- Don't delete anything instantaneously, especially data. Tombstone deleted data away from user view
- Keep data, metadata, machines around for a garbage collector to periodically delete at configurable duration.
- Strive to be event driven
- Polling is bad as the primary mechanism
- Start with event driven approach and have fallback polling
- Have good unit tests.
- All functionality needs to ship with tests in the same PR (no test PR later)
- Unit test tests functionality of units (e.g. class/modules)
- They do not have to test every internal functions. Do not write tests for tests' sake. If test covers all scenarios exposed by an unit, it is OK to push back on comments like "test all methods".
- Think what does your unit implement and can the test validate the unit is working after any changes to it
- Similarly if you add a reference to an unit from outside and depend on a behavior consider adding a test to the callee so that changes to that unit doesn’t break your requirements
- Unit test should never call out from dev box, they should be local tests only
- Unit test should not require other things to be spun up (e.g. local SQL server)
- Consider adding BVT to scenarios that cannot be tested in unit tests.
E.g. stored procs need to run against real SqlDB deployed in a container during BVT, or test query routing that needs to run inside a web-server
- All required tests should be automatically run and not require humans to remember to run them
- Test in production via our INT/canary clusterSomethings simply cannot be tested on dev setup as they rely on real services to be up. For these consider testing in production over our INT infra.
- All merges are automatically deployed to our INT cluster
- Add runners to INT that simulate customer workloads.
- Add real lab devices or fake devices that test as much as possible. E.g. add fake snmp trap generator to test fluentd pipeline, have real blades that can be rebooted using our APIs periodically
- Bits are then deployed to Canary clusters where there are real devices being used for internal testing, certification. Bake bits in Canary!
- All features should have measurable KPIs and metrics.
- You must add metrics against new features. Metrics should tell how well your feature is working, if your feature stops working or if any anomaly is observed
- Do not skimp on metrics, we can filter metrics on the backend rather than not having them fired
- Copious logging is required.
- Process should never fail silently
- You must add logs for both success and failure paths. Err on the side of too much logging
- Do not rely on text logs to catch production issues.
- You cannot rely on too many error logs from a container to catch issues. Have metrics instead (see above)
- Logs are a way to root-cause and debug and not catch issues
- Consider on-call for all development
- Ensure you have metrics and logs
- Ensure you write good documentation that anyone in the team can understand without tons of context
- Add alerts with direct link to TSGs
- Add actionable alerts where the on-call can quickly mitigate
- On-call should be able to turn off specific features in case it is causing problems in production
- All individual merges can be rolled back. Since you cannot control when code snap for production happens the PRs should be such that it can be individually rolled back
Tuesday, February 18, 2020
Saturday, February 15, 2020
Our confrontation with Lymphoma started in 2011, when the C word entered our life and my wife got diagnosed with Stage 4 Hodgkins sclerosing lymphoma. We have fought through and continue to do so. Even through she is in remission now, the shadow of Cancer still hangs over.
We had just moved across the world from India to the US in 2010 and had no family and very few friends around. We had to mostly duke it out ourselves with little support. We were able to get the best treatment available in the world through the Fred Hutch and Seattle Cancer Care Alliance. However, we do realize not everyone is fortunate to be able to do so.
Our daughter has decided to do her part now and raise funds through the Lymphoma and Leukemia Society. If you'd like to help her please head to
Sunday, January 19, 2020
You see thumbnail generation from image has a major problem. I needed the gallery generator to create square thumbnails for the image strip shown at the bottom of the page. However, the generated thumbnails would simply be either from the center or some other arbitrary location. This meant that the thumbnails would cut off at weird places.
Monday, January 13, 2020
Windows 7 end of support is upon us in 1 more day (1/14/2020). This post tries to answer the question on whether you can safely continue to run it. The short answer is that you can't, atleast if it is connected to the outside in some form.
However, I have a friend back in India who has some software that he relies on and he can't run it on modern Windows. So when I was answering his question on how he can run it, I thought I'd write it up in the blog as well.
Get hold of Windows 7 ISO or download it from https://www.microsoft.com/en-us/software-download/windows7.
Choose the following Generation 1
Chose to create a 40GB OS disk
Then install from bootable CD and pointed the location of the image file to the downloaded ISO image
Click through next to end and finish the creation wizard. Then right click on the newly created VM and choose "Connect".
Install Windows 7
At this point if everything went well the VM has booted off the installation ISO and we are on the following screen. Choose "Clean install" and proceed through the installation wizard.
Finally installation starts.
Created a username, password
Secure by Checkpoint
After applying the checkpoint when I go back into the VM, the created file is all gone!!!
Monday, January 06, 2020
I love using Microsoft Todo and before taking time off in December I create a holiday todo list. I tend to be at home with the family and do bunch of projects around. I try to ensure that I am not doing only work related projects during that time, so put in a ceiling of half a week for coding related stuff. Other Todos generally involves carpentry, DIY home projects, yardwork, cleaning etc.
One of the projects was to update my online photo gallery. Now being a programmer I made it way more complicated than I should've. I decided to code up a minimalistic program to generate static photogallery out of folders of images I export out of Adobe Lightroom. As I mentioned above one of the requirement was to finish it in around 3 days.
I am happy to share that I have the project done and the sources are available at https://github.com/abhinababasu/chobi. It took me about 3 days and most of the time was spent figuring out UI stuff which I rarely do and pondering about which photos to put in the gallery.
The code is in go and it does the following
- It iterates through a folder of images (sub-dir not supported yet) and copies the images to a destination
- Also places thumbnails (configurable size) into the destination
- There is a template html that it modifies to display those images
- It also uses some client side script to
- Randomize the image order
- Show a carousel of the images
- A thumbnail gallery at the bottom
- Automated photo rotation
Wednesday, October 09, 2019
There will be no scheduled work items and no SCRUM
We simply do stuff we want to do. Examples include but not limited to
- Solve a pet peeve (e.g. fix a bug that is not scheduled but you really want to get done)
- A cool feature
- Learn something related to the project that you always wanted to figure out (how do we use fluentd to process events, what is helm)
- Learn something technical (how does go channels work, go assembly code)
- Shadow someone from a sibling team and learn what they are working on
I would say we have had great success with it. We have had CAYL projects all over the spectrum
- Speed up build system and just make building easier
- ML Vision device that can tell you which bin trash needs to go in (e.g. if it is compostable)
- Better BVT system and cross porting it to work on our Macs
- Pet peeves like make function naming more uniform, remove TODO from code, spelling/grammar etc.
- Better logging and error handling
- Fix SQL resiliency issues
- Move some of our older custom management VMs move to AKS
- Bring in gomock, go vet, static checking
- 3D game where mommy penguin gets fish for her babies and learns to be optimal using machine learning
- Experiment with Prometheus
- A dev spent a day shadowing dev from another team to learn the cool tech they are using etc.
Monday, September 30, 2019
Waves of workloads have been since moving to the cloud. A new brew of startups were cloud native from the start and they were the first to use the power of cloud. Many large and small enterprises had already virtualized workloads and they moved as well. Some moved their new workloads (green-field), some even followed lift-n-shift with some modifications (brown-field) into the cloud.
However, a class of large enterprises were stuck in their data centers. They wanted to use the power of the cloud, they wanted to use IoT integration, Machine-learning and the capability of elastic growth of their applications, but the center of their systems were running on some stack that did not run in the standard virtualization offered in the cloud. These enterprises said that if they cannot move those workloads into the cloud, they would need to keep the lights on in their data-centers and moving some peripheral workloads simply did not make sense.
This is where Azure Dedicated and we come into the picture.
SAP HANA Large InstanceFor some of these customers that #$%#@ is SAP HANA in-memory DB on a single machine with 768 vCPUs and 24 terabytes of ram (yup) and we have them covered. Some wanted to scale those out to 60 terabytes in memory, we have them covered too with our bare-metal machines running in Azure. See SAP HANA Large Instances on Azure. They wanted to then expand their applications elastically using VMs running on Azure with sub 1 ms latency to those baremetal DB machines, we have that working too.
We started our journey in this area with this workload. Now we have evolved into our own little organization in Azure called Azure Dedicated and also support the following workloads.
Azure VMware SolutionsSome customers wanted to run their VMware workloads and we have two offers for them, see more about Azure VMware Solution by CloudSimple and Virtustream here
Hardware Security ModulesIn partnership with other teams in Azure we support HSM, which are standard cryptographic appliances powering say financial institutions.
Cray Supercomputer?So you need to simulate something or do ML on tens of thousands of cores, we have Cray super computers running inside Azure for that!!
Azure NetApp FilesWorking closely with the storage team we deliver demanding file based workloads running on Azure NetApp files
SkyTapIn partnership with SkyTap we provide IBM Power workloads on Azure to customers.
What next?We know there are more such anchors holding back enterprises from moving into the cloud. If you have some ideas on what we should take on next, please let me know in the comments!
Wednesday, October 10, 2018
Over the past year I have been working to light up bare-metal machines on Azure Cloud. These are specialized bare-metal machines that have extremely high amount of RAM and CPU and in this particular case, purpose built to run SAP HANA in-memory database. We call them the HANA Large Instance and they come certified by SAP (see list here).
So why bare-metal? They are huge high performance machines that goes all the way up to 24TB RAM (yup) and 960 CPU threads. They are purpose built for HANA in memory database and have the right CPU/Memory ratio and high performance storage to run demanding OLTP + OLAP workloads. Imagine a bank being able to load every credit card transaction in the past 5 year and be able to do analytics including fraud detection on a new transaction in a few seconds, or track the flow of commodities from the worlds largest warehouses to millions of stores and 100s of millions of customers. These machines come with 99.99% SLA and can be reserved by customers across the world in US-East, US-West, Japan-East, Japan-West, Europe-West, Europe-North, Australia-SouthEast, Australia-East to SAP HANA workloads.
In SAP TechEd and SAPPHIRE I demoed bare-metal HLI machines with standard Azure Portal integration. Right now customers can see their HLI machines in the portal and coming soon even reboot them from the portal.
Click on the screenshot below to see a recorded video on how the Hana Large Instances are visible on the Azure portal and also how customers can raise support requests from the portal.
Customers with HLI blades can run the following CLI command to register our HANA Resource Provider
az provider register --namespace Microsoft.HanaOnAzure
Or alternatively using the http://portal.azure.com. Go to your subscription that has HANA Large Instances, select “Resource Providers”, type “Hana” in the search box. Click on register.
Send them to email@example.com
Friday, June 01, 2018
Summary: See https://github.com/abhinababasu/cloudbox for a terraform based solution to deploy VMs in Azure with full remote desktop access.
Now the longer form :). I have blogged in the past about how to setup a Ubuntu desktop on Azure that you can RDP (remote desktop) into. Over the past few months I have moved onto doing most of my development work exclusively on cloud VM and I love having full desktop experience on my customized “Cloud Dev box”. I RDP into it from my dev box at work, Surface Pro, secure laptop etc.
I wanted to ensure that I can treat the box as cattle and not pet. So I came up with a terraform based scripts to bring up these cloud dev boxes. I have also shared them with my team in Microsoft and few devs are already using it. I hope it will be useful to you as well incase you want something like that. All code is at https://github.com/abhinababasu/cloudbox
A few things about the main terraform script at https://github.com/abhinababasu/cloudbox/blob/master/cloudVM.tf
- It is a good security practice is to ensure that your VM is locked down. I use Azure NSG rules to ensure that the VM denies in-bound traffic from Internet. I accept parameters to the script where you can give IP ranges which will then be opened up. This ensures that your VM is accessible from only safe locations, in my case those are IP ranges of Microsoft (from work) and my home IP address.
- While you can use just the TF file and setup script I have a driver script at https://github.com/abhinababasu/cloudbox/blob/master/cloudshelldeploy.sh that you might find useful
- Once the VM is created I use remote execution feature of terraform to run the script in https://github.com/abhinababasu/cloudbox/blob/master/cloudVMsetup.sh to install various software that I need including Ubuntu desktop and xrdp for remote desktop. This takes around 10 minutes atleast
- By default Standard_F8s machine is used, but that can be overridden with larger sizes (eg. Standard_F16s). I have found machines smaller than that doesn’t provide adequate performance. Note: You will incur costs for running these biggish VMs
Obviously you need terraform installed. I think the whole system works really well if you launch from https://shell.azure.com because that way all the credential stuff is automatically handled, and cloud shell comes pre-installed with terraform.
If you want to run from any other dev box, you can need to have Azure CLI and terraform installed (use installterraform.sh script for it) . Then do the following where subsId is the subscriptionId under which you want the VM to run.
az login az account set --subscription="<some subscription Id>"
While you can download the files from here and use it, you should be better of by customizing the cloudshelldeploy.sh script and then running it. I use the following to run
curl -O https://raw.githubusercontent.com/bonggeek/share/master/cloudbox/cloudshelldeploy.sh chmod +x cloudshelldeploy.sh ./cloudshelldeploy.sh abhinab <password>
Now you can use a rdp client like mstsc to loginto the machine.
NOTE: In my experience 1080p resolution works well, 4K lags too much to be useful. Since mstsc default is full-screen be careful if you are working on hi-res display and explicitly use 1080p resolution.
There I am logged into my cloud VM.
Wednesday, May 16, 2018
I have had got some ask on how to discover which Azure cloud the current system is running on. Basically you want to figure out if you are running something in the Azure public cloud or in one of the specialized government clouds.
Unfortunately this is not currently available in Instance Metadata Service. However, it can be found out using a an additional call. The basic logic is to get the current location over IMDS and then call Azure Management API to see which cloud that location is present in.
Sample script can be found at https://github.com/bonggeek/share/blob/master/azlocation.sh
#!/bin/bash locations=`curl -s -H Metadata:True "http://169.254.169.254/metadata/instance/compute/location?format=text&api-version=2017-04-02"` # Test regions #locations="indiasouth" #locations="usgovsouthcentral" #locations="chinaeast" #locations="germanaycentral" endpoints=`curl -s https://management.azure.com/metadata/endpoints?api-version=2017-12-01` publicLocations=`echo $endpoints | jq .cloudEndpoint.public.locations` if grep -q $locations <<< $publicLocations; then echo "PUBLIC" exit 1 fi chinaLocations=`echo $endpoints | jq .cloudEndpoint.chinaCloud.locations` if grep -q $locations <<< $chinaLocations; then echo "CHINA" exit 2 fi usGovLocations=`echo $endpoints | jq .cloudEndpoint.usGovCloud.locations` if grep -q $locations <<< $usGovLocations; then echo "US GOV" exit 3 fi germanLocations=`echo $endpoints | jq .cloudEndpoint.germanCloud.locations` if grep -q $locations <<< $germanLocations; then echo "GERMAN" exit 4 fi echo "Unknown' exit 0
This is what I see for my VM
Monday, March 26, 2018
My team just announced the public preview of Azure Serial console. This has been a consistent ask from customers who want to recover VMs in the cloud. Go to your VM in http://portal.azure.com and then click on the Serial Console button
This opens a direct serial console connection to your VM. It is not required to have the VM open to internet. This is amazing to diagnose VM issues. E.g. if you are not able to SSH to the VM for some reason (blocked port, bad config change, busted boot config). You drop into the serial console and interact with your machine. Cool or what!!
To show you the difference between a SSH connection and serial console, this is my machine booting up!!
Friday, October 06, 2017
For the past many months I have moved to have my dev boxes on the cloud. I am happily using a monster Windows VM and an utility Ubuntu desktop in the cloud. I realized after talking to a few people that they don’t realize how easy it is to setup Linux remote desktop in Azure cloud. Here goes the steps.
For VM configuration I needed a small sized VM with large enough network bandwidth to support remoting. Unfortunately on Azure you cannot choose networking bandwidth but rather all the VMs in a box gets networking bandwidth proportional to the number of cores they have. So I just created a VM based on the “Standard DS4 v2 Promo (8 vcpus, 28 GB memory)” and connected it to Microsoft ExpressRoute. If you are ok with public IP you skip setting the express route and ensure your VM has a public IP.
Then went to the Portal and enabled RDP. For that in the portal choose VM –> Networking and add rule to enable RDP.
Finally I sshed into my VM with
Time to now install a bunch of tools
sudo apt-get update sudo apt-get install xrdp sudo apt-get install ubuntu-desktop sudo apt-get install xfce4 sudo apt-get update
echo xfce4-session > ~/.xsession
Open the following file
sudo gvim /etc/xrdp/startwm.sh
Set the content to
#!/bin/sh if [ -r /etc/default/locale ]; then . /etc/default/locale export LANG LANGUAGE fi startxfce4
start the xrdp service
sudo /etc/init.d/xrdp start
And then from my windows machine mstsc /v:machine_ip. I am presented with the login screen
then I have full Ubuntu desktop on Azure :)