When Deploying Before Testing
As you may know, on July 3, 2019, Facebook, Instagram, Down Detector, and a number of other websites were partially or fully unavailable. Cloudflare mentioned did admit that the outage was their fault and did rectify the situation. However, it was not resolved until a number of users complained and reported images not loading.
Cloudflare provides CDN services. CDN stands for Content Delivery Network. Content Delivery Networks are often used by high traffic websites to be able to provide you with images, video, and other online resources faster as they are located closer to where you are.
For instance, let’s say a company has their primary data center in Utah, it may take about 5 seconds for an online image request to be sent and response to be returned to a device that is on the East Coast. However with a CDN, the image that is on the Utah server, is copied to a server in Virginia. Then when the request for that same image is made from the device on the East Coast, the image is provided from the Virginia server instead of the Utah server in about 1 or 2 seconds. This is part of the reason why it is said that if you upload and image or other content online, that you will never be able to delete it.
When I primarily did IT support, we were often asked to do changes that were to be deployed in the testing environment first. Not only this, but we had to provide proof that the change being implemented in the testing environment was successful and did not destroy the testing environment before performing the change in the production environment.
There is one major difference between the testing environment and the production environment. That is that the testing environment may get about 5% of the activity that the production environment gets unless a simulation is performed. Long story short, this change may have been tested in the testing environment and the CPU spike did not occur or was not noticeable to be classified as a failure.
This change was probably not well tested or tested at all in a lower environment. A lower environment is supposed to be an exact copy of the production environment. In the lower environment, you can play around or try things out without breaking the production environment.
Use An Automation or Script
When using an automation or script to perform work, the change is less likely to be error prone. The reason being, is that most automations are tested a number of times before being rolled out. Once rolled out, they are guaranteed to work unless an unexpected situation occurs. For instance, a software dependency is not installed, thus the automation is not able to configure the software. In this circumstance, the automation should fail and roll back what activities that it has done thus far.
Work in Batches
It was mentioned that they noticed that the CPU usage had spiked on the primary and secondary servers. Assuming that the changes were applied at the same time, indicates a bigger problem. From my support experience, we would not deploy or make changes to all the servers at the same. What we would do is deploy to some of the servers and then deploy to the remaining servers after confirming all is well.
For instance, some of the applications that I have supported had 4 application servers. We would deploy to 2 of the servers first. If the deployment went well and post-regression testing passes, then we would deploy to the remaining 2 servers. By doing things this way, it does take longer to complete the deployment. However, it ensures that you do not introduce a major failure or issue into your production environment and impact your customers. If they had deployed to the primary instances and saw that CPU spike that was mentioned, then they could have rolled back after identifying that the
When it comes to code deployment and IT infrastructure changes, this can be a lesson that we can all learn from this.
- Test your code in the lower environment
- Simulate conditions that would be encountered in the production environment as much as possible when testing. This will give you an accurate representation of what would happen under normal use.
- Do not deploy the change to all of the servers in a single environment at the same time
- After deploying the change, be sure to confirm that everything is working as it should be. Any unexpected behaviors that result after the change should be an immediate cause for alarm and the change backed out.