Recently, the Rain system experienced a couple of system outages that affected our entire client base.
In a letter to all of our clients, our founder and CEO, Sean Roylance said the following:
"Here at Rain, reliability and transparency are of the utmost importance. We are taking this unfortunate occurrence very seriously, and will do everything in our power to provide an improved user experience in the months to come. Again, we apologize for the disruptions that were caused, and we truly appreciate your patience and understanding."
We would like to go into more depth about changes we are making to our processes and infrastructure to help ensure maximum up-time and minimal disruptions.
Understanding Down Time
The truth is, every online entity, no matter the size or capability, experiences down time. Every company tries to minimize this in various ways, but it is a sad inevitability of utilizing a system that is entirely online.
Down time can be caused by several factors, but usually results from one or more of the following:
- Human Error - Even the most brilliant system administrator or web developer is only human, and may make a mistake of some kind that can cause a system outage.
- Equipment Failure - We use the Amazon AWS, a premiere hosting service used by a significant portion of internet entities, some of the biggest in the world. And while the system has multiple redundant copies that eliminate the possibility of an individual server failure taking the system down, issues that affect the entirety of AWS do have the potential to do so.
- Malicious Cyber Attack - Distributed Denial-of-Service attacks (DDoS) are the most common method these days of attacking a system. What happens is the hackers use multiple external systems to flood the targeted servers with a never ending stream of requests, overloading the system and preventing legitimate requests from going through. Hackers also try to 'poison' a website's domain name servers (DNS) through malicious code, and of course, viruses and malware remain threats as well.
What Rain Is Doing to Deal with Issues
Vigilance is critical. Our systems are monitored at all times, and we have protocols in place to respond whenever a new threat emerges. We always prioritize addressing threats to service stability, responding as quickly as possible when something does go wrong.
One of the most important things we can do to maintain system security is to keep it up to date. Using AWS makes it easy to keep our servers up to date because we can quickly add and remove servers as needed.
Sometimes what goes wrong happens when we update the system with new functionality, and we have implemented two big changes recently to significantly minimize this possibility:
- Beta Process - Prior to 2019, we didn’t have any way to test new updates with customers before releasing them system-wide. Now, we have a process where all significant updates will be available for customers to test before “going live” to everyone else.
- Feature Flags - All updates are pushed our behind a flag so that it doesn’t change any behavior for existing users. Once we feel ready to “release” a change, we enable this flag for all customers. Pushing things out this way takes extra time, but it also allows us to quickly disable the new functionality if for any reason there is a problem with it. Previously it required rolling back code or creating a solution as quickly as possible.
Problems with new feature updates occur very rarely. We pushed more than 1200 updates over the past year, and 186 in the past thirty days, and only a handful of those required reversing the code due to any kind of a problem. Even so, the above two measures will nearly eliminate the possibility of future updates compromising the system.
Timely System Status Information
Any time you experience an issue with the system being down, you can check the site below to see the latest information on the status of the system.
Also, for more information on what you can do if we do experience an outage, please click here.
Our Promise Moving Forward
We will continue to strive for new ways to proactively reduce the risk of system outages and also to improve our process for system updates. We cannot be successful unless you are successful with our system, so our priority will always be making our service what you need to achieve your success.
Comments
0 comments
Please sign in to leave a comment.