Recently, the Rain system experienced a couple of system outages that affected our entire client base, and we want to be transparent about the cause of these issues and open about the steps we are taking to greatly reduce the chance of this happening again.
In a letter to all of our clients, our founder and CEO, Sean Roylance said the following:
"As you know, we experienced an interruption to our service. We sincerely apologize for the frustration and the disruption to your business that this caused.
"We understand that all systems experience periodic outages. Because the disruption was triggered by some unusual searches and not by a system update, we are taking the following steps to better handle and mitigate unexpected searches that could be problematic:
1) We have specifically added code to identify and eliminate the type of search that caused the problems experienced yesterday.
2) We have a clear plan to significantly reduce the likelihood that a given store will experience any specific future outage by 80%-90%. We are going to split our system into 5 (and later 10) independent segments. That way, if something impacts a particular segment, it will impact at most 20% of our clients at a time. This strategy will also help to significantly reduce the impact of any newly released bugs in the software. The chance of any store immediately experiencing a new bug will be at most 20% rather than as much as 100% with the current infrastructure.
"Here at Rain, reliability and transparency are of the utmost importance. We are taking this unfortunate occurrence very seriously, and will do everything in our power to provide an improved user experience in the months to come. Again, we apologize for the disruptions that were caused, and we truly appreciate your patience and understanding."
Sincerely, Sean Roylance
Rain Retail Software
Understanding Down Time
The truth is, every online entity, no matter the size or capability, experiences down time. Every company tries to minimize this in various ways, but it is a sad inevitability of utilizing a system that is entirely online.
Down time can be caused by several factors, but usually results from one or more of the following:
- Human Error - Even the most brilliant system administrator or web developer is only human, and may make a mistake of some kind that can cause a system outage.
- Equipment Failure - We use the Amazon S3, a premiere hosting service used by a significant portion of internet entities, some of the biggest in the world. And while the system has multiple redundant copies that eliminate the possibility of an individual server failure taking the system down, issues that affect the entirety of the S3 do have the potential to do so.
- Malicious Cyber Attack - Distributed Denial-of-Service attacks (DDoS) are the most common method these days of attacking a system. What happens is the hackers use multiple external systems to flood the targeted servers with a never ending stream of requests, overloading the system and preventing legitimate requests from going through. Hackers also try to 'poison' a website's domain name servers (DNS) through malicious code, and of course, viruses and malware remain threats as well.
What Rain Is Doing to Deal with Threats
Vigilance is critical. Our system administrators actively monitor the system status 24/7/365, and have protocols in place to respond whenever a new threat emerges. We always prioritize addressing threats to service stability, responding as quickly as possible when something does go wrong.
Sometimes what goes wrong happens when we update the system with new functionality, and we have implemented two big changes recently to significantly minimize this possibility:
- Beta Process - Prior to 2019, we didn’t have any way to test new updates with customers before releasing them system-wide. Now, we have a process where all significant updates will be available for customers to test before “going live” to everyone else.
- Feature Flags - All updates are pushed our behind a flag so that it doesn’t change any behavior for existing users. Once we feel ready to “release” a change, we enable this flag for all customers. Pushing things out this way takes extra time, but it also allows us to quickly disable the new functionality if for any reason there is a problem with it. Previously it required rolling back code or creating a solution as quickly as possible.
Problems with new feature updates occur very rarely. We pushed over 1200 updates over the past year, and 186 in the past 30 days, and had only a handful of those require reversing the code due to any kind of a problem. Even so, the above two measures will nearly eliminate the possibility of future updates compromising the system.
Timely System Status Information
Any time you experience an issue with the system being down, you can check the site below to see the latest information on the status of the system.
Also, for more information on what you can do if we do experience an outage, please click here.
Our Promise Moving Forward
We will continue to strive for new ways to proactively reduce the risk of system outages and also to improve our process for system updates. We cannot be successful unless you are successful with our system, so our priority will always be making our service what you need to achieve your success.