Server Move Post-Mortem

As you may know, we have just completed a migration of the main FATdrop servers to a new datacentre. The migration involved more downtime than we were anticipating and we want to apologise for that, and also take the opportunity to offer an update and explanation to our clients about recent disruptions to the FATdrop service.

Over the past 6 months, we have been affected by several datacentre related problems that have caused outages. These have included a hardware failure to the cooling fans in one of the servers, a broken fiber connection at the datacentre, and two power failures. All of these issues are things that a good datacentre should have backup systems in place for and be resilient against, and time after time we have discovered that our datacentre was not prepared and able to deal with these issues.

Because of that, we made the decision that we need to move to another, better equipped, datacentre in order to be able to provide the level of service that we want for our clients, and for the last few months we have been working on this migration. We also decided that in addition to the datacentre move, we would take the opportunity to upgrade to bigger, faster servers and perform a number of updates to the system that FATdrop runs on. It has been a complex operation, but we believe it will be worthwhile.

On Tuesday 20th May at 5:30 AM BST, we had a planned maintenance window scheduled to migrate the FATdrop services to the new datacentre. We selected this particular time because we had determined it to be a period of low-usage. The migration was expected to last two hours.

During this time, we performed a detailed set of migration tests which appeared to show that the migration had been successful. Not only that, our benchmark analysis of system speed showed a very significant improvement. However, following the switchover to the new datacentre, at 7:00 AM BST, we started to detect a slow but increasing degradation in performance.

Our engineers identified a slow database query that was causing the load on the database to increase over time as the traffic increased. This, in turn, caused the system to slow down and, eventually, stall. Once the problem was identified, our engineers were able to quickly patch the system and normal access was resumed at 13:00.

We are looking in to ways to improve our testing methods in order to pick up on potential issues like this before they arise.  Please accept our apologies for yesterday’s problems.

We are committed to providing you with the best possible service and we hope you enjoy FATdrop’s bigger, faster, more reliable home.

Leave a Reply

Captcha image