Mitigating downtime of a SaaS application

by Lee Porter 13. July 2009 10:49

It seems that the greatest concern when implementing a web based application is the possibility of downtime and lack of availability due to the fallibility of the many systems that come together to serve a website.  Although this problem can never be entirely over come, it can be mitigated.

I can't speak for other SaaS operations, but this is how we have provided a 100% uptime service of our Programme Office Toolkit for the last few years.

Summary points are:

  1. We provide Production and Failover services for both Websites and Databases.
  2. The Production and Failover services are in geographically remote location and via different providers.
  3. Failover web sites can also be used to access the Production database.
  4. Backups
    1. Production DB backups taken daily – held locally + transferred to failover + backed up on failover (production is backed up and held in 3 formats daily and held for 2 years)
    2. Production website code is monitored daily for changes and any changes are transferred to the failover server.
  5. Servers automatically monitored 24/7/365.
    1. Production - Locally via hosting provider / remote monitoring – alerts via email to 2 IPS office locations (UK/Ireland).
    2. Failover – Remote monitoring only.
    3. Both are automatically monitored every 5 minutes for memory, disk space, processing, connectivity, web response
  6. All servers have Raided disk arrays so chances of losing any data is very slim (fire/theft are the only possible occurrences but all server halls have fire suppressant, cooling, UPS systems and full time, 24/7 security staff and secure access).
  7. Production servers have a 1 hr replacement of key components so worst case scenario is max 1 hr downtime if not immediately failed over, if it is the website then no time is lost as the failover website is instantly available.
  8. Failover servers are covered by a Dell 4 hour response contract.
  9. IPS has been hosting with Rackspace (a very high quality and costly service provider) since 2003.
  10. Historically never had any problems in past 5 years with POT with over 99.9% availability during client business hours.
  11. Failovers procedures tested 4 times a year.
  12. The platforms systems / POT3 software have been security and penetration tested by QinetiQ and passed.
  13. The only thing we are not in control of is the Internet but all hosting providers have multiple redundancies on Internet connections. We also suggest our clients have redundant Internet connections from different providers.
  14. IPS has a bilateral support agreement with another company (who also helped develop parts of POT3) to provide a flexible expansion in staffing capabilities during periods of high development demand should the current staff of either company need additional assistance or specialist knowledge.

