Running Critical Infrastructure

Running Critical Infrastructure

Well let me start of by saying that running critical applications, services and infrastructure is not just a one time install and you are ready to go deal, It is a process a continuous cycle of maintaining, upgrading, replacing and optimizing.

The bigger the environment or the more sites you have the more difficult it becomes to maintain Insight into the day to day operations of services and systems which can lead to security flaws, unpatched systems and a verity of "low hanging fruit" ready for the picking or even service outages.

For myself i like to break IT service design and Application delivery down into a set of processes, a cycle:

  1. Plan
  2. Design
  3. Build
  4. Maintain
  5. Innovate


Plan is the process in which you first sit down and discuss what the end goal is, the thing you want to achieve this step does not at all speak about the technology used but only the fundamental goal or business case that is trying to be solved.

The plan process can take weeks to months to get right a recent tip i have learned is to keep on talking about the plan for a few weeks, keep on talking about it and watch how it changes until you have thought of all of the ways to do it. From the practical methods to down right insane but it will give you a better base to start off on as Planning is the most important step in the process.


When i say design what i refer to is the Technical side of the project, how are we going achieve our goal?, What technology should we use? or How many users will be using the platform? these are just a few examples of questions and things to consider in the design phase i will make a list below of the common areas and things that need to be thought of:

  • Scaling: How many users?, System capacity?, Predicted growth rate?
  • Technology: What stack are we using?, What is the community behind it?, How easy is it to configure and run?
  • Security: Best practices used?, Access and Control, Patching, auditing and updates, are all steps that can make the Security process easier and secure the environment from unauthorized use or access.
  • Component Failure: Systems will fail at some point in time that is nothing to be worried about, the reason that disasters occur when systems fail is due to a lack of key planning for the event of it failing. A few methods of failure planning are High-Availability and Backup and Recovery. High-Availability can be protect against physical component failure, network failure or if planned right it can also protect against network partitions. But in some cases High-Availability can not protect against data corruption. Backup and Recovery can be used to protect data over a longer period of time to multiple locations to protect access issues such as a building burning down, Common industry terms used to express the protection and recovery goals are RTO ( Recovery Time Objective) = How long you want to spend at max to get from having a backup to being online again or RPO (Recovery Point Objective) = The time period in which you want to keep backups, for example a standard RPO is useuly 1 month and that means that data in the backup will be kept for at least a month and the oldest backup can be no older than 1 month.

As you see this is just some of the common things to consider and the list goes into good design, Don't bother going for a perfect design go for the design that works the best for your use case and always aim for the simplest design.


This is the step where you build the infrastructure, you have decided on the technology, the design and planning of the system (Technical Design and Business Case) and you are now ready to build up the system whether it be a Physical rack or on a public cloud.

One step that is not in my list but it is really step 6 is Adapt as this point you may need to change things as not everything always goes to plan so make sure you have a process that can also adapt with your changes and make you log all changes that have derived from the original design. GIT can be used for tracking changes like this in a simple way.


This step exists until the End of Life of the service or until a new project to replace it starts.

Maintaining the infrastructure is replacing failures, updating the stack, reporting changes, monitoring and re-optimizing parts of the service as you go. It is a critical step as even the simplest of applications require maintenance, bug fixes or security updates.


As time goes on the way we do the same tasks changes as we try to optimize on time and efficiency this process also occurs until the End of Life of the service or until a new project to replace it starts.

This step is what I use as a way to look for optimizations to processes how to use the service and how to maintain the service, automation is a key factor of this step automating ways to do repetitive tasks can save time on the long run but cost extra time on the short run. Technology optimizations can be looking at ways to make the service faster, easier to use or Leverage extra performance from your current system by tweaking components.

Written by Matthew Frost on Monday October 26, 2015