Keeping it simple to avoid network outages

It was just last year when a few of the most significant and recognisable companies which included Facebook and Twitter, experienced major interruptions to their services due to network outages. It goes without saying that this caused the world’s most dominant social media sites to jump into crisis mode. Similarly, well-publicized outages have affected in excess of six hundred million users in China, not to mention a recent Google outage which made a huge impact on the Internet.

Whether these organizations experienced downtime from internal network errors or full blown [Distributed] Denial of Service [D]DoS attacks, the costs to their reputations  and revenues are very significant. According to a recent survey from the Ponemon Institute, the average cost per minute of unplanned data centre downtime sits at USD $7,900, a staggering 41 percent from USD $5,600 per minute in 2010. With this in mind companies today need to be prepared to minimise or even prevent such outages from occurring.

Which is the greater threat?

While media reports tend to hype hacking, DDoS attacks and attackers taking down the entire internet, the reality is most outages are caused by an organization’s own network. A recent Gartner study projected that through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues. In fact, both Xbox LIVE and Facebook suffered network outages from configuration errors during routine maintenance, and while the state of China blamed its outage on hackers, some independent watchers believe it was actually due to an internal configuration error in the firewall.

Indeed, one of the leading sources of network outages is human intervention through configuration errors injected during routine maintenance– in other words, good old human error.

That’s not to say that external threats aren’t important to prepare for as well. Attacks are now being carried out on organizations of all sizes. SolarWinds recently sponsored a survey about security in the UK, and the result was surprising; many of the surveyed companies reported having been targeted by various attacks but were not taking the basic steps to protect themselves.

Going back to basics

As a business owner or IT manager, there are high-tech ways of mitigating risks and keeping networks up that cost a great deal of money, and there are also low-tech, low budget ways of mitigating network outages, even if they cannot be completely eliminated.

1) Checks and balances. Common sense dictates that system changes should be reviewed by another pair of eyes, but not all organizations do this. The best practice of code reviews in software development has proven to increase code quality and significantly reduce the number of errors injected; operations teams should adopt the same practice.

2) Monitor, monitor, monitor. Ensure systems are monitored properly before any changes are made so that a good baseline is available, making errors more easily detectable. Alerts should be properly configured so that IT teams can respond quickly if the health, availability or performance of a system is impacted negatively following a change. The alerts should also be reviewed regularly to ensure they reflect the SLAs and other requirements dictated by business needs.

3) Have a back-up plan. Make sure a solid fallback mechanism is in place so that the network can revert to the last state of configuration once a problem is detected.

4) Keep things simple. An error that is part of a series of changes affecting multiple parts of the IT infrastructure can make it difficult to isolate and remediate problems. Break down massive changes into smaller, more manageable chunks that can be reverted atomically.

5) Build in room for error. It’s surprising how often IT teams go full steam ahead in rolling out changes without thinking about how they will revert back to the previous state should errors occur. These teams should assume errors will happen, and create the action plan for addressing those errors once they do.

6) Communication, the old fashioned way. Any application or system owners impacted by changes should be notified of changes prior to their occurrence, including the scope of the change and timeframe. That will serve as precaution to the owners to be vigilant for abnormal application or system behaviour.


Additional threats

DoS attacks can originate internally through a Trojan horse or virus impacting one or more internal systems. Externally, DDoS attacks originate from multiple systems on the Internet acting in an orchestrated manner to bring down publicly facing systems. Mitigating these threats will require more sophistication, but nevertheless, following tried-and-true best practices will still be key to protecting networks.

1) Strengthen your shield. The first level of defence is ensuring firewalls are configured properly and systems are patched with the latest security updates. Will this prevent a successful attack? No, but they are basic steps that many organizations ignore, leaving themselves vulnerable.

2) Keep vigilant. Appropriately monitor the firewalls and key systems in your network to detect abnormal events that usually accompany [D]DoS attacks, including high connection counts and high CPU and bandwidth utilizations. Different monitoring systems provide different ways to define what “normal” means, ranging from setting up complete manual thresholds to learning from past data to identify normal range of operation. No matter the system, it is important for the IT team to understand the different thresholds being used by the systems and how those evolve over time. These systems should be capable of alerting IT staff of abnormal network behaviours and events.

3) Use appropriate technology. It can be difficult to figure out which data stream(s) to monitor in order to determine a baseline for normal behaviour. Leveraging deep packet inspection or flow based technology to monitor network behaviour provides a live picture of the network traffic on the network, minimizing the window of time that it takes to detect an abnormal behaviour.

4) Assign responsibility. Ownership empowers and confers accountability. It is extremely important to designate someone in the IT organization to be responsible for the security posture of the company. That individual should be involved in security assessments and analyses, and be consulted anytime there is suspicion of security related attacks. This person should also be responsible for staying abreast of the security threat landscape that can be impacting the business and effectively brief and educate the rest of the organization. This strategy does not alleviate the IT team at large of security responsibilities, but merely puts someone in charge of the effort.

Enterprises tend to minimize the security risks associated with various attacks on their infrastructure. By doing this, they are putting themselves at risk of being caught off guard to face threats to their network regardless of whether it is malicious or benign. While the basic steps described above are not sure fire remedies to solve the problem, they nonetheless are essential pillars of defending the network that many organizations tend to overlook. However, they can help shore up networks against unplanned outages which ultimately save companies from taking a hard stumble.

Joel Dolisy is the SVP, CTO, CIO at SolarWinds