AWS WAF pillar one: Operational excellence tools and best practices
Harnessing the full power of the AWS® cloud
involves far more than building a solid technical infrastructure. Amazon developed the
Well-Architected Framework (WAF)
to enable companies to build the most operationally excellent, secure, reliable, efficiently
high-performing, and cost-optimized infrastructure possible for their businesses. This post
addresses the first pillar, operational excellence.
Business operations play an increasing role in how companies can truly transform business
through cloud computing. Operational excellence is one of the five pillars, or areas of
focus in the AWS WAF. The AWS WAF operational excellence pillar covers best practices around
developing robust, repeatable processes for all aspects of managing your cloud infrastructure.
Operational Excellence in the AWS cloud starts with preparation
Like a pilot runs through a pre-flight checklist before takeoff, AWS recommends you use
operational checklists to ensure that your workloads are ready for production operation and
prevent migrating untested workloads to production.
Operational excellence checklists
Create and use these checklists for operational excellence in AWS:
- Operational checklist: Create an operational checklist that you use to evaluate if
you are ready to operate the workload.
- Planning checklist: This may seem redundant, but it is important to have a plan that
syncs with company events, milestones, and roadmaps to stay in front of events that might
cause sudden increases in traffic and requests for specific resources, where network
performance could impact a company’s revenue or reputation.
- Security checklist: Security is among the most misunderstood features of the cloud.
You should develop and use a detailed security checklist to ensure that you are ready to
securely operate the workload and respond to any security event or attack.
AWS configuration management best practices
You should document how you monitor, measure, and manage your architecture, environments,
and the configuration parameters for resources within them to easily identify components
for tracking and troubleshooting.
Changes to configurations should also be trackable and automated. Within a Configuration
Management Database (CMDB), you should record a detailed resource tracking program by using
tags and metadata and thorough, accessible documentation of your entire architecture and
infrastructure configuration.
Automate cloud deployment for operational excellence
Automation can take human error out of the operational excellence equation. You should
include regular quality assurance testing and defined mechanisms that can continually track,
audit, rollback, and review changes as warranted.
Best practices for AWS deployment automation include:
- Develop a deployment pipeline (such as a source code repository, build systems, deployment,
and testing automation) with standard automated procedures for continuous integration and
continuous development (CI/CD).
- Have an automated release management process.
- Design a process to revert changes if they produce operational issues.
- Create risk management strategies ( such as blue/green, canary, A/B testing) to assess
risks continually.
- Use system monitoring with CloudWatch® to monitor system performance.
- Set alarms and notifications based on key performance thresholds that indicate problems
or opportunities for improvement.
- Automate actions based on performance, such as using Auto Scaling to add capacity based
on current conditions automatically.
- Track and save logs, including application logs, AWS service-specific logs, and VPC flow
logs by using CloudTrail® to troubleshoot and review performance.
Respond efficiently in AWS
Responding to network problems is as important as preventing them in the first place. You
should be prepared to automate responses as much as possible, including alerts and
notifications as well as actions and recovery. It is also important to have escalation
procedures in place to get the right issue to the right resources as quickly as possible.
Best practices for responding to unplanned events include:
- Create an event response playbook that everyone follows. The playbook defines escalation
guidelines and procedures and identifies the circumstances for when you should activate it.
- Automate responses as much as possible, such as using Auto Scaling to instantly add
capacity when the system passes critical load thresholds.
- Develop a Root Cause Analysis (RCA) to ensure that you can resolve, document, and fix
issues so that they do not happen in the future. Make sure you’re not just fixing symptoms
of a deeper problem.
- Develop an escalation process that puts the necessary stakeholders and systems in place
for receiving alerts when escalations occur.
- Automate escalation as much as possible based on demand or time thresholds, sending the
issue to the right resources.
- Create an automated escalation queue between appropriate functional teams based on
priority, impact, and intake mechanisms.
- Use a demand- or time-based approach to escalate higher in the organization as impact,
scale, or time to resolution and recovery of an incident increases.
- Define when external escalation to AWS or an AWS partner would be engaged.
Conclusion
The AWS Operational Excellence pillar focuses on running and monitoring systems to deliver
business value and continually improving processes and procedures. It helps organizations
spread the benefits of cloud adoption beyond the IT department. It also ensures that the
cloud infrastructure can efficiently manage changes, respond to events, and automate
standards-based tasks and processes to successfully manage daily operations.
Learn more about the other Well-Architected Framework pillars in this series:
Learn more about Rackspace AWS services.
Use the Feedback tab to make any comments or ask questions. You can also click
Sales Chat to chat now and start the conversation.