AWS Well-Architected Framework
Introduction
The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS
Definitions
On Architecture
General Design Principles
Stop guessing your capacity needs
Test systems at production scale
Automate to make architectural experimentation easier
Allow for evolutionary architectures
Drive architectures using data
Improve through game days
The Five Pillars of the Well-Architected Framework
Operational Excellence
Design Principles
Perform operations as code
Annotate documentation
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure:
Learn from all operational failures
Best Practices
Prepare: AWS Config and AWS Config rules can be used to create standards for
workloads and to determine if environments are compliant with those standards
before being put into production.
Operate Amazon CloudWatch allows you to monitor the operational health of a
workload.
Evolve Amazon Elasticsearch Service (Amazon ES) allows you to analyze your log
data to gain actionable insights quickly and securely.
Security
Design Principles
Implement a strong identity foundation
Enable traceability
Apply security at all layers:
Automate security best practices
Protect data in transit and at rest
Keep people away from data
Prepare for security events
Best Practices
Identity and Access Management
Detective Controls
Infrastructure Protection
Data Protection
Incident Response
Reliability
Design Principles
Test recovery procedures
Automatically recover from failure:
Scale horizontally to increase aggregate system availability
Stop guessing capacity
Manage change in automation
Best Practices
Foundations
Change Management
Failure Management
Performance Efficiency
Design Principles
Democratize advanced technologies
Go global in minutes
Use serverless architectures
Experiment more often:
Mechanical sympathy
Best Practices
Selection
Review
Monitoring
Tradeoffs
Cost Optimization
Design Principles
Adopt a consumption model
Measure overall efficiency
Stop spending money on data center operations
Analyze and attribute expenditure
Use managed and application level services to reduce cost of ownership
Best Practices
Cost-Effective Resources
Matching supply and demand
Expenditure Awareness
Optimizing Over Time
Best Practices
Operational Excellence
OPS 1 What factors drive your operational priorities?
Operational priorities are the focus areas of your operations efforts. Clearly define and agree to
your operations priorities to maximize the benefits of your operations efforts.
Evaluate business needs
Evaluate compliance requirements
Evaluate risk
OPS 2 How do you design your workload to enable operability?
The majority of the lifetime of a workload is typically spent in an operating state. Consider
operations needs as a part of system design to help you enable long term sustainment of your
workload.
Share design standards
Design for cloud operations e.g. elasticity, on-demand scalability, pay-as-you-go pricing, automation
Provide insights into workload behavior
Provide insights into customer behavior:
Implement practices that reduce defects, ease remediation, and improve flow:
Mitigate deployment risks
OPS 3 How do you know that you are ready to support a workload?
Evaluate the operational readiness of your workload, processes and procedures, and personnel
to help you understand the operational risks related to your workload.
Continuous improvement culture
Share understanding of the value to the business:
Ensure personnel capability
Documented accessible governance and guidance
Use checklists
Use runbooks
Use playbooks
Practice recovery
OPS 4 What factors drive your understanding of operational health?
Define metrics for the evaluation of your workload and processes to help you understand
operations effectiveness in supporting business outcomes. Capture and analyze metrics to gain
visibility to processes and events so that you can take appropriate action.
Define expected business and customer outcomes
Identify success metrics
Identify workload metrics
Identify operations metrics
Established baselines
Collect and analyze metrics
Validate insights
Business-level view of operations
OPS 5 How do you manage operational events?
Prepare and validate procedures to respond to operational events to help you minimize their
potential disruption to your workload.
Determine priority of operational events based on business impact
Processes for event, incident, and problem management
Process per alert
Identify decision makers
Defined escalation paths
Push notifications
Communicate status through dashboards
Process for root cause analysis
OPS 6 How do you evolve operations?
Dedicate time and resources for continuous incremental improvement to help evolve the
effectiveness and efficiency of your operations.
Process for continuous improvement
Define drivers for improvement
Implement Feedback loops
Document and share lessons learned
Perform operations metrics reviews
Security
SEC 1 How do you manage credentials for your workload?
Credentials include passwords, tokens, and keys that grant access directly or indirectly to
manage your workload. Protect credentials with appropriate mechanisms to help you reduce
the risk of accidental or malicious use.
Enforce use of multi-factor authentication
Enforce password requirements
Rotate credentials regularly
Audit credentials periodically
Using centralized identity provider
SEC 2 How do you control human access to services?
Control human access to services with appropriately defined, limited, and segregated access to
help you reduce the risk of unauthorized access.
Credentials are not shared
User life-cycle managed
Minimum privileges
Access requirements clearly defined
Access is granted through roles or federation
SEC 3 How do you control programmatic access to services?
Control programmatic or automated access to services with appropriately limited short-term
credentials and roles to help you reduce the risk of unauthorized access.
Credentials are not shared
Dynamic authentication
Minimum privileges
Access requirements clearly defined
SEC 4 How are you aware of security events in your workload?
Capture and analyze logs and metrics to gain visibility to security threats and events so that
you can take appropriate action.
Logging enabled where available
Analyzing AWS CloudTrail
Analyzing logs centrally
Monitoring and alerting for key metrics and events
AWS marketplace or APN partner solution enabled:
SEC 5 How do you protect your networks?
Public and private networks and services require multiple layers of defense to help protect your
workloads from network-based threats.
Controlling traffic in Virtual Private Cloud (VPC)
Controlling traffic at the boundary
Controlling traffic using available features:
AWS marketplace or APN partner solution enabled
SEC 6 How do you stay up to date with AWS security features and industry security
threats?
Staying up to date and implementing AWS and industry best practices including services and
features can improve the security of your workload. Being aware of the latest security threats
will help you build a threat model to identify and implement protective controls.
Evaluating new security services and features
Using security services and features
SEC 7 How do you protect your compute resources?
Configure compute resources with manageable components to protect and monitor their
integrity so that you can take appropriate action.
Hardening default configurations
Checking file integrity
Intrusion detection enabled
AWS marketplace or APN partner solution enabled
Configuration management tool
Patching and scanning for vulnerabilities
SEC 8 How do you classify your data?
Classification provides a way to categorize data, based on levels of sensitivity, to help you
determine appropriate protective controls.
Use a data classification schema
Data classification applied
SEC 9 How do you manage data protection mechanisms?
Data protection mechanisms include services and keys that protect data in transit and at rest.
Protect these services and keys to help you reduce the risk of unauthorized access to systems
and data.
Use a secure key management service
Use service level controls
Use client side key management
AWS Marketplace or APN Partner solution
SEC 10 How do you prepare to respond to an incident?
Prepare to investigate and respond to security incidents to help you minimize potential
disruptions to your workload.
Pre-provisioned access
Pre-deployed tools
Run game days
Reliability
REL 1 How are you managing AWS service limits for your accounts?
AWS accounts are provisioned with default service limits to prevent new users from
accidentally provisioning more resources than they need. There also limits on how often you
can call APIs to protect AWS infrastructure. Evaluate your AWS service needs and request
appropriate changes to your limits for each region.
Active monitoring and managing limits
Implemented automated monitoring and management of limits
Aware of fixed service limits
Ensure there is a sufficient gap between the current service limit and the max usage
to accommodate for fail over
Service limits are managed across all relevant accounts and regions
REL 2 How do you plan your network topology on AWS?
Applications can exist in one or more environments: EC2-Classic, the default VPC, or VPC(s)
created by you. Network considerations such as system connectivity, Elastic IP address and
public IP address management, VPC and private address management, and name resolution
are fundamental to using resources in the cloud. Well planned and documented deployments
are essential to reduce the risk of overlap and contention.
Connectivity back to data center is not needed:
Highly available connectivity between AWS and on-premises environment is
implemented
Highly available network connectivity for the users of the workload is implemented
Using non-overlapping private IP address ranges in multiple VPCs
IP subnet allocation accounts for expansion and availability
REL 3 How does your system adapt to changes in demand?
A scalable system provides elasticity to add and remove resources automatically so that they
closely match the current demand at any given point in time.
Workload scales automatically
Workload is load tested
REL 4 How do you monitor AWS resources?
Logs and metrics are a powerful tool for gaining insight into the health of your workloads. You
can configure your system to monitor logs and metrics and send notifications when thresholds
are crossed or significant events occur. Ideally, when low-performance thresholds are crossed or
failures occur, the system has been architected to automatically self-heal or scale in response.
Monitoring the workload in all tiers
Notifications are sent based on the monitoring
Automated responses are performed for events
Reviews are conducted regularly
REL 6 How do you back up data?
Back up data, applications, and operating environments (defined as operating systems
configured with applications) to meet requirements for mean time to recovery (MTTR) and
recovery point objectives (RPO).
Data is backed up manually
Data is backed up using automated processes
Periodic recovery of the data is done to verify backup integrity and processes
Backups are secured and encrypted
REL 7 How does your system withstand component failures?
If your workloads have a requirement, implicit or explicit, for high availability and low mean
time to recovery (MTTR), architect your workloads for resiliency and distribute your workloads
to withstand outages.
Monitoring is done at all layers of the workload to detect failures
Deployed to multiple Availability Zones; Multiple AWS Regions if required
Has loosely coupled dependencies
Has implemented graceful degradation
Automated healing implemented on all layers
Notifications are sent upon availability impacting events
REL 8 How do you test resilience?
Test the resilience of your workload to help you find latent bugs that only surface in
production. Exercise these tests regularly.
Use a playbook
Inject failures to test
Schedule game days
Conduct root cause analysis (RCA)
REL 9 How do you plan for disaster recovery?
Data recovery (DR) is critical should restoration of data be required from backup methods. Your
definition of and execution on the objectives, resources, locations, and functions of this data
must align with RTO and RPO objectives.
Recovery objectives are defined
Recovery strategy is defined
Configuration drift is managed
Test and validate disaster recovery implementation
Recovery is automated
Performance Efficiency
PERF 1 How do you select the best performing architecture?
Often, multiple approaches are required to get optimal performance across a workload.
Well-architected systems use multiple solutions and enable different features to improve
performance.
Benchmarking:
load test
PERF 2 How do you select your compute solution?
The optimal compute solution for a particular system varies based on application design, usage
patterns, and configuration settings. Architectures may use different compute solutions for
various components and enable different features to improve performance. Selecting the wrong
compute solution for an architecture can lead to lower performance efficiency.
Consider options
Consider instance configuration options
Consider container configuration options
Consider function configuration options
Use elasticity
PERF 3 How do you select your storage solution?
The optimal storage solution for a system varies based on the kind of access method (block,
file, or object), patterns of access (random or sequential), throughput required, frequency of
access (online, offline, archival), frequency of update (WORM, dynamic), and availability and
durability constraints. Well-architected systems use multiple storage solutions and enable
different features to improve performance.
Consider characteristics
Consider configuration options
Consider access patterns
PERF 4 How do you select your database solution?
The optimal database solution for a system varies based on requirements for availability,
consistency, partition tolerance, latency, durability, scalability, and query capability. Many
systems use different database solutions for various sub-systems and enable different features
to improve performance. Selecting the wrong database solution and features for a system can
lead to lower performance efficiency.
Consider characteristics
Consider configuration options
Consider access patterns
Consider other approaches
PERF 5 How do you configure your networking solution?
The optimal network solution for a system varies based on latency, throughput requirements,
and so on. Physical constraints such as user or on-premises resources drive location options,
which can be offset using edge techniques or resource placement.
Consider location
Consider service features
Consider networking features
Cost Optimization
COST 1 How do you evaluate cost when you select AWS services?
Amazon EC2, Amazon EBS, and Amazon S3 are building-block AWS services. Managed services,
such as Amazon RDS and Amazon DynamoDB, are higher level, or application level, AWS
services. By selecting the appropriate building blocks and managed services, you can optimize
your architecture for cost. For example, using managed services, you can reduce or remove
much of your administrative and operational overhead, freeing you to work on applications and
business-related activities.
Select services for cost reduction
Optimize for license costs
Optimize using serverless and container-based approach
Optimize using appropriate storage solutions
Optimize using appropriate databases
Optimize using other application-level services
COST 4 How do you plan for data transfer charges?
Ensure that you monitor data transfer charges so that you can make architectural decisions
that might alleviate some of these costs. For example, if you are a content provider and
have been serving content directly from an S3 bucket to your end users, you might be able to
significantly reduce your costs if you push your content to the Amazon CloudFront content
delivery network (CDN). Remember that a small yet effective architectural change can
drastically reduce your operational costs.
Optimize:
Use a content delivery network (CDN
Use AWS Direct Connect