Reliability High Availability
Reliability and High Availability
Section titled “Reliability and High Availability”- “Everything fails, all the time.” - Werner Vogels, CTO, Amazon.com
- Plan for failure (application or workload downtime)
- Architect applications to withstand failure
Reliability
Section titled “Reliability”-
Definition:
- A measure of system’s ability to provide functionality when desired by user
- System includes all components: hardware, firmware, and software
- Probability that entire system will function as intended for specified period
-
Reliability Metrics:
- Mean Time Between Failures (MTBF) = total time in service/number of failures
- Mean Time To Failure (MTTF) = time until system fails
- Mean Time To Repair (MTTR) = time to repair system after failure
- MTBF = MTTF + MTTR
Availability
Section titled “Availability”-
Definition:
- Normal operation time / total time
- Percentage of uptime over time (typically 1 year)
- Measured in “number of 9s” (e.g., five 9s = 99.999% availability)
-
Reduced by:
- Scheduled interruptions
- Unscheduled interruptions
High Availability
Section titled “High Availability”- System can withstand some measure of degradation while remaining available
- Downtime is minimized
- Minimal human intervention required
- Quickly restores essential services when components fail (often < 1 minute)
Availability Tiers
Section titled “Availability Tiers”- Availability requirements vary by application type
- Common tiers:
- 99% = 87.6 hours max disruption/year (batch processing, data extraction)
- 99.9% = 8.76 hours max disruption/year (internal tools, knowledge management)
- 99.95% = 4.38 hours max disruption/year (online commerce, point of sale)
- 99.99% = 52.56 minutes max disruption/year (video delivery, broadcast systems)
- 99.999% = 5.26 minutes max disruption/year (ATM transactions, telecommunications)
Factors Influencing Availability
Section titled “Factors Influencing Availability”-
Fault Tolerance:
- Built-in redundancy of application components
- Ability to remain operational despite component failures
- Uses specialized hardware to detect failure and switch to redundant components
- Does not address software failures (most common reason for downtime)
-
Scalability:
- Ability to accommodate increases in capacity needs without changing design
- Contributes to availability but doesn’t guarantee it
-
Recoverability:
- Process, policies, and procedures for restoring service after catastrophic events
- Ability to restore service quickly without data loss
-
Cost Consideration:
- Improving availability usually increases cost
- Important to balance cost of improvement with benefit to users
- Consider whether goal is making application “always alive” or “servicing requests within acceptable performance levels”
Reliability and high availability are critical concepts in cloud architecture that ensure systems remain operational despite failures. By focusing on fault tolerance, scalability, and recoverability, architects can design systems that balance performance needs with cost considerations.