Authors :
Taiwo Fadoyin
Volume/Issue :
Volume 11 - 2026, Issue 1 - January
Google Scholar :
https://tinyurl.com/2s3wjrk6
Scribd :
https://tinyurl.com/3spmdspu
DOI :
https://doi.org/10.38124/ijisrt/26jan1075
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Cloud-native microservice architectures have changed modern software systems but they also introduce
distinctive and common reliability challenges as a result of extreme distribution, runtime dynamism and rapid change.
Failures in such systems are hardly isolated; instead, they come from complex interactions between services, platforms and
control policies which usually lead to cascading and metastable behaviours that conventional fault tolerance approaches fail
to capture. Considering the volume of literature on resilience patterns, existing work remains fragmented, pattern-centric
and weakly connected to observed failure models. This paper presented a systematic literature review and synthesis of fault
tolerance mechanisms for cloud-native microservices by using recent peer-reviewed research and authoritative industry
practice. The study constructed a multi-dimensional taxonomy that classified mechanisms in architectural layers,
mechanism families, fault-handling phases, and runtime control characteristics by using a structured review methodology.
A comparative matrix evaluated key mechanisms against operational criteria including latency overhead, scalability impact,
complexity and risk of failure amplification. Building on this analysis, the paper mapped mechanisms to common cloud-
native fault models and derived practitioner-oriented decision guidance. The results pointed out that resilience in cloud-
native systems is dominated not by redundancy alone but by effective containment, observability and context-aware control.
Misconfigured retries and static policies consistently amplify failures while adaptive and observability-driven approaches
remain under-explored. The paper concluded by identifying concrete research gaps and testable hypotheses as well as
providing both actionable design guidance and a foundation for future resilience engineering research.
Keywords :
Cloud-Native, Microservices, Fault Tolerance, Resilience Engineering, Kubernetes, Service Mesh, Chaos Engineering, Reliability.
References :
- Alshuqayran, N., Ali, N. and Evans, R. (2016) ‘A systematic mapping study in microservice architecture’, Service-Oriented Computing and Applications, 10(4), pp. 415–439.
- Avizienis, A., Laprie, J.-C., Randell, B. and Landwehr, C. (2017) ‘Basic concepts and taxonomy of dependable and secure computing’, IEEE Transactions on Dependable and Secure Computing, 1(1), pp. 11–33.
- Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J. and Rosenthal, C. (2016) ‘Chaos engineering’, IEEE Software, 33(3), pp. 35–41.
- Beyer, B., Jones, C., Petoff, J. and Murphy, N.R. (2023) Site Reliability Engineering: How Google Runs Production Systems. 2nd edn. Sebastopol, CA: O’Reilly Media.
- Bronson, N., Charapko, A., Aghayev, A. and Zhu, T. (2021) ‘Metastable failures in distributed systems’, Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’21). New York: ACM.
- Bronson, Nathan, Aghayev, Abutalib, Charapko, Aleksey and Zhu, Timothy (2021) ‘Metastable failures in distributed systems’, Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS ’21). New York, NY: Association for Computing Machinery, pp. 221–227.
- Burns, B., Grant, B., Oppenheimer, D., Brewer, E. and Wilkes, J. (2016) ‘Borg, Omega, and Kubernetes’, ACM Queue, 14(1), pp. 10–29.
- Cloud Native Computing Foundation (CNCF) (2023) Cloud Native Definition. Available at: https://github.com/cncf/toc/blob/main/DEFINITION.md (Accessed: 16 January 2026).
- Dragoni, N., Lanese, I., Larsen, S.T., Mazzara, M., Mustafin, R. and Safina, L. (2022) ‘Microservices: Yesterday, today, and tomorrow’, in Present and Ulterior Software Engineering. Cham: Springer, pp. 195–216.
- Habibi, F., Lorido-Botran, T., Showail, A., Sturman, D.C. and Nawab, F. (2023) ‘MSF-Model: Queuing-Based Analysis and Prediction of Metastable Failures in Replicated Storage Systems’, arXiv preprint arXiv:2309.16181.
- Huang, Lexiang, Magnusson, Matthew, Muralikrishna, Abishek Bangalore, Estyak, Salman, Isaacs, Rebecca, Aghayev, Abutalib, Zhu, Timothy and Charapko, Aleksey (2022) ‘Metastable failures in the wild’, Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). Carlsbad, CA: USENIX Association, pp. 73–90.
- Isaacs, R. (2025) Analysing Metastable Failures. Amazon Science / AWS Research Report. Available at: https://assets.amazon.science/a4/ff/894a054e485f9d80936e796fbd07/analyzing-metastable-failures.pdf
- Keele, S. (2007) Guidelines for Performing Systematic Literature Reviews in Software Engineering. EBSE Technical Report.
- Kitchenham, B. (2009) ‘Systematic literature reviews in software engineering: A systematic literature review’, Information and Software Technology, 51(1), pp. 7–15.
- Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Sebastopol, CA: O’Reilly Media.
- Kubernetes (2025) ‘Configure liveness, readiness and startup probes’, Kubernetes Documentation. Available at: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
- Laprie, J.-C. (1992) ‘Dependability: Basic concepts and terminology’, in Avizienis, A. and Laprie, J.-C. (eds.) Dependable Computing and Fault-Tolerant Systems. Springer, pp. 3–12.
- Merkel, D. (2014) ‘Docker: Lightweight Linux containers for consistent development and deployment’, Linux Journal, 2014(239), pp. 2–15.
- Nadareishvili, I., Mitra, R., McLarty, M. and Amundsen, M. (2016). Microservice Architecture: Aligning Principles, Practices, and Culture. Sebastopol, CA: O’Reilly Media.
- Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd edn.
- Nygard, M. (2019). Release It!: Design and Deploy Production-Ready Software. 2nd edn. Dallas, TX: Pragmatic Bookshelf.
- Page, M.J., McKenzie, J.E., Bossuyt, P.M., et al. (2021) ‘The PRISMA 2020 statement’, BMJ, 372, n71.
- Richardson, C. (2018). Microservices Patterns: With Examples in Java. Shelter Island, NY: Manning Publications.
- Sedghpour, M.R.S., Klein, C. and Tordsson, J. (2022) ‘An empirical study of service mesh traffic management policies for microservices’, Proceedings of the ACM/SPEC International Conference on Performance Engineering (ICPE ’22). New York: ACM, pp. 25–36.
- Sigelman, B., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S. and Shanbhag, C. (2010) ‘Dapper, a large-scale distributed systems tracing infrastructure’, Google Technical Report.
- Sigelman, B., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S. and Shanbhag, C. (2019) ‘Distributed tracing in practice’, Communications of the ACM, 62(4), pp. 40–47.
- Uptime Institute (2023) Annual Outage Analysis 2023. New York: Uptime Institute.
- Varghese, B. and Buyya, R. (2018) ‘Next generation cloud computing: New trends and research directions’, Future Generation Computer Systems, 79, pp. 849–861.
- Wilkes, J. (2020). Site Reliability Engineering in Practice. San Francisco, CA: Addison-Wesley.
- Waseem, M., Shah, B., Babar, M.A. and Khan, M.I. (2023) ‘Understanding the issues, their causes and solutions in microservices systems: An empirical study’, arXiv preprint arXiv:2302.01894.
- Woods, D. (2018) ‘Essentials of resilience engineering’, Resilience Engineering Perspectives, 2, pp. 21–44
Cloud-native microservice architectures have changed modern software systems but they also introduce
distinctive and common reliability challenges as a result of extreme distribution, runtime dynamism and rapid change.
Failures in such systems are hardly isolated; instead, they come from complex interactions between services, platforms and
control policies which usually lead to cascading and metastable behaviours that conventional fault tolerance approaches fail
to capture. Considering the volume of literature on resilience patterns, existing work remains fragmented, pattern-centric
and weakly connected to observed failure models. This paper presented a systematic literature review and synthesis of fault
tolerance mechanisms for cloud-native microservices by using recent peer-reviewed research and authoritative industry
practice. The study constructed a multi-dimensional taxonomy that classified mechanisms in architectural layers,
mechanism families, fault-handling phases, and runtime control characteristics by using a structured review methodology.
A comparative matrix evaluated key mechanisms against operational criteria including latency overhead, scalability impact,
complexity and risk of failure amplification. Building on this analysis, the paper mapped mechanisms to common cloud-
native fault models and derived practitioner-oriented decision guidance. The results pointed out that resilience in cloud-
native systems is dominated not by redundancy alone but by effective containment, observability and context-aware control.
Misconfigured retries and static policies consistently amplify failures while adaptive and observability-driven approaches
remain under-explored. The paper concluded by identifying concrete research gaps and testable hypotheses as well as
providing both actionable design guidance and a foundation for future resilience engineering research.
Keywords :
Cloud-Native, Microservices, Fault Tolerance, Resilience Engineering, Kubernetes, Service Mesh, Chaos Engineering, Reliability.