A Comprehensive Taxonomy and Comparative Analysis of Fault Tolerance Mechanisms in Cloud-Native Microservice Architectures


Authors : Taiwo Fadoyin

Volume/Issue : Volume 11 - 2026, Issue 1 - January


Google Scholar : https://tinyurl.com/2s3wjrk6

Scribd : https://tinyurl.com/3spmdspu

DOI : https://doi.org/10.38124/ijisrt/26jan1075

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : Cloud-native microservice architectures have changed modern software systems but they also introduce distinctive and common reliability challenges as a result of extreme distribution, runtime dynamism and rapid change. Failures in such systems are hardly isolated; instead, they come from complex interactions between services, platforms and control policies which usually lead to cascading and metastable behaviours that conventional fault tolerance approaches fail to capture. Considering the volume of literature on resilience patterns, existing work remains fragmented, pattern-centric and weakly connected to observed failure models. This paper presented a systematic literature review and synthesis of fault tolerance mechanisms for cloud-native microservices by using recent peer-reviewed research and authoritative industry practice. The study constructed a multi-dimensional taxonomy that classified mechanisms in architectural layers, mechanism families, fault-handling phases, and runtime control characteristics by using a structured review methodology. A comparative matrix evaluated key mechanisms against operational criteria including latency overhead, scalability impact, complexity and risk of failure amplification. Building on this analysis, the paper mapped mechanisms to common cloud- native fault models and derived practitioner-oriented decision guidance. The results pointed out that resilience in cloud- native systems is dominated not by redundancy alone but by effective containment, observability and context-aware control. Misconfigured retries and static policies consistently amplify failures while adaptive and observability-driven approaches remain under-explored. The paper concluded by identifying concrete research gaps and testable hypotheses as well as providing both actionable design guidance and a foundation for future resilience engineering research.

Keywords : Cloud-Native, Microservices, Fault Tolerance, Resilience Engineering, Kubernetes, Service Mesh, Chaos Engineering, Reliability.

References :

  1. Alshuqayran, N., Ali, N. and Evans, R. (2016) ‘A systematic mapping study in microservice architecture’, Service-Oriented Computing and Applications, 10(4), pp. 415–439.
  2. Avizienis, A., Laprie, J.-C., Randell, B. and Landwehr, C. (2017) ‘Basic concepts and taxonomy of dependable and secure computing’, IEEE Transactions on Dependable and Secure Computing, 1(1), pp. 11–33.
  3. Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J. and Rosenthal, C. (2016) ‘Chaos engineering’, IEEE Software, 33(3), pp. 35–41.
  4. Beyer, B., Jones, C., Petoff, J. and Murphy, N.R. (2023) Site Reliability Engineering: How Google Runs Production Systems. 2nd edn. Sebastopol, CA: O’Reilly Media.
  5. Bronson, N., Charapko, A., Aghayev, A. and Zhu, T. (2021) ‘Metastable failures in distributed systems’, Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’21). New York: ACM.
  6. Bronson, Nathan, Aghayev, Abutalib, Charapko, Aleksey and Zhu, Timothy (2021) ‘Metastable failures in distributed systems’, Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS ’21). New York, NY: Association for Computing Machinery, pp. 221–227.
  7. Burns, B., Grant, B., Oppenheimer, D., Brewer, E. and Wilkes, J. (2016) ‘Borg, Omega, and Kubernetes’, ACM Queue, 14(1), pp. 10–29.
  8. Cloud Native Computing Foundation (CNCF) (2023) Cloud Native Definition. Available at: https://github.com/cncf/toc/blob/main/DEFINITION.md (Accessed: 16 January 2026).
  9. Dragoni, N., Lanese, I., Larsen, S.T., Mazzara, M., Mustafin, R. and Safina, L. (2022) ‘Microservices: Yesterday, today, and tomorrow’, in Present and Ulterior Software Engineering. Cham: Springer, pp. 195–216.
  10. Habibi, F., Lorido-Botran, T., Showail, A., Sturman, D.C. and Nawab, F. (2023) ‘MSF-Model: Queuing-Based Analysis and Prediction of Metastable Failures in Replicated Storage Systems’, arXiv preprint arXiv:2309.16181.
  11. Huang, Lexiang, Magnusson, Matthew, Muralikrishna, Abishek Bangalore, Estyak, Salman, Isaacs, Rebecca, Aghayev, Abutalib, Zhu, Timothy and Charapko, Aleksey (2022) ‘Metastable failures in the wild’, Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’22). Carlsbad, CA: USENIX Association, pp. 73–90.
  12. Isaacs, R. (2025) Analysing Metastable Failures. Amazon Science / AWS Research Report. Available at: https://assets.amazon.science/a4/ff/894a054e485f9d80936e796fbd07/analyzing-metastable-failures.pdf
  13. Keele, S. (2007) Guidelines for Performing Systematic Literature Reviews in Software Engineering. EBSE Technical Report.
  14. Kitchenham, B. (2009) ‘Systematic literature reviews in software engineering: A systematic literature review’, Information and Software Technology, 51(1), pp. 7–15.
  15. Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Sebastopol, CA: O’Reilly Media.
  16. Kubernetes (2025) ‘Configure liveness, readiness and startup probes’, Kubernetes Documentation. Available at: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
  17. Laprie, J.-C. (1992) ‘Dependability: Basic concepts and terminology’, in Avizienis, A. and Laprie, J.-C. (eds.) Dependable Computing and Fault-Tolerant Systems. Springer, pp. 3–12.
  18. Merkel, D. (2014) ‘Docker: Lightweight Linux containers for consistent development and deployment’, Linux Journal, 2014(239), pp. 2–15.
  19. Nadareishvili, I., Mitra, R., McLarty, M. and Amundsen, M. (2016). Microservice Architecture: Aligning Principles, Practices, and Culture. Sebastopol, CA: O’Reilly Media.
  20. Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd edn.
  21. Nygard, M. (2019). Release It!: Design and Deploy Production-Ready Software. 2nd edn. Dallas, TX: Pragmatic Bookshelf.
  22. Page, M.J., McKenzie, J.E., Bossuyt, P.M., et al. (2021) ‘The PRISMA 2020 statement’, BMJ, 372, n71.
  23. Richardson, C. (2018). Microservices Patterns: With Examples in Java. Shelter Island, NY: Manning Publications.
  24. Sedghpour, M.R.S., Klein, C. and Tordsson, J. (2022) ‘An empirical study of service mesh traffic management policies for microservices’, Proceedings of the ACM/SPEC International Conference on Performance Engineering (ICPE ’22). New York: ACM, pp. 25–36.
  25. Sigelman, B., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S. and Shanbhag, C. (2010) ‘Dapper, a large-scale distributed systems tracing infrastructure’, Google Technical Report.
  26. Sigelman, B., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S. and Shanbhag, C. (2019) ‘Distributed tracing in practice’, Communications of the ACM, 62(4), pp. 40–47.
  27. Uptime Institute (2023) Annual Outage Analysis 2023. New York: Uptime Institute.
  28. Varghese, B. and Buyya, R. (2018) ‘Next generation cloud computing: New trends and research directions’, Future Generation Computer Systems, 79, pp. 849–861.
  29. Wilkes, J. (2020). Site Reliability Engineering in Practice. San Francisco, CA: Addison-Wesley.
  30. Waseem, M., Shah, B., Babar, M.A. and Khan, M.I. (2023) ‘Understanding the issues, their causes and solutions in microservices systems: An empirical study’, arXiv preprint arXiv:2302.01894.
  31. Woods, D. (2018) ‘Essentials of resilience engineering’, Resilience Engineering Perspectives, 2, pp. 21–44

Cloud-native microservice architectures have changed modern software systems but they also introduce distinctive and common reliability challenges as a result of extreme distribution, runtime dynamism and rapid change. Failures in such systems are hardly isolated; instead, they come from complex interactions between services, platforms and control policies which usually lead to cascading and metastable behaviours that conventional fault tolerance approaches fail to capture. Considering the volume of literature on resilience patterns, existing work remains fragmented, pattern-centric and weakly connected to observed failure models. This paper presented a systematic literature review and synthesis of fault tolerance mechanisms for cloud-native microservices by using recent peer-reviewed research and authoritative industry practice. The study constructed a multi-dimensional taxonomy that classified mechanisms in architectural layers, mechanism families, fault-handling phases, and runtime control characteristics by using a structured review methodology. A comparative matrix evaluated key mechanisms against operational criteria including latency overhead, scalability impact, complexity and risk of failure amplification. Building on this analysis, the paper mapped mechanisms to common cloud- native fault models and derived practitioner-oriented decision guidance. The results pointed out that resilience in cloud- native systems is dominated not by redundancy alone but by effective containment, observability and context-aware control. Misconfigured retries and static policies consistently amplify failures while adaptive and observability-driven approaches remain under-explored. The paper concluded by identifying concrete research gaps and testable hypotheses as well as providing both actionable design guidance and a foundation for future resilience engineering research.

Keywords : Cloud-Native, Microservices, Fault Tolerance, Resilience Engineering, Kubernetes, Service Mesh, Chaos Engineering, Reliability.

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe