⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



A Reference Architecture for Scalable, Reliable, and GPU-Optimized AI Model Execution on Kubernetes


Authors : Shekar Rao Lakavath

Volume/Issue : Volume 11 - 2026, Issue 5 - May


Google Scholar : https://tinyurl.com/57asnrwh

Scribd : https://tinyurl.com/ywymm2n9

DOI : https://doi.org/10.38124/ijisrt/26may1495

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : AI inference workloads stress infrastructure through bursty demand, long model initialization times, heterogeneous GPU requirements, and strict latency constraints. Kubernetes provides a programmable control plane for deployment, scaling, scheduling, and recovery, making it a strong substrate for production inference. This paper introduces KARB (Kubernetes AI Reliability Blueprint), a reference architecture and decision framework that connects service-level objectives (SLOs) to GPU-aware placement, multi-layer autoscaling, and governance controls. We explain GPU scheduling via device plugins, node labels, taints/tolerations, and optional GPU sharing strategies (MIG or time-slicing). We also present a unified scaling approach combining Horizontal Pod Autoscaler (HPA), KEDA event-driven scaling, and node autoscaling to control latency while minimizing GPU cost. Finally, we outline Kubernetes-native MLOps integration using Kubeflow Pipelines and KServe and provide reproducible templates and checklists suitable for enterprise platform engineering teams.

Keywords : Kubernetes, AI Inference, GPU Scheduling, Autoscaling, KEDA, HPA, Node Autoscaling, Kubeflow, KServe, MLOps, SRE, Platform Engineering, and Reliability.

References :

  1. Kubernetes Documentation — Horizontal Pod Autoscaling. https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
  2. Kubernetes Documentation — Autoscaling Workloads (overview). https://kubernetes.io/docs/concepts/workloads/autoscaling/
  3. KEDA Documentation — Kubernetes Event-driven Autoscaling. https://keda.sh/
  4. Microsoft Learn — KEDA add-on for Azure Kubernetes Service. https://learn.microsoft.com/en-us/azure/aks/keda-about
  5. Kubernetes Documentation — Node Autoscaling. https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling/
  6. Kubernetes Documentation — Schedule GPUs (device plugins). https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
  7. Kubernetes Documentation — Taints and Tolerations. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
  8. NVIDIA Docs — NVIDIA GPU Operator (overview). https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
  9. NVIDIA Docs — GPU sharing and time-slicing. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
  10. NVIDIA Docs — MIG Support in Kubernetes. https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
  11. KServe — Inference platform for Kubernetes. https://kserve.github.io/website/
  12. Kubeflow Documentation — KServe Introduction. https://www.kubeflow.org/docs/components/kserve/introduction/
  13. Kubeflow Documentation — Pipelines concept. https://www.kubeflow.org/docs/components/pipelines/concepts/pipeline/
  14. Kubernetes Documentation — Network Policies. https://kubernetes.io/docs/concepts/services-networking/network-policies/
  15. Kubernetes Documentation — RBAC Authorization. https://kubernetes.io/docs/reference/access-authn-authz/rbac/
  16. Kubernetes Documentation — Pod Security Standards. https://kubernetes.io/docs/concepts/security/pod-security-standards/

AI inference workloads stress infrastructure through bursty demand, long model initialization times, heterogeneous GPU requirements, and strict latency constraints. Kubernetes provides a programmable control plane for deployment, scaling, scheduling, and recovery, making it a strong substrate for production inference. This paper introduces KARB (Kubernetes AI Reliability Blueprint), a reference architecture and decision framework that connects service-level objectives (SLOs) to GPU-aware placement, multi-layer autoscaling, and governance controls. We explain GPU scheduling via device plugins, node labels, taints/tolerations, and optional GPU sharing strategies (MIG or time-slicing). We also present a unified scaling approach combining Horizontal Pod Autoscaler (HPA), KEDA event-driven scaling, and node autoscaling to control latency while minimizing GPU cost. Finally, we outline Kubernetes-native MLOps integration using Kubeflow Pipelines and KServe and provide reproducible templates and checklists suitable for enterprise platform engineering teams.

Keywords : Kubernetes, AI Inference, GPU Scheduling, Autoscaling, KEDA, HPA, Node Autoscaling, Kubeflow, KServe, MLOps, SRE, Platform Engineering, and Reliability.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe