Authors :
Shekar Rao Lakavath
Volume/Issue :
Volume 11 - 2026, Issue 5 - May
Google Scholar :
https://tinyurl.com/57asnrwh
Scribd :
https://tinyurl.com/ywymm2n9
DOI :
https://doi.org/10.38124/ijisrt/26may1495
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
AI inference workloads stress infrastructure through bursty demand, long model initialization times, heterogeneous GPU requirements, and strict latency constraints. Kubernetes provides a programmable control plane for deployment, scaling, scheduling, and recovery, making it a strong substrate for production inference. This paper introduces KARB (Kubernetes AI Reliability Blueprint), a reference architecture and decision framework that connects service-level objectives (SLOs) to GPU-aware placement, multi-layer autoscaling, and governance controls. We explain GPU scheduling via device plugins, node labels, taints/tolerations, and optional GPU sharing strategies (MIG or time-slicing). We also present a unified scaling approach combining Horizontal Pod Autoscaler (HPA), KEDA event-driven scaling, and node autoscaling to control latency while minimizing GPU cost. Finally, we outline Kubernetes-native MLOps integration using Kubeflow Pipelines and KServe and provide reproducible templates and checklists suitable for enterprise platform engineering teams.
Keywords :
Kubernetes, AI Inference, GPU Scheduling, Autoscaling, KEDA, HPA, Node Autoscaling, Kubeflow, KServe, MLOps, SRE, Platform Engineering, and Reliability.
References :
- Kubernetes Documentation — Horizontal Pod Autoscaling. https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
- Kubernetes Documentation — Autoscaling Workloads (overview). https://kubernetes.io/docs/concepts/workloads/autoscaling/
- KEDA Documentation — Kubernetes Event-driven Autoscaling. https://keda.sh/
- Microsoft Learn — KEDA add-on for Azure Kubernetes Service. https://learn.microsoft.com/en-us/azure/aks/keda-about
- Kubernetes Documentation — Node Autoscaling. https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling/
- Kubernetes Documentation — Schedule GPUs (device plugins). https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
- Kubernetes Documentation — Taints and Tolerations. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
- NVIDIA Docs — NVIDIA GPU Operator (overview). https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
- NVIDIA Docs — GPU sharing and time-slicing. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
- NVIDIA Docs — MIG Support in Kubernetes. https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
- KServe — Inference platform for Kubernetes. https://kserve.github.io/website/
- Kubeflow Documentation — KServe Introduction. https://www.kubeflow.org/docs/components/kserve/introduction/
- Kubeflow Documentation — Pipelines concept. https://www.kubeflow.org/docs/components/pipelines/concepts/pipeline/
- Kubernetes Documentation — Network Policies. https://kubernetes.io/docs/concepts/services-networking/network-policies/
- Kubernetes Documentation — RBAC Authorization. https://kubernetes.io/docs/reference/access-authn-authz/rbac/
- Kubernetes Documentation — Pod Security Standards. https://kubernetes.io/docs/concepts/security/pod-security-standards/
AI inference workloads stress infrastructure through bursty demand, long model initialization times, heterogeneous GPU requirements, and strict latency constraints. Kubernetes provides a programmable control plane for deployment, scaling, scheduling, and recovery, making it a strong substrate for production inference. This paper introduces KARB (Kubernetes AI Reliability Blueprint), a reference architecture and decision framework that connects service-level objectives (SLOs) to GPU-aware placement, multi-layer autoscaling, and governance controls. We explain GPU scheduling via device plugins, node labels, taints/tolerations, and optional GPU sharing strategies (MIG or time-slicing). We also present a unified scaling approach combining Horizontal Pod Autoscaler (HPA), KEDA event-driven scaling, and node autoscaling to control latency while minimizing GPU cost. Finally, we outline Kubernetes-native MLOps integration using Kubeflow Pipelines and KServe and provide reproducible templates and checklists suitable for enterprise platform engineering teams.
Keywords :
Kubernetes, AI Inference, GPU Scheduling, Autoscaling, KEDA, HPA, Node Autoscaling, Kubeflow, KServe, MLOps, SRE, Platform Engineering, and Reliability.