A reference architecture for scalable reliable and gpuoptimized ai model execution on kubernetes| International Journal of Innovative Science and Research Technology

A Reference Architecture for Scalable, Reliable, and GPU-Optimized AI Model Execution on Kubernetes

Authors : Shekar Rao Lakavath

Volume/Issue : Volume 11 - 2026, Issue 5 - May

Google Scholar : https://tinyurl.com/57asnrwh

Scribd : https://tinyurl.com/ywymm2n9

DOI : https://doi.org/10.38124/ijisrt/26may1495

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : AI inference workloads stress infrastructure through bursty demand, long model initialization times, heterogeneous GPU requirements, and strict latency constraints. Kubernetes provides a programmable control plane for deployment, scaling, scheduling, and recovery, making it a strong substrate for production inference. This paper introduces KARB (Kubernetes AI Reliability Blueprint), a reference architecture and decision framework that connects service-level objectives (SLOs) to GPU-aware placement, multi-layer autoscaling, and governance controls. We explain GPU scheduling via device plugins, node labels, taints/tolerations, and optional GPU sharing strategies (MIG or time-slicing). We also present a unified scaling approach combining Horizontal Pod Autoscaler (HPA), KEDA event-driven scaling, and node autoscaling to control latency while minimizing GPU cost. Finally, we outline Kubernetes-native MLOps integration using Kubeflow Pipelines and KServe and provide reproducible templates and checklists suitable for enterprise platform engineering teams.

Keywords : Kubernetes, AI Inference, GPU Scheduling, Autoscaling, KEDA, HPA, Node Autoscaling, Kubeflow, KServe, MLOps, SRE, Platform Engineering, and Reliability.

References :

Kubernetes Documentation — Horizontal Pod Autoscaling. https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
Kubernetes Documentation — Autoscaling Workloads (overview). https://kubernetes.io/docs/concepts/workloads/autoscaling/
KEDA Documentation — Kubernetes Event-driven Autoscaling. https://keda.sh/
Microsoft Learn — KEDA add-on for Azure Kubernetes Service. https://learn.microsoft.com/en-us/azure/aks/keda-about
Kubernetes Documentation — Node Autoscaling. https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling/
Kubernetes Documentation — Schedule GPUs (device plugins). https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Kubernetes Documentation — Taints and Tolerations. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
NVIDIA Docs — NVIDIA GPU Operator (overview). https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
NVIDIA Docs — GPU sharing and time-slicing. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
NVIDIA Docs — MIG Support in Kubernetes. https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
KServe — Inference platform for Kubernetes. https://kserve.github.io/website/
Kubeflow Documentation — KServe Introduction. https://www.kubeflow.org/docs/components/kserve/introduction/
Kubeflow Documentation — Pipelines concept. https://www.kubeflow.org/docs/components/pipelines/concepts/pipeline/
Kubernetes Documentation — Network Policies. https://kubernetes.io/docs/concepts/services-networking/network-policies/
Kubernetes Documentation — RBAC Authorization. https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Kubernetes Documentation — Pod Security Standards. https://kubernetes.io/docs/concepts/security/pod-security-standards/

AI inference workloads stress infrastructure through bursty demand, long model initialization times, heterogeneous GPU requirements, and strict latency constraints. Kubernetes provides a programmable control plane for deployment, scaling, scheduling, and recovery, making it a strong substrate for production inference. This paper introduces KARB (Kubernetes AI Reliability Blueprint), a reference architecture and decision framework that connects service-level objectives (SLOs) to GPU-aware placement, multi-layer autoscaling, and governance controls. We explain GPU scheduling via device plugins, node labels, taints/tolerations, and optional GPU sharing strategies (MIG or time-slicing). We also present a unified scaling approach combining Horizontal Pod Autoscaler (HPA), KEDA event-driven scaling, and node autoscaling to control latency while minimizing GPU cost. Finally, we outline Kubernetes-native MLOps integration using Kubeflow Pipelines and KServe and provide reproducible templates and checklists suitable for enterprise platform engineering teams.

Keywords : Kubernetes, AI Inference, GPU Scheduling, Autoscaling, KEDA, HPA, Node Autoscaling, Kubeflow, KServe, MLOps, SRE, Platform Engineering, and Reliability.

Paper Submission Last Date
31 - July - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.