GPU Cost Efficiency in Kubernetes: Selection, Sharing, and Savings Strategies

By right-sizing GPU instances, enabling autoscaling, and leveraging sharing strategies like time slicing or MIGs, teams can significantly reduce GPU costs in Kubernetes.

GPU Cost Efficiency in Kubernetes: Selection, Sharing, and Savings Strategies
Author:Emily Dunenfeld
Emily Dunenfeld

Engineering teams run inference workloads on Kubernetes clusters for easier scaling and job management. But GPUs, one of the most expensive resources you can run in the cloud, are easy to set up in Kubernetes and forget about the cost. Most teams don’t have visibility or cost savings methods implemented for GPU workloads, so they often overpay without knowing it.

Kubernetes GPUs Cost Benefits

Maintaining GPUs with Kubernetes is easier than provisioning standalone GPU VMs and managing them manually, especially when workloads vary throughout the day or week. It also provides cost visibility and optimization benefits.

For one, you can configure your cluster to run multiple jobs on the same GPU. Instead of dedicating an entire GPU to a single workload, Kubernetes can help you maximize usage across jobs—saving you both capacity and money.

It also gives you deeper visibility into GPU usage. With the right tools in place, you can attribute GPU memory consumption to specific jobs, users, or namespaces. Combined with a scheduler that supports autoscaling, Kubernetes can dynamically consolidate workloads and scale GPU nodes up or down as needed.

In contrast, running standalone GPU VMs often leads to over-provisioning and idle capacity, without the insight or flexibility to fix it.

Kubernetes GPU Pricing for AWS, Azure, and GCP

Each of the three main cloud providers has its own GPU offerings for Kubernetes.

Cloud providers differ in GPU model availability, pricing, and regional coverage. However, common challenges such as idle usage, memory inefficiency, and lack of observability persist across platforms.

Cloud pricing is typically per instance-hour, not based on actual GPU usage. You're billed for the full GPU allocation, regardless of utilization. Billing dimensions usually include: GPU instance size and family, number of GPUs per node, total node uptime, and any attached storage or networking resources.

Pricing is listed in the US East region.

AWS EKS GPUs

Charges include the underlying EC2 instance, a version support fee of $0.10 per cluster per hour, and additional per-second Auto Mode charges when enabled.

EC2 GPU InstanceOn-Demand Hourly CostvCPUMemory (GiB)GPU Memory (GiB)
g4dn.2xlarge$0.7583216
g4dn.4xlarge$1.20166416
g4dn.8xlarge$2.183212816
g4dn.12xlarge$3.914819264
g4dn.16xlarge$4.356425616
g4dn.metal$7.8296384128
g4dn.xlarge$0.5341616
g5.2xlarge$1.2183224
g5.4xlarge$1.62166424
g5.8xlarge$2.453212824
g5.12xlarge$5.674819296
g5.16xlarge$4.106425624
g5.24xlarge$8.149638496
g5.48xlarge$16.29192768192
g5.xlarge$1.0141624
g5g.2xlarge$0.5681616
g5g.4xlarge$0.83163216
g5g.8xlarge$1.37326416
g5g.16xlarge$2.746412832
g5g.metal$2.746412832
g5g.xlarge$0.424816
g6.2xlarge$0.9883224
g6.4xlarge$1.32166424
g6.8xlarge$2.013212824
g6.12xlarge$4.604819296
g6.16xlarge$3.406425624
g6.24xlarge$6.689638496
g6.48xlarge$13.35192768192
g6.xlarge$0.8041624
g6e.2xlarge$2.2486448
g6e.4xlarge$3.001612848
g6e.8xlarge$4.533225648
g6e.12xlarge$10.4948384192
g6e.16xlarge$7.586451248
g6e.24xlarge$15.0796768192
g6e.48xlarge$30.131921536384
g6e.xlarge$1.8643248
gr6.4xlarge$1.541612824
gr6.8xlarge$2.453225624
p4d.24xlarge$32.77961152
p5.48xlarge$98.321922 TiB640 GB HBM3

AWS EKS current generation EC2 GPU instances. (scroll to see full table)

Azure AKS GPUs

Azure charges for the underlying VM chosen as well as a charge per cluster hour, $0.10 per cluster per hour for the Standard tier and $0.60 per cluster per hour for the Premium tier.

Azure GPU VMOn-Demand Hourly CostCPUMemory (GB)
NC40ads H100 v5$6.9840320
NC80adis H100 v5$13.9680640
NCC40ads H100 v5$6.9840320
NC6s v3$3.066112
NC12s v3$6.1212224
NC24s v3$12.2424448
NC24rs v3$13.4624448
NC4as T4 v3$0.53428
NC8as T4 v3$0.75856
NC16as T4 v3$1.2016110
NC64as T4 v3$4.3564440
NC24ads A100 v4$3.6724220
NC48ads A100 v4$7.3548440
NC96ads A100 v4$14.6996880
ND96isr MI300X v5$48.00961850
ND96isr H100 v5$98.32961900
ND96amsr A100 v4$32.77961900
ND96asr A100 v4$27.2096900
NG8ads V620 v1$0.64816
NG16ads V620 v1$1.271632
NG32ads V620 v1$2.543264
NG32adms V620 v1$3.3032176
NV12s v3$1.1412112
NV24s v3$2.2824224
NV48s v3$4.5648448
NV4as v4Currently Unavailable414
NV8as v4Currently Unavailable828
NV16as v4Currently Unavailable1656
NV32as v4Currently Unavailable32112
NV6ads A10 v5$0.45655
NV12ads A10 v5$0.9112110
NV18ads A10 v5$1.6018220
NV36ads A10 v5$3.2036440
NV36adms A10 v5$4.5236880
NV72ads A10 v5$6.5272880

Azure AKS current generation VM GPU instances (scroll to see full table)

GCP GKE GPUs

GCP charges for the underlying Compute Engine instances and a GKE cluster management fee: $0.10 per cluster per hour for the Standard edition and $0.00822 per vCPU per hour for the Enterprise edition.

You can choose from the following GPU VM types.

VMGPU MemoryOn-Demand Hourly Cost
a4-highgpu-8g8 GB HBM3eCurrently Unavailable
a3-ultragpu-8g1128 GB HBM3e$84.80690849
a2-ultragpu-1g80 GB HBM3$5.06879789
a2-ultragpu-2g160 GB HBM3$10.13759578
a2-ultragpu-4g320 GB HBM3$20.27519156
a2-ultragpu-8g640 GB HBM3$40.55038312
g2-standard-424 GB GDDR6$0.70683228
g2-standard-824 GB GDDR6$0.85362431
g2-standard-1224 GB GDDR6$1.00041635
g2-standard-1624 GB GDDR6$1.14720838
g2-standard-2448 GB GDDR6$2.0008327
g2-standard-3224 GB GDDR6$1.73437653
g2-standard-4896 GB GDDR6$4.00166539
g2-standard-96192 GB GDDR6$8.00333078

GCP GKE current generation VM GPU instances (scroll to see full table)

Or, you can also attach GPUs to N1 VMs manually, listed below (costs also apply for the N1 instance, billed $0.03 per hour).

VMGPU memoryOn-Demand Hourly Cost
NVIDIA T4 Virtual Workstation - 1 GPU16 GB GDDR6$0.55
NVIDIA T4 Virtual Workstation - 2 GPUs32 GB GDDR6$1.10
NVIDIA T4 Virtual Workstation - 4 GPUs64 GB GDDR6$2.20
NVIDIA P4 Virtual Workstation - 1 GPU8 GB GDDR5$0.80
NVIDIA P4 Virtual Workstation - 2 GPUs16 GB GDDR5$1.60
NVIDIA P4 Virtual Workstation - 4 GPUs32 GB GDDR5$3.20
NVIDIA P100 Virtual Workstation - 1 GPU16 GB HBM2$1.66
NVIDIA P100 Virtual Workstation 2 GPUs32 GB HBM2$3.32
NVIDIA P100 Virtual Workstation - 4 GPUs64 GB HBM2$6.64

GCP GKE current generation N1 GPU attachments (scroll to see full table)

Optimizing GPU Costs in Kubernetes: Right-Sizing, Autoscaling, and Sharing Strategies

As you can see, the hourly price is inherently expensive, but selecting the right size and running workloads efficiently can make a big difference.

Select the Right GPU Instance and Right-Sizing

Most GPU waste comes from choosing an instance that’s bigger than necessary.

A big reason is because GPU processing time is hard to measure accurately. Metrics can be noisy and inconsistent. Memory usage, on the other hand, is more stable and measurable, and it’s often the best proxy for cost efficiency.

You can calculate the idle memory by comparing memory used by workload to available memory per GPU.

idle_memory = total_allocated_memory - used_memory

Then, determine the number of GPUs needed to get a much better idea of which instance type is the best fit.

To do this automatically for certain NVIDIA workloads, the Vantage Kubernetes agent integrates with NVIDIA DCGM and automatically calculates GPU idle costs by attributing GPU memory usage per workload. This provides a granular view of how memory is consumed, helping you avoid over-provisioning and improve cost efficiency.

Autoscaling

Kubernetes autoscalers can dynamically consolidate workloads and scale GPU nodes up or down as needed. This ensures you're only paying for the resources you actually need. Proper autoscaling configuration requires careful tuning of thresholds and scaling policies to avoid oscillation while still responding promptly to changing demands.

Commit and Save

While On-Demand pricing gives flexibility, committing to compute usage via Savings Plans can significantly reduce the hourly instance costs for long-running GPU workloads.

For example, the g6.48xlarge Savings Plan rate is $8.69 per hour compared to the $13.35 On-Demand rate, 35% less.

GPU Sharing

Running multiple workloads on a single GPU is one of the most effective ways to reduce cost, especially for smaller models or batch inference jobs that don’t fully utilize GPU resources. There are a few strategies available, each applicable and listed as options by the three main cloud providers.

Time Slicing/Sharing

Time slicing provides a simple approach to GPU sharing by allocating GPU time slices to different containers on a time-rotated basis. This works well for workloads that don’t require constant GPU access but still benefit from acceleration.

Read more about how to implement:

NVIDIA MPS

NVIDIA CUDA Multi-Process Service (MPS) allows processes to share the same GPU. In Kubernetes, this is supported through fractional GPU memory requests. Instead of requesting a full GPU, you can request only the memory your workload needs. For example, you can specify a resource request like nvidia.com/gpu-memory: 4Gi to request a subset of memory on a shared GPU.

Multi-Instance GPUs

NVIDIA Multi-instance GPUs (MIGs) are applicable for certain NVIDIA GPUs, such as NVIDIA’s A100 Tensor Core. It provides hardware-level partitioning of a GPU into multiple isolated instances. Each MIG instance has its own memory, cache, and compute cores, providing better isolation than software-based sharing methods.

Conclusion

By right-sizing GPU instances, enabling autoscaling, and leveraging sharing strategies like time slicing or MIGs, teams can significantly reduce GPU costs in Kubernetes. These optimizations not only lower spend but also improve resource efficiency.

Get started with a free trial.

Manage your teams today.

Sign up

TakeCtrlof YourCloud Costs

You've probably burned $0 just thinking about it. Time to act.