Kubernetes Blog

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

Mon, 01 Sep 2025 00:00:00 +0000

The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.

What is VolumeAttributesClass?

At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.

Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.

This means you can now:

Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.

What is new from Beta to GA

There are two major enhancements from beta.

Cancel support from infeasible errors

To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification becomes infeasible. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.

Quota support based on scope

While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.

This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.

Drivers support VolumeAttributesClass

Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.

Contact

For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.

Tuning Linux Swap for Kubernetes: A Deep Dive

Tue, 19 Aug 2025 10:30:00 -0800

The Kubernetes NodeSwap feature, likely to graduate to stable in the upcoming Kubernetes v1.34 release, allows swap usage: a significant shift from the conventional practice of disabling swap for performance predictability. This article focuses exclusively on tuning swap on Linux nodes, where this feature is available. By allowing Linux nodes to use secondary storage for additional virtual memory when physical RAM is exhausted, node swap support aims to improve resource utilization and reduce out-of-memory (OOM) kills.

However, enabling swap is not a "turn-key" solution. The performance and stability of your nodes under memory pressure are critically dependent on a set of Linux kernel parameters. Misconfiguration can lead to performance degradation and interfere with Kubelet's eviction logic.

In this blogpost, I'll dive into critical Linux kernel parameters that govern swap behavior. I will explore how these parameters influence Kubernetes workload performance, swap utilization, and crucial eviction mechanisms. I will present various test results showcasing the impact of different configurations, and share my findings on achieving optimal settings for stable and high-performing Kubernetes clusters.

Introduction to Linux swap

At a high level, the Linux kernel manages memory through pages, typically 4KiB in size. When physical memory becomes constrained, the kernel's page replacement algorithm decides which pages to move to swap space. While the exact logic is a sophisticated optimization, this decision-making process is influenced by certain key factors:

Page access patterns (how recently pages are accessed)
Page dirtyness (whether pages have been modified)
Memory pressure (how urgently the system needs free memory)

Anonymous vs File-backed memory

It is important to understand that not all memory pages are the same. The kernel distinguishes between anonymous and file-backed memory.

Anonymous memory: This is memory that is not backed by a specific file on the disk, such as a program's heap and stack. From the application's perspective this is private memory, and when the kernel needs to reclaim these pages, it must write them to a dedicated swap device.

File-backed memory: This memory is backed by a file on a filesystem. This includes a program's executable code, shared libraries, and filesystem caches. When the kernel needs to reclaim these pages, it can simply discard them if they have not been modified ("clean"). If a page has been modified ("dirty"), the kernel must first write the changes back to the file before it can be discarded.

While a system without swap can still reclaim clean file-backed pages memory under pressure by dropping them, it has no way to offload anonymous memory. Enabling swap provides this capability, allowing the kernel to move less-frequently accessed memory pages to disk to conserve memory to avoid system OOM kills.

Key kernel parameters for swap tuning

To effectively tune swap behavior, Linux provides several kernel parameters that can be managed via sysctl.

vm.swappiness: This is the most well-known parameter. It is a value from 0 to 200 (100 in older kernels) that controls the kernel's preference for swapping anonymous memory pages versus reclaiming file-backed memory pages (page cache).
- High value (eg: 90+): The kernel will be aggressive in swapping out less-used anonymous memory to make room for file-cache.
- Low value (eg: < 10): The kernel will strongly prefer dropping file cache pages over swapping anonymous memory.
vm.min_free_kbytes: This parameter tells the kernel to keep a minimum amount of memory free as a buffer. When the amount of free memory drops below the this safety buffer, the kernel starts more aggressively reclaiming pages (swapping, and eventually handling OOM kills).
- Function: It acts as a safety lever to ensure the kernel has enough memory for critical allocation requests that cannot be deferred.
- Impact on swap: Setting a higher min_free_kbytes effectively raises the floor for for free memory, causing the kernel to initiate swap earlier under memory pressure.
vm.watermark_scale_factor: This setting controls the gap between different watermarks: min, low and high, which are calculated based on min_free_kbytes.
- Watermarks explained:
  - low: When free memory is below this mark, the kswapd kernel process wakes up to reclaim pages in the background. This is when a swapping cycle begins.
  - min: When free memory hits this minimum level, then aggressive page reclamation will block process allocation. Failing to reclaim pages will cause OOM kills.
  - high: Memory reclamation stops once the free memory reaches this level.
- Impact: A higher watermark_scale_factor careates a larger buffer between the low and min watermarks. This gives kswapd more time to reclaim memory gradually before the system hits a critical state.

In a typical server workload, you might have a long-running process with some memory that becomes 'cold'. A higher swappiness value can free up RAM by swapping out the cold memory, for other active processes that can benefit from keeping their file-cache.

Tuning the min_free_kbytes and watermark_scale_factor parameters to move the swapping window early will give more room for kswapd to offload memory to disk and prevent OOM kills during sudden memory spikes.

Swap tests and results

To understand the real-impact of these parameters, I designed a series of stress tests.

Test setup

Environment: GKE on Google Cloud
Kubernetes version: 1.33.2
Node configuration: n2-standard-2 (8GiB RAM, 50GB swap on a pd-balanced disk, without encryption), Ubuntu 22.04
Workload: A custom Go application designed to allocate memory at a configurable rate, generate file-cache pressure, and simulate different memory access patterns (random vs sequential).
Monitoring: A sidecar container capturing system metrics every second.
Protection: Critical system components (kubelet, container runtime, sshd) were prevented from swapping by setting memory.swap.max=0 in their respective cgroups.

Test methodology

I ran a stress-test pod on nodes with different swappiness settings (0, 60, and 90) and varied the min_free_kbytes and watermark_scale_factor parameters to observe the outcomes under heavy memory allocation and I/O pressure.

Visualizing swap in action

The graph below, from a 100MBps stress test, shows swap in action. As free memory (in the "Memory Usage" plot) decreases, swap usage (Swap Used (GiB)) and swap-out activity (Swap Out (MiB/s)) increase. Critically, as the system relies more on swap, the I/O activity and corresponding wait time (IO Wait % in the "CPU Usage" plot) also rises, indicating CPU stress.

Findings

My initial tests with default kernel parameters (swappiness=60, min_free_kbytes=68MB, watermark_scale_factor=10) quickly led to OOM kills and even unexpected node restarts under high memory pressure. With selecting appropriate kernel parameters a good balance in node stability and performance can be achieved.

The impact of `swappiness`

The swappiness parameter directly influences the kernel's choice between reclaiming anonymous memory (swapping) and dropping page cache. To observe this, I ran a test where one pod generated and held file-cache pressure, followed by a second pod allocating anonymous memory at 100MB/s, to observe the kernel preference on reclaim:

My findings reveal a clear trade-off:

swappiness=90: The kernel proactively swapped out the inactive anonymous memory to keep the file cache. This resulted in high and sustained swap usage and significant I/O activity ("Blocks Out"), which in turn caused spikes in I/O wait on the CPU.
swappiness=0: The kernel favored dropping file-cache pages delaying swap consumption. However, it's critical to understand that this does not disable swapping. When memory pressure was high, the kernel still swapped anonymous memory to disk.

The choice is workload-dependent. For workloads sensitive to I/O latency, a lower swappiness is preferable. For workloads that rely on a large and frequently accessed file cache, a higher swappiness may be beneficial, provided the underlying disk is fast enough to handle the load.

Tuning watermarks to prevent eviction and OOM kills

The most critical challenge I encountered was the interaction between rapid memory allocation and Kubelet's eviction mechanism. When my test pod, which was deliberately configured to overcommit memory, allocated it at a high rate (e.g., 300-500 MBps), the system quickly ran out of free memory.

With default watermarks, the buffer for reclamation was too small. Before kswapd could free up enough memory by swapping, the node would hit a critical state, leading to two potential outcomes:

Kubelet eviction If kubelet's eviction manager detected memory.available was below its threshold, it would evict the pod.
OOM killer In some high-rate scenarios, the OOM Killer would activate before eviction could complete, sometimes killing higher priority pods that were not the source of the pressure.

To mitigate this I tuned the watermarks:

Increased min_free_kbytes to 512MiB: This forces the kernel to start reclaiming memory much earlier, providing a larger safety buffer.
Increased watermark_scale_factor to 2000: This widened the gap between the low and high watermarks (from ≈337MB to ≈591MB in my test node's /proc/zoneinfo), effectively increasing the swapping window.

This combination gave kswapd a larger operational zone and more time to swap pages to disk during memory spikes, successfully preventing both premature evictions and OOM kills in my test runs.

Table compares watermark levels from /proc/zoneinfo (Non-NUMA node):

`min_free_kbytes=67584KiB` and `watermark_scale_factor=10`	`min_free_kbytes=524288KiB` and `watermark_scale_factor=2000`
Node 0, zone Normal pages free 583273 boost 0 min 10504 low 13130 high 15756 spanned 1310720 present 1310720 managed 1265603	Node 0, zone Normal pages free 470539 min 82109 low 337017 high 591925 spanned 1310720 present 1310720 managed 1274542

The graph below reveals that the kernel buffer size and scaling factor play a crucial role in determining how the system responds to memory load. With the right combination of these parameters, the system can effectively use swap space to avoid eviction and maintain stability.

Risks and recommendations

Enabling swap in Kubernetes is a powerful tool, but it comes with risks that must be managed through careful tuning.

Risk of performance degradation Swapping is orders of magnitude slower than accessing RAM. If an application's active working set is swapped out, its performance will suffer dramatically due to high I/O wait times (thrashing). Swap could preferably be provisioned with a SSD backed storage to improve performance.
Risk of masking memory leaks Swap can hide memory leaks in applications, which might otherwise lead to a quick OOM kill. With swap, a leaky application might slowly degrade node performance over time, making the root cause harder to diagnose.
Risk of disabling evictions Kubelet proactively monitors the node for memory-pressure and terminates pods to reclaim the resources. Improper tuning can lead to OOM kills before kubelet has a chance to evict pods gracefully. A properly configured min_free_kbytes is essential to ensure kubelet's eviction mechanism remains effective.

Kubernetes context

Together, the kernel watermarks and kubelet eviction threshold create a series of memory pressure zones on a node. The eviction-threshold parameters need to be adjusted to configure Kubernetes managed evictions occur before the OOM kills.

As the diagram shows, an ideal configuration will be to create a large enough 'swapping zone' (between high and min watermarks) so that the kernel can handle memory pressure by swapping before available memory drops into the Eviction/Direct Reclaim zone.

Recommended starting point

Based on these findings, I recommend the following as a starting point for Linux nodes with swap enabled. You should benchmark this with your own workloads.

vm.swappiness=60: Linux default is a good starting point for general-purpose workloads. However, the ideal value is workload-dependent, and swap-sensitive applications may need more careful tuning.
vm.min_free_kbytes=500000 (500MB): Set this to a reasonably high value (e.g., 2-3% of total node memory) to give the node a reasonable safety buffer.
vm.watermark_scale_factor=2000: Create a larger window for kswapd to work with, preventing OOM kills during sudden memory allocation spikes.

I encourage running benchmark tests with your own workloads in test-environments, when setting up swap for the first time in your Kubernetes cluster. Swap performance can be sensitive to different environment differences such as CPU load, disk type (SSD vs HDD) and I/O patterns.

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

Fri, 15 Aug 2025 00:00:00 +0000

The Kubernetes community continues to advance security best practices by reducing reliance on long-lived credentials. Following the successful alpha release in Kubernetes v1.33, Service Account Token Integration for Kubelet Credential Providers has now graduated to beta in Kubernetes v1.34, bringing us closer to eliminating long-lived image pull secrets from Kubernetes clusters.

This enhancement allows credential providers to use workload-specific service account tokens to obtain registry credentials, providing a secure, ephemeral alternative to traditional image pull secrets.

What's new in beta?

The beta graduation brings several important changes that make the feature more robust and production-ready:

Required `cacheType` field

Breaking change from alpha: The cacheType field is required in the credential provider configuration when using service account tokens. This field is new in beta and must be specified to ensure proper caching behavior.

# CAUTION: this is not a complete configuration example, just a reference for the 'tokenAttributes.cacheType' field.
tokenAttributes:
  serviceAccountTokenAudience: "my-registry-audience"
  cacheType: "ServiceAccount"  # Required field in beta
  requireServiceAccount: true

Choose between two caching strategies:

Token: Cache credentials per service account token (use when credential lifetime is tied to the token). This is useful when the credential provider transforms the service account token into registry credentials with the same lifetime as the token, or when registries support Kubernetes service account tokens directly. Note: The kubelet cannot send service account tokens directly to registries; credential provider plugins are needed to transform tokens into the username/password format expected by registries.
ServiceAccount: Cache credentials per service account identity (use when credentials are valid for all pods using the same service account)

Isolated image pull credentials

The beta release provides stronger security isolation for container images when using service account tokens for image pulls. It ensures that pods can only access images that were pulled using ServiceAccounts they're authorized to use. This prevents unauthorized access to sensitive container images and enables granular access control where different workloads can have different registry permissions based on their ServiceAccount.

When credential providers use service account tokens, the system tracks ServiceAccount identity (namespace, name, and UID) for each pulled image. When a pod attempts to use a cached image, the system verifies that the pod's ServiceAccount matches exactly with the ServiceAccount that was used to originally pull the image.

Administrators can revoke access to previously pulled images by deleting and recreating the ServiceAccount, which changes the UID and invalidates cached image access.

For more details about this capability, see the image pull credential verification documentation.

How it works

Configuration

Credential providers opt into using ServiceAccount tokens by configuring the tokenAttributes field:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
providers:
- name: my-credential-provider
  matchImages:
  - "*.myregistry.io/*"
  defaultCacheDuration: "10m"
  apiVersion: credentialprovider.kubelet.k8s.io/v1
  tokenAttributes:
    serviceAccountTokenAudience: "my-registry-audience"
    cacheType: "ServiceAccount"  # New in beta
    requireServiceAccount: true
    requiredServiceAccountAnnotationKeys:
    - "myregistry.io/identity-id"
    optionalServiceAccountAnnotationKeys:
    - "myregistry.io/optional-annotation"

Image pull flow

At a high level, kubelet coordinates with your credential provider and the container runtime as follows:

When the image is not present locally:
- kubelet checks its credential cache using the configured cacheType (Token or ServiceAccount)
- If needed, kubelet requests a ServiceAccount token for the pod's ServiceAccount and passes it, plus any required annotations, to the credential provider
- The provider exchanges that token for registry credentials and returns them to kubelet
- kubelet caches credentials per the cacheType strategy and pulls the image with those credentials
- kubelet records the ServiceAccount coordinates (namespace, name, UID) associated with the pulled image for later authorization checks
When the image is already present locally:
- kubelet verifies the pod's ServiceAccount coordinates match the coordinates recorded for the cached image
- If they match exactly, the cached image can be used without pulling from the registry
- If they differ, kubelet performs a fresh pull using credentials for the new ServiceAccount
With image pull credential verification enabled:
- Authorization is enforced using the recorded ServiceAccount coordinates, ensuring pods only use images pulled by a ServiceAccount they are authorized to use
- Administrators can revoke access by deleting and recreating a ServiceAccount; the UID changes and previously recorded authorization no longer matches

Audience restriction

The beta release builds on service account node audience restriction (beta since v1.33) to ensure kubelet can only request tokens for authorized audiences. Administrators configure allowed audiences using RBAC to enable kubelet to request service account tokens for image pulls:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubelet-credential-provider-audiences
rules:
- verbs: ["request-serviceaccounts-token-audience"]
  apiGroups: [""]
  resources: ["my-registry-audience"]
  resourceNames: ["registry-access-sa"]  # Optional: specific SA

Getting started with beta

Prerequisites

Kubernetes v1.34 or later
Feature gate enabled: KubeletServiceAccountTokenForCredentialProviders=true (beta, enabled by default)
Credential provider support: Update your credential provider to handle ServiceAccount tokens

Migration from alpha

If you're already using the alpha version, the migration to beta requires minimal changes:

Add cacheType field: Update your credential provider configuration to include the required cacheType field
Review caching strategy: Choose between Token and ServiceAccount cache types based on your provider's behavior
Test audience restrictions: Ensure your RBAC configuration, or other cluster authorization rules, will properly restrict token audiences

Example setup

Here's a complete example for setting up a credential provider with service account tokens (this example assumes your cluster uses RBAC authorization):

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#

# Service Account with registry annotations
apiVersion: v1
kind: ServiceAccount
metadata:
  name: registry-access-sa
  namespace: default
  annotations:
    myregistry.io/identity-id: "user123"
---
# RBAC for audience restriction
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: registry-audience-access
rules:
- verbs: ["request-serviceaccounts-token-audience"]
  apiGroups: [""]
  resources: ["my-registry-audience"]
  resourceNames: ["registry-access-sa"]  # Optional: specific ServiceAccount
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubelet-registry-audience
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: registry-audience-access
subjects:
- kind: Group
  name: system:nodes
  apiGroup: rbac.authorization.k8s.io
---
# Pod using the ServiceAccount
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  serviceAccountName: registry-access-sa
  containers:
  - name: my-app
    image: myregistry.example/my-app:latest

What's next?

For Kubernetes v1.35, we - Kubernetes SIG Auth - expect the feature to stay in beta, and we will continue to solicit feedback.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Call to action

In this blog post, I have covered the beta graduation of ServiceAccount token integration for Kubelet Credential Providers in Kubernetes v1.34. I discussed the key improvements, including the required cacheType field and enhanced integration with Ensure Secret Pull Images.

We have been receiving positive feedback from the community during the alpha phase and would love to hear more as we stabilize this feature for GA. In particular, we would like feedback from credential provider implementors as they integrate with the new beta API and caching mechanisms. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack.

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

PSI Metrics for Kubernetes Graduates to Beta

Fri, 08 Aug 2025 00:00:00 +0000

As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.

What is Pressure Stall Information (PSI)?

Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.

PSI exposes metrics for CPU, memory, and I/O, categorized as either some or full pressure:

some: The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.
full: The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.

PSI: 'Some' vs. 'Full' Pressure

These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.

PSI metrics in Kubernetes

With the KubeletPSI feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.

The following new metrics are available in Prometheus exposition format via /metrics/cadvisor:

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:

Identify memory leaks: A steadily increasing some pressure for memory can indicate a memory leak in an application.
Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.
Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.

How to enable PSI metrics

To enable PSI metrics in your Kubernetes cluster, you need to:

Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
Enable the KubeletPSI feature gate on the kubelet.

Once enabled, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.

What's next?

We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.

To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.

Introducing Headlamp AI Assistant

Thu, 07 Aug 2025 20:00:00 +0100

This announcement originally appeared on the Headlamp blog.

To simplify Kubernetes management and troubleshooting, we're thrilled to introduce Headlamp AI Assistant: a powerful new plugin for Headlamp that helps you understand and operate your Kubernetes clusters and applications with greater clarity and ease.

Whether you're a seasoned engineer or just getting started, the AI Assistant offers:

Fast time to value: Ask questions like "Is my application healthy?" or "How can I fix this?" without needing deep Kubernetes knowledge.
Deep insights: Start with high-level queries and dig deeper with prompts like "List all the problematic pods" or "How can I fix this pod?"
Focused & relevant: Ask questions in the context of what you're viewing in the UI, such as "What's wrong here?"
Action-oriented: Let the AI take action for you, like "Restart that deployment", with your permission.

Here is a demo of the AI Assistant in action as it helps troubleshoot an application running with issues in a Kubernetes cluster:

Hopping on the AI train

Large Language Models (LLMs) have transformed not just how we access data but also how we interact with it. The rise of tools like ChatGPT opened a world of possibilities, inspiring a wave of new applications. Asking questions or giving commands in natural language is intuitive, especially for users who aren't deeply technical. Now everyone can quickly ask how to do X or Y, without feeling awkward or having to traverse pages and pages of documentation like before.

Therefore, Headlamp AI Assistant brings a conversational UI to Headlamp, powered by LLMs that Headlamp users can configure with their own API keys. It is available as a Headlamp plugin, making it easy to integrate into your existing setup. Users can enable it by installing the plugin and configuring it with their own LLM API keys, giving them control over which model powers the assistant. Once enabled, the assistant becomes part of the Headlamp UI, ready to respond to contextual queries and perform actions directly from the interface.

Context is everything

As expected, the AI Assistant is focused on helping users with Kubernetes concepts. Yet, while there is a lot of value in responding to Kubernetes related questions from Headlamp's UI, we believe that the great benefit of such an integration is when it can use the context of what the user is experiencing in an application. So, the Headlamp AI Assistant knows what you're currently viewing in Headlamp, and this makes the interaction feel more like working with a human assistant.

For example, if a pod is failing, users can simply ask "What's wrong here?" and the AI Assistant will respond with the root cause, like a missing environment variable or a typo in the image name. Follow-up prompts like "How can I fix this?" allow the AI Assistant to suggest a fix, streamlining what used to take multiple steps into a quick, conversational flow.

Sharing the context from Headlamp is not a trivial task though, so it's something we will keep working on perfecting.

Tools

Context from the UI is helpful, but sometimes additional capabilities are needed. If the user is viewing the pod list and wants to identify problematic deployments, switching views should not be necessary. To address this, the AI Assistant includes support for a Kubernetes tool. This allows asking questions like "Get me all deployments with problems" prompting the assistant to fetch and display relevant data from the current cluster. Likewise, if the user requests an action like "Restart that deployment" after the AI points out what deployment needs restarting, it can also do that. In case of "write" operations, the AI Assistant does check with the user for permission to run them.

AI Plugins

Although the initial version of the AI Assistant is already useful for Kubernetes users, future iterations will expand its capabilities. Currently, the assistant supports only the Kubernetes tool, but further integration with Headlamp plugins is underway. Similarly, we could get richer insights for GitOps via the Flux plugin, monitoring through Prometheus, package management with Helm, and more.

And of course, as the popularity of MCP grows, we are looking into how to integrate it as well, for a more plug-and-play fashion.

Try it out!

We hope this first version of the AI Assistant helps users manage Kubernetes clusters more effectively and assist newcomers in navigating the learning curve. We invite you to try out this early version and give us your feedback. The AI Assistant plugin can be installed from Headlamp's Plugin Catalog in the desktop version, or by using the container image when deploying Headlamp. Stay tuned for the future versions of the Headlamp AI Assistant!

Kubernetes v1.34 Sneak Peek

Mon, 28 Jul 2025 00:00:00 +0000

Kubernetes v1.34 is coming at the end of August 2025. This release will not include any removal or deprecation, but it is packed with an impressive number of enhancements. Here are some of the features we are most excited about in this cycle!

Please note that this information reflects the current state of v1.34 development and may change before release.

Featured enhancements of Kubernetes v1.34

The following list highlights some of the notable enhancements likely to be included in the v1.34 release, but is not an exhaustive list of all planned changes. This is not a commitment and the release content is subject to change.

The core of DRA targets stable

Dynamic Resource Allocation (DRA) provides a flexible way to categorize, request, and use devices like GPUs or custom hardware in your Kubernetes cluster.

Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. The relevant enhancement proposal, KEP-4381, took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io, while extending the .spec for Pods with a new resourceClaims field. The core of DRA is targeting graduation to stable in Kubernetes v1.34.

With DRA, device drivers and cluster admins define device classes that are available for use. Workloads can claim devices from a device class within device requests. Kubernetes allocates matching devices to specific claims and places the corresponding Pods on nodes that can access the allocated devices. This framework provides flexible device filtering using CEL, centralized device categorization, and simplified Pod requests, among other benefits.

Once this feature has graduated, the resource.k8s.io/v1 APIs will be available by default.

ServiceAccount tokens for image pull authentication

The ServiceAccount token integration for kubelet credential providers is likely to reach beta and be enabled by default in Kubernetes v1.34. This allows the kubelet to use these tokens when pulling container images from registries that require authentication.

That support already exists as alpha, and is tracked as part of KEP-4412.

The existing alpha integration allows the kubelet to use short-lived, automatically rotated ServiceAccount tokens (that follow OIDC-compliant semantics) to authenticate to a container image registry. Each token is scoped to one associated Pod; the overall mechanism replaces the need for long-lived image pull Secrets.

Adopting this new approach reduces security risks, supports workload-level identity, and helps cut operational overhead. It brings image pull authentication closer to modern, identity-aware good practice.

Pod replacement policy for Deployments

After a change to a Deployment, terminating pods may stay up for a considerable amount of time and may consume additional resources. As part of KEP-3973, the .spec.podReplacementPolicy field will be introduced (as alpha) for Deployments.

If your cluster has the feature enabled, you'll be able to select one of two policies:

TerminationStarted: Creates new pods as soon as old ones start terminating, resulting in faster rollouts at the cost of potentially higher resource consumption.
TerminationComplete: Waits until old pods fully terminate before creating new ones, resulting in slower rollouts but ensuring controlled resource consumption.

This feature makes Deployment behavior more predictable by letting you choose when new pods should be created during updates or scaling. It's beneficial when working in clusters with tight resource constraints or with workloads with long termination periods.

It's expected to be available as an alpha feature and can be enabled using the DeploymentPodReplacementPolicy and DeploymentReplicaSetTerminatingReplicas feature gates in the API server and kube-controller-manager.

Production-ready tracing for `kubelet` and API Server

To address the longstanding challenge of debugging node-level issues by correlating disconnected logs, KEP-2831 provides deep, contextual insights into the kubelet.

This feature instruments critical kubelet operations, particularly its gRPC calls to the Container Runtime Interface (CRI), using the vendor-agnostic OpenTelemetry standard. It allows operators to visualize the entire lifecycle of events (for example: a Pod startup) to pinpoint sources of latency and errors. Its most powerful aspect is the propagation of trace context; the kubelet passes a trace ID with its requests to the container runtime, enabling runtimes to link their own spans.

This effort is complemented by a parallel enhancement, KEP-647, which brings the same tracing capabilities to the Kubernetes API server. Together, these enhancements provide a more unified, end-to-end view of events, simplifying the process of pinpointing latency and errors from the control plane down to the node. These features have matured through the official Kubernetes release process. KEP-2831 was introduced as an alpha feature in v1.25, while KEP-647 debuted as alpha in v1.22. Both enhancements were promoted to beta together in the v1.27 release. Looking forward, Kubelet Tracing (KEP-2831) and API Server Tracing (KEP-647) are now targeting graduation to stable in the upcoming v1.34 release.

`PreferSameZone` and `PreferSameNode` traffic distribution for Services

The spec.trafficDistribution field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints.

KEP-3015 deprecates PreferClose and introduces two additional values: PreferSameZone and PreferSameNode. PreferSameZone is equivalent to the current PreferClose. PreferSameNode prioritizes sending traffic to endpoints on the same node as the client.

This feature was introduced in v1.33 behind the PreferSameTrafficDistribution feature gate. It is targeting graduation to beta in v1.34 with its feature gate enabled by default.

Support for KYAML: a Kubernetes dialect of YAML

KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, you'll be able use KYAML for writing manifests and/or Helm charts. You can write KYAML and pass it as an input to any version of kubectl, because all KYAML files are also valid as YAML. With kubectl v1.34, we expect you'll also be able to request KYAML output from kubectl (as in kubectl get -o kyaml …). If you prefer, you can still request the output in JSON or YAML format.

KYAML addresses specific challenges with both YAML and JSON. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: "The Norway Bug"). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys.

KEP-5295 introduces KYAML, which tries to address the most significant problems by:

Always double-quoting value strings
Leaving keys unquoted unless they are potentially ambiguous
Always using {} for mappings (associative arrays)
Always using [] for lists

This might sound a lot like JSON, because it is! But unlike JSON, KYAML supports comments, allows trailing commas, and doesn't require quoted keys.

We're hoping to see KYAML introduced as a new output format for kubectl v1.34. As with all these features, none of these changes are 100% confirmed; watch this space!

As a format, KYAML is and will remain a strict subset of YAML, ensuring that any compliant YAML parser can parse KYAML documents. Kubernetes does not require you to provide input specifically formatted as KYAML, and we have no plans to change that.

Fine-grained autoscaling control with HPA configurable tolerance

KEP-4951 introduces a new feature that allows users to configure autoscaling tolerance on a per-HPA basis, overriding the default cluster-wide 10% tolerance setting that often proves too coarse-grained for diverse workloads. The enhancement adds an optional tolerance field to the HPA's spec.behavior.scaleUp and spec.behavior.scaleDown sections, enabling different tolerance values for scale-up and scale-down operations, which is particularly valuable since scale-up responsiveness is typically more critical than scale-down speed for handling traffic surges.

Released as alpha in Kubernetes v1.33 behind the HPAConfigurableTolerance feature gate, this feature is expected to graduate to beta in v1.34. This improvement helps to address scaling challenges with large deployments, where for scaling in, a 10% tolerance might mean leaving hundreds of unnecessary Pods running. Using the new, more flexible approach would enable workload-specific optimization for both responsive and conservative scaling behaviors.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.34 as part of the CHANGELOG for that release.

The Kubernetes v1.34 release is planned for Wednesday 27th August 2025. Stay tuned for updates!

Get involved

The simplest way to get involved with Kubernetes is to join one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Server Fault or Stack Overflow
Share your Kubernetes story
Read more about what's happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Post-Quantum Cryptography in Kubernetes

Fri, 18 Jul 2025 00:00:00 +0000

The world of cryptography is on the cusp of a major shift with the advent of quantum computing. While powerful quantum computers are still largely theoretical for many applications, their potential to break current cryptographic standards is a serious concern, especially for long-lived systems. This is where Post-Quantum Cryptography (PQC) comes in. In this article, I'll dive into what PQC means for TLS and, more specifically, for the Kubernetes ecosystem. I'll explain what the (suprising) state of PQC in Kubernetes is and what the implications are for current and future clusters.

What is Post-Quantum Cryptography

Post-Quantum Cryptography refers to cryptographic algorithms that are thought to be secure against attacks by both classical and quantum computers. The primary concern is that quantum computers, using algorithms like Shor's Algorithm, could efficiently break widely used public-key cryptosystems such as RSA and Elliptic Curve Cryptography (ECC), which underpin much of today's secure communication, including TLS. The industry is actively working on standardizing and adopting PQC algorithms. One of the first to be standardized by NIST is the Module-Lattice Key Encapsulation Mechanism (ML-KEM), formerly known as Kyber, and now standardized as FIPS-203 (PDF download).

It is difficult to predict when quantum computers will be able to break classical algorithms. However, it is clear that we need to start migrating to PQC algorithms now, as the next section shows. To get a feeling for the predicted timeline we can look at a NIST report covering the transition to post-quantum cryptography standards. It declares that system with classical crypto should be deprecated after 2030 and disallowed after 2035.

Key exchange vs. digital signatures: different needs, different timelines

In TLS, there are two main cryptographic operations we need to secure:

Key Exchange: This is how the client and server agree on a shared secret to encrypt their communication. If an attacker records encrypted traffic today, they could decrypt it in the future, if they gain access to a quantum computer capable of breaking the key exchange. This makes migrating KEMs to PQC an immediate priority.

Digital Signatures: These are primarily used to authenticate the server (and sometimes the client) via certificates. The authenticity of a server is verified at the time of connection. While important, the risk of an attack today is much lower, because the decision of trusting a server cannot be abused after the fact. Additionally, current PQC signature schemes often come with significant computational overhead and larger key/signature sizes compared to their classical counterparts.

Another significant hurdle in the migration to PQ certificates is the upgrade of root certificates. These certificates have long validity periods and are installed in many devices and operating systems as trust anchors.

Given these differences, the focus for immediate PQC adoption in TLS has been on hybrid key exchange mechanisms. These combine a classical algorithm (such as Elliptic Curve Diffie-Hellman Ephemeral (ECDHE)) with a PQC algorithm (such as ML-KEM). The resulting shared secret is secure as long as at least one of the component algorithms remains unbroken. The X25519MLKEM768 hybrid scheme is the most widely supported one.

State of PQC key exchange mechanisms (KEMs) today

Support for PQC KEMs is rapidly improving across the ecosystem.

Go: The Go standard library's crypto/tls package introduced support for X25519MLKEM768 in version 1.24 (released February 2025). Crucially, it's enabled by default when there is no explicit configuration, i.e., Config.CurvePreferences is nil.

Browsers & OpenSSL: Major browsers like Chrome (version 131, November 2024) and Firefox (version 135, February 2025), as well as OpenSSL (version 3.5.0, April 2025), have also added support for the ML-KEM based hybrid scheme.

Apple is also rolling out support for X25519MLKEM768 in version 26 of their operating systems. Given the proliferation of Apple devices, this will have a significant impact on the global PQC adoption.

For a more detailed overview of the state of PQC in the wider industry, see this blog post by Cloudflare.

Post-quantum KEMs in Kubernetes: an unexpected arrival

So, what does this mean for Kubernetes? Kubernetes components, including the API server and kubelet, are built with Go.

As of Kubernetes v1.33, released in April 2025, the project uses Go 1.24. A quick check of the Kubernetes codebase reveals that Config.CurvePreferences is not explicitly set. This leads to a fascinating conclusion: Kubernetes v1.33, by virtue of using Go 1.24, supports hybrid post-quantum X25519MLKEM768 for TLS connections by default!

You can test this yourself. If you set up a Minikube cluster running Kubernetes v1.33.0, you can connect to the API server using a recent OpenSSL client:

$ minikube start --kubernetes-version=v1.33.0
$ kubectl cluster-info
Kubernetes control plane is running at https://127.0.0.1:<PORT>
$ kubectl config view --minify --raw -o jsonpath=\'{.clusters[0].cluster.certificate-authority-data}\' | base64 -d > ca.crt
$ openssl version
OpenSSL 3.5.0 8 Apr 2025 (Library: OpenSSL 3.5.0 8 Apr 2025)
$ echo -n "Q" | openssl s_client -connect 127.0.0.1:<PORT> -CAfile ca.crt
[...]
Negotiated TLS1.3 group: X25519MLKEM768
[...]
DONE

Lo and behold, the negotiated group is X25519MLKEM768! This is a significant step towards making Kubernetes quantum-safe, seemingly without a major announcement or dedicated KEP (Kubernetes Enhancement Proposal).

The Go version mismatch pitfall

An interesting wrinkle emerged with Go versions 1.23 and 1.24. Go 1.23 included experimental support for a draft version of ML-KEM, identified as X25519Kyber768Draft00. This was also enabled by default if Config.CurvePreferences was nil. Kubernetes v1.32 used Go 1.23. However, Go 1.24 removed the draft support and replaced it with the standardized version X25519MLKEM768.

What happens if a client and server are using mismatched Go versions (one on 1.23, the other on 1.24)? They won't have a common PQC KEM to negotiate, and the handshake will fall back to classical ECC curves (e.g., X25519). How could this happen in practice?

Consider a scenario:

A Kubernetes cluster is running v1.32 (using Go 1.23 and thus X25519Kyber768Draft00). A developer upgrades their kubectl to v1.33, compiled with Go 1.24, only supporting X25519MLKEM768. Now, when kubectl communicates with the v1.32 API server, they no longer share a common PQC algorithm. The connection will downgrade to classical cryptography, silently losing the PQC protection that has been in place. This highlights the importance of understanding the implications of Go version upgrades, and the details of the TLS stack.

Limitations: packet size

One practical consideration with ML-KEM is the size of its public keys with encoded key sizes of around 1.2 kilobytes for ML-KEM-768. This can cause the initial TLS ClientHello message not to fit inside a single TCP/IP packet, given the typical networking constraints (most commonly, the standard Ethernet frame size limit of 1500 bytes). Some TLS libraries or network appliances might not handle this gracefully, assuming the Client Hello always fits in one packet. This issue has been observed in some Kubernetes-related projects and networking components, potentially leading to connection failures when PQC KEMs are used. More details can be found at tldr.fail.

State of Post-Quantum Signatures

While KEMs are seeing broader adoption, PQC digital signatures are further behind in terms of widespread integration into standard toolchains. NIST has published standards for PQC signatures, such as ML-DSA (FIPS-204) and SLH-DSA (FIPS-205). However, implementing these in a way that's broadly usable (e.g., for PQC Certificate Authorities) presents challenges:

Larger Keys and Signatures: PQC signature schemes often have significantly larger public keys and signature sizes compared to classical algorithms like Ed25519 or RSA. For instance, Dilithium2 keys can be 30 times larger than Ed25519 keys, and certificates can be 12 times larger.

Performance: Signing and verification operations can be substantially slower. While some algorithms are on par with classical algorithms, others may have a much higher overhead, sometimes on the order of 10x to 1000x worse performance. To improve this situation, NIST is running a second round of standardization for PQC signatures.

Toolchain Support: Mainstream TLS libraries and CA software do not yet have mature, built-in support for these new signature algorithms. The Go team, for example, has indicated that ML-DSA support is a high priority, but the soonest it might appear in the standard library is Go 1.26 (as of May 2025).

Cloudflare's CIRCL (Cloudflare Interoperable Reusable Cryptographic Library) library implements some PQC signature schemes like variants of Dilithium, and they maintain a fork of Go (cfgo) that integrates CIRCL. Using cfgo, it's possible to experiment with generating certificates signed with PQC algorithms like Ed25519-Dilithium2. However, this requires using a custom Go toolchain and is not yet part of the mainstream Kubernetes or Go distributions.

Conclusion

The journey to a post-quantum secure Kubernetes is underway, and perhaps further along than many realize, thanks to the proactive adoption of ML-KEM in Go. With Kubernetes v1.33, users are already benefiting from hybrid post-quantum key exchange in many TLS connections by default.

However, awareness of potential pitfalls, such as Go version mismatches leading to downgrades and issues with Client Hello packet sizes, is crucial. While PQC for KEMs is becoming a reality, PQC for digital signatures and certificate hierarchies is still in earlier stages of development and adoption for mainstream use. As Kubernetes maintainers and contributors, staying informed about these developments will be key to ensuring the long-term security of the platform.

Navigating Failures in Pods With Devices

Thu, 03 Jul 2025 00:00:00 +0000

Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel's talk at KubeCon NA 2024. You can follow the links to slides and recording.

The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA's Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional - Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads

Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories - training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services:

Training: These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" - but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.
Inference: These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Workload assumptions before and now
Before	Now
Can get a better CPU and the app will work faster.	Require a specific device (or class of devices) to run.
When something doesn’t work, just recreate it.	Allocation or reallocation is expensive.
Any node will work. No need to coordinate between Pods.	Scheduled in a special way - devices often connected in a cross-node topology.
Each Pod can be plug-and-play replaced if failed.	Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod.
Container images are slim and easily available.	Container images may be so big that they require special handling.
Long initialization can be offset by slow rollout.	Initialization may be long and should be optimized, sometimes across many Pods together.
Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable.	Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme

This article is not going deeper into the question: why not start fresh for
AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling

This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure

In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

Device plugin is scheduled on the Node
Device plugin is registered with the kubelet via local gRPC
Kubelet uses device plugin to watch for devices and updates capacity of the node
Scheduler places a user Pod on a Node based on the updated capacity
Kubelet asks Device plugin to Allocate devices for a User Pod
Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions:

Pods failing admission at various stages of its lifecycle
Pods unable to run on perfectly fine hardware
Scheduling taking unexpectedly long time

The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed:

Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload.
Monitor device plugin health and carefully plan for upgrades.
Do not overload the node with less-important workloads to prevent interruption of device plugin and other components.
Configure user pods tolerations to handle node readiness flakes.
Configure and code graceful termination logic carefully to not block devices for too long.

Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node:

Must match the hardware
Be compatible with an app
Must work with other drivers (like nccl, etc.)

Best practices for handling driver versions:

Monitor driver installer health
Plan upgrades of infrastructure and Pods to match the version
Have canary deployments whenever possible

Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better.

Failure modes: device failed

There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device.

This is why many plugins and DIY solutions exist to handle device failures based on various signals.

Health controller

In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node.

There are problems with the health controller approach:

Root cause of the device failure is typically not known
The controller is not workload aware
Failed device might not be in use and you want to keep other devices running
The detection may be too slow as it is very generic
The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes

There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn’t yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to.

Pod failure policy

Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs.

Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy

There are some problems with the Pod failure policy approach for Jobs:

There is no well-known device failed condition, so this approach does not work for the generic Pod case
Error codes must be coded carefully and in some cases are hard to guarantee.
Only works with Jobs with restartPolicy: Never, due to the limitation of a pod failure policy feature.

So, this solution has limited applicability.

Custom pod watcher

A little more generic approach is to implement the Pod watcher as a DIY solution or use some third party tools offering this functionality. The pod watcher is most often used to handle device failures for inference workloads.

Since Kubernetes just keeps a pod assigned to a device, even if the device is reportedly unhealthy, the idea is to detect this situation with the pod watcher and apply some remediation. It often involves obtaining device health status and its mapping to the Pod using Pod Resources API on the node. If a device fails, it can then delete the attached Pod as a remediation. The replica set will handle the Pod recreation on a healthy device.

The other reasons to implement this watcher:

Without it, the Pod will keep being assigned to the failed device forever.
There is no descheduling for a pod with restartPolicy=Always.
There are no built-in controllers that delete Pods in CrashLoopBackoff.

Problems with the custom pod watcher:

The signal for the pod watcher is expensive to get, and involves some privileged actions.
It is a custom solution and it assumes the importance of a device for a Pod.
The pod watcher relies on external controllers to reschedule a Pod.

There are more variations of DIY solutions for handling device failures or upcoming maintenance. Overall, Kubernetes has enough extension points to implement these solutions. However, some extension points require higher privilege than users may be comfortable with or are too disruptive. The roadmap section goes into more details on specific improvements in handling the device failures.

Failure modes: container code failed

When the container code fails or something bad happens with it, like out of memory conditions, Kubernetes knows how to handle those cases. There is either the restart of a container, or a crash of a Pod if it has restartPolicy: Never and scheduling it on another node. Kubernetes has limited expressiveness on what is a failure (for example, non-zero exit code or liveness probe failure) and how to react on such a failure (mostly either Always restart or immediately fail the Pod).

This level of expressiveness is often not enough for the complicated AI/ML workloads. AI/ML pods are better rescheduled locally or even in-place as that would save on image pulling time and device allocation. AI/ML pods are often interconnected and need to be restarted together. This adds another level of complexity and optimizing it often brings major savings in running AI/ML workloads.

There are various DIY solutions to handle Pod failures orchestration. The most typical one is to wrap a main executable in a container by some orchestrator. And this orchestrator will be able to restart the main executable whenever the job needs to be restarted because some other pod has failed.

Solutions like this are very fragile and elaborate. They are often worth the money saved comparing to a regular JobSet delete/recreate cycle when used in large training jobs. Making these solutions less fragile and more streamlined by developing new hooks and extension points in Kubernetes will make it easy to apply to smaller jobs, benefiting everybody.

Failure modes: device degradation

Not all device failures are terminal for the overall workload or batch job. As the hardware stack gets more and more complex, misconfiguration on one of the hardware stack layers, or driver failures, may result in devices that are functional, but lagging on performance. One device that is lagging behind can slow down the whole training job.

We see reports of such cases more and more often. Kubernetes has no way to express this type of failures today and since it is the newest type of failure mode, there is not much of a best practice offered by hardware vendors for detection and third party tooling for remediation of these situations.

Typically, these failures are detected based on observed workload characteristics. For example, the expected speed of AI/ML training steps on particular hardware. Remediation for those issues is highly depend on a workload needs.

Roadmap

As outlined in a section above, Kubernetes offers a lot of extension points which are used to implement various DIY solutions. The space of AI/ML is developing very fast, with changing requirements and usage patterns. SIG Node is taking a measured approach of enabling more extension points to implement the workload-specific scenarios over introduction of new semantics to support specific scenarios. This means prioritizing making information about failures readily available over implementing automatic remediations for those failures that might only be suitable for a subset of workloads.

This approach ensures there are no drastic changes for workload handling which may break existing, well-oiled DIY solutions or experiences with the existing more traditional workloads.

Many error handling techniques used today work for AI/ML, but are very expensive. SIG Node will invest in extension points to make those cheaper, with the understanding that the price cutting for AI/ML is critical.

The following is the set of specific investments we envision for various failure modes.

Roadmap for failure modes: K8s infrastructure

The area of Kubernetes infrastructure is the easiest to understand and very important to make right for the upcoming transition from Device Plugins to DRA. SIG Node is tracking many work items in this area, most notably the following:

Basically, every interaction of Kubernetes components must be reliable via either the kubelet improvements or the best practices in plugins development and deployment.

Roadmap for failure modes: device failed

For the device failures some patterns are already emerging in common scenarios that Kubernetes can support. However, the very first step is to make information about failed devices available easier. The very first step here is the work in KEP 4680 (Add Resource Health Status to the Pod Status for Device Plugin and DRA).

Longer term ideas include to be tested:

Integrate device failures into Pod Failure Policy.
Node-local retry policies, enabling pod failure policies for Pods with restartPolicy=OnFailure and possibly beyond that.
Ability to deschedule pod, including with the restartPolicy: Always, so it can get a new device allocated.
Add device health to the ResourceSlice used to represent devices in DRA, rather than simply withdrawing an unhealthy device from the ResourceSlice.

Roadmap for failure modes: container code failed

The main improvements to handle container code failures for AI/ML workloads are all targeting cheaper error handling and recovery. The cheapness is mostly coming from reuse of pre-allocated resources as much as possible. From reusing the Pods by restarting containers in-place, to node local restart of containers instead of rescheduling whenever possible, to snapshotting support, and re-scheduling prioritizing the same node to save on image pulls.

Consider this scenario: A big training job needs 512 Pods to run. And one of the pods failed. It means that all Pods need to be interrupted and synced up to restart the failed step. The most efficient way to achieve this generally is to reuse as many Pods as possible by restarting them in-place, while replacing the failed pod to clear up the error from it. Like demonstrated in this picture:

It is possible to implement this scenario, but all solutions implementing it are fragile due to lack of certain extension points in Kubernetes. Adding these extension points to implement this scenario is on the Kubernetes roadmap.

Roadmap for failure modes: device degradation

There is very little done in this area - there is no clear detection signal, very limited troubleshooting tooling, and no built-in semantics to express the "degraded" device on Kubernetes. There has been discussion of adding data on device performance or degradation in the ResourceSlice used by DRA to represent devices, but it is not yet clearly defined. There are also projects like node-healthcheck-operator that can be used for some scenarios.

We expect developments in this area from hardware vendors and cloud providers, and we expect to see mostly DIY solutions in the near future. As more users get exposed to AI/ML workloads, this is a space needing feedback on patterns used here.

Join the conversation

The Kubernetes community encourages feedback and participation in shaping the future of device failure handling. Join SIG Node and contribute to the ongoing discussions!

This blog post provides a high-level overview of the challenges and future directions for device failure management in Kubernetes. By addressing these issues, Kubernetes can solidify its position as the leading platform for AI/ML workloads, ensuring resilience and reliability for applications that depend on specialized hardware.

Image Compatibility In Cloud Native Environments

Wed, 25 Jun 2025 00:00:00 +0000

In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific operating system configuration or hardware presence. It is common practice to require the use of specific versions of the kernel, its configuration, device drivers, or system components. Despite the existence of the Open Container Initiative (OCI), a governing community to define standards and specifications for container images, there has been a gap in expression of such compatibility requirements. The need to address this issue has led to different proposals and, ultimately, an implementation in Kubernetes' Node Feature Discovery (NFD).

NFD is an open source Kubernetes project that automatically detects and reports hardware and system features of cluster nodes. This information helps users to schedule workloads on nodes that meet specific system requirements, which is especially useful for applications with strict hardware or operating system dependencies.

The need for image compatibility specification

Dependencies between containers and host OS

A container image is built on a base image, which provides a minimal runtime environment, often a stripped-down Linux userland, completely empty or distroless. When an application requires certain features from the host OS, compatibility issues arise. These dependencies can manifest in several ways:

Drivers: Host driver versions must match the supported range of a library version inside the container to avoid compatibility problems. Examples include GPUs and network drivers.
Libraries or Software: The container must come with a specific version or range of versions for a library or software to run optimally in the environment. Examples from high performance computing are MPI, EFA, or Infiniband.
Kernel Modules or Features: Specific kernel features or modules must be present. Examples include having support of write protected huge page faults, or the presence of VFIO
And more…

While containers in Kubernetes are the most likely unit of abstraction for these needs, the definition of compatibility can extend further to include other container technologies such as Singularity and other OCI artifacts such as binaries from a spack binary cache.

Multi-cloud and hybrid cloud challenges

Containerized applications are deployed across various Kubernetes distributions and cloud providers, where different host operating systems introduce compatibility challenges. Often those have to be pre-configured before workload deployment or are immutable. For instance, different cloud providers will include different operating systems like:

RHCOS/RHEL
Photon OS
Amazon Linux 2
Container-Optimized OS
Azure Linux OS
And more...

Each OS comes with unique kernel versions, configurations, and drivers, making compatibility a non-trivial issue for applications requiring specific features. It must be possible to quickly assess a container for its suitability to run on any specific environment.

Image compatibility initiative

An effort was made within the Open Containers Initiative Image Compatibility working group to introduce a standard for image compatibility metadata. A specification for compatibility would allow container authors to declare required host OS features, making compatibility requirements discoverable and programmable. The specification implemented in Kubernetes Node Feature Discovery is one of the discussed proposals. It aims to:

Define a structured way to express compatibility in OCI image manifests.
Support a compatibility specification alongside container images in image registries.
Allow automated validation of compatibility before scheduling containers.

The concept has since been implemented in the Kubernetes Node Feature Discovery project.

Implementation in Node Feature Discovery

The solution integrates compatibility metadata into Kubernetes via NFD features and the NodeFeatureGroup API. This interface enables the user to match containers to nodes based on exposing features of hardware and software, allowing for intelligent scheduling and workload optimization.

Compatibility specification

The compatibility specification is a structured list of compatibility objects containing Node Feature Groups. These objects define image requirements and facilitate validation against host nodes. The feature requirements are described by using the list of available features from the NFD project. The schema has the following structure:

version (string) - Specifies the API version.
compatibilities (array of objects) - List of compatibility sets.
- rules (object) - Specifies NodeFeatureGroup to define image requirements.
- weight (int, optional) - Node affinity weight.
- tag (string, optional) - Categorization tag.
- description (string, optional) - Short description.

An example might look like the following:

version: v1alpha1
compatibilities:
- description: "My image requirements"
  rules:
  - name: "kernel and cpu"
    matchFeatures:
    - feature: kernel.loadedmodule
      matchExpressions:
        vfio-pci: {op: Exists}
    - feature: cpu.model
      matchExpressions:
        vendor_id: {op: In, value: ["Intel", "AMD"]}
  - name: "one of available nics"
    matchAny:
    - matchFeatures:
      - feature: pci.device
        matchExpressions:
          vendor: {op: In, value: ["0eee"]}
          class: {op: In, value: ["0200"]}
    - matchFeatures:
      - feature: pci.device
        matchExpressions:
          vendor: {op: In, value: ["0fff"]}
          class: {op: In, value: ["0200"]}

Client implementation for node validation

To streamline compatibility validation, we implemented a client tool that allows for node validation based on an image's compatibility artifact. In this workflow, the image author would generate a compatibility artifact that points to the image it describes in a registry via the referrers API. When a need arises to assess the fit of an image to a host, the tool can discover the artifact and verify compatibility of an image to a node before deployment. The client can validate nodes both inside and outside a Kubernetes cluster, extending the utility of the tool beyond the single Kubernetes use case. In the future, image compatibility could play a crucial role in creating specific workload profiles based on image compatibility requirements, aiding in more efficient scheduling. Additionally, it could potentially enable automatic node configuration to some extent, further optimizing resource allocation and ensuring seamless deployment of specialized workloads.

Examples of usage

Define image compatibility metadata

A container image can have metadata that describes its requirements based on features discovered from nodes, like kernel modules or CPU models. The previous compatibility specification example in this article exemplified this use case.
Attach the artifact to the image

The image compatibility specification is stored as an OCI artifact. You can attach this metadata to your container image using the oras tool. The registry only needs to support OCI artifacts, support for arbitrary types is not required. Keep in mind that the container image and the artifact must be stored in the same registry. Use the following command to attach the artifact to the image:
```
oras attach \
--artifact-type application/vnd.nfd.image-compatibility.v1alpha1 <image-url> \ 
<path-to-spec>.yaml:application/vnd.nfd.image-compatibility.spec.v1alpha1+yaml
```
Validate image compatibility

After attaching the compatibility specification, you can validate whether a node meets the image's requirements. This validation can be done using the nfd client:
```
nfd compat validate-node --image <image-url>
```
Read the output from the client

Finally you can read the report generated by the tool or use your own tools to act based on the generated JSON report.

Conclusion

The addition of image compatibility to Kubernetes through Node Feature Discovery underscores the growing importance of addressing compatibility in cloud native environments. It is only a start, as further work is needed to integrate compatibility into scheduling of workloads within and outside of Kubernetes. However, by integrating this feature into Kubernetes, mission-critical workloads can now define and validate host OS requirements more efficiently. Moving forward, the adoption of compatibility metadata within Kubernetes ecosystems will significantly enhance the reliability and performance of specialized containerized applications, ensuring they meet the stringent requirements of industries like telecommunications, high-performance computing or any environment that requires special hardware or host OS configuration.

Get involved

Join the Kubernetes Node Feature Discovery project if you're interested in getting involved with the design and development of Image Compatibility API and tools. We always welcome new contributors.

Changes to Kubernetes Slack

Mon, 16 Jun 2025 00:00:00 +0000

UPDATE: We’ve received notice from Salesforce that our Slack workspace WILL NOT BE DOWNGRADED on June 20th. Stand by for more details, but for now, there is no urgency to back up private channels or direct messages.

~~Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20, 2025~~. Sometime later this year, our community may move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can.

For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options.

On Friday, June 20, we will be subject to the feature limitations of free Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations.

Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible.

The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform.

Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.

Enhancing Kubernetes Event Management with Custom Aggregation

Tue, 10 Jun 2025 00:00:00 +0000

Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

The challenge with Kubernetes events

In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

Volume: Large clusters can generate thousands of events per minute
Retention: Default event retention is limited to one hour
Correlation: Related events from different components are not automatically linked
Classification: Events lack standardized severity or category classifications
Aggregation: Similar events are not automatically grouped

To learn more about Events in Kubernetes, read the Event API reference.

Real-World value

Consider a production environment with tens of microservices where the users report intermittent transaction failures:

Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

The beneﬁt of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

Building an Event aggregation system

This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

Architecture overview

This event aggregation system consists of three main components:

Event Watcher: Monitors the Kubernetes API for new events
Event Processor: Processes, categorizes, and correlates events
Storage Backend: Stores processed events for longer retention

Here's a sketch for how to implement the event watcher:

package main

import (
    "context"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
    eventsv1 "k8s.io/api/events/v1"
)

type EventWatcher struct {
    clientset *kubernetes.Clientset
}

func NewEventWatcher(config *rest.Config) (*EventWatcher, error) {
    clientset, err := kubernetes.NewForConfig(config)
    if err != nil {
        return nil, err
    }
    return &EventWatcher{clientset: clientset}, nil
}

func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) {
    events := make(chan *eventsv1.Event)
    
    watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{})
    if err != nil {
        return nil, err
    }

    go func() {
        defer close(events)
        for {
            select {
            case event := <-watcher.ResultChan():
                if e, ok := event.Object.(*eventsv1.Event); ok {
                    events <- e
                }
            case <-ctx.Done():
                watcher.Stop()
                return
            }
        }
    }()

    return events, nil
}

Event processing and classification

The event processor enriches events with additional context and classification:

type EventProcessor struct {
    categoryRules []CategoryRule
    correlationRules []CorrelationRule
}

type ProcessedEvent struct {
    Event     *eventsv1.Event
    Category  string
    Severity  string
    CorrelationID string
    Metadata  map[string]string
}

func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent {
    processed := &ProcessedEvent{
        Event:    event,
        Metadata: make(map[string]string),
    }
    
    // Apply classification rules
    processed.Category = p.classifyEvent(event)
    processed.Severity = p.determineSeverity(event)
    
    // Generate correlation ID for related events
    processed.CorrelationID = p.correlateEvent(event)
    
    // Add useful metadata
    processed.Metadata = p.extractMetadata(event)
    
    return processed
}

Implementing Event correlation

One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string {
    // Correlation strategies:
    // 1. Time-based: Events within a time window
    // 2. Resource-based: Events affecting the same resource
    // 3. Causation-based: Events with cause-effect relationships

    correlationKey := generateCorrelationKey(event)
    return correlationKey
}

func generateCorrelationKey(event *eventsv1.Event) string {
    // Example: Combine namespace, resource type, and name
    return fmt.Sprintf("%s/%s/%s",
        event.InvolvedObject.Namespace,
        event.InvolvedObject.Kind,
        event.InvolvedObject.Name,
    )
}

Event storage and retention

For long-term storage and analysis, you'll probably want a backend that supports:

Efficient querying of large event volumes
Flexible retention policies
Support for aggregation queries

Here's a sample storage interface:

type EventStorage interface {
    Store(context.Context, *ProcessedEvent) error
    Query(context.Context, EventQuery) ([]ProcessedEvent, error)
    Aggregate(context.Context, AggregationParams) ([]EventAggregate, error)
}

type EventQuery struct {
    TimeRange     TimeRange
    Categories    []string
    Severity      []string
    CorrelationID string
    Limit         int
}

type AggregationParams struct {
    GroupBy    []string
    TimeWindow string
    Metrics    []string
}

Good practices for Event management

Resource Efficiency
- Implement rate limiting for event processing
- Use efficient filtering at the API server level
- Batch events for storage operations
Scalability
- Distribute event processing across multiple workers
- Use leader election for coordination
- Implement backoff strategies for API rate limits
Reliability
- Handle API server disconnections gracefully
- Buffer events during storage backend unavailability
- Implement retry mechanisms with exponential backoff

Advanced features

Pattern detection

Implement pattern detection to identify recurring issues:

type PatternDetector struct {
    patterns map[string]*Pattern
    threshold int
}

func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern {
    // Group similar events
    groups := groupSimilarEvents(events)
    
    // Analyze frequency and timing
    patterns := identifyPatterns(groups)
    
    return patterns
}

func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent {
    groups := make(map[string][]ProcessedEvent)
    
    for _, event := range events {
        // Create similarity key based on event characteristics
        similarityKey := fmt.Sprintf("%s:%s:%s",
            event.Event.Reason,
            event.Event.InvolvedObject.Kind,
            event.Event.InvolvedObject.Namespace,
        )
        
        // Group events with the same key
        groups[similarityKey] = append(groups[similarityKey], event)
    }
    
    return groups
}


func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern {
    var patterns []Pattern
    
    for key, events := range groups {
        // Only consider groups with enough events to form a pattern
        if len(events) < 3 {
            continue
        }
        
        // Sort events by time
        sort.Slice(events, func(i, j int) bool {
            return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time)
        })
        
        // Calculate time range and frequency
        firstSeen := events[0].Event.FirstTimestamp.Time
        lastSeen := events[len(events)-1].Event.LastTimestamp.Time
        duration := lastSeen.Sub(firstSeen).Minutes()
        
        var frequency float64
        if duration > 0 {
            frequency = float64(len(events)) / duration
        }
        
        // Create a pattern if it meets threshold criteria
        if frequency > 0.5 { // More than 1 event per 2 minutes
            pattern := Pattern{
                Type:         key,
                Count:        len(events),
                FirstSeen:    firstSeen,
                LastSeen:     lastSeen,
                Frequency:    frequency,
                EventSamples: events[:min(3, len(events))], // Keep up to 3 samples
            }
            patterns = append(patterns, pattern)
        }
    }
    
    return patterns
}

With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

Real-time alerts

The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

type AlertManager struct {
    rules []AlertRule
    notifiers []Notifier
}

func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) {
    for _, rule := range a.rules {
        if rule.Matches(events) {
            alert := rule.GenerateAlert(events)
            a.notify(alert)
        }
    }
}

Conclusion

A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

Next steps

Future enhancements could include:

Machine learning for anomaly detection
Integration with popular observability platforms
Custom event APIs for application-specific events
Enhanced visualization and reporting capabilities

For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

Introducing Gateway API Inference Extension

Thu, 05 Jun 2025 00:00:00 +0000

Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

Gateway API Inference Extension

Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a “model-as-a-service” mindset.

The project’s goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.

How it works

The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow:

InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol.
InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., "gpt-4-chat") to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy.

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it’s served.

Request flow

The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here’s a high-level example of the request flow with the Endpoint Selection Extension (ESE):

Gateway Routing
A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend.
Endpoint Selection
Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension— the Endpoint Selection Extension—to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request.
Inference-Aware Scheduling
The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user’s criticality or resource needs. The Gateway then forwards traffic to that specific pod.

This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible—any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing.

Benchmarks

We evaluated this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS.

Key results

Comparable Throughput: Throughout the tested QPS range, the ESE delivered throughput roughly on par with a standard Kubernetes Service.
Lower Latency:
- Per‐Output‐Token Latency: The ESE showed significantly lower p90 latency at higher QPS (500+), indicating that its model-aware routing decisions reduce queueing and resource contention as GPU memory approaches saturation.
- Overall p90 Latency: Similar trends emerged, with the ESE reducing end‐to‐end tail latencies compared to the baseline, particularly as traffic increased beyond 400–500 QPS.

These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests.

Roadmap

As the Gateway API Inference Extension heads toward GA, planned features include:

Prefix-cache aware load balancing for remote caches
LoRA adapter pipelines for automated rollout
Fairness and priority between workloads in the same criticality band
HPA support for scaling based on aggregate, per-model metrics
Support for large multi-modal inputs/outputs
Additional model types (e.g., diffusion models)
Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing)
Disaggregated serving for independently scaling pools

Summary

By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you’re interested in contributing to the project!

Start Sidecar First: How To Avoid Snags

Tue, 03 Jun 2025 00:00:00 +0000

From the Kubernetes Multicontainer Pods: An Overview blog post you know what their job is, what are the main architectural patterns, and how they are implemented in Kubernetes. The main thing I’ll cover in this article is how to ensure that your sidecar containers start before the main app. It’s more complicated than you might think!

A gentle refresher

I'd just like to remind readers that the v1.29.0 release of Kubernetes added native support for sidecar containers, which can now be defined within the .spec.initContainers field, but with restartPolicy: Always. You can see that illustrated in the following example Pod manifest snippet:

initContainers:
  - name: logshipper
    image: alpine:latest
    restartPolicy: Always # this is what makes it a sidecar container
    command: ['sh', '-c', 'tail -F /opt/logs.txt']
    volumeMounts:
    - name: data
        mountPath: /opt

What are the specifics of defining sidecars with a .spec.initContainers block, rather than as a legacy multi-container pod with multiple .spec.containers? Well, all .spec.initContainers are always launched before the main application. If you define Kubernetes-native sidecars, those are terminated after the main application. Furthermore, when used with Jobs, a sidecar container should still be alive and could potentially even restart after the owning Job is complete; Kubernetes-native sidecar containers do not block pod completion.

To learn more, you can also read the official Pod sidecar containers tutorial.

The problem

Now you know that defining a sidecar with this native approach will always start it before the main application. From the kubelet source code, it's visible that this often means being started almost in parallel, and this is not always what an engineer wants to achieve. What I'm really interested in is whether I can delay the start of the main application until the sidecar is not just started, but fully running and ready to serve. It might be a bit tricky because the problem with sidecars is there’s no obvious success signal, contrary to init containers - designed to run only for a specified period of time. With an init container, exit status 0 is unambiguously "I succeeded". With a sidecar, there are lots of points at which you can say "a thing is running". Starting one container only after the previous one is ready is part of a graceful deployment strategy, ensuring proper sequencing and stability during startup. It’s also actually how I’d expect sidecar containers to work as well, to cover the scenario where the main application is dependent on the sidecar. For example, it may happen that an app errors out if the sidecar isn’t available to serve requests (e.g., logging with DataDog). Sure, one could change the application code (and it would actually be the “best practice” solution), but sometimes they can’t - and this post focuses on this use case.

I'll explain some ways that you might try, and show you what approaches will really work.

Readiness probe

To check whether Kubernetes native sidecar delays the start of the main application until the sidecar is ready, let’s simulate a short investigation. Firstly, I’ll simulate a sidecar container which will never be ready by implementing a readiness probe which will never succeed. As a reminder, a readiness probe checks if the container is ready to start accepting traffic and therefore, if the pod can be used as a backend for services.

(Unlike standard init containers, sidecar containers can have probes so that the kubelet can supervise the sidecar and intervene if there are problems. For example, restarting a sidecar container if it fails a health check.)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: alpine:latest
          command: ["sh", "-c", "sleep 3600"]
      initContainers:
        - name: nginx
          image: nginx:latest
          restartPolicy: Always
          ports:
            - containerPort: 80
              protocol: TCP
          readinessProbe:
            exec:
              command:
              - /bin/sh
              - -c
              - exit 1 # this command always fails, keeping the container "Not Ready"
            periodSeconds: 5
      volumes:
        - name: data
          emptyDir: {}

The result is:

controlplane $ kubectl get pods -w
NAME                    READY   STATUS    RESTARTS   AGE
myapp-db5474f45-htgw5   1/2     Running   0          9m28s

controlplane $ kubectl describe pod myapp-db5474f45-htgw5 
Name:             myapp-db5474f45-htgw5
Namespace:        default
(...)
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  17s               default-scheduler  Successfully assigned default/myapp-db5474f45-htgw5 to node01
  Normal   Pulling    16s               kubelet            Pulling image "nginx:latest"
  Normal   Pulled     16s               kubelet            Successfully pulled image "nginx:latest" in 163ms (163ms including waiting). Image size: 72080558 bytes.
  Normal   Created    16s               kubelet            Created container nginx
  Normal   Started    16s               kubelet            Started container nginx
  Normal   Pulling    15s               kubelet            Pulling image "alpine:latest"
  Normal   Pulled     15s               kubelet            Successfully pulled image "alpine:latest" in 159ms (160ms including waiting). Image size: 3652536 bytes.
  Normal   Created    15s               kubelet            Created container myapp
  Normal   Started    15s               kubelet            Started container myapp
  Warning  Unhealthy  1s (x6 over 15s)  kubelet            Readiness probe failed:

From these logs it’s evident that only one container is ready - and I know it can’t be the sidecar, because I’ve defined it so it’ll never be ready (you can also check container statuses in kubectl get pod -o json). I also saw that myapp has been started before the sidecar is ready. That was not the result I wanted to achieve; in this case, the main app container has a hard dependency on its sidecar.

Maybe a startup probe?

To ensure that the sidecar is ready before the main app container starts, I can define a startupProbe. It will delay the start of the main container until the command is successfully executed (returns 0 exit status). If you’re wondering why I’ve added it to my initContainer, let’s analyse what happens If I’d added it to myapp container. I wouldn’t have guaranteed the probe would run before the main application code - and this one, can potentially error out without the sidecar being up and running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: alpine:latest
          command: ["sh", "-c", "sleep 3600"]
      initContainers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80
              protocol: TCP
          restartPolicy: Always
          startupProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 30
            failureThreshold: 10
            timeoutSeconds: 20
      volumes:
        - name: data
          emptyDir: {}

This results in 2/2 containers being ready and running, and from events, it can be inferred that the main application started only after nginx had already been started. But to confirm whether it waited for the sidecar readiness, let’s change the startupProbe to the exec type of command:

startupProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - sleep 15

and run kubectl get pods -w to watch in real time whether the readiness of both containers only changes after a 15 second delay. Again, events confirm the main application starts after the sidecar. That means that using the startupProbe with a correct startupProbe.httpGet request helps to delay the main application start until the sidecar is ready. It’s not optimal, but it works.

What about the postStart lifecycle hook?

Fun fact: using the postStart lifecycle hook block will also do the job, but I’d have to write my own mini-shell script, which is even less efficient.

initContainers:
  - name: nginx
    image: nginx:latest
    restartPolicy: Always
    ports:
      - containerPort: 80
        protocol: TCP
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            echo "Waiting for readiness at http://localhost:80"
            until curl -sf http://localhost:80; do
              echo "Still waiting for http://localhost:80..."
              sleep 5
            done
            echo "Service is ready at http://localhost:80"

Liveness probe

An interesting exercise would be to check the sidecar container behavior with a liveness probe. A liveness probe behaves and is configured similarly to a readiness probe - only with the difference that it doesn’t affect the readiness of the container but restarts it in case the probe fails.

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - exit 1 # this command always fails, keeping the container "Not Ready"
  periodSeconds: 5

After adding the liveness probe configured just as the previous readiness probe and checking events of the pod by kubectl describe pod it’s visible that the sidecar has a restart count above 0. Nevertheless, the main application is not restarted nor influenced at all, even though I'm aware that (in our imaginary worst-case scenario) it can error out when the sidecar is not there serving requests. What if I’d used a livenessProbe without lifecycle postStart? Both containers will be immediately ready: at the beginning, this behavior will not be different from the one without any additional probes since the liveness probe doesn’t affect readiness at all. After a while, the sidecar will begin to restart itself, but it won’t influence the main container.

Findings summary

I’ll summarize the startup behavior in the table below:

Probe/Hook	Sidecar starts before the main app?	Main app waits for the sidecar to be ready?	What if the check doesn’t pass?
`readinessProbe`	Yes, but it’s almost in parallel (effectively no)	No	Sidecar is not ready; main app continues running
`livenessProbe`	Yes, but it’s almost in parallel (effectively no)	No	Sidecar is restarted, main app continues running
`startupProbe`	Yes	Yes	Main app is not started
postStart	Yes, main app container starts after `postStart` completes	Yes, but you have to provide custom logic for that	Main app is not started

To summarize: with sidecars often being a dependency of the main application, you may want to delay the start of the latter until the sidecar is healthy. The ideal pattern is to start both containers simultaneously and have the app container logic delay at all levels, but it’s not always possible. If that's what you need, you have to use the right kind of customization to the Pod definition. Thankfully, it’s nice and quick, and you have the recipe ready above.

Happy deploying!

Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets

Mon, 02 Jun 2025 09:00:00 -0800

Join us in the Kubernetes SIG Network community in celebrating the general availability of Gateway API v1.3.0! We are also pleased to announce that there are already a number of conformant implementations to try, made possible by postponing this blog announcement. Version 1.3.0 of the API was released about a month ago on April 24, 2025.

Gateway API v1.3.0 brings a new feature to the Standard channel (Gateway API's GA release channel): percentage-based request mirroring, and introduces three new experimental features: cross-origin resource sharing (CORS) filters, a standardized mechanism for listener and gateway merging, and retry budgets.

Also see the full release notes and applaud the v1.3.0 release team next time you see them.

Graduation to Standard channel

Graduation to the Standard channel is a notable achievement for Gateway API features, as inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard channel features can continue to evolve with backward-compatible additions over time, and we (SIG Network) certainly expect further refinements and improvements in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

Percentage-based request mirroring

Leads: Lior Lieberman,Jake Bennert

GEP-3171: Percentage-Based Request Mirroring

Percentage-based request mirroring is an enhancement to the existing support for HTTP request mirroring, which allows HTTP requests to be duplicated to another backend using the RequestMirror filter type. Request mirroring is particularly useful in blue-green deployment. It can be used to assess the impact of request scaling on application performance without impacting responses to clients.

The previous mirroring capability worked on all the requests to a backendRef.
Percentage-based request mirroring allows users to specify a subset of requests they want to be mirrored, either by percentage or fraction. This can be particularly useful when services are receiving a large volume of requests. Instead of mirroring all of those requests, this new feature can be used to mirror a smaller subset of them.

Here's an example with 42% of the requests to "foo-v1" being mirrored to "foo-v2":

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-filter-mirror
  labels:
    gateway: mirror-gateway
spec:
  parentRefs:
  - name: mirror-gateway
  hostnames:
  - mirror.example
  rules:
  - backendRefs:
    - name: foo-v1
      port: 8080
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: foo-v2
          port: 8080
        percent: 42 # This value must be an integer.

You can also configure the partial mirroring using a fraction. Here is an example with 5 out of every 1000 requests to "foo-v1" being mirrored to "foo-v2".

  rules:
  - backendRefs:
    - name: foo-v1
      port: 8080
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: foo-v2
          port: 8080
        fraction:
          numerator: 5
          denominator: 1000

Additions to Experimental channel

The Experimental channel is Gateway API's channel for experimenting with new features and gaining confidence with them before allowing them to graduate to standard. Please note: the experimental channel may include features that are changed or removed later.

Starting in release v1.3.0, in an effort to distinguish Experimental channel resources from Standard channel resources, any new experimental API kinds have the prefix "X". For the same reason, experimental resources are now added to the API group gateway.networking.x-k8s.io instead of gateway.networking.k8s.io. Bear in mind that using new experimental channel resources means they can coexist with standard channel resources, but migrating these resources to the standard channel will require recreating them with the standard channel names and API group (both of which lack the "x-k8s" designator or "X" prefix).

The v1.3 release introduces two new experimental API kinds: XBackendTrafficPolicy and XListenerSet. To be able to use experimental API kinds, you need to install the Experimental channel Gateway API YAMLs from the locations listed below.

CORS filtering

Leads: Liang Li, Eyal Pazz, Rob Scott

GEP-1767: CORS Filter

Cross-origin resource sharing (CORS) is an HTTP-header based mechanism that allows a web page to access restricted resources from a server on an origin (domain, scheme, or port) different from the domain that served the web page. This feature adds a new HTTPRoute filter type, called "CORS", to configure the handling of cross-origin requests before the response is sent back to the client.

To be able to use experimental CORS filtering, you need to install the Experimental channel Gateway API HTTPRoute yaml.

Here's an example of a simple cross-origin configuration:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-route-cors
spec:
  parentRefs:
  - name: http-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /resource/foo
    filters:
    - cors:
      - type: CORS
        allowOrigins:
        - *
        allowMethods: 
        - GET
        - HEAD
        - POST
        allowHeaders: 
        - Accept
        - Accept-Language
        - Content-Language
        - Content-Type
        - Range
    backendRefs:
    - kind: Service
      name: http-route-cors
      port: 80

In this case, the Gateway returns an origin header of "*", which means that the requested resource can be referenced from any origin, a methods header (Access-Control-Allow-Methods) that permits the GET, HEAD, and POST verbs, and a headers header allowing Accept, Accept-Language, Content-Language, Content-Type, and Range.

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, HEAD, POST
Access-Control-Allow-Headers: Accept,Accept-Language,Content-Language,Content-Type,Range

The complete list of fields in the new CORS filter:

allowOrigins
allowMethods
allowHeaders
allowCredentials
exposeHeaders
maxAge

See CORS protocol for details.

XListenerSets (standardized mechanism for Listener and Gateway merging)

Lead: Dave Protasowski

GEP-1713: ListenerSets - Standard Mechanism to Merge Multiple Gateways

This release adds a new experimental API kind, XListenerSet, that allows a shared list of listeners to be attached to one or more parent Gateway(s). In addition, it expands upon the existing suggestion that Gateway API implementations may merge configuration from multiple Gateway objects. It also:

adds a new field allowedListeners to the .spec of a Gateway. The allowedListeners field defines from which Namespaces to select XListenerSets that are allowed to attach to that Gateway: Same, All, None, or Selector based.
increases the previous maximum number (64) of listeners with the addition of XListenerSets.
allows the delegation of listener configuration, such as TLS, to applications in other namespaces.

To be able to use experimental XListenerSet, you need to install the Experimental channel Gateway API XListenerSet yaml.

The following example shows a Gateway with an HTTP listener and two child HTTPS XListenerSets with unique hostnames and certificates. The combined set of listeners attached to the Gateway includes the two additional HTTPS listeners in the XListenerSets that attach to the Gateway. This example illustrates the delegation of listener TLS config to application owners in different namespaces ("store" and "app"). The HTTPRoute has both the Gateway listener named "foo" and one XListenerSet listener named "second" as parentRefs.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: prod-external
  namespace: infra
spec:
  gatewayClassName: example
  allowedListeners:
  - from: All
  listeners:
  - name: foo
    hostname: foo.com
    protocol: HTTP
    port: 80
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
  name: store
  namespace: store
spec:
  parentRef:
    name: prod-external
  listeners:
  - name: first
    hostname: first.foo.com
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        group: ""
        name: first-workload-cert
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
  name: app
  namespace: app
spec:
  parentRef:
    name: prod-external
  listeners:
  - name: second
    hostname: second.foo.com
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        group: ""
        name: second-workload-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: httproute-example
spec:
  parentRefs:
  - name: app
    kind: XListenerSet
    sectionName: second
  - name: parent-gateway
    kind: Gateway
    sectionName: foo
    ...

Each listener in a Gateway must have a unique combination of port, protocol, (and hostname if supported by the protocol) in order for all listeners to be compatible and not conflicted over which traffic they should receive.

Furthermore, implementations can merge separate Gateways into a single set of listener addresses if all listeners across those Gateways are compatible. The management of merged listeners was under-specified in releases prior to v1.3.0.

With the new feature, the specification on merging is expanded. Implementations must treat the parent Gateways as having the merged list of all listeners from itself and from attached XListenerSets, and validation of this list of listeners must behave the same as if the list were part of a single Gateway. Within a single Gateway, listeners are ordered using the following precedence:

Single Listeners (not a part of an XListenerSet) first,
Remaining listeners ordered by:
- object creation time (oldest first), and if two listeners are defined in objects that have the same timestamp, then
- alphabetically based on "{namespace}/{name of listener}"

Retry budgets (XBackendTrafficPolicy)

Leads: Eric Bishop, Mike Morris

GEP-3388: Retry Budgets

This feature allows you to configure a retry budget across all endpoints of a destination Service. This is used to limit additional client-side retries after reaching a configured threshold. When configuring the budget, the maximum percentage of active requests that may consist of retries may be specified, as well as the interval over which requests will be considered when calculating the threshold for retries. The development of this specification changed the existing experimental API kind BackendLBPolicy into a new experimental API kind, XBackendTrafficPolicy, in the interest of reducing the proliferation of policy resources that had commonalities.

To be able to use experimental retry budgets, you need to install the Experimental channel Gateway API XBackendTrafficPolicy yaml.

The following example shows an XBackendTrafficPolicy that applies a retryConstraint that represents a budget that limits the retries to a maximum of 20% of requests, over a duration of 10 seconds, and to a minimum of 3 retries over 1 second.

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XBackendTrafficPolicy
metadata:
  name: traffic-policy-example
spec:
  retryConstraint:
    budget: 
        percent: 20
        interval: 10s
    minRetryRate:
      count: 3
      interval: 1s
    ...

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide. As of this writing, four implementations are already conformant with Gateway API v1.3 experimental channel features. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

Check out the user guides to see what use-cases can be addressed.
Try out one of the existing Gateway controllers.
Or join us in the community and help us build the future of Gateway API together!

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

Gateway API v1.2: WebSockets, Timeouts, Retries, and More (November 2024)
Gateway API v1.1: Service mesh, GRPCRoute, and a whole lot more (May 2024)
New Experimental Features in Gateway API v1.0 (November 2023)
Gateway API v1.0: GA Release (October 2023)

Spotlight on Policy Working Group

Thu, 22 May 2025 00:00:00 +0000

(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)

In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.

The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.

Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.

This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:

Interviewed by Arujjwal Negi.

These co-chairs explained what the Policy Working Group was all about.

Introduction

Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?

Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.

Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.

Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.

Responses to the following questions represent an amalgamation of insights from the former co-chairs.

About Working Groups

One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?

Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.

(To know more about SIGs, visit the list of Special Interest Groups)

You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?

The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.

Policy WG

Why was the Policy Working Group created?

To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.

Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.

Can you give me an idea of the work you did in the group?

We worked on several Kubernetes policy-related projects. Our initiatives included:

We worked on a Kubernetes Enhancement Proposal (KEP) for the Kubernetes Policy Reports API. This aims to standardize how policy reports are generated and consumed within the Kubernetes ecosystem.
We conducted a CNCF survey to better understand policy usage in the Kubernetes space. This helped gauge the practices and needs across the community at the time.
We wrote a paper that will guide users in achieving PCI-DSS compliance for containers. This is intended to help organizations meet important security standards in their Kubernetes environments.
We also worked on a paper highlighting how shifting security down can benefit organizations. This focuses on the advantages of implementing security measures earlier in the development and deployment process.

Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?

The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.

To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.

Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.

Challenges

What were some of the major challenges that the Policy Working Group worked on?

During our work in the Policy Working Group, we encountered several challenges:

One of the main issues we faced was finding time to consistently contribute. Given that many of us have other professional commitments, it can be difficult to dedicate regular time to the working group's initiatives.
Another challenge we experienced was related to our consensus-driven model. While this approach ensures that all voices are heard, it can sometimes lead to slower decision-making processes. We valued thorough discussion and agreement, but this can occasionally delay progress on our projects.
We've also encountered occasional differences of opinion among group members. These situations require careful navigation to ensure that we maintain a collaborative and productive environment while addressing diverse viewpoints.
Lastly, we've noticed that newcomers to the group may find it difficult to contribute effectively without consistent attendance at our meetings. The complex nature of our work often requires ongoing context, which can be challenging for those who aren't able to participate regularly.

Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?

There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.

It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.

Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.

This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.

Kubernetes v1.33: In-Place Pod Resize Graduated to Beta

Fri, 16 May 2025 10:30:00 -0800

On behalf of the Kubernetes project, I am excited to announce that the in-place Pod resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, has graduated to Beta and will be enabled by default in the Kubernetes v1.33 release! This marks a significant milestone in making resource management for Kubernetes workloads more flexible and less disruptive.

What is in-place Pod resize?

Traditionally, changing the CPU or memory resources allocated to a container required restarting the Pod. While acceptable for many stateless applications, this could be disruptive for stateful services, batch jobs, or any workloads sensitive to restarts.

In-place Pod resizing allows you to change the CPU and memory requests and limits assigned to containers within a running Pod, often without requiring a container restart.

Here's the core idea:

The spec.containers[*].resources field in a Pod specification now represents the desired resources and is mutable for CPU and memory.
The status.containerStatuses[*].resources field reflects the actual resources currently configured on a running container.
You can trigger a resize by updating the desired resources in the Pod spec via the new resize subresource.

You can try it out on a v1.33 Kubernetes cluster by using kubectl to edit a Pod (requires kubectl v1.32+):

kubectl edit pod <pod-name> --subresource resize

For detailed usage instructions and examples, please refer to the official Kubernetes documentation: Resize CPU and Memory Resources assigned to Containers.

Why does in-place Pod resize matter?

Kubernetes still excels at scaling workloads horizontally (adding or removing replicas), but in-place Pod resizing unlocks several key benefits for vertical scaling:

Reduced Disruption: Stateful applications, long-running batch jobs, and sensitive workloads can have their resources adjusted without suffering the downtime or state loss associated with a Pod restart.
Improved Resource Utilization: Scale down over-provisioned Pods without disruption, freeing up resources in the cluster. Conversely, provide more resources to Pods under heavy load without needing a restart.
Faster Scaling: Address transient resource needs more quickly. For example Java applications often need more CPU during startup than during steady-state operation. Start with higher CPU and resize down later.

What's changed between Alpha and Beta?

Since the alpha release in v1.27, significant work has gone into maturing the feature, improving its stability, and refining the user experience based on feedback and further development. Here are the key changes:

Notable user-facing changes

resize Subresource: Modifying Pod resources must now be done via the Pod's resize subresource (kubectl patch pod <name> --subresource resize ...). kubectl versions v1.32+ support this argument.
Resize Status via Conditions: The old status.resize field is deprecated. The status of a resize operation is now exposed via two Pod conditions:
- PodResizePending: Indicates the Kubelet cannot grant the resize immediately (e.g., reason: Deferred if temporarily unable, reason: Infeasible if impossible on the node).
- PodResizeInProgress: Indicates the resize is accepted and being applied. Errors encountered during this phase are now reported in this condition's message with reason: Error.
Sidecar Support: Resizing sidecar containers in-place is now supported.

Stability and reliability enhancements

Refined Allocated Resources Management: The allocation management logic with the Kubelet was significantly reworked, making it more consistent and robust. The changes eliminated whole classes of bugs, and greatly improved the reliability of in-place Pod resize.
Improved Checkpointing & State Tracking: A more robust system for tracking "allocated" and "actuated" resources was implemented, using new checkpoint files (allocated_pods_state, actuated_pods_state) to reliably manage resize state across Kubelet restarts and handle edge cases where runtime-reported resources differ from requested ones. Several bugs related to checkpointing and state restoration were fixed. Checkpointing efficiency was also improved.
Faster Resize Detection: Enhancements to the Kubelet's Pod Lifecycle Event Generator (PLEG) allow the Kubelet to respond to and complete resizes much more quickly.
Enhanced CRI Integration: A new UpdatePodSandboxResources CRI call was added to better inform runtimes and plugins (like NRI) about Pod-level resource changes.
Numerous Bug Fixes: Addressed issues related to systemd cgroup drivers, handling of containers without limits, CPU minimum share calculations, container restart backoffs, error propagation, test stability, and more.

What's next?

Graduating to Beta means the feature is ready for broader adoption, but development doesn't stop here! Here's what the community is focusing on next:

Stability and Productionization: Continued focus on hardening the feature, improving performance, and ensuring it is robust for production environments.
Addressing Limitations: Working towards relaxing some of the current limitations noted in the documentation, such as allowing memory limit decreases.
VerticalPodAutoscaler (VPA) Integration: Work to enable VPA to leverage in-place Pod resize is already underway. A new InPlaceOrRecreate update mode will allow it to attempt non-disruptive resizes first, or fall back to recreation if needed. This will allow users to benefit from VPA's recommendations with significantly less disruption.
User Feedback: Gathering feedback from users adopting the beta feature is crucial for prioritizing further enhancements and addressing any uncovered issues or bugs.

Getting started and providing feedback

With the InPlacePodVerticalScaling feature gate enabled by default in v1.33, you can start experimenting with in-place Pod resizing right away!

Refer to the documentation for detailed guides and examples.

As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels (GitHub issues, mailing lists, Slack). You can also review the KEP-1287: In-place Update of Pod Resources for the full in-depth design details.

We look forward to seeing how the community leverages in-place Pod resize to build more efficient and resilient applications on Kubernetes!

Announcing etcd v3.6.0

Thu, 15 May 2025 16:00:00 -0800

This announcement originally appeared on the etcd blog.

Today, we are releasing etcd v3.6.0, the first minor release since etcd v3.5.0 on June 15, 2021. This release introduces several new features, makes significant progress on long-standing efforts like downgrade support and migration to v3store, and addresses numerous critical & major issues. It also includes major optimizations in memory usage, improving efficiency and performance.

In addition to the features of v3.6.0, etcd has joined Kubernetes as a SIG (sig-etcd), enabling us to improve project sustainability. We've introduced systematic robustness testing to ensure correctness and reliability. Through the etcd-operator Working Group, we plan to improve usability as well.

What follows are the most significant changes introduced in etcd v3.6.0, along with the discussion of the roadmap for future development. For a detailed list of changes, please refer to the CHANGELOG-3.6.

A heartfelt thank you to all the contributors who made this release possible!

Security

etcd takes security seriously. To enhance software security in v3.6.0, we have improved our workflow checks by integrating govulncheck to scan the source code and trivy to scan container images. These improvements have also been backported to supported stable releases.

etcd continues to follow the Security Release Process to ensure vulnerabilities are properly managed and addressed.

Features

Migration to v3store

The v2store has been deprecated since etcd v3.4 but could still be enabled via --enable-v2. It remained the source of truth for membership data. In etcd v3.6.0, v2store can no longer be enabled as the --enable-v2 flag has been removed, and v3store has become the sole source of truth for membership data.

While v2store still exists in v3.6.0, etcd will fail to start if it contains any data other than membership information. To assist with migration, etcd v3.5.18+ provides the etcdutl check v2store command, which verifies that v2store contains only membership data (see PR 19113).

Compared to v2store, v3store offers better performance and transactional support. It is also the actively maintained storage engine moving forward.

The removal of v2store is still ongoing and is tracked in issues/12913.

Downgrade

etcd v3.6.0 is the first version to fully support downgrade. The effort for this downgrade task spans both versions 3.5 and 3.6, and all related work is tracked in issues/11716.

At a high level, the process involves migrating the data schema to the target version (e.g., v3.5), followed by a rolling downgrade.

Ensure the cluster is healthy, and take a snapshot backup. Validate whether the downgrade is valid:

$ etcdctl downgrade validate 3.5
Downgrade validate success, cluster version 3.6

If the downgrade is valid, enable downgrade mode:

$ etcdctl downgrade enable 3.5
Downgrade enable success, cluster version 3.6

etcd will then migrate the data schema in the background. Once complete, proceed with the rolling downgrade.

For details, refer to the Downgrade-3.6 guide.

Feature gates

In etcd v3.6.0, we introduced Kubernetes-style feature gates for managing new features. Previously, we indicated unstable features through the --experimental prefix in feature flag names. The prefix was removed once the feature was stable, causing a breaking change. Now, features will start in Alpha, progress to Beta, then GA, or get deprecated. This ensures a much smoother upgrade and downgrade experience for users.

See feature-gates for details.

livez / readyz checks

etcd now supports /livez and /readyz endpoints, aligning with Kubernetes' Liveness and Readiness probes. /livez indicates whether the etcd instance is alive, while /readyz indicates when it is ready to serve requests. This feature has also been backported to release-3.5 (starting from v3.5.11) and release-3.4 (starting from v3.4.29). See livez/readyz for details.

The existing /health endpoint remains functional. /livez is similar to /health?serializable=true, while /readyz is similar to /health or /health?serializable=false. Clearly, the /livez and /readyz endpoints provide clearer semantics and are easier to understand.

v3discovery

In etcd v3.6.0, the new discovery protocol v3discovery was introduced, based on clientv3. It facilitates the discovery of all cluster members during the bootstrap phase.

The previous v2discovery protocol, based on clientv2, has been deprecated. Additionally, the public discovery service at https://discovery.etcd.io/, which relied on v2discovery, is no longer maintained.

Performance

Memory

In this release, we reduced average memory consumption by at least 50% (see Figure 1). This improvement is primarily due to two changes:

The default value of --snapshot-count has been reduced from 100,000 in v3.5 to 10,000 in v3.6. As a result, etcd v3.6 now retains only about 10% of the history records compared to v3.5.
Raft history is compacted more frequently, as introduced in PR/18825.

Figure 1: Memory usage comparison between etcd v3.5.20 and v3.6.0-rc.2 under different read/write ratios. Each subplot shows the memory usage over time with a specific read/write ratio. The red line represents etcd v3.5.20, while the teal line represents v3.6.0-rc.2. Across all tested ratios, v3.6.0-rc.2 exhibits lower and more stable memory usage.

Throughput

Compared to v3.5, etcd v3.6 delivers an average performance improvement of approximately 10% in both read and write throughput (see Figure 2, 3, 4 and 5). This improvement is not attributed to any single major change, but rather the cumulative effect of multiple minor enhancements. One such example is the optimization of the free page queries introduced in PR/419.

Figure 2: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.21% to 25.59%.

Figure 3: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 4.38% to 27.20%.

Figure 4: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 2.95% to 24.24%.

Figure 5: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.86% to 28.37%.

Breaking changes

This section highlights a few notable breaking changes. For a complete list, please refer to the Upgrade etcd from v3.5 to v3.6 and the CHANGELOG-3.6.

Old binaries are incompatible with new schema versions

Old etcd binaries are not compatible with newer data schema versions. For example, etcd 3.5 cannot start with data created by etcd 3.6, and etcd 3.4 cannot start with data created by either 3.5 or 3.6.

When downgrading etcd, it's important to follow the documented downgrade procedure. Simply replacing the binary or image will result in the incompatibility issue.

Peer endpoints no longer serve client requests

Client endpoints (--advertise-client-urls) are intended to serve client requests only, while peer endpoints (--initial-advertise-peer-urls) are intended solely for peer communication. However, due to an implementation oversight, the peer endpoints were also able to handle client requests in etcd 3.4 and 3.5. This behavior was misleading and encouraged incorrect usage patterns. In etcd 3.6, this misleading behavior was corrected via PR/13565; peer endpoints no longer serve client requests.

Clear boundary between etcdctl and etcdutl

Both etcdctl and etcdutl are command line tools. etcdutl is an offline utility designed to operate directly on etcd data files, while etcdctl is an online tool that interacts with etcd over a network. Previously, there were some overlapping functionalities between the two, but these overlaps were removed in 3.6.0.

Removed etcdctl defrag --data-dir

The etcdctl defrag command only support online defragmentation and no longer supports offline defragmentation. To perform offline defragmentation, use the etcdutl defrag --data-dir command instead.
Removed etcdctl snapshot status

etcdctl no longer supports retrieving the status of a snapshot. Use the etcdutl snapshot status command instead.
Removed etcdctl snapshot restore

etcdctl no longer supports restoring from a snapshot. Use the etcdutl snapshot restore command instead.

Critical bug fixes

Correctness has always been a top priority for the etcd project. In the process of developing 3.6.0, we found and fixed a few notable bugs that could lead to data inconsistency in specific cases. These fixes have been backported to previous releases, but we believe they deserve special mention here.

Data Inconsistency when Crashing Under Load

Previously, when etcd was applying data, it would update the consistent-index first, followed by committing the data. However, these operations were not atomic. If etcd crashed in between, it could lead to data inconsistency (see issue/13766). The issue was introduced in v3.5.0, and fixed in v3.5.3 with PR/13854.

Durability API guarantee broken in single node cluster

When a client writes data and receives a success response, the data is expected to be persisted. However, the data might be lost if etcd crashes immediately after sending the success response to the client. This was a legacy issue (see issue/14370) affecting all previous releases. It was addressed in v3.4.21 and v3.5.5 with PR/14400, and fixed in raft side in main branch (now release-3.6) with PR/14413.

Revision Inconsistency when Crashing During Defragmentation

If etcd crashed during the defragmentation operation, upon restart, it might reapply some entries which had already been applied, accordingly leading to the revision inconsistency issue (see the discussions in PR/14685). The issue was introduced in v3.5.0, and fixed in v3.5.6 with PR/14730.

Upgrade issue

This section highlights a common issue issues/19557 in the etcd v3.5 to v3.6 upgrade that may cause the upgrade process to fail. For a complete upgrade guide, refer to Upgrade etcd from v3.5 to v3.6.

The issue was introduced in etcd v3.5.1, and resolved in v3.5.20.

Key takeaway: users are required to first upgrade to etcd v3.5.20 (or a higher patch version) before upgrading to etcd v3.6.0; otherwise, the upgrade may fail.

For more background and technical context, see upgrade_from_3.5_to_3.6_issue.

Testing

We introduced the Robustness testing to verify correctness, which has always been our top priority. It plays traffic of various types and volumes against an etcd cluster, concurrently injects a random failpoint, records all operations (including both requests and responses), and finally performs a linearizability check. It also verifies that the Watch APIs guarantees have not been violated. The robustness test increases our confidence in ensuring the quality of each etcd release.

We have migrated most of the etcd workflow tests to Kubernetes' Prow testing infrastructure to take advantage of its benefit, such as nice dashboards for viewing test results and the ability for contributors to rerun failed tests themselves.

Platforms

While retaining all existing supported platforms, we have promoted Linux/ARM64 to Tier 1 support. For more details, please refer to issues/15951. For the complete list of supported platforms, see supported-platform.

Dependencies

Dependency bumping guide

We have published an official guide on how to bump dependencies for etcd’s main branch and stable releases. It also covers how to update the Go version. For more details, please refer to dependency_management. With this guide available, any contributors can now help with dependency upgrades.

Core Dependency Updates

bbolt and raft are two core dependencies of etcd.

Both etcd v3.4 and v3.5 depend on bbolt v1.3, while etcd v3.6 depends on bbolt v1.4.

For the release-3.4 and release-3.5 branches, raft is included in the etcd repository itself, so etcd v3.4 and v3.5 do not depend on an external raft module. Starting from etcd v3.6, raft was moved to a separate repository (raft), and the first standalone raft release is v3.6.0. As a result, etcd v3.6.0 depends on raft v3.6.0.

Please see the table below for a summary:

etcd versions	bbolt versions	raft versions
3.4.x	v1.3.x	N/A
3.5.x	v1.3.x	N/A
3.6.x	v1.4.x	v3.6.x

grpc-gateway@v2

We upgraded grpc-gateway from v1 to v2 via PR/16595 in etcd v3.6.0. This is a major step toward migrating to protobuf-go, the second major version of the Go protocol buffer API implementation.

grpc-gateway@v2 is designed to work with protobuf-go. However, etcd v3.6 still depends on the deprecated gogo/protobuf, which is actually protocol buffer v1 implementation. To resolve this incompatibility, we applied a patch to the generated *.pb.gw.go files to convert v1 messages to v2 messages.

grpc-ecosystem/go-grpc-middleware/providers/prometheus

We switched from the deprecated (and archived) grpc-ecosystem/go-grpc-prometheus to grpc-ecosystem/go-grpc-middleware/providers/prometheus via PR/19195. This change ensures continued support and access to the latest features and improvements in the gRPC Prometheus integration.

Community

There are exciting developments in the etcd community that reflect our ongoing commitment to strengthening collaboration, improving maintainability, and evolving the project’s governance.

etcd Becomes a Kubernetes SIG

etcd has officially become a Kubernetes Special Interest Group: SIG-etcd. This change reflects etcd’s critical role as the primary datastore for Kubernetes and establishes a more structured and transparent home for long-term stewardship and cross-project collaboration. The new SIG designation will help streamline decision-making, align roadmaps with Kubernetes needs, and attract broader community involvement.

New contributors, maintainers, and reviewers

We’ve seen increasing engagement from contributors, which has resulted in the addition of three new maintainers:

Their continued contributions have been instrumental in driving the project forward.

We also welcome two new reviewers to the project:

We appreciate their dedication to code quality and their willingness to take on broader review responsibilities within the community.

New release team

We've formed a new release team led by ivanvc and jmhbnz, streamlining the release process by automating many previously manual steps. Inspired by Kubernetes SIG Release, we've adopted several best practices, including clearly defined release team roles and the introduction of release shadows to support knowledge sharing and team sustainability. These changes have made our releases smoother and more reliable, allowing us to approach each release with greater confidence and consistency.

Introducing the etcd Operator Working Group

To further advance etcd’s operational excellence, we have formed a new working group: WG-etcd-operator. The working group is dedicated to enabling the automatic and efficient operation of etcd clusters that run in the Kubernetes environment using an etcd-operator.

Future Development

The legacy v2store has been deprecated since etcd v3.4, and the flag --enable-v2 was removed entirely in v3.6. This means that starting from v3.6, there is no longer a way to enable or use the v2store. However, etcd still bootstraps internally from the legacy v2 snapshots. To address this inconsistency, We plan to change etcd to bootstrap from the v3store and replay the WAL entries based on the consistent-index. The work is being tracked in issues/12913.

One of the most persistent challenges remains the large range of queries from the kube-apiserver, which can lead to process crashes due to their unpredictable nature. The range stream feature, originally outlined in the v3.5 release blog/Future roadmaps, remains an idea worth revisiting to address the challenges of large range queries.

For more details and upcoming plans, please refer to the etcd roadmap.

Kubernetes 1.33: Job's SuccessPolicy Goes GA

Thu, 15 May 2025 10:30:00 -0800

On behalf of the Kubernetes project, I'm pleased to announce that Job success policy has graduated to General Availability (GA) as part of the v1.33 release.

About Job's Success Policy

In batch workloads, you might want to use leader-follower patterns like MPI, in which the leader controls the execution, including the followers' lifecycle.

In this case, you might want to mark it as succeeded even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require all Pods to finish successfully for that Job to reach an overall succeeded state.

For Kubernetes Jobs, the API allows you to specify the early exit criteria using the .spec.successPolicy field (you can only use the .spec.successPolicy field for an indexed Job). Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.

This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads. Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed. In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled only indirectly via the status of the leader index. Moreover, followers do not know when they can terminate themselves.

After Job meets any Success Policy, the Job is marked as succeeded, and all Pods are terminated including the running ones.

How it works

The following excerpt from a Job manifest, using .successPolicy.rules[0].succeededCount, shows an example of using a custom success policy:

  parallelism: 10
  completions: 10
  completionMode: Indexed
  successPolicy:
    rules:
    - succeededCount: 1

Here, the Job is marked as succeeded when one index succeeded regardless of its number. Additionally, you can constrain index numbers against succeededCount in .successPolicy.rules[0].succeededCount as shown below:

parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
  rules:
  - succeededIndexes: 0 # index of the leader Pod
    succeededCount: 1

This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.

Once the Job either reaches one of the successPolicy rules, or achieves its Complete criteria based on .spec.completions, the Job controller within kube-controller-manager adds the SuccessCriteriaMet condition to the Job status. After that, the job-controller initiates cleanup and termination of Pods for Jobs with SuccessCriteriaMet condition. Eventually, Jobs obtain Complete condition when the job-controller finished cleanup and termination.

Learn more

Read the documentation for success policy.
Read the KEP for the Job success/completion policy

Get involved

This work was led by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space I recommend subscribing to our Slack channel and attending the regular community meetings.

Kubernetes v1.33: Updates to Container Lifecycle

Wed, 14 May 2025 10:30:00 -0800

Kubernetes v1.33 introduces a few updates to the lifecycle of containers. The Sleep action for container lifecycle hooks now supports a zero sleep duration (feature enabled by default). There is also alpha support for customizing the stop signal sent to containers when they are being terminated.

This blog post goes into the details of these new aspects of the container lifecycle, and how you can use them.

Zero value for Sleep action

Kubernetes v1.29 introduced the Sleep action for container PreStop and PostStart Lifecycle hooks. The Sleep action lets your containers pause for a specified duration after the container is started or before it is terminated. This was needed to provide a straightforward way to manage graceful shutdowns. Before the Sleep action, folks used to run the sleep command using the exec action in their container lifecycle hooks. If you wanted to do this you'd need to have the binary for the sleep command in your container image. This is difficult if you're using third party images.

The sleep action when it was added initially didn't have support for a sleep duration of zero seconds. The time.Sleep which the Sleep action uses under the hood supports a duration of zero seconds. Using a negative or a zero value for the sleep returns immediately, resulting in a no-op. We wanted the same behaviour with the sleep action. This support for the zero duration was later added in v1.32, with the PodLifecycleSleepActionAllowZero feature gate.

The PodLifecycleSleepActionAllowZero feature gate has graduated to beta in v1.33, and is now enabled by default. The original Sleep action for preStop and postStart hooks is been enabled by default, starting from Kubernetes v1.30. With a cluster running Kubernetes v1.33, you are able to set a zero duration for sleep lifecycle hooks. For a cluster with default configuration, you don't need to enable any feature gate to make that possible.

Container stop signals

Container runtimes such as containerd and CRI-O honor a StopSignal instruction in the container image definition. This can be used to specify a custom stop signal that the runtime will used to terminate containers based on that image. Stop signal configuration was not originally part of the Pod API in Kubernetes. Until Kubernetes v1.33, the only way to override the stop signal for containers was by rebuilding your container image with the new custom stop signal (for example, specifying STOPSIGNAL in a Containerfile or Dockerfile).

The ContainerStopSignals feature gate which is newly added in Kubernetes v1.33 adds stop signals to the Kubernetes API. This allows users to specify a custom stop signal in the container spec. Stop signals are added to the API as a new lifecycle along with the existing PreStop and PostStart lifecycle handlers. In order to use this feature, we expect the Pod to have the operating system specified with spec.os.name. This is enforced so that we can cross-validate the stop signal against the operating system and make sure that the containers in the Pod are created with a valid stop signal for the operating system the Pod is being scheduled to. For Pods scheduled on Windows nodes, only SIGTERM and SIGKILL are allowed as valid stop signals. Find the full list of signals supported in Linux nodes here.

Default behaviour

If a container has a custom stop signal defined in its lifecycle, the container runtime would use the signal defined in the lifecycle to kill the container, given that the container runtime also supports custom stop signals. If there is no custom stop signal defined in the container lifecycle, the runtime would fallback to the stop signal defined in the container image. If there is no stop signal defined in the container image, the default stop signal of the runtime would be used. The default signal is SIGTERM for both containerd and CRI-O.

Version skew

For the feature to work as intended, both the versions of Kubernetes and the container runtime should support container stop signals. The changes to the Kuberentes API and kubelet are available in alpha stage from v1.33, which can be enabled with the ContainerStopSignals feature gate. The container runtime implementations for containerd and CRI-O are still a work in progress and will be rolled out soon.

Using container stop signals

To enable this feature, you need to turn on the ContainerStopSignals feature gate in both the kube-apiserver and the kubelet. Once you have nodes where the feature gate is turned on, you can create Pods with a StopSignal lifecycle and a valid OS name like so:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  os:
    name: linux
  containers:
    - name: nginx
      image: nginx:latest
      lifecycle:
        stopSignal: SIGUSR1

Do note that the SIGUSR1 signal in this example can only be used if the container's Pod is scheduled to a Linux node. Hence we need to specify spec.os.name as linux to be able to use the signal. You will only be able to configure SIGTERM and SIGKILL signals if the Pod is being scheduled to a Windows node. You cannot specify a containers[*].lifecycle.stopSignal if the spec.os.name field is nil or unset either.

How do I get involved?

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to us!

You can reach SIG Node by several means:

You can also contact me directly:

GitHub: @sreeram-venkitesh
Slack: @sreeram.venkitesh

Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA

Tue, 13 May 2025 10:30:00 -0800

In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.

About backoff limit per index

When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.

To achieve failure tolerance in a Kubernetes Job, you can set the spec.backoffLimit field. This field specifies the total number of tolerated failures.

However, for workloads where every index is considered independent, like embarassingly parallel workloads - the spec.backoffLimit field is often not flexible enough. For example, you may choose to run multiple suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a fast-failing index (test suite) is likely to consume your entire budget for tolerating Pod failures, and you might not be able to run the other indexes.

In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.

How backoff limit per index works

To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated Pod failures per index with the spec.backoffLimitPerIndex field. When you set this field, the Job executes all indexes by default.

Additionally, to fine-tune the error handling:

Specify the cap on the total number of failed indexes by setting the spec.maxFailedIndexes field. When the limit is exceeded the entire Job is terminated.
Define a short-circuit to detect a failed index by using the FailIndex action in the Pod Failure Policy mechanism.

When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job's status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
  rules:
  - action: Ignore
    onPodConditions:
    - type: DisruptionTarget
  - action: FailIndex
    onExitCodes:
      operator: In
      values: [ 42 ]

In this example, the Job handles Pod failures as follows:

Ignores any failed Pods that have the built-in disruption condition, called DisruptionTarget. These Pods don't count towards Job backoff limits.
Fails the index corresponding to the failed Pod if any of the failed Pod's containers finished with the exit code 42 - based on the matching "FailIndex" rule.
Retries the first failure of any index, unless the index failed due to the matching FailIndex rule.
Fails the entire Job if the number of failed indexes exceeded 5 (set by the spec.maxFailedIndexes field).

Learn more

Read the blog post on the closely related feature of Pod Failure Policy Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
For a hands-on guide to using Pod failure policy, including the use of FailIndex, see Handling retriable and non-retriable pod failures with Pod failure policy
Read the documentation for Backoff limit per index and Pod failure policy
Read the KEP for the Backoff Limits Per Index For Indexed Jobs

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Kubernetes v1.33: Image Pull Policy the way you always thought it worked!

Mon, 12 May 2025 10:30:00 -0800

Image Pull Policy the way you always thought it worked!

Some things in Kubernetes are surprising, and the way imagePullPolicy behaves might be one of them. Given Kubernetes is all about running pods, it may be peculiar to learn that there has been a caveat to restricting pod access to authenticated images for over 10 years in the form of issue 18787! It is an exciting release when you can resolve a ten-year-old issue.

Note:

Throughout this blog post, the term "pod credentials" will be used often. In this context, the term generally encapsulates the authentication material that is available to a pod to authenticate a container image pull.

IfNotPresent, even if I'm not supposed to have it

The gist of the problem is that the imagePullPolicy: IfNotPresent strategy has done precisely what it says, and nothing more. Let's set up a scenario. To begin, Pod A in Namespace X is scheduled to Node 1 and requires image Foo from a private repository. For it's image pull authentication material, the pod references Secret 1 in its imagePullSecrets. Secret 1 contains the necessary credentials to pull from the private repository. The Kubelet will utilize the credentials from Secret 1 as supplied by Pod A and it will pull container image Foo from the registry. This is the intended (and secure) behavior.

But now things get curious. If Pod B in Namespace Y happens to also be scheduled to Node 1, unexpected (and potentially insecure) things happen. Pod B may reference the same private image, specifying the IfNotPresent image pull policy. Pod B does not reference Secret 1 (or in our case, any secret) in its imagePullSecrets. When the Kubelet tries to run the pod, it honors the IfNotPresent policy. The Kubelet sees that the image Foo is already present locally, and will provide image Foo to Pod B. Pod B gets to run the image even though it did not provide credentials authorizing it to pull the image in the first place.

Using a private image pulled by a different pod

While IfNotPresent should not pull image Foo if it is already present on the node, it is an incorrect security posture to allow all pods scheduled to a node to have access to previously pulled private image. These pods were never authorized to pull the image in the first place.

IfNotPresent, but only if I am supposed to have it

In Kubernetes v1.33, we - SIG Auth and SIG Node - have finally started to address this (really old) problem and getting the verification right! The basic expected behavior is not changed. If an image is not present, the Kubelet will attempt to pull the image. The credentials each pod supplies will be utilized for this task. This matches behavior prior to 1.33.

If the image is present, then the behavior of the Kubelet changes. The Kubelet will now verify the pod's credentials before allowing the pod to use the image.

Performance and service stability have been a consideration while revising the feature. Pods utilizing the same credential will not be required to re-authenticate. This is also true when pods source credentials from the same Kubernetes Secret object, even when the credentials are rotated.

Never pull, but use if authorized

The imagePullPolicy: Never option does not fetch images. However, if the container image is already present on the node, any pod attempting to use the private image will be required to provide credentials, and those credentials require verification.

Pods utilizing the same credential will not be required to re-authenticate. Pods that do not supply credentials previously used to successfully pull an image will not be allowed to use the private image.

Always pull, if authorized

The imagePullPolicy: Always has always worked as intended. Each time an image is requested, the request goes to the registry and the registry will perform an authentication check.

In the past, forcing the Always image pull policy via pod admission was the only way to ensure that your private container images didn't get reused by other pods on nodes which already pulled the images.

Fortunately, this was somewhat performant. Only the image manifest was pulled, not the image. However, there was still a cost and a risk. During a new rollout, scale up, or pod restart, the image registry that provided the image MUST be available for the auth check, putting the image registry in the critical path for stability of services running inside of the cluster.

How it all works

The feature is based on persistent, file-based caches that are present on each of the nodes. The following is a simplified description of how the feature works. For the complete version, please see KEP-2535.

The process of requesting an image for the first time goes like this:

A pod requesting an image from a private registry is scheduled to a node.
The image is not present on the node.
The Kubelet makes a record of the intention to pull the image.
The Kubelet extracts credentials from the Kubernetes Secret referenced by the pod as an image pull secret, and uses them to pull the image from the private registry.
After the image has been successfully pulled, the Kubelet makes a record of the successful pull. This record includes details about credentials used (in the form of a hash) as well as the Secret from which they originated.
The Kubelet removes the original record of intent.
The Kubelet retains the record of successful pull for later use.

When future pods scheduled to the same node request the previously pulled private image:

The Kubelet checks the credentials that the new pod provides for the pull.
If the hash of these credentials, or the source Secret of the credentials match the hash or source Secret which were recorded for a previous successful pull, the pod is allowed to use the previously pulled image.
If the credentials or their source Secret are not found in the records of successful pulls for that image, the Kubelet will attempt to use these new credentials to request a pull from the remote registry, triggering the authorization flow.

Try it out

In Kubernetes v1.33 we shipped the alpha version of this feature. To give it a spin, enable the KubeletEnsureSecretPulledImages feature gate for your 1.33 Kubelets.

You can learn more about the feature and additional optional configuration on the concept page for Images in the official Kubernetes documentation.

What's next?

In future releases we are going to:

Make this feature work together with Projected service account tokens for Kubelet image credential providers which adds a new, workload-specific source of image pull credentials.
Write a benchmarking suite to measure the performance of this feature and assess the impact of any future changes.
Implement an in-memory caching layer so that we don't need to read files for each image pull request.
Add support for credential expirations, thus forcing previously validated credentials to be re-authenticated.

How to get involved

Reading KEP-2535 is a great way to understand these changes in depth.

If you are interested in further involvement, reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Kubernetes v1.33: Streaming List responses

Fri, 09 May 2025 10:30:00 -0800

Managing Kubernetes cluster stability becomes increasingly critical as your infrastructure grows. One of the most challenging aspects of operating large-scale clusters has been handling List requests that fetch substantial datasets - a common operation that could unexpectedly impact your cluster's stability.

Today, the Kubernetes community is excited to announce a significant architectural improvement: streaming encoding for List responses.

The problem: unnecessary memory consumption with large resources

Current API response encoders just serialize an entire response into a single contiguous memory and perform one ResponseWriter.Write call to transmit data to the client. Despite HTTP/2's capability to split responses into smaller frames for transmission, the underlying HTTP server continues to hold the complete response data as a single buffer. Even as individual frames are transmitted to the client, the memory associated with these frames cannot be freed incrementally.

When cluster size grows, the single response body can be substantial - like hundreds of megabytes in size. At large scale, the current approach becomes particularly inefficient, as it prevents incremental memory release during transmission. Imagining that when network congestion occurs, that large response body’s memory block stays active for tens of seconds or even minutes. This limitation leads to unnecessarily high and prolonged memory consumption in the kube-apiserver process. If multiple large List requests occur simultaneously, the cumulative memory consumption can escalate rapidly, potentially leading to an Out-of-Memory (OOM) situation that compromises cluster stability.

The encoding/json package uses sync.Pool to reuse memory buffers during serialization. While efficient for consistent workloads, this mechanism creates challenges with sporadic large List responses. When processing these large responses, memory pools expand significantly. But due to sync.Pool's design, these oversized buffers remain reserved after use. Subsequent small List requests continue utilizing these large memory allocations, preventing garbage collection and maintaining persistently high memory consumption in the kube-apiserver even after the initial large responses complete.

Additionally, Protocol Buffers are not designed to handle large datasets. But it’s great for handling individual messages within a large data set. This highlights the need for streaming-based approaches that can process and transmit large collections incrementally rather than as monolithic blocks.

As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

From https://protobuf.dev/programming-guides/techniques/

Streaming encoder for List responses

The streaming encoding mechanism is specifically designed for List responses, leveraging their common well-defined collection structures. The core idea focuses exclusively on the Items field within collection structures, which represents the bulk of memory consumption in large responses. Rather than encoding the entire Items array as one contiguous memory block, the new streaming encoder processes and transmits each item individually, allowing memory to be freed progressively as frame or chunk is transmitted. As a result, encoding items one by one significantly reduces the memory footprint required by the API server.

With Kubernetes objects typically limited to 1.5 MiB (from ETCD), streaming encoding keeps memory consumption predictable and manageable regardless of how many objects are in a List response. The result is significantly improved API server stability, reduced memory spikes, and better overall cluster performance - especially in environments where multiple large List operations might occur simultaneously.

To ensure perfect backward compatibility, the streaming encoder validates Go struct tags rigorously before activation, guaranteeing byte-for-byte consistency with the original encoder. Standard encoding mechanisms process all fields except Items, maintaining identical output formatting throughout. This approach seamlessly supports all Kubernetes List types—from built-in *List objects to Custom Resource UnstructuredList objects - requiring zero client-side modifications or awareness that the underlying encoding method has changed.

Performance gains you'll notice

Reduced Memory Consumption: Significantly lowers the memory footprint of the API server when handling large list requests, especially when dealing with large resources.
Improved Scalability: Enables the API server to handle more concurrent requests and larger datasets without running out of memory.
Increased Stability: Reduces the risk of OOM kills and service disruptions.
Efficient Resource Utilization: Optimizes memory usage and improves overall resource efficiency.

Benchmark results

To validate results Kubernetes has introduced a new list benchmark which executes concurrently 10 list requests each returning 1GB of data.

The benchmark has showed 20x improvement, reducing memory usage from 70-80GB to 3GB.

List benchmark memory usage

Kubernetes 1.33: Volume Populators Graduate to GA

Thu, 08 May 2025 10:30:00 -0800

Kubernetes volume populators are now generally available (GA)! The AnyVolumeDataSource feature gate is treated as always enabled for Kubernetes v1.33, which means that users can specify any appropriate custom resource as the data source of a PersistentVolumeClaim (PVC).

An example of how to use dataSourceRef in PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc1
spec:
  ...
  dataSourceRef:
    apiGroup: provider.example.com
    kind: Provider
    name: provider1

What is new

There are four major enhancements from beta.

Populator Pod is optional

During the beta phase, contributors to Kubernetes identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress; these leaks happened due to limitations in finalizer handling. Ahead of the graduation to general availability, the Kubernetes project added support to delete temporary resources (PVC prime, etc.) if the original PVC is deleted.

To accommodate this, we've introduced three new plugin-based functions:

PopulateFn(): Executes the provider-specific data population logic.
PopulateCompleteFn(): Checks if the data population operation has finished successfully.
PopulateCleanupFn(): Cleans up temporary resources created by the provider-specific functions after data population is completed

A provider example is added in lib-volume-populator/example.

Mutator functions to modify the Kubernetes resources

For GA, the CSI volume populator controller code gained a MutatorConfig, allowing the specification of mutator functions to modify Kubernetes resources. For example, if the PVC prime is not an exact copy of the PVC and you need provider-specific information for the driver, you can include this information in the optional MutatorConfig. This allows you to customize the Kubernetes objects in the volume populator.

Flexible metric handling for providers

Our beta phase highlighted a new requirement: the need to aggregate metrics not just from lib-volume-populator, but also from other components within the provider's codebase.

To address this, SIG Storage introduced a provider metric manager. This enhancement delegates the implementation of metrics logic to the provider itself, rather than relying solely on lib-volume-populator. This shift provides greater flexibility and control over metrics collection and aggregation, enabling a more comprehensive view of provider performance.

Clean up for temporary resources

During the beta phase, we identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress, due to limitations in finalizer handling. We have improved the populator to support the deletion of temporary resources (PVC prime, etc.) if the original PVC is deleted in this GA release.

How to use it

To try it out, please follow the steps in the previous beta blog.

Future directions and potential feature requests

For next step, there are several potential feature requests for volume populator:

Multi sync: the current implementation is a one-time unidirectional sync from source to destination. This can be extended to support multiple syncs, enabling periodic syncs or allowing users to sync on demand
Bidirectional sync: an extension of multi sync above, but making it bidirectional between source and destination
Populate data with priorities: with a list of different dataSourceRef, populate based on priorities
Populate data from multiple sources of the same provider: populate multiple different sources to one destination
Populate data from multiple sources of the different providers: populate multiple different sources to one destination, pipelining different resources’ population

To ensure we're building something truly valuable, Kubernetes SIG Storage would love to hear about any specific use cases you have in mind for this feature. For any inquiries or specific questions related to volume populator, please reach out to the SIG Storage community.

Kubernetes v1.33: From Secrets to Service Accounts: Kubernetes Image Pulls Evolved

Wed, 07 May 2025 10:30:00 -0800

Kubernetes has steadily evolved to reduce reliance on long-lived credentials stored in the API. A prime example of this shift is the transition of Kubernetes Service Account (KSA) tokens from long-lived, static tokens to ephemeral, automatically rotated tokens with OpenID Connect (OIDC)-compliant semantics. This advancement enables workloads to securely authenticate with external services without needing persistent secrets.

However, one major gap remains: image pull authentication. Today, Kubernetes clusters rely on image pull secrets stored in the API, which are long-lived and difficult to rotate, or on node-level kubelet credential providers, which allow any pod running on a node to access the same credentials. This presents security and operational challenges.

To address this, Kubernetes is introducing Service Account Token Integration for Kubelet Credential Providers, now available in alpha. This enhancement allows credential providers to use pod-specific service account tokens to obtain registry credentials, which kubelet can then use for image pulls — eliminating the need for long-lived image pull secrets.

The problem with image pull secrets

Currently, Kubernetes administrators have two primary options for handling private container image pulls:

Image pull secrets stored in the Kubernetes API
- These secrets are often long-lived because they are hard to rotate.
- They must be explicitly attached to a service account or pod.
- Compromise of a pull secret can lead to unauthorized image access.
Kubelet credential providers
- These providers fetch credentials dynamically at the node level.
- Any pod running on the node can access the same credentials.
- There’s no per-workload isolation, increasing security risks.

Neither approach aligns with the principles of least privilege or ephemeral authentication, leaving Kubernetes with a security gap.

The solution: Service Account token integration for Kubelet credential providers

This new enhancement enables kubelet credential providers to use workload identity when fetching image registry credentials. Instead of relying on long-lived secrets, credential providers can use service account tokens to request short-lived credentials tied to a specific pod’s identity.

This approach provides:

Workload-specific authentication: Image pull credentials are scoped to a particular workload.
Ephemeral credentials: Tokens are automatically rotated, eliminating the risks of long-lived secrets.
Seamless integration: Works with existing Kubernetes authentication mechanisms, aligning with cloud-native security best practices.

How it works

1. Service Account tokens for credential providers

Kubelet generates short-lived, automatically rotated tokens for service accounts if the credential provider it communicates with has opted into receiving a service account token for image pulls. These tokens conform to OIDC ID token semantics and are provided to the credential provider as part of the CredentialProviderRequest. The credential provider can then use this token to authenticate with an external service.

2. Image registry authentication flow

When a pod starts, the kubelet requests credentials from a credential provider.
If the credential provider has opted in, the kubelet generates a service account token for the pod.
The service account token is included in the CredentialProviderRequest, allowing the credential provider to authenticate and exchange it for temporary image pull credentials from a registry (e.g. AWS ECR, GCP Artifact Registry, Azure ACR).
The kubelet then uses these credentials to pull images on behalf of the pod.

Benefits of this approach

Security: Eliminates long-lived image pull secrets, reducing attack surfaces.
Granular Access Control: Credentials are tied to individual workloads rather than entire nodes or clusters.
Operational Simplicity: No need for administrators to manage and rotate image pull secrets manually.
Improved Compliance: Helps organizations meet security policies that prohibit persistent credentials in the cluster.

What's next?

For Kubernetes v1.34, we expect to ship this feature in beta while continuing to gather feedback from users.

In the coming releases, we will focus on:

Implementing caching mechanisms to improve performance for token generation.
Giving more flexibility to credential providers to decide how the registry credentials returned to the kubelet are cached.
Making the feature work with Ensure Secret Pulled Images to ensure pods that use an image are authorized to access that image when service account tokens are used for authentication.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Try it out

To try out this feature:

Ensure you are running Kubernetes v1.33 or later.
Enable the ServiceAccountTokenForKubeletCredentialProviders feature gate on the kubelet.
Ensure credential provider support: Modify or update your credential provider to use service account tokens for authentication.
Update the credential provider configuration to opt into receiving service account tokens for the credential provider by configuring the tokenAttributes field.
Deploy a pod that uses the credential provider to pull images from a private registry.

We would love to hear your feedback on this feature. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you are interested in getting involved in the development of this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Kubernetes v1.33: Fine-grained SupplementalGroups Control Graduates to Beta

Tue, 06 May 2025 10:30:00 -0800

The new field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31 and has graduated to beta in v1.33; the corresponding feature gate (SupplementalGroupsPolicy) is now enabled by default. This feature enables to implement more precise control over supplemental groups in containers that can strengthen the security posture, particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.

Please be aware that this beta release contains some behavioral breaking change. See The Behavioral Changes Introduced In Beta and Upgrade Considerations sections for details.

Motivation: Implicit group memberships defined in `/etc/group` in the container image

Although the majority of Kubernetes cluster admins/users may not be aware, kubernetes, by default, merges group information from the Pod with information defined in /etc/group in the container image.

Let's see an example, below Pod manifest specifies runAsUser=1000, runAsGroup=3000 and supplementalGroups=4000 in the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
  name: implicit-groups
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
  containers:
  - name: ctr
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

What is the result of id command in the ctr container? The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image should show below:

user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.

Fine-grained supplemental groups control in a Pod: `supplementaryGroupsPolicy`

To tackle the above problem, Pod's .spec.securityContext now includes supplementalGroupsPolicy field.

This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:

Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility).
Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container's primary user are ignored.

Let's see how Strict policy works. Below Pod manifest specifies supplementalGroupsPolicy: Strict:

apiVersion: v1
kind: Pod
metadata:
  name: strict-supplementalgroups-policy
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
    supplementalGroupsPolicy: Strict
  containers:
  - name: ctr
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

The result of id command in the ctr container should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity. See the following section for details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
  containerStatuses:
  - name: ctr
    user:
      linux:
        gid: 3000
        supplementalGroups:
        - 3000
        - 4000
        uid: 1000
...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

`Strict` Policy requires newer CRI versions

Actually, CRI runtime (e.g. containerd, CRI-O) plays a core role for calculating supplementary group ids to be attached to the containers. Thus, SupplementalGroupsPolicy=Strict requires a CRI runtime that support this feature (SupplementalGroupsPolicy: Merge can work with the CRI runtime which does not support this feature because this policy is fully backward compatible policy).

Here are some CRI runtimes that support this feature, and the versions you need to be running:

containerd: v2.0 or later
CRI-O: v1.31 or later

And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field.

apiVersion: v1
kind: Node
...
status:
  features:
    supplementalGroupsPolicy: true

The behavioral changes introduced in beta

In the alpha release, when a Pod with supplementalGroupsPolicy: Strict was scheduled to a node that did not support the feature (i.e., .status.features.supplementalGroupsPolicy=false), the Pod's supplemental groups policy silently fell back to Merge.

In v1.33, this has entered beta to enforce the policy more strictly, where kubelet rejects pods whose nodes cannot ensure the specified policy. If your pod is rejected, you will see warning events with reason=SupplementalGroupsPolicyNotSupported like below:

apiVersion: v1
kind: Event
...
type: Warning
reason: SupplementalGroupsPolicyNotSupported
message: "SupplementalGroupsPolicy=Strict is not supported in this node"
involvedObject:
  apiVersion: v1
  kind: Pod
  ...

Upgrade consideration

If you're already using this feature, especially the supplementalGroupsPolicy: Strict policy, we assume that your cluster's CRI runtimes already support this feature. In that case, you don't need to worry about the pod rejections described above.

However, if your cluster:

uses the supplementalGroupsPolicy: Strict policy, but
its CRI runtimes do NOT yet support the feature (i.e., .status.features.supplementalGroupsPolicy=false),

you need to prepare the behavioral changes (pod rejection) when upgrading your cluster.

We recommend several ways to avoid unexpected pod rejections:

Upgrading your cluster's CRI runtimes together with kubernetes or before the upgrade
Putting some label to your nodes describing CRI runtime supports this feature or not and also putting label selector to pods with Strict policy to select such nodes (but, you will need to monitor the number of Pending pods in this case instead of pod rejections).

Getting involved

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

How can I learn more?

Configure a Security Context for a Pod or Container for the further details of supplementalGroupsPolicy
KEP-3619: Fine-grained SupplementalGroups control

Kubernetes v1.33: Prevent PersistentVolume Leaks When Deleting out of Order graduates to GA

Mon, 05 May 2025 10:30:00 -0800

I am thrilled to announce that the feature to prevent PersistentVolume (or PVs for short) leaks when deleting out of order has graduated to General Availability (GA) in Kubernetes v1.33! This improvement, initially introduced as a beta feature in Kubernetes v1.31, ensures that your storage resources are properly reclaimed, preventing unwanted leaks.

How did reclaim work in previous Kubernetes releases?

PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.

Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.

For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.

PV reclaim policy with Kubernetes v1.33

With the graduation to GA in Kubernetes v1.33, this issue is now resolved. Kubernetes now reliably honors the configured Delete reclaim policy, even when PVs are deleted before their bound PVCs. This is achieved through the use of finalizers, ensuring that the storage backend releases the allocated storage resource as intended.

How does it work?

For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. Addition or removal of finalizer is handled by external-provisioner `

An example of a PV with the finalizer, notice the new finalizer in the finalizers list

kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: csi.example.driver.com
  creationTimestamp: "2021-11-17T19:28:56Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-provisioner.volume.kubernetes.io/finalizer
  name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  resourceVersion: "194711"
  uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: example-vanilla-block-pvc
    namespace: default
    resourceVersion: "194677"
    uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  csi:
    driver: csi.example.driver.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.example.driver.com
      type: CNS Block Volume
    volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
  persistentVolumeReclaimPolicy: Delete
  storageClassName: example-vanilla-block-sc
  volumeMode: Filesystem
status:
  phase: Bound

The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.

Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.

Important note

The fix does not apply to statically provisioned in-tree plugin volumes.

How to enable new behavior?

To take advantage of the new behavior, you must have upgraded your cluster to the v1.33 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later. The feature was released as beta in v1.31 release of Kubernetes, where it was enabled by default.

References

How do I get involved?

The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:

Fan Baofa (carlory)
Jan Šafránek (jsafrane)
Xing Yang (xing-yang)
Matthew Wong (wongma7)

Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We’re rapidly growing and always welcome new contributors.

Kubernetes v1.33: Mutable CSI Node Allocatable Count

Fri, 02 May 2025 10:30:00 -0800

Scheduling stateful applications reliably depends heavily on accurate information about resource availability on nodes. Kubernetes v1.33 introduces an alpha feature called mutable CSI node allocatable count, allowing Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle. This capability significantly enhances the accuracy of pod scheduling decisions and reduces scheduling failures caused by outdated volume capacity information.

Background

Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:

Manual or external operations attaching/detaching volumes outside of Kubernetes control.
Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.

Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.

Dynamically adapting CSI volume limits

With the new feature gate MutableCSINodeAllocatableCount, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler has the most accurate, up-to-date view of node capacity.

How it works

When this feature is enabled, Kubernetes supports two mechanisms for updating the reported node volume limits:

Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).

Enabling the feature

To use this alpha feature, you must enable the MutableCSINodeAllocatableCount feature gate in these components:

kube-apiserver
kubelet

Example CSI driver configuration

Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: example.csi.k8s.io
spec:
  nodeAllocatableUpdatePeriodSeconds: 60

This configuration directs Kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.

Immediate updates on attachment failures

In addition to periodic updates, Kubernetes now reacts to attachment failures. Specifically, if a volume attachment fails with a ResourceExhausted error (gRPC code 8), an immediate update is triggered to correct the allocatable count promptly.

This proactive correction prevents repeated scheduling errors and helps maintain cluster health.

Getting started

To experiment with mutable CSI node allocatable count in your Kubernetes v1.33 cluster:

Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
Monitor and observe improvements in scheduling accuracy and pod placement reliability.

Next steps

This feature is currently in alpha and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution toward beta and GA stability.

Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.

Kubernetes v1.33: New features in DRA

Thu, 01 May 2025 10:30:00 -0800

Kubernetes Dynamic Resource Allocation (DRA) was originally introduced as an alpha feature in the v1.26 release, and then went through a significant redesign for Kubernetes v1.31. The main DRA feature went to beta in v1.32, and the project hopes it will be generally available in Kubernetes v1.34.

The basic feature set of DRA provides a far more powerful and flexible API for requesting devices than Device Plugin. And while DRA remains a beta feature for v1.33, the DRA team has been hard at work implementing a number of new features and UX improvements. One feature has been promoted to beta, while a number of new features have been added in alpha. The team has also made progress towards getting DRA ready for GA.

Features promoted to beta

Driver-owned Resource Claim Status was promoted to beta. This allows the driver to report driver-specific device status data for each allocated device in a resource claim, which is particularly useful for supporting network devices.

New alpha features

Partitionable Devices lets a driver advertise several overlapping logical devices (“partitions”), and the driver can reconfigure the physical device dynamically based on the actual devices allocated. This makes it possible to partition devices on-demand to meet the needs of the workloads and therefore increase the utilization.

Device Taints and Tolerations allow devices to be tainted and for workloads to tolerate those taints. This makes it possible for drivers or cluster administrators to mark devices as unavailable. Depending on the effect of the taint, this can prevent devices from being allocated or cause eviction of pods that are using the device.

Prioritized List lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.

Admin Access has been updated so that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field within the namespace. This grants administrators access to in-use devices and may enable additional permissions when making the device available in a container. This ensures that non-admin users cannot misuse the feature.

Preparing for general availability

A new v1beta2 API has been added to simplify the user experience and to prepare for additional features being added in the future. The RBAC rules for DRA have been improved and support has been added for seamless upgrades of DRA drivers.

What’s next?

The plan for v1.34 is even more ambitious than for v1.33. Most importantly, we (the Kubernetes device management working group) hope to bring DRA to general availability, which will make it available by default on all v1.34 Kubernetes clusters. This also means that many, perhaps all, of the DRA features that are still beta in v1.34 will become enabled by default, making it much easier to use them.

The alpha features that were added in v1.33 will be brought to beta in v1.34.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Acknowledgments

A huge thanks to everyone who has contributed:

Cici Huang (cici37)
Ed Bartosh (bart0sh
John Belamaric (johnbelamaric)
Jon Huhn (nojnhuh)
Kevin Klues (klueska)
Morten Torkildsen (mortent)
Patrick Ohly (pohly)
Rita Zhang (ritazh)
Shingo Omura (everpeace)

Kubernetes v1.33: Storage Capacity Scoring of Nodes for Dynamic Provisioning (alpha)

Wed, 30 Apr 2025 10:30:00 -0800

Kubernetes v1.33 introduces a new alpha feature called StorageCapacityScoring. This feature adds a scoring method for pod scheduling with the topology-aware volume provisioning. This feature eases to schedule pods on nodes with either the most or least available storage capacity.

About this feature

This feature extends the kube-scheduler's VolumeBinding plugin to perform scoring using node storage capacity information obtained from Storage Capacity. Currently, you can only filter out nodes with insufficient storage capacity. So, you have to use a scheduler extender to achieve storage-capacity-based pod scheduling.

This feature is useful for provisioning node-local PVs, which have size limits based on the node's storage capacity. By using this feature, you can assign the PVs to the nodes with the most available storage space so that you can expand the PVs later as much as possible.

In another use case, you might want to reduce the number of nodes as much as possible for low operation costs in cloud environments by choosing the least storage capacity node. This feature helps maximize resource utilization by filling up nodes more sequentially, starting with the most utilized nodes first that still have enough storage capacity for the requested volume size.

How to use

Enabling the feature

In the alpha phase, StorageCapacityScoring is disabled by default. To use this feature, add StorageCapacityScoring=true to the kube-scheduler command line option --feature-gates.

Configuration changes

You can configure node priorities based on storage utilization using the shape parameter in the VolumeBinding plugin configuration. This allows you to prioritize nodes with higher available storage capacity (default) or, conversely, nodes with lower available storage capacity. For example, to prioritize lower available storage capacity, configure KubeSchedulerConfiguration as follows:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  ...
  pluginConfig:
  - name: VolumeBinding
    args:
      ...
      shape:
      - utilization: 0
        score: 0
      - utilization: 100
        score: 10

For more details, please refer to the documentation.

Additional note: Relationship with VolumeCapacityPriority

The alpha feature gate VolumeCapacityPriority, which performs node scoring based on available storage capacity during static provisioning, will be deprecated and replaced by StorageCapacityScoring.

Please note that while VolumeCapacityPriority prioritizes nodes with lower available storage capacity by default, StorageCapacityScoring prioritizes nodes with higher available storage capacity by default.

Kubernetes v1.33: Image Volumes graduate to beta!

Tue, 29 Apr 2025 10:30:00 -0800

Image Volumes were introduced as an Alpha feature with the Kubernetes v1.31 release as part of KEP-4639. In Kubernetes v1.33, this feature graduates to beta.

Please note that the feature is still disabled by default, because not all container runtimes have full support for it. CRI-O supports the initial feature since version v1.31 and will add support for Image Volumes as beta in v1.33. containerd merged support for the alpha feature which will be part of the v2.1.0 release and is working on beta support as part of PR #11578.

What's new

The major change for the beta graduation of Image Volumes is the support for subPath and subPathExpr mounts for containers via spec.containers[*].volumeMounts.[subPath,subPathExpr]. This allows end-users to mount a certain subdirectory of an image volume, which is still mounted as readonly (noexec). This means that non-existing subdirectories cannot be mounted by default. As for other subPath and subPathExpr values, Kubernetes will ensure that there are no absolute path or relative path components part of the specified sub path. Container runtimes are also required to double check those requirements for safety reasons. If a specified subdirectory does not exist within a volume, then runtimes should fail on container creation and provide user feedback by using existing kubelet events.

Besides that, there are also three new kubelet metrics available for image volumes:

kubelet_image_volume_requested_total: Outlines the number of requested image volumes.
kubelet_image_volume_mounted_succeed_total: Counts the number of successful image volume mounts.
kubelet_image_volume_mounted_errors_total: Accounts the number of failed image volume mounts.

To use an existing subdirectory for a specific image volume, just use it as subPath (or subPathExpr) value of the containers volumeMounts:

apiVersion: v1
kind: Pod
metadata:
  name: image-volume
spec:
  containers:
  - name: shell
    command: ["sleep", "infinity"]
    image: debian
    volumeMounts:
    - name: volume
      mountPath: /volume
      subPath: dir
  volumes:
  - name: volume
    image:
      reference: quay.io/crio/artifact:v2
      pullPolicy: IfNotPresent

Then, create the pod on your cluster:

kubectl apply -f image-volumes-subpath.yaml

Now you can attach to the container:

kubectl attach -it image-volume bash

And check the content of the file from the dir sub path in the volume:

cat /volume/file

The output will be similar to:

Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature graduation as part of Kubernetes v1.33.

As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there!

If you would like to provide feedback or suggestions feel free to reach out to SIG Node using the Kubernetes Slack (#sig-node) channel or the SIG Node mailing list.

Kubernetes v1.33: HorizontalPodAutoscaler Configurable Tolerance

Mon, 28 Apr 2025 10:30:00 -0800

This post describes configurable tolerance for horizontal Pod autoscaling, a new alpha feature first available in Kubernetes 1.33.

What is it?

Horizontal Pod Autoscaling is a well-known Kubernetes feature that allows your workload to automatically resize by adding or removing replicas based on resource utilization.

Let's say you have a web application running in a Kubernetes cluster with 50 replicas. You configure the HorizontalPodAutoscaler (HPA) to scale based on CPU utilization, with a target of 75% utilization. Now, imagine that the current CPU utilization across all replicas is 90%, which is higher than the desired 75%. The HPA will calculate the required number of replicas using the formula:

$$desiredReplicas = ceil\left\lceil currentReplicas \times \frac{currentMetricValue}{desiredMetricValue} \right\rceil$$

In this example:

$$50 \times (90/75) = 60$$

So, the HPA will increase the number of replicas from 50 to 60 to reduce the load on each pod. Similarly, if the CPU utilization were to drop below 75%, the HPA would scale down the number of replicas accordingly. The Kubernetes documentation provides a detailed description of the scaling algorithm.

In order to avoid replicas being created or deleted whenever a small metric fluctuation occurs, Kubernetes applies a form of hysteresis: it only changes the number of replicas when the current and desired metric values differ by more than 10%. In the example above, since the ratio between the current and desired metric values is $90/75$, or 20% above target, exceeding the 10% tolerance, the scale-up action will proceed.

This default tolerance of 10% is cluster-wide; in older Kubernetes releases, it could not be fine-tuned. It's a suitable value for most usage, but too coarse for large deployments, where a 10% tolerance represents tens of pods. As a result, the community has long asked to be able to tune this value.

In Kubernetes v1.33, this is now possible.

How do I use it?

After enabling the HPAConfigurableTolerance feature gate in your Kubernetes v1.33 cluster, you can add your desired tolerance for your HorizontalPodAutoscaler object.

Tolerances appear under the spec.behavior.scaleDown and spec.behavior.scaleUp fields and can thus be different for scale up and scale down. A typical usage would be to specify a small tolerance on scale up (to react quickly to spikes), but higher on scale down (to avoid adding and removing replicas too quickly in response to small metric fluctuations).

For example, an HPA with a tolerance of 5% on scale-down, and no tolerance on scale-up, would look like the following:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  ...
  behavior:
    scaleDown:
      tolerance: 0.05
    scaleUp:
      tolerance: 0

I want all the details!

Get all the technical details by reading KEP-4951 and follow issue 4951 to be notified of the feature graduation.

Kubernetes v1.33: User Namespaces enabled by default!

Fri, 25 Apr 2025 10:30:00 -0800

In Kubernetes v1.33 support for user namespaces is enabled by default. This means that, when the stack requirements are met, pods can opt-in to use user namespaces. To use the feature there is no need to enable any Kubernetes feature flag anymore!

In this blog post we answer some common questions about user namespaces. But, before we dive into that, let's recap what user namespaces are and why they are important.

What is a user namespace?

Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature.

Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes.

One Linux namespace that was left behind is the user namespace. It isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in a container can be mapped to identifiers on the host in a way where host and container(s) never end up in overlapping UID/GIDs. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings three key benefits:

Prevention of lateral movement: As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time attacking each other, even if they escape the container boundaries. For example, suppose container A runs with different UIDs and GIDs on the host than container B. In that case, the operations it can do on container B's files and processes are limited: only read/write what a file allows to others, as it will never have permission owner or group permission (the UIDs/GIDs on the host are guaranteed to be different for different containers).
Increased host isolation: As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it runs as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc. Furthermore, capabilities granted are only valid inside the user namespace and not on the host, limiting the impact a container escape can have.
Enablement of new use cases: User namespaces allow containers to gain certain capabilities inside their own user namespace without affecting the host. This unlocks new possibilities, such as running applications that require privileged operations without granting full root access on the host. This is particularly useful for running nested containers.

User namespace IDs allocation

If a pod running as the root user without a user namespace manages to breakout, it has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).

Demos

Rodrigo created demos to understand how some CVEs are mitigated when user namespaces are used. We showed them here before (see here and here), but take a look if you haven't:

Mitigation of CVE 2024-21626 with user namespaces:

Mitigation of CVE 2022-0492 with user namespaces:

Everything you wanted to know about user namespaces in Kubernetes

Here we try to answer some of the questions we have been asked about user namespaces support in Kubernetes.

1. What are the requirements to use it?

The requirements are documented here. But we will elaborate a bit more, in the following questions.

Note this is a Linux-only feature.

2. How do I configure a pod to opt-in?

A complete step-by-step guide is available here. But the short version is you need to set the hostUsers: false field in the pod spec. For example like this:

apiVersion: v1
kind: Pod
metadata:
  name: userns
spec:
  hostUsers: false
  containers:
  - name: shell
    command: ["sleep", "infinity"]
    image: debian

Yes, it is that simple. Applications will run just fine, without any other changes needed (unless your application needs the privileges).

User namespaces allows you to run as root inside the container, but not have privileges in the host. However, if your application needs the privileges on the host, for example an app that needs to load a kernel module, then you can't use user namespaces.

3. What are idmap mounts and why the file-systems used need to support it?

Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when accessing a mount. When combined with user namespaces, it greatly simplifies the support for volumes, as you can forget about the host UIDs/GIDs the user namespace is using.

In particular, thanks to idmap mounts we can:

Run each pod with different UIDs/GIDs on the host. This is key for the lateral movement prevention we mentioned earlier.
Share volumes with pods that don't use user namespaces.
Enable/disable user namespaces without needing to chown the pod's volumes.

Support for idmap mounts in the kernel is per file-system and different kernel releases added support for idmap mounts on different file-systems.

To find which kernel version added support for each file-system, you can check out the mount_setattr man page, or the online version of it here.

Most popular file-systems are supported, the notable absence that isn't supported yet is NFS.

4. Can you clarify exactly which file-systems need to support idmap mounts?

The file-systems that need to support idmap mounts are all the file-systems used by a pod in the pod.spec.volumes field.

This means: for PV/PVC volumes, the file-system used in the PV needs to support idmap mounts; for hostPath volumes, the file-system used in the hostPath needs to support idmap mounts.

What does this mean for secrets/configmaps/projected/downwardAPI volumes? For these volumes, the kubelet creates a tmpfs file-system. So, you will need a 6.3 kernel to use these volumes (note that if you use them as env variables it is fine).

And what about emptyDir volumes? Those volumes are created by the kubelet by default in /var/lib/kubelet/pods/. You can also use a custom directory for this. But what needs to support idmap mounts is the file-system used in that directory.

The kubelet creates some more files for the container, like /etc/hostname, /etc/resolv.conf, /dev/termination-log, /etc/hosts, etc. These files are also created in /var/lib/kubelet/pods/ by default, so it's important for the file-system used in that directory to support idmap mounts.

Also, some container runtimes may put some of these ephemeral volumes inside a tmpfs file-system, in which case you will need support for idmap mounts in tmpfs.

5. Can I use a kernel older than 6.3?

Yes, but you will need to make sure you are not using a tmpfs file-system. If you avoid that, you can easily use 5.19 (if all the other file-systems you use support idmap mounts in that kernel).

It can be tricky to avoid using tmpfs, though, as we just described above. Besides having to avoid those volume types, you will also have to avoid mounting the service account token. Every pod has it mounted by default, and it uses a projected volume that, as we mentioned, uses a tmpfs file-system.

You could even go lower than 5.19, all the way to 5.12. However, your container rootfs probably uses an overlayfs file-system, and support for overlayfs was added in 5.19. We wouldn't recommend to use a kernel older than 5.19, as not being able to use idmap mounts for the rootfs is a big limitation. If you absolutely need to, you can check this blog post Rodrigo wrote some years ago, about tricks to use user namespaces when you can't support idmap mounts on the rootfs.

6. If my stack supports user namespaces, do I need to configure anything else?

No, if your stack supports it and you are using Kubernetes v1.33, there is nothing you need to configure. You should be able to follow the task: Use a user namespace with a pod.

However, in case you have specific requirements, you may configure various options. You can find more information here. You can also enable a feature gate to relax the PSS rules.

7. The demos are nice, but are there more CVEs that this mitigates?

Yes, quite a lot, actually! Besides the ones in the demo, the KEP has more CVEs you can check. That list is not exhaustive, there are many more.

8. Can you sum up why user namespaces is important?

Think about running a process as root, maybe even an untrusted process. Do you think that is secure? What if we limit it by adding seccomp and apparmor, mask some files in /proc (so it can't crash the node, etc.) and some more tweaks?

Wouldn't it be better if we don't give it privileges in the first place, instead of trying to play whack-a-mole with all the possible ways root can escape?

This is what user namespaces does, plus some other goodies:

Run as an unprivileged user on the host without making changes to your application. Greg and Vinayak did a great talk on the pains you can face when trying to run unprivileged without user namespaces. The pains part starts in this minute.
All pods run with different UIDs/GIDs, we significantly improve the lateral movement. This is guaranteed with user namespaces (the kubelet chooses it for you). In the same talk, Greg and Vinayak show that to achieve the same without user namespaces, they went through a quite complex custom solution. This part starts in this minute.
The capabilities granted are only granted inside the user namespace. That means that if a pod breaks out of the container, they are not valid on the host. We can't provide that without user namespaces.
It enables new use-cases in a secure way. You can run docker in docker, unprivileged container builds, Kubernetes inside Kubernetes, etc all in a secure way. Most of the previous solutions to do this required privileged containers or putting the node at a high risk of compromise.

9. Is there container runtime documentation for user namespaces?

Yes, we have containerd documentation. This explains different limitations of containerd 1.7 and how to use user namespaces in containerd without Kubernetes pods (using ctr). Note that if you use containerd, you need containerd 2.0 or higher to use user namespaces with Kubernetes.

CRI-O doesn't have special documentation for user namespaces, it works out of the box.

10. What about the other container runtimes?

No other container runtime that we are aware of supports user namespaces with Kubernetes. That sadly includes cri-dockerd too.

11. I'd like to learn more about it, what would you recommend?

Rodrigo did an introduction to user namespaces at KubeCon 2022:

Run As “Root”, Not Root: User Namespaces In K8s- Marga Manterola, Isovalent & Rodrigo Campos Catelin

Also, this aforementioned presentation at KubeCon 2023 can be useful as a motivation for user namespaces:

Least Privilege Containers: Keeping a Bad Day from Getting Worse - Greg Castle & Vinayak Goyal

Bear in mind the presentation are some years old, some things have changed since then. Use the Kubernetes documentation as the source of truth.

If you would like to learn more about the low-level details of user namespaces, you can check man 7 user_namespaces and man 1 unshare. You can easily create namespaces and experiment with how they behave. Be aware that the unshare tool has a lot of flexibility, and with that options to create incomplete setups.

If you would like to know more about idmap mounts, you can check its Linux kernel documentation.

Conclusions

Running pods as root is not ideal and running them as non-root is also hard with containers, as it can require a lot of changes to the applications. User namespaces are a unique feature to let you have the best of both worlds: run as non-root, without any changes to your application.

This post covered: what are user namespaces, why they are important, some real world examples of CVEs mitigated by user-namespaces, and some common questions. Hopefully, this post helped you to eliminate the last doubts you had and you will now try user-namespaces (if you didn't already!).

How do I get involved?

You can reach SIG Node by several means:

You can also contact us directly:

GitHub: @rata @giuseppe @saschagrunert
Slack: @rata @giuseppe @sascha

Kubernetes v1.33: Continuing the transition from Endpoints to EndpointSlices

Thu, 24 Apr 2025 10:30:00 -0800

Since the addition of EndpointSlices (KEP-752) as alpha in v1.15 and later GA in v1.21, the Endpoints API in Kubernetes has been gathering dust. New Service features like dual-stack networking and traffic distribution are only supported via the EndpointSlice API, so all service proxies, Gateway API implementations, and similar controllers have had to be ported from using Endpoints to using EndpointSlices. At this point, the Endpoints API is really only there to avoid breaking end user workloads and scripts that still make use of it.

As of Kubernetes 1.33, the Endpoints API is now officially deprecated, and the API server will return warnings to users who read or write Endpoints resources rather than using EndpointSlices.

Eventually, the plan (as documented in KEP-4974) is to change the Kubernetes Conformance criteria to no longer require that clusters run the Endpoints controller (which generates Endpoints objects based on Services and Pods), to avoid doing work that is unneeded in most modern-day clusters.

Thus, while the Kubernetes deprecation policy means that the Endpoints type itself will probably never completely go away, users who still have workloads or scripts that use the Endpoints API should start migrating them to EndpointSlices.

Notes on migrating from Endpoints to EndpointSlices

Consuming EndpointSlices rather than Endpoints

For end users, the biggest change between the Endpoints API and the EndpointSlice API is that while every Service with a selector has exactly 1 Endpoints object (with the same name as the Service), a Service may have any number of EndpointSlices associated with it:

$ kubectl get endpoints myservice
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME        ENDPOINTS          AGE
myservice   10.180.3.17:443    1h

$ kubectl get endpointslice -l kubernetes.io/service-name=myservice
NAME              ADDRESSTYPE   PORTS   ENDPOINTS          AGE
myservice-7vzhx   IPv4          443     10.180.3.17        21s
myservice-jcv8s   IPv6          443     2001:db8:0123::5   21s

In this case, because the service is dual stack, it has 2 EndpointSlices: 1 for IPv4 addresses and 1 for IPv6 addresses. (The Endpoints API does not support dual stack, so the Endpoints object shows only the addresses in the cluster's primary address family.) Although any Service with multiple endpoints can have multiple EndpointSlices, there are three main cases where you will see this:

An EndpointSlice can only represent endpoints of a single IP family, so dual-stack Services will have separate EndpointSlices for IPv4 and IPv6.
All of the endpoints in an EndpointSlice must target the same ports. So, for example, if you have a set of endpoint Pods listening on port 80, and roll out an update to make them listen on port 8080 instead, then while the rollout is in progress, the Service will need 2 EndpointSlices: 1 for the endpoints listening on port 80, and 1 for the endpoints listening on port 8080.
When a Service has more than 100 endpoints, the EndpointSlice controller will split the endpoints into multiple EndpointSlices rather than aggregating them into a single excessively-large object like the Endpoints controller does.

Because there is not a predictable 1-to-1 mapping between Services and EndpointSlices, there is no way to know what the actual name of the EndpointSlice resource(s) for a Service will be ahead of time; thus, instead of fetching the EndpointSlice(s) by name, you instead ask for all EndpointSlices with a "kubernetes.io/service-name" label pointing to the Service:

$ kubectl get endpointslice -l kubernetes.io/service-name=myservice

A similar change is needed in Go code. With Endpoints, you would do something like:

// Get the Endpoints named `name` in `namespace`.
endpoint, err := client.CoreV1().Endpoints(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
	if apierrors.IsNotFound(err) {
		// No Endpoints exists for the Service (yet?)
		...
	}
        // handle other errors
	...
}

// process `endpoint`
...

With EndpointSlices, this becomes:

// Get all EndpointSlices for Service `name` in `namespace`.
slices, err := client.DiscoveryV1().EndpointSlices(namespace).List(ctx,
	metav1.ListOptions{LabelSelector: discoveryv1.LabelServiceName + "=" + name})
if err != nil {
        // handle errors
	...
} else if len(slices.Items) == 0 {
	// No EndpointSlices exist for the Service (yet?)
	...
}

// process `slices.Items`
...

Generating EndpointSlices rather than Endpoints

For people (or controllers) generating Endpoints, migrating to EndpointSlices is slightly easier, because in most cases you won't have to worry about multiple slices. You just need to update your YAML or Go code to use the new type (which organizes the information in a slightly different way than Endpoints did).

For example, this Endpoints object:

apiVersion: v1
kind: Endpoints
metadata:
  name: myservice
subsets:
  - addresses:
      - ip: 10.180.3.17
        nodeName: node-4
      - ip: 10.180.5.22
        nodeName: node-9
      - ip: 10.180.18.2
        nodeName: node-7
    notReadyAddresses:
      - ip: 10.180.6.6
        nodeName: node-8
    ports:
      - name: https
        protocol: TCP
        port: 443

would become something like:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: myservice
  labels:
    kubernetes.io/service-name: myservice
addressType: IPv4
endpoints:
  - addresses:
      - 10.180.3.17
    nodeName: node-4
  - addresses:
      - 10.180.5.22
    nodeName: node-9
  - addresses:
      - 10.180.18.12
    nodeName: node-7
  - addresses:
      - 10.180.6.6
    nodeName: node-8
    conditions:
      ready: false
ports:
  - name: https
    protocol: TCP
    port: 443

Some points to note:

This example uses an explicit name, but you could also use generateName and let the API server append a unique suffix. The name itself does not matter: what matters is the "kubernetes.io/service-name" label pointing back to the Service.
You have to explicitly indicate addressType: IPv4 (or IPv6).
An EndpointSlice is similar to a single element of the "subsets" array in Endpoints. An Endpoints object with multiple subsets will normally need to be expressed as multiple EndpointSlices, each with different "ports".
The endpoints and addresses fields are both arrays, but by convention, each addresses array only contains a single element. If your Service has multiple endpoints, then you need to have multiple elements in the endpoints array, each with a single element in its addresses array.
The Endpoints API lists "ready" and "not-ready" endpoints separately, while the EndpointSlice API allows each endpoint to have conditions (such as "ready: false") associated with it.

And of course, once you have ported to EndpointSlice, you can make use of EndpointSlice-specific features, such as topology hints and terminating endpoints. Consult the EndpointSlice API documentation for more information.

Kubernetes v1.33: Octarine

Wed, 23 Apr 2025 10:30:00 -0800

Editors: Agustina Barbetta, Aakanksha Bhende, Udi Hofesh, Ryota Sawada, Sneha Yadav

Similar to previous releases, the release of Kubernetes v1.33 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 64 enhancements. Of those enhancements, 18 have graduated to Stable, 20 are entering Beta, 24 have entered Alpha, and 2 are deprecated or withdrawn.

There are also several notable deprecations and removals in this release; make sure to read about those if you already run an older version of Kubernetes.

Release theme and logo

The theme for Kubernetes v1.33 is Octarine: The Color of Magic¹, inspired by Terry Pratchett’s Discworld series. This release highlights the open source magic² that Kubernetes enables across the ecosystem.

If you’re familiar with the world of Discworld, you might recognize a small swamp dragon perched atop the tower of the Unseen University, gazing up at the Kubernetes moon above the city of Ankh-Morpork with 64 stars³ in the background.

As Kubernetes moves into its second decade, we celebrate both the wizardry of its maintainers, the curiosity of new contributors, and the collaborative spirit that fuels the project. The v1.33 release is a reminder that, as Pratchett wrote, “It’s still magic even if you know how it’s done.” Even if you know the ins and outs of the Kubernetes code base, stepping back at the end of the release cycle, you’ll realize that Kubernetes remains magical.

Kubernetes v1.33 is a testament to the enduring power of open source innovation, where hundreds of contributors⁴ from around the world work together to create something truly extraordinary. Behind every new feature, the Kubernetes community works to maintain and improve the project, ensuring it remains secure, reliable, and released on time. Each release builds upon the other, creating something greater than we could achieve alone.

_{1. Octarine is the mythical eighth color, visible only to those attuned to the arcane—wizards,
witches, and, of course, cats. And occasionally, someone who’s stared at IPtable rules for too
long.}
_{2. Any sufficiently advanced technology is indistinguishable from magic…?}
_{3. It’s not a coincidence 64 KEPs (Kubernetes Enhancement Proposals) are also included in
v1.33.}
_{4. See the Project Velocity section for v1.33 🚀}

Spotlight on key updates

Kubernetes v1.33 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: Sidecar containers

The sidecar pattern involves deploying separate auxiliary container(s) to handle extra capabilities in areas such as networking, logging, and metrics gathering. Sidecar containers graduate to stable in v1.33.

Kubernetes implements sidecars as a special class of init containers with restartPolicy: Always, ensuring that sidecars start before application containers, remain running throughout the pod's lifecycle, and terminate automatically after the main containers exit.

Additionally, sidecars can utilize probes (startup, readiness, liveness) to signal their operational state, and their Out-Of-Memory (OOM) score adjustments are aligned with primary containers to prevent premature termination under memory pressure.

To learn more, read Sidecar Containers.

This work was done as part of KEP-753: Sidecar Containers led by SIG Node.

Beta: In-place resource resize for vertical scaling of Pods

Workloads can be defined using APIs like Deployment, StatefulSet, etc. These describe the template for the Pods that should run, including memory and CPU resources, as well as the replica count of the number of Pods that should run. Workloads can be scaled horizontally by updating the Pod replica count, or vertically by updating the resources required in the Pods container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement.

But what if you could dynamically update the resource configuration for your existing Pods without restarting them?

The KEP-1287 is precisely to allow such in-place Pod updates. It was released as alpha in v1.27, and has graduated to beta in v1.33. This opens up various possibilities for vertical scale-up of stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup, which can then be reduced once the initial setup is complete.

This work was done as part of KEP-1287: In-Place Update of Pod Resources led by SIG Node and SIG Autoscaling.

Alpha: New configuration option for kubectl with `.kuberc` for user preferences

In v1.33, kubectl introduces a new alpha feature with opt-in configuration file .kuberc for user preferences. This file can contain kubectl aliases and overrides (e.g. defaulting to use server-side apply), while leaving cluster credentials and host information in kubeconfig. This separation allows sharing the same user preferences for kubectl interaction, regardless of target cluster and kubeconfig used.

To enable this alpha feature, users can set the environment variable of KUBECTL_KUBERC=true and create a .kuberc configuration file. By default, kubectl looks for this file in ~/.kube/kuberc. You can also specify an alternative location using the --kuberc flag, for example: kubectl --kuberc /var/kube/rc.

This work was done as part of KEP-3104: Separate kubectl user preferences from cluster configs led by SIG CLI.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.33 release.

Backoff limits per index for indexed Jobs

This release graduates a feature that allows setting backoff limits on a per-index basis for Indexed Jobs. Traditionally, the backoffLimit parameter in Kubernetes Jobs specifies the number of retries before considering the entire Job as failed. This enhancement allows each index within an Indexed Job to have its own backoff limit, providing more granular control over retry behavior for individual tasks. This ensures that the failure of specific indices does not prematurely terminate the entire Job, allowing the other indices to continue processing independently.

This work was done as part of KEP-3850: Backoff Limit Per Index For Indexed Jobs led by SIG Apps.

Job success policy

Using .spec.successPolicy, users can specify which pod indexes must succeed (succeededIndexes), how many pods must succeed (succeededCount), or a combination of both. This feature benefits various workloads, including simulations where partial completion is sufficient, and leader-worker patterns where only the leader's success determines the Job's overall outcome.

This work was done as part of KEP-3998: Job success/completion policy led by SIG Apps.

Bound ServiceAccount token security improvements

This enhancement introduced features such as including a unique token identifier (i.e. JWT ID Claim, also known as JTI) and node information within the tokens, enabling more precise validation and auditing. Additionally, it supports node-specific restrictions, ensuring that tokens are only usable on designated nodes, thereby reducing the risk of token misuse and potential security breaches. These improvements, now generally available, aim to enhance the overall security posture of service account tokens within Kubernetes clusters.

This work was done as part of KEP-4193: Bound service account token improvements led by SIG Auth.

Subresource support in kubectl

The --subresource argument is now generally available for kubectl subcommands such as get, patch, edit, apply and replace, allowing users to fetch and update subresources for all resources that support them. To learn more about the subresources supported, visit the kubectl reference.

This work was done as part of KEP-2590: Add subresource support to kubectl led by SIG CLI.

Multiple Service CIDRs

This enhancement introduced a new implementation of allocation logic for Service IPs. Across the whole cluster, every Service of type: ClusterIP must have a unique IP address assigned to it. Trying to create a Service with a specific cluster IP that has already been allocated will return an error. The updated IP address allocator logic uses two newly stable API objects: ServiceCIDR and IPAddress. Now generally available, these APIs allow cluster administrators to dynamically increase the number of IP addresses available for type: ClusterIP Services (by creating new ServiceCIDR objects).

This work was done as part of KEP-1880: Multiple Service CIDRs led by SIG Network.

`nftables` backend for kube-proxy

The nftables backend for kube-proxy is now stable, adding a new implementation that significantly improves performance and scalability for Services implementation within Kubernetes clusters. For compatibility reasons, iptables remains the default on Linux nodes. Check the migration guide if you want to try it out.

This work was done as part of KEP-3866: nftables kube-proxy backend led by SIG Network.

Topology aware routing with `trafficDistribution: PreferClose`

This release graduates topology-aware routing and traffic distribution to GA, which would allow us to optimize service traffic in multi-zone clusters. The topology-aware hints in EndpointSlices would enable components like kube-proxy to prioritize routing traffic to endpoints within the same zone, thereby reducing latency and cross-zone data transfer costs. Building upon this, trafficDistribution field is added to the Service specification, with the PreferClose option directing traffic to the nearest available endpoints based on network topology. This configuration enhances performance and cost-efficiency by minimizing inter-zone communication.

This work was done as part of KEP-4444: Traffic Distribution for Services and KEP-2433: Topology Aware Routing led by SIG Network.

Options to reject non SMT-aligned workload

This feature added policy options to the CPU Manager, enabling it to reject workloads that do not align with Simultaneous Multithreading (SMT) configurations. This enhancement, now generally available, ensures that when a pod requests exclusive use of CPU cores, the CPU Manager can enforce allocation of entire core pairs (comprising primary and sibling threads) on SMT-enabled systems, thereby preventing scenarios where workloads share CPU resources in unintended ways.

This work was done as part of KEP-2625: node: cpumanager: add options to reject non SMT-aligned workload led by SIG Node.

Defining Pod affinity or anti-affinity using `matchLabelKeys` and `mismatchLabelKeys`

The matchLabelKeys and mismatchLabelKeys fields are available in Pod affinity terms, enabling users to finely control the scope where Pods are expected to co-exist (Affinity) or not (AntiAffinity). These newly stable options complement the existing labelSelector mechanism. The affinity fields facilitate enhanced scheduling for versatile rolling updates, as well as isolation of services managed by tools or controllers based on global configurations.

This work was done as part of KEP-3633: Introduce MatchLabelKeys to Pod Affinity and Pod Anti Affinity led by SIG Scheduling.

Considering taints and tolerations when calculating Pod topology spread skew

This enhanced PodTopologySpread by introducing two fields: nodeAffinityPolicy and nodeTaintsPolicy. These fields allow users to specify whether node affinity rules and node taints should be considered when calculating pod distribution across nodes. By default, nodeAffinityPolicy is set to Honor, meaning only nodes matching the pod's node affinity or selector are included in the distribution calculation. The nodeTaintsPolicy defaults to Ignore, indicating that node taints are not considered unless specified. This enhancement provides finer control over pod placement, ensuring that pods are scheduled on nodes that meet both affinity and taint toleration requirements, thereby preventing scenarios where pods remain pending due to unsatisfied constraints.

This work was done as part of KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew led by SIG Scheduling.

Volume populators

After being released as beta in v1.24, volume populators have graduated to GA in v1.33. This newly stable feature provides a way to allow users to pre-populate volumes with data from various sources, and not just from PersistentVolumeClaim (PVC) clones or volume snapshots. The mechanism relies on the dataSourceRef field within a PersistentVolumeClaim. This field offers more flexibility than the existing dataSource field, and allows for custom resources to be used as data sources.

A special controller, volume-data-source-validator, validates these data source references, alongside a newly stable CustomResourceDefinition (CRD) for an API kind named VolumePopulator. The VolumePopulator API allows volume populator controllers to register the types of data sources they support. You need to set up your cluster with the appropriate CRD in order to use volume populators.

This work was done as part of KEP-1495: Generic data populators led by SIG Storage.

Always honor PersistentVolume reclaim policy

This enhancement addressed an issue where the Persistent Volume (PV) reclaim policy is not consistently honored, leading to potential storage resource leaks. Specifically, if a PV is deleted before its associated Persistent Volume Claim (PVC), the "Delete" reclaim policy may not be executed, leaving the underlying storage assets intact. To mitigate this, Kubernetes now sets finalizers on relevant PVs, ensuring that the reclaim policy is enforced regardless of the deletion sequence. This enhancement prevents unintended retention of storage resources and maintains consistency in PV lifecycle management.

This work was done as part of KEP-2644: Always Honor PersistentVolume Reclaim Policy led by SIG Storage.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.33 release.

Support for Direct Service Return (DSR) in Windows kube-proxy

DSR provides performance optimizations by allowing the return traffic routed through load balancers to bypass the load balancer and respond directly to the client; reducing load on the load balancer and also reducing overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell.

Initially introduced in v1.14, support for DSR has been promoted to beta by SIG Windows as part of KEP-5100: Support for Direct Service Return (DSR) and overlay networking in Windows kube-proxy.

Structured parameter support

While structured parameter support continues as a beta feature in Kubernetes v1.33, this core part of Dynamic Resource Allocation (DRA) has seen significant improvements. A new v1beta2 version simplifies the resource.k8s.io API, and regular users with the namespaced cluster edit role can now use DRA.

The kubelet now includes seamless upgrade support, enabling drivers deployed as DaemonSets to use a rolling update mechanism. For DRA implementations, this prevents the deletion and re-creation of ResourceSlices, allowing them to remain unchanged during upgrades. Additionally, a 30-second grace period has been introduced before the kubelet cleans up after unregistering a driver, providing better support for drivers that do not use rolling updates.

This work was done as part of KEP-4381: DRA: structured parameters by WG Device Management, a cross-functional team including SIG Node, SIG Scheduling, and SIG Autoscaling.

Dynamic Resource Allocation (DRA) for network interfaces

The standardized reporting of network interface data via DRA, introduced in v1.32, has graduated to beta in v1.33. This enables more native Kubernetes network integrations, simplifying the development and management of networking devices. This was covered previously in the v1.32 release announcement blog.

This work was done as part of KEP-4817: DRA: Resource Claim Status with possible standardized network interface data led by SIG Network, SIG Node, and WG Device Management.

Handle unscheduled pods early when scheduler does not have any pod on activeQ

This feature improves queue scheduling behavior. Behind the scenes, the scheduler achieves this by popping pods from the backoffQ, which are not backed off due to errors, when the activeQ is empty. Previously, the scheduler would become idle even when the activeQ was empty; this enhancement improves scheduling efficiency by preventing that.

This work was done as part of KEP-5142: Pop pod from backoffQ when activeQ is empty led by SIG Scheduling.

Asynchronous preemption in the Kubernetes Scheduler

Preemption ensures higher-priority pods get the resources they need by evicting lower-priority ones. Asynchronous Preemption, introduced in v1.32 as alpha, has graduated to beta in v1.33. With this enhancement, heavy operations such as API calls to delete pods are processed in parallel, allowing the scheduler to continue scheduling other pods without delays. This improvement is particularly beneficial in clusters with high Pod churn or frequent scheduling failures, ensuring a more efficient and resilient scheduling process.

This work was done as part of KEP-4832: Asynchronous preemption in the scheduler led by SIG Scheduling.

ClusterTrustBundles

ClusterTrustBundle, a cluster-scoped resource designed for holding X.509 trust anchors (root certificates), has graduated to beta in v1.33. This API makes it easier for in-cluster certificate signers to publish and communicate X.509 trust anchors to cluster workloads.

This work was done as part of KEP-3257: ClusterTrustBundles (previously Trust Anchor Sets) led by SIG Auth.

Fine-grained SupplementalGroups control

Introduced in v1.31, this feature graduates to beta in v1.33 and is now enabled by default. Provided that your cluster has the SupplementalGroupsPolicy feature gate enabled, the supplementalGroupsPolicy field within a Pod's securityContext supports two policies: the default Merge policy maintains backward compatibility by combining specified groups with those from the container image's /etc/group file, whereas the new Strict policy applies only to explicitly defined groups.

This enhancement helps to address security concerns where implicit group memberships from container images could lead to unintended file access permissions and bypass policy controls.

This work was done as part of KEP-3619: Fine-grained SupplementalGroups control led by SIG Node.

Support for mounting images as volumes

Support for using Open Container Initiative (OCI) images as volumes in Pods, introduced in v1.31, has graduated to beta. This feature allows users to specify an image reference as a volume in a Pod while reusing it as a volume mount within containers. It opens up the possibility of packaging the volume data separately, and sharing them among containers in a Pod without including them in the main image, thereby reducing vulnerabilities and simplifying image creation.

This work was done as part of KEP-4639: VolumeSource: OCI Artifact and/or Image led by SIG Node and SIG Storage.

Support for user namespaces within Linux Pods

One of the oldest open KEPs as of writing is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and has moved to on-by-default beta as part of v1.33.

This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities.

This work was done as part of KEP-127: Support User Namespaces in pods led by SIG Node.

Pod `procMount` option

The procMount option, introduced as alpha in v1.12, and off-by-default beta in v1.31, has moved to an on-by-default beta in v1.33. This enhancement improves Pod isolation by allowing users to fine-tune access to the /proc filesystem. Specifically, it adds a field to the Pod securityContext that lets you override the default behavior of masking and marking certain /proc paths as read-only. This is particularly useful for scenarios where users want to run unprivileged containers inside the Kubernetes Pod using user namespaces. Normally, the container runtime (via the CRI implementation) starts the outer container with strict /proc mount settings. However, to successfully run nested containers with an unprivileged Pod, users need a mechanism to relax those defaults, and this feature provides exactly that.

This work was done as part of KEP-4265: add ProcMount option led by SIG Node.

CPUManager policy to distribute CPUs across NUMA nodes

This feature adds a new policy option for the CPU Manager to distribute CPUs across Non-Uniform Memory Access (NUMA) nodes, rather than concentrating them on a single node. It optimizes CPU resource allocation by balancing workloads across multiple NUMA nodes, thereby improving performance and resource utilization in multi-NUMA systems.

This work was done as part of KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them led by SIG Node.

Zero-second sleeps for container PreStop hooks

Kubernetes 1.29 introduced a Sleep action for the preStop lifecycle hook in Pods, allowing containers to pause for a specified duration before termination. This provides a straightforward method to delay container shutdown, facilitating tasks such as connection draining or cleanup operations.

The Sleep action in a preStop hook can now accept a zero-second duration as a beta feature. This allows defining a no-op preStop hook, which is useful when a preStop hook is required but no delay is desired.

This work was done as part of KEP-3960: Introducing Sleep Action for PreStop Hook and KEP-4818: Allow zero value for Sleep Action of PreStop Hook led by SIG Node.

Internal tooling for declarative validation of Kubernetes-native types

Behind the scenes, the internals of Kubernetes are starting to use a new mechanism for validating objects and changes to objects. Kubernetes v1.33 introduces validation-gen, an internal tool that Kubernetes contributors use to generate declarative validation rules. The overall goal is to improve the robustness and maintainability of API validations by enabling developers to specify validation constraints declaratively, reducing manual coding errors and ensuring consistency across the codebase.

This work was done as part of KEP-5073: Declarative Validation Of Kubernetes Native Types With validation-gen led by SIG API Machinery.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.33 release.

Configurable tolerance for HorizontalPodAutoscalers

This feature introduces configurable tolerance for HorizontalPodAutoscalers, which dampens scaling reactions to small metric variations.

This work was done as part of KEP-4951: Configurable tolerance for Horizontal Pod Autoscalers led by SIG Autoscaling.

Configurable container restart delay

Introduced as alpha1 in v1.32, this feature provides a set of kubelet-level configurations to fine-tune how CrashLoopBackOff is handled.

This work was done as part of KEP-4603: Tune CrashLoopBackOff led by SIG Node.

Custom container stop signals

Before Kubernetes v1.33, stop signals could only be set in container image definitions (for example, via the StopSignal configuration field in the image metadata). If you wanted to modify termination behavior, you needed to build a custom container image. By enabling the (alpha) ContainerStopSignals feature gate in Kubernetes v1.33, you can now define custom stop signals directly within Pod specifications. This is defined in the container's lifecycle.stopSignal field and requires the Pod's spec.os.name field to be present. If unspecified, containers fall back to the image-defined stop signal (if present), or the container runtime default (typically SIGTERM for Linux).

This work was done as part of KEP-4960: Container Stop Signals led by SIG Node.

DRA enhancements galore!

Kubernetes v1.33 continues to develop Dynamic Resource Allocation (DRA) with features designed for today’s complex infrastructures. DRA is an API for requesting and sharing resources between pods and containers inside a pod. Typically those resources are devices such as GPUs, FPGAs, and network adapters.

The following are all the alpha DRA feature gates introduced in v1.33:

Similar to Node taints, by enabling the DRADeviceTaints feature gate, devices support taints and tolerations. An admin or a control plane component can taint devices to limit their usage. Scheduling of pods which depend on those devices can be paused while a taint exists and/or pods using a tainted device can be evicted.
By enabling the feature gate DRAPrioritizedList, DeviceRequests get a new field named firstAvailable. This field is an ordered list that allows the user to specify that a request may be satisfied in different ways, including allocating nothing at all if some specific hardware is not available.
With feature gate DRAAdminAccess enabled, only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.k8s.io/admin-access: "true" can use the adminAccess field. This ensures that non-admin users cannot misuse the adminAccess feature.
While it has been possible to consume device partitions since v1.31, vendors had to pre-partition devices and advertise them accordingly. By enabling the DRAPartitionableDevices feature gate in v1.33, device vendors can advertise multiple partitions, including overlapping ones. The Kubernetes scheduler will choose the partition based on workload requests, and prevent the allocation of conflicting partitions simultaneously. This feature gives vendors the ability to dynamically create partitions at allocation time. The allocation and dynamic partitioning are automatic and transparent to users, enabling improved resource utilization.

These feature gates have no effect unless you also enable the DynamicResourceAllocation feature gate.

This work was done as part of KEP-5055: DRA: device taints and tolerations, KEP-4816: DRA: Prioritized Alternatives in Device Requests, KEP-5018: DRA: AdminAccess for ResourceClaims and ResourceClaimTemplates, and KEP-4815: DRA: Add support for partitionable devices, led by SIG Node, SIG Scheduling and SIG Auth.

Robust image pull policy to authenticate images for `IfNotPresent` and `Never`

This feature allows users to ensure that kubelet requires an image pull authentication check for each new set of credentials, regardless of whether the image is already present on the node.

This work was done as part of KEP-2535: Ensure secret pulled images led by SIG Auth.

Node topology labels are available via downward API

This feature enables Node topology labels to be exposed via the downward API. Prior to Kubernetes v1.33, a workaround involved using an init container to query the Kubernetes API for the underlying node; this alpha feature simplifies how workloads can access Node topology information.

This work was done as part of KEP-4742: Expose Node labels via downward API led by SIG Node.

Better pod status with generation and observed generation

Prior to this change, the metadata.generation field was unused in pods. Along with extending to support metadata.generation, this feature will introduce status.observedGeneration to provide clearer pod status.

This work was done as part of KEP-5067: Pod Generation led by SIG Node.

Support for split level 3 cache architecture with kubelet’s CPU Manager

The previous kubelet’s CPU Manager was unaware of split L3 cache architecture (also known as Last Level Cache, or LLC), and can potentially distribute CPU assignments without considering the split L3 cache, causing a noisy neighbor problem. This alpha feature improves the CPU Manager to better assign CPU cores for better performance.

This work was done as part of KEP-5109: Split L3 Cache Topology Awareness in CPU Manager led by SIG Node.

PSI (Pressure Stall Information) metrics for scheduling improvements

This feature adds support on Linux nodes for providing PSI stats and metrics using cgroupv2. It can detect resource shortages and provide nodes with more granular control for pod scheduling.

This work was done as part of KEP-4205: Support PSI based on cgroupv2 led by SIG Node.

Secret-less image pulls with kubelet

The kubelet's on-disk credential provider now supports optional Kubernetes ServiceAccount (SA) token fetching. This simplifies authentication with image registries by allowing cloud providers to better integrate with OIDC compatible identity solutions.

This work was done as part of KEP-4412: Projected service account tokens for Kubelet image credential providers led by SIG Auth.

Graduations, deprecations, and removals in v1.33

Graduations to stable

This lists all the features that have graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 18 enhancements promoted to stable:

Deprecations and removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Many of these deprecations and removals were announced in the Deprecations and Removals blog post.

Deprecation of the stable Endpoints API

The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation.

This deprecation affects only those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans.

You can find more in KEP-4974: Deprecate v1.Endpoints.

Removal of kube-proxy version information in node status

Following its deprecation in v1.31, as highlighted in the v1.31 release announcement, the .status.nodeInfo.kubeProxyVersion field for Nodes was removed in v1.33.

This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, this field has been removed entirely in v1.33.

You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field.

Removal of in-tree gitRepo volume driver

The gitRepo volume type has been deprecated since v1.11, nearly 7 years ago. Since its deprecation, there have been security concerns, including how gitRepo volume types can be exploited to gain remote code execution as root on the nodes. In v1.33, the in-tree driver code is removed.

There are alternatives such as git-sync and initContainers. gitVolumes in the Kubernetes API is not removed, and thus pods with gitRepo volumes will be admitted by kube-apiserver, but kubelets with the feature-gate GitRepoVolumeDriver set to false will not run them and return an appropriate error to the user. This allows users to opt-in to re-enabling the driver for 3 versions to give them enough time to fix workloads.

The feature gate in kubelet and in-tree plugin code is planned to be removed in the v1.39 release.

You can find more in KEP-5040: Remove gitRepo volume driver.

Removal of host network support for Windows pods

Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node’s networking namespace. The original implementation landed as alpha with v1.26, but because it faced unexpected containerd behaviours and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. Support was fully removed in v1.33.

Please note that this does not affect HostProcess containers, which provides host network as well as host level access. The KEP withdrawn in v1.33 was about providing the host network only, which was never stable due to technical limitations with Windows networking logic.

You can find more in KEP-3503: Host network support for Windows pods.

Release notes

Check out the full details of the Kubernetes v1.33 release in our release notes.

Availability

Kubernetes v1.33 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.33 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Release Team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.33 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. There was a new team structure adopted in this release cycle, which was to combine Release Notes and Docs subteams into a unified subteam of Docs. Thanks to the meticulous effort in organizing the relevant information and resources from the new Docs team, both Release Notes and Docs tracking have seen a smooth and successful transition. Finally, a very special thanks goes out to our release lead, Nina Polshakova, for her support throughout a successful release cycle, her advocacy, her efforts to ensure that everyone could contribute effectively, and her challenges to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates several interesting data points related to the velocity of Kubernetes and various subprojects. This includes everything from individual contributions, to the number of companies contributing, and illustrates the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.33 release cycle, which spanned 15 weeks from January 13 to April 23, 2025, Kubernetes received contributions from as many as 121 different companies and 570 individuals (as of writing, a few weeks before the release date). In the wider cloud native ecosystem, the figure goes up to 435 companies counting 2400 total contributors. You can find the data source in this dashboard. Compared to the velocity data from previous release, v1.32, we see a similar level of contribution from companies and individuals, indicating strong community interest and engagement.

Note that, “contribution” counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website.

Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Event update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

May 2025

KCD - Kubernetes Community Days: Costa Rica: May 3, 2025 | Heredia, Costa Rica
KCD - Kubernetes Community Days: Helsinki: May 6, 2025 | Helsinki, Finland
KCD - Kubernetes Community Days: Texas Austin: May 15, 2025 | Austin, USA
KCD - Kubernetes Community Days: Seoul: May 22, 2025 | Seoul, South Korea
KCD - Kubernetes Community Days: Istanbul, Turkey: May 23, 2025 | Istanbul, Turkey
KCD - Kubernetes Community Days: San Francisco Bay Area: May 28, 2025 | San Francisco, USA

June 2025

KCD - Kubernetes Community Days: New York: June 4, 2025 | New York, USA
KCD - Kubernetes Community Days: Czech & Slovak: June 5, 2025 | Bratislava, Slovakia
KCD - Kubernetes Community Days: Bengaluru: June 6, 2025 | Bangalore, India
KubeCon + CloudNativeCon China 2025: June 10-11, 2025 | Hong Kong
KCD - Kubernetes Community Days: Antigua Guatemala: June 14, 2025 | Antigua Guatemala, Guatemala
KubeCon + CloudNativeCon Japan 2025: June 16-17, 2025 | Tokyo, Japan
KCD - Kubernetes Community Days: Nigeria, Africa: June 19, 2025 | Nigeria, Africa

July 2025

KCD - Kubernetes Community Days: Utrecht: July 4, 2025 | Utrecht, Netherlands
KCD - Kubernetes Community Days: Taipei: July 5, 2025 | Taipei, Taiwan
KCD - Kubernetes Community Days: Lima, Peru: July 19, 2025 | Lima, Peru

August 2025

KubeCon + CloudNativeCon India 2025: August 6-7, 2025 | Hyderabad, India
KCD - Kubernetes Community Days: Colombia: August 29, 2025 | Bogotá, Colombia

You can find the latest KCD details here.

Upcoming release webinar

Join members of the Kubernetes v1.33 Release Team on Friday, May 16th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Server Fault or Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Kubernetes Multicontainer Pods: An Overview

Tue, 22 Apr 2025 00:00:00 +0000

As cloud-native architectures continue to evolve, Kubernetes has become the go-to platform for deploying complex, distributed systems. One of the most powerful yet nuanced design patterns in this ecosystem is the sidecar pattern—a technique that allows developers to extend application functionality without diving deep into source code.

The origins of the sidecar pattern

Think of a sidecar like a trusty companion motorcycle attachment. Historically, IT infrastructures have always used auxiliary services to handle critical tasks. Before containers, we relied on background processes and helper daemons to manage logging, monitoring, and networking. The microservices revolution transformed this approach, making sidecars a structured and intentional architectural choice. With the rise of microservices, the sidecar pattern became more clearly defined, allowing developers to offload specific responsibilities from the main service without altering its code. Service meshes like Istio and Linkerd have popularized sidecar proxies, demonstrating how these companion containers can elegantly handle observability, security, and traffic management in distributed systems.

Kubernetes implementation

In Kubernetes, sidecar containers operate within the same Pod as the main application, enabling communication and resource sharing. Does this sound just like defining multiple containers along each other inside the Pod? It actually does, and this is how sidecar containers had to be implemented before Kubernetes v1.29.0, which introduced native support for sidecars. Sidecar containers can now be defined within a Pod manifest using the spec.initContainers field. What makes it a sidecar container is that you specify it with restartPolicy: Always. You can see an example of this below, which is a partial snippet of the full Kubernetes manifest:

initContainers:
  - name: logshipper
    image: alpine:latest
    restartPolicy: Always
  command: ['sh', '-c', 'tail -F /opt/logs.txt']
    volumeMounts:
    - name: data
        mountPath: /opt

That field name, spec.initContainers may sound confusing. How come when you want to define a sidecar container, you have to put an entry in the spec.initContainers array? spec.initContainers are run to completion just before main application starts, so they’re one-off, whereas sidecars often run in parallel to the main app container. It’s the spec.initContainers with restartPolicy:Always which differs classic init containers from Kubernetes-native sidecar containers and ensures they are always up.

When to embrace (or avoid) sidecars

While the sidecar pattern can be useful in many cases, it is generally not the preferred approach unless the use case justifies it. Adding a sidecar increases complexity, resource consumption, and potential network latency. Instead, simpler alternatives such as built-in libraries or shared infrastructure should be considered first.

Deploy a sidecar when:

You need to extend application functionality without touching the original code
Implementing cross-cutting concerns like logging, monitoring or security
Working with legacy applications requiring modern networking capabilities
Designing microservices that demand independent scaling and updates

Proceed with caution if:

Resource efficiency is your primary concern
Minimal network latency is critical
Simpler alternatives exist
You want to minimize troubleshooting complexity

Four essential multi-container patterns

Init container pattern

The Init container pattern is used to execute (often critical) setup tasks before the main application container starts. Unlike regular containers, init containers run to completion and then terminate, ensuring that preconditions for the main application are met.

Ideal for:

Preparing configurations
Loading secrets
Verifying dependency availability
Running database migrations

The init container ensures your application starts in a predictable, controlled environment without code modifications.

Ambassador pattern

An ambassador container provides Pod-local helper services that expose a simple way to access a network service. Commonly, ambassador containers send network requests on behalf of a an application container and take care of challenges such as service discovery, peer identity verification, or encryption in transit.

Perfect when you need to:

Offload client connectivity concerns
Implement language-agnostic networking features
Add security layers like TLS
Create robust circuit breakers and retry mechanisms

Configuration helper

A configuration helper sidecar provides configuration updates to an application dynamically, ensuring it always has access to the latest settings without disrupting the service. Often the helper needs to provide an initial configuration before the application would be able to start successfully.

Use cases:

Fetching environment variables and secrets
Polling configuration changes
Decoupling configuration management from application logic

Adapter pattern

An adapter (or sometimes façade) container enables interoperability between the main application container and external services. It does this by translating data formats, protocols, or APIs.

Strengths:

Transforming legacy data formats
Bridging communication protocols
Facilitating integration between mismatched services

Wrap-up

While sidecar patterns offer tremendous flexibility, they're not a silver bullet. Each added sidecar introduces complexity, consumes resources, and potentially increases operational overhead. Always evaluate simpler alternatives first. The key is strategic implementation: use sidecars as precision tools to solve specific architectural challenges, not as a default approach. When used correctly, they can improve security, networking, and configuration management in containerized environments. Choose wisely, implement carefully, and let your sidecars elevate your container ecosystem.

Introducing kube-scheduler-simulator

Mon, 07 Apr 2025 00:00:00 +0000

The Kubernetes Scheduler is a crucial control plane component that determines which node a Pod will run on. Thus, anyone utilizing Kubernetes relies on a scheduler.

kube-scheduler-simulator is a simulator for the Kubernetes scheduler, that started as a Google Summer of Code 2021 project developed by me (Kensei Nakada) and later received a lot of contributions. This tool allows users to closely examine the scheduler’s behavior and decisions.

It is useful for casual users who employ scheduling constraints (for example, inter-Pod affinity) and experts who extend the scheduler with custom plugins.

Motivation

The scheduler often appears as a black box, composed of many plugins that each contribute to the scheduling decision-making process from their unique perspectives. Understanding its behavior can be challenging due to the multitude of factors it considers.

Even if a Pod appears to be scheduled correctly in a simple test cluster, it might have been scheduled based on different calculations than expected. This discrepancy could lead to unexpected scheduling outcomes when deployed in a large production environment.

Also, testing a scheduler is a complex challenge. There are countless patterns of operations executed within a real cluster, making it unfeasible to anticipate every scenario with a finite number of tests. More often than not, bugs are discovered only when the scheduler is deployed in an actual cluster. Actually, many bugs are found by users after shipping the release, even in the upstream kube-scheduler.

Having a development or sandbox environment for testing the scheduler — or, indeed, any Kubernetes controllers — is a common practice. However, this approach falls short of capturing all the potential scenarios that might arise in a production cluster because a development cluster is often much smaller with notable differences in workload sizes and scaling dynamics. It never sees the exact same use or exhibits the same behavior as its production counterpart.

The kube-scheduler-simulator aims to solve those problems. It enables users to test their scheduling constraints, scheduler configurations, and custom plugins while checking every detailed part of scheduling decisions. It also allows users to create a simulated cluster environment, where they can test their scheduler with the same resources as their production cluster without affecting actual workloads.

Features of the kube-scheduler-simulator

The kube-scheduler-simulator’s core feature is its ability to expose the scheduler's internal decisions. The scheduler operates based on the scheduling framework, using various plugins at different extension points, filter nodes (Filter phase), score nodes (Score phase), and ultimately determine the best node for the Pod.

The simulator allows users to create Kubernetes resources and observe how each plugin influences the scheduling decisions for Pods. This visibility helps users understand the scheduler’s workings and define appropriate scheduling constraints.

The simulator web frontend

Inside the simulator, a debuggable scheduler runs instead of the vanilla scheduler. This debuggable scheduler outputs the results of each scheduler plugin at every extension point to the Pod’s annotations like the following manifest shows and the web front end formats/visualizes the scheduling results based on these annotations.

kind: Pod
apiVersion: v1
metadata:
  # The JSONs within these annotations are manually formatted for clarity in the blog post. 
  annotations:
    kube-scheduler-simulator.sigs.k8s.io/bind-result: '{"DefaultBinder":"success"}'
    kube-scheduler-simulator.sigs.k8s.io/filter-result: >-
      {
        "node-jjfg5":{
            "NodeName":"passed",
            "NodeResourcesFit":"passed",
            "NodeUnschedulable":"passed",
            "TaintToleration":"passed"
        },
        "node-mtb5x":{
            "NodeName":"passed",
            "NodeResourcesFit":"passed",
            "NodeUnschedulable":"passed",
            "TaintToleration":"passed"
        }
      }      
    kube-scheduler-simulator.sigs.k8s.io/finalscore-result: >-
      {
        "node-jjfg5":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"52",
            "NodeResourcesFit":"47",
            "TaintToleration":"300",
            "VolumeBinding":"0"
        },
        "node-mtb5x":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"76",
            "NodeResourcesFit":"73",
            "TaintToleration":"300",
            "VolumeBinding":"0"
        }
      }       
    kube-scheduler-simulator.sigs.k8s.io/permit-result: '{}'
    kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout: '{}'
    kube-scheduler-simulator.sigs.k8s.io/postfilter-result: '{}'
    kube-scheduler-simulator.sigs.k8s.io/prebind-result: '{"VolumeBinding":"success"}'
    kube-scheduler-simulator.sigs.k8s.io/prefilter-result: '{}'
    kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status: >-
      {
        "AzureDiskLimits":"",
        "EBSLimits":"",
        "GCEPDLimits":"",
        "InterPodAffinity":"",
        "NodeAffinity":"",
        "NodePorts":"",
        "NodeResourcesFit":"success",
        "NodeVolumeLimits":"",
        "PodTopologySpread":"",
        "VolumeBinding":"",
        "VolumeRestrictions":"",
        "VolumeZone":""
      }      
    kube-scheduler-simulator.sigs.k8s.io/prescore-result: >-
      {
        "InterPodAffinity":"",
        "NodeAffinity":"success",
        "NodeResourcesBalancedAllocation":"success",
        "NodeResourcesFit":"success",
        "PodTopologySpread":"",
        "TaintToleration":"success"
      }      
    kube-scheduler-simulator.sigs.k8s.io/reserve-result: '{"VolumeBinding":"success"}'
    kube-scheduler-simulator.sigs.k8s.io/result-history: >-
      [
        {
            "kube-scheduler-simulator.sigs.k8s.io/bind-result":"{\"DefaultBinder\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/filter-result":"{\"node-jjfg5\":{\"NodeName\":\"passed\",\"NodeResourcesFit\":\"passed\",\"NodeUnschedulable\":\"passed\",\"TaintToleration\":\"passed\"},\"node-mtb5x\":{\"NodeName\":\"passed\",\"NodeResourcesFit\":\"passed\",\"NodeUnschedulable\":\"passed\",\"TaintToleration\":\"passed\"}}",
            "kube-scheduler-simulator.sigs.k8s.io/finalscore-result":"{\"node-jjfg5\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"52\",\"NodeResourcesFit\":\"47\",\"TaintToleration\":\"300\",\"VolumeBinding\":\"0\"},\"node-mtb5x\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"76\",\"NodeResourcesFit\":\"73\",\"TaintToleration\":\"300\",\"VolumeBinding\":\"0\"}}",
            "kube-scheduler-simulator.sigs.k8s.io/permit-result":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/postfilter-result":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/prebind-result":"{\"VolumeBinding\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/prefilter-result":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status":"{\"AzureDiskLimits\":\"\",\"EBSLimits\":\"\",\"GCEPDLimits\":\"\",\"InterPodAffinity\":\"\",\"NodeAffinity\":\"\",\"NodePorts\":\"\",\"NodeResourcesFit\":\"success\",\"NodeVolumeLimits\":\"\",\"PodTopologySpread\":\"\",\"VolumeBinding\":\"\",\"VolumeRestrictions\":\"\",\"VolumeZone\":\"\"}",
            "kube-scheduler-simulator.sigs.k8s.io/prescore-result":"{\"InterPodAffinity\":\"\",\"NodeAffinity\":\"success\",\"NodeResourcesBalancedAllocation\":\"success\",\"NodeResourcesFit\":\"success\",\"PodTopologySpread\":\"\",\"TaintToleration\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/reserve-result":"{\"VolumeBinding\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/score-result":"{\"node-jjfg5\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"52\",\"NodeResourcesFit\":\"47\",\"TaintToleration\":\"0\",\"VolumeBinding\":\"0\"},\"node-mtb5x\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"76\",\"NodeResourcesFit\":\"73\",\"TaintToleration\":\"0\",\"VolumeBinding\":\"0\"}}",
            "kube-scheduler-simulator.sigs.k8s.io/selected-node":"node-mtb5x"
        }
      ]      
    kube-scheduler-simulator.sigs.k8s.io/score-result: >-
      {
        "node-jjfg5":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"52",
            "NodeResourcesFit":"47",
            "TaintToleration":"0",
            "VolumeBinding":"0"
        },
        "node-mtb5x":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"76",
            "NodeResourcesFit":"73",
            "TaintToleration":"0",
            "VolumeBinding":"0"
        }
      }      
    kube-scheduler-simulator.sigs.k8s.io/selected-node: node-mtb5x

Users can also integrate their custom plugins or extenders, into the debuggable scheduler and visualize their results.

This debuggable scheduler can also run standalone, for example, on any Kubernetes cluster or in integration tests. This would be useful to custom plugin developers who want to test their plugins or examine their custom scheduler in a real cluster with better debuggability.

The simulator as a better dev cluster

As mentioned earlier, with a limited set of tests, it is impossible to predict every possible scenario in a real-world cluster. Typically, users will test the scheduler in a small, development cluster before deploying it to production, hoping that no issues arise.

The simulator’s importing feature provides a solution by allowing users to simulate deploying a new scheduler version in a production-like environment without impacting their live workloads.

By continuously syncing between a production cluster and the simulator, users can safely test a new scheduler version with the same resources their production cluster handles. Once confident in its performance, they can proceed with the production deployment, reducing the risk of unexpected issues.

What are the use cases?

Cluster users: Examine if scheduling constraints (for example, PodAffinity, PodTopologySpread) work as intended.
Cluster admins: Assess how a cluster would behave with changes to the scheduler configuration.
Scheduler plugin developers: Test a custom scheduler plugins or extenders, use the debuggable scheduler in integration tests or development clusters, or use the syncing feature for testing within a production-like environment.

Getting started

The simulator only requires Docker to be installed on a machine; a Kubernetes cluster is not necessary.

git clone git@github.com:kubernetes-sigs/kube-scheduler-simulator.git
cd kube-scheduler-simulator
make docker_up

You can then access the simulator's web UI at http://localhost:3000.

Visit the kube-scheduler-simulator repository for more details!

Getting involved

The scheduler simulator is developed by Kubernetes SIG Scheduling. Your feedback and contributions are welcome!

Open issues or PRs at the kube-scheduler-simulator repository. Join the conversation on the #sig-scheduling slack channel.

Acknowledgments

The simulator has been maintained by dedicated volunteer engineers, overcoming many challenges to reach its current form.

A big shout out to all the awesome contributors!

Kubernetes v1.33 sneak peek

Wed, 26 Mar 2025 10:30:00 -0800

As the release of Kubernetes v1.33 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the overall health of the project. This blog post outlines some planned changes for the v1.33 release, which the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes environment and to keep you up-to-date with the latest developments. The information below is based on the current status of the v1.33 release and is subject to change before the final release date.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
Beta or pre-release API versions must be supported for 3 releases after the deprecation.
Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place.

Whether an API is removed as a result of a feature graduating from beta to stable, or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.

Deprecations and removals for Kubernetes v1.33

Deprecation of the stable Endpoints API

This deprecation only impacts those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans in the coming weeks.

You can find more in KEP-4974: Deprecate v1.Endpoints.

Removal of kube-proxy version information in node status

Following its deprecation in v1.31, as highlighted in the release announcement, the status.nodeInfo.kubeProxyVersion field will be removed in v1.33. This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, the v1.33 release will remove this field entirely.

You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field.

Removal of host network support for Windows pods

Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node’s networking namespace. The original implementation landed as alpha with v1.26, but as it faced unexpected containerd behaviours, and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. We're expecting to see support fully removed in v1.33.

You can find more in KEP-3503: Host network support for Windows pods.

Featured improvement of Kubernetes v1.33

As authors of this article, we picked one improvement as the most significant change to call out!

Support for user namespaces within Linux Pods

One of the oldest open KEPs today is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and now is set to be a part of v1.33, where the feature is available by default.

You can find more in KEP-127: Support User Namespaces in pods.

Selected other Kubernetes v1.33 improvements

The following list of enhancements is likely to be included in the upcoming v1.33 release. This is not a commitment and the release content is subject to change.

In-place resource resize for vertical scaling of Pods

When provisioning a Pod, you can use various resources such as Deployment, StatefulSet, etc. Scalability requirements may need horizontal scaling by updating the Pod replica count, or vertical scaling by updating resources allocated to Pod’s container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement.

But what if you could dynamically update the resource configuration for your existing Pods without restarting them?

The KEP-1287 is precisely to allow such in-place Pod updates. It opens up various possibilities of vertical scale-up for stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup that is eventually reduced once the initial setup is complete. This was released as alpha in v1.27, and is expected to land as beta in v1.33.

You can find more in KEP-1287: In-Place Update of Pod Resources.

DRA’s ResourceClaim Device Status graduates to beta

The devices field in ResourceClaim status, originally introduced in the v1.32 release, is likely to graduate to beta in v1.33. This field allows drivers to report device status data, improving both observability and troubleshooting capabilities.

For example, reporting the interface name, MAC address, and IP addresses of network interfaces in the status of a ResourceClaim can significantly help in configuring and managing network services, as well as in debugging network related issues. You can read more about ResourceClaim Device Status in Dynamic Resource Allocation: ResourceClaim Device Status document.

Also, you can find more about the planned enhancement in KEP-4817: DRA: Resource Claim Status with possible standardized network interface data.

Ordered namespace deletion

This KEP introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. The current semi-random deletion order can create security gaps or unintended behaviour, such as Pods persisting after their associated NetworkPolicies are deleted. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources. The design improves Kubernetes’s security and reliability by mitigating risks associated with non-deterministic deletions.

You can find more in KEP-5080: Ordered namespace deletion.

Enhancements for indexed job management

These two KEPs are both set to graduate to GA to provide better reliability for job handling, specifically for indexed jobs. KEP-3850 provides per-index backoff limits for indexed jobs, which allows each index to be fully independent of other indexes. Also, KEP-3998 extends Job API to define conditions for making an indexed job as successfully completed when not all indexes are succeeded.

You can find more in KEP-3850: Backoff Limit Per Index For Indexed Jobs and KEP-3998: Job success/completion policy.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.33 as part of the CHANGELOG for that release.

Kubernetes v1.33 release is planned for Wednesday, 23rd April, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

Follow us on Bluesky @kubernetes.io for the latest updates
Join the community discussion on Discuss
Join the community on Slack
Post questions (or answer questions) on Server Fault or Stack Overflow
Share your Kubernetes story
Read more about what’s happening with Kubernetes on the blog
Learn more about the Kubernetes Release Team

Fresh Swap Features for Linux Users in Kubernetes 1.32

Tue, 25 Mar 2025 10:00:00 -0800

Swap is a fundamental and an invaluable Linux feature. It offers numerous benefits, such as effectively increasing a node’s memory by swapping out unused data, shielding nodes from system-level memory spikes, preventing Pods from crashing when they hit their memory limits, and much more. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes.

The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Later, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements. In the following Kubernetes releases more improvements were made, paving the way to GA in the near future.

Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.

In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This provided Linux users the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and only partially worked on limited environments.

In version 1.28 swap support on Linux nodes was promoted to Beta. The Beta version was a drastic leap forward. Not only did it fix a large amount of bugs and made swap work in a stable way, but it also brought cgroup v2 support, introduced a wide variety of tests which include complex scenarios such as node-level pressure, and more. It also brought many exciting new capabilities such as the LimitedSwap behavior which sets an auto-calculated swap limit to containers, OpenMetrics instrumentation support (through the /metrics/resource endpoint) and Summary API for VerticalPodAutoscalers (through the /stats/summary endpoint), and more.

Today we are working on more improvements, paving the way for GA. Currently, the focus is especially towards ensuring node stability, enhanced debug abilities, addressing user feedback, polishing the feature and making it stable. For example, in order to increase stability, containers in high-priority pods cannot access swap which ensures the memory they need is ready to use. In addition, the UnlimitedSwap behavior was removed since it might compromise the node's health. Secret content protection against swapping has also been introduced (see relevant security-risk section for more info).

To conclude, compared to previous releases, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. That said, the NodeSwap feature introduces basic swap support, and this is just the beginning. In the near future, additional features are planned to enhance swap functionality in various ways, such as improving evictions, extending the API, increasing customizability, and more!

How do I use it?

In order for the kubelet to initialize on a swap-enabled node, the failSwapOn field must be set to false on kubelet's configuration setting, or the deprecated --fail-swap-on command line flag must be deactivated.

It is possible to configure the memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance,

# this fragment goes into the kubelet's configuration file
memorySwap:
  swapBehavior: LimitedSwap

The currently available configuration options for swapBehavior are:

NoSwap (default): Kubernetes workloads cannot use swap. However, processes outside of Kubernetes' scope, like system daemons (such as kubelet itself!) can utilize swap. This behavior is beneficial for protecting the node from system-level memory spikes, but it does not safeguard the workloads themselves from such spikes.
LimitedSwap: Kubernetes workloads can utilize swap memory, but with certain limitations. The amount of swap available to a Pod is determined automatically, based on the proportion of the memory requested relative to the node's total memory. Only non-high-priority Pods under the Burstable Quality of Service (QoS) tier are permitted to use swap. For more details, see the section below.

If configuration for memorySwap is not specified, by default the kubelet will apply the same behaviour as the NoSwap setting.

On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. On cgroup v1 systems, all Kubernetes workloads are not allowed to use swap memory.

Install a swap-enabled cluster with kubeadm

Before you begin

It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.

Create a swap file and turn swap on

I'll demonstrate creating 4GiB of swap, both in the encrypted and unencrypted case.

Setting up unencrypted swap

An unencrypted swap file can be set up as follows.

# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile

# Format the swap space
mkswap /swapfile

# Activate the swap space for paging
swapon /swapfile

Setting up encrypted swap

An encrypted swap file can be set up as follows. Bear in mind that this example uses the cryptsetup binary (which is available on most Linux distributions).

# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile

# Create an encrypted device backed by the allocated storage
cryptsetup --type plain --cipher aes-xts-plain64 --key-size 256 -d /dev/urandom open /swapfile cryptswap

# Format the swap space
mkswap /dev/mapper/cryptswap

# Activate the swap space for paging
swapon /dev/mapper/cryptswap

Verify that swap is enabled

Swap can be verified to be enabled with both swapon -s command or the free command

> swapon -s
Filename				Type		Size		Used		Priority
/dev/dm-0                               partition	4194300		0		-2

> free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       1.3Gi       249Mi        25Mi       2.5Gi       2.5Gi
Swap:          4.0Gi          0B       4.0Gi

Enable swap on boot

After setting up swap, to start the swap file at boot time, you either set up a systemd unit to activate (encrypted) swap, or you add a line similar to /swapfile swap swap defaults 0 0 into /etc/fstab.

Set up a Kubernetes cluster that uses swap-enabled nodes

To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml for the swap enabled cluster.

---
apiVersion: "kubeadm.k8s.io/v1beta3"
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
memorySwap:
  swapBehavior: LimitedSwap

Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet failSwapOn is set to true. We plan to remove this warning in a future release.

How is the swap limit being determined with LimitedSwap?

The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap with automatic configuration of limitations.

With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/Guaranteed QoS Pods) are prohibited from utilizing swap memory. BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when LimitedSwap is in effect. In addition, high-priority pods are not permitted to use swap in order to ensure the memory they consume always residents on disk, hence ready to use.

Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:

nodeTotalMemory: The total amount of physical memory available on the node.
totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).
containerMemoryRequest: The container's memory request.

Swap limitation is configured as: (containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable

In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.

It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.

How does it work?

There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, the kubelet is able to be configured so that:

It can start with swap on.
It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.

Swap configuration on a node is exposed to a cluster admin via the memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting memorySwap.swapBehavior.

The kubelet employs the CRI (container runtime interface) API, and directs the container runtime to configure specific cgroup v2 parameters (such as memory.swap.max) in a manner that will enable the desired swap configuration for a container. For runtimes that use control groups, the container runtime is then responsible for writing these settings to the container-level cgroup.

How can I monitor swap?

Node and container level metric statistics

Kubelet now collects node and container level metric statistics, which can be accessed at the /metrics/resource (which is used mainly by monitoring tools like Prometheus) and /stats/summary (which is used mainly by Autoscalers) kubelet HTTP endpoints. This allows clients who can directly interrogate the kubelet to monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the machine. See this page for more info.

Node Feature Discovery (NFD)

Node Feature Discovery is a Kubernetes addon for detecting hardware features and configuration. It can be utilized to discover which nodes are provisioned with swap.

As an example, to figure out which nodes are provisioned with swap, use the following command:

kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.feature\.node\.kubernetes\.io/memory-swap)]}{.metadata.name}{"\t"}{.metadata.labels.feature\.node\.kubernetes\.io/memory-swap}{"\n"}{end}'

This will result in an output similar to:

k8s-worker1: true
k8s-worker2: true
k8s-worker3: false

In this example, swap is provisioned on nodes k8s-worker1 and k8s-worker2, but not on k8s-worker3.

Caveats

Having swap available on a system reduces predictability. While swap can enhance performance by making more RAM available, swapping data back to memory is a heavy operation, sometimes slower by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Enabling swap increases the risk of noisy neighbors, where Pods that frequently use their RAM may cause other Pods to swap. In addition, since swap allows for greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, and due to unexpected packing configurations, the scheduler currently does not account for swap memory usage. This heightens the risk of noisy neighbors.

The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe. As swap might cause IO pressure, it is recommended to give a higher IO latency priority to system critical daemons. See the relevant section in the recommended practices section below.

Memory-backed volumes

On Linux nodes, memory-backed volumes (such as secret volume mounts, or emptyDir with medium: Memory) are implemented with a tmpfs filesystem. The contents of such volumes should remain in memory at all times, hence should not be swapped to disk. To ensure the contents of such volumes remain in memory, the noswap tmpfs option is being used.

The Linux kernel officially supports the noswap option from version 6.3 (more info can be found in Linux Kernel Version Requirements). However, the different distributions often choose to backport this mount option to older Linux versions as well.

In order to verify whether the node supports the noswap option, the kubelet will do the following:

If the kernel's version is above 6.3 then the noswap option will be assumed to be supported.
Otherwise, kubelet would try to mount a dummy tmpfs with the noswap option at startup. If kubelet fails with an error indicating of an unknown option, noswap will be assumed to not be supported, hence will not be used. A kubelet log entry will be emitted to warn the user about memory-backed volumes might swap to disk. If kubelet succeeds, the dummy tmpfs will be deleted and the noswap option will be used.
- If the noswap option is not supported, kubelet will emit a warning log entry, then continue its execution.

It is deeply encouraged to encrypt the swap space. See the section above with an example for setting unencrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.

Good practice for using swap in a Kubernetes cluster

Disable swap for system-critical daemons

During the testing phase and based on user feedback, it was observed that the performance of system-critical daemons and services might degrade. This implies that system daemons, including the kubelet, could operate slower than usual. If this issue is encountered, it is advisable to configure the cgroup of the system slice to prevent swapping (i.e., set memory.swap.max=0).

Protect system-critical daemons for I/O latency

Swap can increase the I/O load on a node. When memory pressure causes the kernel to rapidly swap pages in and out, system-critical daemons and services that rely on I/O operations may experience performance degradation.

To mitigate this, it is recommended for systemd users to prioritize the system slice in terms of I/O latency. For non-systemd users, setting up a dedicated cgroup for system daemons and processes and prioritizing I/O latency in the same way is advised. This can be achieved by setting io.latency for the system slice, thereby granting it higher I/O priority. See cgroup's documentation for more info.

Swap and control plane nodes

The Kubernetes project recommends running control plane nodes without any swap space configured. The control plane primarily hosts Guaranteed QoS Pods, so swap can generally be disabled. The main concern is that swapping critical services on the control plane could negatively impact performance.

Use of a dedicated disk for swap

It is recommended to use a separate, encrypted disk for the swap partition. If swap resides on a partition or the root filesystem, workloads may interfere with system processes that need to write to disk. When they share the same disk, processes can overwhelm swap, disrupting the I/O of kubelet, container runtime, and systemd, which would impact other workloads. Since swap space is located on a disk, it is crucial to ensure the disk is fast enough for the intended use cases. Alternatively, one can configure I/O priorities between different mapped areas of a single backing device.

Looking ahead

As you can see, the swap feature was dramatically improved lately, paving the way for a feature GA. However, this is just the beginning. It's a foundational implementation marking the beginning of enhanced swap functionality.

In the near future, additional features are planned to further improve swap capabilities, including better eviction mechanisms, extended API support, increased customizability, better debug abilities and more!

How can I learn more?

You can review the current documentation for using swap with Kubernetes.

For more information, please see KEP-2400 and its design proposal.

How do I get involved?

Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.

Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you'd like to help or ask further questions.

Ingress-nginx CVE-2025-1974: What You Need to Know

Mon, 24 Mar 2025 12:00:00 -0800

Today, the ingress-nginx maintainers have released patches for a batch of critical vulnerabilities that could make it easy for attackers to take over your Kubernetes cluster: ingress-nginx v1.12.1 and ingress-nginx v1.11.5. If you are among the over 40% of Kubernetes administrators using ingress-nginx, you should take action immediately to protect your users and data.

Background

Ingress is the traditional Kubernetes feature for exposing your workload Pods to the world so that they can be useful. In an implementation-agnostic way, Kubernetes users can define how their applications should be made available on the network. Then, an ingress controller uses that definition to set up local or cloud resources as required for the user’s particular situation and needs.

Many different ingress controllers are available, to suit users of different cloud providers or brands of load balancers. Ingress-nginx is a software-only ingress controller provided by the Kubernetes project. Because of its versatility and ease of use, ingress-nginx is quite popular: it is deployed in over 40% of Kubernetes clusters!

Ingress-nginx translates the requirements from Ingress objects into configuration for nginx, a powerful open source webserver daemon. Then, nginx uses that configuration to accept and route requests to the various applications running within a Kubernetes cluster. Proper handling of these nginx configuration parameters is crucial, because ingress-nginx needs to allow users significant flexibility while preventing them from accidentally or intentionally tricking nginx into doing things it shouldn’t.

Vulnerabilities Patched Today

Four of today’s ingress-nginx vulnerabilities are improvements to how ingress-nginx handles particular bits of nginx config. Without these fixes, a specially-crafted Ingress object can cause nginx to misbehave in various ways, including revealing the values of Secrets that are accessible to ingress-nginx. By default, ingress-nginx has access to all Secrets cluster-wide, so this can often lead to complete cluster takeover by any user or entity that has permission to create an Ingress.

The most serious of today’s vulnerabilities, CVE-2025-1974, rated 9.8 CVSS, allows anything on the Pod network to exploit configuration injection vulnerabilities via the Validating Admission Controller feature of ingress-nginx. This makes such vulnerabilities far more dangerous: ordinarily one would need to be able to create an Ingress object in the cluster, which is a fairly privileged action. When combined with today’s other vulnerabilities, CVE-2025-1974 means that anything on the Pod network has a good chance of taking over your Kubernetes cluster, with no credentials or administrative access required. In many common scenarios, the Pod network is accessible to all workloads in your cloud VPC, or even anyone connected to your corporate network! This is a very serious situation.

Today, we have released ingress-nginx v1.12.1 and ingress-nginx v1.11.5, which have fixes for all five of these vulnerabilities.

Your next steps

First, determine if your clusters are using ingress-nginx. In most cases, you can check this by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

If you are using ingress-nginx, make a plan to remediate these vulnerabilities immediately.

The best and easiest remedy is to upgrade to the new patch release of ingress-nginx. All five of today’s vulnerabilities are fixed by installing today’s patches.

If you can’t upgrade right away, you can significantly reduce your risk by turning off the Validating Admission Controller feature of ingress-nginx.

If you have installed ingress-nginx using Helm
- Reinstall, setting the Helm value controller.admissionWebhooks.enabled=false
If you have installed ingress-nginx manually
- delete the ValidatingWebhookconfiguration called ingress-nginx-admission
- edit the ingress-nginx-controller Deployment or Daemonset, removing --validating-webhook from the controller container’s argument list

If you turn off the Validating Admission Controller feature as a mitigation for CVE-2025-1974, remember to turn it back on after you upgrade. This feature provides important quality of life improvements for your users, warning them about incorrect Ingress configurations before they can take effect.

Conclusion, thanks, and further reading

The ingress-nginx vulnerabilities announced today, including CVE-2025-1974, present a serious risk to many Kubernetes users and their data. If you use ingress-nginx, you should take action immediately to keep yourself safe.

Thanks go out to Nir Ohfeld, Sagi Tzadik, Ronen Shustin, and Hillai Ben-Sasson from Wiz for responsibly disclosing these vulnerabilities, and for working with the Kubernetes SRC members and ingress-nginx maintainers (Marco Ebert and James Strong) to ensure we fixed them effectively.

For further information about the maintenance and future of ingress-nginx, please see this GitHub issue and/or attend James and Marco’s KubeCon/CloudNativeCon EU 2025 presentation.

For further information about the specific vulnerabilities discussed in this article, please see the appropriate GitHub issue: CVE-2025-24513, CVE-2025-24514, CVE-2025-1097, CVE-2025-1098, or CVE-2025-1974

This blog post was revised in May 2025 to update the hyperlinks.

Introducing JobSet

Sun, 23 Mar 2025 00:00:00 +0000

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes.

Why JobSet?

The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads.

Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.

As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts.

These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines.

It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs.

However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore.

Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training.

For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior.

On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps:

Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern.

Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology.

Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job.

Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases.

How JobSet Works

JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.).

It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names.

Some other key JobSet features which address the problems described above include:

Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.

Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this.

Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed.

Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.

Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network.

Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more.

Example use case

Distributed ML training on multiple TPU slices with Jax

The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs.

This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs.

This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user.

# Run a simple Jax workload on 
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice
  annotations:
    # Give each child Job exclusive usage of a TPU slice 
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
  - name: workers
    replicas: 4 # Set to number of TPU slices
    template:
      spec:
        parallelism: 2 # Set to number of VMs per TPU slice
        completions: 2 # Set to number of VMs per TPU slice
        backoffLimit: 0
        template:
          spec:
            hostNetwork: true
            dnsPolicy: ClusterFirstWithHostNet
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
            containers:
            - name: jax-tpu
              image: python:3.8
              ports:
              - containerPort: 8471
              - containerPort: 8080
              securityContext:
                privileged: true
              command:
              - bash
              - -c
              - |
                pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                python -c 'import jax; print("Global device count:", jax.device_count())'
                sleep 60                
              resources:
                limits:
                  google.com/tpu: 4

Future work and getting involved

We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap.

Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation.

You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contributors who made this project possible!

Spotlight on SIG Apps

Wed, 12 Mar 2025 00:00:00 +0000

In our ongoing SIG Spotlight series, we dive into the heart of the Kubernetes project by talking to the leaders of its various Special Interest Groups (SIGs). This time, we focus on SIG Apps, the group responsible for everything related to developing, deploying, and operating applications on Kubernetes. Sandipan Panda (DevZero) had the opportunity to interview Maciej Szulik (Defense Unicorns) and Janet Kuo (Google), the chairs and tech leads of SIG Apps. They shared their experiences, challenges, and visions for the future of application management within the Kubernetes ecosystem.

Introductions

Sandipan: Hello, could you start by telling us a bit about yourself, your role, and your journey within the Kubernetes community that led to your current roles in SIG Apps?

Maciej: Hey, my name is Maciej, and I’m one of the leads for SIG Apps. Aside from this role, you can also find me helping SIG CLI and also being one of the Steering Committee members. I’ve been contributing to Kubernetes since late 2014 in various areas, including controllers, apiserver, and kubectl.

Janet: Certainly! I'm Janet, a Staff Software Engineer at Google, and I've been deeply involved with the Kubernetes project since its early days, even before the 1.0 launch in 2015. It's been an amazing journey!

My current role within the Kubernetes community is one of the chairs and tech leads of SIG Apps. My journey with SIG Apps started organically. I started with building the Deployment API and adding rolling update functionalities. I naturally gravitated towards SIG Apps and became increasingly involved. Over time, I took on more responsibilities, culminating in my current leadership roles.

About SIG Apps

All following answers were jointly provided by Maciej and Janet.

Sandipan: For those unfamiliar, could you provide an overview of SIG Apps' mission and objectives? What key problems does it aim to solve within the Kubernetes ecosystem?

As described in our charter, we cover a broad area related to developing, deploying, and operating applications on Kubernetes. That, in short, means we’re open to each and everyone showing up at our bi-weekly meetings and discussing the ups and downs of writing and deploying various applications on Kubernetes.

Sandipan: What are some of the most significant projects or initiatives currently being undertaken by SIG Apps?

At this point in time, the main factors driving the development of our controllers are the challenges coming from running various AI-related workloads. It’s worth giving credit here to two working groups we’ve sponsored over the past years:

The Batch Working Group, which is looking at running HPC, AI/ML, and data analytics jobs on top of Kubernetes.
The Serving Working Group, which is focusing on hardware-accelerated AI/ML inference.

Best practices and challenges

Sandipan: SIG Apps plays a crucial role in developing application management best practices for Kubernetes. Can you share some of these best practices and how they help improve application lifecycle management?

Implementing health checks and readiness probes ensures that your applications are healthy and ready to serve traffic, leading to improved reliability and uptime. The above, combined with comprehensive logging, monitoring, and tracing solutions, will provide insights into your application's behavior, enabling you to identify and resolve issues quickly.
Auto-scale your application based on resource utilization or custom metrics, optimizing resource usage and ensuring your application can handle varying loads.
Use Deployment for stateless applications, StatefulSet for stateful applications, Job and CronJob for batch workloads, and DaemonSet for running a daemon on each node. Use Operators and CRDs to extend the Kubernetes API to automate the deployment, management, and lifecycle of complex applications, making them easier to operate and reducing manual intervention.

Sandipan: What are some of the common challenges SIG Apps faces, and how do you address them?

The biggest challenge we’re facing all the time is the need to reject a lot of features, ideas, and improvements. This requires a lot of discipline and patience to be able to explain the reasons behind those decisions.

Sandipan: How has the evolution of Kubernetes influenced the work of SIG Apps? Are there any recent changes or upcoming features in Kubernetes that you find particularly relevant or beneficial for SIG Apps?

The main benefit for both us and the whole community around SIG Apps is the ability to extend kubernetes with Custom Resource Definitions and the fact that users can build their own custom controllers leveraging the built-in ones to achieve whatever sophisticated use cases they might have and we, as the core maintainers, haven’t considered or weren’t able to efficiently resolve inside Kubernetes.

Contributing to SIG Apps

Sandipan: What opportunities are available for new contributors who want to get involved with SIG Apps, and what advice would you give them?

We get the question, "What good first issue might you recommend we start with?" a lot :-) But unfortunately, there’s no easy answer to it. We always tell everyone that the best option to start contributing to core controllers is to find one you are willing to spend some time with. Read through the code, then try running unit tests and integration tests focusing on that controller. Once you grasp the general idea, try breaking it and the tests again to verify your breakage. Once you start feeling confident you understand that particular controller, you may want to search through open issues affecting that controller and either provide suggestions, explaining the problem users have, or maybe attempt your first fix.

Like we said, there are no shortcuts on that road; you need to spend the time with the codebase to understand all the edge cases we’ve slowly built up to get to the point where we are. Once you’re successful with one controller, you’ll need to repeat that same process with others all over again.

Sandipan: How does SIG Apps gather feedback from the community, and how is this feedback integrated into your work?

We always encourage everyone to show up and present their problems and solutions during our bi-weekly meetings. As long as you’re solving an interesting problem on top of Kubernetes and you can provide valuable feedback about any of the core controllers, we’re always happy to hear from everyone.

Looking ahead

Sandipan: Looking ahead, what are the key focus areas or upcoming trends in application management within Kubernetes that SIG Apps is excited about? How is the SIG adapting to these trends?

Definitely the current AI hype is the major driving factor; as mentioned above, we have two working groups, each covering a different aspect of it.

Sandipan: What are some of your favorite things about this SIG?

Without a doubt, the people that participate in our meetings and on Slack, who tirelessly help triage issues, pull requests and invest a lot of their time (very frequently their private time) into making kubernetes great!

SIG Apps is an essential part of the Kubernetes community, helping to shape how applications are deployed and managed at scale. From its work on improving Kubernetes' workload APIs to driving innovation in AI/ML application management, SIG Apps is continually adapting to meet the needs of modern application developers and operators. Whether you’re a new contributor or an experienced developer, there’s always an opportunity to get involved and make an impact.

If you’re interested in learning more or contributing to SIG Apps, be sure to check out their SIG README and join their bi-weekly meetings.

Spotlight on SIG etcd

Tue, 04 Mar 2025 00:00:00 +0000

In this SIG etcd spotlight we talked with James Blair, Marek Siarkowicz, Wenjia Zhang, and Benjamin Wang to learn a bit more about this Kubernetes Special Interest Group.

Introducing SIG etcd

Frederico: Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role and how you got involved in Kubernetes.

Benjamin: Hello, I am Benjamin. I am a SIG etcd Tech Lead and one of the etcd maintainers. I work for VMware, which is part of the Broadcom group. I got involved in Kubernetes & etcd & CSI (Container Storage Interface) because of work and also a big passion for open source. I have been working on Kubernetes & etcd (and also CSI) since 2020.

James: Hey team, I’m James, a co-chair for SIG etcd and etcd maintainer. I work at Red Hat as a Specialist Architect helping people adopt cloud native technology. I got involved with the Kubernetes ecosystem in 2019. Around the end of 2022 I noticed how the etcd community and project needed help so started contributing as often as I could. There is a saying in our community that "you come for the technology, and stay for the people": for me this is absolutely real, it’s been a wonderful journey so far and I’m excited to support our community moving forward.

Marek: Hey everyone, I'm Marek, the SIG etcd lead. At Google, I lead the GKE etcd team, ensuring a stable and reliable experience for all GKE users. My Kubernetes journey began with SIG Instrumentation, where I created and led the Kubernetes Structured Logging effort.
I'm still the main project lead for Kubernetes Metrics Server, providing crucial signals for autoscaling in Kubernetes. I started working on etcd 3 years ago, right around the 3.5 release. We faced some challenges, but I'm thrilled to see etcd now the most scalable and reliable it's ever been, with the highest contribution numbers in the project's history. I'm passionate about distributed systems, extreme programming, and testing.

Wenjia: Hi there, my name is Wenjia, I am the co-chair of SIG etcd and one of the etcd maintainers. I work at Google as an Engineering Manager, working on GKE (Google Kubernetes Engine) and GDC (Google Distributed Cloud). I have been working in the area of open source Kubernetes and etcd since the Kubernetes v1.10 and etcd v3.1 releases. I got involved in Kubernetes because of my job, but what keeps me in the space is the charm of the container orchestration technology, and more importantly, the awesome open source community.

Becoming a Kubernetes Special Interest Group (SIG)

Frederico: Excellent, thank you. I'd like to start with the origin of the SIG itself: SIG etcd is a very recent SIG, could you quickly go through the history and reasons behind its creation?

Marek: Absolutely! SIG etcd was formed because etcd is a critical component of Kubernetes, serving as its data store. However, etcd was facing challenges like maintainer turnover and reliability issues. Creating a dedicated SIG allowed us to focus on addressing these problems, improving development and maintenance processes, and ensuring etcd evolves in sync with the cloud-native landscape.

Frederico: And has becoming a SIG worked out as expected? Better yet, are the motivations you just described being addressed, and to what extent?

Marek: It's been a positive change overall. Becoming a SIG has brought more structure and transparency to etcd's development. We've adopted Kubernetes processes like KEPs (Kubernetes Enhancement Proposals and PRRs (Production Readiness Reviews, which has improved our feature development and release cycle.

Frederico: On top of those, what would you single out as the major benefit that has resulted from becoming a SIG?

Marek: The biggest benefits for me was adopting Kubernetes testing infrastructure, tools like Prow and TestGrid. For large projects like etcd there is just no comparison to the default GitHub tooling. Having known, easy to use, clear tools is a major boost to the etcd as it makes it much easier for Kubernetes contributors to also help etcd.

Wenjia: Totally agree, while challenges remain, the SIG structure provides a solid foundation for addressing them and ensuring etcd's continued success as a critical component of the Kubernetes ecosystem.

The positive impact on the community is another crucial aspect of SIG etcd's success that I’d like to highlight. The Kubernetes SIG structure has created a welcoming environment for etcd contributors, leading to increased participation from the broader Kubernetes community. We have had greater collaboration with other SIGs like SIG API Machinery, SIG Scalability, SIG Testing, SIG Cluster Lifecycle, etc.

This collaboration helps ensure etcd's development aligns with the needs of the wider Kubernetes ecosystem. The formation of the etcd Operator Working Group under the joint effort between SIG etcd and SIG Cluster Lifecycle exemplifies this successful collaboration, demonstrating a shared commitment to improving etcd's operational aspects within Kubernetes.

Frederico: Since you mentioned collaboration, have you seen changes in terms of contributors and community involvement in recent months?

James: Yes -- as showing in our unique PR author data we recently hit an all time high in March and are trending in a positive direction:

Additionally, looking at our overall contributions across all etcd project repositories we are also observing a positive trend showing a resurgence in etcd project activity:

The road ahead

Frederico: That's quite telling, thank you. In terms of the near future, what are the current priorities for SIG etcd?

Marek: Reliability is always top of mind -– we need to make sure etcd is rock-solid. We're also working on making etcd easier to use and manage for operators. And we have our sights set on making etcd a viable standalone solution for infrastructure management, not just for Kubernetes. Oh, and of course, scaling -– we need to ensure etcd can handle the growing demands of the cloud-native world.

Benjamin: I agree that reliability should always be our top guiding principle. We need to ensure not only correctness but also compatibility. Additionally, we should continuously strive to improve the understandability and maintainability of etcd. Our focus should be on addressing the pain points that the community cares about the most.

Frederico: Are there any specific SIGs that you work closely with?

Marek: SIG API Machinery, for sure – they own the structure of the data etcd stores, so we're constantly working together. And SIG Cluster Lifecycle – etcd is a key part of Kubernetes clusters, so we collaborate on the newly created etcd operator Working group.

Wenjia: Other than SIG API Machinery and SIG Cluster Lifecycle that Marek mentioned above, SIG Scalability and SIG Testing is another group that we work closely with.

Frederico: In a more general sense, how would you list the key challenges for SIG etcd in the evolving cloud native landscape?

Marek: Well, reliability is always a challenge when you're dealing with critical data. The cloud-native world is evolving so fast that scaling to meet those demands is a constant effort.

Getting involved

Frederico: We're almost at the end of our conversation, but for those interested in in etcd, how can they get involved?

Marek: We'd love to have them! The best way to start is to join our SIG etcd meetings, follow discussions on the etcd-dev mailing list, and check out our GitHub issues. We're always looking for people to review proposals, test code, and contribute to documentation.

Wenjia: I love this question 😀 . There are numerous ways for people interested in contributing to SIG etcd to get involved and make a difference. Here are some key areas where you can help:

Code Contributions:

Bug Fixes: Tackle existing issues in the etcd codebase. Start with issues labeled "good first issue" or "help wanted" to find tasks that are suitable for newcomers.
Feature Development: Contribute to the development of new features and enhancements. Check the etcd roadmap and discussions to see what's being planned and where your skills might fit in.
Testing and Code Reviews: Help ensure the quality of etcd by writing tests, reviewing code changes, and providing feedback.
Documentation: Improve etcd's documentation by adding new content, clarifying existing information, or fixing errors. Clear and comprehensive documentation is essential for users and contributors.
Community Support: Answer questions on forums, mailing lists, or Slack channels. Helping others understand and use etcd is a valuable contribution.

Getting Started:

Join the community: Start by joining the etcd community on Slack, attending SIG meetings, and following the mailing lists. This will help you get familiar with the project, its processes, and the people involved.
Find a mentor: If you're new to open source or etcd, consider finding a mentor who can guide you and provide support. Stay tuned! Our first cohort of mentorship program was very successful. We will have a new round of mentorship program coming up.
Start small: Don't be afraid to start with small contributions. Even fixing a typo in the documentation or submitting a simple bug fix can be a great way to get involved.

By contributing to etcd, you'll not only be helping to improve a critical piece of the cloud-native ecosystem but also gaining valuable experience and skills. So, jump in and start contributing!

Frederico: Excellent, thank you. Lastly, one piece of advice that you'd like to give to other newly formed SIGs?

Marek: Absolutely! My advice would be to embrace the established processes of the larger community, prioritize collaboration with other SIGs, and focus on building a strong community.

Wenjia: Here are some tips I myself found very helpful in my OSS journey:

Be patient: Open source development can take time. Don't get discouraged if your contributions aren't accepted immediately or if you encounter challenges.
Be respectful: The etcd community values collaboration and respect. Be mindful of others' opinions and work together to achieve common goals.
Have fun: Contributing to open source should be enjoyable. Find areas that interest you and contribute in ways that you find fulfilling.

Frederico: A great way to end this spotlight, thank you all!

For more information and resources, please take a look at :

etcd website: https://etcd.io/
etcd GitHub repository: https://github.com/etcd-io/etcd
etcd community: https://etcd.io/community/

NFTables mode for kube-proxy

Fri, 28 Feb 2025 00:00:00 +0000

A new nftables mode for kube-proxy was introduced as an alpha feature in Kubernetes 1.29. Currently in beta, it is expected to be GA as of 1.33. The new mode fixes long-standing performance problems with the iptables mode and all users running on systems with reasonably-recent kernels are encouraged to try it out. (For compatibility reasons, even once nftables becomes GA, iptables will still be the default.)

Why nftables? Part 1: data plane latency

The iptables API was designed for implementing simple firewalls, and has problems scaling up to support Service proxying in a large Kubernetes cluster with tens of thousands of Services.

In general, the ruleset generated by kube-proxy in iptables mode has a number of iptables rules proportional to the sum of the number of Services and the total number of endpoints. In particular, at the top level of the ruleset, there is one rule to test each possible Service IP (and port) that a packet might be addressed to:

# If the packet is addressed to 172.30.0.41:80, then jump to the chain
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O

# If the packet is addressed to 172.30.0.42:443, then...
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT

# etc...
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK

This means that when a packet comes in, the time it takes the kernel to check it against all of the Service rules is O(n) in the number of Services. As the number of Services increases, both the average and the worst-case latency for the first packet of a new connection increases (with the difference between best-case, average, and worst-case being mostly determined by whether a given Service IP address appears earlier or later in the KUBE-SERVICES chain).

By contrast, with nftables, the normal way to write a ruleset like this is to have a single rule, using a "verdict map" to do the dispatch:

table ip kube-proxy {

        # The service-ips verdict map indicates the action to take for each matching packet.
	map service-ips {
		type ipv4_addr . inet_proto . inet_service : verdict
		comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
		elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
                             172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
                             172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
                             ... }
        }

        # Now we just need a single rule to process all packets matching an
        # element in the map. (This rule says, "construct a tuple from the
        # destination IP address, layer 4 protocol, and destination port; look
        # that tuple up in "service-ips"; and if there's a match, execute the
        # associated verdict.)
	chain services {
		ip daddr . meta l4proto . th dport vmap @service-ips
	}

        ...
}

Since there's only a single rule, with a roughly O(1) map lookup, packet processing time is more or less constant regardless of cluster size, and the best/average/worst cases are very similar:

But note the huge difference in the vertical scale between the iptables and nftables graphs! In the clusters with 5000 and 10,000 Services, the p50 (average) latency for nftables is about the same as the p01 (approximately best-case) latency for iptables. In the 30,000 Service cluster, the p99 (approximately worst-case) latency for nftables manages to beat out the p01 latency for iptables by a few microseconds! Here's both sets of data together, but you may have to squint to see the nftables results!:

Why nftables? Part 2: control plane latency

While the improvements to data plane latency in large clusters are great, there's another problem with iptables kube-proxy that often keeps users from even being able to grow their clusters to that size: the time it takes kube-proxy to program new iptables rules when Services and their endpoints change.

With both iptables and nftables, the total size of the ruleset as a whole (actual rules, plus associated data) is O(n) in the combined number of Services and their endpoints. Originally, the iptables backend would rewrite every rule on every update, and with tens of thousands of Services, this could grow to be hundreds of thousands of iptables rules. Starting in Kubernetes 1.26, we began improving kube-proxy so that it could skip updating most of the unchanged rules in each update, but the limitations of iptables-restore as an API meant that it was still always necessary to send an update that's O(n) in the number of Services (though with a noticeably smaller constant than it used to be). Even with those optimizations, it can still be necessary to make use of kube-proxy's minSyncPeriod config option to ensure that it doesn't spend every waking second trying to push iptables updates.

The nftables APIs allow for doing much more incremental updates, and when kube-proxy in nftables mode does an update, the size of the update is only O(n) in the number of Services and endpoints that have changed since the last sync, regardless of the total number of Services and endpoints. The fact that the nftables API allows each nftables-using component to have its own private table also means that there is no global lock contention between components like with iptables. As a result, kube-proxy's nftables updates can be done much more efficiently than with iptables.

(Unfortunately I don't have cool graphs for this part.)

Why not nftables?

All that said, there are a few reasons why you might not want to jump right into using the nftables backend for now.

First, the code is still fairly new. While it has plenty of unit tests, performs correctly in our CI system, and has now been used in the real world by multiple users, it has not seen anything close to as much real-world usage as the iptables backend has, so we can't promise that it is as stable and bug-free.

Second, the nftables mode will not work on older Linux distributions; currently it requires a 5.13 or newer kernel. Additionally, because of bugs in early versions of the nft command line tool, you should not run kube-proxy in nftables mode on nodes that have an old (earlier than 1.0.0) version of nft in the host filesystem (or else kube-proxy's use of nftables may interfere with other uses of nftables on the system).

Third, you may have other networking components in your cluster, such as the pod network or NetworkPolicy implementation, that do not yet support kube-proxy in nftables mode. You should consult the documentation (or forums, bug tracker, etc.) for any such components to see if they have problems with nftables mode. (In many cases they will not; as long as they don't try to directly interact with or override kube-proxy's iptables rules, they shouldn't care whether kube-proxy is using iptables or nftables.) Additionally, observability and monitoring tools that have not been updated may report less data for kube-proxy in nftables mode than they do for kube-proxy in iptables mode.

Finally, kube-proxy in nftables mode is intentionally not 100% compatible with kube-proxy in iptables mode. There are a few old kube-proxy features whose default behaviors are less secure, less performant, or less intuitive than we'd like, but where we felt that changing the default would be a compatibility break. Since the nftables mode is opt-in, this gave us a chance to fix those bad defaults without breaking users who weren't expecting changes. (In particular, with nftables mode, NodePort Services are now only reachable on their nodes' default IPs, as opposed to being reachable on all IPs, including 127.0.0.1, with iptables mode.) The kube-proxy documentation has more information about this, including information about metrics you can look at to determine if you are relying on any of the changed functionality, and what configuration options are available to get more backward-compatible behavior.

Trying out nftables mode

Ready to try it out? In Kubernetes 1.31 and later, you just need to pass --proxy-mode nftables to kube-proxy (or set mode: nftables in your kube-proxy config file).

If you are using kubeadm to set up your cluster, the kubeadm documentation explains how to pass a KubeProxyConfiguration to kubeadm init. You can also deploy nftables-based clusters with kind.

You can also convert existing clusters from iptables (or ipvs) mode to nftables by updating the kube-proxy configuration and restarting the kube-proxy pods. (You do not need to reboot the nodes: when restarting in nftables mode, kube-proxy will delete any existing iptables or ipvs rules, and likewise, if you later revert back to iptables or ipvs mode, it will delete any existing nftables rules.)

Future plans

As mentioned above, while nftables is now the best kube-proxy mode, it is not the default, and we do not yet have a plan for changing that. We will continue to support the iptables mode for a long time.

The future of the IPVS mode of kube-proxy is less certain: its main advantage over iptables was that it was faster, but certain aspects of the IPVS architecture and APIs were awkward for kube-proxy's purposes (for example, the fact that the kube-ipvs0 device needs to have every Service IP address assigned to it), and some parts of Kubernetes Service proxying semantics were difficult to implement using IPVS (particularly the fact that some Services had to have different endpoints depending on whether you connected to them from a local or remote client). And now, the nftables mode has the same performance as IPVS mode (actually, slightly better), without any of the downsides:

(In theory the IPVS mode also has the advantage of being able to use various other IPVS functionality, like alternative "schedulers" for balancing endpoints. In practice, this ended up not being very useful, because kube-proxy runs independently on every node, and the IPVS schedulers on each node had no way of sharing their state with the proxies on other nodes, thus thwarting the effort to balance traffic more cleverly.)

While the Kubernetes project does not have an immediate plan to drop the IPVS backend, it is probably doomed in the long run, and people who are currently using IPVS mode should try out the nftables mode instead (and file bugs if you think there is missing functionality in nftables mode that you can't work around).

Learn more

"KEP-3866: Add an nftables-based kube-proxy backend" has the history of the new feature.
"How the Tables Have Turned: Kubernetes Says Goodbye to IPTables", from KubeCon/CloudNativeCon North America 2024, talks about porting kube-proxy and Calico from iptables to nftables.
"From Observability to Performance", from KubeCon/CloudNativeCon North America 2024. (This is where the kube-proxy latency data came from; the raw data for the charts is also available.)

The Cloud Controller Manager Chicken and Egg Problem

Fri, 14 Feb 2025 00:00:00 +0000

Kubernetes 1.31 completed the largest migration in Kubernetes history, removing the in-tree cloud provider. While the component migration is now done, this leaves some additional complexity for users and installer projects (for example, kOps or Cluster API) . We will go over those additional steps and failure points and make recommendations for cluster owners. This migration was complex and some logic had to be extracted from the core components, building four new subsystems.

Cloud controller manager (KEP-2392)
API server network proxy (KEP-1281)
kubelet credential provider plugins (KEP-2133)
Storage migration to use CSI (KEP-625)

The cloud controller manager is part of the control plane. It is a critical component that replaces some functionality that existed previously in the kube-controller-manager and the kubelet.

Components of Kubernetes

One of the most critical functionalities of the cloud controller manager is the node controller, which is responsible for the initialization of the nodes.

As you can see in the following diagram, when the kubelet starts, it registers the Node object with the apiserver, Tainting the node so it can be processed first by the cloud-controller-manager. The initial Node is missing the cloud-provider specific information, like the Node Addresses and the Labels with the cloud provider specific information like the Node, Region and Instance type information.

Chicken and egg problem sequence diagram

This new initialization process adds some latency to the node readiness. Previously, the kubelet was able to initialize the node at the same time it created the node. Since the logic has moved to the cloud-controller-manager, this can cause a chicken and egg problem during the cluster bootstrapping for those Kubernetes architectures that do not deploy the controller manager as the other components of the control plane, commonly as static pods, standalone binaries or daemonsets/deployments with tolerations to the taints and using hostNetwork (more on this below)

Examples of the dependency problem

As noted above, it is possible during bootstrapping for the cloud-controller-manager to be unschedulable and as such the cluster will not initialize properly. The following are a few concrete examples of how this problem can be expressed and the root causes for why they might occur.

These examples assume you are running your cloud-controller-manager using a Kubernetes resource (e.g. Deployment, DaemonSet, or similar) to control its lifecycle. Because these methods rely on Kubernetes to schedule the cloud-controller-manager, care must be taken to ensure it will schedule properly.

Example: Cloud controller manager not scheduling due to uninitialized taint

As noted in the Kubernetes documentation, when the kubelet is started with the command line flag --cloud-provider=external, its corresponding Node object will have a no schedule taint named node.cloudprovider.kubernetes.io/uninitialized added. Because the cloud-controller-manager is responsible for removing the no schedule taint, this can create a situation where a cloud-controller-manager that is being managed by a Kubernetes resource, such as a Deployment or DaemonSet, may not be able to schedule.

If the cloud-controller-manager is not able to be scheduled during the initialization of the control plane, then the resulting Node objects will all have the node.cloudprovider.kubernetes.io/uninitialized no schedule taint. It also means that this taint will not be removed as the cloud-controller-manager is responsible for its removal. If the no schedule taint is not removed, then critical workloads, such as the container network interface controllers, will not be able to schedule, and the cluster will be left in an unhealthy state.

Example: Cloud controller manager not scheduling due to not-ready taint

The next example would be possible in situations where the container network interface (CNI) is waiting for IP address information from the cloud-controller-manager (CCM), and the CCM has not tolerated the taint which would be removed by the CNI.

The Kubernetes documentation describes the node.kubernetes.io/not-ready taint as follows:

"The Node controller detects whether a Node is ready by monitoring its health and adds or removes this taint accordingly."

One of the conditions that can lead to a Node resource having this taint is when the container network has not yet been initialized on that node. As the cloud-controller-manager is responsible for adding the IP addresses to a Node resource, and the IP addresses are needed by the container network controllers to properly configure the container network, it is possible in some circumstances for a node to become stuck as not ready and uninitialized permanently.

This situation occurs for a similar reason as the first example, although in this case, the node.kubernetes.io/not-ready taint is used with the no execute effect and thus will cause the cloud-controller-manager not to run on the node with the taint. If the cloud-controller-manager is not able to execute, then it will not initialize the node. It will cascade into the container network controllers not being able to run properly, and the node will end up carrying both the node.cloudprovider.kubernetes.io/uninitialized and node.kubernetes.io/not-ready taints, leaving the cluster in an unhealthy state.

Our Recommendations

There is no one “correct way” to run a cloud-controller-manager. The details will depend on the specific needs of the cluster administrators and users. When planning your clusters and the lifecycle of the cloud-controller-managers please consider the following guidance:

For cloud-controller-managers running in the same cluster, they are managing.

Use host network mode, rather than the pod network: in most cases, a cloud controller manager will need to communicate with an API service endpoint associated with the infrastructure. Setting “hostNetwork” to true will ensure that the cloud controller is using the host networking instead of the container network and, as such, will have the same network access as the host operating system. It will also remove the dependency on the networking plugin. This will ensure that the cloud controller has access to the infrastructure endpoint (always check your networking configuration against your infrastructure provider’s instructions).
Use a scalable resource type. Deployments and DaemonSets are useful for controlling the lifecycle of a cloud controller. They allow easy access to running multiple copies for redundancy as well as using the Kubernetes scheduling to ensure proper placement in the cluster. When using these primitives to control the lifecycle of your cloud controllers and running multiple replicas, you must remember to enable leader election, or else your controllers will collide with each other which could lead to nodes not being initialized in the cluster.
Target the controller manager containers to the control plane. There might exist other controllers which need to run outside the control plane (for example, Azure’s node manager controller). Still, the controller managers themselves should be deployed to the control plane. Use a node selector or affinity stanza to direct the scheduling of cloud controllers to the control plane to ensure that they are running in a protected space. Cloud controllers are vital to adding and removing nodes to a cluster as they form a link between Kubernetes and the physical infrastructure. Running them on the control plane will help to ensure that they run with a similar priority as other core cluster controllers and that they have some separation from non-privileged user workloads.
1. It is worth noting that an anti-affinity stanza to prevent cloud controllers from running on the same host is also very useful to ensure that a single node failure will not degrade the cloud controller performance.
Ensure that the tolerations allow operation. Use tolerations on the manifest for the cloud controller container to ensure that it will schedule to the correct nodes and that it can run in situations where a node is initializing. This means that cloud controllers should tolerate the node.cloudprovider.kubernetes.io/uninitialized taint, and it should also tolerate any taints associated with the control plane (for example, node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master). It can also be useful to tolerate the node.kubernetes.io/not-ready taint to ensure that the cloud controller can run even when the node is not yet available for health monitoring.

For cloud-controller-managers that will not be running on the cluster they manage (for example, in a hosted control plane on a separate cluster), then the rules are much more constrained by the dependencies of the environment of the cluster running the cloud-controller-manager. The advice for running on a self-managed cluster may not be appropriate as the types of conflicts and network constraints will be different. Please consult the architecture and requirements of your topology for these scenarios.

Example

This is an example of a Kubernetes Deployment highlighting the guidance shown above. It is important to note that this is for demonstration purposes only, for production uses please consult your cloud provider’s documentation.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: cloud-controller-manager
  name: cloud-controller-manager
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: cloud-controller-manager
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: cloud-controller-manager
      annotations:
        kubernetes.io/description: Cloud controller manager for my infrastructure
    spec:
      containers: # the container details will depend on your specific cloud controller manager
      - name: cloud-controller-manager
        command:
        - /bin/my-infrastructure-cloud-controller-manager
        - --leader-elect=true
        - -v=1
        image: registry/my-infrastructure-cloud-controller-manager@latest
        resources:
          requests:
            cpu: 200m
            memory: 50Mi
      hostNetwork: true # these Pods are part of the control plane
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: "kubernetes.io/hostname"
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: cloud-controller-manager
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 120
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 120
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        operator: Exists
      - effect: NoSchedule
        key: node.kubernetes.io/not-ready
        operator: Exists

When deciding how to deploy your cloud controller manager it is worth noting that cluster-proportional, or resource-based, pod autoscaling is not recommended. Running multiple replicas of a cloud controller manager is good practice for ensuring high-availability and redundancy, but does not contribute to better performance. In general, only a single instance of a cloud controller manager will be reconciling a cluster at any given time.

Spotlight on SIG Architecture: Enhancements

Tue, 21 Jan 2025 00:00:00 +0000

This is the fourth interview of a SIG Architecture Spotlight series that will cover the different subprojects, and we will be covering SIG Architecture: Enhancements.

In this SIG Architecture spotlight we talked with Kirsten Garrison, lead of the Enhancements subproject.

The Enhancements subproject

Frederico (FSM): Hi Kirsten, very happy to have the opportunity to talk about the Enhancements subproject. Let's start with some quick information about yourself and your role.

Kirsten Garrison (KG): I’m a lead of the Enhancements subproject of SIG-Architecture and currently work at Google. I first got involved by contributing to the service-catalog project with the help of Carolyn Van Slyck. With time, I joined the Release team, eventually becoming the Enhancements Lead and a Release Lead shadow. While on the release team, I worked on some ideas to make the process better for the SIGs and Enhancements team (the opt-in process) based on my team’s experiences. Eventually, I started attending Subproject meetings and contributing to the Subproject’s work.

FSM: You mentioned the Enhancements subproject: how would you describe its main goals and areas of intervention?

KG: The Enhancements Subproject primarily concerns itself with the Kubernetes Enhancement Proposal (KEP for short)—the "design" documents required for all features and significant changes to the Kubernetes project.

The KEP and its impact

FSM: The improvement of the KEP process was (and is) one in which SIG Architecture was heavily involved. Could you explain the process to those that aren’t aware of it?

KG: Every release, the SIGs let the Release Team know which features they intend to work on to be put into the release. As mentioned above, the prerequisite for these changes is a KEP - a standardized design document that all authors must fill out and approve in the first weeks of the release cycle. Most features will move through 3 phases: alpha, beta and finally GA so approving a feature represents a significant commitment for the SIG.

The KEP serves as the full source of truth of a feature. The KEP template has different requirements based on what stage a feature is in, but it generally requires a detailed discussion of the design and the impact as well as providing artifacts of stability and performance. The KEP takes quite a bit of iterative work between authors, SIG reviewers, api review team and the Production Readiness Review team¹ before it is approved. Each set of reviewers is looking to make sure that the proposal meets their standards in order to have a stable and performant Kubernetes release. Only after all approvals are secured, can an author go forth and merge their feature in the Kubernetes code base.

FSM: I see, quite a bit of additional structure was added. Looking back, what were the most significant improvements of that approach?

KG: In general, I think that the improvements with the most impact had to do with focusing on the core intent of the KEP. KEPs exist not just to memorialize designs, but provide a structured way to discuss and come to an agreement about different facets of the change. At the core of the KEP process is communication and consideration.

To that end, some of the significant changes revolve around a more detailed and accessible KEP template. A significant amount of work was put in over time to get the k/enhancements repo into its current form -- a directory structure organized by SIG with the contours of the modern KEP template (with Proposal/Motivation/Design Details subsections). We might take that basic structure for granted today, but it really represents the work of many people trying to get the foundation of this process in place over time.

As Kubernetes matures, we’ve needed to think about more than just the end goal of getting a single feature merged. We need to think about things like: stability, performance, setting and meeting user expectations. And as we’ve thought about those things the template has grown more detailed. The addition of the Production Readiness Review was major as well as the enhanced testing requirements (varying at different stages of a KEP’s lifecycle).

Current areas of focus

FSM: Speaking of maturing, we’ve recently released Kubernetes v1.31, and work on v1.32 has started. Are there any areas that the Enhancements sub-project is currently addressing that might change the way things are done?

KG: We’re currently working on two things:

Creating a Process KEP template. Sometimes people want to harness the KEP process for significant changes that are more process oriented rather than feature oriented. We want to support this because memorializing changes is important and giving people a better tool to do so will only encourage more discussion and transparency.
KEP versioning. While our template changes aim to be as non-disruptive as possible, we believe that it will be easier to track and communicate those changes to the community better with a versioned KEP template and the policies that go alongside such versioning.

Both features will take some time to get right and fully roll out (just like a KEP feature) but we believe that they will both provide improvements that will benefit the community at large.

FSM: You mentioned improvements: I remember when project boards for Enhancement tracking were introduced in recent releases, to great effect and unanimous applause from release team members. Was this a particular area of focus for the subproject?

KG: The Subproject provided support to the Release Team’s Enhancement team in the migration away from using the spreadsheet to a project board. The collection and tracking of enhancements has always been a logistical challenge. During my time on the Release Team, I helped with the transition to an opt-in system of enhancements, whereby the SIG leads "opt-in" KEPs for release tracking. This helped to enhance communication between authors and SIGs before any significant work was undertaken on a KEP and removed toil from the Enhancements team. This change used the existing tools to avoid introducing too many changes at once to the community. Later, the Release Team approached the Subproject with an idea of leveraging GitHub Project Boards to further improve the collection process. This was to be a move away from the use of complicated spreadsheets to using repo-native labels on k/enhancement issues and project boards.

FSM: That surely adds an impact on simplifying the workflow...

KG: Removing sources of friction and promoting clear communication is very important to the Enhancements Subproject. At the same time, it’s important to give careful consideration to decisions that impact the community as a whole. We want to make sure that changes are balanced to give an upside and while not causing any regressions and pain in the rollout. We supported the Release Team in ideation as well as through the actual migration to the project boards. It was a great success and exciting to see the team make high impact changes that helped everyone involved in the KEP process!

Getting involved

FSM: For those reading that might be curious and interested in helping, how would you describe the required skills for participating in the sub-project?

KG: Familiarity with KEPs either via experience or taking time to look through the kubernetes/enhancements repo is helpful. All are welcome to participate if interested - we can take it from there.

FSM: Excellent! Many thanks for your time and insight -- any final comments you would like to share with our readers?

KG: The Enhancements process is one of the most important parts of Kubernetes and requires enormous amounts of coordination and collaboration of people and teams across the project to make it successful. I’m thankful and inspired by everyone’s continued hard work and dedication to making the project great. This is truly a wonderful community.

For more information, check the Production Readiness Review spotlight interview in this series. ↩︎

Kubernetes 1.32: Moving Volume Group Snapshots to Beta

Wed, 18 Dec 2024 00:00:00 +0000

Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release. The recent release of Kubernetes v1.32 moved that support to beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This new feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes, that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage.

Underpinning all these features is the Kubernetes goal of workload portability: Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no cluster specific knowledge.

There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, there are other snapshotting functionalities not covered by the VolumeSnapshot API.

Some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This can be useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another volume. If snapshots for the data volume and the logs volume are taken at different times, the application will not be consistent and will not function properly if it is restored from those snapshots when a disaster strikes.

It is true that you can quiesce the application first, take an individual snapshot from each volume that is part of the application one after the other, and then unquiesce the application after all the individual snapshots are taken. This way, you would get application consistent snapshots.

However, sometimes the application quiesce can be so time consuming that you want to do it less frequently, or it may not be possible to quiesce an application at all. For example, a user may want to run weekly backups with application quiesce and nightly backups without application quiesce but with consistent group support which provides crash consistency across all volumes in the group.

Kubernetes APIs for volume group snapshots

Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

VolumeGroupSnapshot: Created by a Kubernetes user (or perhaps by your own automation) to request creation of a volume group snapshot for multiple persistent volume claims. It contains information about the volume group snapshot operation such as the timestamp when the volume group snapshot was taken and whether it is ready to use. The creation and deletion of this object represents a desire to create or delete a cluster resource (a group snapshot).
VolumeGroupSnapshotContent: Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the volume group snapshot including the volume group snapshot ID. This object represents a provisioned resource on the cluster (a group snapshot). The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass: Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support volume group snapshots.

What components are needed to support volume group snapshots

Volume group snapshots are implemented in the external-snapshotter repository. Implementing volume group snapshots meant adding or changing several components:

Added new CustomResourceDefinitions for VolumeGroupSnapshot and two supporting APIs.
Volume group snapshot controller logic is added to the common snapshot controller.
Adding logic to make CSI calls into the snapshotter sidecar controller.

The volume snapshot controller and CRDs are deployed once per cluster, while the sidecar is bundled with each CSI driver.

Therefore, it makes sense to deploy the volume snapshot controller and CRDs as a cluster addon.

The Kubernetes project recommends that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

What's new in Beta?

The VolumeGroupSnapshot feature in CSI spec moved to GA in the v1.11.0 release.
The snapshot validation webhook was deprecated in external-snapshotter v8.0.0 and it is now removed. Most of the validation webhook logic was added as validation rules into the CRDs. Minimum required Kubernetes version is 1.25 for these validation rules. One thing in the validation webhook not moved to CRDs is the prevention of creating multiple default volume snapshot classes and multiple default volume group snapshot classes for the same CSI driver. With the removal of the validation webhook, an error will still be raised when dynamically provisioning a VolumeSnapshot or VolumeGroupSnapshot when multiple default volume snapshot classes or multiple default volume group snapshot classes for the same CSI driver exist.
The enable-volumegroup-snapshot flag in the snapshot-controller and the CSI snapshotter sidecar has been replaced by a feature gate. Since VolumeGroupSnapshot is a new API, the feature moves to Beta but the feature gate is disabled by default. To use this feature, enable the feature gate by adding the flag --feature-gates=CSIVolumeGroupSnapshot=true when starting the snapshot-controller and the CSI snapshotter sidecar.
The logic to dynamically create the VolumeGroupSnapshot and its corresponding individual VolumeSnapshot and VolumeSnapshotContent objects are moved from the CSI snapshotter to the common snapshot-controller. New RBAC rules are added to the common snapshot-controller and some RBAC rules are removed from the CSI snapshotter sidecar accordingly.

How do I use Kubernetes volume group snapshots

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

The source of the group snapshot specifies whether the underlying group snapshot should be dynamically created or if a pre-existing VolumeGroupSnapshotContent should be used.

A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator. It contains the details of the real volume group snapshot on the storage system which is available for use by cluster users.

One of the following members in the source of the group snapshot must be set.

selector - a label query over PersistentVolumeClaims that are to be grouped together for snapshotting. This selector will be used to match the label added to a PVC.
volumeGroupSnapshotContentName - specifies the name of a pre-existing VolumeGroupSnapshotContent object representing an existing volume group snapshot.

Dynamically provision a group snapshot

In the following example, there are two PVCs.

NAME    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      VOLUMEATTRIBUTESCLASS   AGE
pvc-0   Bound    pvc-6e1f7d34-a5c5-4548-b104-01e72c72b9f2   100Mi      RWO            csi-hostpath-sc   <unset>                 2m15s
pvc-1   Bound    pvc-abc640b3-2cc1-4c56-ad0c-4f0f0e636efa   100Mi      RWO            csi-hostpath-sc   <unset>                 2m7s

Label the PVCs.

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
  name: snapshot-daily-20241217
  namespace: demo-namespace
spec:
  volumeGroupSnapshotClassName: csi-groupSnapclass
  source:
    selector:
      matchLabels:
        group: myGroup

In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which has the information about which CSI driver should be used for creating the group snapshot. A VolumGroupSnapshotClass is required for dynamic provisioning.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotClass
metadata:
  name: csi-groupSnapclass
  annotations:
    kubernetes.io/description: "Example group snapshot class"
driver: example.csi.k8s.io
deletionPolicy: Delete

As a result of the volume group snapshot creation, a corresponding VolumeGroupSnapshotContent object will be created with a volumeGroupSnapshotHandle pointing to a resource on the storage system.

Two individual volume snapshots will be created as part of the volume group snapshot creation.

NAME                                                                        READYTOUSE   SOURCEPVC   RESTORESIZE   SNAPSHOTCONTENT                                                                AGE
snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0   true         pvc-0       100Mi         snapcontent-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0   16m
snapshot-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff   true         pvc-1       100Mi         snapcontent-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff   16m

Importing an existing group snapshot with Kubernetes

To import a pre-existing volume group snapshot into Kubernetes, you must also import the corresponding individual volume snapshots.

Identify the individual volume snapshot handles, manually construct a VolumeSnapshotContent object first, then create a VolumeSnapshot object pointing to the VolumeSnapshotContent object. Repeat this for every individual volume snapshot.

Then manually create a VolumeGroupSnapshotContent object, specifying the volumeGroupSnapshotHandle and individual volumeSnapshotHandles already existing on the storage system.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotContent
metadata:
  name: static-group-content
spec:
  deletionPolicy: Delete
  driver: hostpath.csi.k8s.io
  source:
    groupSnapshotHandles:
      volumeGroupSnapshotHandle: e8779136-a93e-11ef-9549-66940726f2fd
      volumeSnapshotHandles:
      - e8779147-a93e-11ef-9549-66940726f2fd
      - e8783cd0-a93e-11ef-9549-66940726f2fd
  volumeGroupSnapshotRef:
    name: static-group-snapshot
    namespace: demo-namespace

After that create a VolumeGroupSnapshot object pointing to the VolumeGroupSnapshotContent object.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
  name: static-group-snapshot
  namespace: demo-namespace
spec:
  source:
    volumeGroupSnapshotContentName: static-group-content

How to use group snapshot for restore in Kubernetes

At restore time, the user can request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger provisioning of a new volume that is pre-populated with data from the specified snapshot. The user should repeat this until all volumes are created from all the snapshots that are part of a group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: examplepvc-restored-2024-12-17
  namespace: demo-namespace
spec:
  storageClassName: example-foo-nearline
  dataSource:
    name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOncePod
  resources:
    requests:
      storage: 100Mi # must be enough storage to fit the existing snapshot

As a storage vendor, how do I add support for group snapshots to my CSI driver?

To implement the volume group snapshot feature, a CSI driver must:

Implement a new group controller service.
Implement group controller RPCs: CreateVolumeGroupSnapshot, DeleteVolumeGroupSnapshot, and GetVolumeGroupSnapshot.
Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

As mentioned earlier, it is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

As part of this recommended deployment process, the Kubernetes team provides a number of sidecar (helper) containers, including the external-snapshotter sidecar container which has been updated to support volume group snapshot.

The external-snapshotter watches the Kubernetes API server for VolumeGroupSnapshotContent objects, and triggers CreateVolumeGroupSnapshot and DeleteVolumeGroupSnapshot operations against a CSI endpoint.

What are the limitations?

The beta implementation of volume group snapshots for Kubernetes has the following limitations:

Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot).
No application consistency guarantees beyond any guarantees provided by the storage system (e.g. crash consistency). See this doc for more discussions on application consistency.

What’s next?

Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

How can I learn more?

The design spec for the volume group snapshot feature.
The code repository for volume group snapshot APIs and controller.
CSI documentation on the group snapshot feature.

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:

Ben Swartzlander (bswartz)
Cici Huang (cici37)
Hemant Kumar (gnufied)
James Defelice (jdef)
Jan Šafránek (jsafrane)
Madhu Rajanna (Madhu-1)
Manish M Yathnalli (manishym)
Michelle Au (msau42)
Niels de Vos (nixpanic)
Leonardo Cecchi (leonardoce)
Rakshith R (Rakshith-R)
Raunak Shah (RaunakShah)
Saad Ali (saad-ali)
Xing Yang (xing-yang)
Yati Padia (yati1998)

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Enhancing Kubernetes API Server Efficiency with API Streaming

Tue, 17 Dec 2024 00:00:00 +0000

Managing Kubernetes clusters efficiently is critical, especially as their size is growing. A significant challenge with large clusters is the memory overhead caused by list requests.

In the existing implementation, the kube-apiserver processes list requests by assembling the entire response in-memory before transmitting any data to the client. But what if the response body is substantial, say hundreds of megabytes? Additionally, imagine a scenario where multiple list requests flood in simultaneously, perhaps after a brief network outage. While API Priority and Fairness has proven to reasonably protect kube-apiserver from CPU overload, its impact is visibly smaller for memory protection. This can be explained by the differing nature of resource consumption by a single API request - the CPU usage at any given time is capped by a constant, whereas memory, being uncompressible, can grow proportionally with the number of processed objects and is unbounded. This situation poses a genuine risk, potentially overwhelming and crashing any kube-apiserver within seconds due to out-of-memory (OOM) conditions. To better visualize the issue, let's consider the below graph.

The graph shows the memory usage of a kube-apiserver during a synthetic test. (see the synthetic test section for more details). The results clearly show that increasing the number of informers significantly boosts the server's memory consumption. Notably, at approximately 16:40, the server crashed when serving only 16 informers.

Why does kube-apiserver allocate so much memory for list requests?

Our investigation revealed that this substantial memory allocation occurs because the server before sending the first byte to the client must:

fetch data from the database,
deserialize the data from its stored format,
and finally construct the final response by converting and serializing the data into a client requested format

This sequence results in significant temporary memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects.

Unfortunately, neither API Priority and Fairness nor Golang's garbage collection or Golang memory limits can prevent the system from exhausting memory under these conditions. The memory is allocated suddenly and rapidly, and just a few requests can quickly deplete the available memory, leading to resource exhaustion.

Depending on how the API server is run on the node, it might either be killed through OOM by the kernel when exceeding the configured memory limits during these uncontrolled spikes, or if limits are not configured it might have even worse impact on the control plane node. And worst, after the first API server failure, the same requests will likely hit another control plane node in an HA setup with probably the same impact. Potentially a situation that is hard to diagnose and hard to recover from.

Streaming list requests

Today, we're excited to announce a major improvement. With the graduation of the watch list feature to beta in Kubernetes 1.32, client-go users can opt-in (after explicitly enabling WatchListClient feature gate) to streaming lists by switching from list to (a special kind of) watch requests.

Watch requests are served from the watch cache, an in-memory cache designed to improve scalability of read operations. By streaming each item individually instead of returning the entire collection, the new method maintains constant memory overhead. The API server is bound by the maximum allowed size of an object in etcd plus a few additional allocations. This approach drastically reduces the temporary memory usage compared to traditional list requests, ensuring a more efficient and stable system, especially in clusters with a large number of objects of a given type or large average object sizes where despite paging memory consumption used to be high.

Building on the insight gained from the synthetic test (see the synthetic test, we developed an automated performance test to systematically evaluate the impact of the watch list feature. This test replicates the same scenario, generating a large number of Secrets with a large payload, and scaling the number of informers to simulate heavy list request patterns. The automated test is executed periodically to monitor memory usage of the server with the feature enabled and disabled.

The results showed significant improvements with the watch list feature enabled. With the feature turned on, the kube-apiserver’s memory consumption stabilized at approximately 2 GB. By contrast, with the feature disabled, memory usage increased to approximately 20GB, a 10x increase! These results confirm the effectiveness of the new streaming API, which reduces the temporary memory footprint.

Enabling API Streaming for your component

Upgrade to Kubernetes 1.32. Make sure your cluster uses etcd in version 3.4.31+ or 3.5.13+. Change your client software to use watch lists. If your client code is written in Golang, you'll want to enable WatchListClient for client-go. For details on enabling that feature, read Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control.

What's next?

In Kubernetes 1.32, the feature is enabled in kube-controller-manager by default despite its beta state. This will eventually be expanded to other core components like kube-scheduler or kubelet; once the feature becomes generally available, if not earlier. Other 3rd-party components are encouraged to opt-in to the feature during the beta phase, especially when they are at risk of accessing a large number of resources or kinds with potentially large object sizes.

For the time being, API Priority and Fairness assigns a reasonable small cost to list requests. This is necessary to allow enough parallelism for the average case where list requests are cheap enough. But it does not match the spiky exceptional situation of many and large objects. Once the majority of the Kubernetes ecosystem has switched to watch list, the list cost estimation can be changed to larger values without risking degraded performance in the average case, and with that increasing the protection against this kind of requests that can still hit the API server in the future.

The synthetic test

In order to reproduce the issue, we conducted a manual test to understand the impact of list requests on kube-apiserver memory usage. In the test, we created 400 Secrets, each containing 1 MB of data, and used informers to retrieve all Secrets.

The results were alarming, only 16 informers were needed to cause the test server to run out of memory and crash, demonstrating how quickly memory consumption can grow under such conditions.

Special shout out to @deads2k for his help in shaping this feature.

Kubernetes 1.33 update

Since this feature was started, Marek Siarkowicz integrated a new technology into the Kubernetes API server: streaming collection encoding. Kubernetes v1.33 introduced two related feature gates, StreamingCollectionEncodingToJSON and StreamingCollectionEncodingToProtobuf. These features encode via a stream and avoid allocating all the memory at once. This functionality is bit-for-bit compatible with existing list encodings, produces even greater server-side memory savings, and doesn't require any changes to client code. In 1.33, the WatchList feature gate is disabled by default.

Kubernetes v1.32 Adds A New CPU Manager Static Policy Option For Strict CPU Reservation

Mon, 16 Dec 2024 00:00:00 +0000

In Kubernetes v1.32, after years of community discussion, we are excited to introduce a strict-cpu-reservation option for the CPU Manager static policy. This feature is currently in alpha, with the associated policy hidden by default. You can only use the policy if you explicitly enable the alpha behavior in your cluster.

Understanding the feature

The CPU Manager static policy is used to reduce latency or improve performance. The reservedSystemCPUs defines an explicit CPU set for OS system daemons and kubernetes system daemons. This option is designed for Telco/NFV type use cases where uncontrolled interrupts/timers may impact the workload performance. you can use this option to define the explicit cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on the system can be used exclusively for workloads, with less impact from uncontrolled interrupts/timers. More details of this parameter can be found on the Explicitly Reserved CPU List page.

If you want to protect your system daemons and interrupt processing, the obvious way is to use the reservedSystemCPUs option.

However, until the Kubernetes v1.32 release, this isolation was only implemented for guaranteed pods that made requests for a whole number of CPUs. At pod admission time, the kubelet only compares the CPU requests against the allocatable CPUs. In Kubernetes, limits can be higher than the requests; the previous implementation allowed burstable and best-effort pods to use up the capacity of reservedSystemCPUs, which could then starve host OS services of CPU - and we know that people saw this in real life deployments. The existing behavior also made benchmarking (for both infrastructure and workloads) results inaccurate.

When this new strict-cpu-reservation policy option is enabled, the CPU Manager static policy will not allow any workload to use the reserved system CPU cores.

Enabling the feature

To enable this feature, you need to turn on both the CPUManagerPolicyAlphaOptions feature gate and the strict-cpu-reservation policy option. And you need to remove the /var/lib/kubelet/cpu_manager_state file if it exists and restart kubelet.

With the following kubelet configuration:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
  ...
  CPUManagerPolicyOptions: true
  CPUManagerPolicyAlphaOptions: true
cpuManagerPolicy: static
cpuManagerPolicyOptions:
  strict-cpu-reservation: "true"
reservedSystemCPUs: "0,32,1,33,16,48"
...

When strict-cpu-reservation is not set or set to false:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-63","checksum":1058907510}

When strict-cpu-reservation is set to true:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}

Monitoring the feature

You can monitor the feature impact by checking the following CPU Manager counters:

cpu_manager_shared_pool_size_millicores: report shared pool size, in millicores (e.g. 13500m)
cpu_manager_exclusive_cpu_allocation_count: report exclusively allocated cores, counting full cores (e.g. 16)

Your best-effort workloads may starve if the cpu_manager_shared_pool_size_millicores count is zero for prolonged time.

We believe any pod that is required for operational purpose like a log forwarder should not run as best-effort, but you can review and adjust the amount of CPU cores reserved as needed.

Conclusion

Strict CPU reservation is critical for Telco/NFV use cases. It is also a prerequisite for enabling the all-in-one type of deployments where workloads are placed on nodes serving combined control+worker+storage roles.

We want you to start using the feature and looking forward to your feedback.

Getting involved

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

Kubernetes v1.32: Memory Manager Goes GA

Fri, 13 Dec 2024 00:00:00 +0000

With Kubernetes 1.32, the memory manager has officially graduated to General Availability (GA), marking a significant milestone in the journey toward efficient and predictable memory allocation for containerized applications. Since Kubernetes v1.22, where it graduated to beta, the memory manager has proved itself reliable, stable and a good complementary feature for the CPU Manager.

As part of kubelet's workload admission process, the memory manager provides topology hints to optimize memory allocation and alignment. This enables users to allocate exclusive memory for Pods in the Guaranteed QoS class. More details about the process can be found in the memory manager goes to beta blog.

Most of the changes introduced since the Beta are bug fixes, internal refactoring and observability improvements, such as metrics and better logging.

Observability improvements

As part of the effort to increase the observability of memory manager, new metrics have been added to provide some statistics on memory allocation patterns.

memory_manager_pinning_requests_total - tracks the number of times the pod spec required the memory manager to pin memory pages.
memory_manager_pinning_errors_total - tracks the number of times the pod spec required the memory manager to pin memory pages, but the allocation failed.

Improving memory manager reliability and consistency

The kubelet does not guarantee pod ordering when admitting pods after a restart or reboot.

In certain edge cases, this behavior could cause the memory manager to reject some pods, and in more extreme cases, it may cause kubelet to fail upon restart.

Previously, the beta implementation lacked certain checks and logic to prevent these issues.

To stabilize the memory manager for general availability (GA) readiness, small but critical refinements have been made to the algorithm, improving its robustness and handling of edge cases.

Future development

There is more to come for the future of Topology Manager in general, and memory manager in particular. Notably, ongoing efforts are underway to extend memory manager support to Windows, enabling CPU and memory affinity on a Windows operating system.

Getting involved

Kubernetes v1.32: QueueingHint Brings a New Possibility to Optimize Pod Scheduling

Thu, 12 Dec 2024 00:00:00 +0000

The Kubernetes scheduler is the core component that selects the nodes on which new Pods run. The scheduler processes these new Pods one by one. Therefore, the larger your clusters, the more important the throughput of the scheduler becomes.

Over the years, Kubernetes SIG Scheduling has improved the throughput of the scheduler in multiple enhancements. This blog post describes a major improvement to the scheduler in Kubernetes v1.32: a scheduling context element named QueueingHint. This page provides background knowledge of the scheduler and explains how QueueingHint improves scheduling throughput.

Scheduling queue

The scheduler stores all unscheduled Pods in an internal component called the scheduling queue.

The scheduling queue consists of the following data structures:

ActiveQ: holds newly created Pods or Pods that are ready to be retried for scheduling.
BackoffQ: holds Pods that are ready to be retried but are waiting for a backoff period to end. The backoff period depends on the number of unsuccessful scheduling attempts performed by the scheduler on that Pod.
Unschedulable Pod Pool: holds Pods that the scheduler won't attempt to schedule for one of the following reasons:
- The scheduler previously attempted and was unable to schedule the Pods. Since that attempt, the cluster hasn't changed in a way that could make those Pods schedulable.
- The Pods are blocked from entering the scheduling cycles by PreEnqueue Plugins, for example, they have a scheduling gate, and get blocked by the scheduling gate plugin.

Scheduling framework and plugins

The Kubernetes scheduler is implemented following the Kubernetes scheduling framework.

And, all scheduling features are implemented as plugins (e.g., Pod affinity is implemented in the InterPodAffinity plugin.)

The scheduler processes pending Pods in phases called cycles as follows:

Scheduling cycle: the scheduler takes pending Pods from the activeQ component of the scheduling queue one by one. For each Pod, the scheduler runs the filtering/scoring logic from every scheduling plugin. The scheduler then decides on the best node for the Pod, or decides that the Pod can't be scheduled at that time.

If the scheduler decides that a Pod can't be scheduled, that Pod enters the Unschedulable Pod Pool component of the scheduling queue. However, if the scheduler decides to place the Pod on a node, the Pod goes to the binding cycle.
Binding cycle: the scheduler communicates the node placement decision to the Kubernetes API server. This operation bounds the Pod to the selected node.

Aside from some exceptions, most unscheduled Pods enter the unschedulable pod pool after each scheduling cycle. The Unschedulable Pod Pool component is crucial because of how the scheduling cycle processes Pods one by one. If the scheduler had to constantly retry placing unschedulable Pods, instead of offloading those Pods to the Unschedulable Pod Pool, multiple scheduling cycles would be wasted on those Pods.

Improvements to retrying Pod scheduling with QueuingHint

Unschedulable Pods only move back into the ActiveQ or BackoffQ components of the scheduling queue if changes in the cluster might allow the scheduler to place those Pods on nodes.

Prior to v1.32, each plugin registered which cluster changes could solve their failures, an object creation, update, or deletion in the cluster (called cluster events), with EnqueueExtensions (EventsToRegister), and the scheduling queue retries a pod with an event that is registered by a plugin that rejected the pod in a previous scheduling cycle.

Additionally, we had an internal feature called preCheck, which helped further filtering of events for efficiency, based on Kubernetes core scheduling constraints; For example, preCheck could filter out node-related events when the node status is NotReady.

However, we had two issues for those approaches:

Requeueing with events was too broad, could lead to scheduling retries for no reason.
- A new scheduled Pod might solve the InterPodAffinity's failure, but not all of them do. For example, if a new Pod is created, but without a label matching InterPodAffinity of the unschedulable pod, the pod wouldn't be schedulable.
preCheck relied on the logic of in-tree plugins and was not extensible to custom plugins, like in issue #110175.

Here QueueingHints come into play; a QueueingHint subscribes to a particular kind of cluster event, and make a decision about whether each incoming event could make the Pod schedulable.

For example, consider a Pod named pod-a that has a required Pod affinity. pod-a was rejected in the scheduling cycle by the InterPodAffinity plugin because no node had an existing Pod that matched the Pod affinity specification for pod-a.

A diagram showing the scheduling queue and pod-a rejected by InterPodAffinity plugin

pod-a moves into the Unschedulable Pod Pool. The scheduling queue records which plugin caused the scheduling failure for the Pod. For pod-a, the scheduling queue records that the InterPodAffinity plugin rejected the Pod.

pod-a will never be schedulable until the InterPodAffinity failure is resolved. There're some scenarios that the failure could be resolved, one example is an existing running pod gets a label update and becomes matching a Pod affinity. For this scenario, the InterPodAffinity plugin's QueuingHint callback function checks every Pod label update that occurs in the cluster. Then, if a Pod gets a label update that matches the Pod affinity requirement of pod-a, the InterPodAffinity, plugin's QueuingHint prompts the scheduling queue to move pod-a back into the ActiveQ or the BackoffQ component.

A diagram showing the scheduling queue and pod-a being moved by InterPodAffinity QueueingHint

QueueingHint's history and what's new in v1.32

At SIG Scheduling, we have been working on the development of QueueingHint since Kubernetes v1.28.

While QueuingHint isn't user-facing, we implemented the SchedulerQueueingHints feature gate as a safety measure when we originally added this feature. In v1.28, we implemented QueueingHints with a few in-tree plugins experimentally, and made the feature gate enabled by default.

However, users reported a memory leak, and consequently we disabled the feature gate in a patch release of v1.28. From v1.28 until v1.31, we kept working on the QueueingHint implementation within the rest of the in-tree plugins and fixing bugs.

In v1.32, we made this feature enabled by default again. We finished implementing QueueingHints in all plugins and also identified the cause of the memory leak!

We thank all the contributors who participated in the development of this feature and those who reported and investigated the earlier issues.

Getting involved

These features are managed by Kubernetes SIG Scheduling.

Please join us and share your feedback.

How can I learn more?

KEP-4247: Per-plugin callback functions for efficient requeueing in the scheduling queue

Kubernetes Blog

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

What is VolumeAttributesClass?

What is new from Beta to GA

Cancel support from infeasible errors

Quota support based on scope

Drivers support VolumeAttributesClass

Contact

Tuning Linux Swap for Kubernetes: A Deep Dive

Introduction to Linux swap

Anonymous vs File-backed memory

Key kernel parameters for swap tuning

Swap tests and results

Test setup

Test methodology

Visualizing swap in action

Findings

The impact of swappiness

Tuning watermarks to prevent eviction and OOM kills

Risks and recommendations

Kubernetes context

Recommended starting point

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

What's new in beta?

Required cacheType field

Isolated image pull credentials

How it works

Configuration

Image pull flow

Audience restriction

Getting started with beta

Prerequisites

Migration from alpha

Example setup

What's next?

Call to action

How to get involved

PSI Metrics for Kubernetes Graduates to Beta

What is Pressure Stall Information (PSI)?

PSI: 'Some' vs. 'Full' Pressure

PSI metrics in Kubernetes

How to enable PSI metrics

What's next?

Introducing Headlamp AI Assistant

Hopping on the AI train

Context is everything

Tools

AI Plugins

Try it out!

Kubernetes v1.34 Sneak Peek

Featured enhancements of Kubernetes v1.34

The core of DRA targets stable

ServiceAccount tokens for image pull authentication

Pod replacement policy for Deployments

Production-ready tracing for kubelet and API Server

PreferSameZone and PreferSameNode traffic distribution for Services

Support for KYAML: a Kubernetes dialect of YAML

Fine-grained autoscaling control with HPA configurable tolerance

Want to know more?

Get involved

Post-Quantum Cryptography in Kubernetes

What is Post-Quantum Cryptography

Key exchange vs. digital signatures: different needs, different timelines

State of PQC key exchange mechanisms (KEMs) today

Post-quantum KEMs in Kubernetes: an unexpected arrival

The Go version mismatch pitfall

Limitations: packet size

State of Post-Quantum Signatures

Conclusion

Navigating Failures in Pods With Devices

The AI/ML boom and its impact on Kubernetes

Understanding AI/ML workloads

Why Kubernetes still reigns supreme

The current state of device failure handling

Failure modes: K8s infrastructure

Failure modes: device failed

Health controller

Pod failure policy

Custom pod watcher

Failure modes: container code failed

The impact of `swappiness`

Required `cacheType` field

Production-ready tracing for `kubelet` and API Server

`PreferSameZone` and `PreferSameNode` traffic distribution for Services