Troubleshoot IP address management in VPC clusters


This section provides guidance for resolving issues related to VPC-native clusters. You can also view GKE IP address utilization insights.

The default network resource is not ready

Symptoms

You get an error message similar to the following:

projects/[PROJECT_NAME]/regions/XXX/subnetworks/default
Potential causes

There are parallel operations on the same subnet. For example, another VPC-native cluster is being created, or a secondary range is being added or deleted on the subnet.

Resolution

Retry the command.

Invalid value for IPCidrRange

Symptoms

You get an error message similar to the following:

resource.secondaryIpRanges[1].ipCidrRange': 'XXX'. Invalid IPCidrRange: XXX conflicts with existing subnetwork 'default' in region 'XXX'
Potential causes

Another VPC-native cluster is being created at the same time and is attempting to allocate the same ranges in the same VPC network.

The same secondary range is being added to the subnetwork in the same VPC network.

Resolution

If this error is returned on cluster creation when no secondary ranges were specified, retry the cluster creation command.

Not enough free IP address space for Pods

Symptoms

Cluster is stuck in a provisioning state for an extended period of time.

Cluster creation returns a Managed Instance Group (MIG) error.

When you add one or more nodes to a cluster, the follow error appears:

[IP_SPACE_EXHAUSTED] Instance 'INSTANCE_NAME' creation failed: IP space of 'projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME-SECONDARY_RANGE_NAME' is exhausted.
Potential causes

Node IP address exhaustion: The primary IP address range of the subnet assigned to your cluster runs out of available IP addresses. This typically happens when scaling node pools or creating large clusters.

Pod IP address exhaustion: The Pod CIDR range assigned to your cluster is full. This occurs when the number of Pods exceeds the capacity of the Pod CIDR, especially with high Pod density per node or large deployments.

Specific subnet naming conventions: The way a subnet is named in an error message can help you figure out if the problem is with the node IP address range (where the nodes themselves get their IP address) or the Pod IP address range (where the containers inside the Pods get their IP addresses).

Secondary range exhaustion (Autopilot): In Autopilot clusters, secondary ranges assigned for Pod IP addresses are exhausted due to scaling or high Pod density.

Solution

Gather the following information about your cluster: name, control plane version, mode of operation, associated VPC name, and subnet name and CIDR. Additionally, note the default and any additional Cluster Pod IPv4 ranges (including names and CIDRs), whether VPC-native traffic routing is enabled, and the maximum Pods per node setting at both the cluster and node pool levels (if applicable). Note any impacted node pools and their specific IPv4 Pod IP address ranges and maximum Pods per node configurations if they differ from the cluster-wide settings. Also, record the default and custom (if any) configurations for maximum Pods per node in the node pool configuration.

Confirm IP address exhaustion issue

  • Network Intelligence Center: Check for high IP address allocation rates in the Pod IP address ranges in the Network Intelligence Center for your GKE cluster.

    If you observe a high IP address allocation rate in the Pod ranges within Network Intelligence Center, then your Pod IP address range is exhausted.

    If the Pod IP address ranges show normal allocation rates, but you are still experiencing IP address exhaustion, then it's likely your node IP address range is exhausted.

  • Audit logs: Examine the resourceName field in IP_SPACE_EXHAUSTED entries, comparing it to subnet names or secondary Pod IP address range names.

  • Check whether exhausted IP address range is node IP address range or Pod IP address range.

    To verify whether exhausted IP address range is node IP address range or Pod IP address range, check whether the value of resourceName in the ipSpaceExhausted portion of a IP_SPACE_EXHAUSTED log entry correlates with subnet name or name of secondary IPv4 address range for Pods used in the impacted GKE cluster.

    If value of resourceName is in format "[Subnet_name]", then node IP address range is exhausted. If value of resourceName is in format "[Subnet_name]-[Name_of_Secondary_IPv4_range_for_pods]-[HASH_8BYTES]", then Pod IP address range is exhausted.

Resolve Pod IP address exhaustion:

  • Resize existing Pod CIDR: Increase the size of the current Pod IP address range. You can add Pod IP ranges to the cluster using discontiguous multi-Pod CIDR.
  • Create additional subnets: Add subnets with dedicated Pod CIDRs to the cluster.

Reduce Pods per node to free up IP addresses:

Address node IP address exhaustion:

  • Review IP address planning: Ensure the node IP address range aligns with your scaling requirements.
  • Create new cluster (if necessary): If the node IP address range is severely constrained, create a replacement cluster with appropriate IP address range sizing. Refer to IP ranges for VPC-native clusters and IP range planning.

Debug IP address exhaustion issues with gcpdiag

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

To examine IP address exhaustion causes on Autopilot and Standard GKE clusters, consider the following:
  • Cluster status: Checks the cluster status if IP address exhaustion is reported.
  • Network analyzer: Queries stackdriver logs for network analyzer logs to confirm if there is Pod or node IP address exhaustion.
  • Cluster Type: Checks the cluster type and provides relevant recommendations based on the cluster type.

Google Cloud console

  1. Complete and then copy the following command.
  2. gcpdiag runbook gke/ip-exhaustion --project=PROJECT_ID \
        --parameter name=CLUSTER_NAME \
        --parameter location=ZONE|REGION \
        --parameter start_time=yyyy-mm-ddThh:mm:ssZ \
        --parameter end_time=yyyy-mm-ddThh:mm:ssZ \
  3. Open the Google Cloud console and activate Cloud Shell.
  4. Open Cloud console
  5. Paste the copied command.
  6. Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

  1. Copy and run the following command on your local workstation.
    curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
  2. Execute the gcpdiag command.
    ./gcpdiag runbook gke/ip-exhaustion --project=PROJECT_ID \
        --parameter name=CLUSTER_NAME \
        --parameter location=ZONE|REGION \
        --parameter start_time=yyyy-mm-ddThh:mm:ssZ \
        --parameter end_time=yyyy-mm-ddThh:mm:ssZ \

View available parameters for this runbook.

Replace the following:

  • PROJECT_ID: The ID of the project containing the resource
  • CLUSTER_NAME: The name of the target GKE cluster within your project.
  • LOCATION: The zone or region in which your cluster is located.
  • start_time: The time the issue started.
  • end_time: The time the issue ended. Set current time if issue is ongoing.

Useful flags:

  • --project: The PROJECT_ID
  • --universe-domain: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource
  • --parameter or -p: Runbook parameters

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.

Confirm whether default SNAT is disabled

Use the following command to check the status of default SNAT:

gcloud container clusters describe CLUSTER_NAME

Replace CLUSTER_NAME with the name of your cluster.

The output is similar to the following:

networkConfig:
  disableDefaultSnat: true
  network: ...

Cannot use --disable-default-snat without --enable-ip-alias

This error message, and must disable default sNAT (--disable-default-snat) before using public IP address privately in the cluster, mean that you should explicitly set the --disable-default-snat flag when creating the cluster since you are using public IP addresses in your private cluster.

If you see error messages like cannot disable default sNAT ... , this means the default SNAT can't be disabled in your cluster. To resolve this issue, review your cluster configuration.

Debugging Cloud NAT with default SNAT disabled

If you have a private cluster created with the --disable-default-snat flag and have set up Cloud NAT for internet access and you aren't seeing internet-bound traffic from your Pods, make sure that the Pod range is included in the Cloud NAT configuration.

If there is a problem with Pod to Pod communication, examine the iptables rules on the nodes to verify that the Pod ranges are not masqueraded by iptables rules.

For more information, see the GKE IP masquerade documentation.

If you have not configured an IP masquerade agent for the cluster, GKE automatically ensures that Pod to Pod communication is not masqueraded. However, if an IP masquerade agent is configured, it overrides the default IP masquerade rules. Verify that additional rules are configured in the IP masquerade agent to ignore masquerading the Pod ranges.

The dual-stack cluster network communication is not working as expected

Potential causes
The firewall rules created by the GKE cluster don't include the allocated IPv6 addresses.
Resolution
You can validate the firewall rule by following these steps:
  1. Verify the firewall rule content:

    gcloud compute firewall-rules describe FIREWALL_RULE_NAME
    

    Replace FIREWALL_RULE_NAME with the name of the firewall rule.

    Each dual-stack cluster creates a firewall rule that allows nodes and Pods to communicate with each other. The firewall rule content is similar to the following:

    allowed:
    - IPProtocol: esp
    - IPProtocol: ah
    - IPProtocol: sctp
    - IPProtocol: tcp
    - IPProtocol: udp
    - IPProtocol: '58'
    creationTimestamp: '2021-08-16T22:20:14.747-07:00'
    description: ''
    direction: INGRESS
    disabled: false
    enableLogging: false
    id: '7326842601032055265'
    kind: compute#firewall
    logConfig:
      enable: false
    name: gke-ipv6-4-3d8e9c78-ipv6-all
    network: https://www.googleapis.com/compute/alpha/projects/my-project/global/networks/alphanet
    priority: 1000
    selfLink: https://www.googleapis.com/compute/alpha/projects/my-project/global/firewalls/gke-ipv6-4-3d8e9c78-ipv6-all
    selfLinkWithId: https://www.googleapis.com/compute/alpha/projects/my-project/global/firewalls/7326842601032055265
    sourceRanges:
    - 2600:1900:4120:fabf::/64
    targetTags:
    - gke-ipv6-4-3d8e9c78-node
    

    The sourceRanges value must be the same as the subnetIpv6CidrBlock. The targetTags value must be the same as the tags on the GKE nodes. To fix this issue, update the firewall rule with the cluster ipAllocationPolicy block information.

The Private Service Connect endpoint might leak during cluster deletion

Symptoms

You cannot see a connected endpoint under Private Service Connect in your Private Service Connect-based cluster.

You can't delete the subnet or VPC network where the endpoint is Private Service Connect allocated. An error message similar to the following appears:

projects/<PROJECT_ID>/regions/<REGION>/subnetworks/<SUBNET_NAME> is already being used by projects/<PROJECT_ID>/regions/<REGION>/addresses/gk3-<ID>
Potential causes

On GKE clusters that use Private Service Connect, GKE deploys a Private Service Connect endpoint by using a forwarding rule that allocates an internal IP address to access the cluster's control plane in a control plane's network. To protect the communication between the control plane and the nodes by using Private Service Connect, GKE keeps the endpoint invisible, and you can't see it on Google Cloud console or gcloud CLI.

Resolution

To prevent leaking the Private Service Connect endpoint before cluster deletion, complete the following steps:

  1. Assign the Kubernetes Engine Service Agent role to the GKE service account.
  2. Ensure that the compute.forwardingRules.* and compute.addresses.* permissions are not explicitly denied from GKE service account.

If you see the Private Service Connect endpoint leaked, contact support.

What's next