Upgrading a Neglected EKS Cluster from 1.17 to 1.30

On July 3rd, 2024, I ended up dealing with one of the messier Kubernetes situations I’ve encountered: upgrading an EKS cluster that had effectively been left behind for years. The push to upgrade wasn’t optional as the cluster version was approaching the end of extended support from AWS, and staying on it meant continuing to pay the higher extended support costs.

The nodes were manually deployed instead of using AWS managed nodes and had been sitting on Kubernetes 1.17 for a long time. At some point, the control plane had been upgraded to 1.23, but the nodes were never touched. This created a version mismatch that slowly turned into a multi-layered failure once changes started being made.

The plan was straightforward: create a new AWS-managed node group running 1.23 to match the control plane, then slowly migrate workloads over by cordoning the old nodes and evicting deployments one at a time. That way, the cluster could be brought forward without a hard cutover and moved back in case of significant issues.

What followed wasn’t a clean migration. As workloads started moving to the new nodes, a chain of failures began to surface across ingress, IAM, API compatibility, container permissions, and outdated manifests. This is a write-up of what happened and what I ran into while trying to stabilize and move the cluster forward.


Background

This cluster had:

  • Nodes manually provisioned
  • Nodes stuck on 1.17
  • Control plane upgraded separately to 1.23
  • Old manifests and container images
  • No consistent lifecycle management

Once an effort began to modernize and move toward a newer Kubernetes release, the ingress layer started failing.

Without a functioning ingress controller, nothing web-facing could enter the cluster.

That’s when the real work started.


How Ingress Works in EKS

In managed cloud Kubernetes environments, the ingress flow has an extra layer.

A Kubernetes ingress manifest results in:

  1. A cloud load balancer being created (AWS/GCP-managed)
  2. SSL termination happening at the cloud LB
  3. Traffic forwarded into the cluster
  4. NGINX ingress controller routing to services

This means:

  • SSL certs live in AWS, not in the cluster
  • IAM and cloud-controller integrations matter
  • If ingress pods fail, external traffic stops entirely

In a non-cloud cluster, you’d typically expose services directly or put your own LB in front.

In EKS, web services depend on the ingress controller functioning.


Initial Failure: Ingress Pods Wouldn’t Recover

The issue started when ingress pods began failing after changes intended to move workloads onto newer nodes.

This turned into a chain reaction.

The external ingress appeared to be partially functional, but the internal ingress still needed the same recovery steps.


Issue #1 – Expired SSL Certificate

One of the AWS load balancers referenced an SSL certificate that expired in 2023.

This prevented the ingress from restarting cleanly.

The affected LB may not have even been actively used, but the reference was still there in the ingress manifest.

Fix:

  • Updated the ingress manifest to reference a valid, current SSL certificate.

Issue #2 – Nonexistent WAF Resources

Next failure:

The ingress controller was trying to attach to AWS WAF resources that no longer existed.

I couldn’t find them anywhere in AWS.

Fix:

  • Removed the WAF references from the ingress rules.
  • Redeployed.

This got me to the next failure.


Issue #3 – Token Permission Denied

After redeploying, the pod started failing with:

permission denied: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

The pod was running as a non-root user. The token file was owned by root (root:nobody).

The container simply couldn’t read it.

Fix: Added this to the pod spec:

fsGroup: 65534

Important detail:

  • This must go in the pod spec, not the container spec.

This aligned file permissions so the process could read the token.


Issue #4 – Fake SSL Cert Creation Failing

Next error:

The ingress container tried to create a fallback self-signed cert:

could not create PEM certificate file:
permission denied

Newer ingress controller images expect specific UID permissions.

Fix:

  • Set:
runAsUser: 101

This matched the expected file ownership inside the container.


Issue #5 – API Version Mismatch

This was one of the bigger structural problems.

At this point:

  • Nodes (kubelet): 1.17
  • Control plane: 1.23

APIs had changed between versions.

Some resources used by the old ingress controller relied on deprecated API versions (like v1beta1) that no longer existed in the control plane.

Symptoms:

  • API version errors
  • Resource incompatibility
  • Ingress controller instability

Fix:

  1. Tagged the 1.23 nodes:
ingress-ready=true
  1. Added a nodeSelector to move the ingress controller onto that node.
  2. Upgraded ingress-nginx:

From:

v0.24.1

To:

v1.1.0

This version supported the newer APIs.


Issue #6 – Outdated NGINX Template File

After upgrading the controller, another failure surfaced.

The ingress controller was trying to load a template file that didn’t match the version.

Fix:

  • Pulled the template from the same release as the controller
  • Created a ConfigMap:
nginx-template-v101
  • Mounted it into the deployment

This is something Helm would normally manage automatically. Since this deployment wasn’t Helm-managed, it had to be manually added.


Issue #7 – Missing Cluster API Permissions

Once the controller started running further, it hit RBAC failures:

User "system:serviceaccount:ingress-nginx:nginx-ingress-serviceaccount"
cannot list resource "ingresses"

The service account didn’t have permission to read cluster ingress resources.

Fix: Added permissions:

apiGroups:
- networking.k8s.io
resources:
- ingresses
- ingressclasses
verbs:
- get
- list
- watch

Issue #8 – Leader Election Permission Failures

Next error:

cannot update resource "configmaps"

The ingress controller couldn’t update its leader-election configmap.

Fix: Added:

apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- create
- update
- patch

This allowed leader election to function correctly.


Side Issue – Pod Migration / Volume Attach Problems

While moving workloads, I also ran into:

Unable to attach or mount volumes: timed out waiting for the condition

I found a likely fix involving adding permissions to an AWS role based on a Kubernetes issue thread:

https://github.com/kubernetes/kubernetes/issues/110158#issuecomment-1717824677

"Kubernetes.29 does not include a `awsElasticBlock` volume type.

The AWSElsticBlockStore in-tree storage driver was deprecated in the Kubernetes v1.19 released and then removed entirely in the v1.27 release. 

The Kubernetes project suggests that you use the AWS EBS third party storage driver instead.

I installed the referenced EBS CSI driver which resolved the issue.


What Actually Happened Here

This wasn’t one problem.

It was multiple layers of drift:

  • Nodes never upgraded
  • Control plane upgraded independently
  • Old manifests
  • Old controller images
  • Templates that didn’t upgrade with the controller
  • Missing RBAC
  • Stale cloud references
  • Expired certs

Each fix revealed the next failure.


Lessons Learned

1) Control plane and nodes must stay aligned, obviously

Running 1.17 kubelets against a 1.23 control plane is asking for breakage.

2) Manual clusters rot fast

Anything not managed by Helm, Terraform, or Git can drift or break during upgrades.

3) Ingress is a dependency magnet

Ingress touches:

  • IAM
  • AWS LB
  • SSL
  • RBAC
  • APIs
  • Controller versions
  • kubelet(API) compatibility

When it breaks, it rarely breaks for just one reason.

4) Old container images carry old assumptions before new standards

UID expectations, file permissions, API usage — everything changes over time.


The end?

At the end of this effort:

  • Ingress was functional again on a newer node
  • Controller updated
  • Permissions corrected
  • Templates aligned
  • Cert references fixed
  • The old, manual node pool on 1.17 was destroyed
  • Kubernetes was upgraded to 1.27

The next step was pushing the cluster forward again to 1.30 and then repeating the process on the production cluster using what was learned here.

That second upgrade happened on a Friday night a few months later and took less than an hour. Having the fixes documented chronologically and dealing with fewer API changes between 1.23 and 1.30 made it much more predictable.

…and then both clusters were decommissioned a few months later.

updatedupdated2026-02-082026-02-08