You are here

Kubernetes fails: 3 ways to kill your clusters

Henning Jacobs, Head of Developer Productivity, Zalando SE

Kubernetes has its virtues and is worth investing in, but it is undoubtedly complex and comes with many operational challenges. We faced many of them on our journey toward "cloud native" at Zalando.

We constantly learned from other organizations that shared their failures and insights, so I started to compile a list of public failure horror stories related to Kubernetes. The goal was to make it easier for people tasked with operations to find outage reports to learn from.

Many of these failures had a few things in common. Here are the factors, in four major buckets, that contributed to failure.

[ Get Report: Automation, AI, and Analytics: Reinventing ITSM ]

Missing operational maturity

Infrastructure operations is a challenge for most organizations. and the transformation toward end-to-end responsibility (DevOps, "you build it, you run it") is often in full swing. Smaller organizations usually use a tool to bootstrap a cluster (e.g., kops), but do not dedicate time to set up full continuous delivery for the infrastructure. This leads to painful manual Kubernetes upgrades, untested infrastructure changes, and brittle clusters.

The same situation applies to managed infrastructure, since cloud offerings never come with all batteries included. Infrastructure changes should get at least the same attention and rigor as your customer-facing app deployments.

[ Get Report: The Forrester Wave: Continuous Delivery and Release Automation ]

Upstream Kubernetes/Docker issues

Some of the failures can be attributed to upstream issues, e.g., Docker daemon hanging, issues with a kubelet not reconnecting to the control plane, kernel CPU throttling bugs, unsafe CronJob defaults, and kubelet memory leaks.

If you hit an upstream issue—congratulations! You can follow or file an upstream issue and hope or contribute a fix helping many others. I would expect this class of failure causes to get smaller over time as CNCF projects mature and the user base grows, making it less probable that you’ll be the first to hit an upstream issue.

[ Also see: 7 things developers should know about production infrastructure ]

Cloud and other integrations

Kubernetes comes in more than one flavor—there are many possible combinations of Kubernetes components and configurations. Kubernetes needs to interact with your cloud platform, such as Google Cloud or AWS, and your existing IT landscape. And all of these integrations can lead to failure scenarios.

We saw Kubernetes' AWS cloud provider code easily hit AWS API rate limits and have problems with EBS persistent volume attachments. Using AWS Elastic Load Balancing with dynamic IPs caused problems with the kubelet losing connections. The AWS IAM integration (kube2iam) is notoriously prone to race conditions.

Human error

Let’s be clear: There is no such thing as "human error" as a root cause. If your root-cause analysis (RCA) concludes with "human error," start over and ask some hard questions.

[ Also see: One year using Kubernetes in production: Lessons learned ]

Share what you learn

Nowadays everybody is talking about failure culture, but what organization is truly ready to share its failures and lessons learned publicly? Kubernetes gives us a common ground where we can all broadly benefit from sharing our experiences with one another.

Many contributing factors are not new, such as the maturity in infrastructure changes, Docker, distributed systems, and so on. But Kubernetes gives us a common language to talk through and address them. By reducing the unknown unknowns of operating or using Kubernetes through shared experiences, it will get easier for everyone over time.

Do you have experiences to share? Post them below. And for more on Kubernetes failures, come to my talk, “Kubernetes Failure Stories and How to Crash Your Clusters,” at KubeCon + CloudNativeCon Europe 2019 in Barcelona, Spain, on May 20-23.

[ Webinar: IT Infrastructure in the Containers Era ]