Ingress Hardening at Cruise

This is a cross-post from the original post on LinkedIn.

At Cruise we have a requirement for a highly available central ingress service that can be used by all service owners alike. Our ingress solution of choice is a self managed istio setup. We are choosing istio because we make use of both, the ingress and service mesh components of it. We started off with an in-cluster ingress service. However, company wide outages started piling up due to the fact that a Kubernetes service with type LoadBalancer on GCP can only have 250 (1,000 on Azure) backends on the provisioned cloud load balancer for internal load balancing. When a kubernetes service with type LoadBalancer is created on a cluster, the cloud native service controller provisions a cloud vendor specific internal L4 load balancer resource. This load balancer pulls a random subset of 250 nodes in the cluster as backends. If the traffic handling istio-ingressgateway pods happened to be scheduled outside this subset and externalTrafficPolicy is set to Local on the Kubernetes service, then all incoming cluster traffic is blocked. This can be technically circumvented by setting externalTrafficPolicy to Cluster which means that incoming traffic can hit a random node and gets forwarded to the correct target node that runs the ingress pod by the kube-proxy daemonset in the kube-system namespace. However, besides the fact that it’s an extra unnecessary network hop, this also does not preserve the source IP of the caller. This is explained in more detail in the blog post “Kubernetes: The client source IP preservation dilemma”. While changing the externalTrafficPolicy to Cluster might work for some services, this was not acceptable for us as our offering is a central service used by many other services at Cruise, some of which have the requirement to keep the source IP.

The interim solution to this was to run the istio-ingressgateway on self managed VMs with the VirtualService resources on target Kubernetes clusters as input for the routing configuration. This resulted in a complex terraform module with 26 individual resources that contained many cloud resources like

  • Instance template
  • Instance group manager
  • Autoscaler
  • Firewall
  • IP addresses
  • Forwarding rules
  • Health checks

We essentially had to recreate a Kubernetes Deployment and Service. Since this was running outside the cluster we had to prepare the VMs with a running docker daemon, authentication to our internal container registry and a monitoring agent, which was done via a startup script. All in redundancy to what a Kubernetes cluster would have offered. While this setup mitigated the problem at hand, the complexity of this setup triggered misconfigurations during upgrades. Also the startup script would sometimes fail due to external dependencies potentially resulting in ingress outages.

An additional contributing factor to complexity was that some larger clusters ran the out-of-cluster solution described above, while some smaller, highly automated clusters ran the in-cluster solution. The two solutions were rolled out through different processes (Terraform vs Google Config Connector), requiring separate maintenance and version upgrades. All desired config changes had to be made to both solutions. The path forward was clear: We wanted to consolidate on the in-cluster solution that allows for more automation, less duplication of Kubernetes concepts, single source of truth for configurations, less created cloud native resources (only a dedicated IP address is needed for the in-cluster service) and overall easier maintainability of the service while still working on clusters with more than 250 nodes without sacrificing the source IP.

The Path Forward

For this to be possible we worked closely together with the responsible teams at Google Cloud Platform to deliver features like Container Native Load Balancing, ILB Subsetting and removing unhealthy backends from cloud load balancers as quickly as possible. We provide feedback in multiple iterations of the outcome of our load tests upon which the Google teams implemented further improvements.

Internally we hardened the in-cluster setup with the following:

  • Dedicated node pools for ingress pods.
  • Allow only one pod per node via anti affinity rules to prevent connection overloading on node level.
  • Set the amount of ingress replicas to the amount of zones + 2. This means that at least one zone has more than one replica. Because the GCP scaling API only allows us to scale by 1 node per zone at a time, it could happen that all nodes get replaced at once if we have only 1 node per zone. With this formula we guarantee that there are always at least 2 running pods.
  • Set PodDisruptionBudget that requires less healthy pods then desired replicas in the deployment to not block node drains
  • Set HorizontalPodAutoscalers based on memory and CPU usage.
  • Add a PreStop lifecycle hook that sleeps for 120 seconds. This is that existing connections are untouched upon pod termination and can run for a full 2 minutes before a SIGTERM is received.
  • Set terminationGracePeriodSeconds to 180 seconds to give connections an additional minute to gracefully terminate long running connections after the SIGTERM is received.
  • Tweak liveness and readiness probes. Liveness probes have a higher failure threshold to prevent frequently restarting pods, while readiness probes have a lower failure threshold to not route traffic while the pod is not healthy.
  • Lower resource requests while raising resource limits. This is to achieve a higher likelihood of scheduling on a cramped node while we allow this pod to use a lot of resources if necessary.

Migration

Cluster traffic at Cruise grew along with criticality as we now have fully driverless vehicles on the roads. So the migration from out-of-cluster traffic to in-cluster had to be carefully planned and executed. We specifically had to migrate 3 critical, large clusters across 3 environments (which makes it 9 migrations). We deployed both the in-cluster and out-of-cluster solution in parallel, both with L4 load balancers and a dedicated IP. At this point only an A record swap for our ingress domain was necessary. Client domains use CNAMEs from their service names to our centrally offered ingress domain, so there was no client side change necessary. We carefully swapped the A records slowly cluster by cluster, environment by environment, while closely monitoring the metrics. Our preparation has paid off and we rerouted 100% of all mission critical traffic at Cruise without a blip and without any customer experiencing a downtime.

To learn more about technical challenges we’re tackling at Cruise infrastructure, visit getcruise.com/careers.

Tags: kubernetes, cncf, clusters, infrastructure, ingress, tips & tricks

Tips and Tricks Developing a Kubernetes Controller

This is a cross-post from the original post on LinkedIn.

In an effort to contribute to the Kubernetes controller development community, we want to line out a few of the highlights that truly helped us to implement a production grade controller that is reliable. These are generic so they apply to any Kubernetes controller and not just to infrastructure related ones.

API Integration Tests via gomock

While Kubebuilder already creates an awesome testing framework with Ginkgo and Envtest, which spins up a local Kubernetes API server for true integration tests, it is not complete if your controller has one or more integrations with third party APIs like in our case Vault, Buildkite and others. If your reconciliation logic contains an API call to a third party, it should be mocked during integration testing to not create a dependency or load on that API during your CI runs.

We chose gomock as our mocking framework and defined all API clients as interfaces. That allows us to implement the API clients as a gomock stub during the integration tests and as the actual API client during the build. The following is an example of one of such interfaces including the go:generate instructions to create the mock:

//go:generate $GOPATH/bin/mockgen -destination=./mocks/buildkite.go -package=mocks -build_flags=--mod=mod github.com/cruise/cluster-operator/controllers/cluster BuildkiteClientInterface
type BuildkiteClientInterface interface {
 Create(org string, pipeline string, b *buildkite.CreateBuild) (*buildkite.Build, *buildkite.Response, error)
}

During integration tests we just replace the Create function with a stub that returns a *buildkite.Build or an error.

Communication Between Sub-Controllers

It is often required that one controller needs to pass on information to another controller. For instance our netbox controller that provisions and documents CIDR ranges for new clusters as described in our last blog post needs to pass on the new CIDR ranges to the ComputeSubnetwork as properties, which is reconciled by the GCP Config Connector. We utilize the Cluster resource’s status property to pass along properties between sub-resources. That has the positive side effect that the Cluster resource contains all potentially generated metadata of the cluster in the status field. The root controller which reconciles the Cluster resource implements the logic and coordinates which source property goes to which target.

Server-Side Apply

Server-Side Apply is a Kubernetes feature that became GA in Kubernetes version 1.18 and stable in 1.22. It helps users and controllers to manage their resources through declarative configurations and define ownership of fields. It introduces the managedField property in the metadata of each resource storing which controller claims ownership over which field. That is important because otherwise two or more controllers can edit the same field, changing it back and forth, triggering reconciliation of each other and thus creating infinite loops. When we made the switch to Server-Side Apply, we decided to store each resource as go template yaml file, embed it via go 1.16’s //go:embed and apply it as *unstructured.Unstructured resource. That is even though we have the go struct for the specific resource available. The issue with using the go struct is that empty fields (nil values) count as fields with an opinion by that controller. Imagine an int field on the struct. As soon as a struct is initialized, that int field is set to 0. The json marshaller now doesn’t know if it was explicitly set to 0 or if it is just nil and marshalls it as 0 into the resulting json which gets sent to the API server. With an *unstructured.Unstructured resource we ensure that we only apply fields that the controller has an opinion about. It works very much like a regular kubectl apply at this point. A go template yaml file could look like the following:

apiVersion: iam.cnrm.cloud.google.com/v1beta
kind: IAMServiceAccount
metadata:
  annotations:
    cnrm.cloud.google.com/project-id: {{ .Spec.Project }}
  name: {{ . | serviceAccountName }}
  namespace: {{ .Namespace }}
spec:
  description: "default service account for cluster {{ .Name }}"
  displayName: {{ . | serviceAccountName }}

The template properties get filled through parsing the file using a *Template from the text/template package. The templated yaml file gets parsed into the *unstructured.Unstructured resource and applied via the CreateOrPatch function in controller-runtime. This allowed us to only explicitly set fields we have an opinion about.

This was especially important in conjunction with the GCP Config Connector as it often writes resulting values (e.g. a default cluster version) back to the original spec of the resource. Thus our controller and the GCP Config Connector controller often would “fight” over a field before we rolled out Server-Side Apply, changing it back and forth. With Server-Side Apply a field is clearly claimed by one controller, while the other controller accepts the opinion of the first controller, thus eliminating infinite control loops.

If you implement or build upon any of these frameworks, I’d love to hear about it — reach out to me on Twitter with your feedback!

Tags: kubernetes, cncf, clusters, infrastructure, automation, api, controllers, operators, tips & tricks

Herd Your Clusters Like Cattle: How We Automated our Clusters

This is a cross-post from the original post on LinkedIn.

In 2020, it took one engineer about a week to create a new Kubernetes cluster integrated into the Cruise environment. Today we abstract all configurations and creating a new cluster is a matter of one API call, occupying essentially no engineering hours. Here is how we did it.

Before, every platform engineer knew each individual Kubernetes cluster by name and whether the workloads running on it were business critical or not. There was an entire set of always running test clusters to rollout new features. Creating a new cluster meant for an engineer to follow a wiki guide that walks you through the exact sequence of steps that has to be performed to create a cluster. That involved a bunch of pull requests to many different repositories. These pull requests had to be executed in a certain order and success of a previous step unblocked another step (like creating a new Terraform Enterprise workspace before applying Terraform resources). The following is already a simplified flow chart of what an engineer had to perform.

While Cruise is using managed Kubernetes clusters on the cloud which technically can already be created with only one API call, there is a bunch of customization that Cruise does to every new cluster. That includes but is not limited to:

  • Claiming free CIDR ranges for masters, nodes, pods and services and register them on netbox
  • Integration into the internal VPC
  • Firewall configuration
  • Authorized network configuration
  • Creation of cluster node service account and IAM bindings
  • Customizing the default node pool
  • Creation of a dedicated ingress node pool
  • Configuring private container registry access for nodes
  • Creating load balancer IP address resources
  • Creating DNS record entries
  • Creating backup storage bucket
  • Vault integration (using the Vault Kubernetes Auth Method + configuring custom Vault roles)
  • Configure API monitoring
  • Install cluster addons (istio, coredns, external dns, datadog agent, fluentd, runscope agent, velero backup and custom operators to only name a few)

Decommissioning clusters was a simpler but still pretty involved process. Extensive creation and decommissioning processes lead us to treat our clusters as pets.

You might ask “But Why is That Bad? - I Love Pets!”. While pets are pretty great, it’s not favorable to treat your clusters like pets, just as it is not favorable to treat virtual machines like pets. Here are a few reasons why:

  • It takes time (a lot) to create, manage and delete them
  • It is not scalable
  • We often conduct risky ‘open heart surgeries’ to save them
  • We can’t replace them without notice to users
  • We are incentivized to have fewer clusters which increases the blast radius of outages

However, here is what we would like to have:

  • Clusters don’t have to be known individually, they’re all just an entry in a database identified by unique identifiers
  • They’re exactly identical to one another
  • If one gets sick (dysfunctional) we just replace it
  • We scale them up and down as we require
  • We spin one up for a quick test and delete it right after

All these attributes would not apply to pets, but it does sound a lot like cattle. Bill Baker first coined this term for IT infrastructure in his presentation “Scaling SQL Server” in 2012 [1], [2], [3]. He implied that IT infrastructure should be managed by software and an individual component (like a server) is easily replaceable by another equivalent one created by the same software.

That’s what we have achieved at Cruise through the cluster-operator, a Kubernetes operator that reconciles custom Cluster resources and achieves a desired state in the real world via API calls through eventual consistency.

Bryan Liles’ (VP, Principal Engineer at VMWare) tweet on 1/4/2021 of which Tim Hockin (Principal Software Engineer at Google & Founder of Kubernetes) was affirmative:

Even though work for the cluster-operator started before that tweet, it is yet a perfect expression for what we were trying to achieve: A declarative infra API that expresses a desired state and a controller that asynchronously reconciles to eventually reach that desired state. We utilized kubebuilder for the original scaffolding of our operator. Kubebuilder is maintained by the kubernetes special interest group and creates an initial code base with all best practices for controllers (using controller-runtime under the hood). That includes watchers, informers, queuing, backoff retry, and the entire API architecture including automatic code generation for the custom resource definitions (CRDs). All we had to do was to create controllers and sub-controllers and fill the reconciliation stubs with business logic. Kubebuilder is also used by other major infrastructure operators like the GCP Config Connector and the Azure Service Operator.

Based on the API we defined as go structs, a custom Cluster resource could look like the following:

apiVersion: cluster.paas.getcruise.com/v1alpha
kind: Cluster
metadata:
  name: my-cluster
  namespace: clusters
spec:
  project: cruise-paas-platform
  region: us-central1
  environment: dev
  istio:
    enabled: true
    ingressEnabled: true
  template:
    spec:
      clusterAutoscaling:
        enabled: false
      minMasterVersion: 1.20.11-gke.13001

The creation of this Cluster resource triggers the main control loop to create sub-resources needed by every cluster. Some of them are proprietary and are reconciled again by other inhouse controllers (examples are a netbox controller that automatically carves out new CIDR ranges for new clusters, a vault controller that automatically sets up the kubernetes auth for the new cluster or a deployment controller that triggers Buildkite deployments via API, among others). Some others are cloud provider specific resources like the ContainerCluster (which is the actual cluster itself), ComputeSubnetwork, ComputeAddress, ComputeFirewall, IAMServiceAccount and others. Those cloud provider specific resources would get reconciled by the GCP Config Connector in this case, so we didn’t have to implement the GCP API layer and authentication. All we had to do was to communicate with the Kubernetes API server and create those resources. At the time of writing the creation of the Cluster resource triggers the creation of 16 such sub-resources. The relationship between the root controller for the Cluster resource and all sub-resources, which themselves could have sub-resources again, resembles a tree data structure consisting of controllers and their resources.

The yaml of the Cluster resource above can easily be created on a cluster running the operator via kubectl apply -f cluster.yaml and deleted the same way. All provisioning and decommissioning is then handled by the tree of controllers. We created a helm chart around this in which a list of clusters is maintained. The helm chart then applies all the Cluster resources. That way cluster creation and decommissioning is still tracked and reviewed via git ops and it’s bundled in a single PR of a few lines with essential properties. In fact most of the properties are defaulted as well, so that you could hypothetically create a Cluster resource just with a name set.

Of course this is also available via the raw Kubernetes API under /apis/cluster.paas.getcruise.com/v1alpha1/clusters which makes this integratable with any other software. Imagine a load test that spins up a cluster, performs the load test, stores the result and deletes the cluster right after.

Notice the difference between .spec and .spec.template.spec in the Cluster resource above. While .spec is supposed to hold generic, cloud agnostic properties defined by us, the .spec.template.spec holds cloud vendor specific properties, much like the equivalent template spec on a Kubernetes native deployment that contains the spec of the underlying desired pod. This is realized through a json.RawMessage struct field that allows for any arbitrary json/yaml on that field. It gets parsed into a map[string]interface{} for further usage and its properties are used to override the chosen defaults on the core ContainerCluster resource. It is important to preserve unknown fields via // +kubebuilder:pruning:PreserveUnknownFields to allow the input yaml/json to contain any fields the cloud provider specifies (or will specify in the future).

Conclusion

This cluster provisioning and decommissioning via a single API call has allowed us to treat clusters as cattle. We create them for quick tests and experiments, delete them again, move noisy neighbors to dedicated clusters and manage the properties and configuration of all our clusters centrally. Clusters are owned by well tested production-grade software and not by humans anymore. What used to cost days of engineering work was minimized to a few minutes of human attention to enter some cluster metadata. We’re not only automating a fleet of cars, but also a fleet of clusters.

Tags: kubernetes, cncf, clusters, infrastructure, automation, api, controllers, operators

The Kubernetes Discovery Cache: Blessing and Curse

This blog post explores the idea of using Kubernetes as a generic platform for managing declarative APIs and asynchronous controllers beyond its ability as a container orchestrator, and how the discovery cache plays into that consideration.

It all started for me with a tweet by Bryan Liles (VP, Principal Engineer at VMWare) almost a year ago, to which Tim Hockin (Principal Software Engineer at Google & Founder of Kubernetes) agreed:

I think this is so true. Kubernetes embodies a lot of principles of the perfect API. Every resource has a group, a version, and a kind, it offers extendable metadata, a spec for input data (desired state) and a status of the resource (actual state). What else do you wish for? Also the core Kubernetes controllers like the deployment controller, pod controller, autoscaler controller and persistent volume claim controller are perfect examples of asynchronous controller patterns that take a desired state as input and achieve that state through reconciliation with eventual consistency. The fact that this functionality is exposed to Kubernetes users via custom resource definitions (CRDs) makes the entire platform incredibly extendable.

Controllers like crossplane.io, GCP Config Connector or Azure Service Operator have adopted the pattern to a large degree and install 100s, if not 1,000s of CRDs on clusters. However that doesn’t come without its drawbacks…

These drawbacks aren’t due to high load on the Kubernetes API server. In fact, that actually has a pretty robust and advanced rate limiting mechanism through the priority and fairness design that most likely will ensure that the API server doesn’t crash even if a lot of requests are made.

However, installing many CRDs on a cluster can impact the OpenAPI spec publishing, as well as the discovery cache creation. OpenAPI spec creation for many CRDs has recently been fixed through implementing lazy marshalling. While an interesting concept, this could be the topic of another blog post as in this one we are focusing on the latter: discovery cache creation.

It started with more and more log messages like the following when using regular kubectl commands:

Waited for 1.140352693s due to client-side throttling, not priority and fairness, request: GET:https://10.212.0.242/apis/storage.k8s.io/v1beta1?timeout=32s

What’s interesting is that this immediately excludes the priority and fairness mechanism described earlier and talks about ‘client-side throttling’. My first instinct, however, was just to suppress the log line because I hadn’t asked kubectl to print any debug logs for instance, with -v 1. I found this issue on kubernetes/kubernetes pursuing the same goal and gave it a thumbs up in the hope to just suppress this annoying log message that you couldn’t switch off. However, as the discussion on that PR progressed and specifically this comment, saying that “this log message saves 10 hours of debugging for every hour it costs someone trying to hide it”, got me thinking that there must be more to the story and that merely not printing the log message is not the right approach. The PR eventually was closed without merging.

This led me down the rabbit hole of looking at some kubectl debug logs, and I found that a simple request for pods via kubectl get pod -v 8 led to 100s of GET requests à la

GET https://<host>/apis/dns.cnrm.cloud.google.com/v1beta1?timeout=32s
GET https://<host>/apis/templates.gatekeeper.sh/v1alpha1?timeout=32s
GET https://<host>/apis/firestore.cnrm.cloud.google.com/v1beta1?timeout=32s

This was on a cluster that already had a few controllers installed, like the GCP Config Connector or Gatekeeper. I noticed the group versions like dns.cnrm.cloud.google.com/v1beta1 or templates.gatekeeper.sh/v1alpha1 in the debug output relating to those controllers even though I simply queried for pods.

It occurred to me that these many GET requests would ultimately trigger the client-side rate limiting and that those GET requests were made to populate the discovery cache. This reddit post helped me understand this behavior and I also reported this back on the original Kubernetes issue regarding those ominous log messages, which triggered the community to raise a new issue altogether regarding a fix for client side throttling due to discovery caching.

The Discovery Cache

But why do we even need 100s of requests in the background for simply querying pods via kubectl get pods? That is thanks to the ingenious idea of the Kubernetes discovery client. This allows us to run all variations of kubectl get po, kubectl get pod, kubectl get pods and the Kubernetes API server always knows what we want. That becomes even more useful for resources that implement categories, which can trigger a kubectl get <category> to return various different kinds of resources.

The way this works is to translate any of those kubectl commands to the actual API server endpoint like

GET https://<host>/api/v1/namespaces/<current namespace>/pods

You see that kubectl has to fill in the <current namespace> and query for /pods (and not for /po or /pod). It gets the <current namespace> through the $KUBECONFIG (which is usually stored at ~/.kube/config), or falls back to default. It is also possible to query pods of all namespaces at once. The way kubectl resolves a request for po or pod to the final endpoint /pods is through a local cache stored at ~/.kube/cache/discovery/<host>/v1/serverresources.json. In fact, there is a serverresources.json file for every group version of resources installed on the cluster. If you look at the entry for pods you will find something like

{
  "name": "pods",
  "singularName": "",
  "namespaced": true,
  "kind": "Pod",
  "verbs": [...],
  "shortNames": [
    "po"
  ],
  "categories": [
    "all"
  ]
}

With this reference kubectl knows that a request for pods, pod (which is the kind), po (which is in the shortNames array) or all (which is in the categories) should result in the final request for /pods.

kubectl creates the serverresources.json for every group version either if the requested kind is not present in any of the cached serverresources.json files, or if the cache is invalid. The cache invalidates itself every 10 minutes.

That means in those cases kubectl has to make a request to every group version on the cluster to populate the cache again, which results in those 100s of GET requests described earlier, and those again trigger the client-side rate limiting. On large clusters with many CRDs kubectl get requests can easily take up to a minute to run through all these requests plus pausing for the rate limiting. Thus it is advisable to not let your CRD count grow limitless. In fact, the scale targets for GA of custom resource definitions is set to 500 in the Kubernetes enhancement repo.

So while the discovery cache is actually adding usability to Kubernetes, it also is the limiting factor for extending the platform with custom controllers and CRDs.

Scaling

Especially the crossplane community has a vested interest in unlocking this limitation because crossplane’s entire design philosophy is built upon the idea of creating CRDs for every object in the real world and reconciling it through controllers. But it will also be important for other controllers introducing many CRDs like the GCP Config Connector or the Azure Service Operator.

For now the aforementioned issue on kubernetes/kubernetes based on my user report regarding many GET requests after a simple kubectl get pods triggered a set of PRs (1, 2) aimed at increasing the rate limits during discovery. However, this is just kicking the can down the road (or as @liggitt correctly put it the ‘kubernetes equivalent of the debt ceiling’) as it’s not solving the underlying issue of many unnecessary GET requests, but merely not rate limiting as often, which still means a strain on resources and that we will run into the same issue again at a later point in time with even more CRDs. While kubectl still performs 100s of GET requests, at least the total run time is roughly cut in half as there is no additional rate limiting anymore with the fixes.

I also raised a separate issue to challenge the status quo of invalidating the cache every 10 minutes by increasing that default, and also to make this timeout configurable (rather than hard coding it). But again, this just raises limits and doesn’t actually minimize the amount of unused GET requests.

So the real, lasting solution might be a bit more involved and require to only GET the serverresources.json of a group version that is actually requested once the cache gets invalid or isn’t present. So a request for kubectl get pods would only populate the ~/.kube/cache/discovery/<host>/v1/serverresources.json file (because pods are in group "" and version v1) rather than every single group version. This would eliminate all unnecessary requests for unrelated resources and majorly reduce the total amount of GET requests. This solution would also require a server-side change to offer an endpoint that reveals all potential group versions for a given kind.

If you have other ideas to solve this, feel free to reach out to me, @jonnylangefeld on twitter, to discuss or file an issue directly on kubernetes/kubernetes.

Tags: kubernetes, go, cncf, discovery, sig, api, controllers, operators

Introducing kubectl mc

If you work at a company or organization that maintains multiple Kubernetes clusters it is fairly common to connect to multiple different kubernetes clusters throughout your day. And sometimes you want to execute a command against multiple clusters at once. For instance to get the status of a deployment across all staging clusters. You could run your kubectl command in a bash loop. That does not only require some bash logic, but also it’ll take a while to get your results because every loop iteration is an individual API round trip executed successively.

kubectl mc (short for multi cluster) supports this workflow and significantly reduces the return time by executing the necessary API requests in parallel go routines.

It’s doing that by creating a wait group channel with n max entries. The max amount of entries can be configured with the -p flag. It then iterates through a list of contexts that matched a regex, given with the -r flag, and executes the given kubectl command in parallel go routines.

Installation

While you can install the binary via go get github.com/jonnylangefeld/kubectl-mc, it is recommended to use the krew package manager. Check out their website for installation.

Run krew install mc to install mc.

In both cases a binary named kubectl-mc is made available on your path. kubectl automatically identifies binaries named kubectl-* and makes them available as kubectl plugin. So you can use it via kubectl mc.

Usage

Run kubectl mc for help and examples. Here is one to begin with:

$ kubectl mc --regex kind -- get pods -n kube-system

kind-kind
---------
NAME                                         READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-q7gnm                      1/1     Running   0          99m
coredns-f9fd979d6-zd4jn                      1/1     Running   0          99m
etcd-kind-control-plane                      1/1     Running   0          99m
kindnet-8qd8p                                1/1     Running   0          99m
kube-apiserver-kind-control-plane            1/1     Running   0          99m
kube-controller-manager-kind-control-plane   1/1     Running   0          99m
kube-proxy-nb55k                             1/1     Running   0          99m
kube-scheduler-kind-control-plane            1/1     Running   0          99m

kind-another-kind-cluster
-------------------------
NAME                                                         READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-l2xdb                                      1/1     Running   0          91s
coredns-f9fd979d6-m99fx                                      1/1     Running   0          91s
etcd-another-kind-cluster-control-plane                      1/1     Running   0          92s
kindnet-jlrqg                                                1/1     Running   0          91s
kube-apiserver-another-kind-cluster-control-plane            1/1     Running   0          92s
kube-controller-manager-another-kind-cluster-control-plane   1/1     Running   0          92s
kube-proxy-kq2tr                                             1/1     Running   0          91s
kube-scheduler-another-kind-cluster-control-plane            1/1     Running   0          92s

As you can see, each context that matched the regex kind executes the kubectl command indicated through the -- surrounded by spaces.

Apart from the plain standard kubectl output, you can also have everything in json or yaml output using the --output flag. Here is an example of json output piped into jq to run a query which prints the context and the pod name for each pod:

$ kubectl mc --regex kind --output json -- get pods -n kube-system | jq 'keys[] as $k | "\($k) \(.[$k] | .items[].metadata.name)" '
"kind-another-kind-cluster coredns-66bff467f8-6xp9m"
"kind-another-kind-cluster coredns-66bff467f8-7z842"
"kind-another-kind-cluster etcd-another-kind-cluster-control-plane"
"kind-another-kind-cluster kindnet-k4vnm"
"kind-another-kind-cluster kube-apiserver-another-kind-cluster-control-plane"
"kind-another-kind-cluster kube-controller-manager-another-kind-cluster-control-plane"
"kind-another-kind-cluster kube-proxy-dllm6"
"kind-another-kind-cluster kube-scheduler-another-kind-cluster-control-plane"
"kind-kind coredns-66bff467f8-4lnsg"
"kind-kind coredns-66bff467f8-czsf6"
"kind-kind etcd-kind-control-plane"
"kind-kind kindnet-j682f"
"kind-kind kube-apiserver-kind-control-plane"
"kind-kind kube-controller-manager-kind-control-plane"
"kind-kind kube-proxy-trbmh"
"kind-kind kube-scheduler-kind-control-plane"

Check out the github repo for a speed comparison. If you have any questions or feedback email me [email protected] or tweet me @jonnylangefeld.

Tags: kubernetes, addon, krew, multi cluster, cli, tool, go

Kubernetes: How to View Swagger UI

Often times, especially during development of software that talks to the kubernetes API server, I actually find myself looking for a detailed kubernetes API specification. And no, I am not talking about the official kubernetes API reference which mainly reveals core object models. I am also interested in the specific API paths, the possible headers, query parameters and responses.

Typically you find all this information in an openapi specification doc which you can view via the Swagger UI, ReDoc or other tools of that kind.

So the next logical step is to google for something like ‘kubernetes swagger’ or ‘kubernetes openapi spec’ and you’d hope for a Swagger UI to pop up and answer all your questions. But while some of those search results can lead you in the right direction, you won’t be pleased with the Swagger UI you were looking for.

The reason for that is the spec for every kubernetes API server is actually different due to custom resource definitions and therefore exposes different paths and models.

And that requires the kubernetes API server to actually generate it’s very own openapi specification, which we will do in the following:

First make sure that you are connected to your kubernetes API server as your current kube context. You can double check via kubectl config current-context. You then want to open a reverse proxy to your kubernetes API server:

kubectl proxy --port=8080

That saves us a bunch of authentication configuration that’s now all handled by kubectl. Run the following commands in a new terminal window to keep the reverse proxy alive (or run the reverse proxy in the background). Save the Swagger file for your kubernetes API server via the following command:

curl localhost:8080/openapi/v2 > k8s-swagger.json

As last step we now just need to start a Swagger ui server with the generated Swagger json as input. We will use a docker container for that:

docker run \
    --rm \
    -p 80:8080 \
    -e SWAGGER_JSON=/k8s-swagger.json \
    -v $(pwd)/k8s-swagger.json:/k8s-swagger.json \
    swaggerapi/swagger-ui

Now visit http://localhost and you’ll get a Swagger UI including the models and paths of all custom resources installed on your cluster!

Tags: kubernetes, swagger, api, tips & tricks