Common Issues · CloudOps for Kubernetes

Services start failing to resolve

If the CloudOps for Kubernetes Jenkins and Nexus DNS hostnames stop resolving, this might be because the kube-proxy DaemonSet is unhealthy.

To check the status of the DaemonSet through the kubectl command:

kubectl --namespace kube-system describe daemonsets kube-proxy

You can also check the status of the kube-proxy DaemonSet in the kube-system namespace through the Kubernetes Dashboard.

To access the Kubernetes Dashboard, see Accessing Kubernetes Dashboard.

DaemonSets in the Kubernetes cluster are failing

If DaemonSets are failing the Kubernetes cluster and the logs in the pods have the error:

An error log of a failing DaemonSet

Ensure that you check the resource reservation in the Kubelet configuration. The failures might be because the kubelet has run out of memory and started stealing from the DaemonSets.

To see the Kubelet configuration in a node,

Choose a node to inspect. In this example, the name of this node is referred to as NODE_NAME.
In one tab in the command terminal start the kubectl proxy.
```
kubectl proxy --port=8001
```

Run the following command to get the configuration from the configuration endpoint:

note

The resources reserved for the Kubelet will be in the JSON object kubeReserved.

NODE_NAME="the-name-of-the-node-you-are-inspecting"; curl -sSL "http://localhost:8001/api/v1/nodes/${NODE_NAME}/proxy/configz" | jq '.kubeletconfig|.kind="KubeletConfiguration"|.apiVersion="kubelet.config.k8s.io/v1beta1"'

To see and compare the resources used by the DaemonSets in the node with the Kubelet configuration resources, do the following steps:

Run the following command with the same node used in the previous steps:
```
kubectl describe node ${NODE_NAME}
```

Running Out of Ephemeral Storage

Pods can run out of ephemeral storage when building large Self Managed Commerce deployment packages, depending on the size of the source code. A typical message relating to this error in the build-deployment-package Jenkins job would have the following in the job logs:

default/jenkins-worker-51db7930-910b-4888-b88e-eeb68b987803-9wjfl-4k9ms Pod just failed (Reason: Evicted, Message: Pod ephemeral local storage usage exceeds the total limit of containers 13322Mi. )

When you experience this error, review the podYaml definition in the Jenkinsfile for the Jenkins job that is failing:

def podYamlFromFile = new File("${env.JENKINS_HOME}/workspace/${env.JOB_NAME}@script/cloudops-for-kubernetes/jenkins/agents/kubernetes/maven-5gb-2core-1container.yaml").text.trim();
String podYaml = podYamlFromFile.replace('${dockerRegistryAddress}', "${dockerRegistryAddress}").replace('${jenkinsAgentImageTag}', "${jenkinsAgentImageTag}")

For this error, edit the maven-5gb-2core-1container.yaml file. Open the yaml file in any text editor and change the ephemeral storage resource definitions that are defined:

resources:
  requests:
    memory: "5632Mi"
    cpu: "2"
    ephemeral-storage: "13Gi"
  limits:
    memory: "5632Mi"
    cpu: "2"
    ephemeral-storage: "13Gi"

Add as much ephemeral storage as you require. For example, we need to set the requests and limits at 20Gi:

resources:
  requests:
    memory: "5632Mi"
    cpu: "2"
    ephemeral-storage: "20Gi"
  limits:
    memory: "5632Mi"
    cpu: "2"
    ephemeral-storage: "20Gi"

Use Git to commit and push your changes to your CloudOps for Kubernetes repository. You can now use the changes that you made in your repository in the failing Jenkins job. Continue this process until you find the right limits for the job that is failing.

Jenkins will not Start

If the CloudOps for Kubernetes Jenkins pod will not start, it could be due to incompatible Jenkins plugins. To reset the Jenkins plugins to their default state, perform the following steps:

note

This procedure will remove any custom Jenkins plugins and related configurations that may be installed in your environment,

and only Jenkins plugins shipped with CloudOps for Kubernetes will remain.

Edit the docker-compose.override.yml file that you used when you initially set up the cluster.
Set the value for TF_VAR_jenkins_overwrite_plugins to true

note

This variable must be added to the docker-compose.override.yml if it is not already present.
Read through the docker-compose.override.yml file and confirm all other configurations are correct for your environment.
Run the Docker Compose command build to build the Docker image and update the CloudOps for Kubernetes cluster.
Edit docker-compose.override.yml and reset the value for TF_VAR_jenkins_overwrite_plugins to false

Jenkins Configurations are Lost When Jenkins Restarts

If changes made in the Jenkins web interface are not retained when the Jenkins pod restarts, then follow the below procedure to prevent this by setting TF_VAR_jenkins_overwrite_config to false.

Identify a maintenance window during which no Jenkins jobs will be running to apply the change. The Jenkins server pod will be restarted when the change is applied, causing any running jobs to fail.
Obtain the most recent copy of the docker-compose.override.yml file for your CloudOps for Kubernetes cluster.
Set the TF_VAR_jenkins_overwrite_config variable to false. If the variable does not exist in your copy of docker-compose.override.yml, you will need to add it.
(Optional) Set variable TF_VAR_rebuild_nodegroups to false to avoid unnecessary pod redeployment.
Follow the procedure described in Updating Cluster Configuration to apply the change.

note

Some CloudOps for Kubernetes version upgrades may require resetting the Jenkins configuration. Ensure you are following the upgrade documentation specific to your CloudOps for Kubernetes version.