Troubleshoot Embedded Cluster

This topic provides information about troubleshooting Replicated Embedded Cluster installations. For more information about Embedded Cluster, including built-in extensions and architecture, see Embedded Cluster Overview.

Troubleshoot with Support Bundles

This section includes information about how to collect support bundles for Embedded Cluster installations. For more information about support bundles, see About Preflight Checks and Support Bundles.

About the Default Embedded Cluster Support Bundle Spec

Embedded Cluster includes a default support bundle spec that collects both host- and cluster-level information:

The host-level information is useful for troubleshooting failures related to host configuration like DNS, networking, or storage problems.
Cluster-level information includes details about the components provided by Replicated, such as the Admin Console and Embedded Cluster Operator that manage install and upgrade operations. If the cluster has not installed successfully and cluster-level information is not available, then it is excluded from the bundle.

In addition to the host- and cluster-level details provided by the default Embedded Cluster spec, support bundles generated for Embedded Cluster installations also include app-level details provided by any custom support bundle specs that you included in the application release.

Generate a Bundle For Versions 1.17.0 and Later

For Embedded Cluster 1.17.0 and later, you can run the Embedded Cluster support-bundle command to generate a support bundle.

The support-bundle command uses the default Embedded Cluster support bundle spec to collect both cluster- and host-level information. It also automatically includes any application-specific support bundle specs in the generated bundle.

To generate a support bundle:

SSH onto a controller node.

note
You can SSH onto a worker node to generate a support bundle that contains information specific to that node. However, when run on a worker node, the support-bundle command does not capture cluster-wide information.
Run the following command:
```
sudo ./APP_SLUG support-bundle
```
Where APP_SLUG is the unique slug for the application.

Generate a Bundle For Versions Earlier Than 1.17.0

For Embedded Cluster versions earlier than 1.17.0, you can generate a support bundle from the shell using the kubectl support-bundle plugin.

To generate a bundle with the support-bundle plugin, you pass the default Embedded Cluster spec to collect both cluster- and host-level information. You also pass the --load-cluster-specs flag, which discovers all support bundle specs that are defined in Secrets or ConfigMaps in the cluster. This ensures that any application-specific specs are also included in the bundle. For more information, see Discover Cluster Specs in the Troubleshoot documentation.

To generate a bundle:

SSH onto a controller node.

Use the Embedded Cluster shell command to start a shell with access to the cluster:

sudo ./APP_SLUG shell

Where APP_SLUG is the unique slug for the application.

The output looks similar to the following:

   __4___
_  \ \ \ \   Welcome to APP_SLUG debug shell.
<'\ /_/_/_/   This terminal is now configured to access your cluster.
((____!___/) Type 'exit' (or CTRL+d) to exit.
\0\0\0\0\/  Happy hacking.
~~~~~~~~~~~
root@alex-ec-2:/home/alex# export KUBECONFIG="/var/lib/embedded-cluster/k0s/pki/admin.conf"
root@alex-ec-2:/home/alex# export PATH="$PATH:/var/lib/embedded-cluster/bin"
root@alex-ec-2:/home/alex# source <(kubectl completion bash)
root@alex-ec-2:/home/alex# source /etc/bash_completion

The appropriate kubeconfig is exported, and the location of useful binaries like kubectl and the preflight and support-bundle plugins is added to PATH.

note

The shell command cannot be run on non-controller nodes.

Generate the support bundle using the default Embedded Cluster spec and the --load-cluster-specs flag:

kubectl support-bundle  --load-cluster-specs /var/lib/embedded-cluster/support/host-support-bundle.yaml

View Logs

You can view logs for both Embedded Cluster and the k0s systemd service to help troubleshoot Embedded Cluster deployments.

View Installation Logs for Embedded Cluster

To view installation logs for Embedded Cluster:

SSH onto a controller node.
Navigate to /var/log/embedded-cluster and open the .log file to view logs.

View k0s Logs

You can use the journalctl command line tool to access logs for systemd services, including k0s. For more information about k0s, see the k0s documentation.

To use journalctl to view k0s logs:

SSH onto a controller node or a worker node.
Use journalctl to view logs for the k0s systemd service that was deployed by Embedded Cluster.

Examples:
```
journalctl -u k0scontroller
```
```
journalctl -u k0sworker
```

Access the Cluster

When troubleshooting, it can be useful to list the cluster and view logs using the kubectl command line tool. For additional suggestions related to troubleshooting applications, see Troubleshooting Applications in the Kubernetes documentation.

To access the cluster and use other included binaries:

SSH onto a controller node.

note
You cannot run the shell command on worker nodes.

Use the Embedded Cluster shell command to start a shell with access to the cluster:

sudo ./APP_SLUG shell

Where APP_SLUG is the unique slug for the application.

The output looks similar to the following:

   __4___
_  \ \ \ \   Welcome to APP_SLUG debug shell.
<'\ /_/_/_/   This terminal is now configured to access your cluster.
((____!___/) Type 'exit' (or CTRL+d) to exit.
 \0\0\0\0\/  Happy hacking.
~~~~~~~~~~~
root@alex-ec-1:/home/alex# export KUBECONFIG="/var/lib/embedded-cluster/k0s/pki/admin.conf"
root@alex-ec-1:/home/alex# export PATH="$PATH:/var/lib/embedded-cluster/bin"
root@alex-ec-1:/home/alex# source <(k0s completion bash)
root@alex-ec-1:/home/alex# source <(cat /var/lib/embedded-cluster/bin/kubectl_completion_bash.sh)
root@alex-ec-1:/home/alex# source /etc/bash_completion

The appropriate kubeconfig is exported, and the location of useful binaries like kubectl and Replicated’s preflight and support-bundle plugins is added to PATH.

Use the available binaries as needed.

Example:

kubectl version

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1+k0s

Type exit or Ctrl + D to exit the shell.

Troubleshoot Errors

This section provides troubleshooting advice for common errors.

Installation failure when NVIDIA GPU Operator is included as Helm extension

Symptom

A release that includes that includes the NVIDIA GPU Operator as a Helm extension fails to install.

Cause

If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail.

This is the result of a known issue with v24.9.x of the NVIDIA GPU Operator. For more information about the known issue, see container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd binary in the nvidia-container-toolkit repository in GitHub.

For more information about including the GPU Operator as a Helm extension, see NVIDIA GPU Operator in Using Embedded Cluster.

Solution

To troubleshoot:

Remove any existing containerd services that are running on the host (such as those deployed by Docker).
Reset and reboot the node:
```
sudo ./APP_SLUG reset
```
Where APP_SLUG is the unique slug for the application.

For more information, see Reset a Node in Using Embedded Cluster.
Re-install with Embedded Cluster.

Calico networking issues

Symptom

Symptoms of Calico networking issues can include:

The pod is stuck in a CrashLoopBackOff state with failed health checks:

Warning Unhealthy 6h51m (x3 over 6h52m) kubelet Liveness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host
Warning Unhealthy 6h51m (x19 over 6h52m) kubelet Readiness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host
....
Unhealthy               pod/registry-dc699cbcf-pkkbr     Readiness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Unhealthy               pod/registry-dc699cbcf-pkkbr     Liveness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
...

The pod log contains an I/O timeout:

server APIs: config.k8ssandra.io/v1beta1: Get \"https://***HIDDEN***:443/apis/config.k8ssandra.io/v1beta1\": dial tcp ***HIDDEN***:443: i/o timeout"}

Cause

Reasons can include:

Pod CIDR and service CIDR overlap with the host network CIDR.
Incorrect kernel parameters values.
VXLAN traffic getting dropped. By default, Calico uses VXLAN as the overlay networking protocol, with Always mode. This mode encapsulates all pod-to-pod traffic in VXLAN packets. If for some reasons, the VXLAN packets get filtered by the network, the pod will not able to communicate with other pods.

Solution

Pod CIDR and service CIDR overlap with the host network CIDR
Incorrect kernel parameter values
VXLAN traffic dropped

To troubleshoot pod CIDR and service CIDR overlapping with the host network CIDR:

Run the following command to verify the pod and service CIDR:
```
cat /etc/k0s/k0s.yaml | grep -i cidr
    podCIDR: 10.244.0.0/17
    serviceCIDR: 10.244.128.0/17
```
The default pod CIDR is 10.244.0.0/16 and service CIDR is 10.96.0.0/12.

View pod network interfaces excluding Calico interfaces, and ensure there are no overlapping CIDRs.

ip route | grep -v cali
default via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100
10.152.0.1 dev ens4 proto dhcp scope link src 10.152.0.4 metric 100
blackhole 10.244.101.192/26 proto 80
169.254.169.254 via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100

Reset and reboot the installation:
```
sudo ./APP_SLUG reset
```
Where APP_SLUG is the unique slug for the application.

For more information, see Reset a Node in Using Embedded Cluster.
Reinstall the application with different CIDRs using the --cidr flag:
```
sudo ./APP_SLUG install --license license.yaml --cidr 172.16.136.0/16
```
For more information, see Embedded Cluster Install Options.

Embedded Cluster 1.19.0 and later automatically sets the net.ipv4.conf.default.arp_filter, net.ipv4.conf.default.arp_ignore, and net.ipv4.ip_forward kernel parameters. Additionally, host preflight checks automatically run during installation to verify that the kernel parameters were set correctly. For more information about the Embedded Cluster preflight checks, see About Host Preflight Checks in Using Embedded Cluster.

If kernel parameters are not set correctly and these preflight checks fail, you might see a message such as IP forwarding must be enabled. or ARP filtering must be disabled by default for newly created interfaces..

To troubleshoot incorrect kernel parameter values:

Use sysctl to set the kernel parameters to the correct values:

echo "net.ipv4.conf.default.arp_filter=0" >> /etc/sysctl.d/99-embedded-cluster.conf
echo "net.ipv4.conf.default.arp_ignore=0" >> /etc/sysctl.d/99-embedded-cluster.conf
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.d/99-embedded-cluster.conf

sysctl --system

Reset and reboot the installation:
```
sudo ./APP_SLUG reset
```
Where APP_SLUG is the unique slug for the application. For more information, see Reset a Node in Using Embedded Cluster.
Re-install with Embedded Cluster.

As a temporary troubleshooting measure, set the mode to CrossSubnet and see if the issue persists. This mode only encapsulates traffic between pods across different subnets with VXLAN.

kubectl patch ippool default-ipv4-ippool --type=merge -p '{"spec": {"vxlanMode": "CrossSubnet"}}'

If this resolves the connectivity issues, there is likely an underlying network configuration problem with VXLAN traffic that should be addressed.

Troubleshoot with Support Bundles​

About the Default Embedded Cluster Support Bundle Spec​

Generate a Bundle For Versions 1.17.0 and Later​

Generate a Bundle For Versions Earlier Than 1.17.0​

View Logs​

View Installation Logs for Embedded Cluster​

View k0s Logs​

Access the Cluster​

Troubleshoot Errors​

Installation failure when NVIDIA GPU Operator is included as Helm extension​

Symptom​

Cause​

Solution​

Calico networking issues​

Symptom​

Cause​

Solution​

Troubleshoot with Support Bundles

About the Default Embedded Cluster Support Bundle Spec

Generate a Bundle For Versions 1.17.0 and Later

Generate a Bundle For Versions Earlier Than 1.17.0

View Logs

View Installation Logs for Embedded Cluster

View k0s Logs

Access the Cluster

Troubleshoot Errors

Installation failure when NVIDIA GPU Operator is included as Helm extension

Symptom

Cause

Solution

Calico networking issues

Symptom

Cause

Solution