Troubleshoot snapshots

When a snapshot fails, KOTS automatically collects and stores a support bundle. This bundle contains all logs and system state at the time of the failure. It is a good place to view the logs.

note

Replicated KOTS is available only for existing customers. For supporting installations into customer managed clusters, we recommend Helm. For more information, see About Helm Installations with Replicated.

KOTS is a Generally Available (GA) product for existing customers. For more information about the Replicated product lifecycle phases, see Support Lifecycle Policy.

Velero is crashing

If Velero is crashing and not starting, some common causes are:

Invalid cloud credentials

Symptom

You see the following error message from Velero when trying to configure a snapshot.

time="2020-04-10T14:22:24Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-04-10T14:22:24Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-04-10T14:22:27Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-04-10T14:22:31Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-04-10T14:22:31Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
An error occurred: some backup storage locations are invalid: backup store for location "default" is invalid: rpc error: code = Unknown desc = NoSuchBucket: The specified bucket does not exist
        status code: 404, request id: BEFAE2B9B05A2DCF, host id: YdlejsorQrn667ziO6Xr6gzwKJJ3jpZzZBMwwMIMpWj18Phfii6Za+dQ4AgfzRcxavQXYcgxRJI=

Cause

If the cloud access credentials are invalid or do not have access to the location in the configuration, Velero will crashloop. The support bundle includes the Velero logs, and the message looks like this.

Solution

Replicated recommends that you validate the access key / secret or service account json.

Invalid top-level directories

Symptom

You see the following error message when Velero is starting:

time="2020-04-10T14:12:42Z" level=info msg="Checking existence of namespace" logSource="pkg/cmd/server/server.go:337" namespace=velero
time="2020-04-10T14:12:42Z" level=info msg="Namespace exists" logSource="pkg/cmd/server/server.go:343" namespace=velero
time="2020-04-10T14:12:44Z" level=info msg="Checking existence of Velero custom resource definitions" logSource="pkg/cmd/server/server.go:372"
time="2020-04-10T14:12:44Z" level=info msg="All Velero custom resource definitions exist" logSource="pkg/cmd/server/server.go:406"
time="2020-04-10T14:12:44Z" level=info msg="Checking that all backup storage locations are valid" logSource="pkg/cmd/server/server.go:413"
An error occurred: some backup storage locations are invalid: backup store for location "default" is invalid: Backup store contains invalid top-level directories: [other-directory]

Cause

Velero displays this error message when it attempts to start and uses a reconfigured or re-used bucket.

When configuring Velero to use a bucket, the bucket cannot contain other data, or Velero will crash.

Solution

Configure Velero to use a bucket that does not contain other data.

Node agent is crashing

If the node-agent Pod is crashing and not starting, some common causes are:

Metrics server is failing to start

Symptom

You see the following error in the node-agent logs.

time="2023-11-16T21:29:44Z" level=info msg="Starting metric server for node agent at address []" logSource="pkg/cmd/cli/nodeagent/server.go:229"
time="2023-11-16T21:29:44Z" level=fatal msg="Failed to start metric server for node agent at []: listen tcp :80: bind: permission denied" logSource="pkg/cmd/cli/nodeagent/server.go:236"

Cause

This issue occurs in Velero 1.12.0 and 1.12.1. Velero does not set the port correctly when starting the metrics server. The metrics server fails to start with a permission denied error. The error occurs in environments that do not run MinIO and have Host Path, Network File System (NFS), or internal storage destinations configured. When the metrics server fails to start, the node-agent Pod crashes. For more information about this issue, see the GitHub issue details.

Solution

Replicated recommends that you either upgrade to Velero 1.12.2 or later, or downgrade to a version earlier than 1.12.0.

Snapshot creation is failing

Timeout error when creating a snapshot

Symptom

You see a backup error that includes a timeout message when attempting to create a snapshot. For example:

Error backing up item
timed out after 12h0m0s

Cause

This error message appears when the node-agent Pod operation reaches the timeout limit. The default timeout is 240 minutes.

For Velero 1.16 and earlier, Velero integrates with Restic to provide a solution for backing up and restoring Kubernetes volumes. For more information, see File System Backup in the Velero documentation.

For Velero 1.17 and later, Velero uses Kopia for file-system backups by default.

Solution

Use the kubectl Kubernetes command-line tool to patch the Velero deployment to increase the timeout:

Velero 1.17 and later:

kubectl patch deployment velero -n velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--fs-backup-timeout=TIMEOUT_LIMIT"}]'

Velero 1.16 and earlier:

kubectl patch deployment velero -n velero --type json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--restic-timeout=TIMEOUT_LIMIT"}]'

Replace TIMEOUT_LIMIT with a length of time for the node-agent Pod operation timeout in hours, minutes, and seconds. Use the format 0h0m0s. For example, 48h30m0s.

note

The timeout value reverts back to the default value if you rerun the velero install command.

Memory limit reached on the node-agent pod

Symptom

The Linux kernel Out Of Memory (OOM) killer kills the node-agent Pod, or snapshots fail with errors similar to:

pod volume backup failed: ... signal: killed

Cause

Velero sets default limits for the velero Pod and the node-agent Pod during installation. For Velero 1.16 and earlier, a known Restic issue causes high memory usage. This high memory usage can cause failures during snapshot creation when the Pod reaches the memory limit. For Velero 1.17 and later, Velero uses Kopia for file-system backups by default. Large volumes with Kopia can also require a higher memory limit.

For more information about the Restic issue, see the Restic backup — OOM-killed on raspberry pi after backing up another computer to same repo issue in the Restic GitHub repository.

Solution

Increase the default memory limit for the node-agent Pod if your application is particularly large. For more information about configuring Velero resource requests and limits, see Customize resource requests and limits in the Velero documentation.

For example, the following kubectl command increases the memory limit for the node-agent DaemonSet from the default of 1Gi to 2Gi:

kubectl -n velero patch daemonset node-agent -p '{"spec":{"template":{"spec":{"containers":[{"name":"node-agent","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

Alternatively, you can lower the memory garbage collection target percentage on the node-agent DaemonSet. This can help the node-agent Pod avoid reaching the memory limit during snapshot creation. Run the following kubectl command:

kubectl -n velero set env daemonset/node-agent GOGC=1

Velero cannot read at least one source file

Symptom

You see the following error in Velero logs:

Error backing up item...Warning: at least one source file could not be read

Cause

For Velero 1.16 and earlier, there are file changes between Restic's initial scan of the volume and during the backup to the Restic store.

Solution

To resolve this issue, do one of the following:

Use hooks to export data to an EmptyDir volume and include that in the backup instead of the primary PVC volume. See Configure Backup and Restore Hooks for Snapshots.
Freeze the file system to ensure all pending disk I/O operations have completed prior to taking a snapshot. For more information, see Hook Example with fsfreeze in the Velero documentation.

Kopia file-system backup issues (Velero 1.17 and later)

Data mover pods do not start or complete

Cause

For Velero 1.17 and later, Kopia spawns data mover pods from the node-agent. If a backup or restore stays in progress, the data mover pods might not start or complete.

Solution

Check the node-agent logs for errors that prevent data mover pods from starting. Verify that the cluster can pull the data mover pod image and that any pod security policies or security context constraints allow the pod to start.

BackupRepository is not available

Cause

For Velero 1.17 and later, Kopia uses BackupRepository custom resources (CRs) to manage repositories. A backup or restore can fail if the BackupRepository CR is not available or is in a failed state.

Solution

Check the status of the BackupRepository CRs:

kubectl get backuprepositories -n velero

Describe any CR that is not in a Ready state to view the error:

kubectl describe backuprepository BACKUP_REPOSITORY_NAME -n velero

Common errors include invalid credentials, network connectivity issues, or problems with the underlying storage. After you resolve the issue, Velero retries the backup repository operations.

Read-only root filesystem errors

Cause

For Velero 1.17 and later, Kopia needs writable directories for cache and configuration. The default paths are /home/cnb/udmrepo and /home/cnb/.cache. If ReadOnlyRootFilesystem applies to the Velero or node-agent pods, Kopia cannot write to these directories and the backup or restore fails.

Solution

Add emptyDir volumes for /home/cnb/udmrepo and /home/cnb/.cache to the Velero deployment and the node-agent daemon set. Mount the volumes at the required paths so Kopia can write cache and configuration data.

LVP storage location is unavailable after upgrade

Cause

The Local Volume Provider (LVP) is not compatible with Kopia. If you upgrade to Velero 1.17 or later and the existing storage location uses LVP, snapshots fail because Kopia cannot write to LVP storage.

important

LVP backups created on Velero 1.16 and earlier are not restorable on Velero 1.17 and later. Before you upgrade, migrate to a Kopia-compatible storage destination. For more information, see Upgrade Velero for snapshots.

Solution

Before you upgrade to Velero 1.17 or later, migrate from LVP to a Kopia-compatible destination. Replicated recommends one of the following options:

Reinstall KOTS with --with-minio=true.
Reconfigure the storage location to use an external S3-compatible object store, such as Amazon S3, Google Cloud Storage, Azure Blob Storage, or another S3-compatible provider. Install the target Velero plugin before you reconfigure the storage location.

For more information, see Upgrade Velero for snapshots.

Snapshot restore is failing

Service NodePort is already allocated

Symptom

In the Replicated KOTS Admin Console, you see an Application failed to restore error. The error indicates that the port number for a static NodePort is already in use. For example:

Snapshot Troubleshoot Service NodePort

View a larger version of this image

Cause

A known issue in Kubernetes versions earlier than version 1.19 can cause static NodePort services to collide in multi-primary high availability setups. This collision occurs when recreating the services. For more information about this known issue, see https://github.com/kubernetes/kubernetes/issues/85894.

Solution

Kubernetes version 1.19 fixes this issue. To resolve this issue, upgrade to Kubernetes version 1.19 or later.

For more information about the fix, see https://github.com/kubernetes/kubernetes/pull/89937.

Partial snapshot restore finishes with warnings

Symptom

In the Admin Console, when the partial snapshot restore completes, you see warnings indicating that Velero did not restore Endpoint resources:

Snapshot Troubleshoot Restore Warnings

Cause

Velero changed the resource restore priority in 1.10.3 and 1.11.0, which leads to this warning when restoring Endpoint resources. For more information about this issue, see the issue details in GitHub.

Solution

These warnings do not necessarily mean that the restore itself failed. The endpoints likely exist because Kubernetes creates them when the restore process restores the related Service resources. However, to prevent encountering these warnings, use Velero version 1.11.1 or later.

Velero is crashing​

Invalid cloud credentials​

Symptom​

Cause​

Solution​

Invalid top-level directories​

Symptom​

Cause​

Solution​

Node agent is crashing​

Metrics server is failing to start​

Symptom​

Cause​

Solution​

Snapshot creation is failing​

Timeout error when creating a snapshot​

Symptom​

Cause​

Solution​

Memory limit reached on the node-agent pod​

Symptom​

Cause​

Solution​

Velero cannot read at least one source file​

Symptom​

Cause​

Solution​

Kopia file-system backup issues (Velero 1.17 and later)​

Data mover pods do not start or complete​

Cause​

Solution​

BackupRepository is not available​

Cause​

Solution​

Read-only root filesystem errors​

Cause​

Solution​

LVP storage location is unavailable after upgrade​

Cause​

Solution​

Snapshot restore is failing​

Service NodePort is already allocated​

Symptom​

Cause​

Solution​

Partial snapshot restore finishes with warnings​

Symptom​

Cause​

Solution​

On this page

Velero is crashing

Invalid cloud credentials

Symptom

Cause

Solution

Invalid top-level directories

Symptom

Cause

Solution

Node agent is crashing

Metrics server is failing to start

Symptom

Cause

Solution

Snapshot creation is failing

Timeout error when creating a snapshot

Symptom

Cause

Solution

Memory limit reached on the node-agent pod

Symptom

Cause

Solution

Velero cannot read at least one source file

Symptom

Cause

Solution

Kopia file-system backup issues (Velero 1.17 and later)

Data mover pods do not start or complete

Cause

Solution

BackupRepository is not available

Cause

Solution

Read-only root filesystem errors

Cause

Solution

LVP storage location is unavailable after upgrade

Cause

Solution

Snapshot restore is failing

Service NodePort is already allocated

Symptom

Cause

Solution

Partial snapshot restore finishes with warnings

Symptom

Cause

Solution