Troubleshooting Guide

This guide documents common failure modes and solutions for operators using logtap.

Receiver Won't Start
Sidecar Not Forwarding Logs
logtap check Failures
Disk Full / Rotation Not Working
Capture Cannot Be Opened
Port-Forward Drops / Tunnel Instability
Redaction Not Working
logtap tap Timed Out

Receiver Won't Start

Symptom: The logtap recv command fails to start or immediately exits with an error.

Cause:

Port in use: Another process is already listening on the specified port (--listen).
Directory permissions: logtap does not have write permissions to the capture directory (--dir).
Configuration errors: Invalid values in the config file or command-line flags.

Solution:

Port in use:
- Choose a different port: logtap recv --listen :9001
- Identify and stop the process using the port (e.g., lsof -i :9000).
Directory permissions:
- Ensure the user running logtap has write permissions to the --dir path.
- Change the capture directory to an accessible location: logtap recv --dir /tmp/my-capture.
Configuration errors:
- Review the error message for specific configuration issues.
- Double-check the values provided for flags like --max-disk, --max-file, etc.
- If using a config file, ensure it's valid YAML and follows the schema.

Sidecar Not Forwarding Logs

Symptom: logtap tap successfully injects the sidecar, but no logs appear in the receiver's TUI or capture directory.

Cause:

Image pull errors: The Kubernetes cluster cannot pull the logtap-forwarder image (e.g., wrong image name, private registry authentication issues, image not multi-arch compatible with node).
Network policy blocking: A Kubernetes NetworkPolicy prevents the sidecar from reaching the logtap receiver.
Receiver unreachable: The --target address specified for the sidecar is incorrect or the receiver pod is not running/accessible.
Log volume mounting: The sidecar cannot access the application logs (e.g., incorrect volume mount, application logs to a non-standard location).

Solution:

Image pull errors:
- Verify the sidecar image name and tag: kubectl describe pod <tapped-pod-name>.
- Ensure the image exists and is accessible from the cluster.
- For private registries, verify image pull secrets are configured correctly.
Network policy blocking:
- Temporarily disable or adjust relevant NetworkPolicies.
- Create a NetworkPolicy that allows egress from the sidecar to the receiver.
Receiver unreachable:
- Double-check the --target address is correct (e.g., logtap.logtap:9000 for in-cluster).
- Use kubectl logs <logtap-forwarder-pod> to check the sidecar logs for connection errors.
- Verify the receiver is running and its service is reachable (e.g., kubectl get svc -n logtap logtap).
Log volume mounting:
- Inspect the logtap-forwarder container spec in the tapped pod to confirm volume mounts.
- Ensure the application logs to standard output/error or a path that the sidecar can access.

`logtap check` Failures

Symptom: logtap check reports warnings or errors regarding RBAC, quota, or orphaned resources.

Cause:

RBAC Missing: The Kubernetes user or service account lacks permissions to perform necessary actions (e.g., patch deployments, create pods).
Quota Exceeded: Injecting sidecars would exceed a namespace's resource quota.
Orphaned Resources: Previous logtap sessions were not fully cleaned up, leaving behind sidecars or tunnel pods/services.
Prod Namespace Warning: Attempting to tap a namespace identified as "production" without the --allow-prod flag.

Solution:

RBAC Missing:
- Work with a cluster administrator to grant the required RBAC permissions to your user or service account.
- The logtap check output often provides hints about missing permissions.
Quota Exceeded:
- Increase the namespace's ResourceQuota (requires admin privileges).
- Reduce the sidecar's resource requests using --sidecar-memory or --sidecar-cpu flags in logtap tap.
- Use --force with logtap tap if you understand the risks (pods may fail to schedule).
Orphaned Resources:
- Follow the suggestions from logtap check to clean up:
  - logtap untap --all to remove orphaned sidecars.
  - kubectl delete ns logtap to remove orphaned tunnel resources if logtap was deployed in-cluster.
Prod Namespace Warning:
- If you intend to tap a production namespace, explicitly use the --allow-prod flag with logtap tap. Be aware of the implications.
- Ensure PII redaction is enabled (--redact) if tapping production.

Disk Full / Rotation Not Working

Symptom: The receiver stops writing logs, reports drops, or the capture directory exceeds its --max-disk limit without old files being deleted.

Cause:

Disk write errors: The underlying disk is full, has I/O errors, or permissions issues.
Incorrect --max-disk or --max-file: Values are too high, or a configuration mistake prevents rotation.
Slow file deletion: The system is slow to delete files, or logtap is unable to delete files.

Solution:

Disk write errors:
- Check available disk space on the volume hosting the capture directory.
- Investigate system logs for disk-related errors.
- Ensure logtap has full permissions to manage files in the capture directory.
Incorrect limits:
- Review logtap recv command flags and config file for --max-disk and --max-file values.
- Ensure --max-disk is a reasonable limit for your storage.
Slow file deletion:
- Monitor logtap's internal metrics for logtap_disk_usage_bytes and logtap_backpressure_events_total.
- Consider increasing disk I/O capacity or reducing log volume if persistently hitting limits.

Capture Cannot Be Opened

Symptom: logtap open, inspect, slice, or export fail with errors about corrupt metadata, missing index, or invalid file formats.

Cause:

Corrupt metadata.json: The metadata.json file in the capture directory is malformed or missing.
Corrupt index.jsonl: The index.jsonl file is malformed, preventing proper indexing of log files.
Incomplete capture: The logtap recv process was terminated abruptly without flushing buffers, leaving an incomplete or inconsistent capture.
Manual tampering: Files within the capture directory were manually modified, moved, or deleted.

Solution:

Corrupt files:
- If the capture is critical, attempt to manually repair metadata.json or index.jsonl if the corruption is minor and understandable.
- For a robust solution, avoid manual modification of capture directories.
Incomplete capture:
- Always ensure logtap recv is gracefully shut down (Ctrl+C) to allow for buffer flushing and metadata updates.
- For corrupted captures, data integrity cannot be guaranteed. Use logtap inspect to assess what is recoverable.
Manual tampering:
- Restore the capture directory from a backup if available.
- Avoid direct manipulation of files within logtap capture directories.

Port-Forward Drops / Tunnel Instability

Symptom: When using logtap recv --in-cluster, the connection between the in-cluster forwarder and your local receiver frequently disconnects or logs stop flowing.

Cause:

Local network instability: Your machine's network connection is unreliable.
Kubernetes API server load: The Kubernetes API server is under heavy load, causing kubectl port-forward to become unstable.
Receiver pod restart: The temporary receiver pod in the cluster (used for the tunnel) is restarting due to resource limits, errors, or Kubernetes eviction policies.
Idle timeouts: Some network infrastructure or VPNs may aggressively close idle connections, even if traffic is low.

Solution:

Local network stability:
- Ensure your local machine has a stable network connection.
- Avoid actions that might disrupt network connectivity (e.g., VPN changes).
Kubernetes API server load:
- Monitor the Kubernetes API server's health and resource usage.
- If API server load is an issue, consider deploying logtap recv --in-cluster for more stability.
Receiver pod restart:
- Use kubectl describe pod <receiver-pod-name> -n logtap to check for restart reasons (OOMKilled, errors).
- Increase the resource requests/limits for the in-cluster receiver pod if necessary.
Idle timeouts:
- If possible, configure your network or VPN to have longer idle timeouts.
- Consider logtap recv --in-cluster as an alternative if the tunnel remains unstable.

Redaction Not Working

Symptom: PII or sensitive data is still visible in captured logs despite using the --redact flag.

Cause:

Incorrect flag usage: --redact was not used, or specific patterns were not enabled (e.g., --redact=email but a credit card number is visible).
Pattern not matching: The built-in or custom redaction patterns do not correctly match the format of the sensitive data in the logs.
Custom patterns file format: The --redact-patterns YAML file is malformed or its regex patterns are incorrect.
Redaction pipeline bypass: Logs are entering the system through a path that bypasses the redaction pipeline.

Solution:

Correct flag usage:
- Ensure logtap recv --redact is used to enable all built-in patterns.
- If specific patterns are desired, ensure all relevant ones are listed (e.g., --redact=credit_card,email,jwt).
Pattern not matching:
- Examine the sensitive data's exact format in the raw logs.
- If using custom patterns, test the regex patterns against sample data using an online regex tester.
- Built-in patterns are comprehensive but might not cover highly unusual formats.
Custom patterns file format:
- Validate your patterns.yaml file using a YAML linter.
- Ensure the regex values are valid regular expressions.
Redaction pipeline bypass:
- Confirm all log ingestion paths (Loki push API, raw JSON) are properly configured to pass through the redaction stage. This is usually handled internally by logtap when --redact is active.

`logtap tap` Timed Out

Symptom: logtap tap command returns a timeout error after a period of waiting, especially when interacting with Kubernetes.

Cause:

Unresponsive Kubernetes API server: The Kubernetes API server is slow, overloaded, or unreachable from where logtap is running.
Network latency: High network latency between the logtap CLI and the Kubernetes cluster.
Resource constraints on logtap host: The machine running logtap is under heavy load, causing operations to be slow.
Long-running cluster operations: The Kubernetes cluster itself is busy with other operations (e.g., many rolling updates) that delay logtap's ability to patch resources.

Solution:

Unresponsive Kubernetes API server:
- Check the health and load of your Kubernetes API server (kubectl top node, kubectl get --raw=/metrics).
- If possible, try running logtap from a location with better network connectivity to the cluster.
Increase timeout:
- Use the --timeout flag to allow logtap more time to complete its operations, e.g., logtap tap --deployment my-app --target ... --timeout 60s.
- This is a workaround; address the root cause of the slow API server if possible.
Check local machine resources:
- Ensure the machine running the logtap CLI has sufficient CPU and memory.
Monitor cluster activity:
- Observe if other cluster operations are ongoing that might be consuming resources or locking resources logtap needs to modify.
- Retry the logtap tap command after some time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Table of Contents

Receiver Won't Start

Sidecar Not Forwarding Logs

`logtap check` Failures

Disk Full / Rotation Not Working

Capture Cannot Be Opened

Port-Forward Drops / Tunnel Instability

Redaction Not Working

`logtap tap` Timed Out

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Table of Contents

Receiver Won't Start

Sidecar Not Forwarding Logs

logtap check Failures

Disk Full / Rotation Not Working

Capture Cannot Be Opened

Port-Forward Drops / Tunnel Instability

Redaction Not Working

logtap tap Timed Out

`logtap check` Failures

`logtap tap` Timed Out