This guide documents common failure modes and solutions for operators using logtap.
- Receiver Won't Start
- Sidecar Not Forwarding Logs
logtap checkFailures- Disk Full / Rotation Not Working
- Capture Cannot Be Opened
- Port-Forward Drops / Tunnel Instability
- Redaction Not Working
logtap tapTimed Out
Symptom: The logtap recv command fails to start or immediately exits with an error.
Cause:
- Port in use: Another process is already listening on the specified port (
--listen). - Directory permissions:
logtapdoes not have write permissions to the capture directory (--dir). - Configuration errors: Invalid values in the config file or command-line flags.
Solution:
- Port in use:
- Choose a different port:
logtap recv --listen :9001 - Identify and stop the process using the port (e.g.,
lsof -i :9000).
- Choose a different port:
- Directory permissions:
- Ensure the user running
logtaphas write permissions to the--dirpath. - Change the capture directory to an accessible location:
logtap recv --dir /tmp/my-capture.
- Ensure the user running
- Configuration errors:
- Review the error message for specific configuration issues.
- Double-check the values provided for flags like
--max-disk,--max-file, etc. - If using a config file, ensure it's valid YAML and follows the schema.
Symptom: logtap tap successfully injects the sidecar, but no logs appear in the receiver's TUI or capture directory.
Cause:
- Image pull errors: The Kubernetes cluster cannot pull the
logtap-forwarderimage (e.g., wrong image name, private registry authentication issues, image not multi-arch compatible with node). - Network policy blocking: A Kubernetes NetworkPolicy prevents the sidecar from reaching the
logtapreceiver. - Receiver unreachable: The
--targetaddress specified for the sidecar is incorrect or the receiver pod is not running/accessible. - Log volume mounting: The sidecar cannot access the application logs (e.g., incorrect volume mount, application logs to a non-standard location).
Solution:
- Image pull errors:
- Verify the sidecar image name and tag:
kubectl describe pod <tapped-pod-name>. - Ensure the image exists and is accessible from the cluster.
- For private registries, verify image pull secrets are configured correctly.
- Verify the sidecar image name and tag:
- Network policy blocking:
- Temporarily disable or adjust relevant NetworkPolicies.
- Create a NetworkPolicy that allows egress from the sidecar to the receiver.
- Receiver unreachable:
- Double-check the
--targetaddress is correct (e.g.,logtap.logtap:9000for in-cluster). - Use
kubectl logs <logtap-forwarder-pod>to check the sidecar logs for connection errors. - Verify the receiver is running and its service is reachable (e.g.,
kubectl get svc -n logtap logtap).
- Double-check the
- Log volume mounting:
- Inspect the
logtap-forwardercontainer spec in the tapped pod to confirm volume mounts. - Ensure the application logs to standard output/error or a path that the sidecar can access.
- Inspect the
Symptom: logtap check reports warnings or errors regarding RBAC, quota, or orphaned resources.
Cause:
- RBAC Missing: The Kubernetes user or service account lacks permissions to perform necessary actions (e.g.,
patch deployments,create pods). - Quota Exceeded: Injecting sidecars would exceed a namespace's resource quota.
- Orphaned Resources: Previous
logtapsessions were not fully cleaned up, leaving behind sidecars or tunnel pods/services. - Prod Namespace Warning: Attempting to tap a namespace identified as "production" without the
--allow-prodflag.
Solution:
- RBAC Missing:
- Work with a cluster administrator to grant the required RBAC permissions to your user or service account.
- The
logtap checkoutput often provides hints about missing permissions.
- Quota Exceeded:
- Increase the namespace's
ResourceQuota(requires admin privileges). - Reduce the sidecar's resource requests using
--sidecar-memoryor--sidecar-cpuflags inlogtap tap. - Use
--forcewithlogtap tapif you understand the risks (pods may fail to schedule).
- Increase the namespace's
- Orphaned Resources:
- Follow the suggestions from
logtap checkto clean up:logtap untap --allto remove orphaned sidecars.kubectl delete ns logtapto remove orphaned tunnel resources iflogtapwas deployed in-cluster.
- Follow the suggestions from
- Prod Namespace Warning:
- If you intend to tap a production namespace, explicitly use the
--allow-prodflag withlogtap tap. Be aware of the implications. - Ensure PII redaction is enabled (
--redact) if tapping production.
- If you intend to tap a production namespace, explicitly use the
Symptom: The receiver stops writing logs, reports drops, or the capture directory exceeds its --max-disk limit without old files being deleted.
Cause:
- Disk write errors: The underlying disk is full, has I/O errors, or permissions issues.
- Incorrect
--max-diskor--max-file: Values are too high, or a configuration mistake prevents rotation. - Slow file deletion: The system is slow to delete files, or
logtapis unable to delete files.
Solution:
- Disk write errors:
- Check available disk space on the volume hosting the capture directory.
- Investigate system logs for disk-related errors.
- Ensure
logtaphas full permissions to manage files in the capture directory.
- Incorrect limits:
- Review
logtap recvcommand flags and config file for--max-diskand--max-filevalues. - Ensure
--max-diskis a reasonable limit for your storage.
- Review
- Slow file deletion:
- Monitor
logtap's internal metrics forlogtap_disk_usage_bytesandlogtap_backpressure_events_total. - Consider increasing disk I/O capacity or reducing log volume if persistently hitting limits.
- Monitor
Symptom: logtap open, inspect, slice, or export fail with errors about corrupt metadata, missing index, or invalid file formats.
Cause:
- Corrupt
metadata.json: Themetadata.jsonfile in the capture directory is malformed or missing. - Corrupt
index.jsonl: Theindex.jsonlfile is malformed, preventing proper indexing of log files. - Incomplete capture: The
logtap recvprocess was terminated abruptly without flushing buffers, leaving an incomplete or inconsistent capture. - Manual tampering: Files within the capture directory were manually modified, moved, or deleted.
Solution:
- Corrupt files:
- If the capture is critical, attempt to manually repair
metadata.jsonorindex.jsonlif the corruption is minor and understandable. - For a robust solution, avoid manual modification of capture directories.
- If the capture is critical, attempt to manually repair
- Incomplete capture:
- Always ensure
logtap recvis gracefully shut down (Ctrl+C) to allow for buffer flushing and metadata updates. - For corrupted captures, data integrity cannot be guaranteed. Use
logtap inspectto assess what is recoverable.
- Always ensure
- Manual tampering:
- Restore the capture directory from a backup if available.
- Avoid direct manipulation of files within
logtapcapture directories.
Symptom: When using logtap recv --in-cluster, the connection between the in-cluster forwarder and your local receiver frequently disconnects or logs stop flowing.
Cause:
- Local network instability: Your machine's network connection is unreliable.
- Kubernetes API server load: The Kubernetes API server is under heavy load, causing
kubectl port-forwardto become unstable. - Receiver pod restart: The temporary receiver pod in the cluster (used for the tunnel) is restarting due to resource limits, errors, or Kubernetes eviction policies.
- Idle timeouts: Some network infrastructure or VPNs may aggressively close idle connections, even if traffic is low.
Solution:
- Local network stability:
- Ensure your local machine has a stable network connection.
- Avoid actions that might disrupt network connectivity (e.g., VPN changes).
- Kubernetes API server load:
- Monitor the Kubernetes API server's health and resource usage.
- If API server load is an issue, consider deploying
logtap recv --in-clusterfor more stability.
- Receiver pod restart:
- Use
kubectl describe pod <receiver-pod-name> -n logtapto check for restart reasons (OOMKilled, errors). - Increase the resource requests/limits for the in-cluster receiver pod if necessary.
- Use
- Idle timeouts:
- If possible, configure your network or VPN to have longer idle timeouts.
- Consider
logtap recv --in-clusteras an alternative if the tunnel remains unstable.
Symptom: PII or sensitive data is still visible in captured logs despite using the --redact flag.
Cause:
- Incorrect flag usage:
--redactwas not used, or specific patterns were not enabled (e.g.,--redact=emailbut a credit card number is visible). - Pattern not matching: The built-in or custom redaction patterns do not correctly match the format of the sensitive data in the logs.
- Custom patterns file format: The
--redact-patternsYAML file is malformed or its regex patterns are incorrect. - Redaction pipeline bypass: Logs are entering the system through a path that bypasses the redaction pipeline.
Solution:
- Correct flag usage:
- Ensure
logtap recv --redactis used to enable all built-in patterns. - If specific patterns are desired, ensure all relevant ones are listed (e.g.,
--redact=credit_card,email,jwt).
- Ensure
- Pattern not matching:
- Examine the sensitive data's exact format in the raw logs.
- If using custom patterns, test the regex patterns against sample data using an online regex tester.
- Built-in patterns are comprehensive but might not cover highly unusual formats.
- Custom patterns file format:
- Validate your
patterns.yamlfile using a YAML linter. - Ensure the
regexvalues are valid regular expressions.
- Validate your
- Redaction pipeline bypass:
- Confirm all log ingestion paths (Loki push API, raw JSON) are properly configured to pass through the redaction stage. This is usually handled internally by
logtapwhen--redactis active.
- Confirm all log ingestion paths (Loki push API, raw JSON) are properly configured to pass through the redaction stage. This is usually handled internally by
Symptom: logtap tap command returns a timeout error after a period of waiting, especially when interacting with Kubernetes.
Cause:
- Unresponsive Kubernetes API server: The Kubernetes API server is slow, overloaded, or unreachable from where
logtapis running. - Network latency: High network latency between the
logtapCLI and the Kubernetes cluster. - Resource constraints on
logtaphost: The machine runninglogtapis under heavy load, causing operations to be slow. - Long-running cluster operations: The Kubernetes cluster itself is busy with other operations (e.g., many rolling updates) that delay
logtap's ability to patch resources.
Solution:
- Unresponsive Kubernetes API server:
- Check the health and load of your Kubernetes API server (
kubectl top node,kubectl get --raw=/metrics). - If possible, try running
logtapfrom a location with better network connectivity to the cluster.
- Check the health and load of your Kubernetes API server (
- Increase timeout:
- Use the
--timeoutflag to allowlogtapmore time to complete its operations, e.g.,logtap tap --deployment my-app --target ... --timeout 60s. - This is a workaround; address the root cause of the slow API server if possible.
- Use the
- Check local machine resources:
- Ensure the machine running the
logtapCLI has sufficient CPU and memory.
- Ensure the machine running the
- Monitor cluster activity:
- Observe if other cluster operations are ongoing that might be consuming resources or locking resources
logtapneeds to modify. - Retry the
logtap tapcommand after some time.
- Observe if other cluster operations are ongoing that might be consuming resources or locking resources