Debugging Docker Container Initialization Failures: A Deep Dive into libnss_files.so.2 Hosts Workaround


3 views

When debugging Docker container initialization issues, especially with low-level system modifications, we often encounter puzzling symptoms. In this case, basic commands like ls work while their variants (ls -l) fail silently. Here's how to systematically approach such problems:

The root cause lies in the LD_LIBRARY_PATH modification affecting how glibc resolves hostnames. The workaround attempts to override /etc/hosts through these steps:

RUN mkdir -p -- /lib-override /etc-override && cp /lib/libnss_files.so.2 /lib-override
ADD hosts.template /etc-override/hosts
RUN perl -pi -e 's:/etc/hosts:/etc-override/hosts:g' /lib-override/libnss_files.so.2
ENV LD_LIBRARY_PATH /lib-override

1. Enable Docker Daemon Debug Logging

Start the Docker daemon with debug mode:

dockerd --debug

Or edit /etc/docker/daemon.json:

{
  "debug": true,
  "log-level": "debug"
}

2. Using strace for System Call Tracing

For containers that fail silently, strace becomes invaluable:

docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -it your_image strace ls -l

3. Alternative Approach for libnss_files Debugging

The cleaner solution avoids LD_LIBRARY_PATH manipulation:

# Instead of modifying libnss_files.so.2
RUN echo "user_allow_other" >> /etc/fuse.conf
RUN apt-get install -y fuse-overlayfs
RUN mkdir -p /mnt/hosts && fuse-overlayfs -o allow_other /mnt/hosts /etc/hosts=./custom_hosts

The key difference lies in how these commands resolve hostnames:

# ls -l triggers NSS lookups for:
# - User/group resolution (for ownership display)
# - Potential network operations (if coloring is enabled)

For Kubernetes environments:

apiVersion: v1
kind: Pod
metadata:
  name: host-aliases
spec:
  hostAliases:
  - ip: "127.0.0.1"
    hostnames:
    - "foo.local"
    - "bar.local"

For Docker Compose:

services:
  app:
    extra_hosts:
      - "somehost:162.242.195.82"
      - "otherhost:50.31.209.229"

To confirm your container's NSS configuration:

docker run --rm your_image ldd $(which ls)
docker run --rm your_image getent hosts
docker run --rm your_image strace -e openat ls -l 2>&1 | grep hosts

When a Docker container builds successfully but fails during initialization, it's often related to runtime configuration or environment variables. The specific case here involves a custom /etc/hosts modification through library overrides, which works for simple commands like ls but fails for ls -l.

Here are several ways to investigate such issues:

# 1. Check container logs (even if they appear empty)
docker logs --details CONTAINER_ID

# 2. Run with debug mode enabled
docker run --env "DOCKER_VERBOSE=1" your_image

# 3. Force interactive mode with shell fallback
docker run -it --entrypoint=/bin/sh your_image -c "ls -l || echo 'Failed with status $?'"

When standard methods don't reveal the issue:

# 4. Use strace to trace system calls
docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
  your_image strace ls -l

# 5. Inspect the container's filesystem
docker run --rm -it --entrypoint=/bin/sh your_image
# Inside container:
mount | grep overlay
cat /proc/mounts

The original approach modifies libnss_files.so.2 to use custom hosts files. While creative, this can cause subtle issues:

# Better alternative: Use --add-host at runtime
docker run --add-host custom.host:127.0.0.1 your_image

# Or in docker-compose:
extra_hosts:
  - "custom.host:127.0.0.1"

Create a debug-friendly image variant:

FROM your_base_image
RUN apt-get update && apt-get install -y \
    strace lsof procps
ENTRYPOINT ["/bin/bash", "-c"]
CMD ["strace -f -o /tmp/debug.log your_command"]

Implement proper logging in your application:

# Python example
import logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s %(levelname)s %(message)s',
    handlers=[logging.StreamHandler()]
)

# Bash example
exec > >(tee /var/log/startup.log) 2>&1