Troubleshooting Samba Winbind User Resolution Issues in CentOS 6 AD Environments


4 views

When dealing with Winbind user resolution failures, we've identified an interesting pattern where:

  • 5/6 CentOS 6 servers resolve all Active Directory users consistently
  • 1 problematic server fails to resolve specific users (both existing and new)
  • The issue persists even after clearing Winbind cache files (winbindd_idmap.tdb and winbindd_cache.tdb)

Before diving deep, run these diagnostic commands to gather more information:

# Check basic Winbind functionality
wbinfo -u | grep problematic_user
wbinfo -i problematic_user

# Verify domain trust
net ads testjoin
net ads info

# Check name resolution
getent passwd problematic_user
getent group 'domain users'

The Winbind cache corruption theory makes sense, but since deleting TDB files didn't help, we need to look deeper. Here's how to inspect the cache without service restart:

# Dump cache contents for analysis
tdbdump /var/lib/samba/winbindd_idmap.tdb
tdbdump /var/lib/samba/winbindd_cache.tdb

# Alternative way to query specific entries
tdbtool /var/lib/samba/winbindd_cache.tdb show | grep SID_of_user

When standard cache clearing fails, consider these advanced steps:

  1. Idmap Reinitialization:
    net cache flush
    net idmap restore
  2. SELinux Context Check:
    ls -lZ /var/lib/samba/winbindd_*
    restorecon -Rv /var/lib/samba
  3. Winbind Socket Analysis:
    lsof -U | grep winbind
    ss -xlnp | grep winbind

For the intermittent new user resolution issue, check these configuration aspects:

# In smb.conf verify these parameters:
idmap config DOMAIN : backend = ad
idmap config DOMAIN : range = 10000-999999
winbind enum users = yes
winbind refresh tickets = yes
winbind offline logon = no

Create a test script to automate user resolution checks:

#!/bin/bash
TEST_USERS=("user1" "user2" "new.user" "problematic.user")

for user in "${TEST_USERS[@]}"; do
    if getent passwd "$user" >/dev/null; then
        echo "[SUCCESS] $user resolved successfully"
    else
        echo "[FAILURE] $user resolution failed"
        wbinfo -i "$user" || echo "wbinfo failed for $user"
    fi
done

If all else fails, this complete reset procedure often works (requires service restart):

systemctl stop winbind smb
rm -f /var/lib/samba/winbindd_*
net cache flush
systemctl start smb winbind
net ads join -U adminuser
wbinfo -p
wbinfo -t

When winbind suddenly stops resolving specific Active Directory users while working perfectly for others, it creates one of the most frustrating scenarios for Linux-AD integration. From your description, we're dealing with:

  • Intermittent failures (some users resolve, others don't)
  • Server-specific behavior (one server fails while others work)
  • Persistence after cache clearing

Before diving deep, let's verify the cache state with winbind's diagnostic tools:

# Check winbind cache status
wbinfo --sequence

# View cached users
wbinfo --ccache-list

# Test resolution for problematic user
wbinfo --name-to-sid 'DOMAIN\\problem_user'

During change freezes, try these non-restart methods:

# Flush single user from cache
net cache flush 'DOMAIN\\problem_user'

# Alternative method using wbinfo
wbinfo --invalidate-cache --user='problem_user'

Since clearing tdb files didn't resolve the issue, let's examine:

  • idmap configuration: Verify consistency across servers
  • Time synchronization: Kerberos is time-sensitive
  • DNS resolution: Check DC availability
# Check time sync status
ntpstat

# Verify DC connectivity
nslookup yourdomaincontroller.domain.com

# Compare idmap settings
testparm -s | grep idmap

Enable detailed logging to catch resolution failures:

# Add to smb.conf's [global] section
log level = 3 winbind:5
winbind debug traceid = yes

Monitor real-time events while testing:

tail -f /var/log/samba/log.winbind | grep 'problem_user'

The random nature suggests either:

  • Multiple domain controllers with inconsistent replication
  • Network partitioning issues
  • DNS round-robin problems

Force winbind to use specific DCs:

# In smb.conf
winbind preferred dc = your-primary-dc
winbind reconnect delay = 30

When all else fails, rebuild the trust relationship:

# Remove server from domain
net ads leave -U adminuser

# Rejoin domain
net ads join -U adminuser

Remember to verify keytab consistency after rejoining:

klist -kte