test: fix fd leak in the Samba DLM handoff test (promote xfail checks) (#9592)

test(mount): fix fd leak that deadlocked the DLM handoff check

The cross-mount handoff checks held a file open on mount 2 via fd 9 to
keep the distributed lock, then started the SMB writer in a background
subshell. The subshell inherited fd 9, so the SMB writer kept the file
open and waited on a lock held by its own descriptor; the put could
never complete, and the two checks were parked as expected-fail.

Close fd 9 in the subshell (9>&-) so the writer does not hold the file.
The waiter now acquires the freed lock within ~1s, so the two checks are
real assertions and the xfail machinery is gone.
This commit is contained in:
Chris Lu
2026-05-20 16:17:13 -07:00
committed by GitHub
parent a5d0e4a735
commit afcc491517
2 changed files with 16 additions and 55 deletions
+2 -26
View File
@@ -40,32 +40,8 @@ the same data on each.
> file at a time) and guarantees writes are not torn. It does not guarantee
> which concurrent writer wins or instant cross-mount read convergence — the
> holder's buffered data is flushed on close, asynchronously to lock release.
### Known issue: DLM handoff stalls under same-file contention
The two handoff checks in test 2 (`blocked SMB write succeeds after the other
mount releases` and `post-release content is the SMB writer's payload`) are
marked **expected-fail** (xfail) — they pin a remaining DLM liveness bug without
failing CI. If the handoff is fixed they flip to `[XPASS]` and turn the suite
red, a reminder to promote them to hard assertions.
When two mounts contend for the *same* file, the lock handoff does not complete
in a reasonable time because the holder releases the distributed lock only on
the FUSE `Release` op, which the kernel delays by tens of seconds after
`close()` (vs ~12 ms uncontended). The waiting writer's client gives up before
the lock frees. This is a **liveness/latency** problem, not data corruption —
the lock stays over-conservative, so no torn writes occur.
Two contributing causes have been fixed in the lock client (`weed/cluster/lock_client.go`):
- the waiter no longer polls with `util.RetryUntil`'s growing backoff; it polls
at a steady cadence so a freed lock is picked up promptly, and
- `Stop()` no longer races the renewal goroutine, which previously could send a
stale unlock token and leave the lock lingering as "owned" at the filer.
The remaining cause — the holder-side release waiting on FUSE `Release` — needs
the lock released promptly on flush/close (with care for the multi-fd case), and
is left as a follow-up.
> When a holder closes the file, a writer on another mount acquires the freed
> lock within ~1s and completes.
## Layout
+14 -29
View File
@@ -31,14 +31,8 @@ trap 'rm -rf "${WORK}"' EXIT
PASS=0
FAIL=0
XFAIL=0
XPASS=0
pass() { printf ' [PASS] %s\n' "$1"; PASS=$((PASS + 1)); }
fail() { printf ' [FAIL] %s\n' "$1"; FAIL=$((FAIL + 1)); }
# Expected failure: a known-broken behavior. [XFAIL] does not fail the suite;
# an unexpected pass ([XPASS]) does, so the check gets promoted once it's fixed.
xfail() { printf ' [XFAIL] %s\n' "$1"; XFAIL=$((XFAIL + 1)); }
xpass() { printf ' [XPASS] %s\n' "$1"; XPASS=$((XPASS + 1)); }
smb() {
smbclient "//${SMB_HOST}/${SMB_SHARE}" -p "${SMB_PORT}" \
@@ -112,16 +106,9 @@ FAIL=$((FAIL + $(grep -c '\[FAIL\]' <<<"${fcntl_out}")))
# 2. Distributed lock blocks a cross-mount writer, then hands it off ----------
# mount 2 holds a file open for writing (holding the DLM lock on its path).
# An SMB put of the same file goes through mount 1 and must (a) block while
# mount 2 holds it and (b) actually SUCCEED once mount 2 releases, leaving the
# SMB writer's payload on disk. smbclient gets a long client timeout (-t) so we
# are testing the lock handoff itself, not smbclient's own ~20s default timeout.
#
# KNOWN ISSUE (expected failure): the (b) handoff checks are marked xfail. The
# holder releases the distributed lock only on FUSE Release, which the kernel
# delays for tens of seconds under this contention, so the waiting writer does
# not acquire in time. This is a DLM liveness bug, not data corruption. When the
# holder-side release is fixed these flip to [XPASS] and fail the suite, a
# reminder to promote them to hard assertions.
# mount 2 holds it and (b) succeed once mount 2 releases, leaving the SMB
# writer's payload on disk. smbclient gets a long client timeout (-t) so we are
# testing the lock handoff itself, not smbclient's own ~20s default timeout.
echo "==> 2. distributed lock: cross-mount write coordination"
dlmfile="dlm_coord.bin"
newdata="${WORK}/dlm_new.bin"
@@ -131,14 +118,17 @@ head -c 4096 /dev/urandom >"${newdata}"
exec 9>"${MOUNT2_SHARE}/${dlmfile}"
printf 'held-by-mount2' >&9
# Start the SMB write; record its real exit code when it returns.
# Start the SMB write; record its real exit code when it returns. The subshell
# must NOT inherit fd 9 (9>&-): otherwise the SMB writer keeps the file open and
# waits on a DLM lock held by its own inherited descriptor, deadlocking the
# handoff this test is meant to exercise.
rm -f "${WORK}/dlm_put.rc"
(
smbclient "//${SMB_HOST}/${SMB_SHARE}" -p "${SMB_PORT}" \
-U "${SMB_USER}%${SMB_PASS}" -m SMB3 -t 120 \
-c "put ${newdata} ${dlmfile}" >/dev/null 2>&1
echo "$?" >"${WORK}/dlm_put.rc"
) &
) 9>&- &
smb_bg=$!
sleep 4
@@ -164,21 +154,19 @@ done
kill "${smb_bg}" 2>/dev/null
wait "${smb_bg}" 2>/dev/null
# xfail: the handoff stalls because the lock is freed only on the delayed FUSE
# Release. A pass here means the holder-side release was fixed.
if [[ "${put_rc}" == "0" ]]; then
xpass "blocked SMB write succeeds after the other mount releases"
pass "blocked SMB write succeeds after the other mount releases"
else
xfail "blocked SMB write succeeds after the other mount releases (rc=${put_rc})"
fail "blocked SMB write succeeds after the other mount releases (rc=${put_rc})"
fi
# A correct handoff leaves the SMB writer's payload on disk: mount 1 acquired
# the lock and wrote after mount 2 released.
got="${WORK}/dlm_got.bin"
if smb "get ${dlmfile} ${got}" >/dev/null 2>&1 && [[ "$(md5 "${got}")" == "$(md5 "${newdata}")" ]]; then
xpass "post-release content is the SMB writer's payload (correct handoff)"
pass "post-release content is the SMB writer's payload (correct handoff)"
else
xfail "post-release content is the SMB writer's payload (correct handoff)"
fail "post-release content is the SMB writer's payload (correct handoff)"
fi
# 3. Distributed lock integrity: concurrent writers, same file ---------------
@@ -226,8 +214,5 @@ else
fi
echo
echo "==> Summary: ${PASS} passed, ${FAIL} failed, ${XFAIL} expected-fail"
if [[ "${XPASS}" -gt 0 ]]; then
echo "==> ${XPASS} check(s) unexpectedly passed - the DLM handoff appears fixed; promote them from xfail to assertions"
fi
[[ "${FAIL}" -eq 0 && "${XPASS}" -eq 0 ]]
echo "==> Summary: ${PASS} passed, ${FAIL} failed"
[[ "${FAIL}" -eq 0 ]]