mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
test: fix fd leak in the Samba DLM handoff test (promote xfail checks) (#9592)
test(mount): fix fd leak that deadlocked the DLM handoff check The cross-mount handoff checks held a file open on mount 2 via fd 9 to keep the distributed lock, then started the SMB writer in a background subshell. The subshell inherited fd 9, so the SMB writer kept the file open and waited on a lock held by its own descriptor; the put could never complete, and the two checks were parked as expected-fail. Close fd 9 in the subshell (9>&-) so the writer does not hold the file. The waiter now acquires the freed lock within ~1s, so the two checks are real assertions and the xfail machinery is gone.
This commit is contained in:
+2
-26
@@ -40,32 +40,8 @@ the same data on each.
|
||||
> file at a time) and guarantees writes are not torn. It does not guarantee
|
||||
> which concurrent writer wins or instant cross-mount read convergence — the
|
||||
> holder's buffered data is flushed on close, asynchronously to lock release.
|
||||
|
||||
### Known issue: DLM handoff stalls under same-file contention
|
||||
|
||||
The two handoff checks in test 2 (`blocked SMB write succeeds after the other
|
||||
mount releases` and `post-release content is the SMB writer's payload`) are
|
||||
marked **expected-fail** (xfail) — they pin a remaining DLM liveness bug without
|
||||
failing CI. If the handoff is fixed they flip to `[XPASS]` and turn the suite
|
||||
red, a reminder to promote them to hard assertions.
|
||||
|
||||
When two mounts contend for the *same* file, the lock handoff does not complete
|
||||
in a reasonable time because the holder releases the distributed lock only on
|
||||
the FUSE `Release` op, which the kernel delays by tens of seconds after
|
||||
`close()` (vs ~12 ms uncontended). The waiting writer's client gives up before
|
||||
the lock frees. This is a **liveness/latency** problem, not data corruption —
|
||||
the lock stays over-conservative, so no torn writes occur.
|
||||
|
||||
Two contributing causes have been fixed in the lock client (`weed/cluster/lock_client.go`):
|
||||
|
||||
- the waiter no longer polls with `util.RetryUntil`'s growing backoff; it polls
|
||||
at a steady cadence so a freed lock is picked up promptly, and
|
||||
- `Stop()` no longer races the renewal goroutine, which previously could send a
|
||||
stale unlock token and leave the lock lingering as "owned" at the filer.
|
||||
|
||||
The remaining cause — the holder-side release waiting on FUSE `Release` — needs
|
||||
the lock released promptly on flush/close (with care for the multi-fd case), and
|
||||
is left as a follow-up.
|
||||
> When a holder closes the file, a writer on another mount acquires the freed
|
||||
> lock within ~1s and completes.
|
||||
|
||||
## Layout
|
||||
|
||||
|
||||
+14
-29
@@ -31,14 +31,8 @@ trap 'rm -rf "${WORK}"' EXIT
|
||||
|
||||
PASS=0
|
||||
FAIL=0
|
||||
XFAIL=0
|
||||
XPASS=0
|
||||
pass() { printf ' [PASS] %s\n' "$1"; PASS=$((PASS + 1)); }
|
||||
fail() { printf ' [FAIL] %s\n' "$1"; FAIL=$((FAIL + 1)); }
|
||||
# Expected failure: a known-broken behavior. [XFAIL] does not fail the suite;
|
||||
# an unexpected pass ([XPASS]) does, so the check gets promoted once it's fixed.
|
||||
xfail() { printf ' [XFAIL] %s\n' "$1"; XFAIL=$((XFAIL + 1)); }
|
||||
xpass() { printf ' [XPASS] %s\n' "$1"; XPASS=$((XPASS + 1)); }
|
||||
|
||||
smb() {
|
||||
smbclient "//${SMB_HOST}/${SMB_SHARE}" -p "${SMB_PORT}" \
|
||||
@@ -112,16 +106,9 @@ FAIL=$((FAIL + $(grep -c '\[FAIL\]' <<<"${fcntl_out}")))
|
||||
# 2. Distributed lock blocks a cross-mount writer, then hands it off ----------
|
||||
# mount 2 holds a file open for writing (holding the DLM lock on its path).
|
||||
# An SMB put of the same file goes through mount 1 and must (a) block while
|
||||
# mount 2 holds it and (b) actually SUCCEED once mount 2 releases, leaving the
|
||||
# SMB writer's payload on disk. smbclient gets a long client timeout (-t) so we
|
||||
# are testing the lock handoff itself, not smbclient's own ~20s default timeout.
|
||||
#
|
||||
# KNOWN ISSUE (expected failure): the (b) handoff checks are marked xfail. The
|
||||
# holder releases the distributed lock only on FUSE Release, which the kernel
|
||||
# delays for tens of seconds under this contention, so the waiting writer does
|
||||
# not acquire in time. This is a DLM liveness bug, not data corruption. When the
|
||||
# holder-side release is fixed these flip to [XPASS] and fail the suite, a
|
||||
# reminder to promote them to hard assertions.
|
||||
# mount 2 holds it and (b) succeed once mount 2 releases, leaving the SMB
|
||||
# writer's payload on disk. smbclient gets a long client timeout (-t) so we are
|
||||
# testing the lock handoff itself, not smbclient's own ~20s default timeout.
|
||||
echo "==> 2. distributed lock: cross-mount write coordination"
|
||||
dlmfile="dlm_coord.bin"
|
||||
newdata="${WORK}/dlm_new.bin"
|
||||
@@ -131,14 +118,17 @@ head -c 4096 /dev/urandom >"${newdata}"
|
||||
exec 9>"${MOUNT2_SHARE}/${dlmfile}"
|
||||
printf 'held-by-mount2' >&9
|
||||
|
||||
# Start the SMB write; record its real exit code when it returns.
|
||||
# Start the SMB write; record its real exit code when it returns. The subshell
|
||||
# must NOT inherit fd 9 (9>&-): otherwise the SMB writer keeps the file open and
|
||||
# waits on a DLM lock held by its own inherited descriptor, deadlocking the
|
||||
# handoff this test is meant to exercise.
|
||||
rm -f "${WORK}/dlm_put.rc"
|
||||
(
|
||||
smbclient "//${SMB_HOST}/${SMB_SHARE}" -p "${SMB_PORT}" \
|
||||
-U "${SMB_USER}%${SMB_PASS}" -m SMB3 -t 120 \
|
||||
-c "put ${newdata} ${dlmfile}" >/dev/null 2>&1
|
||||
echo "$?" >"${WORK}/dlm_put.rc"
|
||||
) &
|
||||
) 9>&- &
|
||||
smb_bg=$!
|
||||
|
||||
sleep 4
|
||||
@@ -164,21 +154,19 @@ done
|
||||
kill "${smb_bg}" 2>/dev/null
|
||||
wait "${smb_bg}" 2>/dev/null
|
||||
|
||||
# xfail: the handoff stalls because the lock is freed only on the delayed FUSE
|
||||
# Release. A pass here means the holder-side release was fixed.
|
||||
if [[ "${put_rc}" == "0" ]]; then
|
||||
xpass "blocked SMB write succeeds after the other mount releases"
|
||||
pass "blocked SMB write succeeds after the other mount releases"
|
||||
else
|
||||
xfail "blocked SMB write succeeds after the other mount releases (rc=${put_rc})"
|
||||
fail "blocked SMB write succeeds after the other mount releases (rc=${put_rc})"
|
||||
fi
|
||||
|
||||
# A correct handoff leaves the SMB writer's payload on disk: mount 1 acquired
|
||||
# the lock and wrote after mount 2 released.
|
||||
got="${WORK}/dlm_got.bin"
|
||||
if smb "get ${dlmfile} ${got}" >/dev/null 2>&1 && [[ "$(md5 "${got}")" == "$(md5 "${newdata}")" ]]; then
|
||||
xpass "post-release content is the SMB writer's payload (correct handoff)"
|
||||
pass "post-release content is the SMB writer's payload (correct handoff)"
|
||||
else
|
||||
xfail "post-release content is the SMB writer's payload (correct handoff)"
|
||||
fail "post-release content is the SMB writer's payload (correct handoff)"
|
||||
fi
|
||||
|
||||
# 3. Distributed lock integrity: concurrent writers, same file ---------------
|
||||
@@ -226,8 +214,5 @@ else
|
||||
fi
|
||||
|
||||
echo
|
||||
echo "==> Summary: ${PASS} passed, ${FAIL} failed, ${XFAIL} expected-fail"
|
||||
if [[ "${XPASS}" -gt 0 ]]; then
|
||||
echo "==> ${XPASS} check(s) unexpectedly passed - the DLM handoff appears fixed; promote them from xfail to assertions"
|
||||
fi
|
||||
[[ "${FAIL}" -eq 0 && "${XPASS}" -eq 0 ]]
|
||||
echo "==> Summary: ${PASS} passed, ${FAIL} failed"
|
||||
[[ "${FAIL}" -eq 0 ]]
|
||||
|
||||
Reference in New Issue
Block a user