mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
e3e02d3364
* [CheckDisk][GRPC]: implement MVP for disk health detection, added timeout for new grpc connections * fix(volume): build disk health check on every platform setDiskStatus only existed behind the statfs build tag, so disk.go failed to compile on windows, openbsd, solaris, netbsd and plan9. Move the timeout wrapper and failure tracking into the shared disk.go and have each platform's fillInDiskStatus return an error, so every platform gets the same protection from a stuck filesystem. Also restore the uint64(fs.Bavail) cast: Bavail is int64 on freebsd, so the unguarded multiply broke the freebsd build. * fix(volume): keep one outstanding statfs probe per disk A stuck statfs used to leave isChecking cleared by the timeout path, so the next check spawned another goroutine while the previous one was still blocked in the syscall, leaking one goroutine per minute on a hung disk. Clear the flag only when statfs returns and treat an overlapping check as a failure, so a hung filesystem keeps a single outstanding probe and still gets reported. * fix(volume): assume disk available until the first health check isDiskAvailable defaulted to false, and CollectHeartbeat skips locations that are not available. A freshly started volume server would therefore omit every volume from its first heartbeats until the async CheckDiskSpace ran, so the master could briefly treat all of them as missing. * fix(volume): label the disk error metric by data directory The new gauge tagged the series with IdxDirectory while every neighbouring resource gauge uses Directory, so the error series would not line up with them in dashboards. Also log the underlying error instead of a generic message. * test(volume): cover disk health success and repeated-failure paths * fix(volume): make a healthy disk the zero-value default Track the disk as isDiskUnavailable instead of isDiskAvailable so the safe state is the zero value, matching isDiskSpaceLow. CollectHeartbeat only skips a location once a check has actively marked it unavailable, so any DiskLocation built without running CheckDiskSpace (tests, future call sites) still reports its volumes instead of silently dropping them. * feat(disk): detect degraded disks using IO latency probes * feat(stats): introduce configurable disk I/O health probe with EWMA-based latency detection * feat(disk): replace EWMA with sliding window algorithm for disk health detection and added user-friendly options * feat(disk): improve disk health probing and recovery * feat(volume): configure disk health checks via volume.toml * fix(volume): Remove disk IO probe CLI options --------- Co-authored-by: ptukha <ptukha@tochka.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>