seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-13 23:36:45 +03:00

Files

T

Aleksey e3e02d3364 [CheckDisk]: implement disk health detection (#9560 )

* [CheckDisk][GRPC]: implement MVP for disk health detection, added timeout for new grpc connections

* fix(volume): build disk health check on every platform

setDiskStatus only existed behind the statfs build tag, so disk.go failed
to compile on windows, openbsd, solaris, netbsd and plan9. Move the timeout
wrapper and failure tracking into the shared disk.go and have each platform's
fillInDiskStatus return an error, so every platform gets the same protection
from a stuck filesystem.

Also restore the uint64(fs.Bavail) cast: Bavail is int64 on freebsd, so the
unguarded multiply broke the freebsd build.

* fix(volume): keep one outstanding statfs probe per disk

A stuck statfs used to leave isChecking cleared by the timeout path, so the
next check spawned another goroutine while the previous one was still blocked
in the syscall, leaking one goroutine per minute on a hung disk. Clear the
flag only when statfs returns and treat an overlapping check as a failure, so
a hung filesystem keeps a single outstanding probe and still gets reported.

* fix(volume): assume disk available until the first health check

isDiskAvailable defaulted to false, and CollectHeartbeat skips locations that
are not available. A freshly started volume server would therefore omit every
volume from its first heartbeats until the async CheckDiskSpace ran, so the
master could briefly treat all of them as missing.

* fix(volume): label the disk error metric by data directory

The new gauge tagged the series with IdxDirectory while every neighbouring
resource gauge uses Directory, so the error series would not line up with them
in dashboards. Also log the underlying error instead of a generic message.

* test(volume): cover disk health success and repeated-failure paths

* fix(volume): make a healthy disk the zero-value default

Track the disk as isDiskUnavailable instead of isDiskAvailable so the safe
state is the zero value, matching isDiskSpaceLow. CollectHeartbeat only skips a
location once a check has actively marked it unavailable, so any DiskLocation
built without running CheckDiskSpace (tests, future call sites) still reports
its volumes instead of silently dropping them.

* feat(disk): detect degraded disks using IO latency probes

* feat(stats): introduce configurable disk I/O health probe with EWMA-based latency detection

* feat(disk): replace EWMA with sliding window algorithm for disk health detection and added user-friendly options

* feat(disk): improve disk health probing and recovery

* feat(volume): configure disk health checks via volume.toml

* fix(volume): Remove disk IO probe CLI options

---------

Co-authored-by: ptukha <ptukha@tochka.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>

2026-06-02 09:02:05 -07:00

credential.toml

migrate IAM policies to multi-file storage (#8114 )

2026-01-26 11:28:23 -08:00

example.go

[CheckDisk]: implement disk health detection (#9560 )

2026-06-02 09:02:05 -07:00

filer.toml

filer(mysql): TLS hostname/SNI knobs + MariaDB upsert documentation (#9260 )

2026-04-28 01:29:41 -07:00

master.toml

Disable master maintenance scripts when admin server runs (#8499 )