mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
9ae905e456
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin S3 and filer already use a refreshing pemfile provider for their HTTPS cert, so rotated certificates (e.g. from k8s cert-manager) are picked up without a restart. Master, volume, webdav, and admin, however, passed cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at startup — rotating those certs required a pod restart. Add a small helper NewReloadingServerCertificate in weed/security that wraps pemfile.Provider and returns a tls.Config.GetCertificate closure, then wire it into the four remaining HTTPS entry points. httpdown now also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates but CertFile/KeyFile are empty, so volume server can pre-populate TLSConfig. A unit test exercises the rotation path (write cert, rotate on disk, assert the callback returns the new cert) with a short refresh window. * refactor(security): route filer/s3 HTTPS through the shared cert reloader Before: filer.go and s3.go each kept a *certprovider.Provider on the options struct plus a duplicated GetCertificateWithUpdate method. Both were loading pemfile themselves. Behaviorally they already reloaded, but the logic was duplicated two ways and neither path was shared with the newly-added master/volume/webdav/admin wiring. After: both use security.NewReloadingServerCertificate like the other servers. The per-struct certProvider field and GetCertificateWithUpdate method are removed, along with the now-unused certprovider and pemfile imports. Net: -32 lines, one code path for all HTTPS cert reloading. No behavior change — the refresh window, cache, and handshake contract are identical (the helper wraps the same pemfile.NewProvider). * feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc The HTTP client in weed/util/http/client loaded the mTLS client cert once at startup via tls.LoadX509KeyPair. That left every long-lived HTTPS client process (weed mount, backup, filer.copy, filer→volume, s3→filer/volume) unable to pick up a rotated client cert without a restart — even though the same cert-manager setup was already rotating the server side fine. Swap the client cert loader for a tls.Config.GetClientCertificate callback backed by the same refreshing pemfile provider. New TLS handshakes pick up the rotated cert; in-flight pooled connections keep their old cert and drop as normal transport churn happens. To keep this reusable from both server and client TLS code without an import cycle (weed/security already imports weed/util/http/client for LoadHTTPClientFromFile), extract the pemfile wrapper into a new weed/security/certreload subpackage. weed/security keeps its thin NewReloadingServerCertificate wrapper. The existing unit test moves with the implementation. gRPC mTLS was already handled by security.LoadServerTLS / LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ agent, Kafka gateway, and FUSE mount control plane are gRPC-only and therefore already rotate. CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted as a known limitation in the wiki. * fix(security): address PR review feedback on cert reloader Bots (gemini-code-assist + coderabbit) flagged three real issues and a couple of nits. Addressing them here: 1. KeyMaterial used context.Background(). The grpc pemfile provider's KeyMaterial blocks until material arrives or the context deadline expires; with Background() a slow disk could hang the TLS handshake indefinitely. Switched both the server and client callbacks to use hello.Context() / cri.Context() so a stuck read is bounded by the handshake timeout. 2. Admin server loaded TLS inside the serve goroutine. If the cert was bad, the goroutine returned but startAdminServer kept blocking on <-ctx.Done() with no listener, making the process look healthy with nothing bound. Moved TLS setup to run before the goroutine starts and propagate errors via fmt.Errorf; also captures the provider and defers Close(). 3. HTTP client discarded the certprovider.Provider from NewClientGetCertificate. That leaked the refresh goroutine, and NewHttpClientWithTLS had a worse case where a CA-file failure after provider creation orphaned the provider entirely. Added a certProvider field and a Close() method on HTTPClient, and made the constructors close the provider on subsequent error paths. 4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain the provider. filer and webdav run ServeTLS synchronously, so a plain defer works. master/volume/s3 dispatch goroutines and return while the server keeps running, so they hook Close() into grace.OnInterrupt. 5. Test: certreload_test now tolerates transient read/parse errors during file rotation (writeSelfSigned rewrites cert before key) and reports the last error only if the deadline expires. No user-visible behavior change for the happy path. * test(tls): add end-to-end HTTPS cert rotation integration test Boots a real `weed master` with HTTPS enabled, captures the leaf cert served at TLS handshake time, atomically rewrites the cert/key files on disk (the same rename-in-place pattern kubelet does when it swaps a cert-manager Secret), and asserts that a subsequent TLS handshake observes the rotated leaf — with no process restart, no SIGHUP, no reloader sidecar. Verifies the full path: on-disk change → pemfile refresh tick → provider.KeyMaterial → tls.Config.GetCertificate → server TLS handshake. Runtime is ~1s by exposing the reloader's refresh window as an env var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the test. The same env var is user-facing — documented in the wiki — so operators running short-lived certs (Vault, cert-manager with duration: 24h, etc.) can tighten the rotation-pickup window without a rebuild. Defaults to 5h to preserve prior behavior. security.CredRefreshingInterval is kept for API compatibility but now aliases certreload.DefaultRefreshInterval so the same env controls both gRPC mTLS and HTTPS reload. * ci(tls): wire the TLS rotation integration test into GitHub Actions Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner, Go 1.25, build weed, run `go test` in test/tls_rotation, upload master logs on failure. 10-minute job timeout; the test itself finishes in about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms inside the test. Runs on every push to master and on every PR to master. * fix(tls): address follow-up PR review comments Three new comments on the integration test + volume shutdown path: 1. Test: peekServerCert was swallowing every dial/handshake error, which meant waitForCert's "last err: <nil>" fatal message lost all diagnostic value. Thread errors back through: peekServerCert now returns (*x509.Certificate, error), and waitForCert records the latest error so a CI flake points at the actual cause (master didn't come up, handshake rejected, CA pool mismatch, etc.). 2. Test: set HOME=<tempdir> on the master subprocess. Viper today registers the literal path "$HOME/.seaweedfs" without env expansion, so a developer's ~/.seaweedfs/security.toml is accidentally invisible — the test was relying on that. Pinning HOME is belt-and-braces against a future viper upgrade that does expand env vars. 3. volume.go: startClusterHttpService's provider close was registered via grace.OnInterrupt, which fires on SIGTERM but NOT on the v.shutdownCtx.Done() path used by mini / integration tests. The pemfile refresh goroutine leaked in that shutdown path. Now the helper returns a close func and the caller invokes it on BOTH shutdown paths for parity. Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the ast-grep static-analysis nit — zero-risk since the pool only trusts our in-memory CA. Test runs clean 3/3.
324 lines
10 KiB
Go
324 lines
10 KiB
Go
// Package tls_rotation exercises HTTPS certificate rotation end-to-end:
|
|
// start a real `weed master` with an HTTPS listener, capture the leaf
|
|
// served at handshake time, rewrite the cert/key files on disk, and
|
|
// assert that a subsequent handshake sees the new leaf — all without
|
|
// stopping the master process. The test shortens the reloader's refresh
|
|
// window to ~half a second via WEED_TLS_CERT_REFRESH_INTERVAL so it
|
|
// completes in seconds rather than hours.
|
|
package tls_rotation
|
|
|
|
import (
|
|
"context"
|
|
"crypto/ecdsa"
|
|
"crypto/elliptic"
|
|
"crypto/rand"
|
|
"crypto/tls"
|
|
"crypto/x509"
|
|
"crypto/x509/pkix"
|
|
"encoding/pem"
|
|
"fmt"
|
|
"math/big"
|
|
"net"
|
|
"os"
|
|
"os/exec"
|
|
"path/filepath"
|
|
"strconv"
|
|
"syscall"
|
|
"testing"
|
|
"time"
|
|
)
|
|
|
|
// TestMasterHTTPSCertRotation boots `weed master` with HTTPS, confirms
|
|
// the initial leaf is served, rotates the cert/key pair on disk, and
|
|
// asserts the rotated leaf is served on subsequent TLS handshakes.
|
|
func TestMasterHTTPSCertRotation(t *testing.T) {
|
|
if testing.Short() {
|
|
t.Skip("skipping HTTPS rotation integration test in -short mode")
|
|
}
|
|
|
|
weedBin := findWeedBinary(t)
|
|
|
|
dir := t.TempDir()
|
|
tlsDir := filepath.Join(dir, "tls")
|
|
if err := os.MkdirAll(tlsDir, 0o755); err != nil {
|
|
t.Fatalf("mkdir tls: %v", err)
|
|
}
|
|
certPath := filepath.Join(tlsDir, "server.crt")
|
|
keyPath := filepath.Join(tlsDir, "server.key")
|
|
|
|
ca, caKey := generateCA(t)
|
|
leafSerial1 := big.NewInt(10001)
|
|
leafSerial2 := big.NewInt(10002)
|
|
|
|
// Initial leaf on disk.
|
|
writeLeaf(t, certPath, keyPath, ca, caKey, leafSerial1)
|
|
|
|
masterDir := filepath.Join(dir, "master")
|
|
if err := os.MkdirAll(masterDir, 0o755); err != nil {
|
|
t.Fatalf("mkdir master: %v", err)
|
|
}
|
|
// Empty security.toml so the master doesn't pick up a user's
|
|
// ~/.seaweedfs/security.toml during the test.
|
|
if err := os.WriteFile(filepath.Join(masterDir, "security.toml"), []byte("# test\n"), 0o644); err != nil {
|
|
t.Fatalf("write security.toml: %v", err)
|
|
}
|
|
|
|
// Master auto-derives gRPC port as port+10000 when -port.grpc is
|
|
// unset, so both must fit in uint16. Pin both explicitly.
|
|
port, grpcPort := getFreeTCPPort(t), getFreeTCPPort(t)
|
|
|
|
ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second)
|
|
defer cancel()
|
|
|
|
cmd := exec.CommandContext(ctx, weedBin, "master",
|
|
"-ip", "127.0.0.1",
|
|
"-port", strconv.Itoa(port),
|
|
"-port.grpc", strconv.Itoa(grpcPort),
|
|
"-mdir", masterDir,
|
|
)
|
|
cmd.Dir = masterDir
|
|
cmd.Env = append(os.Environ(),
|
|
// Isolate HOME so the subprocess cannot pick up a developer's
|
|
// ~/.seaweedfs/security.toml. Viper's AddConfigPath uses the
|
|
// literal string "$HOME/.seaweedfs" without env expansion today,
|
|
// so this is only belt-and-braces — but it insures us against a
|
|
// future viper upgrade that does expand env vars.
|
|
"HOME="+dir,
|
|
"WEED_HTTPS_MASTER_CERT="+certPath,
|
|
"WEED_HTTPS_MASTER_KEY="+keyPath,
|
|
// Short refresh window so rotation completes in seconds.
|
|
"WEED_TLS_CERT_REFRESH_INTERVAL=500ms",
|
|
)
|
|
logPath := filepath.Join(masterDir, "master.log")
|
|
logOut, err := os.Create(logPath)
|
|
if err != nil {
|
|
t.Fatalf("create master log: %v", err)
|
|
}
|
|
cmd.Stdout = logOut
|
|
cmd.Stderr = logOut
|
|
|
|
if err := cmd.Start(); err != nil {
|
|
t.Fatalf("start master: %v", err)
|
|
}
|
|
t.Cleanup(func() {
|
|
if cmd.Process != nil {
|
|
_ = cmd.Process.Signal(syscall.SIGTERM)
|
|
done := make(chan struct{})
|
|
go func() { _ = cmd.Wait(); close(done) }()
|
|
select {
|
|
case <-done:
|
|
case <-time.After(10 * time.Second):
|
|
_ = cmd.Process.Kill()
|
|
<-done
|
|
}
|
|
}
|
|
_ = logOut.Close()
|
|
if t.Failed() {
|
|
if b, readErr := os.ReadFile(logPath); readErr == nil {
|
|
t.Logf("master.log:\n%s", string(b))
|
|
}
|
|
}
|
|
})
|
|
|
|
caPool := x509.NewCertPool()
|
|
caPool.AddCert(ca)
|
|
|
|
addr := fmt.Sprintf("127.0.0.1:%d", port)
|
|
|
|
// 1. Wait for the initial leaf to appear. Master takes a few seconds
|
|
// to open its HTTPS listener.
|
|
waitForCert(t, addr, caPool, leafSerial1, 30*time.Second, "initial cert")
|
|
|
|
// Sanity: same handshake twice still observes the initial leaf.
|
|
got, err := peekServerCert(addr, caPool)
|
|
if err != nil || got == nil || got.SerialNumber.Cmp(leafSerial1) != 0 {
|
|
t.Fatalf("second probe before rotation did not return initial leaf: cert=%v err=%v", got, err)
|
|
}
|
|
|
|
// 2. Rotate on disk. pemfile watches mtime, so each file's write is
|
|
// an atomic rename (tempfile in the same directory).
|
|
writeLeaf(t, certPath, keyPath, ca, caKey, leafSerial2)
|
|
|
|
// 3. Wait for new leaf to take over. With a 500ms refresh and no
|
|
// connection pooling (tls.Dial opens a fresh conn each time), this
|
|
// should take a couple of seconds.
|
|
waitForCert(t, addr, caPool, leafSerial2, 15*time.Second, "rotated cert")
|
|
}
|
|
|
|
// waitForCert polls until a TLS handshake against addr yields a peer
|
|
// cert with the expected serial, or fails the test at the deadline.
|
|
// The last handshake error is surfaced in the fatal message so that a
|
|
// CI flake makes the root cause obvious (master didn't come up, TLS
|
|
// handshake rejected, CA pool mismatch, etc.).
|
|
func waitForCert(t *testing.T, addr string, caPool *x509.CertPool, wantSerial *big.Int, within time.Duration, label string) {
|
|
t.Helper()
|
|
deadline := time.Now().Add(within)
|
|
var lastErr error
|
|
var lastSerial *big.Int
|
|
for time.Now().Before(deadline) {
|
|
cert, err := peekServerCert(addr, caPool)
|
|
if err != nil {
|
|
lastErr = err
|
|
} else if cert != nil {
|
|
lastSerial = cert.SerialNumber
|
|
if cert.SerialNumber.Cmp(wantSerial) == 0 {
|
|
return
|
|
}
|
|
}
|
|
time.Sleep(250 * time.Millisecond)
|
|
}
|
|
t.Fatalf("timeout waiting for %s (want serial %s, last seen %v, last err %v)", label, wantSerial, lastSerial, lastErr)
|
|
}
|
|
|
|
// peekServerCert opens a one-shot TLS connection and returns the leaf.
|
|
// Errors (dial failure, handshake rejection, empty peer chain) are
|
|
// returned rather than swallowed, so the caller can surface them when
|
|
// the test times out.
|
|
func peekServerCert(addr string, caPool *x509.CertPool) (*x509.Certificate, error) {
|
|
d := &net.Dialer{Timeout: 2 * time.Second}
|
|
conn, err := tls.DialWithDialer(d, "tcp", addr, &tls.Config{
|
|
RootCAs: caPool,
|
|
ServerName: "localhost",
|
|
MinVersion: tls.VersionTLS12,
|
|
})
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
defer conn.Close()
|
|
state := conn.ConnectionState()
|
|
if len(state.PeerCertificates) == 0 {
|
|
return nil, fmt.Errorf("handshake returned empty peer chain")
|
|
}
|
|
return state.PeerCertificates[0], nil
|
|
}
|
|
|
|
func getFreeTCPPort(t *testing.T) int {
|
|
t.Helper()
|
|
ln, err := net.Listen("tcp", "127.0.0.1:0")
|
|
if err != nil {
|
|
t.Fatalf("listen ephemeral: %v", err)
|
|
}
|
|
port := ln.Addr().(*net.TCPAddr).Port
|
|
_ = ln.Close()
|
|
return port
|
|
}
|
|
|
|
func findWeedBinary(t *testing.T) string {
|
|
t.Helper()
|
|
candidates := []string{
|
|
"../../weed/weed",
|
|
"../weed/weed",
|
|
"./weed",
|
|
}
|
|
for _, c := range candidates {
|
|
if _, err := os.Stat(c); err == nil {
|
|
abs, absErr := filepath.Abs(c)
|
|
if absErr == nil {
|
|
return abs
|
|
}
|
|
return c
|
|
}
|
|
}
|
|
if path, err := exec.LookPath("weed"); err == nil {
|
|
return path
|
|
}
|
|
t.Skip("weed binary not found — build with `cd weed && go build` first")
|
|
return ""
|
|
}
|
|
|
|
// --- cert fixtures -------------------------------------------------------
|
|
|
|
// generateCA returns a self-signed CA cert and its private key.
|
|
func generateCA(t *testing.T) (*x509.Certificate, *ecdsa.PrivateKey) {
|
|
t.Helper()
|
|
key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
|
|
if err != nil {
|
|
t.Fatalf("gen CA key: %v", err)
|
|
}
|
|
tmpl := &x509.Certificate{
|
|
SerialNumber: big.NewInt(1),
|
|
Subject: pkix.Name{CommonName: "tls-rotation-test-CA"},
|
|
NotBefore: time.Now().Add(-time.Hour),
|
|
NotAfter: time.Now().Add(24 * time.Hour),
|
|
IsCA: true,
|
|
BasicConstraintsValid: true,
|
|
KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageCRLSign,
|
|
}
|
|
der, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &key.PublicKey, key)
|
|
if err != nil {
|
|
t.Fatalf("create CA cert: %v", err)
|
|
}
|
|
parsed, err := x509.ParseCertificate(der)
|
|
if err != nil {
|
|
t.Fatalf("parse CA cert: %v", err)
|
|
}
|
|
return parsed, key
|
|
}
|
|
|
|
// writeLeaf signs a new leaf cert with the given serial and writes it
|
|
// plus its key to the given paths via atomic rename — the pattern
|
|
// Kubernetes (cert-manager → Secret volume mount) produces in practice.
|
|
func writeLeaf(t *testing.T, certPath, keyPath string, ca *x509.Certificate, caKey *ecdsa.PrivateKey, serial *big.Int) {
|
|
t.Helper()
|
|
leafKey, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
|
|
if err != nil {
|
|
t.Fatalf("gen leaf key: %v", err)
|
|
}
|
|
tmpl := &x509.Certificate{
|
|
SerialNumber: serial,
|
|
Subject: pkix.Name{
|
|
CommonName: "localhost",
|
|
},
|
|
NotBefore: time.Now().Add(-time.Hour),
|
|
NotAfter: time.Now().Add(24 * time.Hour),
|
|
KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment,
|
|
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth, x509.ExtKeyUsageClientAuth},
|
|
DNSNames: []string{"localhost"},
|
|
IPAddresses: []net.IP{net.ParseIP("127.0.0.1"), net.ParseIP("::1")},
|
|
}
|
|
der, err := x509.CreateCertificate(rand.Reader, tmpl, ca, &leafKey.PublicKey, caKey)
|
|
if err != nil {
|
|
t.Fatalf("create leaf cert: %v", err)
|
|
}
|
|
|
|
atomicWritePEM(t, certPath, "CERTIFICATE", der)
|
|
|
|
keyDER, err := x509.MarshalECPrivateKey(leafKey)
|
|
if err != nil {
|
|
t.Fatalf("marshal leaf key: %v", err)
|
|
}
|
|
atomicWritePEM(t, keyPath, "EC PRIVATE KEY", keyDER)
|
|
}
|
|
|
|
// atomicWritePEM writes a PEM file via tempfile-in-same-directory plus
|
|
// rename, matching what kubelet does when it swaps the ..data symlink
|
|
// for a renewed Secret. Ensures the reader never sees a truncated file.
|
|
func atomicWritePEM(t *testing.T, path, blockType string, der []byte) {
|
|
t.Helper()
|
|
dir := filepath.Dir(path)
|
|
tmp, err := os.CreateTemp(dir, ".tls-*")
|
|
if err != nil {
|
|
t.Fatalf("create tempfile: %v", err)
|
|
}
|
|
ok := false
|
|
defer func() {
|
|
if !ok {
|
|
_ = os.Remove(tmp.Name())
|
|
}
|
|
}()
|
|
if err := pem.Encode(tmp, &pem.Block{Type: blockType, Bytes: der}); err != nil {
|
|
tmp.Close()
|
|
t.Fatalf("pem encode: %v", err)
|
|
}
|
|
if err := tmp.Close(); err != nil {
|
|
t.Fatalf("close tempfile: %v", err)
|
|
}
|
|
if err := os.Chmod(tmp.Name(), 0o600); err != nil {
|
|
t.Fatalf("chmod tempfile: %v", err)
|
|
}
|
|
if err := os.Rename(tmp.Name(), path); err != nil {
|
|
t.Fatalf("rename tempfile onto %s: %v", path, err)
|
|
}
|
|
ok = true
|
|
}
|