Files
Chris Lu 2d1b8be22b s3: route object reads to the key's owner filer (#9806)
* s3: route object reads to the key's owner filer

Writes already route by key to the owner filer on the lock ring, where the
entry is created. Reads went to the gateway's local filer and treated its
NotFound as authoritative, so a GET on one gateway could miss an object
another gateway had just written until the filers' metadata replication
caught up.

Resolve an object's entry from the key's owner first, failing over to the
gateway's filer set only on transport errors. An owner NotFound stays
authoritative: no fan-out across filers, and no resurrecting a peer's
not-yet-replicated tombstone, so a delete routed to the owner is visible at
once and a genuine miss costs one lookup. Keys owned by the local filer are
unchanged. Objects written through the non-routed lock path land on a
gateway's local filer, so they can still read as absent on the owner until
they replicate.

withFilerClientFailover takes a preferred start filer; the object-entry
reads pass the owner, every other caller passes "" and keeps the
current-filer fast path.

* s3: consult the prior owner on a rebalance-window read miss

Owner-first reads route a key to its current ring owner. When a filer joins,
~1/N of keys reassign to it, and the new owner may not have replicated a
just-moved key yet, so an owner NotFound would surface a transient 404 for an
object that already exists elsewhere.

Retain the previous ring on the gateway's LockClient for a cooling-off window
(PriorOwnerForKey, mirroring the master's LockRing.PriorOwner) and, on the
owner's NotFound, probe the key's previous owner once before treating the miss
as final. The probe is scoped to keys whose ownership actually moved and only
within the window, so steady-state reads are untouched.

This trades the transient scale-up 404 for a transient stale read if a delete
routed to the new owner races the same window — the same authoritative-NotFound
tradeoff, narrowed to the rebalance.

* s3: try healthy filers before unhealthy ones on failover

The candidate list probed its first entry (usually the current filer)
unconditionally, so a health-flagged current filer cost a transport timeout on
every ordinary call before failover reached a replica. Partition candidates into
healthy and unhealthy, keep priority within each, and fall back to unhealthy
ones only when all healthy ones fail.

* reduce comments on the routed read and lock client paths

* s3: skip a recently-unreachable owner on route-by-key reads

The gateway's filer health tracking no-ops for an owner outside the static
-filer list, so during a sustained owner outage every route-by-key read
re-dials the dead owner before failing over. Flag an owner whose owner-first
read hit a transport error and skip it (read local-first) for a short TTL, so
reads pay one dead dial per TTL instead of one per request; the flag expires so
owner-first reads resume once the owner or the ring recovers.

* s3: always try the preferred owner first, health-order only the rest

The healthy/unhealthy partition also demoted a health-flagged preferred owner
behind healthy replicas, so a replica's authoritative NotFound could mask a
write that had only reached the owner — the read-after-write race this routing
exists to close. Pull preferred out of the partition and keep it first; the
recently-unreachable gate already steers reads away from a genuinely dead owner.
2026-06-03 00:12:28 -07:00
..
2026-04-10 17:31:14 -07:00