* feat(filer.backup): -initialSnapshot seeds destination from live tree
Replaying the metadata event log on a fresh sync only leaves files that
still exist on the source at replay time: any entry that was created and
later deleted is replayed as a create/delete pair and never materializes
on the destination. Users who wipe the destination and re-run
filer.backup therefore see "only new files" instead of a full backup,
even when -timeAgo=876000h is passed and the subscription genuinely
starts from epoch (ref discussion #8672).
Add a -initialSnapshot opt-in flag: when set on a fresh sync (no prior
checkpoint, -timeAgo unset), walk the live filer tree under -filerPath
via TraverseBfs and seed the destination through sink.CreateEntry, then
persist the walk-start timestamp as the checkpoint and subscribe from
there. Capturing the timestamp before the walk lets the subscription
catch any create/update/delete racing with the walk — sink CreateEntry
is idempotent across the builtin sinks so replay is safe.
Honors existing -filerExcludePaths / -filerExcludeFileNames /
-filerExcludePathPatterns filters and skips /topics/.system/log the
same way the subscription path does.
Also log "starting from <t> (no prior checkpoint)" instead of a
misleading "resuming from 1970-01-01" when the KV has no stored offset.
* fix(filer.backup): guard initialSnapshot counters under TraverseBfs workers
TraverseBfs fans the callback out across 5 worker goroutines, so the
entryCount / byteCount updates and the 5-second progress-log gate in
runInitialSnapshot were racing. Switch the counters to atomic.Int64 and
protect the lastLog check/update with a short-scoped mutex so the heavy
sink.CreateEntry call stays outside the critical section.
Flagged by gemini-code-assist on #9126; verified with go test -race.
* fix(filer.backup): harden initialSnapshot against transient errors and path edge cases
Three review items from CodeRabbit on #9126:
1. getOffset errors no longer leave isFreshSync=true. Before, a transient
KV read failure would cause runFilerBackup's retry loop to redo the
full -initialSnapshot walk on every retry. Treat any offset-read
error as "not fresh" so the snapshot only runs when we've verified
there really is no prior checkpoint.
2. initialSnapshotTargetKey now normalizes sourcePath to a trailing-
slash base before stripping the prefix, so edge cases where
sourceKey equals sourcePath (trailing-slash mismatch or root-entry
emission) no longer index past the end. Unit tests cover both
forms.
3. Documented the TraverseBfs-enumerates-excluded-subtrees performance
characteristic on runInitialSnapshot, since pruning requires a
separate change to TraverseBfs itself.
* fix(filer.backup): retry setOffset after initialSnapshot to avoid full re-walks
If the snapshot walk finishes but the subsequent setOffset fails, the
retry loop in runFilerBackup will re-enter doFilerBackup with an empty
checkpoint and run the full BFS again — on a multi-million-entry tree
that's hours of wasted work over a 100-byte KV write. Retry the write a
handful of times with exponential backoff before giving up, and log
loudly at the final failure (with snapshotTsNs + sinkId) so operators
recognize the symptom instead of guessing at mysterious repeated walks.
Nitpick raised by CodeRabbit on #9126.
* fix(filer.backup): initialSnapshot ignore404, skew margin, exclude dir-entry itself
Three review items from CodeRabbit on #9126:
1. ignore404Error now threads into runInitialSnapshot. If a file is listed
by TraverseBfs and then deleted before CreateEntry reads its chunks,
the follow path already ignores 404s — the snapshot path was aborting
and triggering a full re-walk. Treat an ignorable 404 as "skip this
entry, continue."
2. snapshotTsNs now uses `time.Now() - 1min` instead of `time.Now()`.
Metadata events are stamped server-side, so a fast backup-host clock
could skip events that fire during or right after the walk. Matches
the 1-minute margin meta_aggregator.go applies on initial peer
traversal; duplicate replay is harmless because CreateEntry is
idempotent.
3. Exclude checks now run against the entry's own full path, not just
its parent. A walked directory whose full path matches SystemLogDir
or -filerExcludePaths was being seeded to the destination; only its
descendants were being skipped. Verified with a manual repro where
-filerExcludePaths=/data/skipdir now keeps the skipdir entry itself
off the destination.
* refactor(filer): share destKey helper between buildKey and initialSnapshot
Extract destKey(dataSink, targetPath, sourcePath, sourceKey, mTime) from
buildKey in filer_sync.go. Both the event-log path (buildKey) and the
initialSnapshot walk (initialSnapshotTargetKey) now go through the same
helper, so a walk-seeded file and an event-replayed file always resolve
to the same destination key.
As a bonus, buildKey picks up the defensive trailing-slash normalization
that initialSnapshotTargetKey introduced — no more index-past-end risk
when sourceKey happens to equal sourcePath. Also tightens the mTime
lookup to guard against nil Attributes (caught by an existing test
against buildKey when I first moved the lookup out of the incremental
branch).
* fix: extend ignore404Error to match 404 Not Found string from S3 sink errors
* test: add unit tests for isIgnorable404 error matching
* improve: pre-compute ignorable 404 string and simplify isIgnorable404
* test: replace init() with TestMain for global HTTP client setup