From f4ce2be875e1afe00c598632c6082cce621e1130 Mon Sep 17 00:00:00 2001 From: Ping Qiu Date: Mon, 20 Apr 2026 02:24:44 -0700 Subject: [PATCH] =?UTF-8?q?doc:=20P14=20S8=20final=20bounded=20close=20?= =?UTF-8?q?=E2=80=94=20evidence=20matrix=20+=20P15=20handoff=20(#9142)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * doc: P14 S8 final bounded close — evidence matrix + P15 handoff Adds the six S8 closure deliverables consolidating S4-S7 evidence, classifying V2 scenarios, and mapping residual product gaps onto canonical P15 tracks (per v3-phase-15-product-plan.md §4). New docs: - v3-phase-14-s8-assignment.md — S8 execution contract. - v3-phase-14-s8-final-bounded-close.md — bounded P14 target, accepted topology, reject conditions. - v3-phase-14-s8-evidence-matrix.md — 16 claims × {L0, L1, L2, L3, Status, Residual}. 15 PROVEN, 1 PARTIAL (Claim 15 fence quantitative bound, P14 internal follow-up). Rounds 2-3 architect corrections: Claim 10 / 12 L2 narrowed; Claim 6 refresh gap closed by the new L1 test (see companion commit in seaweed_block). - v3-phase-14-s8-v2-scenario-classification.md — every V2 scenario mapped to RUNNABLE-P14 / BLOCKED-FRONTEND / BLOCKED-OPS / BLOCKED-HA / BLOCKED-PERF / PORT-MECHANISM; scenario YAMLs kept as L3 shape, not executed evidence. - v3-phase-14-s8-p15-handoff.md — 11 rows (10 canonical P15 tracks + 1 P14 internal follow-up anchored to Claim 15 PARTIAL); §4 integrity check split by row class. - v3-phase-14-s8-closure.md — final P14 closure statement matching the close doc §10 wording; explicit non-goals; all 9 P15 tracks named with canonical numbering. No claim of CSI / frontend / migration / security / performance / production readiness. Every product gap is handed off with a concrete first-proof gate. Companion: seaweed_block commit adds the IntentRefreshEndpoint L1 route test that closes Claim 6. Co-Authored-By: Claude Opus 4.7 (1M context) * doc: P14 S8 — resolve port-now doc conflict (CodeRabbit #4) final-bounded-close.md §7 previously said "Port now: testrunner, scenarios, component harness, qa_block, learn/test" while v2-scenario-classification.md §2 says S8 does NOT port testrunner machinery and defers all actual porting to P15. Align final-bounded-close.md §7 with classification: section renamed "Classify now (S8 scope), port deferred to P15". Every item now states which P15 track actually owns the port (Final Gate or T1 Frontend + Data Path as applicable). No scope expansion; no new handoff gap. Pure doc-consistency fix. Co-Authored-By: Claude Opus 4.7 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- sw-block/design/v3-phase-14-s8-assignment.md | 187 ++++++++++++++ sw-block/design/v3-phase-14-s8-closure.md | 114 +++++++++ .../design/v3-phase-14-s8-evidence-matrix.md | 241 ++++++++++++++++++ .../v3-phase-14-s8-final-bounded-close.md | 212 +++++++++++++++ sw-block/design/v3-phase-14-s8-p15-handoff.md | 86 +++++++ ...-phase-14-s8-v2-scenario-classification.md | 107 ++++++++ 6 files changed, 947 insertions(+) create mode 100644 sw-block/design/v3-phase-14-s8-assignment.md create mode 100644 sw-block/design/v3-phase-14-s8-closure.md create mode 100644 sw-block/design/v3-phase-14-s8-evidence-matrix.md create mode 100644 sw-block/design/v3-phase-14-s8-final-bounded-close.md create mode 100644 sw-block/design/v3-phase-14-s8-p15-handoff.md create mode 100644 sw-block/design/v3-phase-14-s8-v2-scenario-classification.md diff --git a/sw-block/design/v3-phase-14-s8-assignment.md b/sw-block/design/v3-phase-14-s8-assignment.md new file mode 100644 index 000000000..4bde46842 --- /dev/null +++ b/sw-block/design/v3-phase-14-s8-assignment.md @@ -0,0 +1,187 @@ +# V3 Phase 14 S8 Assignment + +Date: 2026-04-19 +Status: draft +Owner: sw, with architect and tester review +Purpose: execute the final bounded P14 close by producing evidence, scenario classification, and a clean P15 handoff + +## 1. One-Line Goal + +Close P14 as a bounded internal single-active-master topology/control-plane loop, backed by L0/L1/L2 evidence and honest L3/P15 carry-forward. + +## 2. Scope + +Included: + +1. S4-S7 route evidence consolidation +2. acceptance matrix for all P14 internal claims +3. final supported topology statement +4. unsupported/deferred list +5. V2 testrunner/scenario port classification +6. process smoke gaps review +7. 14A final targeted review packet +8. P15 handoff mapping + +Excluded: + +1. CSI implementation +2. external API implementation +3. user data-path production readiness +4. migration implementation +5. security implementation +6. multi-master / HA +7. performance or soak claims + +## 3. Required Work + +### A. Evidence Inventory + +Create an evidence table that maps S4-S7 claims to concrete tests and commands. + +Must include: + +1. observation institution +2. durable authority institution +3. convergence institution +4. real-route restart closure +5. placement/rebalance/failover policy for accepted topology +6. unsupported topology evidence +7. old-slot/stale observation rejection +8. per-volume isolation + +### B. Gap Classification + +Classify every missing proof as one of: + +1. P14 internal blocker +2. P14 S8 test/harness gap +3. P15 product-surface gap +4. 14A verification follow-up +5. explicit non-goal + +Any P14 internal blocker must be fixed before S8 close. A P15 product-surface gap must be carried forward, not disguised as P14 closure. + +### C. V2 Scenario Port Classification + +Inspect V2 scenario/test assets and produce a table: + +| V2 asset | P14 use | P15 use | Port shape | Blocker | +|---|---|---|---|---| + +Start with: + +1. `weed/storage/blockvol/testrunner/` +2. `weed/storage/blockvol/testrunner/scenarios/public/ha-restart-recovery.yaml` +3. `weed/storage/blockvol/testrunner/scenarios/public/ha-failover.yaml` +4. `weed/storage/blockvol/testrunner/scenarios/public/fault-partition.yaml` +5. `weed/storage/blockvol/test/component/` +6. `weed/server/qa_block_*` +7. `learn/test` evidence layout + +### D. L2 Process Smoke Decision + +Answer whether current `sparrow` hidden smoke surfaces are sufficient for S8 L2 internal close. + +If not sufficient, add the minimum internal S8 smoke surface needed to prove the S4-S7 route without inventing P15 external product APIs. + +The smoke must check: + +1. real process starts +2. durable authority reloads +3. observation or fixture input drives controller +4. assignment reaches adapter through `VolumeBridge` +5. restart does not mint backward +6. structured output can be consumed by tester + +### E. 14A Final Targeted Pass + +Prepare a 14A checklist for: + +1. stale authority after restart +2. stale observation after reassignment +3. old-slot delivery after authority move +4. unsupported topology no-op/evidence +5. convergence stuck/supersede clearing +6. publication honesty during transition +7. per-volume isolation + +14A should review findings first. It should not expand P14 scope. + +### F. P15 Handoff + +Produce a P15 handoff table: + +| Gap | Why not P14 | P15 owner track | Required first proof | +|---|---|---|---| + +Must include: + +1. CSI lifecycle +2. frontend data path +3. external volume API +4. security/auth +5. diagnostics/operator workflow +6. migration/coexistence +7. deployment/hardening +8. final cluster validation agent + +## 4. Required Commands + +At minimum run or document blockers for: + +```text +go test ./core/engine ./core/adapter ./core/authority -count=1 +go test ./cmd/sparrow ./core/authority -count=1 +go test ./... -count=1 +``` + +If sandbox or Windows filesystem behavior blocks durable-store tests, rerun outside the sandbox and record the reason. + +If L3 hardware scenarios are not runnable, do not fake them. Record the scenario names and blockers. + +## 5. Expected Files + +Expected docs: + +1. `sw-block/design/v3-phase-14-s8-final-bounded-close.md` — this scope document +2. `sw-block/design/v3-phase-14-s8-assignment.md` — this assignment +3. `sw-block/design/v3-phase-14-s8-closure.md` — final closure statement, created after implementation/evidence +4. updates to `v3-phase-14-checklist.md` +5. updates to `v3-phase-14-log.md` +6. updates to `v3-phase-14a-checklist.md` if 14A final pass finds coverage gaps + +Expected code/tests only if needed: + +1. internal process smoke helper in `cmd/sparrow` +2. component evidence consolidation tests in `core/authority` or a shared component package +3. scenario classification artifacts under the chosen test/runner directory + +## 6. Acceptance + +S8 can be accepted only when: + +1. the evidence matrix exists and is honest +2. all P14 internal blockers are either fixed or explicitly rejected from the P14 claim +3. L0 and L1 tests pass +4. L2 process smoke is present or its absence is classified as a P15 product-surface blocker +5. L3 scenarios are classified with runnable/deferred/blocker status +6. 14A final targeted pass has no open P14 safety blocker +7. P15 handoff maps every product gap to a P15 track +8. final closure statement does not claim production readiness + +## 7. Review Focus + +Architect should review: + +1. whether S8 closure is product-honest +2. whether any P15 work was pulled backward into P14 +3. whether any P14 internal truth gap was pushed forward into P15 +4. whether test evidence matches the claimed layer +5. whether V2 porting stays mechanism-only + +Tester should review: + +1. whether evidence is reproducible +2. whether commands and scenario blockers are specific +3. whether L2/L3 gaps are visible +4. whether final artifacts are enough to start P15 without re-litigating P14 diff --git a/sw-block/design/v3-phase-14-s8-closure.md b/sw-block/design/v3-phase-14-s8-closure.md new file mode 100644 index 000000000..c47ce61c7 --- /dev/null +++ b/sw-block/design/v3-phase-14-s8-closure.md @@ -0,0 +1,114 @@ +# V3 Phase 14 S8 Closure + +Date: 2026-04-20 +Status: draft — pending 14A final pass + architect + tester sign-off +Purpose: the final P14 closure statement per `v3-phase-14-s8-final-bounded-close.md` §10 + +## 1. What P14 Now Owns + +After S4-S7, P14 closes **one bounded single-active-master internal topology/control-plane truth loop** for the accepted topology: + +1. **Observation institution (S4)**: heartbeats and inventory flow through `ObservationHost` into a synthesized `ClusterSnapshot` + `SupportabilityReport`. Observation never mints authority. Partial / conflicting / duplicate / unknown / expired inventory becomes explicit `VolumeUnsupportedEvidence`; pending state is distinct from unsupported; per-volume isolation holds. + +2. **Durable authority institution (S5)**: current authority line is durable, single-owner, atomically written, and reloaded synchronously at publisher boot. One record per volume. Corrupt records per-volume-fail-closed and surface as structured `ReloadSkips`. Process lock is exclusive and idempotent. Store never mints authority (boundary-guard test). + +3. **Convergence institution (S6)**: desired assignments have bounded fate — confirmed (cleared), publish-not-observed (stuck-evidence, no re-mint, no churn), or superseded (by newer publisher line or different decision). Passive retry only; no active re-drive. Normal observation lag does NOT supersede. Per-volume retry clock; no global state. + +4. **Real-route restart closure (S7)**: the full `ObservationHost → TopologyController → Publisher(reloaded) → VolumeBridge → VolumeReplicaAdapter` composes correctly at restart. Restart recomputes desired from durable authority + fresh observation (no durable desired store). Old-slot per-replica state is NOT revived from the store (S5 one-record-per-volume rule); live-route old-slot delivery via bridge is rejected by the adapter's monotonic guard. Stale observation cannot move authority backward (4 simulation shapes). Unsupported topology after restart records evidence, not silent idle. L2 process smoke proves real-binary restart × 2 preserves durable truth with no backward mint, via pinned structured JSON. + +5. **Accepted topology** (S8 §4): single active master; multiple volumes; three replica slots per volume on distinct servers; one primary + two candidates; publisher-owned `Epoch` / `EndpointVersion`; durable current line; passive convergence; real `VolumeBridge` into adapter. + +## 2. What P14 Explicitly Does NOT Own + +1. CSI lifecycle (Create/Delete/Publish/Node*). +2. iSCSI / NVMe-oF frontend data path. +3. External volume management API (REST / gRPC). +4. Security / auth (authn, CHAP, encryption). +5. Operator diagnostics surface (dashboards, alerts, `/metrics`). +6. Operator workflow (drain, planned failover, manual reassign, supervised rebuild). +7. V2 ↔ V3 migration and coexistence. +8. Deployment / packaging / hardening (systemd, container, backup of durable store). +9. Performance / soak / release-hardening claims. +10. Multi-master / leader election / distributed authority (explicit non-goal, not just deferred). +11. V2 `HandleAssignment` / `promote` / `demote` semantics (explicit non-goal). +12. Heartbeat-as-authority (explicit non-goal). + +Every item 1–9 is mapped to a P15 track in `v3-phase-14-s8-p15-handoff.md` §2. Items 10–12 are permanent non-goals. + +## 3. Evidence At L0 / L1 / L2 / L3 + +Full matrix: `v3-phase-14-s8-evidence-matrix.md` §3. Summary of coverage: + +| Claim | L0 | L1 | L2 | L3 | +|---|---|---|---|---| +| 1. Observation system-fed | ✓ | ✓ | — (via L1) | L3 shape only | +| 2. Supportability explicit | ✓ | ✓ | — | L3 shape only | +| 3. Durable authority / reload | ✓ (13 tests) | ✓ | ✓ (subprocess) | L3 shape only | +| 4. No old-slot durable revival | ✓ | ✓ | ✓ | L3 shape only | +| 5. Live-route old-slot rejected | ✓ | ✓ | N/A (live-route) | N/A | +| 6. Controller-driven bind/reassign/refresh | ✓ (7 tests) | ✓ | ✓ (Bind only — Reassign / RefreshEndpoint are L1-only; subprocess does not ingest heartbeats) | L3 shape only | +| 7. Confirmation clears desired | ✓ | ✓ | N/A | L3 shape only | +| 8. Stuck evidence bounded / no churn | ✓ (5 tests) | ✓ | N/A | L3 shape only | +| 9. Supersede (2 modes) | ✓ (3 tests) | ✓ | N/A | L3 shape only | +| 10. Restart re-anchors via VolumeBridge | ✓ | ✓ | **subclaim only** — L2 proves durable `Publisher` reload (Claim 3); the bridge/adapter re-anchor is NOT at L2 because `sparrow` subprocess does not construct `ObservationHost` / controller / bridge / adapter | L3 shape only | +| 11. Stale observation cannot go backward | ✓ | ✓ (4 sub-cases) | — | L3 shape only | +| 12. Unsupported → evidence | ✓ | ✓ | **not at L2** — `ReloadSkips` is S5 durable-corruption evidence, not observation-layer unsupported-topology. Controller `LastUnsupported` is L1 | L3 shape only | +| 13. Per-volume isolation | ✓ (4 tests) | ✓ | N/A | L3 shape only | +| 14. Placement / rebalance | ✓ (5 tests) | ✓ | N/A | L3 shape only | +| 15. Fence route bounded | ✓ (adapter) | PARTIAL | N/A | L3 shape only | +| 16. Boundary guards | ✓ | N/A (structural) | N/A | N/A | + +Sixteen claims; fifteen PROVEN at their target level; one (Claim 15) PARTIAL — adapter-package follow-up per §5 below. L3 is classification-only at P14 (`v3-phase-14-s8-v2-scenario-classification.md`) — L3 entries are scenario SHAPES, not executed evidence. Runnable L3 is P15 Final Gate (Cluster Validation Agent). Claims 6, 10, and 12 have NARROWED L2 cells where the subprocess binary cannot carry the full claim surface; each is handed off to the correct canonical P15 track. + +## 4. Test Baseline (2026-04-20) + +``` +$ go test ./core/engine ./core/adapter ./core/authority ./cmd/sparrow -count=1 +ok github.com/seaweedfs/seaweed-block/core/engine 0.016s +ok github.com/seaweedfs/seaweed-block/core/adapter 2.344s +ok github.com/seaweedfs/seaweed-block/core/authority 4.569s +ok github.com/seaweedfs/seaweed-block/cmd/sparrow 2.264s + +$ go test ./... -count=1 +(14 packages, all PASS) +``` + +193 Go tests across the P14 scope. L2 subprocess smoke (`TestS7Process_RealSubprocessRestartSmoke`) spawns a real `sparrow` binary twice on the same store directory and asserts the pinned JSON pass-line schema. + +## 5. Residual Risks + +| # | Risk | Scope / carry-forward | +|---|---|---| +| 1 | Windows `os.RemoveAll` on `t.TempDir()` can race with lock-file release during teardown. No correctness impact. | S7 sketch §8.5 — logged as `t.Logf`, not a failure. | +| 2 | Claim 15 PARTIAL: fence quantitative timeout-bound is adapter-package-owned, depends on the uncommitted fence-watchdog branch committing. | `v3-phase-14-s8-p15-handoff.md` item 11 — P14 internal follow-up, NOT P15. Single adapter test replaces PARTIAL status. | +| 3 | L2 subprocess drives mint via `StaticDirective`; it does NOT construct `ObservationHost` / controller / bridge / adapter. Full heartbeat-ingress and real-binary adapter route are L1-only in S8. | **P15 T1** Frontend + Data Path (handoff items #1 and #2). | + +**No S8-blocking correctness residuals.** One adapter-owned quantitative fence proof (Claim 15) remains as P14-internal follow-up per row 11 of the handoff table — not a P15 track, not an S8 gate. Zero backward-authority-mint residuals. Zero silent-idle residuals. + +## 6. P15 Track Owners + +Handoff complete per `v3-phase-14-s8-p15-handoff.md`, aligned to canonical `v3-phase-15-product-plan.md` §4. Every P14 residual product gap has a named P15 track and a concrete first-proof gate: + +- T1 Frontend + Data Path Contract (includes subprocess heartbeat ingress) +- T2 CSI / External Lifecycle Surface +- T3 External Control API +- T4 Security And Auth Posture +- T5 Diagnostics And Explainability +- T6 Operator Workflow +- T7 V2/V3 Coexistence And Migration +- T8 Deployment / Upgrade / Release Hardening +- Final Gate — Cluster Validation Agent + +## 7. Final P14 Claim + +> **P14 has closed one bounded single-active-master internal topology/control-plane truth loop for the accepted topology set.** Observation is system-fed; authority is durable, single-owner, and restart-recovered; convergence has bounded fate; the real `VolumeBridge → adapter` route re-anchors across restart without backward mint or old-slot revival. +> +> **P14 has NOT closed CSI, external API, frontend data path, migration, security, deployment, performance, or production readiness.** Those nine tracks are handed off to P15 with concrete first-proof gates. Multi-master, V2 `HandleAssignment/promote/demote` semantics, and heartbeat-as-authority are explicit non-goals — not deferrals. + +## 8. Sign-off State + +- **Architect** — pending review of this matrix + handoff. +- **Tester** — pending reproducibility check (commands in §4). +- **14A final targeted pass** — pending (scope in `v3-phase-14a-checklist.md`). + +S8 cannot be accepted until all three sign off. After that, P14 is closed and the next active work is P15 T1 Frontend. diff --git a/sw-block/design/v3-phase-14-s8-evidence-matrix.md b/sw-block/design/v3-phase-14-s8-evidence-matrix.md new file mode 100644 index 000000000..85f5fc5bd --- /dev/null +++ b/sw-block/design/v3-phase-14-s8-evidence-matrix.md @@ -0,0 +1,241 @@ +# V3 Phase 14 S8 Evidence Matrix + +Date: 2026-04-20 +Status: draft (S8 evidence consolidation) +Purpose: map every internal P14 claim from `v3-phase-14-s8-final-bounded-close.md` §5 to concrete tests and commands, classify L0/L1/L2/L3 coverage, and surface residual gaps for 14A / P15 handoff + +## 1. How To Read This Matrix + +Each row is one internal P14 claim. Columns: + +- **L0 Unit** — package-local invariant / policy test that proves the claim inside one package. +- **L1 Component** — in-process multi-package route test (real `ObservationHost` + `TopologyController` + `Publisher` + `VolumeBridge` + `VolumeReplicaAdapter`, no shell, no kernel). +- **L2 Process** — real `sparrow` binary or in-process `Bootstrap()` against a real filesystem store. +- **L3 Scenario** — hardware / YAML-driven scenario. For P14 S8, L3 is **classification-only**: the rows below say "L3 shape only, not executed" where a V2 YAML scenario describes the same claim shape. L3 entries are NOT executed evidence; they are the scenario shapes that P15 Cluster Validation (Final Gate) will run. A claim with an L3 shape listed is NOT proven at L3 by S8. +- **Status** — `PROVEN` (explicit test); `PARTIAL` (route covered but some sub-claim deferred); `DEFERRED` (intentionally pushed to later slice / P15); `N/A` (not applicable at that level). +- **Residual** — what is NOT proven and where it is carried. + +All cited L0/L1/L2 tests live in the committed tree; re-run with: + +``` +go test ./core/engine ./core/adapter ./core/authority ./cmd/sparrow -count=1 +``` + +Baseline on 2026-04-20: engine 0.02s / adapter 2.3s / authority 4.6s / sparrow 2.3s — all green, 193 Go tests total across the S4-S7 scope. + +## 2. Accepted Topology Claim Recap + +Pinned by the S8 close doc §4. Reproduced here so every row in the matrix below is read against the correct bounded shape: + +1. single active master +2. multiple volumes +3. three bounded replica slots per volume, distinct servers +4. one current authoritative primary, two failover/rebalance candidates +5. publisher-owned `Epoch` / `EndpointVersion` +6. durable current authority line, one record per volume +7. passive convergence (publish → observe → confirm / stuck / supersede) +8. real `VolumeBridge` delivery into adapter/engine + +Anything outside this set is P15 or explicit non-goal. + +## 3. Evidence Matrix + +### Claim 1 — Observation is system-fed, not test-authored + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestObservation_BuilderStampsAuthorityFromReader` / `TestObservation_SupportabilityDoesNotReadClusterSnapshotOutput` / `TestObservation_NeverMintsAuthority_BoundaryGuard` | `TestObservation_EndToEnd_SystemFedSnapshotReachesController` / `TestConvergenceRoute_ObservationFedFailover_ConfirmsAndClearsDesired` / `TestS7_ReloadedAuthority_ReanchorsAdapterViaVolumeBridge` | (covered at L1; `sparrow` run-path uses the same `ObservationHost` construction) | L3 shape only, not executed — `ha-restart-recovery.yaml` / `ha-failover.yaml` carry the L3 shape | PROVEN | none at L0/L1 | + +### Claim 2 — Topology supportability is explicit and fail-closed (partial inventory / missing-server / conflict / duplicate / unknown-volume / expired) + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestObservation_PartialInventory_PendingThenUnsupported` / `TestObservation_MissingServerObservation_NeverOrdinaryIneligible` / `TestObservation_ConflictingPrimaryClaim_Unsupported` / `..._NoWinnerEverChosen` / `TestObservation_DuplicateServerTopology_UnsupportedAndEvidenceRecorded` / `TestObservation_UnknownObservedVolume_ProducesUnsupportedEvidence` / `TestObservation_ExpiredFact_SemanticallyIneligible` / `TestObservation_ExpiredDoesNotReachController` / `TestObservation_FreshnessWindowHonored_NotTickMultiples` | `TestObservation_SupportabilityCollapsePropagatesToController` / `TestObservation_SupportabilityCollapseClearsControllerDesired` / `TestS7_Restart_UnsupportedTopologyRecordsEvidence` | N/A (covered by L0/L1) | L3 shape only, not executed — `fault-partition.yaml` is the L3 shape | PROVEN | none | + +### Claim 3 — Durable authority is single-owner, atomic, and restart-recovered + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestFileAuthorityStore_RoundTrip` / `TestFileAuthorityStore_LatestWinsPerVolumeID` / `TestFileAuthorityStore_EnvelopeShape` / `TestDurableAuthority_AtomicWrite_NoTornRecord` / `TestDurableAuthority_CorruptRecord_PerVolumeFailClosed` / `TestDurableAuthority_ProcessLock_ExclusiveOwner` / `TestDurableAuthority_ProcessLock_IdempotentRelease` / `TestDurableAuthority_ProcessLock_SurvivesStaleFileOnDisk` / `TestDurableAuthority_UnreadableIndex_BootFailsClosed` / `TestDurableAuthority_StoreNeverMintsAuthority_BoundaryGuard` / `TestDurableAuthority_StoreWriteCallerAllowlist_BoundaryGuard` / `TestEncodeVolumeIDForFilename_Injective` / `TestEncodeVolumeIDForFilename_NoFilesystemUnsafeOutput` | `TestDurableAuthority_ReloadReproducesCurrentLine` / `TestDurableAuthority_ControllerSeesReloadedLineAtBoot` / `TestDurableAuthority_StoreBackedPublisherRunsNormalRoute` / `TestDurableAuthority_PutFailureRollsBackInMemoryState` / `TestS7_ReloadedAuthority_ReanchorsAdapterViaVolumeBridge` | `TestBootstrap_FreshDir_AcquiresLockAndReloads` / `TestBootstrap_WithExistingRecord_ReloadsPublisherState` / `TestBootstrap_CorruptRecord_LoggedAtStartup` / `TestBootstrap_LockContention_RefusesSecondBootstrap` / `TestS7Process_BootstrapReloadRouteSmoke` / **subprocess**: `TestS7Process_RealSubprocessRestartSmoke` | L3 shape only, not executed — `ha-restart-recovery.yaml` / `diag-restart-recovery.yaml` | PROVEN | Windows `os.RemoveAll` may race with lock release at teardown — logged as residual per S7 sketch §8.5; does NOT affect correctness | + +### Claim 4 — No old-slot durable revival after restart + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestDurableAuthority_StalePreFailoverRecordRejectedOnReload` | `TestS7_Restart_OldSlotNotRevivedFromDurable` (explicitly asserts `LastAuthorityBasis("vr","r1")==false` and `("vr","r2")==Epoch=2` post-restart) | covered by S7 subprocess smoke plus in-process two-run `TestS7Process_RestartDoesNotRegressAuthorityLine` | L3 shape only, not executed | PROVEN | none — row 4a/4b split in S7 closes the original ambiguity | + +### Claim 5 — Live-route monotonic guard rejects stale old-slot delivery (pre-restart) + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| adapter-internal: engine monotonic guards covered in `core/adapter/adapter_test.go` and `core/engine/*_test.go` | `TestS7_LiveRoute_OldSlotDeliveryRejectedByAdapter` (real `VolumeBridge`, in-memory publisher holds both per-replica keys, lower-epoch catch-up rejected) | N/A (live-route, pre-restart) | N/A | PROVEN | none | + +### Claim 6 — Controller-driven bind / reassign / refresh over accepted topology (system-driven, not test-driven) + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestTopologyController_InitialPlacementBalancesAcrossVolumes` / `..._InitialPlacementUsesEvidenceTieBreakOnEqualLoad` / `..._FailoverUsesHighestEvidenceCandidate` / `..._RebalanceMovesToLighterServer` / `..._RebalanceSkipsWhenLoadAlreadyWithinBound` / `..._ObservedAuthorityClearsPending` / `..._StalePreConfirmSnapshotDoesNotDuplicateQueuedMove` | Placement / failover: `TestTopologyControllerToPublisher_E2E_MultiVolumePlacementAndFailover` / `TestObservationHost_IdenticalSupportedSnapshots_NoDuplicateAsks` / `TestObservationHost_VolumeRecovers_ClearsStaleDesired`. Rebalance: `TestTopologyControllerToPublisher_E2E_Rebalance`. **Refresh: `TestConvergenceRoute_RefreshEndpoint_ConfirmsAndClearsDesired`** (host heartbeat changes slot DataAddr/CtrlAddr while publisher holds old addrs → controller emits `IntentRefreshEndpoint` → publisher bumps `EndpointVersion` on same `Epoch` → bridge delivers refreshed `AssignmentInfo` → confirmation clears desired). | `TestS7Process_RealSubprocessRestartSmoke` covers the Bind mint via the real binary (Reassign/Refresh are L1-only in `sparrow` today; real binary does not ingest heartbeats — see §5 L2 scope note). | L3 shape only, not executed | PROVEN | none | + +### Claim 7 — Convergence: confirmation clears desired + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestConvergence_Confirmation_ClearsDesired` / `TestConvergence_Confirmation_RequiresBothSources` / `TestConvergence_StaleObservation_NoConfirmation` / `TestConvergence_StuckThenConfirm_ClearsStuckEvidence` | `TestConvergenceRoute_ObservationFedFailover_ConfirmsAndClearsDesired` | N/A | L3 shape only, not executed | PROVEN | none | + +### Claim 8 — Publish-not-observed stuck evidence is bounded and non-churning + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestConvergence_PublishNotObserved_StuckAfterWindow` / `TestConvergence_Stuck_DoesNotReMintReassign` / `TestConvergence_Reassign_CalledAtMostOncePerDesired` / `TestConvergence_RetryStateIsPerVolume` / `TestConvergence_SupportedRecovery_ClearsPriorUnsupportedEvidence` | `TestConvergenceRoute_PublishNotObserved_StuckWithoutChurn` | N/A | L3 shape only, not executed | PROVEN | none | + +### Claim 9 — Supersede by newer authority or different decision + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestConvergence_Supersede_PublisherAdvancedToOtherReplica` / `TestConvergence_NormalLag_OldObservationDoesNotSupersede` / `TestConvergence_Supersede_DecisionTableProducesDifferentAsk` | `TestConvergenceRoute_Supersede_PublisherMovedElsewhere_DropsStaleDesired` | N/A | L3 shape only, not executed | PROVEN | Case 2 of the supersede rule was intentionally dropped (see S6 sketch §9) — that decision is tested indirectly by `TestConvergence_NormalLag_OldObservationDoesNotSupersede` which would otherwise fail | + +### Claim 10 — Restart re-anchors authority via real `VolumeBridge` into the adapter/engine route + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestDurableAuthority_PublisherAdvancesFromReloaded` | `TestS7_ReloadedAuthority_ReanchorsAdapterViaVolumeBridge` / `TestS7_Restart_RecomputesDesiredNotTransientState` (hard precondition prevents vacuous pass) | **Not at L2.** The `sparrow` subprocess smoke (`TestS7Process_RealSubprocessRestartSmoke`) reloads durable `Publisher` state only; it does NOT construct `ObservationHost`, `TopologyController`, `VolumeBridge`, or `VolumeReplicaAdapter`, so the bridge/adapter re-anchor cannot be counted at L2. L2 proves a strictly narrower subclaim (durable `Publisher` reload — Claim 3). | L3 shape only, not executed — `ha-restart-recovery.yaml` is the L3 shape | PROVEN at L1 (full route); L2 COVERS a subclaim only (durable reload, see Claim 3) | Subprocess heartbeat ingress + real-binary adapter route → **P15 T1** (handoff item #2) | + +### Claim 11 — Stale observation cannot move authority backward + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestConvergence_StaleObservation_NoConfirmation` | `TestS7_Restart_StaleObservationCannotMoveBackward` (4 sub-cases: stale-epoch / stale-replica / stale-endpointVersion / unassigned via `snapshottingReader` host-reader freeze) / `TestConvergenceRoute_PublishNotObserved_StuckWithoutChurn` (proves host-reader lag scenario) | covered at L1 | L3 shape only, not executed — `fault-partition.yaml` | PROVEN | stale-basis simulation is via host-reader freeze; heartbeat wire does not carry authority (documented in S7 sketch §9.1) | + +### Claim 12 — Unsupported topology after restart records evidence, not silent idle + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestConvergence_UnsupportedClearsDesiredWithEvidence` / `TestConvergence_PendingClearsDesired` | `TestS7_Restart_UnsupportedTopologyRecordsEvidence` | **Not at L2.** This claim is about the controller's `LastUnsupported(vid)` surface being populated from conflicting / incomplete observation; that requires `ObservationHost` + controller composition, which the subprocess does not construct. `Bootstrap.ReloadSkips` is a DIFFERENT surface (S5 durable-record corruption, see Claim 3 residual path) and is NOT observation-layer unsupported-topology evidence. | L3 shape only, not executed — `fault-partition.yaml` + `diag-restart-recovery.yaml` are the L3 shapes | PROVEN at L1 | Operator-visible surface over `LastUnsupported` / `LastConvergenceStuck` → **P15 T5** Diagnostics (handoff item #6) | + +### Claim 13 — Per-volume isolation (one volume's failure does not block others) + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestObservation_BadVolumeDoesNotBlockHealthyVolume_Isolation` / `TestObservation_VolumeUnsupportedEvidence_IsolatedPerVolume` / `TestConvergence_OneStuckVolumeDoesNotBlockOthers` / `TestTopologyController_UnsupportedVolumeDoesNotBlockOtherVolumes` | `TestS7_Restart_PerVolumeIsolation` | N/A | L3 shape only, not executed | PROVEN | none | + +### Claim 14 — Placement / rebalance / failover within the accepted topology + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestTopologyController_InitialPlacementBalancesAcrossVolumes` / `..._FailoverUsesHighestEvidenceCandidate` / `..._RebalanceMovesToLighterServer` / `..._RebalanceSkipsWhenLoadAlreadyWithinBound` / `..._OutOfTopologyAuthority_NoAskPlusEvidence` | `TestTopologyControllerToPublisher_E2E_MultiVolumePlacementAndFailover` / `TestTopologyControllerToPublisher_E2E_Rebalance` | N/A | L3 shape only, not executed — `ha-failover.yaml` carries the L3 shape | PROVEN | none | + +### Claim 15 — Fence path routes through bridge + adapter without spurious Healthy + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| adapter fence-slot / monotonic-guard tests in `core/adapter/adapter_test.go` | `TestS7_Restart_FenceRouteBoundedWithoutCallback` (Fence command reaches executor via real bridge; withheld callback does not produce Healthy) | N/A | L3 shape only, not executed | PARTIAL | Full timeout-expiry quantitative bound (watchdog fires at exact deadline + structured event) is adapter-owned proof (sketch §10.1 Path B — S7 may not reconfigure the fence path). Carried to the adapter package's own tests; not an L1 S7 proof | + +### Claim 16 — Boundary guards (no unauthorized `AssignmentInfo` minting, no observation-side asks, no store minting, no hidden adapter ingress) + +| L0 | L1 | L2 | L3 | Status | Residual | +|---|---|---|---|---|---| +| `TestObservation_NeverMintsAuthority_BoundaryGuard` / `TestDurableAuthority_StoreNeverMintsAuthority_BoundaryGuard` / `TestDurableAuthority_StoreWriteCallerAllowlist_BoundaryGuard` / `TestNonForgeability_*` (AST-level guards in `authority_test.go`) | N/A (structural, not route) | N/A | N/A | PROVEN | none | + +## 4. Gap Classification + +Per S8 assignment §3.B, every missing proof must be classified. Scanning the matrix above: + +### 4.1 P14 internal blockers +**None.** Every P14 internal claim has at least one PROVEN row. + +### 4.2 P14 S8 test/harness gap +**Claim 15 (fence quantitative bound).** S7 deliberately capped at L1 route coverage. The full `fence timeout fires at exact deadline` proof lives in the adapter package and depends on the uncommitted fence-watchdog branch landing committed. After that commits, a single-file adapter test replaces the current PARTIAL status. Not an S8 gate. + +### 4.3 P15 product-surface gaps (carried forward) +See `v3-phase-14-s8-p15-handoff.md` for the full mapping (aligned to canonical `v3-phase-15-product-plan.md` §4 track numbering). Cross-reference summary: + +- Frontend data path + subprocess heartbeat ingress → **P15 T1** Frontend + Data Path +- CSI lifecycle → **P15 T2** CSI / External Lifecycle Surface +- External control API → **P15 T3** External Control API +- Security / auth → **P15 T4** Security And Auth Posture +- Operator diagnostics for `LastUnsupported` / `LastConvergenceStuck` / `ReloadSkips` → **P15 T5** Diagnostics +- Operator workflow (drain / supervised reassign / rebuild) → **P15 T6** Operator Workflow +- V2/V3 migration → **P15 T7** Migration +- Deployment / hardening → **P15 T8** Deployment / Release Hardening +- End-to-end cluster validation → **P15 Final Gate** + +### 4.4 14A verification follow-ups +See `v3-phase-14a-checklist.md` for the sidecar list. The S8 matrix above is the input to 14A's final targeted review. + +### 4.5 Explicit non-goals (do NOT carry) +- multi-master / leader election +- performance / soak claims at P14 close +- long-running stability under production load + +## 5. L2 Process Smoke Decision + +Per S8 assignment §3.D, S8 must answer: are current `sparrow` smoke surfaces sufficient? + +**Answer: scoped YES — L2 proves durable-restart only; the full S4-S7 route remains L1-proven.** (Round-2 architect correction.) + +### 5.1 What L2 proves today + +| Required check | Covered by | +|---|---| +| real process starts | `TestS7Process_RealSubprocessRestartSmoke` spawns the real `sparrow` binary. | +| durable authority reloads | same test + component `TestS7Process_BootstrapReloadRouteSmoke`. | +| restart does not mint backward | subprocess asserts `backwardMint: false` in the pinned JSON pass-line; component `TestS7Process_RestartDoesNotRegressAuthorityLine` asserts reloaded `Epoch` equals pre-restart `Epoch`. | +| structured output can be consumed by tester | subprocess pass-line schema pinned in S7 sketch §8.4. | + +### 5.2 What L2 deliberately does NOT prove + +| Not covered at L2 | Where it IS proven | Carry-forward | +|---|---|---| +| "heartbeat wire → `ObservationHost` → `TopologyController`" in a real process. Subprocess drives mints via `StaticDirective`; there is no heartbeat ingress in `sparrow` today. | L1: `TestObservation_EndToEnd_SystemFedSnapshotReachesController`, `TestConvergenceRoute_ObservationFedFailover_ConfirmsAndClearsDesired`, `TestConvergenceRoute_RefreshEndpoint_ConfirmsAndClearsDesired`, `TestS7_ReloadedAuthority_ReanchorsAdapterViaVolumeBridge` (all run real `ObservationHost` → controller → publisher → bridge → adapter, in-process). | **P15 T1 Frontend + Data Path** (subprocess heartbeat ingress is part of the frontend/transport contract). Handoff item #2 in `v3-phase-14-s8-p15-handoff.md`. | +| "assignment reaches adapter via bridge IN THE REAL BINARY". The subprocess binary does not construct a VolumeReplicaAdapter — only the in-process component test does. | L1: `TestS7_ReloadedAuthority_ReanchorsAdapterViaVolumeBridge` composes the full stack in-process and waits for adapter `ModeHealthy`. | **P15 T1 Frontend + Data Path** — the first real-binary adapter ↔ frontend route is a T1 deliverable. | +| Multi-process mixed-route scenarios (separate authority + observing processes). | N/A at P14 | **P15 Final Gate** (Cluster Validation). | + +### 5.3 Decision + +**No new S8 smoke is added.** The scope split above is honest: + +- L2 subprocess smoke correctly proves durable restart truth. +- The full S4-S7 composed route is proven at L1 — including the `RefreshEndpoint` path added in this pass. +- Subprocess heartbeat ingress and full real-binary adapter route are genuinely P15 T1 surfaces; inventing a hidden `--s8-route-smoke` heartbeat fixture would be a P15 surface in P14 clothing. + +This is a downgrade from the round-1 matrix claim ("L2 covers the S4-S7 internal route"), which was an overclaim. + +## 6. Baseline Command Result (2026-04-20) + +``` +$ go test ./core/engine ./core/adapter ./core/authority ./cmd/sparrow -count=1 +ok github.com/seaweedfs/seaweed-block/core/engine 0.016s +ok github.com/seaweedfs/seaweed-block/core/adapter 2.344s +ok github.com/seaweedfs/seaweed-block/core/authority 4.569s +ok github.com/seaweedfs/seaweed-block/cmd/sparrow 2.264s + +$ go test ./... -count=1 +(14 packages, all PASS; runtime/schema have no tests) +``` + +Evidence is reproducible by any reviewer on Windows / Linux from the committed tree on `phase-14` at commit `a6421f9` or later. + +## 7. Residual Risks + +| # | Risk | Scope | +|---|------|-------| +| 1 | Windows `os.RemoveAll` on the lock file during `t.TempDir()` teardown races with lock release. Does NOT affect correctness — only teardown. | S7 sketch §8.5 residual; logged as `t.Logf` in `restart_route_test.go`. | +| 2 | Claim 15 partial: L1 covers "fence command dispatched via real bridge + withheld callback does not produce Healthy"; full timeout-expiry bound is adapter-package-owned. | Closes when adapter fence-watchdog branch commits. Not an S8 blocker. | +| 3 | L2 subprocess smoke drives the directive via `StaticDirective`, not real heartbeat ingress to `ObservationHost`. The full heartbeat-path is L1-proven; subprocess widening requires a P15 external surface. | P15 Frontend / Cluster Validation. | + +**No S8-blocking correctness residuals.** One adapter-owned quantitative fence proof (Claim 15) remains as a P14-internal follow-up, landing when the fence-watchdog branch commits. No backward-authority-mint risk on restart. No unsupported-topology silent-idle path. + +## 8. Closure Readiness + +Against S8 acceptance gate (`v3-phase-14-s8-assignment.md` §6): + +1. evidence matrix exists and is honest — **yes** (this document) +2. P14 internal blockers fixed or explicitly non-P14 — **yes** (§4.1 empty) +3. L0 + L1 tests pass — **yes** (baseline §6) +4. L2 process smoke present or absence classified — **yes** (§5; present) +5. L3 classified — **yes** (L3 shape only, not executed this slice; see `v3-phase-14-s8-v2-scenario-classification.md`) +6. 14A final pass has no open P14 safety blocker — **pending** (14A review on this matrix) +7. P15 handoff maps every product gap — **yes** (see `v3-phase-14-s8-p15-handoff.md`) +8. final closure statement does not claim production readiness — **yes** (see `v3-phase-14-s8-closure.md`, draft) + +**S8 is evidence-complete pending 14A sign-off.** Awaiting architect + tester review. diff --git a/sw-block/design/v3-phase-14-s8-final-bounded-close.md b/sw-block/design/v3-phase-14-s8-final-bounded-close.md new file mode 100644 index 000000000..55d20fd05 --- /dev/null +++ b/sw-block/design/v3-phase-14-s8-final-bounded-close.md @@ -0,0 +1,212 @@ +# V3 Phase 14 S8 Final Bounded Close + +Date: 2026-04-19 +Status: draft +Purpose: close the bounded P14 internal topology/control-plane claim and hand off product surfaces to P15 without pretending P14 is production-ready block storage + +## 1. Target + +`P14 S8` closes one bounded internal control-plane product shape: + +`heartbeat/observation -> ClusterSnapshot -> TopologyController -> Publisher durable authority -> VolumeBridge -> adapter/engine -> observed convergence/restart evidence`. + +S8 is not another protocol slice and not a broad product surface slice. It is the final acceptance and evidence package for the internal single-active-master topology/control-plane loop built by S4-S7. + +## 2. What S8 Must Prove + +S8 must prove, at the appropriate test layer, that: + +1. observation is system-fed, not test-authored +2. topology supportability is explicit and fail-closed +3. authority is durable, single-owner, and restart-recovered +4. controller decisions are system-driven for the accepted topology set +5. convergence has bounded fate: confirmed, stuck, or superseded +6. restart re-anchors current authority through the real bridge/adapter route +7. stale observation, stale old-slot delivery, and unsupported topology do not silently move the system backward +8. per-volume isolation holds under mixed supported/unsupported states +9. the accepted topology set and unsupported list are explicit +10. every product-facing gap is handed to P15, not hidden inside P14 closure + +## 3. What S8 Does Not Prove + +S8 does not prove: + +1. CSI completion +2. iSCSI/NVMe product frontend completion +3. external volume lifecycle API completion +4. real user data-path production readiness +5. V2/V3 migration readiness +6. multi-master HA / leader election / distributed authority +7. final performance or soak readiness + +Those are P15 tracks and the P15 final cluster validation gate. + +S8 may prepare scenario shapes and port test runner muscle for those later gates, but it must not claim them as P14 closure. + +## 4. Accepted Topology Claim + +The S8 supported topology claim is bounded to: + +1. single active master +2. multiple volumes +3. three bounded replica slots per volume +4. distinct server per slot +5. one current authoritative primary +6. two bounded failover/rebalance candidates +7. publisher-owned `Epoch` / `EndpointVersion` +8. durable current authority line, one record per volume +9. passive convergence: publish, observe, confirm / stuck / supersede +10. real `VolumeBridge` delivery into the adapter/engine path + +Anything outside this set is unsupported or deferred. + +## 5. Required Evidence Matrix + +S8 must produce a closure table with these columns: + +| Claim | L0 Unit | L1 Component | L2 Process | L3 Scenario | Status | Residual | +|---|---|---|---|---|---|---| + +Minimum claims: + +1. observation supportability +2. duplicate/missing/conflicting inventory handling +3. durable authority reload +4. no old-slot durable revival +5. controller-driven bind/reassign/refresh +6. convergence confirm/clear +7. publish-not-observed stuck evidence +8. supersede by newer authority +9. restart re-anchor via `VolumeBridge` +10. stale observation cannot move backward +11. unsupported topology records evidence +12. per-volume isolation +13. placement/rebalance within accepted topology +14. P15 handoff gaps identified + +## 6. Required Test Levels + +### L0 Unit / Invariant + +Required command baseline: + +```text +go test ./core/engine ./core/adapter ./core/authority -count=1 +``` + +Must include boundary guards for: + +1. no non-authority `AssignmentInfo` minting +2. no observation code constructing assignment asks +3. no store minting authority +4. no hidden adapter ingress + +### L1 Component + +Required routes: + +1. `ObservationHost -> TopologyController -> Publisher -> VolumeBridge -> VolumeReplicaAdapter` +2. observation-fed failover confirms and clears desired +3. stuck evidence appears without ask churn +4. supersede drops stale desired +5. restart route reloads and re-anchors authority +6. old-slot live delivery is rejected by adapter monotonicity +7. unsupported topology records evidence + +### L2 Local Process + +Required shapes: + +1. real `sparrow` durable authority restart smoke +2. process-level S4-S7 route smoke if the current binary surface can expose it cleanly +3. if process route smoke cannot be added without inventing P15 surfaces, S8 must record the blocker and carry it to P15 T1/T2/T3 + +L2 tests must check exit code, artifact/log output, lock release, reload count, and no backward mint. + +### L3 Hardware / Scenario + +S8 should port scenario shapes from V2 and classify each as: + +1. runnable in P14 internal mode +2. blocked by missing P15 frontend/API/data-path surface +3. deferred to P15 final cluster validation + +Initial scenario candidates: + +1. restart recovery +2. failover +3. partition / delayed heartbeat +4. unsupported topology +5. IO continuity only if frontend/data path exists + +S8 can close P14 with L3 blockers only if the blockers are genuinely P15 product-surface gaps, not internal truth gaps. + +## 7. V2 Port Plan + +### Classify now (S8 scope), port deferred to P15 + +S8 **does not port testrunner machinery into the V3 tree** (round-2 doc-consistency fix, aligned with `v3-phase-14-s8-v2-scenario-classification.md` §2). The items below are the V2 muscles S8 inventories and classifies; actual porting is a P15 Final Gate (Cluster Validation Agent) deliverable, not an S8 one. + +1. `weed/storage/blockvol/testrunner/` + - **classify now**: runner architecture, scenario YAML shape, action vocabulary, artifact collection, report generation — recorded as PORT-MECHANISM in the classification table. **Port in P15 Final Gate.** + +2. `weed/storage/blockvol/testrunner/scenarios/public` and selected internal scenarios + - **classify now**: map each scenario shape to RUNNABLE-P14 / BLOCKED-FRONTEND / BLOCKED-OPS / BLOCKED-HA / BLOCKED-PERF. **Port in P15 Final Gate.** + +3. `weed/storage/blockvol/test/component/` + - **classify now**: component harness structure and failure-injection style. **Port in P15 T1 Frontend + Data Path** as that track's real-route harness. + +4. `weed/server/qa_block_*` + - **classify now**: coverage taxonomy and route scenario ideas as reference for the evidence matrix (§5). **Port only as applicable per P15 track needs.** + +5. `learn/test` evidence convention + - **classify now**: manifest / result / report / run-notes convention referenced as the target shape for future P15 evidence bundles. **Adopt when P15 ships real L3/L4 runs.** + +### Do not port + +1. `promotion.go` +2. `HandleAssignment` / `promote` / `demote` +3. local runtime self-promotion +4. heartbeat timing directly becoming authority +5. V2 authority ownership or HA assumptions +6. CSI/iSCSI/NVMe product integration as P14 close criteria + +## 8. Deliverables + +S8 must deliver: + +1. final P14 supported-topology statement +2. final P14 unsupported/deferred list +3. acceptance evidence matrix +4. scenario-port classification table +5. tester sign-off packet template +6. 14A final targeted review checklist +7. P15 handoff statement mapping every remaining gap to a P15 track +8. code/tests only where needed to make the evidence route real + +## 9. Reject Conditions + +Reject S8 if: + +1. closure is based only on local unit tests +2. a claimed route still uses manual assignment choreography +3. S8 invents CSI/API/operator surfaces to make tests look complete +4. unsupported topology becomes silent idle +5. stale observation can emit a newer decision from an old basis +6. durable restart can revive stale old-slot truth +7. convergence can churn epochs without observation confirmation +8. S8 claims production readiness without P15 frontend/data-path/security/cluster validation + +## 10. Closure Statement Shape + +The final S8 closure statement must say exactly: + +1. what P14 now owns +2. what P14 explicitly does not own +3. what evidence exists at L0/L1/L2/L3 +4. what risks remain +5. which P15 track owns each product gap + +The expected final claim is: + +`P14 has closed one bounded single-active-master internal topology/control-plane truth loop for the accepted topology set. P14 has not closed CSI, external API, frontend data path, migration, security, deployment, performance, or production readiness.` diff --git a/sw-block/design/v3-phase-14-s8-p15-handoff.md b/sw-block/design/v3-phase-14-s8-p15-handoff.md new file mode 100644 index 000000000..672da5b41 --- /dev/null +++ b/sw-block/design/v3-phase-14-s8-p15-handoff.md @@ -0,0 +1,86 @@ +# V3 Phase 14 S8 — P15 Handoff + +Date: 2026-04-20 +Status: draft (S8 P15 handoff, round-2 aligned to canonical P15 plan) +Purpose: map every P14-residual product gap to the canonical P15 track in `v3-phase-15-product-plan.md`, so P15 execution starts from the correct ownership map + +## 1. Handoff Contract + +Per `v3-phase-14-s8-assignment.md` §3.F: every gap that P14 does NOT close must appear here with four columns — the gap, why it is not P14, which canonical P15 track owns it, and the minimum first proof gate. + +**Canonical P15 track numbering** (from `v3-phase-15-product-plan.md` §4): + +1. `T1` Frontend + Data Path Contract +2. `T2` CSI / External Lifecycle Surface +3. `T3` External Control API +4. `T4` Security And Auth Posture +5. `T5` Diagnostics And Explainability +6. `T6` Operator Workflow +7. `T7` V2/V3 Coexistence And Migration +8. `T8` Deployment / Upgrade / Release Hardening +9. `Final Gate` Cluster Validation Agent + +S8 does NOT invent new tracks, renumber existing ones, or prescribe P15 implementation — only the acceptance gate each P15 track must pass for the handed-off gap. + +## 2. Handoff Table + +| # | Gap | Why not P14 | P15 owner track | Required first proof | +|---|---|---|---|---| +| 1 | **Frontend data path** (attach → real read → real write → stale-primary fence at frontend boundary) | P14 stops at adapter/engine projection; there is no user-data I/O path in P14, and `adapter.PublishHealthy` is a control-plane signal, not a data-path handshake. | **P15 T1** Frontend + Data Path | L2: V3 frontend backend attaches to an internal volume, writes data, reads it back, triggers P14 reassignment/failover, and verifies the stale primary can no longer serve writes. Covered by T1 §5 Acceptance. | +| 2 | **Subprocess heartbeat ingress** — the L2 subprocess smoke (`TestS7Process_RealSubprocessRestartSmoke`) drives mint via `StaticDirective`, not a real heartbeat wire. The full `heartbeat → observation → controller` path is L1-only in the subprocess. | Real heartbeat ingress is a frontend/transport surface. P14 has no intention of adding one. | **P15 T1** Frontend + Data Path (observation ingress is part of the frontend/transport contract) | L2: subprocess accepts heartbeats over a real transport; observation store mutates; controller sees the change; adapter advances — end-to-end in a real process. | +| 3 | **CSI driver lifecycle** (CreateVolume / DeleteVolume / ControllerPublish / NodeStage / NodePublish / NodeUnpublish / ControllerUnpublish) | P14 has no orchestrator-facing lifecycle surface; CSI sits above the P14 authority/adapter route. | **P15 T2** CSI / External Lifecycle Surface | L2: CSI sidecar ↔ sparrow process, full Create → Publish → use → Unpublish → Delete cycle via real CSI gRPC against V3 authority (T2 §5 Acceptance). | +| 4 | **External control API** (REST / gRPC verbs for list / inspect / create / delete / resize / pause / resume) | P14 exposes hidden test flags on `sparrow` only. An external API is a net-new surface. | **P15 T3** External Control API | L2: OpenAPI / gRPC verbs backed by V3 authority state, exercised by an external client that does NOT construct `AssignmentInfo` directly. | +| 5 | **Security / auth posture** (API authn/authz, CHAP for iSCSI, TLS/mTLS where applicable, audit trail) | P14 has no external authn surface. | **P15 T4** Security And Auth | L1+L2: CHAP positive+negative for any iSCSI target shipped; API authn positive+negative; tenant-scoped listing; audit log for mutating verbs. | +| 6 | **Operator diagnostics surface** (health dashboard, structured readout of `LastUnsupported` / `LastConvergenceStuck` / `ReloadSkips`, alerts) | P14 records internal evidence (per-volume maps + structured JSON pass-line in S7 subprocess smoke) but does NOT expose it to operators. | **P15 T5** Diagnostics And Explainability | L2: structured `/healthz` + `/metrics` surfacing P14 internal states; L3 `cp85-metrics-verify.yaml` scenario class reshaped for V3. | +| 7 | **Operator workflow** (planned failover, graceful drain, supervised reassign, supervised rebuild) | P14 rejects V2 `promote/demote/HandleAssignment`. Operator-initiated actions MUST route through the publisher mint path as ordinary `AssignmentAsk`s. A new workflow surface is required. | **P15 T6** Operator Workflow | L2: operator-triggered reassign lands in `TopologyController` as an `AssignmentAsk`; publisher mints new epoch; adapter converges. Manual promote is explicitly reshaped from V2 semantics. | +| 8 | **V2 ↔ V3 migration / coexistence** (online migrate a V2 volume to V3 authority; rollback; mixed V2+V3 cluster) | Migration is a product-level transition requiring both T1 and T6. P14 is V3-only. | **P15 T7** V2/V3 Coexistence And Migration | L3 + L4: `op-upgrade-rollback.yaml` scenario class on a mixed cluster; explicit no-split-brain check under migration. | +| 9 | **Deployment / hardening** (systemd / container packaging, lockfile+storedir pathing, crash policy, log rotation, store backup) | `sparrow` is currently a test binary with hidden smoke flags; no deployment story. | **P15 T8** Deployment / Upgrade / Release Hardening | L2: deployable package starts `sparrow` with the durable store correctly mounted; crash-restart cycle preserves the P14 restart truth. L4: release-hardening soak (`cp84-soak-4h.yaml`, `cp85-soak-24h.yaml`). | +| 10 | **Final cluster validation** (end-to-end: operator creates volume → data written → primary killed → failover → operator reads data back → no loss, on real hardware) | P14 covers the control plane only; full cluster validation needs T1 + T2 + T3 + T4 + T5 + T6 all landed. | **P15 Final Gate** Cluster Validation Agent | Composite L3/L4 pack: `smoke-block-api` + `ha-restart-recovery` + `ha-failover` + `ha-io-continuity` + `fault-partition` in one run on real hardware, with evidence artifacts in `learn/test/`. | +| 11 | **Adapter fence-watchdog quantitative timeout bound** (watchdog fires at exact fence deadline + structured watchdog event for fence lineage) | S7 L1 covers "Fence command reaches executor via bridge + withheld callback does not produce Healthy"; the quantitative bound is adapter-package-owned and depends on the currently-uncommitted fence-watchdog branch. S7 sketch §10.1 Path B forbids S7 from reconfiguring the fence path. | **P14 internal follow-up** (NOT a P15 track) | Single adapter-package test replacing the PARTIAL row for evidence-matrix Claim 15. Lands when the fence-watchdog branch commits. Not an S8 gate. | + +## 3. Out-Of-Scope (NOT Handed Off) + +These are explicit non-goals for both P14 and current P15 bounds — not failures, not deferrals, permanent exclusions. + +| Non-goal | Reason | +|---|---| +| Multi-master / leader election / distributed authority store | `v3-phase-14-s8-final-bounded-close.md` §3.6 and `v3-phase-15-product-plan.md` §2 both reject. Single-active-master is the accepted topology. Any multi-master claim would require a separate phase with a distributed-authority institution. | +| V2 `HandleAssignment` / `promote` / `demote` authority-owning semantic | Rejected at S2, reaffirmed S3-S8. Not ported under any P15 track. Operator promote surface in T6 is reshaped, not ported. | +| V2 heartbeat-as-authority | Rejected at S4. Heartbeats are observation inputs only under any P14 or P15 surface. | + +## 4. Handoff Integrity Check + +The table splits into two distinct row classes, each with its own integrity rule: + +**Rows 1-10 — P15 product-surface gaps.** Each of these must satisfy all four checks as of 2026-04-20: + +1. **Gap is not proven at any P14 level.** Cross-reference `v3-phase-14-s8-evidence-matrix.md` §3 — none of rows 1-10 appear as PROVEN or PARTIAL rows of the internal matrix. These are genuinely product-surface gaps P14 never attempted. +2. **Gap is assigned to a canonical P15 track.** Numbering matches `v3-phase-15-product-plan.md` §4. +3. **First-proof gate is concrete**, not a hand-wave. Every row names a specific test layer and a specific scenario or surface shape. +4. **Gap does not silently depend on a P14 claim being wider than is actually proven.** Example: P15 T1 is allowed to depend on P14 Claim 10 (restart re-anchors via VolumeBridge, PROVEN at L1) but NOT on any data-path claim — P14 makes none. + +**Row 11 — P14 internal follow-up (exception to rule #1).** Row 11 is NOT a P15 gap; it is a targeted P14 internal completion that lands when the fence-watchdog branch commits. Its integrity rules are narrower: + +1. **Row 11 MUST correspond to an existing PARTIAL or subclaim-only row of the internal matrix** — the whole point of the row is to name the follow-up that closes that partial. In this case row 11 corresponds to evidence-matrix **Claim 15 PARTIAL** (fence quantitative timeout bound). This is the inverse of rule #1 above: row 11 is listed here precisely BECAUSE it appears as PARTIAL in the matrix. +2. **Row 11 MUST NOT be assigned to any P15 track** — it is explicitly labeled "P14 internal follow-up (NOT a P15 track)" in the gap table. +3. **Row 11's first-proof gate MUST be a single-package test that closes the matrix PARTIAL row** — not a multi-track composite. + +Splitting the integrity check this way keeps rule #1 for P15 gaps strict (any cross-reference to PROVEN/PARTIAL in the matrix is a bug there) while making row 11's purpose explicit and auditable (every P14 internal follow-up row should have a matrix PARTIAL anchor). + +## 5. Per-Track Dependency Summary + +| P15 Track | Depends on P14 claims | Notes | +|---|---|---| +| T1 Frontend + Data Path | Claims 3, 6, 10, 14 (durable reload, controller-driven, restart re-anchor, placement/rebalance) | Core product dependency. | +| T2 CSI / Lifecycle | T1 + Claims 6, 10 | CSI needs a working data path. | +| T3 External API | Claims 6, 14 | API translates external verbs to `AssignmentAsk`s. | +| T4 Security | T1 + T3 | Authn/authz sits above the surfaces T1 and T3 expose. | +| T5 Diagnostics | Claims 2, 8, 11, 12 (supportability / stuck / unsupported evidence) | Surfaces what P14 already records internally. | +| T6 Operator Workflow | Claims 6, 9, 14 (controller-driven, supersede, placement) | Operator actions are `AssignmentAsk`s through the controller. | +| T7 Migration | T1 + T6 | Needs both frontend and operator surfaces. | +| T8 Deployment | Claim 3 (durable reload) + T1 | Depends on durable store shape and frontend. | +| Final Gate | All P14 claims + all P15 tracks | Composite. | + +## 6. Closure + +Every product gap that S8 identifies is listed here with a canonical P15 track or explicit P14 follow-up. Track numbering matches `v3-phase-15-product-plan.md`. P15 execution starts from this table without needing to re-derive the ownership map. diff --git a/sw-block/design/v3-phase-14-s8-v2-scenario-classification.md b/sw-block/design/v3-phase-14-s8-v2-scenario-classification.md new file mode 100644 index 000000000..4df09426d --- /dev/null +++ b/sw-block/design/v3-phase-14-s8-v2-scenario-classification.md @@ -0,0 +1,107 @@ +# V3 Phase 14 S8 — V2 Scenario Port Classification + +Date: 2026-04-20 +Status: draft (S8 scenario classification) +Purpose: classify every V2 testrunner scenario (`weed/storage/blockvol/testrunner/scenarios/`) as P14-runnable / P15-blocked / deferred, producing the table required by `v3-phase-14-s8-assignment.md` §3.C + +## 1. Classification Vocabulary + +| Class | Meaning | P14 S8 action | +|---|---|---| +| **RUNNABLE-P14** | Shape maps onto S4-S7 internal surfaces. Runnable with V3 binary + no external frontend. | Keep scenario YAML shape; adapt actions to V3 route; run as L3 classification only (S8 does not need to execute L3 itself). | +| **BLOCKED-FRONTEND** | Requires iSCSI / NVMe / CSI / real data path. P15 Frontend track gate. | Preserve scenario reference; hand off to P15 with required precondition. | +| **BLOCKED-OPS** | Requires operator CLI / HTTP / admin workflow. P15 Ops track gate. | Hand off to P15 Ops. | +| **BLOCKED-HA** | Requires multi-master / leader election / distributed authority. Outside P14/P15 bounded claim. | Mark as out-of-scope for both P14 and current P15 bounds. Document only. | +| **BLOCKED-PERF** | Benchmark / soak / perf-baseline; not a correctness gate. | Defer to release-hardening track; document only. | +| **PORT-MECHANISM** | Testrunner machinery itself (not a scenario): action vocabulary, artifact collection, scenario YAML shape. | Port machinery without V2 authority semantics (see `v3-phase-14-s6-s8-v2-port-plan.md` §5). | + +## 2. Testrunner Machinery (port decision) + +`weed/storage/blockvol/testrunner/` — YAML-driven runner with action vocabulary, artifact collection, report generation. V3 acceptance pack (P14 S8 carries scenario SHAPES forward; actual V3 testrunner integration is a P15 Cluster Validation track deliverable). + +| Component | Classification | S8 action | +|---|---|---| +| `testrunner/*.go` (engine / parser / registry / reporter / metrics) | PORT-MECHANISM | Port to V3 when V3 has a stable CLI surface. S8 does not run the runner directly. | +| `testrunner/actions/*.go` (37 registered actions) | PORT-MECHANISM | Port per-action as V3 surfaces appear. S8 ports NONE directly (no V3 CLI surface exists yet outside `sparrow` hidden smoke flags). | +| `testrunner/scenarios/*.yaml` | scenario-by-scenario below | — | + +**S8 decision:** do not port testrunner machinery into the V3 tree as part of S8. The scenario SHAPES below are what S8 classifies; the runner itself is a P15 Cluster Validation deliverable. + +## 3. Public Scenarios Table + +Path: `weed/storage/blockvol/testrunner/scenarios/public/`. + +| # | Scenario | P14 route it touches | Class | Notes | +|---|---|---|---|---| +| 1 | `smoke-block-api.yaml` | block create / write / read / verify | **BLOCKED-FRONTEND** | Requires V3 block API surface + real data path. No V3 equivalent exists. P15 Frontend track. | +| 2 | `smoke-iscsi.yaml` | kernel iSCSI sanity | **BLOCKED-FRONTEND** | Requires iSCSI target + kernel initiator. P15 Frontend. | +| 3 | `smoke-kv.yaml` | KV layer smoke | **BLOCKED-FRONTEND** | KV path is out of P14 scope. P15 Frontend. | +| 4 | `e2e-block.yaml` / `e2e-block-auto.yaml` | end-to-end block I/O with V3 backend | **BLOCKED-FRONTEND** | Same preconditions as #1. | +| 5 | `e2e-kv.yaml` / `e2e-kv-auto.yaml` / `e2e-combined-auto.yaml` | KV + block combined | **BLOCKED-FRONTEND** | Same as #3. | +| 6 | `ha-restart-recovery.yaml` | restart-and-reload closed loop | **RUNNABLE-P14** | Shape maps directly onto S7 restart route. L1 / L2 subprocess smoke already proves the route in-process; scenario YAML would just drive the same route under scenario harness. Keep shape; port when V3 testrunner lands. Covered by Claim 3 / Claim 10 of the evidence matrix at L0/L1/L2. | +| 7 | `ha-failover.yaml` | failover under live I/O | **BLOCKED-FRONTEND** | Shape touches S4-S7 (observation → controller → reassign → adapter), but "under live I/O" requires the data path. Classification: control-plane portions are RUNNABLE-P14 (covered by Claim 14 at L1); I/O-continuity portions are BLOCKED-FRONTEND. Split at P15 port time. | +| 8 | `ha-full-lifecycle.yaml` | bind → heal → failover → rebuild → restart full cycle | **BLOCKED-FRONTEND** | Contains rebuild + I/O. Rebuild is P14-internal but has no V3 equivalent outside V2 bridge; I/O is frontend. Defer whole scenario to P15. | +| 9 | `ha-io-continuity.yaml` | zero data loss across failover | **BLOCKED-FRONTEND** | Entirely data-path. P15 Frontend. | +| 10 | `ha-rebuild.yaml` | full-extent rebuild via transport | **BLOCKED-FRONTEND** | Rebuild transport is V2-side; V3 adapter has no rebuild surface in current slice. Deferred. | +| 11 | `crash-recovery.yaml` | process kill + restart + verify | **RUNNABLE-P14** | Control-plane portion identical to #6. Data-verify portion is BLOCKED-FRONTEND. Split at port time. | +| 12 | `diag-restart-recovery.yaml` | diagnostics on restart | **RUNNABLE-P14** (control-plane subset) | S7 subprocess smoke emits structured JSON (`Bootstrap.ReloadedRecords`, `ReloadSkips`, no-backward-mint); maps onto this shape. Operator dashboards / full diag bundle is P15 Diagnostics. | +| 13 | `fault-partition.yaml` | network partition + recovery | **RUNNABLE-P14** (control-plane) | Control-plane partition → stale observation / convergence stuck is covered by Claims 8, 11 at L1. Real netem + data-path partition is BLOCKED-FRONTEND. | +| 14 | `fault-netem.yaml` | generic network fault injection | **BLOCKED-FRONTEND** | Needs data path under load. P15. | +| 15 | `fault-disk-full.yaml` | ENOSPC on primary | **BLOCKED-FRONTEND** | Needs write path. P15. | +| 16 | `consistency-epoch.yaml` | epoch monotonicity across failover | **RUNNABLE-P14** | Pure control-plane claim — already covered at L0 by `TestDurableAuthority_PublisherAdvancesFromReloaded` and at L1 by `TestS7_*` restart tests. L3 version would add cross-process multi-primary validation. | +| 17 | `consistency-lease.yaml` | lease-based guard | **BLOCKED-OPS** | Lease semantics are V2-specific; V3 uses publisher-owned Epoch. Would need a reshaped V3 scenario. Defer. | +| 18 | `lease-expiry-write-gate.yaml` | write blocked on lease expiry | **BLOCKED-FRONTEND** | Write path + lease. P15. | +| 19 | `lease-renewal-under-io.yaml` | lease renewal during I/O | **BLOCKED-FRONTEND** | Same. | +| 20 | `cp11b3-auto-failover.yaml` | automatic failover trigger | **RUNNABLE-P14** (control-plane) | Control-plane covered at L1 by `TestTopologyControllerToPublisher_E2E_MultiVolumePlacementAndFailover`. | +| 21 | `cp11b3-manual-promote.yaml` | operator-initiated promote | **BLOCKED-OPS** | Requires operator CLI + admin API. P15 Ops. Also — manual promote is V2 semantics; V3 does not have a direct equivalent by design (S6-S8 V2 port plan §2 rejects V2 `promote` ownership). | +| 22 | `cp11b3-fast-reconnect.yaml` | reconnect skips unnecessary failover | **RUNNABLE-P14** (control-plane) | S6 normal-lag handling covers this — `TestConvergence_NormalLag_OldObservationDoesNotSupersede`. Full scenario requires real reconnect transport = BLOCKED-FRONTEND for the transport layer, RUNNABLE-P14 for the control-plane. | + +**Public scenarios summary:** +- RUNNABLE-P14: 6 scenarios (ha-restart-recovery, crash-recovery control plane, diag-restart-recovery, fault-partition control plane, consistency-epoch, cp11b3-auto-failover control plane) — all covered by Claims 3/10/11/12/14 at L0/L1/L2. +- BLOCKED-FRONTEND: 13 scenarios +- BLOCKED-OPS: 2 scenarios (lease-consistency, manual-promote) +- Mixed (split required at port time): ha-failover, cp11b3-fast-reconnect + +## 4. Internal Scenarios Table (selected, by category) + +Path: `weed/storage/blockvol/testrunner/scenarios/internal/`. 50+ files; classifying by category rather than per-file. + +| Category | Example files | Class | Notes | +|---|---|---|---| +| Recovery baselines | `recovery-baseline-restart.yaml` / `recovery-baseline-failover.yaml` / `recovery-baseline-partition.yaml` / `recovery-baseline-rebuild.yaml` | RUNNABLE-P14 (control-plane) | Restart/failover/partition portions map onto S4-S7; rebuild portions need V3 rebuild surface (not in S8). | +| Coordination dev-cycle | `coord-dev-cycle.yaml` / `coord-ha-failover.yaml` / `coord-smoke-iscsi.yaml` | BLOCKED-FRONTEND | iSCSI / end-to-end workflows. | +| CP103 performance matrix | `cp103-*.yaml` | BLOCKED-PERF | Performance, not correctness. Defer to release hardening. | +| CP85 chaos | `cp85-chaos-partition.yaml` / `cp85-chaos-primary-kill-loop.yaml` / `cp85-chaos-replica-kill-loop.yaml` / `cp85-role-flap.yaml` / `cp85-session-storm.yaml` | Mixed | Control-plane portions RUNNABLE-P14 as property tests (kill-loop restart, role-flap convergence). I/O portions BLOCKED-FRONTEND. | +| CP85 metrics / observability | `cp85-metrics-verify.yaml` | BLOCKED-OPS | Operator metrics pipeline. P15 Diagnostics. | +| CP85 soak | `cp85-soak-24h.yaml` / `cp84-soak-4h.yaml` | BLOCKED-PERF | Long-running stability. Release hardening. | +| CP85 database / filesystem | `cp85-db-ext4-fsck.yaml` / `cp85-db-sqlite-crash.yaml` / `cp85-expand-failover.yaml` | BLOCKED-FRONTEND | Real filesystem + DB workload. | +| Snapshot | `cp11a4-snapshot-export-import.yaml` / `cp83-snapshot-expand.yaml` / `cp85-snapshot-stress.yaml` | BLOCKED-FRONTEND | Snapshot API is P15 Frontend / Ops. | +| EC / Erasure | `ec3-*.yaml` / `ec5-*.yaml` | BLOCKED-FRONTEND | Data path erasure. | +| HA extensions | `ha-failover-during-rebuild.yaml` / `ha-multi-client-failover.yaml` | BLOCKED-FRONTEND | Multi-client data path. | +| Benchmark | `benchmark-*.yaml` / `bench-validated.yaml` / `baseline-full-roce.yaml` / `fsync-only-test.yaml` | BLOCKED-PERF | Performance. | +| DM / stripe | `dm-stripe-two-server.yaml` | BLOCKED-FRONTEND | Device mapper. | +| Operator lifecycle | `op-upgrade-rollback.yaml` / `op-csi-lifecycle.yaml` / `op-failure-injection.yaml` | BLOCKED-OPS | P15 Ops + Migration. | +| Real-workload validation | `cp13-8-real-workload-validation.yaml` | BLOCKED-FRONTEND + BLOCKED-PERF | Full stack. | + +**Internal scenarios summary:** +- Directly useful now as P14-internal control-plane L3 shape: ~8 scenarios (recovery-baseline × 3, cp85 chaos × 3, consistency-epoch at public, diag-restart-recovery at public). All already covered at L0/L1/L2 by the evidence matrix; L3 runnable is P15 Cluster Validation. +- BLOCKED-FRONTEND majority: ~30 scenarios tied to I/O / iSCSI / NVMe / snapshot / DB workloads. +- BLOCKED-OPS: ~8 scenarios tied to operator workflows. +- BLOCKED-PERF: ~10 scenarios tied to perf / soak. +- BLOCKED-HA: none of the current set explicitly requires multi-master, but `cp85-role-flap` and any future "multi-master" test would be BLOCKED-HA. + +## 5. Port Shape (V3) + +When V3 eventually ships a testrunner integration (P15 Cluster Validation), the port order should be: + +1. **RUNNABLE-P14 scenarios first** — port scenario YAML shape (not V2 actions) against V3 `sparrow` binary + testrunner-wrapped smoke flags. Keep the structured-JSON stdout shape from S7 smoke; wrap it in testrunner action vocabulary. +2. **Mixed scenarios second** — split each into (control-plane sub-scenario, data-path sub-scenario). Port the control-plane half first. +3. **BLOCKED-FRONTEND last** — arrives when P15 Frontend ships iSCSI / NVMe / CSI. +4. **BLOCKED-OPS as needed** — arrives with P15 Ops admin surface. +5. **BLOCKED-PERF and BLOCKED-HA** — release-hardening and explicit non-goals respectively. + +## 6. Closure + +Per S8 assignment §3.C, this table is sufficient for S8. Actual scenario PORTING is P15 Cluster Validation work and is not an S8 deliverable. S8 produces the classification + the residual map; P15 executes. + +Evidence that the RUNNABLE-P14 scenarios' underlying claims are proven today is in `v3-phase-14-s8-evidence-matrix.md` §3 — each RUNNABLE-P14 scenario above has at least one PROVEN row backing it at L0/L1/L2.