mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-13 23:36:45 +03:00
79ac279fe1
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns (encode time in unix nanos). It rides in .vif and the read request so a read can be scoped to the encode run that produced the index. * fix(ec): stamp each encode and reject cross-run shard reads Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the shard's owning volume (resolved together via FindEcVolumeWithShard, so a multi-disk server validates the disk that actually serves the bytes) and reject a shard from a different encode run, recovering from parity. A zero on either side (pre-upgrade volume) skips the guard. * fix(ec): stamp the encode identity on the worker-generated .vif The worker-local encode path now writes EncodeTsNs (and the resolved EC ratio) into the .vif, so the read guard is not silently off for volumes encoded by the maintenance worker. * fix(ec): wipe stale EC artifacts before re-encoding VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and removes its on-disk shard/index/sidecar files before writing fresh ones, so a retried encode never builds on a partial prior run and the unlink frees the inodes instead of leaving open fds serving old bytes. * fix(ec): unmount EC shards across all disks UnmountEcShards walked only the first disk holding the shard, leaving a duplicate copy mounted on a sibling disk (split-disk reconciled volumes) still serving and heartbeating. Traverse every disk and emit one deletion delta per disk. * fix(ec): delete orphan shards without a local .ecx deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx, so it could not clean an orphan .ecNN left by a failed copy on a disk with no index. Delete the requested shard files unconditionally; the index-file (.ecx/.ecj/.vif) routing stays gated as before. * fix(ec): clear stale EC shards cluster-wide before re-encoding ec.encode unmounts and deletes EC shards for the target volumes on every node before regenerating: fatal for the shards the topology reports (mounted leftovers), best-effort for the rest (a sweep that catches unmounted failed-copy orphans). A down node is a no-op. * fix(ec): don't nil EC fds on close so reads can't race eviction A reader resolves an EcVolume/shard under the lock then reads after it is released, so an eviction that nils ecxFile/ecdFile would race that read and panic. Close the fds without nilling the fields: the field is now write-once (no data race) and a concurrent read hits a closed fd, getting a clean error that the caller recovers from parity. * fix(ec): wipe stale EC artifacts on every disk and surface failures The pre-encode wipe only deleted beside the source volume, so a stale shard on a sibling disk survived and could be mounted against the new index at reconcile. Sweep every disk. Removal also ignored os.Remove errors, reporting a failed cleanup as success and letting a stale shard join the next generation; surface the first real failure (treating already-gone as success) from removeStaleEcArtifacts and the shard delete. * fix(ec): log when a local shard is skipped for a different encode run The cross-run guard returned errShardNotLocal, indistinguishable in logs from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so operators can tell "wrong encode generation" from "shard not here". * fix(ec): surface metadata removal failures in the shard delete path deleteEcShardIdsForEachLocation still dropped os.Remove errors on the .ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index condition this path prevents, so route those through removeFileIfExists and return the first real failure instead of reporting cleanup as success. * fix(ec): fail orphan cleanup when a reachable node's delete fails The pre-encode orphan sweep swallowed every error for unreported (node, volume) pairs. That is only safe for an unreachable node, which cannot receive this encode's new generation. A reachable node whose delete genuinely failed (permission/IO) keeps an orphan shard that a later copy re-stamps with the new run's volume-level .vif identity, so the read guard would accept stale data. Surface those; stay best-effort only for unreachable nodes (gRPC Unavailable / no status). * fix(ec): guard ecjFile under its lock in the EC delete path EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved its .ecx lookup before a concurrent eviction (the generate-time UnloadEcVolume) could then reach the journal append with a nil fd. Bail with a clear "volume closed" error under the lock instead. * fix(ec): reject an unstamped shard when the caller has an encode identity The read guard required both identities nonzero, so a current (stamped) caller accepted a holder with identity 0 and could be served a stale pre-upgrade shard. Reject when the caller is stamped and the holder differs (including unstamped); stay lenient only when the caller itself has no identity (pre-upgrade reader). A skipped shard recovers from parity. * fix(ec): full-teardown delete so cluster cleanup wipes a whole generation The pre-encode cluster sweep deleted only the listed canonical shards on remote nodes, leaving index/sidecar (and, on builds with versioned generations, those too) behind. Add a full_teardown flag to VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode cleanup paths set it. Other delete callers (balance/decode/repair) are unchanged. * fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while Close nils it under the lock, a data race on the field. Take the lock first, then nil-check inside, in both. * fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success An old volume server silently ignores full_teardown and returns success for an ordinary delete, so the caller wrongly believes the generation was wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done in the response; the worker destination cleanup fails when it is absent, and the shell cluster sweep fails for a reported (mounted) leftover while staying best-effort for an unreported node. encode_ts_ns stays an accepted transient (an old server just skips the new read guard, no regression). * fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown A reachable pre-upgrade server ignores full_teardown and returns success without wiping an orphan, which a later copy then folds into the new generation. Treat a missing full_teardown_done ack as fatal for every reachable node (best-effort only for a gRPC-unreachable one), not just for topology-reported pairs. * fix(ec): return the served shard identity and validate it client-side The encode identity was only enforced server-side, so a pre-upgrade server ignored the request field and served bytes unchecked. Echo the served shard's EncodeTsNs on every read response chunk and have the client reject a mismatch (including 0 from an old server), so the guard holds regardless of server version; a rejected read recovers from parity. * fix(ec): reject a short/empty remote shard read instead of serving zeros doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and returned success with a partly zero-filled, unvalidated buffer (the server stamps the identity only on chunks that carry bytes). A non-deleted interval must arrive whole: require n == len(buf), exempting the is_deleted short-circuit (n=0), matching readLocalEcShardInterval's local check. A short read now fails so the caller recovers from parity. * test(ec): fake volume server echoes the full_teardown acknowledgement The worker now fails a teardown delete that isn't acknowledged (so a pre-upgrade server can't silently skip the wipe). The fake server's no-op VolumeEcShardsDelete returned an empty response, which the worker read as a skipped teardown and aborted the encode. Echo full_teardown_done. * feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif and rejects a read served from a shard of a different run; full_teardown wipes a whole generation and acknowledges it. The Rust volume server had none of it. Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume, stamp it on every read response, and reject a request/response mismatch on both the server and the distributed-read client (recovering from parity); handle full_teardown by evicting the volume and wiping every EC artifact on each disk, echoing full_teardown_done so the caller can detect a server that ignored it. * fix(ec): remove a stale .vif on full teardown of a shard-only node A shard copy installs shards + .ecx before .vif, so an interrupted copy after a teardown could mount the new files under the previous run's identity / version / shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during full teardown, gated on .idx absence so a source-volume holder keeps its live .vif. In Rust this lives in a teardown-only helper so the reconcile / load- fallback paths (which share the base removal) still preserve .vif. * fix(ec): treat a missing teardown ack as fatal, not as an unreachable node isNodeUnreachable returned true for any non-gRPC-status error, so a reachable pre-upgrade server's missing full_teardown_done ack (a plain error) was classified unreachable and the unreported pair was silently skipped. Classify only a real codes.Unavailable as unreachable, and wrap the missing ack in a sentinel the sweep treats as fatal regardless. A genuinely down node still surfaces as Unavailable from the RPC and stays best-effort. * fix(ec): reject a short shard read in the local EC needle reader read_ec_shard_needle ignored the byte count from shard.read_at and appended the whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later length check and parsed as garbage. Require n == buf.len() per interval, erroring on a short read like the local interval reader already does. * fix(ec): probe reachability before skipping a node that returns Unavailable The pre-encode sweep skipped any node whose teardown delete returned codes.Unavailable, but a reachable volume server in maintenance mode also returns that code for the maintenance-gated delete, so its stale EC files were left behind on a node that can still receive the new generation. Confirm with a non-maintenance-gated empty-target Ping: skip only when the node fails the probe too (genuinely unreachable). * fix(ec): use try_exists for the teardown .vif .idx guard The teardown-only .vif removal gated on Path::exists(), which returns false on a permission/IO stat error, so a stat failure on a present .idx would read as a shard-only node and delete the live source volume's .vif. Gate on try_exists() == Ok(false) instead, preserving the sidecar on any stat error. * fix(ec): only skip a sweep node when a Ping confirms it is transport-down The pre-encode sweep skipped a node whenever its teardown delete and a liveness Ping both failed, but it treated ANY Ping error as down — an application-level Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a reachable node's stale generation in place. Classify the Ping tri-state and skip only when it transport-fails with codes.Unavailable; a reachable or inconclusive node stays fatal. * fix(ec): exclude sweep-skipped nodes from the encode's rebalance The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance then recollected the current topology — a node that recovered between the two could become a copy target and receive the new generation while still holding its stale, never-cleared shards. Have the sweep return the skipped set and exclude those nodes from the rebalance for this encode, so a node we could not clean cannot receive the new generation. Standalone ec.balance is unaffected. * fix(ec): re-sweep recovered nodes before generation so they aren't stranded A node skipped as down by the pre-encode sweep is excluded from the rebalance, but it can recover and become the generation host — mounting all shards locally, then being excluded from distribution. Union-only verification accepts all shards on one node and deletes the originals: a single point of failure. Re-sweep the skipped nodes just before generation; one whose teardown now succeeds leaves the skipped set and rebalances normally, while a node still down stays skipped. * fix(ec): abort the encode if a selected source is still skipped after re-sweep The re-sweep un-skips a recovered node, but the source was selected before it and a node can stay down through the re-sweep then recover just in time to be the generation host — mounting all shards locally while still excluded from the rebalance, which union-only verification accepts before deleting the originals. Abort the encode when a selected source remains skipped after the re-sweep. * fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch If a volume is not EC at the batch-delete classification but is encoded to EC and its .dat deleted before the regular-volume mutation, the mutation returns an exact "not found" that the filer chunk-GC treats as completed, dropping the delete. Recheck EC presence under the mutation lock and return a retriable 503 with the "try again" token so the filer requeues it onto the EC path. * fix(ec): recheck EC state before the regular batch-delete mutation ec.encode mounts EC shards (copied from the .dat) before deleting the originals, so a volume can be EC while its .dat still exists. The batch delete only rechecked EC after a NotFound, so a successful regular-volume delete in that window wrote a tombstone to the soon-removed .dat — the delete was lost and the needle resurrected from the pre-tombstone shards. Recheck has_ec_volume under the write lock before delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path. * fix(volume): make the metrics push test independent of test order test_push_metrics_once asserted the pushed body contains the request-counter family without ever touching the counter — a CounterVec with no children emits nothing, so the assertion only held when another test had already created a labelset in the shared registry. Create one in the test itself.
800 lines
22 KiB
Protocol Buffer
800 lines
22 KiB
Protocol Buffer
syntax = "proto3";
|
|
|
|
package volume_server_pb;
|
|
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb";
|
|
|
|
import "remote.proto";
|
|
|
|
//////////////////////////////////////////////////
|
|
|
|
// Persistent state for volume servers.
|
|
message VolumeServerState {
|
|
// whether the server is in maintenance (i.e. read-only) mode.
|
|
bool maintenance = 1;
|
|
// incremental version counter
|
|
uint32 version = 2;
|
|
}
|
|
|
|
//////////////////////////////////////////////////
|
|
|
|
service VolumeServer {
|
|
//Experts only: takes multiple fid parameters. This function does not propagate deletes to replicas.
|
|
rpc BatchDelete (BatchDeleteRequest) returns (BatchDeleteResponse) {
|
|
}
|
|
|
|
rpc VacuumVolumeCheck (VacuumVolumeCheckRequest) returns (VacuumVolumeCheckResponse) {
|
|
}
|
|
rpc VacuumVolumeCompact (VacuumVolumeCompactRequest) returns (stream VacuumVolumeCompactResponse) {
|
|
}
|
|
rpc VacuumVolumeCommit (VacuumVolumeCommitRequest) returns (VacuumVolumeCommitResponse) {
|
|
}
|
|
rpc VacuumVolumeCleanup (VacuumVolumeCleanupRequest) returns (VacuumVolumeCleanupResponse) {
|
|
}
|
|
|
|
rpc DeleteCollection (DeleteCollectionRequest) returns (DeleteCollectionResponse) {
|
|
}
|
|
rpc AllocateVolume (AllocateVolumeRequest) returns (AllocateVolumeResponse) {
|
|
}
|
|
|
|
rpc VolumeSyncStatus (VolumeSyncStatusRequest) returns (VolumeSyncStatusResponse) {
|
|
}
|
|
rpc VolumeIncrementalCopy (VolumeIncrementalCopyRequest) returns (stream VolumeIncrementalCopyResponse) {
|
|
}
|
|
|
|
rpc VolumeMount (VolumeMountRequest) returns (VolumeMountResponse) {
|
|
}
|
|
rpc VolumeUnmount (VolumeUnmountRequest) returns (VolumeUnmountResponse) {
|
|
}
|
|
rpc VolumeDelete (VolumeDeleteRequest) returns (VolumeDeleteResponse) {
|
|
}
|
|
rpc VolumeMarkReadonly (VolumeMarkReadonlyRequest) returns (VolumeMarkReadonlyResponse) {
|
|
}
|
|
rpc VolumeMarkWritable (VolumeMarkWritableRequest) returns (VolumeMarkWritableResponse) {
|
|
}
|
|
rpc VolumeConfigure (VolumeConfigureRequest) returns (VolumeConfigureResponse) {
|
|
}
|
|
rpc VolumeStatus (VolumeStatusRequest) returns (VolumeStatusResponse) {
|
|
}
|
|
|
|
rpc GetState (GetStateRequest) returns (GetStateResponse) {
|
|
}
|
|
rpc SetState (SetStateRequest) returns (SetStateResponse) {
|
|
}
|
|
|
|
// copy the .idx .dat files, and mount this volume
|
|
rpc VolumeCopy (VolumeCopyRequest) returns (stream VolumeCopyResponse) {
|
|
}
|
|
rpc ReadVolumeFileStatus (ReadVolumeFileStatusRequest) returns (ReadVolumeFileStatusResponse) {
|
|
}
|
|
rpc CopyFile (CopyFileRequest) returns (stream CopyFileResponse) {
|
|
}
|
|
rpc ReceiveFile (stream ReceiveFileRequest) returns (ReceiveFileResponse) {
|
|
}
|
|
|
|
rpc ReadNeedleBlob (ReadNeedleBlobRequest) returns (ReadNeedleBlobResponse) {
|
|
}
|
|
rpc ReadNeedleMeta (ReadNeedleMetaRequest) returns (ReadNeedleMetaResponse) {
|
|
}
|
|
rpc WriteNeedleBlob (WriteNeedleBlobRequest) returns (WriteNeedleBlobResponse) {
|
|
}
|
|
rpc ReadAllNeedles (ReadAllNeedlesRequest) returns (stream ReadAllNeedlesResponse) {
|
|
}
|
|
|
|
rpc VolumeTailSender (VolumeTailSenderRequest) returns (stream VolumeTailSenderResponse) {
|
|
}
|
|
rpc VolumeTailReceiver (VolumeTailReceiverRequest) returns (VolumeTailReceiverResponse) {
|
|
}
|
|
|
|
// erasure coding
|
|
rpc VolumeEcShardsGenerate (VolumeEcShardsGenerateRequest) returns (VolumeEcShardsGenerateResponse) {
|
|
}
|
|
rpc VolumeEcShardsRebuild (VolumeEcShardsRebuildRequest) returns (VolumeEcShardsRebuildResponse) {
|
|
}
|
|
rpc VolumeEcShardsCopy (VolumeEcShardsCopyRequest) returns (VolumeEcShardsCopyResponse) {
|
|
}
|
|
rpc VolumeEcShardsDelete (VolumeEcShardsDeleteRequest) returns (VolumeEcShardsDeleteResponse) {
|
|
}
|
|
rpc VolumeEcShardsMount (VolumeEcShardsMountRequest) returns (VolumeEcShardsMountResponse) {
|
|
}
|
|
rpc VolumeEcShardsUnmount (VolumeEcShardsUnmountRequest) returns (VolumeEcShardsUnmountResponse) {
|
|
}
|
|
rpc VolumeEcShardRead (VolumeEcShardReadRequest) returns (stream VolumeEcShardReadResponse) {
|
|
}
|
|
rpc VolumeEcBlobDelete (VolumeEcBlobDeleteRequest) returns (VolumeEcBlobDeleteResponse) {
|
|
}
|
|
rpc VolumeEcShardsToVolume (VolumeEcShardsToVolumeRequest) returns (VolumeEcShardsToVolumeResponse) {
|
|
}
|
|
rpc VolumeEcShardsInfo (VolumeEcShardsInfoRequest) returns (VolumeEcShardsInfoResponse) {
|
|
}
|
|
|
|
// tiered storage
|
|
rpc VolumeTierMoveDatToRemote (VolumeTierMoveDatToRemoteRequest) returns (stream VolumeTierMoveDatToRemoteResponse) {
|
|
}
|
|
rpc VolumeTierMoveDatFromRemote (VolumeTierMoveDatFromRemoteRequest) returns (stream VolumeTierMoveDatFromRemoteResponse) {
|
|
}
|
|
|
|
rpc VolumeServerStatus (VolumeServerStatusRequest) returns (VolumeServerStatusResponse) {
|
|
}
|
|
rpc VolumeServerLeave (VolumeServerLeaveRequest) returns (VolumeServerLeaveResponse) {
|
|
}
|
|
|
|
// remote storage
|
|
rpc FetchAndWriteNeedle (FetchAndWriteNeedleRequest) returns (FetchAndWriteNeedleResponse) {
|
|
}
|
|
|
|
// scrubbing
|
|
rpc ScrubVolume (ScrubVolumeRequest) returns (ScrubVolumeResponse) {
|
|
}
|
|
rpc ScrubEcVolume (ScrubEcVolumeRequest) returns (ScrubEcVolumeResponse) {
|
|
}
|
|
|
|
// <experimental> query
|
|
rpc Query (QueryRequest) returns (stream QueriedStripe) {
|
|
}
|
|
|
|
rpc VolumeNeedleStatus (VolumeNeedleStatusRequest) returns (VolumeNeedleStatusResponse) {
|
|
}
|
|
|
|
rpc Ping (PingRequest) returns (PingResponse) {
|
|
}
|
|
|
|
}
|
|
|
|
//////////////////////////////////////////////////
|
|
|
|
message BatchDeleteRequest {
|
|
repeated string file_ids = 1;
|
|
bool skip_cookie_check = 2;
|
|
}
|
|
|
|
message BatchDeleteResponse {
|
|
repeated DeleteResult results = 1;
|
|
}
|
|
message DeleteResult {
|
|
string file_id = 1;
|
|
int32 status = 2;
|
|
string error = 3;
|
|
uint32 size = 4;
|
|
uint32 version = 5;
|
|
}
|
|
|
|
message Empty {
|
|
}
|
|
|
|
message VacuumVolumeCheckRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VacuumVolumeCheckResponse {
|
|
double garbage_ratio = 1;
|
|
}
|
|
|
|
message VacuumVolumeCompactRequest {
|
|
uint32 volume_id = 1;
|
|
int64 preallocate = 2;
|
|
}
|
|
message VacuumVolumeCompactResponse {
|
|
int64 processed_bytes = 1;
|
|
float load_avg_1m = 2;
|
|
}
|
|
|
|
message VacuumVolumeCommitRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VacuumVolumeCommitResponse {
|
|
bool is_read_only = 1;
|
|
uint64 volume_size = 2;
|
|
}
|
|
|
|
message VacuumVolumeCleanupRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VacuumVolumeCleanupResponse {
|
|
}
|
|
|
|
message DeleteCollectionRequest {
|
|
string collection = 1;
|
|
}
|
|
message DeleteCollectionResponse {
|
|
}
|
|
|
|
message AllocateVolumeRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
int64 preallocate = 3;
|
|
string replication = 4;
|
|
string ttl = 5;
|
|
uint32 memory_map_max_size_mb = 6;
|
|
string disk_type = 7;
|
|
uint32 version = 8;
|
|
}
|
|
message AllocateVolumeResponse {
|
|
}
|
|
|
|
message VolumeSyncStatusRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VolumeSyncStatusResponse {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
string replication = 4;
|
|
string ttl = 5;
|
|
uint64 tail_offset = 6;
|
|
uint32 compact_revision = 7;
|
|
uint64 idx_file_size = 8;
|
|
uint32 version = 9;
|
|
}
|
|
|
|
message VolumeIncrementalCopyRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 since_ns = 2;
|
|
}
|
|
message VolumeIncrementalCopyResponse {
|
|
bytes file_content = 1;
|
|
}
|
|
|
|
message VolumeMountRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VolumeMountResponse {
|
|
}
|
|
|
|
message VolumeUnmountRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VolumeUnmountResponse {
|
|
}
|
|
|
|
message VolumeDeleteRequest {
|
|
uint32 volume_id = 1;
|
|
bool only_empty = 2;
|
|
// when true, do not remove the cloud-tier object backing the volume.
|
|
// used for moves where another server is taking over the same .vif.
|
|
bool keep_remote_data = 3;
|
|
}
|
|
message VolumeDeleteResponse {
|
|
}
|
|
|
|
message VolumeMarkReadonlyRequest {
|
|
uint32 volume_id = 1;
|
|
bool persist = 2;
|
|
}
|
|
message VolumeMarkReadonlyResponse {
|
|
}
|
|
|
|
message VolumeMarkWritableRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VolumeMarkWritableResponse {
|
|
}
|
|
|
|
message VolumeConfigureRequest {
|
|
uint32 volume_id = 1;
|
|
string replication = 2;
|
|
}
|
|
message VolumeConfigureResponse {
|
|
string error = 1;
|
|
}
|
|
|
|
message VolumeStatusRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VolumeStatusResponse {
|
|
bool is_read_only = 1;
|
|
uint64 volume_size = 2;
|
|
uint64 file_count = 3;
|
|
uint64 file_deleted_count = 4;
|
|
}
|
|
|
|
message GetStateRequest {
|
|
}
|
|
message GetStateResponse {
|
|
VolumeServerState state = 1;
|
|
}
|
|
|
|
message SetStateRequest {
|
|
// SetState updates *all* volume server flags at once. Retrieve state with GetState(),
|
|
// modify individual flags as required, then call this RPC to update.
|
|
VolumeServerState state = 1;
|
|
}
|
|
message SetStateResponse {
|
|
VolumeServerState state = 1;
|
|
}
|
|
|
|
message VolumeCopyRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
string replication = 3;
|
|
string ttl = 4;
|
|
string source_data_node = 5;
|
|
string disk_type = 6;
|
|
int64 io_byte_per_second = 7;
|
|
}
|
|
message VolumeCopyResponse {
|
|
uint64 last_append_at_ns = 1;
|
|
int64 processed_bytes = 2;
|
|
}
|
|
|
|
message CopyFileRequest {
|
|
uint32 volume_id = 1;
|
|
string ext = 2;
|
|
uint32 compaction_revision = 3;
|
|
uint64 stop_offset = 4;
|
|
string collection = 5;
|
|
bool is_ec_volume = 6;
|
|
bool ignore_source_file_not_found = 7;
|
|
}
|
|
message CopyFileResponse {
|
|
bytes file_content = 1;
|
|
int64 modified_ts_ns = 2;
|
|
}
|
|
|
|
message ReceiveFileRequest {
|
|
oneof data {
|
|
ReceiveFileInfo info = 1;
|
|
bytes file_content = 2;
|
|
}
|
|
}
|
|
|
|
message ReceiveFileInfo {
|
|
uint32 volume_id = 1;
|
|
string ext = 2;
|
|
string collection = 3;
|
|
bool is_ec_volume = 4;
|
|
uint32 shard_id = 5;
|
|
uint64 file_size = 6;
|
|
uint32 disk_id = 7; // EC shard disk; 0 = auto-select (see VolumeEcShardsCopyRequest.disk_id)
|
|
}
|
|
|
|
message ReceiveFileResponse {
|
|
uint64 bytes_written = 1;
|
|
string error = 2;
|
|
}
|
|
|
|
message ReadNeedleBlobRequest {
|
|
uint32 volume_id = 1;
|
|
int64 offset = 3; // actual offset
|
|
int32 size = 4;
|
|
}
|
|
message ReadNeedleBlobResponse {
|
|
bytes needle_blob = 1;
|
|
}
|
|
|
|
message ReadNeedleMetaRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 needle_id = 2;
|
|
int64 offset = 3; // actual offset
|
|
int32 size = 4;
|
|
}
|
|
message ReadNeedleMetaResponse {
|
|
uint32 cookie = 1;
|
|
uint64 last_modified = 2;
|
|
uint32 crc = 3;
|
|
string ttl = 4;
|
|
uint64 append_at_ns = 5;
|
|
}
|
|
|
|
message WriteNeedleBlobRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 needle_id = 2;
|
|
int32 size = 3;
|
|
bytes needle_blob = 4;
|
|
}
|
|
message WriteNeedleBlobResponse {
|
|
}
|
|
|
|
message ReadAllNeedlesRequest {
|
|
repeated uint32 volume_ids = 1;
|
|
}
|
|
message ReadAllNeedlesResponse {
|
|
uint32 volume_id = 1;
|
|
uint64 needle_id = 2;
|
|
uint32 cookie = 3;
|
|
bytes needle_blob = 5;
|
|
bool needle_blob_compressed = 6;
|
|
uint64 last_modified = 7;
|
|
uint32 crc = 8;
|
|
bytes name = 9;
|
|
bytes mime = 10;
|
|
}
|
|
|
|
message VolumeTailSenderRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 since_ns = 2;
|
|
uint32 idle_timeout_seconds = 3;
|
|
}
|
|
message VolumeTailSenderResponse {
|
|
bytes needle_header = 1;
|
|
bytes needle_body = 2;
|
|
bool is_last_chunk = 3;
|
|
uint32 version = 4;
|
|
}
|
|
|
|
message VolumeTailReceiverRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 since_ns = 2;
|
|
uint32 idle_timeout_seconds = 3;
|
|
string source_volume_server = 4;
|
|
}
|
|
message VolumeTailReceiverResponse {
|
|
}
|
|
|
|
message VolumeEcShardsGenerateRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
}
|
|
message VolumeEcShardsGenerateResponse {
|
|
}
|
|
|
|
message VolumeEcShardsRebuildRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
bool unsafe_ignore_sidecar = 3; // bypass the bitrot-sidecar fail-closed guard (operator override; distinct from ec.rebuild -force)
|
|
}
|
|
message VolumeEcShardsRebuildResponse {
|
|
repeated uint32 rebuilt_shard_ids = 1;
|
|
}
|
|
|
|
message VolumeEcShardsCopyRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
repeated uint32 shard_ids = 3;
|
|
bool copy_ecx_file = 4;
|
|
string source_data_node = 5;
|
|
bool copy_ecj_file = 6;
|
|
bool copy_vif_file = 7;
|
|
uint32 disk_id = 8; // Target disk ID for storing EC shards
|
|
bool copy_ecsum_file = 9; // copy the bitrot checksum sidecar (.ecsum) when present; tolerant of a missing source (no-op), since this non-2PC path has no Prepare backstop
|
|
}
|
|
message VolumeEcShardsCopyResponse {
|
|
}
|
|
|
|
message VolumeEcShardsDeleteRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
repeated uint32 shard_ids = 3;
|
|
bool full_teardown = 4; // pre-encode cleanup: wipe every EC artifact + generation for this volume, not just shard_ids
|
|
}
|
|
message VolumeEcShardsDeleteResponse {
|
|
bool full_teardown_done = 1; // set by a new server that performed full_teardown; absent from an old server lets the caller detect the silent no-op
|
|
}
|
|
|
|
message VolumeEcShardsMountRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
repeated uint32 shard_ids = 3;
|
|
string source_disk_type = 4; // disk type of the source volume, applied to the in-memory EC volume so heartbeats report under it (#9423)
|
|
}
|
|
message VolumeEcShardsMountResponse {
|
|
}
|
|
|
|
message VolumeEcShardsUnmountRequest {
|
|
uint32 volume_id = 1;
|
|
repeated uint32 shard_ids = 3;
|
|
}
|
|
message VolumeEcShardsUnmountResponse {
|
|
}
|
|
|
|
message VolumeEcShardReadRequest {
|
|
uint32 volume_id = 1;
|
|
uint32 shard_id = 2;
|
|
int64 offset = 3;
|
|
int64 size = 4;
|
|
uint64 file_key = 5;
|
|
reserved 6;
|
|
int64 encode_ts_ns = 7; // caller's expected encode time; the server rejects a shard from a different encode run
|
|
}
|
|
message VolumeEcShardReadResponse {
|
|
bytes data = 1;
|
|
bool is_deleted = 2;
|
|
int64 encode_ts_ns = 3; // identity of the shard actually served; client rejects a mismatch (0 = pre-upgrade server)
|
|
}
|
|
|
|
message VolumeEcBlobDeleteRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
uint64 file_key = 3;
|
|
uint32 version = 4;
|
|
}
|
|
message VolumeEcBlobDeleteResponse {
|
|
}
|
|
|
|
message VolumeEcShardsToVolumeRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
}
|
|
message VolumeEcShardsToVolumeResponse {
|
|
}
|
|
|
|
message VolumeEcShardsInfoRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message VolumeEcShardsInfoResponse {
|
|
repeated EcShardInfo ec_shard_infos = 1;
|
|
uint64 volume_size = 2;
|
|
uint64 file_count = 3;
|
|
uint64 file_deleted_count = 4;
|
|
}
|
|
|
|
message EcShardInfo {
|
|
uint32 shard_id = 1;
|
|
int64 size = 2;
|
|
string collection = 3;
|
|
uint32 volume_id = 4;
|
|
}
|
|
|
|
message ReadVolumeFileStatusRequest {
|
|
uint32 volume_id = 1;
|
|
}
|
|
message ReadVolumeFileStatusResponse {
|
|
uint32 volume_id = 1;
|
|
uint64 idx_file_timestamp_seconds = 2;
|
|
uint64 idx_file_size = 3;
|
|
uint64 dat_file_timestamp_seconds = 4;
|
|
uint64 dat_file_size = 5;
|
|
uint64 file_count = 6;
|
|
uint32 compaction_revision = 7;
|
|
string collection = 8;
|
|
string disk_type = 9;
|
|
VolumeInfo volume_info = 10;
|
|
uint32 version = 11;
|
|
}
|
|
|
|
message DiskStatus {
|
|
string dir = 1;
|
|
uint64 all = 2;
|
|
uint64 used = 3;
|
|
uint64 free = 4;
|
|
float percent_free = 5;
|
|
float percent_used = 6;
|
|
string disk_type = 7;
|
|
string error = 8;
|
|
}
|
|
|
|
message MemStatus {
|
|
int32 goroutines = 1;
|
|
uint64 all = 2;
|
|
uint64 used = 3;
|
|
uint64 free = 4;
|
|
uint64 self = 5;
|
|
uint64 heap = 6;
|
|
uint64 stack = 7;
|
|
}
|
|
|
|
// tired storage on volume servers
|
|
message RemoteFile {
|
|
string backend_type = 1;
|
|
string backend_id = 2;
|
|
string key = 3;
|
|
uint64 offset = 4;
|
|
uint64 file_size = 5;
|
|
uint64 modified_time = 6;
|
|
string extension = 7;
|
|
}
|
|
message VolumeInfo {
|
|
repeated RemoteFile files = 1;
|
|
uint32 version = 2;
|
|
string replication = 3;
|
|
uint32 bytes_offset = 4;
|
|
int64 dat_file_size = 5; // store the original dat file size
|
|
uint64 expire_at_sec = 6; // expiration time of ec volume
|
|
bool read_only = 7;
|
|
EcShardConfig ec_shard_config = 8; // EC shard configuration (optional, null = use default 10+4)
|
|
}
|
|
|
|
// EcShardConfig specifies erasure coding shard configuration
|
|
message EcShardConfig {
|
|
uint32 data_shards = 1; // Number of data shards (e.g., 10)
|
|
uint32 parity_shards = 2; // Number of parity shards (e.g., 4)
|
|
int64 encode_ts_ns = 3; // encode time (unix nanos); a read served from a shard of a different encode run is rejected
|
|
}
|
|
|
|
// EcBitrotProtection is the entire content of a bitrot checksum sidecar
|
|
// (<base>.ecsum for the legacy generation, <base>.ecsum.v<N> for vacuum
|
|
// generation N). On disk it is wrapped in a fixed header carrying a CRC32C
|
|
// over this serialized payload (see weed/storage/erasure_coding/ec_bitrot.go).
|
|
message EcBitrotProtection {
|
|
ChecksumAlgorithm algorithm = 1; // CRC32C (Castagnoli)
|
|
uint32 block_size = 2; // bytes per checksum block; default 16777216 (16 MiB), a power-of-two multiple of 1 MiB
|
|
uint32 generation = 3; // EC vacuum generation these checksums describe (0 = legacy/fresh); must match the sidecar filename version
|
|
EcShardConfig ec_shard_config = 4; // data/parity shard counts at encode time
|
|
repeated EcShardChecksums shards = 5; // one entry per shard id in the active layout
|
|
bytes encode_uuid = 6; // random per-encode identity, for stale-sidecar detection across in-place re-encodes
|
|
}
|
|
|
|
message EcShardChecksums {
|
|
uint32 shard_id = 1; // 0..MaxShardCount-1 (custom EC ratios go up to 32)
|
|
int64 covered_size = 2; // shard byte length these checksums cover (must equal the on-disk shard length)
|
|
bytes block_crc32c = 3; // packed little-endian uint32[] = ceil(covered_size/block_size) entries
|
|
}
|
|
|
|
enum ChecksumAlgorithm {
|
|
CHECKSUM_NONE = 0;
|
|
CHECKSUM_CRC32C = 1;
|
|
}
|
|
message OldVersionVolumeInfo {
|
|
repeated RemoteFile files = 1;
|
|
uint32 version = 2;
|
|
string replication = 3;
|
|
uint32 BytesOffset = 4;
|
|
int64 dat_file_size = 5; // store the original dat file size
|
|
uint64 DestroyTime = 6; // expiration time of ec volume
|
|
bool read_only = 7;
|
|
}
|
|
|
|
// tiered storage
|
|
message VolumeTierMoveDatToRemoteRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
string destination_backend_name = 3;
|
|
bool keep_local_dat_file = 4;
|
|
}
|
|
message VolumeTierMoveDatToRemoteResponse {
|
|
int64 processed = 1;
|
|
float processedPercentage = 2;
|
|
}
|
|
|
|
message VolumeTierMoveDatFromRemoteRequest {
|
|
uint32 volume_id = 1;
|
|
string collection = 2;
|
|
bool keep_remote_dat_file = 3;
|
|
}
|
|
message VolumeTierMoveDatFromRemoteResponse {
|
|
int64 processed = 1;
|
|
float processedPercentage = 2;
|
|
}
|
|
|
|
message VolumeServerStatusRequest {
|
|
|
|
}
|
|
message VolumeServerStatusResponse {
|
|
repeated DiskStatus disk_statuses = 1;
|
|
MemStatus memory_status = 2;
|
|
string version = 3;
|
|
string data_center = 4;
|
|
string rack = 5;
|
|
VolumeServerState state = 6;
|
|
}
|
|
|
|
message VolumeServerLeaveRequest {
|
|
}
|
|
message VolumeServerLeaveResponse {
|
|
}
|
|
|
|
// remote storage
|
|
message FetchAndWriteNeedleRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 needle_id = 2;
|
|
uint32 cookie = 3;
|
|
int64 offset = 4;
|
|
int64 size = 5;
|
|
message Replica {
|
|
string url = 1;
|
|
string public_url = 2;
|
|
int32 grpc_port = 3;
|
|
}
|
|
repeated Replica replicas = 6;
|
|
string auth = 7;
|
|
int32 download_concurrency = 8; // multipart download concurrency if supported by the remote storage client; for S3, 0 = default (5)
|
|
// remote conf
|
|
remote_pb.RemoteConf remote_conf = 15;
|
|
remote_pb.RemoteStorageLocation remote_location = 16;
|
|
}
|
|
message FetchAndWriteNeedleResponse {
|
|
string e_tag = 1;
|
|
}
|
|
|
|
enum VolumeScrubMode {
|
|
UNKNOWN = 0;
|
|
INDEX = 1;
|
|
FULL = 2;
|
|
LOCAL = 3;
|
|
CHECKSUM = 4; // EC only: verify each local shard's raw bytes against the bitrot checksum sidecar
|
|
}
|
|
|
|
message ScrubVolumeRequest {
|
|
VolumeScrubMode mode = 1;
|
|
// optional list of volume IDs to scrub. if empty, all volumes for the server are scrubbed.
|
|
repeated uint32 volume_ids = 2;
|
|
bool mark_broken_volumes_readonly = 3;
|
|
}
|
|
message ScrubVolumeResponse {
|
|
uint64 total_volumes = 1;
|
|
uint64 total_files = 2;
|
|
repeated uint32 broken_volume_ids = 3;
|
|
repeated string details = 4;
|
|
}
|
|
|
|
message ScrubEcVolumeRequest {
|
|
VolumeScrubMode mode = 1;
|
|
// optional list of volume IDs to scrub. if empty, all EC volumes for the server are scrubbed.
|
|
repeated uint32 volume_ids = 2;
|
|
}
|
|
message ScrubEcVolumeResponse {
|
|
uint64 total_volumes = 1;
|
|
uint64 total_files = 2;
|
|
repeated uint32 broken_volume_ids = 3;
|
|
repeated EcShardInfo broken_shard_infos = 4;
|
|
repeated string details = 5;
|
|
}
|
|
|
|
// select on volume servers
|
|
message QueryRequest {
|
|
repeated string selections = 1;
|
|
repeated string from_file_ids = 2;
|
|
message Filter {
|
|
string field = 1;
|
|
string operand = 2;
|
|
string value = 3;
|
|
}
|
|
Filter filter = 3;
|
|
|
|
message InputSerialization {
|
|
// NONE | GZIP | BZIP2
|
|
string compression_type = 1;
|
|
message CSVInput {
|
|
string file_header_info = 1; // Valid values: NONE | USE | IGNORE
|
|
string record_delimiter = 2; // Default: \n
|
|
string field_delimiter = 3; // Default: ,
|
|
string quote_character = 4; // Default: "
|
|
string quote_escape_character = 5; // Default: "
|
|
string comments = 6; // Default: #
|
|
// If true, records might contain record delimiters within quote characters
|
|
bool allow_quoted_record_delimiter = 7; // default False.
|
|
}
|
|
message JSONInput {
|
|
string type = 1; // Valid values: DOCUMENT | LINES
|
|
}
|
|
message ParquetInput {
|
|
}
|
|
|
|
CSVInput csv_input = 2;
|
|
JSONInput json_input = 3;
|
|
ParquetInput parquet_input = 4;
|
|
}
|
|
InputSerialization input_serialization = 4;
|
|
|
|
message OutputSerialization {
|
|
message CSVOutput {
|
|
string quote_fields = 1; // Valid values: ALWAYS | ASNEEDED
|
|
string record_delimiter = 2; // Default: \n
|
|
string field_delimiter = 3; // Default: ,
|
|
string quote_character = 4; // Default: "
|
|
string quote_escape_character = 5; // Default: "
|
|
}
|
|
message JSONOutput {
|
|
string record_delimiter = 1;
|
|
}
|
|
|
|
CSVOutput csv_output = 2;
|
|
JSONOutput json_output = 3;
|
|
}
|
|
|
|
OutputSerialization output_serialization = 5;
|
|
}
|
|
message QueriedStripe {
|
|
bytes records = 1;
|
|
}
|
|
|
|
message VolumeNeedleStatusRequest {
|
|
uint32 volume_id = 1;
|
|
uint64 needle_id = 2;
|
|
}
|
|
message VolumeNeedleStatusResponse {
|
|
uint64 needle_id = 1;
|
|
uint32 cookie = 2;
|
|
uint32 size = 3;
|
|
uint64 last_modified = 4;
|
|
uint32 crc = 5;
|
|
string ttl = 6;
|
|
}
|
|
|
|
message PingRequest {
|
|
string target = 1; // default to ping itself
|
|
string target_type = 2;
|
|
}
|
|
message PingResponse {
|
|
int64 start_time_ns = 1;
|
|
int64 remote_time_ns = 2;
|
|
int64 stop_time_ns = 3;
|
|
}
|