Gateway Cluster State Synchronization

State synchronization in confidential computing represents one of the most challenging aspects of distributed system design. Unlike traditional distributed systems where trust can be assumed among cluster members, confidential computing environments operate under a zero-trust model where every node must continuously prove its identity and integrity. The dstack gateway’s state synchronization system addresses this challenge by implementing cryptographically secure protocols that enable multiple gateway nodes to maintain consistent cluster state while ensuring that only authenticated and authorized nodes can participate. The concept of “state” in this context encompasses far more than simple configuration data. It includes the complete knowledge of registered CVMs, their attestation status, cryptographic keys, network topology, health metrics, and authorization policies. This state must be synchronized across multiple gateway nodes to ensure high availability and load distribution, but this synchronization must never compromise the security guarantees that make confidential computing possible. What makes dstack’s approach unique is its integration of hardware-backed attestation with traditional distributed consensus mechanisms. Each gateway node doesn’t just sync data - it continuously validates the cryptographic identity of its peers, ensuring that compromised or malicious nodes cannot inject false state information or gain unauthorized access to sensitive cluster operations.

Gateway Cluster Topology and State Synchronization

The dstack gateway supports distributed deployment through a cluster architecture where multiple gateway nodes synchronize their state to provide redundancy and load balancing.

Cluster State Management

The gateway maintains its cluster state using a central ProxyStateMut structure (see source). This state includes information about other gateway nodes (nodes), registered applications (apps), CVM instances (instances), allocated network addresses, and additional internal tracking fields. By synchronizing this structured state across the cluster, the gateway ensures consistent knowledge of node membership, application registrations, and CVM instance assignments. each gateway node tracks other nodes in the cluster using a GatewayNodeInfo structure (see source), which includes the node’s id, url, WireGuard peer information, and last-seen timestamp.

State Synchronization Protocol

Gateway node state synchronization is achieved via secure RPC channels with mutual TLS authentication, ensuring both confidentiality and strong peer verification. The synchronization client (SyncClient) initiates connections to peer nodes using a purpose-built RPC client that strictly enforces certificate-based authentication and validates application identity (sync client handshake and validation logic).
During the handshake, the client checks that the remote node presents a valid TLS certificate and that the application ID matches the expected cluster identity. If the app ID does not match, the connection is immediately rejected and a log entry records the error.
This rigorous validation mechanism ensures that only properly authenticated and authorized gateway nodes can participate in the state sync protocol, effectively blocking unauthorized or misconfigured nodes from joining the cluster.
The state sync protocol is engineered for resilience, with built-in timeout protection and comprehensive error handling as detailed in the sync client async logic. When a node initiates a sync with a peer, it logs the attempt, enforces a timeout on the operation, and monitors for both timeout and RPC-level errors. If synchronization fails—due to timeout, network issues, or any other error—the system logs the failure with diagnostic details and continues operating, ensuring that cluster stability is not compromised by transient faults or misbehaving nodes. This design guarantees robust, fault-tolerant state propagation across the gateway cluster.

Node Recycling and Health Monitoring

the gateway automatically removes (“recycles”) stale nodes from the cluster based on their last-seen timestamp. as implemented in main_service.rs#L439C4-L452C10, the recycling logic iterates over all known nodes and checks if the time since their last_seen exceeds the configured node_timeout. if a node is considered stale and is not the current node itself, it is removed from the cluster state. this ensures that only healthy, recently active nodes remain in the cluster, improving reliability and preventing issues from unresponsive or disconnected peers. node health monitoring is configured in the gateway’s toml file under the [core.recycle] section (see gateway.toml), which sets parameters such as interval, timeout, and especially node_timeout (e.g., node_timeout = "10m"). the gateway tracks each node’s last-seen timestamp and automatically recycles (removes) nodes that have not been seen within the node_timeout period. this mechanism is implemented in the gateway service using the GatewayNodeInfo structure and is part of the cluster state management logic, relying on the gateway’s internal state synchronization and recycling packages.