In a normal cloud, updating software is something you do to the machines. You own them, you have a control plane, and a rollout is a button you press and a graph you watch. A decentralized network takes that away. The machines belong to other people. They live behind home routers, they disappear when someone reboots, and there is no privileged door you can open to swap a binary on a whim.
So here is the lesson, start to finish: how does software on a fleet you do not control move forward on its own, without breaking what people are running, and without ever asking the people who own the machines to lift a finger.
Lesson 1: push is the wrong verb
The instinct from centralized infrastructure is to push. The instinct is wrong here, and it is wrong in a specific way.
PUSH MODEL ASK MODEL
control plane control plane
| "take this now" ^ "may I?"
| | "yes / not yet"
v |
[node] [node] [node] [node] [node] [node]
one bad build one bad build
= already everywhere = still nowhere until allowed
A push pipeline fails closed in the worst possible way. The moment a bad build leaves the pipeline it is, by definition, already on every machine. Flip the arrow. Let each node state where it stands and ask before it moves. Now there is a place to say no, per machine, before anything happens at all.
Lesson 2: the agent has the agency
The only thing that reliably reaches a machine is the software already on it. So the agent updates itself. On a jittered interval it wakes, notices a newer build may exist, and instead of grabbing it, it reports its situation and waits for an answer.
The jitter is not cosmetic. If every node checked at the same instant, a release would arrive on the whole planet in the same minute. Nothing about your changelog should make a stranger's computer stutter, so the fleet is deliberately smeared across time.
And the person who owns that machine never enters this story. They plugged in hardware. They do not read release notes, they do not run a command, they do not get paged. The correct number of update steps for a node operator is zero, forever. If they ever have to think about an update, the design has already failed them.
Lesson 3: one question is really three
When a node asks to move, the answer is not a single yes. It is three independent checks, and the whole robustness of the thing comes from keeping them apart.
node: "I am on vN. I could move to vN+1. May I?"
|
v
[1] is this build allowed on the network at all?
| no stop. nothing moves to an unlisted build.
| yes
v
[2] is THIS node allowed to move right now?
| no stay. (nobody / a named set / everyone)
| yes
v
[3] can the network spare this node this moment?
| no stay, ask again later
| yes
v
update, then rejoin
Gate one is the line between a build existing and a build being permitted: a version is only a candidate once its exact artifact is on an allowlist. Gate two is the dial an operator actually turns (nobody, a named handful, or everyone), which is what makes a careful staged release possible. Gate three is the interesting one.
Lesson 4: deciding the network can spare a node
A node is not just a machine. At any moment it might be the thing answering a request. Pulling it out mid-work to install an update is a worse outcome than running last week's build for another twenty minutes.
So the node says what it is actually doing, and the only component that can see the whole fleet at once weighs it. The real question is not "is this one machine busy." It is "if this machine steps away, can the rest carry the work."
BEFORE DURING UPDATE
request request
| |
v v
[A] [B] [C] [A] [.] [C]
work spread A B C B steps out, A and C absorb it
nothing flickers from outside
then B returns, updated
Because the work was never pinned to a single machine, a release stops being one synchronized event and becomes a sequence of small, individually safe departures and returns. From the outside, the service simply stays up. There is an override for the rare case where a build matters more than never interrupting anything, but it is an override on purpose, not the default road.
The whole lesson, in one picture
sleep (jittered)
|
v
notice newer build (a nudge skips the wait, still gated)
|
v
ASK the network
|
v
[1] build allowed? -- no --> wait
|
yes
v
[2] this node allowed now? -- no --> wait
|
yes
v
[3] can the network spare it? -- no --> wait
|
yes
v
pull -> swap self -> restore work -> rejoin fleet
Why it is shaped this way
It all comes from accepting one fact early. You cannot treat a stranger's computer like a server in your rack, and you cannot ask the person who owns it to operate it for you. So the agent gets the agency, the network gets three separate places to say no, and the decision to remove a node is made by the only part of the system that can tell whether anyone will notice.
It is more moving parts than pressing deploy. It is also the only version where the people who make the network exist never have to think about it, and the people who use it never see a gap. On infrastructure built from other people's machines, that is not a nice-to-have. That is the job.



