Testnet halt: lessons learned
A couple of weeks ago, we decided to do a runtime upgrade of our testnet on Rococo. We last updated it quite a while ago and were excited to roll out the completely overhauled t3rn Circuit, bringing live new features, like side effect bidding. All of our tests passed, the runtime was working in Zombienet and we also had no issues running this version in standalone mode. We felt pretty confident about the upgrade.
The upgrade transaction was included in block 845420, thereby updating the runtime. We expected the chain to continue producing blocks, but we were not so fortunate. The block production stopped, and our chain was halted. And that was on a Friday afternoon. Yeah, great.
So what went wrong?
A lot has happened since the last time we updated our testnet. We refactored a lot of areas of the code base, cleaning things up. Among those changes was an effort to unify the different mock runtimes we used. A mock runtime is used for executing unit tests without starting the entire chain, which enables them to run a lot faster. Over time we have amassed a lot of mock runtimes, managing a separate one for each pallet. This was becoming increasingly annoying to manage, so we created a single mock runtime used in all tests. During this refactor, the block time of our chain was updated from 12s to 6s.
This is what the config was changed to in our code base. It looks harmless at first, but it becomes clear this will cause a problem when reading the (ill-placed) note under the actual variable.
So now what?
Fixing the issue in the runtime is quick, but how do we apply this change? How do we get the chain to produce blocks again?
It became apparent that we had to learn more about the underlying mechanisms upon which block production and consensus are based. It took us quite some digging to understand what options we had in this situation.
The first thing to understand is what is happening on our parachain. Why is the block production not working anymore? Essentially, the new block time has made the blocks our collators produce invalid to the relay chain.
We also had to find a way to tell our Collators to run the old runtime version because the upgrade was successful and is now included in the relay chain.
Besides, we couldn't perform a new runtime upgrade to fix the issue. The chain is bricked. So how do we get around this?
Code substitutions
Code substitutions were created for precisely this situation. It enables the addition of a substitute runtime, which is activated on a specific block.
It's important to note that this is not a runtime upgrade. It's a way to restart block production on a bricked chain without doing a hard fork, going back in time.
How to perform a code substitution:
1. Build the new WASM runtime, fixing the issue
2. Add a code substitution map to the raw chain spec
- As a key, select the block number on which the chain is stuck (as a string)
- As a value, add the WASM byte code (with 0x notation)
3. Do NOT update the spec version. Keep the version of the bad upgrade
4. Upgrade the runtime code on the relay chain (via governance or root)
So, to start fixing the issue, we got in touch with Santiago and Alejandro from the Substrate Builders Program, to update the runtime code on the relay chain. Since we are on Rococo, this can be done via root access, making it easy to deal with. On Mainnet, this would have to be passed by governance vote. Once this was completed, all we had to do was resync our Collators to start producing blocks again.
However, there is a big gotcha (that turned into a massive rabbit hole) with resyncing the Collators. To resync, we stopped the collators, purged the parachain state, and started the Collator again. Sounds logical. Why would we have to resync the relay chain as well?
Weellllll, it is required. We didn't do this initially, resulting in the Collators syncing the old blocks but not finalizing them. They would sync until 845420, but the Collator telemetry would show the last finalized block as 0.
Needless to say, the block production also didn't resume. Once we also purged the relay chain, old blocks were finalized, and the block production continued.
Bad Blocks
The other solution is to revert the chain to before the runtime upgrade. If we reset our chain to block 845419 the runtime upgrade did not happen. The chain should continue running from that point onwards, forking away from the lousy runtime upgrade.
To make this work, we must tell the Collators where the wrong fork starts. They can then simply restart, reverting the chain to that point. We do this by adding bad blocks to the chain spec. (Make sure to add the bad blocks extension to the parachain).
For us, this looked like this:
As the last step, the parachain header on the relay chain must be reset. For us, block 845419 would make sense to be reset too. Again, this can be done via root on Rococo or governance on Mainnet.
Ultimately, we should have gone for this approach to fixing our chain. However, it's good to be aware of how these approaches are different from each other. With bad blocks, we're able to perform a proper hard fork, enabling the network to also go back in time. It’s never ideal applying any of this, but good to be aware of it.
Lessons learned
We learned a lot while resolving this issue, and in the end, we're pretty happy it happened.
Have we felt like a big gap existed in our knowledge about Polkadot, Substrate, and even blockchains at some point? Very much so. Did we accidentally reset our head on Rococo with an invalid block header (set block hash instead of encoded header), creating new errors? Absolutely. Did it take a while to figure out that we were now dealing with multiple issues? Of course it did.
But it also forced us to dig into the lower-level workings of Polkadot and uncovered the blind spots we had.
We now feel well prepared if this was ever to happen on a production chain.
The team learned a lot dealing with this, and we have a good idea of how to prevent this in the future and make our response much faster. For one, we should have tested the runtime upgrade on Zombienet. Coincidently we were building that feature into our testing pipeline.
Another learning for us was the way our response was coordinated internally. In the beginning, we needed to be more cohesive, working together closely enough.
There is a first for everything, and this was a good dry run. Our parachain on Rococo is back online and has been running stable.
We would like to thank Santiago and Alejandro for their support in guiding us through this process. We couldn't have done it without you guys!
👉 Subscribe to our newsletter: Join 15,000 subscribers for exclusive monthly updates and insights, directly from Maciej Baj, founder & CTO of t3rn. - no spam, unsubscribe anytime.