Post-Mortem
As mentioned earlier last week on our Discord channel, the StakeWise team wants to provide a detailed update on the events which led to validators being down during the early hours launch of Medalla.
On August 4th at approximately 9 am EST, the Medalla beacon chain was launched. Approximately 16 hours earlier, the Prysmatic Labs team had introduced keystore migration which transforms keys from one format to another. Currently, StakeWise uses the Prysm client to run 100% of its validators.
StakeWise already had more than 1000 validators running by the time the new client version was released. StakeWise uses its own “validator operator” which listens to its smart contracts, creates new validator keys, and calls the registration function on its smart contracts to register new validators.
For security purposes, StakeWise’s infra containers run in such a way where there it is impossible to add, delete, or update files outside of the chain data. This means that the keystore migration which is executed on a validator client start would fail as it received a “permission denied” when transforming keys from one format to another. Disabling v2 accounts of validator clients was not an option as well, as we had already migrated to the keystore format compatible in version alpha-17.
As a result, we had the following options less than 24 hours before the genesis:
1. Migrate our operator client and existing keys to the new keystore format, where all the keys of each validator client are stored in a single file called “all-accounts.keystore.json”, and migrate to alpha-18.
- We didn’t go for this option as there wasn’t enough time to make changes to the validator operator.
2. Disable the security context for the containers to allow the validator client to migrate the keys on its own, and upgrade to alpha-18.
- This would compromise our containers security and we put security above all else.
3. Stay on alpha-17 until the genesis and execute option 1 afterward.
- This option felt as the safest to join genesis.
However, as soon as genesis had started, our validators were not able to attest due to the “Failed to update assignments error=no signing key found in keys cache” error.
Solving What Went Wrong
The bug we have faced could only be seen when the validators are already running, so it appeared right after genesis. When we started to dig into the issue, we noticed that this error was fixed in the alpha-18 release (https://github.com/prysmaticlabs/prysm/issues/6834), so the only option to fix the validator penalties as fast as possible was to disable the security context for the containers and allow the validator client to perform the migrations on its own, and upgrade to alpha-18. It took us a couple of hours to make that fix for all the 1500 validators we had by that time.
As of today, we have migrated our operator client and existing keys to the new keystore format, which dramatically improved the validators provisioning speed and re-enabled the security context for our containers.
Risk Mitigation in the Future
In order to prevent this from happening on the Mainnet, we’ll be making the following changes to our infra:
- Remote Key Manager — This will detach the keystore completely from the validator so that keystore management changes on the validator client don’t affect us.
- The Prewashed Version of the Client — We won’t upgrade to the newer version of the client on the Mainnet without ensuring that it runs smoothly on the Testnet and we expect new client versions to become more stable by that time.
- Multiple Clients in Use — Lastly, we plan to run multiple clients and switch from one to another depending on their stability.
Thankfully this is a Testnet where we can still iron out bugs and ready ourselves for the Mainnet. As always, if you have any questions, please let us know by reaching out on Twitter or Discord.