Large Node Operator
Blueprint provided by kiln
Last updated
Blueprint provided by kiln
Last updated
Try not to have more than one active channel per stakeholder.
Define incident response plans to be prepared for potential incidents.
Aim to follow the Google SRE incident management practices.
Focus on declaring incidents fast/easily, stop the bleeding, and try to reduce over-communication.
Continuous training of people on incident management
Ensuring you have a good on-call rotation
Reduce over-communication
For incident communications incident.io is used. It helps streamline internal resolution, helps write the post-mortem, and publishes to https://status.kiln.fi/
Worknet as a slack bot to do bulk messaging to groups of customer slack channels.
Are SLAs being met? (ETH target is 99% uptime, like coinbase, so there is typically enough buffer)
Are the customers happy with the level of communication?
Group | Stakeholder | Level of Engagement | Comms. Channels |
---|---|---|---|
Institutional Stakers
Enterprise customers (eg Bitpanda)
Communicating the monthly performance
Communication if there is a major outage that might affect the SLAs
Service Partners
Liquid staking protocol customers (eg Lido)
Communication with them on any outage, communication of postmortems (not just with the LST protocol but also with other validators where relevant) Performing tests for them (eg holesky web3 signer scale test)
Telegram
Software Providers
3rd party software teams (eg Web3 signer, clients Teku, Prysm)
Share bugs and issues
Telegram, Github issues
Auditors
Auditors (eg., Quantstamp)
Share any outage and postmortem. Share architectural designs/changes
Slack
Communities
Ethereum foundation
Organizing talks, sharing some feedback on the latest hot topics (upgrades, now timing games)
Telegram