Large Node Operator
Blueprint provided by kiln
Stakeholder Overview
Group | Stakeholder | Level of Engagement | Comms. Channels |
---|---|---|---|
Institutional Stakers | Enterprise customers (eg Bitpanda) | Communicating the monthly performance
Communication if there is a major outage that might affect the SLAs | |
Service Partners | Liquid staking protocol customers (eg Lido) | Communication with them on any outage, communication of postmortems (not just with the LST protocol but also with other validators where relevant) Performing tests for them (eg holesky web3 signer scale test) | Telegram |
Software Providers | 3rd party software teams (eg Web3 signer, clients Teku, Prysm) | Share bugs and issues | Telegram, Github issues |
Auditors | Auditors (eg., Quantstamp) | Share any outage and postmortem. Share architectural designs/changes | Slack |
Communities | Ethereum foundation | Organizing talks, sharing some feedback on the latest hot topics (upgrades, now timing games) | Telegram |
Best Practices
Try not to have more than one active channel per stakeholder.
Define incident response plans to be prepared for potential incidents.
Aim to follow the Google SRE incident management practices.
Focus on declaring incidents fast/easily, stop the bleeding, and try to reduce over-communication.
Hot Topics
Continuous training of people on incident management
Ensuring you have a good on-call rotation
Reduce over-communication
Tools in Use
For incident communications incident.io is used. It helps streamline internal resolution, helps write the post-mortem, and publishes to https://status.kiln.fi/
Worknet as a slack bot to do bulk messaging to groups of customer slack channels.
Effectiveness Metrics
Are SLAs being met? (ETH target is 99% uptime, like coinbase, so there is typically enough buffer)
Are the customers happy with the level of communication?
Last updated