Risk Management Procedures
Processes and actions that should be defined to address risks and that cover the concrete procedures in case of related incidents.
1. Risk Monitoring
Leverage monitoring dashboards or systems to identify the risk and gain relevant data.
Beacon Chain Monitoring
Slashing Events: Monitor the beacon chain for any slashing events.
Anti-Slashing Database: Regularly poll the local node to ensure the anti-slashing database is enabled and functioning correctly.
Impact of Slashing: Assess and monitor the broader impact of any slashing incidents on the network.
Relay List Monitoring: Monitor the relay list for availability metrics and load balance capabilities between various relayers for downtime conditions.
Chain Reorganizations: Track events and causes of chain reorganizations
Non-finalized Events: Monitor events preventing the consensus layer from confirming finality
Special Software Conditions: Monitor major software upgrades requiring specific durations and events that will conclude the upgrade
Node and System Health
Node Health Metrics: Monitor key metrics like CPU, memory, restarts, and uptime of nodes.
System Configuration: Monitor system configuration settings in real-time and continuously.
Key Usages: Track the usage of critical system keys.
App-specific: App specific metrics (e.g. metrics for Dirk & Vouch)
Security and Compliance
Access Control and Logs: Keep an eye on access controls to nodes and abnormal configuration changes.
Phishing and Endpoint Protection: Monitor for phishing attacks and ensure the security of endpoint protection systems, both for employee devices and infrastructure nodes.
Bastion Nodes: If applicable, monitor bastion or connection nodes.
Suspicious Internal Interactions: Watch for any suspicious internal interactions with infrastructure, cloud security platforms, or network monitoring solutions.
Access Patterns and Configurations: Check for unusual access patterns and the configurations of VPNs and 2FA systems.
Relay Compliance: Monitor relay compliance aspects and availability metrics.
Upgrade and Code Management
Upgrade Process: Monitor the upgrade process, including client code source, notification channels, bug reports, and community disclosures.
Customized Code in Testnet: Monitor any new custom code deployed in the testnet.
Hardware and Network
Baremetal and Network Equipment Health: Monitor the health of bare metals and networking equipment, including internet and peering connectivity.
Predictive Models: Use predictive models for future malfunctions and equipment replacement needs.
Capacity and Resource Usage: Track capacity usage, processing memory, and CPU.
Peering Connectivity: Monitor both internal and external network peering connectivity.
Firewall Configuration and Metrics: Keep an eye on firewall configuration changes or unexpected increases in drop metrics.
Cloud and Infrastructure
Cloud Monitoring Solutions: Utilize cloud monitoring solutions to keep track of uptime and internal issues.
Cloud Service Notifications: Stay informed about cloud service announcements regarding expected downtime and maintenance.
Take a look at Collection of Tools, Scripts & Templates for tool examples to perform the monitoring of some of the metrics mentioned above.
2. Incident Response Plan
Define Incident Response Plans (ICR) for all specific Risks and store them in a central location with access for all relevant employees. ICRs establish plans for managing security incidents and events, and offer guidance for employees or incident responders who believe they have discovered, or are responding to, a security incident. Ensure that relevant employees are aware of the location. Simulations of an Incident Response Plan should be conducted at least once a year.
Incident Response Plan template can be found here: https://docs.google.com/document/d/1ynZfeMh3vxZu7Juh-f34b50_3WHgejiL/edit?usp=sharing&ouid=117284374075970906179&rtpof=true&sd=true
3. Disaster Recovery Plan
A Disaster Recovery Plan gives guidance on recovering one or more information systems at an alternate facility in response to a major hardware or software failure or destruction of facilities. Simulations of a Disaster Recovery Plan should be conducted at least once a year in a test environment.
Disaster Recovery Plan templates can be found here:
4. Pre-Mortem
Perform Pre-Mortems for specific risks. The article linked below provides instructions on how to use Pre-Mortems to prevent incidents. Example topics for a Pre-Mortem could be
Unauthorized users could gain access to the servers
In the case of a security compromise it is not clear what to do
A specific scenario could result in downtime
Last updated