Mitigation Strategies

This page summarizes different, node-operator specific mitigation strategies.

Node-Operator Technology Stack Mitigations

Local anti-slashing database

To avoid double signing, validators maintain a history of messages they signed, and this is usually stored inside of a database. In some cases, this feature is enabled by an external web3signer. The maintenance and protection of this database is crucial, as inconsistencies in this database may cause a double-signing event. The following items need to be in place:

  • Persistence of anti-slashing database: Ensure that a persistent, not a temporary storage is used for the anti-slashing database.

  • Ensure that slashing databases are always connected: It is possible to run a validator and a database, but never connect those two. Verify via monitoring that they interact.

  • Prevent deletion

Links to risks:

  • SLS1

  • SLS2

  • SLS3

Doppelgänger protection

While there are multiple measures possible to be taken to avoid two validator running with the same signing keys, one can also employ technologies that detect and prevent two validators running at the same time. This can be done using monitoring and alert systems, robust StatefulSet handling in Kubernetes to ensure no two containers with the same keys run at the same time, or pre-defined tools such as DoppelBuster.

Links to risks:

  • SLS2

Use of a Web3Signer

The main benefit of the use of Web3 signers is to have a service that is focused on the signing task directly, and comes with protection mechanisms.

Similar to the anti-slashing database, whenever used, a web3signer needs to be

  • Connected to a storage system (such as a database), and it needs to be ensured that it is always connected.

  • Ensured that they are not accidentally terminated.

  • Ensured that the failover is using the same web3signer

Links to risks:

  • SLS2 - SLS3

  • SLS14 - SLS15

  • KEC5 - KEC6

Client diversity

Maintain a diverse set of clients for different protocols, in order to reduce blast radius in case one of the clients appears to have a protocol error or other bug. In some cases, migrate keys to different clients in case of a specific client error observed, such as startup issues after controlled update or bug in the latest version of the chosen client.

Links to risks:

  • SLS6

  • SLS7

  • DOW2

  • DOW19

Distributed Validator Technology (DVT)

In order to avoid the single-point of failure problem for a node-validator without risking a slashing incident, DVT has been developed.

Links to Risks:

  • SLS1

  • SLS14

  • SLS15

  • KEC2 - KEC6

Lido-specific: Handling of delinquent state

In order to avoid loosing out on opportunity cost, Node operators need to develop and adhere to strict processes to properly exit validators, as they are otherwise put into a delinquent state. This results in monetary losses.

Links to risks:

  • SPS1

Secret Management

Controlled/audited secret access

Any secret needs to be accessed and authorized through a vault system. In this way, everything is audited, and anomaly detection can be activated for those vaults.

Also, multi-sig wallets should be used where appropriate.

Furthermore, access credentials for internal systems should also be stored inside those vaults, and key rotation managed from there.

Links to Risks:

  • SLS5

  • KEC1 - KEC4

  • KEC6

  • KEC8

  • KEC9 - KEC11

  • GIR25

Encryption of data at rest/in transit

Many different components interplay while a staking operation is going on. It is crucial, since sensitive information may be transmitted, to ensure that data is stored and transmitted in an encrypted fashion.

Links to risks:

  • SLS8

  • KEC5 - KEC7

  • KEC10 - KEC11

  • GR10

Store withdrawal keys in a cold location

Ideally, since these keys are not used often, it makes sense to store them in locations where data is not as often accessed. Ideally Air-Gapped.

Links to risks:

  • KEC5 - KEC7

Employees and signing keys

Employees should not be able to delete signing keys and there should be a back-up for the signing keys. Modern vault systems can have policies where deletion is prevented by certain or all users. Signing keys should be only possible to be removed by the root-user or through some multi-signing mechanism.

Links to risks:

  • KEC10

Access to unencrypted signing keys

The use case where an employee would need to access a signing key is low, and this should only be possible with a clear protocol when a support case is required. Vault systems can be set up that only verifier container roles can access these keys.

Links to risks:

  • KEC2

  • KEC11

Key rotation

Key rotation and a proper process around it is key to protect one's infrastructure from a potential breach of credentials. When in doubt, keys should be rotated. This includes, but is not limited to:

  • The Postgres database used by Web3Signer

  • The vault itself

  • Any SSH keys

  • Any API keys for your cloud infrastructure

Links to risks:

  • SLS8

  • GIR6 - GIR7

Access Management

Access controls & access management

The principle to follow is "least privilege". This is usually achieved by using an enforcing role-based access control, and create fine-grained roles throughout all processes of an organization.

Each user should be assigned roles, and some are temporary. There should be a clear lifetime of a user, that is automatically enforced and can be extended when needed.

Links to risks:

  • SLS8 - SLS9

  • DOW16

  • GIR1

  • GIR7

  • GIR22

Least Privilege

Even when employing RBAC, there are ways to log into containers as users and acquire larger privileges from there. Take docker exec -uroot as an example. These mechanisms can be disabled on the orchestration level (and should be).

Links to risks:

  • SLS8 - SLS9

  • DOW16

  • GIR1

  • GIR22

  • KEC8

  • GIR25

Strict employment termination process in place

Ensure that terminated employees do not have lingering credentials they can use to cause harm.

Links to risks:

  • SLS10

  • DOW17

  • GIR25

No access from external network to the nodes

Following the principle of defense in depth and least privilege, it is important that nodes are generally not accessible from the web. Any web access should be proxied through a load-balancer that has a firewall attached to it. The reason is that there are many software pieces on a node, potentially, and the attack vector due to a potential CVE may be incresed.

Links to risks:

  • SLS12

Strong authentication

Use password policies at every layer of the infrastructure (i.e. DUCK123 should never be an allowed password ;-)). When users are authenticating, MFA should be used.

Links to risks:

  • SLS13

Prevent physical access to non-authorized persons

This is mainly for bare-metal installations. If you host your nodes on-premise, ensure that physical access to the servers is restricted through a key-mechanism. Ideally, any entry and exit should be logged.

Links to risks:

  • DOW4

  • KEC6

  • KEC8

Development and Update Process

Testing and review of all changes to infrastructure code

Anything on the infrastructure should be captured in a code repository, and changes managed through a versioning system such as Git. No direct push to the main branch should be possible; everything should go through pull requests and review.

All code should go through static and dynamic analysis tools to minimize risk.

There should be custom tests created, and a strict testing policy before pushing to prod needs to be in place.

Ideally, metrics should be used to verify a high degree of testing culture. This includes, but is not limited to:

  • Line coverage

  • Endpoint coverage

  • Accidental human error detection

  • Architectural enforcement

Links to risks:

  • SLS4-SLS7

  • SLS18

  • DOW2

  • DOW6

  • DOW11-DOW14

  • GIR11

  • GIR13

  • GIR18

  • GIR21

  • GIR23-GIR24

  • DOW19

  • DOW20

No custom changes to the validator software

Validator software is open source, but in order to ensure that no protocol error occurs, the code should not be touched.

Links to risks:

  • SLS7

  • DOW13

  • DOW19

  • DOW20

Sanitize inputs

Unchecked inputs are a major cause for overflow attacks and brute force. Ideally, the load balancer in front of the node filters out all traffic that has too large headers and payloads. Additionally, if JSON payloads are being used, they should be checked to adhere to a certain schema.

Links to risks:

  • GIR8

Use of separate tests and staging environments

This minimizes a potential blast radius. It is important to run any change (even an update of a validator software or Web3Signer) through a test environment first, and then roll it out in a staged fashion. If it causes some slashing event, it is then contained to the few nodes that it was rolled out to.

Links to risks:

  • GIR11

  • DOW19

  • DOW20

Use containerized and orchestrated environments only.

Follow their best practice recommendations. Their mechanisms are more than battle-tested in different environments. Any make-shift approach to do mechanisms such as fail-over by hand should be deemed insecure.

Links to risks:

  • GIR23

Automation where possible

Human error is a real threat, and every process should at least follow an automated script that may or not be invoked by a human. The other risk of non-manual steps is the reduction of the risk of exposure of secrets. Everything should be done through pipelines and job-mechanisms (GitHub Actions, Apache Airflow, Apache Nifi)

Links to risks:

  • GIR16

  • GIR18 - GIR21

  • SLS17

  • DOW19

  • DOW20

  • GIR25

Minimize CVEs in images

Analyzing images for potential CVEs is simple nowadays (use e.g. Trivy). Further configurations inside these images can be checked using CoGuard. Any image used in your infrastructure should be checked this way.

Links to risks:

  • GIR17

Monitoring

Logging/Alerting at all levels of the infrastructure

Every component of your node operation is producing logs. These should be captured, analyzed, and alert systems should be set up to warn if something is wrong. Examples include, but are not limited to:

  • Web3Signer database has no CRUD operations going on (is it connected?)

  • CPU/Memory spike suddenly in container

  • Network traffic in and out of container

  • Relays

  • Slashing related logs on validator nodes

The alert systems should be automatically set up to take actions such as shutting nodes down (nuking).

Links to risks:

  • SLS8

  • SLS16

  • DOW6

  • DOW15

General Measures

  • General cyber security (Firewall, Intrusion Detection System, ....)

  • Check the uptime promise of cloud provider (minimum three 9s)

  • Failover system (also in different locations)

  • Keeping track of age and replacing appliances

  • Conduct an internal special study of failover and load balancer strategies

  • Securing the physical access

  • Being informed about the relevant natural catastrophes

  • Ensure stable Internet connection of the System (Cloud, Bare Metal, ....)

  • Ensure stable Power connection of the System (Cloud, Bare Metal, ....)

  • Ensure proper load-balancer and firewall at the front

  • Only necessary software on the relevant servers

  • Being able to switch the relayer or disconnect from the relay

  • Back-Up/DR / BC Policies

  • Validate cloud, data center or infrastructure provider regarding security

  • Safety training

  • Central & accessible documentation of critical knowledge

  • Having a communication toolkit and process prepared

  • Having a incident response policy / strategy

Last updated