Protocol Implementation Risk

Building crosschain protocols involves creating complex on-chain and off-chain components while accounting for the peculiarities and pitfalls of different programming languages, frameworks, virtual machines, runtime environments, and protocols. Inevitably, such complexity increases the likelihood of bugs and vulnerabilities. This type of risk has thus far been the most common cause of bridge hacks witnessed over the last couple of years.

The best practices learned over the years in building secure systems and blockchain applications apply directly in the context of crosschain protocols. Discussing each of these would be beyond the scope of this document and largely duplicative. Instead, we refer the reader to resources such as Ethereum Smart Contract Best Practices, which outlines best practices, patterns, anti-patterns, common attacks, and guiding philosophies for building and maintaining secure applications. It covers general principles such as the importance of simplicity, modularity, reuse of standard implementations, and planning for failures. All of which are critical considerations in crosschain protocols.

This section discusses some of the salient principles and considerations for thinking about crosschain implementation risk. In general, protocol developers should mitigate implementation risk by reducing its likelihood upfront, uncovering extant risk, and having controls to respond to materialized risk.

Reducing risk
- Managing Complexity: Generally, the higher the complexity of a protocol's design, code, and runtime environments, the higher the implementation risk. Various ways to gauge complexity include: the size and intricacy of codebases (e.g., cyclomatic complexity), the diversity of execution environments, and the number of moving and coordinating pieces.
- Assurances of Correctness: Ensuring the correctness of the implementation of a protocol against its specification, at different levels of granularity, under a range of inputs and conditions is critical. There are numerous techniques for attaining varying degrees of confidence about the correctness of a protocol, ranging from formal verification to different types of testing. Rigorously employing such techniques reduces implementation risk.
- Principles of Least Privilege: Ensuring that fine-grained access controls are in-place reduces the impact of compromised entities or credentials.
- Principles of Diffuse Privilege: Ensuring that control of critical operations is decentralized offers checks and balances that mitigate the likelihood and impact of implementation risk.
- Nascency Risk: Cross-chain protocols are built to operate across disparate networks. Some of these ecosystems use tools, frameworks and execution environments that are less mature or well understood. This increases implementation risk.
Uncovering extant risk: Protocols should enable, encourage and incentivize external review and scrutiny of their codebase. This enables vulnerabilities to be surfaced by good-faith actors and reduces the risk of exploits. Some practices that are critical aids to this include up-to-date audits, bug bounties, open-source codebases, and good documentation.
Responding to materialized risk: Despite best efforts, hacks and exploits are likely to occur. Protocol implementations should include efficient mechanisms for responding to such incidents. Capabilities that enable this include the ability to monitor, detect anomalies, and pause protocols.

The rest of this section will discuss specific practices and considerations that expand on the above framework.

Reducing Risk

Mixing of Control and Data Plane

The terms Control Plane and Data Plane come from networking [Wikipedia]. In the context of networking, the Control Plane configures the network topology and routing tables, and the Data Plane is the information that is communicated across the network. In the context of computing, the Control Plane is the configuration of the system, and the Data Plane is the data processing.

Functions in smart contracts can be ones that control the configuration of the contract. These can be thought of as Control Plane functions. For example, a function to pause the contract is a Control Plane function. Data Plane functions are functions that process data. For example, a function to mint some tokens is a Data Plane function.

Poor project design can result in smart contract functions that contain both Control Plane and Data Plane logic. Mixing these two planes in the one function dramatically increases the risk of the project. An attacker may be able to compromise the Data Plane part of a mixed processing function, and then use that to change the configuration of the project, accessing the Control Plane part of the mixed function. This can lead to the attacker having the ability to control aspects of the project such as minting tokens.

Example

As example of this type of issue being exploited is the August 2021 PolyNetwork issue. The PolyNetwork code was written such that its EthCrossChainManager contract was the owner of the EthCrossChainData contract. The EthCrossChainData contract held important information including the public keys used to verify crosschain requests. Doing this allows for function calls for EthCrossChainData to go via the EthCrossChainManager contract. Access from the EthCrossChainManager contract to the EthCrossChainData contract could be deemed part of the Control Plane. The EthCrossChainManager contract also had a function verifyHeaderAndExecuteTx that was used to process Data Plane requests. The attacker was able to create a carefully constructed call to verifyHeaderAndExecuteTx that allowed the Data Plane request to modify data in the EthCrossChainData, that ultimately led to the attacker being able to steal funds.

The PolyNetwork code would not have been vulnerable to this type of attack if there had been a clear separation of Control Plane and Data Plane. For example, rather than doing updates to the EthCrossChainData contract via the EthCrossChainManager contract, updates could have been only allowed from an Externally Owned Account (EOA) or a MultiSig Wallet account.

Role Based Access Control

Role Based Access Control (RBAC) allows different entites to be responsible for different configuration actions. Systems that are managed by a single entity are inherently less secure than those with narrowly-scoped privileges for different entities and specific contexts.

With contracts, this can be used to limit which accounts can execute which functions. For example, imagine a contract that operates as a crosschain bridge. It could have a role called PAUSER. This role could be required to call a function that enables pausing of the contract. Any transaction submitted by an account that did not have the PAUSER role would be reverted.

Simplistic contracts might have a single role, OWNER, that can only be assigned to one account. For these contracts, the owner account is the only account that can submit transactions that call configuration functions without reverting.

The greater degree of flexibility afforded by Role Based Access Control compared to simplistic OWNER style access control has security implications. For example, a contract might be able to mint new tokens, and thus have a MINTER role to control this action. Minting new tokens could change the tokenomics of the contract, and hence must only be executed if there is agreement between administrators. Access to this configuration action might be limited to a multi-signature wallet account. The same contract might have a PAUSER role that can be used to stop data processing within the contract. The action to pause the contract needs to occur as quickly as possible, to halt an in-progress attack. However, access to the role needs to be limited to trusted accounts, to prevent attackers causing a Denial of Service attack on the contract, by continually pausing the contract. Using a multi-signature wallet to control this action is not ideal, as multiple parties would need to work together to pause the contract, thus allowing attacks to continue longer than they otherwise would. In this situation, multiple trusted accounts could be granted PAUSER role. Any one of the accounts could then pause the contract.

For a small project, when a contract is deployed, it might be tempting to use simplistic OWNER style access control. However, it is better to deploy a contract configured for fine grain Role Based Access Control, where all roles are initially assigned to the one account. In this way, as the project using the contract matures, new accounts can be granted roles and the original account's access can be revoked. It should be noted that the benefits of RBAC are only realised once access for different roles is allocated to additional accounts.

For Ethereum based projects, the OpenZeppelin project has an example contract AccessControl.sol that can be used to implement Role Based Access Control. Using this template, checking an address has been granted a role becomes as simple as calling the function hasRole. The code below shows how this would work in practice.

contract Example is Pausable, AccessControl {
    constructor() {
        _setupRole(DEFAULT_ADMIN_ROLE, msg.sender);
        _setupRole(PAUSER_ROLE, msg.sender);
    }

    function pause() external {
        require(hasRole(PAUSER_ROLE, msg.sender), "Must have PAUSER role");
        _pause();
    }
}

Upgradable

Like any software, smart contracts software can have bugs. This is true of both code that appears to be extremely simple and more complex code. Additionally, applications may require new features. As such, many projects have the ability to upgrade their smart contract code.

Having the ability to upgrade a contract is inherently risky as it can lead to rug-pulling attacks, where the operators of a project change the contract logic and steal customer funds (for instance the Hunter Defi rug-pull). However, as a proportion of both the number of cypto rug pull attacks and as a proportion of the projects that use upgradable contracts, using contract upgrade as a method of stealing funds is very rare. We have no hard statistics on this; this is just an observation.

Another major source of risk related to upgrading a contract is that vulnerabilities can either be introduced due to the new code, or the interaction of the new code with the old data. This was the source of the issue in the Nomad hack in August 2022.

An important consideration is the processes and controls that govern contract upgrades. That is, who can perform upgrades, how decentralised is this privilege, are there timelocks such that upgrades take effect at some time after the upgrade is triggered. Many of these governance considerations are covered in the Role Based Access Control section.

The three common methodologies for upgrading smart contracts are:

Data Holder Upgrade Pattern: Have a data holder contract and a separate business logic contract. The business logic contracts are upgraded, and connect to the existing data holder contracts. Issues with this approach are that new versions of the business logic contract need to be able to utilise old data formats stored in the data holder contract.
Transparent Upgrade Proxy Pattern: Have a transparent upgrade proxy contract and a business logic contract. The business logic contract executes in the context of the upgrade proxy contract. The upgrade logic resides in the proxy contract. Issues with this approach are that extreme caution needs to be exercised to ensure there are no storage slot collisions between the proxy and the business logic contracts, and between different versions of the business logic contract.
Transparent Upgrade Proxy Pattern with Upgradable Upgrade Logic: As per the Transparent Upgrade Proxy Pattern described above, but with the upgrade logic in the business logic contract. The advantage of this approach is that governance that exists in the business logic contract can be used to approve the upgrade. Issues with this approach over and above the issues with the Transparent Upgrade Proxy Pattern approach is that bugs with the business logic contract can interfere with the upgrade logic, thus preventing upgrade, or enabling an attacker to maliciously upgrade the contract.

In summary, having the ability to upgrade the contracts of a project has risks. However, not having the ability to upgrade contracts, thus resolving bugs, has other, possibly larger risks.

Secret Storage

Most projects use cryptographic keys to operate the system. These keys could be stored in a network HSM or a hardware wallet, and not in a file on disk on a server.

Product Development Maturity

Organisations that lack software development practices are likely to create lower quality software than organisations that have defined development practices. These practices are called the Software Development Lifecycle (SDLC).

Well Known Platform

More people are likely to be able to review and understand projects that are created in environments that they are familiar with. The majority of the blockchain developers in the world understand Ethereum and the Ethereum Virtual Machine (EVM). Hence, projects that use blockchains that support the EVM, or that are forks or Ethereum, are inherently less risky than projects that involve other blockchains.

Well Known Smart Contract Programming Language

More people are likely to be able to review and understand code that has been created in programming languages that they are familiar with. Solidity is the smart contract programming language known by most blockchain developers. Hence, projects that use Solidity are inherently less risky than projects that use other smart contract programming languages.

Solidity Assembler can be used to implement complex features not available in standard Solidity. Assembler code is more complex than Solidity code and is more likely to contain bugs and have unexpected consequences.

Uncovering Extant Risk

Formal Verification

Formally verified code should prove that the code matches the specification. Hence, formal verification can not detect bugs in the design of the project. However, it will pick-up implementation bugs that testing might miss.

Testing

Code that is not tested is far more likely to contain bugs than code that has been tested. Comprehensive tests allow new features to be added without fear of breaking existing functionality. Hence, the more comprehensive the testing of the project, the less risky the project is.

Testing falls into several categories:

Unit Testing: Checks the operation of a single component, module, class, or contract in isolation. Because of the low level nature of this testing, it should be possible to check all error conditions. Mocking can be used to simulate the behavior of real components that are relied upon by the component to be tested.
Integration Testing: Checks the operation of multiple components together.
System Testing: Checks the operation of the entire product. This testing could check the system with no data, simulating a new install, with prefilled data simulating multiple years of operation, or could use real world data. A type of System Testing is Upgrade Testing. This type of testing checks that existing version(s) of the system can be upgraded to the new version of the system. System testing is also known as Regression Testing or End to End Testing.
Performance Testing: Measures the latency, speed of operation, gas usage, transactions per second, or some other metric. These tests can be used to check that the performance of a system has not degraded from one software release to the next. Performance testing can be done at the unit, integration or system level.
Acceptance Testing: Checks that the software can be deployed and that the deployed software operates as expected.
Interoperability Testing: Checks that two software products can communicate based on a standard.

Testing can be automated or manual. Automated tests allows software to be tested using scripts. Continuous Integration is a form of automated testing that runs a test script when code is committed to a source code repository. Manual testing requires people to perform a sequence of tests. As manual testing is labor intensive, there is the risk that it is performed for some software releases and not other software releases. As such, it is preferable to only rely on manual testing to do some Acceptance Testing, and not to rely on it to perform other forms of testing, for instance System Testing.

Tests can be happy path tests that only check the operation of the software in ideal conditions. Unhappy path tests check what occurs when an error condition arises. Tests should check both happy and unhappy paths.

Code Coverage describes what percentage of the code is tested, based on the lines of code. It provides a rough metric of how well code has been tested. It can be used to determine whether, when new code is added, more or less code overall is being tested.

Considerations

Unit Testing

Mocking can lead to brittle tests. That is, if complex behavior is simulated in some mocked code, then the mocking code will need to be modified when the component it is mocking changes. An additional problem is that the component being mocked might change, but developers fail to update the mock of the component. In this case, the unit test passes, despite the fact that the component being tested would fail at integration and system testing.

Code Coverage

Code coverage should be higher for complex pieces of code.

Code Auditing

Code audits are necessary to ensure that a protocol's code performs as per its intended logic. Code that has not been audited is far more likely to contain bugs than code that has. Any protocol that does not have code audits should be trusted less by users and developers. However, code audits should not be viewed as a comprehensive security solution. For example, protocols sometimes conduct audits on specific parts of their architecture or deploy unaudited code for changes/new features, which can be a potential source of risk outside of the initial audit.

Considerations:

Has the code been audited? How many audits have been completed? Were these audits conducted by different organizations?

Several audits, ideally by different organizations, are more likely to uncover more potential vulnerabilities.
When was the most recent audit? Has the protocol been upgraded since the last audit? How often is the protocol's code audited?

Ensuring audits are up-to-date with code changes and are conducted frequently offers more assurance about their validity.
What was the scope of the audit? Does it cover all of the key on-chain components?

Wide audit scope offers assurances around more parts of the protocol. At a minimum, however, audits should cover all core on-chain contracts.
Is the deployed version of the protocol's code audited?
What were the findings of the audit? Were there critical findings that were left unaddressed? What are the scenarios in which these could be exploited?

Sometimes findings are "acknowledged" by team members but not addressed either because the attack scenarios are mitigated by other means, are considered difficult to pull off, or the requested changes are difficult to make. Understanding these findings enables a better assessment of the potential implementation risks.
What are the audit firm's track record and reputation?

Not all audit firms are created equal. The level of confidence around the audit might need to be weighted by their track record, reputation, or other measures of the caliber of the team.
Have changes made in response to audit findings been audited? If not, how significant are these changes?

The critical bug that led to the Nomad hack in August 2022 was introduced during a response to the auditor's findings. While Nomad's team was under the impression that the post-remediation code changes were re-audited by the auditor, it was, in fact, not the case. The auditor's report only certified the state of the codebase prior to the changes.

While code audits are vital to ensure the robustness of a protocol, they do not guarantee security. In the past, protocols have been compromised despite completing several audits. Thus, audits should only be one of the risk management strategies used by protocols, but not their entire security stack against hacks.

Open Source Code

If the code is stored in a public Github repository, it allows people to review the code and the test system. If many people view the code, then it is likely that defects in the code will be found. Additionally, it allows for the assessment of such things as the number of tests.

Some people argue that a private Github repository is more secure, believing that issues can be hidden from attackers. However, attackers who are sufficiently motivated often obtain access of private repository or to a copy of the code, and are then able to exploit any vulnerabilities. Not having the repository public then hinders white hat developers from helping, in the case of an attack.

When using a public repository, it is important that issues that relate to vulnerabilities and code fixes for vulnerabilities are not put on the public repository before a release including the vulnerability fix has been deployed. Not doing this equates to publishing vulnerabilities that can be used to exploit the project. The approach that should be taken is to review and test the vulnerability fix using the private repo, deploy from the private repo, and the push the fix to the public repo.

Verified Code on Block Explorer

Source code verification for all deployed smart contracts is critical for user safety. It ensures that the source code of a protocol’s smart contract is the same as the one executed at the contract address. Moreover, it allows users interacting with a protocol’s smart contract to read and audit it. This enables them to determine what code has been deployed and what types of risks the user is taking by interacting with the protocol. As a result, if a protocol’s code isn’t verified, it is considered less transparent and should be trusted less by users and developers.

All deployed contracts should have source code uploaded and verified on Etherscan, or other block explorers specific to the chain if the contracts are deployed on an EVM-compatible chain. For instance, code for smart contracts on Polygon can be verified on Polygonscan, and those on Arbitrum can be verified on Arbiscan. For non-EVM compatible chains, the practice of source-code verification through block explorers is not a standard and is typically not supported. However, there often exist other tools or processes that are considered a standard for source code verification for specific programming languages used by different non-EVM chains. For instance, Move Prover (MVP) is used to verify smart contracts written in the Move programming language for chains like Aptos and Sui. Thus, entities building applications on these need to find different ways to ensure the trustworthiness of their deployed smart contracts.

Additionally, one should verify that the security parameters that govern the logic of smart contracts are initialized as expected. Any discrepancies at the initialization of a smart contract could be a risk vector. For instance, in the case of Celo’s Optics bridge, the recovery mode timelock was initialized at only 1 second instead of the expected 1 day, leaving users’ funds at risk.

The process of verifying source code on a block explorer like Etherscan typically includes the following steps:

Enter the contract address that needs to be verified.
Input the compilation settings (like compiler type and version, the open-source license) to a compiler.
Provide the source code that makes up the smart contract.
The verification tool compares the deployed bytecode with the recompiled bytecode. If the codes match, the contract is deemed verified.
Once verified, the smart contract is marked “verified” and is curated under the “Verified Contracts'' in the “Blockchain” tab.

Considerations:

What tool has been used to verify the source code?

Source code can be verified by different tools such as Etherscan, Sourcify, Tenderly, etc., but not all tools are created equal. The level of confidence around the source code verification must be assessed based on the tool used. For instance, for contracts deployed on Ethereum, Etherscan’s verify contract code tool is considered a trustworthy tool.
Are all smart contracts of the application verified?

It’s critical for the source code of all key smart contracts of a protocol to be verified. It’s possible for protocols to verify only certain smart contracts while hiding malicious code to give a false perception of its smart contracts being verified.
Could the protocol be less transparent about their source code on purpose?

The rule should always be to trust, but verify. If the source code isn’t verified, a protocol can be a potential rug, hiding the real intention of their smart contract. Or, the protocol could be following the practice of ‘security through obscurity', keeping the code unverified and thus private to prevent people from learning about how a specific feature works and potentially misusing it.

Documentation

More documentation makes a project easier to analyze. A lack of documentation can lead to confusion and issues being missed. Projects that have good documentation are easier to maintain.

Types of documentation a good project should have are:

Architecture document that includes the component and deployment architecture.
Thread model that includes all parts of the project.
Sequence diagrams for all major data flows.
Test plan describing how the project will be tested.
Smart contract code comments at the contract and function level.
Off-chain code with comments at the class and method level.
Test code with at least a class level comment.

Bug Bounty

A public bug bounty program incentivizes whitehats to uncover and safely disclose vulnerabilities that might exist in a protocol. This increases the number of people that can thoroughly scrutinize a codebase and prevent exploits by malicious actors. Bug bounties have to offer adequate compensation and have clear scope covering the protocol's critical elements. The process requires trust and transparency and should ideally be managed by an independent third party (e.g., Immunefi).

Responding to Materialized Risk

Ability to Pause Project

This section described the implementation risks associated with being able to pause a project. Operational risks related to pausing are covered in the Ability to Pause Operational Risk section.

All Data Plane functions should be pausable. For example, a bridge contract could have a function that could transfer coins based on actions on another blockchain. The ability to pause a function in a project allows administrators to stop functions from successfully executing. If there is a vulnerability that is being actively exploited in a project, having the ability to pause a function could stop the exploitation of the project midway through the attack.

For Ethereum based projects, the OpenZeppelin project has an example contract Pausable.sol that can be used to implement pausing. Using this template, pausing a function becomes as simple as adding a modifier whenNotPaused. The code below shows how this would work in practice.

contract Example is Pausable {
    function pause() external onlyOwner {
        _pause();
    }
    function transfer(
        address _sender,
        address _tokenContract,
        address _recipient,
        uint256 _amount
    ) external whenNotPaused {
        // Only executed when not paused

When analyzing whether a project can be paused, it is important to check whether all data processing functions can be paused, or just some parts of the project.

Example

For example, in August 2022 the Nomad Bridge had an issue (see Rekt for an analysis of the issue from people outside the team). An attacker was able to determine a methodology for stealing funds using the Replica contract's process() function. Depite most of the Data Plane processing functions in the project being pausable, the process() function was not. This meant that the attack was able to proceed without the administrators of the project being able to stop it.

Ability to Ban Addresses

Addresses may be associated with stolen funds. Tornado Cash was a project that had sanctions placed against it due to its association with stolen funds. Projects that have the ability to block these addresses, or freeze in-transit funds are more likely to avoid this type of regulatory risk.

Last update: October 13, 2023
Created: October 13, 2023