Protocol Operation Risk

Operational Security

Operational Security (also known as OPSEC) is a process used to protect sensitive information. The idea is to determine how to protect sensitive information by viewing operations from the perspective of an attacker. The processes originated during the Vietnam War. The following sections discuss the five steps of the Operational Security process.

Identification of Sensitive Information

The first step of Operational Security is identifying what information needs to be protected, and the relative sensitivity level of each type of information. In the context of a crosschain protocol, sensitive information will to include:

Private Keys used to sign blockchain transactions.
Private Keys used to attest to the validity of information in the crosschain protocol.
Private Keys used as part of blockchain consensus protocols.
Private Keys associated with a Transport Layer Security (TLS) web server certificate.
Private Keys used with key agreement and asymmetric encryption protocols.
Private Keys used with PGP and other signed and encrypted email.
Passwords.
Information about and code fixes for security vulnerabilities that have not yet been deployed into production.

Sensitive information could also include:

The organization's Org Chart. This allows attackers to identify specific individuals to target with Spear Phishing attacks. Note that social media platforms such as LinkedIn are used by attackers to determine organizations' org charts.
Identities of validators or other entities that help operate the crosschain protocol. Attackers could target entities involved in the protocol with Spear Phishing attacks.
Internal organizational procedures. Sensitive procedures include obvious ones such as vulnerability response procedures and what security software is installed on company computers, but also includes whether employees use company issued computers or use their own computer (that is Bring Your Own Device), software development practices, and HR policies.
Architecture and design information. Most crosschain protocols publish their system architecture and design information; hoping that this will provide users of their protocol greater assurance of the trustworthiness of the protocol. However, publishing design information makes it easier for attackers to identify potential weak points in the system.
Source code. Most crosschain protocol code is open source. This allows everyone to review the code, thus helping to provide users with assurance that the code has been written with security in mind. However, attackers are also able to view the code, and may be able to identify vulnerabilities in the code.
Known issues with source code. As most crosschain protocol code is open source, it is common for reported issues to be also publicly available. Sometimes the reported issues, though appearing to be innocuous, may highlight security issues.
Server log file information. This information is likely to indicate the usage of the protocol, which may be commercially sensitive. It may also provide insights into unexpected behavior within the protocol. Attackers may use this unexpected behavior to mount attacks.

Analysis of Threats

The second step of Operational Security is identifying possible actors for each of the categories of sensitive information and analyzing their capabilities. For crosschain protocols threats are likely to come from several groups:

Attackers aiming to steal funds from the crosschain protocol. These attackers can range in sophistication from people new to blockchain to state sponsored hackers such as North Korea's Lazarus Group.
Attackers or competitors aiming to discredit or reduce trust in the crosschain protocol.
General Front Running Bots viewing transactions in the transaction pool and submitting similar transactions ahead of the original transactions. These bots could see an attack in progress and would then attempt to execute the attack repeatedly automatically.
White Hat Hacker hoping to earn a bug bounty for identifying an issue.
University researchers aiming to identify vulnerabilities in protocols. If they can claim credit for finding the vulnerability, then they will be able to write an academic paper about the issue, and gain academic recognition. This is particularly important for Doctoral Candidates who need to contribute some original research to complete their doctorate.
Disgruntled employees and other insiders acting maliciously.

Analysis of Weaknesses

Given the information being protected, and the possible threat actors, assess the current safeguards that are in place. From there, and determine what weaknesses exist.

Assessment of Risk

The next step is to rank each of the weaknesses to determine the likelihood of the attack happening and the likely impact of the attack. The likely impact will include the immediate financial loss, reputational damage, and the time to address the attack. The more likely an attack and the higher the likely loss, the higher the priority should be to mitigate the weakness.

Application of Countermeasures

The final step of Operational Security is to mitigate risks. The range of possible countermeasures is vast and beyond the scope of this document. However, two important mitigations are Responsible Disclosure and Bug Bounty Programs. White Hackers will have a means of being compensated for finding issues with a Bug Bounty Program. University researchers will have a means of reporting their findings and be sure to be able to claim recognition with a Responsible Disclosure process. Not having these programs in place may lead White Hat Hackers to attack the system and return some percentage of their attack earnings, while retaining the rest as a Bug Bounty, and may lead university researchers to publish their results prior to contacting the crosschain project.

Ability to Pause

The pausing capability described in the Ability to Pause Implementation Risk section is only effective if the pause() function can be executed expeditiously. Typically, attacks on protocols are mounted in time periods ranging from minutes to hours. As such, pausing a project several hours after an attack has commenced is unlikely to be effective, as by this time, the project may have already been drained of funds.

To prevent malicious parties unnecessarily halting the project by calling the pause() function, pause() functions need to have some form of access control.

For projects operated by a single organization, a simple administration set-up could be used. However, as described below, this has serious issues. For example, a single administrator account (that is an owner account) could be the only party authorized to call the pause() function. However, to be responsive to the need to pause the project, the private key belonging to the administrator account would need to be shared with support staff who live in timezones around the world. The issue with sharing a single private key is that it would be impossible to determine which support staff member used the key to pause the project. Additionally, if one of the support staff left the company or were compromised, then the shared key would need to be changed.

A better approach to using a shared key is to provide each support staff member with their own administrator account, and provide all accounts with the Pauser Role. The advantage of this approach over using a shared key is that the account that pauses the project can be associated with a specific support staff member. Additionally, if one of the support staff left the company or were compromised, then only their account would need to be disabled. This would allow the other support staff to operate without changing their keys.

For projects that are operated by multiple organizations the pausing capability should be controlled by a multi-signature wallet. Multi-signature wallets have a threshold number of owners that must vote for a proposed transaction. In the case of pausing, the proposed transaction would be to call the pause() function. Within the application contract, the multi-signature wallet contract would be authorized to call the pause() function.

Many projects also incorporate services that monitor the operation of the project. The services could be authorized to call the pause() function automatically based on the detection of anomalous conditions. This automatic pausing capability needs to operate in parallel with the other manual approaches described above.

The pausing operations described above typically are performed by Bridge Operators. Granting Bridge Validators the right to pause bridges has merit as the validators role is to verify state updates. They are likely to be able to detect anomalous behavior.

Things to consider when setting up a multi-signature wallet and the threshold, are:

How timezone dispersed the owners are. If they are mostly located in the one timezone region, then pausing will be difficult if an attack is mounted when most are asleep. The threshold for pausing could be lowered to match the number of owners in various timezones across the world. In this way, the project to be paused even if the time an attack is mounted is when most owners are asleep.
How engaged are the owners. Owners who are volunteers may not be as responsive as is needed to pause a project expeditiously.
If the threshold is too low, then perhaps a subset of owners who are not happy with the project's direction could choose to disrupt the project by pausing the project.
How independent are the owners? If multiple owners rely on a single party to hold or operate their keys, then that party effectively has multiple votes.
Attackers could target owners, aiming to gain access to their key. They could pause the project if they can gain access to the threshold number of keys.

Codebase Diversity

Is there just one implementation, or have multiple parties implemented the protocol?

Decentralization of Operations

Third-party Attestation Protocols rely on sets of independent parties. For example, in a Proof of Authority protocol messages are deemed valid if at least M of N parties sign messages. The Third-party Attestation Protocols section described a set of considerations that should be taken into account when reviewing these protocols. This section highlights the critical importance of ensuring the operation of a protocol delivers on the security guarantees of the protocol design.

When external parties attempt to audit the security of the protocol deployment, they will reason about the security of the protocol based on the threshold number of signers (M) and the total number of signers (N). They will expect that at least M parties would need to be compromised, or choose to maliciously sign a value, for a malicious message to be trusted. If one party controls multiple signers, then the true security of the system is different to what it first appears. For example, if one party controls M - 1 signers, then an attacker would only need to compromise that party, and one of the other independent parties. This is what occurred in the Ronin bridge hack.

Another operational consideration is latency and compensation. Parties might only be compensated if they sign messages. If a message is submitted to a contract immediately after (M) parties have signed a message, then it may be that the (M) parties that have lowest latency between each other sign most or all of the messages. In this situation, parties with high latency relative to other parties are not as heavily incentivized as parties that have low latency.

All parties could be compensated to participate, whether their signature is one of the M signatures used or not. In this situation, parties that have high latency relative to other parties are still compensated. Additionally, this means that parties that are temporarily offline are also compensated. A challenge for protocols like this is to prove that all N signers are usually online and are actively participating in the protocol.

Complex inter-node communications mechanisms can be setup to ping nodes, to check that the parties are participating in the protocol. However, the question then is how to prove that a party did not reply and how to prove this in a forum, probably on-chain, that can be used to slash parties not following the protocol.

Security of off-chain systems

TODO

(e.g. validators) Standard security practices such as ISO27001

Vulnerability Response Plan

A Vulnerability Response Plan is a process of agreed steps to take when an issue is reported with a crosschain protocol (or any other software system). The term vulnerability should be used carefully. When an issue is first reported, it should be considered a possible Security Issue. A Security Issue could be classified as a Vulnerability (also known as a Security Vulnerability) if the issue can be exploited to cause some sort of damage to the crosschain protocol.

Identify a Vulnerability Response Virtual Team

The Vulnerability Response Virtual Team is a cross functional team that will manage the vulnerability response process. The team should include domain experts such as the lead protocol designer, system architects, and lead engineer, and representatives from teams such as DevOps, SecOps, Product Management, Public Relations, and Executive Management. The team is a virtual team as it draws on team members on an as needs basis.

If the team is too large, then the probability of information about the vulnerability leaking increases. Additionally, the more people in the team, the more people who have their focus diverted from their core work. Having the team too small may mean that people who could provide useful insights are excluded. Experts could be brought into a small team on an as needs basis. However, this approach assumes that members of the team can identify the correct people to temporarily bring into the team.

Reporting Mechanism

Organizations should identify and advertise how they expect vulnerabilities to be reported. External parties such as White Hat Hackers and University Researchers (see Section Operational Security) need a mechanism communicate with the Vulnerability Response Virtual Team. This could be via a group email alias or via a web form.

People within the organization should have a defined way of reporting vulnerabilities. Using the standard bug tracking system will advertise the issue to anyone who can access the system. Additionally, the true significance of the issue may not be fully understood, and other higher priority issues may be resolved ahead of the vulnerability. Reporting to an individual such as a manager again might mean that the significance of the issue is not fully understood. Allowing the person to directly contact the Vulnerability Response Virtual Team, possibly by the same mechanism as external parties, will ensure the issue is appropriately triaged.

Once a vulnerability has been reported to the Vulnerability Response Virtual Team, someone from the team needs to take responsibility as the contact point for the person who reported the issue (from now on, the vulnerability reporter) for the duration of the vulnerability response. They need to assure the the vulnerability reporter that they are being taken seriously, and that the issue is being triaged. They should provide a realistic schedule for when they will respond to the the vulnerability reporter with the results of the triage process.

Triage

Once a possible Security Issue has been reported to the Vulnerability Response Virtual Team, the issue needs to be triaged. The team can analyze the issue against the following criteria:

Scope: This could be:
- Fundamental cryptographic building block: For example, an issue found in a cryptographic algorithm such as SHA256 would affect all protocols and systems using that protocol, and their implementations in all programming languages.
- A protocol or system: For example, an issue found in a communications protocol such as Transport Layer Security (TLS) would affect all applications that use TLS, independent of programming language.
- Software Library: All applications and systems using a software library.
- Crosschain Protocol: The issue just pertains to this crosschain protocol. If the protocol is implemented in a variety of programming languages, does the issue apply to all languages, or just one implementation.
On-chain or off-chain components: Does the issue relate to contracts that are on-chain or off-chain services?
Not deployed, Deployed, or Historic: Does the issue relate to code that has never been deployed, is currently deployed, or was deployed previously, but has subsequently been superseded?
Direct Impact: What is the direct impact of exploiting the issue? Could funds be stolen? Could a Denial of Service (DOS) be mounted against the protocol? Could private information such as customer data be stolen?
Reputational Impact: Could the issue cause reputational loss?

Contact the vulnerability reporter once the triage process has been completed. Discuss the findings of the process and give feedback to the the vulnerability reporter. They may highlight some misunderstandings that the Vulnerability Response Virtual Team has. If the Vulnerability Response Virtual Team feels that there is no substantive issue, it would be good if they could convince the the vulnerability reporter of this, so that they don't go public with their perception that there is a serious issue with the protocol. Assuming there is a vulnerability, the the vulnerability reporter needs to be given the planned schedule of events that will culminate in the vulnerability fix being deployed and the issue being publicly announced.

Creating, Validating, and Deploying a Fix

Once how the vulnerability will be addressed is determined, the code for the fix will need to be written, tested, and then deployed. Even if the project uses an open source repository for their main development effort, a private repository must be used for developing, testing and deploying the vulnerability fix code. The reason for this is that submitting vulnerability fix code to a public repository is tantamount to publishing the vulnerability. Even if the time between submitting the vulnerability fix code and deploying it is a matter of ten minutes, this will be long enough for an attacker who is already aware of the vulnerability. The attacker may have been preparing to mount an attack. Faced with the prospect of the vulnerability fix code being deployed soon, the attacker will launch their attack.

Prior to deploying the fix, it may be helpful to again reach out to the the vulnerability reporter to check that they too believe the fix will address the vulnerability.

Publishing a Root Cause Analysis

Immediately after the vulnerability code has been deployed, a Root Cause Analysis should be published. This should identify what the issue was, possible impacts if it had been exploited, how it was resolved, actions that users should take, any bug bounty that has been paid, and, most importantly, it should attribute finding of the vulnerability to the the vulnerability reporter. The publication of this analysis should be coordinated with the the vulnerability reporter as they may wish to publish their own press release. The details of the analysis should be checked with the the vulnerability reporter to ensure they agree with what has been said in the analysis.

Last update: October 13, 2023
Created: October 13, 2023