Group state forks are faulty states that MLS groups can end up in. This article looks at what they are exactly, how that happens and how to resolve them. We also look at a new OpenMLS feature that makes fork resolutions a little easier.
Over time, an MLS group evolves its group state. Whenever a new member is added, or another is removed, or a group extension is added or someone updates their encryption keys, the state must be changed. The versions that a group evolves throughout its lifetime are called epochs and identified by a counter value.The way this works is that first a number of requested state changes (proposals in MLS terminology) are collected (“Add Dora”, “This is my new key”, …) and then applied in a big update (called commit in MLS). Once a commit is applied, the group state is changed and the group epoch counter is incremented.
A state fork occurs when two members of a group end up in different states, because at some point the sequence of commits they merged diverged. Now these two members are effectively in two different groups, because they might see a different membership, a different group context and will definitely have different keys for the group.
The protocol designers made sure that this can’t happen. For example, the client who assembles the commit does not apply it to their own local state immediately. Instead they send it to the delivery service (DS), a largely untrusted relay server that delivers the messages to all the group members. In some cases application developers may need a more complex design than a simple relay, but the idea is the same. The DS should only deliver one commit for each epoch. Only when the sender receives the message (or a hint that the commit is approved) back from the DS they may apply the commit. And when adding new members, we perform sufficient checks to know that the joiner has the same view on the group as the committer, so their state also matches that of the rest of the group. If any checks fail at a client, it should notify the DS of the invalid commit. By induction, we can’t have group state forks: There is a known-good base case (when the party joined) and a known-good transition (when the party applies a commit).
But still, forks happen. How? Bugs.
For example, a DS might be malfunctioning and send out more than one commit. Or a client might be malfunctioning and apply a commit that has not yet been greenlit by the DS. It sounds like these are easy things to fix, but the state management of MLS itself is tricky, and if you pair that with the realities of building scalable and usable apps for the devices people actually use, things sometimes slip through. In the case of MLS, the result can render a group unusable.
At first glance, the solution might seem straightforward: just undo the commit so you are back in the old state and then apply the right one. It’s just git reset HEAD^
! However, in order to achieve forward secrecy, the old state must be removed promptly after a commit, so old messages are safe in case of a compromise. It is technically possible to keep the state around for a short, upfront-defined time, and then recover that way. However, while not an erosion of forward secrecy, it still weakens it in a way that many users of MLS are not willing to take.
The MLS standard provides two relatively simple approaches for merging the group forks:
- Just create a new group and add all members of the previous group.
- If the member that initiates the fork healing knows which members are forked, they can create a commit that removes and re-adds them. These members then receive the Welcome messages for the branch of the initiator of the healing and construct the group state for that fork.
These approaches both come with a bit of a cost:
- The member initiating the recovery needs to fetch a lot of key packages and encrypt to every single one of them
- The size of the Welcome grows linearly with the amount of members added
- All the newly added members will be unmerged leaves in the ratchet tree, making future updates more expensive (but this can be remediated later)
It seems like with the currently standardized protocol there is no way around paying these costs. But it does tell us that we should keep the number of added parties as low as possible. This means we should prefer the remove-and-readd technique over the start-from-scratch one.
In order to make recovery from a fork a little easier for users of the OpenMLS library, we have added helper functions that take some of that burden from application builders. For example, if Alice noticed a group state fork and were to attempt to merge it again, she could do so using this code:
// For readding, Alice needs to know who shares her view
let our_partition = &[alice_leaf_index, charlie_leaf_index];
let builder = alice_group.recover_fork_by_readding(our_partition).unwrap();
// Here we iterate over the members of the complement partition
// to get their key packages.
let readded_key_packages = builder
.complement_partition()
.iter()
.map(|member| {
// Get a new key package for the member to be re-added.
// Note: This is a stub for application specific code
delivery_service.get_key_package(&member.credential)
})
.collect();
// Specify the key packages to be re-added and create the commit and welcome messages.
let readd_messages = builder
.provide_key_packages(readded_key_packages)
.load_psks(alice_provider.storage())
.unwrap()
.build(
alice_provider.rand(),
alice_provider.crypto(),
&alice_signature_keys,
|_| true,
)
.unwrap()
.stage_commit(alice_provider)
.unwrap();
More information about this feature can be found in the Fork Resolution chapter of the OpenMLS Book.
One thing we have not touched on so far is the problem of detecting that a fork occurred in the first place. This problem is a lot more specific to the application than that of resolving them. That is why so far we do not offer a general solution here, but we will write more about this in a later blog post.