Replicated data types (RDTs) are data structures that permit concurrent modification of multiple, potentially geo-distributed, replicas without coordination between them. RDTs are designed so that conflicting operations are eventually deterministically reconciled ensuring convergence. However, constructing correct RDTs remains difficult due to the complexity of reasoning about independently evolving replica states, and existing approaches sacrifice efficiency for correctness.
We present Peepul, a pragmatic approach to building and verifying efficient RDTs. We cast RDTs as mergeable replicated data types (MRDTs) in the mould of distributed version control (like Git), equipping them with a three-way merge function. We develop replication-aware simulation relations to verify both convergence and functional correctness, and implement Peepul as an F* library that discharges proof obligations to an SMT solver. The verified RDTs are extracted as OCaml code and run on Irmin, a Git-like distributed database.
Modern distributed software replicates data across multiple locations for fault tolerance and low latency. When replicas are modified concurrently, conflicts must be resolved automatically. Convergent Replicated Data Types (CRDTs) solve this but carry expensive causal metadata and use inefficient merge strategies.
Mergeable Replicated Data Types (MRDTs) offer a better alternative: ordinary purely functional data structures are promoted to RDTs by equipping them with a three-way merge function (using a lowest common ancestor, like Git). This avoids tombstones and metadata overhead. But two key challenges remain:
Peepul bridges the gap between high-level specifications and efficient implementations through three key ideas:
An observed-remove set (OR-set) where concurrent add and remove of the same element results in the add winning. Peepul provides three verified implementations with different space/time trade-offs:
// OR-set state: set of (element, timestamp) pairs
Sigma = P(N x N)
sigma_0 = {}
do(add(a), sigma, t) = (sigma ∪ {(a, t)}, ⊥)
do(remove(a), sigma, t) = ({e ∈ sigma | fst(e) ≠ a}, ⊥)
merge(sigma_lca, sigma_a, sigma_b) =
(sigma_lca ∩ sigma_a ∩ sigma_b) ∪
(sigma_a - sigma_lca) ∪
(sigma_b - sigma_lca)
The merge takes the LCA and two divergent versions: it keeps elements present in all three, and adds elements that are new in either branch. The space-efficient variant (OR-set-space) deduplicates entries, while OR-set-spacetime uses a balanced BST for O(log n) operations.
Peepul includes 9 verified MRDT implementations with varying complexity:
| MRDT | Lines of Code | Lines of Proof | Lemmas | Verif. Time (s) |
|---|---|---|---|---|
| Increment-only counter | 6 | 43 | 2 | 3.5 |
| PN counter | 8 | 43 | 2 | 23.2 |
| Enable-wins flag | 20 | 58–89 | 3–7 | 104–1074 |
| LWW register | 5 | 44 | 1 | 4.2 |
| G-set / G-map | 10 / 48 | 23–33 / 26 | 0–2 / 0 | 2–5 / 26 |
| Mergeable log | 39 | 95 | 2 | 36.6 |
| OR-set (3 variants) | 30–97 | 36–266 | 0–7 | 8.8–1854 |
| Queue | 32 | 1123 | 75 | 4753 |
The queue is the most complex MRDT — the first formally verified replicated queue with a complete specification capturing FIFO ordering, at-least-once dequeue semantics, and correct merge behaviour. Its O(n) merge is orders of magnitude faster than prior work.
@inproceedings{soundarapandian2022certified,
title={Certified Mergeable Replicated Data Types},
author={Soundarapandian, Vimala and Kamath, Adharsh and Nagar, Kartik and Sivaramakrishnan, KC},
booktitle={Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI)},
pages={332--347},
doi={10.1145/3519939.3523735},
year={2022}
}