Certified Mergeable Replicated Data Types

Vimala Soundarapandian^*, Adharsh Kamath^†, Kartik Nagar^*, KC Sivaramakrishnan^*‡

^*IIT Madras ^†NITK Surathkal ^‡Tarides

PLDI 2022 (ACM SIGPLAN Conference on Programming Language Design and Implementation)

Abstract

Replicated data types (RDTs) are data structures that permit concurrent modification of multiple, potentially geo-distributed, replicas without coordination between them. RDTs are designed so that conflicting operations are eventually deterministically reconciled ensuring convergence. However, constructing correct RDTs remains difficult due to the complexity of reasoning about independently evolving replica states, and existing approaches sacrifice efficiency for correctness.

We present Peepul, a pragmatic approach to building and verifying efficient RDTs. We cast RDTs as mergeable replicated data types (MRDTs) in the mould of distributed version control (like Git), equipping them with a three-way merge function. We develop replication-aware simulation relations to verify both convergence and functional correctness, and implement Peepul as an F* library that discharges proof obligations to an SMT solver. The verified RDTs are extracted as OCaml code and run on Irmin, a Git-like distributed database.

The Problem

Modern distributed software replicates data across multiple locations for fault tolerance and low latency. When replicas are modified concurrently, conflicts must be resolved automatically. Convergent Replicated Data Types (CRDTs) solve this but carry expensive causal metadata and use inefficient merge strategies.

Mergeable Replicated Data Types (MRDTs) offer a better alternative: ordinary purely functional data structures are promoted to RDTs by equipping them with a three-way merge function (using a lowest common ancestor, like Git). This avoids tombstones and metadata overhead. But two key challenges remain:

Efficiency: How do we guarantee that optimized implementations still preserve the intent of the data type?
Correctness: How do we formally verify both convergence and functional correctness of complex MRDTs like queues?

How Peepul Works

Peepul bridges the gap between high-level specifications and efficient implementations through three key ideas:

Git-like Store Model: MRDTs operate over a store modelled after distributed version control — data is versioned in branches with explicit merges. The store provides the lowest common ancestor (LCA) for three-way merging, abstracting away causal metadata.
Replication-Aware Simulation Relations: To prove an MRDT implementation correct, we define a simulation relation between the concrete state (the efficient data structure) and an abstract state (a high-level specification based on events and visibility). The relation must be maintained across all operations and merges, ensuring both functional correctness (operations return the right values) and convergence modulo observable behaviour (replicas with the same history behave identically).
Automated Verification in F*: The simulation relation generates proof obligations that F* discharges to the Z3 SMT solver. For most MRDTs, the proofs are fully automatic. The verified implementations are extracted as OCaml code and run on Irmin.

Example: OR-Set

An observed-remove set (OR-set) where concurrent add and remove of the same element results in the add winning. Peepul provides three verified implementations with different space/time trade-offs:

// OR-set state: set of (element, timestamp) pairs
Sigma = P(N x N)
sigma_0 = {}

do(add(a), sigma, t) = (sigma ∪ {(a, t)}, ⊥)
do(remove(a), sigma, t) = ({e ∈ sigma | fst(e) ≠ a}, ⊥)

merge(sigma_lca, sigma_a, sigma_b) =
  (sigma_lca ∩ sigma_a ∩ sigma_b) ∪
  (sigma_a - sigma_lca) ∪
  (sigma_b - sigma_lca)

The merge takes the LCA and two divergent versions: it keeps elements present in all three, and adds elements that are new in either branch. The space-efficient variant (OR-set-space) deduplicates entries, while OR-set-spacetime uses a balanced BST for O(log n) operations.

Verified MRDTs

Peepul includes 9 verified MRDT implementations with varying complexity:

MRDT	Lines of Code	Lines of Proof	Lemmas	Verif. Time (s)
Increment-only counter	6	43	2	3.5
PN counter	8	43	2	23.2
Enable-wins flag	20	58–89	3–7	104–1074
LWW register	5	44	1	4.2
G-set / G-map	10 / 48	23–33 / 26	0–2 / 0	2–5 / 26
Mergeable log	39	95	2	36.6
OR-set (3 variants)	30–97	36–266	0–7	8.8–1854
Queue	32	1123	75	4753

The queue is the most complex MRDT — the first formally verified replicated queue with a complete specification capturing FIFO ordering, at-least-once dequeue semantics, and correct merge behaviour. Its O(n) merge is orders of magnitude faster than prior work.

Key Results

				Orders-of-magnitude speedup over Quark. For queues, Quark's merge takes 10s for 1000 operations and 178s for 5000 operations due to costly relational reification. Peepul's linear-time merge takes less than a millisecond in all cases, operating directly on the concrete representation.
			

				Space efficiency. Quark's OR-set grows unboundedly due to duplicates (reaching 10,000 elements after 100K operations), while Peepul's space-efficient OR-set stays bounded below 1000 elements for the same workload.
			

				Compositionality. Complex MRDTs (like an IRC chat application) can be built by composing simpler verified MRDTs using a generic α-map. The correctness proof of the composite follows from the proofs of the components.
			

				Convergence modulo observable behaviour. A new, weaker notion of convergence that allows replicas to have different internal states (e.g., differently balanced BSTs) as long as their observable behaviour is identical. This enables even more efficient implementations.
			

Contributions

A store semantics for MRDTs with formal definitions of convergence and functional correctness, including a novel notion of convergence modulo observable behaviour.
A verification technique based on replication-aware simulation relations adapted from CRDTs to the MRDT setting, with automated proof discharge via F* and Z3.
The Peepul F* library with 9 verified efficient MRDTs, including the first formally specified and verified replicated queue.
Experimental results showing Peepul MRDTs are orders of magnitude more efficient than prior approaches, with verified OCaml code running on the Irmin distributed database.

BibTeX

@inproceedings{soundarapandian2022certified,
  title={Certified Mergeable Replicated Data Types},
  author={Soundarapandian, Vimala and Kamath, Adharsh and Nagar, Kartik and Sivaramakrishnan, KC},
  booktitle={Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI)},
  pages={332--347},
  doi={10.1145/3519939.3523735},
  year={2022}
}