Published on

PowerSync vs. Chaos Monkey

Authors
lofi2025

What happens when your sync engine faces the Chaos Monkey - network partitions, forced crashes, and general mayhem?

It all started a while ago where I learnt about Jepsen via a certain NatureNurture. Embarassingly I'd never heard of the project but immediately fell in love - given my background in aerospace engineering and cybersecurity, I have a very soft spot for resilliency audits... So I reached out to them to see if they'd be keen to put PowerSync through the ringer. We agreed and hired them to run rigorous consistency tests on PowerSync using Jepsen, evaluating Causal Consistency, Atomic Transactions, and Strong Convergence. This was in May of 2024!

Fast forward a year or so and we have v1 of the testing framework.

TL;DR Results

PowerSync held up impressively under certain fault conditions, and we identified areas for improvement in other cases. Check out v1 here: https://github.com/nurturenature/jepsen-powersync

What is Jepsen?

Jepsen is the gold standard for testing distributed systems like databases for correctness during failures (network failures, processes crashes, etc.). We tested PowerSync's active/active sync from PostgreSQL to SQLite clients in progressive scenarios: from single-user to multi-user chaos.

Results Summary

Overall we have four fuzzing scenarios with suboptimal results that we'll be investigating:

  1. Both ordered and random client disconnects can result in stalled replication 1
  2. Random graceful exits of the client application has a similar impact, but with lower frequency 2
  3. Network Partitioning can result in replication disruption - either clients just stop uploading, or (less frequently), changes are uploaded to Postgres but those changes never make it back to clients. Impact summary. "Divergent final reads" 3
  4. Randomly kill client applications. Impact: in ~0.5% of cases db.currentStatus gets into weird state, and local transactions committed after the kill not being uploaded to the sync service. 4

Wins

In no-fault environments, PowerSync nailed full Causal Consistency, Atomicity, and Strong Convergence. Even with faults like disconnects, pauses, and kills, PowerSync performed "better than most similar systems that have been privately tested.", while demonstrating no data loss.

Future work

We're evaluating how to harden the system to cater for the scenarios listed above.

Huge thanks to NatureNurture for the deep dive - public Jepsen tests like this push us forward!

Footnotes

  1. Impact summary: https://github.com/nurturenature/jepsen-powersync/blob/main/README.md#impact-on-consistencycorrectness

  2. Impact summary: https://github.com/nurturenature/jepsen-powersync/blob/main/README.md#impact-on-consistencycorrectness-1

  3. Impact summary: https://github.com/nurturenature/jepsen-powersync/blob/main/README.md#impact-on-consistencycorrectness-2

  4. Impact summary: https://github.com/nurturenature/jepsen-powersync/blob/main/README.md#impact-on-consistencycorrectness-4