How We Synchronized Editing for Rec Room’s Multiplayer Scripting System

🔥 Explore this insightful post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

Circuits is Rec Room’s technology for programming rooms: a multiplayer-synchronized realtime scripting system that lets players build interactive experiences together. Creating Circuits is one of the top honors of my career. I’d like to share how we solved core networking challenges, for anyone working on similar problems.

Before the Circuits we know today, we had “Circuits V1”. Circuits V1 was built in a few weeks by the talented Jamie McDuffee, reusing our shapes synchronization technology. For shapes, the unit of synchronization was a game object, so thousands of scripting elements needed thousands of game objects. This caused memory and CPU concerns that placed a low ceiling on complexity.

The tradeoff was deliberate. We didn’t know Circuits would land, so shipping fast and testing the hypothesis was more important than perfect architecture. Circuits V1 met the market with massive success and drove all Rec Room scripting for years. We learned that combining 3D modeling, basic scripting, and an excited community, naturally resulted in interesting 3D games. But after scaling the product to its ceiling, we needed a different strategy, a rewrite we called Circuits V2.

For Circuits V2, a key decision reduced complexity: if objects were the expensive part of synchronization, what if all of Circuits was one game object? Starting from that premise, we built the Circuits V2 synchronization model on top. We made two attempts at this: one we released for MVP; another built on top after discovering issues in retail.

I’ll call the first attempt “In-Memory Database”. We imagined entities in Circuits, like nodes and edges, as rows in a database. Every row subscribed to network events describing changes it experienced: “Add Node”, “Edit Property”, “Remove Node”. This model caused issues around cross-object operations. What if I connected node N1 and N2 with edge E1, while someone else deletedN1? This touched three rows twice: add edge E1, update N1‘s edge list, update N2‘s edge list, delete N1, update N2‘s edge list, remove E1.

If you weren’t careful with row updates you’d see an error.

Null Reference on Deletion

Simple operations could be corrected with null checks but with our peer-to-peer networking model synchronization complexity grew with number of players and operations. Rows could easily get out of sync as messages were sent to every player and arrived in conflicting orders. We explored transactions, idempotence, consensus algorithms for eventual consistency. None of these were insurmountable engineering challenges, but they weren’t where a startup found value spending time.

We came to a solution inspired by the popular JavaScript library, Redux. Redux’s core idea is that your entire application state lives in one object, and the only way to change it is to send an action through a pure function called a “reducer”. The reducer takes the current state and an action and returns the new state.

We kept our Circuits database but reimagined it as one global state. Now, any change to state can be represented by the reducer function f(S1, A) = S2 where S1 is the original state, A is some action we want to perform, and S2 is the new state.

We take advantage of the fact that Circuits uses a single game object, and use single RPC method for all networking. This means we can manage all networking in a single place rather than having bespoke RPCs for every mutation.

We use a fluent interface for registering responses at the top level:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

With this setup, we can easily implement Reduce functions:

Simple Networking for Circuits

So our data is in a database modeled like a Redux-style reducible state container. There’s still one last problem to solve to bring this all together: players can send actions peer-to-peer, so latency between peers may cause players to see actions out of order. A sends to B and C while B sends to A and C. A would think their action happened first, and B would think their action happened first.

We again opted for a simple solution:

  1. When an action is performed, set temporary local state so it feels snappy
  2. Send all actions to a single “owner” client
  3. That client forwards the actions to everyone else
  4. Stomp local state with actions received from the owner and commit them

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. Thus, the database guarantees that if the transactions behave correctly when run individually, they continue to be correct when run concurrently–in other words, the database prevents all possible race conditions.

We colloquially refer to this solution as the “action funnel”. The mental model I encourage is thinking of all actions being thrown into a funnel Then they “drip” out one-by-one when they get to the bottom.

The action funnel is a success. On the shapes side, we still use bespoke RPCs for all of the various operations. This causes synchronization defects that must be resolved on a case-by-case basis. Since Circuits V1 was built on shapes networking, it’s susceptible to these defects.

These defects disappeared after moving to the action funnel in Circuits V2. In the rare cases where we find a defect, we fix it in a single location, rather than scouring the codebase for a cause. Our rate of defects in this component is close to 1 defect every 2 years. Simple code also means that everyone on the team can contribute solutions, even new college grads with little experience.

In addition to the problems we targeted directly, we’ve found that several other problems become trivial in the simpler codebase. Let’s cover some of those.

Join-in-progress is when a player joins a session after other players have built some new things. Players download save data when they join, but it doesn’t contain in-progress creation information. This means joining players have to replay network state or “fast forward” to catch up.

To solve this problem with the action funnel, we periodically send a special InitializePayloadData action which contains the latest save data for the room. Rather than sending this up to cloud storage like save data, we store it in the session server that players connect to for realtime multiplayer. As players create, additional actions are stored in the session server after the InitializePayloadData action. When we hit a certain threshold (initially 25 actions), we create another InitializePayloadData “snapshot” and store it in the server while clearing the queue.

Joining players catch up with simple logic:

  1. Join the room
  2. Check if the first action in the queue is InitializePayloadData
  3. If it is, use it to restore, otherwise, restore from save data
  4. Replay the remaining actions on top in a loop

Compression and Splitting

Since our networking code relies on a single RPC, we can update a single method to transform everything which flows through it. Two modifications we use in production are compression and splitting. They look like this:

By changing the payload during transform, we can handle large actions where bandwidth is a concern.CompressIfOversized looks at the size of the action and runs zip compression if it’s too large(> ~100KB), returning this new action with this payload:

If the action is still too large (> ~300KB), we shrink it further with SplitIfOversized which returns an array of actions with these payloads:

Each player has a unique player_id and tracks a local action_id. This combination creates unique-per-session keys. Players store these actions in structures like Dictionary<(int PlayerId, int ActionId), List<byte[]>>. When the count of the list equals a payload’s part_count, the action is reassembled and dispatched like any other action. While we could have sent the count once in the first part, we chose to send it every time to simplify some debugging scenarios.

Techniques like these helped us work around networking limits that placed caps on the total amount of nodes a player could use in their creations.

Logging and automation also take advantage of the single code path. We provide hooks to capture every sent and received action. The receive hook looks like this:

WriteToLog takes the raw bytes received from the network and writes them into a log file. Action logs are uploaded during crashes or if players press an in-game button. The action logs benefit two debugging techniques:

  1. The sequence of action following the latest InitializePayloadData action should always be equal.
  2. We can replay the actions in simulator when something goes wrong.

The second technique is critical for our testing framework. After an error is discovered, we can encode the log into a unit test. Fixing the test then fixes the error and ensures that it can’t happen again.

The state of every player’s Circuits instance is consistent at any point in the action stream. In other words, when sending actions A> B> C, if every player hashes their state after receiving B, the hashes must be equivalent. We used this fact to implement our observability stack.

  1. After every 15 minutes a players sends a WatchdogPayloadData action
  2. On receipt of the WatchdogPayloadData, players hash their state and return it to the sender
  3. The sender compares all of the hashes and records divergence in analytics

This observability stack helps us iron out synchronization defects in retail. The telemetry includes the id of the defective game. We can either pull the log to see what went wrong or contact the creator for more information on the scenario.

Real Corruption Detected and Fixed in Amplitude

Social Solutions to Social Problems

A big part of our design was being intentional about problems we didn’t solve. We wanted a little effort to do a lot of work so we were happy choosing artifacts which allowed us time to improve other, more important parts of Rec Room as long as players had a path forwards.

A common principle in our decision-making framework when deciding whether to resolve a defect is “social solutions to social problems”. If folks working on something together can sit around and reason through a problem, can we rely on that social reasoning rather than technology?

I’ll discuss how we used this principle to find simple solutions to problems which would otherwise require complicated technology.

Replaying Dropped Actions

A few anomalies cause dropped actions. For example: the master client disconnecting. Some of Rec Room’s most common defects are related to master disconnects or transitions. From the point-of-view of the action funnel, only the master can send actions to everyone. When the master client change, we would simply drop the action and return failure to the sender.

In Rec Room master disconnects are obvious. There’s usually a hitch or network interruption, and if you are building with someone, they disappear. Since weird stuff already happens during a disconnect, we don’t think it’s worthwhile to solve additional weirdness: you’re already in a state where reality is suspended.

As a concrete example, if you send an AddNodePayloadData, a player disconnects, and you see the node disappear, it’s easy enough to recreate it. Lost work is limited to a small number of recent actions.

We did some early thinking on approaches with fancy hashing mechanisms to detect and replay dropped actions but in the end we settled on dropping them outright. Fancy algorithms weren’t worthwhile when we could otherwise have a codebase accessible to more people. Each trick you add to a system comes out of a cognitive budget and you have to reason about all the tricks together whenever you do something new.

People Editing the Same Object

Another scenario solved more elegantly by social solutions is rapid edits to the same object by different people. The outcome of this scenario is an object potentially jolting back-and-forth an movement is synchronized and local imposters are updated. We have a mechanism for locking objects but once again, complexity makes it a rarely-used tool.

Instead, we thought about the prevalence of this scenario. How often do you edit the same paragraph concurrently in a Google doc? We’d done a lot of work in Google docs and it didn’t seem like something that happened. Either folks edited different parts of the doc, or folks talked about one part of a doc together as one person edited it.

The realtime editing experience of Google docs was similar to behaviors we saw in Rec Room. It was just uncommon for folks to be editing the same parts of Circuits script at the same time. If they did, they tended to be talking about it. If you were fighting over where a node should go and moving it simultaneously, you had a social problem to solve before the tech even mattered.

Thus, we ruled out weirdness caused by delayed consistency due to rapid edits as a problem worth solving with technology.

The undercurrent of our synchronization work was a bet that simple code outperformed clever code in the long run: collapsing game objects into one, funneling actions through a single owner, dropping actions instead of replaying them. That bet paid off. With a defect rate near one every two years, a system accessible to new grads, and flexible enough to absorb compression, splitting, observability, and join-in-progress without structural changes, we could spend the time on the product rather than chasing technology.

If there’s one thing I hope you take from this, it’s that the best architecture for a multiplayer system isn’t always the most sophisticated. Sometimes it’s the one that’s easy enough for your whole team to hold in their heads, boring enough to rarely break, and small enough to fix in one place when it does.

Our CEO, Nick Fajt phrased it in a way that’s always stuck with me:

I’ve never heard a team say, “I wish we took more time and shipped something larger.” It’s always, “I wish we did something smaller, less complex, got to players sooner.”

⚡ **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Synchronized #Editing #Rec #Rooms #Multiplayer #Scripting #System**

🕒 **Posted on**: 1774182052

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *