Mixed-Reality Interactions

Introspect — a mixed-reality installation accompanying live performance, where AI watches the audience and grows the virtual scene in real time in response.

The challenge

Create an immersive mixed-reality experience that intelligently responds to audience interactions during a live performance.

Our approach

React AI built a real-time system that reads live VR sensor data — players' positions, movements and interactions — and feeds it into an extended Growing Neural Cellular Automata model that grows and evolves the visuals in response. Delivered with Octopus Immersive and performer Sian Cross, supported by the BBC via the MyWorld programme.

Technologies

A mixed-reality installation pairing live performance with AI that watches the audience and reshapes the virtual scene in real time.

How we built the sensor-fed animations

Introspect is a mixed-reality installation that accompanies a live performance. As the audience moves through the space in VR, an AI model reads what they do and grows the visuals around them in real time — patterns that emerge, spread and fade in direct response to people and events in the room. It was delivered with Octopus Immersive and performer Sian Cross, supported by the BBC through the MyWorld Amplifying Imagination programme.

This is a write-up of how the sensor-fed animation engine works.

Built on growing neural cellular automata

The animation is generated by a neural cellular automaton (CA), building on the technique from Distill’s Growing Neural Cellular Automata (Mordvintsev et al., 2020).

A cellular automaton is a grid of cells. Each cell can only “sense” its immediate neighbours — the eight cells around it in a 3×3 window — and a small neural network learns a local update rule that is applied to every cell, every time step. From those purely local rules, complex, organic images grow and even self-repair. Cells are either alive or dead; a dead cell can only come alive if one of its neighbours was alive on the previous step, and only living cells and their immediate surroundings get updated. The practical consequence: patterns must grow outward from a seed rather than appearing all at once.

Making it react to the room

The published model grows a fixed image from a single seed. For Introspect we extended it in two ways so it could respond to a live environment:

A “sense” image. Alongside its own state, the model takes a second input image that lets the environment influence how the cells evolve. Importantly, sense input alone can’t bring a cell to life — growth still has to start from a seed — so the environment steers the pattern rather than overwriting it.
Event-driven seeds. Every time a player or object enters the scene, we drop a new seed into the automaton at that location, and tag the sense image with the type of event. The automaton starts growing there, in a way appropriate to what just happened — so the visuals track the life of the room.

From headset to animation

The live pipeline turns sensor data into rendered frames many times a second:

VR telemetry over MQTT. The headsets (Oculus) stream events over MQTT — head and hand positions on every update, plus messages when an object is spawned or a player interacts with one. A head-position update looks like:
```
XRUserData/Update {"dataType":"OI.Oculus.XRSensorData",
  "payload":{"HeadPosition":{"x":0.616,"y":1.640,"z":-0.105},
    "LeftHandPosition":{...},"RightHandPosition":{...}}}
```
Building the sense image. A SenseImage component maps player-space coordinates into the model’s image space (a configured scale, offset and rotation) and accumulates the activity into the sense image, placing a seed wherever a spawn event occurs.
Growing the next frame. Roughly ten times a second the server runs the current sense image and seeds through the trained CA to produce the next animation frame.
Publishing back. The frame is JPEG-encoded and published back over MQTT, where it drives the scene’s visuals.

The whole loop — sense → seed → grow → render → publish — runs continuously throughout the performance, so the imagery is never pre-baked: it is computed live from what the audience is doing.

flowchart LR
  H["VR headsets
head + hand positions"] -->|"MQTT"| SI
  EV["Spawn / interaction
events"] -->|"MQTT"| SI["Sense image
maps to image space
+ drops seeds"]
  SI --> CA["Neural cellular
automaton"]
  CA -->|"~10 fps · MQTT 'image'"| VIS["Scene visuals"]
  CA -->|"state carries to next step"| CA

Styles: ripple, trail and aura

What the automaton grows is shaped by the targets it was trained on. Three styles drive the look of the piece:

Ripple — expanding rings that move out across the image from a point of activity.
Trail — a line that follows a player’s path through the space.
Aura — an experimental style that tracks a player’s speed and their interactions with other players.

Seeing it in action

Two recorded runs of the trained system generating output:

Recorded output

Teaching it what to grow

The model’s update rule isn’t hand-coded — it’s learned from recorded performance sessions. The training loop boils down to: given this sensor input, produce this image; iterate until the math matches.

The dataset. During development we record performance and rehearsal sessions as streams of MQTT events (head/hand positions and spawn events). A build step replays each recording offline through the same SenseImage pipeline the live server uses, producing paired (sense_image, target_keyframe) tuples. Each scenario lives in a directory under scenarios/ with a seeds.csv (when and where to drop a seed) and a sequence of hand-crafted keyframe PNGs — the targets the CA should produce at specific frame indices.

The loss. Mean squared error between the CA’s rendered RGBA (state channels 0..3) and the target keyframe, taken only at frames where a keyframe exists. Recent additions also penalise drift in non-RGBA channels (so they don’t silently do double duty), and a direction-invariant perception loss for robustness to which direction the audience approaches from.

Pool-based training. Rather than training from scratch each step, we keep a pool of CA states. Each iteration we sample a batch from the pool, run ~200 update steps starting from those states, score the result against the keyframes, backpropagate, then write the updated states back. The pool mechanic teaches the model to recover gracefully from many intermediate states, not just clean seed-frames.

Training run. Adam, batch 8, image size 64, ~10k iterations on a consumer GPU. The whole pipeline is PyTorch; W&B captures the loss curves and sample renders so we can spot regressions early. Resulting checkpoints are under 100 KB — small enough to ship inside the website.

Reference: technique builds on Growing Neural Cellular Automata (Mordvintsev et al., 2020); we extend it with the sense-image conditioning and event-driven seeding described above.

flowchart LR
  REC["Recorded sessions
(MQTT event logs)"] --> BUILD["buildSenseImages.py"]
  BUILD --> SENSE["Sense images"]
  BUILD --> KEYS["Keyframes
(target images)"]
  SENSE --> TRAIN["Train CA
(TensorFlow)"]
  KEYS --> TRAIN
  TRAIN --> MODEL["Trained model
(local update rule)"]

Reference & credits

Technique: Growing Neural Cellular Automata, Mordvintsev, Randazzo, Niklasson & Levin — Distill, 2020.
Delivered with Octopus Immersive and performer Sian Cross; supported by the BBC through the MyWorld Amplifying Imagination programme.