Beyond Turn-Taking: Reimagining Conversational AI for RPGs

Storytelling Gaming AI Language Nov 11, 2024 11:39:21 AM Robert Dixon 5 min read

I'm a bit of a nerd. It's a part of my identity that manifests in different ways - I'm a tech CTO working on generative AI for gaming, and an avid tabletop role player. But one of my less common interests (in my circles at least) is my fascination with language. And with the problems we're tackling in Mind Mage, my passion for language is just as relevant as the others!

In this post, I want to explore why the way we model conversations in today's LLMs is fundamentally misaligned with how people actually communicate during roleplaying games - and what we're doing about it.

The Evolution of LLM Conversation Models

When Large Language Models first emerged, they used simple "completion" endpoints - you'd pass in the start of some text, and the model would generate what comes next. This approach was flexible and versatile, allowing developers to implement any conversational format they could express as text.

However, the industry quickly recognized the value of standardization. Having a consistent schema for dialogue means:

Smaller models do a better job faster (models shouldn't need to waste precious parameters getting good at figuring out your bespoke conversation format on the fly)
Stronger output guarantees - a consistent representation of conversation can be validated to ensure it's going to play nicely with your application. Additionally, the model decoders can be 'constrained' to only be able to produce valid conversations, but only if we agree up front what a valid conversation looks like.

Current Limitations of Chat APIs

These developments led to standardized "chat" APIs, now common across most LLM providers. However, these APIs make several fundamental assumptions about how conversation works:

Binary Interaction Model: Most LLM chat APIs assume a two-party conversation (user/assistant), making multi-party dynamics difficult to represent.
Turn-based Interaction: Participants take sequential turns, adding complete messages to the history. There's no way to interrupt mid-message.
Broadcast Communication: While messages are tagged with speakers, they lack explicit recipients. This works for simple chats but breaks down when trying to model multiple concurrent conversations, or more complex directed speech than each person speaking to 'the group'.

ChatGPT's Chat API take a structured list of messages

Innovations in Digital Communication

Modern written messaging platforms have developed several features to enhance the written conversational experience, though major LLM APIs don't natively support these today:

Typing Indicators: Showing when someone is composing a message. While we still can't receive thoughts as they unfold, we can at least see that something's brewing.
Threading: Allowing parallel conversations through nested replies. This is a neat innovation native to written conversation - while interruption might be difficult, we can develop multiple conversation branches simultaneously.
Reactions: Enabling lightweight acknowledgments and expressing sentiment without a fully verbalized response. These substitute for some important non-verbal features of spoken conversation.

While these innovations improve written communication, they still don't capture the unique dynamics of tabletop RPG conversations.

The Chaos of the Gaming Table

Picture this: you're sitting around the gaming table with your friends. The party's rogue is sweet-talking a guard while your wizard frantically flips through their spellbook, clarifying some rules with the game master. You're muttering to yourself, trying to remember if your barbarian knows what "diplomacy" means, while someone else updates everyone on the pizza order.

When it comes to building AI Agents that can engage in this sort of chat, current LLM Chat APIs face several key challenges:

1. Concurrent Speech and Interruptions

Sometimes we think of interruptions as a rude way people can 'break' conversational protocol. But interruptions aren't just disruptions - they're a natural part of fluid conversation. We often even build our spoken statements in ways that invite interjection and engagement. Turn-based conversation protocols can't capture this dynamic flow, because you can never decide if you want to interrupt until it's too late - you've received the message in full.

2. Diegetic Layers

This challenge is more unique to RPGs than general spoken conversation. Diegesis refers to the kinds of events that happen inside a world we're telling a story about, versus those that are part of telling the story. In role-playing games, players seamlessly blend three distinct modes:

Speaking as a character ("Greetings barkeep, I'll have an ale!")
Speaking on behalf of a character, narrating their actions ("I pay the man 3 gold pieces")
Speaking as a player ("What's the attack modifier for that axe?")

Players switch between these modes very fluidly, often mid-sentence. This layered communication is central to the RPG experience and completely absent from conversational models, making it hard for LLMs to interpret the distinction between them.

3. Private Communication

The game master often needs to communicate privately with specific players, and this can play out in various ways:

Open addressing: The game master speaks to a player while others can hear
Visible but private: The other players can see that the storyteller is communicating with someone, but can't hear the content
Completely private: Communication that other players aren't aware is happening at all

Current chat protocols don't have good ways to represent these nuanced levels of privacy and awareness. Roll20, an online RPG platform, implements a version of this kind of private messaging through what it calls 'whispers'.

Looking Forward: A New Approach

One way to solve this might be to abandon structured schemas entirely - perhaps building a streaming protocol where AI agents process continuous speech in real-time, and learn these quirky conventions more innately, just like humans do. However, I think there are some benefits to text models I'm not yet ready to leave behind: they're cheaper and more efficient, easier to process, and far easier to collect training data for.

Instead, at Mind Mage we're working on something more innovative: a protocol for multi-agent written conversation that can simulate the natural flow and interruptions of real RPG conversations, interface with ai agents as well as real human agents, and not require streaming. We're aiming to preserve the benefits of text-based models while capturing the dynamic, multilayered nature of tabletop conversations.

Stay tuned for more updates as we continue this work. And if you're as excited by the intersection of language, gaming, and AI as we are, there's plenty more to come!

Beyond Turn-Taking: Reimagining Conversational AI for RPGs

The Evolution of LLM Conversation Models

Current Limitations of Chat APIs

Innovations in Digital Communication

The Chaos of the Gaming Table

1. Concurrent Speech and Interruptions

2. Diegetic Layers

3. Private Communication

Looking Forward: A New Approach

Robert Dixon

Open Narrative Gaming: Unleashing Player Agency in Interactive Storytelling

Safeguarding the Creativity of Authors

Aligning AI Storytellers with players