Imagine a voice agent that picks up a customer's call, listens in real time, looks up their order in your database, processes a refund, and sounds genuinely human while doing it -- all without a single human agent in the loop. That's exactly what we're going to build in this tutorial.
We'll use Gemini Live, Python, and a small tool-calling layer to ship a fully working real-time voice AI customer support agent. By the end, you'll have a runnable project, an architecture you understand, and a personality system you can customize without touching the core code.
Let's dive in.
Last updated: April 2026
What We're Building
A real-time voice agent with five capabilities:
- Listens to a customer through the microphone with sub-second latency
- Talks back in a natural-sounding voice using Gemini Live's audio output
- Calls tools (lookup order, issue refund, escalate to human) when the conversation needs them
- Maintains conversation memory across turns
- Loads its personality and behavior from a YAML profile file -- no code changes needed to retune
No frontend. No microservices. One Python process. ~400 lines of code total.
Architecture at a Glance
Here's how the data flows once everything is wired up:
Microphone ──► PCM audio ──► Gemini Live (streaming)
│
├──► Audio response ──► Speakers
│
└──► Tool call ──► Dispatcher
│
├──► CRM API
├──► Refund API
└──► Escalation queue
Three things are happening concurrently:
- Audio in: the microphone streams 16 kHz PCM chunks to Gemini Live over a WebSocket session.
- Audio out: Gemini Live streams response audio back, which we play through the speakers in real time.
- Tool dispatch: when Gemini decides a tool call is needed, we run it on a background task so audio playback never stutters.
The trick is keeping all three loops async and non-blocking. Get that wrong and the agent sounds choppy or freezes mid-sentence.
Prerequisites
You need:
- Python 3.11 or newer
- A Gemini API key (free tier works for development) -- get one at aistudio.google.com
- A working microphone and speakers
- 20 minutes
Install the dependencies:
pip install google-genai sounddevice numpy pyyaml httpxgoogle-genai is the official Gemini SDK. sounddevice handles microphone input and speaker output without any platform-specific headaches.
Project Layout
voice-agent/
├── main.py # entry point, launches the session
├── audio.py # microphone capture + speaker playback
├── tools.py # tool definitions and dispatcher
├── profiles/
│ └── support.yaml # personality + system instructions
└── .env # API keys
Small enough to hold in your head, big enough to extend.
Step 1: The Audio Loop
The microphone-to-speaker pipeline is the foundation. If this doesn't work cleanly, nothing else matters.
# audio.py
import asyncio
import sounddevice as sd
import numpy as np
SAMPLE_RATE = 16000
CHUNK_SIZE = 1024
async def mic_stream(queue: asyncio.Queue):
loop = asyncio.get_running_loop()
def callback(indata, frames, time, status):
loop.call_soon_threadsafe(
queue.put_nowait, indata.copy().tobytes()
)
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=1,
dtype="int16",
blocksize=CHUNK_SIZE,
callback=callback,
):
while True:
await asyncio.sleep(0.1)
async def play_audio(pcm_bytes: bytes):
audio = np.frombuffer(pcm_bytes, dtype=np.int16)
sd.play(audio, samplerate=24000)
sd.wait()Two things to notice. First, the input runs at 16 kHz (what Gemini Live expects) and the output runs at 24 kHz (what Gemini Live returns). Don't try to "simplify" this -- you'll get robot voice. Second, we use a loop.call_soon_threadsafe bridge because sounddevice callbacks fire on a separate thread.
Step 2: Define the Tools
Tools are how the agent actually does things instead of just talking about them. We'll define three.
# tools.py
import httpx
TOOL_DECLARATIONS = [
{
"name": "lookup_order",
"description": "Look up a customer order by order ID. Returns status, items, and total.",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "The customer's order ID"}
},
"required": ["order_id"],
},
},
{
"name": "issue_refund",
"description": "Issue a full or partial refund for an order. Use only after confirming with the customer.",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"amount": {"type": "number", "description": "Refund amount in USD"},
"reason": {"type": "string"},
},
"required": ["order_id", "amount", "reason"],
},
},
{
"name": "escalate_to_human",
"description": "Transfer the conversation to a human agent. Use when the customer is frustrated or the issue is out of scope.",
"parameters": {
"type": "object",
"properties": {"summary": {"type": "string"}},
"required": ["summary"],
},
},
]
async def dispatch(name: str, args: dict) -> dict:
if name == "lookup_order":
async with httpx.AsyncClient() as client:
r = await client.get(f"https://api.example.com/orders/{args['order_id']}")
return r.json()
if name == "issue_refund":
# Real implementation: call Stripe / your payment processor
return {"status": "refunded", "amount": args["amount"]}
if name == "escalate_to_human":
# Push to your escalation queue
return {"status": "escalated", "ticket_id": "T-1042"}
return {"error": f"Unknown tool: {name}"}Two design notes that matter at scale:
- Each tool description is the prompt. The model reads these strings to decide when to call them. Vague descriptions = wrong tool calls. Be specific about when to use each tool, not just what it does.
issue_refundsays "use only after confirming." That's a guardrail baked into the description. The model will respect it surprisingly well -- but in production, you should also enforce it programmatically.
Step 3: The Personality Profile
This is where most tutorials hardcode the system prompt and call it a day. Don't. Externalize it.
# profiles/support.yaml
name: "Aria"
voice: "Aoede"
system_instructions: |
You are Aria, a friendly and efficient customer support agent for ChatsbyMart.
Your goals, in order:
1. Make the customer feel heard before solving their problem.
2. Solve the problem in as few turns as possible.
3. Never invent order details, prices, or policies. Use tools.
4. If a customer is angry, acknowledge the frustration, then offer concrete action.
5. Confirm refunds before issuing them.
Tone: warm but concise. Avoid filler phrases like "I completely understand."
Speak in short sentences. This is a voice call, not an email.
guardrails:
max_refund_without_escalation: 200
topics_to_escalate:
- legal
- billing dispute over $500
- account securityNow anyone on your team -- product, support lead, marketing -- can tune the agent's behavior by editing YAML. No Python knowledge required. This single decision is the difference between an agent your team can iterate on and one that ossifies after launch.
Step 4: Wire It All Together
Here's the main entry point. This is where the audio loop, the Gemini Live session, and the tool dispatcher meet.
# main.py
import asyncio
import os
import yaml
from google import genai
from google.genai import types
from audio import mic_stream, play_audio
from tools import TOOL_DECLARATIONS, dispatch
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
MODEL = "gemini-2.5-flash-preview-native-audio-dialog"
async def run_agent(profile_path: str):
with open(profile_path) as f:
profile = yaml.safe_load(f)
config = types.LiveConnectConfig(
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name=profile["voice"]
)
)
),
system_instruction=profile["system_instructions"],
tools=[{"function_declarations": TOOL_DECLARATIONS}],
)
async with client.aio.live.connect(model=MODEL, config=config) as session:
mic_queue = asyncio.Queue()
async def send_mic():
async for chunk in iter_queue(mic_queue):
await session.send_realtime_input(audio=chunk)
async def receive_responses():
async for response in session.receive():
if response.data:
await play_audio(response.data)
if response.tool_call:
asyncio.create_task(handle_tool_call(session, response.tool_call))
await asyncio.gather(
mic_stream(mic_queue),
send_mic(),
receive_responses(),
)
async def handle_tool_call(session, tool_call):
responses = []
for fc in tool_call.function_calls:
result = await dispatch(fc.name, dict(fc.args))
responses.append(types.FunctionResponse(
id=fc.id, name=fc.name, response={"result": result}
))
await session.send_tool_response(function_responses=responses)
async def iter_queue(q):
while True:
yield await q.get()
if __name__ == "__main__":
asyncio.run(run_agent("profiles/support.yaml"))Look at receive_responses. When a tool call comes in, we don't await it inline -- we spawn it as a background task. This is the single most important pattern in the whole project. If you await the tool call, audio playback freezes while the API call is running and the agent sounds broken.
Step 5: Run It
export GEMINI_API_KEY=your_key_here
python main.pySpeak into your microphone:
"Hey, I never got my order. The number is A-9921."
Aria will respond in real time. Behind the scenes, she'll call lookup_order, get the result, and explain what happened in natural language. Try asking for a refund. Try being unreasonable. Try asking about something completely off-topic and watch her route you to a human.
Built-in Voices and Models
| Option | Value | Notes |
|---|---|---|
| Model | gemini-2.5-flash-preview-native-audio-dialog | Native audio, lowest latency |
| Voices | Aoede, Charon, Fenrir, Kore, Puck | Pick to match your brand |
| Sample rate (in) | 16000 Hz | Required by Gemini Live |
| Sample rate (out) | 24000 Hz | What Gemini Live returns |
| Max session length | 15 min | Reconnect for longer calls |
Customization: Beyond the Tutorial
The real power of this architecture is how easy it is to extend. A few ideas:
Swap profiles per use case. Make a profiles/sales.yaml, a profiles/onboarding.yaml, a profiles/triage.yaml. Same code, three completely different agents.
Add memory across calls. Drop the conversation transcript into a vector store at session end. On the next call, retrieve the customer's history and inject it into the system prompt. You now have an agent that remembers customers.
Connect real CRM tools. Replace the example HTTP calls in tools.py with calls to Salesforce, HubSpot, Stripe, or your internal API. Most of the work is writing tool descriptions clearly enough that the model picks the right one.
Add a video feed. Gemini Live supports continuous video input. Stream a webcam frame every 500ms and the agent can see the customer's screen, product, or face.
For more on how thoughtful tool design separates a great agent from a mediocre one, how Chatsby optimizes RAG goes deeper on the retrieval side, and generative AI vs rule-based chatbots covers why script-based bots fall apart on conversations like the ones above.
Common Pitfalls (And How to Avoid Them)
Audio is choppy. You're probably blocking the event loop. Make sure tool calls run in asyncio.create_task() and audio playback is non-blocking.
The agent makes things up. Your tool descriptions are vague, or you're not actually returning the tool result back to the session. Print the dispatcher output and verify the model is seeing real data.
The voice sounds robotic. Sample rate mismatch. Verify input is 16 kHz and playback is 24 kHz.
Latency is high. You're using a non-native-audio model. Switch to gemini-2.5-flash-preview-native-audio-dialog -- it skips the text intermediate and is dramatically faster.
Sessions disconnect after 15 minutes. That's the current Gemini Live session limit. Implement a reconnect handler that restores conversation history from a transcript.
For a broader look at the traps teams fall into when shipping AI agents to production, top mistakes businesses make when adding AI chatbots is worth the read.
Why This Matters
A few years ago, building something like this required a speech-to-text service, a language model, a text-to-speech service, an orchestration layer, and a small team to wire them together. The end result was usually slow, brittle, and expensive.
Native audio models like Gemini Live collapse that entire stack into a single API call. The latency is low enough to feel like a real conversation, the voice quality is good enough to pass for human in many contexts, and the tool calling works well enough to actually take action on a customer's behalf.
This is the moment voice AI agents stopped being a research demo and started being something a small team can ship in a weekend. According to Gartner, agentic AI is one of the top strategic technology trends of the decade -- and voice is going to be the interface most customers actually experience.
If you want to understand the bigger picture of where this is heading, what is agentic AI covers how autonomous agents are reshaping customer-facing work.
Resources
Skip the Code: Get a Production Voice Agent in Minutes
If you want all of this -- voice, memory, tools, personality profiles, escalation, analytics -- without writing or maintaining a single line of Python, Chatsby gives you a production-ready AI agent that connects to your knowledge base and tools out of the box. Build the prototype above to understand how it works under the hood, then deploy Chatsby when you're ready to ship something your team can actually maintain.
