This product is a Beta product. This means that it and the information is subject for change, updates or removal.
If you test this product, please let us know your feedback, so that we can make it the best possible product for you. Please share your feedback with us here.
The realtime voice API provides a websocket for streaming and processing bidirectional audio during phone calls. You should use this API only if you need programmatic access to the audio data itself, for example if you want to connect your virtual phone number to an AI agent.
Using the Realtime API requires both a regular virtual phone number and a websocket number. Your application must also accept websocket connections. We have a detailed setup guide if you need help getting this set up.
Each time your websocket number receives a call, the API will establish a new
websocket connection to your application at the websocket_url specified on
the number.
API -- wss://example.com/incoming-call --> Application
Once the websocket connection is established, the API and your application use
JSON messages to start the session, negotiate audio formats, stream audio, and
gracefully end the call. Each message has a field t that denotes its type.
The session ALWAYS starts with the API sending a hello message containing
some metadata about the call…
// API -> Application
{
"t": "hello",
"callid": "c13d1e772...",
"from": "+46701234567",
"to": "+46766861234"
}
…and ALWAYS ends with the API sending a bye message. This contains the
reason why the call ended, if any errors occurred, as well as a human-readable
explanation as to what happened.
// API -> Application
{
"t": "bye",
"reason": "hangup",
"message": "the caller hung up"
}
After the initial hello your application can start sending and receiving
audio. To send audio, you must first send a sending message that specifies
which audio format you'll be sending in. Likewise, to receive audio you must
first send a listening message that specifies which audio format you want to
receive in. You can choose to just send audio, just receive audio, or both.
// Application -> API
{
"t": "sending",
"format": "pcm_24000",
}
// Application -> API
{
"t": "listening",
"format": "pcm_24000",
}
Audio is sent in both directions via audio messages. These contain Base64
encoded audio data in the format specified in the previous sending and
listening messages.
// API <-> Application
{
"t": "audio",
"data": "<base64 encoded audio data>",
}
If you want to end the call before the caller hangs up, you can do so by
sending a bye message to the API.
// Application -> API
{
"t": "bye",
}
The API will hang up the call once all buffered audio data has been played,
then send a bye message back to your application. Your application must
ALWAYS wait for the final bye from the API before disconnecting; failure to
do so will cause the call to disconnect before all buffered audio has been
played.
Putting it all together, the control flow of a typical bidirectional audio session will look something like this:
API Your application
| |
| *Websocket conection established* |
| |
| *API says hello* |
| ------------ hello --------------> |
| |
| *Audio setup* |
| <--------- listening ------------- |
| <---------- sending -------------- |
| |
| *Audio streaming* |
| <----------- audio --------------- |
| ------------ audio --------------> |
| ------------ audio --------------> |
| <----------- audio --------------- |
| <------------ ... ---------------> |
| |
| *Application is done* |
| <------------ bye ---------------- |
| |
| ------------ audio --------------> |
| ------------ audio --------------> |
| ------------- ... ---------------> |
| |
| *Call ends* |
| <------------ bye ---------------- |
| |
| *Websocket is closed* |
v v
All examples use the websockets library in Python, and assume that a websocket
connection between the API and your application has already been established.
See our integration guide if you need help setting that up.
This code example implements an “echo server” that just plays the caller’s audio back to them.
async def echo_server(ws):
# Get the call metadata from the hello message
hello = json.loads(await ws.recv())
print(f"Received {hello['to']} <- {hello['from']} ({hello['callid']})")
# Tell the API the format we want to receive audio in
await ws.send(json.dumps({
"t": "listening",
"format": "ulaw"
}))
# Tell the API the format we'll be sending audio in
await ws.send(json.dumps({
"t": "sending",
"format": "ulaw"
}))
# Loop over and play back each audio message until the call ends
async for raw in ws:
msg = json.loads(raw)
if msg["t"] == "audio":
audio = base64.b64decode(msg["data"])
# Echo the audio back to the server
await ws.send(json.dumps({
"t": "audio",
"data": base64.b64encode(audio).decode()
}))
elif msg["t"] == "bye":
print("Call ended:", msg["message"])
break
This example plays a WAV file from disk and then hangs up.
async def play_file(ws):
# Get the call metadata from the hello message
hello = json.loads(await ws.recv())
print(f"Received {hello['to']} <- {hello['from']} ({hello['callid']})")
# Tell the API the format we'll be sending audio in
await ws.send(json.dumps({
"t": "sending",
"format": "wav"
}))
# Send the WAV file bytes in chunks
with open("audio.wav", "rb") as f:
for chunk in iter(lambda: f.read(32 * 1024), b""):
await ws.send(json.dumps({
"t": "audio",
"data": base64.b64encode(chunk).decode()
}))
# Hang up
await ws.send(json.dumps({
"t": "bye"
}))
# Wait for the call to end
msg = json.loads(await ws.recv())
print("Call ended:", msg["message"])
This example records the caller’s audio to a WAV file on disk.
import wave
# ...
async def record_call(ws):
# Get the phone call metadata from the hello message
hello = json.loads(await ws.recv())
print(f"Call started: {hello['to']} <- {hello['from']} ({hello['callid']})")
# Tell the API the format we want to receive audio in
await ws.send(json.dumps({
"t": "listening",
"format": "pcm_16000"
}))
# Open the wav file for writing
with wave.open(f"{hello['callid']}.wav", "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2) # 16-bit
wf.setframerate(16000)
# Loop over each audio message until the call ends
async for raw in ws:
msg = json.loads(raw)
if msg["t"] == "audio":
audio = base64.b64decode(msg["data"])
wf.writeframes(audio)
elif msg["t"] == "bye":
print("Call ended:", msg["reason"])
break