Skip to content
DocsStart free

Media & Multilingual TTS

SIP.IO speaks 16 languages out of the box, with gender-correct voices and grammatically-correct number agreement for dynamic announcements. This page covers media storage, the prompt catalog, and how TTS is generated.

Custom audio (uploaded prompts, music on hold, greetings) lives in object storage and is described by a media_file row:

{ "id": "media_01j22nfknpx47prkr1dpm4yxtk", "account_id": "acc_01jf18ah3jeb5w6dfp27sgjsbt", "name": "After-hours greeting",
"kind": "greeting", "object_key": "1/greetings/after-hours.wav",
"format": "wav", "duration_ms": 8200 }
FieldPurpose
kindprompt / moh / greeting / announcement.
object_keyObject key in the MEDIA bucket.
format / duration_msMetadata.

Media is served to the media engine edge through the edge runtime’s /media proxy, so the bucket stays private: recordings and voicemail are never world-readable. The edge lazily fetches and caches each object. You can also group media into ordered media_playlist sets.

The platform ships a pre-generated catalog of system prompts in every supported language and both genders. Prompts are stored in object storage at a content-addressed key:

system/<lang>/<gender>/<id>.wav
en-US Englishes-ES Spanishfr-FR Frenchde-DE German
it-IT Italianpt-BR Portuguesenl-NL Dutchpl-PL Polish
ru-RU Russiantr-TR Turkishar-XA Arabiche-IL Hebrew
hi-IN Hindija-JP Japaneseko-KR Koreancmn-CN Mandarin

The catalog covers the prompts the platform itself needs:

welcome · please-hold · confirm-accept · whisper-queue · all-busy · no-agents · vm-greeting · goodbye · invalid · transferring

Each language has a locked male and female voice (the neural speech engine, preferring premium neural tiers). For example, en-US uses en-US-neural-male (male) and en-US-neural-female (female); he-IL uses he-IL-neural-male (male) and he-IL-neural-female (female). An account picks its narrator gender via voice_gender (default male); the matching voice is used everywhere.

System prompts are synthesized by an admin endpoint:

Terminal window
curl -X POST 'https://nodes.sip.io/admin/tts/generate' \
-H 'x-admin-token: $ADMIN_TOKEN'
# optional filters: ?lang=he-IL ?gender=female ?force=1

It walks every prompt × language × gender, calls the neural speech engine, and writes the result to object storage. The job is content-addressed and idempotent: it HEADs each object and skips what already exists, so a re-run only synthesizes new or changed text (use ?force=1 to rebuild).

Position announcements need to say numbers (“you are caller 3”, “about 5 minutes”). These are generated on demand from per-language templates with placeholders (position, estimated wait) and cached, content-addressed, at:

tts/<lang>/<voice>/<hash>.wav

Each unique line is synthesized once and cached forever, so the second caller who is “number 3” reuses the first one’s audio.

Many languages inflect cardinal numbers by the grammatical gender of the noun they modify, so reading a bare digit gets this wrong. SIP.IO ships per-language number-word tables so announcements agree correctly, across languages like Hebrew, Arabic, Spanish, French, Italian, Portuguese, and Russian.

Hebrew is a good worked example, because the two numbers in the same announcement take different genders:

  • Queue position agrees masculine with מספר (“number”), e.g. 2 → שניים.
  • Estimated wait minutes agrees feminine with דקות (“minutes”), e.g. 2 → שתי (construct form, not שתיים).

The tables cover 1–99; above that the engine falls back to the digit. This is the kind of detail that separates “technically multilingual” from “actually sounds right”: the announcement engine tracks which noun a number modifies, per language, not just the number.