Media & Multilingual TTS
SIP.IO speaks 16 languages out of the box, with gender-correct voices and grammatically-correct number agreement for dynamic announcements. This page covers media storage, the prompt catalog, and how TTS is generated.
Media storage
Section titled “Media storage”Custom audio (uploaded prompts, music on hold, greetings) lives in object storage and is described by a media_file row:
{ "id": "media_01j22nfknpx47prkr1dpm4yxtk", "account_id": "acc_01jf18ah3jeb5w6dfp27sgjsbt", "name": "After-hours greeting", "kind": "greeting", "object_key": "1/greetings/after-hours.wav", "format": "wav", "duration_ms": 8200 }| Field | Purpose |
|---|---|
kind | prompt / moh / greeting / announcement. |
object_key | Object key in the MEDIA bucket. |
format / duration_ms | Metadata. |
Media is served to the media engine edge through the edge runtime’s /media proxy, so the bucket stays private: recordings and voicemail are never world-readable. The edge lazily fetches and caches each object. You can also group media into ordered media_playlist sets.
The system-prompt catalog
Section titled “The system-prompt catalog”The platform ships a pre-generated catalog of system prompts in every supported language and both genders. Prompts are stored in object storage at a content-addressed key:
system/<lang>/<gender>/<id>.wavSupported languages
Section titled “Supported languages”en-US English | es-ES Spanish | fr-FR French | de-DE German |
it-IT Italian | pt-BR Portuguese | nl-NL Dutch | pl-PL Polish |
ru-RU Russian | tr-TR Turkish | ar-XA Arabic | he-IL Hebrew |
hi-IN Hindi | ja-JP Japanese | ko-KR Korean | cmn-CN Mandarin |
System prompt ids
Section titled “System prompt ids”The catalog covers the prompts the platform itself needs:
welcome · please-hold · confirm-accept · whisper-queue · all-busy · no-agents · vm-greeting · goodbye · invalid · transferring
Voices & gender
Section titled “Voices & gender”Each language has a locked male and female voice (the neural speech engine, preferring premium neural tiers). For example, en-US uses en-US-neural-male (male) and en-US-neural-female (female); he-IL uses he-IL-neural-male (male) and he-IL-neural-female (female). An account picks its narrator gender via voice_gender (default male); the matching voice is used everywhere.
Generating prompts
Section titled “Generating prompts”System prompts are synthesized by an admin endpoint:
curl -X POST 'https://nodes.sip.io/admin/tts/generate' \ -H 'x-admin-token: $ADMIN_TOKEN'# optional filters: ?lang=he-IL ?gender=female ?force=1It walks every prompt × language × gender, calls the neural speech engine, and writes the result to object storage. The job is content-addressed and idempotent: it HEADs each object and skips what already exists, so a re-run only synthesizes new or changed text (use ?force=1 to rebuild).
Dynamic announcements
Section titled “Dynamic announcements”Position announcements need to say numbers (“you are caller 3”, “about 5 minutes”). These are generated on demand from per-language templates with placeholders (position, estimated wait) and cached, content-addressed, at:
tts/<lang>/<voice>/<hash>.wavEach unique line is synthesized once and cached forever, so the second caller who is “number 3” reuses the first one’s audio.
Gender-correct numbers
Section titled “Gender-correct numbers”Many languages inflect cardinal numbers by the grammatical gender of the noun they modify, so reading a bare digit gets this wrong. SIP.IO ships per-language number-word tables so announcements agree correctly, across languages like Hebrew, Arabic, Spanish, French, Italian, Portuguese, and Russian.
Hebrew is a good worked example, because the two numbers in the same announcement take different genders:
- Queue position agrees masculine with מספר (“number”), e.g. 2 → שניים.
- Estimated wait minutes agrees feminine with דקות (“minutes”), e.g. 2 → שתי (construct form, not שתיים).
The tables cover 1–99; above that the engine falls back to the digit. This is the kind of detail that separates “technically multilingual” from “actually sounds right”: the announcement engine tracks which noun a number modifies, per language, not just the number.