Media & Multilingual TTS

SIP.IO speaks 16 languages out of the box, with gender-correct voices and grammatically-correct number agreement for dynamic announcements. This page covers media storage, the prompt catalog, and how TTS is generated.

Media storage

Custom audio (uploaded prompts, music on hold, greetings) lives in object storage and is described by a media_file row:

{ "id": "media_01j22nfknpx47prkr1dpm4yxtk", "account_id": "acc_01jf18ah3jeb5w6dfp27sgjsbt", "name": "After-hours greeting",
  "kind": "greeting", "object_key": "1/greetings/after-hours.wav",
  "format": "wav", "duration_ms": 8200 }

Field	Purpose
`kind`	`prompt` / `moh` / `greeting` / `announcement`.
`object_key`	Object key in the `MEDIA` bucket.
`format` / `duration_ms`	Metadata.

Media is served to the media engine edge through the edge runtime’s /media proxy, so the bucket stays private: recordings and voicemail are never world-readable. The edge lazily fetches and caches each object. You can also group media into ordered media_playlist sets.

The system-prompt catalog

The platform ships a pre-generated catalog of system prompts in every supported language and both genders. Prompts are stored in object storage at a content-addressed key:

system/<lang>/<gender>/<id>.wav

Supported languages


`en-US` English	`es-ES` Spanish	`fr-FR` French	`de-DE` German
`it-IT` Italian	`pt-BR` Portuguese	`nl-NL` Dutch	`pl-PL` Polish
`ru-RU` Russian	`tr-TR` Turkish	`ar-XA` Arabic	`he-IL` Hebrew
`hi-IN` Hindi	`ja-JP` Japanese	`ko-KR` Korean	`cmn-CN` Mandarin

System prompt ids

The catalog covers the prompts the platform itself needs:

welcome · please-hold · confirm-accept · whisper-queue · all-busy · no-agents · vm-greeting · goodbye · invalid · transferring

Voices & gender

Each language has a locked male and female voice (the neural speech engine, preferring premium neural tiers). For example, en-US uses en-US-neural-male (male) and en-US-neural-female (female); he-IL uses he-IL-neural-male (male) and he-IL-neural-female (female). An account picks its narrator gender via voice_gender (default male); the matching voice is used everywhere.

Generating prompts

System prompts are synthesized by an admin endpoint:

curl -X POST 'https://nodes.sip.io/admin/tts/generate' \
  -H 'x-admin-token: $ADMIN_TOKEN'
# optional filters: ?lang=he-IL  ?gender=female  ?force=1

It walks every prompt × language × gender, calls the neural speech engine, and writes the result to object storage. The job is content-addressed and idempotent: it HEADs each object and skips what already exists, so a re-run only synthesizes new or changed text (use ?force=1 to rebuild).

Dynamic announcements

Position announcements need to say numbers (“you are caller 3”, “about 5 minutes”). These are generated on demand from per-language templates with placeholders (position, estimated wait) and cached, content-addressed, at:

tts/<lang>/<voice>/<hash>.wav

Each unique line is synthesized once and cached forever, so the second caller who is “number 3” reuses the first one’s audio.

Gender-correct numbers

Many languages inflect cardinal numbers by the grammatical gender of the noun they modify, so reading a bare digit gets this wrong. SIP.IO ships per-language number-word tables so announcements agree correctly, across languages like Hebrew, Arabic, Spanish, French, Italian, Portuguese, and Russian.

Hebrew is a good worked example, because the two numbers in the same announcement take different genders:

Queue position agrees masculine with מספר (“number”), e.g. 2 → שניים.
Estimated wait minutes agrees feminine with דקות (“minutes”), e.g. 2 → שתי (construct form, not שתיים).

The tables cover 1–99; above that the engine falls back to the digit. This is the kind of detail that separates “technically multilingual” from “actually sounds right”: the announcement engine tracks which noun a number modifies, per language, not just the number.