链载Ai

标题: 语音转文本，文本转语音：OpenAI 发布了 2 套新模型，1 个新网站 [打印本页]

作者: 链载Ai 时间: 6 小时前
标题: 语音转文本，文本转语音：OpenAI 发布了 2 套新模型，1 个新网站

ingFang SC", system-ui, -apple-system, BlinkMacSystemFont, "Helvetica Neue", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;color: rgb(31, 35, 41);margin: 0px 0px 4px;word-break: break-all;min-height: 20px;">凌晨 1 点的时候，OpenAI 突然做了三项发布：

语音转文本（STT）模型
文本转语音（TTS）模型
一个体验网站：OpenAI.fm

结论前置：

不大的发布，实用的东西，不错的 PlayGround

ingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;display: table;padding: 0.3em 1em;color: rgb(255, 255, 255);background: rgb(0, 122, 170);border-radius: 8px;box-shadow: rgba(0, 0, 0, 0.1) 0px 4px 6px;">语音转文本（STT）模型

ingFang SC", system-ui, -apple-system, BlinkMacSystemFont, "Helvetica Neue", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;color: rgb(31, 35, 41);margin: 0px 0px 4px;word-break: break-all;min-height: 20px;">两款模型：gpt-4o-transcribe和gpt-4o-mini-transcribe，比之前的 Whisper 价格更优，性能更好，尤其在处理口音、噪音和不同语速方面表现更佳。

Whisper: ~ $0.006/min
gpt-4o-transcribe: ~ $0.006/min
gpt-4o-mini-transcribe: ~ $0.003/min

再是错误率对比（越低越好）

对比自家的 Whisper

对比竞品模型

这俩 endpoint，一个是transcriptions，另一个是translations，同样可以用于新模型。前者是纯转文字，简单调用起来是这样：

fromopenaiimportOpenAI
client = OpenAI()

audio_file =open("/path/to/file/audio.mp3","rb")
transcription = client.audio.transcriptions.create(
 model="whisper-1",
 file=audio_file
)

print(transcription.text)

后者是转文字+翻译（仅限翻译成英文），调用大概这样。

fromopenaiimportOpenAI
client = OpenAI()

audio_file =open("/path/to/file/speech.mp3","rb")
transcription = client.audio.transcriptions.create(
 model="whisper-1",
 file=audio_file,
 response_format="text"
)

print(transcription.text)

剩下的，是一些接口参数更新：

时间戳（Timestamps）：通过设置timestamp_granularities参数，可以获取带有时间戳的 JSON 输出，精确到句子片段或单词级别。
流式转录（Streaming transcriptions）：通过设置stream=True，可以在模型完成音频片段的转录后立即接收到transcript.text.delta事件，最终会收到包含完整转录的transcript.text.done事件。
实时 API （Realtime API）：对于正在进行的音频流（例如实时会议或语音输入），可以通过 WebSocket 连接实时发送音频数据并接收转录事件。

详细文档：

https://platform.openai.com/docs/guides/speech-to-text

语音转文本（TTS）模型

模型名称是gpt-4o-mini-tts可控性很强的 TTS：

可以指定要说的内容，如：“我是练习时长两年半的个人练习生”
可以指定说话的风格，如：“用娇滴滴的语气”

中文示例

英文示例

我个人感觉效果不是很好（但可以 roll 点音色）；

长度方面，最大支持 2000 token 的内容；

价格方面，是 $0.015/min，示例代码如下：

importasyncio

fromopenaiimportAsyncOpenAI
fromopenai.helpersimportLocalAudioPlayer

openai = AsyncOpenAI()

input="""大家好，我是练习时长两年半的个人练习生，你坤坤，喜欢唱、跳、Rap和篮球，music~\n\n在今后的节目中，有我很多作词，作曲，编舞的原创作品，期待的话多多投票吧！"""

instructions ="""用娇滴滴的语气，萝莉音"""

asyncdefmain() ->None:

 asyncwithopenai.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
   input=input,
    instructions=instructions,
    response_format="pcm",
  )asresponse:
   awaitLocalAudioPlayer().play(response)

if__name__ =="__main__":
  asyncio.run(main())

详细文档：

https://platform.openai.com/docs/guides/text-to-speech

新网站：OpenAI.fm

这是一个调试语音的 PlayGround，挺好玩的

还可以在右上角，一键导出代码

欢迎光临链载Ai (https://www.lianzai.com/)