跳到主要内容

文本转语音 (TTS) API

Deepseek 3.2 中英对照 Text-To-Speech (TTS) API

Spring AI 通过 TextToSpeechModelStreamingTextToSpeechModel 接口为文本转语音 (TTS) 提供了统一的 API。这使您可以编写跨不同 TTS 提供程序工作的可移植代码。

支持的提供商

通用接口

所有 TTS 服务提供商都实现了以下共享接口:

TextToSpeechModel

TextToSpeechModel 接口提供了将文本转换为语音的方法:

public interface TextToSpeechModel extends Model<TextToSpeechPrompt, TextToSpeechResponse>, StreamingTextToSpeechModel {

/**
* Converts text to speech with default options.
*/
default byte[] call(String text) {
// Default implementation
}

/**
* Converts text to speech with custom options.
*/
TextToSpeechResponse call(TextToSpeechPrompt prompt);

/**
* Returns the default options for this model.
*/
default TextToSpeechOptions getDefaultOptions() {
// Default implementation
}
}

StreamingTextToSpeechModel

StreamingTextToSpeechModel 接口提供了实时流式传输音频的方法:

@FunctionalInterface
public interface StreamingTextToSpeechModel extends StreamingModel<TextToSpeechPrompt, TextToSpeechResponse> {

/**
* Streams text-to-speech responses with metadata.
*/
Flux<TextToSpeechResponse> stream(TextToSpeechPrompt prompt);

/**
* Streams audio bytes for the given text.
*/
default Flux<byte[]> stream(String text) {
// Default implementation
}
}

文本转语音提示

TextToSpeechPrompt 类封装了输入文本和选项:

TextToSpeechPrompt prompt = new TextToSpeechPrompt(
"Hello, this is a text-to-speech example.",
options
);

TextToSpeechResponse

TextToSpeechResponse 类包含生成的音频和元数据:

TextToSpeechResponse response = model.call(prompt);
byte[] audioBytes = response.getResult().getOutput();
TextToSpeechResponseMetadata metadata = response.getMetadata();

编写与供应商无关的代码

共享TTS接口的关键优势之一,是能够编写无需修改即可与任何TTS提供商协同工作的代码。实际的提供商(如OpenAI、ElevenLabs等)由你的Spring Boot配置决定,这使你能在不更改应用代码的情况下切换提供商。

基本服务示例

共享接口允许您编写适用于任何TTS提供商的代码:

@Service
public class NarrationService {

private final TextToSpeechModel textToSpeechModel;

public NarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}

public byte[] narrate(String text) {
// Works with any TTS provider
return textToSpeechModel.call(text);
}

public byte[] narrateWithOptions(String text, TextToSpeechOptions options) {
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
TextToSpeechResponse response = textToSpeechModel.call(prompt);
return response.getResult().getOutput();
}
}

本服务可与OpenAI、ElevenLabs或其他任何TTS服务提供商无缝协作,具体实现取决于您的Spring Boot配置。

高级示例:多提供商支持

你可以构建同时支持多个TTS提供商的应用程序:

@Service
public class MultiProviderNarrationService {

private final Map<String, TextToSpeechModel> providers;

public MultiProviderNarrationService(List<TextToSpeechModel> models) {
// Spring will inject all available TextToSpeechModel beans
this.providers = models.stream()
.collect(Collectors.toMap(
model -> model.getClass().getSimpleName(),
model -> model
));
}

public byte[] narrateWithProvider(String text, String providerName) {
TextToSpeechModel model = providers.get(providerName);
if (model == null) {
throw new IllegalArgumentException("Unknown provider: " + providerName);
}
return model.call(text);
}

public Set<String> getAvailableProviders() {
return providers.keySet();
}
}

流式音频示例

共享接口也支持实时音频生成的流式传输:

@Service
public class StreamingNarrationService {

private final TextToSpeechModel textToSpeechModel;

public StreamingNarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}

public Flux<byte[]> streamNarration(String text) {
// TextToSpeechModel extends StreamingTextToSpeechModel
return textToSpeechModel.stream(text);
}

public Flux<TextToSpeechResponse> streamWithMetadata(String text, TextToSpeechOptions options) {
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
return textToSpeechModel.stream(prompt);
}
}

REST控制器示例

构建一个提供者无关的TTS的REST API:

@RestController
@RequestMapping("/api/tts")
public class TextToSpeechController {

private final TextToSpeechModel textToSpeechModel;

public TextToSpeechController(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}

@PostMapping(value = "/synthesize", produces = "audio/mpeg")
public ResponseEntity<byte[]> synthesize(@RequestBody SynthesisRequest request) {
byte[] audio = textToSpeechModel.call(request.text());
return ResponseEntity.ok()
.contentType(MediaType.parseMediaType("audio/mpeg"))
.header("Content-Disposition", "attachment; filename=\"speech.mp3\"")
.body(audio);
}

@GetMapping(value = "/stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public Flux<byte[]> streamSynthesis(@RequestParam String text) {
return textToSpeechModel.stream(text);
}

record SynthesisRequest(String text) {}
}

基于配置的提供者选择

使用Spring profiles或properties在不同提供者之间进行切换:

# application-openai.yml
spring:
ai:
model:
audio:
speech: openai
openai:
api-key: ${OPENAI_API_KEY}
audio:
speech:
options:
model: gpt-4o-mini-tts
voice: alloy

# application-elevenlabs.yml
spring:
ai:
model:
audio:
speech: elevenlabs
elevenlabs:
api-key: ${ELEVENLABS_API_KEY}
tts:
options:
model-id: eleven_turbo_v2_5
voice-id: your_voice_id

然后激活所需的提供程序:

# Use OpenAI
java -jar app.jar --spring.profiles.active=openai

# Use ElevenLabs
java -jar app.jar --spring.profiles.active=elevenlabs

使用便携式选项

为实现最大可移植性,请仅使用通用的 TextToSpeechOptions 接口方法:

@Service
public class PortableNarrationService {

private final TextToSpeechModel textToSpeechModel;

public PortableNarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}

public byte[] createPortableNarration(String text) {
// Use provider's default options for maximum portability
TextToSpeechOptions defaultOptions = textToSpeechModel.getDefaultOptions();
TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, defaultOptions);
TextToSpeechResponse response = textToSpeechModel.call(prompt);
return response.getResult().getOutput();
}
}

使用特定提供程序的功能

当你需要使用特定于供应商的功能时,你仍然可以在保持代码库可移植性的同时使用它们:

@Service
public class FlexibleNarrationService {

private final TextToSpeechModel textToSpeechModel;

public FlexibleNarrationService(TextToSpeechModel textToSpeechModel) {
this.textToSpeechModel = textToSpeechModel;
}

public byte[] narrate(String text, TextToSpeechOptions baseOptions) {
TextToSpeechOptions options = baseOptions;

// Apply provider-specific optimizations if available
if (textToSpeechModel instanceof OpenAiAudioSpeechModel) {
options = OpenAiAudioSpeechOptions.builder()
.from(baseOptions)
.model("gpt-4o-tts") // OpenAI-specific: use high-quality model
.speed(1.0)
.build();
} else if (textToSpeechModel instanceof ElevenLabsTextToSpeechModel) {
// ElevenLabs-specific options could go here
}

TextToSpeechPrompt prompt = new TextToSpeechPrompt(text, options);
TextToSpeechResponse response = textToSpeechModel.call(prompt);
return response.getResult().getOutput();
}
}

可移植代码的最佳实践

  1. 依赖接口:始终注入 TextToSpeechModel 而非具体实现

  2. 使用通用选项:坚持使用 TextToSpeechOptions 接口方法以获得最佳可移植性

  3. 优雅处理元数据:不同服务商返回的元数据各异;请通用化处理

  4. 多服务商测试:确保你的代码至少与两个 TTS 服务商兼容

  5. 记录服务商假设:如果依赖特定服务商行为,请清晰记录

供应商特定功能

尽管共享接口确保了可移植性,但每个提供商也会通过特定于提供商的选项类(例如OpenAiAudioSpeechOptionsElevenLabsSpeechOptions)提供特定的功能。这些类实现了TextToSpeechOptions接口,同时增加了提供商特有的能力。

章节总结