AidVoice C++ API Documentation
💡Note: When developing with AidVoice-SDK C++, please keep the following in mind:
- Include the header file during compilation, located at:
/usr/local/include/aidlux/aidvoice/aidvoice_speech.hpp - Specify the library file during linking, located at:
/usr/local/lib/libaidvoice_speech.so
Specific Function Types .enum FeatureType
FeatureType is used to specify core business modules when initializing the AidVoice SDK. Since the AidVoice SDK includes various voice functions, developers must explicitly specify the specific voice function through this enumeration when creating a functional instance (Object). Currently, the SDK supports Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) functions, with more voice features being continuously integrated.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_DEFAULT | uint8_t | 0 | Invalid data type |
| TYPE_ASR | uint8_t | 1 | Speech Recognition |
| TYPE_TTS | uint8_t | 2 | Speech Synthesis |
Model Type .enum ModelType
ModelType defines the inference model algorithms supported by the AidVoice SDK. Users should select the appropriate model based on the application scenario. Currently, only the whisper_base and sensevoice_small models for ASR are supported.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_DEFAULT | uint8_t | 0 | Invalid data type |
| TYPE_WHISPER | uint8_t | 1 | whisper_base model |
| TYPE_SENSEVOICE | uint8_t | 2 | sensevoice_small model |
| TYPE_MELOTTS_CHINESE | uint8_t | 2 | melotts_chinese model |
| TYPE_MELOTTS_ENGLISH | uint8_t | 2 | melotts_english model |
💡Note
Special Instructions: TYPE_WHISPER and TYPE_SENSEVOICE are ASR models, both featuring multi-language support (Chinese, English, etc.).
- English scenarios: It is recommended to prioritize TYPE_WHISPER, as it performs better in English recognition accuracy and semantic understanding.
- Chinese scenarios: It is recommended to prioritize TYPE_SENSEVOICE, as it performs better in Chinese recognition accuracy and semantic understanding.
TYPE_MELOTTS_CHINESE and TYPE_MELOTTS_ENGLISH are TTS models, supporting Chinese and English speech synthesis respectively.
Audio Type .enum AudioType
When performing Speech Recognition (ASR) or Speech Synthesis (TTS) tasks, it is necessary to specify the encoding format and sampling attributes of the input/output audio. By setting this enumeration, the SDK can correctly parse audio stream data or generate audio files in the specified format.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_DEFAULT | uint8_t | 0 | Invalid data type |
| TYPE_WAV | uint8_t | 1 | WAV format audio |
| TYPE_PCM | uint8_t | 2 | PCM format audio |
💡Note
To ensure the accuracy of Automatic Speech Recognition (ASR) and system stability, the raw audio stream fed into the SDK must strictly comply with the following standards:
- Sampling Frequency: Fixed at 16k Hz.
- Channel Configuration: Supports Mono input only.
- Data Precision: Must be 16-bit (Signed 16-bit) depth.
Currently, the TTS module only supports outputting audio streams in WAV format; the relevant audio enumeration values are primarily used by the ASR module as criteria for identifying different input audio types.
Log Level .enum LogLevel
AidVoiceSDK provides interfaces for setting log-related information (to be introduced later). If you need to specify which log level AidVoiceSDK currently uses, you must use this log level enumeration.
| Member Name | Type | Value | Description |
|---|---|---|---|
| INFO | uint8_t | 0 | Information |
| WARNING | uint8_t | 1 | Warning |
| ERROR | uint8_t | 2 | Error |
| FATAL | uint8_t | 3 | Fatal Error |
| DEBUG | uint8_t | 4 | Debug |
| OFF | uint8_t | 5 | Off |
Return Result Status .enum ResultStatus
ResultStatus is used to define the return status of all SDK executions. By checking this enumeration value, developers can determine whether the current execution flow was successful. If a non-successful status is returned, troubleshooting can be performed based on specific error information.
| Member Name | Type | Value | Description |
|---|---|---|---|
| AV_OK | uint8_t | 0 | Execution successful |
| AV_ERR_INVALID_ARG | uint8_t | 1 | Invalid argument |
| AV_ERR_LOAD_FILE | uint8_t | 2 | File loading failed |
| AV_ERR_RUN_FAIL | uint8_t | 3 | Runtime error |
| AV_ERR_UNSUPPORTED | uint8_t | 4 | Operation not supported |
| AV_ERR_GENERATE_OBJECT | uint8_t | 5 | Object creation failed |
| AV_OTHER | uint8_t | 6 | Other |
Global Configuration Class .class FeatureConfig
The FeatureConfig structure is used to store all configuration information required to construct a specific function object. Before initializing an SDK instance, developers need to instantiate this class and set the function type, model selection, and log level according to business requirements.
Member Variable List
FeatureConfig includes the following parameters:
| Member | feature_type |
| Type | FeatureType |
| Default | No default value |
| Description | Specifies the SDK's business mode, such as Voice (Recognition/Synthesis, etc.) |
| Member | model_type |
| Type | ModelType |
| Default | No default value |
| Description | Specifies the model used by the function object |
| Member | log_type |
| Type | LogLevel |
| Default | TYPE_OFF |
| Description | Specifies the log level |
Speech Recognition ASR Related Interfaces
This section details the core API interfaces related to Automatic Speech Recognition (ASR) in the AidVoice SDK. Developers can use these interfaces to complete the entire process from ASR instance creation and audio stream input to retrieving recognition results.
Function Description: The ASR module is designed to convert input 16k/Mono/16-bit raw audio streams into text information in real-time or offline. Currently, it supports two mainstream inference models: sensevoice_small and whisper_base.
Speech Recognition Mode .enum ASRMode
ASRMode is used to define the output strategy for speech recognition results. Developers should choose between incremental feedback (Streaming) or full-sentence output (Non-Streaming) based on the real-time requirements of the application scenario.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_STREAM | uint8_t | 0 | Streaming output: Can return intermediate transcription results |
| TYPE_NOSTREAM | uint8_t | 1 | Non-streaming output: Returns final transcription result per process |
💡Note
Special Instructions:
- Streaming: Generates temporary transcription results in real-time during audio processing. As subsequent audio streams are input, the SDK continuously corrects and updates intermediate text. This mode allows returning data before the audio buffer is completely processed, significantly reducing first-word latency and improving interactive experience.
- Non-Streaming: Transcribes based on a fixed audio duration. If speech ends early (e.g., user stops inputting) or the speech duration is short, the system will immediately generate and return the final transcription text.
💡Note
Different inference models have strict limits on the audio duration for a single process. Developers must pay special attention when configuring non-streaming transcription or pre-processing audio segments:
- TYPE_WHISPER: Maximum audio input length per process is 24s.
- TYPE_SENSEVOICE: Maximum audio input length per process is 15s.
- If the audio data sent in a single process exceeds these limits, the SDK will truncate the input audio data, and the truncated subsequent audio will automatically trigger a new round of transcription tasks.
Speech Transcription Status .enum AsrStatus
AsrStatus is used to identify the state of the text results returned by ASR within the current recognition cycle. By parsing this status, developers can distinguish whether the current result is an "intermediate word being corrected" or a "finalized statement."
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_PARTIAL | uint8_t | 0 | Intermediate transcription process; transcription not finished |
| TYPE_FINAL | uint8_t | 1 | Final transcription result; transcription completed |
💡Note
Special Instructions:
- Streaming Mode (TYPE_STREAM): In streaming mode, if the status information is PARTIAL, it indicates intermediate results returned before the current audio buffer is fully processed. Only when the status flag is FINAL does it indicate that the processing of the current buffer data is complete.
- Non-Streaming Mode (TYPE_NOSTREAM): In non-streaming mode, the status information for every transcription is FINAL.
// Streaming Mode:
I am. TYPE_PARTIAL
I am a boy. TYPE_PARTIAL
I am a boy. I like Aplux. TYPE_FINAL
// Non-Streaming Mode:
I am a boy. I like Aplux. TYPE_FINALASR Result Return Class .class AsrResult
The AsrResult structure is used to carry the transcription data and status output by ASR. When the SDK completes processing a segment of audio, it encapsulates the recognized text content and its corresponding real-time status (intermediate or final) into this class to return to the developer.
Member Variable List
AsrResult includes the following parameters:
| Member | status |
| Type | AsrStatus |
| Default | No default value |
| Description | Transcription status of the current return result |
| Member | text |
| Type | std::string |
| Default | No default value |
| Description | Text result of the current transcription |
| Member | id |
| Type | int |
| Default | 0 |
| Description | ID value of the return result |
ASR Error Return Class .class AsrError
The AsrError structure is used to carry exception information output by ASR. When an interface call returns a non-successful status, developers can use this class to obtain specific error codes and detailed description text.
Member Variable List
AsrError includes the following parameters:
| Member | status |
| Type | ResultStatus |
| Default | No default value |
| Description | Error information status |
| Member | error_code |
| Type | int |
| Default | No default value |
| Description | Error code |
| Member | message |
| Type | std::string |
| Default | No default value |
| Description | Current returned error message |
ASR Callback Interface Class .class ASRCallbacks
ASRCallbacks is a virtual base class used to define the listening interfaces for pushing data from the SDK to the application layer. Developers need to inherit from this class and implement its virtual functions to asynchronously obtain recognition results or error information.
Get Transcription Results .onResult()
This callback is automatically triggered when the underlying ASR processes a segment of audio and generates text.
| API | onResult |
| Description | Speech recognition result callback function |
| Parameters | result: Transcription result object, containing the currently recognized text content and result status |
| Return | void |
| API | onError |
| Description | Error information callback function. Used to receive and process various exceptions generated during ASR operation |
| Parameters | error: Error information object, containing the current error description, error code, etc. |
| Return | void |
class ASRCallbacksImpl : public ASRCallbacks
{
public:
void onResult(const AsrResult &result) override
{
string asrResult = result.text;
int sid = result.id;
AsrStatus status = result.status;
printf("=================\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "asrResult: \n"
<< asrResult << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("=================\n");
total_echo = sid;
}
void onError(const AsrError &error) override
{
int errCode = error.error_code;
int errStatus = (int)error.status;
string errMsg = error.message;
printf("=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("=================\n");
}
~ASRCallbacksImpl() = default;
};ASR Core Business Class .class AidVoiceASR
AidVoiceASR is the main functional body of the SDK, responsible for managing the complete lifecycle of speech recognition. Developers use the interfaces provided by this class for model loading, audio data pushing, and stopping recognition tasks. This class must be initialized in conjunction with FeatureConfig.
Create Instance Object .create_asr()
This interface is the first step in using the SDK. It constructs and initializes a specific ASR recognition object in memory based on the passed global configuration information (such as function type, model type, etc.).
| API | create_asr |
| Description | Constructs a specific ASR instance based on the passed configuration object |
| Parameters | cfg: Global configuration parameters used to specify model type and log level |
| Return | Returns a pointer to the AidVoiceASR instance on success. Returns nullptr on failure. |
💡Note
Special Instructions:
- Before calling this interface, ensure that
cfg.feature_typehas been set toTYPE_ASR.
Set Mode .set_mode()
This interface is used to set the working mode of ASR. Developers can set it to streaming or non-streaming mode based on business requirements (such as real-time voice interaction or offline long-text transcription).
| API | set_mode |
| Description | Sets the working mode for ASR recognition |
| Parameters | mode: The specific recognition mode |
| Return | void |
Set Callback .set_callback()
This interface is used to register the user-implemented callback listener instance with the ASR. After registration, the SDK will asynchronously push transcription text (onResult) or error information (onError) through this instance.
| API | set_callback |
| Description | Registers a callback listener object to receive asynchronously returned recognition results |
| Parameters | cb: Pointer to the instance of the user-defined ASRCallbacks implementation class |
| Return | void |
// This must be dynamically allocated; memory is released by the AidVoice underlying layer
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
// After registration, ownership of the object is transferred to AidVoice internally
asr->set_callback(mASRCallbacks);💡Note
Special Instructions:
- The passed callback instance must be dynamically allocated on the heap using the
newkeyword. Onceset_callbackis called, the lifecycle of this pointer will be managed by the AidVoice underlying layer. The SDK will automatically perform adeleteoperation when the ASR instance is destroyed.
Set Maximum Audio Processing Duration .set_echo_ms()
This interface is used to configure the maximum audio length that a single ASR inference task can receive, in milliseconds (ms).
| API | set_echo_ms |
| Description | Sets the audio duration threshold for a single ASR inference process |
| Parameters | echo_ms: Single processing duration threshold |
| Return | void |
💡Note
When setting this parameter, the single processing limit of the selected model must be observed:
- Whisper model: Must not exceed 24000 (24s).
- SenseVoice model:
echo_msmust not exceed 15000 (15s).
Set Streaming Feedback Interval .set_step_ms()
This interface is designed specifically for Streaming Mode (TYPE_STREAM) and is used to configure the time frequency at which ASR returns intermediate transcription results (PARTIAL status).
| API | set_step_ms |
| Description | Sets the callback frequency for streaming transcription results (unit: milliseconds). A callback is triggered every time the specified duration of audio is processed. |
| Parameters | step_ms: Time step for result feedback. |
| Return | void |
💡Note
Special Attention:
- This setting only takes effect in streaming mode. In non-streaming mode, the system ignores this configuration and returns the final result directly upon completion of recognition.
- The smaller the
step_ms, the higher the real-time performance. In microphone real-time input scenarios, settingstep_mstoo small may lead to incoherent output. This is because the underlying SDK uses an overwrite cache strategy to ensure real-time performance for the "current moment." If the model processing speed cannot keep up with the audio input speed, unprocessed old data may be overwritten by newly arrived audio, resulting in incoherent recognition results.
Set Audio Saving .set_save_audio()
This interface is primarily used for real-time microphone input scenarios. When enabled, the SDK automatically captures the raw audio stream received by the microphone and saves it locally in WAV format.
| API | set_save_audio |
| Description | Configures whether to save the raw audio data from the microphone input to a local file. |
| Parameters | save_audio: Boolean. `true` to enable saving; `false` to disable (default). |
| Return | void |
Initialization Interface .init()
After the ASR object is created, certain initialization operations (such as environment checks and resource building) must be executed.
| API | init |
| Description | Completes the necessary initialization work required for inference. |
| Parameters | void |
| Return | A value of 0 indicates that the initialization was successful; otherwise, a non-zero value indicates failure. |
// ASR Initialization; returns non-zero on error
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
printf("asr->init() failure!\n");
return EXIT_FAILURE;
}Data Input .write()
After the ASR instance has successfully called init(), audio data to be recognized can be sent via the write() interface. The SDK supports multiple data source inputs to adapt to different business scenarios such as file transcription and streaming recording.
Audio File as Input Data
This interface directly reads a local audio file for recognition.
| API | write |
| Description | Passes the path of a 16kHz sample rate WAV audio file. The SDK will automatically parse the file and perform recognition. |
| Parameters | wav_16k_file: Absolute or relative path of the local audio file. |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
// Audio file as input data; returns non-zero on error
std::string wave_path = "audio.wav";
int ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}💡Note
Special Attention: The audio file must be in WAV/PCM format with mono channel, 16-bit depth, and a sample rate of 16000Hz.
Raw Byte Stream as Input Data
This interface receives a raw audio byte stream from memory.
| API | write |
| Description | Pushes raw audio byte stream to ASR. |
| Parameters | data: Pointer to the audio data buffer. len: Byte length of the buffer data. |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
// Raw byte stream as input data; returns non-zero on error
char *data = new char[fileLen];
// ... Fill audio data here ...
int ret = asr->write(data, data_size);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}Float Array as Input Data
This interface receives floating-point audio sample data, suitable for audio streams already pre-processed into a standard floating-point format.
| API | write |
| Description | Pushes a float-type audio sample array to ASR. |
| Parameters | audio_data: Float array containing audio samples. |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
// Float array as input data; returns non-zero on error
std::vector<float> audio_;
// ... Fill audio data here ...
int ret = asr->write(audio_);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}Real-time Microphone Input .audio_microphone()
This interface calls the microphone driver for audio collection based on the set microphone ID and sends the acquired data directly to the ASR underlying layer.
| API | audio_microphone |
| Description | Starts the microphone device with the specified ID and begins real-time speech recognition. |
| Parameters | id: Hardware device ID of the microphone; the default device is 0. |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
💡Note
Special Attention:
- This interface only supports streaming mode (
TYPE_STREAM).set_mode(TYPE_STREAM)must be called before using this interface. - In terminal environments, it supports capturing system signals via Ctrl + C to safely stop the input stream. If the microphone device is disconnected during collection, ASR will also automatically terminate the input.
// After execution, microphone device ID 1 will start receiving real-time speech.
asr->audio_microphone(1);ASR Stop Input .stop()
This interface is used to notify the ASR underlying layer that the audio stream has ended. After calling, ASR will process all remaining audio in the buffer and trigger the corresponding onResult callback.
| API | stop |
| Description | Stops audio input. |
| Parameters | void |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
ASR Object Destruction .asr_destory()
This interface must be called when all speech recognition tasks are finished and the ASR functionality is no longer needed. it will completely release resources.
| API | asr_destory |
| Description | Completely destroys the ASR instance and releases all associated resources. |
| Parameters | void |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
Speech Synthesis TTS Related Interfaces
This section details the core API interfaces related to Text-to-Speech (TTS) in the AidVoice SDK. Developers can implement the entire workflow from TTS instance creation and text input to final audio acquisition through these interfaces.
Function Description: The TTS module is designed to convert input text into corresponding audio files. Currently, it supports two mainstream inference models: melotts_chinese and melotts_english.
Speech Synthesis Mode .enum TTSMode
TTSMode is used to configure the output strategy for synthesized audio. Developers should choose between full-sentence output or fragmented output based on the real-time requirements of the application scenario.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_WHOLE | uint8_t | 0 | Whole output: Once the entire text synthesis is complete, the full audio is returned in one callback. |
| TYPE_FRAGMENT | uint8_t | 1 | Fragment output: Slices based on punctuation or semantic pauses; synthesizes and outputs short phrases immediately. |
💡Note
Special Instructions:
- Whole output (whole): Treats the complete text as a single task and triggers the result callback only after all audio data is synthesized, ensuring the integrity of the output audio.
- Fragment output (fragment): Intelligently splits long text into multiple short segments based on punctuation and semantic pauses, outputting each fragment immediately upon completion of synthesis.
Synthesized Audio Status .enum TTSStatus
TTSStatus is used to identify the real-time status of the audio result returned by the TTS engine. By parsing this status, developers can accurately determine whether the currently received data is a "partial fragment" or the "final result" of the task.
| Member Name | Type | Value | Description |
|---|---|---|---|
| TYPE_PARTIAL | uint8_t | 0 | Partially synthesized audio |
| TYPE_FINAL | uint8_t | 1 | Fully synthesized audio |
💡Note
Special Instructions:
- Fragment Mode (TYPE_FRAGMENT): In fragment mode, if the status information is PARTIAL, it indicates that the current audio is only an intermediate slice (e.g., a short phrase) of the entire text. Only when the status flag is FINAL does it represent that the current synthesis task (the specific sentence) has completely ended.
- Whole Mode (TYPE_WHOLE): In whole mode, the status information for every audio synthesis is FINAL.
TTS Result Return Class .class TTSResult
The TTSResult structure is used to carry the audio synthesized by TTS and the audio status. After the SDK synthesizes a segment of audio, it encapsulates the audio's relevant information and its corresponding status (partial or final) into this class to return to the developer.
Member Variable List
TTSResult includes the following parameters:
| Member | status |
| Type | TTSStatus |
| Default | No default value |
| Description | Status of the current returned audio |
| Member | audio_name |
| Type | std::string |
| Default | No default value |
| Description | Filename of the current output audio |
| Member | audio_len |
| Type | double |
| Default | 0 |
| Description | Duration of the current output audio, in seconds (s) |
| Member | seq |
| Type | int |
| Default | 1 |
| Description | Indicates which segment in the input text synthesis sequence the current audio block belongs to |
| Member | id |
| Type | int |
| Default | 0 |
| Description | ID value of the return result |
TTS Error Return Class .class TTSError
The TTSError structure is used to carry exception information output by TTS. When an interface call returns a non-successful status, developers can use this class to obtain specific error codes and detailed description text.
Member Variable List
TTSError includes the following parameters:
| Member | status |
| Type | ResultStatus |
| Default | ResultStatus::AV_OTHER |
| Description | Error message status |
| Member | error_code |
| Type | int |
| Default | -1 |
| Description | Error code |
| Member | message |
| Type | std::string |
| Default | No default value |
| Description | Current returned error message |
TTS Callback Interface Class .class TTSCallbacks
TTSCallbacks is a virtual base class used to define the listening interface for the SDK to push data to the application layer. Developers need to inherit from this class and implement its virtual functions to asynchronously obtain synthesis results or error information.
Get Transcription Results .onResult()
This callback is automatically triggered when the underlying TTS processes a segment of text and generates audio.
| API | onResult |
| Description | Speech synthesis result callback function |
| Parameters | result: Synthesis result object, containing information about the current synthesized audio and the audio status |
| Return | void |
| API | onError |
| Description | Error message callback function. Used to receive and process various exceptions generated during TTS operation |
| Parameters | error: Error message object, containing the current error description, error code, etc. |
| Return | void |
class TTSCallbacksImpl : public TTSCallbacks
{
public:
void onResult(const TTSResult &result) override
{
std::string audio_name = result.audio_name;
double audio_len = result.audio_len;
int seq = result.seq;
int sid = result.id;
TTSStatus status = result.status;
printf("=================\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "audio_name:"<< audio_name << std::endl;
std::cout << "audio_len: " << (double)audio_len << std::endl;
std::cout << "seq: " << (int)seq << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("=================\n");
}
void onError(const TTSError &error) override
{
int errCode = error.error_code;
int errStatus = (int)error.status;
string errMsg = error.message;
printf("=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("=================\n");
}
~TTSCallbacksImpl() = default;
};TTS Core Business Class .class AidVoiceTTS
AidVoiceTTS is the functional core of the SDK, responsible for managing the complete lifecycle of speech synthesis. Developers interact with the interfaces provided by this class for model loading, pushing synthesis text, and stopping audio synthesis tasks. This class must be initialized in conjunction with FeatureConfig.
Create Instance Object .create_tts()
This interface is the first step in using the SDK. It builds and initializes a specific TTS speech synthesis object in memory based on the global configuration information provided (such as function type, model type, etc.).
| API | create_tts |
| Description | Builds a specific TTS instance based on the provided configuration object |
| Parameters | cfg: Global configuration parameters, used to specify model type and log level |
| Return | Successful return of an AidVoiceTTS instance pointer. Failure returns nullptr. |
💡Note
Special Instructions:
- Before calling this interface, ensure that
cfg.feature_typeis set toTYPE_TTS.
Set Mode .set_mode()
This interface is used to set the working mode of TTS. Developers can set it to whole output or fragment output based on business needs. By default, TTS operates in whole mode (TYPE_WHOLE).
| API | set_mode |
| Description | Sets the working mode of TTS |
| Parameters | mode: The specific working mode |
| Return | void |
Set Callback .set_callback()
This interface is used to register a user-implemented callback listener instance with the TTS. Once registered, the SDK will asynchronously push synthesized audio information (onResult) or error information (onError) through this instance.
| API | set_callback |
| Description | Registers a callback listener object to receive asynchronously returned audio synthesis information. |
| Parameters | cb: Pointer to the instance of the user-defined TTSCallbacks implementation class. |
| Return | void |
// This must be dynamically allocated; it will be released by the AidVoice underlying layer.
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
// After registration, ownership of the object is transferred to the interior of AidVoice.
tts->set_callback(mTTSCallbacks);💡Note
Special Instructions:
- The passed callback instance must be dynamically allocated on the heap using the
newkeyword. Onceset_callbackis called, the lifecycle of this pointer will be managed by the AidVoice underlying layer. The SDK will automatically perform adeleteoperation when the TTS instance is destroyed.
Initialization Interface .init()
After the TTS object is created, certain initialization operations (such as environment checks and resource building) must be performed.
| API | init |
| Description | Completes the necessary initialization work required for inference. |
| Parameters | void |
| Return | A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure. |
// TTS initialization; returns non-zero on failure.
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
printf("tts->init() failure!\n");
return EXIT_FAILURE;
}Data Input .write()
After the TTS instance is successfully initialized (i.e., init() returns success), developers can submit text data to be synthesized via the write() interface. This interface supports bulk text input (received as a string array) and returns synthesized audio asynchronously.
| API | write |
| Description | Delivers text data to the TTS. The interface accepts a string array (vector<string>), allowing for the simultaneous submission of multiple independent text segments. |
| Parameters | Text array for synthesis. Each element in the array is treated as an independent synthesis task. |
| Return | A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure. |
// Character array as input data; returns non-zero on failure.
std::vector<std::string> str_vec = {"I am a boy.", "I like Aplux."};
int ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
printf("tts->write() failure!\n");
return EXIT_FAILURE;
}💡Note
Special Attention: After text data is delivered, the system will asynchronously return the synthesized audio stream through the callback interface. The output audio strictly adheres to the following specifications:
- File encapsulation: Standard WAV format
- Sample rate: 44100 Hz
- Channel count: Mono (Single channel)
TTS Stop Input .stop()
This interface is used to notify the TTS engine that the current input phase has ended. Once called, the SDK will no longer accept new text input, but it will ensure that the remaining text already in the buffer is processed completely.
| API | stop |
| Description | Formally closes the text input stream. |
| Parameters | void |
| Return | A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure. |
TTS Object Destruction .tts_destory()
This interface must be called when all audio synthesis tasks are finished and the TTS functionality is no longer needed within the application lifecycle. This operation will completely release the system resources occupied by the SDK.
| API | tts_destory |
| Description | Completely destroys the TTS instance and releases all associated resources. |
| Parameters | void |
| Return | A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure. |
Other Methods
Besides the inference-related interfaces mentioned above, the AidVoice SDK also provides several auxiliary interfaces.
Get Microphone List .show_mircophone_dev()
Before calling audio_mircophone(), it is recommended to call this interface to enumerate the available audio input devices on the current system to obtain the correct device ID.
| API | show_mircophone_dev |
| Description | Lists all available microphone hardware devices in the system. This interface prints the device names and their corresponding IDs to standard output (Stdout) or the log system. |
| Parameters | void |
| Return | A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure. |
Get Current AidVoice SDK Version .get_library_version()
Retrieves information regarding the version of the current AidVoice SDK.
| API | get_library_version |
| Description | Obtains version information of the current AidVoice SDK. |
| Parameters | void |
| Return | string: Version information. |
Get Current Log Level .get_log_level()
| API | get_log_level |
| Description | Retrieves the current log level. |
| Parameters | void |
| Return | LogLevel: Log level. |
Set Log Level .set_log_level()
| API | set_log_level |
| Description | Sets the log level. |
| Parameters | LogLevel: Log level. |
| Return | Returns 0 by default. |
Output Log to Console .log_to_console()
| API | log_to_console |
| Description | Configures the log information to be output to the standard error terminal (stderr). |
| Parameters | void |
| Return | Returns 0 by default. |
Output Log to Text File .log_to_file()
| API | log_to_file |
| Description | Configures log information to be output to a specified text file. |
| Parameters | path_and_prefix: Storage path and name prefix for log files. also_to_console: Flag indicating whether to simultaneously output logs to the stderr terminal; default is false. |
| Return | A value of 0 indicates success; otherwise, a non-zero value indicates failure. |
AidVoice C++ Example Programs
AidVoice ASR Recognition Example
Taking audio text recognition as an example, the CPP program generally consists of the following parts:
// Global configuration information
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_ASR;
cfg.model_type = ModelType::TYPE_WHISPER;
// Construct ASR object
auto asr = AidLux::AidVoice::create_asr(cfg);
if (!asr)
{
printf("create_asr failure!\n");
return EXIT_FAILURE;
}
// Inherit callback interface
class ASRCallbacksImpl : public ASRCallbacks
{
public:
void onResult(const AsrResult &result) override
{
string asrResult = result.text;
int sid = result.id;
AsrStatus status = result.status;
printf("=================\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "asrResult: \n"
<< asrResult << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("=================\n");
}
void onError(const AsrError &error) override
{
string errMsg = error.message;
printf("=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("=================\n");
}
~ASRCallbacksImpl() = default;
};
// Create callback class and set the callback
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
asr->set_callback(mASRCallbacks);
// Initialize audio object
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
printf("asr->init() failure!\n");
return EXIT_FAILURE;
}
// Pass audio data
ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
printf("asr->write() failure!\n");
return EXIT_FAILURE;
}
// Stop input data
ret = asr->stop();
if (ret != EXIT_SUCCESS)
{
printf("asr->stop() failure!\n");
return EXIT_FAILURE;
}
// Destroy object
ret = asr->asr_destory();
if (ret != EXIT_SUCCESS)
{
printf("asr->asr_destory() failure!\n");
return EXIT_FAILURE;
}AidVoice TTS Synthesis Example
The audio synthesis CPP program generally consists of the following parts:
// Global configuration information
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_TTS;
cfg.model_type = ModelType::TYPE_MELOTTS_ENGLISH;
// Construct TTS object
auto tts = AidLux::AidVoice::create_tts(cfg);
if (!tts)
{
printf("create tts failure!\n");
return EXIT_FAILURE;
}
// Set TTS working mode
tts->set_mode(TTSMode::TYPE_WHOLE);
// Inherit callback interface
class TTSCallbacksImpl : public TTSCallbacks
{
public:
void onResult(const TTSResult &result) override
{
std::string audio_name = result.audio_name;
double audio_len = result.audio_len;
int seq = result.seq;
int sid = result.id;
TTSStatus status = result.status;
printf("=================\n");
std::cout << "sid: " << sid << std::endl;
std::cout << "audio_name:"<< audio_name << std::endl;
std::cout << "audio_len: " << (double)audio_len << std::endl;
std::cout << "seq: " << (int)seq << std::endl;
std::cout << "status: " << (int)status << std::endl;
printf("=================\n");
}
void onError(const TTSError &error) override
{
string errMsg = error.message;
printf("=================\n");
std::cout << "errMsg: " << errMsg << std::endl;
printf("=================\n");
}
~TTSCallbacksImpl() = default;
};
// Create callback class and set the callback
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
tts->set_callback(mTTSCallbacks);
// Initialize TTS object
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
printf("tts->init() failure!\n");
return EXIT_FAILURE;
}
// Pass synthesis text
std::vector<std::string> str_vec = {"This is an example of text to speech using Melo for English. How does it sound?"};
ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
printf("tts->write() failure!\n");
return EXIT_FAILURE;
}
// Stop input data
ret = tts->stop();
if (ret != EXIT_SUCCESS)
{
printf("tts->stop() failure!\n");
return EXIT_FAILURE;
}
// Destroy object
ret = tts->tts_destory();
if (ret != EXIT_SUCCESS)
{
printf("tts->tts_destory() failure!\n");
return EXIT_FAILURE;
}💡Note
More usage examples are stored in the following path:
- CPP example program path:
/usr/local/share/aidvoice/examples/asr/cpp/
This concludes the presentation of all interfaces for the AidVoice SDK.