Skip to content

AidVoice C++ API Documentation

💡Note: When developing with AidVoice-SDK C++, please keep the following in mind:

  • Include the header file during compilation, located at: /usr/local/include/aidlux/aidvoice/aidvoice_speech.hpp
  • Specify the library file during linking, located at: /usr/local/lib/libaidvoice_speech.so

Specific Function Types .enum FeatureType

FeatureType is used to specify core business modules when initializing the AidVoice SDK. Since the AidVoice SDK includes various voice functions, developers must explicitly specify the specific voice function through this enumeration when creating a functional instance (Object). Currently, the SDK supports Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) functions, with more voice features being continuously integrated.

Member NameTypeValueDescription
TYPE_DEFAULTuint8_t0Invalid data type
TYPE_ASRuint8_t1Speech Recognition
TYPE_TTSuint8_t2Speech Synthesis

Model Type .enum ModelType

ModelType defines the inference model algorithms supported by the AidVoice SDK. Users should select the appropriate model based on the application scenario. Currently, only the whisper_base and sensevoice_small models for ASR are supported.

Member NameTypeValueDescription
TYPE_DEFAULTuint8_t0Invalid data type
TYPE_WHISPERuint8_t1whisper_base model
TYPE_SENSEVOICEuint8_t2sensevoice_small model
TYPE_MELOTTS_CHINESEuint8_t2melotts_chinese model
TYPE_MELOTTS_ENGLISHuint8_t2melotts_english model

💡Note

Special Instructions: TYPE_WHISPER and TYPE_SENSEVOICE are ASR models, both featuring multi-language support (Chinese, English, etc.).

  • English scenarios: It is recommended to prioritize TYPE_WHISPER, as it performs better in English recognition accuracy and semantic understanding.
  • Chinese scenarios: It is recommended to prioritize TYPE_SENSEVOICE, as it performs better in Chinese recognition accuracy and semantic understanding.

TYPE_MELOTTS_CHINESE and TYPE_MELOTTS_ENGLISH are TTS models, supporting Chinese and English speech synthesis respectively.

Audio Type .enum AudioType

When performing Speech Recognition (ASR) or Speech Synthesis (TTS) tasks, it is necessary to specify the encoding format and sampling attributes of the input/output audio. By setting this enumeration, the SDK can correctly parse audio stream data or generate audio files in the specified format.

Member NameTypeValueDescription
TYPE_DEFAULTuint8_t0Invalid data type
TYPE_WAVuint8_t1WAV format audio
TYPE_PCMuint8_t2PCM format audio

💡Note

To ensure the accuracy of Automatic Speech Recognition (ASR) and system stability, the raw audio stream fed into the SDK must strictly comply with the following standards:

  • Sampling Frequency: Fixed at 16k Hz.
  • Channel Configuration: Supports Mono input only.
  • Data Precision: Must be 16-bit (Signed 16-bit) depth.

Currently, the TTS module only supports outputting audio streams in WAV format; the relevant audio enumeration values are primarily used by the ASR module as criteria for identifying different input audio types.

Log Level .enum LogLevel

AidVoiceSDK provides interfaces for setting log-related information (to be introduced later). If you need to specify which log level AidVoiceSDK currently uses, you must use this log level enumeration.

Member NameTypeValueDescription
INFOuint8_t0Information
WARNINGuint8_t1Warning
ERRORuint8_t2Error
FATALuint8_t3Fatal Error
DEBUGuint8_t4Debug
OFFuint8_t5Off

Return Result Status .enum ResultStatus

ResultStatus is used to define the return status of all SDK executions. By checking this enumeration value, developers can determine whether the current execution flow was successful. If a non-successful status is returned, troubleshooting can be performed based on specific error information.

Member NameTypeValueDescription
AV_OKuint8_t0Execution successful
AV_ERR_INVALID_ARGuint8_t1Invalid argument
AV_ERR_LOAD_FILEuint8_t2File loading failed
AV_ERR_RUN_FAILuint8_t3Runtime error
AV_ERR_UNSUPPORTEDuint8_t4Operation not supported
AV_ERR_GENERATE_OBJECTuint8_t5Object creation failed
AV_OTHERuint8_t6Other

Global Configuration Class .class FeatureConfig

The FeatureConfig structure is used to store all configuration information required to construct a specific function object. Before initializing an SDK instance, developers need to instantiate this class and set the function type, model selection, and log level according to business requirements.

Member Variable List

FeatureConfig includes the following parameters:

Member feature_type
Type FeatureType
Default No default value
Description Specifies the SDK's business mode, such as Voice (Recognition/Synthesis, etc.)
Member model_type
Type ModelType
Default No default value
Description Specifies the model used by the function object
Member log_type
Type LogLevel
Default TYPE_OFF
Description Specifies the log level

This section details the core API interfaces related to Automatic Speech Recognition (ASR) in the AidVoice SDK. Developers can use these interfaces to complete the entire process from ASR instance creation and audio stream input to retrieving recognition results.

Function Description: The ASR module is designed to convert input 16k/Mono/16-bit raw audio streams into text information in real-time or offline. Currently, it supports two mainstream inference models: sensevoice_small and whisper_base.

Speech Recognition Mode .enum ASRMode

ASRMode is used to define the output strategy for speech recognition results. Developers should choose between incremental feedback (Streaming) or full-sentence output (Non-Streaming) based on the real-time requirements of the application scenario.

Member NameTypeValueDescription
TYPE_STREAMuint8_t0Streaming output: Can return intermediate transcription results
TYPE_NOSTREAMuint8_t1Non-streaming output: Returns final transcription result per process

💡Note

Special Instructions:

  • Streaming: Generates temporary transcription results in real-time during audio processing. As subsequent audio streams are input, the SDK continuously corrects and updates intermediate text. This mode allows returning data before the audio buffer is completely processed, significantly reducing first-word latency and improving interactive experience.
  • Non-Streaming: Transcribes based on a fixed audio duration. If speech ends early (e.g., user stops inputting) or the speech duration is short, the system will immediately generate and return the final transcription text.

💡Note

Different inference models have strict limits on the audio duration for a single process. Developers must pay special attention when configuring non-streaming transcription or pre-processing audio segments:

  • TYPE_WHISPER: Maximum audio input length per process is 24s.
  • TYPE_SENSEVOICE: Maximum audio input length per process is 15s.
  • If the audio data sent in a single process exceeds these limits, the SDK will truncate the input audio data, and the truncated subsequent audio will automatically trigger a new round of transcription tasks.

Speech Transcription Status .enum AsrStatus

AsrStatus is used to identify the state of the text results returned by ASR within the current recognition cycle. By parsing this status, developers can distinguish whether the current result is an "intermediate word being corrected" or a "finalized statement."

Member NameTypeValueDescription
TYPE_PARTIALuint8_t0Intermediate transcription process; transcription not finished
TYPE_FINALuint8_t1Final transcription result; transcription completed

💡Note

Special Instructions:

  • Streaming Mode (TYPE_STREAM): In streaming mode, if the status information is PARTIAL, it indicates intermediate results returned before the current audio buffer is fully processed. Only when the status flag is FINAL does it indicate that the processing of the current buffer data is complete.
  • Non-Streaming Mode (TYPE_NOSTREAM): In non-streaming mode, the status information for every transcription is FINAL.
cpp
// Streaming Mode:
I am.                                TYPE_PARTIAL
I am a boy.                          TYPE_PARTIAL
I am a boy. I like Aplux.            TYPE_FINAL
// Non-Streaming Mode:
I am a boy. I like Aplux.            TYPE_FINAL

ASR Result Return Class .class AsrResult

The AsrResult structure is used to carry the transcription data and status output by ASR. When the SDK completes processing a segment of audio, it encapsulates the recognized text content and its corresponding real-time status (intermediate or final) into this class to return to the developer.

Member Variable List

AsrResult includes the following parameters:

Member status
Type AsrStatus
Default No default value
Description Transcription status of the current return result
Member text
Type std::string
Default No default value
Description Text result of the current transcription
Member id
Type int
Default 0
Description ID value of the return result

ASR Error Return Class .class AsrError

The AsrError structure is used to carry exception information output by ASR. When an interface call returns a non-successful status, developers can use this class to obtain specific error codes and detailed description text.

Member Variable List

AsrError includes the following parameters:

Member status
Type ResultStatus
Default No default value
Description Error information status
Member error_code
Type int
Default No default value
Description Error code
Member message
Type std::string
Default No default value
Description Current returned error message

ASR Callback Interface Class .class ASRCallbacks

ASRCallbacks is a virtual base class used to define the listening interfaces for pushing data from the SDK to the application layer. Developers need to inherit from this class and implement its virtual functions to asynchronously obtain recognition results or error information.

Get Transcription Results .onResult()

This callback is automatically triggered when the underlying ASR processes a segment of audio and generates text.

API onResult
Description Speech recognition result callback function
Parameters result: Transcription result object, containing the currently recognized text content and result status
Return void
API onError
Description Error information callback function. Used to receive and process various exceptions generated during ASR operation
Parameters error: Error information object, containing the current error description, error code, etc.
Return void
cpp
class ASRCallbacksImpl : public ASRCallbacks
{
public:
    void onResult(const AsrResult &result) override
    {
        string asrResult = result.text;
        int sid = result.id;
        AsrStatus status = result.status;
        printf("=================\n");
        std::cout << "sid: " << sid << std::endl;
        std::cout << "asrResult: \n"
                  << asrResult << std::endl;
        std::cout << "status: " << (int)status << std::endl;
        printf("=================\n");
        total_echo = sid;
    }

    void onError(const AsrError &error) override
    {
        int errCode = error.error_code;
        int errStatus = (int)error.status;
        string errMsg = error.message;
        printf("=================\n");
        std::cout << "errMsg: " << errMsg << std::endl;
        printf("=================\n");
    }
    ~ASRCallbacksImpl() = default;
};

ASR Core Business Class .class AidVoiceASR

AidVoiceASR is the main functional body of the SDK, responsible for managing the complete lifecycle of speech recognition. Developers use the interfaces provided by this class for model loading, audio data pushing, and stopping recognition tasks. This class must be initialized in conjunction with FeatureConfig.

Create Instance Object .create_asr()

This interface is the first step in using the SDK. It constructs and initializes a specific ASR recognition object in memory based on the passed global configuration information (such as function type, model type, etc.).

API create_asr
Description Constructs a specific ASR instance based on the passed configuration object
Parameters cfg: Global configuration parameters used to specify model type and log level
Return Returns a pointer to the AidVoiceASR instance on success. Returns nullptr on failure.

💡Note

Special Instructions:

  • Before calling this interface, ensure that cfg.feature_type has been set to TYPE_ASR.
Set Mode .set_mode()

This interface is used to set the working mode of ASR. Developers can set it to streaming or non-streaming mode based on business requirements (such as real-time voice interaction or offline long-text transcription).

API set_mode
Description Sets the working mode for ASR recognition
Parameters mode: The specific recognition mode
Return void
Set Callback .set_callback()

This interface is used to register the user-implemented callback listener instance with the ASR. After registration, the SDK will asynchronously push transcription text (onResult) or error information (onError) through this instance.

API set_callback
Description Registers a callback listener object to receive asynchronously returned recognition results
Parameters cb: Pointer to the instance of the user-defined ASRCallbacks implementation class
Return void
cpp
// This must be dynamically allocated; memory is released by the AidVoice underlying layer
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();

// After registration, ownership of the object is transferred to AidVoice internally
asr->set_callback(mASRCallbacks);

💡Note

Special Instructions:

  • The passed callback instance must be dynamically allocated on the heap using the new keyword. Once set_callback is called, the lifecycle of this pointer will be managed by the AidVoice underlying layer. The SDK will automatically perform a delete operation when the ASR instance is destroyed.
Set Maximum Audio Processing Duration .set_echo_ms()

This interface is used to configure the maximum audio length that a single ASR inference task can receive, in milliseconds (ms).

API set_echo_ms
Description Sets the audio duration threshold for a single ASR inference process
Parameters echo_ms: Single processing duration threshold
Return void

💡Note

When setting this parameter, the single processing limit of the selected model must be observed:

  • Whisper model: Must not exceed 24000 (24s).
  • SenseVoice model: echo_ms must not exceed 15000 (15s).
Set Streaming Feedback Interval .set_step_ms()

This interface is designed specifically for Streaming Mode (TYPE_STREAM) and is used to configure the time frequency at which ASR returns intermediate transcription results (PARTIAL status).

API set_step_ms
Description Sets the callback frequency for streaming transcription results (unit: milliseconds). A callback is triggered every time the specified duration of audio is processed.
Parameters step_ms: Time step for result feedback.
Return void

💡Note

Special Attention:

  • This setting only takes effect in streaming mode. In non-streaming mode, the system ignores this configuration and returns the final result directly upon completion of recognition.
  • The smaller the step_ms, the higher the real-time performance. In microphone real-time input scenarios, setting step_ms too small may lead to incoherent output. This is because the underlying SDK uses an overwrite cache strategy to ensure real-time performance for the "current moment." If the model processing speed cannot keep up with the audio input speed, unprocessed old data may be overwritten by newly arrived audio, resulting in incoherent recognition results.
Set Audio Saving .set_save_audio()

This interface is primarily used for real-time microphone input scenarios. When enabled, the SDK automatically captures the raw audio stream received by the microphone and saves it locally in WAV format.

API set_save_audio
Description Configures whether to save the raw audio data from the microphone input to a local file.
Parameters save_audio: Boolean. `true` to enable saving; `false` to disable (default).
Return void
Initialization Interface .init()

After the ASR object is created, certain initialization operations (such as environment checks and resource building) must be executed.

API init
Description Completes the necessary initialization work required for inference.
Parameters void
Return A value of 0 indicates that the initialization was successful; otherwise, a non-zero value indicates failure.
cpp
// ASR Initialization; returns non-zero on error
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
    printf("asr->init() failure!\n");
    return EXIT_FAILURE;
}
Data Input .write()

After the ASR instance has successfully called init(), audio data to be recognized can be sent via the write() interface. The SDK supports multiple data source inputs to adapt to different business scenarios such as file transcription and streaming recording.

Audio File as Input Data

This interface directly reads a local audio file for recognition.

API write
Description Passes the path of a 16kHz sample rate WAV audio file. The SDK will automatically parse the file and perform recognition.
Parameters wav_16k_file: Absolute or relative path of the local audio file.
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.
cpp
// Audio file as input data; returns non-zero on error
std::string wave_path = "audio.wav";
int ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
    printf("asr->write() failure!\n");
    return EXIT_FAILURE;
}

💡Note

Special Attention: The audio file must be in WAV/PCM format with mono channel, 16-bit depth, and a sample rate of 16000Hz.

Raw Byte Stream as Input Data

This interface receives a raw audio byte stream from memory.

API write
Description Pushes raw audio byte stream to ASR.
Parameters data: Pointer to the audio data buffer. len: Byte length of the buffer data.
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.
cpp
// Raw byte stream as input data; returns non-zero on error
char *data = new char[fileLen];
// ... Fill audio data here ...
int ret = asr->write(data, data_size);
if (ret != EXIT_SUCCESS)
{
    printf("asr->write() failure!\n");
    return EXIT_FAILURE;
}
Float Array as Input Data

This interface receives floating-point audio sample data, suitable for audio streams already pre-processed into a standard floating-point format.

API write
Description Pushes a float-type audio sample array to ASR.
Parameters audio_data: Float array containing audio samples.
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.
cpp
// Float array as input data; returns non-zero on error
std::vector<float> audio_;
// ... Fill audio data here ...
int ret = asr->write(audio_);
if (ret != EXIT_SUCCESS)
{
    printf("asr->write() failure!\n");
    return EXIT_FAILURE;
}
Real-time Microphone Input .audio_microphone()

This interface calls the microphone driver for audio collection based on the set microphone ID and sends the acquired data directly to the ASR underlying layer.

API audio_microphone
Description Starts the microphone device with the specified ID and begins real-time speech recognition.
Parameters id: Hardware device ID of the microphone; the default device is 0.
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.

💡Note

Special Attention:

  • This interface only supports streaming mode (TYPE_STREAM). set_mode(TYPE_STREAM) must be called before using this interface.
  • In terminal environments, it supports capturing system signals via Ctrl + C to safely stop the input stream. If the microphone device is disconnected during collection, ASR will also automatically terminate the input.
cpp
// After execution, microphone device ID 1 will start receiving real-time speech.
asr->audio_microphone(1);
ASR Stop Input .stop()

This interface is used to notify the ASR underlying layer that the audio stream has ended. After calling, ASR will process all remaining audio in the buffer and trigger the corresponding onResult callback.

API stop
Description Stops audio input.
Parameters void
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.
ASR Object Destruction .asr_destory()

This interface must be called when all speech recognition tasks are finished and the ASR functionality is no longer needed. it will completely release resources.

API asr_destory
Description Completely destroys the ASR instance and releases all associated resources.
Parameters void
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.

This section details the core API interfaces related to Text-to-Speech (TTS) in the AidVoice SDK. Developers can implement the entire workflow from TTS instance creation and text input to final audio acquisition through these interfaces.

Function Description: The TTS module is designed to convert input text into corresponding audio files. Currently, it supports two mainstream inference models: melotts_chinese and melotts_english.

Speech Synthesis Mode .enum TTSMode

TTSMode is used to configure the output strategy for synthesized audio. Developers should choose between full-sentence output or fragmented output based on the real-time requirements of the application scenario.

Member NameTypeValueDescription
TYPE_WHOLEuint8_t0Whole output: Once the entire text synthesis is complete, the full audio is returned in one callback.
TYPE_FRAGMENTuint8_t1Fragment output: Slices based on punctuation or semantic pauses; synthesizes and outputs short phrases immediately.

💡Note

Special Instructions:

  • Whole output (whole): Treats the complete text as a single task and triggers the result callback only after all audio data is synthesized, ensuring the integrity of the output audio.
  • Fragment output (fragment): Intelligently splits long text into multiple short segments based on punctuation and semantic pauses, outputting each fragment immediately upon completion of synthesis.

Synthesized Audio Status .enum TTSStatus

TTSStatus is used to identify the real-time status of the audio result returned by the TTS engine. By parsing this status, developers can accurately determine whether the currently received data is a "partial fragment" or the "final result" of the task.

Member NameTypeValueDescription
TYPE_PARTIALuint8_t0Partially synthesized audio
TYPE_FINALuint8_t1Fully synthesized audio

💡Note

Special Instructions:

  • Fragment Mode (TYPE_FRAGMENT): In fragment mode, if the status information is PARTIAL, it indicates that the current audio is only an intermediate slice (e.g., a short phrase) of the entire text. Only when the status flag is FINAL does it represent that the current synthesis task (the specific sentence) has completely ended.
  • Whole Mode (TYPE_WHOLE): In whole mode, the status information for every audio synthesis is FINAL.

TTS Result Return Class .class TTSResult

The TTSResult structure is used to carry the audio synthesized by TTS and the audio status. After the SDK synthesizes a segment of audio, it encapsulates the audio's relevant information and its corresponding status (partial or final) into this class to return to the developer.

Member Variable List

TTSResult includes the following parameters:

Member status
Type TTSStatus
Default No default value
Description Status of the current returned audio
Member audio_name
Type std::string
Default No default value
Description Filename of the current output audio
Member audio_len
Type double
Default 0
Description Duration of the current output audio, in seconds (s)
Member seq
Type int
Default 1
Description Indicates which segment in the input text synthesis sequence the current audio block belongs to
Member id
Type int
Default 0
Description ID value of the return result

TTS Error Return Class .class TTSError

The TTSError structure is used to carry exception information output by TTS. When an interface call returns a non-successful status, developers can use this class to obtain specific error codes and detailed description text.

Member Variable List

TTSError includes the following parameters:

Member status
Type ResultStatus
Default ResultStatus::AV_OTHER
Description Error message status
Member error_code
Type int
Default -1
Description Error code
Member message
Type std::string
Default No default value
Description Current returned error message

TTS Callback Interface Class .class TTSCallbacks

TTSCallbacks is a virtual base class used to define the listening interface for the SDK to push data to the application layer. Developers need to inherit from this class and implement its virtual functions to asynchronously obtain synthesis results or error information.

Get Transcription Results .onResult()

This callback is automatically triggered when the underlying TTS processes a segment of text and generates audio.

API onResult
Description Speech synthesis result callback function
Parameters result: Synthesis result object, containing information about the current synthesized audio and the audio status
Return void
API onError
Description Error message callback function. Used to receive and process various exceptions generated during TTS operation
Parameters error: Error message object, containing the current error description, error code, etc.
Return void
cpp
class TTSCallbacksImpl : public TTSCallbacks
{
public:
    void onResult(const TTSResult &result) override
    {
        std::string audio_name = result.audio_name;
        double audio_len = result.audio_len;
        int seq = result.seq;
        int sid = result.id;
        TTSStatus status = result.status;
        printf("=================\n");
        std::cout << "sid: " << sid << std::endl;
        std::cout << "audio_name:"<< audio_name << std::endl;
        std::cout << "audio_len: " << (double)audio_len << std::endl;
        std::cout << "seq: " << (int)seq << std::endl;
        std::cout << "status: " << (int)status << std::endl;
        printf("=================\n");
    }

    void onError(const TTSError &error) override
    {
        int errCode = error.error_code;
        int errStatus = (int)error.status;
        string errMsg = error.message;
        printf("=================\n");
        std::cout << "errMsg: " << errMsg << std::endl;
        printf("=================\n");
    }
    ~TTSCallbacksImpl() = default;
};

TTS Core Business Class .class AidVoiceTTS

AidVoiceTTS is the functional core of the SDK, responsible for managing the complete lifecycle of speech synthesis. Developers interact with the interfaces provided by this class for model loading, pushing synthesis text, and stopping audio synthesis tasks. This class must be initialized in conjunction with FeatureConfig.

Create Instance Object .create_tts()

This interface is the first step in using the SDK. It builds and initializes a specific TTS speech synthesis object in memory based on the global configuration information provided (such as function type, model type, etc.).

API create_tts
Description Builds a specific TTS instance based on the provided configuration object
Parameters cfg: Global configuration parameters, used to specify model type and log level
Return Successful return of an AidVoiceTTS instance pointer. Failure returns nullptr.

💡Note

Special Instructions:

  • Before calling this interface, ensure that cfg.feature_type is set to TYPE_TTS.
Set Mode .set_mode()

This interface is used to set the working mode of TTS. Developers can set it to whole output or fragment output based on business needs. By default, TTS operates in whole mode (TYPE_WHOLE).

API set_mode
Description Sets the working mode of TTS
Parameters mode: The specific working mode
Return void
Set Callback .set_callback()

This interface is used to register a user-implemented callback listener instance with the TTS. Once registered, the SDK will asynchronously push synthesized audio information (onResult) or error information (onError) through this instance.

API set_callback
Description Registers a callback listener object to receive asynchronously returned audio synthesis information.
Parameters cb: Pointer to the instance of the user-defined TTSCallbacks implementation class.
Return void
cpp
// This must be dynamically allocated; it will be released by the AidVoice underlying layer.
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();

// After registration, ownership of the object is transferred to the interior of AidVoice.
tts->set_callback(mTTSCallbacks);

💡Note

Special Instructions:

  • The passed callback instance must be dynamically allocated on the heap using the new keyword. Once set_callback is called, the lifecycle of this pointer will be managed by the AidVoice underlying layer. The SDK will automatically perform a delete operation when the TTS instance is destroyed.
Initialization Interface .init()

After the TTS object is created, certain initialization operations (such as environment checks and resource building) must be performed.

API init
Description Completes the necessary initialization work required for inference.
Parameters void
Return A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure.
cpp
// TTS initialization; returns non-zero on failure.
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
    printf("tts->init() failure!\n");
    return EXIT_FAILURE;
}
Data Input .write()

After the TTS instance is successfully initialized (i.e., init() returns success), developers can submit text data to be synthesized via the write() interface. This interface supports bulk text input (received as a string array) and returns synthesized audio asynchronously.

API write
Description Delivers text data to the TTS. The interface accepts a string array (vector<string>), allowing for the simultaneous submission of multiple independent text segments.
Parameters Text array for synthesis. Each element in the array is treated as an independent synthesis task.
Return A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure.
cpp
// Character array as input data; returns non-zero on failure.
std::vector<std::string> str_vec = {"I am a boy.", "I like Aplux."};
int ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
    printf("tts->write() failure!\n");
    return EXIT_FAILURE;
}

💡Note

Special Attention: After text data is delivered, the system will asynchronously return the synthesized audio stream through the callback interface. The output audio strictly adheres to the following specifications:

  • File encapsulation: Standard WAV format
  • Sample rate: 44100 Hz
  • Channel count: Mono (Single channel)
TTS Stop Input .stop()

This interface is used to notify the TTS engine that the current input phase has ended. Once called, the SDK will no longer accept new text input, but it will ensure that the remaining text already in the buffer is processed completely.

API stop
Description Formally closes the text input stream.
Parameters void
Return A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure.
TTS Object Destruction .tts_destory()

This interface must be called when all audio synthesis tasks are finished and the TTS functionality is no longer needed within the application lifecycle. This operation will completely release the system resources occupied by the SDK.

API tts_destory
Description Completely destroys the TTS instance and releases all associated resources.
Parameters void
Return A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure.

Other Methods

Besides the inference-related interfaces mentioned above, the AidVoice SDK also provides several auxiliary interfaces.

Get Microphone List .show_mircophone_dev()

Before calling audio_mircophone(), it is recommended to call this interface to enumerate the available audio input devices on the current system to obtain the correct device ID.

API show_mircophone_dev
Description Lists all available microphone hardware devices in the system. This interface prints the device names and their corresponding IDs to standard output (Stdout) or the log system.
Parameters void
Return A value of 0 indicates the operation was successful; otherwise, a non-zero value indicates failure.
Get Current AidVoice SDK Version .get_library_version()

Retrieves information regarding the version of the current AidVoice SDK.

API get_library_version
Description Obtains version information of the current AidVoice SDK.
Parameters void
Return string: Version information.
Get Current Log Level .get_log_level()
API get_log_level
Description Retrieves the current log level.
Parameters void
Return LogLevel: Log level.
Set Log Level .set_log_level()
API set_log_level
Description Sets the log level.
Parameters LogLevel: Log level.
Return Returns 0 by default.
Output Log to Console .log_to_console()
API log_to_console
Description Configures the log information to be output to the standard error terminal (stderr).
Parameters void
Return Returns 0 by default.
Output Log to Text File .log_to_file()
API log_to_file
Description Configures log information to be output to a specified text file.
Parameters path_and_prefix: Storage path and name prefix for log files. also_to_console: Flag indicating whether to simultaneously output logs to the stderr terminal; default is false.
Return A value of 0 indicates success; otherwise, a non-zero value indicates failure.

AidVoice C++ Example Programs

AidVoice ASR Recognition Example

Taking audio text recognition as an example, the CPP program generally consists of the following parts:

cpp
// Global configuration information
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_ASR;  
cfg.model_type = ModelType::TYPE_WHISPER;  

// Construct ASR object
auto asr = AidLux::AidVoice::create_asr(cfg);  
if (!asr)
{
    printf("create_asr failure!\n");
    return EXIT_FAILURE;
}

// Inherit callback interface
class ASRCallbacksImpl : public ASRCallbacks
{
public:
    void onResult(const AsrResult &result) override
    {
        string asrResult = result.text;
        int sid = result.id;
        AsrStatus status = result.status;
        printf("=================\n");
        std::cout << "sid: " << sid << std::endl;
        std::cout << "asrResult: \n"
                  << asrResult << std::endl;
        std::cout << "status: " << (int)status << std::endl;
        printf("=================\n");
    }

    void onError(const AsrError &error) override
    {
        string errMsg = error.message;
        printf("=================\n");
        std::cout << "errMsg: " << errMsg << std::endl;
        printf("=================\n");
    }
    ~ASRCallbacksImpl() = default;
};

// Create callback class and set the callback
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
asr->set_callback(mASRCallbacks);

// Initialize audio object
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
    printf("asr->init() failure!\n");
    return EXIT_FAILURE;
}

// Pass audio data
ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
    printf("asr->write() failure!\n");
    return EXIT_FAILURE;
}

// Stop input data
ret = asr->stop();
if (ret != EXIT_SUCCESS)
{
    printf("asr->stop() failure!\n");
    return EXIT_FAILURE;
}

// Destroy object
ret = asr->asr_destory();
if (ret != EXIT_SUCCESS)
{
    printf("asr->asr_destory() failure!\n");
    return EXIT_FAILURE;
}

AidVoice TTS Synthesis Example

The audio synthesis CPP program generally consists of the following parts:

cpp
// Global configuration information
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_TTS;
cfg.model_type = ModelType::TYPE_MELOTTS_ENGLISH;

// Construct TTS object
auto tts = AidLux::AidVoice::create_tts(cfg);
if (!tts)
{
    printf("create tts failure!\n");
    return EXIT_FAILURE;
}

// Set TTS working mode
tts->set_mode(TTSMode::TYPE_WHOLE);

// Inherit callback interface
class TTSCallbacksImpl : public TTSCallbacks
{
public:
    void onResult(const TTSResult &result) override
    {
        std::string audio_name = result.audio_name;
        double audio_len = result.audio_len;
        int seq = result.seq;
        int sid = result.id;
        TTSStatus status = result.status;
        printf("=================\n");
        std::cout << "sid: " << sid << std::endl;
        std::cout << "audio_name:"<< audio_name << std::endl;
        std::cout << "audio_len: " << (double)audio_len << std::endl;
        std::cout << "seq: " << (int)seq << std::endl;
        std::cout << "status: " << (int)status << std::endl;
        printf("=================\n");
    }

    void onError(const TTSError &error) override
    {
        string errMsg = error.message;
        printf("=================\n");
        std::cout << "errMsg: " << errMsg << std::endl;
        printf("=================\n");
    }
    ~TTSCallbacksImpl() = default;
};

// Create callback class and set the callback
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
tts->set_callback(mTTSCallbacks);

// Initialize TTS object
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
    printf("tts->init() failure!\n");
    return EXIT_FAILURE;
}

// Pass synthesis text
std::vector<std::string> str_vec = {"This is an example of text to speech using Melo for English. How does it sound?"};
ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
    printf("tts->write() failure!\n");
    return EXIT_FAILURE;
}

// Stop input data
ret = tts->stop();
if (ret != EXIT_SUCCESS)
{
    printf("tts->stop() failure!\n");
    return EXIT_FAILURE;
}

// Destroy object
ret = tts->tts_destory();
if (ret != EXIT_SUCCESS)
{
    printf("tts->tts_destory() failure!\n");
    return EXIT_FAILURE;
}

💡Note

More usage examples are stored in the following path:

  • CPP example program path: /usr/local/share/aidvoice/examples/asr/cpp/

This concludes the presentation of all interfaces for the AidVoice SDK.