AidVoice C++ API Documentation

Important things to know when developing with the AidVoice SDK for C++:

This document corresponds to SDK version aidvoice-sdk_v1.3.1
Include the header file at compile time: /usr/local/include/aidlux/aidvoice/aidvoice_speech.hpp
Link the library file at build time: /usr/local/lib/libaidvoice_speech.so

Feature Type.enum FeatureType

FeatureType is used to specify the core feature module when initializing the AidVoice SDK. Since the SDK includes multiple speech features, you must use this enum to clearly define which feature you want when creating a feature instance (object). The SDK currently supports automatic speech recognition (ASR) and text-to-speech (TTS). More speech features will be added in future releases.

Member Name	Type	Value	Description
TYPE_DEFAULT	uint8_t	0	Invalid data type
TYPE_ASR	uint8_t	1	Automatic speech recognition
TYPE_TTS	uint8_t	2	Text to speech

Audio Type.enum AudioType

When running an ASR task, you need to define the encoding format and sampling properties of the input or output audio. By setting this enum, the SDK can correctly parse audio stream data or generate audio files in the required format. This enum is mainly used to identify the audio type for ASR input. The TTS output format is currently fixed by the engine.

Member Name	Type	Value	Description
TYPE_DEFAULT	uint8_t	0	Invalid data type
TYPE_WAV	uint8_t	1	WAV audio
TYPE_PCM	uint8_t	2	PCM audio

Important

To ensure ASR accuracy and system stability, the raw audio stream sent to the SDK must strictly follow these requirements:

Sample rate: fixed at 16 kHz.
Channel configuration: mono only.
Bit depth: 16-bit signed.

The TTS module currently outputs WAV audio only. The audio enum values here are mainly used by the ASR module to identify different input audio types.

Log Level.enum LogLevel

AidVoice SDK provides logging-related interfaces, which are introduced later in this document. If you need to specify the current logging level used by AidVoice SDK, use this enum.

Member Name	Type	Value	Description
TYPE_INFO	uint8_t	0	Information
TYPE_WARNING	uint8_t	1	Warning
TYPE_ERROR	uint8_t	2	Error
TYPE_FATAL	uint8_t	3	Fatal error
TYPE_DEBUG	uint8_t	4	Debug
TYPE_OFF	uint8_t	5	Disabled

Return Status.enum ResultStatus

ResultStatus defines the return status for all SDK operations. By checking this enum value, you can tell whether the current operation completed successfully. If the return value indicates failure, you can troubleshoot the issue based on the detailed error information.

Member Name	Type	Value	Description
AV_OK	uint8_t	0	Success
AV_ERR_INVALID_ARG	uint8_t	1	Invalid argument
AV_ERR_LOAD_FILE	uint8_t	2	Failed to load file
AV_ERR_RUN_FAIL	uint8_t	3	Runtime error
AV_ERR_UNSUPPORTED	uint8_t	4	Operation not supported
AV_ERR_GENERATE_OBJECT	uint8_t	5	Failed to create object
AV_OTHER	uint8_t	6	Other

Device Information.struct DeviceInfo

The DeviceInfo class is used to describe NPU information on the current device.

Member List

Member	version
Type	uint32_t
Default	0X00020301
Description	Version of the current device information

Member	id
Type	uint32_t
Default	0
Description	ID of the current device

Member	type
Type	uint32_t
Default	0
Description	Type of the current device

Member	cores_num
Type	uint32_t
Default	1
Description	Number of cores in the current device information (NPU cores)

Member	cores_id
Type	std::vector<uint32_t>
Default	No default value
Description	IDs of all cores in the current device information

Important

The current SDK device information class is only used to query NPU information on the device. For CPU, GPU, and other hardware information, please use other methods.

Global Configuration.struct FeatureConfig

The FeatureConfig structure stores all configuration information required to build a specific feature object. Before initializing an SDK instance, you need to create this structure and set the feature type, model selection, and logging level based on your business needs.

Member List

FeatureConfig includes the following parameters:

Member	feature_type
Type	FeatureType
Default	FeatureType::TYPE_ASR
Description	Specifies the SDK feature mode, such as speech recognition or speech synthesis

Member	model_path
Type	string
Default	Empty string
Description	Specifies the path of the selected model

Member	log_type
Type	LogLevel
Default	TYPE_OFF
Description	Specifies the log level

Member	custom_device_info
Type	std::vector<DeviceInfo>
Default	Empty array
Description	List of NPU device information

ASR Interfaces

This section describes the core API interfaces in AidVoice SDK for automatic speech recognition (ASR). With these interfaces, you can complete the full ASR workflow, from creating an ASR instance and sending audio data to retrieving recognition results.

Feature overview: The ASR module converts 16 kHz, mono, 16-bit raw audio streams into text in real time or offline. It currently supports mainstream inference models such as senseVoice_small and whisper_tiny, whisper_base, whisper_base_en, whisper_small, and whisper_medium.

ASR Mode.enum ASRMode

ASRMode defines how ASR results are returned. Depending on your real-time requirements, you can choose streaming mode for incremental feedback or non-streaming mode for complete sentence output.

Member Name	Type	Value	Description
TYPE_STREAM	uint8_t	0	Streaming output: can return intermediate transcription results
TYPE_NOSTREAM	uint8_t	1	Non-streaming output: returns the final transcription result for each processing step

Important

Notes:

Streaming: temporary transcription results are generated in real time during audio processing. As more audio is fed in, the SDK continues to revise and update the intermediate text. This mode can return data before the full audio buffer is processed, which reduces first-token latency and improves the interaction experience.
Non-streaming: transcription is performed based on a fixed audio duration. If speech end is detected early, such as when the user stops input, or if the speech duration is shorter than expected, the system immediately returns the final transcription result.

Important

Different inference models have strict limits on the audio length for a single processing task. When configuring non-streaming transcription or preparing audio segments, keep the following limits in mind:

Whisper models: the maximum audio length for a single input is 24 seconds.
SenseVoice models: the maximum audio length for a single input is 15 seconds.
If the audio sent in one request exceeds these limits, the SDK truncates the input audio. The remaining audio after truncation automatically starts a new transcription task.

Speech Transcription Status.enum AsrStatus

AsrStatus indicates the state of the ASR text result in the current recognition round. By checking this status, you can tell whether the current text is an intermediate transcription that may still change or the final confirmed result.

Member Name	Type	Value	Description
TYPE_PARTIAL	uint8_t	0	Intermediate transcription, recognition not finished
TYPE_FINAL	uint8_t	1	Final transcription result, recognition finished

Important

Notes:

Streaming mode (TYPE_STREAM): when the result status is PARTIAL, it means the SDK is returning an intermediate result before the current audio buffer has been fully processed. Only when the status is FINAL does it mean processing of the current buffer is complete.
Non-streaming mode (TYPE_NOSTREAM): in non-streaming mode, the status of each transcription result is always FINAL.

cpp

// Streaming mode:
I am.                                TYPE_PARTIAL
I am a boy.                          TYPE_PARTIAL
I am a boy. I like Aplux.            TYPE_FINAL
// Non-streaming mode:
I am a boy. I like Aplux.            TYPE_FINAL

ASR Result.class AsrResult

The AsrResult structure carries the ASR transcription result and its status. When the SDK finishes processing a segment of audio, it returns the recognized text together with its status, either intermediate or final, in this structure.

Member List

AsrResult includes the following parameters:

Member	status
Type	AsrStatus
Default	No default value
Description	Transcription status of the current result

Member	text
Type	std::string
Default	Empty string
Description	Current transcription text

Member	id
Type	int
Default	0
Description	Result ID

ASR Error.class AsrError

The AsrError structure carries ASR error information. When an interface returns a non-success status, you can use this structure to get the specific error code and a detailed message.

Member List

AsrError includes the following parameters:

Member	status
Type	ResultStatus
Default	No default value
Description	Error status

Member	error_code
Type	int
Default	-1
Description	Error code

Member	message
Type	std::string
Default	No default value
Description	Error message

ASR Callback Interface.class ASRCallbacks

ASRCallbacks is a virtual base class that defines the listener interface used by the SDK to push data to the application layer. You need to inherit from this class and implement its virtual functions so that you can asynchronously receive recognition results or error information.

Get Transcription Result.onResult()

This callback is triggered automatically when the ASR engine finishes processing a segment of audio and generates text.

API	onResult
Description	Speech recognition result callback
Parameters	result: transcription result object that contains the current recognized text and result status
Returns	void

API	onError
Description	Error callback. Used to receive and handle different exceptions that occur while ASR is running
Parameters	error: error information object that contains the current error message, error code, and related details
Returns	void

API	onStop
Description	Stop callback
Parameters	result: after stop() is called, the current task stops accepting new input and returns any remaining result that is still being processed through this callback
Returns	void

cpp

class ASRCallbacksImpl : public ASRCallbacks
{
public:
	void onResult(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
		total_echo = sid;
	}

	void onError(const AsrError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n");
	}
	~ASRCallbacksImpl() = default;
};

ASR Core Class.class AidVoiceASR

AidVoiceASR is the main ASR feature class in the SDK. It manages the full lifecycle of speech recognition. You use the interfaces provided by this class to load models, push audio data, and stop recognition tasks. This class must be initialized together with FeatureConfig.

Create Instance.create_asr()

This is the first interface you call when using the SDK. It creates and initializes a specific ASR object in memory based on the global configuration, such as feature type and model type.

API	create_asr
Description	Builds a specific ASR instance from the configuration object
Parameters	cfg: global configuration used to specify the model type and log level
Returns	Returns an std::shared_ptr<AidVoiceASR> instance pointer on success. Returns nullptr on failure

Important

Notes:

Before calling this interface, make sure cfg.feature_type is set to TYPE_ASR.

Set Mode.set_mode()

This interface sets the working mode of ASR. Depending on your application needs, such as real-time speech interaction or offline long-form transcription, you can set it to streaming mode or non-streaming mode.

API	set_mode
Description	Sets the ASR recognition mode
Parameters	mode: the recognition mode to use
Returns	void

Set Callback.set_callback()

This interface registers a user-implemented callback listener instance with ASR. Once registered, the SDK uses this instance to asynchronously return transcription results through onResult and error information through onError.

API	set_callback
Description	Registers a callback listener object used to receive asynchronous recognition results
Parameters	cb: pointer to an instance of a user-defined ASRCallbacks implementation
Returns	void

cpp

// The callback object must be allocated on the heap.
// It will be released by AidVoice internally.
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();

// After registration, ownership of the object is transferred to AidVoice.
asr->set_callback(mASRCallbacks);

Important

Notes:

The callback instance must be allocated on the heap with new. Once set_callback is called, the pointer lifetime is managed by AidVoice internally. The SDK automatically deletes it when the ASR instance is destroyed.

Enable Special Token Output.set_special_tokens()

This interface controls whether special tokens are included in the output, such as the model start token, end token, and other tokens with special meanings. If enabled, the callback also returns these special tokens.

API	set_special_tokens
Description	Controls whether the model's special token characters are returned
Parameters	is_add: whether to include them. The default value is false
Returns	void

Set Maximum Audio Processing Duration.set_echo_ms()

This interface sets the maximum audio length, in milliseconds, that a single ASR inference task can accept.

API	set_echo_ms
Description	Sets the audio duration threshold for a single ASR inference task
Parameters	echo_ms: duration threshold for a single processing task
Returns	void

Important

When setting this value, you must follow the per-model processing limits:

Whisper models: must not exceed 24000 (24 s).
SenseVoice models: echo_ms must not exceed 15000 (15 s).

Set Streaming Feedback Interval.set_step_ms()

This interface is designed for streaming mode (TYPE_STREAM). It sets how often ASR returns intermediate transcription results with PARTIAL status.

API	set_step_ms
Description	Sets the callback interval for streaming transcription results in milliseconds. A callback is triggered each time the specified amount of audio has been processed
Parameters	step_ms: time step for result feedback
Returns	void

Important

Notes:

This setting only takes effect in streaming mode. In non-streaming mode, the system ignores this setting and returns the final result only after recognition is complete.
A smaller step_ms gives better real-time feedback. In live microphone input scenarios, setting step_ms too small may make the output less continuous. To preserve real-time responsiveness, the SDK uses an overwrite-based buffer strategy. If the model cannot process audio as fast as it is being fed in, older unprocessed data may be overwritten by newly received audio, which can lead to broken recognition results.

Save Input Audio.set_save_audio()

This interface is mainly used for live microphone input. When enabled, the SDK automatically captures the raw audio stream received from the microphone and saves it locally in WAV format.

API	set_save_audio
Description	Controls whether raw microphone input audio is saved locally
Parameters	save_audio: boolean value. true enables saving. false disables it, which is the default
Returns	void

Get Device NPU Information.get_device_info()

This interface is used to query NPU information on the current device.

API	get_device_info
Description	Queries NPU information on the current device
Parameters	device_info_list
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

You must call this interface after initialization has completed to get the current device NPU information.

cpp

	auto asr = AidLux::AidVoice::create_asr(cfg);
	asr->init();
	std::vector<DeviceInfo> device_info;
	asr->get_device_info(device_info);

Bind an NPU Device for Execution.set_device_info()

This interface configures the NPU binding information used by the SDK. By passing the target NPU device settings, you can bind SDK inference tasks to the selected NPU so that the model runs on the intended hardware.

API	set_device_info
Description	Sets the NPU device used by the SDK
Parameters	device_info
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

You must set the NPU device information before SDK initialization is completed. Otherwise, the setting may not take effect after initialization.

cpp

	auto asr = AidLux::AidVoice::create_asr(cfg);
	DeviceInfo device01;
	device01.id = 0;
	device01.type = 0;
	device01.cores_num = 1;
	device01.cores_id = {0};
	asr->set_device_info(device01);
	asr->init();

Initialize.init()

After the ASR object is created, you need to run initialization steps such as environment checks and resource setup.

API	init
Description	Completes the initialization work required for inference
Parameters	void
Returns	Returns 0 on successful initialization. Any non-zero value means the operation failed

cpp

// Initialize ASR. Any non-zero return value indicates an error.
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
	printf("asr->init() failure!\n");
	return EXIT_FAILURE;
}

Data Input.write()

After init() succeeds, you can use the write() interface to send audio data for recognition. The SDK supports multiple input sources to cover different scenarios, such as file transcription and streaming audio capture.

Use an Audio File as Input

This interface directly reads a local audio file for recognition.

API	write
Description	Passes the path of a 16 kHz WAV audio file. The SDK parses the file automatically and performs recognition
Parameters	wav_16k_file: absolute or relative path to the local audio file
Returns	Returns 0 on success. Any non-zero value means the operation failed

cpp

// Use an audio file as input. Any non-zero return value indicates an error.
std::string wave_path = "audio.wav";
int ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

Important

Notes:

The audio file must be mono, 16-bit WAV or PCM audio with a sample rate of 16000 Hz.

Use a Raw Byte Stream as Input

This interface accepts raw audio bytes stored in memory.

API	write
Description	Pushes a raw audio byte stream to ASR
Parameters	data: pointer to the audio data buffer len: byte length of the buffer data
Returns	Returns 0 on success. Any non-zero value means the operation failed

cpp

// Use a raw byte stream as input. Any non-zero return value indicates an error.
char *data = new char[fileLen];
// ... Fill the audio data here ..
int ret = asr->write(data, data_size);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

Use a float Array as Input

This interface accepts floating-point audio sample data. It is suitable for audio streams that have already been preprocessed and converted to the standard float format.

API	write
Description	Pushes a float array of audio samples to ASR
Parameters	audio_data: float array containing audio sample points
Returns	Returns 0 on success. Any non-zero value means the operation failed

cpp

// Use a float array as input. Any non-zero return value indicates an error.
std::vector<float> audio_;
// ... Fill the audio data here ..
int ret = asr->write(audio_);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

Real-Time Microphone Input.audio_microphone()

This interface uses the configured microphone ID to capture audio through the microphone driver and sends the captured data directly to the ASR engine.

API	audio_microphone
Description	Starts the microphone device with the specified ID and begins real-time speech recognition
Parameters	id: hardware device ID of the microphone. The default device is 0
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

This interface supports streaming mode only (TYPE_STREAM). Before calling it, you must first call set_mode(TYPE_STREAM).
In a terminal environment, you can stop the input stream safely by capturing Ctrl + C. If the microphone device is disconnected during capture, ASR also stops input automatically.

cpp

// After this call, the microphone device with ID 1 starts and receives speech in real time.
asr->audio_microphone(1);

Stop ASR Input.stop()

This interface notifies the ASR engine that the audio stream has ended. After it is called, ASR immediately cuts off the audio input. Only the result still being processed is returned through the onStop callback, and any remaining unprocessed data is discarded.

API	stop
Description	Stops audio input
Parameters	void
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

This interface is intended for interrupted output scenarios. After it is called, the internal buffer is cleared immediately and input is cut off. Only the remaining data currently being processed is returned through onStop. It is suitable for fast interruption in streaming mode.

Destroy ASR Object.asr_destroy()

When all speech recognition tasks are finished and the ASR feature is no longer needed, you must call this interface. It fully releases all related resources.

API	asr_destroy
Description	Completely destroys the ASR instance and releases all related resources
Parameters	void
Returns	Returns 0 on success. Any non-zero value means the operation failed

Get Audio Read Time.get_read_time()

This interface returns timing statistics for the audio file read stage or the input data transfer stage, either for the most recent recognition run or as a cumulative total.

API	get_read_time
Description	Gets the cumulative time spent in the audio read stage, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Initialization Time.get_init_time()

This interface returns cumulative timing statistics for the ASR initialization stage.

API	get_init_time
Description	Gets the cumulative time spent in initialization, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Feature Extraction Time.get_character_time()

This interface returns cumulative timing statistics for the ASR feature extraction stage. Depending on the model, this stage typically includes preprocessing overhead such as mel, fbank, LFR, and CMVN.

API	get_character_time
Description	Gets the cumulative time spent in audio feature extraction, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Encoder Inference Time.get_encoder_time()

This interface returns cumulative timing statistics for the ASR encoder inference stage.

API	get_encoder_time
Description	Gets the cumulative time spent in encoder inference, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Decoder Inference Time.get_decoder_time()

This interface returns cumulative timing statistics for the ASR decoding stage. For model implementations that do not use a separate decoder, this value may be 0.

API	get_decoder_time
Description	Gets the cumulative time spent in decoding, in seconds
Parameters	void
Returns	A double value representing the elapsed time

TTS Interfaces

This section describes the core API interfaces in AidVoice SDK for text-to-speech (TTS). With these interfaces, you can complete the full TTS workflow, from creating a TTS instance and submitting text to retrieving the generated audio.

Feature overview: The TTS module converts input text into audio files. It currently supports two mainstream inference models: melotts_chinese and melotts_english.

TTS Mode.enum TTSMode

TTSMode is used to configure how synthesized audio is returned. Based on your real-time requirements, you can choose whole-output mode or fragment-output mode.

Member Name	Type	Value	Description
TYPE_WHOLE	uint8_t	0	Whole output: the complete audio is returned in one callback after the full sentence is synthesized
TYPE_FRAGMENT	uint8_t	1	Fragment output: the text is split by punctuation or semantic pauses, and short sentences are returned as soon as they are synthesized

Important

Notes:

Whole output: the full input text is processed as one task. The result callback is triggered only after all audio data has been synthesized, which ensures the audio is returned as one complete output.
Fragment output: long text is intelligently split into multiple short sentences based on punctuation and semantic pauses. Each sentence is returned as soon as its audio has been synthesized.

Synthesized Audio Status.enum TTSStatus

TTSStatus indicates the current state of the audio result returned by the TTS engine. By checking this status, you can tell whether the returned audio is a partial segment produced during the task or the final result.

Member Name	Type	Value	Description
TYPE_PARTIAL	uint8_t	0	Partial synthesized audio
TYPE_FINAL	uint8_t	1	Complete synthesized audio

Important

Notes:

Fragment output (TYPE_FRAGMENT): when the result status is PARTIAL, the current audio is only one intermediate segment of the full input text, such as a short sentence. Only when the status is FINAL does it mean the synthesis task for the current sentence has fully finished.
Whole output (TYPE_WHOLE): in whole mode, the status of each synthesized audio result is always FINAL.

TTS Result.class TTSResult

The TTSResult structure carries synthesized audio data and its status. When the SDK finishes synthesizing a segment of audio, it returns the related audio information and audio status, either partial or final, in this structure.

Member List

TTSResult includes the following parameters:

Member	status
Type	TTSStatus
Default	No default value
Description	Status of the current returned audio

Member	audio_name
Type	std::string
Default	Empty string
Description	File name of the current output audio

Member	audio_data
Type	vector<float >
Default	Empty vector
Description	Generated raw audio data. Output format: float, mono, 44100 Hz sample rate

Member	audio_time
Type	double
Default	0
Description	Duration of the current output audio in seconds

Member	seq
Type	int
Default	0
Description	Indicates which segment of the synthesis sequence this returned audio block belongs to

Member	id
Type	int
Default	0
Description	Result ID

TTS Error.class TTSError

The TTSError structure carries TTS error information. When an interface returns a non-success status, you can use this structure to get the specific error code and a detailed message.

Member List

TTSError includes the following parameters:

Member	status
Type	ResultStatus
Default	No default value
Description	Error status

Member	error_code
Type	int
Default	-1
Description	Error code

Member	message
Type	std::string
Default	No default value
Description	Error message

TTS Callback Interface.class TTSCallbacks

TTSCallbacks is a virtual base class that defines the listener interface used by the SDK to push data to the application layer. You need to inherit from this class and implement its virtual functions so that you can asynchronously receive synthesis results or error information.

Get Synthesis Result.onResult()

This callback is triggered automatically when the TTS engine finishes processing a segment of text and generates audio.

API	onResult
Description	Speech synthesis result callback
Parameters	result: synthesis result object that contains the current synthesized audio information and status
Returns	void

API	onError
Description	Error callback. Used to receive and handle different exceptions that occur while TTS is running
Parameters	error: error information object that contains the current error message, error code, and related details
Returns	void

API	onStop
Description	Stop callback
Parameters	result: after stop() is called, the current task stops accepting new input and returns any remaining result that is still being processed through this callback
Returns	void

cpp

class TTSCallbacksImpl : public TTSCallbacks
{
public:
	void onResult(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_time = result.audio_time;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_time: " << (double)audio_time << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}

	void onError(const TTSError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_time = result.audio_time;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_time: " << (double)audio_time << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}
	~TTSCallbacksImpl() = default;
};

TTS Core Class.class AidVoiceTTS

AidVoiceTTS is the main TTS feature class in the SDK. It manages the full lifecycle of speech synthesis. You use the interfaces provided by this class to load models, submit text, and stop synthesis tasks. This class must be initialized together with FeatureConfig.

Create Instance.create_tts()

This is the first interface you call when using the SDK. It creates and initializes a specific TTS object in memory based on the global configuration, such as feature type and model type.

API	create_tts
Description	Builds a specific TTS instance from the configuration object
Parameters	cfg: global configuration used to specify the model type and log level
Returns	Returns an std::shared_ptr<AidVoiceTTS> instance pointer on success. Returns nullptr on failure

Important

Notes:

Before calling this interface, make sure cfg.feature_type is set to TYPE_TTS.

Set Mode.set_mode()

This interface sets the working mode of TTS. Depending on your business needs, you can set it to whole-output mode or fragment-output mode. By default, TTS runs in whole mode (TYPE_WHOLE).

API	set_mode
Description	Sets the working mode of TTS
Parameters	mode: the target working mode
Returns	void

Set Callback.set_callback()

This interface registers a user-implemented callback listener instance with TTS. Once registered, the SDK uses this instance to asynchronously return synthesized audio information through onResult and error information through onError.

API	set_callback
Description	Registers a callback listener object used to receive asynchronously returned synthesized audio information
Parameters	cb: pointer to an instance of a user-defined TTSCallbacks implementation
Returns	void

cpp

// The callback object must be allocated on the heap.
// It will be released by AidVoice internally.
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();

// After registration, ownership of the object is transferred to AidVoice.
tts->set_callback(mTTSCallbacks);

Important

Notes:

The callback instance must be allocated on the heap with new. Once set_callback is called, the pointer lifetime is managed by AidVoice internally. The SDK automatically deletes it when the TTS instance is destroyed.

Enable Speaker Playback for Synthesized Audio.set_play_audio()

This interface is used to play the speech generated by TTS. You must specify the speaker device ID. If you do not know the correct speaker device ID, see audio device lookup.

API	set_play_audio
Description	Plays the synthesized speech
Parameters	dev_id: audio device ID
Returns	void

Important

Notes:

Audio playback is disabled by default. You need to set an audio device ID greater than or equal to 0 to enable the audio device.

Set Output Audio File Path.set_out_audio_path()

This interface sets the save path for the synthesized audio. It can be either a relative path or an absolute path. By default, no output audio path is set, so no audio file is generated.

API	set_out_audio_path
Description	Save path for synthesized audio
Parameters	path: audio save path
Returns	0: path set successfully; otherwise, setting failed

Get Device NPU Information.get_device_info()

This interface is used to query NPU information on the current device.

API	get_device_info
Description	Queries NPU information on the current device
Parameters	device_info_list
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

You must call this interface after initialization has completed to get the current device NPU information.

cpp

	auto tts = AidLux::AidVoice::create_tts(cfg);
	tts->init();
	std::vector<DeviceInfo> device_info;
	tts->get_device_info(device_info);

Bind an NPU Device for Execution.set_device_info()

API	set_device_info
Description	Sets the NPU device used by the SDK
Parameters	device_info
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

You must set the NPU device information before SDK initialization is completed. Otherwise, the setting may not take effect after initialization.

cpp

	auto tts = AidLux::AidVoice::create_tts(cfg);
	DeviceInfo device01;
	device01.id = 0;
	device01.type = 0;
	device01.cores_num = 1;
	device01.cores_id = {0};
	tts->set_device_info(device01);
	tts->init();

Initialize.init()

After the TTS object is created, you need to run initialization steps such as environment checks and resource setup.

API	init
Description	Completes the initialization work required for inference
Parameters	void
Returns	Returns 0 on successful initialization. Any non-zero value means the operation failed

cpp

// Initialize TTS. Any non-zero return value indicates an error.
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
	printf("tts->init() failure!\n");
	return EXIT_FAILURE;
}

Data Input.write()

After the TTS instance is initialized successfully, meaning init() returns success, you can use write() to submit text for synthesis. This interface accepts a string array, vector<string>, so you can submit multiple independent text items in one call. The synthesized audio is returned asynchronously.

API	write
Description	Submits text data to TTS. The interface accepts a string array, vector<string>, and supports submitting multiple independent text segments in one call
Parameters	Array of text to synthesize. Each element in the array is treated as an independent synthesis task
Returns	Returns 0 on success. Any non-zero value means the operation failed

cpp

// Use a string array as input. Any non-zero return value indicates an error.
std::vector<std::string> str_vec = {"I am a boy.", "I like Aplux."};
int ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
	printf("tts->write() failure!\n");
	return EXIT_FAILURE;
}

Important

Notes:

After the text is submitted, the system returns the synthesized audio stream asynchronously through the callback interface. The output audio strictly follows these specifications:

File container: standard WAV format
Sample rate: 44100 Hz
Channels: mono

Stop TTS Input.stop()

This interface notifies the TTS engine that the input stream has ended. After it is called, TTS immediately cuts off the input data. Only the result still being processed is returned through the onStop callback, and any remaining unprocessed data is discarded.

API	stop
Description	Formally closes the text input stream
Parameters	void
Returns	Returns 0 on success. Any non-zero value means the operation failed

Important

Notes:

This interface is intended for interruption scenarios. After it is called, the internal buffer is cleared immediately and input is cut off. Only the remaining data currently being processed is returned through onStop, which makes it useful when you need to stop output quickly.

Destroy TTS Object.tts_destroy()

When all audio synthesis tasks have finished and the application no longer needs the TTS feature during its lifecycle, you must call this interface. It fully releases the system resources used by the SDK.

API	tts_destroy
Description	Completely destroys the TTS instance and releases all related resources
Parameters	void
Returns	Returns 0 on success. Any non-zero value means the operation failed

Get Text Read Time.get_read_time()

This interface returns timing statistics for the text read stage or preprocessing stage, either for the most recent synthesis run or as a cumulative total.

API	get_read_time
Description	Gets the cumulative time spent in text reading, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Initialization Time.get_init_time()

This interface returns cumulative timing statistics for the TTS initialization stage.

API	get_init_time
Description	Gets the cumulative time spent in initialization, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Text Feature Processing Time.get_character_time()

This interface returns cumulative timing statistics for the text analysis, segmentation, preprocessing, or feature construction stage.

API	get_character_time
Description	Gets the cumulative time spent in text feature processing, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Encoder Inference Time.get_encoder_time()

This interface returns cumulative timing statistics for the TTS encoder inference stage.

API	get_encoder_time
Description	Gets the cumulative time spent in encoder inference, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Get Decoder Inference Time.get_decoder_time()

This interface returns cumulative timing statistics for the TTS decoder or vocoder inference stage.

API	get_decoder_time
Description	Gets the cumulative time spent in decoder inference, in seconds
Parameters	void
Returns	A double value representing the elapsed time

Other Methods

In addition to the inference-related interfaces described above, the AidVoice SDK also provides the following helper interfaces.

Get Microphone List.show_microphone_dev()

Before calling audio_microphone(), it is recommended to call this interface first to list the available audio input devices on the current system, so you can get the correct device ID.

API	show_microphone_dev
Description	Lists all available microphone hardware devices in the system. This interface prints the device name and its corresponding ID to standard output or the logging system
Parameters	void
Returns	No return value

Get Current AidVoice SDK Version.get_library_version()

Gets version information for the current AidVoice SDK.

API	get_library_version
Description	Gets version information for the current AidVoice SDK
Parameters	void
Returns	string: version information

Get Current Log Level.get_log_level()

API	get_log_level
Description	Gets the current log level
Parameters	void
Returns	LogLevel: log level

Set Log Level.set_log_level()

API	set_log_level
Description	Sets the log level
Parameters	LogLevel: log level
Returns	Returns 0 by default

Output Logs to the Console.log_to_console()

API	log_to_console
Description	Sends log output to the standard error console
Parameters	void
Returns	Returns 0 by default

Output Logs to a Text File.log_to_file()

API	log_to_file
Description	Sends log output to the specified text file
Parameters	path_and_prefix: path and filename prefix for log files also_to_console: whether to also output logs to stderr. The default value is false
Returns	Returns 0 on success. Any non-zero value means the operation failed

AidVoice C++ Sample Programs

AidVoice ASR Audio Recognition Sample

Using audio transcription as an example, a typical C++ sample for ASR includes the following parts:

cpp

// Global configuration
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_ASR;
cfg.model_path = "model_path";

// Build the ASR object
auto asr = AidLux::AidVoice::create_asr(cfg);
if (!asr)
{
	printf("create_asr failure!\n");
	return EXIT_FAILURE;
}

// Inherit the callback interface
class ASRCallbacksImpl : public ASRCallbacks
{
public:
   void onResult(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}

	void onError(const AsrError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const AsrResult &result) override
	{
		string asrResult = result.text;
		int sid = result.id;
		AsrStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "asrResult: \n"
				  << asrResult << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}
	~ASRCallbacksImpl() = default;
};

// Create the callback object and register it
ASRCallbacksImpl *mASRCallbacks = new ASRCallbacksImpl();
asr->set_callback(mASRCallbacks);

// Initialize the ASR object
int ret = asr->init();
if (ret != EXIT_SUCCESS)
{
	printf("asr->init() failure!\n");
	return EXIT_FAILURE;
}

// Send audio file data
ret = asr->write(wave_path);
if (ret != EXIT_SUCCESS)
{
	printf("asr->write() failure!\n");
	return EXIT_FAILURE;
}

// Stop input
ret = asr->stop();
if (ret != EXIT_SUCCESS)
{
	printf("asr->stop() failure!\n");
	return EXIT_FAILURE;
}

// Destroy the object
ret = asr->asr_destroy();
if (ret != EXIT_SUCCESS)
{
	printf("asr->asr_destroy() failure!\n");
	return EXIT_FAILURE;
}

AidVoice TTS Audio Synthesis Sample

A typical C++ sample for audio synthesis includes the following parts:

cpp

// Global configuration
AidLux::AidVoice::FeatureConfig cfg;
cfg.feature_type = FeatureType::TYPE_TTS;
cfg.model_path = "model_path"; // You can set the model path here or pass it through command-line arguments

// Build the TTS object
auto tts = AidLux::AidVoice::create_tts(cfg);
if (!tts)
{
	printf("create tts failure!\n");
	return EXIT_FAILURE;
}

// Set the TTS working mode
tts->set_mode(TTSMode::TYPE_WHOLE);

// Inherit the callback interface
class TTSCallbacksImpl : public TTSCallbacks
{
public:
		void onResult(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_time = result.audio_time;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============callback result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_time: " << (double)audio_time << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}

	void onError(const TTSError &error) override
	{
		int errCode = error.error_code;
		int errStatus = (int)error.status;
		string errMsg = error.message;
		printf("============error callback=================\n");
		std::cout << "errMsg: " << errMsg << std::endl;
		printf("===========================================\n\n");
	}

	void onStop(const TTSResult &result) override
	{
		std::string audio_name = result.audio_name;
		std::vector<float> audio_data = result.audio_data;
		double audio_time = result.audio_time;
		int seq = result.seq;
		int sid = result.id;
		TTSStatus status = result.status;
		printf("============stop result ===============\n");
		std::cout << "sid: " << sid << std::endl;
		std::cout << "audio_name:" << audio_name << std::endl;
		std::cout << "audio_data size:" << audio_data.size() << std::endl;
		std::cout << "audio_time: " << (double)audio_time << std::endl;
		std::cout << "seq: " << (int)seq << std::endl;
		std::cout << "status: " << (int)status << std::endl;
		printf("===========================================\n\n");
	}
	~TTSCallbacksImpl() = default;
};

// Create the callback object and register it
TTSCallbacksImpl *mTTSCallbacks = new TTSCallbacksImpl();
tts->set_callback(mTTSCallbacks);

// Initialize the TTS object
int ret = tts->init();
if (ret != EXIT_SUCCESS)
{
	printf("tts->init() failure!\n");
	return EXIT_FAILURE;
}

// Send text for synthesis
std::vector<std::string> str_vec = {"This is an example of text to speech using Melo for English. How does it sound?"};
ret = tts->write(str_vec);
if (ret != EXIT_SUCCESS)
{
	printf("tts->write() failure!\n");
	return EXIT_FAILURE;
}

// Stop input
ret = tts->stop();
if (ret != EXIT_SUCCESS)
{
	printf("tts->stop() failure!\n");
	return EXIT_FAILURE;
}

// Destroy the object
ret = tts->tts_destroy();
if (ret != EXIT_SUCCESS)
{
	printf("tts->tts_destroy() failure!\n");
	return EXIT_FAILURE;
}

Important

More usage examples are available in the following locations:

C++ ASR sample path: /usr/local/share/aidvoice/examples/asr/cpp/
C++ TTS sample path: /usr/local/share/aidvoice/examples/tts/cpp/

This completes the full list of interfaces provided by the AidVoice SDK.

GenAI Inference Toolkit

GenAI HTTP Service

API Documentation

AI Development

Generative AI Development

Audio AI Development

Model Farm

AidVoice C++ API Documentation ​

Feature Type.enum FeatureType ​

Audio Type.enum AudioType ​

Log Level.enum LogLevel ​

Return Status.enum ResultStatus ​

Device Information.struct DeviceInfo ​

Member List ​

Global Configuration.struct FeatureConfig ​

Member List ​

ASR Interfaces ​

ASR Mode.enum ASRMode ​

Speech Transcription Status.enum AsrStatus ​

ASR Result.class AsrResult ​

Member List ​

ASR Error.class AsrError ​

Member List ​

ASR Callback Interface.class ASRCallbacks ​

Get Transcription Result.onResult() ​

ASR Core Class.class AidVoiceASR ​

Create Instance.create_asr() ​

Set Mode.set_mode() ​

Set Callback.set_callback() ​

Enable Special Token Output.set_special_tokens() ​

Set Maximum Audio Processing Duration.set_echo_ms() ​

Set Streaming Feedback Interval.set_step_ms() ​

Save Input Audio.set_save_audio() ​

Get Device NPU Information.get_device_info() ​

Bind an NPU Device for Execution.set_device_info() ​

Initialize.init() ​

Data Input.write() ​

Use an Audio File as Input ​

Use a Raw Byte Stream as Input ​

Use a float Array as Input ​

Real-Time Microphone Input.audio_microphone() ​

Stop ASR Input.stop() ​

Destroy ASR Object.asr_destroy() ​

Get Audio Read Time.get_read_time() ​

Get Initialization Time.get_init_time() ​

Get Feature Extraction Time.get_character_time() ​

Get Encoder Inference Time.get_encoder_time() ​

Get Decoder Inference Time.get_decoder_time() ​

TTS Interfaces ​

TTS Mode.enum TTSMode ​

Synthesized Audio Status.enum TTSStatus ​

TTS Result.class TTSResult ​

Member List ​

TTS Error.class TTSError ​

Member List ​

TTS Callback Interface.class TTSCallbacks ​

Get Synthesis Result.onResult() ​

TTS Core Class.class AidVoiceTTS ​

Create Instance.create_tts() ​

Set Mode.set_mode() ​

Set Callback.set_callback() ​

Enable Speaker Playback for Synthesized Audio.set_play_audio() ​

Set Output Audio File Path.set_out_audio_path() ​

Get Device NPU Information.get_device_info() ​

Bind an NPU Device for Execution.set_device_info() ​

Initialize.init() ​

Data Input.write() ​

Stop TTS Input.stop() ​

Destroy TTS Object.tts_destroy() ​

Get Text Read Time.get_read_time() ​

Get Initialization Time.get_init_time() ​

Get Text Feature Processing Time.get_character_time() ​

Get Encoder Inference Time.get_encoder_time() ​

Get Decoder Inference Time.get_decoder_time() ​

Other Methods ​

Get Microphone List.show_microphone_dev() ​

Get Current AidVoice SDK Version.get_library_version() ​

Get Current Log Level.get_log_level() ​

Set Log Level.set_log_level() ​

Output Logs to the Console.log_to_console() ​

AidVoice C++ API Documentation

Feature Type.enum FeatureType

Audio Type.enum AudioType

Log Level.enum LogLevel

Return Status.enum ResultStatus

Device Information.struct DeviceInfo

Member List

Global Configuration.struct FeatureConfig

Member List

ASR Interfaces

ASR Mode.enum ASRMode

Speech Transcription Status.enum AsrStatus

ASR Result.class AsrResult

Member List

ASR Error.class AsrError

Member List

ASR Callback Interface.class ASRCallbacks

Get Transcription Result.onResult()

ASR Core Class.class AidVoiceASR

Create Instance.create_asr()

Set Mode.set_mode()

Set Callback.set_callback()

Enable Special Token Output.set_special_tokens()

Set Maximum Audio Processing Duration.set_echo_ms()

Set Streaming Feedback Interval.set_step_ms()

Save Input Audio.set_save_audio()

Get Device NPU Information.get_device_info()

Bind an NPU Device for Execution.set_device_info()

Initialize.init()

Data Input.write()

Use an Audio File as Input

Use a Raw Byte Stream as Input

Use a float Array as Input

Real-Time Microphone Input.audio_microphone()

Stop ASR Input.stop()

Destroy ASR Object.asr_destroy()

Get Audio Read Time.get_read_time()

Get Initialization Time.get_init_time()

Get Feature Extraction Time.get_character_time()

Get Encoder Inference Time.get_encoder_time()

Get Decoder Inference Time.get_decoder_time()

TTS Interfaces

TTS Mode.enum TTSMode

Synthesized Audio Status.enum TTSStatus

TTS Result.class TTSResult

Member List

TTS Error.class TTSError

Member List

TTS Callback Interface.class TTSCallbacks

Get Synthesis Result.onResult()

TTS Core Class.class AidVoiceTTS

Create Instance.create_tts()

Set Mode.set_mode()

Set Callback.set_callback()

Enable Speaker Playback for Synthesized Audio.set_play_audio()

Set Output Audio File Path.set_out_audio_path()

Get Device NPU Information.get_device_info()

Bind an NPU Device for Execution.set_device_info()

Initialize.init()

Data Input.write()

Stop TTS Input.stop()

Destroy TTS Object.tts_destroy()

Get Text Read Time.get_read_time()

Get Initialization Time.get_init_time()

Get Text Feature Processing Time.get_character_time()

Get Encoder Inference Time.get_encoder_time()

Get Decoder Inference Time.get_decoder_time()

Other Methods

Get Microphone List.show_microphone_dev()

Get Current AidVoice SDK Version.get_library_version()

Get Current Log Level.get_log_level()

Set Log Level.set_log_level()

Output Logs to the Console.log_to_console()

Output Logs to a Text File.log_to_file()

AidVoice C++ Sample Programs

AidVoice ASR Audio Recognition Sample

AidVoice TTS Audio Synthesis Sample