tts-api/tts-api.texi at master · brailcom/tts-api · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
    \input texinfo   @c -*-texinfo-*-
@c %**start of header
@setfilename tts-api.info
@settitle Common Text-to-Speech API (draft, 17.5.2006)
@finalout
@c @setchapternewpage odd
@c %**end of header

@copying
Copyright @copyright{} 2006 Brailcom, o.p.s.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:@*
1. Redistributions of source code of this document must retain the
above copyright notice and this list of conditions.@*
2. Redistributions in binary form and in printed form must reproduce the
above copyright notice, this list of conditions and/or other materials provided
with the distribution.@*
3. The names of the authors may not be used to endorse or promote products
derived from this document without specific prior written permission.
@end copying

@include macros.texi

@c @set version 2006-03-09

@c Directory, keywords
@dircategory Sound
@dircategory Development
@dircategory Accessibility

@direntry
* TTS API: (tts-api).    Common TTS API
@end direntry

@c Title page for printed version
@titlepage
@title Common Text-to-Speech API @break (draft, 28.4.2006)

@author Hynek Hanke, Brailcom <@email{hanke@@brailcom.org}>
@author Milan Zamazal, Brailcom <@email{zamazal@@brailcom.org}>
@author Willie Walker, GNOME, Sun Microsystems <@email{willie.walker@@sun.com}>
@author Olaf Jan Schmidt, KDE <@email{ojschmidt@@kde.org}>
@author Gary Cramblitt, KDE <@email{garycramblitt@@comcast.net}>

@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage
@contents

@c Title page for INFO
@ifnottex
@node Top, Introduction, (dir), (dir)
@top Common TTS Application Interface
@insertcopying
@end ifnottex

@menu
* Introduction::
* Interface Description::
* Notes About the Interface::
* Requirements on the API::
* Extended SSML Markup::
* Key Names::
* Requirements on the synthesizers::
* Recommended Sound Icons::
* Related Specifications::
* Index of Functions::
@end menu

@node Introduction
@chapter Introduction

The purpose of this document is to define a common low-level interface
for access to the various speech synthesizers on Free Software and Open
Source platforms. It is designed to be used by applications that do
not need advanced functionality like message management (such as
txt2wave) and by applications providing high-level interfaces (such as
@dispatcher{}, @gnomespeech{}, @kttsd{} etc.)  The purpose of this
document is not to define and force an API on the speech
synthesizers. The synthesizers might use different interfaces that
will be handled by their drivers.

This interface will be implemented by a simple layer integrating
available speech synthesis drivers and in some cases emulating some of
the functionality missing in the synthesizers themselves.

Advanced capabilities not directly related to speech, like message
management, prioritization, synchronization etc. are left out of scope for
this low-level interface. They will be dealt with by higher-level
interfaces. Such a high-level interface
(not necessarily limited to speech) will make good use of the already
existing low-level interface.

It is desirable that simple applications can use this API in a simple
way. However, the API must also be complex enough so that it doesn't
limit more advanced applications in use of the synthesizers.

Requirements on this interface have been gathered between various accessibility
projects, most notably KDE, GNOME, Emacspeak, Speakup and Free-b-Soft.
They are summarized in Appendix A and Appendix B of this document. Appendix
A deals with general requirements and required functionality, while
Appendix B describes the extended SSML subset in use and thus also
defines required parameter settings. The interface definition contained
in chapter 2 was composed based on these requirements.

@temporary{A goal is a real implementation of this interface in the
near future.  The next step will be merging the available
engine drivers in the various accessibility projects under this
interface and using this interface. For this reason, we need all
accessibility projects who want to participate in this common effort
to make sure all their requirements on a low-level speech output
interface are met and that such an interface is defined so that it is
suitable for their needs.}

@temporary{Any comments about this draft are welcome and
useful. But since the goal of these requirements is a real
implementation, we need to avoid endless discussions and keep the
comments focused and to the point.}

@node Interface Description
@chapter Interface Description

This section defines the low-level TTS interface for use by
all assistive technologies on free software platforms.

@menu
* General Points::
* Speech Synthesis Driver Discovery::
* Voice Discovery::
* Speech Synthesis Commands::
* Speech Control Commands::
* Parameter Settings::
* Audio Retrieval::
* Event Callbacks::
@end menu

@node General Points
@section General Points

@itemize

@item
The definition of this interface is not meant to imply that the final
interface will be provided by C library calls. C syntax is only used
for convenience.

@itemize
@item
The general design of the API must be respected in every
implementation of it. Regardless of language of implementation, the
full set of functions defined here must be available.


@item
Function definitions (return value, arguments) must be respected.

@item
Data types must be respected in their meaning and possible values.

@item
The names of functions, data types and the names of the values for
enumeration and set data types should be the same as given here,
except for writing them in the form most appropriate for the language
in use and prepending/appending them with namespace identifiers, class
names etc.

@item
This API definition uses numerical values as error return values for
functions. A given implementation might choose a better mechanism
for reporting errors, appropriate for the language or communication mechanism
used (for example exceptions, three digit error
codes for a TCP protocol etc).

@item
Where this API speaks about callbacks, other means of asynchronous
notification can be used. (Such as asynchronous messages for a TCP
text protocol.)

@end itemize

@item
The interface is designed such that both a simple library interface
and a serialized language-independent character-based protocol for use
over sockets and pipes can be provided.

@item
This interface is meant to be provided by a simple library or process
running the various synthesis drivers and emulating MUST HAVE
functionality where possible and needed. It can also try to emulate
some of the SHOULD HAVE and NICE TO HAVE capabilities at implementor's
discretion.

@item
The interface between this library or process and the synthesis drivers
themselves, hardware or software, will be a subset of this interface.

@item
The interface definition uses the type @code{bool_t} not present in C
for variables whose value can be either @code{TRUE} or @code{FALSE}.

@item
For strings that can possibly contain characters not present in the
ASCII table (for example various language specific characters), the
data type used in this interface definition is @code{wchar_t} and the
corresponding expected format for Unicode is UTF-32. Strings marked
with type @code{char*}, it must not contain any wide characters. Only
when the particular implementation of the interface (for example a
text protocol) does not allow for different types, UTF-8 encoding should
be used for values marked as @code{wchar_t}.

@item
When the word 'strings' is used in this interface description, it
means a NULL-terminated array of characters of type @code{char} for
single-byte encoding or a NULL-terminated array of type @code{wchar_t}
for UTF-32 encoding (in other words, four consecutive zero bytes
at the end). More appropriate data types can be used in other languages.

@end itemize

@node Speech Synthesis Driver Discovery
@section Speech Synthesis Driver Discovery

This section deals with the discovery of the synthesis drivers
available behind this interface. It also covers discovery of the
capabilities and voices provided by the drivers.

@anchor{driver_capabilities_t}
@deftp {Variable Type} driver_capabilities_t

@code{driver_capabilities_t} is a structure data type intended for
carrying information about driver capabilities.

@verbatim
typedef struct {
    /* Voice discovery */
    bool_t can_list_voices;
    bool_t can_set_voice_by_properties;
    bool_t can_get_current_voice;

    /* Prosody parameters */
    bool_t can_set_rate_relative;
    bool_t can_set_rate_absolute;
    bool_t can_get_rate_default;

    bool_t can_set_pitch_relative;
    bool_t can_set_pitch_absolute;
    bool_t can_get_pitch_default;

    bool_t can_set_pitch_range_relative;
    bool_t can_set_pitch_range_absolute;
    bool_t can_get_pitch_range_default;

    bool_t can_set_volume_relative;
    bool_t can_set_volume_absolute;
    bool_t can_get_volume_default;

    /* Style parameters */
    bool_t can_set_punctuation_mode_all;
    bool_t can_set_punctuation_mode_none;
    bool_t can_set_punctuation_mode_some;
    bool_t can_set_punctuation_detail;

    bool_t can_set_capital_letters_mode_spelling;
    bool_t can_set_capital_letters_mode_icon;
    bool_t can_set_capital_letters_mode_pitch;

    bool_t can_set_number_grouping;

    /* Synthesis */
    bool_t can_say_text_from_position;
    bool_t can_say_char;
    bool_t can_say_key;
    bool_t can_say_icon;

    /* Dictionaries */
    bool_t can_set_dictionary;

    /* Audio playback/retrieval */
    bool_t can_retrieve_audio;
    bool_t can_play_audio;

    /* Events and index marking */
    bool_t can_report_events_by_sentences;
    bool_t can_report_events_by_words;
    bool_t can_report_custom_index_marks;

    /* Performance guidelines */
    int honors_performance_guidelines;

    /* Defering messages */
    bool_t can_defer_message;

    /* SSML Support */
    bool_t can_parse_ssml;

    /* Multilingual utterences */
    bool_t supports_multilingual_utterances;
} driver_capabilities_t;
@end verbatim

@var{can_set_rate_*}, @var{can_set_pitch_*},
@var{can_set_pitch_range_*} and @var{can_set_volume_*} variables
indicate whether the corresponding prosody parameter setting commands
are supported. See @pref{Prosody Parameters}.

@var{can_set_punctuation_mode_*}
variables indicate which parameters
are supported for @func{set_punctuation_mode()}. See
@pref{set_punctuation_mode()}.

@var{can_set_punctuation_detail} indicates
whether the function @func{set_punctuation_detail()}
is supported. See @pref{set_punctuation_detail()}.

@var{can_set_capital_letters_mode_*} variables indicate
which parameters are supported for @func{set_capital_letters_mode()}.
See @pref{set_capital_letters_mode()}.

@var{can_set_number_grouping} indicates whether the function
@func{set_number_grouping} is supported. See @pref{set_number_grouping()}.

@var{can_say_text_from_position} indicates whether
the capability to start synthesis at a given position in
the text is supported, as described in @pref{say_text()}.

Other @var{can_say_*} variables indicate whether the corresponding
@code{say_} synthesis command is supported. See @pref{Speech Synthesis
Commands}.

@var{can_set_dictionary} indicates whether the function
@func{set_dictionary()} is supported.

@var{can_play_audio} and @var{can_retrieve_audio} variables indicate
whether the corresponding audio output methods are allowed
for @func{set_audio_output}. See @pref{Audio Retrieval}.

@var{can_report_*} variables indicate which kind of audio events
and index marks are supported. See @pref{Event Callbacks}.

@var{honors_performance_guidelines} variable is @code{0} if
performance guidelines are not honored, @code{1} if performance
guidelines are honored on the @should{} level and @code{2}
if performance guidelines are honored on the @niceto{} level.

@var{can_defer} indicates whether the defer capability is
supported. If this variable is true, @func{defer()}
and @func{say_deferred} must be supported. It is expected
the synthesizer will be able to defer multiple messages
at the same time. See @pref{defer()},
@pref{say_deferred()}.

@var{can_parse_ssml} indicates whether the synthesizer is
able to parse SSML. It doesn't indicate which SSML elements
and attributes are supported.

@var{supports_multilingual_utterances} indicates whether the
synthesizer supports multilingual utterances (utterances containing
multiple languages).

@end deftp

@deftp {Variable Type} driver_description_t
@anchor{driver_description_t}

@code{driver_description_t} is a structure containing information
about a single driver.

@verbatim
typedef struct {
    char*           driver_id;
    char*           driver_version;
    char*           synthesizer_name;
    char*           synthesizer_version;
} driver_description_t;
@end verbatim

@var{synthesizer_id} is the identification string of the driver.

@var{synthesizer_version} carries information about the synthesizer
version in use in a human readable form. There is no strict rule
for formatting the version information inside the string as the
versioning schemes of the various synthesizers differ significantly.
If it is not possible to determine the synthesizer version, this string
should be NULL.

@var{synthesizer_name} is a full name of the synthesizer engine.

@var{driver_version} carries information about the driver version in
use for the given synthesizer. It has the form @code{"major.minor"}
where @code{major} is the major version number for the driver and
@code{minor} is the minor version number for the driver.

@var{driver_capabilities} contains information about the support
of the driver for functions and features defined in this interface.
See (@ref{driver_capabilities_t}) for a list of the available information.

@mycexample{
driver_id = "festival"
synthesizer_name = "Festival Speech Synthesis System"
synthesizer_version = "1.94beta"
driver_version = "1.2"
}
@end deftp

@deftypefun driver_description_t** list_drivers (void)
@anchor{list_drivers()}

@func{list_drivers()} returns a newly allocated null-terminated array
of available synthesizer drivers.  Each of the items in the array is of
the type @code{driver_description_t*}, @pref{driver_description_t}, and
must carry a properly filled in variable @var{driver_id}.

@fperror
@end deftypefun

@deftypefun driver_capabilities_t* driver_capabilities (char* driver_id)
@anchor{driver_capabilities()}

@func{driver_capabilities} returns information about the
capabilities of the driver in a @code{driver_capabilities_t} structure.

Under this API, each driver is not guaranteed to support all of the
functionality as defined in this document. It must however provide the
full set of functions. Whether the functions will have the described
effect can be discovered by examining the entries of the
driver_capabilities_t structure and comparing them with the documentation
for the given functions.

@arg{driver_id} is the unique identification string for the
synthesizer driver whose capabilities should be reported.  See
@pref{list_drivers()}.

This function returns a properly filled @code{driver_capabilities_t}
structure on success.
@fperror
@end deftypefun

@node Voice Discovery
@section Voice Discovery

@deftp {Variable Type} voice_description_t
@anchor{voice_description_t}

@code{voice_description_t} is a structure containing the description
of a voice.

@verbatim
typedef struct {
    wchar_t *name;
    char *language;
    wchar_t *dialect;
    voice_gender gender;
    unsigned int age;
} voice_description_t;
@end verbatim

@var{name} is the name of the voice as recognized by the synthesizer.

@var{language} is an ISO 639 language code represented as a character
string. Examples are @code{en}, @code{fr}, @code{cs}.

@var{dialect} is a string describing the language dialect or NULL if
unknown or not applicable. Examples are @code{american} or @code{british}
with English language or @code{moravian} with Czech language.

@openissue{Is there a standard way of describing dialects?}

@var{gender} indicates the gender of the voice. The values @code{MALE},
@code{FEMALE} and @code{UNKNOWN} are permitted.

@var{age} gives the approximate age of the voice in years. A value
of @code{0} means the age is unknown.

@end deftp

@deftypefun voice_description_t** list_voices (char* driver_id)
@anchor{list_voices}

For a given driver specified as @var{driver_id},
@func{driver_list_voices()} returns a newly allocated null-terminated
array of describing the available voices in @code{voice_description_t*}
items, one for each voice.

@arg{driver_id} is the identification string of the driver
as returned by @func{list_drivers()} @pref{list_drivers()}.

@fperror

@end deftypefun

@node Speech Synthesis Commands
@section Speech Synthesis Commands

Functions defined in this section generally accept a message to
synthesize, with driver, voice and other parameters according to the
current settings at the time when the function is called. Several
types of messages are handled by this API. It can be either a text
message, containing plain text or SSML, or it can be a 'key' or
'character' event or any general event.

The functions defined in this section can only block the calling
process for as long as is necessary to fully receive and/or transfer
the message, which should generally be a very short time. These
functions will not block the calling process for the time of synthesis
of the message and audio output.

The result of these commands will either be that the resulting audio
stream is played on the audio device or that the audio stream is
returned via the registered communication channel. Please see
@ref{Audio Settings}.

@anchor{message_format_t}
@deftp {Variable Type} message_format_t

@code{message_format_t} is an enumeration type to indicate the type
of the content of a message.

@verbatim
typedef enum {
    MESSAGE_TYPE_SSML,
    MESSAGE_TYPE_PLAIN
} message_format_t;
@end verbatim

@code{MESSAGE_TYPE_SSML} means the content of the message is text
formatted according to the Speech Synthesis Markup Language. See
@pref{SSML}.

@code{MESSAGE_TYPE_PLAIN} means the content of the message is plain
text.

@end deftp

@anchor{message_id_t}
@deftp {Variable Type} message_id_t

@verbatim
typedef signed int message_id_t;
@end verbatim

A positive value represents the identification number of the message.  The
value of @code{0} means 'no message' and @code{-1} means an
error occurred.
@end deftp

@anchor{event_type_t}
@deftp {Variable Type} event_type_t

@code{event_type_t} is used to describe the type of an event both in the
original text and in the synthesized audio data.

@verbatim
typedef enum {
    EVENT_MESSAGE_BEGIN,
    EVENT_MESSAGE_END,
    EVENT_SENTENCE_BEGIN,
    EVENT_SENTENCE_END,
    EVENT_WORD_BEGIN,
    EVENT_WORD_END,
    EVENT_NONE
} event_type_t;
@end verbatim

@code{EVENT_MESSAGE_BEGIN} and @code{EVENT_MESSAGE_END} are events
corresponding to the begin and end of the message.

@code{EVENT_SENTENCE_BEGIN} and @code{EVENT_SENTENCE_END} are
events corresponding to the begin and end of a sentence.

@code{EVENT_WORD_BEGIN} and @code{EVENT_WORD_END} are events
corresponding to the begin and end of a word.
@end deftp

@anchor{say_text()}
@anchor{say_text_from_event()}
@anchor{say_text_from_index_mark()}
@anchor{say_text_from_character()}
@deftypefun mesage_id_t say_text (message_format_t format, wchar_t* text)
@deftypefunx mesage_id_t say_text_from_event (message_format_t format, wchar_t* text, unsigned int position, event_type_t position_type)
@deftypefunx mesage_id_t say_text_from_index_mark (message_format_t format, wchar_t* text, char* index_mark)
@deftypefunx mesage_id_t say_text_from_character (message_format_t format, wchar_t* text, size_t character_position)

@func{say_text} accepts a text message to synthesize and starts
synthesis at the given position.

@var{position} and @var{position_type} describe the position in the
message where synthesis should be started. @var{position_type} can be
either a word or sentence event. @var{position} is a positive counter
of events of type @var{position_type} from the beginning of the
message. So for example the position @code{2} of event
@code{EVENT_WORD_START} describes the start of the second word.  In a
similar way, @var{index_mark} specifies the name of the index mark
where synthesis should start and @var{character} gives a position in
the text as a positive number in characters.

There is no explicit upper limit on the size of the text, but the
server administrator may set one in the configuration or the limit can
be enforced by available system resources.  If the limit is exceeded,
the whole text is accepted, but the excess is ignored and an
error is returned.

When a markup language, such as SSML, is being used as the format
of the text, this markup may or may not be checked for validity,
according to users settings. If a validity check is performed and
the text is found to be invalid, an error code is returned and the
text is not processed further.

Errors found during processing the document, as for example a markup
request to set a language which is not available for the synthesizer,
are not reported.

If the position requested through @func{say_text_from_char} falls in
the middle of a markup tag, the synthesis should begin with the text
following the tag. If the position is in the middle of a word, the
synthesizer can either synthesize from the exact position or it can
start from the beginning of the word. Neither of these is considered
an error.

@arg{format} is a format of the message according to @pref{message_format_t}.

@arg{text} is the text to be synthesized in the form according to the
value of the @var{format} argument.

@arg{position} is a positive number counting the events of the given
type.  If @var{position_type} is set to @code{EVENT_MESSAGE_BEGIN},
the value of this argument is irrelevant and is conventionally set to
@code{0}.

@arg{position_type} is one of @code{EVENT_MESSAGE_BEGIN},
@code{EVENT_SENTENCE_START}, @code{EVENT_SENTENCE_END},
@code{EVENT_WORD_BEGIN} and @code{EVENT_WORD_END}.

@arg{index_mark} is the name of the index mark where synthesis
should begin.

@arg{character_position} is a positive number of the character
where synthesis should begin.

On success, a positive value -- a unique message identifier -- is
returned.
@fierror

For example calling say_text() with the following arguments
@example @code
say_text(MESSAGE_TYPE_PLAIN, "This is an example.", 3, EVENT_WORD_BEGIN)
@end example
should result in audio which starts with the word 'an' and continues
to the end of the sentence.

@note{For longer and more complicated texts, it will not be possible
to say in advance where the audio will start, given just the original
text of the message and the position description. The placing of
events across the original text may be ambiguous and depends on the
synthesizer. However, this capability is designed for purposes like
rewinding (rewind 5 sentences forward) or context pause (resume
speaking from a place which we already got event information about
when we executed pause).  The application must not try to guess where
exactly the events are and rely on that guess if it did not receive
the information from the synthesizer earlier.}
@end deftypefun

@anchor{say_deferred()}
@anchor{say_deferred_from_position()}
@anchor{say_deferred_from_index_mark()}
@anchor{say_deferred_from_character()}
@deftypefun mesage_id_t say_deferred (message_id_t message_id)
@deftypefunx message_id_t say_deferred (message_id_t message_id, signed int position_from, PositionType position_type)
@deftypefunx mesage_id_t say_deferred_from_index_mark (message_id_t message_id, char* index_mark)
@deftypefunx mesage_id_t say_deferred_from_character (message_id_t message_id, size_t character)

@func{say_deferred} works just like @func{say_text}, except it works
on messages which were previously deferred and if position is set to
0, this has an additional meaning of ''start where speech was
interrupted last time''. Please see @pref{defer()}.

@arg{message_id} is the id of the message to synthesize, as obtained by
@func{defer()}.

On success, a positive value -- a unique message identifier -- is
returned.
@fierror
@end deftypefun

@deftypefun message_id_t say_key (wchar_t* key_name)

@func{say_key} accepts a key name to synthesize. The command is
intended to be used for speaking keys pressed by the user.

@arg{key_name} is a valid key name as defined in @ref{appendix-C}.

On success, a positive value -- a unique message identifier -- is
returned.
@fierror
@end deftypefun

@deftypefun message_id_t say_char (wchar_t character_name)

@func{say_char} accepts a letter (or syllable if the language doesn't
have individual letters) to synthesize. The command is intended to be
used for speaking single character messages, produced when the user is
moving the cursor over a word.

@arg{character_name} is the character to synthesize.


On success, a positive value -- a unique message identifier -- is
returned.
@fierror
@end deftypefun

@deftypefun message_id_t say_icon (char* icon_name)

@func{say_icon} accepts a general sound icon to synthesize.  The
command is intended to be used for general events like `new-line',
`message-arrived', `question' or `new-email' . The exact sound
produced or text synthesized depends on user's configuration.

The name for the icon can be one of the names given in
@pref{recommended-sound-icons} or any other name. If the
icon name is not recognized by the synthesizer, the
synthesizer tries to synthesize the name of the event
itself.

If the icon name is not recognized by the synthesizer, the
synthesizer tries to synthesize the name of the icon itself.

@arg{icon_name} is the name of the icon to synthesize. It must not contain
any whitespace characters.

On success, a positive value -- a unique message identifier -- is
returned.
@fierror
@end deftypefun

@node Speech Control Commands
@section Speech Control Commands

@deftypefun void cancel (void)

@func{cancel} immediately stops synthesis and audio output of the
current message. When this function returns, the audio output is fully
stopped and the synthesizer is ready to synthesize a new message.

If this function is called during the transfer of audio data to the
application, the data block currently being transferred is completed and
no further data block is sent.

Calling this command when no message is being processed is
not considered an error.
@fierror

@end deftypefun

@anchor{defer()}
@deftypefun void defer (void)

@func{defer} is similar to @func{cancel} except after stopping the
synthesis process and audio playback the message is not thrown away in
the synthesizer, but data that might be useful for future working
with the message (such as rewinding, repeating or resuming the synthesis
process) are preserved. This might or might not include the original
text of the message. In any case, enough information must
be preserved so that the synthesizer is able to fully reproduce the
audio data for the message.

If this function is called during the transfer of audio data to the
application, the data block currently being transferred is completed and
no further data block is sent.

This function can also be called after all the audio has been already
transferred to the application, but before another synthesis request is
issued, with no cancel() request in between, the data for the previous
message are stored.

There is no explicit upper limit on the number of messages
that can be simultaneously postponed by defer(). There might
however be a limit imposed by the administrator or forced by
available system resources. In case such a limit is passed,
@func{defer()} will return with an error.

After the message is no longer needed, the application must make
sure to discard it through @func{discard()}, otherwise system
resources will be wasted.

@fireturn
@end deftypefun

@deftypefun int discard (message_id_t message)

Discards a previously deferred message. The driver/engine
will drop all information about this message and the
message will be removed from the list of paused messages.

See @pref{defer()}.

@arg{message} is the message ID of the message to discard.
Passing an ID of a message that is not paused is considered
an error.

@fireturn
@end deftypefun

@node Parameter Settings
@section Parameter Settings

@menu
* Driver Selection and Parameters::
* Voice Selection::
* Prosody Parameters::
* Style Parameters::
* Dictionaries::
* Audio Settings::
@end menu

@node Driver Selection and Parameters
@subsection Driver Selection and Parameters

@deftypefun int set_driver (char* driver_id)

Set the synthesis driver. See @ref{list_drivers()}.

@arg{driver_id} is the unique ID of the driver
as returned by list_drivers().

@end deftypefun

@node Voice Selection
@subsection Voice Selection

Setting parameters in this section only has effect until the
synthesizer driver in use is changed by the application.

@deftypefun int set_voice_by_name (wchar_t* voice_name)

@func{set_voice_by_name} selects the voice with the given name.

@arg{voice_name} is the name of the desired voice. It must be one of
the names returned by @code{list_voices()}.  See @ref{list_voices}.

@fireturn
@end deftypefun

@deftypefun int set_voice_by_properties (voice_description_t *voice_description, unsigned int variant)

@func{set_voice_by_properties} selects a voice under the current
driver most closely matching the given description. The exact voice
selected might be subject to user preference settings for voice
selection inside the synthesizer.

There is no guarantee that any of the given parameters will be
respected, although language generally is supposed to be respected,
unless impossible or unless the user wishes otherwise.

In case no voice matches the given language, the synthesizer should
pick the general default voice (if applicable) or choose an
arbitrary voice. This alone is not considered an error and must not be
a reason for the synthesizer to refuse further synthesis requests
unless for some other related reason (as for example the voice being
unable to handle the given Unicode character range).

The application can check which voice was selected and how closely (if
at all) it matches the given description.

@arg{voice_description} is a description of the desired voice. Any of
its entries except language can be filled in or left blank
(@code{NULL} for strings, @code{0} for integer values, @code{UNKNOWN}
for VoiceGender). Please see @ref{voice_description_t} for more
information about the format and allowed values.

@arg{variant} is a positive (1,2,3...) number specifying which of the
voices matching the description and assigned equal priority inside the
synthesizer should be selected. Please see @pref{SSML} for more details.

@note{This function is different from performing @code{voice_list} and
following that with @code{set_voice_by_name} as user settings about
voice selection inside the synthesizer are respected.}

@fireturn
@end deftypefun

@deftypefun voice_description_t* get_current_voice (void)

@func{get_current_voice} returns a @code{voice_description_t}
structure filled in with all known information about the voice
currently in use.

@fperror
@end deftypefun

@node Prosody Parameters
@subsection Prosody parameters

Setting parameters in this section only has effect until the
synthesizer is changed.

@deftypefun int set_rate_relative (signed int rate_relative)
@deftypefunx int set_rate_absolute (unsigned int rate_absolute)
@deftypefunx {unsigned int} get_rate_absolute_default (void)

Set/get the rate of speech.

@arg{rate_relative} represents the relative change with respect to the
default value for the given voice. For example @code{0} means the
default value for the given voice while @code{-50} means a fifty
percent lower rate with respect to the default.

@arg{rate_absolute} is the desired rate in words per minute.

@fireturn
@end deftypefun

@deftypefun int set_pitch_relative (signed int pitch_relative)
@deftypefunx int set_pitch_absolute (unsigned int pitch_absolute)
@deftypefunx {unsigned int} get_pitch_absolute_default (void)

Set/get the voice base pitch.

@arg{pitch_relative} represents the relative change with respect to the default
value for the given voice. For example @code{0} means the default
value for the given voice while @code{-50} means a fifty percent lower
pitch with respect to the default.

@arg{pitch_absolute} is the desired pitch in Hz.

@fireturn
@end deftypefun

@deftypefun int set_pitch_range_relative (signed int range)

Set voice pitch range in relative units. Pitch range is how much pitch
changes in intonation with respect to the base pitch.

@arg{pitch} represents the relative change with respect to the default
value for the given voice. For example @code{0} means the default
value for the given voice while @code{-50} means a fifty percent lower
pitch range with respect to the default.
@end deftypefun

@deftypefun int set_pitch_range_absolute (unsigned int range)

@openissue{How should this work? It is not clear from the SSML specs.}

@fireturn
@end deftypefun

@deftypefun int set_volume_relative (signed int volume_relative)
@deftypefunx int set_volume_absolute (unsigned int volume_absolute)
@deftypefunx {unsigned int} get_volume_absolute_default ()

Set/get the volume of speech.

@arg{volume_relative} represents the relative volume change with respect to the
default value for the given voice. For example @code{0} means the
default value for the given voice while @code{-50} means a fifty
percent lower volume with respect to the default.

@arg{volume_absolute} is a number from the range 0 to 100 where
the value of @code{0} means silence and @code{100} means
maximum volume.

@fireturn
@end deftypefun

@node Style Parameters
@subsection Style parameters

@deftp {Variable Type} punctuation_mode_t
@anchor{punctuation_mode_t}

@code{punctuation_mode_t} is an enumeration type containing
information about punctuation signalling mode.

@verbatim
typedef enum {
    PUNCTUATION_NONE,
    PUNCTUATION_ALL,
    PUNCTUATION_SOME
} punctuation_mode_t;
@end verbatim

@code{PUNCTUATION_NONE} means no punctuation is signalled.

@code{PUNCTUATION_ALL} means all punctuation characters are signalled.

@code{PUNCTUATION_SOME} means only selected punctuation characters
are signalled. (See @ref{set_punctuation_detail()}).
@end deftp

@deftypefun int set_punctuation_mode (punctuation_mode_t mode)
@anchor{set_punctuation_mode()}

Set punctuation reading mode. In other words, this influences which
punctuation characters will be signalled while reading the text.
Signalling means either synthesizing their name (e.g. `qustion mark')
or playing the appropriate sound icon, according to user settings
inside the synthesizer.