Method and device for transmission of speech activity in distribution system of voice recognition

FIELD: speech activity transmission systems in distributed system of voice recognition.

SUBSTANCE: distributed system of voice recognition has voice recognition (VR) local mechanism in user unit and VR server mechanism in server. VR local mechanism has module for selection of features (FS), which selects features from voice signals. Voice activity detector (VAD) module detects voice activity invoice signal. Indication of voice activity is transmitted before features from user unit to server.

EFFECT: reduction in overloading of circuit; reduced delay and increased efficiency of voice recognition.

3 cl, 8 dwg, 2 tbl

 

The technical field to which the invention relates.

The present invention relates generally to the field of communications and, in particular, to a system and method for transmitting speech activity in a distributed voice recognition system.

The level of technology

Voice recognition (VR) is one of the most important ways of giving machines artificial intelligence to recognize user voice, and facilitate human interaction with the machine. VR is also the primary way of understanding human speech. Systems that use methods of recovery linguistic messages from the acoustic speech signal, referred to as voice recognition devices.

The use of VR (which is also often called the "speech recognition") is becoming increasingly important for security reasons. For example, VR can be used to replace the manual buttons on the keypad of the wireless telephone. This is especially important when the user initiates a phone call while driving. When using a car phone without VR driver must remove one hand from the steering wheel and look at the keypad, press the button to dial the called number. These actions increase the likelihood of road accidents. Car is strong telephone, enable speech input (i.e. phone, designed for speech recognition), allows the driver to enjoy a conversation on the phone constantly watching the road. In addition, the automated vehicle system with a set of appropriate tools will allow the driver to keep both hands on the steering wheel when initiating a telephone call. Approximate dictionary for automatic car kit may include: ten digits; the keyword "call", "parcel", "dialing", "cancel", "clear", "add", "delete", "archive", "program", "Yes" and "no"; and the set of names in a specified amount, which includes co-workers, friends or family members, called on the phone most often.

Device voice recognition (speech), that is, the VR system contains a processor of acoustic signals, also called the preprocessor device voice recognition, and decoding words, also known as a post-processor device voice recognition. The processor acoustic signals performs the function of feature selection. The processor emit acoustic signals from incoming raw speech sequence features (vectors), which carry information that is necessary for VR. The decoder decodes words this sequence features (vectors) to obtain meaningful and desired format of the output signal,for example, the sequence of linguistic words that match the typed fragment of speech.

When implementing device voice recognition using a distributed system architecture is often required to assign the task of decoding words on a subsystem that can take on the computational load and memory load that is appropriate to run on a network server. At the same time, the CPU acoustic signals should be as close as possible to the source of speech, that is in the user's device, to reduce the impact of vocoders (used for voice compression before transmission)that are made in the process of signal processing, and/or errors induced in the channel. Thus, in a distributed voice recognition system (DVR) processor acoustic signals is in the user's device, and the decoder of words is online.

DVR system enable devices, such as cellular phones, personal communications, personal information devices (PDA) and so on, to access information and services from a wireless network such as the Internet, using voice commands by accessing servers voice recognition on the network.

The use of methods essential pairing reduces the accuracy of the systems voice recognition applied in wireless systems. This deterioration can be smahc is th placing the extraction of features from user voice commands on a device such as a subscriber unit (also called subscriber station, mobile station, mobile unit, remote station, remote terminal, access terminal and the user equipment, and transmitting characteristics of VR in the traffic data, instead of transmitting a voice command in the voice traficet way, the DVR system characteristics for preprocessing are highlighted in the above device, after which they are sent into the network. The device may be either mobile or stationary, and may communicate with one or multiple base stations (BS) (also called base stations of cellular communication, cellular base stations, base transceiver systems (BTS), a base transceiver, the Central points of contact, access points, access nodes, nodal base stations, the modem pool transceivers (MRI)).

Complex tasks voice recognition require significant computing resources. The implementation of such systems in the subscriber unit with limited CPU resources, memory and batteries impractical. DVR system rely on the power of computational resources in the network. In a typical system, DVR to the decoder words must meet higher demand is for computing power and memory, than to the preprocessor device voice recognition. Thus, the VR-based server in the network serves as a post-processor of a voice recognition system and performs decoding words. This gives you the advantage when performing complex tasks VR using network resources. Examples of DVR systems are disclosed in U.S. patent No. 5956683 "Distributed Voice Recognition System", the rights to which are owned by the assignee of the present invention and the content of which is incorporated here by reference.

In addition to the selection of traits that are performed in the subscriber unit may also be made simple tasks VR, and in this case, the voice recognition system does not use network resources to solve simple problems VR. Therefore, reduced network traffic, thereby reducing the cost of providing services that enable speech input.

Despite performing simple tasks VR in the subscriber unit, the congestion of traffic in the network can cause the subscriber units will receive poor service from the system VR-based server. Distributed VR system allows to receive from the user interface features, powerful, based on the implementation of complex tasks VR, however, this comes at the expense of increased network traffic, and sometimes at the expense of delay and transmission. If the local VR mechanism in the subscriber unit does not recognize the user's voice commands, then these voice commands to transmit to the VR mechanism based on the server after preprocessing, which increases network traffic and leads to congestion. Congestion of the network occurs when simultaneously from the subscriber unit to the system VR-based server transferred a large amount of network traffic. After interpreting voice commands mechanism VR server-based results must be passed back to the subscriber unit, which may cause a significant delay if the network is congested.

Thus, the DVR system must have a system and a method of reducing network congestion and reduce delays. System and method for reducing network congestion and reduce delays, enhance the effectiveness of VR.

The invention

Describes the options are focused on creating a system and method of transmitting speech activity, reducing congestion and delay in the network. System and method for transmitting speech activity in the voice recognition system includes a module voice activity detection (VAD) module and the feature extractor (FE) in the subscriber unit.

According to one aspect of the subscriber unit contains the module of feature selection performed in the possibility of identifying a set of characteristics of the speech signal, module voice activity detection made with the possibility of detecting voice activity within the speech signal and providing an indication of detected voice activity, and a transmitter connected to the module of feature selection module and voice activity detection, and configured to transmit the indication of detected voice activity before many signs.

According to another aspect of the subscriber unit includes means for identifying a set of characteristics of the speech signal, means for detecting voice activity within the speech signal and the indication of detected voice activity, and a transmitter connected to the means of feature selection and means of voice activity detection, and configured to transmit the indication of detected voice activity before many signs.

According to one aspect of the subscriber unit further comprises means for merging multiple signs with the indication of detected voice activity, and the indication of detected voice activity is preceded by many signs.

According to one aspect of the method of transmitting speech activity is that secrete many of the characteristics of the speech signal, detect voice activity by using the speech signal and provide indie is the situation of the detected voice activity, and transmit the indication of detected voice activity before many signs.

Brief description of drawings

Figure 1 - the voice recognition system comprising a processor acoustic signals and the decoder words according to one variant of implementation of the present invention;

figure 2 - a sample of a distributed voice recognition system;

figure 3 - delay in an exemplary embodiment, a distributed voice recognition system;

4 is a block diagram of the VAD module according to one variant embodiment of the invention;

5 is a block diagram submodule VAD according to one variant embodiment of the invention;

6 is a block diagram of the FE module according to one variant embodiment of the invention;

7 is a state diagram of the VAD module according to one variant embodiment of the invention; and

Fig - parts of speech and the events of the VAD on the time axis according to one variant embodiment of the invention.

Detailed description of the invention

Figure 1 shows a system 2 voice recognition, includes a processor 4 acoustic signals and the decoder 6 words according to one variant embodiment of the invention. The decoder contains 6 words 8 matching with the acoustic samples and element 10 modeling language. Element 10 modeling language is also known as a grammatical element is about description. The processor 4 acoustic signals associated with item 8 of the acoustic matching decoder 6 words. Item 8 matching with the acoustic samples connected to the element 10 of the modeling language.

The processor 4 acoustic signals highlights the characteristics of the input speech signal and delivers these features in the decoder 6 words. Generally speaking, the decoder 6 words converts the acoustic characteristics of the processor 4 acoustic signals in the assessment of the initial sequence of spoken words. This is done in two stages: mapping acoustic models and language modeling. Modeling language can be avoided in applications which are recognized by the selected words. Item 8 matching with the acoustic samples detects and classifies the possible acoustic samples, such as phonemes, syllables, words, etc. Samples of candidates served in item 10 of the modeling language, which models the rules of the syntactic constraints that determine which sequences of words are grammatically correct and make sense. Syntactic information can be a valuable reference for voice recognition when only acoustic information ambiguous interpretation. When the voice recognition on the basis of the modeling language consistently interpreted the results of mapping acoustic priznaki is, and given the estimated sequence of words.

As for comparison with the acoustic samples, and for the modeling language in the decoder 6 words, you must have a mathematical model: either deterministic or stochastic to describe the phonological and acoustic-phonetic variations of the speaker. The effectiveness of speech recognition systems is directly related to the quality of these two models. Among the models of different classes for comparison with the acoustic samples are experts in the art, the most famous two models: the dynamic transformation of the time scale on the basis of the standards (DTW) and hidden stochastic Markov model (HMM).

The processor 4 of the acoustic signals is a subsystem preliminary analysis of speech unit 2 voice recognition. In response to the input speech signal, it provides an appropriate representation characterizing the time-varying speech signal. He must discard irrelevant information, such as background noise, channel distortion characteristics of the speaker and manner of speech. Effective acoustic characteristic will provide the voice recognition of a higher power level for acoustic discernment. The most useful feature is the short-time spectrum envelope. When determining PA is amerov short-time spectrum envelope usually use the method of spectral analysis on the basis of the filter block.

Combining several systems VR (also known mechanisms VR) provides improved accuracy and more information in the input speech signal than one VR system. System and method for combining mechanisms VR is described in patent application U.S. No. 09/618177 "Combined Engine System and Method for Voice Recognition", filed on 18 July 2000, and in the patent application U.S. No. 09/657760 "System and Method for Automatic Voice Recognition Using Mapping", filed September 8, 2000, the rights to which are owned by the assignee of the present invention and the contents of which are fully incorporated here by reference.

In one embodiment, the multiple mechanisms VR combined in a distributed VR system. Thus, the VR mechanism is provided in both the subscriber unit and the network server. The VR mechanism in the subscriber unit is a local mechanism VR. The VR mechanism on the server is a network mechanism VR. Local mechanism VR contains the processor to implement a local mechanism VR and a memory for storing voice data. The network mechanism VR contains the processor for implementing networked VR mechanism and a memory for storing voice data.

Approximate DVR system disclosed in the patent application U.S. No. 09/755561 "System and Method For Improving Voice Recognition In A Distributed Voice Recognition System", filed January 5, 2001, the rights to which are owned by the assignee of the present invention and with the holding of which are incorporated here by reference. Figure 2 shows a sample implementation of the present invention. In this exemplary embodiment, the medium is a wireless communication system, containing the subscriber unit 40 and the Central point of contact, called cellular base station 42. In this exemplary embodiment, presents a distributed VR system. In a distributed VR processor acoustic signals, or element 22 of feature selection is located in the subscriber unit 40 and the decoder 48 words is the Central point of contact. If instead of the variant with distributed VR voice recognition is implemented only in the subscriber unit, such recognition would be extremely difficult even if the dictionary is of medium size, because word recognition is associated with high costs calculations. On the other hand, if VR is only at the base station, it can dramatically decrease the accuracy due to the deterioration of speech signals associated with the operation of speech codecs and channel effects. It is obvious that the option of a distributed system has advantages. First, it reduces the cost of the subscriber unit, as in this case, in the subscriber unit 40 no hardware decoder words. Secondly, it reduces the consumption of the battery (not shown) of the subscriber unit 40 in the local implementation of intensive vychislitelniye in decoding words. Third, increases the accuracy of recognition in addition to the flexibility and extensibility of the distributed system.

It enters the microphone 20, which converts the speech signal into electrical signals supplied to the element 22 of feature selection. The signals from the microphone 20 can be analog or digital. If the signals are analog, then between the microphone 20 and the element 22 of feature selection may provide an analog-to-digital Converter (not shown). Speech signals are in item 22 of feature selection. Item 22 of feature selection selects relevant features of the input speech, which will be used to decode the linguistic interpretation of the input speech. One of the examples of characteristics that can be used to assess speech, are the frequency characteristics of the entered speech frame. Often this is provided in the form of parameters of linear coding with prediction introduced speech frame. Selected characteristics of speech then served in the transmitter 24, which encodes, modulates and amplifies the signal with the selected signs and submits these signs through duplexer 26 to the antenna 28, where verbal signs transmitted to the cellular base station or Central point of connection 42. You can use the digital coding schemes modulation and transmission of various types, well-known specialists in this field of technology.

In the Central point 42 have been transmitted signs are received by the antenna 44 and fed to the receiver 46. The receiver 46 may perform the functions of demodulation and decoding of the received transmitted traits that he provides to the decoder 48 words. The decoder 48 words, based on these speech features, performs a linguistic assessment of the speech, and signal effects in the transmitter 50. The transmitter 50 performs the functions of amplification, modulation and coding of the signal effects and delivers the amplified signal to the antenna 52, which transmits the assessed word or command signal to the portable telephone 40. The transmitter 50 may also be used by the digital coding, modulation or transmission, known to specialists in this field of technology.

In the subscriber unit 40 of the assessed word or command signals are received by the antenna 28, which through duplexer 26 transmits the received signal to the receiver 30, and that, in turn, demodulates and decodes the signal, and then supplies the command signal or the estimated word in the control element 38. In response to the accepted command signal or the estimated word control element 38 generates the set response (for example, dial a phone number, the submission of information on the display screen of the portable telephone and the like).

In one embodiment, the wasp is estline invention information, sent back from the Central point 42 communication, does not necessarily represent an interpretation of transmitted speech, rather this information is sent back from the Central paragraph 42 of the communication may be a response to the decoded message is sent to the portable telephone. For example, you can query message to the remote answering machine (not shown)connected through a communication network with a Central point of connection 42, and in this case, the signal transmitted from the Central paragraph 42 of the connection terminal unit 40 may be a message from the answering machine with this alternative implementation. The second control element 49 is located in the Central point of contact.

The VR mechanism receives voice data signals are pulse code modulation (PCM). The VR mechanism processes the signal until the correct recognition, or the user will not cease to speak, and the speech will not be processed. In one embodiment, the architecture DVR includes local VR mechanism, which receives the PCM data and generates a preliminary information. In one embodiment, the preliminary information is casterline settings. In another embodiment, prior information may represent information/signs of any type, which characterized Arisue entered the speech signal. Specialists in the art it is clear that to define the parameters entered voice signal you can use the characteristics of any type known to specialists in this field of technology.

For solving typical tasks of local recognition mechanism VR receives a set of approved standards from my memory. Local VR mechanism receives a grammatical description of the application. The application is a service logic which allows users to solve the problem using the subscriber unit. This logic is implemented by the processor in the subscriber unit. It is a component of the user interface module in the subscriber unit.

The system and method of improving store standards in the voice recognition system described in patent application U.S. No. 09/760076 "System And Method For Efficient Storage Of Voice Recognition Models", filed January 12, 2001, the rights to which are owned by the assignee of the present invention and the entire contents of which are included here by reference. System and method for improving voice recognition in environments with noise and mismatch of frequencies, and to improve store standards described in patent application U.S. No. 09/703191 "System and Method for Improving Voice Recognition In Noisy Environments and Frequency Mismatch Conditions", filed October 30, 2000, the rights to which belong to p is Anoprienko of the present invention and the contents of which are fully incorporated here by reference.

Active dictionary, using models Zubkov, is determined by the grammar. Typical grammar include the 7-digit phone numbers, dollar amounts and the name of the city from a set of names. Typical grammatical descriptions include a status of "out-of-vocabulary (OOV)" to represent the state in the case where on the basis of input speech signal failed to provide the Trustee recognition.

In one embodiment, the local VR mechanism locally generates a recognition hypothesis, if it is able to solve the problem of VR, defined by this grammar. When a given grammar is too complex to handle local VR mechanism, the local VR mechanism passes the input (preliminary) data to the server VR.

Direct link refers to transmission from a network server to the subscriber unit, and the reverse link refers to transmission from a subscriber unit to a network server. The transmission time is divided into time units (blocks). In one embodiment, the transmission time can be divided into frames. In another embodiment, the transmission time can be divided into time intervals (slots). According to one variant of implementation data is divided into data packets, each data packet is transmitted over one or more time units. the course of each time unit, the base station can maintain directional transfer of data to any subscriber unit, communicating with this base station. In one embodiment, frames may be further divided into multiple time intervals. In one embodiment, the time intervals can be broken down additionally. For example, the time interval can be divided into half-intervals and jamb intervals.

Figure 3 shows the delay in the exemplary embodiment, the distributed system 100 voice recognition. DVR system 100 includes a subscriber unit 102, the network 150 and the server 160 speech recognition (SR). The subscriber unit 102 is connected to the network 150 and the network 150 is connected to the server SR 160. The function preprocessor (pre-processing) in the DVR system 100 is implemented in the subscriber unit 102, which contains the module 104 of the feature extractor (FE) and the module 106 voice activity detection (VAD). FE performs the extraction of features from the speech signal and compressing the resulting characteristics. In one embodiment, the VAD module 106 determines which frames will be transmitted from the subscriber unit to the server SR. The VAD module 106 divides the inputted speech into segments containing frames that are experiencing it, and adjacent frames before and after the frame in which you discovered it. In one embodiment, the end of each segment (EOS) is celebrated in the payload by passing a null frame.

The preprocessor VR PR performs dwarfling processing, to determine the parameters of the speech segment. The vector s is the speech signal, and the vector F and vector V are vectors of FE and VAD, respectively. In one embodiment, the vector VAD has a length of one element and one element has a binary value. In another embodiment, the vector VAD has a binary value associated with additional signs. In one embodiment, the additional characteristics are the values of the energy bands, allowing the server to specify exactly the end of the fragment. The indication of the end of the fragment is the separation of voice and silence, and the distinction between speech segments. Thus, the server may use additional computational resources to provide more robust solutions tasks VAD.

Energy bands correspond to the amplitudes of the barque. Scale Barca is warped frequency scale critical bands corresponding to the auditory perception of the person. The calculation of the amplitudes Barca known to experts in the art and described in the work of Lawrence Rabiner &Biing Hwang Juang, Fundamentals of Speech Recognition (1993), the contents of which are entirely incorporated here by reference. In one embodiment, the digitized PCM voice signals are converted into energy bands.

Figure 3 shows the delay in the exemplary embodiment, a distributed voice recognition system. Latency is ycycline vectors F and V and transfer them over the network is shown with the symbols of the Z-transform. Delay the execution of the algorithm introduced in the calculation of the vector F, is equal to k, and in one embodiment, the range of values of k ranges from 100 to 300 MS. Similarly, delay the execution of the algorithm for the calculation of the VAD information is equal to j, and in one embodiment, the range of j is from 10 to 100 MS. Thus, the eigenvectors FE is available with a delay of k units, and the VAD information is available with a delay of j units. The latency introduced when transmitting information over the network is n units. The network latency for vectors F and V are equal.

Figure 4 shows the block diagram of the VAD module 400. The module 402 of the frame includes an analog-to-digital Converter (not shown). In one embodiment, the sampling frequency of the output speech signal analog-to-digital Converter is 8 kHz. Specialists in the art should also be clear that it is possible to use other values of the output sampling frequency. Samples of the speech signal is divided into overlapping frames. In one embodiment, the frame length is 25 MS (200 samples), and the frame rate is 10 MS (80 samples).

In one embodiment, each frame is formed with a window module 404 window arrangement using a weighing function of Hamming. Module 406 fast Fourier transform (FFT) computes the amplitudes of the output spectrum for each frame in a window. In one embodiment, to calculate the amplitude spectrum for each frame in a window use the fast Fourier transform with a length of 256. In one embodiment, the first 129 binary samples of the amplitude spectrum is left for further processing. Module 408 of the power spectrum (PS) calculates the power spectrum by squaring the amplitude spectrum.

In one embodiment, the module 409 MEL-filter calculates the MEL-warped spectrum using the whole frequency band (0-4000 Hz). This area is divided into 23 channels, equally spaced on the frequency scale MEL. Thus, there are 23 values of energy per frame. The output signal of the module 409 MEL-filter is the weighted sum of the values of the power spectrum FFT in each zone. The output signal of the module 409 MEL-filter passes through a logarithmic module 410, which performs a nonlinear transformation of the output signal of the module MEL-filtering. In one embodiment, the nonlinear transformation is a natural logarithm. Specialists in the art will understand that you can use other nonlinear transformation.

The submodule 412 voice activity detector (VAD) as the input signal gets converted output signal of the logarithmic module 409 and sets the difference between p the key practice and non-speech frames. The submodule VAD 412 detects the presence of voice activity in the frame. The submodule VAD 412 determines there is or is not in the frame of speech activity. In one embodiment, the submodule VAD 412 is a three-layer neural network with a direct link.

Figure 5 shows the block diagram of the submodule VAD 500. In one embodiment, the module 420 of subdirectly performs downsampled output signal of the logarithmic module by a factor of two.

Module 422 discrete cosine transform (DCT) calculates casterline coefficients based on 23 subdescription logarithmic energy values according to the MEL scale. In one embodiment, the DCT module 422 calculates 15 kastelnik factors.

Module 424 neural network (NN) provides an estimate of the a posteriori probability of the current frame, which is speech or non-speech. The threshold module 426 compares the evaluation module 424 NN with a threshold value to convert the estimate into a binary trait. In one embodiment, the use of a threshold value equal to 0.5.

Module 427 median filter smoothes the binary trait. In one embodiment, the binary characteristic smooth, using 11-point median filter. In another embodiment, the module 427 median filter removes any short pause or a short sound pulses on what eTelestia less than 40 MS. In one embodiment, the module 427 median filter also adds seven frames before and after the transition from silence to speech. In one embodiment, the bit value is set in accordance with a result of determining whether a frame of speech activity, or is the silence.

Figure 6 shows the block diagram of the FE module 600. Module 602 framing module 604 organization window, the FFT module 606, the PS module 608, the module MF 609 and logarithmic module 610 are also parts FE and perform the same functions in the module FE 600, as in the VAD module 400. In one embodiment, these common modules are used between the VAD module 400 and the FE module 600.

The submodule VAD 612 is associated with the logarithmic module 610. Module 428 linear discriminant analysis (LDA) is associated with the submodule VAD 612 and uses a band-pass filter for the output signal submodule VAD 610. In one embodiment, the bandpass filter is a filter RASTA. Examples of bandpass filters that can be used on the pre-processing step for VR are the RASTA filters described in U.S. patent No. 5450522 "Auditory Model for Parameterization of Speech", the application for which is submitted September 12, 1995, the content of which is incorporated here by reference.

Module 430 subdirectly performs downsampled output module LDA. In one embodiment, the module 430 DBMS is cratizatio performs the downsampled output of the LDA module by a factor of two.

Module 432 discrete cosine transform (DCT) calculates casterline coefficients based on 23 subdescription logarithmic energy values according to the MEL scale. In one embodiment, the DCT module 422 calculates 15 kastelnik factors.

To compensate for noise module 434-line normalization (OLN) uses the normalization of the mean and variance for kastelnik coefficients from the DCT module 432. The estimates of the local mean and variance are updated for each frame. In one embodiment, to estimate the variance before normalization signs experimentally added a certain offset. This offset eliminates the impact of estimates of variance for weak noise in areas of long silence. From the normalized static signs get dynamic characteristics. This allows not only to reduce the amount of computation required for normalization, but also provides the best detection performance.

Module 436 compression characteristics compresses the eigenvectors. Module 438 formatting, and framing bit stream streams the compressed format eigenvectors in bits, preparing them for transmission. In one embodiment, module 436 compression characteristics provides protection from bugs formatted bit stream.

M is Dul FE 600 connects with each other the vector F Z -kand the vector V Z-j. Thus, each vector contains FE connection vector F Z-kwith the vector V Z-j.

In the present invention the output signal VAD is transmitted before the payload, which reduces the overall delay in the system, DVR, because pre-treatment of VAD shorter (j<k)than pre-treatment of FE.

In one embodiment, the application running on the server can determine the end of a fragment of the speech of a user when the vector V indicates silence for a period of time, a larger Shangover. Shangoveris the duration of silence after active speech to complete fixation of the fragment of the speech. The value of Shangovermust be greater than the permissible duration of silence within a fragment of speech. If Shangover>k, then the delay in the implementation of the FE algorithm will not increase the response time. In one embodiment, the characteristics of FE, corresponding to the time interval t-k, and signs of VAD, the corresponding time interval t-j, brought together to form advanced signs FE. The output signal VAD is sent when it is available, and does not depend on when the output signal FE for transmission. Output signals VAD and FE synchronize with the transmitted payload. In one embodiment, information corresponding to each segment R is Chi, that is, without dropping frames.

In another embodiment, the channel bandwidth is narrowed during periods of silence. The vector F is quantized with a lower bit rate, when the vector V indicates the zone of silence. This is similar to the work of vocoders with variable speed and multi-speed vocoders, in which the bit rate is changed based on the definition of speech activity. The output signals of both VAD and FE are synchronized with the transmitted payload. Transmitted information corresponding to each segment of speech. Thus, the transmitted output signal VAD, but the bit rate is reduced in frames with silence.

In another embodiment, the server only transmits frames with speech. Thus, frames with silence completely omitted. Because the server is only transmitted speech frames, the server needs some way to determine that the user has finished speaking. It does not depend on the values of the delays k, j and n. Consider the set of words, for example "Portland <pause> Maine" or "617-555-<pause>1212". For information transfer VAD uses a separate channel. Signs FE, the corresponding area of <pause>, in the subscriber unit are omitted, and the server without a single channel will not have information to conclude that the user has finished the GOV is more. In this embodiment there is a separate channel for transmission of information VAD.

In another embodiment, the device state recognition is supported even if the long pauses in the speech of a user, according to the state diagram in Fig.7 and the events and actions in table 1. Upon detection of voice activity averaged vector module FE 600 corresponding to the omitted frames and the total number of omitted frames transmitted before the transmission of speech frames. In addition, when the mobile subscriber unit determines that frames Shangoversilence is fixed, it detects the end of the fragment of the speech of the user. In one embodiment, the speech frames and the total number of omitted frames are transmitted to the server together with the average vector of the FE module 600 on the same channel. Thus, the payload includes both the signs and the output signal VAD. In one embodiment, the output signal VAD is sent in the payload of the latter to indicate the end of the speech.

For a typical fragment of speech, the VAD module 400 begins by state 702 expectations and enters a state 704 initial silence because of an event A. there May be several events B that leave the module in initial silence. Upon detection of the speech event causes a transition to state 706 active speech. ZAT is m module 400 moves from state 706 active speech state 708 "internal" silence and back because of the events D and E. If the duration of inner silence more Shangoverthis is interpreted as the end of the speech fragment, and the event F causes a transition to state 702 expectations. Event Z represents a long initial silence in the fragment of speech. It triggers an error condition TIME_OUT (simple), when the speech of the user is not detected. The event X interrupts this state and returns the module in state 702 expectations. This can be a user or an event initiated by the system.

On Fig shows the parts of speech and the events of the VAD on the time axis. Contact Fig, box 4 and table 2, in which in relation to the VAD module 400 shown events that cause transitions from one state to another.

Table 1
EventAction
AFixation of the fragment of the speech initiated by the user.
BSactive<Smin. The duration of active speech is less than the minimum duration of a fragment of speech. Preventing incorrect detection due to clicks and other external noises.
CSactive>Smin. Found the beginning of the speech. Sending the averaged feature vector FE, FDcount, frame Sbefore. Run forward eigenvectors FE.
DSsil>Safter.Sending frames Safter. Reset FDcountzero.
ESactive>Smin. Detected active speech after the inner silence. Sending the averaged feature vector FE, FDcount, frame Sbefore. Run forward eigenvectors FE.
FSsil> Shangover. Found the end of the speech of the user. Sending the averaged feature vector FE and FDcount.
XThe interrupt is initiated by the user. May be initiated by the user from the keyboard, server initiated when the recognition is completed or initiated by the interrupt of a higher priority device.
ZSsil>MAXSILDURATION. MAXSILDURATION<2.5 seconds for an 8-bit counter FD. Sending the averaged feature vector FE and FDcount. Reset FDcountzero.

In table 1, Sbeforeand Safteris the number of frames of silence passed to the server before and after active speech.

From the diagrams of States and event table, which shows the corresponding actions in the mobile station, it is clear that there are a number of thresholds used for initiating transitions from state to state. You can use specific values, default is Y. for these thresholds. However, specialists in the art will understand that you can use other values of the thresholds shown in table 1.

In addition, the server can change the default values, depending on the specific application. The default values are programmable, as defined in table 2.

The duration of the current segment of silence in frames, some VAD.
Table 2
Segment nameCoordinates on FigDescription
Smin>(b-a)The minimum duration of the fragment of the speech frames. Is used to prevent false detection of clicks and noises as active speech.
Sactive(e-d) and

(i-h)
The duration of the segment of active speech frames, determined by the VAD module.
Sbefore(d-c) and

(h-g)
The number of frames transmitted before the speech, a certain VAD. The size of the area of silence passed before the speech.
Safter(f-e) and

(j-i)
Certain VAD the number of frames transmitted after active speech. The size of the area of silence passed after active speech.
Ssil(d-0), (h-e), (k-i)
Sembedded>(h-e)The duration of the silence frames (Ssilbetween two active segments of speech.
FDcount---The number of omitted frames of silence before the current active segment of speech.
Shangover<(k-i)

>(h-e)
The duration of the silence frames (Ssilafter the last active segments of speech to complete fixation of the fragment of the speech. Shangover>=Sembedded
SmaxsilThe maximum duration of silence, in which the mobile station drops video frames. If the maximum duration of silence has been exceeded, then the mobile unit sends the averaged feature vector FE and resets the counter to zero. This is useful for maintaining the state of the recognition on the active server.
SminsilThe minimum duration of silence, expected before and after active speech. If active speech is observed duration less Sminsilthe server may decide not to perform a specific task adaptation using these data. This is sometimes called the error Spoke_Too_Soon (too fast speech). Replacement is named, the server can identify this condition, based on the value of FDcount. A separate variable may not be necessary.

In one embodiment, the minimum length of Sminfragment of speech is about 100 MS. In one embodiment, the size of the area of silence passed before the speech, Sbeforeis about 200 MS. In one embodiment, the size of the area of silence passed after active speech, Safteris about 200 MS. In one embodiment, the duration of silence after active speech to complete fixation of the fragment of the speech, Shangoverlies in the range from 500 MS to 1500 MS, depending on the application of VR. In one embodiment, the eight-bit counter allows Smaxsilequal to 2.5 seconds at 100 frames per second. In one embodiment, the minimum duration of silence, expected before and after active speech, Sminsilis about 200 milliseconds.

Thus, there has been described new and improved method and apparatus for voice recognition. Specialists in the art will understand that the illustrated various logical blocks, modules, and data conversion are described together here with open options for implementation, can be implemented in the form of electronic apt the military funds computer software, or combinations thereof. Illustrated here are the various components, blocks, modules, circuits, and steps are described in General from the point of view of their functionality. How the implemented functionality in the form of hardware or software depends upon the particular application and design constraints imposed on the entire system. Specialists in the art will no doubt realize that in these circumstances, the possible interchangeability of hardware and software, and have an idea of how best to implement the described functionality for each specific application. For example, the various illustrated here, logical blocks, modules, and data conversion described in connection with the disclosures provided here variants of implementation, may be implemented or performed using a processor executing a set of built-in commands, application specific integrated circuits (ASICS), gate arrays, programmable by the user in operating conditions (FPGA)or other programmable logic device, discrete gate or transistor logic, discrete hardware components, such as, for example, registers, using any known programmable programs is on the module and the processor, or any combination thereof designed to perform the functions described here. The VAD module 400 and the FE module 600 can be implemented in the microprocessor, which is a preferred embodiment and the alternative embodiment, the VAD module 400 and the module FE can be performed in any standard processor, controller, microcontroller, or state machine. References can be in RAM, flash memory, ROM, EPROM (erasable programmable ROM), EEPROM (electrically erasable programmable ROM), registers, hard disk, removable disk, a ROM on the CD-ROM or on the storage media of any type known to specialists in this field of technology. The memory (not shown) may be integrated with any of the above-mentioned processor (not shown). The processor (not shown) and a memory (not shown) can be in the ASIC (not shown). ASIC may reside in the telephone.

The above description of the embodiments of the invention proposed in order to enable any person skilled in the art to implement or use the present invention. Specialists in the art of the obvious various modifications to these embodiments, and that certain basic principles here can be used in other embodiments, the wasp is estline without inventive skills. Thus, it is assumed that the present invention is not limited to the shown options here, but should correspond to the widest extent consistent herewith new principles and characteristics.

1. Subscriber unit for use in a voice recognition system, containing

the module of feature selection, made with the possibility of identifying a set of characteristics of the speech signal to represent the signal with the selected set of signs,

module voice activity detection made with the possibility of detecting voice activity within the speech signal and provides a signal indicating the detected voice activity, and

a transmitter connected to the module of feature selection module and voice activity detection, and configured to transmit a signal indicating the detected voice activity before the signal representing mentioned many signs.

2. Subscriber unit for use in a voice recognition system, containing

means for identifying a set of characteristics of the speech signal to represent the signal with the selected set of signs,

means for detecting voice activity within the speech signal and provide a signal indicating said detected the activity and the transmitter, coupled with the allocator of the signs and means of voice activity detection, and configured to transmit a signal indicating the detected voice activity before the signal representing mentioned many signs.

3. The method of signal transmission of speech activity, namely, that

allocate a lot of characteristics from the speech signal to represent the signal with the selected set of signs,

detect voice activity within the speech signal and provide a signal indication of the detected voice activity, and

transmit a signal indicating the detected voice activity before the signal representing mentioned many signs.



 

Same patents:

FIELD: method and device for efficiency compression of audio signal to acoustic signal of level III of MPEG-1 standard with low information transfer speed.

SUBSTANCE: in accordance to audio signal encoding method, harmonic components are extracted with usage of information resulting from fast Fourier transformation, which is received with usage of psycho-acoustic model 2 to received audio data of impulse-code modulation. Then, extracted harmonic components are removed from received audio data of impulse-code modulation. After that audio data, from which extracted harmonic components have been removed, are subjected to modified discontinuous cosine transformation and quantization.

EFFECT: provision of efficient compression of signal at low speed by compressing changing part of signal only by means of modified discontinuous cosine transformation.

5 cl, 11 dwg

FIELD: medicine.

SUBSTANCE: method involves applying analog-to-digital input signal transformation expressed as word, dividing transformed signal spectrum into odd and even frequency bands, summing odd bands, carrying out digital-to-analog transformation of resulting summed signal and training its perception by preliminarily getting familiar with the word shown for listening and following testing. Spectrum division is based on tonotopic frequency distribution law over cochlea axis. Frequency bands having odd numbers are arranged in equal distances along basilar membrane length in agreement with normal tonotopic frequency distribution law over cochlea axis. At least three odd spectrum bands are summed up. Training is carried out by multiple repetition of the word shown for listening until unambiguous correlation to the known word meaning given in preliminary acquaintance takes place. The same words are to be shown in testing and training.

EFFECT: partially retained speech spectrum.

FIELD: digital speech encoding.

SUBSTANCE: speech compression system provides encoding of speech signal into bits flow for later decoding for generation of synthesized speech, which contains full speed codec, half speed codec, one quarter speed codec and one eighth speed codec, which are selectively activated on basis of speed selection. Also, codecs of full and half speed are selectively activated on basis of type classification. Each codec is activated selectively for encoding and decoding speech signal for various speeds of transfer in bits, to accent different aspects of speech signal to increase total quality of synthesized speech signal.

EFFECT: optimized width of band, required for bits flow, by balancing between preferred average speed of transfer in bits and perception quality of restored speech.

11 cl, 12 dwg, 9 tbl

FIELD: speech recording/reproducing devices.

SUBSTANCE: during encoding speech signals are separated on frames and separated signals are encoded on frame basis for output of encoding parameters like parameters of linear spectral couple, tone height, vocalized/non-vocalized signals or spectral amplitude. During calculation of altered parameters of encoding, encoding parameters are interpolated for calculation of altered encoding parameters, connected to temporal periods based on frames. During decoding harmonic waves and noise are synthesized on basis of altered encoding parameters and synthesized speech signals are selected.

EFFECT: broader functional capabilities, higher efficiency.

3 cl, 24 dwg

FIELD: technologies for encoding audio signals.

SUBSTANCE: method for generating of high-frequency restored version of input signal of low-frequency range via high-frequency spectral restoration with use of digital system of filter banks is based on separation of input signal of low-frequency range via bank of filters for analysis to produce complex signals of sub-ranges in channels, receiving a row of serial complex signals of sub-ranges in channels of restoration range and correction of enveloping line for producing previously determined spectral enveloping line in restoration range, combining said row of signals via synthesis filter bank.

EFFECT: higher efficiency.

4 cl, 5 dwg

FIELD: communication systems.

SUBSTANCE: method and system for decreasing prediction error an averaging device for calculation of transfer coefficient is used, pulse detector, signals classifier, decision-taking means and transfer coefficient compensation device, wherein determining of compensated transfer coefficient of quantizer count is performed in process of coding/decoding of transferred data in speech signal band by use of vector linear non-adaptive predicting-type algorithm.

EFFECT: higher efficiency.

4 cl, 4 dwg

FIELD: electric communication, namely systems for data transmitting by means of digital communication lines.

SUBSTANCE: method comprises steps of preliminarily, at reception and transmission forming R matrices of allowed vectors, each matrix has dimension m2 x m1 of unit and zero elements; then from unidimensional analog speech signal forming initial matrix of N x N elements; converting received matrix to digital one; forming rectangular matrices with dimensions N x m and m x N being digital representation of initial matrix from elements of lines of permitted vectors; transmitting elements of those rectangular matrices through digital communication circuit; correcting errors at transmission side on base of testing matching of element groups of received rectangular matrices to line elements of preliminarily formed matrices of permitted vectors; then performing inverse operations for decompacting speech messages. Method is especially suitable for telephone calls by means of digital communication systems at rate 6 - 16 k bit/s.

EFFECT: possibility for correcting errors occurred in transmitted digital trains by action of unstable parameters of communication systems and realizing telephone calls by means of low-speed digital communication lines.

5 cl, 20 dwg

The invention relates to a speech encoding and reduces the sparsity in the input digital signal comprising a first sequence of sample values

The invention relates to radio communications, in particular to the process of encoding speech in which during the time intervals in which there is no speech activity, create an artificial background noise

The invention relates to speech recognition

FIELD: speech activity transmission systems in distributed system of voice recognition.

SUBSTANCE: distributed system of voice recognition has voice recognition (VR) local mechanism in user unit and VR server mechanism in server. VR local mechanism has module for selection of features (FS), which selects features from voice signals. Voice activity detector (VAD) module detects voice activity invoice signal. Indication of voice activity is transmitted before features from user unit to server.

EFFECT: reduction in overloading of circuit; reduced delay and increased efficiency of voice recognition.

3 cl, 8 dwg, 2 tbl

FIELD: information technology.

SUBSTANCE: method for natural speech recognition of a vocal expression involves analysis of a speech signal (10) in parallel or series in several branches of a speech recognition device independently from each other using several grammars (12, 14, 26). The method involves simultaneous transmission of the speech signal (10) to first and second speech recognition branches which include first grammar (12) and second grammar (14), respectively, for analysing the speech signal. In case of recognition or non-recognition of the vocal expression through first grammar (12) or second grammar (14), first or second positive recognition results or first or second negative recognition results are obtained. A decision on successful recognition of the vocal expression is made based on assessment of the first and second recognition results. Frequently occurring vocal expressions are analysed through first grammar (12), rarely occurring vocal expressions are analysed through second grammar (14), and even more rarely occurring vocal expressions are analysed through each additional grammar (26). Either the recognition result given by the first grammar (12) or the recognition result with the highest reliability of recognition is used.

EFFECT: speech recognition method and system with high volume of recognition with low volume of grammar.

3 cl, 2 dwg, 1 tbl

FIELD: electric communication, namely systems for data transmitting by means of digital communication lines.

SUBSTANCE: method comprises steps of preliminarily, at reception and transmission forming R matrices of allowed vectors, each matrix has dimension m2 x m1 of unit and zero elements; then from unidimensional analog speech signal forming initial matrix of N x N elements; converting received matrix to digital one; forming rectangular matrices with dimensions N x m and m x N being digital representation of initial matrix from elements of lines of permitted vectors; transmitting elements of those rectangular matrices through digital communication circuit; correcting errors at transmission side on base of testing matching of element groups of received rectangular matrices to line elements of preliminarily formed matrices of permitted vectors; then performing inverse operations for decompacting speech messages. Method is especially suitable for telephone calls by means of digital communication systems at rate 6 - 16 k bit/s.

EFFECT: possibility for correcting errors occurred in transmitted digital trains by action of unstable parameters of communication systems and realizing telephone calls by means of low-speed digital communication lines.

5 cl, 20 dwg

FIELD: communication systems.

SUBSTANCE: method and system for decreasing prediction error an averaging device for calculation of transfer coefficient is used, pulse detector, signals classifier, decision-taking means and transfer coefficient compensation device, wherein determining of compensated transfer coefficient of quantizer count is performed in process of coding/decoding of transferred data in speech signal band by use of vector linear non-adaptive predicting-type algorithm.

EFFECT: higher efficiency.

4 cl, 4 dwg

FIELD: technologies for encoding audio signals.

SUBSTANCE: method for generating of high-frequency restored version of input signal of low-frequency range via high-frequency spectral restoration with use of digital system of filter banks is based on separation of input signal of low-frequency range via bank of filters for analysis to produce complex signals of sub-ranges in channels, receiving a row of serial complex signals of sub-ranges in channels of restoration range and correction of enveloping line for producing previously determined spectral enveloping line in restoration range, combining said row of signals via synthesis filter bank.

EFFECT: higher efficiency.

4 cl, 5 dwg

FIELD: speech recording/reproducing devices.

SUBSTANCE: during encoding speech signals are separated on frames and separated signals are encoded on frame basis for output of encoding parameters like parameters of linear spectral couple, tone height, vocalized/non-vocalized signals or spectral amplitude. During calculation of altered parameters of encoding, encoding parameters are interpolated for calculation of altered encoding parameters, connected to temporal periods based on frames. During decoding harmonic waves and noise are synthesized on basis of altered encoding parameters and synthesized speech signals are selected.

EFFECT: broader functional capabilities, higher efficiency.

3 cl, 24 dwg

FIELD: digital speech encoding.

SUBSTANCE: speech compression system provides encoding of speech signal into bits flow for later decoding for generation of synthesized speech, which contains full speed codec, half speed codec, one quarter speed codec and one eighth speed codec, which are selectively activated on basis of speed selection. Also, codecs of full and half speed are selectively activated on basis of type classification. Each codec is activated selectively for encoding and decoding speech signal for various speeds of transfer in bits, to accent different aspects of speech signal to increase total quality of synthesized speech signal.

EFFECT: optimized width of band, required for bits flow, by balancing between preferred average speed of transfer in bits and perception quality of restored speech.

11 cl, 12 dwg, 9 tbl

FIELD: medicine.

SUBSTANCE: method involves applying analog-to-digital input signal transformation expressed as word, dividing transformed signal spectrum into odd and even frequency bands, summing odd bands, carrying out digital-to-analog transformation of resulting summed signal and training its perception by preliminarily getting familiar with the word shown for listening and following testing. Spectrum division is based on tonotopic frequency distribution law over cochlea axis. Frequency bands having odd numbers are arranged in equal distances along basilar membrane length in agreement with normal tonotopic frequency distribution law over cochlea axis. At least three odd spectrum bands are summed up. Training is carried out by multiple repetition of the word shown for listening until unambiguous correlation to the known word meaning given in preliminary acquaintance takes place. The same words are to be shown in testing and training.

EFFECT: partially retained speech spectrum.

FIELD: method and device for efficiency compression of audio signal to acoustic signal of level III of MPEG-1 standard with low information transfer speed.

SUBSTANCE: in accordance to audio signal encoding method, harmonic components are extracted with usage of information resulting from fast Fourier transformation, which is received with usage of psycho-acoustic model 2 to received audio data of impulse-code modulation. Then, extracted harmonic components are removed from received audio data of impulse-code modulation. After that audio data, from which extracted harmonic components have been removed, are subjected to modified discontinuous cosine transformation and quantization.

EFFECT: provision of efficient compression of signal at low speed by compressing changing part of signal only by means of modified discontinuous cosine transformation.

5 cl, 11 dwg

FIELD: speech activity transmission systems in distributed system of voice recognition.

SUBSTANCE: distributed system of voice recognition has voice recognition (VR) local mechanism in user unit and VR server mechanism in server. VR local mechanism has module for selection of features (FS), which selects features from voice signals. Voice activity detector (VAD) module detects voice activity invoice signal. Indication of voice activity is transmitted before features from user unit to server.

EFFECT: reduction in overloading of circuit; reduced delay and increased efficiency of voice recognition.

3 cl, 8 dwg, 2 tbl

FIELD: analysis and synthesis of speech information outputted from computer, possible use in synthesizer-informers in mass transit means, communications, measuring and technological complexes and during foreign language studies.

SUBSTANCE: method includes: analog-digital conversion of speech signal; segmentation of transformed signal onto elementary speech fragments; determining of vocalization of each fragment; determining, for each vocalized elementary speech segment, of main tone frequency and spectrum parameters; analysis and changing of spectrum parameters; and synthesis of speech sequence. Technical result is achieved because before synthesis, in vocalized segments periods of main tone of each such segment are adapted to zero starting phase by means of transferring digitization start moment in each period of main tone beyond the point of intersection of contouring line with zero amplitude, distortions appearing at joining lines of main tone periods are smoothed out and, during transformation of additional count in the end of modified period of main tone, re-digitization of such period is performed while preserving its original length.

EFFECT: improved quality of produced modulated signal, allowing more trustworthy reproduction of sounds during synthesis of speech signal.

2 cl, 8 dwg

Up!