Method and device for dynamic adjustment of beam in viterbi search

FIELD: technology for recognition of speech, in particular, method and device for dynamic adjustment of beam in Viterbi search.

SUBSTANCE: method includes selecting beginning width of beam, finding out whether value of probability for frame changes, dynamic adjustment of width of beam, decoding of input speech signal, while width of beam is dynamically adjusted. Also proposed is device including processor, component for speech recognition, connected to processor, memory, connected to processor. Component for speech recognition automatically adjusts width of beam to decode input speech signal.

EFFECT: increased speech recognition reliability.

6 cl, 6 dwg, 4 tbl

 

Background of the invention

The technical field to which the invention relates.

The invention relates to speech recognition and, more particularly, to a method and apparatus for dynamic adjustment of the beam in the Viterbi search.

The level of technology

Speech recognition or voice became very widely used to improve performance. For recognition of the human voice in the speech recognition process uses multiple technologies. The speech recognition is used to convert digital audio signals from devices such as a sound card of a personal computer (PC), in the recognized speech. Mentioned signals may pass through several stages, which are manufactured different kinds of mathematical and statistical transformation with the objective of determining what actually was said.

Many applications that perform speech recognition, have a database that contains thousands of frequencies or "phonemes" (also called "backgrounds" in speech recognition systems). A phoneme is the smallest unit of speech in a language or dialect (i.e. the smallest unit of sound that can distinguish between the two words). Pronunciation of one phoneme different from saying another phoneme. Thus, if the word of one phoneme to replace another, tolowa will have a different meaning. For example, if "B" in the word "bat" be replaced by the phoneme "R", you'll get a new word: "rat". Database containing phonemes, are used for comparison with the reference ranges of sound frequencies. For example, if the incoming frequency sounds like "T", then the application will try to compare it with the corresponding phoneme from the database. Also on pronunciation can affect neighboring backgrounds, called context. For example, the "T" in the word "that" sounds like "T" in "truck". Background with a fixed left (right) context is usually called a left (right) pifanoi". Background with fixed left and right context is usually called "trifonas". Database of phonemes may contain for each phoneme multiple entries corresponding to bi - and Tritonal. Each phoneme is supplied by the record number, which is then assigned to the incoming signal.

Can be so many choices of sounds, resulting in different spoken words, that it is almost impossible to achieve an exact match of the input signal and the entries in the database. Moreover, different people may have different ways of pronouncing the same word. Further, the environment also contributes noise. Therefore, the application must use sophisticated techniques to approximate the input signal and determining which are phonemes.

Another problem recognize is avania of speech lies in the determination of the moment, when one phoneme (or smaller unit) ends and the next one begins. To solve such problems, a technique may be used, called a hidden Markov model (hmm). SMM approach matching model for speech recognition.

Hmm, in the General case is determined by the following elements: first, the number N of States in the model; then, the matrix of transition probabilities A, where aijthe probability that the process will move from the condition qi; qjat time t=1, 2, ..., and at time t - 1 the process was in a state of qithe probability distribution of the observationsfor all States qi, i=1, ..., N; and the probabilities of the initial States πi, i=1, ..., N.

In order to carry out speech recognition using hmm, languages are usually divided into a limited number of phonemes. For example, the English language can be divided roughly 40-50 phonemes. However, it should be noted that if you use other units, such as three-backgrounds, the number of such units can be several thousands. Then construct the stochastic model for each unit (i.e. backgrounds). Given the acoustic observations can determine the most probable phoneme corresponding to the observation. However, it should be noted that when the use of context units, such as bi-backgrounds or backgrounds, the number of the above units may be several thousand. Thus, for each unit will be constructed probabilistic model. The method of determining the most probable phoneme corresponding to the acoustic observation, uses assessment by Viterbi (named after A.J.Viterbi).

Brief description of drawings

The present invention is illustrated using an example and is not limited to the above illustrative drawings, in which similar positions denote the same elements.

Figure 1 depicts the four steps of the model search algorithm Viterbi.

Figure 2 depicts a block diagram of one implementation of the present invention containing the dynamic adjustment of the beam.

Figure 3 shows a block diagram of one implementation of the present invention, applying the dynamic adjustment of the beam (when there is a sufficient number of active paths) or decoding N best (if too many or too few active paths).

Figure 4 illustrates a typical system that can be used for speech recognition.

Figure 5 illustrates the implementation of the present invention containing the process of dynamic adjustment of the beam incoming into the system.

6 depicts an implementation of the present invention containing the device to dynamically adjust the beam.

In General, the invention relates to a method and apparatus for dynamic adjustment of the beam in the Viterbi search. Next, with reference to the figures will be described a typical implementation of the present invention. Typical implementations selected to illustrate the invention and should not be construed as any limitation of the invention.

Figure 1 illustrates the four steps of the model search algorithm Viterbi. In this method sluchautsya acoustic observation and stochastic model phonemes to assess the probability that this observation corresponds to the phoneme. When evaluating Viterbi is determined by only the best sequence of States, then there is a path of transitions from state to state, which gives the highest probability that the observation is consistent with the model. This work is carried out for each of the 40-50 phonemes (or smaller units, such as three-background) English language. This defines the phoneme, which most likely corresponds to the acoustic observation. If the acoustic observation contains more than one phoneme, such as the spoken word, then the above procedure can be repeated to determine the set of phonemes that are most relevant acoustic monitoring. In the speech recognition system, after the model definition is s, for example, the decoding of the word "ask", the system selects all possible continuations, concatenates SMI corresponding to the sequels and continues decoding. For example, after decoding the second three-backgrounds "AE-S+K", the speech recognition system can select a three-background S-K+*" (where * denotes any other background). Then, after decoding some part of the input data (for example, corresponding to the sentence or sequence of words), the speech recognition system selects the best hypothesis as a result.

The Viterbi algorithm is commonly used in speech recognition with hmm. The Viterbi algorithm is typically used to search for the best sequence of words corresponding to the recognized speech. Here the line is expressed in terms of probability values, and the search is performed over all possible sequences of words. For systems with a large dictionary space to search increases dramatically. More precisely, the search space increases as the NLwhere N is the power of the dictionary, a L is the length of the sentence.

To avoid an exponential increase of the search space in the search algorithms are used to adjust the beam. This is done to remove unlikely hypotheses or search paths. Search using beam requires the calculation of the reference assessment, called the smallest face (the G), representing the logarithm of the probability of the most probable hypotheses. The score is a negative number is the logarithm of the probability values; high scores usually means low probability and low weight estimation means large probability. When using an adjustable beam within continuous speech recognition (NR) used a static value, the width of the beam, for the regulation of multiple probability values (in accordance with the highest probability value), which has paths at some point in time.

Four steps of the model search algorithm Viterbi is a step 110 pre-processing step 120 initialization step 130 recursion and step 140 the end. It should be noted that in the systems of speech recognition step 110 pre-treatment may not completely finish, and other steps will have to be performed. There are several reasons to perform the other steps before completion of step 110 pre-treatment, such as: all models considered when speech recognition is not known before the start of the recognition process due to the coupling of hidden Markov models (hmm), which depend on intermediate results of the recognition process, the impossibility of making a list of all sequences of words for continuous process rozpoznawanie the speech or data input observations to decode in real-time may not be available. For each model of speech φn(where φncan represent all models for all possible fragments, hmm fragments or concatenated hmm), where n=1, ..., m, the four steps are as follows:

pre-processing 110:

initializing 120:

recursion 130:

the end 140:

where assessment

is an approximation of the logarithm of the probability of the most probable path, passing a vertex j at time t andis the logarithm of the probability of the most probable path ending at vertex N at time T. the Result of recognition (that is, the word corresponding to the unknown speech signal) is

where

In step 110 pre-processing the logarithms of the probabilities of the initial States πifor i=1, ..., N, describing the probability distribution

where

1≤i≤N

and

1≤t≤T,

transition probabilities of States of aijwhere 1≤i j≤N are calculated and stored in memory. Function biiand transition probabilities of States of aijgenerally speaking, depend on the specific model being discussed speech φn. However, in order to reduce the amount of data described in the models, some constants rely identical for all models. For example, all models of the speech of the probabilities of initial States can be put equal to π1=1, πi=0, where i>1. Values that are defined at the step of pre-processing, sometimes calculated and stored once.

In step 120 initialization computed evaluate waysfor time 1 and state i=1,..., N, for time t=2, ..., T and state j=1, ..., N.

In step 130 recursion evaluationare calculated for the state i varying from 1 to N, at time t, where 2≤t≤T, and state j, where 1≤j≤N.

In step 140 the end of the calculation done in step 130 recursion, is determined by the result with the highest probability (or the best estimate of the path) for each specific model. The overall best estimate of the path is obtained after comparing the best estimates of the paths obtained for each model.

In one implementation of the present invention, during recognition, the mechanism for adjusting the beam di is amicucci changes the width of the beam depending on the data, what is being done to improve efficiency. At the beginning of the speech recognition, it is advisable to use a wider beam, due to the uncertainty of what was said, and as more data, you can use a narrower beam. In other words, the beam with a width that can work well for a single fragment of speech, but not for the middle of the speech fragment or may be too wide for other parts of speech that causes a lot of useless hypotheses. Thus, in accordance with one implementation of the invention, the method for adjusting the beam is dynamically controlled by the width of the beam depending on the data obtained in the course of recognition.

One implementation of the present invention will be described as follows. Let ftdenotes the set of active paths of time t. Let Ntis the number of paths in ft. Let R(φ) denotes a probability value of an arbitrary path φ∈ft. Let αtdenotes the maximum probability in the set ftthat α0=0. Then

Also, let β denotes the width of the beam.

In some implementations of the invention in ftwith beam width β discard those paths for which R(φ)<(αt-β) (that is, the truncation occurs). In one implementation of the image is possible beam width β tat time t is determined in proportion to some initial width of the beam In the following way: βt=bt×B, where btis defined as follows:

where

and [b1b2] is the range of variation of the coefficients, these boundaries are determined heuristically. For example, the initial value of αtandcan be the following: α1=0 andFurther, the width of the beam βtdynamically regulated in order to reduce processing time while maintaining low error rate in words (Czos).

Figure 2 depicts a block diagram of the implementation of the present invention includes a process 200 dynamic regulation of the beam. After initial testing of the speech recognition system selects the set of optimal parameters. The selection is based on the optimal speed of decoding and CIOs. Optimal decoding speed and CIOs are determined on the basis of the number of active hypotheses. For example, the more active hypotheses, the lower the decoding speed and fewer errors.

When searching with the beam all the hypotheses, which is worse when the selected width of the beam, the best hypotheses are discarded. In block 210 of process 200 is field groups (authoriz is giving the width of the beam given in advance values. A predetermined value based on the set of optimal parameters. In block 220 receives the next speech frame (follow-up). In block 230 determines the best chance for the best hypotheses. In block 240 determines the width of the beam to select hypotheses for the current speech frame. If in block 240 determines that the probability value for the frame grows (weight grows more slowly), then the process 200 reduces the beam width in comparison with the pre-selected initial beam width. This reduced width is used to decode the speech.

In block 240 also determines is not reduced if the probability value for the frame (weight grows faster). If in block 240 determines that the probability value in the frame is reduced, the unit width of the beam increases. In one implementation of the present invention the width of the beam in the block 240 is reduced/increased by the amount specified by the user. In another implementation of the invention, the speech recognition system automatically reduces/increases the width of the beam to a small value based on the selected percentage, such as 10%. It should be noted that it is possible to implement other ways of reducing or increasing the width of the beam without derogating from the various embodiments of the invention.

The process 200 continues at block 250. In block 250 are all the asset is haunted road under the new width of the beam (dynamically changeable). In block 260 checks, not over whether the decoding of speech. If in block 260 determines that the decoding is not completed, the process 200 continues at block 220. If in block 260 determines that the decoding is completed, then the process 200 continues at block 270. In block 270 the best way to announce the result of speech recognition.

When in the process 200, when decoding speech, the width of the beam current decreases, increases or remains constant, CIOs not comprometida, while increasing the speed of completion of the decoding (i.e. reduced time decoding).

Table I illustrates the results of solving the problem of speech recognition in the Chinese language using the implementation of the present invention contains a process of dynamic adjustment of the beam. It should be noted that other languages can also be used as input for speech recognition. From Table I it is clear that the use of the invention with a dynamic adjustment of the beam for the task of speech recognition in the Chinese language increases the recognition rate is sixty percent (60%), i.e. the real-time decreased from 3.14 to 1.24. It should be noted that the real time is the time required by the Central processor (CPU) to complete the decoding, divided by the duration reciprocity words, if the real time is less than one, then you can decode in real time.

Table I
Dynamic adjustment of the beam for the problem with the Chinese language
Static beamDynamic beam
The real-time3.141.24
The frequency of occurrence of errors in words8.48.3

Table II shows the results of solving the problem of speech recognition in English, using the implementation of the present invention contains a process of dynamic adjustment of the beam. From Table II it is seen that the use of the invention increases the recognition rate is forty five percent (45%) (real-time decreases from 3.4 to 1.85). As in the problem of recognition of Chinese speech, and in the problem of recognition of English speech is not observed any noticeable increase Czos. Note that the results presented in Table I and Table II were obtained on a computer with Intel Pentium® with a clock frequency of 550 MHz, the size of the cache memory was 512K, and the volume of the synchronous dynamic memory with random access (SDRAM) 512 Mega the AIT. It should be noted that implementations may use different computer systems with different processing speeds and memory.

In the examples of problems with Chinese language and English language initial values of the beam were assumed equal to 140 and 180, respectively. Used the same interval [b1b2]that put equal to [0.5, 1.05].

Table II
Dynamic adjustment of the beam for the problem with the English language
Static beamDynamic beam
The real-time3.41.85
The frequency of occurrence of errors in words1111.4

In one implementation of the invention, the dynamic adjustment of the beam used in the Viterbi search, in which the width of the beam is adjusted only if there is a normal number of active paths (not too much or too little). In this implementation we denote bythe width of the beam, when ft. after trimming remained exactly N active ways. Beam width βtat time t is as follows:

where 0<N1<N 2<N3- the pre-selected thresholds and is the initial beam width. The value of btis defined as follows:

In this implementation of the invention the width of the beam is adjusted only if there is a normal number of active paths (not too much or too little), namely the number of ways is in the interval [2N2, 2N3]. In this implementation of the invention, if the number of active paths is beyond the boundaries of [2N2, 2N3]then truncated to the number of active paths got back in the specified interval. If the total number of active paths is less than the threshold N1then reduce the number of paths is not possible [that is, the width of the beam is equal to infinity].

Figure 3 shows the implementation of the present invention contains the process 300, which when decoding dynamically adjusts beam only if there is a normal amount of active paths (not too much or too little) or uses the decoding N best, if too many or too few active ways. In a method of decoding a top-N, for each time frame, leaving only the N best hypotheses, and the remaining hypotheses are discarded. Thus the list of the N best hypotheses is continuously recalculated. The process 300 begins at block 310, to the m set the initial width of the beam.

The initial beam width is determined by recognition of some example of the input speech signal. The process 300 continues to block 320 where define a boundary coefficients. After initial testing of speech recognition systems choose the set of optimal parameters. The selection is based on the optimal speed of decoding and CIOs. Optimal decoding speed and CIOs determined on the basis of the number of active hypotheses. The process 300 continues to block 330, where determine the statistics of example, the input speech signal. As in block 330 processes the example, the input speech signal and determine the statistics for a given input, in block 340 sets the threshold for the number of active hypotheses. Note that the thresholds are set in accordance with the equality 2. In one implementation of the invention, the threshold 2N3is the number of hypotheses, which may be exceeded by no more than about 10% of a time frame. In one implementation of the invention, the threshold 2N2is the number of hypotheses that about 10% of a time frame contain less than 2N2the hypotheses. It should be noted that in other implementations of the invention can use other thresholds.

In one implementation of the invention, the threshold N1assumed to equal N2/5. Note that the threshold N1use the : in critical situations, when the number of hypotheses is very small. In one implementation of the present invention blocks 310-340 can be separated from the process 300 and is performed during the formation and testing of speech recognition systems, or they can be performed when the speech recognition system adapts (indirectly or directly) to the speaker and/or the environment.

The process 300 continues to block 350 where to get the following observation (or fragment). The process 300 continues at block 355. In block 355 define, superior to whether the number of active hypotheses threshold 2N3. If the unit 355 determines that the number of active hypotheses exceeds the threshold 2N3then the process 300 proceeds to block 356. In block 356 produce decoding the N best hypotheses, where N is assumed to be N3. In this case, it is important to keep the same parameter Czos. Assign N=N3accelerating the decoding process. If the unit 355 determines that the number of active hypotheses does not exceed the threshold 2N3then the process 300 proceeds to block 365.

In block 365 check the number of active hypotheses greater than, less than or equal to the threshold 2N2. If the unit 365 determines that the number of active hypotheses is not less than threshold 2N2then the process 300 proceeds to block 366. In block 366 to determine the best probability αtand dynamically-adjustable beam width btdefined inthe 200. The process 300 then proceeds to execute in block 367, where are all the way, with a probability of more than αt-bt. Next, the process 300 continues at block 380.

In block 380 determines whether the following observation of the latter. If in block 380 determines that the observation is the latter, then the process 300 continues at block 350. If in block 380 determines that the observation at time t=t+1 is not the last, then the process 300 continues at block 385. In block 385, the best path given as the result of the speech recognition process.

If the unit 365 determines that the number of active hypotheses is less than the threshold 2N2then the process 300 proceeds to block 375. In block 375 check: the number of active hypotheses greater than, less than or equal to the threshold 2N1. If in block 375 determines that the number of active hypotheses is not less than threshold 2N1then the process 300 proceeds to block 376. In block 376 decode active hypotheses by decoding the N best hypotheses, the parameter N is assigned the value of N2. The process 300 continues at block 380. If in block 375 determines that the number of active hypotheses is less than the threshold 2N1then the process 300 proceeds to block 390. Then, in block 390 all hypotheses are propagated further.

Table III shows the comparison of the results of the recognition problem is ECI in Chinese, obtained using the following methods: statistical speech recognition using Viterbi search in static beam, speech recognition using the implementation of the present invention contains a process of dynamic adjustment of the beam (process 200) and speech recognition with the use of the invention contains a process of dynamic adjustment of the beam, where the beam width is adjusted only if there is a normal number of active paths (not too much or too little) or using the decoding N best, if too many or too few active paths (process 300). It should be noted that other languages can also be used as input for speech recognition.

In Table III the results illustrate the implementation of the invention (modified dynamically regulate the width of the beam; the third column)with dynamic beam adjustment (process 200) and dynamic adjustment of the beam in the case when there is a normal amount of active paths (not too much or too little) or using the decoding N best, if too many or too few active paths (process 300), while the first pass is only used by the implementation of the invention, having a dynamic beam adjustment (200), and the second pass uses isovel implementation of the invention, with dynamic adjustment of the beam, where the beam width is changed only when "enough" active paths (process 300).

In Table III the use of the invention contains the modified process of dynamic adjustment of the beam (third column), has allowed to improve the CIOs from 8.3 to 7.8, compared with the implementation of the invention, containing only the process of dynamic adjustment of the beam without changing the width of the beam in the case when there is a normal amount of active paths (not too much or too little) or the use of decoding N best if too many or too few active ways. The results of the example illustrated in Table III, obtained for the implementation of the invention, using the same parameters as the implementation of the invention, the results of which are shown in Table I. the parameters for the implementation, use and dynamic beam adjustment, and dynamic adjustment of the beam, where the beam width is changed only if there is a normal number of active paths (not too much or too little) or using the decoding N best if too many or too few active paths, is equal to N1=160, N2=800 and N3=5000.

Table III
Dynamic adjustment of the beam for the problem with the Chinese language
Static beamDynamic beamModified dynamic beam
The real-time3.141.241.32
The frequency of occurrence of errors in words8.58.37.8

Table IV illustrates the comparison of the results of solving the problem of speech recognition in English, obtained using the following methods: search by Viterbi static beam, the implementation of the present invention contains a process of dynamic adjustment of the beam and the implementation of the invention contains a process of dynamic adjustment of the beam, where the beam width is adjusted only in the case of normal number of active paths (not too much or too little) or using the decoding N best, if too many or too few active ways. It should be noted that the implementation of the present invention can solve the problems of speech recognition with other languages.

The results illustrated in Table IV for the implementation of this invention, using the same parameters as the implementation of the invention, the results of which are shown in Table II. Parameters for which realizatsii, using dynamic beam adjustment, and dynamic adjustment of the beam, where the beam width is adjusted only if there is a normal number of active paths (not too much or too little) or using the decoding N best if too many or too few active paths, is equal to N1=300, N2=1500 and N3=6000. The results in Table IV illustrate that the improved CIOs from 11.4 to 11 seen from a comparison of the implementation of the present invention contains a process of dynamic adjustment of the beam for the first passage and implementation of the invention contains a process of dynamic adjustment of the beam, where the beam width is adjusted only in the case of normal number of active paths (not too much or too little) or using the decoding N best, if too many or too few active paths for the second pass. The results from Tables III and IV were obtained using a PC with Intel Pentium™ with a clock frequency of 550 MHz with 512K cache and 512 MB SDRAM. It should be noted that implementations may use different computer systems with different processing speeds and memory.

Table IV
Dynamic adjustment of the beam for the problem with the English language/td>
Static beamDynamic beamModified dynamic beam
The real-time3.41.851.95
The frequency of occurrence of errors in words1111.411

It should be noted that the implementation of the invention, discussed above, can be applied to problems with a number of hypotheses too large, so that this quantity can be determined in a reasonable time, and usually applies truncation beam width.

Figure 4 illustrates a typical computing system 400 that may be used for speech recognition. The system 400 includes a memory 410, a Central processing unit (CPU) cache memory 420, North bridge 430, 435 South bridge, the audio output device information 440, and an input device audio information 450. The audio output device information 440 may be a device similar to the acoustic system. The input device audio information 450 may be a device such as a microphone.

Figure 5 shows a system 500 that contains some implementation of the present invention with the process 510 dynamic adjustment of the beam. In one implementation of the invention, the process 510 dynamic adjustment of the beam is about the ECCA 200 (illustrated in figure 2). In another implementation of the invention, the process 510 dynamic adjustment of the beam is in the form of process 300 (illustrated in Figure 3). In the third implementation of the invention, the process 510 dynamic adjustment of the beam is in the form of process 200 and process 300, where the process 200 is used for the first pass, and the process 300 is used for the remaining passes. The process 510 may be implemented as an application stored in memory 410. The memory 410 may be a storage device such as random access memory (RAM), dynamic RAM or synchronous DRAM (SDRAM). Note that along with the above, can be used in other storage devices, including future developments in this area. Also note that the process 510 can be implemented on other media, information from which can be read, such as floppy disks, compact disks (CD-ROM), and so on.

6 shows a variant implementation of the invention with a process 610 (shown in Figure 5, as the process 510), the process 610 is implemented in hardware. In one embodiment of the invention, the process 610 is implemented using programmable logic arrays (PLA). It should be noted that the process 510 may be implemented using other electronic devices, such as register and transistors. In another implementation of the invention the process is carried out in the firmware.

Use in speech recognition, the implementation of the present invention with a method of Viterbi search allows you to increase the processing speed without increasing the Czos. Thus, it takes less time for solving the problem of speech recognition.

The above implementation can also be stored on the device or machine-readable media and readable by the machine to perform commands. Machine-readable media include any mechanisms that contain (i.e. stores and/or transmits) information that can be read by a machine (e.g. computer). For example, a machine-readable medium may be a read-only memory (ROM); random access memory device or a random-access memory (RAM); storage on magnetic disks; optical drive; flash memory; electrical, optical, audible signal, or any other form of distributed signal (e.g., carrier waves, infrared signals, digital signals, etc.). The device or machine-readable medium may be a semiconductor memory device and/or a rotating magnetic or optical disk. The device or machine-readable media may be raspredelennym, when commands are divided between different machines, for example between the United network computers.

Although some typical implementations described and shown in the accompanying drawings, it is necessary to understand that these implementations are only illustrations and do not limit the scope of the invention and that the invention is not limited to the specific structures and constructions here shown and described as a specialist in the relevant field can offer a variety of other modifications.

1. How to dynamically adjust the beam in the Viterbi search, including the choice of the initial beam width; determining, does the probability value for the frame; dynamic adjustment of the beam width; decoding the input speech signal with a dynamically adjustable beam width.

2. The method according to claim 1, in which the specified decoding includes truncation of the set of active paths with dynamically adjustable beam width.

3. The method according to claim 1, in which the specified decoding further includes the use of a hidden Markov model (hmm).

4. The method according to claim 3, in which SMM is estimated by the Viterbi search.

5. The method according to claim 1, in which the specified dynamic adjustment includes a heuristic to determine the first and second boundary coefficients.

6. The way p is 1, which changes the value on the frame, with probability at frame one hypothesis from the best group increases, and the probability of the best hypotheses per frame is reduced.

7. How to dynamically adjust the beam in the Viterbi search, which includes the determination of the initial beam width; defining a set threshold value used for a variety of active hypotheses; identifying the current number of active hypotheses; comparing the current number of active hypotheses with threshold values; dynamic adjustment of the beam width for the set of active hypotheses; decoding the input speech signal.

8. The method according to claim 7, in which the specified decoding includes truncation of the set of active hypotheses or using dynamically adjustable beam width, or by truncation of the N best, and the truncation is based on multiple threshold values.

9. The method according to claim 8, in which the specified decoding further includes the use of a hidden Markov model (hmm).

10. The method according to claim 9, in which SMM is estimated by the Viterbi search.

11. The method according to claim 7, in which the specified dynamic adjustment includes a heuristic to determine the first and second boundary coefficients.

12. The method according to claim 7, in which the definition of the set of threshold values includes determining the way the ticks on the fragment of the input speech signal.

13. The dynamic adjustment of the beam in the Viterbi search, containing a processor; a bus coupled to a processor; memory connected to the processor, which recorded the speech recognition process; an input device connected to the processor; a device for displaying information, coupled with the processor, where the speech recognition process dynamically adjusts the width of the beam in order to decode the input speech signal.

14. The system of item 13, in which the speech recognition process decodes the input speech signal by truncating the set of active hypotheses or using dynamically adjustable beam width, or by truncation of the N best, and the truncation occurs on the basis of a set of threshold values stored in memory.

15. System 14, in which the speech recognition process decodes the input speech signal using, in addition to the above, a hidden Markov model (hmm).

16. The system of clause 15, in which SMM is estimated by the Viterbi search.

17. Device dynamic adjustment of the beam in the Viterbi search that contains the processor; the first scheme to perform the speech recognition process, connected to the processor; a memory connected to the processor, where the speech recognition process dynamically adjusts the width of the beam, with the aim of decoding WMO is tion of the speech signal.

18. The device according to 17, in which the speech recognition process decodes the input speech signal by truncating the set of active hypotheses or using dynamically adjustable beam width, or by truncation of the N best, and the truncation occurs on the basis of a set of threshold values stored in memory.

19. The device according to p in which the speech recognition process decodes the input speech signal using, in addition to the above, a hidden Markov model (hmm).

20. The device according to claim 19, in which SMM is estimated by the Viterbi search.

21. The device according to 17, in which the first circuit is a programmable logic matrix.

22. Device dynamic adjustment of the beam in the Viterbi search, containing a machine-readable medium with the recorded commands, the execution of which by the machine causes the machine performs operations comprising selecting the initial width Puig; definition, does the probability value for the frame; dynamic adjustment of the beam width; decoding the input speech signal with a dynamically adjustable beam width.

23. The device according to item 22, which specified the decoding includes truncation of the set of active paths with dynamically adjustable beam width.

24. The device according to item 23, where the specified decoding further includes IP is the use of hidden Markov models (hmm).

25. The device according to paragraph 24, in which SMM is estimated by the Viterbi search.

26. The device according to paragraph 24, in which this adjustment further comprises a command, the execution of which by the machine causes the machine performs operations, including heuristic detection of the first and second boundary coefficients.

27. The device according to item 22, which changes the value on the frame, with probability at frame one hypothesis from the best group increases, and the probability of the best hypotheses per frame is reduced.

28. Device dynamic adjustment of the beam in the Viterbi search, containing a machine-readable medium with the recorded commands, the execution of which by the machine causes the machine performs operations comprising determining the initial beam width; defining a set of threshold values used for a variety of active hypotheses; identifying the current number of active hypotheses; comparing the current number of active hypotheses with threshold values; dynamic adjustment of the beam width for the set of active hypotheses; decoding the input speech signal.

29. The device according to p in which the specified decoding further comprises a command, the execution of which by the machine causes the machine performs operations, including the truncation of many active hypothesis is h or using dynamically adjustable beam width, either by truncation of the N best, and the truncation is based on multiple threshold values.

30. The device according to clause 29, in which the specified decoding additionally includes estimated Viterbi search.



 

Same patents:

The invention relates to speech recognition

FIELD: technology for recognition of speech, in particular, method and device for dynamic adjustment of beam in Viterbi search.

SUBSTANCE: method includes selecting beginning width of beam, finding out whether value of probability for frame changes, dynamic adjustment of width of beam, decoding of input speech signal, while width of beam is dynamically adjusted. Also proposed is device including processor, component for speech recognition, connected to processor, memory, connected to processor. Component for speech recognition automatically adjusts width of beam to decode input speech signal.

EFFECT: increased speech recognition reliability.

6 cl, 6 dwg, 4 tbl

FIELD: speech recognition, in particular, method and device for computing acoustic probabilities during speech recognition.

SUBSTANCE: in accordance to invention, density mixture functions are calculated with usage of commands like "single instruction multiple data" (SIMD) for producing a vector, containing components of mixture of densities as elements. Content of vector is stored in memory (110) and used for recognition of speech for whole set of components of densities mixture for sequential frames.

EFFECT: capacity for better usage of vector processing advantages, increased speed of calculation of acoustic probabilities.

4 cl, 6 dwg

FIELD: radio engineering.

SUBSTANCE: method for recognition of voice including reception of frames that contain samples of audio signal; formation of criteria vector that contains the first number of vector components for each frame; projection of criteria vector at least at two subspaces so that number of components in each projected vector of criteria is less than the first number, and common number of components in projected vector of criteria is equal to the first number; establishment of mixing models set for each projected vector, which provides for highest probability of observation; and analysis of mixing models set for detection of recognition result. When result of recognition is found, a measure of recognition result credibility is determined; this determination includes detection of probability of the fact that result of recognition is correct, determination of normalising member and dividing this probability by normalising member.

EFFECT: increasing reliability and efficiency of voice recognition.

14 cl, 2 dwg

FIELD: physics.

SUBSTANCE: method for machine estimation of speech transmission quality involves loading an audio signal into computer random-access memory. Active and inactive fragments in the signal are picked up. Spectra are calculated for each phase, which are then divided into critical bands and spectral parameter values are calculated for each critical band, wherein signal parameters are calculated in the spectral as well as in time domain. Active phase fragments which correspond to tone dialling are excluded from processing before dividing into critical bands. The spectra undergo multilevel psycho-acoustic filtration. The obtained parameters of the processed signal are compared with associations stored in a data base and the association closest on all parameters to the processed signal is selected. The speech quality estimate is determined as the sum of weighted values of the degree of closeness.

EFFECT: providing machine estimation of a speech signal by comparing parameters of the processed signal with speech models stored in a data base of associations.

4 cl, 8 dwg, 5 tbl

FIELD: physics.

SUBSTANCE: device for providing a speech probability estimation comprises the first speech probability estimation means for estimating speech probability information indicating the first probability as to whether the scene sound field contains speech or whether the scene sound field does not contain speech. Additionally, the device comprises an output interface for deriving the speech probability estimation depending on speech probability information. The first speech probability estimation means is configured to estimate information of the first speech probability based on, at least, spatial information about the sound field or spatial information about the scene.

EFFECT: increasing the accuracy of detecting useful and parasitic sounds.

18 cl, 21 dwg

Up!