Method of detecting emotions from voice

IPC classes for russian patent Method of detecting emotions from voice (RU 2510955):

G10L15/00 - Speech recognition (G10L0017000000 takes precedence);;

Another patents in same IPC classes:

Context-based arithmetic encoding apparatus and method and context-based arithmetic decoding apparatus and method / 2493652

Disclosed are a context-based arithmetic encoding apparatus and method and a context-based arithmetic decoding apparatus and method. The context-based arithmetic decoding apparatus may determine a context of a current tuple of N elements to be decoded, determine a most significant bit (MSB) context corresponding to an MSB symbol of the current tuple of N elements, and determine a probability model using the context of the tuple of N elements and the MSB context. The context-based arithmetic decoding apparatus may then perform decoding for a MSB based on the determined probability model, and perform decoding for a least significant bit (LSB) based on a bit depth of the LSB derived from the decoding process for an escape code.

Recognition architecture for generating asian hieroglyphs / 2477518

Translation system includes a speech recognition component and a spelling/correction component. The speech recognition component is configured to switch between multiple writing systems based on speech input. A spell check mode is launched in response to reception of speech input, the spell check mode being meant to correct misspelling of recognition results or generate new words. Correction is made using speech and/or manual selection and input. Words corrected using the spell check mode are corrected as a single entity and are treated as a word. The spell check mode is applied to languages with at least Asian content, for example, simplified Chinese, traditional Chinese and/or other Asian languages, for example, Japanese.

Device and method to generate signature of acoustic signal, device to identify acoustic signal / 2459281

Method to generate a signature of an acoustic signal from signatures of frames, into which the acoustic signal is broken, includes performance of frequency conversion of a digitised acoustic signal of each n (1≤n≤N) frame. At the same time for each frame, amplitude values are converted for all counts of the converted signal into positive ones, for each n frame, starting from (1+t) (where 1≤t≤N), differences are calculated between found positive counts and appropriate positive counts of the previous (n-t) frame, produced differential counts are combined into subgroups, a sum of differential counts of each subgroup is found, subgroups are combined with the same quantity of differential counts into groups, for each group a number of a subgroup is identified with a maximum or minimum sum of differential counts, from the specified numbers of subgroups a signature of n (where (1+t)≤n≤N)) frame of the acoustic signal is generated. Versions are given to implement the device for generation of a signature of an acoustic signal and the device to identify an acoustic signal.

Method of processing speech signal in frequency domain / 2454735

Disclosed is a method of processing a speech signal, based on band-pass filtration of a logarithmic amplitude spectrum with a filter having an uneven pulse response and detecting additional information features in the spectrum of the speech signal. Such features are local positive and negative slopes in the spectrum of the speech signal generated by the frequency response of the voice path.

Method for contact-difference acoustic personal identification / 2451346

Acoustic parameters used are the unique amplitude-frequency characteristic of the human body which is calculated as the ratio of spectral density of the power of an acoustic signal for a certain detection range on the human body to the spectral density of the power of a speech signal. References formed based on the amplitude-frequency characteristics are stored in a data base of acoustic biometric samples. The amplitude-frequency characteristic of the body of the identified person is used to generate a weighted Euclidean disparity of parameters of the amplitude-frequency characteristic of the body of the identified person and the reference. Based on the degree of difference between the identified person and the reference, a decision is made on the identity of the acoustic biometric sample of the identified person whose acoustic parameters were used when forming the reference.

Method for contact-difference acoustic personal identification / 2451346

Method of processing speech signal in frequency domain / 2454735

Device and method to generate signature of acoustic signal, device to identify acoustic signal / 2459281

Recognition architecture for generating asian hieroglyphs / 2477518

Context-based arithmetic encoding apparatus and method and context-based arithmetic decoding apparatus and method / 2493652

Method of detecting emotions from voice / 2510955

Invention relates to means for recognition of human emotions from voice. Intensity of the voice and tempo, defined by the rate at which the voice appears, are detected, respectively, and intonation which reflects the picture of intensity variation in each word pronounced by the voice is detected based on the input voice signal in form of a time value. A first variation value, indicating intensity variation of the detected voice in the direction of the time axis, a second variation value, indicating tempo variation of the voice in the direction of the time axis, and a third variation value indicating intonation variation of the voice in the direction of the time axis are obtained. The voice signal of a Russian-speaking subscriber is input and intensity of the voice and tempo is then detected. Once the third variation value is obtained, the base frequency of the voice signal is detected and a fourth variation value which indicates base frequency variation in the direction of the time axis is obtained; signals expressing the emotional state of anger, fear, grief and pleasure are generated, respectively, based on said first, second, third and fourth variation values.

Electronic computer / 2523220

Electronic computer has random access memory, the output of which is connected to an arithmetic logic unit, as well as rows of photocells which respond to red light and are connected through switches to the random access memory. The output of the arithmetic logic unit is connected through switches to thirty comparison units. Outputs of the thirty comparison units are connected to control electrodes of thirty switches, respectively. A pulse generator is connected to inputs of the thirty switches, outputs of which are connected to inputs of the thirty switches, respectively. Outputs of the thirty switches are connected to the random access memory of a bit-map display.

Markup language-based selection and utilisation of recognisers for utterance processing / 2525440

Invention relates to a method of using recognisers to process an utterance based on a markup language document. The markup language document and an utterance are received in a computing device. One or more recognisers are selected from among the multiple recognisers for returning a results set for the utterance based on markup language in the markup language document. The grammar used in the utterance is recognised. The markup language document is parsed in searches of at least one markup language tag. A results set is received from one or more selected recognisers in a format defined by the processing method given in the markup language document. The results set is merged with at least one previously received results set to form a plurality of results sets.

Method for hybrid generative-discriminative segmentation of speakers in audio-flow / 2530314

Verbal segments are extracted. Acoustic MFCC features of a vector are calculated. Each verbal segment is projected to the space EV of proper voices with a degree of 10 so that a set of Y vectors is obtained. Clustering centres C₁ and C₂ of the Y vectors are determined. Discriminative clustering is performed by calculation of parameters of planes H₁, H₂ and approximate determination of concentration areas of the Y vectors that are homogeneous as to speaker's information. Obtained data on the verbal segments are used for initialisation of VB diarisation based on a variation and Bayesian analysis. Marks of the segments as to the speakers during the whole pronouncing are obtained, on the basis of which correction of clustering centres C₁ and C₂ is performed; with that, operations of discriminative clustering, variation and Bayesian analysis and correction of clustering centres are performed subsequently at several iteration EV-VB stages. At each stage of iterations there performed is an analysis of complete segmentation as to the speakers, and at the absence of variations in segmentation on iteration it is stopped; after that, final segmentation representing the table correspondence between the verbal segments of an input signal and the speaker's index is obtained by Viterbi resegmentation.

FIELD: physics, acoustics.

SUBSTANCE: invention relates to means for recognition of human emotions from voice. Intensity of the voice and tempo, defined by the rate at which the voice appears, are detected, respectively, and intonation which reflects the picture of intensity variation in each word pronounced by the voice is detected based on the input voice signal in form of a time value. A first variation value, indicating intensity variation of the detected voice in the direction of the time axis, a second variation value, indicating tempo variation of the voice in the direction of the time axis, and a third variation value indicating intonation variation of the voice in the direction of the time axis are obtained. The voice signal of a Russian-speaking subscriber is input and intensity of the voice and tempo is then detected. Once the third variation value is obtained, the base frequency of the voice signal is detected and a fourth variation value which indicates base frequency variation in the direction of the time axis is obtained; signals expressing the emotional state of anger, fear, grief and pleasure are generated, respectively, based on said first, second, third and fourth variation values.

EFFECT: high accuracy of determining the emotional state of a Russian-speaking subscriber.

3 dwg

The invention relates to the recognition of emotions of the human voice and can be used to detect emotions in intelligent information and communication systems, as well as in carrying out various kinds of psychological research.

The expansion of the field of communicative interaction officials, as well as growing psychological stress when making managerial decisions related to reduction of quotas trust communicating to each other, transforming the formal role of communication in business, which along with the exchange of information should take into account peculiarities of the personality of the recipient, his mood, physiological and emotional States. Promising in this sense may be the abandonment of traditional principles of coding and transmission of audio (speech) signals in communication systems in favor of intelligent signal processing.

Intelligence (the combination of transmission and processing of information at different levels of representation) information and communication systems must be established at the early stages of their life cycle and one of the functions to implement the possibility to determine the emotional state of the caller's voice.

Known methods for determining the emotional tension (stress) (patent RU 2068653 from 10.11.1996 and EN 2073484 from 20.02.1997), according to which the output record galvanic skin response, heart rate and respiratory rate and their dynamics appreciate the emotional tension. A common shortcoming of these analogues is the inability to detect emotions (emotional tension) of a person without the use of sensors.

There is a method of determining the emotions of the synthesized speech signal (patent JP 02-236600 from 19.09.1990), according to which the digitized speech signal to produce the frequency of the fundamental tone and calculate the amplitude spectrum, and then on the basis of these parameters generate a signal expressing emotion. The disadvantage of analogue is the low accuracy of detection of emotional States.

The closest to the technical nature of the claimed method and selected as a prototype is a method for detecting emotions (patent RU 2287856 from 20.11.2006), namely, that impose voice signal; detect the intensity of the voice and the rate determined by the speed with which you receive a voice, respectively, and find the amount of time a tone which reflects the pattern of intensity changes in each of the words performed by voice, based on input voice signal; receiving the first amount of change indicating a change in intensity of the detected voice in the direction of the time axis, the second amount of change that indicates the ISM is out tempo voice in the direction of the time axis, and the third amount of change indicating a change in intonation in the direction of the time axis, respectively, and generate signals expressing the emotional state of at least anger, sorrow and pleasure, respectively, on the basis of these first, second and third values changes.

Prototype method provides for the recognition of emotions based on changes in the intensity, pace and intonation in time. However, in most languages emotionally-distinguishing function is performed by the frequency of the fundamental tone (CHOT). In [Appromately, Ibicenco. Assessment of acoustic parameters of emotional speech / First annual scientific conference of students and postgraduates of the basic departments of RAS southern research center, 2009. - S-214] found that the average value of the CHOTA rises in the state of pleasure and decreases in distress, in addition to significantly changing the dynamics of the CHOTA: when sorrow is her smooth decrease, when anger appears sharp peaks in the frequency change. Thus, changes in the CHOTA is an important means of determining emotional information, and the disadvantage of the prototype method is the low accuracy of detection of emotions, in particular the detection of emotions for the Russian language.

The objective of the invention is to develop a method for detecting emotion in voice, allowing the St to increase the accuracy of determining the emotional state of the Russian-speaking caller.

In the proposed method, this task is solved in that in the method for detecting emotion in voice, which impose voice signal; detect the intensity of the voice and the rate determined by the speed with which you receive a voice, respectively, and find the amount of time a tone which reflects the pattern of intensity changes in each of the words performed by voice, based on input voice signal; receiving the first amount of change indicating a change in intensity of the detected voice in the direction of the time axis, the second amount of change indicating a change of pace voices in the direction of the time axis, and the third amount of change indicating a change intonation in the direction of the time axis, respectively; additionally detect the frequency of the fundamental tone of the voice signal and receive the fourth change, indicating a change of the fundamental frequency in the direction of the time axis. Then generate signals expressing the emotional state of anger, fear, sadness and pleasure, respectively, based on the above first, second, third and fourth values changes.

A new set of essential features allows you to achieve the technical result due to the change detection of the fundamental frequency and generate signals, vergauwe the emotional state of the speaker, on the basis of four variables changes.

The analysis of the level of technology has allowed to establish that the analogues, characterized by a set of characteristics is identical for all features of the claimed method for detecting emotions, no. Therefore, the claimed invention meets the condition of patentability "novelty".

Search results known solutions in this and related areas of technology in order to identify characteristics that match the distinctive features of the prototype of the features of the declared object, showed that they do not follow explicitly from the prior art. The prior art also revealed no known effect provided the essential features of the claimed invention transformations on the achievement of the technical result. Therefore, the claimed invention meets the condition of patentability "inventive step".

The claimed invention is illustrated by the following figures:

figure 1 is an embodiment of the detection system of emotions on voice according to the proposed method;

figure 2 - the decision rules determine the emotions according to the proposed method;

figure 3 - results of evaluating the accuracy of determining the emotional state.

The implementation of the inventive method consists in the following (figure 1).

Voice signal entered is via the microphone 101, quantized by an analog-digital Converter 102, and is then converted into a digital signal. Digital voice signal obtained at the output of analog-to-digital Converter, served in the block 103, the signal processing unit 104 detection of phonemes, the detection block 105 and block 106 detection of the fundamental frequency.

Block 103 signal processing extracts the frequency components necessary for detecting the intensity of the voice. Block 107 detection intensity detects the intensity of the signal extracted by the block 103 signal processing. For example, as the intensity, you can use the result obtained by averaging the amplitude of the voice signal or the dynamic range D.

Block 104 detection of phonemes implements the segmentation of each phoneme of the voice signal entered into it. Block 108 detection rate signal is received from the segmentation of each phoneme, issued by the block 104 detection of phonemes, and detects the number of phonemes F that appear in unit time. As cycle detection rate is set to a time equal to, for example, 10 C. However, if the detected segmentation phrases, counting phonemes stops until the detection time segmentation phrase, even if the segmentation phrases are detected within 10 seconds, and calculates the amount of pace. In particular, the rate of ODA is determined for each phrase.

The detection block 105 words implements the segmentation of each word voice signal entered into it. The block 109 is detected tone signal is received from the segmentation of each word issued by the block 105 detection of words, and detects the tone, expressing the pattern of changes in the intensity of the voice in the word. Thus, block 109 detection of intonation detects the characteristic pattern intensity segmentation. As shown in the prototype, in block 109 detection of intonation provided by the bandpass filter, the conversion unit of the absolute value, the block comparison, the detection unit and the detection unit interval zones. As the value of intonation I output unit 109 detection of intonation is the result of averaging intervals between zones in the power spectrum of the signal, which exceeds a certain threshold value.

Block 106 detection of the fundamental frequency implements the determination of the fundamental frequency are entered in it the voice signal. Block 106 detection of the fundamental frequency F_FROMcan be implemented, for example, in accordance with the known solution (patent # 78977 from 10.12.2008).

The emotional state of the person is changed, therefore, to correctly identify emotions, including anger, fear, sadness and pleasure, it is necessary on narutimate change characteristic values, such as D intensity, pace F, intonation I and the frequency of the fundamental tone F_FROM.

In the detection system of the emotions shown in figure 1, to ensure the possibility of relying on the values of characteristics in the past, the magnitude of the D given by the block 107 detection of the intensity, the magnitude of the rate F, given by the block 108, the detection rate, the value of intonation I, issued by the block 109 detection of intonation, and the magnitude of the fundamental frequency F_FROMissued by the block 106, the detection of the fundamental frequency, temporarily retain in block 110 for temporary storage of data.

In addition, the block 111 change detection emotions takes the existing value of the intensity of D generated by the block 107 detection intensity present value rate of F generated by the block 108, the detection rate present value of intonation I, issued by the block 109 detection of intonation, and the magnitude of the fundamental frequency F_FROMissued by the block 106, the detection of the fundamental frequency. Block 111 change detection emotions also accepts previous values of the intensity, tempo, intonation and fundamental frequency, which is stored in block 110 for temporary storage of data. Thus, the block 111 change detection emotions detects changes in the intensity, pace, intonation and frequency of the fundamental tone of voice, but the but. Block 112 detection of emotions on voice takes intensity change ∆ D, rate ΔF, intonation ΔI and the fundamental frequency F_FROMvoices, which gives the block 111 change detection emotions, evaluates the current emotional state and generates the signals expressing the emotional state of anger, fear, sadness and pleasure, in this embodiment, implementation of the system.

The claimed method of detecting emotion in voice provides a more accurate determination of the emotional state of the Russian-speaking caller. To prove the claimed technical result is the following experimental studies.

To determine the emotional state was used entries emotional speech 80 professional actors - men and women aged from 28 to 32 years. Each of them were given 4 words (cardboard, quietly, milk, utensils) with the expression of the four emotional States: anger, fear, sadness and pleasure.

These records were processed using the options for performing the detection of emotions according to the method prototype and implementation variant detection system of emotions on voice (1) according to the proposed method. The block 112 detection of emotion in the voice of the estimated current emotional state and generate signals expressing emotio the real state of anger, fear, sorrow and pleasure, according to the final rules define emotions presented in figure 2.

To assess the accuracy of determining the emotional state of the Russian-speaking caller used the hit ratio

$K_{i} = \frac{N_{with a about in p . i}}{N_{i}}$ ,

where N_soup- the number of correct entries with the expression of the i-th emotional state; N_i- total number of records with the expression of g-th emotional state; i=1, 2, 3, 4 - the number of emotional States - anger, fear, sadness and pleasure, respectively.

The results of the evaluation according to the prototype method and the proposed method (figure 3) indicate a more accurate determination of the emotional state in the claimed method and the possibility of solving the tasks of the invention.

The method for detecting emotion in voice, namely, that detect the intensity of voice and rate determined by the speed with which you receive a voice, respectively, and find the maximum value of the time stamps, which reflects the pattern of intensity changes in each of the words performed by voice, based on input voice signal; ucaut first value changes specifies the intensity of the detected voice in the direction of the time axis, and the second amount of change indicating a change of pace voices in the direction of the time axis, and the third amount of change indicating a change in intonation in the direction of the time axis, characterized in that impose voice signal Russian party, and then find the voice intensity and pace; after receiving the third measurement, find the frequency of the fundamental tone of the voice signal and receive the fourth change, indicating a change of the fundamental frequency in the direction of the time axis; generate signals expressing the emotional state of anger, fear, sadness and pleasure, accordingly, based on the above first, second, third and fourth values change.