RussianPatents.com

Method and discriminator for classifying different signal segments. RU patent 2507609.

IPC classes for russian patent Method and discriminator for classifying different signal segments. RU patent 2507609. (RU 2507609):

G10L25/00 - SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; AUDIO ANALYSIS OR PROCESSING

Another patents in same IPC classes:

Coding device, decoding device and method / 2488897
Speech coding device comprises a coding section of the first level, which performs processing by coding in respect to the input speech signal; a decoding section of the first level, which performs processing by decoding using coded data of the first level; a section for calculation of erroneous transformation coefficients of the first level, which converts the erroneous signal of the first level into a frequency area for calculation of erroneous coefficients of transformation of the first level; and a coding section of the second level, which performs processing by coding in respect to erroneous transformation coefficients of the first level, besides, the coding section of the second level comprises: a setting section; a selection section; a connected strip configuration section, which connects a strip selected from the low-frequency strip, and a fixed strip from the high-frequency strip, in order to configure the connected strip; and a section of coded data generation, which codes erroneous transformation coefficients of the first level, included into the connected strip, in order to generate coded data of the second level.

Method and apparatus for encoding and decoding object-based audio signal / 2484543
Audio decoding method involves extracting from the audio signal a first audio signal wherein one or more music objects are grouped and encoded, a second audio signal wherein at least two vocal objects are grouped step by step and encoded, and a residual signal which corresponds to the second audio signal, and generating a third audio signal using at least one of the first and second audio signals and the residual signal. A multichannel audio signal is then generated using the third audio signal. Thereafter, multiple playback modes can be efficiently provided.

Device for encoding stereophonic signals, device for decoding stereophonic signals and methods realised by said devices / 2484542
Device for encoding stereophonic signals disclosed in the present invention employs a configuration, having: a sum and difference calculator which generates a monaural signal related to a sum of a left and a right signal forming a stereo signal, and generates a side signal related to the difference between the left and right channel signals; a mode information generator which generates mode information per layer indicating a coding mode of one of monaural coding and stereophonic coding; and first to N-th layer coders which perform monaural coding in an i-th layer (i=1, 2 …, N, where N is an integer equal to or greater than 2) using information related to the monaural signal or performs stereophonic coding in the i-th layer using both the information related to the monaural signal and information related to the side signal, based on the mode information, and provide i-th layer encoded information.

Audio encoding/decoding scheme having switchable bypass / 2483364
Apparatus for encoding includes a first domain converter (510), a switchable bypass (50), a second domain converter (410), a first processor (420) and a second processor (520) to obtain an encoded audio signal having different signal portions represented by coded data in different domains, which have been coded by different coding algorithms.

Method to assess frequency of single harmonic oscillation in limited range / 2480847
Method to assess frequency of a single harmonic oscillation in a limited range includes accumulation of counts of a signal of a mixture of a harmonic oscillation and noise in parallel for several subranges in the suggested range of frequency finding, differing by the fact for each accumulated signal the amplitudes of the accumulated signal are found in complex planes formed by reference frequencies, then the maximum value of the amplitude is determined, which defines the first frequency rating, then the difference is found between the maximum value of the amplitude and the neighbouring ones, if the difference is lower than the threshold value, then the rated value of frequency complies with the frequency value, which is the border of neighbouring subranges.

Method and discriminator for classifying different signal segments / 2507609
Method and discriminator for classifying different segments of a signal designed to classify different segment of a signal which comprises segments of at least a first type and second type, e.g. musical and speech segments, a short- term classification (150) signal based on at least one short-term feature extracted from the signal and a short- term classification result (152); a long-term classification (154) signal based on at least one short-term feature and at least one long-term feature extracted from the signal and a long-term classification result (156). The short-term classification result (152) and the long-term classification result (156) are combined (158) to provide an output sampling signal (160) indicating whether a segment of the signal is of the first type or of the second type.

Method of estimating quality of audio signal, apparatus and computer-readable medium storing programme / 2517393
Invention relates to means of estimating the quality of an audio signal for a multimedia telecommunication service. The method involves calculating the audio packet loss frequency if at least one audio packet to be estimated exists in once or constantly generated IP packet losses, wherein calculation of the audio packet loss frequency is based on information from the received IP packets by calculating packet loss; calculating the average exposure time/average duration of the audio packet based on information of the received IP packets, wherein the average exposure time serves as the average time during which the quality of the audio signal has an effect at loss frequency of audio packets contained in one-time loss of audio packets; estimating subjective quality based on audio packet loss frequency and either the average exposure time or average duration of the audio packet; calculating audio data transmission speed for calculating the audio data transmission speed based on information from the received IP packets. The subjective quality estimate is calculated based on the quality of the encoded audio signal, audio packet loss frequency and average exposure time.

FIELD: information technology.

SUBSTANCE: method and discriminator for classifying different segments of a signal designed to classify different segment of a signal which comprises segments of at least a first type and second type, e.g. musical and speech segments, a short- term classification (150) signal based on at least one short-term feature extracted from the signal and a short- term classification result (152); a long-term classification (154) signal based on at least one short-term feature and at least one long-term feature extracted from the signal and a long-term classification result (156). The short-term classification result (152) and the long-term classification result (156) are combined (158) to provide an output sampling signal (160) indicating whether a segment of the signal is of the first type or of the second type.

EFFECT: providing an improved approach to differentiate segments of a different type in a signal while keeping low any delay caused by the discriminator.

17 cl, 7 dwg

The invention relates to the approach for classification of the various segments of the signal, including at least segments of the first and second types. The invention relates to the field of and, in particular, to distinguish speech/music while encoding the audio signal.

Known encoding scheme in the frequency domain, such as MP3 or AAC. These encoding devices are based on the transformation of temporarily present in the frequency, the next stage of quantization, which is operated by the quantization error, using the information from the physico-acoustic (psychoacoustic) module, and the encoding side, which quantized spectral coefficients and the corresponding information is encoded without loss of information with code tables.

On the other hand, there is the encoding devices, which are very suitable for speech processing, for example, AMR-WB+, is described in 3GPP TS 26.290. Such speech encoding schemas perform filtering with linear prediction (PL) signal in the time interval. PL filtering is obtained from the analysis of linear prediction input on the time interval. The resulting PL filter coefficients are encoded and transmitted as the information of the disclosing party. The process is known as linear predictive coding (LPC). Filter output differential signal prediction or signal prediction error, which is also known as the excitation signal is encoded using the encoder ACELP or using the encoder, which provides the Fourier transform with overlay. The choice between ACELP encoding and encoding converted excitation, which is also called encoding TLC, is done using the algorithms of a closed or open loop.

Schema in the frequency , such as the outline of a high efficiency AAC encoding, which combines AAC encoding scheme and technique of repetition of spectral bands (recovery of spectrum in the field of high frequencies), can also be attached to the joint stereo or multi-channel device, which is known as “MPEG environment”. The encoding scheme in the frequency domain are beneficial as they show high quality at low bitrates for the music signals. Problematic, however, is the quality of speech signals at low bitrates.

On the other hand, speech coding devices such as AMR-WB+, there is also a level (unit) high-frequency enhancements and functionality in the field of stereo. Speech-coding scheme shows the high quality for speech signals even at low bitrates, but show low quality for the music signals at low bitrates.

Due to the mentioned above available encoding schemes, some of which are better for encoding speech and other, better suited for encoding music, automatic segmentation and classification of the encoded audio is an important tool in many multimedia applications and can be used to choose the appropriate process for each different class of audio signals. Application performance depends heavily on the reliability of the classification of the audio signal. Indeed, misclassification leads to the wrong choosing and setting up processing.

Figure 6 shows usual design of the encoder used to encode speech and music depending on the type of audio. Design encoder includes channel coding speech 100, including appropriate encoder for speech 102, for example, AMR-WB+encoder for speech that is described in the technical specification of “Extended Adaptive Multi-Rate - Wideband (AMR-WB+) codec”, 3GPP TS 26.290 V6.3.0, 2005-06. Further, the design of the encoder includes channel encoding music 104, including encoder for music 106, for example, encoder for music AAS, is described in the Generic Coding of Moving Pictures and Associated Audio: Advanced Audio Coding. International Standard 13818-7, ISO/IEC JTC1/SC29/WG11 Moving Pictures Expert Group, 1997.

Outputs coding devices 102 and 106 connected with the input of the multiplexer 108. Inputs coding devices 102 and 106 are selectively connected with the line of 110 input the audio signal. The audio input signal is served selectively on speech encoder 102 or musical encoder 106 using switch 112, shown schematically in figure 6 and managed by the controller switch 114. Furthermore, the encoder includes a discriminator speech/music 116, also receiving the audio input signal and generates a control signal controller switch 114. Controller switch 114 also generates a signal indicator fashion (fashion) on line 118, which is the input line of the second input multiplexer 108 so that the signal of the indicator method can be send along with a coded signal. signal indicator method indicates that a block of data associated with it, or encoded speech or music, so that the decoder is not necessary to discriminate. Based on the bit of the indicator method, supplied together with the encoded data to the decoder can be generated by the corresponding signal switching for direction of the received and encoded data to the appropriate decoder speech or music.

Figure 6 depicts the traditional design of the encoder used to digitally encoded speech and music signals on line 110. In General, speech coding devices are more effective for speech, and devices are more effective for music. Universal encoding scheme can be developed using system that switches from one coder to another according to the nature of the input signal. A non-trivial problem here is to develop a suitable input classifier signal that controls the switch. Classifier is a discriminator speech/music 116 shown in figure 6. Usually reliable classification audio introduces a considerable delay, while on the other hand, the delay is an important factor in real-time applications.

In General, it is desirable that the total algorithmic delay introduced I informatika speech/music, was small enough so that you can use switchable encoders applications in real time.

Fig.7 illustrates delay encoder presented in figure 6. It is assumed that the signal applied to the input line 110 is encoded frames of 1024 samples (sampling) with a sampling rate of 16 kHz, so that discernment speech/music must be done for each unit, i.e. every 64 milliseconds. Transition between the two coded devices can be produced according to the description WO 2008/071353 A2, and a discriminator speech/music should not significantly increase the algorithmic delay of switching decoders, which in General is 1600 counts, not considering the delay needed to discern speech/music. Further, it is desirable to provide a selection of speech/music for the same , which is solved by switching the device AAC. The situation is shown in Fig.7 illustrating AAS long block 120, having a length of 2048 samples, that is, a long block of 120 includes two frames 1024 reference ACC short blocks 122 include one frame of 1024 samples, and AMR-WB+superframe 124 include one frame of 1024 samples.

Figure 7 selection switch AAS unit and choice of speech/music performed on the frames 126 and 128 respectively with a size of 1024 samples that cover the same period of time. These two choices are in such a temporary position to make coding able to use the time switch Windows to properly move from one encoding to another. Later these two elections to be paid minimum delay 512+64 counts. This delay should be added to the delay length of 1024 reference, educated 50%-s imposition of AAS MDCT that creates a minimal delay 1600 samples. In normal AAS is present only the unit switches, and the delay is exactly 1600 samples. This delay is necessary to switch to a lump sum from the long block to short blocks if detected transition in the frame 126. This switch is the length of the conversion is desirable to avoid interference pre-echo. The frame for the decryption 130, shown in Fig.7, the first of a frame, which can be recognized by the decoder in any case (long or short blocks).

In encoder failover using AAS as a musical encoder selection switch, carried out from the stage of solution should not be to add too many of additional delay to the original delay AAS. The additional delay is formed from pre-frame 132, which is necessary for the analysis of the signal at the stage of solution.

For example, in sampling frequency of 16 kHz delay AAS is 100 milliseconds, whereas the regular discriminator speech/music uses approximately 500 milliseconds training, which switches the coding structure with a delay 600 milliseconds. The total delay will then be six times larger than the original delay AAS.

This solution is achieved the stated method 1 and declared I informatika 14. One solution of the invention provides a method for classifying the various segments of the signal, including at least segments of the first type and the second type. The method includes current classification of signals on the basis of at least one short-term features extracted from the signal, and the formation of the short-term classification; long-term classification of signals on the basis of at least one short-term features and at least one long-term features extracted from the signal, and the formation of the long-term classification; combining the results for the short-classification and long-term classification to provide the output signal to indicate whether the segment of the signal, the first or second type.

Other the decision of the invention - a discriminator, including: short-term categorizer, which is designed to receive the signal and to form the result of the short signal classification on the basis of at least one short-term features extracted from the signal, including segments, at least the first type and second types; the long-term categorizer, which is designed to receive the signal and to form the result of long signal classification on the basis of at least one short-term features and at least one long-term features extracted from the signal; schema selection, designed to combine the result of short-term classification and the result of long classification, and generate an output signal to indicate whether the segment of the signal, the first or second type.

Decisions of the invention to provide a signal on the basis of comparison of the short-term analysis and long-term outcome analysis.

Solutions of the invention related approach to the classification of different non-overlapping segments for the short audio as speech, or as speech or not like other classes. The approach is based on extraction of features and an analysis of their statistics using two different lengths analyzed Windows. The first long window directed to the past. The first window is used to obtain reliable, but deferred a prompt solution for signal classification. The second window is short and considers mainly processed in the present or current segment. The second window is used to obtain an immediate prompt decisions. Two tips solutions optimally combined, using solutions with hysteresis, which receives information from the memory of deferred tips and instant information from instant tips.

Solutions of the invention use the short term especially in the short classifier, and in the long-term classifier to these two classifier used various statistics of the same features. Short-term classifier extracts only instant information, because he has access to only one set of features. For example, he can use the average of the features. On the other hand long-term classifier has access to multiple sets of features, because he considers multiple frames. As a consequence, long-term classifier can use more features of the signal processing statistics larger number of frames than the short-term classifier. For example, long-term classifier can use distinction features or development peculiarities in time. Thus, long-term classifier can use more information than short-term classifier, but this introduces a delay or waiting time. However, long-term features, despite the introduction of delay or waiting time, make long-term results of the classification is more correct and reliable. In some solutions, short term and long term classifiers can consider the same short-term features, which can be computed once and used by both classifiers. Thus, in such a solution, long-term classifier can get short term features directly from short-term classifier.

Thus, the new approach provides the correct classification, introducing low latency. In contrast to the usual approach to the solution of the invention limit delay introduced choice speech/music, while maintaining the reliability of choice. In one decision, the preparation of the invention is limited to 128 sampling, which leads to a delay of only 108 milliseconds.

A brief description of the drawings

Solutions of the invention described below, with links to relevant drawings, including:

Figure 1 - block diagram of the discriminator speech/music in accordance with the decision of the invention;

Figure 2 illustrates the analytical window used long-term and short-term classifiers discriminator in figure 1;

Figure 3 illustrates a solution with hysteresis used in Fig 1;

Figure 4 is a block diagram of the sample coding schemes, including discriminator in accordance with the decisions of the invention;

Figure 5 is a block diagram of decoding, the corresponding encoding scheme in figure 4;

Fig.7 illustrates the delay, received at the encoder, shown in figure 6.

Figure 1 shows a block diagram of the discriminator speech/music 116 in accordance with the decision of the invention. A discriminator speech/music 116 includes short-term classifier 150, which receives the input, for example audio, including musical segments and speech. Short-term classifier 150 generates the output line 152 result of short-classification - instant prompt decisions. A discriminator 116 further includes long-term classifier 154, which also receives an input signal and generates the output line 156 result of long-term classification - deferred prompt decisions. Next, the circuit is implemented with a delay of 158, which combines the output signals of the short-term classifier 150 and long-term classifier 154 in the manner described in detail below, to form the selection of speech/music served to the output line 160 and can be used to manage the further processing segment of the input signal in the manner described above and presented in the figure 6, i.e. the signal 160 choice speech/music can be used to direct the classified input segment signal to the sound device or device.

Thus, in accordance with the decisions of the invention two different classifier 150 and 154 are used in parallel to process input data classifiers through the input line 110. These two classifier called long-term classifier 154 and short-term classifier 150, and these two classifier differ, analyzing the statistics of the peculiarities of using analytical Windows. These two classifier form the output signals 152 156, namely instant hint of choice (IDC), and the deferred hint of choice (DDC). Short-term classifier 150 forms IDC on the basis of short-term features in order to provide instant information about the nature of the input signal. It is associated with short-term signs of signal that can quickly and at any time to change. Subsequently, short term features will be fast and not make a big delay in the process of discrimination. For example, because it by 5-20 millisecond intervals, short term features can be calculated for each 16 millisecond timescale, construction of frame at a sampling rate of 16 kHz. Long-term classifier 154 forms the DDC on the basis of features of a longer observation signal (long term features), and therefore, helps to achieve a more reliable classification.

Figure 2 illustrates the analytical window used long-term classifier 154 and short-term classifier 150 shown in figure 1. Frame length of 1024 reference with a sampling rate of 16 kHz length of the long-term classifier window 162 is 4 ∗ 1024+128 samples, i.e. long-term box 162 classifier covers four frames audio, and an additional 128 samples needed long-term classifier 154 for analysis. This additional delay, which is referred to as "anticipate", is indicated in figure 2 reference 164. Figure 2 also shows a short window of the classifier 166 length of 1024+128 samples, that includes one frame of the audio signal and an additional delay necessary to analyse the current segment. Current segment denoted by the reference 128, this segment for which the selection is made speech/music.

Long-term classifier window, shown in figure 2, long enough to determine the characteristics of the modulation speech with a frequency of 4 Hz. Energy modulation frequency 4 Hz is a distinctive feature of the speech, which is traditionally used in speech/music, for example, Scheirer Is, and Slaney M., “Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator”, ICASSP'97, Munich, 1997. Energy modulation frequency 4 Hz is a feature that can be defined in the observation of the signal on a long time segment. The additional delay, payable I informatika speech/music by «foresight is 164 of 128 samples, is necessary for each of classifiers 150 and 154 to do the appropriate analysis, we have perceptual analysis of linear prediction, as described in the works of N. Hermansky, “Perceptive of the linear prediction (pip) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp.1738-1752, 1990 and N. Hermansky, et al., “Perceptually based linear predictive analysis of speech,” ICASSP 5.509-512, 1985. Thus, using the discriminator encoder presented in figure 6, full switching delay coders 102 and 106 will be determined by 1600+128 sampling, which is equal to 108 milliseconds, and few enough for real-time applications.

Figure 3 describes Association output signals 152 156 classifiers 150 and 154 of the discriminator 116 in order to get a signal 160 choice speech/music. Delayed hint solutions DDC and instant hint solutions IDC in accordance with the decision of the invention are joined using hysteresis. Processes with hysteresis are widely used to secure the decisions and stabilize them. Figure 3 illustrates a two-stage process solutions with hysteresis as a function of the DDC and IDC, in order to determine whether the selection of speech/music indicate that the currently processed segment of the input signal is a sound segment or music segment. In figure 3 you can see a typical hysteresis loop, where the signals IDC and DDC normalized classifiers 150 and 154 so that they can take on values between -1 and 1, where -1 means that the fragment is completely similar to the music, and 1 means that the full piece like speech.

The decision is based on the value of the function F(IDC, DDC), examples of which will be described below. Figure 3 function F1(DDC, IDC indicates the threshold that should cross the function F(IDC, DDC)to move from the state «music» to the state of «speech». Function F2 (DDC, IDC) illustrates the threshold that a function F(IDC, DDC) has to cross to move from the state of «speech» to the state of «music». The final decision D (n) for the current segment or the current frame with the index n can be computed based on the following pseudo-code:

Pseudo Code to pick with delay

%Hysteresis Decision Pseudo Code

If(D(n-1)=music)

If(F(IDC,DDC)<F1(DDC,IDC))

D(n)==music

Else

D(n)=speech

Else

If(F(IDC,DDC)>F2(DDC,IDC))

D(n)=speech

Else

D(n)==music

%End Hysteresis Decision Pseudo Code

In accordance with the decisions of the invention, the function F (IDC, DDC) and the above thresholds are defined as follows:

F(IDC,DDC)=IDC

F1(IDC,DDC)=0.4-0.4 ∗ DDC

F2(IDC,DDC)=-0.4-0.4 ∗ DDC

Alternatively, you can use the following definitions:

F(IDC,DDC)=(2 ∗ IDC+DDC)/3

F1(IDC,DDC)=-0.75 ∗ DDC

F2(IDC,DDC)=-0.75 ∗ DDC

When using last definition loop of a hysteresis disappears, and the decision is made only on the basis of the unique adaptive threshold.

The invention is not limited by the decision of hysteresis described above. Next, I'll describe the other solutions in order to combine analytical results and get the output signal.

Instead of dealing with hysteresis can be used a simple threshold processing by using features of the DDC and IDC. It is believed that the DDC provides a more reliable hint, because it is obtained from a longer observation of signal. However, calculations of the DDC is based partly on the last observation of a signal. Normal classifier, which compares only the value of the DDC threshold 0, rating segment as a similar speech at DDC>0, or this music, otherwise generates deferred (detained) solution. In this decision, the invention we can use the threshold processing based on IDC and make a decision faster. When this threshold can be calculated on the basis of the following pseudo-code:

% Pseudo code of adaptive thresholding

If(DDO>-0.5 ∗ IDC)

D(n)==speech

Else

D(n)==music

%End of adaptive thresholding

In another decision, the DDC can be used to make more reliable IDC. IDC is fast but not as robust as that of DDC. In addition, the analysis of the DDC between the past and current segment can give another sign showing how to frame 166 figure 2 affects the DDC, calculated on the segment 162. Record DDC (p) is used for the current value of DDC DDC (n-1) - for the past. Using both the DDC (n) and DDC(n-1), IDC can be made more reliable when using the decision tree, as described below:

% Pseudocode decision tree

% Pseudo code of decision tree

If(IDC>0&&DDC(n)>0)

D(n)=speech

Else if(IDC<0&&DDC(n)<0)

D(n)=music

Else if(IDC>0&&DDC(n)-DDC(n-1)>0)

D(n)=speech

Else if(IDC<0&&DDC(n)-DDC(n-1)<0)

D(n)=music

Elseif(DDC>0)

D(n)=speech

Else

D(n)=music

%End of decision tree

In the above tree, a decision is taken directly if both the hints show the same result. If these two tips give conflicting signs, we look at the development of the DDC. If the difference DDC (n)-DDC (n-1) is positive, we can assume that the current segment like speech. Otherwise, we can assume that the current segment is similar to the music. If this new trait in the same direction, as IDC, the final decision. If both attempts fail to give a clear decision, the decision is taken on the basis of only delayed hints DDC as the reliability of the IDC is insufficient.

Further, in accordance with the decisions of the invention will be described classifiers 150 and 154.

Primarily for long-term classifier 154 note that you need to take a number of peculiarities of each subframe length 256 counts. The first feature - factor linear prediction (Perceptual Linear Prediction Cepstral Coefficient - PLPCC), which is described in the works of N. Hermansky, “Perceptive of the linear prediction (pip) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp.1738-1752, 1990 and N. Hermansky, et al., “Perceptually based linear predictive analysis of speech,” ICASSP 5.509-512, 1985. Coefficient PLPCC effective for the classification speaker when using the human auditory assessment of perception. These factors may be used to distinguish speech and music, they allow you to distinguish feature of the formant (formants) speech, as well as modulation modulation by syllables speech at a frequency of 4 Hz, the analysis of changes of features in time.

However, to gain grounds for differentiation coefficients PLPCC merged with another peculiarity, which is able to capture information about the main tone, which is another important feature of speech and can be important when encoding. Indeed, speech coding is based on the assumption that the input signal - signal. Speech-coding scheme effective for this signal. On the other hand, the characteristics of the pitch speech harm coding efficiency music coders. Smooth oscillation pitch, this natural vibrato speech, makes frequency representation of the signal in the musical encoders unsuitable for strong compression, which is required in order to obtain the high efficiency of coding.

The following tones:

The ratio of the voice of energy pulses:

This feature calculates the ratio of the energy between voice pulse and a differential signal LPC. Voice pulse is extracted from differential signal LPC using the algorithm choice of the maximum (pick-peaking). Usually LPC voice segment shows a pronounced like the pulse of the structure caused by the vibration of the larynx. This feature is great for voice segments.

Long-term prediction send level:

The transmit level is usually calculated in speech encoders (see, for example, the “Extended Adaptive Multi-Rate - Wideband (AMR-WB+) codec”, 3GPP TS 26.290 V6.3.0, 2005-06, Technical Specification) during long-term predictions. This feature measures the frequency of the signal and is based on an evaluation of lowering the tone.

Fluctuation of lowering the tone:

This feature determines the difference existing assessment lowering the tone compared with the last . For voice this feature is low, but not zero and smoothly.

As the only long-term classifier drew out the required set of features used statistical classification of these extracted features. Classifier first studying, removing features for speech and music training sets. The extracted features are normalized and modified on 1 relative to the value of 0 for the data sets. For each training set of lessons learned and normalized features collected within the long-term classifier window and simulated using a mixed model Gauss (Gaussians Gaussians Mixture Model (GMM) using five . As a result of a sequence of learning is obtained and maintained a number of parameters, normalization and two sets of parameters of the GMM.

For classification features first extracted for each frame, and normalized with the parameters of normalization. The maximum probability for speech (11d_speech) and the maximum probability for music (11d_music) are calculated for the extracted and normalized features using GMM speech class and GMM music class, respectively. Then delayed hint solutions DDC is calculated as follows:

DDC=(11d_speech-11d_music)/(abs(11d_music)+abs(11d_speech))

DDC lies between -1 and 1 and is positive when the probability of speech higher than the probability for 11d_speech>11d_music.

Short-term, the categorizer uses the short-features ratio PLPCC. Unlike long-term classifier this feature is analysed in box 128. Statistical data on the features used in this short period of time in a mixed model Gauss (Gaussians Gaussians Mixture Model (GMM) using five . Trained two models, one for music, the other for the speech. Note that these two models differ from the models obtained for the long-term classifier. For the classification of each frame first gets coefficients PLPCC, the maximum probability for speech (11d_speech) and the maximum probability for music (11d_music), calculated in order to use GMM speech class and GMM music class, respectively. Instant tip decision IDC is then calculated as follows

IDC=(11d_speech-11d_music)/(abs(11d_music)+abs(11d_speech))

IDC varies from -1 to 1.

Thus, the short-term classifier 150 generates short-term outcome signal classification on the basis of features of the “coefficient linear prediction (PLPCC), and long-term classifier 154 forms the long-term result of signal classification on the basis of the same features “factor linear prediction (PLPCC)” and the above-mentioned additional function (or functions), i.e. the characteristics (or characteristics) of the main tone of the speech signal. In addition, long-classifier can use different features of the General features, that is, the coefficient PLPCC, because long-term classifier has access to longer watch window. Thus, after the unification of short-term and long-term outcomes short term features essentially taken into consideration for the classification, i.e. their properties significantly used.

Below is described in detail further decision to respective classifications 150 and 154.

Short term features, analyzed the short-term classifier in accordance with the decision correspond mainly to the coefficients linear prediction, the above-mentioned factors PLPCC. Coefficients PLPCC widely used in speech and speaker identification as MFCC (see above). Coefficients PLPCC left because they share much of the functionality of a linear prediction (LP), which are used in most of today's speech coders and implemented in failover . Using PLPCC, you can extract the structure of the formant speech, as does the LP, but taking into account considerations PLPCC more independent from the speaker (speaker) and, thus, of more significance in relation to linguistic information. For a signal with a sampling frequency of 16 kHz uses a set of 16.

In addition to the factors PLPCC, calculated voice power as a short-term feature. Power to vote generally not be used separately, but it is beneficial when combined with PLPCC. The power of voice lets you select when measuring features at least two groups, relating respectively to a voice and non voice pronunciation of speech. The method of separation of these groups is based on the calculation of the characteristic properties of using various parameters, called the number of intersections of zero (Zero crossing Counter - zc), spectral slope (spectral tilt - tilt), stability of the main tone of speech (pitch stability - ps), and normalized correlation of the main tone of speech (normalized correlation of the pitch - nc). All of these four parameters are normalized between 0 and 1 way, where 0 corresponds to the typical non voice signal, and 1 corresponds to the typical voice signal. In this solution, voice power is taken from the criterion for classification of speech used in the speech encoder VMR-WB described in the work Milan Jelinek distillery and Redwan Salami, "Wideband speech coding advances in vmr-wb standard," IEEE Trans. on Audio, Speech and Language Processing, vol. 15, no. 4, pp.1167-1179, May 2007. The criterion based on the dynamics of the witness filter pitch, based on autocorrelation. For a frame with an index k strength of voice u (k) has the following form:

v ( k ) = 1 5 ( 2 ∗ n c ( k ) + 2 ∗ p s ( k ) + t i l t ( k ) + z c ( k ) )

The ability to discern between the short-term features is calculated using the mixed models of Gaussian Mixture Models - GMMS) as a classifier. Use two GMM, one for speech class and another for music class. The number of mixed component of the Gaussian density done variables to assess the effect on performance. Table 1 shows the degree of accuracy for a different number of mixed component. Values are calculated for each segment of four consecutive frames. Full delay is equal to 64 milliseconds that is appropriate for the switch mode . You notice that the accuracy increases with increasing number of mixed component. The gap between the 1-GMMs and 5-GMMs particularly important and may be explained by the fact that represent speech in the form of formant too difficult to be sufficiently definite only one a Gaussian.

1-GMMs 5-GMMs 10-GMMs 20-GMMs Speech 95.33 96.52 97.02 97.60 Music 92.17 91.97 91.61 91.77 Average 93.75 94.25 94.31 94.68

Table. 1: the Accuracy of the classification with the use of short-term features

Considering the long-term classifier 154, we note that in many works, e.g., M. J. Carey, et. al. “A comparison of features for speech and music discrimination,” Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, ICASSP, vol. 12, pp.149 to 152, March 1999, it is believed that statistical differences features are better suited for the discerning, what features directly. As a rough General rule, music can be considered as a more permanent, usually showing a small change. On the contrary it can be easily distinguished because of its considerable energy modulation with a frequency of 4 Hz, because the signal is changed periodically between voice and segments. Moreover, the sequence of the different phonemes makes speech less constant. According to the proposed decision are two long-term peculiarities: one based on the calculation of the difference, and another, based on a priori knowledge of the contour of the main tone of speech. Long-term features are adapted to low latency SMD (discrimination speech/music).

Dynamic change PLPCC consists in calculation of differences for each set of coefficients PLPCC on analytical window covering multiple frames to give special value to the last frame. To limit the insertion timeout, analytical window asymmetrically and considering only the current frame and the past. In the first step calculates the moving average mam(k) coefficients PLPCC according to the latest N frames in the following way:

m a m ( k ) = aff i = 0 N - 1 P L P C m ( k - i ) Buna w ( i )

where PLPm (k) - m-th coefficient of the Fourier cosine transform of M coefficients obtained for k-frame. Dynamic differentiation mvm (k) is then defined as:

m v m ( k ) = aff i = 0 N - 1 ( P L P C m ( k - i ) - m a m ( k ) 2 Buna w ( i ) )

p c ( k ) = { 0 i f | p ( k ) - p ( k - 1 ) | < 1 0,5 i f 1 ≤ | p ( k ) - p ( k - 1 ) | < 2 1 i f 2 ≤ | p ( k ) - p ( k - 1 ) | < 20 0,5 i f 20 ≤ | p ( k ) - p ( k - 1 ) | < 25 0 o t h e r w i s e

where p (k) is the delay of the main tone of calculated on the frame with the index k LP differential signal with the frequency of sampling 16 Hz. Speech quality sm(k) is calculated from the contour of the main tone of the manner in which it is expected will show smoothly wavering delay to the main tone, during voice segments and a strong spectral tilt toward higher frequencies during non-voice segments:

s m ( k ) = { n c ( k ) Buna p c ( k ) i f v ( k ) & GE 0.5 ( 1 - n c ( k ) ) Buna ( 1 - t i l t ( k ) ) o t h e r w i s e

where nc(k), tilt(k) and v(k) defined above (see short-term classifier). Characteristic of «quality of language» loaded weights window w, defined above, and integrates the latest N frames:

a m s ( k ) = aff i = 0 N m ( k - 1 ) w ( i )

Contour of the main tone is also an important sign that the signal is suitable for voice or . Indeed, speech encoders operate mainly in the time interval and make the assumption that the signal harmonic and segments short period of time approximately 5 milliseconds. With these assumptions, they can effectively simulate the natural oscillation tone of speech. On the contrary, the same fluctuation harm the efficiency of conventional devices that use linear transformations on long analytical boxes. Basic energy of the signal is then distributed across multiple conversion factor.

As short-term features, and long-term features are estimated using statistical classifier, forming thus the long-term outcome classification (DDC). Calculated two features using N=25 frames when the analysis 400 milliseconds background signal. Before using 3-GMM in the reduced one-dimensional space of applied linear discriminant analysis (LDA). Table 2 shows the accuracy of the classification defined on sets of training and testing, when segments four consecutive frames.

A set of training

Test kit Speech 97.99 97.84 Music 95.93 95.44 Average 96.96 96.64

Table. 2: Accuracy of classification in % using long-term features

The United system of classifiers according to the decisions of the invention combines respectively short-term and long-term function of the manner in which they bring their own specific contribution to the final decision. With this purpose there can be used the stage of final decision with hysteresis, as described above, where the effect of memory is managed by the DDC or long term distinctive hint (LTDC), while the instant data is retrieved from IDC or short distinctive hints (STDC). These two tips are formed at the output of the long-term and short-term classifiers, as illustrated in figure 1. The decision is made on the basis of the IDC, but stated DDC, which controls dynamically rapids, causing a change in status.

Long-term classifier 154 uses long-term and short-term features, previously defined using the LDA, followed by the 3-GMM. DDC is equal to the logarithmic against the long-term likelihood classifier speech class and music class, calculated according to the last 4 frames. The number taken into account frames may vary depending on The parameter to add more or less memory effect in the final decision. On the contrary, short-term, the categorizer uses only short-term functions with 5-GMM, which show a good compromise between efficiency and complexity. IDC is equal to the logarithmic relation short likelihood classifier speech class and music class, calculated only for the last 4 frames.

To appreciate this approach specifically for switchable were estimated to be three different types of actions. The first performance measurement was done using ordinary speech against the music (SvM). The estimate is obtained for a large set of speech signs and music. The second performance measurement is done on a large unique materials, including speech and musical segments, alternating every 3 seconds. Accuracy of discernment then it is called " discernment of speech after/before the music (SabM) and reflects mainly the speed of the system. Finally, the sustainability of discernment evaluated by performing classification on a large set of voice fragments over the music. Mixing speech and music made at different levels. Feature it on top of the music (SoM) is obtained by calculating the ratio of the number of switches that have occurred on the total number of frames.

Long-term and short-term classifications are used as links to assess conventional approaches with the use of single classifiers. Short-term classifier shows good performance, with lower resilience and the ability to distinguish between (discrimination) everywhere. On the other hand long-term classifier, greatly increasing the number of frames 4 X To, can achieve better stability and the ability to distinguish music and speech compromising performance solutions. When compared with the conventional approach of the proposed joint classification system in accordance with the invention, there are several advantages. One advantage is that the proposed solution supports the good pure speech unlike musical , while maintaining system performance. Another advantage is that a good balance between performance and stability.

Figure 4 and 5 are illustrated by examples of coding schemes and decryption, which include the discriminator or block working in accordance with the decisions of the invention.

In accordance with the encoding scheme shown in figure 4, mono, stereo signal or multichannel signal are on a shared block of preliminary processing of 200.

Of the total unit 200 pre-processing may be functionality joint stereo (joint stereo), multi-channel stereo (surround stereo), and/or functionality of the expansion of the bandwidth. The output of block 200 is a mono channel, stereo channel or a lot of channels that are input channels in one or more switches 202. Switch 202 can be implemented for each output unit 200, when the unit 200 there are two or more outputs, that is, when 200 unit forms a stereo or multi-channel signals. For example, the first channel of a stereo signal can be voice channel and the second channel of a stereo signal can be music channel. In this case, the choice of the block 204 decisions may at the same time differ between the two channels.

Switch 202 204 block controls solutions. Block selection includes discriminator in accordance with the decisions of the invention and takes as input the signal on the 200 unit, or signal at the output of 200. Alternatively, block 204 decisions may also receive external information that is included in mono, stereo signal or multichannel signal or at least is related to such information signal, which was, for example, formed from the original mono, stereo or multichannel signal.

In one decision, the selection block not manage unit 200 pre-processing, and arrow connects blocks 204 and 200 missing. In the further implementation of the processing in a block 200 manages to some extent selection block 204 to install based on selecting one or more options in a block 200. This, however, does not affect the overall algorithm unit 200 as its basic functionality remains 200 regardless of the solution produced by the unit 204.

Generating solution 204 block causes the switch 202 to file the output of basic block of preliminary processing unit frequency encoding 206, located in the upper part of figure 4 or block LPC encoding 208, located in the lower part of figure 4.

In one decision, the switch 202 switches between the two coded channels 206, 208. In the further developments may be additional channels encoding, such as the third channel coding, the fourth channel coding or even more channels coding. In a solution with three channels coding of the third channel encoding can be similar to the second channel coding, but includes encoder excitation, different from the encoder excitation 210 in the second channel 208. In this incarnation of the second channel includes the block 212 LPC and encoder excitation 210, such as in ACELP based on the table of codes, and the third channel includes a block of LPC and encoder excitation, acting on the spectral representation of the output signal block LPC.

Channel coding frequency domain includes power spectral transformation 214, which provides conversion of the output signal of basic block of preliminary processing in a spectral region. Block spectral transformation may include algorithms MDCT (modified discrete cosine transform), QMF, the algorithm FFT (fast Fourier transform, wavelet analysis or block of filters, such as critically selected block filter (block of filters, which balanced frequency and temporal resolution), having a certain amount of channels of signals which can be a real or complex signals. Output unit of spectral transformation 214 are encoded using spectral unit 216, which may include processing units, known from AAC encoding schemes.

Channel coding 208 includes the source model analyzer, such as the LPC 212 that generates two types of signals. One signal information LPC signal used to control the characteristics of the filter LPC synthesis. The LPC information is sent to the decoder. Another output signal of the unit LPC 212-excitation signal or signal LPC area which enters into the device 210, encodes the excitation signal. As the encoder excitation 210 can be chosen at any encoder, such as CELP encoder, encoder ACELP or any other encoder, which processes the signal of the LPC.

Another execution of the encoder excitation can be converted signal encoding excitation. In this incarnation the excitation signal is not encoded using ACELP, and yields the spectral representation and spectral coefficients of view such as signals subgroups in the case of filters block, or frequency coefficients in the case of the conversion, FFT, encoded to get data compression. The implementation of this type of device that encodes the excitation signal is a way of encoding TLC, known from the AMR-WB+.

Select signal generated by the unit 204, should be formed so that the selection block 204 fulfilled discrimination music/speech and managed switch 202 in such a way that music signals are transferred in the top channel 206, and speech signals bottom channel 208. In one decision, the unit 204 forms its information on the decision in the form of an output bit rate, so that the decoder can use this information about choosing and to follow the correct operation decoding.

This decoder is illustrated in figure 5. After the transfer of a signal formed spectral device 216, enters the spectral audio decoder 218. The output spectral 218 come to a temporary area of 220. The output of the encoder excitation of 210 4 come in decoder excitation 222, which forms the LPC region. Signal LPC comes in block 224 synthesis LPC, which, given the other input of the information LPC, formed appropriate analytical LPC unit 212. Transmitter output a temporary area of 220 and/or the output of the unit 224 synthesis LPC connected to the switch 226. Switch 226 controls the control signal switch, which was, for example, formed the block select 204 and is external, formed by the Creator of the original mono signals, stereo or multichannel signal.

Output signal switch 226 is a complete mono signal, which is subsequently transferred to the General unit 228 further processing, which can be processed joint stereo, or processing expansion bands etc. Alternatively, the output signal of the switch can also be a signal stereo or multi-channel signal. This is a stereo signal, when the pre-treatment includes compression in two channels. It may even be a multi-channel signal when data compression three channels or compression does not happen at all, but only the technology of restoration of the spectrum.

Depending on the specific functionality of the overall unit further processing is formed mono signal stereo or multichannel signal, which has a large spectral band, than the input signal block 228 if the total unit 228 subsequent processing performs the operation on the expansion of the bandwidth.

In one decision, the switch 226 switches between the two channels decrypt 218, 220 and 222, 224. In another decision, there may be additional channels decryption, such as the third channel decrypt, the fourth channel decrypt or even more channels decryption. In a solution with three channels decrypt the third channel decrypt may be similar to the second channel decrypt, but includes the decoder of excitation, different from the decoder excitation 222 in the second channel 222, 224. In this incarnation of the second channel includes the block 224 LPC and decoder excitation, such as in ACELP based on the table of codes, and the third channel includes the block LPC and decoder device excitation signal, acting on the spectral representation of the output signal block 224 LPC.

In another decision, a common block of preliminary processing includes the block of multi-channel/joint (surround/joint) stereo, which forms the outlet United parameters stereo and mono output signal, which are formed compression with a mix of input signal has two or more channels. In General, the signal emitted by the unit may also be a signal with more channels, but due to the shrink operation of the mix (mix) the number of channels formed by block will be smaller than the number of input channels per unit. In this decision, the channel frequency encoding includes spectral transformation and subsequently associated processing quantization/encoding. Block quantization/encoding may include any of the functionality known in modern frequency coding devices, such as AAC encoder. In addition, the operation of quantization on the stage of quantization/encoding can be controlled via the module that generates the information coming into the unit, such as frequency masking threshold. Spectral transformation of operation MDCT preferably even more preferable is the operation MDCT with temporary (time-warped MDCT), where the value of emphasis can be controlled between zero and great value. It is known that when the value of emphasis zero operation MDCT is a direct operation MDCT. Encoder LPC could include the core ACELP to calculate the transmission of sound, the delay of the sound and/or information about the set of codes, such as the index of the set of codes and transfer code.

Although some of the drawings illustrate the block diagram of the device, note that these drawings at the same time illustrate the method, where the functionality of the unit corresponds to the steps of the method.

There are solutions of the invention, wrapping the input sound signal, consisting of various segments or frames associated with the information about the speech or information about music. Invention is not limited to such decisions, rather it is an approach to categorize the different segments of the signal, including segments, at least the first type and the second type, the method can also be applied to , including three or more different types of segments, each of which shall be encoded using different coding schemes. Examples for such types of segments:

- Stationary/nonstationary segments can be processed using different blocks filters, Windows, or coding devices. For example, the transition process should be encoded using filters block with good temporal resolution, while pure sine wave must be coded block of filters with good frequency resolution.

- Voice/: voice segments well handled speech encoder, such as CELP, but for non-voice segments of the wasted too many bits. Parametric coding will be more efficient.

- Silence/Activity: the silence can be encoded with fewer bits than active segments.

- Harmonic/: harmonic coding segments beneficial to use linear prediction in the field frequency.

In addition, the invention is not limited to the region , the above approach to the classification can be applied to other types of signals, such as video or data signals, and these signals include segments of different types that require different treatment.

The invention could be adapted to all applications that require a temporary signal segmentation. For example, face detection from the video surveillance can be based on the classifier, which specifies for each pixel of the frame (frame here corresponds to the snapshot made at the time n)if he belongs to the person of the person or not. Classification (i.e segmentation entity) shall be made for each frame of the video stream. However, using the invention, segmentation existing frame may take into consideration the past consecutive frames in order to get the best accuracy, based on the fact that the sequential photos strongly correlated. Then can be applied two classifier. One for the analysis of the existing structure and other analysis of a number of frames, including past and present. Last classifier can combine a set of frames and determine the scope of the likely status of the person. Select classifier made only on the current frame, then be compared with the likely area. The choice may be approved or modified.

Solutions of the invention use the switch to switch between the channels so that only one channel was processed signal, and the other channel is not received. The alternative solution, the switch can also be embedded after processing units, or channels, for example, device and speech encoder to both channels have processed the same signal simultaneously. The signal generated by one of these channels, output, to get into the output stream.

Depending on specific requirements, the proposed methods can be implemented in hardware or in software. Execution can be carried out using digital media data, in particular, DVD or CD-ROM, stored in electronic form control codes, which are programmable computer systems, so that the proposed algorithms. In General, the invention is a computer program stored on electronic media, running in the machine code of the program carries out the proposed methods. In other words, the proposed methods is a computer program with program code to perform at least one of the proposed methods for the implementation of the computer program on the computer.

The decisions described above are simple illustration of the principles of the present invention. It is implied that the modifications and changes described here means and details will be obvious to specialists. Therefore, the intention is to restrict the set of claims claims rather than defined and described the details of solutions.

In the above decisions is described signal, including many frames assessed the current frame to select the switching. Noted that the current segment of the signal, which was assessed for selection of the switching can be one frame, however, the invention is not limited to such decisions. Segment of the signal can also enable many, that is, two or more frames.

Further, in the short-term decisions classifier and long-term classifications utilized the same short-term function. This approach can be used for a variety of reasons, such as the need to calculate the short-term features only once and use them two classifiers, which reduces the complexity of the system, because, for example, short-term feature can be calculated in one of the short-term or long-term classifiers and transferred to another qualifier. In addition, comparison short and long-term classifiers can be more significant, since the contribution of the current frame in the long-term result of the classification is displayed more easily compared with short-term result of the classification, as these two classifier share common properties.

Invention, however, is not limited to such an approach, and long-term classifier can use the same short-term features, as well as a short classifier, i.e. short-term classifier, and long-term classifier can calculate the short-term feature (and), which differ from each other.

While the decisions described above use PLPCC as a short-term feature, note that you can consider other features, for example, a variety of PLPCC.

1. A method of classifying the different segments of the audio signal that contains audio and music segments, including short-term classification (150) audio on the basis of at least one short-term features extracted from the audio to determine whether the current segment of the audio voice segment or music segment, and create short-term outcome classification (152) indicating that a segment of the audio signal is a sound segment or music segment; long-term classification (154) audio on the basis of at least one short-term features and at least one long-term features extracted from the audio to determine whether the current segment of the audio voice segment or music segment, and to secure long-term classification result (156) indicating that a segment of the audio signal is a sound segment or music segment; and Association (158) the short-term classification (152) and long-term outcome classification (156)to generate output signal (160) indicating whether the current segment of the audio voice segment or music segment.

2. The method according to claim 1, where the stage of Association includes the formation of the output signal on the basis of comparison of the short-term classification (152) and long-term outcome classification (156).

3. The method according to claim 1, where received at least one short-term feature of the analysis of the current classified audio segment; and received at least one long-term feature of the analysis of the current segment of the audio signal and one or more of the previous segments of the audio signal;

4. The method according to claim 1, where received at least one short-term feature of the analysis by the first way of investigational window (168) of the first length; and received at least one long-term feature, by the analysis of the second method investigated the window (162) the second length, and the first in length, shorter than the length, and the first and second methods of analysis differ.

5. The method according to claim 4, where the first length covers the current segment of audio, the second length covers the current segment of the audio signal and one or more of the previous segments of the audio signal, and the first and second length include additional period (164) covering the period of analysis.

9. The method according to claim 1, where short-term feature, used for short-term classification and short-term feature, used for long-term classification are the same or different.

10. Way of processing of audio, including voice and music segments, including the classification (116) the current segment of the audio signal in accordance with the method of claim 1-9; dependence of output signal (160) formed at the stage classification (116) by processing (102, 206; 106, 208) of the current segment in accordance with the first process or second process; and formation of output of processed segment.

11. Way to 10, where the segment is processed speech encoder (102)when the signal output (160) indicates that the segment is the voice segment; the segment is processed music encoder (106)when the signal output (160) indicates that a segment is a musical segment.

12. The method according to claim 11, additionally includes Association (108) encoded segment and information from the output signal (160) which indicates that the segment type.

13. Machine-readable medium of information, having the code for the method according to claim 1, when the code is running on a computer or processor.

14. A discriminator, including short-term classifier (150) configurable to get audio signal and determine whether the current segment of the audio voice segment or music segment, and create short-term outcome classification (152) of audio on the basis of at least one short-term features extracted from the audio; short-term outcome classification (152), indicating that the current segment of the audio signal is a sound segment or music segment of audio, including voice and music segments; the long-term classifier (154) intended to get the audio signal and determine whether the current segment of the audio voice segment or music segment, and to secure long-term classification result (156) of audio on the basis of at least one short-term features and at least one long-term features extracted from the audio; long-term classification result (156) indicating that the current segment audio is the audio segment or music segment; and a scheme of choice (158) designed to combine short-term outcome classification (152) and long-term outcome classification (156)to provide signal (160) indicating whether the current segment of the audio voice segment or music segment.

15. A discriminator to paragraph 14, where the scheme of selection (158) intended to form the signal on the basis of comparison of the short-term classification (152) and long-term outcome classification (156).

16. Processing unit audio, including input (110) for receipt of the processed audio signal, where the audio signal includes spoken and musical segments; the first channel processing (102; 206) for handling voice segments; the second channel processing (104; 208) for handling the music segments; discriminator (116; 204), declared in 14 or 15, connected to the input and device (112; 202), connecting the entrance to the first or second channel of processing intended to submit an audio signal from input (110) on one of the channels of treatment depending on the output signal (160) discriminator (116).

17. Audio encoder, including device audio signal processing, P16, where the first channel processing includes speech encoder (102), and the second channel processing includes musical encoder(106).