Method for realizing machine estimation of quality of sound signals

FIELD: analysis of sound signal quality, possible use for estimating quality of speech transferred through radio communication channels.

SUBSTANCE: in accordance to the method for machine estimation of sound signal quality, the signal is divided onto critical bands and spectral energy values are computed for critical bands, values of spectral likeness of active phase of fragments are determined, and quality of tested sound signal is determined by means of weighted linear combination of aforementioned quality values for each phase. The difference of the method is that selected fragments of active and inactive phase of both signals are synchronized, inactive phase spectrums are determined for each fragment, resulting spectrums of active and inactive phase of fragments are divided onto additional sets of bands, for each one of which spectral energy values are computed, resulting spectral energies of active and inactive fragment phases are compared in couples, to determine spectral likeness coefficients, resulting likeness coefficient for each phase is determined as an average value of likeness coefficients for all sets of bands, which is the estimate of quality of each phase.

EFFECT: ensured universality and optimized quality of estimation process depending on purposes of estimation.

5 cl, 13 dwg, 6 tbl

 

The invention relates to the analysis of the quality of the audio signals and can be used to assess the quality of speech transmitted via radio communication, telephony and paths negotiation devices, as well as in assessing the quality of sound from various audio equipment, including any past treatments compression/recovery using various vocoders and the evaluation of the acoustic quality of rooms.

Quality assessment of audio signals is becoming increasingly important with the growth of the spread and use of mobile communication systems synthetic telephony, various portable recording and reproducing devices. The desire to create a way that ensures the objectivity of the assessment (i.e. independence from the evaluation of a specific person) and the possibility of its automatic implementation, of course - the objective evaluation necessary to compare samples of competitors ' products, and to optimize the parameters of their own.

One of the main indicators systems compression, transfer and playback of audio information is restored as adopted or reproduced sound.

A quantitative measure of the quality of sound has its own specific features associated with the fact that, in the end, the receiver sound signal is always the man, and he W is, is the source of most audio signals. Accordingly, the quality of the audio signals is determined not only by the technical characteristics of the systems processing and transmission of sound, but also the properties of speech and hearing people change over time and from person to person.

Distinguish between subjective and objective methods of measuring speech quality. Subjective methods are those in which hearing people is an integral part of the measuring system. Accordingly, objective methods to exclude the human ear from the measurement process.

The most common subjective method of evaluation of speech quality (not necessarily speech, although, usually, speech) is to estimate the MOS (mean opinion score - the average subjective evaluation) - evaluation on a five-point scale.

Score on a scale of MOS is determined by processing the estimates given by the group auditors, multiple audio signals played by the various speakers. Each auditor makes an assessment of each signal. Then averaging the results.

The process of organizing and conducting subjective expertise is rather complicated, lengthy and expensive procedure, so for the past many years, work is underway to search for objective methods for the assessment of intelligibility, enabling a quick and automated assessments, horo is about coincident with the subjective examinations.

There are various methods of assessment, some of them are listed below:

AI (Articulation index) articulation - the Idea is that the entire frequency range of the speech signal is divided into 20 bands, within which the determined signal-to-noise ratio. The width of the strips is chosen so that the contribution of each band in speech perception was the same. In each band is calculated signal-to-noise ratio. The articulation index equals the weighted sum of the values on the bars.

The articulation index is bad because it is focused on the speech signal does not take into account the properties of hearing, receiptant.

SII (Speech Intelligibility Index) - Index of speech intelligibility - the development of the AI. Index of speech intelligibility included in the American standard ANSI S3.5-1997 and offers four measuring procedures on different groups of bands: critical strip (21 page), third octave bands (18 bands), equal to the contribution of the critical strip (17 bands) and octave band (6 lanes). In each of the bands is calculated signal-to-noise ratio and calculated total factor SII, lying in the range from 0 to 1.

Index of speech intelligibility takes into account only the properties of hearing and does not take into account the properties of receiptant.

STI (Speech Transmission index) this voice - the Voice signal can approximately be considered as a broadband signal modulated n is societatem signal. The modulation frequency is determined by the speed of articulation. The decrease in modulation depth likens the speech signal noise and reduces its intelligibility. Accordingly, the decrease in intelligibility can be estimated from the decrease of the modulation depth.

The entire speech band is divided into seven octave bands, to the input of the test system is served octave noise signal. The intensity distribution of the test signal coincides with the distribution of intensities of the speech signal. The frequency of the modulating signal vary from 0.5 to 12.5 Hz third octave intervals (a total of 14 frequency).

Method of measurement STI recorded in the international standard IEC 268-16.

RATSI/STIPA (Rapid Speech Transmission Index) - quick index voice. Method STI requires a large number of measurements and calculations. Has developed a simplified method involving the measurement of only two bands in five modulation frequencies, and reduced the number of measurements and calculations. For good intelligibility RASTI values must not be lower than 0.6.

Index voice, as well as a quick index that mimics the process of receiptant using a noise model, however, the inventory of properties receiptant and hearing are far from optimal.

With50 - ratio - definition determines the sharpness or clarity of sound and is calculated as the ratio of the near and far echo. Meth is D. based on the echo decreases the intelligibility of the signal. Measured by the ratio of the near and far echo on multiple frequency bands. Near echo (33 MS) is considered a useful signal, and far (more than 33 MS) is interfering.

Factor definition takes into account only one type of possible distortion and it can be used as one estimate of the speech quality.

There is a method of assessing the intelligibility of speech produced by tracts negotiation devices, means of individual protection of respiratory organs, by applying the transducer voice message into an electrical signal and a complex of equipment for reception and processing to obtain the amplitude-frequency dependence voice messages, determine the formant equal intelligibility, their feelings, calculating the probability of receiving the formant, the largest of which assess the intelligibility of speech, characterized in that the Converter voice message into an electrical signal connected to the input of the audio adapter PC card digitization, translate information from analog form to digital form, spend processing the digital information and the determination required to assess the intelligibility of the output characteristics (application No. 2002133196).

The disadvantage of this method is that it does not take into account fully the properties of receiptant. The presence of formant x is the case only for vowels and voiced consonants. In addition, this method is applicable only for the assessment of speech intelligibility, as measures of the quality of a speech sound signal, it is not suitable for audio signals in General.

The closest technical solution to the claimed method is the implementation of machine evaluation of the transmission quality of audio signals, particularly voice signals in one frequency range to determine the spectra of the transmitted source signal and the received signal, determine the value of the spectral similarity, which corresponds to the transmission quality, the covariance spectra of the source signal and the received signal is divided by the product of the standard deviations of both spectra (RF Patent No. 2232434).

In addition, the spectral similarity value weighted by a coefficient that depends on the ratio of the energy spectra of the signal to the signal source that provides a control signal interference, because the higher the energy of the received signal, the similarity value is reduced stronger.

Pre-processing of signals from the source signal and the received signal to distinguish active and inactive phases, with fragments of the signal, the energy of which exceeds a predefined threshold, are correlated with active phases, and the remaining fragments qualify as a pause. Pause and noise during pauses from Lauda and accounted for to a lesser extent, than the active phase of the signals.

On this basis, the value of the spectral similarity is measured only for fragments of the received signal and source signal related to the active phase and an inactive phase is a function of quality, depending on the maximum and average power on delay, which decreases a variable declining balance.

Prior to transformation into the frequency domain signals of the active phase, carry out temporary masking, which is divided into time blocks of data such that successive data blocks overlap a significant part of up to 50%, and before the temporary masking components spectra squeeze through exponentiation with exponent less than 1.

The spectra of the source and the received signal is divided into critical bands (model Zwicker), and hopes for them, the coefficients of similarity. Before determining the value of the similarity spectra respectively subjected to convolution using asymmetric frequency function blur, and before convolution extend components of the spectra using exponentiation with exponent greater than 1.

The transmission quality is calculated by a weighted linear combination of the similarity value of the active phase and the values of the quality of the inactive phase.

The main disadvantage of protot the PA include:

almost processing affects only the active phase source and received (test) signals, which reduces the objectivity of the assessment;

- this method does not take into account the properties of receiptant, since critical band Zwicker used by the authors of the invention, only reflect the properties of the hearing;

- the method takes into account the perception of the inactive phase only at the volume level that reduces the accuracy of the estimates.

The task of the invention is to develop a method for objective assessment of sound quality, which can be used in these applications of the invention.

The technical result is achieved due to the fact that in the known method machine quality assessment of audio signals, in which the original signal and the test signal detect fragments of the active and inactive phases, determine the range of the active phase, the calculated values of the spectral energy in the critical bands and the similarity value, and the test sound signal is determined by a weighted linear combination of the received values for each phase, changes, namely:

- selections active and inactive phases synchronize time;

- additionally determine the spectra of the fragments of the inactive phase;

- received the JV is CTRY fragments of both phases are divided into additional sets of strips, for which the calculated values of the spectral energy;

- compare fragments;

- result similarity coefficient for each phase are determined as the average of the coefficients of similarity of the sets of strips on all slices.

Then taking into account the obtained results, assess the quality of the test sound signal.

In addition:

as the original signal can be used as an arbitrary audio signal, and a specialized set of signals;

spectra of the fragments of the active and inactive phases are determined using a discrete casinotreasure;

as additional sets of strips may be used logarithmic, resonator and various well-known critical bands;

- the amount and composition of the sets of strips may vary in various combinations to determine the similarity coefficient of each phase.

The essence of the invention is illustrated with figures 1-3, figure 4-8 explain an example implementation, and figures 9-13 - possible use:

Figure 1 - rough-cut algorithm for the evaluation of sound quality;

Figure 2 - comparison algorithm fragments of signal strips;

Figure 3 - General synchronization algorithm source and test signals;

4 is a filtering algorithm emissions VAD;

Figure 5 - algorithm sync is Torno block (start);

6 - work algorithm timing unit (continued);

Fig.7 - algorithm timing unit (continued);

Fig algorithm operation timing block (end);

Fig.9 is an example of assessing the quality of sound transmitted through the telephone network;

Figure 10 - example of evaluation of the quality of sound transmission over VoIP;

11 is an example of evaluation of the quality of sound transmission in cellular networks and satellite communications;

Fig sample estimates of the quality group a system developer(s) sound processing;

Fig - sample evaluation of the sound quality of space.

The need to develop new methods and improve existing ones caused by the desire to increase the proximity of the objective and subjective quality assessments, the need to consider the properties of hearing, receiptant.

Use as an initial signal of arbitrary or specialized signal depends on the purpose of the assessment (determination of speech intelligibility, quality sound, quality assessment of speech produced by tracts negotiation devices, etc) and allows to increase its objectivity.

Almost any audio signal can be divided into active and inactive phases. The first corresponds to the active sound processes, the second low-level background noise. The easiest way split - split level is Y. the signal energy, however, this approach does not have high accuracy. In the proposed method for the separation of the signal on the active and inactive phases used known VAD algorithm recorded in the recommendation G.723 (as part of the same vocoder).

The source and the test sound signals are analyzed and separated into active and inactive phases (figure 1). Further fragments of the active and inactive phases are synchronized (identical fragments are aligned in time) and analyzed the various blocks in one algorithm. The synchronization algorithm described below.

Separate versus combined pairs of fragments of the active and inactive phases can improve the accuracy of the obtained estimates.

For each fragment is determined by the integral range using discrete casinotreasure (DCT), which is to achieve a technical result has some advantage in comparison with the fast Fourier transform (FFT).

Integration of the spectrum will be calculated by the formula (1):

where j=0...N/2-1 - index values of the spectral energy

i - number of the step of integration;

N - number of signal samples used in the computation of the spectrum;

- obtained average value of the spectrum;

- average the second spectrum value at the previous step;

Spi,j- the value of the spectrum obtained using DCT.

When calculating the integral of the spectrum of the overlapping Windows is N/2 times for each window overlaps known window function Hamming (Hamming or Blackman-Harris (Blackmann-Harris).

For all selected sets of strips are determined by the levels of the spectral energy bands. Known groups critical bands that are defined by different authors, on the basis of different models of sound perception, receiptant.

Hearing aid man is a nonlinear system, which causes a phenomenon called masking. Masking occurs when listening to a message on the background interference or masking sounds.

The study disguise harmonic signals narrowband noise Zwicker determined that the entire range of audible frequencies can be divided into frequency bands or strips, secreted by the human ear. To Zwiker a similar conclusion was made by Fletcher, who selected frequency bands of the critical bands of hearing.

Critical bands defined by Fletcher and Cvecara vary, as first defined stripes using masking noise, and the second of the relations of perceived volume.

Boots identified critical strip, as the strip frequency range of speech that is perceived as ed is Noah integer. In his early research, he even talked about the possibility of replacing the sound signal on the band equivalent tone signal, however, this assumption could not withstand experimental verification. Critical bands, certain Sapozhkova differ from bands defined by Fletcher and Cvecara, because the boots came from the properties of the speech signal.

Pokrovsky was also identified critical bands based on the properties of the speech signal. Strip defined Pokrovsky, ensure equal probability of falling into them formant.

The value of the spectral energy bands can be used for various purposes, one of which is to assess the quality of the audio signal. However, the use of critical bands only one author (in the prototype, for example, are critical band Zwicker), it is not possible to obtain a sufficiently objective assessment, as they reflect only one aspect of either perception or receiptant. In the present invention the spectral energy can be determined at different critical bands, and logarithmic and resonator strips that allows to take into account more features of the ear and receiptant.

The light of the fact that the band defined Pokrovsky and Sapozhkova, better suited for speech signals, but not for audio signals in General, allows visit the precision of the estimate depending on its purpose. Table 1 shows the critical bands according to various authors.

The following notation is used:

Fc is the Central frequency of the band;

L - band width.

Table 1
Critical bands that are defined by different authors.
No.ZwickerPokrovskyFletcherBoots
FcLFcLFcLFcL
151802603202005320060
21501004951503005030060
32501006401404005050060
43501007871555005080070
5450110947 16560053100080
65701201125190700541500100
77001401315190800582000130
88401501505190900603000200
9100016016901801000635000300
10117019018701801250718000600
1113702102050180150080
1216002402230180175087
1318502802435230200098
142150 32027253502500120
15250038031004003000141
16290045034803604000200
17340055038553905000276
18400070045309606000370
194800900613022407000480
2058001100862527508000590
2170001300
2285001800
23105002500
24135003500

Additionally, it is suggested to use a logarithmic bands or stripes of equal volume. The idea is simple - the volume is proportional to 10 logarithm of the energy. To determine the bounds of the logarithmic bands used write phonetically representative of text (known text, developed at the Department of phonetics state University), reader speakers of different sex and age.

Vocal tract is a complex sound system. Acoustics of the vocal tract nonstationary and nonlinear. When the movement of the speech organs, the shape and volume of the upper resonator is changed, which results in speech production. The pitch of the voice is determined by the number of vibrations of the vocal cords in the second, as well as the length of the ligaments, the strength of their tension and position of the epiglottis. The power of the sound is determined by the force closure of the vocal cords and the force of exhalation. The timbre changes depending on the position of the larynx and epiglottis.

Because anatomize the fir features of the structure of the vocal apparatus and the ability to use the resonators for some people the result is a strengthening or weakening of the harmonic components of the sound. The main impact on featiu have the upper cavity and throat. Also the resonator function of increased tone of voice and giving him the individual voice, carry out the nasal cavity and paranasal sinuses.

Resonator bands that are characteristic of different speech sounds were identified Sorokin V.N. (table 2). Accounting resonator strips useful in determining the quality of a speech sound (especially speech) signals. Resonator strips can be used to determine the quality of play individual sounds.

The index at the Central frequency and the width of the bands is given by Sorokin. Fxmatch Fc, a Lx-L.

Table 2
Resonator strips
No.SoundFpLpF1L1F2L2
1"A"273,572,4to 574.678,1994,148,3
F3L3F4L4F5L5F6L6/sub>
2404,877,72711,4102,53796,5145,64735,3221,8
No.SoundFpLpF1L1F2L2
2"O"287,672,4497,1100,9914,247,1
F3L3F4L4F5L5F6L6
2316,467,92635,187,64030,9142,34728,3189,5
No.SoundFpLpF1L1F2L2
3"Y"296,872,4408,6149,2858,041,9
F3L3F4L4 F5L5F6L6
2042,854,22761,371,23612,392,44434,3the 122.7
No.SoundFpLpF1L1F2L2
4"And"287,772,4393,554,92272,166,1
F3L3F4L4F5L5F6L6
3094,677,64003,683,75047,3117,06103,5133,6
No.SoundFpLpF1L1F2L2
5"S"of 302.672,4485,785,51378,447,0
F3/td> L3F4L4F5L5F6L6
1847,746,32574,563,33732,597,74421,9of 124.8
No.SoundFpLpF1L1F2L2
6"E"279,072,4490,973,11353,041,4
F3L3F4L4F5L5F6L6
2235,060,82775,078,53575,7109,44226,4141,3
No.SoundFpLpF1L1F2L2
7"C"325,472,4482,72,7 1619,4of 45.7
F3L3F4L4F5L5F6L6
2861,072,74029,8106,34406,1115,95290,6other 153.9
No.SoundFpLpF1L1F2L2
8"W"335,172,4473,497,51439,953,7
F3L3F4L4F5L5F6L6
2101,657,12528,862,83159,872,94516,78117,3
No.SoundFpLpF1L1F2L2
9 "X"349,972,4543,891,91459,7of 54.8
F3L3F4L4F5L5F6L6
2035,053,52915,178,53699,193,54540,6of 120.5
No.SoundFpLpF1L1F2L2
10"F"to 274.972,4338,983,21024,637,4
F3L3F4L4F5L5F6L6
2110,243,22694,553,53872,978,04798,0104,9

In addition, you can determine the coefficients of the importance of ' bands, based on the assumption that the smaller the integral energy on the strip, t is m above the importance of the band for speech perception. Accordingly, to assess the quality of the audio signals in General, it is reasonable to assume strips is important, and in assessing the quality of speech signals transmitted via paths negotiating device, take into account the factors of importance.

Border strips (start and end indexes) are defined by the following formula:

where nSpecLen the number of points in the spectrum (N/2);

SampleRate - the sample rate of the signal;

n - number of lanes.

Energy bands are defined as:

wherevalues of the integral of the spectrum (isreceived on the last window of the fragment).

The comparison algorithm of the bands (for one set) are presented in figure 2. Initial quality assessment is assumed to equal 100%. It further decreases in proportion to the difference of the energies of the bands. Defined quality assessment for each set of lanes. Quality assessment for all sets of strips is defined as the average of the individual estimates by the formula:

where Nk is the number of used tables strips;

k - the number of the current table.

dQk- score obtained for the k-table of strips;

- integral evaluation for all tables./p>

Quality assessment for each phase is defined as the average, for all pairs of fragments:

where- get the integral value of the coefficient of loss of quality;

- integral value of the quality factor on the previous step;

- the value of the quality factor on a pair of fragments t;

- the value of the quality factor on the first pair of fragments;

t - couple of slices.

The resulting assessment of the quality throughout the signal (dQGlobal) is defined as the sum of the weighted ratings of the quality of active ((Active)and inactive ((Pause)) phases:

The General algorithm of synchronization signals presented on figure 3. The clock input signal received signal segments (pDATA), equal to the duration of the frame VAD, and signs of activity VAD on segments pDATA. There are two entrances: one for the reference (or source) of the signal and the test signal.

Before synchronizing filtering emissions signs of activity VAD, namely, that the sign of activity over short distances (less than threshold) is equal to the signs of activity surrounding the signal.

After the filter characteristics of States and frames of the signal received at the timing blocks, combining fragments of the active signal and pause. The modules share a common data buffer of the active reference signal (EBuffer1), buffer active test signal (TBuffer1), buffer pause the reference signal (EBuffer0), buffer pause the test signal (TBuffer0), ready buffers active signal and pauses (dReady[0...1]), a counter synchronization errors (dErrorCounter).

At the output of the synchronizer is obtained pair of buffers with active signal or a pair of buffers with pauses. Both timing unit can initiate the appearance of a pair of synchronized buffers.

Synchronized buffer depending on the sign of activity is coming to the Comparer active fragments or breaks (figure 1).

Currently, the testing of the proposed method with respect to assessing the quality of telephone channels and IP telephony. A search for optimal synchronization algorithms and clarifies the relationship between quality assessment and syllabic intelligibility.

Below is a description of the method. Implementation of the proposed method of assessing the quality of the sound signals is performed on a personal computer using software and developed by the authors of the invention. Method R is alisoun in the form of a program to assess the quality vocoders and comparing the external source and the tested signals.

External signals can be used for arbitrary signals, recorded with a sampling frequency of 8 kHz and a bit depth of 16-bit samples. It is assumed that the test signal received from the source signal in the result of any transformation (e.g., compression/restoration, communication, filtering).

Additionally, as the source of the external signal can be used write phonetically representative of text, a reader multiple speakers of different sex and age.

As an internal source signals (signals that the user program does not have access) uses signals generated according to the noise model (description of the generator below) and the signals generated on the basis of the statistical model.

Internal signals are input to the implementation of the system compact/repair audio data that is being implemented as a DLL with the specified interface. Pets using DLL developed by the authors of the proposed method, and third-party developers. The signal that has undergone processing methods contained in the DLL, is tested and subjected to the procedure of quality assessment described above.

4 shows the filtering algorithm emissions VAD. As the source data are segments of the signal pDATA and the prize is Aki activity VAD-dVAD. In table 3 lists the variables, their purpose and initial values. In addition to the variables in the algorithm used three constants: the threshold correcting pauses in the active state (dBound[0]=6), the threshold correcting the active state to the pause (dBound[1]=4) and the length of the delay line (dDLSize=max(dBound[])+1).

Used constant values determined experimentally for the case of assessing the quality of the signals passed to the compression/recovery) and can be changed when implementing for the best specific synchronization signals.

Table 3
Variables used by the filter emission VAD
VariablePurposeN/C
dVADThe characteristic value of the activity at the input of the algorithm-
pDATAAn array of signal samples with a length equal to the frame VAD-
dStateThe characteristic activity of the site (the previous value of the characteristic activity)-1
dSLenThe number of consecutive frames with the same sign of activity0
dNDLFramesThe total number of frames received at the input of the algorithm 0
DelayLine[]The delay line. Saves the signs of activity and arrays counts-

The algorithm checks the activity flag of the current block signal. If the sign of activity coincides with the current taken by the state, the incoming frame is simply added to the delay line, and the first element of the delay line is given to the input timing of the block.

If the activity flag does not match the current taken by the state, then the system checks for the arrival of the first frame signal. The first frame is simply placed in the delay line and its characteristic activity is taken as the current state.

If there is a change in the activity of the received signal in the filtering process, then checked the number of frames of the signal received in the previous state. If the number of frames is less than the set threshold, it will change their sign of activity on the opposite, if not, just change the current state and resets the counter of frames taken in the current state. After all the operations for changing the state of the frame is placed in the delay line.

The algorithm fails to obtain a sign of the end signal. When the input timing of the block is given all the accumulated signal, of course, if one exists, and the sweat is m - the sign of the end signal.

For synchronization signals used pair of timing blocks, working with multiple shared variables described above. The algorithm operation timing unit are shown in figures 5-8.

Timing block 0 processes the reference signal (figure 5), and the unit 1 test. Block algorithms are identical, the blocks use cross-references to the buffer, i.e. the block 0 XBuffer0 is a buffer pauses reference signal, a- test and Vice versa - in the block 1 XBuffer0 buffer pauses the test signal, a- reference.

Similarly, in block 0 XBuffer1 buffer is the active reference signal, a- test and Vice versa - in the block 1 XBuffer1 - buffer active test signal, and- reference.

To obtain a terminator signal, the algorithm terminates. Branch stop presented on Fig.

Depending on the sign of activity VAD signal is either placed in the buffer pauses, or in the buffer of the active signal. If the buffer size exceeds the threshold value, it is issuing the synchronized buffers on the comparison module. Branches issuing a synchronization buffer size is presented on Fig.7.

After placing the signal in the buffer is checked, the current state is the activity signal. If it is the former, then it skips to the beginning and expected new data. When the status is checked, not whether it was the first piece of data? If "Yes", then accepted her condition activity and navigates to the beginning.

If not, then increases, a ready signal in this state, then is not whether both signals, i.e. the areas of the active signal or a pause synchronized. If there is a synchronized portions of the signal, go to the branch, presented on Fig.6. If not, skip to the beginning of the algorithm.

At the current state is determined whether the found synchronization for breaks or for the active signal. Check the synchronization error by comparing with zero buffer sizes (and buffer from a parallel block) signal. Ate at least one of them is zero, there is a sync error.

If everything is in order synchronized buffer is issued to the input of the comparison module. If not, then the error counter is incremented, the buffer is reset, change the status of the activity and return to waiting for the next piece of data.

Before you give a default buffer size is exceeded segment produced by checking the size of the parallel buffer (Fig.7). If the parallel buffer unit empty buffer are reset and increments the error counter is the synchronization. If the data is present in both buffers, the comparison module signals are given synchronized fragments.

Before the end of the work is checked: is there any data in the buffers pauses and buffers active signal. If so, give the corresponding synchronized pair (or pair) signal to the comparison module. Then, give the comparison module marks the end of the signal.

Next, the calculated integral spectra of isolated and combined fragments in accordance with the method described above. To calculate the spectra uses a 1024-point discrete casinotreasure ensuring sufficient for 8 kHz signal, the accuracy of determining the boundaries of the bands.

In table 4 presents the coefficients of the importance of individual bands, defined in accordance with the description method. The coefficients are defined for signals recorded with a sampling frequency of 8 kHz.

Table 4
The coefficients of the importance of critical bands
No.ZwickerPokrovskyFletcherBoots
VclogVclineVclogVcline VclogVclineVclogVcline
1.112257.022757.023221.000234.060620.000554.119224.002201
2.071777.001918.052324.001002.062399.000774.122034.002950
3.063108.000816.059593.002275.056358.001181.150998.008126
4.066354.001426.057859.004045.058028.001615.136877.027655
5.063906.001986.061305.009510.061681.003002.141754.081240
6.063221.003309.059533.019082.064525.004624.124286.165172
7.056019.005001.061430.029982.067569.006987.123389.462036
8.057442.009524.06473 .032110.072189.012508.073987.249604
9.061323.023545.068123.037674.068211.020201
10.055594.037177.074750.066339.068900.045774
11.052207.048333.082703.121272.062134.041917
12.046928.043929.086423.153918.059783.055413
13.043235.066483.067701.114197.063507.122423
14.043545.132619.042399.063481.049434.112761
15.037087.111828.035929.044686.041380.072089
16.029355.072354 .045791.100261.081286.497994
17.026423.087377.055098.199775
18.049202.329361

Also in accordance with the method defined logarithmic bands and their importance coefficients, valid for 8 kHz signal (table 5).

Table 5
Logarithmic bands and their importance coefficients of the bands
No.FcLVclineNo.FcLVclineNo.FcLVcline
174149.00517081035180.013173152551242.090052
2207117.00042691219188.022271 162797250.081422
3324117.000556101410195.029766173035227.069182
4445125.000925111609203.027939183273250.079362
5574133.001577121816211.042986193539281.142375
6715148.002717132047250.079971203836313.204682
7867156.005893142301258.099413

In accordance with the description of the method calculated the integral spectra of fragments, energy bands, the quality assessment for each pair of fragments and integrated, the resulting quality assessment. This implementation uses all the sets of strips.

Then for the dubsta comparison with subjective assessments objective MOS score in percent is converted into points by dividing by 20.

The signal generator, the appropriate noise models receiptant, works as follows: generated white noise. From the cut of the critical strip, certain Pokrovsky or Sapozhkova (table 1). Each band is modulated by the frequencies listed below. Frequency modulation are applied sequentially on the number of samples specified for each frequency (number in parentheses). Once through all the modulation frequency, a pause is made in 8000 counts (1 second) and navigates to the next line.

Use the following frequency modulation: 0.63 (40000), 0.84 (40000), 1.05 (40000), 1.26 (40000), 1.68 (40000), 2.10 (20000), 2.52 (20000), 3.36 (20000), 4.20 (20000), 5.04 (20000), 6.72 (10000), 8.40 (10000), 10.08 (10000), 13.44 (10000).

Statistical model generates a sound signal on the basis of knowledge about the sound structure of the Russian language, frequency sounds, statistical information about the physical characteristics of sounds, statistical data on the composition of the population, samples of the voices of multiple speakers. The model generates a sound source signal as a sequence of samples of the votes of the speaker, taken at random, in proportion to their frequency.

In the testing of the proposed method were obtained of the quality assessment of several standard vocoders. In table 6 assesses the quality of several standard vocoder is in, obtained on different test signals, the proposed method using a described implementation. For comparison, the table shows the estimates MOS.

Table 6
Assessment of the quality of the sound of vocoders
CodecMOSNoise modelStatistical modelFPT
MinimalAbbreviatedFull
-Vc-Fc-Vc-Vc-Vc
A-Law4,104,794,734,784,784,784,784,794,804,804,84
Mu-Law4,104,794,844,774,774,774,784,784,794,794,82
G.723.6.33,904,254,484,214,294,22 4,334,154,044,083,95
GSM.6.103,703,201,993,011,653.04 from1,784,223,664,013,21
G.723.5.3the 3.654,23of 4.444,184,274,194,324,144,044,063,93

In the column with the sign "-" shows the estimates when making strips of uniform, and in the column "Vc" - valuations derived from valuation into account factors of importance.

The proposed method for the evaluation of audio signals has several advantages over known methods of quality measurement, namely:

has versatility because it allows to judge the quality of the signals having different origins, past various processing procedures;

the quality evaluation process can be optimized depending on the purpose of obtaining estimates:

- speed (for example, it is possible to quickly obtain a rough estimate);

- the type of the signal (using different bands for speech signals and audio signals in General);

- estimated correlate well with the estimated MOS;

- assessment of the quality obtained for speech signals, can be PE is accitane in the values of various kinds of intelligibility.

Below is a brief description of several possible applications of the proposed method for assessing the quality of the sound.

Figure 9 presents the scheme of the proposed method for assessing the quality of sound transmission through the telephone network (PSTN). This scheme is fair for both local and long-distance telephony.

Server quality assessment of audio source generates a signal (or selects from among a pre-prepared) and passes it to one of the subscribers participating in the testing.

The subscriber receiving the signal, sets a standard telephone connection with the second subscriber, and reproduces the original signal. The second subscriber writes the received sound signal, and transmits it to the server evaluate the quality of the sound.

The server estimates the quality of the sound makes the comparison of the source and test signals in accordance with the proposed method and gives an assessment of the quality of the sound transmitted through the telephone network. This estimate can be used to improve the quality of customer service, decision-making about the need for replacement or adjustment of the equipment (as the subscriber's side, and on the side of the station), for advertising purposes and other

Similarly evaluates the sound quality over the IP network, presented on figure 10. Unlike prodyusirovaniya application lies in the way the transmission source and the test audio signals from the evaluation server sound quality to the subscribers, and in the method of transferring data between subscribers.

Furthermore, the quality assessment can be used to select the codecs used in VoIP communications and to select operators providing VoIP services.

Similarly the proposed method can be used to assess cellular and satellite communication (see 11). The resulting estimates can be used by subscribers to select carriers and phones and operators to optimize the placement of base stations.

On Fig presents the process of using the proposed method of assessing the quality of sound development and test systems and algorithms (methods) to compress audio data. Each version of the codec (or codec with a set of parameters) requires evaluation and comparison with analogues. Each developer can access the audio samples, compact and repair the signal and to obtain an objective assessment of the quality of the codec.

This system will allow you to manage the process of codec development and optimization of their parameters, in addition, the end user will be able to obtain the optimal algorithm, and not just running.

On Fig presents a process evaluation of the sound quality of space. In this case the source is a signal obtained from a microphone located in front of the speaker, and the test signals from the micro is ons located in different parts of the premises, in the locations of the listeners and audio equipment.

The resulting estimates can be used to optimize the location of the sound reproducing equipment, furniture and seats.

After testing the proposed method during 2005-06, will be widely used in various fields of technology.

1. The method of implementation of the machine quality assessment of audio signals, in which portions of the source and the test signal to distinguish active and inactive phases, determine the range of the active phase, divide it into critical bands and calculated values of the spectral energy in the critical bands, determine the values of the spectral similarity of the active phase of the fragments, and the quality of the test sound signal is determined by a weighted linear combination of the received values of quality for each phase, characterized in that the selection of active and inactive phases of both signals synchronize determine the spectra of the inactive phase for each of the fragments obtained spectra of active and inactive phase of the fragments are divided into sets bands, including additional sets of critical and logarithmic and resonator strips, each of which count values of the spectral energy, compare the try pairs obtained spectral energy of the active and inactive phase of the fragments, to determine the coefficients of spectral similarity, the resulting similarity coefficient for each phase are determined as the average of the coefficients of similarity for all sets of stripes, which is the assessment of the quality of each phase.

2. The method according to claim 1, characterized in that the source signal can be used as an arbitrary audio signal, and a specialized set of signals.

3. The method according to claim 1, characterized in that the spectra of the fragments of the active and inactive phases are determined using a discrete cosine transform.

4. The method according to claim 1 characterized in that the value of the spectral energy of the active and inactive phase of each fragment is calculated taking into account the importance coefficients of each band included in the set.

5. The method according to any of claim 1 or 4, characterized in that the sets of strips using a different combination of logarithmic, resonator and known critical bands.



 

Same patents:

FIELD: technologies for encoding audio signals.

SUBSTANCE: method for generating of high-frequency restored version of input signal of low-frequency range via high-frequency spectral restoration with use of digital system of filter banks is based on separation of input signal of low-frequency range via bank of filters for analysis to produce complex signals of sub-ranges in channels, receiving a row of serial complex signals of sub-ranges in channels of restoration range and correction of enveloping line for producing previously determined spectral enveloping line in restoration range, combining said row of signals via synthesis filter bank.

EFFECT: higher efficiency.

4 cl, 5 dwg

The invention relates to speech recognition

The invention relates to computer-based quality assessment of audio

The invention relates to the field of radio engineering, in particular to the coding information to enhance the format of the encoded signals

The invention relates to encoding and decoding of speech

The invention relates to a speech decoder, used in radio communications systems with mobile objects

The invention relates to digital processing of speech

The invention relates to techniques for speech analysis

The invention relates to techniques for digital processing of speech signals, transmitted via communication lines by the method of pulse-code modulation (PCM), and can be used to enhance noise immunity multichannel communication systems digital telephony

The invention relates to techniques for digital processing of speech signals, transmitted via communication lines by the method of pulse-code modulation (PCM) , and can be used to enhance noise immunity multichannel communication systems digital telephony

FIELD: speech recognition, in particular, method and device for computing acoustic probabilities during speech recognition.

SUBSTANCE: in accordance to invention, density mixture functions are calculated with usage of commands like "single instruction multiple data" (SIMD) for producing a vector, containing components of mixture of densities as elements. Content of vector is stored in memory (110) and used for recognition of speech for whole set of components of densities mixture for sequential frames.

EFFECT: capacity for better usage of vector processing advantages, increased speed of calculation of acoustic probabilities.

4 cl, 6 dwg

FIELD: automatics and computer science, possible usage in systems for controlling technological, home appliance and other equipment, in automatic reference systems, automatic translation systems, speech understanding systems.

SUBSTANCE: in accordance to method, during pronouncing of speech phrase, taken periodically are selections of acoustic signal of this phrase, digitized with given quantization frequency, in fixed time interval or based on combination of selection, functional is calculated, determining current acoustic condition, while received series of current acoustic conditions is used for restoration of series of words (working hypothesis), pronounced in original speech phrase, for that purpose lexical decoding network is used, which sets rules of order of standard acoustic conditions in a language. Working hypothesis is found, being optimal in terms of maximal match thereof with original speech signal, which is ensured by usage of movable marker algorithm, while working hypothesis is restored from marker, which at this time moment is at end vertex of lexical decoding network.

EFFECT: increased precision of words recognition in continuous speech.

4 cl, 12 dwg

FIELD: analysis and recognition of speech signals.

SUBSTANCE: in accordance to method, during recognition training of system generated are standard bispectral signs of phonemes - position of bispectral module maximums of sound signals and amplitude of bispectral module maximums of sound signal, and also standard signs of words, represented by sets of averaged time spans from the beginning of word to end and ending of all phonemes and pauses in word, and during recognition speech signal, appropriate for word interval, is divided on segments, formed wherein are bispectral signs - position of bispectral module maximums of sound signals and amplitude of bispectral module maximums of sound signal, compared to first and second solution taking criterions. Formed from solutions about recognized phonemes taken in process of comparison on all segments are two series of solutions about recognized phonemes, selected from which are most frequently encountered solutions (letter codes of phonemes), forming a set of letter codes of phonemes of word being recognized. During comparison of a set of letter codes of phonemes of recognized word to sets of letter codes of phonemes of all words of dictionary with consideration of all standard signs of words formed is array of values of recognition coefficients, equal to amount of coinciding letter codes of phonemes and codes of pauses and decision about recognition of word is taken in favor of the word of dictionary, during comparison to which maximal recognition coefficient was produced.

EFFECT: increased precision of recognition of spoken words.

8 dwg

FIELD: information technologies, processing of audio-signals, in particular, method for recognition of music compositions and device for realization thereof.

SUBSTANCE: in accordance to method, determined are a set of values Ln and degree of likeness L of musical composition and recorded musical fragment as maximal value among values Ln, musical composition is found in database in Q stages, while at each stage a set of musical compositions is selected with greatest likeness degrees, on completion of last search stage, as composition matching recorded fragment, musical composition for which maximal likeness degree had been produced is selected, degree of reliability of recognition of musical fragment is calculated and compared to threshold, if calculated value of reliability degree of musical fragment recognition exceeds given threshold, then recorded musical composition is considered recognized.

EFFECT: fast reliable identification of music composition based on processing a record of its fragment, due to determining size and quality of resulting record of musical fragment, original algorithm for finding music composition matching recorded musical fragment.

2 cl, 7 dwg

FIELD: data control and processing systems.

SUBSTANCE: according to method, flow of speech is subject to segmenting, selected separate words are sent in turn for two-step processing of speech signal. At first stage, the most probable candidates of standards are chosen for word to be analyzed; at second stage the most probable alternative of selected candidates is chosen. Results of recognition of speech signals are subject to analysis followed by decision making. Analysis and processing of speech signal are made within frequency-time area presented by means of wavelet transform.

EFFECT: higher speed of operation; reliability of results of analysis.

3 cl, 2 dwg

FIELD: device and methods for continuous control over emotional states.

SUBSTANCE: invention determines emotional state of individual and contains analyzer of speech, made with possible inputting of sample of speech selected by individual, and to extract information about intonation from it, and device for messaging emotional state, made with possible generation of output indications about emotional state of individual based on information about intonation.

EFFECT: possible monitoring of emotional states.

5 cl, 13 dwg

FIELD: statistical language models, used in speech recognition systems.

SUBSTANCE: word indexes of bigrams are stored in form of common base with characteristic shifting. In one variant of realization, memory volume required for serial storage of bigram word indexes is compared to volume of memory, required for storage of indexes of bigram words in form of common base with characteristic shifting. Then indexes of bigram words are stored for minimization of size of data file of language model.

EFFECT: decreased memory volume needed for storing data structure of language model.

7 cl, 4 dwg

FIELD: speech activity transmission systems in distributed system of voice recognition.

SUBSTANCE: distributed system of voice recognition has voice recognition (VR) local mechanism in user unit and VR server mechanism in server. VR local mechanism has module for selection of features (FS), which selects features from voice signals. Voice activity detector (VAD) module detects voice activity invoice signal. Indication of voice activity is transmitted before features from user unit to server.

EFFECT: reduction in overloading of circuit; reduced delay and increased efficiency of voice recognition.

3 cl, 8 dwg, 2 tbl

FIELD: emotion detection.

SUBSTANCE: on the basis of input voice signal intensity, tempo and inflexion are recognized for every word in voice, values of changes are derived for the detected contents correspondingly and also signals which express the condition of each emotion of anger, sadness and satisfaction are derived on the basis of values of changes. Partner's emotion and information about the situation are introduced and thus information of instinctive motivation is generated. Besides, emotional information is generated which includes basic emotional parameters of satisfaction, anger and sadness, which is controlled on the basis of information about individual.

EFFECT: accurate detection of human emotion and the ability to generate sensitivity, close to human's sensitivity.

21 cl, 11 dwg

FIELD: computation engineering.

SUBSTANCE: device has object-oriented measurement tract correction mode. The device has test signal generator as sequence of N frequencies distributed over N strip means to which the hearing spectrum is divided with pauses between the frequencies, unit for transmitting test signals, microphone, N-bandwidth signal/noise ratio measurement unit, and computation unit for treating intelligibility. Manual generation frequency switch and pause switch mode is introduced. Level measurement unit is mounted in front of the transmitter. Controllable frequency characteristic adjustment unit is introduced in front of the signal/noise ratio measurement unit.

EFFECT: high accuracy and reliability of measurements.

2 dwg

FIELD: audio signal compression technologies.

SUBSTANCE: method includes first determining of whether current audio signal contains information in form of speech or noise, while second determining stage is performed, whether signal contains non-speech data, being important for sensing by listener, and selective cancellation of result of first determination is performed, appropriate for noise, in response to result of second determination, appropriate for non-speech data, being important for sensing by listener.

EFFECT: higher efficiency.

3 cl, 13 dwg

Up!