RussianPatents.com

Method and device for reproducing speech signals and method for transferring said signals

Method and device for reproducing speech signals and method for transferring said signals
IPC classes for russian patent Method and device for reproducing speech signals and method for transferring said signals (RU 2255380):

G10L19 - Speech or audio signal analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, e.g. for compression or expansion, source-filter models or; psychoacoustic analysis
Another patents in same IPC classes:
Improved spectrum transformation and convolution in sub-ranges spectrum Improved spectrum transformation and convolution in sub-ranges spectrum / 2251795
Method for generating of high-frequency restored version of input signal of low-frequency range via high-frequency spectral restoration with use of digital system of filter banks is based on separation of input signal of low-frequency range via bank of filters for analysis to produce complex signals of sub-ranges in channels, receiving a row of serial complex signals of sub-ranges in channels of restoration range and correction of enveloping line for producing previously determined spectral enveloping line in restoration range, combining said row of signals via synthesis filter bank.
Method and system for abolishing quantizer saturation during communication with data transfer in speech signal band Method and system for abolishing quantizer saturation during communication with data transfer in speech signal band / 2249860
Method and system for decreasing prediction error an averaging device for calculation of transfer coefficient is used, pulse detector, signals classifier, decision-taking means and transfer coefficient compensation device, wherein determining of compensated transfer coefficient of quantizer count is performed in process of coding/decoding of transferred data in speech signal band by use of vector linear non-adaptive predicting-type algorithm.
Method for compaction and decompaction of speech messages Method for compaction and decompaction of speech messages / 2244963
Method comprises steps of preliminarily, at reception and transmission forming R matrices of allowed vectors, each matrix has dimension m2 x m1 of unit and zero elements; then from unidimensional analog speech signal forming initial matrix of N x N elements; converting received matrix to digital one; forming rectangular matrices with dimensions N x m and m x N being digital representation of initial matrix from elements of lines of permitted vectors; transmitting elements of those rectangular matrices through digital communication circuit; correcting errors at transmission side on base of testing matching of element groups of received rectangular matrices to line elements of preliminarily formed matrices of permitted vectors; then performing inverse operations for decompacting speech messages. Method is especially suitable for telephone calls by means of digital communication systems at rate 6 - 16 k bit/s.
Method for compaction and decompaction of speech messages Method for compaction and decompaction of speech messages / 2244963
Method comprises steps of preliminarily, at reception and transmission forming R matrices of allowed vectors, each matrix has dimension m2 x m1 of unit and zero elements; then from unidimensional analog speech signal forming initial matrix of N x N elements; converting received matrix to digital one; forming rectangular matrices with dimensions N x m and m x N being digital representation of initial matrix from elements of lines of permitted vectors; transmitting elements of those rectangular matrices through digital communication circuit; correcting errors at transmission side on base of testing matching of element groups of received rectangular matrices to line elements of preliminarily formed matrices of permitted vectors; then performing inverse operations for decompacting speech messages. Method is especially suitable for telephone calls by means of digital communication systems at rate 6 - 16 k bit/s.
Method and system for abolishing quantizer saturation during communication with data transfer in speech signal band Method and system for abolishing quantizer saturation during communication with data transfer in speech signal band / 2249860
Method and system for decreasing prediction error an averaging device for calculation of transfer coefficient is used, pulse detector, signals classifier, decision-taking means and transfer coefficient compensation device, wherein determining of compensated transfer coefficient of quantizer count is performed in process of coding/decoding of transferred data in speech signal band by use of vector linear non-adaptive predicting-type algorithm.
Improved spectrum transformation and convolution in sub-ranges spectrum Improved spectrum transformation and convolution in sub-ranges spectrum / 2251795
Method for generating of high-frequency restored version of input signal of low-frequency range via high-frequency spectral restoration with use of digital system of filter banks is based on separation of input signal of low-frequency range via bank of filters for analysis to produce complex signals of sub-ranges in channels, receiving a row of serial complex signals of sub-ranges in channels of restoration range and correction of enveloping line for producing previously determined spectral enveloping line in restoration range, combining said row of signals via synthesis filter bank.
Method and device for reproducing speech signals and method for transferring said signals Method and device for reproducing speech signals and method for transferring said signals / 2255380
During encoding speech signals are separated on frames and separated signals are encoded on frame basis for output of encoding parameters like parameters of linear spectral couple, tone height, vocalized/non-vocalized signals or spectral amplitude. During calculation of altered parameters of encoding, encoding parameters are interpolated for calculation of altered encoding parameters, connected to temporal periods based on frames. During decoding harmonic waves and noise are synthesized on basis of altered encoding parameters and synthesized speech signals are selected.
Multi-mode encoding device Multi-mode encoding device / 2262748
Speech compression system provides encoding of speech signal into bits flow for later decoding for generation of synthesized speech, which contains full speed codec, half speed codec, one quarter speed codec and one eighth speed codec, which are selectively activated on basis of speed selection. Also, codecs of full and half speed are selectively activated on basis of type classification. Each codec is activated selectively for encoding and decoding speech signal for various speeds of transfer in bits, to accent different aspects of speech signal to increase total quality of synthesized speech signal.
Method for simulating auditory patient perception of acoustic signal after cochlear implantation Method for simulating auditory patient perception of acoustic signal after cochlear implantation / 2277375
Method involves applying analog-to-digital input signal transformation expressed as word, dividing transformed signal spectrum into odd and even frequency bands, summing odd bands, carrying out digital-to-analog transformation of resulting summed signal and training its perception by preliminarily getting familiar with the word shown for listening and following testing. Spectrum division is based on tonotopic frequency distribution law over cochlea axis. Frequency bands having odd numbers are arranged in equal distances along basilar membrane length in agreement with normal tonotopic frequency distribution law over cochlea axis. At least three odd spectrum bands are summed up. Training is carried out by multiple repetition of the word shown for listening until unambiguous correlation to the known word meaning given in preliminary acquaintance takes place. The same words are to be shown in testing and training.
Method and device for encoding an audio signal with usage of harmonics extraction Method and device for encoding an audio signal with usage of harmonics extraction / 2289858
In accordance to audio signal encoding method, harmonic components are extracted with usage of information resulting from fast Fourier transformation, which is received with usage of psycho-acoustic model 2 to received audio data of impulse-code modulation. Then, extracted harmonic components are removed from received audio data of impulse-code modulation. After that audio data, from which extracted harmonic components have been removed, are subjected to modified discontinuous cosine transformation and quantization.
Method and device for transmission of speech activity in distribution system of voice recognition Method and device for transmission of speech activity in distribution system of voice recognition / 2291499
Distributed system of voice recognition has voice recognition (VR) local mechanism in user unit and VR server mechanism in server. VR local mechanism has module for selection of features (FS), which selects features from voice signals. Voice activity detector (VAD) module detects voice activity invoice signal. Indication of voice activity is transmitted before features from user unit to server.
Method for analysis and synthesis of speech Method for analysis and synthesis of speech / 2296377
Method includes: analog-digital conversion of speech signal; segmentation of transformed signal onto elementary speech fragments; determining of vocalization of each fragment; determining, for each vocalized elementary speech segment, of main tone frequency and spectrum parameters; analysis and changing of spectrum parameters; and synthesis of speech sequence. Technical result is achieved because before synthesis, in vocalized segments periods of main tone of each such segment are adapted to zero starting phase by means of transferring digitization start moment in each period of main tone beyond the point of intersection of contouring line with zero amplitude, distortions appearing at joining lines of main tone periods are smoothed out and, during transformation of additional count in the end of modified period of main tone, re-digitization of such period is performed while preserving its original length.
Method for reverse filtration, method for synthesizing filtration, device for reverse filtration, device for synthesizing filtration and tools containing such devices Method for reverse filtration, method for synthesizing filtration, device for reverse filtration, device for synthesizing filtration and tools containing such devices / 2297049
In accordance to invention, filtration of input signal is performed for generation of first filtered signal; first filtered signal is combined with aforementioned input signal for production of difference signal, while stage of filtering of input signal for producing first filtered signal contains: stage of production of at least one delayed, amplified and filtered signal, and production stage contains: storage of signal, related to aforementioned input signal in a buffer; extraction of delayed signal from buffer, filtration of signal for forming at least one second filtered signal, while filtration is stable and causative; amplification of at least one signal by amplification coefficient, while method also contains production of aforementioned first filtered signal, basing on at least one aforementioned delayed, amplified and filtered signal.

FIELD: speech recording/reproducing devices.

SUBSTANCE: during encoding speech signals are separated on frames and separated signals are encoded on frame basis for output of encoding parameters like parameters of linear spectral couple, tone height, vocalized/non-vocalized signals or spectral amplitude. During calculation of altered parameters of encoding, encoding parameters are interpolated for calculation of altered encoding parameters, connected to temporal periods based on frames. During decoding harmonic waves and noise are synthesized on basis of altered encoding parameters and synthesized speech signals are selected.

EFFECT: broader functional capabilities, higher efficiency.

3 cl, 24 dwg

 

Background of invention

The technical field to which the invention relates.

The present invention relates to a method and device playback speech signals in which the input speech signal is divided into many frames as elements and code to detect the encoding parameters, based on which, at least, harmonic waves are synthesized to reproduce the speech signal. The invention also concerns the method of transmission of the modified encoding parameters obtained by interpolating encoding settings

Description of related technology

Currently, there are many ways of encoding for compressing signals through the use of static properties of the audio signals of frequencies, including voice signals and audio signals in the time domain and in the frequency domain and related to the psychology of auditory perception characteristics of the human auditory system. These methods encode roughly classified into coding in the time domain, coding in the frequency domain and the coding by analysis-synthesis.

Meanwhile, when a highly efficient method of encoding speech by processing the signals on the time axis illustrated linear foreseeable, is a W with excitation code (LPVC), difficulties are encountered when converting (changing) velocity time axis rather because of the long processing signals output from the decoding device.

In addition, the above method cannot be used, for example, to convert the fundamental frequency, because the speed control is performed in the decoded linear range.

In view of the foregoing, the present invention is to provide a method and apparatus designed to reproduce speech signals, in which the speed control of an arbitrary frequency in a wide range can be easily performed with high quality, while leaving unchanged the phoneme and pitch.

In one aspect, the present invention provides a method for playback of the input speech signal based on the encoding parameters obtained by dividing the input speech signal on the basis of pre-established frames on the time axis and coding of such a split of the input speech signal on a frame basis, including myself stages of interpolation encoding options designed to identify modified encoding parameters associated with the desired time points, and is emitted to the differences of the modified speech signal according to the frequency of said input is th speech signal based on the modified encoding parameters. Thus, the speed control at an arbitrary frequency in a wide range can easily be done with a high quality signal when leaving unchanged phoneme and pitch.

In another aspect, the present invention provides a device for playback of the speech signal in which the input speech signal is restored based on the coding parameters obtained by dividing the input speech signal on the basis of pre-established frames on the time axis, and coding such a split of the input speech signal on a frame basis, comprising means of interpolation, used for interpolation of the encoding parameters for the detection of modified encoding parameters associated with the desired time points, and a means of generating a speech signal suitable for generating differences of transformed speech signal according to the frequency of said input speech signal based on the modified encoding parameters. Thus, it becomes possible to adjust the bit rate. Therefore, the speed control at an arbitrary frequency in a wide range can be easily performed with high signal quality, while leaving unchanged the phoneme and pitch.

Even in one and the too, the present invention provides a method for transmission of speech signals, in which the encoding parameters are found by dividing the input speech signal on the basis of pre-established frames on the time axis as elements and by coding between the input speech signal on a frame basis to detect encoding options, and discovered thus the encoding parameters are interpolated to determine modified encoding parameters associated with the desired time-point, and modified encoding parameters are passed, providing, thus, the possibility of regulating the bit rate.

By dividing the input speech signal on the basis of preset frames in the time axis and the encoding frame-based signal to detect the encoding parameters, due to the interpolation encoding options with the aim of determining the modified encoding parameters and by synthesizing at least harmonic waves based on the modified encoding parameters for recovery of speech signals, it becomes possible to adjust the speed at arbitrary frequency.

Brief description of drawings

1 is a structural block diagram is, illustrating the layout of the playback device speech signal corresponding to a first alternative implementation of the present invention.

2 is a structural block diagram illustrating the layout shown in figure 1. the playback device of the speech signal.

Figure 3 is a block diagram illustrating the encoder shown in figure 1. the playback device of the speech signal.

Figure 4 is a block diagram illustrating the schematic layout analysis multiband excitation (SRM) as an illustrative example of the coding scheme harmonics and noise of the coding device.

Figure 5 illustrates the layout of a vector quantizer.

6 is a graph illustrating average values of the input signalvocalic sound, nelokalizovannaya sound and collected together vocalic and nelokalizovannaya sounds.

7 is a graph illustrating average values of the weighting factorfor vocalic sound, nelokalizovannaya sound and collected together vocalic and nelokalizovannaya sounds.

Fig is a graph illustrating the method of generating codebook vector quantization for vocalic sound, Neuve is cialisinuaevo sound and collected together vocalic and nelokalizovannaya sounds.

Fig.9 is an algorithm illustrating a schematic operation diagram of the calculation of the modified encoding parameters used are shown in figure 1. the playback device of the speech signal.

Figure 10 is a schematic view illustrating the modified encoding parameters obtained through a scheme of calculating the modified parameters on the time axis.

11 is an algorithm illustrating the detailed operation of the circuit calculating the modified encoding parameters used are shown in figure 1. the playback device of speech signals.

Figa, 12B and 12C are schematic views showing an illustrative operation of the scheme of calculation of modified encoding parameters.

Figa, 13B and 13C are schematic views showing another illustrative work schemes calculate the modified encoding parameters.

Fig is a block diagram illustrating a decoding device and the playback device speech signals.

Fig is an electrical block diagram illustrating the layout of synthesizing multi-band excitation (IPOs) in the form of illustrative example of a scheme for the synthesis of harmonics and noise used in a decoding device.

Fig is a block diagram illustrating the transmission device of the speech signal in the form of a second variant implementation of the present invention.

Fig is an algorithm illustrating the operation of the transmitting side device voice signals.

Figa, 18B and 18C illustrate the operation of the device the transmission of speech signals.

Description of the preferred embodiments of the invention

Below will be described in detail with reference to drawings of preferred embodiments of the present invention method and device designed to reproduce speech signals, and transmission of speech signals.

First, we present a description of a device designed to reproduce speech signals, which are applied corresponding to the present invention a method and apparatus for reproducing speech signals. Figure 1 shows a block diagram of a playback device speech signals 1, in which the input speech signals are separated on the basis of pre-established frames as elements on the time axis and is encoded on human basis to detect encoding options. Based on these parameters encode the synthesized sine wave and noise to reproduce speech signals.

In particular, in the case of this device playback speech signals 1, the encoding parameters are interpolated to determine the modified PA is amerov encoding, associated with the desired time points, and on the basis of these modified encoding parameters are synthesized sine wave and noise. Although based on the modified encoding parameters are synthesized sine wave and noise, it is also possible to synthesize at least harmonic wave.

In this case, the playback device audio signals includes the coding block 2, designed for the separation of speech signals received at the input terminal 10, frames as elements for the coding of speech signals on a personnel basis with the purpose of the output encoding parameters, such as parameters of a linear spectral pair (LSP), tone, vocalic (V)- devocalisation (UV) signals or spectral amplitude Am. The playback device audio signals 1 also includes a computing unit 3, intended for interpolation encoding options with the aim of determining the modified encoding parameters associated with the desired time points, and the block decoding 6, designed for synthesizing harmonic waves and noise on the basis of the modified encoding parameters to output synthesized speech parameters on the output terminal 37. The coding block 2, the computing unit 3, intended for vechicle the Oia modified encoding parameters, and the block decoder 6 controls the controller (not shown).

The computing unit 3 is designed for calculating modified encoding parameters of the playback device speech signals 1, includes schema changes in period 4, designed for compression-expansion of the time axis of the encoding parameters generated in each predetermined frame, to change the period of the output encoding parameters, and the interpolation scheme 5, designed for the interpolation parameters for the abovementioned period with the aim of creating a modified encoding parameters associated with frame-based time periods, as shown, for example, in figure 2. Next will describe the computing unit 3 is designed to calculate the modified encoding parameters.

First, we present a description of the coding block 2. The coding block 3 and block decoding 6 represent the residual value of short-term predictions, for example, residual values coding linear prediction (LP), based on the encoding of harmonics and noise. Alternatively, the coding block 3 and block decoding 6 coding the multi-band excitation (IPOs) or the analysis of multi-band excitation (IPOs).

In the case of conventional coding linear prediction excited what about the code (LPVC), residual values CLP are directed vector quantization in the form of the signal in time. Because the coding block 2 encodes the residual values by encoding harmonics or analysis, SRM, smoother synthesized waveform can be obtained by vector quantization of the amplitudes of the spectral envelope of the harmonics with fewer bits, while the output filter synthesized waveform CLP also very consistent with the quality of sound. Meanwhile, the amplitude of the spectral envelope quanthouse using the method of spatial transformation, or conversion, the amount of data proposed by the present applicant in Japanese patent publication Kokai JP-A-51800. That is, the amplitude of the spectral envelope are subjected to vector quantization of the predetermined number of vector dimensions.

Figure 3 shows an illustrative diagram of the coding block 2. The speech signals at the input terminal 10, are exempt from signals of unwanted frequencies through the filter 11 and then fed into the analysis scheme coding linear prediction (LP) 12 and the scheme of the inverse filter 21.

In the scheme of analysis CLP 12 applied weighing function Hamming input to the waveform when the length of the order of 256 samples in the as block to pose the STV method autocorrelation to detect the linear prediction coefficients, that is, the so-called α-parameters. Interval encoding in the form of a block of output data is in the order of 160 samples. If the sampling rate is, for example, 8 kHz, the interval encoding 160 samples corresponds to 20 milliseconds.

α - parameter circuit analysis of transmission 12 is fed to the conversion scheme α - parameter in the LSP 13, to convert the parameters of a linear spectral pair (LSP). That is, α - parameters detected as filter coefficients of the directional type, is converted, for example, ten, i.e. five pairs of LSP parameters. This conversion is performed using, for example, the method of Newton-Raphson. The reason for the conversion α - parameters of LSP is that the LSP parameters exceed α - parameters from characteristics of the interpolation.

The LSP parameters from the schema transformation parameters in the LSP 13 are subjected to vector quantization by the vector of LSP quantizer 14. At this time, you can find mezhdunaroduyu difference to switch to vector quantization. Alternatively, you can collect and to quantize multiple frames by matrix quantization. For quantization, calculated every 20 MS, the LSP parameters are subjected to vector quantization, the duration of one frame is 20 MS.

The quantized output the second signal vector of LSP quantizer 14, which is the indicator vector of LSP quantizer, the output at terminal 15. Quantized LSP vectors are served on the interpolation scheme LSP 16.

The scheme of interpolation of LSP 16 interpolates the LSP vectors provided by vector quantization every 20 MS, to ensure the eightfold speed. That is, the LSP vectors have so that they can be updated every 2.5 MS. The reason is that if the residual waveform is processed by the analysis-synthesis method of encoding-decoding multi-band excitation (IPOs), the envelope of the synthesized waveform is extremely smooth waveform, so that if the coefficients coding linear prediction (LP) really change every 20 MS, there is a tendency to create unique sounds. The formation of such kind of sounds can be created obstacle, if the coefficients CLP constantly change every 2.5 milliseconds.

For inverse filtering the input speech signal using the interpolated thus the LSP vectors with an interval of 2.5 MS, the LSP parameters are converted by the translation scheme LSP in α - parameters that represent the filter coefficients of the directional type, for example, ten sequences. The output signals 17 Ave is education LSP in α served on the scheme of the inverse filter 21, to ensure that the inverse filter when adjusted α - the parameter on the interval of 2.5 MS to create a smooth output signal. The output signal circuit of the inverse filter 21 is supplied to the encoding scheme of the harmonics and noise 22, namely, the outline of the analysis of multi-band excitation (IPOs).

The encoding scheme of the harmonics and noise (scheme of analysis SRM) 22 analyzes the output signal circuit of the inverse filter 21 by a method similar to the method of analysis SRM. To have the encoding scheme of the harmonic-noise 22 detects the tone and calculates the amplitude Am of each harmonic. The encoding scheme of the harmonic-noise 22 also enables the establishment of differences vocalic (V) - nelokalizovannaya (UV) of the speech signal and converts the number of amplitudes Am of harmonics, which varies with a change of tone to a constant amount by spatial transformation. To determine the tone is autocorrelative input residual values CLP, as explained below.

Figure 4 shows an example of the analysis scheme of encoding multi-band excitation (IPOs) in the form of a coding scheme harmonics and noise 22.

In the case shown in figure 4 scheme of analysis SRM developed model with the assumption that there is vocalic portion and vocalsanna part in the frequency range of the same time point, which is the same block or frame.

Residual values CLP or residual value coding linear prediction (LP) with the scheme of the inverse filter 21 are served to those shown in Fig. 4 the input terminal III. Thus, the scheme of analysis SRM performs the analysis of the SRM and the encoding of the input residual values CLP.

Residual values coding linear prediction (LP)received at the input terminal III, served on the extraction block tone 113, the weighing unit 114 and the computing unit energy subunit 126, as described below.

Because the input signal extraction block tone 113 represents the residual value CLP, the determination of the tone can be performed by detecting the maximum value of autocorrelation of the residual values. The extraction block tone 113 searches the tone by searching with a disconnected cycle. The extracted data tones are received at block accurate search tone 116, where the exact tone search is performed by searching for tone a closed loop.

In the weighing unit 114 applies a preset weighting function, for example, the weighing function Hamming, to each block of N samples, to move the weighted block along the time axis with an interval between frames of α - samples. The sequence data in the temporary area unit the weighing 114 is processed by block orthogonal transformation, for example, by fast Fourier transform (FFT).

If it is determined that all the bands in the block devocalisation (UV), the computing unit energy subunit 126 extracts the characteristic value representing the envelope of the waveform in time nelokalizovannaya sound signal block.

On the block find the exact tone 116 are served raw data tones in the form of integers, the extracted block allocation tone 113, and the data of the frequency domain generated by FFT block orthogonal transformation 115. The unit is accurate search tone 116 performs swing on ± several samples with an interval of 0.2 to 0.5 relative to the gross value of the data tones in the center to bring to accurate data tone with optimal decimal point (floating). In the way of exact search using the analysis method of synthesizing and select the step that gives the energy spectrum in the process of synthesizing, which is closest to the initial energy spectrum.

That is, the number of tone values above and below the rough tones defined by the block selection tones 113 as the center, are provided with an interval of, for example, equal to 0.25. For those values of a tone that continually differ from each other, is determined by the sum of errors ∑∈m. In this case, if you set the tone set the width of the strip, so using the energy spectrum according to the frequency domain and range of the excitation signal, is determined error ∈m. Thus, it is possible to determine the amount of errors ∑∈mthe total number of bands. This amount of errors ∑∈mis determined for each value of the tone, and as the optimal tone select the tone corresponding to the minimum sum of errors. Thus we can determine the optimal exact tone with an interval approximately equal to 0.25, through the search box to the exact tone, and is determined by the amplitudefor optimal pitch. The amplitude value calculated by the evaluation unit amplitude 118 V for vocalic sound.

In the above description search for an exact pitch of the note it is assumed that all the bands vocalic. However, as used in system analysis-synthesis approach proposed model is that it at the same time point on the frequency axis has neocaridina region, it becomes necessary in each successive strip to carry out the establishment of the vocalic differences - devocalisation signals.

Optimal tone with block search exact tone 116 and data about the amplitudewith the evaluation unit amplitude for vocalic the th sound 118 V is coming to the unit to differentiate vocalic and nelokalizovannaya signals 117, in running the establishment of differences between vocalic sound signal and devocalisation sound signal in each successive strip. The day this used to differentiate the signal-to-noise ratio (SNR).

Meanwhile, since the number of bands, which are separated on the basis of the fundamental pitch frequency, i.e. the number of harmonics varies in the range of from about 8 to 63, depending on the tone of the audio signal, similarly varies the number of signs V/UV in each successive strip. Thus, in the present embodiment, grouped or decompose the results to determine differences V and UV for each of a predetermined number of strips of constant width. In particular, a predefined frequency range, for example, equal 0-4000 Hz, including sound range is divided into NBbands such as 12 strips, and the difference between the weighted average value of SNR of each band with a predetermined threshold value Th2to assess differences V and UV in each successive strip.

The evaluation unit amplitude 118 U for nelokalizovannaya alarm beeps frequency domain data block orthogonal transformation 115, the data is accurate colors with block search tone 116, data is e amplitude with the evaluation unit amplitude for vocalic sound signal 118 V and data to differentiate between vocalic and nelokalizovannaya (V/UV) sounds from block to differentiate vocalic - nelokalizovannaya audio signals 117. And here the evaluation unit amplitude 118 U for nelokalizovannaya sound detects the amplitude for the band defined by the block setting differences vocalic - nelokalizovannaya signal 117 in the form of nelokalizovannaya (UV) signal through the implementation of the revaluation of the amplitude. The evaluation unit amplitude 118 U for nelokalizovannaya sound directly outputs the input value from the evaluation unit amplitude vocalic sound 118 V for bands detected in the form of vocalic.

Data from the evaluation unit amplitude 118 U nelokalizovannaya sound arrive at the assessment unit of the amount of data 119, which represents the frequency Converter sampling. The conversion unit amount of data 119 is used to generate a constant amount of data, due to the fact that the number of divided bands of the frequency spectrum and the amount of data, especially the number of amplitude data in different sound colours vary. That is, if the effective frequency range is, for example, up to 3400 kHz, then the effective frequency is output range is divided into 8-63 strip, depending on the tone, so the amount of data mMX+1 amplitudesincluding amplitudeUV UV strip, varies in the range from 8 to 63. Thus, the conversion unit amount of data 119 converts the amplitude data with a variable number of data mMX+1 in a constant amount of data M, for example 44.

The conversion unit amount of data 119 adds to the amplitude data corresponding to one effective unit on the frequency axis, such bogus data that interpolate the values from the last data block to the first data block to increase the amount of data to NF. The conversion unit amount of data 119 in this case performs redundant sampling type limit bandwidth with a factor of excess sample Osfor example, equal to 8, for the detection of Os- fold the number of amplitude data. This Os-fold the number of ((mMX+1)×Os) amplitude data are linearly interpolated to create greater amounts of NMdata, for example, 2048 data. The number of NMdata is weeded to convert at a predetermined constant number M, for example, 44 of the data.

Data (amplitude data at a predetermined constant if what estom M) with unit conversion number data 119 arrive at the vector quantizer 23 to ensure vector, with the amount of data M, or going to a vector having a predetermined number of data for vector quantization.

Data about tone block accurate search tone 116 is transmitted through the fixed contactswitch 27 at the input terminal 28. This method, disclosed in our Japanese patent application No. 5-185325 (1993), consists of switching from the information representing the characteristic value corresponding to the signal waveform in time nelokalizovannaya signal, information about tone if all pages in the block devocalization (UV) and, consequently, the tone information becomes unnecessary.

These data are obtained by processing data of the N-th number, for example, 256 samples. As the block moves along the time axis on the basis of the above scene from α samples as part of the transmitted data is retrieved on a personnel basis. That is, data about the tone data to differentiate V-UV and amplitude data are corrected for the repetition period of the frame. As data to differentiate V-UV block to differentiate between V and UV 117 you can use the data, the number of bands reduced or reduced to 12, or to use data that defines one or more provisions of the boundaries between vocalic (V) and nebo is alizirovannaya (UV) regions over the entire frequency range. Alternatively, the totality of the bands you can imagine one of V and UV, or establish the differences between the V and UV can be run on a regular basis.

If it is determined that the block is fully devocalization (UV), one block, for example, 256 samples can be further divided into many sub-blocks, each of which consists of 32 samples which are fed to the computing unit energy subunit 126.

The computing unit energy subunit 126 calculates the proportion or ratio of average power or RMS set of samples in the block, for example, 256 samples to average power or RMS value of each sample in each subunit.

That is, is determined by the average power, for example, the K-th subunit and the average power of one full block and is calculated by the square root of the ratio of average power of the entire block to the average power p(K) K-th subblock.

Suppose that defined this way is the square root represents the vector of a predetermined size in order to perform vector quantization in the vector quantizer 127, located beside the power calculation power subunit.

Vector quantizer 127 performs 8-d 8-bit direct vector quantization (volume codebook is equal to 256 samples). The output show the l UV - E of this vector quantizer, that is, the code representing the vector is supplied to the fixed terminalswitch 27. On the fixed terminalswitch 27 receives data tone block the precise tone 116, while the output signal of the switch 27 is supplied to output terminal 28.

The operation of the switch 27 is the output signal of establishing differences from block to differentiate vocalic-nelokalizovannaya signals 117, so that the fixed contact of the switch 27 is mounted on the stationary contactswhen it is detected that at least one of the strips in the block vocalic (V), and when it is detected that all of the bands vocalic, respectively.

Thus, the output signals normalized vector quantization on subblocks using RMS values are transmitted through the introduction in the intervals, essentially used to transmit information tone. That is, if it is determined that all pages in the block devocalization (UV), the tone is unnecessary, therefore, if, and only if, it is found that the signs to differentiate V-UV are fully nelokalizovannaya, instead of information about the tone transmitted decree of the fir output signal vector quantization VU-E.

The following is a description with reference to figure 3 of weighted vector quantization of the spectral envelope (Am) in the vector quantizer 23.

Vector quantizer 23 is a 2-cascade α-measuring, for example, the 44-dimensional configuration.

That is, the sum of the output vectors of the codebook vector quantization, which is a 44-dimensional and has a volume of codebook equal to 32, is multiplied by the gain gi and the resulting product is used as a quantized value 44-dimensional vectorthe spectral envelope. Figure 5 positions of NWO and GBI shows code dictionaries of two forms of output vectors which areandrespectively, where 0≤i j≤31. The output signal codebook SVD gain is ge, which is a scalar value, where 0≤e≤31. The final output value becomes equal to

The spectral envelope Am obtained in the analysis of multi-band excitation (IPOs) residual value coding linear prediction (LP) and converted into a predetermined size is set at. Critical is the way of effective quantization .

Energy quantization errors is defined by the following expression:

where H and W represent, respectively, the characteristics on the frequency axis synthesizing filter, the HRC and the matrix for weighting representing the characteristics of the weighting of auditory perception on the frequency axis.

The energy of the quantization error detect by sampling the corresponding α - measuring, for example, the 44-dimensional points of the frequency characteristics by the formula:

where αiwhen I≤i≤P is α - parameters obtained by analyzing CLP the current frame.

For calculation of Osfilled after 1, α1that α2, ..., αPto get I αIthat α2, ..., αP, 0, 0, ..., 0 to ensure, for example, a 256-point data. After this, there is a 256-point fast Fourier transform and calculates the values offor points corresponding to the period 0-π. Next are determined by the inverse values were obtained by calculating values ofand were subjected to thinning out, for example, up to 44 points. Matrix, diagonal elements of which correspond to these return values are defined as follows:

Matrix weighting of auditory perception of W is defined as follows:

where αirepresents the result of the analysis CLP output, and λandthat λbare constant values, such as λa=0,4, λin=0,9.

The matrix W can be determined from the frequency characteristics of equation (3). In the example provided 1, α1that λin,..., αpInR, 0, 0, ..., 0 in order to obtain a 256-point data using the fast Fourier transform to determinewhere 0≤i≤128. Then, provided I α1that λand,..., αpandp, 0, 0, ..., 0, and calculates the frequency characteristics of the denominator of 256-point fast Fourier transform on 128 points for region 0-π. The result is the value ofwhere 0≤i≤128.

The frequency characteristics of the above equation (3) can be determined by the equation:

where 0≤i≤128.

The frequency characteristics are determined in the following manner for the respective pixels, for example, 44-dimensional vector. Although more accurate financial p the tats you want to use linear interpolation, after the substitution the following example uses the values of nearby points.

That is, ω[i]=ω0[nint(128i/L)],

where I≤i≤L, a nint (x) is a function that reflects the integer nearest to x.

As for the values of H, h(1), h(2), ..., h(L), they are defined in a similar way. That is,

so

As a modified version of the implementation, the frequency characteristics can be defined, in order to reduce the number of operations of the fast Fourier transform, after first defining H(z) W(z).

That is,

The denominator of equation (5) is decomposed as follows:

By setting 1, β1that β2, ..., in2P, 0, 0, ..., 0, is formed, for example, a 256-point data. Then perform a 256-point fast Fourier transform with the purpose of providing frequency characteristics of amplitude, so

where 0≤i≤128. Here is the following equation:

where 0≤i≤128.

This value is determined for each of the respective points α - dimensional vector. If the number of FFT points is negligible, it is necessary to use a linear interpolare is the W. However, in this case are close values. That is,

where 1≤i≤L.

The matrix W having these close values as diagonal elements, is defined by the following expression:

The above equation (6) is the same matrix as equation (4).

Using this matrix, i.e. the frequency response of the weighing synthesizing filter, equation (1) can be rewritten as follows:

The following is a description of how learning codebook forms and codebook gain.

First, for all personnel who select a code vectorappropriate WWTP, minimizes the expected value of the distortion. If there are M such frames, to minimize enough:

In this equation (8)denote the weight of K-th frame of the input signal of K-th frame, the gain of K-th frame and the output signal codebook CBI for the K-th frame, respectively.

To minimize equation (8) can be written as follows

so

here

where {}-1means inverse matrix, ameans the transposed matrix wk.

Next, we consider optimization in respect of the gain.

The expected value of jgdistortion for the K-th frame, selects the code word gcthe gain is defined as follows.

Solving equation

get

The above equations give the optimal centroid condition for the formgain gi, where 0≤i≤31, that is, the optimal output signal of the decoding. Optimal output decoding foryou can define in the same way as for.

Next, we consider the optimal condition coding (nearest neighbor condition).

Formsto minimize equation (7) for measuring distortion, that is,

determined each time get inputand the weight matrix W, i.e. for each frame.

On SV is his creature, E should be determined for all combinations of ge(0≤e≤31),(0≤i≤31) and(0≤j≤31), that is, 32×32×32 combinations, by circular system, with the aim of identifying the set of,that gives the last value of E. However, since this leads to a voluminous number of arithmetic operations, the coding block 2 performs a sequential search form and gain. Search in a circular system must be performed for 32×32=1024 combinations. In the following explanationfor simplicity, can be written as Sm.

The above equation can be written in the formFor further simplification, denotingget

Thus, assuming that for geis provided with sufficient accuracy, a search can be performed in two stages:

1) searchthat is brought to the high

2) search for gewhich is closest to the

If the above equation is rewritten using the original is the material representation, the search can be performed in two stages:

1)’ search groupswhich brings up to a maximum of

and 2)’ search gi nearest to

Equation (15) gives the optimum condition coding (condition nearest neighbouring entry).

Using the centroid condition equations (11) and (12) equation (15), it is possible to simultaneously train code dictionaries CBO, the CBI and the SVD by the generalized Lloyd algorithm (OAL).

Considering figure 3, we note that the vector quantizer 23 is connected through the switch 24 to the code dictionary for vocalized sound signal 25 V and dictionary for neocaledonica sound 25 U. Controlling the switching of the switch 24 in dependence on the output signal to differentiate V-UV with the coding scheme of the harmonics and noise 22, performs vector quantization vocalized sound and neocaledonica sound, using the code dictionary for vocalized sound 25 V and code dictionary for neocaledonica sound 25 U, respectively.

The reason for switching code dictionaries depending on the estimation and regard vocalic sound (V) and nelokalizovannaya sound (UV) is that because of weighted averaging parametersand geruns when vechicle the AI new centroids, corresponding to equations (11), (12), it is undesirable to defineand gevalues are significantly different.

Meanwhile, in the coding block 2 uses w'divided by the norm of the input signal. That is, when moving forward during the processing of equations (11), (12) and (15), instead of w' substitute.

When switching between code dictionaries depending on the setting of the differences V and UV similarly distributed training data for the preparation of relevant training data codebook for vocalic sound and codebook for nelokalizovannaya sound.

To reduce the number of binary digits in the V-UV in the coding block 2 is used adapalene excitation (OPV), and this frame is vocalic (V) by the frame and devocalisation frame (UV), if the relation V exceeds 50 % and the opposite ratio, respectively.

Figure 6 and 7 shows the average values ofinputand the average value of the weighting factor for vocalic sound, for nelokalizovannaya sound and for a combination of vocalic and nelokalizovannaya sounds, i.e. without taking into account the differences between vocalic and newac the lysed sounds.

Figure 6 shows that the energy distribution of theon the frequency axis is not much different at V and UV, while the average value of the gain () varies greatly between V and UV. However, from Fig.7. it is seen that the shape of the weighting factor varies between V and UV, and the weighting factor such that it increases the appointment of binary digits to low range in the case of V compared to the UV. This explains the possibility of developing a codebook with higher performance through the division of education for V and UV.

On Fig shows how training for three examples, i.e. vocalic sound (V), nelokalizovannaya sound (UV) and United together vocalic and nelokalizovannaya sounds. That is, curves,andon Fig set way of training only V only UV and combined values of V and UV, and the final values of,andequal to 3, 72, 7,011 and 6.25, respectively.

On Fig shows that the division of study codebook for V and code of slowely to UV leads to reduced the expected value of the distortion of the output signal. Although the state is expected zachariahnishio deteriorates in the case of curve only for UV, the expected value is generally improving, because the area for the V is longer than the field for UV. As an example, the frequency V and UV, the resulting measurement values of the lengths of the fields only for V and only for UV are 0,538 and 0,462 for the length of the training data 1. Thus, the final values of the curves,Fig the expected value of the total distortion is determined by the value:

3,72×0,538+7,011×0,467=5,24,

which is an improvement of around 0.76 dB compared with the expected value of 6.25 for learning together V and UV.

Based on the type of training, improving the expected value is about 0.76 to dB. But found that if processed sample speech four panelists from among men and four participants in the discussion of women beyond the training group with the purpose of detection SNR for the case where the quantization is not performed, the separation of the V and UV leads to improved segmental SNR of about 1.3 dB. The reason for this probably lies in the fact that the ratio V significantly higher ratio for UV.

It should be noted that although the weighting factor w’, used for weighting of auditory perception in vector quantization vector quantizer 23, as described above ur is using (6), by determining the current weighting factor w', taking into account the last w', it is possible to determine the weighting factor w', taking into account the temporal masking effect.

As for the elements wh(1), wh(2), ..., wh(L) in the above equation (6), calculated at time n these elements, that is calculated for the n-th frame, they are indicated by elements whn(1), whn(2), ..., whn(L).

Taking into account the previous value at time n is a weighting factor determined by the value of An(i), where I≤i≤L. In this case,

An(i)=λAn-1(i)+(1-λ)wh(i) for whn(i)≤An-1(i)=whn(i)

for whn(i)>An-1(i)

where λ can be set so that, for example, λ=0,2. An(i), where I≤i≤L, can be used as the diagonal elements of the matrix, which is used as the above-mentioned weighting coefficients.

Returning to figure 1, note that here the computing unit modified encoding parameters 3. The playback device speech signals 1 modifies the encoding parameters output from the coding block 2 at a certain speed, by means of the computing unit modified encoding parameters 3 intended to calculate the modified encoding parameters, and decodes vidoesseanna the coding parameters by block decoding for playback of content from a continuous recording speed, twice the speed in real time. Because the pitch and phoneme remain unchanged, despite a higher playback speed of recorded content you can hear, even if it is to play at a higher speed.

Since the encoding settings modified for speed, the computing unit modified encoding parameters 3 does not require processing after decoding and output signals and can easily be coordinated with different fixed speeds with the same algorithm.

Considering the algorithms in figures 9 and 11, we note that there is a detailed description of the operation unit calculating the modified encoding parameters 3 playback device speech signals 1. As described with reference to figure 2, the computing unit modified encoding parameters 3 consists of schema changes in period 4 and the interpolation scheme 5.

First, at step S1 Fig.9 on the schema changes in period 4 are received through the input terminal 15, 28, 29 and 26 coding parameters, such as LSP, pitch, V-UV and Am. The pitch is set at a value of Pch[n], V-UV is installed on vuv[n], Am set on am[n][e] and LSR is set to lsp[n][i]. Calculated in the end, the computing unit modified encoding parameters videoisland the e coding parameters are set at values of andwhere l denotes the number of harmonics, i provides the number of order LSP, and n and m corresponds to the frame number corresponding, in turn, the index of the time axis before and after a transformation of the time axis, respectively. Meanwhile, 0≤n<N1and 0≤m<N2where each of the n elements and m is the frame pointer when HR interval, for example, equal to 20 MS.

As described above,denotes the number of harmonics. The above conditions can be performed after recovery of the number of harmonics to a value of am[n][e]corresponding to the actual number of harmonics, or can also be performed in the condition am[n][e] (e=0-43). That is, the conversion data amount can be performed before or after decoding, the decoding device.

At step S2, the schema change period 4 sets the number of frames corresponding to the initial time, on the N1establishing at the same time, the number of frames corresponding to the duration of time after the change in N2. Then at step S3, the schema change period 4 compresses the time axis speed N1the speed of N2. That is, the ratio of the time axis spd scheme change period 4 identify eleesa ratio of N 2/N1.

Next, at step S 4 interpolation scheme 5 setscorresponding to the number of frames, in turn, the corresponding figure in the time axis after converting the time axis, is equal to 2.

Next, at step S 5 interpolation scheme 4 defines two frame fr0and fr1and differences between "left" and "right" between two frames froand fr1um/spd. If the encoding parameters Pchvuv, amand lspindicate with an asterisk (*), thencan be expressed in a General formula

where 0≤m<N2. However, since the ratio m/spd is not an integer, a modified parameter encoding for m/spd is created by interpolation of the two frames fr0=Lm/spd and fr1=f0+1. It should be noted that between the frame fr0m/spd and the frame fr1there is a connection, as shown in figure 10, that is, there is a link defined by the expression

left = m/spoL,

right= fr1-m/spd.

Parameter coding for m/spd figure 10, that is, a modified parameter encoding is generated by interpolation, as shown in step 6. Modified parameter encoding can be defined simply by linear interpolation in the form:

.

However, if interpolation between fr0and fr1these two frames differ in the V-UV; that is, if one of the two frames is V, and the other UV, it is impossible to apply the above General formula. Therefore, the interpolation scheme 5 changes the method of determining coding parameters in connection with vocalic and devocalization characteristics of these two frames fr0and fr1as shown in step S11 of the next 11.

The first step is to decide whether two frames fr0and fr1vocalic (V) or nelokalizovannaya (UV). If it is determined that both of the frame fr0and fr1vocalic (V), the program proceeds to step S 12, where all parameters are linearly interpolated and modified encoding parameters are represented as follows:

where 0≤l≤L. note that L denotes the maximum possible number, which can be taken as harmonics, and that "0" is filled in am[n][l], where there are no harmonics. If the number of harmonics varies in frames fr0and fr1,it is considered that under the above described interpolating the value of the equivalent harmonic is zero. Before passing through the conversion unit amount of data, the number L can is to be permanent, for example, L=43 at 0≤l<L.

In addition, the modified encoding parameters are reproduced as follows:

where 0≤i≤1, and I indicates the number of decimal places LSP and is usually equal to 10, and

It should be understood that when establishing differences V-UV, 1 and 0 indicate vocalic (V ) and nelokalizovannaya (UV) frames, respectively.

If at step S11, a decision is made that neither of the two frames fr0and fr1not lokalizirutesa (V ), it is estimated, similar to the estimate obtained in step S 13, that is, assessing whether devocalisation (UV) whether the both frame fr0and fr1. If the evaluation result is positive (YES), i.e. if both frame neocaledonica (UV), the interpolation scheme 5 sets of Pchon a constant and defines amand lspby linear interpolation as follows

(maximum tone)

to fix the values tone at a constant value, for example, the maximum value for nelokalizovannaya of sound, equal to MaxPitch=148;

If both frame fr0and fr1devocalisation, the program proceeds to step S15, where animalsa decision in relation to lokalizirutesa (V) whether the frame fr0and not lokalizirutesa (UV) frame fr1. If the evaluation result is positive (YES), i.e. if the frame fr0lokalizirutesa (V), and the frame fr1not lokalizirutesa (UV), the program proceeds to step S16. If the evaluation result is negative (NO), i.e. if the frame fr0not lokalizirutesa (UV), and the frame fr1lokalizirutesa (V), the program proceeds to step S17.

Processing in the next step S 16 refers to the cases when fr0and fr1differ in relation to the V-UV, that is, when one of the frames vocalic, and the other devocalization. It takes into account the fact that the interpolation of parameters between the two frames fr0and fr1, characterized in relation to the V-UV, does not matter. In this case, the value of the parameter of the frame, closer to the time m/spd without performing interpolation.

If the frame fr0vocalic (V), and the frame fr1not vocalic (UV), the program proceeds to step S 16, which compares with each other the size of the "left" (=m/spd-fr0and "right" (=fr1-m/spd) frames, as shown in figure 10. This allows you to assess for which of the frames fr0and fr1is closer to m/spd. The calculation of the modified encoding parameters is performed using the parameter values of the frame is closer to m/spd.

If the result of the OC the NCI at step S16 is positive (YES), this means that the "right" size is larger, and hence the frame fr1is farther from m/spd. Thus, at step S18 determined by the modified encoding parameters, using parameters of the frame fr0closer to m/spd as follows:

If the evaluation result in step S16 is negative (NO), then the size of the "left" ≥ "right", and hence the frame fr1closer to m/spd, so the program proceeds to step S19, where the magnitude of the tone is brought to maximum and using the parameters for the frame fr1installed modified the settings as follows:

Next, at step S17, under the action of assessment S 15, consisting in the fact that two frames fr0and fr1are devocalisation (UV) and vocalic (V), respectively, provides an assessment similar to the assessment in step S16. That is, in this case, the interpolation is not performed, and uses the parameter values of the frame closer to the time m/spd.

If the evaluation result in step S 17 positive (YES), the pitch is raised to the maximum value at step S20 and using the parameters closer to the frame fr0for the rest of the parameters are modified encoding parameters as follows:

If the result is at the evaluation stage S17 is negative (NO), then, since the size of the "left" ≥ "right", and hence the frame fr1closer to m/spd, the program proceeds to step S 21, which uses the parameters for the frame fr1installed the modified encoding parameters as follows:

Thus, the interpolation scheme 5 performs various interpolating operation on the stage's 6 Fig.9 depending on the ratio of vocalic (V) and devocalisation (UV) characteristics between the two frames fr0and fr1. After the operation of the interpolation in step S 6, the program proceeds to step S 7, where is the increment of the parameter m. Actions in accordance with the stages of S 5 and S 6 are repeated until the value of m becomes equal to N2.

In addition, the sequence of short-term RMS devocalisation (UV) parts normally used to control the gain of the noise. However, this parameter here is set to 1.

The operation of the computing unit modified encoding parameters is shown schematically in Fig. Model coding parameters extracted every 20 MS coding block 2 shown in figa. Schema changes in period 4 of the computing unit modified encoding parameters 3 sets a time period of 15 MS and performs compression along the time axis, as shown in figv. Display the data on figs modified encoding parameters are calculated by interpolating device, the corresponding settings of the V-UV two frames fr0fr1as explained above.

The scheme of calculating the modified encoding parameters 3 can also be changed to reverse the sequence in which the operations are performed by the schema change period 4 and the interpolation scheme, i.e. to perform the interpolation encoding options, shown in figa, as shown in figv, and compress to calculate the modified encoding parameters, as shown in figs.

Modified encoding parameters with the scheme of calculation of modified encoding parameters 3 are transferred to the decoding scheme 6, shown in figure 1. The decoding scheme 6 synthesizes a sine wave and the noise on the basis of the modified encoding parameters and outputs the synthesized audio signal output terminal 37.

Description of the circuit operation of the decoding is performed with reference to Fig and 15. For purposes of explanation it is assumed that the incoming decoding scheme 6 parameters are common parameters encoding.

On Fig to terminal 31 receives the output signal from the vector quantization of line spectral pairs (LSP), the corresponding output signal at terminal 15 figure 3, i.e. the so-called index.

This input signal is supplied to an inverse vector of LSP quantizer 32 and for vernogo vector quantization with the aim of generating data linear spectral pair (LSP), which are then fed to the interpolation scheme LFA 33 for interpolation of LSP. The resulting interpolated data converted by the conversion scheme LSP in α 32 α - options codes with linear prediction (LP), These α - the options go on synthesizing filter 35.

At terminal 41 Fig receive data indicator weighted code word vector quantization of spectral envelope (Am), the corresponding output signal at terminal 26 of the coding device shown in figure 3. At terminal 43 receives information about tone with terminals 28 figure 3 and data showing the characteristic quality of the signal shape in time to block UV, whereas at terminal 46 receives data to differentiate V-UV from terminal 29 to 3.

Data from the vector quantization of the amplitude Amterminal 41 is coming to an inverse vector quantizer 42 for inverse vector quantization. The resulting data of the spectral envelope arrive at a scheme for the synthesis of harmonics and noise or scheme for the synthesis of multi-band excitation (IPOs) 45. The scheme of synthesizing 45 serves data from terminal 43, which is switched by the switch 44 between the data of tone and data showing the characteristic value of the waveform for frame UV depending on the data to differentiate V-UV. The synthetic scheme is investing 45 also receives the data to differentiate V-UV terminal 46.

Below is a description with reference to Fig schematic layout of synthesizing IPOs as illustrative schematic layout of synthesizing 45.

From the synthesizing circuit 45 are residual data CLP corresponding to the output signal of the inverse filter circuit 21 figure 3. Thus obtained residual data is being received on the scheme for the synthesis of 35 running synthesizing CLP with the aim of creating a data waveform in time, which is filtered followed by filter 36, so that the output terminal 37 displays the reproduced signal waveform time domain.

Illustrated example of the arrangement of synthesizing SRM as an example of the synthesizing circuit 45 is described with reference to Fig.

On Fig shown that the data of the spectral envelope with inverse vector quantizer 42 Fig the actual data of the spectral envelope of the residual values CLP served at the input terminal 131. Data received at terminals 43, 46, is the same as the data shown in Fig. Data received at terminal 43, is selected by the switch 44 so that the data on tone and data showing the characteristic quality of the signal UV, proceed to block synthesizing vocalic sound 137 and the inverse vector quantizer 152, respectively.

The data of the spectral amplitude of the adjusted values CLP terminal 131 receives the inverse transform scheme number of data 136 for the inverse transform. The scheme of the inverse transform of the amount of data 136 performs the inverse transformation which is the inverse function of the conversion performed by the conversion unit amount of data 119. The resulting amplitude data are fed to the block synthesizing vocalic sound 137 and on the block synthesizing nelokalizovannaya sound 138. Data about tone received from the terminal 43 through a fixed terminalswitch 44 serves to block the synthesis 137, 138. The blocks for the synthesis 137, 138 are also data to differentiate V-UV terminal 46.

The block synthesizing vocalic sound 137 synthesizes the waveform vocalic sound in the time domain, for example, by synthesizing cosine or sine wave, while the block synthesizing nelokalizovannaya sound filters 138, for example, white noise through a bandpass filter with the aim of synthesizing devocalisation waveform time domain. Vocalic form of the signal and neocaridina the shape of the signal are added together by the adder 141 so that they can be output to the output terminal 142.

If the data to differentiate V and UV is transmitted code V and UV, all bands can be divided into single demarcation point on Vocalizer annoy (V) region and devocalization (UV) region, and on the basis of this point of differentiation can be obtained based on the band data to differentiate V-UV. If the number of bands is reduced by-side analysis (encoding device) to a constant number equal to 12 lanes, this reduction can be canceled for varying the number of strips in the strip width corresponding to the original tone.

Below is a description of the steps of synthesizing nelokalizovannaya sound synthesizing unit nelokalizovannaya sound 138.

The wave form of the signal white noise time domain with a white-noise generator 143 is supplied to the weighing unit 144 for weighing using the corresponding compactly supported functions, for example, the weighing function Hamming, with a pre-determined duration, for example, equal to 256 samples. Then the weighted waveform signal into a scheme of short term Fourier transform (XPF) 145 for short-term Fourier transform to create the energy spectrum of the frequency domain white noise. Energy spectrum block short term Fourier transform 145 is supplied to the processing unit amplitude strips 146, where I think the band devocalisation (UV) and multiplied by the amplitudewhile I believe that the width of other bands that represent V by setting the W to 0. The processing unit amplitude strips 146 receives the amplitude data is tone and data to differentiate V and UV.

The output signal processing unit amplitude frequency band 146 is supplied to the unit inverse short-term Fourier transform (CSPF), which is the inverse XPF, using as the initial phase white noise, with the aim of converting signals in the time domain. The output signal of the block inverse XPF 147 is fed through the forming unit power distribution multiplier 156 and 157, as described below, the block combining and adding 148, where the combination and adding are repeated with the appropriate weighting on the time axis to restore the primary continuous waveform. Thus, a continuous waveform in the time domain is created through the synthesis. The output signal of block matching and adding 148 is supplied to the adder 141.

If at least one of the strips in the block vocalic (V), the above processing is performed in the respective block synthesizing 137, 138. If it is determined that all the bands in the block devocalisation, the movable contact 44 of the switch 44 is set to the fixed terminalso that information about the shape of the signal in time nelokalizovannaya signal instead of that is new information on the block inverse vector quantization 152.

That is, the unit vector dekvantovanie 152 receives the data corresponding to the data coming from the unit vector quantization 127 figure 4. These data are subjected to inverse vector quantization to output data to extract the characteristic waveform quality nelokalizovannaya signal.

The output signal of the block inverse XPF 147 before serving multiplier 157 is exposed to the energy distribution of the time domain, regulated by the forming unit power distribution 156. The multiplier 157 multiplies the output signal of the block inverse XPF 147 with the signal output from block vector dekvantovanie 152 through the block smoothing 153. Rapid changes in gain, which seem to be pronounced, can suppress the power smoothing 153.

Synthesized thus nelokalizovannaya the audio signal is removed from the block synthesizing nelokalizovannaya audio signal 138 and supplied to the adder 141, where it adds to the signal from the synthesizing unit vocalic sound signal 137, so that the output terminals 142 are removed residual signals CLP as synthesized output signals refracted.

These residual signals CLP come on synthesizing filter 35 Fig to create a target speech sound signal playback.

<> The playback device of the speech signal 1 causes the block to calculate the modified encoding parameters 3 to calculate modified encoding parameters under control of a controller (not shown), and synthesizes speech sound signal, which is komandirowannyj on the time axis of the original speech signal with the addition of modified encoding parameters.

In this case, the signalwith the computing unit modified encoding parameters 3 is used instead of the output signal of the circuit inverse vector quantization of the LSP. Modified parameter encodingis used instead of the value inherent in vector quantization. Modified parameter encodingserved on a scheme for interpolation of LSP 33 for interpolation linear spectral pair (LSP), and hence served on mapping LSP in α 34, where it is converted to α - option codes with linear prediction (LP), which is supplied to the synthesizing filter 35.

On the other hand, a modified parameter encodingserved instead of the output signal or the input signal circuit converting the amount of data 136. The terminals 43, 46 receives signals andrespectively.

Modified parameter encodingcomes on a scheme for the synthesis of harmonics and noise 45 as the data of the spectral envelope. The scheme of synthesizing 45 receives a signalterminal 43 through the switch 44, depending on the data to differentiate, while it also receives a signalterminal 46.

Through the above shown Fig schemes are synthesized komandirovannye on the time axis of the original speech signals, using the aforementioned modified encoding parameters so that they can be output to the output terminal 37.

Thus, the playback device speech signals 1 decodes the modified matrix encoding options(where 0≤m<N2instead of an integral matrix *[n] (0≤n<N1). The interval between frames during decoding can be constant, usually equal to, for example 20 MS. Thus, if N2<N1or N2>N1, compress the time axis with the increase in the rate or extend the time axis, respectively.

If you change the time axis, as described above, the instantaneous range and tone OST the fast unchanged, so despite a significant change in the range between 0,5≤spd≤2, barely creates deterioration.

In the case of this system, because ultimately obtained the sequence of the parameters is decoded after locations in a specific order with an integral interval of 20 MS, it's easy to implement a random controlled speed in the direction of increase or decrease. On the other hand, the increase and decrease speed, you can run through the same processing without transition points.

Thus, densely recorded contents can be played back at a speed twice the speed in real time. Because the tone and phoneme remain unchanged, despite the increased playback speed, tightly written content you can hear, if playback is performed at a higher speed. On the other hand, with regard to the speech codec, then you can exclude additional, for example, an arithmetic operation after decoding and producing signals that are required when using coding with linear prediction code excited (LPVC).

Although the computing unit modified encoding parameters 3 is sealed with the above-described first method implementation from the block decoder 6, the computing unit 3 is also provided the step in block decoding 6.

When calculating the parameters of the computing unit modified encoding parameters 3 in the playback device speech signals 1, the interpolating operation by setting amperform on the value of the vector quantization or the value of the inverse vector quantization.

The following is a description of the device the transmission of speech signals 50, designed to perform the appropriate present invention method of transmitting audio signals. On Fig can be seen that the device of the voice signal transmission 50 includes a transmitting device 51, designed to divide the input speech signal based on a predetermined frame time domain as elements, and coding the input speech signal on a frame basis to detect the encoding parameters, the interpolation of the encoding parameters to detect modified encoding parameters and to transmit the modified encoding parameters. Device voice signals 50 also includes a receiver 56 designed to receive modified encoding parameters and for synthesizing harmonic vibrations and noise.

That is, the transmitting device 51 includes the encoder 53, designed to divide the input speech signal and is walking out of the predetermined frame time domain as elements and coding of the speech signal on a frame basis for extracting coding parameters, the interpolator 54, intended for interpolation encoding options with the aim of determining the modified encoding parameters, and the transmission unit, for transmitting the modified encoding parameters. The receiving device 56 includes a reception unit 57, the interpolator 58 intended for interpolation modified encoding parameters, and block decoding 59 designed for synthesizing harmonic vibrations and noise on the basis of the interpolated parameters to output synthesized speech signal at output terminal 60.

The main work of the coding block 53 and the block decoding 59 is similar to the work of the same blocks in the playback device of the speech signal 1, and so here, for simplicity, their detailed description is omitted.

Description of the operation of the transmitting device 51 is made with reference to presents on Fig algorithm, which together shows the steps of coding the coding block 53 and the interpolation by the interpolator 54.

The coding block 53 extracts the encoding parameters, consisting of LSP, tone PchV-UV and amin steps S31 and S33. In particular LSP is interpolated and reordered by the interpolator 54 at step S 31 and quantized at step S 32, while the tone of the PchV-UV and aminterpolated and paleoparadoxia is at the stage of S 34 and quanthouse at stage S 35. These quantized data is transmitted by the transmitting device 55 on the receiving device 56.

The quantized data received by the reception unit 57 in the receiving device 56, served on the interpolation unit 58, where the parameters are interpolated and reordered at step S 36. At step S 37 data synthesized by block decoding 59.

Thus, to increase the speed by compressing the time axis, the device transmit speech signals 50 interpolates the parameters and change the interval between frames parameters during transmission. Meanwhile, since the playback is performed during the reception by detecting parameters with a constant interval between frames is equal to 20 MS, the algorithm speed control can be used directly to convert the bit rate.

That is, we assume that if the interpolation parameters used for speed control, this interpolation is performed in a decoding device. However, if this processing is performed in the encoding device, so that the data is compressed (thinned) the time axis is encoded and extended (interpolated) time axis decoding device, the bit rate can be adjusted by the ratio of spd.

If the transmission speed of extending t is, for example, 1,975 kilobits / second, and the coding is performed at double speed through this installation that spd=0.5, because the coding is performed at a speed of 5 seconds instead of the inherent speed of 10 seconds, the transmission speed becomes equal 1,975×0.5 kilobits per second.

In addition, the coding parameters obtained in the coding block 53 shown in figa, interpolated and reordered by the interpolator 54 with an arbitrary interval, for example, equal to 30 MS, as shown in figv. Then, the encoding parameters are interpolated and are reordered by the interpolator 58 receiving device 56 to 20 MS, as shown in figs, and are synthesized by block decoding 59.

If a similar scheme to provide a decoding device, you can restore the speed to the initial value, although the speech sound signal can also be heard at the high or low speed. That is, the cruise control can be used as the encoder-decoder variable bit rate.

1. How to play the input speech signal based on the encoding parameters obtained by dividing the input speech signal on the basis of pre-established frames on a time axis and by encoding the input speech signal in human OS is ove, contains the stages of interpolation encoding options to determine the modified encoding parameters associated with frame-based time periods, and generating modified speech signal on the time axis from the input speech signal based on the modified encoding parameters.

2. The method according to p. 1, characterized in that a modified speech signal generate by at least synthesizing sine waves based on the modified encoding parameters for reproducing speech signals.

3. The method according to claim 2, characterized in that the period of the output encoding parameters change by a compression-expansion time axis of the encoding parameters generated in each predetermined frame before or after the stage of interpolation.

4. The method according to p. 1, characterized in that the interpolation encoding options perform a linear interpolation of the parameters of a linear spectral pair, pitch, and residual spectral envelope contained in the encoding options.

5. The method according to claim 1, characterized in that the encoding parameters are determined by the representation of the residual values of short-term predictions of the input speech signal as a synthesized sine wave and noise, and by Kodirov is the frequency of the spectral information of each of the synthesized sine waves and noise.

6. The playback device of the speech signal in which the input speech signal is restored based on the coding parameters obtained by dividing the input speech signal on the basis of pre-established frames on the time axis and coding based on the frames of the input speech signal to detect the encoding parameters, containing the means of the interpolation encoding options designed to determine the modified encoding parameters associated with frame-based time periods, and the means of generating modified speech signal, used for production of transformed speech signal, differing in the time axis from the input speech signal based on the modified encoding parameters.

7. The device according to claim 6, characterized in that the means to generate the modified speech signal is configured to, at least, synthesizing harmonic wave in accordance with the modified encoding parameters.

8. The device according to claim 7, characterized in that it further comprises means of changes of the period, designed for compression-expansion of the time axis of the encoding parameters generated in each predetermined frame, to change the period of the output parameters codero the project, and installed before or after the means of interpolation.

9. The device according to claim 6, characterized in that the means of interpolation of the coding parameters is made with the possibility of a linear interpolation of the parameters of a linear spectral pair, pitch, and residual spectral envelope contained in the encoding options.

10. The device according to claim 6, characterized in that the encoding parameters are determined by the representation of the residual values of short-term prediction of the input speech signal as a synthesized sine wave and noise, and by encoding the frequency-spectral information of each of sintezirovannyh harmonic waves and noise.

11. The transfer method of the input speech signals, namely, that the encoding parameters receive by dividing the input speech signal on the basis of pre-established frames on the time axis as elements and divided by the coding of the input speech signal on a frame basis, and the encoding parameters interpolate to determine the modified encoding parameters associated with frame-based time periods, and modified encoding parameters passed in.

12. The method according to p. 11, characterized in that the parameters Kodirov the deposits are determined by the representation of the residual values of short-term predictions of the input speech signal as a synthesized sine wave and noise, and by encoding the frequency-spectral the envelope of each of sintezirovannyh harmonic wave and noise.

 

© 2013-2014 Russian business network RussianPatents.com - Special Russian commercial information project for world wide. Foreign filing in English.