|
RussianPatents.com
|
Method and device for reproducing speech signals and method for transferring said signals |
||||||||||||||
IPC classes for russian patent Method and device for reproducing speech signals and method for transferring said signals (RU 2255380):
|
FIELD: speech recording/reproducing devices. SUBSTANCE: during encoding speech signals are separated on frames and separated signals are encoded on frame basis for output of encoding parameters like parameters of linear spectral couple, tone height, vocalized/non-vocalized signals or spectral amplitude. During calculation of altered parameters of encoding, encoding parameters are interpolated for calculation of altered encoding parameters, connected to temporal periods based on frames. During decoding harmonic waves and noise are synthesized on basis of altered encoding parameters and synthesized speech signals are selected. EFFECT: broader functional capabilities, higher efficiency. 3 cl, 24 dwg
Background of invention The technical field to which the invention relates. The present invention relates to a method and device playback speech signals in which the input speech signal is divided into many frames as elements and code to detect the encoding parameters, based on which, at least, harmonic waves are synthesized to reproduce the speech signal. The invention also concerns the method of transmission of the modified encoding parameters obtained by interpolating encoding settings Description of related technology Currently, there are many ways of encoding for compressing signals through the use of static properties of the audio signals of frequencies, including voice signals and audio signals in the time domain and in the frequency domain and related to the psychology of auditory perception characteristics of the human auditory system. These methods encode roughly classified into coding in the time domain, coding in the frequency domain and the coding by analysis-synthesis. Meanwhile, when a highly efficient method of encoding speech by processing the signals on the time axis illustrated linear foreseeable, is a W with excitation code (LPVC), difficulties are encountered when converting (changing) velocity time axis rather because of the long processing signals output from the decoding device. In addition, the above method cannot be used, for example, to convert the fundamental frequency, because the speed control is performed in the decoded linear range. In view of the foregoing, the present invention is to provide a method and apparatus designed to reproduce speech signals, in which the speed control of an arbitrary frequency in a wide range can be easily performed with high quality, while leaving unchanged the phoneme and pitch. In one aspect, the present invention provides a method for playback of the input speech signal based on the encoding parameters obtained by dividing the input speech signal on the basis of pre-established frames on the time axis and coding of such a split of the input speech signal on a frame basis, including myself stages of interpolation encoding options designed to identify modified encoding parameters associated with the desired time points, and is emitted to the differences of the modified speech signal according to the frequency of said input is th speech signal based on the modified encoding parameters. Thus, the speed control at an arbitrary frequency in a wide range can easily be done with a high quality signal when leaving unchanged phoneme and pitch. In another aspect, the present invention provides a device for playback of the speech signal in which the input speech signal is restored based on the coding parameters obtained by dividing the input speech signal on the basis of pre-established frames on the time axis, and coding such a split of the input speech signal on a frame basis, comprising means of interpolation, used for interpolation of the encoding parameters for the detection of modified encoding parameters associated with the desired time points, and a means of generating a speech signal suitable for generating differences of transformed speech signal according to the frequency of said input speech signal based on the modified encoding parameters. Thus, it becomes possible to adjust the bit rate. Therefore, the speed control at an arbitrary frequency in a wide range can be easily performed with high signal quality, while leaving unchanged the phoneme and pitch. Even in one and the too, the present invention provides a method for transmission of speech signals, in which the encoding parameters are found by dividing the input speech signal on the basis of pre-established frames on the time axis as elements and by coding between the input speech signal on a frame basis to detect encoding options, and discovered thus the encoding parameters are interpolated to determine modified encoding parameters associated with the desired time-point, and modified encoding parameters are passed, providing, thus, the possibility of regulating the bit rate. By dividing the input speech signal on the basis of preset frames in the time axis and the encoding frame-based signal to detect the encoding parameters, due to the interpolation encoding options with the aim of determining the modified encoding parameters and by synthesizing at least harmonic waves based on the modified encoding parameters for recovery of speech signals, it becomes possible to adjust the speed at arbitrary frequency. Brief description of drawings 1 is a structural block diagram is, illustrating the layout of the playback device speech signal corresponding to a first alternative implementation of the present invention. 2 is a structural block diagram illustrating the layout shown in figure 1. the playback device of the speech signal. Figure 3 is a block diagram illustrating the encoder shown in figure 1. the playback device of the speech signal. Figure 4 is a block diagram illustrating the schematic layout analysis multiband excitation (SRM) as an illustrative example of the coding scheme harmonics and noise of the coding device. Figure 5 illustrates the layout of a vector quantizer. 6 is a graph illustrating average values of the input signal 7 is a graph illustrating average values of the weighting factor Fig is a graph illustrating the method of generating codebook vector quantization for vocalic sound, Neuve is cialisinuaevo sound and collected together vocalic and nelokalizovannaya sounds. Fig.9 is an algorithm illustrating a schematic operation diagram of the calculation of the modified encoding parameters used are shown in figure 1. the playback device of the speech signal. Figure 10 is a schematic view illustrating the modified encoding parameters obtained through a scheme of calculating the modified parameters on the time axis. 11 is an algorithm illustrating the detailed operation of the circuit calculating the modified encoding parameters used are shown in figure 1. the playback device of speech signals. Figa, 12B and 12C are schematic views showing an illustrative operation of the scheme of calculation of modified encoding parameters. Figa, 13B and 13C are schematic views showing another illustrative work schemes calculate the modified encoding parameters. Fig is a block diagram illustrating a decoding device and the playback device speech signals. Fig is an electrical block diagram illustrating the layout of synthesizing multi-band excitation (IPOs) in the form of illustrative example of a scheme for the synthesis of harmonics and noise used in a decoding device. Fig is a block diagram illustrating the transmission device of the speech signal in the form of a second variant implementation of the present invention. Fig is an algorithm illustrating the operation of the transmitting side device voice signals. Figa, 18B and 18C illustrate the operation of the device the transmission of speech signals. Description of the preferred embodiments of the invention Below will be described in detail with reference to drawings of preferred embodiments of the present invention method and device designed to reproduce speech signals, and transmission of speech signals. First, we present a description of a device designed to reproduce speech signals, which are applied corresponding to the present invention a method and apparatus for reproducing speech signals. Figure 1 shows a block diagram of a playback device speech signals 1, in which the input speech signals are separated on the basis of pre-established frames as elements on the time axis and is encoded on human basis to detect encoding options. Based on these parameters encode the synthesized sine wave and noise to reproduce speech signals. In particular, in the case of this device playback speech signals 1, the encoding parameters are interpolated to determine the modified PA is amerov encoding, associated with the desired time points, and on the basis of these modified encoding parameters are synthesized sine wave and noise. Although based on the modified encoding parameters are synthesized sine wave and noise, it is also possible to synthesize at least harmonic wave. In this case, the playback device audio signals includes the coding block 2, designed for the separation of speech signals received at the input terminal 10, frames as elements for the coding of speech signals on a personnel basis with the purpose of the output encoding parameters, such as parameters of a linear spectral pair (LSP), tone, vocalic (V)- devocalisation (UV) signals or spectral amplitude Am. The playback device audio signals 1 also includes a computing unit 3, intended for interpolation encoding options with the aim of determining the modified encoding parameters associated with the desired time points, and the block decoding 6, designed for synthesizing harmonic waves and noise on the basis of the modified encoding parameters to output synthesized speech parameters on the output terminal 37. The coding block 2, the computing unit 3, intended for vechicle the Oia modified encoding parameters, and the block decoder 6 controls the controller (not shown). The computing unit 3 is designed for calculating modified encoding parameters of the playback device speech signals 1, includes schema changes in period 4, designed for compression-expansion of the time axis of the encoding parameters generated in each predetermined frame, to change the period of the output encoding parameters, and the interpolation scheme 5, designed for the interpolation parameters for the abovementioned period with the aim of creating a modified encoding parameters associated with frame-based time periods, as shown, for example, in figure 2. Next will describe the computing unit 3 is designed to calculate the modified encoding parameters. First, we present a description of the coding block 2. The coding block 3 and block decoding 6 represent the residual value of short-term predictions, for example, residual values coding linear prediction (LP), based on the encoding of harmonics and noise. Alternatively, the coding block 3 and block decoding 6 coding the multi-band excitation (IPOs) or the analysis of multi-band excitation (IPOs). In the case of conventional coding linear prediction excited what about the code (LPVC), residual values CLP are directed vector quantization in the form of the signal in time. Because the coding block 2 encodes the residual values by encoding harmonics or analysis, SRM, smoother synthesized waveform can be obtained by vector quantization of the amplitudes of the spectral envelope of the harmonics with fewer bits, while the output filter synthesized waveform CLP also very consistent with the quality of sound. Meanwhile, the amplitude of the spectral envelope quanthouse using the method of spatial transformation, or conversion, the amount of data proposed by the present applicant in Japanese patent publication Kokai JP-A-51800. That is, the amplitude of the spectral envelope are subjected to vector quantization of the predetermined number of vector dimensions. Figure 3 shows an illustrative diagram of the coding block 2. The speech signals at the input terminal 10, are exempt from signals of unwanted frequencies through the filter 11 and then fed into the analysis scheme coding linear prediction (LP) 12 and the scheme of the inverse filter 21. In the scheme of analysis CLP 12 applied weighing function Hamming input to the waveform when the length of the order of 256 samples in the as block to pose the STV method autocorrelation to detect the linear prediction coefficients, that is, the so-called α-parameters. Interval encoding in the form of a block of output data is in the order of 160 samples. If the sampling rate is, for example, 8 kHz, the interval encoding 160 samples corresponds to 20 milliseconds. α - parameter circuit analysis of transmission 12 is fed to the conversion scheme α - parameter in the LSP 13, to convert the parameters of a linear spectral pair (LSP). That is, α - parameters detected as filter coefficients of the directional type, is converted, for example, ten, i.e. five pairs of LSP parameters. This conversion is performed using, for example, the method of Newton-Raphson. The reason for the conversion α - parameters of LSP is that the LSP parameters exceed α - parameters from characteristics of the interpolation. The LSP parameters from the schema transformation parameters in the LSP 13 are subjected to vector quantization by the vector of LSP quantizer 14. At this time, you can find mezhdunaroduyu difference to switch to vector quantization. Alternatively, you can collect and to quantize multiple frames by matrix quantization. For quantization, calculated every 20 MS, the LSP parameters are subjected to vector quantization, the duration of one frame is 20 MS. The quantized output the second signal vector of LSP quantizer 14, which is the indicator vector of LSP quantizer, the output at terminal 15. Quantized LSP vectors are served on the interpolation scheme LSP 16. The scheme of interpolation of LSP 16 interpolates the LSP vectors provided by vector quantization every 20 MS, to ensure the eightfold speed. That is, the LSP vectors have so that they can be updated every 2.5 MS. The reason is that if the residual waveform is processed by the analysis-synthesis method of encoding-decoding multi-band excitation (IPOs), the envelope of the synthesized waveform is extremely smooth waveform, so that if the coefficients coding linear prediction (LP) really change every 20 MS, there is a tendency to create unique sounds. The formation of such kind of sounds can be created obstacle, if the coefficients CLP constantly change every 2.5 milliseconds. For inverse filtering the input speech signal using the interpolated thus the LSP vectors with an interval of 2.5 MS, the LSP parameters are converted by the translation scheme LSP in α - parameters that represent the filter coefficients of the directional type, for example, ten sequences. The output signals 17 Ave is education LSP in α served on the scheme of the inverse filter 21, to ensure that the inverse filter when adjusted α - the parameter on the interval of 2.5 MS to create a smooth output signal. The output signal circuit of the inverse filter 21 is supplied to the encoding scheme of the harmonics and noise 22, namely, the outline of the analysis of multi-band excitation (IPOs). The encoding scheme of the harmonics and noise (scheme of analysis SRM) 22 analyzes the output signal circuit of the inverse filter 21 by a method similar to the method of analysis SRM. To have the encoding scheme of the harmonic-noise 22 detects the tone and calculates the amplitude Am of each harmonic. The encoding scheme of the harmonic-noise 22 also enables the establishment of differences vocalic (V) - nelokalizovannaya (UV) of the speech signal and converts the number of amplitudes Am of harmonics, which varies with a change of tone to a constant amount by spatial transformation. To determine the tone is autocorrelative input residual values CLP, as explained below. Figure 4 shows an example of the analysis scheme of encoding multi-band excitation (IPOs) in the form of a coding scheme harmonics and noise 22. In the case shown in figure 4 scheme of analysis SRM developed model with the assumption that there is vocalic portion and vocalsanna part in the frequency range of the same time point, which is the same block or frame. Residual values CLP or residual value coding linear prediction (LP) with the scheme of the inverse filter 21 are served to those shown in Fig. 4 the input terminal III. Thus, the scheme of analysis SRM performs the analysis of the SRM and the encoding of the input residual values CLP. Residual values coding linear prediction (LP)received at the input terminal III, served on the extraction block tone 113, the weighing unit 114 and the computing unit energy subunit 126, as described below. Because the input signal extraction block tone 113 represents the residual value CLP, the determination of the tone can be performed by detecting the maximum value of autocorrelation of the residual values. The extraction block tone 113 searches the tone by searching with a disconnected cycle. The extracted data tones are received at block accurate search tone 116, where the exact tone search is performed by searching for tone a closed loop. In the weighing unit 114 applies a preset weighting function, for example, the weighing function Hamming, to each block of N samples, to move the weighted block along the time axis with an interval between frames of α - samples. The sequence data in the temporary area unit the weighing 114 is processed by block orthogonal transformation, for example, by fast Fourier transform (FFT). If it is determined that all the bands in the block devocalisation (UV), the computing unit energy subunit 126 extracts the characteristic value representing the envelope of the waveform in time nelokalizovannaya sound signal block. On the block find the exact tone 116 are served raw data tones in the form of integers, the extracted block allocation tone 113, and the data of the frequency domain generated by FFT block orthogonal transformation 115. The unit is accurate search tone 116 performs swing on ± several samples with an interval of 0.2 to 0.5 relative to the gross value of the data tones in the center to bring to accurate data tone with optimal decimal point (floating). In the way of exact search using the analysis method of synthesizing and select the step that gives the energy spectrum in the process of synthesizing, which is closest to the initial energy spectrum. That is, the number of tone values above and below the rough tones defined by the block selection tones 113 as the center, are provided with an interval of, for example, equal to 0.25. For those values of a tone that continually differ from each other, is determined by the sum of errors ∑∈m. In this case, if you set the tone set the width of the strip, so using the energy spectrum according to the frequency domain and range of the excitation signal, is determined error ∈m. Thus, it is possible to determine the amount of errors ∑∈mthe total number of bands. This amount of errors ∑∈mis determined for each value of the tone, and as the optimal tone select the tone corresponding to the minimum sum of errors. Thus we can determine the optimal exact tone with an interval approximately equal to 0.25, through the search box to the exact tone, and is determined by the amplitude In the above description search for an exact pitch of the note it is assumed that all the bands vocalic. However, as used in system analysis-synthesis approach proposed model is that it at the same time point on the frequency axis has neocaridina region, it becomes necessary in each successive strip to carry out the establishment of the vocalic differences - devocalisation signals. Optimal tone with block search exact tone 116 and data about the amplitude Meanwhile, since the number of bands, which are separated on the basis of the fundamental pitch frequency, i.e. the number of harmonics varies in the range of from about 8 to 63, depending on the tone of the audio signal, similarly varies the number of signs V/UV in each successive strip. Thus, in the present embodiment, grouped or decompose the results to determine differences V and UV for each of a predetermined number of strips of constant width. In particular, a predefined frequency range, for example, equal 0-4000 Hz, including sound range is divided into NBbands such as 12 strips, and the difference between the weighted average value of SNR of each band with a predetermined threshold value Th2to assess differences V and UV in each successive strip. The evaluation unit amplitude 118 U for nelokalizovannaya alarm beeps frequency domain data block orthogonal transformation 115, the data is accurate colors with block search tone 116, data is e amplitude Data from the evaluation unit amplitude 118 U nelokalizovannaya sound arrive at the assessment unit of the amount of data 119, which represents the frequency Converter sampling. The conversion unit amount of data 119 is used to generate a constant amount of data, due to the fact that the number of divided bands of the frequency spectrum and the amount of data, especially the number of amplitude data in different sound colours vary. That is, if the effective frequency range is, for example, up to 3400 kHz, then the effective frequency is output range is divided into 8-63 strip, depending on the tone, so the amount of data mMX+1 amplitudes The conversion unit amount of data 119 adds to the amplitude data corresponding to one effective unit on the frequency axis, such bogus data that interpolate the values from the last data block to the first data block to increase the amount of data to NF. The conversion unit amount of data 119 in this case performs redundant sampling type limit bandwidth with a factor of excess sample Osfor example, equal to 8, for the detection of Os- fold the number of amplitude data. This Os-fold the number of ((mMX+1)×Os) amplitude data are linearly interpolated to create greater amounts of NMdata, for example, 2048 data. The number of NMdata is weeded to convert at a predetermined constant number M, for example, 44 of the data. Data (amplitude data at a predetermined constant if what estom M) with unit conversion number data 119 arrive at the vector quantizer 23 to ensure vector, with the amount of data M, or going to a vector having a predetermined number of data for vector quantization. Data about tone block accurate search tone 116 is transmitted through the fixed contact These data are obtained by processing data of the N-th number, for example, 256 samples. As the block moves along the time axis on the basis of the above scene from α samples as part of the transmitted data is retrieved on a personnel basis. That is, data about the tone data to differentiate V-UV and amplitude data are corrected for the repetition period of the frame. As data to differentiate V-UV block to differentiate between V and UV 117 you can use the data, the number of bands reduced or reduced to 12, or to use data that defines one or more provisions of the boundaries between vocalic (V) and nebo is alizirovannaya (UV) regions over the entire frequency range. Alternatively, the totality of the bands you can imagine one of V and UV, or establish the differences between the V and UV can be run on a regular basis. If it is determined that the block is fully devocalization (UV), one block, for example, 256 samples can be further divided into many sub-blocks, each of which consists of 32 samples which are fed to the computing unit energy subunit 126. The computing unit energy subunit 126 calculates the proportion or ratio of average power or RMS set of samples in the block, for example, 256 samples to average power or RMS value of each sample in each subunit. That is, is determined by the average power, for example, the K-th subunit and the average power of one full block and is calculated by the square root of the ratio of average power of the entire block to the average power p(K) K-th subblock. Suppose that defined this way is the square root represents the vector of a predetermined size in order to perform vector quantization in the vector quantizer 127, located beside the power calculation power subunit. Vector quantizer 127 performs 8-d 8-bit direct vector quantization (volume codebook is equal to 256 samples). The output show the l UV - E of this vector quantizer, that is, the code representing the vector is supplied to the fixed terminal The operation of the switch 27 is the output signal of establishing differences from block to differentiate vocalic-nelokalizovannaya signals 117, so that the fixed contact of the switch 27 is mounted on the stationary contacts Thus, the output signals normalized vector quantization on subblocks using RMS values are transmitted through the introduction in the intervals, essentially used to transmit information tone. That is, if it is determined that all pages in the block devocalization (UV), the tone is unnecessary, therefore, if, and only if, it is found that the signs to differentiate V-UV are fully nelokalizovannaya, instead of information about the tone transmitted decree of the fir output signal vector quantization VU-E. The following is a description with reference to figure 3 of weighted vector quantization of the spectral envelope (Am) in the vector quantizer 23. Vector quantizer 23 is a 2-cascade α-measuring, for example, the 44-dimensional configuration. That is, the sum of the output vectors of the codebook vector quantization, which is a 44-dimensional and has a volume of codebook equal to 32, is multiplied by the gain gi and the resulting product is used as a quantized value 44-dimensional vector The spectral envelope Am obtained in the analysis of multi-band excitation (IPOs) residual value coding linear prediction (LP) and converted into a predetermined size is set at Energy quantization errors is defined by the following expression: where H and W represent, respectively, the characteristics on the frequency axis synthesizing filter, the HRC and the matrix for weighting representing the characteristics of the weighting of auditory perception on the frequency axis. The energy of the quantization error detect by sampling the corresponding α - measuring, for example, the 44-dimensional points of the frequency characteristics by the formula: where αiwhen I≤i≤P is α - parameters obtained by analyzing CLP the current frame. For calculation of Osfilled after 1, α1that α2, ..., αPto get I αIthat α2, ..., αP, 0, 0, ..., 0 to ensure, for example, a 256-point data. After this, there is a 256-point fast Fourier transform and calculates the values of Matrix weighting of auditory perception of W is defined as follows: where αirepresents the result of the analysis CLP output, and λandthat λbare constant values, such as λa=0,4, λin=0,9. The matrix W can be determined from the frequency characteristics of equation (3). In the example provided 1, α1that λin, The frequency characteristics of the above equation (3) can be determined by the equation: where 0≤i≤128. The frequency characteristics are determined in the following manner for the respective pixels, for example, 44-dimensional vector. Although more accurate financial p the tats you want to use linear interpolation, after the substitution the following example uses the values of nearby points. That is, ω[i]=ω0[nint(128i/L)], where I≤i≤L, a nint (x) is a function that reflects the integer nearest to x. As for the values of H, h(1), h(2), ..., h(L), they are defined in a similar way. That is, so As a modified version of the implementation, the frequency characteristics can be defined, in order to reduce the number of operations of the fast Fourier transform, after first defining H(z) W(z). That is, The denominator of equation (5) is decomposed as follows: By setting 1, β1that β2, ..., in2P, 0, 0, ..., 0, is formed, for example, a 256-point data. Then perform a 256-point fast Fourier transform with the purpose of providing frequency characteristics of amplitude, so where 0≤i≤128. Here is the following equation: where 0≤i≤128. This value is determined for each of the respective points α - dimensional vector. If the number of FFT points is negligible, it is necessary to use a linear interpolare is the W. However, in this case are close values. That is, where 1≤i≤L. The matrix W having these close values as diagonal elements, is defined by the following expression: The above equation (6) is the same matrix as equation (4). Using this matrix, i.e. the frequency response of the weighing synthesizing filter, equation (1) can be rewritten as follows: The following is a description of how learning codebook forms and codebook gain. First, for all personnel who select a code vector In this equation (8) To minimize equation (8) can be written as follows so here where {}-1means inverse matrix, a Next, we consider optimization in respect of the gain. The expected value of jgdistortion for the K-th frame, selects the code word gcthe gain is defined as follows. Solving equation get The above equations give the optimal centroid condition for the form Next, we consider the optimal condition coding (nearest neighbor condition). Forms determined each time get input On SV is his creature, E should be determined for all combinations of ge(0≤e≤31), The above equation can be written in the form Thus, assuming that for geis provided with sufficient accuracy, a search can be performed in two stages: 1) search 2) search for gewhich is closest to the If the above equation is rewritten using the original is the material representation, the search can be performed in two stages: 1)’ search groups and 2)’ search gi nearest to Equation (15) gives the optimum condition coding (condition nearest neighbouring entry). Using the centroid condition equations (11) and (12) equation (15), it is possible to simultaneously train code dictionaries CBO, the CBI and the SVD by the generalized Lloyd algorithm (OAL). Considering figure 3, we note that the vector quantizer 23 is connected through the switch 24 to the code dictionary for vocalized sound signal 25 V and dictionary for neocaledonica sound 25 U. Controlling the switching of the switch 24 in dependence on the output signal to differentiate V-UV with the coding scheme of the harmonics and noise 22, performs vector quantization vocalized sound and neocaledonica sound, using the code dictionary for vocalized sound 25 V and code dictionary for neocaledonica sound 25 U, respectively. The reason for switching code dictionaries depending on the estimation and regard vocalic sound (V) and nelokalizovannaya sound (UV) is that because of weighted averaging parameters Meanwhile, in the coding block 2 uses w'divided by the norm of the input signal When switching between code dictionaries depending on the setting of the differences V and UV similarly distributed training data for the preparation of relevant training data codebook for vocalic sound and codebook for nelokalizovannaya sound. To reduce the number of binary digits in the V-UV in the coding block 2 is used adapalene excitation (OPV), and this frame is vocalic (V) by the frame and devocalisation frame (UV), if the relation V exceeds 50 % and the opposite ratio, respectively. Figure 6 and 7 shows the average values of Figure 6 shows that the energy distribution of the On Fig shows how training for three examples, i.e. vocalic sound (V), nelokalizovannaya sound (UV) and United together vocalic and nelokalizovannaya sounds. That is, curves On Fig shows that the division of study codebook for V and code of slowely to UV leads to reduced the expected value of the distortion of the output signal. Although the state is expected zachariahnishio deteriorates in the case of curve 3,72×0,538+7,011×0,467=5,24, which is an improvement of around 0.76 dB compared with the expected value of 6.25 for learning together V and UV. Based on the type of training, improving the expected value is about 0.76 to dB. But found that if processed sample speech four panelists from among men and four participants in the discussion of women beyond the training group with the purpose of detection SNR for the case where the quantization is not performed, the separation of the V and UV leads to improved segmental SNR of about 1.3 dB. The reason for this probably lies in the fact that the ratio V significantly higher ratio for UV. It should be noted that although the weighting factor w’, used for weighting of auditory perception in vector quantization vector quantizer 23, as described above ur is using (6), by determining the current weighting factor w', taking into account the last w', it is possible to determine the weighting factor w', taking into account the temporal masking effect. As for the elements wh(1), wh(2), ..., wh(L) in the above equation (6), calculated at time n these elements, that is calculated for the n-th frame, they are indicated by elements whn(1), whn(2), ..., whn(L). Taking into account the previous value at time n is a weighting factor determined by the value of An(i), where I≤i≤L. In this case, An(i)=λAn-1(i)+(1-λ)wh(i) for whn(i)≤An-1(i)=whn(i) for whn(i)>An-1(i) where λ can be set so that, for example, λ=0,2. An(i), where I≤i≤L, can be used as the diagonal elements of the matrix, which is used as the above-mentioned weighting coefficients. Returning to figure 1, note that here the computing unit modified encoding parameters 3. The playback device speech signals 1 modifies the encoding parameters output from the coding block 2 at a certain speed, by means of the computing unit modified encoding parameters 3 intended to calculate the modified encoding parameters, and decodes vidoesseanna the coding parameters by block decoding for playback of content from a continuous recording speed, twice the speed in real time. Because the pitch and phoneme remain unchanged, despite a higher playback speed of recorded content you can hear, even if it is to play at a higher speed. Since the encoding settings modified for speed, the computing unit modified encoding parameters 3 does not require processing after decoding and output signals and can easily be coordinated with different fixed speeds with the same algorithm. Considering the algorithms in figures 9 and 11, we note that there is a detailed description of the operation unit calculating the modified encoding parameters 3 playback device speech signals 1. As described with reference to figure 2, the computing unit modified encoding parameters 3 consists of schema changes in period 4 and the interpolation scheme 5. First, at step S1 Fig.9 on the schema changes in period 4 are received through the input terminal 15, 28, 29 and 26 coding parameters, such as LSP, pitch, V-UV and Am. The pitch is set at a value of Pch[n], V-UV is installed on vuv[n], Am set on am[n][e] and LSR is set to lsp[n][i]. Calculated in the end, the computing unit modified encoding parameters videoisland the e coding parameters are set at values of As described above, At step S2, the schema change period 4 sets the number of frames corresponding to the initial time, on the N1establishing at the same time, the number of frames corresponding to the duration of time after the change in N2. Then at step S3, the schema change period 4 compresses the time axis speed N1the speed of N2. That is, the ratio of the time axis spd scheme change period 4 identify eleesa ratio of N 2/N1. Next, at step S 4 interpolation scheme 5 sets Next, at step S 5 interpolation scheme 4 defines two frame fr0and fr1and differences between "left" and "right" between two frames froand fr1um/spd. If the encoding parameters Pchvuv, amand lspindicate with an asterisk (*), then where 0≤m<N2. However, since the ratio m/spd is not an integer, a modified parameter encoding for m/spd is created by interpolation of the two frames fr0=Lm/spd and fr1=f0+1. It should be noted that between the frame fr0m/spd and the frame fr1there is a connection, as shown in figure 10, that is, there is a link defined by the expression left = m/spoL, right= fr1-m/spd. Parameter coding for m/spd figure 10, that is, a modified parameter encoding is generated by interpolation, as shown in step 6. Modified parameter encoding can be defined simply by linear interpolation in the form:
The first step is to decide whether two frames fr0and fr1vocalic (V) or nelokalizovannaya (UV). If it is determined that both of the frame fr0and fr1vocalic (V), the program proceeds to step S 12, where all parameters are linearly interpolated and modified encoding parameters are represented as follows: where 0≤l≤L. note that L denotes the maximum possible number, which can be taken as harmonics, and that "0" is filled in am[n][l], where there are no harmonics. If the number of harmonics varies in frames fr0and fr1,it is considered that under the above described interpolating the value of the equivalent harmonic is zero. Before passing through the conversion unit amount of data, the number L can is to be permanent, for example, L=43 at 0≤l<L. In addition, the modified encoding parameters are reproduced as follows: where 0≤i≤1, and I indicates the number of decimal places LSP and is usually equal to 10, and It should be understood that when establishing differences V-UV, 1 and 0 indicate vocalic (V ) and nelokalizovannaya (UV) frames, respectively. If at step S11, a decision is made that neither of the two frames fr0and fr1not lokalizirutesa (V ), it is estimated, similar to the estimate obtained in step S 13, that is, assessing whether devocalisation (UV) whether the both frame fr0and fr1. If the evaluation result is positive (YES), i.e. if both frame neocaledonica (UV), the interpolation scheme 5 sets of Pchon a constant and defines amand lspby linear interpolation as follows
to fix the values tone at a constant value, for example, the maximum value for nelokalizovannaya of sound, equal to MaxPitch=148; If both frame fr0and fr1devocalisation, the program proceeds to step S15, where animalsa decision in relation to lokalizirutesa (V) whether the frame fr0and not lokalizirutesa (UV) frame fr1. If the evaluation result is positive (YES), i.e. if the frame fr0lokalizirutesa (V), and the frame fr1not lokalizirutesa (UV), the program proceeds to step S16. If the evaluation result is negative (NO), i.e. if the frame fr0not lokalizirutesa (UV), and the frame fr1lokalizirutesa (V), the program proceeds to step S17. Processing in the next step S 16 refers to the cases when fr0and fr1differ in relation to the V-UV, that is, when one of the frames vocalic, and the other devocalization. It takes into account the fact that the interpolation of parameters between the two frames fr0and fr1, characterized in relation to the V-UV, does not matter. In this case, the value of the parameter of the frame, closer to the time m/spd without performing interpolation. If the frame fr0vocalic (V), and the frame fr1not vocalic (UV), the program proceeds to step S 16, which compares with each other the size of the "left" (=m/spd-fr0and "right" (=fr1-m/spd) frames, as shown in figure 10. This allows you to assess for which of the frames fr0and fr1is closer to m/spd. The calculation of the modified encoding parameters is performed using the parameter values of the frame is closer to m/spd. If the result of the OC the NCI at step S16 is positive (YES), this means that the "right" size is larger, and hence the frame fr1is farther from m/spd. Thus, at step S18 determined by the modified encoding parameters, using parameters of the frame fr0closer to m/spd as follows: If the evaluation result in step S16 is negative (NO), then the size of the "left" ≥ "right", and hence the frame fr1closer to m/spd, so the program proceeds to step S19, where the magnitude of the tone is brought to maximum and using the parameters for the frame fr1installed modified the settings as follows: Next, at step S17, under the action of assessment S 15, consisting in the fact that two frames fr0and fr1are devocalisation (UV) and vocalic (V), respectively, provides an assessment similar to the assessment in step S16. That is, in this case, the interpolation is not performed, and uses the parameter values of the frame closer to the time m/spd. If the evaluation result in step S 17 positive (YES), the pitch is raised to the maximum value at step S20 and using the parameters closer to the frame fr0for the rest of the parameters are modified encoding parameters as follows: If the result is at the evaluation stage S17 is negative (NO), then, since the size of the "left" ≥ "right", and hence the frame fr1closer to m/spd, the program proceeds to step S 21, which uses the parameters for the frame fr1installed the modified encoding parameters as follows: Thus, the interpolation scheme 5 performs various interpolating operation on the stage's 6 Fig.9 depending on the ratio of vocalic (V) and devocalisation (UV) characteristics between the two frames fr0and fr1. After the operation of the interpolation in step S 6, the program proceeds to step S 7, where is the increment of the parameter m. Actions in accordance with the stages of S 5 and S 6 are repeated until the value of m becomes equal to N2. In addition, the sequence of short-term RMS devocalisation (UV) parts normally used to control the gain of the noise. However, this parameter here is set to 1. The operation of the computing unit modified encoding parameters is shown schematically in Fig. Model coding parameters extracted every 20 MS coding block 2 shown in figa. Schema changes in period 4 of the computing unit modified encoding parameters 3 sets a time period of 15 MS and performs compression along the time axis, as shown in figv. Display the data on figs modified encoding parameters are calculated by interpolating device, the corresponding settings of the V-UV two frames fr0fr1as explained above. The scheme of calculating the modified encoding parameters 3 can also be changed to reverse the sequence in which the operations are performed by the schema change period 4 and the interpolation scheme, i.e. to perform the interpolation encoding options, shown in figa, as shown in figv, and compress to calculate the modified encoding parameters, as shown in figs. Modified encoding parameters with the scheme of calculation of modified encoding parameters 3 are transferred to the decoding scheme 6, shown in figure 1. The decoding scheme 6 synthesizes a sine wave and the noise on the basis of the modified encoding parameters and outputs the synthesized audio signal output terminal 37. Description of the circuit operation of the decoding is performed with reference to Fig and 15. For purposes of explanation it is assumed that the incoming decoding scheme 6 parameters are common parameters encoding. On Fig to terminal 31 receives the output signal from the vector quantization of line spectral pairs (LSP), the corresponding output signal at terminal 15 figure 3, i.e. the so-called index. This input signal is supplied to an inverse vector of LSP quantizer 32 and for vernogo vector quantization with the aim of generating data linear spectral pair (LSP), which are then fed to the interpolation scheme LFA 33 for interpolation of LSP. The resulting interpolated data converted by the conversion scheme LSP in α 32 α - options codes with linear prediction (LP), These α - the options go on synthesizing filter 35. At terminal 41 Fig receive data indicator weighted code word vector quantization of spectral envelope (Am), the corresponding output signal at terminal 26 of the coding device shown in figure 3. At terminal 43 receives information about tone with terminals 28 figure 3 and data showing the characteristic quality of the signal shape in time to block UV, whereas at terminal 46 receives data to differentiate V-UV from terminal 29 to 3. Data from the vector quantization of the amplitude Amterminal 41 is coming to an inverse vector quantizer 42 for inverse vector quantization. The resulting data of the spectral envelope arrive at a scheme for the synthesis of harmonics and noise or scheme for the synthesis of multi-band excitation (IPOs) 45. The scheme of synthesizing 45 serves data from terminal 43, which is switched by the switch 44 between the data of tone and data showing the characteristic value of the waveform for frame UV depending on the data to differentiate V-UV. The synthetic scheme is investing 45 also receives the data to differentiate V-UV terminal 46. Below is a description with reference to Fig schematic layout of synthesizing IPOs as illustrative schematic layout of synthesizing 45. From the synthesizing circuit 45 are residual data CLP corresponding to the output signal of the inverse filter circuit 21 figure 3. Thus obtained residual data is being received on the scheme for the synthesis of 35 running synthesizing CLP with the aim of creating a data waveform in time, which is filtered followed by filter 36, so that the output terminal 37 displays the reproduced signal waveform time domain. Illustrated example of the arrangement of synthesizing SRM as an example of the synthesizing circuit 45 is described with reference to Fig. On Fig shown that the data of the spectral envelope with inverse vector quantizer 42 Fig the actual data of the spectral envelope of the residual values CLP served at the input terminal 131. Data received at terminals 43, 46, is the same as the data shown in Fig. Data received at terminal 43, is selected by the switch 44 so that the data on tone and data showing the characteristic quality of the signal UV, proceed to block synthesizing vocalic sound 137 and the inverse vector quantizer 152, respectively. The data of the spectral amplitude of the adjusted values CLP terminal 131 receives the inverse transform scheme number of data 136 for the inverse transform. The scheme of the inverse transform of the amount of data 136 performs the inverse transformation which is the inverse function of the conversion performed by the conversion unit amount of data 119. The resulting amplitude data are fed to the block synthesizing vocalic sound 137 and on the block synthesizing nelokalizovannaya sound 138. Data about tone received from the terminal 43 through a fixed terminal The block synthesizing vocalic sound 137 synthesizes the waveform vocalic sound in the time domain, for example, by synthesizing cosine or sine wave, while the block synthesizing nelokalizovannaya sound filters 138, for example, white noise through a bandpass filter with the aim of synthesizing devocalisation waveform time domain. Vocalic form of the signal and neocaridina the shape of the signal are added together by the adder 141 so that they can be output to the output terminal 142. If the data to differentiate V and UV is transmitted code V and UV, all bands can be divided into single demarcation point on Vocalizer annoy (V) region and devocalization (UV) region, and on the basis of this point of differentiation can be obtained based on the band data to differentiate V-UV. If the number of bands is reduced by-side analysis (encoding device) to a constant number equal to 12 lanes, this reduction can be canceled for varying the number of strips in the strip width corresponding to the original tone. Below is a description of the steps of synthesizing nelokalizovannaya sound synthesizing unit nelokalizovannaya sound 138. The wave form of the signal white noise time domain with a white-noise generator 143 is supplied to the weighing unit 144 for weighing using the corresponding compactly supported functions, for example, the weighing function Hamming, with a pre-determined duration, for example, equal to 256 samples. Then the weighted waveform signal into a scheme of short term Fourier transform (XPF) 145 for short-term Fourier transform to create the energy spectrum of the frequency domain white noise. Energy spectrum block short term Fourier transform 145 is supplied to the processing unit amplitude strips 146, where I think the band devocalisation (UV) and multiplied by the amplitude The output signal processing unit amplitude frequency band 146 is supplied to the unit inverse short-term Fourier transform (CSPF), which is the inverse XPF, using as the initial phase white noise, with the aim of converting signals in the time domain. The output signal of the block inverse XPF 147 is fed through the forming unit power distribution multiplier 156 and 157, as described below, the block combining and adding 148, where the combination and adding are repeated with the appropriate weighting on the time axis to restore the primary continuous waveform. Thus, a continuous waveform in the time domain is created through the synthesis. The output signal of block matching and adding 148 is supplied to the adder 141. If at least one of the strips in the block vocalic (V), the above processing is performed in the respective block synthesizing 137, 138. If it is determined that all the bands in the block devocalisation, the movable contact 44 of the switch 44 is set to the fixed terminal That is, the unit vector dekvantovanie 152 receives the data corresponding to the data coming from the unit vector quantization 127 figure 4. These data are subjected to inverse vector quantization to output data to extract the characteristic waveform quality nelokalizovannaya signal. The output signal of the block inverse XPF 147 before serving multiplier 157 is exposed to the energy distribution of the time domain, regulated by the forming unit power distribution 156. The multiplier 157 multiplies the output signal of the block inverse XPF 147 with the signal output from block vector dekvantovanie 152 through the block smoothing 153. Rapid changes in gain, which seem to be pronounced, can suppress the power smoothing 153. Synthesized thus nelokalizovannaya the audio signal is removed from the block synthesizing nelokalizovannaya audio signal 138 and supplied to the adder 141, where it adds to the signal from the synthesizing unit vocalic sound signal 137, so that the output terminals 142 are removed residual signals CLP as synthesized output signals refracted. These residual signals CLP come on synthesizing filter 35 Fig to create a target speech sound signal playback. <> The playback device of the speech signal 1 causes the block to calculate the modified encoding parameters 3 to calculate modified encoding parameters under control of a controller (not shown), and synthesizes speech sound signal, which is komandirowannyj on the time axis of the original speech signal with the addition of modified encoding parameters.In this case, the signal On the other hand, a modified parameter encoding Modified parameter encoding Through the above shown Fig schemes are synthesized komandirovannye on the time axis of the original speech signals, using the aforementioned modified encoding parameters so that they can be output to the output terminal 37. Thus, the playback device speech signals 1 decodes the modified matrix encoding options If you change the time axis, as described above, the instantaneous range and tone OST the fast unchanged, so despite a significant change in the range between 0,5≤spd≤2, barely creates deterioration. In the case of this system, because ultimately obtained the sequence of the parameters is decoded after locations in a specific order with an integral interval of 20 MS, it's easy to implement a random controlled speed in the direction of increase or decrease. On the other hand, the increase and decrease speed, you can run through the same processing without transition points. Thus, densely recorded contents can be played back at a speed twice the speed in real time. Because the tone and phoneme remain unchanged, despite the increased playback speed, tightly written content you can hear, if playback is performed at a higher speed. On the other hand, with regard to the speech codec, then you can exclude additional, for example, an arithmetic operation after decoding and producing signals that are required when using coding with linear prediction code excited (LPVC). Although the computing unit modified encoding parameters 3 is sealed with the above-described first method implementation from the block decoder 6, the computing unit 3 is also provided the step in block decoding 6. When calculating the parameters of the computing unit modified encoding parameters 3 in the playback device speech signals 1, the interpolating operation by setting amperform on the value of the vector quantization or the value of the inverse vector quantization. The following is a description of the device the transmission of speech signals 50, designed to perform the appropriate present invention method of transmitting audio signals. On Fig can be seen that the device of the voice signal transmission 50 includes a transmitting device 51, designed to divide the input speech signal based on a predetermined frame time domain as elements, and coding the input speech signal on a frame basis to detect the encoding parameters, the interpolation of the encoding parameters to detect modified encoding parameters and to transmit the modified encoding parameters. Device voice signals 50 also includes a receiver 56 designed to receive modified encoding parameters and for synthesizing harmonic vibrations and noise. That is, the transmitting device 51 includes the encoder 53, designed to divide the input speech signal and is walking out of the predetermined frame time domain as elements and coding of the speech signal on a frame basis for extracting coding parameters, the interpolator 54, intended for interpolation encoding options with the aim of determining the modified encoding parameters, and the transmission unit, for transmitting the modified encoding parameters. The receiving device 56 includes a reception unit 57, the interpolator 58 intended for interpolation modified encoding parameters, and block decoding 59 designed for synthesizing harmonic vibrations and noise on the basis of the interpolated parameters to output synthesized speech signal at output terminal 60. The main work of the coding block 53 and the block decoding 59 is similar to the work of the same blocks in the playback device of the speech signal 1, and so here, for simplicity, their detailed description is omitted. Description of the operation of the transmitting device 51 is made with reference to presents on Fig algorithm, which together shows the steps of coding the coding block 53 and the interpolation by the interpolator 54. The coding block 53 extracts the encoding parameters, consisting of LSP, tone PchV-UV and amin steps S31 and S33. In particular LSP is interpolated and reordered by the interpolator 54 at step S 31 and quantized at step S 32, while the tone of the PchV-UV and aminterpolated and paleoparadoxia is at the stage of S 34 and quanthouse at stage S 35. These quantized data is transmitted by the transmitting device 55 on the receiving device 56. The quantized data received by the reception unit 57 in the receiving device 56, served on the interpolation unit 58, where the parameters are interpolated and reordered at step S 36. At step S 37 data synthesized by block decoding 59. Thus, to increase the speed by compressing the time axis, the device transmit speech signals 50 interpolates the parameters and change the interval between frames parameters during transmission. Meanwhile, since the playback is performed during the reception by detecting parameters with a constant interval between frames is equal to 20 MS, the algorithm speed control can be used directly to convert the bit rate. That is, we assume that if the interpolation parameters used for speed control, this interpolation is performed in a decoding device. However, if this processing is performed in the encoding device, so that the data is compressed (thinned) the time axis is encoded and extended (interpolated) time axis decoding device, the bit rate can be adjusted by the ratio of spd. If the transmission speed of extending t is, for example, 1,975 kilobits / second, and the coding is performed at double speed through this installation that spd=0.5, because the coding is performed at a speed of 5 seconds instead of the inherent speed of 10 seconds, the transmission speed becomes equal 1,975×0.5 kilobits per second. In addition, the coding parameters obtained in the coding block 53 shown in figa, interpolated and reordered by the interpolator 54 with an arbitrary interval, for example, equal to 30 MS, as shown in figv. Then, the encoding parameters are interpolated and are reordered by the interpolator 58 receiving device 56 to 20 MS, as shown in figs, and are synthesized by block decoding 59. If a similar scheme to provide a decoding device, you can restore the speed to the initial value, although the speech sound signal can also be heard at the high or low speed. That is, the cruise control can be used as the encoder-decoder variable bit rate. 1. How to play the input speech signal based on the encoding parameters obtained by dividing the input speech signal on the basis of pre-established frames on a time axis and by encoding the input speech signal in human OS is ove, contains the stages of interpolation encoding options to determine the modified encoding parameters associated with frame-based time periods, and generating modified speech signal on the time axis from the input speech signal based on the modified encoding parameters. 2. The method according to p. 1, characterized in that a modified speech signal generate by at least synthesizing sine waves based on the modified encoding parameters for reproducing speech signals. 3. The method according to claim 2, characterized in that the period of the output encoding parameters change by a compression-expansion time axis of the encoding parameters generated in each predetermined frame before or after the stage of interpolation. 4. The method according to p. 1, characterized in that the interpolation encoding options perform a linear interpolation of the parameters of a linear spectral pair, pitch, and residual spectral envelope contained in the encoding options. 5. The method according to claim 1, characterized in that the encoding parameters are determined by the representation of the residual values of short-term predictions of the input speech signal as a synthesized sine wave and noise, and by Kodirov is the frequency of the spectral information of each of the synthesized sine waves and noise. 6. The playback device of the speech signal in which the input speech signal is restored based on the coding parameters obtained by dividing the input speech signal on the basis of pre-established frames on the time axis and coding based on the frames of the input speech signal to detect the encoding parameters, containing the means of the interpolation encoding options designed to determine the modified encoding parameters associated with frame-based time periods, and the means of generating modified speech signal, used for production of transformed speech signal, differing in the time axis from the input speech signal based on the modified encoding parameters. 7. The device according to claim 6, characterized in that the means to generate the modified speech signal is configured to, at least, synthesizing harmonic wave in accordance with the modified encoding parameters. 8. The device according to claim 7, characterized in that it further comprises means of changes of the period, designed for compression-expansion of the time axis of the encoding parameters generated in each predetermined frame, to change the period of the output parameters codero the project, and installed before or after the means of interpolation. 9. The device according to claim 6, characterized in that the means of interpolation of the coding parameters is made with the possibility of a linear interpolation of the parameters of a linear spectral pair, pitch, and residual spectral envelope contained in the encoding options. 10. The device according to claim 6, characterized in that the encoding parameters are determined by the representation of the residual values of short-term prediction of the input speech signal as a synthesized sine wave and noise, and by encoding the frequency-spectral information of each of sintezirovannyh harmonic waves and noise. 11. The transfer method of the input speech signals, namely, that the encoding parameters receive by dividing the input speech signal on the basis of pre-established frames on the time axis as elements and divided by the coding of the input speech signal on a frame basis, and the encoding parameters interpolate to determine the modified encoding parameters associated with frame-based time periods, and modified encoding parameters passed in. 12. The method according to p. 11, characterized in that the parameters Kodirov the deposits are determined by the representation of the residual values of short-term predictions of the input speech signal as a synthesized sine wave and noise, and by encoding the frequency-spectral the envelope of each of sintezirovannyh harmonic wave and noise.
|
|||||||||||||
| © 2013-2014 Russian business network RussianPatents.com - Special Russian commercial information project for world wide. Foreign filing in English. |