RussianPatents.com

Complexity scalable perceptual tempo estimation. RU patent 2507606.

IPC classes for russian patent Complexity scalable perceptual tempo estimation. RU patent 2507606. (RU 2507606):

G10H1/40 - Rhythm (metronomes G04F0005020000)

Another patents in same IPC classes:

Complexity scalable perceptual tempo estimation / 2507606
Method and system for extracting tempo information of an audio signal from an encoded bit stream of the audio signal comprising spectral band replication data are described. The method comprises steps of determining a payload quantity associated with the amount of spectral band replication data contained in the encoded bit stream for a time interval of the audio signal; repeating the determining step for successive time intervals of the encoded bit stream of the audio signal, thereby determining a sequence of payload quantities; identifying periodicity in the sequence of payload quantities; and extracting tempo information of the audio signal from the identified periodicity.

FIELD: information technology.

SUBSTANCE: method and system for extracting tempo information of an audio signal from an encoded bit stream of the audio signal comprising spectral band replication data are described. The method comprises steps of determining a payload quantity associated with the amount of spectral band replication data contained in the encoded bit stream for a time interval of the audio signal; repeating the determining step for successive time intervals of the encoded bit stream of the audio signal, thereby determining a sequence of payload quantities; identifying periodicity in the sequence of payload quantities; and extracting tempo information of the audio signal from the identified periodicity.

EFFECT: enabling tempo estimation, which is invariant to the type of codec or applicable to the musical genre of any kind.

22 cl, 4 tbl, 13 dwg

THE TECHNICAL FIELD

This document relates to methods and systems to assess the pace of multimedia signal, such as a sound or a combined video/audio signal. In particular, the document refers to the assessment of pace, being perceived by listeners, as well as to means and systems to assess the pace scalable computational complexity.

BACKGROUND OF THE INVENTION

Portable handheld devices, such as PDAs, smartphones, mobile phones and portable multimedia players that, as a rule, include the ability to represent audio and video, became important entertainment platforms. Their development is pushed forward by the penetration of wireless or wired capabilities for data transfer in these devices. Thanks to the support of multimedia data transfer and/or protocols store information, such as the format of NOT-AAS, multimedia content can constantly be downloaded and stored on the portable handheld device, and, thus, supports an unlimited number of multimedia content.

However, mobile/handheld devices are key algorithms with low complexity, because the critical limitations for such devices are limited computing power and energy consumption. These restrictions are even more critical for handhelds low level in emerging markets. Due to large number of media files available on the typical portable electronic devices, clustering or classification of multimedia files desirable with programming tools are applications MIR (Music Information Retrieval), allowing the user to a portable electronic device to identify the appropriate media file, for example, sound, music and/or video. There is a need in the schemes of calculations with low complexity for the specified application MIR, because otherwise their applicability for portable electronic devices with limited computing resources and energy can be compromised.

An important characteristic feature of music for different applications MIR, such as the classification of the genre and mood, abstracting music, create sketches of audio data, automatically generate a playlist, music recommendation system, use the musical similarity, etc, is a musical tempo. Therefore, the procedure for determining the pace, which has low computational complexity, could contribute to the development of decentralized implementation of the above applications MIR for mobile devices.

In addition, although it is generally accepted description of the music tempo through tempo in musical notation, or music - in VRM (strokes per minute), this value is often not the perceived rate. For example, if you ask a group of trainees (including experienced musicians) to comment on the pace of pieces of music, they tend to give different answers, i.e. they usually beat off the pace at different metric levels. For some musical excerpts perceived rate of less ambiguous, and all students, as a rule, beat off the pace on the same mental level, but for other musical passages pace can be ambiguous, and various listeners identify the various rates. In other words, perceptual experiments have shown that the perceived rate may differ from tempo. Musical fragment may be faster or slower than it rate, in the case when the dominant perceived rhythm may have metric level higher or lower than pace. In view of the fact that application MIR should primarily take into account the pace, user perceived most likely, automatic device to retrieve the tempo should predict perceptually most pronounced rate of the audio signal.

The known methods and assessment system tempo have disabilities. In many cases, they are limited to specific audio codecs, for example, MDE, and may not apply for the audio tracks that are encoded other codecs. Also, these methods to assess the pace, tend to operate properly only when they are applied to Western popular music, having a simple and clear rhythmical structures. In addition, the known ways assess the pace does not take into account the peculiarities of perception, i.e. they aim to assess the tempo, which most likely is perceived by the listener. Finally, the known scheme assess the pace, tend to operate only in uncompressed region PCM (pulse-code modulation), the conversion or compression field.

It is desirable creation of means and systems of assessment of tempo, which overcame would aforementioned flaws of the existing schemes assess the pace. In particular, it is desirable to create assess the pace invariant with respect to the type of codec and/or applicable to the music genre of any kind. In addition, it is desirable to create a schema assess the pace, which assessed would perceptually most pronounced rate of the audio signal. Also desirable scheme assess the pace that would be applicable to audible signals in any of the above mentioned areas, i.e. in the uncompressed field of PCM, the field of conversion and compression of the region. Also it is desirable to create assessment schemes tempo with low computational complexity.

Evaluation schemes tempo can be used in various applications. Because the tempo in music is a fundamental semantic information, a reliable estimate of the tempo will increase the effectiveness of other applications MIR, such as automatic classification of genres on the basis of content classification sentiment, musical similarity, creating thumbnails audio and referencing the music. In addition, a reliable estimate of the perceived tempo is useful statistics for music selection, matching, mixing and creating playlists. The perceived rate, or feeling generally more significant than or physical pace, especially for automatic generator, playlists, music programs navigators or apparatus of disc jockeys. In addition, a reliable estimate of the perceived tempo can be useful for gaming applications. For example, the rate of the audio track can be used to control the corresponding parameters of the game, such as the speed of the game, and Vice versa. This can be used to personalize the content of games using the sound information and to provide users with the advanced experience. Another field of application can be a sync audio/video-based content, where the music meter, or pace, is the primary source of information used as a reference for timestamp of the events.

It should be noted that in this document the term «rate» is defined as the speed of the clock strikes. Specified tact also called speed beating tempo foot, i.e. the speed with which students beat tempo leg when listening to an audio signal, such as a music signal. This term differs from the musical size determining the hierarchical structure of the music signal.

In document WO 2006/037366 A1 described device and method of generating encoded rhythmic drawing of a piece of music based on the RSM representation in the time domain. In the US 7518053 B1 describes how to retrieve the shock of two audio streams and alignment shock of the two streams.

SHORT DESCRIPTION OF THE INVENTION

According to one of the features described how information is retrieved tempo sound signal of the encoded bit rate of the audio signal, where the coded bit stream includes data spectral band replication. Coded bit stream may be a bit stream is NOT-AAS or bit stream mp3PRO. The audio signal can include a music signal, and picking up the tempo may include evaluation of the tempo of the music signal.

The method may include the stage of determination of size of the payload, number of data spectral band replication, prisoners in the coded bit stream, for a certain time interval beep. Especially in the case when the coded bit stream is a bit stream is NOT-AAS, the last stage may involve determination of the amount of data to be concluded in one or more fields fill-element of the encoded bit stream into the specified time interval, and the determination of the value of the payload on the basis of the amount of data prisoners in one or more fields fill-element of the encoded bit stream into the specified time interval.

In one of the embodiments of the invention value payload corresponds to the net amount of data spectral band replication, prisoners in one or more fields fill-element of the encoded bit stream in a certain time range. Alternatively or in addition, to determine the current data spectral band replication supplementary service data can be removed from one or more fields fill-element.

Coded bit stream may include a range of frames where each frame corresponds to some passage of a sound signal with a pre-specified duration of time. For example, a frame may include a passage in a few milliseconds of a musical signal. The time interval can be sustained in time covered the scenes of the encoded bit stream. For example, frame AAS, as a rule, includes, 1024 spectral values, for example, the coefficients of MDCT. Spectral values are the frequency representation of a specific point in time, or a time interval of a sound signal. The relationship between time and frequency can be expressed as follows:

f S =2·f MAX , and

t = 1 f S ,

where f MAX - covers a range of frequencies, f S sampling frequency, and t - time resolution, i.e. the time interval beep covered by the frame. For the sampling frequency f S =44100 Hz, this corresponds to a resolution in time

t = 1 0 2 4 4 4 1 0 0 H z = 2 3 , 2 1 9

MS frame for AAS. Since one of the options for the implementation of the non-AAS is defined as «a system with double frame», where its basic coder (AAS) operates at half of the sampling frequency, you can achieve maximum resolution in time

t = 1 0 2 4 2 2 0 5 0 H z = 4 6 , 4 3 9 9 MS.

The method may include additional step, repeat a higher stage for successive time intervals of the encoded bit rate of the audio signal and thus determine the sequence of values of a payload. If the coded bit stream includes a sequence of frames, the specified stage of recurrence can run for a certain set of frames encoded bit stream, i.e. all frames encoded bit stream.

At the next stage method can identify the periodicity of the sequence of values of a payload. This can be achieved through the identification of the periodicity of peaks, or recurring patterns in the sequence of values of a payload. Identification periodicities can be accomplished by the execution of spectral analysis for a sequence of values of the payload is giving the set of values of the energy and frequency. The frequency can be identified in the sequence of values of the payload by determining the relative maximum set of energy values and by selecting the periodicity as appropriate frequency. In one of the embodiments of the invention is determined, the absolute maximum.

Spectral analysis is usually done for a sequence of values of the payload along the time axis. In addition, the spectral analysis is usually done in a number of subsequences sequence values payload, thus giving a number of sets of values of energy. For example, a subsequence can cover a certain duration of a sound signal, for example, 6 seconds. In addition, a subsequence can overlap each other, for example, by 50%. Thus, can be achieved a number of sets of values of energy, where each set of values for energy corresponds to a specific passage of a sound signal. The full set of energy values for all audio signal can be obtained by averaging a number of sets of values of energy. It should be understood that the term «averaging» covers various types of mathematical operations, such as calculation of the mean, or the definition of the median value. I.e. full set of energy values can be obtained by calculating the set of average values of energy, or set of median values of energy for a number of sets of values of energy. In one of the embodiments of the invention perform spectral analysis includes the performance of frequency conversion, such as fast Fourier transform or FFT.

Sets the values of the energy can be further processed. In one of the embodiments of the invention set of energy values are multiplied by the weights associated with the preference of human perception corresponding frequencies. For example, these perceptual weighting factors may cover frequencies that correspond to the pace, which often are found in human beings, while the frequencies corresponding to the rate, which rarely found people to loosen up.

According to the following features of the method of assessment perceptually expressed the tempo of an audio signal. Perceptually distinct pace can be a pace that is often perceived by a group of users when listening to an audio signal, such as a music signal. As a rule, it is different from physically expressed the tempo of an audio signal, which can be defined as physically, or acoustically, the most pronounced rate of the audio signal, such as a music signal.

The method may include the stage of determination of the spectrum of the audio-frequency modulation, where the spectrum modulation, as a rule, includes a number of frequencies of occurrence of an event and the corresponding series of values of importance, with the values of the significance indicate the relative importance of the respective frequencies of occurrence in the sound signal. In other words, the frequency of occurrence of events that indicate a certain periodicity in the sound signal, while the corresponding values of importance included the importance of these periodicities in the sound signal. For example, the frequency may represent intermittent sound in the sound signal, for example, the sound of the bass drum in the music signal, which occurs in the recurring points in time. If the intermittent sound is characteristic, the value of significance to the periodicity, as a rule, is high.

In one of the embodiments of the invention of the sound signal is represented by a sequence of discrete values RSM along the time axis. In these cases, the stage of determination of the spectrum modulation may include stages of choice for a number of consecutive, partially overlapping subsequences of a sequence of discrete values PCM; definition for a number of consecutive substrings of a series of successive energy spectra, with a spectral resolution; seal spectral resolution of a series of successive energy spectra with the use of frequency conversion Mel or any other perceptually motivated nonlinear frequency conversion; and/or execution of spectral analysis along the axis of time on successive compacted energy spectra, and, thus, obtaining a number of values of the significance and the corresponding frequencies of occurrence of an event.

In one of the embodiments of the invention of the sound signal is represented by a sequence of consecutive blocks ratios ranges along the time axis. The said ratios ranges can, for example, be rates MDCT, as, for example, in the case of codecs MP3, AAC, NOT-AAC, Dolby Digital and Dolby Digital Plus. In these cases, the stage of determination of the spectrum modulation may include seal in the number of factors subranges in a block with the use of frequency conversion Mel; and/or execution of spectral analysis along the time axis on a sequence of consecutive blocks compacted coefficients of subranges, thus, gives a range of values of the significance and the corresponding frequencies of occurrence of an event.

In one of the embodiments of the invention of the sound signal is represented in coded bit stream, including data spectral band replication and a number of consecutive shots along the time axis. For example, coded bit stream may be a bit stream is NOT-AAS or bit stream mp3PRO. In these cases, the stage of determination of the spectrum modulation may include a determination of the sequence of values payload associated with the amount of data spectral band replication, in a sequence of frames encoded bit stream; selection of a number of consecutive, overlapping subsequences of a sequence of values payload; and/or execution of spectral analysis along the time axis on a number of consecutive substrings that, therefore, gives a range of values of the significance and the corresponding frequencies of occurrence of an event. In other words, the spectrum modulation may be determined in accordance with the way described above.

In addition, the stage of determination of the spectrum modulation may include processing, intended to improve spectrum modulation. This treatment may include the multiplication of a number of values of the significance of the weights associated with the preference of human perception corresponding frequencies of occurrence of an event.

The method may include additional definition phase physically expressed tempo as the frequency of occurrence of events, corresponding to the maximum value range of values significance. A specified maximum the value can be an absolute maximum number of values significance.

The method may include an additional step is determining the size of the quantum beep from the spectrum modulation. In one of the embodiments of the invention quantum size specifies the relationship between the physical pronounced pace and at least another frequency of occurrence of events corresponding to the relatively high value of the time series of the importance of, for example, the second largest value of a number of values significance. The size of the quantum can be one of the values of a number: 3, for example, in the case of size ¾; or 2, for example, in the case of a 4/4 time signature. The size of the quantum could be a factor associated with the ratio between the physical pronounced pace and at least one more pronounced rate, i.e. the frequency of occurrence of events corresponding to the relatively high value of the time series of the importance of a sound signal. In General, the quantum can represent the relationship between a number of physically expressed rate of the audio signal, for example, between two physically more pronounced rate of the audio signal.

In one of the embodiments of the invention define the size of the quantum involves the determination of the autocorrelation modulation spectra for a number of non-zero lag in frequency; and/or determination of the size of the quantum based on the appropriate lag frequency and physically expressed tempo. Define the size of the quantum can also include steps to define a correlation between the spectrum modulation and a number of synthesized functions beating tempo, corresponding to a series of sizes of tact, respectively; and/or select the size of the quantum, which leads to the maximum cross-correlation.

The method may include the definition phase indicator of perceived tempo on the basis of the spectrum modulation. The first indicator of perceived tempo can be defined as the mean value of the range of values significance, normalized to the maximum number of values significance. The second indicator of perceived tempo can be defined as the maximum value of significance in a number of values significance. The third indicator is perceived tempo can be defined as frequency of occurrence of events in the spectrum modulation.

The method may include the definition phase perceptually expressed tempo by modifying physically expressed tempo in accordance with the size of tact, where the stage of modification takes into account the correlation between the indicator of perceived tempo and physically marked pace. In one of the embodiments of the invention definition phase perceptually expressed tempo includes identifying whether the first indicator of perceived tempo first threshold; and modification physically expressed tempo only if the first threshold is exceeded. In one of the embodiments of the invention definition phase perceptually expressed tempo includes the determination of whether a second indicator of perceived tempo below the second threshold; and modification physically expressed pace if the second indicator of perceived tempo is below the second threshold.

Alternatively or in addition, the stage of determination perceptually expressed tempo may include the identification of inconsistencies between the third indicator of perceived tempo and physically pronounced pace; and, if the mismatch is defined, modified physically expressed tempo. The discrepancy can be determined, for example, by determining that the third indicator of perceived tempo is below the third threshold, and physically expressed rate is higher than the fourth threshold; and/or by determining that the third indicator of perceived tempo is higher than the fifth threshold and physically expressed rate is below the sixth threshold. As a rule, at least one of the thresholds, third, fourth, fifth and sixth, connected with the preferences of human perception tempo. These preferences perception tempo may indicate a correlation between the third indicator of perceived tempo and subjective perception of the speed of sound signal perceived by the user group.

Stage modification physically expressed tempo in accordance with Dolny size may include an increase in metric level to the next height metric level relative to the main tact; and/or reduction in the metric level to the next highest metric level relative to the main measure. For example, if the main tact is a 4/4, increase metric level may include an increase in physically expressed tempo, for example, rate, the corresponding quarter notes, in 2 times, which thus leads to the next height adjustable tempo, corresponding eighth notes. Similarly, lowering metric level may include division by 2, for example, an offset from the tempo on the basis of 1/8 to the pace on the basis of ¼.

In one of the embodiments of the invention increase, or decrease, metric level may include multiplication, or division, is physically expressed tempo at 3 in the case of quantum ¾; and/or multiplication, or division, is physically expressed the tempo on the 2 - in the case of quantum 4/4.

According to the following features, described computer software product includes executable command to execute the method described in this document, when performed on the computer.

According to the following features, described portable electronic device. The device can include a block of memory configured to store in memory of the sound signal; the block audio signal, configured to play a sound signal; the user interface configured for receiving the request from the user on the information about the tempo sound signal; and/or processor is configured to specify information about the pace by performing the steps of the method described in this document, the sound signal.

According to another characteristics, describes a system that is configured to retrieve information about the tempo of the audio signal from the encoded bit stream, including data replication spectral band audio signal, such as bitstream NOT-AAS. The system may include funds for determining the amount of payload that is associated with the amount of data spectral band replication, prisoners in the coded bit stream in some time interval of a sound signal; means of repetition stages definitions for the successive time intervals of the encoded bit rate of the audio signal and, thus, to determine the sequence of values of the payload; and/or tools to extract information about the tempo of a sound signal of a specific frequency.

According to the following features, describes a system configured for the assessment perceptually expressed the tempo of an audio signal. The system can include a means to identify a range of audio-frequency modulation, where the spectrum modulation includes a number of frequencies of occurrence of an event and the corresponding values of importance, with the values of the significance indicate the relative importance of the respective frequencies of occurrence in the sound signal; means for the determination of physically expressed tempo as the frequency of occurrence of events, corresponding to the maximum value range of values significance; tools for definition of the indicator of perceived tempo of the spectrum modulation; and means to determine the perceptually expressed tempo by modifying physically expressed tempo in accordance with the size of tact, where the stage of modification takes into account the correlation between the indicator of perceived tempo and physically marked pace.

According to another characteristics, describes how to generate the encoded bit rate of the audio signal that includes metadata. The method may include stage of encoding audio signal in sequence data payload, thus, leads to the encoded bit stream. For example, the audio signal can be encoded bitstream NOT-AAC, MP3, AAC, Dolby Digital or Dolby Digital Plus. Alternatively or in addition, the method can be based on already coded bit stream, for example, the method may include a stage of reception of the encoded bit stream.

The method may include steps to define the metadata associated with the tempo of an audio signal, and embed the metadata in the coded bit stream. Metadata can be data, representing physically expressed rate and/or perceptually pronounced rate of the audio signal. Also metadata can be data, representing a range of modulation of sound signal, where the spectrum modulation includes a number of frequencies of occurrence of an event and the corresponding series of values of importance, with the values of the significance indicate the relative importance of the respective frequencies of occurrence in the sound signal. It should be noted that the metadata associated with the tempo of an audio signal may be determined in accordance with any of the methods described in this document. I.e. the rates and modulation spectra may be determined according to the methods described in this document.

According to the following features, described coded bit stream audio signal, including metadata. Coded bit stream may be a bit stream is NOT-AAC, MP3, AAC, Dolby Digital or Dolby Digital Plus. Metadata can include data representing at least physically expressed rate and/or perceptually pronounced rate of the audio signal; or range of audio-frequency modulation, where a range of audio-frequency modulation includes a number of values of importance, with the values of the significance indicate the relative importance of the respective frequencies of occurrence in the sound signal. In particular, the metadata can include data representing data on the rate and data modulation spectra generated by the ways described in this document.

According to another characteristics, described audio decoder configured to generate the encoded bit rate of the audio signal that includes metadata. Coder may include tools to encode audio signal in sequence data payload, thus, leads to the encoded bit stream; means for the determination of the metadata associated with the tempo of an audio signal; and means to insert metadata in the coded bit stream. Similar to the above method, the encoder can rely on the already existing coded bit stream, and an encoder may include funds for the reception of the encoded bit stream.

It should be noted that, according to the following features, described the appropriate way to decode the encoded bit rate of the audio signal and the corresponding decoder configured to decode the encoded bit rate of the audio signal. Way and decoder configured to retrieve from the encoded bit stream is associated metadata, in particular, the metadata associated information about the tempo.

It should be noted that the options for implementation and features of the invention described in this document may be combined arbitrarily. In particular, it should be noted that the features and characteristics, described in the context of the system, also applicable in the context of a corresponding way and Vice versa. In addition, it should be noted that disclosure of this document covers and combinations of claims than those combinations claims that are explicitly given reverse references in the independent claims, i.e. claims and their characteristic technical features can be combined in any order and in any form.

SHORT DESCRIPTION OF GRAPHIC MATERIALS

Below the present invention will be described through illustrative examples, not limiting the amount or the spirit of the invention with reference to the accompanying graphic materials, where:

figure 1 illustrates an example of the resonant model for large music collections depending on rate for a single piece of music;

figure 2 shows an example of alternation of the coefficients of MDCT for short blocks;

figure 3 shows an example of the scale Mel and sample filters block in scale Mel;

figure 4 illustrates an example of functions;

figure 5 illustrates an example of a weight function;

figure 6 illustrates examples of the energy spectrum and the spectrum modulation;

Fig.7 shows an example of a data element SBR;

Fig.8 illustrates an example of a sequence of values payload SBR and the resulting spectrum modulation;

figure 9 shows an example of the overall presentation of the proposed assessment schemes rate;

figure 10 shows an example of the comparison of proposed assessment schemes rate;

figure 11 shows an example of the spectrum modulation for the audio tracks of different sizes;

fig.12 illustrates an example of the experimental results classification of the perceived rate; and

fig.13 example illustrates the flowchart of the system assess the pace.

DETAILED DESCRIPTION

Described below embodiments of the invention are only illustrations of the principles of the methods and systems to assess the pace. It should be understood that the modifications and changes to devices and parts described in this disclosure will be apparent to those skilled in the region. Therefore, the intention is only restricted by the following formula of the invention, but no specific details presented in this disclosure to the description and explanation of the options for carrying out the invention.

As mentioned in the introductory part, known scheme assess the pace limited to certain fields of signal representation, for example, the scope PCM, the area of conversion or compression area. In particular, there are no solutions to assess the pace in which its characteristic features were calculated directly from the compressed bit stream is NOT-AAS without performing entropy decoding. In addition, the existing system are limited mainly Western popular music.

Existing schemes do not take into account the rate of perceived by the audience, and as a result of errors per octave or uncertainty doubling/ period. The specified uncertainty may arise due to the fact that in the music of different instruments play with the rhythms, the periodicity of which are inextricably linked to multiples of each other. As will be described below, that perception tempo depends on the frequency of recurrence periodicities, but is also influenced by other factors in perception, is the idea of the authors of the invention, so the uncertainty are overcome by the use of additional characteristic features of perception. On the basis of the specified additional characteristic features of perception of a running correction recoverable rate perceptually motivated manner, i.e. the aforementioned uncertainty tempo is reduced or removed.

As already highlighted, if we talk about the «pace», it is necessary to distinguish rate, physically measured pace and the perceived rate. Physically measured pace is derived from the actual measurements on the discretized sound signal, while the perceived rate of subjective in nature and, as a rule, is determined by the result of experiments on listening. Furthermore, the rate is characteristic of the music, heavily dependent on information content, and sometimes it is very hard to detect automatically, because some sound, or music, tracks carrier rate of part of a piece of music is present in an implicit form. Also on the results of the evaluation tempo strongly influenced by the musical experience of students and their concentration. This can lead to differences in the amount of tempo used when matching , physically measured and perceived tempo. However, physical and perceptual approaches to assessing the tempo can be used in combination to adjust to each other. This can be seen when, for example, full or dual notes that match a certain number of beats per minute (BPM) and its multiples, determined by physical measurement on the audio signal, but the perceived rate is estimated as slow. Therefore, if you believe that physical measurement is reliable, correct rate turns out to be slower than the designated. In other words, the scheme of the evaluation focused on assessing tempo, will give ambiguous results of the evaluation of the corresponding full and double notes. When combining with ways to measure the perceived tempo you can determine the correct (perceptive) rate.

Large-scale experiments on human perception tempo show that people are inclined to accept the music tempo in the range of 100 to 140 BPM with peak value 120 BPM. This fact can be modeled dotted resonance curve 101 shown in figure 1. This model can be used in predicting the rate distribution for large data sets. However, when comparing with the resonance curve 101 results of experiments on tempo for a single music file, or track, see the reference position 102 and 103, you can see that the perceived rate of 102, 103 separate audio track does not necessarily consistent with the model 101. As can be seen, the subjects can beat the tempo with a different metric levels 102 or 103, which sometimes leads to the curve, completely different from the model 101. This is especially true for various kinds of genres and different kinds of rhythms. The specified metric ambiguity leads to a high degree of uncertainty in determining the pace and is likely explanation in generally «poor» efficiency managed algorithms assess the pace.

To overcome this uncertainty requires a new scheme perceptually motivated correction tempo, where various metric levels were appropriated would weights on the basis of extracting a certain number of acoustic frequency-time signal characteristics, i.e. musical performance or features. These weights can be used to correct the recoverable, calculated physically rate. In particular, this correction can be used to determine perceptually expressed tempo.

The following describes how to retrieve the information about the tempo of the field of RSM and the field of conversion. For this purpose can be used modulation spectral analysis. Modulation spectral analysis in General can be used to capture the repeatability of musical fragments in time. It can be used to assess the long-term statistics musical tracks, and/or it can be used to quantify the tempo. Spectra of modulation on the basis of the energy spectra of Mel can be defined for the audio track in the uncompressed region PCM (pulse-code modulation) and/or for the audio track in the area of conversion, for example in the field of conversion of non-AAS (highly advanced ).

For a signal represented in the field of PCM, the range of modulation is determined directly from discrete values for the PCM sound signal. On the other hand, for audio signals presented in the field of conversion, for example, in the field of conversion of non-AAS, to determine the spectrum modulation factors can be used subranges signal. For the conversion of the NON-AAS modulation spectrum can be defined on a frame-by-frame basis for a certain amount, for example, 1024, the coefficients of MDCT (modified discrete cosine transform), which were taken directly from the decoder is NOT-AAS during decoding or in the course of coding.

For a single frame, if it includes eight short blocks offered to complete alternating coefficients MDCT in a long block. As a rule, two types of blocks, the long and short blocks can be distinguished. In one of the embodiments of the invention long block is equal to the size of the frame (i.e., 1024 spectral coefficients, which corresponds to a particular resolution in time). Short block includes 128 spectral values, in order to achieve eight times higher resolution in time (1024/128) for the proper presentation of the characteristics of sound signals in time and to avoid the artifacts ahead of the echo. Therefore, the frame is formed eight short blocks by lowering the frequency resolution in the same eight times. This scheme is usually called «Circuit switching units in AAC».

Shown in figure 2, where the coefficients of MDCT for 8 short blocks 201-208 are alternation to the corresponding coefficients 8 short blocks regrouped, i.e. so that regrouped first coefficients of MDCT 8 blocks 201-208, then - the second coefficients MDCT 8 blocks 201-208, etc. Thus the corresponding coefficients MDCT, i.e., MDCT, which correspond to the same frequency, are grouped together. Alternation of short blocks in the frame can be understood as an «artificial» increase the frequency resolution within the frame. It should be noted that you can imagine and other tools to increase the frequency resolution.

In the illustrated example, the block 210, including 1024 coefficients MDCT is obtained for the package of 8 short blocks. As long blocks also include 1024 coefficients MDCT, for audio signal so the complete sequence of units, including 1024 coefficients. Through the formation of long-term blocks 210 of eight consecutive short blocks 201-208 the result is a sequence of long blocks.

On the basis of the unit of 210 subjected to alternating coefficients MDCT (in the case of short blocks) and on the basis of the unit of the coefficients of MDCT for long blocks, each block of the coefficients of the MDCT is calculated energy spectrum. An example of the energy spectrum is illustrated in fig.6.

It should be noted that the human auditory perception, in General, is a function of (generally nonlinear) the volume and frequency, and not all frequencies are regarded with equal volume. On the other hand, the coefficients of the MDCT is presented in a linear scale for both the amplitude and energy and frequency, which contradicts the human auditory system, which is non-linear in both cases. To get a representation of the signal closer to the human perception, can be used conversion of linear scales in nonlinear. In one of the embodiments of the invention for the simulation of human perception volume using the transformation of the energy spectrum for the coefficients of MDCT in the logarithmic scale is in dB. This transformation of the energy spectrum can be calculated as follows:

MDCT dB [i]=10log 10 (MDCT[i] 2 ).

Similarly, energy spectrogram of the energy spectrum can be calculated for the alarm in the uncompressed region RSM. For this to sound signal is applied STFT (the short time Fourier transform) with a specified period of time. Accordingly, the conversion of energy. To model a loudness perception man, can convert into non-linear scale, such as the transformation in the logarithmic scale above. The size of the STFT can be chosen so that the resulting time resolution was equal to the resolution in time for training-AAS. However, the size of the STFT can also be assigned and large, and lower values depending on the required accuracy and computational complexity.

At the next stage, the model of nonlinearity of human perception frequencies can be filtered using filters block Mel. For this purpose, as shown in figure 3, is applied non-linear frequency scale (scale Mel). The scale of 300 is approximately linear for low frequencies (<500 Hz) and logarithmic for higher frequencies. The reference position 301, indicating a linear scale of frequencies, representing the sound with a frequency of 1000 Hz, which is defined as 1000 Mel. Sound with the main tone, the perceived double height frequency, defined as 2000 Mel, sound with the main tone, the perceived with half-frequency - as 500 Mel, etc. In the mathematical description of scale Mel has the form:

m Mel =1127.01048ln(1+f Hz /700),

Thus, it turns out the energy spectrum Mel, which represents the range of audible frequencies only a few coefficients. An example of the energy spectrum Mel is shown in fig.6b. As a result of filtration in scale Mel energy spectrum is smoothed, lost specific details at higher frequencies. In an illustrative case, the scale of frequencies of the energy spectrum Mel can be represented only 40 coefficients instead of 1024 coefficients MDCT frame for the conversion of the NON-AAS and a potentially large number of spectral coefficients for uncompressed region RSM.

In order to additionally reduce the amount of data on the frequency of submitting to interpretation minimum, can be introduced function (CF), which displays the higher bands Mel in common factors. Rational explanation for this is that, as a rule, the majority of information and energy of the signal is located in areas of lower frequencies. Experimentally estimated function is shown in table 1, and the corresponding curve 400 shown in figure 4. In an illustrative case, this function reduces the number of energy coefficients Mel to 12. Example energy spectrum Mel is shown in fig.6.

Table 1

Index strip Mel

The band index Mel (the sum of (...))

1 1 2 2 3 3-4 4 5-6 5 7-8 6 9-10 7 11-12 8 13-14 9 15-18 10 19-23 11 24-29 12 30-40

It should be noted that function can be balanced so that it could cover the various frequency ranges. In one of the embodiments of the invention weighing can ensure that the condensed bandwidth will reflect the average energy of frequency bands Mel, prisoners in some band. This function differs from an unweighted function, where dense bands reflect the total energy of a frequency bands Mel, prisoners in some band. In one of the embodiments of the invention weighing can be inversely proportional to the number of frequency bands Mel, prisoners in some band.

In order to determine the range of modulation, energy spectrum Mel, or any other previously defined energy spectrum can segmented into blocks, representing a predefined duration of duration of a sound signal. In addition, it is useful to define a partial overlap between the blocks. In one of the embodiments of the invention, are selected blocks corresponding to duration of a sound signal, with a 50% cut-off along the time axis. The length of the blocks can be chosen as a compromise between the ability to cover long-term characteristics of a sound signal and computational complexity. An example of spectrum modulation defined on the basis of energy spectrum Mel, is shown in fig.6d. Along the way, it should be mentioned that the approach to determining the spectrum modulation is not limited to the spectral data subjected to Mel-filter, but can also be used for long term statistics on the merits for any characteristic features music or spectral representation.

With the purpose of receiving modulated on the amplitude of the frequency for the volume for each of these segments, or blocks, FFT is calculated according to the time and frequency axis. As a rule, in the context of assessing the tempo accounted modulation frequency in a range of 0-10 Hz, because the modulation frequency is outside this range, as a rule, are not significant. As a result of the analysis of the FFT, which is determined for power spectral data on the time axis, or frame, can be defined peak value of the energy spectrum and the corresponding elements of the resolution frequency FFT. Frequency, or element of the resolution frequency, for the specified peak corresponds to the frequency energy-intensive events on the sound, or music, track, and, thus, it is an indication tempo sound, or music track.

In order to improve the definition of the significant peaks in a compact energy spectrum Mel, the data can be further processed, such as, for example, perceptual weighting and the blurring of boundaries. In view of the fact that preference tempo for human varies with frequency modulation, and the fact that the appearance of the very high and very low frequency modulation unlikely, can be introduced perceptive weighing function, covering the pace with a higher probability of occurrence and the overwhelming rate, the appearance of which is unlikely. Experimentally estimated weighing function 500 is shown in figure 5. Weighing function 500 can be applied to any band energy spectrum Mel-axis frequency modulation each of the segments, blocks, or a sound signal. I.e. the values of the energy of each strip Mel can be multiplied by function 500. An example of a balanced spectrum modulation is shown in fig.6. It should be noted that in the case where the genre of music known, can be adapted weighing the filter, or weighing function. For example, if you know what analysis is electronic music, weighing function must have a peak value of approximately 2 Hz and shall be limiting outside a fairly narrow range. In other words, weighing function may depend on the musical genre.

For a broader coverage of the signal changes and for greater demonstration of rhythmic content of the spectrum modulation, can be performed calculation of absolute differences in the frequency modulation. The result can be enhanced line of peaks in the spectrum modulation. An example of a differential spectrum modulation is shown in fig.6f.

Additionally, you may run perceptual blurring of the frequencies bands Mel or along the frequency axis Mel and the frequency modulation. As a rule, this stage smooths the data so that the connecting line frequency modulation United in the wider region, independent of the amplitude. In addition, the blurring of boundaries may reduce the impact of noisy patterns in the data and, thus, lead to improved visual interpretation. In addition, the blurring of boundaries can adapt the range of modulation to the form of histograms beating tempo obtained from individual experiments tempo for the music object (as shown reference positions 102, 103 figure 1). An example of spectrum modulation with blurred boundaries shown on fig.6.

Ultimately, the United frequency representation package segments, or blocks of a sound signal can averaged out to obtain a very compact, independent of the length of the sound file spectrum of frequency modulation Mel. As described above, the term «average» can refer to a variety of mathematical operations, including the calculation and determination of the median value. Example averaged spectrum modulation is shown in fig.6h.

It should be noted that the average specified modulation spectral representation of the audio track is such that it is able to specify the rate on the set of metric levels. Besides, the spectrum modulation can specify a relative physical manifestation for many metric levels in a format that is compatible with the experiments on tempo used for the determination of perceived tempo. In other words, this view is in good agreement with experimental submission 102, 103 «on tempo» of figure 1, and therefore it may become basis for decision perceptually-motivated decisions to assess the pace of the sound track.

As mentioned above, the frequency corresponding to the peak values processed energy spectrum Mel provide an indication of the tempo of the analyzed sound signal. In addition, it should be noted that -spectral representation can be used for comparison of rhythmic similarity between songs. Also -spectral representation for the individual segments, or blocks, can be used to compare the similarity between songs to create a thumbnail of the audio data or applications related to the segmentation.

In General, describes how to obtain information about the tempo of audio signals in the field of conversion, for example in the field of conversion of non-AAS and in the field of RSM. However, you may have to retrieve information about the tempo of the audio signal directly from the compressed region. Below describes a method for determining assess the pace on sound signals, which are presented in a condensed area, or in the field of the bit stream. Special attention is paid to acoustic signals encoded in AAC.

Coding-AAC technology uses high-frequency reconstruction (HFR) or spectral band replication (SBR). The process of SBR-encoding includes the stage of detection of short signals, stage of adaptive selection grid T/F (time/frequency) for the proper presentation, evaluation phase envelope and additional techniques for the correction of non-performance between low and high frequency parts of the signal.

Accordingly, the choice of frequency-temporal resolution has a significant influence on the bit rate of data transfer SBR because the longer time segments can be coded more efficiently than less long time segments. At the same time, for the rapidly changing content, i.e. as a rule, for audio content that has a higher rate, the number of envelopes and, consequently, the number of coefficients envelopes that need to be passed for the proper presentation of a sound signal, more than for a slowly varying content. In addition to the influence of the selected resolution in time, this effect could also influence the amount of data SBR. In fact, it was observed that the sensitivity data transfer rate SBR to changes in the tempo of the main audio signal is higher than the sensitivity of the values of the length of the Huffman code used in the context of mp3 codecs. So, changing the bit rate SBR identified as valuable information that can be used to determine the rhythmic components directly from the encoded bit stream.

Figure 7 shows an example of a block 701 raw data AAS, which includes a field 702 fill_element. Box 702 fill_element in the bit stream is used to store additional parametric information, such as data SBR. When used in addition to SBR parametric stereo encoding (PS) (i.e. in the HE-AAC v2) box 702 filljelement also contains additional information PS. The following explanations are based on the mono case. Note, however, that the described method is also applicable for bit streams, transmitting any number of channels, for example, for stereo case.

The size of the field 702 filljelement varies depending on the number of transmitted parametric additional information. Therefore, the size of the field 702 filljelement can be used to retrieve information about the tempo directly from the compressed stream HE-AAC. As shown in Fig.7, box 702 filljzlement includes a header 703 SBR and data 704 payload SBR.

Title 703 SBR is a constant for a separate audio file, and is repeated several times as part of the field 702 filljelement. Retransmission header 703 SBR leads to recurring to the peak in the data payload at a certain frequency, consequently leading to a peak in the field of modulation frequencies at 1/x Hz with certain amplitude (x - repetition rate when transferring the title 703 SBR). However, many times passed to the header 703 SBR not contain any rhythmic information, and should therefore be deleted.

This can be done by determining the length and time of the appearance of the header 703 SBR immediately after parsing the bit stream. Because of the periodicity header 703 SBR definition phase, as a rule, should be carried out once. If the available information on the length and appearance, data 705 SBR in full volume can easily be adjusted by subtracting the header length 703 SBR of data 705 SBR at the moment of appearance of the header 703 SBR, i.e. at the moment of transmission of the header 703 SBR. This gives the value of the payload 704 SBR, which can be used to determine the pace. It should be noted that determining the tempo in a similar manner may be used by the size of the field 702 fill_element, adjusted by subtracting the header length 703 SBR, because it differs from the payload 704 SBR only by a constant amount of overhead.

A sample size of a packet 704 payload SBR, or the size of the corrected field 702fill_element is on fig.8. The x-axis shows the number of the frame, and the y-axis indicates the amount of data 704 payload SBR, or the size of the corrected field 702 fill_element, for the corresponding frame. As you can see, the amount of data 704 payload SBR changes from frame to frame. Below reference will be made only to the volume of data 704 payload SBR. Information about the tempo can be extracted from a sequence 801 amounts of data 704 payload SBR by detecting periodicities in the volume of data 704 payload SBR. In particular, can be identified periodicity peaks or recurring patterns in the volume of data 704 payload SBR. This can be accomplished, for example, applying FFT to overlapping amount of data 704 payload SBR. A subsequence can be defined that the duration of signal, for example, the 6 seconds. The overlapping of successive identical subsequences may constitute 50% overlapping. Accordingly, the coefficients of FFT for subsequences may speculation on the length of the entire soundtrack. This leads to the averaged coefficients of FFT for all sound track, which can be represented as a spectrum modulation 811 shown on fig.8b. It should be noted that for the identification of periodicities in the volume of data 704 payload SBR may envisage other ways.

Peaks 812, 813, 814 in the spectrum modulation 811 indicate repeating, i.e. rhythmic, patterns with a certain frequency. The frequency of appearance also known as frequency modulation. It should be noted that the maximum possible frequency of modulation is limited to the resolution by the time of the main base audio codec. Because NO-AAS is defined as a system with twice the frequency of sampling, where the underlying AAC codec operates at half of the sampling frequency, the maximum frequency modulation sequence for the duration of 6 seconds (128 frames) with a sampling frequency of Fs=44100 Hz is equal, approximately, to 21.74 Hz/2 about 11 Hz. The specified maximum modulation frequency corresponds to approximately 660 VRM that covers the rate of almost any piece of music. For convenience, subject to further ensure proper processing, the maximum modulation frequency may be limited to 10 Hz, which corresponds to 600 VRM.

Spectrum modulation on fig.8b can further be improved in a manner equivalent to that described in the context of the spectrum modulation determined from the representation of the sound signal in the area of conversion or region RSM. For example, to simulate human preferences tempo to spectrum 811 payload data SBR can be used perceptual weighting using curve 500, shown in figure 5. The result perceptually weighted range of 821 modulation data payload SBR is shown in Fig.8 . seen, very high and very low rates of suppressed. In particular, you can see that the low-frequency peak 822 and high-frequency peak 824 reduced in comparison with the original peaks 812 and 814 respectively. On the other hand bass peak 823 saved.

Defining the maximum value in the spectrum modulation and the corresponding frequency modulation of the spectrum modulation data payload SBR, you can get physically most pronounced pace. In the case Fig.8 with, the result is 178,659 VRM. However, in this example, the specified physically most pronounced rate does not correspond to perceptually most pronounced rate, which corresponds to around 89 BPM. Consequently, there is a double uncertainty, i.e. the uncertainty metric level, which needs correction. With this purpose below describes the perceptual correction tempo.

It should be noted that the proposed approach to evaluation of pace, based on the data payload SBR does not depend on the bit rate of the input of a musical signal. When you change the bit rate of the encoded bit stream is NOT-AAC encoder automatically sets the start and end frequency SBR in accordance with maximal output quality, accessible at a given bit rate, i.e. change the frequency transition SBR. However, payload SBR still includes information that relates to the repeated short-term components of the sound track. This can be seen in fig.8d, where shows the spectra of modulation payload SBR for different bit data rates from 16 kbps to 64 kbps). As can be seen, the identical fragments (i.e. such peaks in the spectrum modulation peak 833) beep remain dominant for all bit data transfer speeds. You can also see that in different modulation spectra involve fluctuations, because when you decrease the bit rate, the encoder tries to save bits in SBR-part.

To summarize the above, let us refer to figure 9. Discusses three different representations of the sound signal. In compressed audio signal is represented his coded bit stream, for example, the bit stream 901 NOT-AAS. In the field of conversion of the sound signal is represented his conversion factors, or factors of subranges, for example, coefficients 902 MDCT. In the field of PCM audio signal is represented by its discrete values 903 RSM. In the above description, describes how to determine the spectrum modulation in any of these three areas signal. Describes how to determine the spectrum 911 modulation based on payload SBR bitstream NOT-AAS 901. Also describes how to determine the spectrum 912 modulation based on the submission of 902 conversion, for example, based on the coefficients of MDCT, sound signal. In addition, describes how to determine the spectrum 913 modulation on the basis of the RSM-view 903 sound signal.

Any of the assessed spectra 911,912, 913 modulation can be used as a basis for assessing the physical tempo. To do this, you can perform various stages of sophisticated processing, for example, perceptual weighting using curve 500, perceptual blurring of boundaries and/or the calculation of absolute differences. Ultimately defined maxima in the spectra of 911, 912, 913 modulation and the corresponding frequency modulation. The absolute maximum in the spectrum of 911, 912, 913 modulation is an estimate of the most physically expressed tempo analyzed sound signal. Other highs, as a rule, correspond to other metric levels of the most physically expressed tempo.

Thus, the methods and the systems, which allow to estimate physically expressed pace through modulation spectra, obtained from various forms of representation of a signal. These methods apply to different types of music and not limited to only Western popular music. Also different forms of representation of a signal can be applied with different ways that can be done with low computational complexity for each of the corresponding representation of a signal.

As seen in figure 6, 8 and 10, modulation spectrum, as a rule, contains a series of peaks, which usually correspond to the different metric levels tempo sound signal. This is seen, for example, on fig.8b, where three peaks 812, 813, 814 have a similar intensity and, therefore, may be candidates for evaluation of the basic tempo sound signal. The choice of peak 813 maximum involves physically most pronounced pace. As described above, the specified physically most pronounced rate may not match the perceptually most pronounced rate. To assess this perceptually most pronounced rate automatically, below describes the perceptual correction tempo.

In one of the embodiments of the invention scheme perceptual correction tempo includes the definition of the most physically expressed the tempo of the spectrum modulation. In the case of spectrum modulation on fig.8b can be defined peak 813 and the relevant frequency modulation. In addition, from a spectrum modulation can be extracted for more options, contributing to the correction of the spectrum. The first parameter can be the MMS option Centroid (spectrum modulation Mel), which is the centroid of the spectrum modulation according to equation 1. The centroidal the MMS option Centroid can be used as a pointer speed of sound signal.

M M S C e n t r o i d = aff d = 1 D d Buna aff n = 1 N M M S ( n , d ) aff d = 1 D aff n = 1 N M M S ( n , d ) ( 1 )

In the above equation D is the number of items resolution frequency modulation, and d=1, ..., D defines the corresponding element in the resolution frequency modulation. N - total number of items resolution frequency along the frequency axis Mel, and n=1, ..., N specifies the corresponding element in the resolution frequency the frequency Mel. MMS(n,d) specifies the range of modulation for a particular segment of an audio signal, whereas

M M S ( n , d )

specifies the summarized spectrum modulation, which is characterized by a sound signal.

The second argument, helping to execute the correction, MMS BEATSTRENGTH , which represents the maximum value in the spectrum modulation according to equation 2. As a rule, its value is high for electronic music and small - for classical music.

M M S B E A T S T R E N G T H = m a x d ( aff n = 1 N M M S ( n , d ) ) ( 2 )

The next option is the MMS CONFUSION' , which represents the average value of the spectrum modulation after rationing 1 in accordance with the formula 3. If the latter is specified parameter has a low value indicates pronounced peaks in the spectrum modulation (for example, as figure 6). If this option to a high value, the range is distributed, does not contain expressed peaks, and there is a high degree of uncertainty.

M M S C O N F U S I O N = 1 N Buna D aff n = 1 N aff d = 1 D ( M M S ( n , d ) m a x ( n , d ) ( M M S ( n , d ) ) ) ( 3 )

In addition to these parameters, i.e. the centroid of the spectrum modulation MMS Centoid , shock intensity modulation MMS BEATSTRENGTH and uncertainty tempo modulation MMS CONFUSION may be excluded, and other significant parameters of perception, which can be used for applications MIR.

It should be noted that the equations in this document are formulated for the frequencies in the spectrum modulation Mel, i.e. in the spectra of 912, 913, defined on the basis of sound signals presented in the field of RSM and the field of conversion. In the case when the spectrum of the 911 modulation is determined from the MMS(n,d) sound signals presented in a concise region, members and

aff n = 1 N M M S ( n , d )

in the equations of this document it is necessary to replace a member of the MS SBR (d) (modulation spectrum on the basis of the data payload SBR).

1. Define the basic size of a music track, for example, the size 4/4 or ¾.

2. Exercise collapse tempo to the targeted range in accordance with the MMS BEATSTRENGTH .

3. Carry out correction of the tempo in accordance with the criterion of the perceived speed MMS Cmtroid .

Optional, determination of modulation uncertainty MMS CONFUSION may provide reliability criterion of assessment of the perceived tempo.

At the first stage, to identify possible factors, through which you should adjust physically measured rate may be determined by the size of the main music track. For example, the peaks in the spectrum modulation music track with tact ¾ appear three times more likely frequency of the main rhythm. Thus correction tempo should be set on the basis of the number three. In the case of a sound track with tact 4/4, correction tempo should be set on the basis of the number 2. This is shown in figure 11, which shows the spectra of modulation payload SBR jazz music track with tact ¾ (fig.11) and metal music track with tact 4/4 (fig.11b). Metric tempo can be determined from the distribution of peaks in the spectrum modulation payload SBR. In the case of quantum 4/4 significant peaks are two of each other, while for the quantum % significant peaks are triple.

For the weakening of this potential source of error assess the pace can be applied method of cross-correlation. In one of the embodiments of the invention for different lags frequency Δd can be determined autocorrelation spectrum modulation. Autocorrelation can be:

C o r r ( Δ d ) = 1 D N aff d = 1 D aff n = 1 N M M S ( n , d ) Buna M M S ( n , d + Δ d ) . ( 4 )

Lag frequency Δd that lead to maximum correlation Corr(Δd), provide hints to the base size. More precisely, if d max - most physically expressed frequency modulation, the expression

( d m a x + Δ d ) d m a x

provides an indication of the base size.

In one of the embodiments of the invention to determine the primary size can be used cross-correlation between the synthetic, perceptually modified integral multiples of the most physically expressed tempo within the averaged spectrum modulation. Sets of multiples for dual (equation 5) and triple (equation 6) uncertainties are calculated as follows:

M u l t i p l e s d o u b l e = d m a x Buna { 1 4 , 1 2 , 1 , 2 , 4 } , ( 5 ) M u l t i p l e s t r i p l e = d m a x Buna { 1 6 , 1 3 , 1 , 3 , 6 } . ( 6 )

The next step is the synthesis of functions beating the tempo for different sizes, where the functions beating tempo have a length equal to the representation of the spectrum modulation, i.e. they have a length equal to the axis frequency modulation (equation 7):

S y n t h T a b d o u b l e , t r i p l e ( d ) = { 1 i f d

belongs to

M u l t i p l e s d o u b l e , t r i p l e 0 o t h e r w i s e , 1 ≤ d ≤ D . ( 7 )

Synthesized function beating tempo

S y n t h T a b d o u b l e , t r i p l e ( d )

present a model of man, slugger various metric levels of the main tempo. I.e., provided quantum ¾, the rate can fight back 1/6 his tact, 1/3 of his tact, his tact on tact and quantum. Similarly, subject to the quantum 4/4, the rate could beat off 1/4 of his tact, 1/2 his tact, his tact, double step, and his sharp quantum.

If there are perceptually modified versions of the spectrum modulation, it may be necessary also in the modification of the synthesized functions beating tempo to create a General view. If the perceptual extract the tempo dropped perceptual blurring of boundaries, this step can be skipped. Otherwise, synthesized function beating tempo should be blurring of the boundaries, described by the equation 8, to adapt synthesized function beating tempo histograms beating tempo people.

S y n t h T a b d o u b l e , t r i p l e ( d ) = S y n t h T a b d o u b l e , t r i p l e ( d ) * B , 1 ≤ d ≤ D ,

where is the kernel of the operator of the blurring of boundaries, and * is an operation clotting. The kernel of the operator of the blurring of boundaries is a vector of fixed length, which has the shape of the peak of the histogram beating tempo, for example, the form of a triangle or a narrow pulse. Form The kernel of the operator of the blurring of boundaries, preferably, reflects the shape of the histogram peaks beating tempo, i.e. histograms 102, 103 of figure 1. The width of the kernel of the operator of the blurring of boundaries, i.e. the number of coefficients for The kernel, and thus the interval of modulation frequencies covered by the kernel, which is typically the same across the range D frequency modulation. In one of the embodiments of the invention In the core operator of the blurring of boundaries is a narrow bell momentum with maximum amplitude equal to one. The kernel of the operator of The blurring of boundaries may cover the frequency range modulation value 0,265 Hz (about 16 VRM), i.e. it can have a width of +/- 8 VRM relative to the center of the pulse.

Field users perceptive modification synthesized functions beating tempo (if required), is calculated correlation with zero lag between the functions of beating tempo and source spectrum modulation. It is shown in equation 9:

C o r r d o u b l e , t r i p l e = aff d = 1 D ( aff n = 1 N M M S ( n , d ) ) Buna S y n t h T a b d o u b l e , t r i p l e ( d ) . ( 9 )

Eventually by comparing the results of correlation obtained from a synthetic function beating the tempo for a «doubled» the size of the synthetic function of beating the tempo for «trebled» size is determined by the correction factor. factor is assigned a value of 2 if its correlation obtained for functions beating the tempo for dual uncertainty, greater than or equal to correlation is obtained for functions of beating the tempo for triple uncertainty, and Vice versa (equation 10):

C o r r e c t i o n = { 2 i f C o r r d o u b l e > = C o r r t r i p l e 3 e l s e . ( 1 0 )

It should be noted that, generally speaking, a correction factor is determined using the methods of correlation spectrum modulation. The correction factor is associated with the size of the music signal, i.e. to beats 4/4, three quarters etc. the Size of the primary tact may be determined by application of the methods of correlation to the spectra of modulation of a musical signal, some of which are described above.

Using a correction factor, you can run the current perceptive correction tempo. In one of the embodiments of the invention it is gradual. The pseudocode for this illustrative options for carrying out the invention is given in table 2.

At the first stage physically most pronounced rate, which is represented in table 2 as «Tempo» is displayed in the interest range by using the MMS BEATSTRENGTH and previously correction factor. If the value of MMS BEATSTRENGTH is below a certain threshold (which depends on the signal, audio codec, bit rate and sample rate), and if physically a certain pace, i.e option «Tempo» is the relatively high value or of relatively low value, physically most pronounced rate is adjusted by a correction factor, or the size of the quantum.

The above threshold parameter MMS Centroid is used at the second stage of correction tempo indicated in table 2. During the second phase correction tempo identified and, ultimately, adjusted large discrepancies between the assessment of tempo and parameter MMS Centroid . For example, if the estimated rate is relatively high, and if the MMS option Centroid indicates that the perceived speed should be relatively low, estimated the rate is reduced by using a correction factor. Similarly, if the estimated rate is relatively low, while the MMS option Centroid indicates that the perceived speed should be relatively high, estimated the rate is increased by using a correction factor.

Another option is the implementation of the scheme perceptual correction tempo described in table 4. Shows the pseudocode for the correction factor 2, however, this example is equally applies to other correction coefficients. In the scheme of perceptual correction tempo according to table 4 on the first stage, you check whether the uncertainty, i.e. MMS CONFUSION a certain threshold. If not, then it is assumed that physically expressed tempo t 1 corresponds to perceptually pronounced rate. However, if the level of uncertainty exceeds the threshold value, then physically expressed tempo t 1 is adjusted for information on the perceived speed of the music signal, which is extracted from the MMS Centroid .

It should be noted that for the classification of music tracks can also be used and alternative regimens. For example, you can construct a classifier that is designed for the classification of speed, and then to perform perceptual corrections. In one of the embodiments of the invention for automatic classification of uncertainty, the speed and intensity of the shock unknown musical signals can be prepared and modelled parameters used for the correction of pace, i.e., especially MMS CONFUSION , MMS Centroid and MMS BEATSTRENGTH . For similar perceptual corrections, as described above can be used classifiers. Thus, can be facilitated by the use of fixed thresholds, as presented in tables 3 and 4, and the system can be made more flexible.

As mentioned above, the proposed option MMS CONFUSION provides an indication of the reliability of the estimated tempo. This option can also be used as a feature to MIR (search for music information) when classifying mood and genre.

It should be noted that the above scheme perceptual correction tempo can be applied over a variety of ways to assess the physical tempo. This is illustrated in figure 9, where it is shown that the scheme perceptual correction tempo can be applied to different estimates of the physical tempo received from the compressed region (reference position 921), can be applied to the estimates of physical tempo derived from the field of conversion (reference position 922) and can be applied to the estimates of physical tempo derived from the field of PCM (reference position 923).

On fig.13 shows an example of a block diagram of the system 1300 assess the pace. It should be noted that depending on the requirements of the various components of the scheme 1300 assess the pace can be used separately. The system includes 1300 block 1310 system management, the parser 1301 region, stage of preliminary processing 1302, 1303, 1304, 1305, 1306 1307, intended for obtaining a unified view of the signal algorithm 1311 determination expressed rate and block 1308, 1309 postprocessing intended for perceptual correction recoverable rate.

RSS signals can be the following. First input in any area is served in the parser 1301 region, which retrieves all the necessary information like the sample rate, and channel number, to determine the pace and its correction from the input of the audio file. These values are then stored in the memory block, 1310 system management, which sets the path calculations in accordance with the scope of the input signal.

Segments, which include pre-processed data MDCT or PCM undergo a transformation in the scale of Mel and/or the stage of processing to reduce dimensionality using function (block 1306 processing in scale Mel). Segments, including the data payload SBR, be submitted directly to the next block 1307 processing, definition block spectrum modulation, where along the time axis is calculated FFT N numbers. This step leads to the desired spectrum modulation. The number N of elements of the resolution frequency modulation depends on resolution time base area, and it may be fed into the algorithm block 1310 system management. In one of the embodiments of the invention range is limited to 10 Hz frequency in order to remain within intervals perceived by the sense organs, and the range of perceptually weighted in accordance with curve 500 human preferences tempo.

To strengthen peaks modulation spectra, based on the uncompressed field and the field of conversion, at the next stage (in block 1307 determine the spectrum modulation) may be calculated from the absolute difference in the frequency modulation followed by the perceptive blurring of the borders along the frequency axis scale Mel and on-axis frequency modulation in order to adapt the form of histograms beating tempo. This stage of the calculations is optional for uncompressed field and transformation, as new information at this stage are not generated, but it usually leads to improved visualization of spectra modulation.

Ultimately, the segments of processed in block 1307, can be combined through the operation of averaging. As noted above, averaging can include calculation of average values or the definition of the median value. This leads to the final recommendation of the perceptually motivated spectrum modulation in scale Mel (MMS) of uncompressed PCM data or data MDCT in the field of conversion, or it leads to the final recommendation of the perceptually motivated spectrum modulation payload SBR (MS SBR ) for the components of the bit stream into compressed region.

Of the spectra of modulation can be calculated parameters such as the centroid of the spectrum modulation, the intensity of shocks in the spectrum modulation and uncertainty tempo in the spectrum modulation. All these parameters can be supplied in block 1309 perceptual correction tempo and used this unit for correction physically most expressed rate derived from calculations 1311 highs. Output signal system 1300 is perceptually most pronounced rate of the current input of the music file.

It should be noted that the methods outlined in this document to assess the pace, can be applied in , as well as in . Ways to assess the pace of sound signals in the compressed areas, conversion and region RSM may be used when decoding of the encoded file. These methods are equally applicable when encoding audio signal. The concept of scalable complexity of these methods has effect as when decoding and encoding of sound signal.

It should also be noted that, although the techniques described in this document are described in the context of the assessment and correction of the tempo on sound signals in General, these methods may also be applied to subsections, for example, to segments MMS, sound signal, and thus to provide information on pace for subsections of the sound signal.

As the following features should be noted that information physical tempo and/or information perceived tempo sound signal can be encoded bit stream in the form of metadata. These metadata can be retrieved and used the player multimedia data or application MIR.

In addition, the expected modification and compression of spectral representations modulation (for example, spectra 1001 modulation and, in particular, 1002 and 1003 in figure 10) and storing in the memory, possibly modified and/or compressed modulation spectra as metadata in the audio/video or bit stream. This information can be used as a thumbnail of the acoustic image of a sound signal. This can be useful to provide the user of detail related to the rhythmic content of the audio signal.

Furthermore, the proposed methods and system use knowledge about human perception of tempo and on the distribution of the musical rate in large sets of musical data. Besides the assessment of the adequate representation of the sound signal to assess the pace, described perceptive weight function of pace, as well as the scheme of perceptual correction tempo. In addition, described in the scheme of perceptual correction of tempo, which provides reliable estimates of perceptually expressed tempo of audio signals.

Suggested methods and systems can be used in the context of applications MIR, for example, for the classification of genres. Because of low computational complexity assess the pace with a way to assess the pace on the basis of the payload SBR,in particular, can be directly implemented on a portable electronic devices, which usually have limited processing resources and memory.

Furthermore, the definition of perceptually expressed rate can be used for music selection, comparison, mixing and preparation of lists of play. For example, when generating a playlist with a smooth rhythmic transitions between adjacent tracks information related to perceptually pronounced rate music tracks may be more relevant than information relating to physically pronounced rate.

Methods and evaluation systems tempo described in this document can be implemented in software, firmware and/or hardware. Some components may, for example, be implemented as software run on a processor digital signal processing or a microprocessor. Other components may be, for example, be implemented as hardware or as microchips for special purposes. Signals encountered in the described methods and systems can be stored in memory, such as memory or on the optical data carrier. They can be transmitted on networks such as the radio, satellite network, wireless network or wired network such as the Internet. Typical devices using the methods and systems described in this document are the portable electronic device or other household equipment that is used for the storage or reproduction of sound signals. Methods and the system can be used in computer systems, such as web servers in the Internet that store and provide for downloading tones, for example, music signals.

1. Way to extract information about the tempo of the audio signal from the compressed encoded bit stream spectral band replication of a sound signal, where coded bitstream includes data spectral band replication, where the method includes the stages at which: - define the size of the payload, number of data spectral band replication, prisoners in the coded bit stream for a certain time interval beep; repeat the definition phase for the successive time intervals of the encoded bit rate of the audio signal and, therefore, determining the sequence of values of the payload; - identify the periodicity of the sequence of values of payload; and retrieve information about the tempo of an audio signal from an identified periodicity.

2. The method according to claim 1, characterized in that the definition of size of the payload includes the stages at which: - determine the amount of data prisoners in one or more fields fill-element of the encoded bit stream into the specified time interval; and - determine the amount of payload on the basis of the amount of data prisoners in one or more fields fill-element of the encoded bit stream into the specified time interval.

3. The method of claim 2, characterized in that the definition of size of the payload includes the stages where: - determine the number of header data spectral band replication, prisoners in one or more fields fill-element of the encoded bit stream into the specified time interval; - determine the amount of the net data prisoners in one or more fields fill-element of the encoded bit stream into the specified time interval, by subtracting the amount of the header data of the spectral band replication, prisoners in one or more fields fill-element of the encoded bit stream into the specified time interval; and - determine the amount of payload on the basis of net amount of data.

4. The method of claim 3, wherein the value of the payload corresponds to the net amount of data.

5. Way to one of the preceding paragraphs, wherein the - coded bit stream includes a range of frames, each frame corresponds to the passage of a sound signal to a pre-defined duration of time; and - the time interval corresponds to the frame of the encoded bit stream.

6. The method according to claim 1, wherein the step of repetition is for all frames of the encoded bit stream.

8. The method according to claim 1, characterized in that the identification periodicity includes the stages at which: - carry out spectral analysis of the sequence of values of the payload, which leads to a set of values of the energy and frequency; and - identify the periodicity of the sequence of values of payload by determining the relative maximum in the set of values of energy and select the periodicity as appropriate frequency.

9. The method of claim 8, wherein that the implementation of the spectral analysis includes the stages at which: - perform spectral analysis on a number of subsequences of the sequence of values of the payload, which leads to a number of sets of values of energy; and - perform averaging a number of sets of values of energy.

10. The method of claim 9, wherein the subsequence series are partly overlapping.

11. Way to one of .8-10, wherein the execution of spectral analysis involves performing a Fourier transform.

12. The method according to claim 11, wherein also includes the stage at which: - carry out multiplication set of energy values of the weights associated with the preferences of the human perception of corresponding frequencies.

13. The method according to section 12, wherein the extraction of information about the tempo includes the stage at which: - determine the frequency corresponding to the absolute maximum value of a set of energy values; where the specified frequency corresponds to physically pronounced rate of the audio signal.

14. The method according to claim 1, characterized in that the sound includes the musical signal, and where to extract information about the tempo includes evaluating the tempo of the music signal.

15. Data medium, including the program, implemented in software, adapted for execution on the processor and implementation phases of the way of one of claims 1 to 14 of computing device.

16. Portable electronic device that contains: - a block of memory configured to store in memory of the sound signal; - the block audio signal, configured to play a sound signal; - user interface that is configured to receive a request from a user to the information about the tempo sound signal; and the processor is configured to specify information about the pace through the implementation phases of the way of one of claims 1 to 14 on the audio signal.

17. The system is configured to retrieve information about the tempo of the audio signal from the compressed the encoded bit stream spectral band replication sound signal, where the coded bit stream includes data spectral band replication of a sound signal, where the system contains: - the means for determining the amount of payload that is associated with the amount of data spectral band replication, prisoners in the coded bit stream, for a certain time interval alarm; - means of repetition stages definitions for the successive time intervals of the encoded bit rate of the audio signal and, thus, to determine the sequence values payload; - tools for identification of the periodicity of the sequence of values of payload; and - tools to extract information about the tempo of a sound signal of a specific frequency.

18. Way to generate the encoded bit stream that includes metadata of the audio signal, where the method includes the stages at which: - define the metadata associated with the tempo of an audio signal, where the rate defined in accordance with the stages of way to one of claims 1 to 14; and - insert metadata in the coded bit stream.

19. The method according see item 18, wherein the metadata includes the data representing physically expressed rate of the audio signal.

20. The method according to .19, wherein the metadata includes the data representing the spectrum modulation of sound signal, where the spectrum modulation includes a number of frequencies of occurrence of an event and the corresponding series of values of importance, with the values of the significance indicate the relative importance of the respective frequencies of occurrence in the sound signal.

21. The method according to claim 20 wherein also includes the stage at which: - encode the audio signal in sequence data payload of the encoded bit stream using one of the following encoder: AAC, MP3, AAC, Dolby Digital or Dolby Digital Plus.

22. Audio encoder configured to generate the encoded bit stream that includes metadata of the audio signal, where the encoder includes: - a means for determining the metadata associated with the tempo of an audio signal, where the rate defined in accordance with the stages of way to one of claims 1 to 14; and tools to insert metadata in the coded bit stream.