Apparatus and method of generating output audio signals using object based metadata

IPC classes for russian patent Apparatus and method of generating output audio signals using object based metadata (RU 2510906):

H04S3/00 - Systems employing more than two channels, e.g. quadraphonic (H04S0005000000, H04S0007000000 take precedence);;

Another patents in same IPC classes:

Generation of binaural signals / 2505941

Described is a device for generating a binaural signal based on a multi-channel signal representing a plurality of channels and intended for reproduction by a speaker system, wherein each virtual sound source position is associated to each channel. The device includes a correlation reducer for differently converting, and thereby reducing correlation between, at least one of a left and a right channel of the plurality of channels, a front and a rear channel of the plurality of channels, and a centre and a non-centre channel of the plurality of channels, in order to obtain an inter-similarity reduced combination of channels; a plurality of directional filters, a first mixer for mixing output signals of the directional filters modelling the acoustic transmission to the first ear canal of the listener, and a second mixer for mixing output signals of the directional filters modelling the acoustic transmission to the second ear canal of the listener. Also disclosed is an approach where centre level is reduced to form a downmix signal, which is further transmitted to a processor for constructing an acoustic space. Another approach involves generating a set of inter-similarity reduced transfer functions modelling the ear canal of the person.

Apparatus for merging spatial audio streams / 2504918

Method comprises estimating a first wave representation comprising a first wave direction measure characterising the direction of a first wave and a first wave field measure being related to the magnitude of the first wave for the first spatial audio stream, having a first audio representation comprising a measure for pressure or magnitude of a first audio signal and a first direction of arrival of sound; estimating a second wave representation comprising a second wave direction characterising the direction of the second wave and a second wave field measure being related to the magnitude of the second wave for the second spatial audio stream, having a second audio representation comprising a measure for pressure or magnitude of a second audio signal and a second direction of arrival of sound; processing the first wave representation and the second wave representation to obtain a merged wave representation comprising a merged wave field measure, a merged direction of arrival measure and a merged diffuseness parameter; processing the first audio representation and the second audio representation to obtain a merged audio representation, and forming a merged audio stream.

Apparatus for generating multichannel audio signal / 2498526

Apparatus (100) for generating a multichannel audio signal (142) based on an input audio signal (102) comprises a main signal upmixing means (110), a section (segment) selector (120), a section signal upmixing means (110) and a combiner (140). The section signal upmixing means (110) is configured to provide a main multichannel audio signal (112) based on the input audio signal (102). The section selector (120) is configured to select or not select a section of the input audio signal (102) based on analysis of the input audio signal (102). The selected section of the input audio signal (102), a processed selected section of the input audio signal (102) or a reference signal associated with the selected section of the input audio signal (102) is provided as section signal (122). The section signal upmixing means (130) is configured to provide a section upmix signal (132) based on the section signal (122), and the combiner (140) is configured to overlay the main multichannel audio signal (112) and the section upmix signal (132) to obtain the multichannel audio signal (142).

Lossless multi-channel audio codec using adaptive segmentation with random access point (rap) and multiple prediction parameter set (mpps) capability / 2495502

Invention relates to lossless multi-channel audio codec which uses adaptive segmentation with random access point (RAP) and multiple prediction parameter set (MPPS) capability. The lossless audio codec encodes/decodes a lossless variable bit rate (VBR) bit stream with random access point (RAP) capability to initiate lossless decoding at a specified segment within a frame and/or multiple prediction parameter set (MPPS) capability partitioned to mitigate transient effects. This is accomplished with an adaptive segmentation technique that fixes segment start points based on constraints imposed by the existence of a desired RAP and/or detected transient in the frame and selects a optimum segment duration in each frame to reduce encoded frame payload subject to an encoded segment payload constraint. RAP and MPPS are particularly applicable to improve overall performance for longer frame durations.

Surround sound virtualiser with dynamic range compression and method / 2491764

Method and system for generating output signals for reproduction by two physical speakers in response to input audio signals indicative of sound from multiple source locations including at least two rear locations. Typically, the input signals are indicative of sound from three front locations and two rear locations (left and right surround sources). A virtualiser generates left and right surround output signals suitable for driving front loudspeakers to emit sound that a listener perceives as emitted from rear sources. Typically, the virtualiser generates left and right surround output signals by transforming rear source input signals in accordance with a sound perception simulation function. To ensure that virtual channels are well heard in the presence of other channels, the virtualiser performs dynamic range compression on rear source input signals. The dynamic range compression is preferably performed by amplifying rear source input signals or partially processed versions thereof in a nonlinear way relative to front source input signals.

Improved reproduction of multiple audio channels / 2479149

Invention discloses the method for reproduction of multiple audio channels, according to which out-of-phase information is extracted from side and/or rear side channels contained in a multi-channel audio signal.

Audio coding using step-up mixing / 2474887

Audio decoder for decoding multi-object audio signal comprises module to compute factor of forecasting matrix C consisting of factors forecasts based on data about object level difference (OLD), as well as means for step-up mixing proceeding from forecast factors for getting first upmix audio signal tending first type audio signal and/or second upmix signal tending to second type audio signal. Note here that multi-object audio signal comprises coded audio signals of first and second types. Multi-object audio signal consists of downmix signal 112 and service info. Service info comprises data on first and second type signal levels in first predefined frequency-time resolution.

Method and apparatus for supporting speech perceptibility in multichannel ambient sound with minimum effect on surround sound system / 2467406

Invention relates to processing audio signals, particularly to improving intelligibility of dialogue and oral speech, for example, in surround entertainment ambient sound. A multichannel audio signal is processed to form a first characteristic and a second characteristic. The first channel is processed to generate a speech probability value. The first characteristic corresponds to a first measured indicator which depends on the signal level in the first channel of the multichannel audio signal containing speech and non-speech audio. The second characteristic corresponds to a second measured indicator which depends on the signal level in the second channel of the multichannel audio signal primarily containing non-speech audio. Further, the first and second characteristics of the multichannel audio signal are compared to generate an attenuation coefficient, wherein the difference between the first measured indicator and the second measured indicator is determined, and the attenuation coefficient is calculated based on the obtained difference and a threshold value. The attenuation coefficient is then adjusted in accordance with the speech probability value and the second channel is attenuated using the adjusted attenuation coefficient.

User annunciation on microphone cover / 2449497

Invention relates to a mechanism, which tracks signals of a secondary microphone in a mobile device with multiple microphones in order to warn a user, if one or more secondary microphones are covered at the moment, when the mobile device is used. In one example the estimate values of secondary microphone capacity averaged in a smoothed manner may be calculated and compared to the estimate value of the minimum noise level of the main microphone. Detection of microphone cover may be carried out by comparison of smoothed estimate values of secondary microphone capacity with an estimate value of minimum noise level for the main microphone. In another example the estimate values of the minimum noise level for signals of the main and secondary microphones may be compared with the difference in the sensitivity of the first and second microphones in order to detect whether the secondary microphone is covered. As soon as detection is over, a warning signal may be generated and issued to the user.

Signal processing method and apparatus / 2449387

Signal processing method involves: receiving a signal and spatial information which includes channel level difference (CLD) information, a channel prediction coefficient (CPC), interchannel coherence (ICC) information; obtaining mode information for determining the encoding scheme and modification flag information indicating whether the signal has been modified. If the mode information indicates an audio encoding scheme, the signal is decoded according to the audio encoding scheme. If the modification flag information indicates that the signal has been modified, restoration information is obtained after modification, which indicates the value for adjusting the window length applied to the signal; the window length is modified based on restoration information after modification and the signal is decoded using the window with the modified length. Further, based extension information, the base extension signal is determined; a downmix extended signal is generated, having a bandwidth which is extended using the base extension signal by restoring the high-frequency region signal; and a multichannel signal is generated by applying spatial information to the downmix extended signal.

/ 2273116

/ 2323551

/ 2325046

/ 2325790

/ 2327304

/ 2345506

Audio encoding / 2363116

Invention relates to encoding a multichannel audio signal, particularly encoding a multichannel signal containing first, second and third signal components. The method of encoding a multichannel audio signal containing at least, a first signal component (LF), second signal component (LR) and a third signal component (RF), involves encoding the first and second signal components using a first parametric encoder (202) to obtain the first encoded signal (L) and the first set (P2) of coding parametres. The first encoded signal and an additional signal (R) are encoded using a second parametric encoder to obtain a second encoded signal (T) and a second set (P1) of coding parametres. The additional signal is obtained from at least the third signal component, and is a multichannel audio signal in form of at least, the resultant encoded signal (T), obtained from at least, the second encoded signal, first set of coding parametres and second set of coding parametres.

Multichannel surrounding sound of frontal installation of speakers / 2364053

Invention concerns multichannel sound reproduction systems, particularly application of psychoacoustic principles in acoustic system design. Surrounding sound reproduction system uses a number of filters and system of main and auxiliary speakers producing effect of phantom rear channels of surrounding sound or phantom surrounding sound by acoustic system or system of two speakers installed in front of listener. Acoustic system includes left and right input signals of surrounding sound and left and right frontal input signals. Left and right auxiliary speakers and left and right main speakers are positioned in front of audition position. Distance between respective main and auxiliary speakers is equal to distance between ears of an average human.

Parametric composite coding audio sources / 2376654

Invention relates to coding several signals from audio sources, which must be transmitted or stored with the objective of mixing in order to synthesise a wave field, signals for multichannel three-dimensional or stereophonic audio after decoding signals from the sources. The proposed method provides for efficient composite coding signals compared to their separate coding, even when there is no redundancy between the signals. This is possible due to statistical properties of signals, properties of the coding method and spatial hearing. The sum of the signals is transmitted together with the statistical properties, which mainly determine spatial features for final mixed audio signals which are important for perception. The signals are reconstructed in a receiver so that statistical properties are approximately identical to corresponding properties of initial signals from the sources.

Device and method for generating encoded stereo signal of audio part or stream of audio data / 2376726

Invention relates to technology of multichannel audio and, specifically, to applications of multichannel audio in connections with headphone technologies. The device for generating an encoded stereo signal from a multichannel presentation includes a multichannel decoder (11), which forms three or more channels from at least one main channel and parametric information. Said three or more channels are subject to processing (12) headphone signals so as to generate an uncoded first stereo channel and an uncoded second stereo channel, which are then input into a stereo encoder (13) so as to generate an encoded stereo file at the output side. The encoded stereo file can be transmitted to any suitable playback device in form of a CD player or portable playback device such that, the user not only receives a normal stereo impression, but a multichannel impression as well.

FIELD: physics, acoustics.

SUBSTANCE: invention relates to processing signals in an audio frequency band. The apparatus for generating at least one output audio signal representing a superposition of two different audio objects includes a processor for processing an input audio signal to provide an object representation of the input audio signal, where that object representation can be generated by parametrically guided approximation of original objects using an object downmix signal. An object manipulator individually manipulates objects using audio object based metadata relating to the individual audio objects to obtain manipulated audio objects. The manipulated audio objects are mixed using an object mixer for finally obtaining an output audio signal having one or multi-channel signals depending on a specific rendering setup.

EFFECT: providing efficient audio signal transmission rate.

14 cl, 17 dwg

The scope of the invention

This invention relates to processing signals in the band of audio frequencies and, in particular, to the processing of signals in the band of audio frequencies in the context of encoding audio objects, such as spatial encoding of the sound object.

Background of the invention and prototype

In modern broadcasting systems, such as television, under certain circumstances, it is desirable not to reproduce audio tracks, as they were designed by the engineer, but rather to make special settings in order to address the limitations defined during the presentation (visualization). A well-known technique for managing these settings when the final installation is to provide the appropriate metadata along with the audio tracks.

The traditional system of sound reproduction, such as old home television systems that consist of a single loudspeaker or a pair of stereographically. More complex multi-channel playback systems use five or more speakers.

When considering multi-channel playback, sound engineers have much more freedom to maneuver when placing a single source in a two-dimensional plane and can therefore also use the th higher dynamic range for full audio tracks, as the voice becomes more distinct thanks to the well-known effect of "cocktail parties".

However, a realistic, highly dynamic sounds can cause problems in traditional systems of play. There may be scenarios where a consumer may not wish to receive this high dynamic signal, because he listens to the content in a noisy environment (for example, traveling in the car or when using a mobile entertainment system, in-flight), she or he wears a hearing device, or she or he does not want to disturb their neighbors (late at night, for example).

In addition, speakers are faced with the problem that different elements of the same program (for example, commercial advertising) can be on a different volume levels due to different coefficients of amplitude, requiring regulation of consecutive elements.

In the chain classic broadcast transmission to the end user receives the mixed audio track. Any further control on the receiver side can be done only in a very limited form. Currently a small set of metadata characteristics of the system Dolby allows the user to change some of the properties of the audio signal.

Usually manipulations based on the above-mentioned metadata implementation is Auda without any kind of frequency selective recognition because metadata is traditionally applied to the audio signal, do not provide sufficient information to do so.

In addition, you can only manage the whole sound stream. Besides, it is impossible to accept and identify each sound object inside this audio stream. This can be unsatisfactory, especially in the wrong environment listening.

In the midnight mode used sound processor cannot distinguish between noise environment from the dialogue due to the lack of management information. Therefore, in the case of high noise level (which must be compressed/limited volume) dialogues will also be operated in parallel. This could damage the intelligibility of speech.

Increasing the level of dialogue compared with surround sound helps to improve speech perception, especially to listen to people with hearing impairments. This technique only works if the audio signal is really separated in the dialogue and the surrounding components on the receiver side in addition to the availability of information about quality control. If there is only a stereo down-mixing, no further separation can no longer be used for individual recognition and management of speech information. Modern methods of implementation of the down-mixing allow you to adjust Dean is nomic stereoseven for the center and surround channels. But for any excellent configuration of the loudspeaker instead of the stereo there is no real indication from the transmitter how to mix with the lower end multi-channel sound source. Only the default formula in the decoder performs the mixing signal.

All of these scenarios, there are usually two different approaches. The first approach is that when generating a sound signal to be transmitted, the number of audio objects are mixed down to mono, stereo or multi-channel signal. The signal that should be transmitted to the user of this signal by radio, by any other transmission Protocol, or via distribution on machine-readable data carrier, typically has a number of channels less than the number of the original audio objects, which were mixed with a lowering sound, for example, in a Studio environment. In addition, metadata can be applied to allow several different modifications, but these modifications can only be applied to a transmitted signal or, if the transmitted signal has several different transmitted channels, transferred to individual channels entirely. Since, however, these are transmitted channels are always superpositions and W is gcih audio objects, individual control certain sound object, while following the sound object is not managed, impossible.

Another approach is not in the implementation of the downward mixing of the object, and in the transmission of sound signals of interest, because they are separate transmitted channels. This scenario works well when the number of audio objects is small. When, for example, there are only five sound-objects, then you can send these five different audio objects separately from each other within the scenario 5.1. Metadata can be associated with those channels that indicate the specific nature of the object/channel. Then on the receiver side the transmitted channels can be controlled based on the transmitted metadata.

The disadvantage of this approach is that it is not backward compatible and works well only in the context of a small number of sound objects. When the number of audio objects increases, also quickly increases the bit rate required for transmission of all objects defined as a separate audio tracks. This increase in bit rate is not particularly useful in the context of the use of the radio.

Therefore, existing approaches, the effective relative to the bit rate, do not allow the OS is to conduct individual control individual sound objects. Such individual control is available only when each object will be transmitted separately. This approach, however, is not effective relative to the bit rate and therefore is not suitable, specifically, in scenarios the radio.

The objective of the invention is the provision of effective bit rate with flexible solutions to these problems.

According to the first aspect of the present invention, this is achieved by a device for generating at least one output audio signal representing a superposition of at least two different audio objects, comprising: a processor for processing the input audio signal to provide an object representation of the input audio signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound signals of interest and at least two different audio objects are managed independently from each other;

the manipulator object to control an audio object or a mixed signal of the sound object, at least one audio object based on object-oriented metadata regarding at least one sound object to get problemy signal of the sound object, or controlled by the mix of the sound object, at least one audio object; and a mixer object for mixing the object representation by combining the controlled audio object with an unmodified audio object or with another managed a sound object that is managed differently than at least one audio object.

According to the second aspect of the present invention this is achieved by a method of generating at least one output audio signal representing a superposition of at least two different audio objects, including:

processing the input audio signal to provide an object representation of the input audio signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound signals of interest and at least two different audio objects are managed independently from each other; a control signal of the sound object, or mixed signal of the sound object, at least one audio object based on object-oriented metadata regarding at least one sound object to get managed the signal of the sound object, or controlled by the mix of the sound object, the least for one audio object; and mixing the object representation by combining the controlled audio object with an unmodified audio object or with another managed a sound object, which is handled differently than at least one sound object.

According to a third aspect of the present invention is achieved by a device for generating an encoded audio signal representing a superposition of at least two different audio objects, including:

the formatter data stream to format the data stream so that the stream data included the signal down-mixing object representing a combination of at least two different audio objects, and as additional information, the metadata, at least one of the different audio objects.

According to a fourth aspect of the present invention is achieved by a method of generating an encoded audio signal representing a superposition of at least two different audio objects, comprising: formatting the data stream so that the stream data included the signal down-mixing object representing a combination of at least two different zvukovoy the objects, and as additional information, the metadata, at least one of the different audio objects.

Further aspects of this invention relate to computer programs implementing the invention in methods, and machine-readable data carrier, with the stored signal down-mixing of the object and, as additional information of parametric data object and metadata for one or more audio objects included in the signal down-mixing of the object.

This invention is based on the discovery that individual control individual sound objects or separate series mixed signal audio objects allows for an individual associated with object processing based on object-related metadata. According to this invention the control does not go directly to the loudspeaker, but is provided to the mixer object, which generates output signals for a particular scenario provide, where output signals are generated by the superposition of at least one of the controlled signal of the object, or some mixed signals of the object together with other managed signals of the object and/or unmodified signal object. Natural is, there is no need to control every object, but in some cases it is sufficient to control only one object and not to control further object of many sound objects. The result of the operation of the mixing of the object is one or many output audio signals that are based on managed objects. These output audio signals can be transmitted to the speakers or can be saved for future use, or can even be passed on to subsequent receiver depending on a particular use case.

Preferably, the input signal in the control unit/mix, made according to the invention, was a signal down-mixing generated by down-mixing of multiple sound objects. The process of down-mixing can be controlled by the metadata for each object individually or can be uncontrollable to be the same for each object. In the previous case, the management object according to the metadata is the process of individual control object and the process of mixing a particular object, which generates the signal component speaker representing this object. Preferably, it also provided PR is the spatial parameters of the object, which can be used to reconstruct the original signal by means of approximate versions using a transmitted signal down-mixing of the object. Then a processor for processing the input audio signal to provide an object representation of the input audio signal is effective to calculate a reconstructed version of the original sound object based on parametric data, where these approximate signals of the object can then be individually controlled object-oriented metadata.

Preferably, the object view (rendering) information was also provided where an object representation information includes information regarding the perceived sound playback settings and information relative to the location of individual sound objects within the script playback. Certain implementation, however, can also work without such location data of the object. Such configurations are, for example, by providing a stationary object positions, which may be permanently installed or which may be agreed upon between the transmitter and receiver for full audio track.

Brief description of drawings

The preferred embodiment of the present invention is then discussed in the context of the attached drawings, in which:

Figure 1 illustrates a preferred implementation of the device for generating at least one output audio signal;

Figure 2 illustrates a preferred execution of the CPU 1;

Figa illustrates a preferred implementation of the manipulator to control signals of the object;

Figv illustrates a preferred implementation of the mixer object in the context of the manipulator as illustrated in figa;

Figure 4 illustrates the configuration of the processor/paddle/mixer object in the situation in which control is performed following the step-down mixer object, but before the final mixing of the object;

Figa illustrates a preferred implementation of the device for generating an encoded audio signal;

Figv illustrates the transmission signals having a stereo downmix of the object, the object-oriented metadata and spatial parameters of the object;

6 illustrates a map showing multiple audio objects identified in accordance with a certain identification with the sound file of the object, and the matrix E United information about the audio object;

Fig.7 illustrates the explanation of the matrix E covariance of the object 6:

Fig illustrates the matrix down-mixing and zvukovedenie device object managed matrix D down-mixing;

Fig.9 illustrates the specified matrix visualization And that is typically provided by the user, and an example of the predefined playback script.

Figure 10 illustrates the preferred implementation of the device for generating at least one output audio signal in accordance with a further aspect of the present invention;

Figa illustrates a further implementation;

Fig.1l illustrates another variant implementation;

Figs illustrates a further implementation;

Figa illustrates an exemplary scenario of application; and

Figv illustrates a further exemplary scenario of application.

Detailed description of preferred implementations

In order to solve the above problems, the preferred approach should provide the appropriate metadata along with the audio tracks. Such metadata can contain information for managing the following three factors (the three "classic" D):

- normalization dialogue

- dynamic range control,

- stereo downmix.

Such audio metadata help the receiver to manage the received audio signal based on the settings made by the listener. To distinguish this kind of sound metadata from the village of the natives (for example, descriptive metadata, such as Author, Title,..,), usually refers to a Metadata system Dolby" (because they are performed only by the system Dolby). In the future, is only seen this kind of audio metadata and is simply called "metadata".

Audio metadata is additional control information that is transferred along with a sound program and is essential to the receiver information about the sound. Metadata provide many important functions, including control of the dynamic range for a far from ideal environment listening, matching the level of programs, information about Panigale mixing to reproduce multi-channel sound through fewer channels of the speaker and other information.

Metadata provide the tools necessary for sound programs that will be reproduced accurately and skillfully in many different listening situations from full home theaters to entertainment medium in flight, regardless of the number of channels of the speaker, the quality of the playback equipment or the relative level of surrounding noise.

While the engineer or content provider take care of ensuring sound of the highest quality possible within the framework of the program, the e have the ability to control a vast array of consumer electronics or environmental listen, which will reproduce the original sound track. Metadata provides the engineer or the content provider the ability to control how their work is reproduced and is seen in almost every conceivable environment listening.

The metadata system Dolby are a special format to provide information for the management of these three mentioned factors.

The three most important features of system metadata Dolby

- Normalization of dialogue to achieve the long-term average level of dialogue within the view, often consisting of different types of programs, such as feature film, commercial advertising, etc.

- Dynamic range control to deliver the most part the audience a pleasant sound compression, but at the same time to allow each individual user to control the dynamics of the audio signal and to adjust the compression for her or his personal environment listening.

- Stereo downmix to display sound multi-channel audio signal to two or one channel if unavailable no multi-channel audio playback equipment.

The metadata system Dolby used along with Digital system Dolby (AC-3) Dolby E. the format of the sound who's metadata system-Dolby-E, described in [16] Digital Dolby (AC-3), is designed to channel sound into the house through a digital television (high or standard definition, DVD, or other media.

Digital Dolby can carry anything from a single audio channel to full program with 5.1 channels, including metadata. And in digital television and DVD is often used to transmit stereo, and full 5.1 discrete audio programs.

System Dolby E definitely is designed for the distribution of multichannel sound within the professional environment of production and distribution. At any time before delivery system Dolby E is the preferred method of multi-channel distribution/multiprogramming sounds with the video. System Dolby E can carry up to eight discrete audio channels, arranged in any number of customized software configurations (including metadata for each) within the existing two-channel digital audio infrastructure. Unlike Digital Dolby system Dolby E can regulate many generation encoding/decoding and is synchronous with the frequency of video frames. Like Dolby Digital system, the system Dolby E transfers the metadata for each individual sound programs, we are going to code vannoy within the data stream. Using system Dolby E can be used to decrypt, modify, and re-encode the resulting audio data stream without audible degradation. Because the flow system Dolby E synchronous with the frequency of video frames, it can be routed, switched and edited in the professional environment of the radio.

In addition, the tool is provided along with MPEG AAC to control the dynamic range and control the generation of down-mixing.

To regulate the source material with variable peak levels, intermediate levels and dynamic range way that minimizes the variability for the consumer, it is necessary to control the reproduced level so that, for example, the level of dialogue or average music level was set at a controlled user level when playing, regardless of how the program was created. Additionally, not all consumers will be able to listen to the program in good (i.e. low noise) environment, without the restrictions of the volume of the sound when listening. The automotive environment, for example, has a high level of ambient noise, and can therefore be expected that the listener will want to decrease the range of levels that would otherwise be reproduced.

On both this p is iciam dynamic range control should be available within specification AAC (Advanced Audio Coding - advanced audio encoding). To achieve this, you must follow the sound with the low bit rate data used to install and dynamic range control points of the program. This control must be defined relative to a reference level and in relation to important program elements, such as dialog.

Characteristics of the dynamic range control is as follows.

1. Dynamic range control is fully optimal. Therefore, when the correct syntax does not change the complexity for those who do not want to activate the DRC (monitoring of compliance with design standards).

2. The audio data with a low bit rate is transmitted with full dynamic range of the source material, with supporting data, to facilitate the dynamic range control.

3. The data dynamic range control can be sent to each frame in order to minimize the waiting time when setting the gain of the playback.

4. The data dynamic range control are sent by using the characteristics of the "item padding" (fill_element) AAC format (advanced audio encoding).

5. The reference Level is defined as Full-scale.

6. The control level of the Program is transmitted, to ensure the equality of the levels play a variety of sources and provide a link, which can build dynamic range control. It is this characteristic of the original signal are most relevant to the subjective impression of loudness programs, such as the level of content of the program's dialogue or the average level of the music program.

7. The control Level is the level of program that can be reproduced at a given level relative to the reference Level in the hardware user to achieve equal level of play. Regarding this quieter part of the program can be strengthened by level, and the louder parts of the program can be weakened by level.

8. The control Level is determined within the range of 0 to 31.75 dB relative to a reference Level.

9. The control Level of the Program uses a 7-bit field in increments of 0.25 dB.

10. Dynamic range control is determined within the range of ±31.75 dB.

11. Dynamic range control uses 8-bit field (1 character, 7 values) in increments of 0.25 dB.

12. Dynamic range control can be applied to all spectral coefficients of the audio channel or frequency band as a unit is a specific object or factors can be divided into different groups of scaling factors, each controlled separately by a separate data sets dynamic range control.

13. Dynamic range control can be applied to all channels (stereo or multi-channel bit stream) as a single object or can be divided, in this case a group of channels will be managed separately by individual data sets dynamic range control.

14. If the expected data set dynamic range control is missing, you must use the last valid value.

15. Not all data elements of the dynamic range control is sent every time. For example, the reference Level may be sent on average only once every 200 milliseconds.

16. Where necessary, the detection/error protection is provided by the Transport layer (transport level).

17. The user will be provided with a means for varying the degree of dynamic range control that is present in the bit stream which is applied to the signal level.

In addition to being able to give a mono - or stereomicroscope with the lower channels in the transfer with 5.1 channels, AAC also allows automatic generation of down-mixing from the source on the horns with 5 channels. The LFE channel should be omitted in this case.

This way the matrix down-mixing can be controlled by the editor, audio track with a small set of parameters that define the number of rear channels added to downward mixing.

The way the matrix down-mixing is used only for mixing 3-front/2-rear configuration of the loudspeaker, 5 channels to stereo or miniprogramme. It does not apply to any other program except 3/2 configuration.

Within MPEG provides several means to control the presentation of audio (audiorenderer) on the receiver side.

A typical technique provides a language for the description of the scene, such as BIFS and LASeR. Both technologies are used to play the audiovisual elements of the divided encoded objects in the scene playback.

BIFS are standardized in [5] and LASeR [6].

MPEG-D mainly deals with (parametric) descriptions (i.e. metadata):

- to generate multichannel audio based on the audio representations of the down-mixing (MPEG Surround (surround sound)); and

in order to generate the parameters of the MPEG Surround, based on sound objects (MPEG Spatial audio coding object).

MPEG Surround uses interchannel R the differences in level, the phase and coherence equivalent replicas ILD, ITD and IC, in order to capture the spatial image of the multi-channel audio signal relative to the transmitted signal down-mixing, and encodes these replicas in a very compact form so that the replica and the transmitted signal can be decoded to synthesize high-quality multi-channel representation. MPEG Surround encoder receives multi-channel audio signal, where N is the number of input channels (e.g., 5.1). A key aspect of the encoding process is that the signal is down-mixing, xt1 and xt2, which is usually stereo (but may also be mono), obtained from the multichannel input signal, and this signal is down-mixing is compressed for transmission over the channel of the multichannel signal. The encoder may benefit from the process, reducing mixing in such a way that it creates the exact equivalent of the multichannel signal in mono - or stereophonized mixing, and creates the best multi-channel decoding based on Panigale mixing and encoded spatial replicas. Alternatively, a stereo downmix can be supplied externally. MPEG Surround encoding process independent of the compression algorithm used for transferred ka is Alov; it can be any of the many high-performance compression algorithms such as MPEG-1 Layer III, MPEG-4 AAC or MPEG-4 High-efficiency AAC, or it may even be PCM (PCM - pulse code modulation [signal]).

Technology MPEG Surround supports very efficient parametric coding of multi-channel audio signals. The idea of MPEG SAOC (spatial encoding of the sound object is to apply a similar key assumptions together with the similar parametric representation for very efficient parametric coding of individual sound objects (tracks). Additionally, included functionality views in order to interactively present the audio objects in the audio scene for several types of systems of play (1.0, 2.0, 5.0... for speakers or binaural headphone). SAOC is designed to deliver a number of sound objects in United mono or stereo down-mixing to ensure the reproduction of the individual objects in the sound stage provided in an interactive mode. To this end SAOC encodes the difference between the Levels of the Object (OLD), Interobjective cross-coherence (IOC) and the Difference of Levels of Channel Down-mixing (DCLD) in parametric bit stream. SAOC decoder Ave rotates SAOC parametric representation in MPEG Surround parametric representation, which is then decrypted with the signal down-mixing by means of the MPEG Surround decoder to produce the desired sound stage. The user interactively controls this process to change the representation of the sound objects in the resulting sound stage. Among the numerous possible applications SAOC here are a few typical scenarios.

Users can create personal interactive Remix using virtual mixing console. Certain tools can be, for example, weakened to podygryvanii (like Karaoke), the original mix can be changed to suit personal taste, the level of dialogue in movies/radio programs can be adapted for best intelligibility of speech, etc.

For interactive games SAOC is a memory and computationally efficient way to play the soundtracks. Moving in the virtual scene reflects the adaptation of the object, the playback parameters. Network games with many players benefit from transfer efficiency using one thread SAOC to represent all the sound objects that are external to the terminal by a specific player.

In the context of this application, the term "sound object" also includes the term "basis", known in scenarios production sound is and. In particular, the foundations of individual components of the mix, separately stored (usually on disk) for use in remixes. Family foundations are usually returned from the same original location. An example can be the basis of the drum (includes all related to drum instruments in the mix), vocal basis (includes only voice track) or rhythmic basis (includes all rhythmically related instruments, such as drums, guitar, keyboard,...).

Modern telecommunications infrastructure is mono and its functionality can be extended. The terminals extending SAOC, capture multiple audio sources (objects) and produce a mono signal down-mixing, which is transmitted in a compatible way using the existing (voice) encoders. Additional information can be transmitted nested backward-compatible way. Traditional terminals will continue to produce a mono output, while SAOC-enabled terminal can play the acoustic stage and thus to increase intelligibility, spatial separating different speaking subjects ("the effect of cocktail parties").

A brief overview of the affordable audio applications met the data system Dolby described in the next section.

The midnight mode.

As mentioned above, there may be scenarios where the listener may not want to get high dynamic signal. Therefore, the listener can activate the so-called "midnight mode" of your receiver. Then the compressor is applied to a full sound. To control the parameters of the compressor, the transferred metadata is evaluated and applied to the full sound.

Clear sound.

Another scenario - people with hearing impairments who do not want to have a highly dynamic ambient noise, but I want to have a pretty clean signal, containing dialogues ("Clean Audio" - clear sound). This mode can also be enabled through the use of metadata.

Currently, the proposed solution defined in [15] - the Application-that is, the Balance between the main stereo and optional mono channel, describing the dialogue is regulated here individual set of level settings. The proposed solution, based on a specific syntax, called secondary audio service in DVB (digital video and television).

Stereo downmix.

There are separate metadata settings that control L/R step-down mixing. Certain metadata settings allow the engineer to choose how Stereophonics the e mixing and what stereoanlagen signal is preferred. Here the Central and surrounding levels down-mixing will determine the final balance mixing the signal down-mixing for each decoder.

Figure 1 illustrates a device for generating at least one output audio signal representing a superposition of at least two different audio objects in accordance with the preferred implementation of this invention. The device 1 includes a processor 10 for processing an audio input signal 11 to ensure the representation of an object 12 input audio signal, in which at least two different audio object are separated from each other, in which at least two different audio object is available as a separate sound signals of the object and in which at least two different audio objects are managed independently from each other.

Management representation of the object is performed in the manipulator object 13 to control sound signal of the object, or mixed representation of the sound signal of the object, at least one audio object based on object-oriented metadata 14 related at least one audio object. The manipulator of the sound object 13 is adapted to receive the controlled audio signal is of byetta managed or mixed audio signal representation of the object 15, at least one sound object.

The signals generated by the pointing device object, are introduced into the mixer object 16 for mixing the object representation by combining the controlled audio object with an unmodified audio object or with a managed another sound object, where managed another sound object managed in any other way than at least one audio object. The result of the mixer object includes one or more output signals 17A, 17b, 17c. Preferably, one or more output signals 17A-17c was designed for a specific installation view, such as mono installation view, stereo installation view, multi-channel installation views that include three or more channels, such as installing surround sound, requiring at least five or at least, seven different output signals.

Figure 2 illustrates a preferred implementation of the processor 10 for processing the input sound signal. Preferably, the input sound signal 11 is implemented as an object downward mixing 11, obtained by means of the mixer object downward mixing 101A figa, which is described next. In this situation, the processor additionally receives the parameters of the object 18 such as, for example the EP, generated by the solver parameters of the object 101b on figa, as described below. Then the processor 10 is in the position to calculate a separate sound signals of the object 12. The number of audio signals of the object 12 may be greater than the number of channels in the object down-mix 11. The object downward mixing 11 may include monopolygame mixing, Stereophonics mixing or even stereo downmix, having more than two channels. However, the formatter data flow processor 12 can be effective for generating greater number of sound signals of the object 12 in comparison with the number of individual signals in the object down-mix 11. Sound signals of the object, due to the parametric processing performed by the CPU 10, not an exact reproduction of the original audio objects that were present before it was made stereo downmix of the object 11, but the sound signals of the object are approximate versions of the original audio objects, where the accuracy of the approximation depends on the type of separation algorithm executed in the CPU 10, and, of course, on the accuracy of the parameters passed. The preferred object parameters - parameters that are known from the coding of spatial audio object, and the preferred algorithm reconstruction for whom and for generating individually separated sound signals of the object is a reconstruction algorithm, made in accordance with the standard coding of spatial audio objects. The preferred implementation of the processor 10 and the object parameters will be further discussed in the context of 6-9.

Figa and figv together illustrate the performance, which is the management object to the down-mixing of the object to set the playback, while figure 4 illustrates a further implementation, in which the stereo downmix object is to control, and control is performed before the final mixing process object. The result of the process figa, 3B, compared with the figure 4 is the same, but the control object is performed at different levels in the processing scenarios. When the control signals of the object is a problem in the context of efficiency and computational resources, the implementation in accordance with figa/3V is preferred because the control of the audio signal should only be performed on a single audio signal, and not a lot of sound signals, as in figure 4. In another implementation, which may be a requirement that downward mixing of the object was performed by using an unmodified signal of the object, the preferred configuration of figure 4, in which is performed the track for the step-down mixer object, but before the final mixing of the object, to obtain output signals for, for example, the left channel L, the Central channel or the right channel R.

Figa illustrates a situation in which the processor 10 figure 2 produces a separate sound signals of the object. At least one sound object, such as a signal for the object 1, is controlled by a pointing device 13A, based on the metadata for this object is 1. Depending on implementation, other objects, such as object 2, are also arm 13b. Of course, the situation may arise in which there is indeed an object, such as object 3, which do not control, but which, however, is generated by separating the object. The result of processing figa in example figa are managed two signal object and one unmanaged signal.

These results are introduced into the mixer object 16, which includes a first stage mixer, executed as mixers object downward mixing 19a, 19b, 19s, and which, moreover, includes a second stage mixer object performed by the devices 16A, 16b, 16C.

The first stage mixer object 16 includes, for each output data figa, mixer object down-mixer, such as mixer object downward mixing 19a for output 1 figa, mixer object downward mixing 1b to output 2 figa, the mixer object downward mixing 19s for output 3 figa. The purpose of a mixer object downward mixing 19a-19s is to distribute each object on the output channels. Therefore, each mixer object downward mixing 19a, 19b, 19 (C) has an output for the left component of the signal L, the Central component of the signal and the right component of the signal R. Thus, if, for example, the object 1 would be a single object, a mixer down-mixing 19a would be direct mixer down-mixing, and the output unit 19a would be the same as the final output L, C, R, indicated by the numerals 17A, 17b, 17c. Mixers object downward mixing 19a - 19s preferably receive information about the rendering, indicated by the numeral 30, where the information about the rendering can describe the installation of the rendering, that is, as in the implementation figv there are only three output loudspeaker. These conclusions - left speaker L, the center speaker and the right speaker is R. If, for example, a rendering or setting playback includes scenario 5.1, each mixer object downward mixing would have six output channels, and there would have been six adders so that was received final output signal for the left channel, the final output signal for the right what about the channel, the final output signal for the center channel, the final output signal for the left circuit channel, the final output signal for the right of the district of the canal and the final output signal to extend low frequency (subwoofer) channel.

In particular, the adders 16A, 16b, 16C adapted to combine the component signals for the corresponding channel, which were generated by respective mixers objects downward mixing. This combination, preferably, is a direct sequential addition of samples, but, depending on the implementation, can also be applied weights. In addition, the functionality figa, 3B can be implemented in the frequency domain, or area of the sub-bands so that the elements 19a-16C could work in the frequency domain, and there would be some conversion frequency/time before the actual output signals to the speakers in the setup playback.

Figure 4 illustrates an alternative implementation in which the functionality of the elements 19a, 19b, 19s, 16A, 16b, 16C is similar to the implementation figv. It is important, however, that the operation, which took place on figa to downward mixing of the object 19a, now occurs after down-mixing object 19a. Thus, management, specific the La of the object, which is controlled by the metadata for the corresponding object is in the field of down-mixing, i.e. before the actual add-ons subsequently managed components of the signals. When figure 4 compared to figure 1, it becomes clear that the mixer object downward mixing as 19a, 19b, 19s will be implemented in the processor 10 and the mixer object 16 will include adders 16A, 16b, 16C. When executed figure 4 and mixers object downward mixing are part of the processor, then the processor will receive, in addition to the parameters of the object 18 1, information about the rendering 30, that is, information regarding the position of each sound object and information regarding installation of rendering and additional information depending on the circumstances.

In addition, management may include the process of the reduction of the mixing performed by the blocks 19a, 19b, 19s. In this implementation, the manipulator includes these blocks, and additional manipulation can take place, but not required in any case.

Figa illustrates the implementation side of encoder that can generate the data flow, as schematically shown in figv. In particular, figa illustrates a device for generating an encoded audio signal 50 representing a superposition of at least two R slichnih sound objects. Essentially, the device figa illustrates the type of data stream 51 to format the data stream 50 so that the flow of data included signal of the object downward mixing 52, representing a combination, such as a weighted or unweighted combination of at least two sound objects. In addition, the data stream 50 includes, as additional information associated with the object metadata 53 related at least to one of the different audio objects. Preferably, the data stream 50, in addition, included parametric data 54, which are selective in time and frequency and which provides high-quality separation of signal down-mixing of the object on multiple sound objects, where this process is also referred to as a process which increases the mixing of the object, which is performed by the CPU 10 of figure 1, as explained previously.

The signal is down-mixing of the object 52 is preferably generated by the mixer object downward mixing 101A. Parametric data 54 preferably are generated by the evaluator parameters of the object 101b, and the selective metadata objects 53 are generated by the metadata provider of selective objects 55. The metadata provider of selective objects can be input for receiving metadata as about svedennyh a sound generator in the recording Studio or can be data, produced through the analysis associated with the object, which could be performed after the separation of the object. In particular, the metadata provider of selective objects can be implemented to analyze the output of the object by the processor 10, for example, determine whether the object audio object audio object, or object of ambient sound. Thus, the speech object can be analyzed by means of some known algorithms for speech detection, known from speech coding, and selective analysis of objects can be done to also discover the sounds coming from the instruments. These sound objects have a high tonal nature and can therefore be distinguished from speech or ambient sounds. The objects surrounding sounds will be very noisy nature, reflecting the background sound, which usually exists, for example, in the movies, where, for example, background noises are the sounds of vehicles or any other persistent noise signals or intermittent signals having a broadband spectrum of what is done when, for example, in the movie there is a scene of the shooting.

Based on this analysis, it is possible to enhance the sound object and weaken other objects to highlight it, since that contributed obstet better understanding of film hard of hearing people or elderly people. As established earlier, other implementations include providing metadata for a particular object, such as object identity and object-related data, sound recordist, producing the actual signal down-mixing of the object on a CD or DVD, such as Stereophonics mix or stereo downmix surround.

Fig.5d illustrates an exemplary data stream 50, which has as main information mono, stereo or multi-channel object downward mixing and which has as additional information, the parameters of the object 54 and object-oriented metadata 53, which are constant only in the case of identification of objects, such as speech or environmental sounds, or which are variable in time in the case of providing data as an object-oriented metadata, such as required for the midnight mode. Preferably, however, to object-oriented metadata is not provided frequency-selective way to save data transfer rate.

6 illustrates the implementation of the display of the sound object, illustrating the number of objects n In the approximate explanation 6 each object has an ID (identifier) of the object corresponding to the file of the sound object and that important information is Yu about the parameters of the sound object, which is, preferably, information relating to the power of the sound object and interobjective correlation of the sound object. In particular, information about the parameters of the audio object includes a matrix of covariances of the object for each sub-band and for each time interval a stable connection.

An example of such information about the parameters of the sound object matrix E is illustrated in Fig.7. The diagonal elements of e_iiinclude information about the intensity or power of the sound object i in the corresponding sub-band and the corresponding time interval. Finally, the subband signal representing a particular sound object i, is introduced in the transmitter intensity or power, which may, for example, to perform an autocorrelation function (acf)to obtain the value of e_iiwith or without normalization. Alternatively, power can be calculated as the sum of the squares of the signal at a certain length (i.e. the cross product: ss*). The autocorrelation function (acf) can, in some sense, to describe the spectral distribution of power, but due to the fact that T(time)/P(frequency conversion to the frequency selection preferably is used in any case, computing power can be made without the autocorrelation function (acf) for each subband separately. Thus,the main diagonal elements of the matrix E parameters of the sound object indicate the degree of power of the sound object in a certain sub-range in a certain time interval sustainable radio.

On the other hand, the off-diagonal element of e_ijspecifies the appropriate measure of correlation between sound objects i, j in the corresponding sub-band and time interval sustainable radio. From Fig.7 it is clear that the matrix E is for real normalized records, symmetric about the main diagonal. Usually this matrix is a Hermitian matrix. The item measures the correlation of e_ijcan be calculated, for example, by cross-correlation between two signals of sub-bands of the respective audio objects so that was received mutual measure of correlation, which may or may not be normalized. Can be used other measures of correlation, which are not computed by using the cross-correlation procedure, and evaluated other methods of determining the correlation between the two signals. For practical reasons, all the elements of the matrix E are normalized so that they have values between 0 and 1, where 1 indicates the maximum power or maximum correlation, 0 indicates the minimum power (no power) and -1 indicates the minimum (out-of-phase) correlation.

The matrix D down-mixing of size K x N, where K>1, determines the channel signal to the down-mixing in the form of a matrix with K rows by matrix multiplication

$X = D S (2)$

Fig illustrates an example matrix D downward mixing with the matrix elements of d_ijdown-mixing. Such an element d_ijspecifies whether to include part or the whole object j in the signal downward mixing of object i or not. When, for example, d₁₂equal to zero, this means that the object 2 is not included in the signal down-mixing of the object 1. On the other hand, the value of d₂₃equal to 1, indicates that the object 3 is fully included in the signal down-mixing of the object 2.

Valid values for the matrix elements of down-mixing between 0 and 1. In particular, a value of 0.5 indicates that a certain object is included in a signal down-mixing, but only half its capacity. Thus, when a sound object, such as object number 4, is equally distributed over both channels of the signal down-mixing, then d₂₄and d₁₄will be equal to 0.5. This method down-mixing is an energy-saving process down-mixing, which is preferred for some situations. The alternative, however, may also be used energy-saving stereo downmix, in which a C is okoboi object inserted into the left channel down-mix and the right channel down-mixing so to the power of this sound object was doubled relative to the other sound objects within the signal down-mixing.

In the lower parts Fig given schematic diagram of encoder object 101 figure 1. In particular, the encoder object 101 includes two different parts 101A and 101b. Part 101A is a mixer down-mixing, which preferably performs a weighted linear combination of sound objects 1, 2,..., N, and the second part of the coding device object 101 is the transmitter parameters of the sound object 101b, which calculates information about the parameters of the sound object, such as the matrix for each time interval or sub-bands to provide information on the sound power and correlation, which is the parametric information and can therefore be transferred with a low bit rate or can be saved, consuming a small amount of memory resources.

Controlled by the user matrix And render the object (the matrix object views), size M x N, determines the target visualization channel M audio objects in the form of a matrix with M rows by matrix multiplication

$Y = A S (3)$ </>

It is assumed in the course of the next differentiation that M=2, because the focus is on modes. Providing an initial matrix visualization for more than two channels and norms downward mixing from these multiple channels into two channels makes it obvious for the skilled professionals receive the corresponding matrix rendering And, 2×N, for stereoregular. For simplicity it is also assumed that K=2, thus the object downward mixing is also a stereo signal. The case of down-mixing storeobject is also the most important special case, based on the application scenarios.

Fig.9 illustrates a detailed explanation of the given matrix A. rendering depending on the application of the matrix rendering And can be provided by the user. The user is free to specify where the virtual must be located the sound object to set the playback. The advantage of the concept of the sound object is that information about Panigale mixing and parameter information of the sound object is completely independent from the specific localization of sound objects. This localization of sound objects is provided by the user in the form of a given rendering. It is preferable to set the information of the rendering could be realized as the matrix rendering And, which may be in the form of a matrix figure 9. In particular, the matrix rendering And has M lines and N columns, where M equals the number of channels in the output signal after rendering and where N is the number of audio objects. M is equal to two preferred scenarios stereoregular, but if you are rendering M-channels, then the matrix a has M lines.

In particular, the matrix element a,j indicates whether a part or a whole object j to be subjected to rendering in a specific output channel i or not. The lower part of figure 9 gives a simple example of a given matrix rendering scenario in which there are six sound objects A01-A06, where only the first five sound objects should be subjected to rendering in certain positions, and the sixth sound object should not be subjected to rendering.

Regarding the sound object A01, the user wants to render this sound object implemented in the left side of the script playback. Therefore, this object is placed in the position of the left speaker in the (virtual) location reproduce the results in the first column of the matrix visualization And should be indicated (10). Regarding the second sound object, and₂₂-1 and a₁₂-0, this means that the rendering of the second sound object should be made on the right side.

Sounds the th object 3 should be subjected to rendering in the middle, between the left speaker and the right speaker, so that 50% level or signal of this audio object included in the left channel and the 50% level or signal included in the right channel, corresponding to the third column of a given matrix rendering And was (0.5 length 0.5).

Similarly, any accommodation between the left speaker and the right speaker can be specified given matrix rendering. Regarding the sound object 4, placing more on the right side, since the matrix element and₂₄more than and₁₄. Similarly, the rendering fifth of the sound object A is to be more on the left speaker, as indicated by the elements of a₁₅and a₂₅given matrix rendering. The matrix rendering And additionally allows not to perform rendering of a particular audio object. This is roughly illustrated by the sixth column of a given matrix rendering with null elements.

Subsequently, the preferred implementation of the present invention is described with reference to figure 10.

Preferably, the methods known from SAOC (Spatial Audio Coding Object), share one audio signal into different parts. These parts can be, for example, different audio objects, but it is not limited in order to iceutica.

If the metadata is passed to every single portion of an audio signal, this allows you to adjust only some of the components of the signal, while the other parts will remain unchanged or even could be changed by other metadata.

This can be done for different audio objects, and also for individual spectral ranges.

The parameters for the separation of an object are classic or even new metadata (gain, compression, level,...) for each individual sound object. These data are preferably transmitted.

The processing unit of the decoder is implemented on two different stages: the first stage separation options object used to generate the (10) individual sound objects. In the second stage, the processing unit 13 has a variety of items, where each item is an individual object. Here should be applied to the metadata of the specific object. At the end of the process occurring in the decoder, all individual objects again unite (16) in one single beep. Additionally, the original controller and the controlled signal 20 (dry /wet controller) can provide smooth mixing influx between the original and the controlled signals to provide the end user with a simple way to find the own preferred settings.

Depending on the concrete implementation figure 10 illustrates two aspects. In the main aspect associated with the object metadata only point to the object description for the object. Preferably, the description of the object was associated with an ID (identifier) of the object, as indicated by figure 21 figure 10. Therefore, object-oriented metadata for the top object, the managed device 13A, are the only information that this object is "speech" object. Object-oriented metadata for another object processed by the device 13b have the information that this second object - the object surround sound.

These basic object-related metadata for both objects may be sufficient to perform advanced clean sound mode in which the speech object increases as the object of ambient sound attenuated or, in short, the speech object is enhanced relative to the object's ambient sound, or object ambient sound is attenuated relative to the speech object. The user, however, it is preferable to carry out various processing modes on the side of the receiver/decoder, which can be programmed via the control input modes. These different modes can be regime-level dialogue, mode, compression mode buck is about mixing, advanced midnight mode, advanced pure audio mode, dynamic mode down-mixing regime catalyzed enhance mixing mode for moving objects, etc.

Depending on the performance of the different methods require different object-oriented metadata in addition to the basic information that indicates the type or nature of an object, such as speech or surround sound. In midnight mode, in which the dynamic range of an audio signal to be compressed, preferably, for each object, such as a speech object and the object or actual level or the specified level for the midnight mode was provided as metadata. When given the actual level of the object, then the receiver must calculate the specified level for the midnight mode. However, when given the specified relative level, then the processing on the side of the decoder/receiver decreases.

In this version, each object has a time-dependent object-oriented sequence information about the level, which is used by the receiver to compress the dynamic range so that the difference of levels within a single object have been reduced. This automatically leads to obtain the final smokevaginal, in which the difference of the levels from time to time decreases, as required by the execution of the midnight mode. For pure sound applications may also be granted the specified level for the speech object. Then surrounding the object can be set to zero or almost zero, to better emphasize the speech object within the sound generated by a particular installation of the loudspeaker. High quality reproduction, which is the inverse of the midnight mode, can be extended dynamic range of the object or dynamic range differences between objects. In this version, it is preferable to provide the desired gain levels, because these specified levels ensure that the end result is a sound that is created artistic sound in the sound Studio and therefore has a higher quality compared with automatic tuning or configuring user-defined.

In another implementation, in which object-oriented metadata relating to improved down-mixing, the management object includes a stereo downmix different from the one that is designed for specific installations rendering. Then object-oriented metadata are entered in units 19a - 19s mixer down-mixing version on FIGU or 4. In this version, the manipulator may include blocks 19a-19s, when individual stereo downmix of the object is performed depending on the installation rendering. In particular, blocks 19a-19s downward mixing of the object may be set well apart. In this case, the speech object can only be entered into the Central channel, and not in the left or right channel, depending on the channel configuration. Then the blocks of the mixer down-mixing 19a-19s may have a different number of outputs of the signal components. Downward mixing can also be done dynamically.

Additionally, may also be provided with information about the directional enhance mixing and information for moving objects.

Below is an outline of the preferred methods for providing metadata and application-specific metadata object.

Sound objects can be divided not perfect, as in a typical SOAC application. For sound control may be sufficient to have a "mask" of objects, rather than complete separation.

This may lead to fewer/more rough parameters for the separation of the object.

For application mode, called "midnight mode", the engineer must determine all the metadata settings independently for each object, producing, for example,a constant volume of dialogue, and controlled noise environment (advanced midnight mode").

It may also be useful for people wearing hearing AIDS (advanced pure sound"),

New scenarios downward mixing: various separated objects may be treated differently for each particular situation down-mixing. For example, a signal with 5.1 channels must be mixed with a decrease to home television stereo system and the other the receiver has only a mono playback. Therefore, various objects may be treated differently (and all this is controlled by the sound engineer during production due to the metadata provided by the engineer).

Preferably also stereo downmix to 3.0 and so on

Produced stereo downmix will not be determined regular main parameter (set), but it can be generated from the variables in time-dependent settings object. Through new object-oriented metadata can also perform directed enhance mixing.

Objects can be placed in various positions, for example, to create a spatial image is wider when the environment is weakened. This will help to improve the clarity of speech for the hearing impaired people.

Proposed in this slave is the method extends the existing concept of metadata, implemented and mainly used in the Encoder-decoder system Dolby. You can now apply the well-known concept of metadata not only to the whole sound stream, but also to the extracted objects within this thread. It provides sound engineers and operators much more room for manoeuvre, provides a large range of regulation and therefore better sound quality and more fun for students.

Figa, 12V illustrate different scenarios of application of the concept of the invention. In a classic scenario, there is a TV broadcast of sporting events where the atmosphere of the stadium all 5.1 channels and where the loudspeaker channel is displayed on the center channel. This display can be performed by direct addition of channel speaker to the center channel, intended for these 5.1 channel carrying the atmosphere of the stadium. Now the method according to the invention allows to have a center channel in the audio description of the situation of the stadium. Then the process of adding mixes the Central channel of the environment of the stadium and the loudspeaker. Generating the parameters object for the loudspeaker and the Central channel of the environment of the stadium, this invention allows to separate these two sound object on the side of the decoder and allows is to strengthen or weaken the loudspeaker or the Central channel of the environment of the stadium. Further scenario assumes the presence of two speakers. This situation can occur when two people comment on the same football match. In particular, when there are two speaker who speaks at the same time, it may be useful to have these two speakers as separate objects and, in addition, to make these two speaker were separated from the channel environment of the stadium. In this application, these 5.1 channel and two channel loudspeaker can be processed as eight different sound objects or seven different audio objects, when the low frequency extension channel (subwoofer channel) is neglected. As infrastructure direct distribution is adjusted to 5.1 channels of audio signal, the seven (or eight) of interest can be mixed with the decrease in 5.1 channel signal down-mixing, and the parameters of the object can be provided in addition to the 5.1 channels down-mixing so that the receiver side objects could be again divided, and, due to the fact that object-oriented metadata will identify announcer objects from background objects stadium, the processing of a specific object possible before the final stereo downmix 5.1 channels through mixer objects takes place on the side of priem the ka.

In this scenario, it was also possible to have the first object including a first speaker, the second object including a second speaker, and a third object, including the full atmosphere of the stadium.

Subsequently, various scripting object oriented downward mixing are discussed in the context of figa-11S.

When, for example, the sound generated by scripts figa or 12V, should be played on a conventional 5.1 playback system, then the nested stream metadata can be ignored, and the resulting stream can be played as it is. When, however, the reproduction should be made on plants of stereographically must occur stereo downmix from 5.1 to stereo. If the surrounding channels were added directly to the left/right, moderators can be at a level that is too low. Therefore, it is preferable to reduce the level of the situation before or after down-mixing to how the regulator will be (re -) added.

People with hearing impairments may want to reduce the situation to improve speech intelligibility, still dividing both speaker left/right, what is known as "the effect of cocktail parties, where the person hears his name and then concentrated in the direction from which the service is Shal his name. This concentration on a particular area will be from psychoacoustic point of view, to dampen the sound coming from other directions. Therefore, the exact location of a particular object, such as finding the loudspeaker to the left or right or both left or right so that the loudspeaker appeared in the middle between the left or right, could improve intelligibility. Finally, the input audio stream is preferably divided into individual objects, where the objects should be ranked in the metadata depending on how important the object or less important. Then the difference of levels between them can be adjusted in accordance with the metadata, or the position of the object can be moved, in order to improve intelligibility in accordance with the metadata.

To achieve this goal, the metadata may not be applied to the transmitted signal, and the metadata is applied to a single shared audio objects before or after down-mixing of the object depending on the circumstances. Now, this invention does not require more that the objects were limited spatial channels so that these channels can be controlled individually. Instead, the concept of the invention an object-oriented metadata does not require that there was a certain object in a certain Cana is e, but objects can be mixed down to a few channels and can still be managed individually.

Figa illustrates a further implementation of the preferred embodiment. The mixer down-mixing of the object 16 generates m output channels of the k x n input channels, where k is the number of objects and where n channels are generated on the object. Figa corresponds to the scenario figa, 3V, where control 13A, 13b, 13C has a place to downward mixing of the object.

Figa, in addition, includes manipulators level 19d, 19e, 19f, which can be performed without control metadata. Alternatively, however, these behaviors can also be controlled object-oriented metadata so that the modification level, carried out by blocks 19d-19f, was also part of the manipulator object 13 figure 1. The same is true for the processes of downward mixing 19a-19b-19s, when these processes are down-mixing controlled object-oriented metadata. This case, however, is not illustrated in figa, but can also be implemented as object-oriented metadata is also sent to the blocks downward mixing 19a-19s. In the latter case, these blocks would also be part of the object manipulator 13 figa, and the remaining functionality of the mixer object 16 carry out the camping combination in the form of an output channel component signals of a managed object for the respective output channels. Figa, in addition, includes the functionality of the normalization dialogue 25, which can be achieved by conventional metadata, since this normalization dialogue takes place not in the object area, and the area of the output channel.

Figv illustrates the implementation of object-oriented 5.1-stereophaser mixing. Here stereo downmix is performed before running, and therefore figv corresponds to the scenario of figure 4. Modification level 13A, 13b is an object-oriented metadata, where, for example, the upper branch corresponds to the sound object and the lower branch corresponds to the surrounding object or, for example in figa, 12V, upper transition (branch) corresponds to one or both speakers, and the lower transition corresponds to all environmental information. Then the blocks of the manipulator level 13A, 13b would govern both objects based on the set parameters so that object-oriented metadata were accurate identification of objects and manipulators level 13A, 13b could also control levels based on target levels provided by the metadata 14, or based on the actual levels provided by the metadata 14. Therefore, in order to generate Stereophonics mixing for multi-channel input, the formula panyhose the mixing is applied to each object and the objects are weighed this level until they are re-mix again until the output signal.

For pure sound applications, as shown in figs, the significance level is transmitted as metadata, to give the opportunity to reduce less than significant components of the signal. Then another transition (branch) would correspond to components of significance, which are enhanced, while the lower transition (branch) might be less important components, which can be weakened. How to do a certain weakening and/or strengthening of various objects can be fixed installed receiver, but can also be controlled, in addition, object-oriented metadata, as it is performed by controlling the original and managed signals (dry/wet control) 14 figs.

Usually the dynamic control range may be performed in an area of the object, which is implemented like the performance of the AAC dynamic range control as multi-band compression. Object-oriented metadata can even be frequency-selective data to pass frequency-selective compression, which is similar to the performance of the equalizer.

As mentioned earlier, the dialogue normalization is preferably performed after step-down mixing, that is, the signal is down-mixing. Stereo downmix should, in General, to be able to handle about the projects k with n input channels and m output channels.

Not always it is important to divide the objects into discrete objects. It may be enough to remove the mask from the signal components, which are subject to control. It's like editing masks in image processing. Then the generalized object is a superposition of multiple objects, where the superposition includes the number of objects is less than the total number of original objects. All objects are again formed at the final stage. Split single objects of no interest, and for some subjects, the level value can be set to 0, which corresponds to a high negative number of decibels, when a certain object should be removed completely, as in karaoke, where there may be interest in the complete removal of the voice of the object so that the karaoke singer could add your own vocals to the remaining instrumental objects.

Other preferred application of the invention, as mentioned earlier, are advanced midnight mode, where the dynamic range of single objects can be reduced, or the regime of high precision, where the dynamic range of extended objects. In this context, the transmitted signal can be compressed, and it is assumed invert this compression. Particularly preferably, when the label normalization of the dialogue for the full signal as output to the speaker, but nonlinear weakening/strengthening of various objects is useful when you are the normalization of the dialogue. In addition to the parametric data to separate the different audio objects from the signal down-mixing of the object, it is preferable to transmit, for each object, and the total signal in addition to the classical metadata associated with the total signal level values for down-mixing, value, significance, indicating the significance level for the clean sound, the identification of the object, the actual absolute or relative levels as a variable in time information or the absolute or relative specified levels as a variable in time information, etc.

The described implementation are only illustrative for the principles of the present invention. Understood that modifications and changes to the layout and parts described herein will be obvious to other specialists in this field. We therefore intend to be limited only by the scope of the claims and not by the specific details presented here by describing and explaining accomplishments. Depending on the specific requirements for the implementation of the proposed methods, they can be implemented in hardware or in software. Cases which can be implemented through the use of digital media data, in particular DVD or CD-ROM, having stored electronically-readable control signals, which interact with a programmable computer system in such a way that implements the methods according to the invention. In General, this invention is therefore a computer program product with a control program stored on a machine-readable carrier, the control program is put into effect to implement the methods when the computer program product runs on a computer. In other words, the methods according to the invention is therefore a computer program having a control program for implementing at least one of the inventive methods when the computer program runs on a computer.

Links

[1] ISO/IEC 13818-7 MPEG-2 (Standard coding of motion pictures and associated audio information - Part 7: Advanced Audio Coding (AAC).

[2] ISO/IEC 23003-1: MPEG-D (audio MPEG technology) - Part 1: MPEG Surround (surround).

[3] ISO/IEC 23003-2: MPEG-D (audio MPEG technology) - Part 2:

Spatial Encoding of the Sound Object (SAOC).

[4] ISO/IEC 13818-7 MPEG-2 (Standard coding of motion pictures and associated audio information - Part 7: Advanced Audio Coding (AAC).

[5] ISO/IEC 14496-11: MPEG 4 (the Coding of audiovisual objects - Part 11:

Description With the ENES and Engine Applications (IFS).

[6] ISO/IEC 14496-: MPEG 4 (the Coding of audiovisual objects - Part 20:

Lightweight Application Scene (LASeR) and Simple Format Aggregation (SAF).

[7] http:/www.dolby.com/assets/pdf/techlibrary/l 7. AllMetadata.pdf.

[8] http:/www.dolby.com/assets/pdf/tech_library/l 8_Metadata. Guide.pdf.

[9] Krauss, Kurt; Reden, Jonas; Schildbach, Wolfgang: transcoding Coefficients Dynamic Range Control and Other Metadata in MPEG-4 AA, AES 123 Agreement, October 2007, str.

[10] Robinson, Charles Sq., Gantry, Kenneth: Dynamic Control Range by means of Metadata, AES Agreement 102, September 1999, str.

[11] the System Dolby, "Standards and guidelines for Creating Digital Dolby and Bit streams system Dolby E, Issue 3.

[14] coding Technologies /system Dolby, "System Dolby E / solve the Transcoder Metadata aacPlus for aacPlus Multi-channel Digital Video Broadcasting (DVB)", VI. 1.0.

[15] ETSI TS101154: Digital Video and Television (DVB), V1.8.1.

[16] SMPTE RDD 6-2008: Description and Usage Guide Serial Bit Stream Audio system Metadata Dolby.

1. A device for generating at least one audio output signal representing a superposition of at least two different audio objects, comprising a processor for processing the audio input signal to ensure that the ü object representation of the audio input signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound signals of interest and at least two different audio objects are managed independently from each other; a manipulator object management based on object-oriented audio-specific metadata, at least one audio object, an audio object, at least one audio object or mixing the audio signal of the object obtained from at least one audio object, allowing to obtain a controlled signal of the sound object, at least one audio object; and a mixer object for mixing the object representation by combining the controlled signal of the audio object with an unmodified audio object or with another managed audio object manipulated differently than at least one audio object; the device is adapted to generate m output signals, m is an integer greater than 1, the processor is designed to provide an object representation with k sound objects, k is an integer greater than m, the manipulator object is adapted to control, at least the, two objects that differ from each other based on metadata associated at least one object, at least two objects, and where the mixer object is managed to combine the audio signals of at least two different objects, to obtain m output signals so that each output signal was under the influence of controlled audio signals, at least two different objects.

2. The device according to claim 1, in which the input audio signal is mixed with a lowering representation of multiple sound objects and includes, as additional information, object-oriented metadata having information regarding one or more audio objects included in the mix with lower performance, and in which the manipulator object is adapted to extract an object-oriented metadata from the input audio signal.

3. The device according to claim 1, in which the manipulator object is adapted to control each set of signal components of the object in the same way, based on metadata for the object to get the signal components of the object for the sound object, and in which the mixer object is adapted to add signals to the components of the object from other objects besides samouverennomu channel, to produce an output audio signal for the output channel.

4. The device according to claim 1, further comprising a mixer output signal for mixing the output of the audio signal, which was obtained based on the control of at least one audio object, and a corresponding output audio signal obtained without control at least one audio object.

5. The device according to claim 1, wherein the metadata includes information regarding the gain compression level, installation of down-mixing or characteristics specific to this object, and where the manipulator object is adapted to control the object or other objects based on metadata for the implementation of the method intended for a specific object, midnight mode, precision mode, mode, clear sound, normalization dialogue, a given control step-down mixer, dynamic, down-mixing, managed enhancing mixing, moving speech objects or weakening of the object environment.

6. A device for generating at least one audio output signal representing a superposition of at least two different audio objects, comprising a processor for processing the audio input signal, allowing to provide the th object representation of the audio input signal, in which at least two different audio object are separated from each other: at least two different audio object is available as a separate sound signals of interest and at least two different audio objects are managed independently from each other; a manipulator object management based on object-oriented audio-specific metadata, at least one audio object, an audio object, at least one audio object or mixing the audio signal of the object obtained from at least one audio object, allowing to obtain a controlled signal of the sound object, at least one audio object; and a mixer object for mixing the object representation by combining the controlled audio object with an unmodified audio object or with another managed audio object manipulated differently than at least one audio object in which the processor is adapted to receive the input signal, the input signal is mixed with a lowering representation of multiple sound objects, in which the processor is adapted to obtain the parameters of the sound object to control the reconstruction algorithm for reconstruction for the approximate representation of the original audio objects, and in which the processor is adapted to control the reconstruction algorithm using the input signal and parameters of the sound object to get an object representation that includes the signals of the sound object, which is an approximation of the sound object sound object.

7. Device for the generation according to claim 3, in which the input sound signal includes, as additional information, the parameters of the sound object and in which the processor is adapted to extract additional information from the input audio signal.

8. A device for generating at least one audio output signal representing a superposition of at least two different audio objects, comprising a processor for processing the audio input signal, allowing you to provide an object representation of the audio input signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound signals of interest and at least two different audio objects are managed independently from each other; a manipulator object management based on object-oriented audio metadata regarding at least one sound about who to target, the signal of the sound object, at least one audio object or mixing the audio signal of the object obtained from at least one audio object that provides a controlled signal of the sound object, and the mixer object for mixing the object representation by combining the controlled signal of the audio object with an unmodified audio object or with another managed audio object manipulated differently than at least one audio object, and in which the mixer object is adapted to apply the rule down-mixing to each object, based on the rendering position of the object and playback settings to get signal components of the object for each output audio signal, and where the mixer object is adapted to add signals to the components of the object from other objects to the same output channel to produce an output audio signal for the output channel.

9. A device for generating at least one audio output signal representing a superposition of at least two different audio objects, comprising a processor for processing the audio input signal, allowing you to provide an object representation of the audio input signal, in which the end is th least two different audio object are separated from each other, at least two different audio object is available as a separate sound signals of interest and at least two different audio objects are managed independently from each other; a manipulator object management based on object-oriented audio-specific metadata, at least one audio object, an audio object, at least one audio object or mixing the audio signal of the object obtained from at least one audio object that provides a controlled signal of the sound object, at least one audio object; and a mixer object for mixing the object representation by combining the controlled signal of the audio object with an unmodified audio object or with another managed audio object manipulated differently than at least one audio object in which the object parameters include, for many of the temporal parts of an audio object, the parameters for each band of the set of frequency bands in the respective time part, and where the metadata includes only nechistoty selective information for the sound object.

10. The way to generate, at m is re, one of the output audio signal representing a superposition of at least two different audio objects, comprising processing an audio input signal to provide an object representation of the input audio signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound object and at least two different audio objects are managed independently from each other; management based on object-oriented audio-specific metadata, at least one audio object, or mixed the audio signal of the object obtained from at least one audio object, to obtain a controlled signal of the sound object, or controlled by the mix of the sound object, at least one audio object; and mixing the object representation by combining the controlled audio object with an unmodified audio object or with a managed another sound object that is managed differently than the at least one audio object in which the method generates m output signals, m is an integer greater than 1, the processing provides an object presents the e, with a k sound objects, k is an integer greater than m, where at least two objects are managed differ from each other based on metadata associated at least one object from at least two objects, mix of object managed to combine the audio signals of at least two different objects, to obtain m output signals so that each output signal was under the influence of controlled audio signals, at least two different objects.

11. Method of generating at least one output audio signal representing a superposition of at least two different audio objects, comprising processing an audio input signal to provide an object representation of the input audio signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound object and at least two different audio objects are managed independently from each other; management based on object-oriented audio metadata related, at least one audio object signal, at least one audio object or mixing the audio signal of the object, p is obtained from, at least one audio object that provides a controlled signal of the sound object, at least one audio object; and mixing the object representation by combining the controlled signal of the audio object with an unmodified audio object or with another managed audio object manipulated differently than at least one audio object in which the parameters of the sound object to control the reconstruction algorithm to reconstruct an approximate representation of the original audio objects and in which the reconstruction algorithm uses the input signal and parameters of the sound object to get an object representation that includes the signals of the sound object, which is an approximation signals of the sound object sound object.

12. Method of generating at least one output audio signal representing a superposition of at least two different audio objects, comprising processing an audio input signal to provide an object representation of the input audio signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound object and, at least, LW is the sound of an object are managed independently from each other; mixing the object representation by combining the controlled signal of the audio object with an unmodified audio object or with another managed audio object manipulated differently than at least one audio object, where the rule down-mixing to each applied to an object, based on the rendering position of the object and playback settings to obtain signal components of the object for each output audio signal, where the signal components of the object from other objects to the same output channel type, to produce an output audio signal for the output channel.

13. Method of generating at least one output audio signal representing a superposition of at least two different audio objects, comprising processing an audio input signal to provide an object representation of the input audio signal, in which at least two different audio object are separated from each other, at least two different audio object is available as a separate sound object, and at least two different audio objects are managed independently from each other; and mixing the object representation by combining the controlled signal zvukovoj the object with an unmodified audio object or other controlled sound object, managed differently than at least one audio object.

14. The computer-readable storage medium with written computer program to implement, being executed on a computer, a method of generating at least one output audio signal PP, 11, 12, 13.