RussianPatents.com

Method and device for parallel conjunction of data with shift to the right

IPC classes for russian patent Method and device for parallel conjunction of data with shift to the right (RU 2273044):

G06F9/315 -

Another patents in same IPC classes:

Method and device for parallel conjunction of data with shift to the right / 2273044
Method includes in parallel with shift to left for 'L - M' data elements of first operand having first set of L data elements, second operand is shifted having second set of L data elements, to the right for M data elements, and aforementioned shifted first set is combined with aforementioned shifted second set for producing a result having L data elements.

Method for processing with use of one commands stream and multiple data streams / 2279706
System is disclosed with command (ADD8TO16), which decompresses non-adjacent parts of data word with utilization of signed or zero expansion and combines them by means of arithmetic operation "one command stream, multiple data streams", such as adding, performed in response to one and the same command. Command is especially useful for utilization in systems having a data channel, containing a shifting circuit before the arithmetic circuit.

Loading/moving and copying commands for processor / 2292581
Processor contains first logical means for preserving set of bit groups into non-adjacent groups of storage cells and second logical means for storing a copy of a set of non-adjacent bit groups. In accordance to method, set of bit groups is saved to set of non-adjacent storage cells and set of non-adjacent bit groups is copied into remaining groups of bit storage cells. System contains memory and processor for storing first bit group in first and second storage cell groups and for storing second bit group in third and fourth storage cells. Device contains execution module for storing bits [31-0] in positions [31-0] and [62-32], bits [95-64] in positions [95-64] and [127-96] of destination register bits.

Device for priority servicing of requests / 2320001
Device contains request register, AND element, two OR elements, modulus two addition element, clock impulses generator, control trigger, counter, decoder, switches, additionally incorporated AND element, additionally incorporated OR elements, number of which is equal to capacity of request register.

FIELD: engineering of microprocessors and computing systems, in particular, engineering of devices for parallel conjunction of data with shift to the right.

SUBSTANCE: method includes in parallel with shift to left for 'L - M' data elements of first operand having first set of L data elements, second operand is shifted having second set of L data elements, to the right for M data elements, and aforementioned shifted first set is combined with aforementioned shifted second set for producing a result having L data elements.

EFFECT: efficient support of SIMD operations without substantial decrease of efficiency as a whole.

6 cl, 39 dwg

The SCOPE of the INVENTION

The present invention relates in General to the field of microprocessors and computer systems. More specifically, the present invention relates to a method and apparatus for parallel data aggregation with a shift to the right.

PRIOR art

With the development of technology processors are also created new codes programs to run on computers with these processors. Users in General expect and demand better performance from their computers regardless of the type of software used. One of such problems may occur because of the types of commands and operations that are currently performed in the processor. Some types of operations require more time to complete due to the complexity of operations and/or required for these schemes. This creates the possibility of optimizing the way that some complex operations are performed in the processor.

Media playback applications (multimedia applications) contribute to the development of microprocessors for more than ten years. In fact, most of the updates of computing in recent years were caused by mediapriline. These updates are the predominant way occurred within the consumer segment, although significant improvements so the e was in the business segment, for example, for the purposes of education and communication, taking into account the entertainment aspect. However, future media playback applications will be required to show increasingly high demands on computing. As a result, in the near future the use of personal computers (PCs) will be more interesting audiovisual effects, and will also be easier in practical application, and more importantly, the computers will be connected with the means of communication.

Accordingly, the display of images as well as audio and video data, which all together referred to as the content became more and more popular applications for existing computing devices. Filtering and convolution are examples of the most common operations performed with the data content, such as images, audio and video information. As known in the art, calculation, filtering, and correlation is performed by using the multiplication with accumulation, which summarizes the work data and coefficients. The correlation of two vectors a and B consists in calculating the amount S:

equation (1), which is very often used when k=0:

equation (2)

In the case when N-discharge filter f is applied to the vector V, the sum S can be expect is to, as follows:

equation (3)

Such operations require a large amount of computation, but provide a high level of data parallelism that can be used in an effective embodiment using a variety of data storage devices, such as registers with a single instruction stream and multiple data streams (SIMD).

The use of filtering operations can be found in a great many tasks in image processing and video information, and links. Examples of use filters are the reduction of artifacts forming data blocks in the video standard (MPEG) standard, developed by the expert group on cinematography), reduction of noises and sounds, Department of watermarks from the values of the pixels to improve the detection of the watermark, the correlation for smoothing, sharpening, noise reduction, edge detection and scale sizes of images or video frames, the sampling frames for motion estimation sub-pixels, improving the quality of the audio and modifying the pulse shape and the correction signals in communication. Accordingly, the filtering operations like convolution, vital for computing devices that provide playback of content that includes images, audio and video.

It is Oleniy, existing methods and commands used for General needs filtering and are not comprehensive. In fact, many kinds of architectures do not support a means for the efficient calculation of the filter for different filter lengths and data types. In addition, the ordering of data in storage devices, such as SIMD registers, as well as the possibility of summation of adjacent values in the register and the partial data transfers between registers, in the General case are not supported. As a result, existing architectures require unnecessary changes the type of data that minimize the number of operations for one team and significantly increase the number of clock cycles required to organize data for arithmetic operations.

BRIEF DESCRIPTION of DRAWINGS

The present invention is shown for illustration and not for limitation, the drawings in which the same reference position indicate similar elements and in which:

Figure 1 - structural diagram of a computer system that implements one of the embodiments of the present invention;

2 is a block diagram of a variant of implementation of the processor, as shown in figure 1, in accordance with another embodiment of the present invention;

Figure 3 - illustration of types of compressed data according to other variant implemented the I of the present invention;

Figa - representation in the case of the Packed byte according to one of embodiments of the present invention;

Figv representation in the register of Packed words according to one of embodiments of the present invention;

Figs representation in the register of Packed double word according to one of embodiments of the present invention;

5 is a chart illustrating the effect of the move command byte in accordance with the embodiment of the present invention;

6 is a chart illustrating the command multiplication with accumulation of bytes in accordance with the embodiment of the present invention;

Figa-7C is a diagram illustrating the navigation command byte in figure 5, combined with a team of multiplication with accumulation of bytes by 6 to obtain the set of sums of pairs of works in accordance with another embodiment of the present invention;

Figa-8D is a diagram illustrating a command summation of adjacent elements in accordance with another embodiment of the present invention;

Figa and 9B - team Association registers in accordance with another embodiment of the present invention;

Figure 10 - block diagram of the sequence of operations for efficient processing of content data in the CE is provided with one of the embodiments of the present invention;

11 is a block diagram of a method of processing content data according to the operation data in accordance with another embodiment of the present invention;

Fig is a block diagram of a sequence of operations to continue processing the content data in accordance with another embodiment of the present invention;

Fig - block diagram of the sequence of operations illustrating the operation of the Association registers in accordance with another embodiment of the present invention.

Fig - block diagram, sequence selection method the raw data elements of the source device storing data in accordance with an exemplary embodiment of the present invention;

Fig - structural diagram of the microarchitecture of the processor of one of the embodiments, which includes a logic circuit for parallel operation with the shift to the right in accordance with the present invention;

Figa - structural diagram of one embodiment of an implementation of the logic circuit to perform the operation in parallel Association with a shift to the right from the operand data in accordance with the present invention;

Figw - structural diagram of another variant implementation of the logic circuit to perform the operation obyedinenie is shifted to the right;

Figa - illustrating steps of a parallel command Association with a shift to the right in accordance with the first embodiment of the present invention;

Figw - illustrating steps command Association with a shift to the right in accordance with the second embodiment;

Figa - sequence of operations of one embodiment of the method of parallel shift right and merge data operands;

Figw - sequence of another variant of the method right shift and data Federation;

Figa-B - examples of motion estimation;

Fig - example of the application of motion estimation and the resulting predictions;

Figa-B is an example of the current and previous frames processed during motion estimation;

Figa-D - operations evaluation of motion in the frames in accordance with one embodiments of the present invention; and

Figa-B - a sequence of operations of one embodiment of the method of prediction and motion estimation.

DETAILED DESCRIPTION

Below is described a method and apparatus for performing parallel data Federation with a shift to the right. Also described is a method and apparatus for efficient filtering and convolution of content data. Also disclosed are a method and apparatus for motion estimation fast full search with POM is using join operations SIMD. Describes the options for implementation are described in the context of a microprocessor, but is not limited to them. Although later versions of the implementation described with reference to the processor, other variants of implementation can be applied to other types of integrated circuits and logic devices. The same methods and ideas of the present invention can easily be applied to other types of semiconductor circuits or devices that can take advantage of the higher throughput of the pipeline and improved performance. The principles of the present invention is applicable to any processor or device that performs data processing. However, the present invention is not limited to processors or devices that perform operations on 256-bit, 128-bit, 64-bit, 32-bit or 16-bit data and can be used on any processor and the device to which you want to merge data with a shift to the right.

In the following description for purposes of explanation, numerous specific details are set forth to provide a full understanding of the present invention. However, professionals should be clear that these specific details are not necessary for the practical implementation of the present invention. In other instances, well known electrical structures and circuits are not set out adolescents who but in order not to obscure the essence of the present invention. Moreover, in the following description, examples and drawings showing various examples for illustration purposes. However, these examples should not be construed as limitations, since they are only intended to represent examples of the present invention, and not to provide an exhaustive list of all possible embodiments of the present invention.

In one of the embodiments of the methods of the present invention is embodied in the commands executable by the computer. Commands can be used to perform steps of the present invention in a generic processor or a specialized processor that is programmed using these commands. Alternatively, the steps of the present invention can be executed by specific hardware components that contain logic circuits with permanent connections to perform steps, or a combination of programmed computer components and custom hardware components.

The present invention may be provided as a computer program or software, which may include a machine-readable medium that has stored thereon the commands that can be used to program a computer (or other e is ectronic devices) to perform the method according to the present invention. Such software may be stored in system memory. Similarly, the code may be distributed through a network or through some other machine-readable media. Machine-readable media may include floppy diskettes, optical disks, compact disks (CD-ROMs), and magneto-optical disks, a persistent storage device (ROM), random access memory (OP), erasable programmable permanent memory (EPROM), electrically erasable programmable permanent memory (EEPROM), magnetic or optical cards, flash memory, transmission through the Internet or similar means, but is not limited to the foregoing.

Accordingly, the machine-readable medium includes any type of transmission media/machine-considered media suitable for storing or transmitting electronic commands or information in a form readable by a device (e.g., a computer). In addition, the present invention can also be downloaded as a computer program. Also, the program may be sent from a remote computer (e.g., a server) to a requesting computer (e.g., the client). Shipment program can be accomplished by electrical, optical, acoustic signals or data signals data other f RMI, embodied in a carrier or other medium for signal propagation through the communication channel (e.g., modem, network connection or other similar tool).

Modern processors use a variety of Executive modules for processing and execution of various codes and commands. Not all teams are the same, so some faster finish, while others can take a huge number of clock cycles. The sooner commands are executed, the higher the overall performance of the processor. Thus it would be advantageous to have the maximum number of commands that are executed as fast as possible. However, there are some commands which are of high complexity and require more CPU time and run-time. For example, there are commands floating-point operations load/store, move data, etc.

As more and more computer systems are used in applications of Internet and multimedia, with additional support processor increases over time. For example, integer/floating-point processing commands in architecture with a single command multiple data streams and (SIMD) and extension for streaming commands (SIMD) (SSE) are commands that reduce the total number of commands that require the x to perform the specific objectives of the program. These commands can increase the performance of the software, processing multiple data elements in parallel. In the result, it is possible to achieve increased performance in a wide range of applications, which include the processing of video data, speech and images/photos. Execution of SIMD instructions in microprocessors and other types of logic circuits usually creates many problems. In addition, the complexity of SIMD operations often requires additional circuits for proper processing and data management.

Embodiments of the present invention provide a way of implementing parallel commands right shift as an algorithm that uses related to the architecture of the SIMD hardware. One embodiment of the implementation of the algorithm is based on the principle of the right shift required number of data segments one operand in the direction of the greatest importance in the second operand when the same number of data segments to move in the direction of the lowest significance of the second operand. In principlethe join operation with a shift to the right can be seen as the Union of two blocks of data in one block and the shift of this combined unit to align the data segments in the desired position for the formation of new combinations of data. Thus, VA is ianti implementation of the algorithm for Association with a shift to the right in accordance with the present invention may be implemented in the processor to effectively support SIMD operations without significant reduction in effectiveness in General.

COMPUTING ARCHITECTURE

Figure 1 shows a computer system 100, which may be embodied variant of implementation of the present invention. Computer system 100 includes a bus 101 to transmit information and a processor 109, connected to the bus 101 for processing information. Computer system 100 also includes a memory subsystem 104-107, connected to the bus 101, designed to store information and commands to the processor 109.

The processor 109 includes module 130 execution file 200 of registers, cache memory 160, a decoder 165 and the internal bus 170. The cache memory 160 is attached to the module 130 execution and stores frequently accessed and/or recently used information for the processor 109. File 200 registers stores information in the processor 109 and connected to the module 130 execution via the internal bus 170. In one of the embodiments of the invention, the file 200 registers include registers media, such as SIMD registers for the storage of multimedia information. In one embodiment, the implementation of each of the registers media stores up to one hundred twenty-eight bits of Packed data. Registers media can be specialized registers multimedia or registers that are used to store multimedia and other information. In one variant new implementation registers the multimedia data is stored multimedia action multimedia and store data floating-point operations floating-point number.

The module 130 execution operates on Packed data according to the commands accepted by the processor 109 that are included in the set is Packed 140 teams. The module 130 performance also operates with scalar data according to commands reagiruete universal processors. The processor 109 can support the instruction set of the Pentium microprocessor® and set 140 Packed teams. Due to the inclusion of a set of 140 Packed commands in the standard instruction set of the microprocessor, such as the instruction set of the Pentium microprocessor®teams with Packed data can be easily placed in existing software (previously written for the standard instruction set of the microprocessor). Other standard instruction sets, such as sets of commands PowerPC™ alpha™can also be used in accordance with the described invention. (Pentium® is a registered trademark of Intel Corporation. PowerPC™ is a trademark of IBM, APPLE COMPUTER and MOTOROLA. Alpha™ is a trademark of Digital Equipment Corporation.)

In one of the embodiments set 140 Packed commands includes commands (as described in more detail below) for the operation 143 forwarding data (MOVD) and operations 145 move data (PSHUFD) to organize the data within the storage device; an operation Packed multiplication with nakoplenie the m for the first register source unsigned and second register source familiar with (147 PMADDUSBW); the operation of the Packed multiply with accumulate (149 PMADDUUBW) for performing multiplication with accumulation for the first register source unsigned and second register source unsigned; operation Packed multiply with accumulate (operation 151 PMADDSSBW) for the first and second registers, sources familiar with, and the standard multiplication with accumulation (operation 153 PMADDWD) for the first and second registers-sources with a sign containing the 16-bit data. Finally, the system is Packed commands includes a command summation of adjacent values for the summation of adjacent bytes (operation 155 PAADDNB), words (157 PAADDNWD) and double words (PAADDNDWD 159), the two meanings of the word (PAADDWD 161), two words for 16-bit result (operation 163 PAADDNWW), two Quad words quadraflow to get the result in the form quadralobe (165 PAADDNDD) and the operation 167 Association of registers.

Due to the inclusion of a set of 140 Packed commands in the command set universal processor 109, together with the relevant schemes for the execution of commands, the operations used by many existing multimedia applications can be performed using Packed data in a generic processor. Thus, many multimedia applications can run faster and more efficiently using the full capacity ø what are the data processor to perform operations on Packed data. This eliminates the need for moving small blocks of data on the data bus of the processor to perform one or more operations on one data item at a certain point in time.

According to Figure 1, a computer system 100 corresponding to the present invention, may include device 121 of the display, such as a monitor. The device 121 of the display may include an intermediate device, such as a frame buffer. Computer system 100 also includes a device 122 of the input data, such as a keyboard, and the device 123 cursor control, such as a mouse or trackball (trackball), or touch pad. The device 121 of the display device 122 and data entry device 123 of the cursor attached to the bus 101. Computer system 100 may also include means 124 connect to the network so that the computer system 100 is part of a local area network (LAN) or wide area network (WAN).

Additionally, computer system 100 can be attached to the device 125 for recording and/or reproducing sound, such as a device conversion into digital form of audio information, United with a microphone to record voice input for speech recognition. Computer system 100 may also include device 126 are converted to digital form videoing is rmacie, which can be used for input video device 127 to create a solid (documentary) copies, such as a printer, and device 128 CD-ROM. Devices 124 - 128 is also connected to the bus 101.

Processor

Figure 2 shows the detailed diagram of the CPU 109. The processor 109 may be implemented on one or more substrates using any of a variety of process technologies, such as BiCMOS (bipolar complementary structure of metal-oxide-semiconductor (CMOS)), CMOS (CMOS) and NMOS (n-channel metal-oxide-semiconductor). The processor 109 includes a decoder 202 for decoding the control signals and data used by processor 109. Then the data can be saved in the file 200 registers via the internal bus 205. It should be clarified that registers variant implementation is not limited to a specific type of schema. Register in this embodiment should only be able to provide the storage and transmission of data and perform the described functions.

Depending on data type data can be stored in integer registers 201, registers 209, registers 208 status or register 211 command pointer. File registers 204 may include other registers, such as register floating-point number. In one embodiment, the implementation of the integer registers 201 store tridtsatidvuhletny celcis the i.i.d. data. In one embodiment, the implementation of registers 209 contains eight registers multimedia R0 212a-R7 212h, for example, the SIMD registers containing Packed data. Each register in registers 209 has a length of one hundred and twenty eight bits. R1 212a, R2 212b and R3 212c are examples of individual registers in registers 209. Thirty-two bits of data in registers 209 can be moved to one of the integer registers 201. Similarly, the value of the integer register can be moved to thirty-two bits of one of the registers 209.

Registers 208 status indicate the status of the CPU 109. The register 211 command pointer stores the address of the next instruction to be executed. Integer registers 201, registers 209, registers 208 status and register 211 index command, they all are connected with the internal bus 205. Any additional registers are also connected to the internal bus 205.

In another embodiment, some of these registers can be used for two different data types. For example, registers 209 and integer registers 201 can be combined, where each register can store integer data, or compressed data. In another embodiment, registers 209 can be used as registers floating-point number. In this embodiment, in registers 209 may be stored Packed data or Dan is haunted floating-point number. In one embodiment, the implementation of the United registers have a length of one hundred and twenty eight bits, and integers are represented as one hundred and twenty eight bits. In this embodiment, when storing the compressed data and integer data is not required to registers distinguished between these two types of data.

Function module 203 performs the operations performed by processor 109. Such operations may include shift, addition, subtraction and multiplication, etc. function module 203 is connected with the internal bus 205. The cache 160 is an optional element processor 109 and can be used to cache data and/or control signals, for example, from memory 104. The cache 160 is connected to the decoder 202 and connected to receive signals 207 control.

Data formats and memory

Figure 3 illustrates three types of Packed data: Packed byte 221, Packed word 222 and Packed double word (dword) 223. Packed byte 221 has a length of one hundred and twenty eight bits and contains sixteen data elements of the Packed bytes. In General, the data element is a separate part of the data that is stored in the same register (or memory location) with other data elements of the same length. In Packed data sequences the number of data items stored in the register, equally is 128 bits, divided by the length in bits of the data element.

Packed word 222 has a length of 128 bits and contains eight data elements of Packed words. Each Packed word contains 16 bits of information. Packed double word 223 has a length of 128 bits and contains four data element of the Packed double word. Each data element of the Packed double word contains 32 bits of information. Packed quadralobe has a length of 128 bits and contains two data elements Packed kvadratov.

Figa - 4C illustrate the performance in the register storing the compressed data according to one embodiments of the invention. View 310 in the case of the Packed unsigned byte indicates the storage of Packed bytes 201 without sign in one of the registers 209 media, which is shown in Figa. Information for each item of data bytes stored in bits from the seventh zero (7 to 0) for the zero byte 15 through 8 for the first byte, from 23 to 16 for the second byte, and finally, with 120 of 127 for 15-th byte.

Thus, all available bits are used in the register. This memory structure increases the storage efficiency of the processor. Also, when referring to 16 data elements now one operation can be executed simultaneously with 16 data elements. View 311 in the case of the Packed byte what about the sign shows the storage of Packed byte 221 with the sign. You should note that the eighth bit of each data element - byte - is an indicator of the sign.

View 312 in the register of Packed unsigned word shows how words from the seventh zero is stored in register 209 media, which is shown in Figv. View 313 in the register of Packed signed word like representation 312 in the register of Packed unsigned word. You should note that 16 bits of each element of the data type word is an indicator of the sign. The submission of 314 in the register of Packed double word unsigned shows the register 209 media that stores two pieces of data type double word, as shown in Figs. View 315 in the register of Packed double word with the sign like a view 314 in the register of Packed double word unsigned. Note that a significant discharge is 32 bits of data member of type double word.

Efficient filtering and convolution of content data, as disclosed in the present invention, begin with downloading the source device data coefficients filtering/convolution and data. In many cases, the order of the data or coefficients in the storage device, such as a SIMD register, requires changes before they can be executed arithmetic calculations. Accordingly, effective the basic calculations filtering and convolution require not only the appropriate arithmetic commands, but also effective ways of organizing data required for calculations.

For example, as mentioned in the description of the prior art, the image is filtered using a replacement value, for example, a given pixel I, S [I]. The pixel values on both sides of the pixel I is used in the calculation of the filter S [I]. Similarly, the pixels on both sides of the pixel I+1 are required to compute S [I+1]. Therefore, to calculate the filter results for more than one pixel in the SIMD register, the data is copied and placed in the register OCMD to calculate.

Unfortunately, modern computer architecture lack effective way of organizing data for all relevant dimensions of the data within the computing architecture. Accordingly, as shown in Figure 5, the present invention includes a move command byte (PSHUFB) 145, which effectively organizes the data of any size. Operation 145 move organizes the data bytes, the size of which is larger than a byte, maintaining the relative positions of the bytes within big data during the move operation. In addition, the operation 145 move bytes may change the relative position of the data in the SIMD register and can also copy the data.

Figure 5 shows an example of the Opera the AI 145 move the bytes for a filter with three coefficients. Using conventional methods, the coefficients of the filter (not shown) applied to three pixels, and then the filter coefficients move to another pixel and used again. However, for parallel execution of these operations, the present invention describes a new team to organize data. Accordingly, as shown in Figure 5, the data 404 are formed in the device the destination data storage 406, which in one embodiment is a device-source data storage 404, using the mask 402 to determine the address at which the corresponding data items stored in the register is the recipient 406. In one of the embodiments, the configuration of the mask based on the desired data processing operations, which may include, for example, the filter operation, the convolution operation or the like.

Accordingly, using the mask 402, data processing 406 together with the coefficients can be performed in parallel. In the described example, the storage device source 404 data is 128-bit SIMD register, which initially stores the 16 8-bit pixels. When using the filter pixels with three factors the fourth factor set to zero. In one of the embodiments, depending on the number of data elements in the storage device 404 data registration is the source 404 can be used as a device to the destination storage, or register to the destination, thus reducing the number of registers needed in the General case. Also, rewritable data in the source device 404 data storage can be re-loaded from memory or another register. In addition, the set of registers can be used as the source device 404 storing data, and their corresponding data generated in the device, the destination data storage 406, when required.

When ordering items of data and coefficients are finished, the data and the corresponding coefficients must be processed in accordance with the operation data. Professionals should be clear that for computing the filtering necessary operations with different accuracy as well as for computing convolution using different number of filter coefficients and different data sizes. Basic filter operation multiplies two pairs of numbers and summarizes their results. This operation is called by the command multiplication with accumulation.

Unfortunately, the existing computing architecture does not provide support for efficient calculation of multiplication with accumulation for many of the lengths of the array or filter and the many dimensions of data using ratios unsigned or signed. In addition, not supported operation byte is I. As a result, the computing system normal architecture must convert 16-bit data, using commands unpacking. These types of architectures of computing systems in General include support for the operation of multiplication with accumulation, which calculates pieces of 16-bit data in separate registers and then put related works for 32-bit result. This solution is acceptable for the filter coefficients for the data, which require 16-bit precision, but for 8-bit filter coefficients for 8-bit data (which is the common case for images and video) concurrency level commands and data useless.

6 depicts the first register-source 452 and the second register is the source 454. In one embodiment, the first and second registers-sources are SIMD registers of length N-bits, such as 128-bit XMM registers technology SSE2 Intel®. Command multiplication with accumulation of done with this register gives the following results for the two vectors 452 and 454 pixels, which is stored in the register is the destination 456. Accordingly, the example shows the command multiplication with accumulation of 8-bit bytes to 16 words, which is called operation 147 PMADDUSBW (figure 1), in which the symbols U and S in the command, refer to the bytes with signed and unsigned. In the bottom register-sources bytes used signed and in another they are unsigned.

In one of the embodiments of the present invention, the register data without sign is the addressee and contains 16 results of multiplication with accumulation. The reason for this choice is that in most implementations use data without a sign, and the coefficients are signed. Accordingly, it is preferable to overwrite the data, because it is less likely that these data will be necessary in future calculations. Additional commands multiplication with accumulation of bytes, as shown in figure 1, are the operation 149 PMADDUUBW for unsigned byte in both the registers and the operation 151 PMADDSSBW for bytes with a sign-in both cases-sources. Command multiplication with accumulation end command 153 PMADDWD that applies to pairs of 16-bit words with the sign for 32-bit works with the token.

In General filtering operations, the second vector typically contains the filter coefficients. Accordingly, in order to prepare an XMM register, the coefficients can be loaded into the part of the register and copied in the rest of the case using the command 145 move. For example, as shown in Figa, the device 502 storage coefficients, such as 128-bit XMM register, initially download the three coefficients in response to the command load is given who's. However, professionals should be clear that the filter coefficients can be generated in memory before processing the data. Also, the coefficients may be initially loaded, as shown in figv, on the basis of their organization in memory, prior to the filtering operation.

The register 502 coefficients includes coefficients of the filter F3, F2 and F1, which can be encoded as the byte signed or unsigned. When the register 502 coefficients loaded, the existing team PSHUFD can be used to copy the filter coefficients in the remaining part of the register of coefficients to obtain the following result, as shown in figv. Register 504 coefficients now includes moved (copied) coefficients as required for parallel execution of data processing operations. As known in the art, the filters, which include three ratios are common in image processing algorithms. However, professionals should be clear that some filtering operations, such as JPEG 2000, use nine and seven 16-bit coefficients. Accordingly, the processing of such ratio exceeds the capacity of the registers of the coefficients, resulting in partially filtered result. Therefore, processing continues with each factor, while the end result of n is obtained.

Figs illustrates the location of the pixel data in the register-source 506, which was originally contained in the register-source 404, as shown in Figure 5, and were moved in case the recipient 406. Accordingly, in response to the operation of the data processing team PMADDUSBW can be used to calculate the sum of these two results of the multiplication is then saved in the register is the destination 510. Unfortunately, to complete the calculation and generation of data for selected data processing operations it is necessary to summarize the adjacent pairs of the sums of the results of the multiplication register the destination 510.

Accordingly, if the amount of multiply stacked longer than two pixels, that is, in the General case, it is necessary to combine individual amounts. Unfortunately, the existing computing architecture does not provide an effective way of adding the related amounts due to the fact that the related amounts are within the same register destination. Accordingly, the present invention uses the summation of adjacent elements, the results of which are depicted on Figa - 8D.

Figa depicts the register destination 552 after summation of two adjacent 16-bit values (157 PADDD2WD) for 32-bit sum. Also, Figa depicts two adjacent 16-bit result is the commands themselves multiplication with accumulation, which summed up to obtain the 32-bit sum of 4-byte pieces. Figv depicts the command summation of adjacent values (157 PAADDD4WD), which summarizes the 4 adjacent 16-bit values to obtain the 32-bit sum. Also, 4 adjacent 16-bit result of the multiplication with accumulation of bytes summed to obtain the 32-bit sum of the 8-byte pieces. Figs shows the command summation of adjacent values (157 PAADD8WD), which summarizes the 8 adjacent 16-bit values to obtain the 32-bit sum. Also, this example shows 8 adjacent 16-bit result of the multiplication with accumulation of bytes that are added together to obtain the 32-bit sum to 16-byte pieces.

Accordingly, the choice of commands to perform a summation of adjacent values is based on the number of elements in the sum (N). For example, using a filter with three coefficients, as shown in Figa - 7C, the first command (operation 157 PAADD2WD) gets the result, as shown in Fig.8D. However, due to the correlation between the two 16-bit vectors (for example, the first row of the macroblock) of the pixel is the last command (operation 157 PAADD8WD), as shown in Figs. This operation is becoming increasingly important for effective implementation, as the registers OCMD increase in size. Without an operation is then required a lot of additional commands.

The set of commands summation of adjacent elements according to the present invention supports a wide range of a number of related values that can be summed up, and the entire range of normal data types. In one embodiment, the implementation of the summation of adjacent 16-bit values includes a set of commands (operation 157 PAADDNWD), a range that begins with the summation of two adjacent values (N=2) and the number of summands is doubled to four (N=4), then up to eight (N=8) and to the total number in the register. The size of the data amount of the 16-bit result of the summation of adjacent values equal to 32 bits. In another embodiment, adjacent 16-bit values (161 PAADDWD) are summed to obtain the 32-bit sum.

This implementation does not include any other team with 16-bit data size, because the teams summation of adjacent elements with 32-bit input is used to add the amount generated by the team with 16-bit input values. Both implementation includes a set of commands summation of adjacent 32-bit values (159 PAADDNDWD), a range that begins with the summation of two adjacent values (N=2) and the number of summands is doubled to four (N=4), then up to eight (N=8), etc. to the total number in the register. The size of the data amount of a 32-bit Raza is tatov operations summation of adjacent values equal to 32 bits. In some cases, the results do not fill the register. For example, for commands that are shown in figa, 8B and 8C, the summation of three different adjacent values leads to 4, 2 and 1 32-bit result. In one variant of implementation, the results are stored in the youngest, the least significant part of the device, the destination data storage.

Accordingly, when there are two 32-bit result, as shown in Figv, the results are stored as 64 bit younger. In the case of a single 32-bit result, as shown in Figs, the results are stored in the lower 32 bits. Professionals should be understood that some applications use the amount of contiguous bytes. The present invention supports the summation of adjacent bytes with the command (operation 155 PAADDNB), which sums the two adjacent bytes with a sign, receiving a 16-bit word, and a command which adds two adjacent unsigned byte, resulting in the 16-bit word. Applications that require summing over two contiguous bytes, summarize the 16-bit sum of the two bytes with the corresponding result of 16-bit operations summation of adjacent values.

As soon as the operation result data is calculated, the next operation is to send the results back into the memory device. As shown by the above-described variants of the westline, the results can be encoded with 32-bit precision. Therefore, the results can be written back to memory using simple forwarding operations, operating with double words, for example, the above operation 143 MOVD and logical right shift operating on a register (PSRLDQ), logical shift right double kvadratov. To record all of the results back in memory need four operations MOVD and three operations PSRLDQ in the first case (Figa), two operations MOVD and one operation PSRLDQ in the second case (Pigv) and, finally, only one operation MOVD in the latter case, as shown in Figs.

Unfortunately, although the operation of summation of adjacent values, as shown in Figs can be executed in parallel, calculate the filter in General require a next pixel of the image. One or more pixels must be loaded in the source device data storage, or in a register. To avoid loading the registers these eight pixels each time, the proposed two solutions for this operation. In one of the embodiments the present invention describes the operation 163 Association of registers, as shown in figa. For processing pixels A1 to A8 in the register of the destination 606 pixels A7 - A1 connect with pixel A8 for forming pixels A8 - A1 in case the recipient 606. Accordingly, the operation together the ia registers uses a certain number of bytes for the selected registers, which is provided in the input parameter.

Figv depicts an alternative implementation to perform the merge operation registers. Initially, eight pixels are loaded into the first register-source 608 (MM0). Then the next eight pixels loaded in the second register-source 610 (MM1). Then perform an operation permutations in the second case the source 610. After performing this register 610 copying in the third register is the source (MM2) 612. Then the first register-source 608 shift right by eight bits. In addition, the second register is the source 610 and the register 614 masks are combined in accordance with the Packed logical And" and save in the first register source 608. Then perform the operation "logical OR with the second register source 610 and the first register source 608 to obtain the following result in case the destination 620, which leads to a join operation registers. The process continues as shown by the operation of the first shift register-source 608. Then the second register source 610 shift to get the value of the register 612. Then perform the operation "logical And" to register 614 mask and a second register source 612, and the results stored in the register is the destination 622. Finally, perform an operation Packed "logical OR" of the second register source 612 and p is pout register-source 608, which leads to the subsequent join operations registers in the register of the destination 624. The following steps describe how the procedures for implementing the principles of the present invention.

Work

Figure 10 depicts a diagram illustrating a method 700 for efficient filtering and convolution of content data, for example, in computer system 100, which is shown in figure 1 and 2. As described, the data content related to the image data, audio, video and speech. In addition, the present invention relates to storage devices, which, as is clear to the experts, include various storage devices, digital data, which include, for example, data registers, such as 128-bit registers MMX architecture SSE2 Intel®.

According to figure 10, the method begins at step 702, which determines whether the operation data. As described, the operation data includes the operations of convolution and filtering, which is performed with the data of the pixels, but are not limited to them. If "Yes", then at step 704 the process. At step 704 the process to run the command to load the data. In response to the command load data at step 706 of process input data stream is loaded into the source device 212A data storage and the secondary device 212B data storage, for example, as shown in figure 2./p>

At step 708 the process determines performed if the operation of the data processing command data movement. In response to the command move data at step 710 of process selected part of the data, for example, the source device 212B data storage is formed in the device the destination data storage, or in accordance with the ordering of the coefficients in the storage device of the coefficients (see figure 5). The coefficients in the storage device of the coefficients are ordered according to the required computing data processing operations (for example, as shown in figa and 7B). In one embodiment, the implementation of the coefficients are ordered in memory before any filtering operations. Accordingly, the coefficients can be loaded into a storage device of the data coefficients without having to move (see figv).

As described above, the ordering data and coefficients required for the implementation of parallel computing, as it is required for processing operations, as shown in figa-7C. However, since the coefficients of the known prior to the operation data, the coefficients can be sequenced in memory for loading into the register of the coefficients, as ordered in memory, without requiring a transfer coefficients during data processing operations. Finally, at step 720 process the downloaded data is remotivate according to the operation data to obtain one or more results data. After receiving the results of data processing operations, they may be written back into memory.

11 depicts a diagram illustrating the method 722 data processing according to the operation data. At step 724, the process is determined, did the operation of the data processing command multiplication with accumulation. In response to the command multiplication with accumulation at step 726 of the process generated a lot of sums of pairs of pieces of data in a storage device, the destination and the coefficients in the storage device of the coefficients, as shown in Figs. Then at step 728 the process determines performed if the operation of the data processing command summation of adjacent values.

In response to the command summation of adjacent values at stage 730 of the process related to the amount of pairs of works in the device the destination data storage 510 (figs) are added in response to the command summation of adjacent values to obtain one or more results of data processing operations (see fig.8D). However, in some embodiments, implementation, when the number of factors exceeds the capacity of the register of coefficients (see step 732)process, receive partial data. Hence, processing and ordering of the coefficients (step 734) and process data (step 736) process continues until the Udut obtained the final results of data processing operations, as indicated in additional stages of the process 732-736. Otherwise, at step 738 process retain one or more results data processing operations. Finally, at step 790 the process determines whether completed processing the input data stream. Stages 724-732 process to repeat until the processing of the input data stream is completed. Once processing is completed, the sequence returns to step 720, where the method 700 ends.

Fig depicts a diagram showing an additional method 740 handle the additional input data. At step 742 process determines whether the data in the source device 212A store data that has not been converted. As already described, the data to which no appeal, refer to the data in the source device 212A data storage that were not moved in the storage device to execute the command multiplication with accumulation. If the storage device contains data that has not been converted, at step 744 process from the source device's data storage select part of the data as the selected data. After selecting the execute stage 786 process.

Otherwise, at step 746 process one or more unprocessed data elements are selected from the source device data storage, and also to the to one or more data elements from the secondary storage device. As already described, the raw data elements are data elements for which the result of the operation of the data processing has not yet been calculated. Then at step 780 of this process is the team of the Association registers (see figa and 9B), which connects the raw data elements of a source device of data storage and data elements selected from a secondary storage device for the formation of selected data. Then at step 782 process data from the secondary storage device are moved to the source device data storage.

Essentially, the data in the source device's data storage is no longer necessary, because all of them have already been addressed. Accordingly, a secondary storage device that contains the data to which no appeal can be used to overwrite the data in the source device data storage. At step 784 process secondary storage device loads the input data stream from the memory, which requires additional processing, such as filtering or convolution. Finally, at step 786 process the selected data is arranged in the device, the destination data storage, or in accordance with the ordering of the coefficients in the storage device of the coefficients (see Figure 5). After performing the sequence of operations is th control returns to step 790 processing to continue processing the selected data, as shown in figure 11.

Fig depicts additional method 748 to select raw data elements. At step 750 the process determines whether the source device data storage raw data. If each piece of data in the source device storing data processed, at step 770 the process. At step 770 of the process from the secondary storage device is selected portion of the data, which functions as the selected data, which is then treated in accordance with the operation data.

Otherwise, at step 752 process one or more unprocessed data elements are selected from the source device's data storage. Finally, at step 766 process additional data elements are selected from the secondary storage device according to the count of unprocessed data elements for the development of the selected data. The data selected for transfer to the device-destination data storage to perform data processing operations, is limited by the count of data elements based on the number of filter coefficients. Accordingly, using this reference data items, the number of unprocessed data elements is subtracted from the count of data items to determine the number of elements that you must choose from a secondary storage device of the data is x to perform a join operation registers.

Fig depicts additional method 754 to select the raw data elements of stage 752 process, as shown in Fig. At step 756 process from the source device storing data select the data item. Then at step 758 the process determines whether the result of the operation processing data calculated for this data element. If this result was calculated, the selected data item is ignored. Otherwise, at step 760 process the selected data item is a raw data item and stored. Then at step 762 process the raw count data elements incremented. Finally, at step 764 process stages 756 - 762 process to repeat until each data element in the source device's data storage will not be processed.

Using the principles of the present invention, it is possible to avoid unnecessary changes the data type that allows you to maximize the number of SIMD operations on one team. In addition, also achieved a significant reduction in the number of clock cycles required to organize data for arithmetic operations. Accordingly, table 1 gives the estimated acceleration values for different applications filtering using the principles and commands, disclosed in the present invention.

TABLE 1
Operation	Acceleration
9-7-wave conversion	1,7
3x3 filter with byte factors	4,3
correlation watermarks	6,8

Alternative implementation

The above-described various aspects of the implementation of a computer architecture for efficient filtering and convolution of content data using SIMD registers. However, various implementations of the computing architecture provides many features, which include complement and/or replace the above features. Signs may be implemented as part of the computational architecture or as part of a specific software or hardware components in various implementations. In addition, in the foregoing description, for purposes of explanation, used specific nomenclature to provide a complete understanding of the invention. However, specialists will be obvious that these specific details are not required for practical use of this invention.

In addition, although described an implementation option is aimed at an efficient filtering and convolution of content data using SIMD registers, specialists should be understood that while the principles of the present invention can be applied to other systems. In fact, systems for image processing, audio and video are covered by the present invention without changing the nature and scope of the present invention. The above-described embodiments of were chosen and described to best explain the principles of the invention and its practical application. These implementation options were selected in order thus to allow other professionals to best utilize the invention and various options for its implementation with various modifications that are suited for specific consideration of the application.

Embodiments of the present invention provide many advantages in comparison with known methods. The present invention includes the ability to effectively implement filtering/convolution for multiple lengths of arrays and data sizes and signs of the coefficients. These operations are performed using a small number of commands that are part of a small group of processing commands in a SIMD architecture. Accordingly, the present invention allows to avoid unnecessary changes of data type. As a result, the present invention increases the number of SIMD operations for the same team, greatly reducing the number of clock cycles required to organize the data is for arithmetic operations, such as the multiplication with accumulation.

On Fig presents a block diagram of the processor microarchitecture of one of the embodiments, which includes a logic circuit for performing parallel operations in Association with a shift to the right in accordance with the present invention. The join operation is shifted to the right may also be referred to as the join operation registers and command Association registers, as in the above discussion. For one of the embodiments of the team's Association with a shift to the right (PSRMRG), the command gives the same results that the operation 167 Association of registers in figure 1, 9A and 9B. The input unit 1001 ordered processing is part of the CPU 1000, which selects the macro that will be executed, and prepares them for later use in the pipeline processing of the processor. The input unit of this variant implementation includes several modules. Module 1026 prefetch command selects settings from memory and transmits them to the decoder 1028 teams, which in turn decodes them into primitives, which are called micro-ops or micro-operations (also referred to as MOS), the execution of which understand the machine. The trace cache 1030 takes decoded micro-operation and transmits them to programmatically stop the ordered sequence, or track, in line 1034 of micro-operations for execution. When the trace cache 1030 detects a complicated macro, the ROM 1032 firmware provides micro operations that are needed to complete the operation.

Many macros are converted into a single micro-operations, and others require several operations to complete the whole operation. In this embodiment, if to complete the macros needed more than four microcommand, the decoder 1028 accesses the ROM 1032 firmware to run the macro. In one embodiment, the implementation team algorithm for parallel Association with a shift to the right can be stored within the ROM 1032 firmware when to perform the operation requires a large number of micro-operations. The trace cache 1030 refers to an entry point in a programmable logic array (PLA, PLA) to determine the correct pointer microcommand to read the sequence of firmware for the split algorithms in ROM 1032 firmware. After the ROM 1032 firmware will finish the ordering of operations for the current macro input unit 1001 device resumes sample of micro-operations from the trace cache 1030.

Some SIMD instructions and other multimedia types of commands are treated as complex commands. Most is in command, related commands floating-point are also difficult teams. When the decoder 1028 teams detects a complicated macro, is the address in the ROM 1032 firmware in the corresponding cells to extract the sequence of firmware for this macro. Various micro-operation that is required to run this macro is passed to the tool 1003 performance by changing the order for execution in the respective modules version: integer and floating point.

Tool 1003 performance by changing the sequence of microcommand prepared for execution. Logic execution with the change in the sequence have many buffers to smooth and reorder the flow of microinstructions to optimize performance, as they come down the pipeline and scheduled for execution. Logic circuit means of distribution distribute native buffer and the resources required for each micro-operation for execution. Logic rename register-rename logical registers in the entries in the register file. The means of distribution also allocates an entry for each micro-operation into one of two queues, operations, one for memory operations and one for operations without recourse to memory and, before the next scheduling commands: scheduler memory, fast scheduler 1002, slow/General scheduler 1004 for floating-point operations and a simple scheduler 1006 for floating-point operations. Planners 1002, 1004, 1006 operations determine when the micro-operations are ready for execution on the basis of readiness dependent registers-input operands and the availability of execution resources required by the micro-operation to complete the operation. Quick scheduler 1002 for this option, the implementation may perform scheduling on each half of the main clock cycle, while the other schedulers can be scheduled only once per core clock cycle of the processor. Planners solve conflicts for send ports in the planning of micro-operations for execution.

Files 1008, 1010 registers are located between planners 1002, 1004, 1006 and modules 1012, 1014, 1016, 1018, 1020, 1022, 1024 execution unit 1011 execution. There are separate files 1008, 1010 registers for operations with integers and floating-point respectively. Each file 1008, 1010 registers this option, the implementation also includes a network traversal, which can bypass or pass just completed results that have not yet been recorded in the register file, to a new dependent microhope the promotion. File 1008 integer registers and file registers 1010 floating point is also capable of transmitting data to each other. One embodiment of the implementation file 1008 integer registers divided into two separate register files, one file register for the lower 32 bits of data and a second register file for the high-order 32 bits of data. File 1010 registers floating-point one of the embodiments contains the record length of 128 bits, because the teams are floating in a typical case are the operands with the length from 64 to 128 bits.

Block 1011 execution contains modules 1012, 1014, 1016, 1018, 1020, 1022, 1024 execution, where the command is actually executed. This part includes the files 1008, 1010 registers that store data values integer operands and operands of a floating-point number that is necessary for the execution of the microinstructions. The processor 1000 of this variant implementation consists of several modules: module address generation (MHA, AGU) 1012, MGA 1014, fast arithmetical-logical unit 1016, a high-speed arithmetical and logic unit 1018, cause a slow arithmetical-logical unit 1020, arithmetical-logical unit 1022 floating-point module 1024 move floating-point number. For this option, the implementation of the modules 1022, 1024 of the execution of the floating h is the fifth perform operations MMX, OCMD and SSE floating-point number. Arithmetical and logic unit 322 floating-point implementation of this option includes a division unit 64 bits on a 64 bit floating point for the execution of micro-operations division, square root extraction and determination of residue. For embodiments of the present invention, any action using floating-point values is performed using hardware floating-point. For example, conversion from integer format to floating point include the use of the register file floating-point number. Similarly, the division operation floating point is implemented in the module dividing floating-point number. On the other hand, numbers that are not floating-point numbers and integers are handled using the integer hardware. Simple, very frequent ALU operations are directed at high-speed modules 1016, 1018 execution ALU. High-speed ALU 1016, 1018 of this variant implementation can perform fast operations with the effective waiting time equal to half a clock cycle. One embodiment of the implementation of the most complex integer operations directed to cause a slow ALU 1020 as cause a slow ALU 1020 includes celcis the military hardware implementation for operations with long standby time, such as multiplication, shifts, processing logic flags and processing transitions. Operations to load/save in memory perform MGA 1012, 1014. For this option, the implementation of the integer ALU 1016, 1018, 1020 described in the context of performing integer operations with 64-bit data operands. In alternative embodiments, the implementation of the ALU 1016, 1018, 1020 can be implemented to support different capacity, including 16, 32, 128, 256 bits, etc. in Exactly the same modules 1022, 1024 for floating-point operations can be implemented to support a range of operands with different number of bits. One embodiment of the implementation of the modules 1022, 1024 for floating-point operations can handle the Packed data operands 128 bits in conjunction with SIMD and multi-media commands.

In this embodiment, planners 1002, 1004, 1006 operations are dependent dispatching operations before generating the download is complete to run. Because the micro-operation in the processor 1000 is planned and executed by assumption, the controller 1000 also includes a logic circuit for processing the gaps in memory. If the loaded data is missing in the data cache, the pipeline processing can be dependent operations, which came from the scheduler temporarily incorrect data is mi. The tool re-execution tracks and re-executes the commands that use incorrect data. Must re-executed only dependent operations, and independent operations are completed. Planners and facility re-execution of one of the embodiments of the processor are also designed for execution of command sequences for integer operations division with extended precision.

The term "registers" is used to refer to the built-in memory cells of the processor that are used as part of macros to identify the operands. In other words, the referenced registers "visible" from the outside of the processor (from a programmer's perspective). However, as described registers can be implemented using circuits in the processor using a variety of methods, for example, the allocated physical registers, dynamically allocated physical registers using register renaming, combinations selected and dynamically allocated physical registers etc. For the subsequent discussion, it is understood that the registers are registers data intended for storage of compressed data, for example, have a length of 64-bit MMX registers™ (registers mm) in microprocessors implemented technology MMX Intel Corporation, Santa Clara, Cali is ornia. These registers MMX, existing and in integer form, and in the form of a floating-point number that can be used with Packed data elements that accompany SIMD instructions and SSE. Similarly, the XMM registers 128 bits related to technology SSE2, can be used to store the operands of the compressed data.

In the examples shown on the following drawings, described a number of data operands. For simplicity, the data segments marked on the letter a in alphabetical order, with the segment And is located in the lower address, and the segment Z would have to be in the senior address. Thus, As can be at address 0, B - 1, C - 3, etc. Although sequence data in some of the examples are shown by the letters placed in reverse alphabetical order, addressing still starts with a 0, B 1, etc. In principle, the shift operation to the right, such as an Association with a right shift of one embodiment of implementation entails a right shift of the data segments with the lower address, if the sequence corresponds to D, C, B, A. So, right shift just moves the data elements of the data block to the right of the permanent line. In addition, the join operation with a shift to the right could, in principle, move to the right, the extreme right of the data segments one operand in the left side is toward the other operand data as if the two operands would be continuous.

Figa presents a structural diagram of one of the options for implementing logic to perform a parallel join operation shift right operand data in accordance with the present invention. Command (PSRMRG) for the join operation shift right (shift register) for this option, the exercise begins with three pieces of information: the first operand of the data 1102, the second operand data 1104 and counter 1106 shift. In one embodiment, the implementation team shift PSRMRG is decoded into one micro-operations. In an alternative embodiment, the command can be decoded in a different number of micro-operations to perform join operations shift with data operands. For this example, the operands 1102, 1104 data are part of the 64-bit data stored in the register/memory and counting 1106 offset is an immediate value length 8 bits. Depending on the specific implementation of the operand data and the reference shift may have other dimensions, for example, 128/256 bits and 16 bits, respectively. The first operand 1102 in this example consists of eight data segments: P, O, N, M, L, K, J and I. the Second operand 1104 consists of eight segments of data: H, G, F, E, D, C, B and A. the data Segments have equal length, and each contains one the AIT (8 bits) of data. However, another variant of implementation of the present invention operates with a longer 128-bit operands, each of the segments of data consists of one byte (8 bits), and the operand length of 128 bits is sixteen segments of the data length of one byte. Similarly, if each data segment is a double word (32 bits) or kwadratowa (64-bit), 128-bit operand has four segment data length double word or two data segment length in quadralobe respectively. Thus embodiments of the present invention is not limited to a specific length operand data, data segments or count down shift and may have a size corresponding to each implementation.

Operands 1102, 1104 may be located on or in the register or in the memory or the register file, or in combinations of these funds. Operands 1102, 1104 data and the reference 1106 served on the module 1110 execution in the processor along with the team of Association with a shift to the right. By the time the team joins with a shift to the right reaches the module 1110 execution, the command must be decoded earlier in the processor pipeline. Thus the team of the Association with a shift to the right may be in the form of a micro-operation (MOS) or in some other decoded format. For this option, the implementation of the two operands 1102, 1104 data p is enemalta in the logic circuit of Association and the temporary register. The logical scheme of the Association unites/attaches the data segments for the two operands and places the new data block in a temporary register. New data block consists of sixteen pieces of data: P, O, N, M, L, K, J, I, H, G, F, E, D, C, B, A. As this example works with operand length of 64 bits, a temporary register must contain the combined data length of 128 bits. For operand data length of 128 bits necessary temporary register of length 256 bits.

Logic 1114 of the right-shift module 1110 execution takes the contents of the temporary register and performs a logical right shift of the data block into n segments of data, as requested by the count 1106. In this embodiment, the reference 1106 specifies the number of bytes that will be shift to the right. Depending on the specific implementation of the countdown 1106 may also be used to specify the number of bits, nibbles, words, double words, quadraflow, etc. to which a shift is performed, depending on the degree of detail of the data segments. For this example, the value of n is equal to 3, so that the contents of the temporary register is shifted by three bytes. If each data segment has a length of a word, or double word, count may indicate the number of words, or double words, to which a shift is performed, respectively. For this option, the OS is supervising 0 are shifted from the left side of the temporary register to fill the vacated space, as the data in the register are shifted to the right. Thus, if the reference 1106 shift more than the number of data segments in the operand data (in this case eight), one or more 0 can appear in the result value 1108. In addition, if the count 1106 shift is equal to or exceeds the total number of data segments for both operands, the result will contain all 0, since all the data segments will be shifted and displayed. Logic 1114 shift to the right displays the corresponding number of data segments from the temporary register as the 1108. In another embodiment, the circuit may include a multiplexer output or register-latch after a logic right shift to output the resulting values. For this example, the result is 64 bits long and includes eight bytes. When the merge operation with a shift to the right with the two operands 1102, 1104 data result consists of the following eight data segments: K, J, I, H, G, F, E and D. In figv presents a structural diagram of another variant implementation of the logic circuit to perform a join operation with a shift to the right. Like the previous example Figa, the join operation is shifted to the right of this option exercise begins with three pieces of information: the first operand 1102 data DL is Noah 64 bits the second operand 1104 data length of 64 bits and the count 1106 shift length of 8 bits. Count 1106 offset specifies how many bits to shift the data segments. For this option, the implementation of the reference 1106 is set in number of bytes. In an alternative embodiment, the count may indicate the number of bits, nibbles, words, double words or quadraflow that will shift data. In this example, the first operand 1102 consists of eight have the same length in one byte data segments (H, G, F, E, D, C, B, A), and the second operand 1104 consists of eight data segments (P, O, N, M, L, K, J, I). The counter n is equal to 3. Another variant implementation of the invention can work with operands and data segments alternate in size, for example, with operands length 128/256/512 bits and pieces of data size bit/byte/word/double word/quadraflow and the reference shift length 8/16/32 bits. Thus embodiments of the present invention is not limited to a specific length operand data, data segments or counter shift and may have a size corresponding to each implementation.

Operands 1102, 1104 data and the reference 1106 served on module 1120 execution in the processor along with the team of Association with a shift to the right. For this option, the first operand 1102 data and second operand data 1104 is prinimautsia logic 1122 left shift logic circuit 1124 right shift, respectively. Count 1106 also served on the logical schema 1122, 1124 shift. The logic circuitry 1122 shift shifts to the left the data segments for the first operand 1102 on the number of segments equal to the number of data segments in the first operand - n". As the data segments are shifted to the left, on the right side are entered 0 to fill the vacated space. In this case, there are eight data segments, so that the first operand 1102 is shifted left by eight minus three, or five digits. The first operand 1102 is shifted to the value of this difference to achieve proper alignment of the data to merge in the circuit 1126 "logical OR". After left shift the first operand of the data takes the form: K, J, I, 0, 0, 0, 0, 0. If the count 1106 greater than the number of data segments in the operand, then the calculation of the left-shift can lead to a negative number, indicating a negative shift to the left. Logical shift left with a negative count is interpreted as a shift in the negative direction and essentially logical shift to the right. A negative left shift introduces a 0 on the left side of the first operand 1102.

Similarly, logic circuit 1124 shift to the right segments of the data shifts to the right the second operand into n segments. As the data segments are shifted to the right, 0 is entered from the left side to fill osvobozhdayas the camping space. The second operand data takes the form: 0, 0, 0, H, G, F, E, D. the Shifted operands are output from the logic circuits 1122, 1124 shift left/right and are combined in the circuit 1126 "logical OR". Scheme "logical OR" performs a "logical OR" of the data segments and provides the result 1108 length of 64 bits for the implementation of this option. Operation "OR" for "K, J, I, 0, 0, 0, 0, 0" and "0, 0, 0, H, G, F, E, D" to form the output 1108, containing eight bytes: K, J, I, H, G, F, E, D. This result is the same as for the first variant implementation of the present invention, shown in Figa. You should note that for reference 1106 n greater than the number of data elements in the operand corresponding to the number 0 may appear in the result, starting from the left side. In addition, if the count 1106 is greater than or equal to the total number of data items in both operands, the result will contain all 0.

Figa shows the action commands in parallel Association with a shift to the right in accordance with the first embodiment of the present invention. For this discussion, MM1 1204, MM2 1206, TMP 1232 and DEST 1242 in General referred to as operands or data blocks, and also include registers, file registers and memory cells, but are not limited to them. In one of the embodiments MM1 1204 and MM2 1206 - MMX registers of length 64 bits (sometimes also referred to as 'mm'). In the state I 1200, the reference shift imm [y] 1202, the first operand MM1 [x] 1204 and the second operand MM2 [x] 1206 served with a team of parallel Association with a shift to the right. Countdown 1202 represents the immediate value is the length of y bits. First 1204 1206 and the second operands are data blocks that include x data segments and have an overall length of 8 bits each, if each data segment is equal to the byte (8 bits). First 1204 1206 and the second operands each Packed with many smaller data segments. For this example, the first operand data 1204 MM1 consists of eight data segments of equal length: P 1211, O 1212, 1213 N, M 1214, L 1215, 1216 K, J 1217, 1218 I. Similarly, the second operand data 1206 MM2 consists of eight data segments of equal length: H 1221, G 1222, F 1223, 1224 E, D 1225, C 1226, B 1227 And 1228. Thus each of these data segments has a length of 'x·8' bits. So, if the value of x is 8, then each operand has a length of 8 bytes, or 64 bits. For other embodiments, the data item may be a half byte (4 bits), word (16 bits)double word (32 bits), kwadratowa (64-bit), etc. In an alternative implementation options the value of x may be equal to 16, 32, 64, etc. of data elements. The y counter is equal to 8 for this variant implementation, and an immediate value can be represented as bytes. In alternative embodiments, osuwestvlenieaj.in y may be equal to 4, 16, 32, etc. bits. In addition, the reference 1202 is not limited to the immediate value and can also be stored in a memory location or register.

Operands 1204 MM1 and 1206 MM2 are merged together in the state II 1230 for the formation of the temporary data block TEMP [2x] 1232 length 2x data elements (or bytes in this case). The combined data 1232 for this example consist of sixteen data segments are as follows: P, O, N, M, L, K, J, I, H, G, F, E, D, C, B and A. the Box 1234 length of eight bytes of eight frames of data segments of a temporary block 1232 data, starting with the LSB (right edge). Thus, the right edge of the box 1234 is aligned with the right edge of the block 1232 data so that the box 1234 frames data segments: H, G, F, E, D, C, B and A. the Reference 1202 shift n specifies the required number of shifts to the right of the merged data. The value of count can set the offset in bits, nibbles, bytes, words, double words, quadraclear etc. or a specific number of data segments. Based on the value of the reference 1202, block 1232 data is moved to the right 1236 n data segments in this case. For this example, the value of n is 3, and the block 1232 data is shifted to the right by three bits. Otherwise it can be considered as the shift box 1234 in the opposite direction. In other words, the box 1234 can in principle be considered as a DM is inute three digits to the left from the right edge of the temporary block 1232 data. For one of the embodiments, if the count of the shift n is greater than the total number of data segments 2x, existing in the merged data block, then the result contains all 0's. Similarly, if the count of the shift n is greater than or equal to the number of data segments x in the first operand 1204, the result includes one or more than 0, since the left side of the result. In condition III 1240 segments of data (K, J, I, H, G, F, E, D), framed by a box 1234, inferred as the result to the destination DEST [x] 1242, having a length x of the data elements.

Figv illustrates the action team Association with a shift to the right in accordance with the second embodiment. The team joins with the right shift in the state I 1250 accompanied by a reference imm [y] length of y bits, the first operand data MM1 [x], consisting of x data segments and the second operand data MM2 [x], consisting of x data segments. As in the example on figa, the value of y is 8 and the value of x is 8, each of MM1 and MM2 has a length of 64 bits, or 8 bytes. First 1204 1206 and the second operands for this variant implementation Packed with many having the same size of data segments, in this case, each of them has a length of one byte, P 1211, O 1212, 1213 N, M 1214, L 1215, 1216 K, J 1217, 1218 I" and "H 1221, G 1222, F 1223, 1224 E, D 1225, C 1226, B 1227 And 1228, respectively.

In condition II 1260 count 1202 shift n used dlastpage first 1204 1206 and second operands. The countdown to implementation of this option specifies the number of data segments to shift to the right of the merged data. For this option, the implementation of the shift occurs before combining the first 1204 1206 and second operands. The result is the first operand 1204 moves differently. In this example, the first operand 1204 is shifted to the left by x minus n data segments. The calculation of "x - n" takes into account the proper data alignment in the subsequent Union of the data. Thus for the value of the count n is equal to 3, the first operand 1204 is shifted to the left by five data segments, or five bytes. There are 0, which enter from the right side to fill the vacated space. But if the reference 1202 shift n is greater than the number of data segments x, available in the first operand 1204, then left shift the calculation of "x - n" can lead to a negative number, which in essence indicates a negative shift to the left. In one embodiment, the implementation of the logical shift to the left with a negative count is interpreted as a shift to the left in the negative direction and, essentially, as a logical right shift. If negative, a left shift of 0 is entered on the left side of the first operand 1204. Similarly, the second operand 1206 is moved to the right by the value of the reference shift equal to 3, and 0 are left filled for the I space. The results of the shift for the first 1204 1206 and second operands stored in registers of length x data segments EMR 1266 and TEMP2 1268, respectively. The shifted results from EMR 1266 and TEMP2 1268 combined 1272 to obtain the desired Association with the shift of data in the register is the destination DEST 1242 in condition III 1270. If the reference 1202 shift n is greater than x, then the operand may contain one or more 0 the left hand side. In addition, if the reference 1202 shift n equal to or greater than 2x, then the result in the DEST register 1242 will contain all 0.

In the above examples, for example, on figa and 17B, one or both of the register MM1 and MM2 can be 64-bit data registers in a processor that implements technology MMX/SSE, or 128-bit data registers of the technology SSE2. Depending on the implementation of these registers may have a length 64/128/256 bits. Similarly, one or both of the register MM1 and MM2 may be other memory cells other than registers. In the processor architecture of one of the embodiments MM1 and MM2 - source operands for the command Association with a shift to the right (PSRMRG), as described above. The reference shift IMM is also a direct number for a team PSRMRG. One of the enforcement options available to the recipient for the DEST is the data register MMX or XMM. In addition, register DEST can be the same register as one of the source of the x operand. For example, in one of the architectures command PSRMRG has a first source operand MM1 and the second source operand MM2. Pre-defined destination for the result can be register of the first source operand, in this case, the register MM1.

On figa presents the sequence of operations illustrating one way of implementing the method of the right shift and the operand data for parallel Association. Values of the length L in the General case are used to represent the length of data blocks and operands. Depending on the specific case for L can be used to determine the length of the data segments, bits, bytes, words, etc. At step 1302 the first operand data length L is accepted for use when performing a join operation with the shift. At step 1304 also adopted by the second operand data length L for the join operation with the shift. At step 1306, the accepted reference offset indicating the number of data segments or length in bits/nibbles/bytes/words/double words/quadraclear. At step 1308 logic execution combine the first operand and the second operand. One embodiment of the implementation of the temporary register of length 2L contains the merged data block. In an alternative embodiment, the combined data stored in the memory cell. On e is up 1310 composite block of data is shifted to the right by the value of the reference shift. If the count is expressed in the data segments, the data block is shifted to the right by a given number of data segments and the left are input 0 high significant bits of the data block to fill the vacant seats. If the count is expressed, for example, in bits or bytes, then the data block is similarly shifted to the right by this amount. At step 1312, the result of length L is generated from the right side, or with the least significant bits, out of the data block. For one of embodiments L data segments are multiplexed out of a block of data in the register to the destination or to the memory cell.

On FIGU presents the sequence of operations illustrating another variant of the method of Association and the right-shift data. The first operand of the data length L is accepted for processing by using operation shift right and merge at step 1352. At step 1354 adopted by the second operand data length L. In step 1356, the reference offset specifies the desired distance for right shift. At step 1358 first operand data is shifted to the left on the basis of calculations using the reference shift. The calculation in one of the embodiments includes the subtraction of the reference shift from L. for Example, if the length of the operand L and the reference offset is specified in the data segments, then the first operand is shifted to the left by L - countdown shift" segment is s, and 0 are the least significant bits of the operand. Similarly, if L is expressed in bits, and the count is expressed in bytes, then the first operand is shifted to the left by L - countdown shift·8" bits. At step 1360, the second operand data is shifted to the right by the value of the reference offset, and 0 are entered with the senior the significant digits of the second operand to fill the vacant seats. At step 1362 shifted first operand and the shifted second operand are combined together to obtain the length L. For one of the embodiments when combining the result containing the required data segments and the first and second operands.

One of the more popular uses of computers is to process extremely large video and audio files. Even though these video and audio files in a typical case, are transmitted through the network with very high bandwidth or through storage media high capacity, data compression is still needed to handle the traffic. As a result, different compression algorithms become important elements of the representation or encoding of the many popular formats of audio, video and images. Video in accordance with one of the MPEG standards are one of the applications that use compression. Video format MPEG break on stage the July levels to ensure error-handling random search and editing and synchronization.

For illustration purposes, briefly describe these levels, which are one video MPEG. On the upper level is the level of the video sequence, which includes an independent stream of bits. The second level down is a group of images made up of one or more groups of internal staff and/or the non-domestic personnel. The third level down is the level of the actual image, and the next level below the level of the slices. Each slice is a contiguous sequence of ordered macroblocks in the image, most often based on rows in a typical video applications, but is not limited to them. Each slice consists of macroblocks, which are arrays of 16x16 pixels brightness or image elements, with two arrays of 8x8 corresponding pixel color. The macroblocks can be further divided into different blocks of 8x8 for additional processing, such as encoding conversion. The macroblock is the fundamental unit for compensation and motion estimation, and may have an associated motion vectors. Depending on the implementation of the macroblocks can have 16 rows and 16 columns or other various sizes.

One of the methods of forecasting time used in the video in the MPEG standard, based the and the motion estimation. Motion estimation is based on the premise that consecutive video frames in the General case will be similar among themselves except for changes caused by objects moving within the frame. If there is zero movement between frames, the encoder can be easily and effectively to predict the current frame as a duplicate of the previous frame, or the frame prediction. The previous frame may also be called the reference frame. In another embodiment, the reference frame may be the next frame or even some other frame in the sequence. Embodiments of the motion estimation are not required to compare the current frame with the previous frame. Thus the comparison can be used for any other frame. Then the encoder is transmitted syntactic service information necessary to restore the image from the original reference frame. But when there is movement between images, the situation becomes more complicated. The differences between the matching macroblock and the current macroblock ideally are many 0-x values. When encoding a macroblock differences between matching the best way and the current macroblock is transformed and quanthouse. For one of the embodiments quantieme values are sent to the means of encoding with variable-length compression. POSCO is jku 0 can be compressed very well, it is desirable to have the best match with many 0's differences. The motion vectors can thus be derived from the values of the differences.

On figa illustrates the first example of the motion estimation. Left frame 1402 is a sample of the previous frame, which includes a stick figure and index. The right frame 1404 is a sample of the current frame, which includes such as a stick figure and index. In the current frame 1404 pan has led to the fact that the pointer is moved to the right and down from its original position in the previous frame 1402. A stick figure, now with his hands up, in the current frame is also moved down to the right side from the center of the previous frame 1402. For the appropriate representation of the changes between the two frames 1402, 1404 can be used in algorithms for motion estimation.

One embodiment of the implementation of the motion estimation algorithm performs a full two-dimensional (2D) spatial search for each macroblock brightness. Depending on the implementation of the motion estimation cannot be directly applied to a color video MPEG, therefore, a color motion can be adequately represented using the same information about the movement, and the brightness signal. Many different ways are possible for evaluating DWI is to be placed, and the particular scheme perform motion estimation to some extent depends on the relationship of complexity to quality for this particular application. Full, exhaustive search over a wide 2-dimensional (2D) region can generally lead to the best results of the comparison. However, this quality leads to a large value calculations, since motion estimation is often the most expensive computationally part of the coding. Attempts to reduce cost by limiting the search range of the pixels or the type of search can cost of some loss of video quality.

Figv shows an example macroblock search. Each of the frames 1410, 1420 includes a variety of macroblocks. Target macroblock 1430 of the current frame is the current macroblock, which will be compared with the previous macroblocks of the previous frame 1410, 1420. In the first frame 1410 macroblock 1412 poor compliance contains part of the pointer and poorly matches the current macroblock. In the second frame 1420 macroblock 1420 good match contains bits of the pointer and the head shapes of the sticks, as in the current macroblock 1430, which must be encoded. These two macroblock 1422, 1430 have some commonality, and only seen a small error. As found relatively good agreement, the encoder assigns vectors is viginia this macroblock. These vectors indicate how far this macroblock should be moved horizontally and vertically to match.

Fig shows an example of the application of motion estimation and the resulting prediction when generating the second frame. Previous frame 1510 ahead of time the current frame 1520. For this example, the current frame 1520 is subtracted from the previous frame 1510 to obtain less complex residual image 1530 errors, which can be encoded and transmitted. Previous frame 1510 this example consists of a pointer 1511 and 1513 figure Popsicle. The current frame 1520 consists of a pointer 1521, and two pieces of sticks 1522, 1523 on the Board 1524. The more movement will be assessed and agreed, the more likely it is that the residual error will approach zero and lead to high coding efficiency. Macroblock prediction can reduce the size of the search window.

The coding efficiency can be achieved through use of the fact that the motion vectors are highly correlated between macroblocks. Thus, the horizontal component can be compared with previous accurate horizontal component of the motion vector, and the difference is encoded. Similarly, it can be calculated difference for the vertical component before the coding for the cation. For this example, subtract the current frame 1520 from the previous frame 1510 leads to a residual image 1530, which includes the second shape 1532 Popsicle with my hands and the Board 1534. This residual image 1530 is compressed and transmitted. Ideally, this residual image 1530 less difficult to encode and requires less memory than in the case of compression and transmission of a current frame 1520. However, not every macroblock search leads to an acceptable compliance. If the encoder determines that there is an acceptable match, it can be coded that particular macroblock.

Figa - B show an example of the current 1601 and previous 1650 frames that are processed during the motion estimation. Previous frame 1650 precedes the current frame 1601 in chronological order for a sequence of video frames. Each frame consists of a very large number of pixels, which extend across the frame in horizontal and vertical directions. The current frame 1601 consists of a set of macroblocks 1610, 1621-1627, which are organized horizontally and vertically. For this option, the implementation of the current frame 1601 is divided into equal size disjoint macroblocks 1610, 1621-1627. Each of these square macroblock is further subdivided into equal number of rows and NT is vcov. For the same macroblock 1610 shows a matrix of eight rows and eight columns. Each square macroblock 1610 corresponds to one pixel. Thus, this macroblock 1610 sample includes 64 pixels. In other embodiments, implementation of the macroblock has a size of sixteen rows and sixteen columns (16x16). One embodiment of the implementation of the data for each pixel consist of eight information bits, or a single word. In alternative embodiments, the implementation of the pixel data can have other sizes, including nibbles, words, double words, quadralobe etc. These current macroblock of the current frame are trying to compare with the macroblocks in the previous frame 1650 for motion estimation.

For this option, the implementation of the previous frame 1650 includes box 1651 search, and the box 1651 search encompasses part of the frame. Box 1651 search contains the area in which obtained the approval of the current macroblock of the current frame 1601. Like the current image, the search window is divided into a number of equal size macroblocks. Shows the approximate macroblock 1660, having eight rows and eight columns, but macroblocks may have other different sizes, including sixteen rows and sixteen columns. During operation of the motion estimation algorithm of one embodiment domestic is each individual macroblocks box 1651 search is compared sequentially with the current macroblock of the current frame to find an acceptable match. One embodiment of the implementation of the top-left corner of the first previous macroblock in the box 1651 search is aligned with the top left corner of the window 1651 search. During operation of one of the algorithms for the estimation of the movement direction of the processing macroblock passes from the left side of the search window to the right edge, from pixel to pixel. Thus the leftmost edge of the second macroblock is one pixel farther from the left edge of the search box, etc. At the end of the first line of pixels, the algorithm returns to the left edge of the search box and extends from the first pixel of the next line. This process is repeated up until the macroblocks for each of the pixels in the window 1651 search will not be compared with the current macroblock.

Figa - D illustrate the operation of the motion estimation in frames in accordance with one embodiments of the present invention. The discussed embodiments of the present invention uses algorithms for motion estimation of full search. With full search macroblocks for all positions of the pixels in the search window of the previous (reference) frame are trying to compare with the macroblock of the current frame. For one of the options for implementing a fast motion estimation algorithm full search uses SIMD operations Association with a shift to the right for fast processing of compressed data frames. The operation is SIMD about what Yedinaya right-shift of one of the embodiments can improve processor performance by reducing the number of data downloads, especially unaligned loads of memory and other control commands data. In the General case, the procedure of motion estimation in one of the embodiments may be described in pseudo-code as:

for each current block in each of the x and y directions {

for all items with a coefficient of 1 on the Y-axis of the search window {

for all items with a factor of 4 on the X-axis of the search window {

load pixel data from memory into registers;

trying to compare blocks of 4 adjacent previous macroblocks;

keep track of the minimum value and the index location for this previous macroblock;

}}}in which the comparison operation units is determined as follows:

for each row from 1 to m {

for each macroblock, beginning with column 1 to 4 {

generate the correct data for the previous [line] from the data stored in the registers;

measured data [row] += sum of absolute differences (current [string], previous [string]);

}}.

Thus for this case for each location of the pixel in the search window of the previous macroblock is compared with the current macroblock. As stated above, this variant implementation evaluates four adjacent previous macroblock per cycle. The pixel data are loaded from memory, and information from memory into registers are loaded expressed what nianiem. Using merge operations with a shift to the right, you can manipulate the data of the pixels for forming different combinations of shifted data segments corresponding to the adjacent macroblocks. For example, the first, second, third and fourth pixels of the first row of the first previous macroblock can start with memory addresses 0, 1, 2, and 3, respectively. For the first pixel of the first row of the second previous macroblock, the pixel begins with the addresses of the memory 1. Thus the join operation with the right shift data register can formulate the necessary data row of pixels for the second previous macroblock using reusable data already loaded from memory for the first previous macroblock, which leads to saving time and resources. Similar to the join operation with the shift can generate data strings for other related previous macroblocks, for example, for the third, fourth, etc.

Thus the comparison blocks for motion estimation algorithm of one of the embodiments can be described in pseudocode as follows:

compare blocks for the previous four adjacent macroblocks {

for each row from 1 to m {

loads the pixel data for one row of the current macroblock;

perform aligned for whom Rusko from memory into registers two consecutive "servings" pixel data for one row of the search window;

generate the correct line pixel data for each of the previous four adjacent macroblocks of the loaded data through join operations with a shift to the right;

calculate the sum of the absolute differences between the line of the previous macroblock and the corresponding row of the current macroblock for each of the previous four adjacent macroblocks;

accumulate four separate sum of absolute values of differences for each of the previous four adjacent macroblocks;

}}

This procedure is additionally described below. Although these examples are described in terms of actions on four adjacent macroblocks in the search box, alternative embodiments of the present invention that is not limited. However, embodiments of the present invention is not limited to the processing of adjacent macroblocks. Numerous reference macroblocks processed together, do not necessarily have to differ to a distance of one pixel. For one of the options for implementing any reference macroblock having the location of the pixels within the window 16 on 16 on the specific location of the pixel can be processed together. Depending on the number of hardware resources, such as registers available data and performance modules, other embodiments of can perform the best comparison blocks and calculation of the sums of absolute differences on a larger or smaller number of macroblocks. For example, another variant implementation, using at least 8 registers the compressed data to store 4 different combinations of pixel data generated by the merge operations with a shift to the right with two "chunks" of data having a length of 8 data segments that can handle 4 adjacent previous macroblock using just two aligned loads from memory length 8 data segments. Four of the 8 registers in the Packed data used to calculate service information: they contain the first 8 segments of data of the previous frame, the next 8 data segments of the previous frame, 8 data segments of the current frame and 8 data segments of the join operation with the shift to the right. The other four register Packed data are used to accumulate totals for sums of absolute values of differences (sad) for each of the four macroblocks. You can add more registers in the Packed data for the calculation and accumulation of the RAA, at least to increase the number of reference macroblocks, which are processed together. Thus, if four additional register Packed data is available, it may also be processed four additional previous macroblock. The number of available registers in the Packed data to store the accumulated sum of absolute differences is one of the embodiments may limit the number of simultaneously processed macroblocks.

In addition, in some processor architectures memory access has a specific granularity and aligned within certain limits. For example, one processor can do fetch from memory based on a 16 - or 32-byte blocks. In this case, accessing data that is not aligned on a 16 - or 32-byte boundary, might require an unaligned memory access, which is costly in terms of time and resources. Even worse, when the desired part of the data crosses the border and overlaps with many blocks of memory. The split lines of the cache memory, which would require unaligned loads when accessing data located on two separate lines in the cache memory, can be costly. The situation is compounded when the data line cross the border of memory pages. For example, in the case of a process that works with 8-byte blocks of memory, and macroblock, covering 8 pixels, with one byte of data per pixel, one aligned load from memory would be sufficient for this macroblock line. But for the next adjacent macroblock that is located on one column of the pixels further, the data needed for this row of pixels would be 7 bytes of data memory block of the first macroblock, but also would cross the border memory 1 data byte of the next block of memory is I. Embodiments of the present invention use a join operation to shift to the right for efficient data processing. In one embodiment, the implementation of two consecutive block of memory load when aligned with the edges of the memory and stored in registers for repeated use. When performing merge operations with a shift to the right, these memory blocks can be used and the data segments can be moved the required distance to get the correct row of data. So in this example, the Association with a shift to the right can use two already loaded memory block and to output one byte of data from the second block, and then enter one byte of data in the second block from the first to generate data for the first row of the second macroblock, without requiring you to perform unaligned loads. Embodiments of the motion estimation can also break the chain of dependencies based on how the algorithm is implemented. For example, changing the order of calculations, the dependence on the data/commands can be removed or changed so that some calculations and commands can be executed out of order, as in the CPU 1000 on Fig. The performance improvement can be even greater for new generations of processor architecture due to increased time-out IP is filling up and available computing resources. Using a variant of the implementation team's Association with a shift to the right, you can avoid some dependencies in sequence comparison blocks. For example, multiple operations of the summation of absolute differences and/or the accumulation operation can be executed in parallel.

Figa illustrates the movement of the current macroblock in the current frame 1701. For this option, the implementation of each current macroblock 1710 is divided into 16 rows and 16 columns, and thus contains 256 individual pixels. For this option, the implementation of the pixels in each macroblock 1710 treated as a separate line 1711 at one time. When all sixteen rows of the current block is processed in relation to the required macroblocks in the search box, process following the current macroblock. Macroblocks in this embodiment, handle in the horizontal direction 1720 from the left side to the right side of the current frame 1701, each time moving to the size of the macroblock. In other words, the current macroblock does not overlap in this embodiment, and the current macroblock is ordered so that each macroblock is adjacent to the following. For example, the first macroblock can be located from column 1 pixel to the column 16 pixels. The second macro-block is located from column 17 to column 32, etc. At the end of the page the key macroblock, the process returns 1722 to the left and goes down on one height macroblock, sixteen rows in this example. Macroblocks that are lower on one size of the macroblock, then treated horizontally 1724 from left to right until then, until you have completed made the comparison for the whole frame 1701.

Figv illustrates the movement of macroblocks in the box 1751 search previous (reference) frame. Depending on the specific implementation of the box 1751 search can be focused on a specific area and thus may be smaller than the previous frame. In another embodiment, the search window can completely overlap the previous frame. Similar to the current block, each of the previous macroblock 1760, 1765, 1770, 1775 divided into 16 rows and 16 columns to get the total number of 256 pixels in each macroblock. For this option, the implementation of the present invention four previous macroblock 1760, 1765, 1770, 1775 box 1751 search process in parallel with respect to a current block to match. In contrast to the current macroblock of the current frame previous macroblocks 1760, 1765, 1770, 1775 in box 1751 search may overlap, and overlap in this example. In this case, each of the previous macroblock is shifted by one column of pixels. Thus the leftmost pixel in the first row of the macroblock BLK 1 - pixel 1761, for a macroblock BLK 2 - pixel 1766, the macroblock BLK 3 - pixel 1771, and the pixel 1776 - macroblock BLK 4. When implementing the motion estimation algorithm each line of the previous macroblock 1760, 1765, 1770, 1775 compared with the corresponding row of the current block. For example, each row 1 of macroblocks BLK 1 1760, BLK 2 1765, BLK 3 1770 and BLK 4 1775 process compared to row 1 of the current block.

Line-by-line comparison for the four overlapping of adjacent macroblocks continues until all 16 rows of macroblocks will not be processed. To handle the following four macroblocks algorithm corresponding to the given variant implementation, involves a shift by four columns of pixels. Thus for this example, the leftmost first column of pixels for the following four macroblocks are pixel 1796, the pixel 1797, the pixel 1798 and the pixel 1799 respectively. For this option, the implementation of the processing of the previous macroblock continues right 1780 in box 1751 backtracking 1782 to resume from the position of one line of pixels below, with the leftmost pixel of the window 1751 to search until the search window is completed. Although the current macroblock of the current frame of implementation of this option do not overlap, and the following individual macroblocks have a height or length in one macroblock, the previous macroblocks or previous key frame overlap, and the following macroblocks increases is conducted on one row or column of pixels. Although four of the reference macroblock 1760, 1765, 1770, 1775 this example are related and differ by one column of pixels, any macroblock in the box 1751 search, which covers a specific area relative to the selected location of the pixel can be processed together with the macroblock in the location of the pixel. For example, processed macroblock 1760 with pixel 1796. Any macroblock within the 16x16 window relative pixel 1796 may be processed together with the macroblock 1760. Window 16x16 in this example due to the size of the macroblock and the length of the string. In this case, one data row has 16 data elements. Since this comparison function blocks for this option is the implementation of the motion estimation algorithm can load two rows of data from the 16 data elements and to merge with right shift for different rows of data that have shifted/combined version of these two lines of data, other macroblocks that overlap the window of 16x16, which will load the data for this macroblock will be able to at least partially re-use the loaded data. Thus, any macroblock, overlapping macroblock 1760, for example macroblocks 1765, 1765, 1770, 1775 or macroblock, starting with the lower right position of the pixel of the macroblock 1760, can amrabat is introduced together with the macroblock 1760. The difference in degree of overlap affects the amount of data from previous downloads of data that can be reused.

In embodiments, the implementation of motion estimation in accordance with the present invention macroblocks analysis provides a comparison between the previous (reference) macroblock and the current macroblock on a per-row basis for obtaining a sum of absolute values of differences between the two macroblocks. The sum of absolute values of differences may indicate how different macroblocks and how they coincide. Each of the previous macroblock to one of the embodiments can be represented using a value obtained by accumulating the sum of absolute differences for all sixteen rows in the macroblock. For the current analyzed macroblock is supported by the record of the macroblock with the closest match. For example, the tracked minimum cumulative sum of absolute values of differences and the location index for the corresponding previous macroblock. As the motion estimation is moving through the search window, the cumulative sum of each previous macroblock is compared with the minimum value. If later the preceding macroblock is less the accumulated value of differences than the tracked minimum value, the thus indicating a closer match than the existing the closest match, then the accumulated value of differences and index information for this later preceding macroblock become the new minimum value of the difference and the index. When available for all macroblocks of pixels in the search window, processed, according to one of embodiments indexed macroblock with a minimum value of the differences can be used to obtain a residual image for this compression of the current frame.

Figs shows the parallel processing of four reference macroblocks 1810, 1815, 1820, 1825, for the given search window with the current block 1840 to one of the embodiments of the present invention. For this example, data for pixels in the search window are sorted as "A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P" 1860, and "A" Junior address digit (0) in the data set, and P is the senior address digit (15). This set of pixels 1860 contains two sections 1681, 1682, each of which has eight (m) data segments. The use of merge operations with a shift to the right, as described above, allows options for the implementation of the present invention to control the operands using these two data sections, 1618, 1682 and generate properly aligned data line 1830 to various previous macroblocks 1810, 1815, 1820, 1825. All macrob the Oka, previous 1810, 1815, 1820, 1825 and current 1840, have a size of m rows and m columns. For the purposes of discussion and ease in this example, the value of m is equal to eight. Alternative implementation may be macroblocks of other sizes, in which, for example, the value of m is 4, 16, 32, 64, 128, 256 etc.

In this example, the motion estimation algorithm is applied to the first row in the four previous blocks 1810, 1815, 1820, 1825, with the first row of the current block 1840. One embodiment of the implementation of the pixel data, which include these two data section 1861,1862 length equal to two macroblocks (2m), are loaded from memory using two operations aligned loaded from memory and stored in temporary registers. Join operations with a shift to the right with these two data sections 1861, 1862 generate nine possible combinations of data 1830-line without the many memory accesses. In addition, you can avoid unaligned loads from memory, which is costly in terms of time and resources. In this example, these two data section 1861, 1862 align the byte boundaries. Download from memory that do not begin with an address on a byte boundary, for example, data segments B, D or D, in a typical case require surgery unaligned loads from memory. Data 1830 row for each of the blocks are such as described hereinafter, moreover, the leftmost data segment has the lowest address. In block 1810 1 line 1 1811 contains "A, B, C, D, E, F, G, H". As the data in row 1 1811 the same as in the first data section 1861, the shift is not necessary. But row 1 block 1816 1815 2 contains "B, C, D, E, F, G, H, I". Because the previous block 1810 1 and unit 2 1815 separated from each other by one pixel horizontally, block 1815 2 begins with data of the pixel B, while block 1 1810 begins with pixel data A, and data of the second pixel - B. Thus the Association with a shift to the right of these two data sections 1861, 1862, with the counter shear may lead to the data line 1 of block 2.

Similarly, block 3 1820 is one pixel further to the right, and row 1 1821 unit 3 1820 begins with data of the pixel C and contains C, D, E, F, G, H, I, J". The join operation is shifted to the right with operands of these two data sections 1861, 1862, with the counter shear equal to two, forms the data line 1 of block 3. Line 1 1826 unit 4 1825 consists of D, E, F, G, H, I, J, K". These data can be formulated using join operations with right shift count four, with the same data operands. Thus the use of merge operations with a shift to the right with the temporarily stored preloaded data sections 1861, 1862 allows you to reuse the data generated by the data line for other adjacent macroblocks and save time/resources reducing the number of loads from memory, especially unaligned loads from memory. Note that the pixel data for the current block are the same for all comparisons of sums of absolute differences with the reference macroblocks of the previous frame. One aligned load from memory may be required for data 1842 row of the current block 1840, since the current block 1840 may be aligned within the boundaries of memory.

Continuing the example of one of the embodiments of the motion estimation, each line of the preceding macroblock 1810, 1815, 1820, 1825 compared with the corresponding row of the current block 1840 to obtain the value of the sum of absolute differences. Thus line 1 1811 block 1 1810 compared with the string 1841 1 of the current block 1840 with operations 1850 calculate a sum of absolute differences (sad). The same thing happens with the other three being processed blocks. Although it turned out that these four macroblock 1810, 1815, 1820, 1825 processed simultaneously or in parallel, other embodiments of the present invention is not limited to this. Thus, the operation of these four macroblocks may occur sequentially in time, but as a sequence of four operations. For example, row 1 of each reference block is subjected to operations 1850 RAA with the row of the current block 1840 in order: block 1 1810, block 1815 2, block 3 1820 and Blo is 4 1825. Then line 2 of each reference block is subjected to operations 1850 SAR, etc. After each operation 1850 RAA current total value of the sums of absolute differences are accumulated in a temporary register. Thus, in this exemplary embodiment, four register to accumulate the sum of absolute differences up until all m rows of this macroblock will not be processed. The accumulated value for each block is compared with the current minimum value of the differences as part of the search for the best matching macroblock. Although this example describes the processing of four adjacent overlapping the previous macroblock, other macroblocks that overlap the first block BLK 1810 in the search box, can also be processed together with data downloads for BLK 1810, if the data line is relevant. Thus may also be processed macroblock within the window of 16x16 pixels around the currently processed macroblock.

Fig.22D shows the operation 1940 calculate a sum of absolute differences (sad) and the sum of these values of SAR. Here, each row from row a to row P of the reference macroblock UNIT 1 1900 and the corresponding strings for the current macroblock 1920 undergo surgery 1940 RAA. Operation 1940 SAR compares the data representing the pixels in each row, and calculates a value that represents absolutely the e difference between these two lines, one of the previous macroblock 1900 and one of the current macroblock 1920. The values of these operations 1940 RAA for all rows from a to P are added together, as the sum of 1942 block. This amount 1942 unit provides the accumulated value of the sum of absolute differences of all the previous macroblock 1900 and the current macroblock 1920. Based on this amount 1942 block, the motion estimation algorithm can determine how similar, or close, matching the previous macroblock 1900 with respect to the given current macroblock 1920.

Although this variant implementation handles the four reference macroblock at the same time, alternative options for implementation may handle a different number of macroblocks depending on the number of loaded pixel data and the number of available registers. In addition, the set of registers can be used during motion estimation. For example, the extended registers, such as registers mm technology MMX or XMM registers technology SSE2, can be used to store Packed data, such as pixel data. In one embodiment, the implementation of the MMX register length of 64 bits can contain eight bytes, or eight individual pixels, if each pixel has eight bits of data. In another embodiment, the XMM register 128 bits can contain sixteen bytes, or the pole is adcat individual pixels, if each pixel has eight bits of data. Similarly, the registers of other dimensions, such as length 32/128/256/512 bits that can contain Packed data can also be used in embodiments implementing the present invention. On the other hand calculations, which do not require the use of registers in the Packed data, such as integer operations, can use integer registers and hardware to work with integer data.

On figa presents the sequence of operations showing one variant of the method of prediction and motion estimation. At step 2002 is initialized tracked minimum (min) value and the index location of this minimum value. For this option, the implementation of the tracked minimum value and the index indicate that of the previous reference macroblocks from the search window has the closest match with the current macroblock. At step 2004 is checked whether finished processing all required macroblocks in the current frame. If Yes, then this part of the motion estimation algorithm is executed. If not all required current macroblocks have been processed, then at step 2006 is selected raw current macroblock for the current frame. At step 2008 comparison blocks cont ikaetsya from the location of the first pixel in the search window of the previous (reference) frame. At step 2010 is checked whether the completed processing of the search window. During the first pass, the search box will not be processed. But on subsequent passes, if all the search window has been processed, the sequence of operations returns to step 2004 to determine whether other current macroblock.

If all the search window has not been analyzed, then at step 2012 is determined whether processed all pixels in a line along the axis X. If this string has been processed, the value of the reference lines is increased to the value of the next line and the procedure returns to step 2010 to test whether more macroblocks on this new line in the search window. But if not all macroblocks of pixels in the row have been processed, at step 2014 is checked whether the processed macroblock of the pixel in this row and column. If the macroblock has been processed, the count of columns is increased, and the procedure returns to step 2012 to check whether the processed macroblock of the pixel in this new column. But if the macroblock of the pixel in this row and column has not been processed, then compares blocks between the reference macroblock and the current macroblock.

The sequence of operations in this example, for simplicity, described in increments of rows and columns of pixels, the location is routed on the X and Y axes, on one pixel at a time. However, for one of the embodiments of the present invention four of the previous macroblock is processed in one pass. Thus the count column on the Y-axis increase to four columns in a single pass. Other variants of implementation can also handle 8, 16, 32, etc. of the macroblocks at the same time, and thus the reference column respectively is increased by 8, 16, 32, etc. columns to indicate the correct position of the pixel for the next iteration of the algorithm. Although the process of comparing blocks of this variant implementation uses a search on the X and Y axes in an orderly manner, the comparison blocks alternative implementation may use a different algorithm, such as algorithm diamond search, which uses a different template, or logarithmic search.

On FIGU presents the sequence of operations, additionally describing the comparison blocks on Figa. At step 2222 loads the data for the reference macroblock and the current macroblock. One embodiment of the implementation data of the reference macroblock is loaded as two "servings" Packed data, which include data for multiple consecutive pixels. In one embodiment, the implementation of each "portion" of the compressed data contains eight data elements. At step 2224 when is neobhodimosti to receive a portion of the "correct data is the join operation is shifted to the right with "chunks" of data. For the above variant implementation, when processed together four previous macroblock can run the merge operation with a shift to the right for the "chunks" of data that correspond to the rows that are located in each macroblock. "Portion" of the data for each adjacent macroblock that is located at one pixel position on, also shifts per serving on, and it seems that the macroblocks slide on a search window by one pixel in each given time for each row of pixels in the search window. On the steps 2226, 2228, 2230 and 2232 operations are applied to each of the four processed together previous macroblocks. One embodiment of the implementation of all four macroblock is subjected to the same operation will occur before the next operation. An alternative implementation of all operations with one previous macroblock can be completed before it will be processed following the previous macroblock with the "portion" of data, which includes appropriately shifted data segments.

At step 2226 for each row of macroblocks is calculated sum of absolute differences between corresponding rows of the previous macroblock and the current macroblock. At step 2228 accumulated (summed cumulatively) the sum of the absolute separation is it for all rows of the previous macroblock. At step 2230, the accumulated value of the differences for the previous macroblock is compared with the current minimum value. If at step 2232 determined that the value of the difference for this previous macroblock is less than the current minimum value, the minimum value update with the new value of differences. The index is updated to reflect the location of the previous macroblock to indicate that the previous macroblock has the closest match. But if at step 2232 determined that the new value of the difference greater than the current minimum value, then the previous block has a closer match than those that have already been compared.

Embodiments of algorithms for motion estimation in accordance with the present invention can also improve the performance of the processor and system with existing hardware resources. But as technology continues to improve, embodiments of the present invention when combined with a large number of hardware resources and a faster, more efficient logic circuits can have even more profound impact on improving performance. Thus one efficient implementation of the motion estimation may have a different and more impact than the emergence of a new generation the surveillance processor. Simply adding more resources in modern processor architecture does not guarantee performance improvements. But if you also maintain the effectiveness of the application, for example, as in one of the embodiments of the motion estimation and teams Association with a shift to the right (PSRMRG), possible increase performance.

Although the above examples are in General described in the context of hardware/register/operand length of 64 bits to simplify the discussion, other embodiments of used hardware/registers/the operands of length 128 bits to perform join operations, registers, merge operations with a shift to the right and calculate the motion estimation. In addition, embodiments of the present invention is not limited to certain hardware or types of technology, such as technology MMX/SSE/SSE2, and they can be used with other embodiments of the SIMD and other technologies of processing graphics data. Although embodiments of the motion estimation and comparison blocks that described above for Fig - 23B described in the context of strings of eight pixels, or eight data elements and the macroblock size eight rows and eight columns, other options for implementation include other dimensions. For example, strings can and who know the length of sixteen pixels, or sixteen data elements, and the macroblocks may have a size sixteen rows and sixteen columns.

In the foregoing description the invention has been described with reference to certain exemplary embodiments of its implementation, however, it will be obvious that various modifications and changes may be made without deviating from the broader scope and nature of the invention as defined in the claims. Accordingly, the description and drawings should be considered illustrative rather than restrictive sense.

1. The method of parallel data Federation with a shift to the right, containing the steps that take command of Association with a shift to the right that contains the count of the shift on M, shift the first operand from the first register source defined by the command of Association with a shift to the right, and the first operand has a first set of L data elements, to the left 'L - M' data elements in parallel with the shift of the first operand is moved to the right the second operand from the secondary storage unit defined data command Association with a shift to the right, and the second operand has a second set of L data elements to the right by M elements data are pooled mentioned shifted first set with said shifted second set to get the result, having L data elements.

2 the Method according to claim 1, characterized in that the above-mentioned shift the first operand forms mentioned shifted first set containing M data elements aligned on the left edge of the first operand.

3. The method according to claim 2, characterized in that the said left-shift removes 'L - M' data elements of the first operand, and enter zeros from the right edge of the first operand to fill the space freed 'L - M' data elements that are shifted.

4. The method according to claim 3, characterized in that the above-mentioned shift of the second operand forms mentioned shifted second set containing 'L - M' data elements aligned with the right edge of the second operand.

5. The method according to claim 4, characterized in that the said right shift removes the M data elements of the second operand, and the zeros injected with the left edge of the second operand to fill the space freed joined M data elements.

6. The method according to claim 5, characterized in that the said combining includes performing a Logical-OR operation with said shifted first set and said shifted second set.

7. The method according to claim 6, characterized in that the above result consists of M data items mentioned out of the first set and 'L - M' data elements mentioned out of the second set, with M data elements out of paragonia not intersect with the 'L - M' data elements out of the second set.

8. The method according to claim 7, characterized in that the first operand, the second operand and the result are the operands of the compressed data.

9. The method according to claim 8, characterized in that each data element is a byte of data.

10. The method according to claim 9, characterized in that the value of L is 8.

11. The method according to claim 10, wherein M has a value ranging from 0 to 15.

12. The method according to claim 9, characterized in that the value of L is 16.

13. The method according to item 12, wherein M is a number ranging from 0 to 31.

14. The method of parallel data Federation with a shift to the right, containing the steps that take command of Association with a shift to the right that identifies the countdown, the first operand data from the first register of the source that includes the first set of data elements and the second operand data from a secondary data block that includes a second set of data elements shift left the first set of data elements as long as the number of data items that remain in the first operand data is not equal to the aforementioned reference, in parallel with the shift of the first set of data elements shift right the second set of data elements for to remove the number of data elements equal to the aforementioned reference, of the aforementioned second operand data, and the volume of inaut together shifted first set of data elements with shifted second set of data elements to obtain a result, which includes the data elements of the first operand data and second operand data.

15. The method according to 14, characterized in that the said left-shift of the first set of data elements includes the steps that remove data elements from the left edge of the first operand data and enter zeros from the right edge of the first operand data to fill positions exempt remote data elements.

16. The method according to item 15, wherein the shift to the right of the second set of data elements includes the steps, which remove data elements from the right edge of the second operand data and enter zeros from the left edge of the second operand to fill positions, exempt remote data elements.

17. The method according to clause 16, in which the grouping operation Logical OR out of the first set of data elements and out of the second set of data elements.

18. The method according to 17, wherein the first operand and the second operand of the load data of the adjacent memory cells contiguous block of data, the first set of data elements and a second set of data elements do not overlap.

19. The method of parallel data Federation with a shift to the right, containing the steps that take command of Association with the shift and count shift on M, sochinyayut the first operand of the lane is on the register of the source a specific team of Association with the shift, and the first operand has a first set of L data elements to the second operand from the secondary storage unit to the memory cell specified by command of Association with the shift, and the second operand has a second set of L data elements for the formation of the block of data elements of length 2L, move to the right mentioned block on M positions, and M the most right-wing elements in the data drop, and the output L of the most right-wing elements of data out of these block as a result for the team joins with the shift.

20. The method according to claim 19, characterized in that the said right shift further includes writing zeros to the left edge of the above-mentioned block to fill the space freed M data elements.

21. The method according to claim 20, characterized in that the first operand and the second operand are Packed operands data.

22. The method according to item 21, wherein each data element contains one byte of data.

23. The method according to item 22, wherein the value of L is 8.

24. The method according to item 23, wherein M has a value ranging from 0 to 15.

25. The method according to paragraph 24, characterized in that the block remain in the temporary register Packed data having a volume sufficient to 2L of data elements.

26. Device for parallel volume is inane data with a shift to the right, contains a decoder for decoding commands associations with right shift scheduler for sending the above command to run using the first register source having first operand consisting of a first set of L data elements, the secondary storage unit data having the second operand consisting of a second set of L data elements, and reference shift on M, and an execution module for executing the above command, and an execution module contains the logic shift left shift the first operand to the left by count shift 'L-M' data elements, the logic right shift for shifting the second operand right on M data elements in parallel with the shift of the first operand, the logic of Association to merge out of the first operand is shifted from the second operand to produce a result having L data elements.

27. The device according to p, characterized in that the team with the right shift consists of one microcommand (micro-operation).

28. The device according to item 27, wherein the left shift the first operand provides getting out of the first set of data consisting of M data elements aligned on the left edge of the first operand.

29. The device according to p, characterized in that the shift to the left removes 'L - M' data elements of the first operand, and n is whether entered from the right edge of the first operand to fill the space, released pushed 'L-M' data elements.

30. The device according to clause 29, characterized in that the shift to the right of the second operand generates a shifted second set containing 'L - M' data elements aligned with the right edge of the second operand.

31. The device according to item 30, wherein the shift to the right removes the M data elements of the second operand, and zeros are inserted at the left edge of the second operand to fill the space freed joined M data elements.

32. The device according to p, characterized in that the first operand, the second operand and result registers are Packed data.

33. The device according to p, wherein each data element is a byte of data.

34. The device according to p, characterized in that the value of L is 8.

35. The device according to clause 34, wherein M has a value ranging from 0 to 15.

36. The device according to p, characterized in that the said device has a unique 64-bit architecture.

37. The device according to p, characterized in that the value of L is equal to 16, M has a value ranging from 0 to 31, and the aforementioned device has 128-bit architecture.

38. The system of parallel data Federation with a shift to the right that contains memory for storing data and commands, a processor connected to the memory through the bus, and the processor has the ability to you is filling up join operations with a shift to the right, this processor contains bus module for receiving commands from the memory, a decoder for decoding the commands for performing Association with a shift to the right by the value of the reference shift M the first operand from the first register source having a first set of K data elements, and the second operand from the secondary storage unit of data having a second set of L data elements, the scheduler to send the decoded commands for execution and an execution module for executing the above command, containing logic shift left shift the first operand to the left by count shift on 'K - M' data elements, the logic right shift for shifting the second operand to the right by the count of the shift on M data elements in parallel with the shift of the first operand, the logic of Association to merge out of the first operand is shifted from the second operand to produce a result having K data elements.

39. System § 38, characterized in that the value of K is equal to the value L, and both of them (and K, and L) is equal to 8.

40. System § 38, characterized in that the said left-shift removes the 'K - M' data elements of the first operand, and the zeros are entered from the right edge of the first operand to fill the space freed joined 'K - M' data elements, and the above-mentioned shift to the right removes the M data elements of the second operand, the rich zeros are entered from the left edge of the second operand to fill the space, exempt joined M data elements.

41. System § 38, wherein each data element contains one byte of data, and the first operand and the second operand are Packed operands data.

42. Machine-readable medium having embodied thereon a computer program and a computer program executed by the computer to perform a method that includes the steps of receiving the first command and reference shift on M, shift in response to the first command of the first operand from the first register source having a first set of L data elements to the left of the reference shift 'L - M' data elements, shear parallel to and in response to the first command, the second operand from the secondary storage unit of data having a second set of L data elements, for M data elements, the Association in response to the first the team out of the first set with shifted second set to get the result, having L data elements.

43. Machine-readable medium according to § 42, in which the said left-shift removes 'L - M' data elements of the first operand, enter zeros from the right edge of the first operand to fill the space freed referred pushed 'L - M' data elements mentioned right shift removes the M data elements of the second operand, enter zeros with levov the edge of the second operand to fill the space, exempt mentioned joined M data elements, and said combining includes performing a Logical-OR operation to mentioned out of the first set and mentioned out of the second set.

44. Machine-readable medium according to item 43, in which the first operand, the second operand and the result are the operands of the compressed data.