Method and device for shuffling data

FIELD: engineering of microprocessors and computer systems.

SUBSTANCE: in accordance to shuffling instruction, first operand is received, which contains a set of L data elements, and second operand, which contains a set of L shuffling masks, where each shuffling mask includes a "reset to zero" field and selection field, for each shuffling mask, if the "reset to zero" field of shuffling mask is not set, then data indicated by shuffling mask selection field are moved, from data element of first operand, into associated data element of result, and if "reset to zero" field of shuffling mask is set, then zero is placed in associated data element of result.

EFFECT: improved characteristics of processor and increased productivity thereof.

8 cl, 43 dwg

 

The present application is filed as a partial continuation of patent application U.S. 09/952,891 "apparatus and method for efficient filtering and convolution of the data information content" of October 29, 2001.

This patent application is related to the jointly filed patent application U.S. No. - on "Method and apparatus for parallel table conversion using SIMD instructions," dated June 30, 2003 and jointly filed patent application U.S. No. - on "Method and apparatus for reordering data between multiple registers" dated June 30, 2003.

The technical field

The present invention relates to the field of microprocessors and computer systems. More specifically, the present invention relates to a method and apparatus for shuffling data.

Prior art

Computer systems are increasingly penetrate into the life of modern society. Processing capabilities of computers have increased the efficiency and productivity of workers in a wide range of professions. As the cost of acquisition and ownership of computers continues to fall, more and more users get the opportunity to enjoy the benefits of newer and faster machines. In addition, many people are committed ISOE is isawanya road computers (laptops) because they provide freedom of use. Mobile computers allow users an easy way to migrate their data and work with them when they leave the place of work or travel. This scenario is familiar to marketers, officers of corporations, and even students.

As technology improved, the processor also generates the updated software code for execution on machines with these processors. Users in General expect and demand high performance from their computers, regardless of the type of software used. One such question may arise in connection with the instructions and operations that are currently being executed in the processor. Some types of operations require more time to calculate because of the complexity of the operations and/or type of required circuits. This allows optimization method, which performed some complex operations in the processor.

Multimedia applications contribute to the development of microprocessors for more than ten years. In reality, the majority of computer upgrades in recent years has been stimulated by the multimedia applications. These upgrades predominant way took place in the consumer segment, although significant progress was also observed in business the segment to achieve such objectives, as the improvement due to the diversion of learning and communication. However, future multimedia applications will require higher computational performance. In the modern experience in the field of personal computers will be even more enriched by audio-visual effects, but also due to the greater ease of use, and more importantly, the calculations will be integrated with communications.

Accordingly, image display, and playback of audio and video data, which together define the term "content" (content), are becoming more popular applications for modern computing devices. Filtering and convolution are some of the most common operations performed on the data content, such as audio data and video images. Such operations are intensive in the computational sense, but provide a high level of data parallelism that can be exploited through the effective implementation using a variety of data storage devices, such as, for example, registers with a single instruction stream and multiple data streams (SIMD registers). A number of modern architectures also require unnecessary changes to data types, which minimizes the bandwidth of the reconstruction and significantly increases the number of clock cycles, required to organize data for arithmetic operations.

Brief description of drawings

The present invention is illustrated, for example, but not as limitations on the drawings, in which identical reference position indicate similar elements and in which are presented the following:

Figa is a block diagram of a computer system formed with a processor that includes execution modules for execution instructions shuffling data in accordance with the embodiment of the present invention.

FIGU is a block diagram of another sample of the computer system in accordance with the embodiment of the present invention.

Figs is a block diagram of another sample of the computer system in accordance with an alternative embodiment of the present invention.

Figure 2 - block diagram of the architecture for the processor in one embodiment, which includes a logic circuit to perform operations shuffling data in accordance with the present invention.

Figa-With - illustrations of masks shuffling according to different variants of implementation of the present invention.

Figa - view illustration of the different types of packet data in a multimedia registers in accordance with the present embodiment is th invention.

Figw - illustration of types of packet data in accordance with an alternative embodiment of the present invention.

Figs - variant implementation of the encoding format of the operation (operation code) for instructions shuffling.

Fig.4D - illustration of alternative encoding format operation.

File is an illustration of another alternative encoding format operation.

5 is a block diagram of a variant of implementation of the logic to perform the shuffling operation on the operand data based on the mask shuffling in accordance with the present invention.

6 is a block diagram of a variant of implementation of the scheme for the operation of the shuffling data in accordance with the present invention.

7 is an illustration of the operation of the shuffling data over the elements of the data byte length in accordance with the embodiment of the present invention.

Fig - illustration of the operation of the shuffling of data over the data items with the word length in accordance with another embodiment of the present invention.

Fig.9 is a block diagram of a variant of the method of shuffling data.

Figa-N - illustration of the operation of the parallel algorithm table conversion using SIMD instructions.

11 is a block diagram of a variant of the method of performing table conversion with the use of the SIMD instructions.

Fig is a block diagram of another embodiment of the method of performing table conversion.

Figa-With - illustration of the algorithm for reordering data between multiple registers.

Fig is a block diagram illustrating a variant of the method of rearranging data between multiple registers.

Figa-To - illustration of the algorithm for shuffling data between multiple registers to generate perenesennyj data.

Fig is a block diagram illustrating a variant of the method of shuffling data between multiple registers to generate perenesennyj data.

Detailed description

Disclosed are a method and apparatus for shuffling data. Also described is a method and apparatus for parallel table conversion using SIMD instructions. Also disclosed are a method and apparatus for reordering data between multiple registers. The options presented implementation is described in the context of a microprocessor, but is not limited to this. Although later versions of the implementation described with reference to the processor, other embodiments of applicable to other types of integrated circuits and logic devices. The same methods and solutions that are relevant to the present invention can be easily applied to other types of circuits or semiconductor devices that can in order to benefit from higher throughput pipeline and improved performance. The solution, according to the present invention, applicable to any processor or machine that performs data processing. However, the present invention is not limited to processors or machines that perform a 256-bit, 128-bit, 64-bit, 32-bit or 16-bit data operations and can be applied to any processor and the machine where you want the shuffling data.

In the following description in order to explain various specific details to more deeply understand the present invention. Specialist in the art, however, it should be clear that these specific details are not required to implement the present invention. In other instances, well known electrical structures and circuits are not disclosed in detail in order not to obscure the invention in minor details. Moreover, in the following description, examples and drawings showing various examples to illustrate. However, these examples should not be construed in a restrictive sense, since they are intended only to provide examples of the present invention, but do not represent an exhaustive list of possible options for implementing the present invention.

In the embodiment, the methods corresponding to the present invention, embodied in Mashinostroenie who's instructions. Instructions can be used to ensure that a generic or special-purpose processor programmed with the instructions, progressing through the stages of the present invention. Alternatively, the steps of the present invention can be executed by specific hardware components that contain logical hardware resources to execute these steps, or any combination of programmed computer components and custom hardware components.

Although the following examples describe the processing and distribution of instructions in the context of Executive modules and logic circuits, other embodiments of the present invention can be performed using software. The present invention can be provided as a computer program product, or software, that may include a machine or computer-readable medium with stored instructions, which can be used to program a computer (or other electronic devices) to perform a process corresponding to the present invention. Such software may be stored in memory in the system. Similarly, the code may be distributed over a network or other computer-readable media. So about what atom, machine-readable medium may include any mechanism for storing or transmitting information in a form that enables the reading of a machine (e.g. computer), including but not limited to, floppy disks, optical disks, compact disks (CD), a ROM on the CD-ROM (CD-ROM), magneto-optical disks, ROM, RAM, erasable programmable ROM (EPROM), electronically-erasable programmable ROM (EEPROM), magnetic or optical cards, flash memory, transmitted over the Internet, electrical, optical, acoustical or other forms of propagating signals (e.g. carrier waves, infrared signals, digital signals, etc) and the like.

Accordingly, the machine-readable medium includes any type of media/machine-readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g. computer). Moreover, the present invention may also allow for downloading in the form of a computer software product. As such, the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., the client). The transfer program can be implemented using electrical, optical, acoustical or other forms of propagating signals are embodied in the eat oscillation or other transmission medium in the communication channel (for example, modem, network connection, and the like).

In addition, embodiments of the structures of integrated circuits in accordance with the present inventions may be assigned or transferred in electronic form as a database on magnetic tape or other machine-readable media. For example, the electronic structure of the integrated circuit processor in one embodiment, can be processed or produced on the production for a computer component. In another case, the structure of integrated circuits in electronic form may be processed by a machine for modeling computer component. Thus, the topology of the circuits and/or structures of the processors in some embodiments, the implementation can be shared via computer-readable media or to incarnate on them for making schematics or for the simulation of the integrated circuit, the model of which when processing machine simulates the processor. Machine-readable medium may also store data representing pre-defined functions, in accordance with the present invention in other types of exercise.

Modern processors use a number of different Executive modules for processing and execution of various codes and regulations. Not all instructions are created equal, as some of bladud faster execution while others require a huge number of clock cycles. The higher is the throughput of instructions, the better the performance of the processor as a whole. Thus, it would be preferable to maximize more instructions had the highest possible performance. However, there are some instructions, which are more difficult and require more execution time and a faster processor. For example, there are instructions floating-point operations load/store, move data, etc.

As more and more computer systems used in the Internet and multimedia applications, over time, added to the support processor. For example, integer/floating-point SIMD instructions and streaming SIMD instructions (SSE) are instructions that reduce the total number of instructions required to perform a specific mission. These instructions may improve performance of execution of the software due to parallel operations on multiple data elements. The result may be provided with a winning performance in a wide range of applications, including video processing, speech, images/photos. Implementation of SIMD instructions in microprocessors and similar types of log is ical schemes typically associated with a number of questions. In addition, the complexity of SIMD operations often leads to the necessity of introducing additional circuits for correct processing and data manipulation.

Embodiments of the present invention provide a method for the realization of Packed byte instructions shuffling with reset to zero as the algorithm that uses SIMD-oriented hardware. In one embodiment, the algorithm is based on the principle of shuffling data from a particular register or memory cell based on the values of the control mask for each position of the data element. Embodiments of Packed byte shuffling can be used to reduce the number of instructions required in many different applications, which include the reordering of the data. Packed byte instructions shuffling can also be used for any application with non-aligned loads. Embodiments of such instructions shuffling can be used to filter to normalize the data to ensure the effective operations of multiplication-addition. Similarly, Packed instructions shuffling can be used in video applications and encryption applications to organize data and small conversion tables. This instruction can be used the van for mixing data from two or more registers. Thus, embodiments of Packed shuffling in accordance with the algorithm is reset to zero, according to the present invention can be implemented in the process to support SIMD operations in an effective manner, without any noticeable deterioration in General.

Embodiments of the present invention provide instruction shuffling Packed data (PSHUFB) with reset to zero to effectively organize and location data of any size. In one embodiment, the data is shuffled or reordered in register with bit granularity. The operation of the bit shuffling arranges the data sizes that exceed bytes, by maintaining the relative position of the bytes in the larger data during the operation of the shuffling. In addition, the operation of the byte shuffling can change the relative position of the data in SIMD register and may also duplicate data. This PSHUFB instruction shuffles bytes of the first register source in accordance with the contents of the bytes of the management shuffle in the second register source. Although the instruction performs a permutation of the data mask shuffling unaffected and remains unchanged during this operation, shuffling in this embodiment. Mnemonics for one implementation of the pre is ensured by the following: ' PSHUFB-register 1, case 2/memory", while the first and second operands are SIMD registers. However, the register second operand may also be replaced by a memory cell. The first operand includes a source of data for shuffling. For this variant implementation of the register for the first operand is the destination register. Options for implementation in accordance with the present invention also include the ability to set the selected bytes to zero, in addition to changing their positions.

The second operand includes a set of byte mask control shuffle to denote a template shuffling. The number of bits used to select the data source is equal to the logarithm to base two of the number of data elements in the source operand. For example, the number of bytes in the embodiment, with 128-bit register is equal to sixteen. Log216=4. That is, requires four bits or nibbles. Index [3:0] in the code below, refers to four bits. If the most significant bit (MSB), bit 7 in this embodiment, control byte shuffle is selected, then the constant zero is recorded in bytes of the result. If the low nibble of byte I of the second operand, the mask set, contains the integer J, the statement shuffling ensures that the J-th byte of the first register of the source is copied to the position of the I-th byte of the register and is recata. The following is an example of pseudocode for a case for Packed operations byte shuffling over 128-bit operands:

Similarly, the following is an example of pseudocode for another case for Packed operations byte shuffling on 64-bit operands:

Note that in this embodiment, a 64-bit register used lower three bits of the mask, because there is eight bytes in a 64-bit register. Log28=3. Index [2:0] in the code above refers to three bits. In alternative embodiments, the number of bits in the mask can be changed to match the number of data elements in the data source. For example, a mask with younger five bits are needed to select data in 256-bit register.

At the present time is somewhat difficult and time consuming to rearrange the data in a SIMD register. Some algorithms require more instructions for ordering data for arithmetic operations than the actual number of instructions for execution of these operations. By implementing embodiments of the Packed instruction byte shuffling in accordance with the present invention can significantly reduce the number of instructions required to implement is tvline reorder data. For example, one variant of implementation of the Packed instruction byte shuffling can transmit data bytes in all positions 128-bit register. The broadcast data in the register is often used in applications filtering, where one data element is multiplied by many factors. Without such instruction byte of data would be filtered from its source on and off in the lower byte position. Then this one byte would be duplicated at first as byte, then these two bytes must be duplicated again for the formation of a double word, and then double word must be duplicated for the formation eventually kvadratov. All these operations can be replaced with a single Packed instruction shuffling.

Similarly, the reversal of all the bytes in 128-bit register, such as the transition between the large end and small end formats can easily be performed using Packed instruction byte shuffling. While even these rather simple templates require a certain number of instructions, if you have not used Packed instruction shuffling, complex or random patterns require much more inefficient procedures instructions. The most simple solution for random reordering of the bytes in the SIMD register is to record them in a buffer and then use the years of integer byte reads/writes to reorder them and then read them back in a SIMD register. All of this data processing will require long code sequence, while it may be sufficient only Packed instruction shuffling. By reducing the required number of instructions can significantly reduce the number of clock cycles required for forming the same result. Embodiments of the present invention also use the instructions shuffling to access many of the values in the table with SIMD instructions. Even in the case where the table has two times larger compared to the case, the algorithms corresponding to the present invention, allow access to data items with a higher speed than in the case of a single data item on the statement, as in integer operations.

On figa shows a block diagram of an exemplary computer system with a processor that includes execution modules for execution instructions for shuffling data in accordance with one embodiment of the present invention. The system 100 includes a component, such as processor 102, for use of the Executive modules, including logic to perform algorithms for shuffling data, in accordance with the present invention, as described in the embodiment. System 100 represents a system of treatment based on microprocessors Petium® III, Pentium® 4, Celeron®, XeonTM, Itanium®, XScaleTMand/or StrongARMTMsupplied by Intel Corporation (Santa Clara, California), although other systems (including MS with other microprocessors, workstations, set-top boxes and so on) may also be used. In one embodiment, given as a sample, the system 100 can execute the operating system version WINDOWSTMsupplied by Microsoft Corporation (Redmond, Washington), although can be used by other operating systems such as UNIX, Linux, embedded software and/or graphical user interfaces. Thus, the present invention is not limited to any specific combination of schemes of hardware and software.

The present invention is not limited to computer systems. Alternative embodiments of the present invention can be used in other devices, such as portable devices and embedded applications. Some examples of portable devices include cell phones, devices, Internet Protocol, digital cameras, personal digital assistants (PDAs) and pocket PC. Embedded applications can include a microcontroller, a digital signal processor (DSP), system on a chip, network computers (NetPC), set-top boxes, network conc the Torah, switches, wide area network (WAN) or any other system that performs integer operations shuffling on the operands. In addition, some architectures are implemented to allow the instructions to work on different data at the same time, to improve the efficiency of multimedia applications. As the number of types and volume of data increases, computers and processors should improve to manipulate data more efficient methods.

On figa shows a block diagram of a computer system 100 formed with a processor 102 that includes one or more Executive module 108 to perform the algorithm for shuffling data in accordance with the present invention. Presents an implementation option is described in the context of uniprocessor desktop or server system, but alternative options for implementation may be included in a multiprocessor system. System 100 is an example of the architecture of the hub. Computer system 100 includes a processor 102 for processing data signals. The processor 102 may be a microprocessor of a computer with a full set of commands (CICS), the microprocessor of the computer reduced instruction set (RICS), the microprocessor command words reserved length (VLIW)processor, realismus is a combination of instruction sets, or any other processor device, such as a digital signal processor. The processor 102 is connected to the bus 110 processor, which can transfer data signals between the processor 102 and other components in the system 100. The elements of the system 100 to perform their normal functions, well known to specialists in this field of technology.

In one embodiment, the processor 102 includes an internal cache memory 104 level 1 (L1). Depending on the architecture of the processor 102 can have one internal cache or multiple levels of internal cache. Alternatively, in another embodiment, the cache memory may be contained outside of the processor 102. Other variants of implementation may also contain a combination of internal and external caches, depending on the specific implementation and needs. Register file 106 may store various types of data in different registers, including the integer registers, registers floating-point registers and status registers instruction pointer.

Executive module 108 includes logic to perform integer operations and operations with floating-point numbers, also included in the processor 102. The processor 102 also contains the ROM microcode (ucode), which stores firmware for certain macroinstruction. In this embodiment, the Executive fashion is l 108 includes logic for processing the set of 109 Packed instructions. In one embodiment, the set 109 Packed instructions includes Packed instruction shuffling to organize the data. By including a set of 109 Packed instructions in the instruction set of the universal processor 102 together with associated circuits for execution of the instructions the operations used by many multimedia applications can be performed using Packed data in a generic processor 102. Thus, many multimedia applications can be executed faster and more efficiently through the use of the full width of the data bus of the processor for performing operations on Packed data. This can eliminate the need to transfer small blocks of data on the data bus of the processor to perform one or more operations on the same data item at any given time.

Alternative embodiments of the Executive module 108 can also be used in microcontrollers, embedded processors, graphics devices, DSPS, and logic circuits of other types. System 100 includes memory 120. The memory 120 may be a dynamic RAM (DRAM), static RAM (SRAM), flash memory or other storage device. The memory 120 may store instructions and/or data represented by data signals that may complying the change processor 102.

The chip system logic 116 is connected with the CPU bus 110 and memory 120. The chip system logic 116 in the present embodiment, is the hub of the memory controller (sit). The processor 102 may exchange information with the sit 116 via the processor bus 110. Sit 116 provides the channel 118 memory with high bandwidth to memory 120 for storing instructions and data and to store commands, data, graphics and textures. Sit 116 is designed to route data signals between the processor 102, a memory 120 and other components in the system 100 and to transfer data signals between processor bus 110, a memory 120 and a system input/output (I/O) 122. In some embodiments, the implementation of the chip system logic 116 may provide graphics port for communication with the graphics controller 112. Sit 116 associated with the memory 120 via the interface 118 of the memory. Graphics card 112 is connected with a sit-116 via interconnect 114 accelerated graphics port (AGP).

The system 100 uses a specialized bus 122 of the hub interface for communication of the sit, 116, with controller hub I/O (ICH) 130. ICH 130 provides a direct connection with some devices I/o via the local bus input/output. Local bus I/o is a high-speed bus I/o for sedimentaryrock devices with memory 120, chipset and processor 102. Some examples are audio controller hub hardware (flash BIOS) 128, a wireless transceiver 126, the memory 124 of the data, the traditional controller input/output, contains interfaces for user input and keyboard, a serial expansion port such as universal serial bus (BUS), and the network controller 134. The device 124 data storage may contain a hard drive, floppy drive, the CD-ROM device, flash memory device or other mass memory.

In another embodiment, system Executive module for algorithm execution with the instruction shuffling can be used with a system on a chip. One variant of implementation of the system on chip includes a processor and memory. Memory for one such system is a flash memory. Flash memory can reside on the same chip as the processor and other system components. Additionally, other logic blocks such as a memory controller or graphics controller, also can be in the system on a chip.

On FIGU shows an alternative implementation of the system 140 data processing, which implements the principles of the present invention. One option domestic the system 140 data represents the application processor Intel® PCA (Architecture personal Internet client Intel®) technology IntelXScaleTMas presented in the world wide web at the address developer.intel.com). Specialists in the art it should be clear that the described embodiments of may be used with alternative processing systems without deviating from the scope of invention.

Computer system 140 includes processor core 159 capable of performing SIMD operations including shuffle. In one embodiment, processor core 159 is a processing unit with the architecture of any type, including but not limited to, architecture type SISC, RISC, VLIW. Processor core contains 159 Executive unit 142, a set of file registers 145 and the decoder 144. Processor core 159 also includes additional circuitry (not shown)that are not necessary for understanding the present invention. Executive unit 142 is used for execution of the instructions received core processor 159. In addition to the recognition model of processor instructions of the Executive unit 142 may recognize the instruction set 143 Packed instructions for performing operations on Packed data formats. Set 143 Packed instructions includes instructions to support operations shuffling, but so can the e to include other Packed instructions. Executive unit 142 is associated with a register file 145 through an internal bus. Register file 145 is an area of memory on processor core 159 for storing information, including data. As mentioned above, it is clear that the area of memory used to store packet data, is not critical. Executive unit 142 is connected with the decoder 144. The decoder 144 is used for decoding instructions received by the processor core 159, control signals and/or the insertion point in microcode. In response to these control signals and/or the insertion point in the microcode of the Executive unit 142 performs the appropriate operations.

Processor core 159 is connected with the bus 141 for information exchange with the various other system devices, which may include, but is not limited to, for example, the control of a synchronous dynamic RAM (SDRAM), management 147 static RAM (SRAM)interface 148 batch flash memory management 149 memory card PCMCIA (International Association of manufacturers of memory cards for personal computers)/card compact flash memory (CF), management 150 LCD display, the controller 151 of the direct memory access (DMA)interface 152 alternative host bus device transmission control of the data bus). In one embodiment, the system 140 of the data is the same to contain the bridge 154 I/o for communication with various devices I/o through bus 153 input/output. Such device I/o can include, but is not limited to, for example, a universal asynchronous receiver/transmitter (UART) 155, a universal serial bus (USB) 156, wireless Bluetooth UART 157 and interface 158 expansion I/o.

One variant of implementation of the system 140 of the data provides for mobile, network and/or wireless communications and a processing core 159 capable of performing SIMD operations including the operation of the shuffling. Processor core 159 can be programmed in accordance with various algorithms for processing audio, video, imaging and communications, including digital conversion, such as conversion of the Walsh-Hadamard transform, fast Fourier transform (FFT) and discrete cosine transform (DCT), and the corresponding inverse transformation; methods of compression/decompression, such as color space conversion, motion estimation in coded video or motion compensation of the decoded video; function modulation/demodulation, such as pulse code modulation.

On figs shows another variant implementation of the data processing system, ensure the implementation of SIMD operations shuffling. In accordance with another embodiment, the system 160 data may contain the main clause is ocessor 166, SIMD coprocessor 161, the cache memory 167 and the system 168 input/output. System 168 input/output can be optionally associated with a wireless interface 169. SIMD coprocessor 161 can perform SIMD operations, including the shuffling data. Processor core 170 may be suitable for fabrication in one or more processes, as well as through representation on a machine-readable carrier with sufficient detail may facilitate the manufacture of all or part of the system 160 processing containing a processor core 170.

In one embodiment, SIMD coprocessor 161 contains the Executive unit 162 and a set of file registers 164. One variant of implementation of the main processor 166 contains a decoder 165 to recognize instructions from a set of 163 instructions, including SIMD instructions shuffling, for execution by the Executive unit 162. In alternative embodiments, the implementation of the SIMD coprocessor 161 also contains at least part of the decoder B for decoding instructions from a set of 163 instructions. Processor core 170 also includes additional circuitry (not shown)that are not necessary for understanding the present invention.

In the process, the main processor 166 executes a stream of instructions for processing data, managing data processing operations of General type, including vzaimode istia with the cache memory 167 and system 168 input/output. In the instruction stream data entered instructions SIMD coprocessor. The decoder 165-core processor 166 recognizes these instructions SIMD coprocessor, as belonging to that type, which must be fulfilled attached SIMD coprocessor 161. Accordingly, the main processor 166 generates these instructions SIMD coprocessor (or control signals representing instructions SIMD co-processor) bus 166 coprocessor, where they are all connected SIMD coprocessors. In this case, SIMD coprocessor 161 will accept and execute any received instructions SIMD coprocessor designed for him.

Data can be received through the wireless interface 169 for processing through the instructions SIMD coprocessor. For example, voice transmission can be received in the form of a digital signal that can be processed via instructions SIMD coprocessor for recovery of samples of digital audio signal representing the speech transmission. As another example, compressed audio and/or video may be taken in the form of a digital bit stream that can be processed by instructions SIMD coprocessor for recovery of samples of digital audio signal and/or frames of a moving video image. In one embodiment, processor core 170 main processor 166 SIMD-coprocessor 161 integrated into a single processor core 170, contains an Executive unit 162, a set of register files 164 and a decoder 165 to recognize instructions from a set of 163 instructions, including SIMD instructions shuffling.

Figure 2 shows the block diagram of the architecture for the processor 200 in one embodiment, which includes a logic circuit to perform operations shuffling in accordance with the present invention. The shuffling operation may also be referred to as operation shuffling Packed data or Packed as the shuffling operation, as described above. In one embodiment, the instructions shuffle instruction can shuffle Packed data with byte granularity. This statement may also be referred to as PSHUFB or Packed byte shuffle. In other embodiments, implementation of instruction shuffling can be implemented for operation with data items having dimensions of words, double words, quadralobe etc. Internal preprocessor 201 is part of the processor 200, which retrieves macroinstruction for performance and prepares them for subsequent use in the pipeline processing of the processor. The preprocessor 201 in this embodiment includes multiple blocks. Block 226 fetch instructions retrieves macroinstruction from memory and puts them in the decoder 228 regulations, which, in turn, decoder the em them into primitives, called microinstruction or micro-operations (also referred to as micro op or uops)that clear the machine for execution. The trace cache 230 receives the decoded uops and assembles them into software-ordered sequence or track in uop queues 234 for execution. If the trace cache 230 detects a comprehensive macroinstruction, ROM 232 microcode provides uops required to complete the operation.

Many macroinstruction converted into a single micro-operations, while others require several operations to complete the operation. In this embodiment, if more than four micro-operations required to complete macroinstruction, the decoder 228 accesses the ROM 232 microcode for execution micro-instructions. In one embodiment, the Packed instruction shuffling can be decoded in a small number of micro-operations for processing in the decoder 228 instructions. In another embodiment, the instructions for the algorithm shuffling Packed data can be stored in the ROM 232 microcode, if a certain number of operations required to perform the operation. The trace cache 230 refers to programmable logic matrix entry points to identify the correct index of the micro-instructions for reading a sequence of microcode for algo is itnow shuffling in the ROM 232 microcode. After the ROM 232 microcode completing the sequencing of the execution of microinstruction for the current macroinstruction, the preprocessor 201 machine resumes sample of micro-operations from the trace cache 230.

Some SIMD and other multimedia types of statements are considered as comprehensive instructions. Most statements related to floating point are, therefore, complex instructions, When the decoder 228 instruction detects a comprehensive macroinstruction, accessing a ROM 232 microcode in the appropriate cell to extract the sequence of microcode for macroinstruction. Various micro-operation that is required to run the micro-instructions are forwarded to the processor 203 performance by changing the order for execution at the relevant Executive modules, integer or floating-point number.

The processor 203 performance by changing the sequence is designed to prepare microinstruction for execution. Logic execution with the change in the sequence has a number of buffers to smooth and reordering flow microinstruction to improve the performance, as they pass through the pipeline and scheduled for execution. Logic allocator allocates asinie buffers and resources required for each micro-operation (uop) for execution. The dispenser also assigns a record for each uop in one of the two uop queues, one for memory operations and one for operations out of memory before planners instructions: scheduler memory, high-speed scheduler 202, slow/General scheduler 204 and a simple scheduler 204 floating-point number. Schedulers 202, 204, 206 operations determine when the micro-operations are ready to be executed, based on the readiness of their dependent source operand input registers, and resource availability performance required by the micro-operations to complete their operations. Fast scheduler 202 in this embodiment may be scheduled on each half of the main clock cycle, while the other schedulers can only be planned once for each clock cycle of the main processor. Planners conduct the arbitration in relation to the control ports for the planning of micro-operations for execution.

The register files 208, 210 are located between schedulers 202, 204, 206 and Executive modules 212, 214, 216, 218, 220, 222, 224 in the Executive block 211. There is a separate register file 208, 210 for integer operations and floating-point operations, respectively. Each register file 208, 210 in this embodiment also includes the et in transit scheme may skip or forward just the results that have not yet been recorded in the register file, in new dependent micro-operation. The integer register file 208 and register file 210 floating-point can also communicate with each other. In one embodiment, the integer register file 208 is divided into two separate register files, one register file for the 32 data bits of the lower order and the other register file for 32 bits of data of the upper order. Register file 210 floating in one embodiment, has a 128-bit recording, as instructions floating in a typical case are the operands of the size from 64 to 128 bits.

Executive unit 211 includes Executive modules 212, 214, 216, 218, 220, 222, 224, where instructions are executed in the current moment. This section contains the register files 208, 210 that store integer and floating-point values of the operands of data that you want to execute microinstruction. The processor 200 in this embodiment, contains a number of Executive modules: module address generation (AGU) 212, AGU 214, fast arithmetical and logic unit (ALU) 218, high-speed ALU 220, slow ALU 222, the module 224 move floating-point number. In this embodiment, ispolnitel the e modules 222, 224 perform operations with floating-point, MMX, SIMD, SSE. ALU 222 floating in this embodiment, contains a divider floating point for the execution of micro-operations division, square root and finding balance. In embodiments implementing the present invention, any action associated with the floating-point value, is carried out using hardware floating-point. For example, conversions between integer format and floating point format associated with the use of a register file floating-point number. Similarly, the division operation floating point is done in the divider floating-point number. On the other hand, the number of floating-point numbers and integers are handled integer hardware resources. Simple very frequent arithmetical and logical operations are directed at high-speed ALU 216, 218. High-speed ALU 216, 218 in this embodiment can perform high-speed operations with an effective delay of the order of half of such cycle. In one embodiment, the most complex integer operations are sent to slow ALU 222, as slow ALU 222 includes an integer execution hardware for operations with large delay, such as the multiply is s, shifts, flag logic and processing branch. Operations load/store in the memory are executed by the modules AGU 212, 214. In this embodiment, the integer ALU 216, 218, 220 are described in the context of performing integer operations on 64-bit data operands. In alternative embodiments, the implementation of the ALU 216, 218, 220 can be implemented to support different numbers of data bits, including 16, 32, 128, 256, etc. Similarly, the modules 222, 224 floating-point can be implemented to support a range of operands consisting of different number of bits. In one embodiment, the modules 222, 224 floating point can work with 128-bit operands Packed data in connection with SIMD multimedia instructions.

In this embodiment, schedulers 202, 204, 206 operations dispatch dependent operations before the initial load completes execution. Because the micro-operation leading the way is planned and executed in the processor 200, the processor 200 also includes logic to handle gaps in memory. If the load data is missing in the data cache, it can be dependent current operations in pipeline processing, which came from the scheduler temporarily incorrect data. The appropriate response mechanism monitors and repeatedly executes t is some use of incorrect data. You must repeat only dependent operations, and independent operations are allowed to complete. Planners and response mechanism in one embodiment, the processor is also designed to capture sequences of instructions for operations shuffling.

The term "registers" is used herein to refer to built-in memory cells of the processor that are used as part of macroinstruction to identify operands. In other words, the referenced registers are registers that are visible outside the processor (from a programmer's perspective). However, the registers in one embodiment should not be limited in its semantic value of a specific type schemes. On the contrary, the registers in the embodiment, only need to have the ability to store and provide data and to perform the functions described herein. Describes the registers can be implemented by circuits in the processor using any number of different methods, such as specialized physical registers, dynamically allocated physical registers using the rename registers, a combination of specialized and dynamically registers, etc. In one embodiment, the integer registers store a 32-bit integer data. Register file in one in whom the version of the implementation also contains eight multimedia SIMD registers for Packed data. For the description below, the registers are understood as data registers for storing the compressed data, such as 64-bit MMXETM-registers (also referred to in some cases, "mm-registers") in microprocessors implemented in MMX technology, Intel Corporation (Santa Clara, California). These MMX registers available in the integer variant and variant floating-point number that can be used with elements of the Packed data that accompany SIMD and SSE instructions. Similarly 128-bit HHM-registers relating to SSE2 technology can also be used to store the operands Packed data. In this embodiment, when storing the compressed data and integer data registers is not required to conduct the differences between the two data types.

In the examples shown on the following drawings, described a number of data operands. For simplicity, the initial data segments of the source is indicated, starting with the letter a, alphabet, where a refers to the lower address, and Z is to the highest address. Thus, As may initially correspond to the address 0, address 1, address 3, etc. In principle, the operation of the shuffling, as if shuffling Packed bytes in one embodiment, associated with the shuffling of data segments from the first operand and the reordering is underwater or more of the elements of the source data in the template, a specific set of masks by the second operand. Thus, the shuffling can cyclically change or rearrange some or all of the data elements in any desired order. In addition, any particular data element or data elements may be duplicated or transmitted in the result. Embodiments of instructions shuffling in accordance with the present invention include the function is reset to zero, and the mask for each particular data element may ensure that the item data will be reset in the result set.

On figa-With illustrations of masks shuffling in accordance with the variations in implementation of the present invention. This example shows the operand 310 Packed data containing a number of separate elements 311, 312, 313, 314 data. Packed operand 310 of this example is described in the context of the Packed operand data as containing a set of masks to specify a template shuffling for corresponding elements of the Packed data of the other operand. Thus, the mask in each of the elements 311, 312, 313, 314 Packed data operand 310 represents the contents in the corresponding position of the data element of the result. For example, the element 311 data corresponds to the position of the leftmost data element. The mask element is 311 data is intended to indicate what data should tusovatsa or be placed in the position of the leftmost element of the result data for the operation of the shuffling. Similarly, element 312 data corresponds to the second position to the left of the data item. The mask element 312 data is intended to indicate that the data should tusovatsa or be placed in a position second to the left of the data element of the result. For this option, the implementation of each of the data elements in a Packed operand containing mask shuffling, has a one-to-one correspondence with the position of the data element of the Packed result.

On figa element 312 data used to describe the content of the sample mask shuffling in one embodiment. Mask 318 shuffling in one embodiment, contains three parts: field 315 "install flag to zero", a field 316 "reserved" field 317 "bits of choice". Field 315 "install flag to zero" is intended to indicate whether the position data item of the specified submitted by the mask to be set to zero or, in other words, replaced by the value zero ("0"). In one embodiment, the field "install flag to zero" is predominant, and if the field 315 "install flag to zero" is set, then the remaining fields in the mask 318 is ignored, and the position of the data element of the result is filled with"0". Field 316 "reserved" includes one or more bits that can be used or not be used in alternative embodiments, implementation, or may be reserved for future or special use. Box 317 "bits select the" mask 318 shuffling is used to specify a data source for the corresponding position of the data element of the Packed result.

In one embodiment, the instructions shuffling Packed data one operand contains a set of masks, and the other operand contains a set of Packed data elements. Both operands are of the same size. Depending on the number of data elements in the operand requires a variable number of bits used to select individual data element from the second Packed operand data to be placed in a Packed result. For example, for a 28-bit operand of the source of the Packed bytes required at least four bits of the selection, so as to select an available 16 byte data elements. On the basis of the value indicated by the bits of the selection mask, the corresponding data item is placed in the corresponding position of the data element for the given mask. For example, the mask element 318 312 data corresponds to the second position to the left of the data item. If bits 317 choice of this mask 318 contain the value "X", then the data element and the position "X" of the data element in the source operand data is moved to the second position to the left of the data item in the result. But if the field 315 "install flag to zero" is set, the position of the second to the left of the data item in the result is replaced by "0"and the bits indicating 317 choice is ignored.

On FIGU illustrates the structure of the mask 328 for the variant of implementation, which uses the data elements of the byte size and 128-bit Packed operands. For this option, the implementation of field 325 "set zero" consists of bits 7 and box 327 "choice" is composed of bits 3 through 0, since there are 16 possible choices of data elements. Bits 6 through 4 are not used in this embodiment, and remain in the box 326 "reserved". In another embodiment, the number of bits used in the field 327 "choice"may be increased, as necessary, based on the number of possible choices of data elements available in the operand of the original data.

On figs shows the structure of a mask 338 with an alternative implementation that uses the data elements of the byte size and 128-bit Packed operands, as well as many sources of data elements. Mask 338 in this embodiment, contains a field 335 "set null", field 336 source selection and field 337 "choice". Field 335 "set zero" and the box 337 "select" function as described above. Field 336 "source selection" in this option, the implementation of whom the drug is intended to indicate from what source data must be received by the operand data, certain bits of choice. For example, there may be used the same set of masks with multiple data sources, such as many multimedia registers. Each multimedia register source is assigned a numeric value, and the value in field 336 "source selection" refers to one of these registers sources. Depending on the contents of a field 336 "source selection", select the data item is selected from the corresponding data source to be placed in the corresponding position of the data element of the Packed result.

On figa shown represent different types of Packed data in a multimedia registers in accordance with one embodiment of the present invention. Figa illustrates the data types Packed 410 bytes, Packed words 420, Packed double word 430 for 128-bit operands. Format 410 Packed bytes in this example has a length of 128 bits and contains 16 data elements of the Packed bytes. Bytes defined here as 8 bits of data. Information for each byte of a data item stored in bits 7 through 0 of byte 0, bits 15 through 8 of byte 1, bits 23 to 16 for byte 2 and, finally, in bits 120 of 127 for byte 15. Thus, all available bits are used in this registrata storage configuration increases the storage efficiency of the processor. In addition, when accessing 16 items of data in one operation can now be performed over the 16 data elements in parallel.

In General, the data element is a separate piece of data, which is stored in the operand (the same register or memory location), and other data elements have the same length. In sequences of Packed data related to SSE2 technology, the number of data elements stored in the operand (HMM-register or memory location), is 128 bits divided by the length in bits of the individual data item. Similarly in sequences of Packed data related to MMX and SSE technology, the number of data elements stored in the operand (HMM-register or memory location)is 64 bits divided by the length in bits of the individual data item. Format 420 Packed words in this example has a length of 128 bits and contains 8 data elements of Packed words. Each Packed word contains 16 bits of information. Format 430 Packed double words figa has a length of 128 bits and contains 4 data element of the Packed double word. Each Packed double word contains 32 bits of information. Packed quadralobe has a length of 128 bits and contains two data elements Packed kvadratov.

Figv illustrates alternative formats vnutrielitnogo storage the data. Each Packed data may include more than one independent data item. Shows three Packed format data: Packed half-441, Packed single 442 and Packed double 443. In one embodiment, Packed half-441, Packed single 442 and Packed double 443 contain data elements with a fixed decimal point. In an alternative embodiment, one or more of Packed half-441, Packed single 442 and Packed double 443 may contain elements of floating point data. In one alternative embodiment, Packed half-441 has a 128-bit long and contains eight 16-bit data elements. In one embodiment, Packed single-442 has a 128-bit length and contains four 32-bit data item. In one embodiment, the Packed double 443 has a 128-bit length and contains two 64-bit data item. It is clear that such compressed data formats can be further extended to other lengths of registers, e.g., 96 bits, 160 bits, 192 bits, 224 bits, 256 bits or more.

On figs illustrates an implementation option format 460 encoding operation (opcode)with 32 or more bits, and the addressing modes of the operand to register/memory in accordance with the MD format opcode, described in the publication "IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference", available from Intel Corporation (Santa Clara, CA) in the world wide web (www) address intel.com/design/litcentr. The type of operation shuffling can be encoded in one or more fields 461 and 462. Can be identified up to two locations of the operand to the instruction, including up to two IDs 464 and 465 of the source operand. In one embodiment, the instructions shuffling ID 466 destination operand is the same as the ID 464 of the source operand. In an alternative embodiment, the identifier 466 destination operand is the same as the ID 465 of the source operand. Therefore, in embodiments of the operation of the shuffling one of the operators of the source identified by IDs 464 and 465 of the source operand, is overwritten by the results of operations shuffling. In one embodiment, the operation of the shuffling IDs 464 and 465 of the source operand can be used to identify 64-bit operand of the source and destination.

On fig.4D shows another alternate form 470 encoding operation (opcode), having 40 or more bits. Opcode format 470 corresponds to the opcode format 460 and contains the optional prefix bytes 478. The type of operation shuffling can be encoded in one or more fields 478, 471 and 472. Up to two locations of the operands to instructions which can be identified by identifiers 474 and 475 of the source operand and prefix byte 478. In one embodiment, the instructions shuffling prefix bytes 478 can be used to identify the 128-bit operand of the source and destination. In one embodiment, the instructions shuffling ID 476 destination operand is the same as the ID 474 of the source operand. In an alternative embodiment, the identifier 476 destination operand is the same as the ID 475 of the source operand. Therefore, in embodiments, operations shuffling one of the source operands identified by identifiers 474 and 475 of the source operand is overwritten by the results of operations shuffling. Opcode formats 460 and 470 are capable of addressing the type of register to register, memory to register, register-memory, register register, register, reseller, register to memory, as defined partially MOD-fields 463 and 473 and optional bytes scale factor of the base and offset.

According file in some alternative embodiments, the implementation of the 64-bit SIMD arithmetic operations can be performed by the processing instruction data co-processor (CDP). Opcode format 480 depicts one such CDP instruction with field 482, and 489 CDP-opcode. Type CDP-instructions in alternative embodiments, the operations of shuffling can be encoded on the him or more of the fields 483, 484, 487, and 488. Up to three locations of the operands to the instruction may be identified, including up to two identifiers 485 and 490 of the source operand and one ID 486 destination operand. One variant of implementation of the coprocessor can handle 8-, 16-, 32 - and 64-bit values. In one embodiment, the shuffling operation is performed on the data elements floating-point and integer data elements. In some embodiments, the implementation of the shuffling operation may be executed conditionally, using field 481 conditions. For some instructions shuffling the size of the source data can be encoded field 483. In some embodiments, the transaction shuffling over SIMD fields can be performed definition of States of zero (Z), negative (N), carry (C)overflow (V). For some instructions, the type of saturation can be encoded in the field 484.

Figure 5 shows the block diagram of option execution logic to perform the shuffling operation on the operand data based on the mask shuffling in accordance with the present invention. Instruction (PSHUFB) for the operation of the shuffling with a setup function to zero for this option begins with two segments of information: first (mask) operand 510 and the second (data) operand 520. For the following description MASK DATA and the RESULTANT, in the General case, the UE is minutse as operands or data blocks, but it is not limited as such and also include registers, register files and memory cells. In one embodiment, PSHUFB instruction shuffling is decoded into one micro-operations. In an alternative embodiment, the instruction may be decoded into a variable number of micro-operations to perform shuffling on data operands. For example, the operands 510, 520 are 128-bit segments of the information stored in the register/memory source with the data elements of the byte size. In one embodiment, the operands 510, 520 are contained in the 128-bit SIMD registers, such as 128-bit SSE2 XMM registers. However, one or both of the operands 510, 520 can also be loaded from a memory cell. In one embodiment, the RESULTANT 540 is also SSE2 or XMM-register data. In addition, the RESULTANT 540 may be the same register or memory location as one of the operands of the source. Depending on the specific implementation of the operands and registers can be 32, 64, and 256 bits, and must contain the data elements with the size of the word, double word or kvadratov. The first operand 510 in this example consists of a set of 16 masks (in hexadecimal format): he, he, h, 08F, 0x02, 0x0E, 0x06, 0x06, 0x06, 0xF0, 0x04, 0x08, 0x08, 0x06, 0x0D 0x00 and. Each mask must determine the contents of the corresponding position is lementa data in the 540.

The second operand 520 consists of 16 data segments: P, O, N, M, L, K, J, I, H, G, F, E, D, C, B and A. Each data segment in the second operand 520 is also indicated by the position value of the element data in hexadecimal format. The data segments are the same length and each contain one byte (8 bits) of data. If each data item was the word (16 bits)double word (32 bits) or kwadratowa (64-bit), 128-bit operands would be, respectively, the eight elements of the data length in word, four element data length double word or two data element length in quadralobe. However, in another embodiment, the present invention can be used with other sizes of operands and data segments. Embodiments of the present invention is not limited to the operands of the data segments of data or values of the shifts of a specific length, and can use the appropriate length for each implementation.

Operands 510, 520 can be either in register or in memory, file or register, or combinations thereof. Operands 510, 520 data are sent to the logic 530 shuffling of the Executive module in the processor with instructions shuffling. When the statement gets shuffled to the Executive module, the instruction must be decoded before during pipeline processing process is ora. Thus, the statement shuffling may be in the form of a micro-operation (uop) or in some other decoded format. For this option, the implementation of the two operands 510, 520 data are accepted in the logic 530 shuffling. Logic 530 shuffling selects data elements from operand 520 data based on the values in the operand 510 mask and orders/shuffles the selected data elements in corresponding positions in the 540. Logic 530 shuffling also resets the specified position of the data elements in the 540, as specified. Here the result 540 contains 16 data segments: O, K, J, '0', C, O, G, G, F, '0', E, I, I, G, N and A.

Operation logic 530 shuffling is described here in connection with multiple data elements. Mask shuffling for position on the extreme left of the data item in the operand 510 mask is he. Logic 530 shuffling interprets various fields of the mask described above with reference to figa-C. In this case, the "set zero" is not installed. Selection box containing the bottom four bits or nibbles, has a hexadecimal value of 'E'. Logic 530 shuffles shuffling data at position '0xE' data element of the operand data 520 in the position of the leftmost data element of the 540. Similarly, the mask shuffling for position second to the left of the data item in the operand 510 mask is ha. Logic 530 shuffling interprets mA is ku for this position. This selection field is the hexadecimal value of 'A'. Logic 530 shuffling copies the data To the at position '0xE' data element of the operand data 520 in the second position to the left of the data element of the 540.

Logic 530 shuffling in this embodiment also supports "reset to zero" instructions shuffling. Mask shuffling in the position of the fourth data element from the left to the operand 510 mask is 08F. Logic 530 shuffling recognizes that the field "set zero" set that is specified by the value '1' in bit 8 of the mask. In response Directive "reset to zero" overrides the selection field, and logic 530 shuffling ignores the hexadecimal value of 'F' in the selection field of the mask. '0' is placed in the corresponding position of the fourth data element from the left in the 540. In this embodiment, logic 530 shuffling evaluates field "set zero" and a selection box for each mask and does not care about the other bits that may exist in addition to these fields in the mask, such as reserved bits and the source. This processing masks shuffling and shuffling data is repeated for the entire set of masks in the operand 510 mask. In one embodiment, all the masks are processed in parallel. In another embodiment, a certain part of the set of masks and data elements may be processed by the simultaneity the temporal together.

In embodiments, the implementation provides instructions shuffling the data elements in the operand can be reordered in different ways. In addition, some data from a particular data item may be repeated in multiple positions of the data elements or even be broadcast in each position. For example, the fourth and fifth mask both have the hexadecimal value h. In the data I in position j data element of the operand data 520 is moved to the fourth and fifth positions to the right of the 540. With regard to functions set to zero embodiments of instructions shuffling may enforce any of the positions of data elements in the 540 to '0'.

Depending on the specific implementation of each mask shuffling can be used to specify the content of the individual items of the data item in the result. In this example, each individual mask shuffling byte length corresponds to the position of the element data byte length in the 540. In another embodiment, can be used in combination with many masks for joint indications of blocks of data elements. For example, the mask length is two bytes can be used together to indicate the data element length in the word. Mask shuffling is not limited to byte in length and may have any other size, the mu is necessary for a particular implementation. Similarly, the data elements and the positions of the data elements can have a different granularity than a byte.

Figure 6 shows the block diagram of a variant of implementation of the scheme 600 for the operation of the shuffling data in accordance with the present invention. The circuit in this embodiment includes multiplexing structure to select the correct result bytes from the first source operand based on the decoding mask shuffling the second operand. Operand data source here contains the upper elements of the Packed data and the lower elements of the Packed data. Multiplexing structure in this embodiment is relatively easier than other multiplexing structure used to implement other Packed instructions. In the multiplexing structure in this embodiment does not introduce any new critical path synchronization. Diagram 600 in this embodiment contains the block masks shuffling blocks for storing upper/lower elements of the Packed data of the operands sources, the first set of (8:1) multiplexers for the initial selection of data elements to another set (3:1) multiplexers for the initial selection of the upper and lower elements of the data logic select multiplexers and reset and the centre is in control signals. For simplicity, figure 6 shows a limited number (8:1) and (3:1) multiplexers, which are represented by the dots. However, their functions similar to the functions shown in the drawing and are explained below.

In the process of shuffling operation in this example, the diagram 600 of processing shuffling taken two operand: first operand with a set of items Packed data and the second operand with a set of masks shuffling. Mask shuffling served in block 602 masks shuffling. The set of masks shuffling is decoded in block 604 selection logic multiplexer and reset for generating selection signals SELECT A 606, SELECT B 608, SELECT C 610) and signal (ZERO) 611 install to zero. These signals are used to control the operation of multiplexers for recovery in the whole of the 632.

In this example, the operand mask, and the operand data both have a length of 128 bits, and each consists of a 16 byte data segments. The value of N, as shown for various signals in this case is 16. In this embodiment, the data elements are divided into a set of lower and upper elements of the Packed data, each set contains 8 data elements. This allows the use of smaller (8:1) multiplexers in the process of selecting data elements instead of (16:1) multiplexers. These sets of upper and lower elements of the Packed data is stored, sootvetstvenno is, in the upper and lower regions 612, 622 storage. Starting from the bottom of the data set, each of the 8 data items sent on the first set of 16 separate (8:1) multiplexers A-D through a set of lines, such as line 614 routing. Each of the 16 (8:1) multiplexers A-D is controlled by one of the N signals 606 SELECT A. depending on the value of this signal 606 SELECT A corresponding multiplexer must issue one of the 8 lower elements 614 data for subsequent processing. There are 16 (8:1) multiplexers to set the lower elements of the Packed data, because you can move any of the lower data elements in any of the 16 positions of the data elements of the result. Each of (8:1) multiplexers corresponds to one of the 16 positions of the data elements of the result. Similarly, there are 16 (8:1) multiplexers for a set of top elements of the Packed data. 8 data elements are sent to the second set of 16 (8:1) multiplexers A-D. Each of the 16 (8:1) multiplexers A-D is controlled by one of the N signals 608 SELECT Century. depending on the value of this signal 608 SELECT the appropriate (8:1) multiplexer must issue one of the top 8 elements 616 data for further processing.

Each of the 16 (3:1) multiplexers A-D corresponds to the position of the data item in the result 632. 16 output signals A-D 16 multiplexers A-D of the lower elements Yes the data are routed to a set of 16 (3:1) multiplexers A-D select the upper/lower data elements, as output signals A-D multiplexers A-D top data. Each of these (3:1) multiplexers A-D receives its signals SELECT C 610 and ZERO 611 of logic 604 select multiplexer and reset. The value of the SELECT signal C 610 for this (3:1) multiplexer is used to specify whether this multiplexer to output the selected operand data from a set of lower data or from a set top data. The control signal ZERO 611 for each (3:1) multiplexer is used to specify whether this multiplexer to set its output signal to zero ('0'). For this option, the implementation of the control signal ZERO 611 overrides the selection made by the SELECT signal C 610, and force the issue to the 632 '0' to the corresponding position of the data element.

For example, (3:1) multiplexer A accepts the selected lower data element A of (8:1) multiplexer A for a given position of the data element. Signal SELECT C 610 controls what data elements to move at its output 630 in the position of the data element provided in the 632. However, if the ZERO signal 611 is supplied to the multiplexer A is active, indicating that the mask shuffling for a given item of data sets that require '0', the output multiplexer 630 is set to '0', and none of the input signals A, A not using The 632 operations shuffling consists of output signals 630A-D (3:1) multiplexers A-D, each of these output signals corresponding to a particular position of the data elements and represents either a data element or a '0'. In this example, each output signal (3:1) multiplexer has a byte size, and the result is a block of data consisting of 16 Packed byte data.

7 illustrates the operation of the shuffling of data over the data elements of the byte size in accordance with one embodiment of the present invention. This is an example of instruction PSHUFB DATA, MASK". Note that the most significant bit masks shuffling for byte positions h and HS mask 701 is selected so that the resulting data in the 741 for these positions are zero. In this example, data is arranged in the device 721 data storage destination, which in one embodiment is a device 721 source storage, in view of the set of masks 701, which determine the address in which the corresponding data elements from operand 721 source should be stored in the register 741 recipient. Two operand source, mask 701 and data 721, each contain 16 elements of the Packed data in this example, as the 741. In this embodiment, each of the data elements has a size equal to eight bits or bytes. Thus, the data blocks of the mask 701, data 721 and the result of the ATA 741 have a length of 128 bits each. In addition, these data blocks can reside in memory or in registers. In one embodiment, the configuration of the mask based on the desired data processing operations, which may include, for example, the filtering or convolution operation.

As shown in Fig.7, the operand 701 mask contains data elements with masks shuffling: he 702, ha 703, h 704, 008F 705, 0x02 706, 0x0E 707, 0x06 708, 0x06 709, 0x05 710, 0xF0 711, 0x04 712, 0x08 713, 0x08 714, 0x06 715, 0x0D 716, 0x00 717. Similarly, the operand 721 data includes the data elements of the source: P 722, O 723, 724 N, M 725, L 726, 727 K, J 728, 729 I, H 730, 731 G, F 732, E 733, D 734, C 735, 736 B, A 737. In representations of the data segments 7 position data element is also specified under the data as a hexadecimal value. Accordingly, the Packed shuffling operation is performed with the mask 701 and data 721. Using a set of masks 701 shuffling data processing 721 can be executed in parallel.

When each of the masks shuffling of data elements is evaluated, the corresponding data from the specified data element or '0' are moved to the corresponding position of the data element for this specific mask shuffling. For example, the rightmost mask 717 shuffling is set h, which is decoded to specify the data position h operand data source. In response data And position h data are copied to the extreme is th right position of the 741. Similarly, the second from the right mask 716 shuffling is set 00D, which is decoded as 0D. In response data N from the position 0D data copied to the second right position of the 741.

The fourth position to the left of the data element in the 741 is '0'. This refers to the value 08F in the mask shuffling for a given position of the data element. In this embodiment, bit 7 of byte masks shuffling is a pointer "setting to zero" or "reset to zero". If this field is set, then the corresponding position of the data item in the result is filled with the value '0', instead of the data of the operand 721 data source. Similarly the seventh position to the right in the 741 is set to '0'. This is due to the value 0F0 mask shuffling for position of the data element in the mask 701. Note that not all bits in the mask shuffling can be used in some embodiments of implementation. In this embodiment, the lower half byte or four bits of the mask shuffling sufficient to select any of the 16 possible data elements in the operand 721 data source. Since bit 7 is a "set zero", the other three bits remain unused and can be reserved or ignored in some embodiments, implementation. For this variant implementation of the field "set zero" controls and periop delaet selection of the data item, as indicated in the lower half byte mask shuffling. In both these cases, the fourth position to the left of the data element, and the seventh position on the right is h mask shuffling, where the flag is "set zero" in bit 7, may also cause the fill value '0' the corresponding position of the data element of the result.

As shown in Fig.7, the arrows illustrate the shuffling of data elements to be masked shuffling in the mask 701. Depending on the specific set of masks shuffling one or more elements of the source data may not appear in the 741. In some cases, one or more '0' can also appear in different positions of data elements in the 741. If mask shuffling configured to broadcast one or a particular group of data items, the data for each data element may recur as the selected template in the result. Embodiments of the present invention is not limited to any particular configurations or templates masks shuffling.

As noted above, the register data source is also used as a register storing data of the recipient in this embodiment, thereby reducing the number of required registers. Although data 721 source is overridden, the set of masks 701 shuffling does not change and is available for future reference. PE is Opredelenie data in the storage device of the source data can be reloaded from memory or another register. In another embodiment, the set of registers can be used as a storage device of the source data, and their corresponding data is arranged in the storage device of the destination as necessary.

On Fig illustrates the operation of the shuffling data over the elements of the data length in word, in accordance with another embodiment of the present invention. General description of this pattern is somewhat similar to the description of example 7. However, in this scenario, the data elements operand data 821 and the result 831 have a word length. For this option, the implementation of the speech data elements are treated as pairs of bytes of data elements, so as masks shuffling in the operand 8021 have a size of bytes. Thus, the pair of bytes mask shuffling is used to define each word position data element. But in another embodiment, mask shuffling can also have the granularity of words to describe the positions of data elements having a word in the result.

Operand 801 mask contains elements of the data byte length with masks shuffling: h 802, h 803, 008F 804, 0x0E 805, h 806, h 807, 0x0D 808, HS 809, 0x05 810, 0x04 811, HV 812, ha 813, 0x0D 814, HS 815, h 816, 0x00 817. Operand 821 data includes data elements source: N 822, 823 G, F 824, 825 E, D 826, C 827, 828 B, A 829. In view of the segment the clients data on Fig position data element is also specified under the data as a hexadecimal value. As shown in Fig, each of the data elements of the long word operand 821 data is address data items it holds, corresponding to the two positions of the byte size. For example, the data N 822 occupy the position data and 0xF 0xE byte data elements of size.

Packed the shuffling operation is performed with the mask 801 and data 821. The arrows on Fig illustrate the shuffling of data elements to be masked shuffling in the mask 801. Code each mask shuffling of data elements is evaluated, the corresponding data from the specified data element of the operand data 821 or '0' are moved to the corresponding position of the data item in the result 831 for this specific mask shuffling. In this embodiment, the mask shuffling byte size work in pairs to specify the data elements are word-sized. For example, the two leftmost mask h 802, h 803 in the operand 801 masks together correspond to the leftmost position 832 of data element length in the word of the 831. In the operation of the shuffling two data bytes or one word of data in the byte positions h 802, h 803 data element, which in this case corresponds to the data In 828, arranged in two leftmost position 832 of the data item size in bytes in the 831.

In addition, masks shuffling can also be configured to set the element Dan what's the size of the word to '0' in the result, as shown masks h 806 and h 807 shuffling to the third position 834 data element size in the word in the 831. Mask h 806 and h 807 have established fields "set zero". Although two bytes masks shuffling here are grouped into pairs, can be implemented in various combinations in pairs to order four bytes together, for example, in the form of quadralobe or eight bytes together to form a double kvadratov. Similarly associations in pairs is not limited to sequential masks shuffling specific bytes. In another embodiment, mask shuffling size in the word can be used to specify data elements size in the word.

Figure 9 presents a block diagram illustrating one variant of the method of shuffling data. The value of the length L in the General form is used here to represent the length of the operands, and data blocks. Depending on the specific case for L can be used to indicate size in terms of the number of data segments, bits, bytes, etc. In block 910, the first Packed operand data length L is accepted for use in the operation of the shuffling. Set the length L of the mask shuffling of length M that specifies the template shuffling is performed in a block 920. In this example, L is 128 bits, and M is equal to 8 bits or bytes. In other variants of the e implementation of L and M can have other values for example, 256 and 16, respectively. In block 930 the operation is shuffled, and data elements from operand data are shuffled for ordering in the result according to the template shuffling.

Details shuffling in block 930, in this embodiment, further described in the aspect of what is happening for each position of the data element. In one embodiment, the shuffling for all positions of the data elements of the Packed processed in parallel. In another embodiment, a certain part of the masks may be simultaneously processed together. In block 932 is checked whether the flag is zero. This flag is zero refers to the field "set/reset to zero each mask shuffling. If the flag is zero is defined as set in block 932, the position of the data element of the result corresponding to this particular mask shuffling, is set to '0'. If the flag is zero is defined as set in block 932, the data from the element data source specified by the mask shuffled, placed in the position of the data element of the destination for the result corresponding to this particular mask shuffling.

Currently, conversion tables, using integer instructions, require a large number of instructions. An even greater number of instructions required for conversion, if the whole is Islandia operations are used to access the algorithms, implemented with SIMD instructions. However, due to the use of embodiments instructions shuffling Packed byte count instructions and execution time is significantly reduced. For example, you can refer to 16 bytes of data in the process table conversion using a single statement if the table size is 16 bytes or less. 11 SIMD instructions can be used to convert tabular data for table size in the range from 17 to 32 bytes. 32 SIMD instructions is required if the table size is in the range from 33 to 64 bytes.

There are some applications with data parallelism, which could not be implemented SIMD instructions due to the use of conversion tables. Algorithms quantization and release in the method of H.26L video compression is an example of an algorithm that uses a small conversion table, which is not consistent with a 128-bit register. In some cases, conversion tables used by these algorithms are small. If the table is consistent with one register, the operation table conversion can be performed using one Packed instructions shuffling. But if the requirement of memory space requirements for the table exceeds the size of a single register, the embodiments of the Packed instructions can operate using a different algorithm. One variant implementation of the method of processing tables with the increased size provides for the division of the table into sections, each of which is equal to the capacity of the register, and access to each of these partition tables using the instructions shuffling. Instruction shuffling uses the same sequence management shuffle to access each partition of a table. In the parallel conversion table can be implemented in these cases with instructions shuffling Packed bytes, thereby allowing the use of SIMD instructions to improve the efficiency of the algorithm. Embodiments of the present invention can enhance the efficiency and reduce the number of memory accesses needed for algorithms that use a small conversion table. Other variants of implementation also enables you to access the many elements of the conversion table using SIMD instructions. Instruction shuffling Packed byte in accordance with the present invention provides an efficient implementation of SIMD instructions, instead of the less efficient integer embodiments of algorithms that use a small conversion table. This alternative implementation of the present invention demonstrates how to access data from the people, which requires a memory space larger than a single register. In this example, the registers contain various segments of the table.

Figa-H illustrate the operation of the parallel algorithm table conversion using SIMD instructions. The example described with reference to figa-N, provides the conversion of the data using multiple tables, and some selected data elements, as defined in the set of masks, shuffled from these many tables in the merged data block result. The following description is explained in the context of the Packed operations, in particular, Packed instructions shuffling as disclosed above. The shuffling operation in this example overrides the tabular source data in the register. If the table must be reused, following the conversion, the tabular data should be copied to another register before the operation will be executed, so that the other download is not required. In an alternative embodiment, the shuffling operation uses three separate register or memory location: two sources and one destination. The address in the alternate embodiment, is a register or a memory location that is different from any of the source operands. Thus, tabular data source not Perea is determined and can be reused. In this example, the tabular data is treated as coming from different parts of a large table. For example, "lower tabular data" 1021 correspond to the lower address area of the table and the upper tabular data" 1051 - top address of the table. Embodiments of the present invention are not restrictive as to where they can take on tabular data. The blocks 1021, 1051 data can be related, widely separated by time and even overlapping. Similarly tabular data can also be from different tables of data or various memory sources. It is also clear that such a table conversion and data aggregation can be performed on the data from multiple tables. For example, instead of different parts of the same table "bottom tabular data" 1021 can be from the first table, and the "upper tabular data 1051 from the second table.

Figa illustrates the shuffling Packed data of the first set of data elements from a table based on a set of masks shuffling. The first set of data elements grouped as the operand called "bottom tabular data 1021. The mask 1001 and the lower tabular data" 1021 contain every in this example, 16 elements. The shuffling operation using the MASK 1001 and the lower tabular data" 1021 gives the result as Premiata the tion of the TEMP of the RESULTANT A 1041. The lower part of the mask management shuffle selects the data item in the register. The number of bits required to select the data item is the number of data elements register log2. For example, if the capacity of the register is 128 bits and the data type is a byte, the number of elements in the data register is 16. In this case, four bits are required to select a data item. Figv illustrates the shuffling Packed data of the second set of data elements from a table based on the same set of masks shuffling shown in figa. The second set of data elements grouped as an operand named "top tabular data 1051. "Top tabular data" 1051 also contain in this example, 16 elements. The shuffling operation using the mask 1001 and the upper tabular data" 1051 gives the result as an intermediate of the TEMP of the RESULTANT B 1042.

Because the same set of masks shuffling was used as for the "lower tabular data 1021, and for the "top table" data 1051, their corresponding results 1041, 1042 are similarly positioned data, but data from different sources. For example, the leftmost position data of both results 1041, 1042 has data from a data element he 1023, 1053 relevant sources 1021, 1051 data. Figs illustrates the Packed logical operation "And" with what ispolzovaniem filter 1043 choice and a set of masks shuffling MASK 1001. The filter of choice in this case is a filter to distinguish which masks shuffling in the MASK 1001 refers to the first table data 1021, and a second table data 1051. Mask shuffling in this embodiment, use field source selection source selection" 336, as described above with reference to figs. Lower bits of the control byte shuffle is used to select the data item in the register, and the high-order bits excluding the most significant bit is used to select the segment table. In this embodiment, the bits directly above and near used to select the data select section of the table. The selection filter 1043 applies H all masks shuffling in the MASK 1001 for field selection source selection of masks shuffling. Packed operation "And" gives "the mask table selection" 1044 to indicate what the position of the data element in the final result should be from the first set of data 1021 or the second data set 1051.

The number of bits for selecting a partition table is equal to the number of partitions of a table in log2. For example, if the table size from 17 to 32 bytes with a 16-byte registers, the lower 4 bits select the data and the fifth bit selects the partition table. Here the source selection uses the low-order bit of the second half byte, bit 4, of each mask shuffling to identify the source of the data is, since there are two data sources 1021 and 1051. To the partition table with indexes from 0 to 15 is access instructions Packed shuffling on figa. Section table index from 16 to 31 access instructions Packed shuffling on figv. Field that selects the table that is separate from the byte/index control to generate, according figs. In implementations with a large number of data sources, additional bits may be required for the fields in the source selection. In the case of a 32-byte table bytes h - 00F management shuffle will choose table of the elements from 0 to 15 in the first section of the table, and bytes h - 01F management shuffle will choose tabular elements from 16 to 31 in the second section of the table. For example, consider the case where the control byte shuffle is defined as H. Bit representation h is 0001 1001. The bottom four bits 1001 selects the ninth byte (counting from 0), and the fifth bit is set to 1, selects the second table from two tables. The fifth bit is equal to 0, would choose the first table.

Mask to select data values, reference to which is made from the first section of the table with indexes from 0 to 15, is calculated using the "compare-equality" for this variant implementation fig.10D by selecting byte control to generate,in which the fifth bit is equal to zero. Fig.10D Packed illustrates the operation of comparison is equality" bottom filter 1045 and "mask selection table 1044. Mask the bottom of the table, formed on fig.10D, for the first section of the table selects the data items to which you have access from the first section of the table with another Packed operations shuffling. Lower filter 1045 in this example is a mask for extracting or releasing positions of the data elements specified masks shuffling, as coming from the first set of data 1021. If the source is '0' in this embodiment of the invention, the source data must be "bottom data table 1021. Operation comparison-equality" gives "the selection mask the bottom of the table 1046 with values 0FF for positions of data elements that have a value source selection '0'.

Mask to select data values accessed from the second section of the table with indexes from 16 to 31 is calculated using the "compare-equality" file by selecting byte control to generate, in which the fifth bit is equal to one. File illustrates a similar operation comparison-equality" upper filter 1047 and "mask selection table 1044. Mask of choice the top of the table, formed on file, for the second section of the table selects the element of the s data accessed from the second section of the table using a Packed operations shuffling. The upper filter 1047 is a mask to extract the positions of the data elements specified fields of the source selection mask shuffling, as coming from the second set of data 1051. If the source is '1' in this embodiment of the invention, the source data must be in a "top table" data 1051. Operation comparison-equality" gives "the selection mask top table 1048 with values 0FF for positions of data elements that have a value source selection '1'.

The data elements selected from two sections of the table are combined, as shown in fig.10F. On fig.10F shown Packed "And" over "selection mask the bottom of the table 1046 and the intermediate result And 1041. Packed operation "And" filters selected shuffled data elements from the first set of 1021 data for the mask 1046, which is based on the fields of the source selection. For example, the source selection in the mask 1002 shuffling to the leftmost position of a data item has a value of '0'as shown in "the mask table selection" 1044. Accordingly, mask low table 1046 is set 0FF in this position. Operation "And"under fig.10F between 0FF and the data in the leftmost position of the data element leads to a re the ASU data in the selected data of the lower table 1049. On the other hand, the source selection in the mask 1004 shuffling to the third position of the data element on the left has the value '1' to indicate that data is being received from a source other than the first set of data 1021. Accordingly mask low table 1046 is set h in this position. Operation "And" in this case does not transmit data J in the selected data of the lower table 1049, and this position remains empty, teh.

Similarly Packed "And" over "selection mask the top of the table", 1048, and an intermediate result In 1042, as shown in fig.10G. Packed operation "And" filters selected shuffled data elements from the second set 1051 data for the mask 1048. Unlike Packed operations And described with reference to fig.10F, mask 1048 allows data specified selection fields of the source as coming from the second set of data to place in the selected header / table 1050, while the other positions of the data elements remain empty.

Fign illustrates combining selected data from the first dataset and the second dataset. Packed logical operation "OR" is performed on selected data of the lower table 1049 and selected data from the top of the table" 1050 to receive the combined data of the selected table" 1070 that is desirable re is the query result of the parallel algorithm conversion table in this example. In an alternative embodiment, Packed the operation of addition, which is intended to add the selected data of the lower table 1049 and selected data from the upper table," 1050 may also provide "the combined data of the selected table 1070. As shown in fign or selected data of the lower table 1049 or selected data from the upper table," have a value of 1050 K for a particular position of the data in this embodiment. This is because the other operand, which has no value h must contain the desired tabular data selected from the appropriate source. Here the leftmost position of the data element in the 1070 is 0, which represents the scrambled data 1041 from the first set of data 1021. Similarly, the third position to the left of the data element in the 1070 is Z, which represents the scrambled data 1042 of the second set of data 1051.

The tabular method of converting data into tables with inflated size in this sample embodiment, can be reduced mainly to the following operations. First, the data is copied or loaded into the registers. Tabular values of each cell of the table will be evaluated by a Packed operations shuffling. Field source selector that identifies the CE is on the table, stand out from the masks shuffling. Performed "comparison is the definition of equality" over the fields select the source partition number table to determine which partition tables are appropriate sources for peretasovany data elements. Operation "comparison is the definition of equality" provides masks for additional filtering is desired peretasovany data elements for each partition of a table. The desired data items from the corresponding sections of the table are merged together to form the final result table of the conversion.

Figure 11 shows a block diagram illustrating a variant of the method of performing table conversion using SIMD instructions. The described workflow is essentially the method presented on figa-N, but is not limited to this. Some of these operations can also be performed in a different order or using other types of SIMD instructions. In block 1102, the accepted set of masks shuffling indicating the template shuffling. These masks shuffling also include a source field to specify from which table or source should shuffle data elements to obtain the desired result. In block 1104 loaded data elements from the first part of the table first or what about the dataset. The data elements of the first part shuffled in accordance with the template shuffling unit 1102 in block 1106. The data elements for the second part of the table or the second set of data is loaded in block 1108. The data elements of the second part shuffled in accordance with the template shuffling unit 1102 in block 1110. In block 1112 choices tables are filtered out of the masks shuffling. Options tables in this embodiment, use of source selection, which indicate the source from which received data elements. In block 1114, the selection mask table is generated for peretasovany data from the first part of the table. Mask selection table is generated for peretasovany data from the second part of the table in block 1116. These masks selection tables should filter out desirable peretaskivaete data elements for specific items of data elements from the corresponding tabular data sources.

In block 1118, the data items are selected from peretasovany data the first part of the table in accordance with the mask table selection block 1114 for the first part of the table. The data elements are selected in block 1120 of peretasovany data the second part of the table in accordance with the mask table selection block 1116 for the second part of the table. Peretaskivaete data elements selected from the first part of the table in th the e 1118 and the second part of the table in block 1120, unite together in block 1122 to obtain a combined tabular data. United tabular data in one embodiment, includes data elements, peretaskivaete as of the first table data and the second table data. In another embodiment, United tabular data may include data that is converted from more than two tabular sources or areas of memory.

On Fig shows a block diagram illustrating another variant of the method of performing table conversion. In block 1202 loaded table with many data elements. In block 1204 determines whether the table is for a single register. If the table is suitable for a single register, the table conversion is performed using the operation shuffling in block 1216. If the data cannot fit in one register, the table conversion should be done using operations shuffling for each relevant part of the table in block 1206. Logical Packed operation "And" is performed to obtain bits or fields who choose the part of the table or data source. Operation "comparison is the definition of equality" in block 1210 generates a mask to select tabular data from relevant parts of the table to view. In block 1212, the logical operation "And" is used to playback the tra and the selection of data items from the table partitions. The logical operation "OR" merges the selected data block 1214 to obtain the desired data conversion table.

One variant of implementation of the Packed instructions shuffling is implemented as an algorithm to reorder data between multiple registers using reset to zero. The purpose of the blending operation is to merge the contents of two or more SIMD registers in one SIMD register in the selected configuration in which the position data in the result differ from their original positions in the operands of the source. Selected data elements first moved to the desired position of the result, and the non-selected data elements are set to zero. Positions that have been moved to the selected data items for a single register are set to zero for the other registers. Consequently, one of the registers can contain non-zero data element in a given position of the data element. The following General sequence of instructions can be used for mixing data from two operands:

The shuffling Packed DATA byte AND MASK AND;

The shuffling Packed DATA byte, MASK;

Packed logical "OR" of the RESULTANT AND the RESULTANT Century

The operands of the DATA a and DATA contain items that should be removed or set to zero. Opera is dy MASK a and MASK TO contain the bytes of the management shuffle, determine which data elements should be moved and what data elements should be set to zero. In this embodiment, the data elements in the positions of the destination is not set to zero through the MASK AND are set to zero through a MASK, and the position of the addressee that is not set to zero by the MASK, are set to zero by MASK A. Figa-To illustrate the algorithm to reorder the data between the set of registers. In this example, the data elements of the two data sources or registers 1304, 1310 shuffled together in perenesennyj block 1314 data. Data blocks, including the mask 1302, 1308, data 1304, 1310 sources and results 1306, 1312, 1314, in this example have a length of 128 bits and consist of data elements of size 16 bytes. However, alternative implementation may include data blocks of a different size, with the data items of different sizes.

On figa shows the first operation of the shuffling Packed data of the first mask A 1302 over the first operand data source A DATA 1304. In this example, the desired perenesennyj result 1314 should include perenesennyj template of the same data item from a first source 1304 data and other data item from a second source data 1310. In this example, the fifth byte of DATA AND 1304 must be tempered with the 1st byte DATA IN 1310. MASK A 1302 includes a repeating pattern h and h in this embodiment. Is h in this embodiment, has an established field "set zero", and the corresponding position of the data element is filled with '0'. Is h indicates that the corresponding item of the data item for a given mask shuffling should be formed data F1from the data item h of DATA AND 1304. Essentially, the pattern of shuffling in A MASK 1302 orders and repeats the data F1in every other position of the data element of the result. Here the data F1represent a single piece of data, which must peretaskivaniya of DATA AND 1304. In alternative embodiments, the implementation can peretaskivaniya data from different numbers of elements data sources. These options implementation is not limited to templates, using a single piece of data or a specific template. The combination of patterns of the masks can take advantage of various kinds. The arrows on fega show shuffle data elements for masks shuffling A MASK 1302. The RESULTANT A 1306 of this operation shuffling consists, therefore, of '0' and F1for the template mask 1302.

Figv illustrates a second operation of the shuffling Packed data using a second mask IN 1308 with the second operand of the source data DAT In 1310. MASK IN 1308 includes a repeating pattern of 0x0C 0x80 and. Is h determines that the corresponding position data for this mask shuffling receives a '0'. Is HS determines that the position of the data element of the result corresponding to this mask shuffling, provided data M2item HS data from DATA IN 1310. Template shuffling MASK IN 1308 provides data M2for any other position of the data element of the result. The arrows on FIGU illustrate the shuffling of data items for a set of masks shuffling in the MASK IN 1308. The result of the RESULTANT 1312 B of this operation shuffling, therefore, consists of a combination of '0' and M2template 1308 shuffling.

Figs illustrates combining the scrambled data of the RESULTANT A 1306 and the RESULTANT B 1312 for receiving peremienko of the INTERLEAVED RESULTANT 1314. The Association is performed through the Packed logical "OR". The combination of '0' in A RESULTANT 1306 and the RESULTANT B 1312 provides the ability to interleave values 1314 data M2and F1. For example, in the extreme left position of the element data of the logical operation "OR" to '0' and M2results M2in the extreme left position of the element data of the 1314. Similarly in the rightmost position of the element data of the logical operation "OR" F1and '0' results in F1the edge is her right position of the element data of the 1314. Thus, data from a variety of registers or memory cells can be reordered as desired combination.

On Fig shows a block diagram illustrating a variant of the method of rearranging data between multiple registers. Data is loaded from the first register or a memory cell in block 1402. The data of the first register shuffled in block 1404 based on the first set of masks shuffling. In block 1406, the data is loaded from the second register or a memory cell. The data of the second register shuffled in block 1408 in accordance with the second set of masks shuffling. The scrambled data from the first and second registers are combined in block 1410 using a logical "OR" to obtain peremienko data block with the data from the first and second registers.

On figa To the algorithm shuffling data between multiple registers to generate perenesennyj data. This sample application is related to the alternation of the color data of the flat image. Image data is often processed in separate color planes, and these planes are interspersed to display the image. The algorithm, described below, illustrates interleaving data for the red plane, green plane and blue plane, as used image formats, such as bitove the images. There are many color spaces and patterns of alternation. As such, this approach can simply be extended to other color spaces and formats. This example implements a commonly used image processing the process data format in which the data of red (R) plane, G (green) plane and the blue (B) plane interspersed in RGB format. This example illustrates how the function is reset to zero, according to the present invention substantially reduces memory accesses.

Data from three sources are combined together by way interleave. More specifically, the data refer to the pixel color data. For example, the color data for each pixel may include information from sources of red (R), green (G), blue (B) colors. By combining the color information RGB data can be assessed to ensure the desired color for the particular pixel. Here, red color data is contained in the operand DATA A 1512, green color data is contained in the operand DATA IN 1514, and blue color data in the operand DATA FROM 1516. This configuration can exist in graphical systems, or systems of memory, where data for each color stored or accumulated separately as streaming data. To use this information is the s in recreation or display the desired image, the pixel data must be sequenced in an RGB pattern, where all data for a particular pixel are grouped together.

In this embodiment, the set of masks with pre-defined templates used in the alternation RGB data. On figa shows the set of masks: A MASK 1502 having the first pattern MASK IN 1504, with the second pattern, and MASK 1506 having a third pattern. The data from each register should be posted in three bytes, so that they can be premiani with data from two other registers. Bytes of control with hexadecimal values h have the highest bit set so that the corresponding byte is reset to zero instruction shuffling Packed bytes. In each of these masks every third mask shuffling can select the data item to be shuffled, while two involved in the process mask shuffling matter h. Is h indicates that the field "set zero" in the masks for these respective positions of the elements of the data set. Thus, the '0' should be placed in the position of the data elements associated with this mask. In this example, the templates, masks should, in principle, to separate data elements for each color, to perform interleaving. For example, if A MASK 1502 is applied to the operand data in the operation of the shuffling, then A MASK 1502 causes the shuffling six is a separate estimate data (h, H, h, h, h, h) separately, with intervals in two data elements between each data item. Similarly, MASK IN 1504 is designed to shuffling separately data elements respectively h, h, h, h, h. MASK 1506 is designed to shuffling separately data elements respectively h, h, h, h, h.

Note that in this implementation, the mask shuffling for each overlapping position data element has two established fields "set zero", and one mask shuffling specifies the data item. For example, referring to the rightmost position of the element data for the three sets of masks, the values of the masks are h, h and h for A MASK 1502, MASK IN 1504 MASK and 1506, respectively. Thus, only the mask shuffling h for A MASK 1502 will define the data for this position. The mask in this embodiment has such a pattern that the scrambled data can be easily combined to generate peremienko block RGB-data.

Figv illustrates the blocks of data that should be subjected to the procedure alternation: A DATA 1512, DATA IN 1514 and DATA FROM 1516. In this embodiment, each set of data is the write data with color information for the 16 pixel positions. For example, R0 - red color data for pixels 0 and G15 - green color data of the pixel 15. The hexadecimal values of the each displayed data element represent the position of the data element. Color data (DATA A 1512, DATA IN 1514, DATA FROM 1516) can be copied to other registers so that data is not overwritten by the operation of the shuffling and can be reused without another loading operation. In the example corresponding to this variant implementation of the required three passes using three masks 1512, 1514, 1516 to complete interleave pixel data. For other implementations and other data volumes, the number of passes and operations shuffling can be changed as needed.

On figs shows the block data of the MASKED DATA A 1522 for Packed operations shuffling over the red pixel data DATA A 1512 using the first template shuffling A MASK 1502. In response to A MASK 1502 red pixel data are arranged at every third position of the data item. Similarly on fig.15D shows the block data of the MASKED DATA IN 1524 for Packed operations shuffling over green pixel DATA IN 1514 using the second template shuffling MASK IN 1504. On file shows the block data of the MASKED DATA WITH 1526 for Packed operations shuffling over blue pixel data DATA FROM 1516 using the third pattern shuffling MASK 1506. Templates for masks in this embodiment, the data blocks of the result of these operations shuffling, provide the data elements that cher is sulking thus, what one item of data has the data, while two are '0'. For example, the leftmost position of the data element of these results, 1522, 1524, 1526 contains, respectively, R5, '0', '0'. At the next position of the data item represented by the pixel data for another one of the RGB colors. Thus, if the merger is implemented by grouping the RGB type.

In this embodiment, the above-mentioned shuffled data for red color data, green color data are first merged together using the Packed logical "OR". On fig.15F shows the data of the INTERLEAVED A&B DATA 1530 Packed logical "OR" on MASKED DATA AND 1522 and MASKED DATA IN 1524. Shuffled blue color data are then combined together with perenesennymi red and green color data using another Packed logical "OR". On fig.15G shows the new result INTERLEAVED A&DATA 1532 Packed logical "OR" on MASKED DATA WITH 1526 and MASKED DATA AND&B 1530. Thus, the data block of the result, shown in fig.15G contains peremerzanie RGB data for the first five pixels and part of the sixth pixel. The next iteration of the algorithm in this embodiment will give peremerzanie RGB data for the rest of 16 pixels.

Currently one third of the data in A DATA 1512, DATA IN 1514 and DTA With 1516 made interleaving. To handle the rest of the data in these registers can be used in two ways. Another set of bytes management shuffle can be used to organize data, subject to alternation, or data in A DATA 1512, DATA IN 1514 and DATA FROM 1516 can be shifted to the right so that the mask shuffling 1502, 1504 and 1506 can be used again. In the described embodiments of the data are shifted to avoid memory accesses required to download additional bytes of management shuffle. Without these shift operations in this embodiment would require nine sets of bytes of the control instead of three (A MASK 1502, MASK IN 1504, MASK 1506). This implementation can also be used in architectures where there are a limited number of registers and memory accesses are long.

In alternative implementation, where there are a large number of registers, maintaining all or a large number of sets of masks in registers to avoid shift operations can be more efficient. In addition, the architecture with many registers and Executive modules all operations shuffling can be performed without waiting for the shift. For example, the processor performance in a modified sequence with nine modules shuffling and nine sets of masks can perform nine operations shuffling p is parallel. In the above embodiments, the implementation of data must be moved before the mask can be re-applied.

The data elements in the source color data A 1512, DATA IN 1514 and DATA FROM 1516 shifted in accordance with the number of data items has already been processed for that particular color. In this example, the data for the six pixels already processed for the color red, so that data elements operand DATA A 1512 red color data are shifted to the right by six positions of the data elements. Similarly, data for five pixels already processed for green and blue colors, so the data elements for the operand DATA IN 1514 green color data and the operand DATA FROM 1516 blue color data are shifted to the right by five data items each. The shifted source data are shown as DATA A' 1546, DATA IN' 1542 and DATA WITH' 1544 respectively for red, green, and blue colors on fign.

The operation is shuffled, and a logical "OR", as described above with reference to figa-G are repeated using these shifted data. Subsequent Packed operations shuffling DATA over A' 1546, DATA IN' 1542 and DATA WITH' 1544 together with A MASK 1502, MASK IN 1504, MASK 1506, respectively, in combination with Packed logical operations "OR" over three Packed results shuffling provide peremerzanie RGB data for the other four pixels and parts of the Vuh other. The data of the INTERLEAVED A','&' 1548 DATA shown in Fig. Note that the rightmost two pieces of data belong to the sixth pixel, for which there were already red color data R2sorted according to the first ParameterName set 1532 data. The source pixel color data is again shifted by the appropriate number of positions on the results of the processing in the second pass. Here the data for five additional pixels processed for the red and blue colors, so that data elements operand DATA A' 1546 red color data and the operand DATA FROM' 1544 blue color data are shifted to the right by five positions of data elements. Data for six of pixels processed for green color, so the data elements for the operand DATA IN' 1542 red color data are shifted to the right by six positions. The shifted data for this third pass is shown in fig.15J. Repetition Packed operations shuffling and logical OR is applied to DATA C 1552, DATA A" 1554 and DATA B" 1556. Peremerzanie RGB data for the last 16 pixels shown in fig.15J as INTERLEAVED A", B" DATA 1558. The rightmost item 10 refers to the 11-th pixel, which already has its green color data G190 and red color data R10, ordered according to the second ParameterName set 1548 data. Thus, through a number of the Packed tavani using a set of templates, masks and Packed logical operations "OR" data from multiple sources 1512, 1514, 1516 can be combined and reordered together in the desired manner for further use or processing as these results 1532, 1548, 1558.

On Fig shows a block diagram of one variant of the method of shuffling data between multiple registers to generate perenesennyj data. For example, embodiments of the present method can be applied to generate perenesennyj pixel data, as explained with reference to figa-K. Although the present invention is described in the context of the three data sources or data planes, other variants of implementation can work with two or more planes of data. These plane data may include color data for one or more image frames. In block 1602, the data frame for the set of pixels available as individual color data from different planes. The data in the first plane correspond to the red color data in the second plane is green, and the data in the third plane is blue. In block 1604 loaded set of masks templates management shuffle (M1, M2 and M3). These templates control the shuffle define templates shuffling and ordering data for complete alternation of colors. Depending on the implementation, any number of templates shuffling can be used to generate desirable to the nfiguration data.

In block 1606 for each plane of data, you select the appropriate template control. In this embodiment, the control template is selected on the basis of which the order of the color data is desirable and what iteration is performed at the moment. The frame data from the first data set, with red and shuffled using the first template management shuffle in block 1608 to get shuffled red color data. The second set of data, for green, is shuffled using the second template management shuffle in block 1610, to get shuffled green color data. The third data set, blue is shuffled using the third control template shuffle in block 1612 to get shuffled blue color data. Although in this embodiment, three masks and their management templates to generate different from each other, the mask and its control template shuffle can be used for more than one data set at each iteration. In addition, some masks can be used more often than others.

In block 1614 shuffled data block 1608, 1610, 1612 for the three data sets are combined together to form peremienko result for this run. For example, the first paragraph is koda can be peremerzanie data 1532, as shown in fig.15G, and the RGB data for each pixel are grouped together as a set. In block 1616 is scanned to determine whether there is another data frame is loaded into registers for shuffling. If not, then in block 1620 is checked whether there are data from three planes of data, subject to alternation. If not, then the method is complete. If in block 1620 is determined that there are data mentioned planes, then the process returns to block 1602 to load other data frame for shuffling.

If the determination in block 1616 true, then the frame data in each color plane data are shifted by a predetermined count, which corresponds to the pattern mask was applied to the dataset for that particular color in the last passage. For example, in accordance with an example of the first passage on fig.15G red, green, and blue color data in the first, second and third planes are shifted by six, five and five positions, respectively. Depending on the implementation of the template shuffling selected for each color data, may vary with each pass, or you can reuse the same template. During the second pass in one embodiment, three masks from the first iteration are rotated so that the data of the first plane now form a pair strata mask, data for the second plane to form a pair with the first mask, and the data of the third plane to form a pair with the third mask. This rotation masks enables proper continuity perenesennyj RGB data from one aisle to the next, as shown in fig.15G and 15I. The shuffling and merging continue, as in the first pass. If desired, the third and subsequent iterations, the templates, masks shuffling in this embodiment, continue to rotate for different data planes to form peremerzanie largely RGB-data.

Embodiments of the algorithms using Packed instructions shuffling, in accordance with the present invention can also improve the characteristics of the processor and system with existing hardware resources. But as technology improved, the embodiments of the present invention when combined with a large amount of hardware resources and a faster, more efficient logic circuits can have even more profound impact on improving performance. Thus, effective implementation of the Packed instructions shuffling with byte granularity and the option is reset to zero, can have different and stronger impact on generations of processors. The simple addition bol the higher the number of resources in modern processor architectures does not guarantee improvements provides higher performance. While maintaining application performance, such variant implementation with parallel conversion table and Packed instruction byte shuffling (PSHUFB) possible more significant improvements leading to increased productivity.

Although the above examples are described in the General context of the 128-bit hardware/register/operand to simplify the description, in other embodiments, the implementation to perform Packed operations shuffling can use a 64 - or 128-bit hardware/register/the operands of the parallel tabular conversion and reordering of data in the set of registers. In addition, embodiments of the invention are not limited to specific hardware or types of technology, such as MMX/SSE/SSE2 technology, and can be used with other SIMD implementations and other methods of manipulation of graphical data.

In the above description, the invention is explained with reference to sample specific options for implementation. However, it is obvious that various modifications and changes may be made without deviating from the essence and scope of the invention as presented in the claims. Accordingly, the description and drawings should be considered n is in a restrictive sense, but only as illustrative.

1. Method of providing instructions shuffling, comprising receiving a first operand having a set of L data elements, the acceptance of the second operand with a set of L masks shuffling associated with the data element of the result, and each of the masks shuffling involves the "reset to zero" and a selection box, and for each mask shuffling if the "reset to zero" mask shuffling is not set, then move the data specified field of the selection mask shuffling of the data element of the first operand into the associated data element of the result, and if the "reset to zero" mask shuffling installed, the location of the zero in the associated data element of the result.

2. The method according to claim 1, in which each of these L masks shuffling occupies a specific position in the second operand associated with similarly situated position of the data item in the result.

3. The method according to claim 2, in which each of the said L data elements occupies a particular position in the first operand.

4. The method according to claim 3, in which said mask shuffling is intended for specifying a data element of the first operand by the position number of the data element.

5. The method according to claim 4, in which each of these masks shuffling contains a field "reset to zero"is intended to indicate whether the position e is ment data, associated with this mask shuffling, fill in a zero value and selection field, used to specify what data element of the first operand, you must move the data.

6. The method according to claim 5, in which each of these masks shuffling further comprises a selection field source.

7. The method according to claim 2, additionally containing the output data block, containing data that has been shuffled from the first operand in response to the mask shuffling the second operand.

8. The method according to claim 1, wherein each of the data elements contains a byte of data.

9. The method according to claim 8, in which each mask shuffling has a size of bytes.

10. The method according to claim 9, in which L is equal to 8 and in which the first operand, the second operand and the result each contain 64-bit Packed data.

11. The method according to claim 9, in which L is 16 and in which the first operand, the second operand and the result each contain 128-bit Packed data.

12. The method according to claim 1, wherein receiving the first operand, the reception of the second operand, and the location data into the associated data element of the result is performed in response to receiving one Packed instructions shuffling, which specifies three bits of the first register, preserving the first operand, and specifies three bits of the second register, preserving the second operand, the first and second operands which have the same size and each of the L data elements and L masks shuffling has the same size, each of the L masks shuffling divided into three parts, the first part, which is a bit "reset to zero", occupying the position of the high bit of each mask shuffling, the second part, which is the field of choice of position, which has size at least log2L bits and indicates the position of one of the mentioned L data elements, and the third part.

13. The device providing instructions shuffling containing memory for storing data and instructions shuffling, and the instructions provide the shuffling data, at least one of L data elements of the first operand based on the set of L masks shuffling associated with the data element of the result, and each mask shuffling of the mentioned set of L masks shuffling involves the "reset to zero" and the selection field of the second operand, and the memory are connected through a hub controller to the memory that is used for sending signals between the processor and memory, with the processor bus, comprising Executive module for execution of the instructions shuffling, for each mask shuffling Executive module moves data specified field of the selection mask shuffling of the data element of the first operand into the associated data element of the result if the "reset to zero" mask shuffling n is installed, and the location of the zero in the associated data element of the result if the "reset to zero" mask shuffling installed.

14. The device according to item 13, in which each of these L masks shuffling occupies a certain position in the second operand associated with similarly situated position of the data element of the result.

15. The device according to 14, in which each individual mask shuffling is intended for specifying a data element of the first operand by the position number of the data element.

16. The device according to item 15, in which each of these masks shuffling contains a field "reset to zero"is intended to indicate whether the position of the data element associated with this mask shuffling, fill in a zero value and selection field, used to specify what data element of the first operand, you must move the data.

17. The device according to clause 16, in which each of these masks shuffling further comprises a selection field source.

18. The device according to 17, in which the instruction shuffling additionally provides the conclusion referred to the Executive module, containing L positions of the data elements that were filled on the basis of the set of L masks shuffling.

19. The device according to item 13, in which each of the data elements contains a byte of data, and each mA is OK shuffling has a size of bytes.

20. The device according to claim 19, in which L is equal to 8 and in which the first operand, the second operand and the result each contain 64-bit Packed data.

21. The device according to claim 19, in which L is 16 and in which the first operand, the second operand and the result each contain 128-bit Packed data.

22. Machine-readable media that stores data representing a preset function, including receiving a first operand having a set of L data elements, the acceptance of the second operand with a set of L masks shuffling associated with the data element of the result, and each of the masks shuffling involves the "reset to zero" and a selection box for each mask shuffling if the "reset to zero" mask shuffling is not set, then move the data specified field of the selection mask shuffling of the data element of the first operand in the data element of the result, and the "reset to zero" mask shuffling installed, the location of the zero in the data element of the result.

23. Machine-readable medium according to article 22, in which the data is stored machine readable media, the present structure of the integrated circuit, which, when made, performs the aforementioned predefined function in response to a single statement.

24. Machine-readable medium according to item 23, in which the mentioned pre-set the second function additionally includes generating the result, with L positions of the data elements that are filled in accordance with said set of L masks shuffling.

25. Machine-readable media according to paragraph 24, in which each of these L masks shuffling associated with similarly situated position of the data element of the result.

26. Machine-readable media on A.25, in which each individual mask shuffling is intended for specifying a data element of the first operand by the position number of the data element.

27. Machine-readable media on p, in which each of these data elements contain data bytes.

28. Machine-readable media on p, in which each of these masks shuffling contains a field "reset to zero"is intended to indicate whether the position of the data item that is associated with this control element, fill in a zero value and selection field, used to specify what data element of the first operand, you must move the data.

29. Machine-readable media on p, in which each of these masks shuffling further comprises a selection field source.

30. Machine-readable medium according to article 22, in which the data is stored machine readable media, represent computer instructions that, when executed by a computer, provides the implementation of computer-mentioned tentative is but the specified function.

31. Method of providing instructions shuffling, containing the reception of the first operand with a set of L data elements, the acceptance of the second operand with a set of L masks shuffled, and each of L masks shuffling occupies a specific position in the second operand associated with similarly situated position of the data element of the result, each of the L masks shuffling involves the "reset to zero", for each mask shuffling a determination of whether the "reset to zero", and if the "reset to zero" is set, then the location of the zero in the associated position of the element result data, and if the "reset to zero" is not set, then move the data from the data element of the first operand is specified in the said mask shuffling, in the above-mentioned associated position data element of the result.

32. The method according to p, in which each of these L masks shuffling occupies a specific position in the second operand associated with similarly situated position of the data item in the result.

33. The method according to p, in which each of these L masks shuffling contains a field "reset to zero"is intended to indicate whether the position of the data element associated with this mask shuffling, fill in a zero value and selection field, used to specify what is lementa data of the first operand, you must move the data.

34. The method according to p, in which each mask further comprises a selection field source.

35. The method according to clause 34, in which the first operand, the second operand and the result each contain 64-bit Packed data.

36. The method according to clause 34, in which the first operand, the second operand and the result each contain 128-bit Packed data.

37. Method of providing instructions shuffling, containing the reception of the first operand with a set of L data elements, the acceptance of the second operand with a set of L masks shuffled, and each of L masks shuffling associated with similarly situated position of the data element of the result for each individual mask shuffling a determination of whether the "reset to zero", and if the "reset to zero" is set, then the location of the zero in the associated position data element of the result, otherwise move data from a data element of the first operand is specified referred to individually mask shuffling, in the above-mentioned associated the position of the element result data.

38. The method according to clause 37, in which each of these L masks shuffling contains a field "reset to zero"is intended to indicate whether the position of the data element associated with this mask, fill in a zero value and selection box that is designed to specify, from kako what about the data element of the first operand, you must move the data.

39. The method according to 38, in which each of these masks shuffling further comprises a selection field source.

40. The device providing instructions shuffling containing at least one block for storing data, block masks shuffling, which includes the "reset to zero" and a selection box, the first set of multiplexers and the second set of multiplexers, logic selection multiplexers and zeroing designed for decoding masks shuffling taken the said scheme of block masks shuffling, and generating selection signals used to control the first and second sets of multiplexers and to generate a setting signal to zero, is used to control the second set of multiplexers, with at least one unit for data storage is connected by lines routing the first set of multiplexers, which, in turn, is connected to the second set of multiplexers, each of the second set of multiplexers outputs a zero if signal installation at zero active, or return the value of a data element received from the first set of multiplexers, if the setting signal to zero not active

41. The device according to p, in which many elements of the source data is first Packed operand is given the s.

42. The device according to paragraph 41, wherein a set of masks shuffling is the second Packed operand data.

43. The device according to paragraph 41, in which the first and second memory cells are registers with a single instruction stream and multiple data streams,

44. The device according to item 43, in which the first Packed operand has a length of 64 bits, and each of the data elements of the source size is byte, and the second Packed operand has a length of 64 bits, and each of the masks shuffling has a size of bytes.

45. The device according to item 43, in which the first Packed operand has a length of 128 bits, and each of the data elements of the source size is byte, and the second Packed operand has a length of 128 bits, while each of the masks shuffling has size bytes

46. The device providing instructions shuffling containing at least one block for storing data, block masks shuffling to store L masks shuffling, which includes the "reset to zero" and a selection box, a set of L multiplexers, logic selection multiplexers and zeroing designed for decoding masks shuffling taken the said scheme of block masks shuffling, and for generating selection signals and signal installation at zero, arriving at each of the L multiplexers, with at least one block for storing data is connected by lines Mar is rotiserie with the set of L multiplexers, each of multiplexers outputs a value of zero if the signal installation at zero active, or gives the value of the element data block for storing the source data, if the signal installation at zero is not active

47. The device according to item 46, optionally containing register with L unique positions of data elements, with each element position data stores the output from the associated multiplexer.

48. The device according to p, in which L is equal to 16, and M is 16.

49. A data processing system containing a memory for storing data and instructions shuffling, ensuring that the shuffling data, at least one of L data elements of the first operand based on the set of L masks shuffling associated with the data element of the result, and each of the masks shuffling involves the "reset to zero" and the selection field of the second operand, the processor associated with the memory bus, and the processor provides the operation of the shuffling and contains a register file for receiving instructions shuffling from memory, Executive module associated with the register file and is designed to execution of the above instructions shuffling, with the Executive module for each mask shuffling moves the data specified by the field source selection mask shuffling, from an item of data is x the first operand into the associated element of the result data, if the "reset to zero" mask shuffling is not set and placing a zero into the associated data element of the result if the "reset to zero" mask shuffling installed.

50. System 49, in which each of these masks shuffling contains a field "reset to zero"is intended to indicate whether the position of the data item that is associated with this control shuffle, fill in a zero value and selection field, used to specify what data element of the first operand, you must move the data.

51. The system according to item 50, in which each of these masks shuffling further comprises a selection field source.

52. System 49, in which the instruction is an instruction shuffling Packed byte data reset to zero.

53. System 49, in which each data item has a size of bytes, each mask shuffling has a size of bytes and L is 8.

54. System 49, in which the first operand has a length of 64 bits and the second operand has a length of 64 bits.



 

Same patents:

FIELD: network communications, in particular, control means built into applications for conduction of network exchange.

SUBSTANCE: expandable communication control means is used for maintaining communication between computing device and remote communication device. In a computer program adapted for using expandable communication control means, information about contacting side is found, and on basis of found contact information it is determined which types of transactions may be used for communication with contacting side at remote communication device. As soon as communication setup function is determined using contacting side information, communication setup request, associated with such a function, is dispatched to communication address. After receipt, expandable communication control means begins conduction of communication with remote communication device.

EFFECT: creation of more flexible and adaptable software communication control means (program components) for processing communications (connections, exchange) between devices.

3 cl, 11 dwg

FIELD: computing devices with configurable number length for long numbers.

SUBSTANCE: device consists of two computing device units, each of them divided into at least four subunits, which consist of a quantity of unit cells. Named units are spatially located so that the distance between unit cell of first unit and equal unit cell in the second unit is minimal. Computing device configuration can be changed using configurational switches, which are installed between device subunits.

EFFECT: increased performance of computing device, reduced time of data processing.

12 cl, 6 dwg

FIELD: engineering of data processing systems, which realize operations of type "one command stream and multiple data streams".

SUBSTANCE: system is disclosed with command (ADD8TO16), which decompresses non-adjacent parts of data word with utilization of signed or zero expansion and combines them by means of arithmetic operation "one command stream, multiple data streams", such as adding, performed in response to one and the same command. Command is especially useful for utilization in systems having a data channel, containing a shifting circuit before the arithmetic circuit.

EFFECT: possible use for existing processing resources in data processing system in a more efficient way.

3 cl, 5 dwg

The invention relates to data processing systems having a rated Bank and supporting vector operations

The invention relates to data processing devices

The invention relates to electronics

The invention relates to the addressing of the registers in the processing unit and can be used for digital signal processing

The invention relates to data processing systems

The invention relates to the field of computer systems and may be used to execute processor commands floating point and Packed data

FIELD: engineering of data processing systems, which realize operations of type "one command stream and multiple data streams".

SUBSTANCE: system is disclosed with command (ADD8TO16), which decompresses non-adjacent parts of data word with utilization of signed or zero expansion and combines them by means of arithmetic operation "one command stream, multiple data streams", such as adding, performed in response to one and the same command. Command is especially useful for utilization in systems having a data channel, containing a shifting circuit before the arithmetic circuit.

EFFECT: possible use for existing processing resources in data processing system in a more efficient way.

3 cl, 5 dwg

FIELD: computing devices with configurable number length for long numbers.

SUBSTANCE: device consists of two computing device units, each of them divided into at least four subunits, which consist of a quantity of unit cells. Named units are spatially located so that the distance between unit cell of first unit and equal unit cell in the second unit is minimal. Computing device configuration can be changed using configurational switches, which are installed between device subunits.

EFFECT: increased performance of computing device, reduced time of data processing.

12 cl, 6 dwg

FIELD: network communications, in particular, control means built into applications for conduction of network exchange.

SUBSTANCE: expandable communication control means is used for maintaining communication between computing device and remote communication device. In a computer program adapted for using expandable communication control means, information about contacting side is found, and on basis of found contact information it is determined which types of transactions may be used for communication with contacting side at remote communication device. As soon as communication setup function is determined using contacting side information, communication setup request, associated with such a function, is dispatched to communication address. After receipt, expandable communication control means begins conduction of communication with remote communication device.

EFFECT: creation of more flexible and adaptable software communication control means (program components) for processing communications (connections, exchange) between devices.

3 cl, 11 dwg

FIELD: engineering of microprocessors and computer systems.

SUBSTANCE: in accordance to shuffling instruction, first operand is received, which contains a set of L data elements, and second operand, which contains a set of L shuffling masks, where each shuffling mask includes a "reset to zero" field and selection field, for each shuffling mask, if the "reset to zero" field of shuffling mask is not set, then data indicated by shuffling mask selection field are moved, from data element of first operand, into associated data element of result, and if "reset to zero" field of shuffling mask is set, then zero is placed in associated data element of result.

EFFECT: improved characteristics of processor and increased productivity thereof.

8 cl, 43 dwg

FIELD: physics.

SUBSTANCE: invention pertains to the means of providing for computer architecture. Description is given of the method, system and the computer program for computing the data authentication code. The data are stored in the memory of the computing medium. The memory unit required for computing the authentication code is given through commands. During the computing operation the processor defines one of the encoding methods, which is subject to implementation during computation of the authentication code.

EFFECT: wider functional capabilities of the computing system with provision for new extra commands or instructions with possibility of emulating other architectures.

10 cl, 15 dwg

FIELD: physics; computer technology.

SUBSTANCE: present invention pertains to digital signal processors with configurable multiplier-accumulation units and arithmetic-logical units. The device has a first multiplier-accumulation unit for receiving and multiplying the first and second operands, storage of the obtained result in the first intermediate register, adding it to the third operand, a second multiplier-accumulation unit, for receiving and multiplying the fourth and fifth operands, storage of the obtained result in the second intermediate register, adding the sixth operand or with the stored second intermediate result, or with the sum of the stored first and second intermediate results. Multiplier-accumulation units react on the processor instructions for dynamic reconfiguration between the first configuration, in which the first and second multiplier-accumulation units operate independently, and the second configuration, in which the first and second multiplier-accumulation units are connected and operate together.

EFFECT: faster operation of the device and flexible simultaneous carrying out of different types of operations.

21 cl, 9 dwg

FIELD: information technologies.

SUBSTANCE: command of message digest generation is selected from memory, in response to selection of message digest generation command from memory on the basis of previously specified code of function, operation of message digest generation, which is subject to execution, is determined, at that previously specified code of function defines operation of message digest calculation or operation of function request, if determined operation of message digest generation subject to execution is operation of message digest calculation, in respect to operand, operation of message digest calculation is executed, which contains algorithm of hash coding, if determined operation of message digest generation subject to execution is operation of function request, bits of condition word are stored in block of parameters that correspond to one or several codes of function installed in processor.

EFFECT: expansion of computer field by addition of new commands or instructions.

14 cl, 18 dwg

FIELD: information technology.

SUBSTANCE: present invention relates to computer engineering and can be used in signal processing systems. The device contains an instruction buffer, memory control unit, second level cache memory, integral arithmetic-logic unit (ALU), floating point arithmetic unit and a system controller.

EFFECT: more functional capabilities of the device due to processing signals and images when working with floating point arithmetic.

4 cl, 4 dwg

FIELD: physics; computer engineering.

SUBSTANCE: invention relates to processors with pipeline architecture. The method of correcting an incorrectly early decoded instruction comprises stages on which: the early decoding error is detected and a procedure is called for correcting branching with a destination address for the incorrectly early decoded instruction in response to detection of the said error. The early decoded instruction is evaluated as an instruction, which corresponds to incorrectly predicted branching.

EFFECT: improved processor efficiency.

22 cl, 3 dwg, 1 tbl

FIELD: information technology.

SUBSTANCE: method involves defining a granule which is equal to the smallest length instruction in the instruction set and defining the number of granules making up the longest length instruction in the instruction denoted MAX. The method also involves determining the end of an embedded data segment, when a program is compiled or assembled into the instruction string and inserting a padding of length MAX-1 into the instruction string to the end of the embedded data. Upon pre-decoding of the padded instruction string, a pre-decoder maintains synchronisation with the instructions in the padded instruction string even if embedded data are randomly encoded to resemble an existing instruction in the variable length instruction set.

EFFECT: ensuring reconstruction during repeated synchronisation owing to reduced errors of synchronising the mechanism for pre-decoding the instruction string.

20 cl, 11 dwg

Up!