Instruction and logical circuit to carry out dot product operation

FIELD: information technologies.

SUBSTANCE: system to carry out dot product operation includes the following: the first memory device designed to store instruction of a dot product of "single instruction - multiple data flows" type (SIMD); a processor connected to the first memory device to execute instruction of SIMD dot product, in which instructions of SIMD dot product include an indicator of source operand, an indicator of target operand, at least one indicator of direct value, at the same time the direct value indicator includes multiple control bits.

EFFECT: increased efficiency of processor.

28 cl, 18 dwg

 

The technical field to which the invention relates.

The present invention relates to the field of processing devices and the corresponding software and software sequences that perform mathematical operations.

The level of technology

Computer systems are increasingly distributed in our society. Processing capabilities of computers increase efficiency and productivity in a wide range of professions. As the cost of purchase and maintenance of computer continues to fall, more and more consumers have the opportunity to take advantage of newer and faster devices. In addition, a large number of people enjoy the use of portable computers as they provide the freedom. Mobile computers allow users to easily transport their data and work with them outside the office or while traveling. Such a scenario is quite familiar for personnel engaged in marketing, corporations and even students.

With the development of technology processor new software also generate code to run in devices with such processors. Users usually expect and demand high performance from their computers, regardless of the type of p is ogromnogo software. One of these problems arises from the types of instructions and operations that are actually performed in the processor. Some types of operations require more time to complete due to the complexity of operations and/or the type of schemes. This provides the opportunity to optimize the way to perform some complex operations inside the processor.

Multimedia applications was a motive for the development of microprocessors for more than a decade. In fact, much of updates of computer technology in recent years has been motivated by multimedia applications. These updates are primarily occurred in the segments of consumers, although significant progress can also be seen in the segments of industrial enterprises in the field of education, based on entertainment, and communications. Nevertheless, future multimedia applications will impose even greater demands on computing. As a result of work with personal computers of tomorrow will be even richer and saturated audiovisual effects, and will also be simpler to use and, more importantly, the computational procedure will merge with the data transfer.

In accordance with this display images and play audio and video, which combines the IDT relate to the content, becoming more and more popular applications in modern computing devices. Filtering and convolution are some of the most frequently performed operations in relation to data content, such as image data, audio and video. Such operations require intensive calculations, but provide a high level of data parallelism that can be exploited, using effective embodiment with the use of different data storage devices, such as, for example, registers with a single instruction stream and multiple data streams (SIMD, OCMD). In some existing architectures also require many operations, regulations or bodanstrasse (often called "micro-operations" or "uops") carried out various mathematical operations on multiple operands, reducing, thus, the bandwidth and increasing the number of cycles of the clock frequency required to perform mathematical operations.

For example, a sequence of instructions consisting of a number of instructions, you may need to perform one or more operations necessary to generate the scalar product, which includes the summation of the products of two or more numbers, presents various types of data within a processing device, system or compiuternoi program. However, such technologies of the prior art may require multiple cycles of treatment and can cause the processor or the system will consume excessive energy in order to generate the scalar product. In addition, some technologies of the prior art can be restricted by data type of the operand with which they can perform.

Brief description of drawings

The present invention is illustrated as an example and not for limitation, in the accompanying drawings:

on figa shows a block diagram of a computer system formed with a processor that includes execution modules designed to execute instructions for the operation of scalar product in accordance with one embodiment of the present invention;

on FIGU shows a block diagram of another exemplary computer system in accordance with an alternative embodiment of the present invention;

on figs shows a block diagram of another exemplary computer system in accordance with another alternative embodiment of the present invention;

figure 2 shows the block diagram of the architecture for a processor in accordance with one variant of embodiment, which includes a logic circuit designed for you is filling up the operation of scalar product in accordance with the present invention;

on figa illustrates the different types of representations of packet data in a multimedia registers in accordance with one embodiment of the present invention;

on FIGU illustrates the types of packet data in accordance with an alternative embodiment;

on figs are illustrated different views of the types of packet data signed and unsigned multimedia registers in accordance with one embodiment of the present invention;

on fig.3D illustrates one alternative embodiment of the encoding format operation (opcode);

on file illustrates an alternative encoding format operation (opcode);

on fig.3F illustrates another alternative encoding format operations;

figure 4 shows a block diagram of a variant embodiment of the logic circuit to perform the operation of the scalar product of the operands of the packet data in accordance with the present invention;

on figa shows the block diagram of the logic circuit to perform the operation of scalar product on operands packet data single-precision in accordance with one embodiment of the present invention;

on fig.5b shows the block diagram of the logic circuit to perform the operation of scalar product on operands packet data double precision in accordance with the ne variant embodiment of the present invention;

on figa shows a block diagram of a circuit for the operation of scalar product in accordance with one embodiment of the present invention;

on FIGU shows a block diagram of a circuit for the operation of scalar product in accordance with another alternative embodiment of the present invention;

on figa view shown in the form of pseudocode operations that can be performed when you execute SPA, in accordance with one embodiment;

on FIGU view shown in the form of pseudocode operations that can be performed using the existing statement SPD, in accordance with one embodiment.

Detailed description of the invention

In the following description of the options described embodiment of the method of operation of scalar product in the processing device, in a computer system or program. In the following description presents various specific details such as processor types, architecture, events, mechanisms, unlock, etc. to provide a more complete understanding of the present invention. To a person skilled in the art, however, it will be clear that the invention can be implemented in practice without such specific details. In addition, some of the known structures, circuits and the like were not presented in detail, to avoid unnecessary confusion of the present invention.

Although the following variants of the embodiment described with reference to the processor, other variations of the embodiments are applicable to other types of integrated circuits and logic devices. The same methodology and descriptions of the present invention can be easily applied to other types of circuits or semiconductor devices that can take advantage of the higher throughput pipeline and improved performance. The description of the present invention is applicable to any processor or device that performs operations on data. However, the present invention is not limited to processors or devices that perform 256-bit, 128-bit, 64-bit, 32-bit or 16-bit data operations and can be applied to any processor and the device in which you want to process the packet data.

In the following description the purpose of explanation presents various specific details to provide a complete understanding of the present invention. However, to a person skilled in the art will understand that these specific details are not binding on the practice of the present invention. In other instances, well known electrical structures and therefore the s were not described specifically details to avoid unnecessarily obscure the present invention. In addition, in the following description presents examples and the attached drawings show various examples to illustrate. However, these examples should not be construed in the sense of limitations, because they are intended simply to represent examples of the present invention instead of providing an exhaustive list of all possible variants of embodiment of the present invention.

Although the following examples describe the processing and distribution of instructions in the context of Executive modules and logic circuits, other variants of the embodiment of the present invention can be performed using software tools. In one variant embodiment, the methods of the present invention is embodied in the form of executable computer instructions. Instructions you can use to ensure compliance with the General-purpose processor or a specialized processor that is programmed with instructions to perform steps of the present invention. The present invention may be provided as a computer program product, or software, which may include reading device or computer storage medium that contains instructions that can be used for the program is Mirovaya computer (or other electronic devices) to perform a method in accordance with the present invention. Alternatively, the steps of the present invention can be performed using specific hardware components that contain hardware logic for performing the steps, or any combination of programmed computer components and specialized hardware components. Such software may be stored in a storage device system. Similarly, the code can be transmitted via a network or other machine-readable media.

Thus, the machine-readable storage medium may include any mechanism for storing or transmitting information in a form readable by a device (e.g., a computer), but is not limited to floppy disks, optical disks, compact disk read-only memory (CD-ROM, CD-ROMs), and magneto-optical disks, read-only memory (ROM, ROM), random access memory device (RAM, RAM), erasable programmable read-only memory (EPROM, EPROM), electrically erasable programmable read-only memory (EEPROM EEPROM), magnetic or optical cards, memory device type, flash, transmission via the Internet, electrical, optical, acoustical or other forms of distribution whitefish is Alov (for example, waves bearing, infrared signals, digital signals, etc) or the like. In accordance with this machine-readable medium includes any type of medium/considered the unit of media suitable for storing or transmitting electronic instructions or information in a form readable by a device (e.g., a computer). In addition, the present invention can also be downloaded as a computer program product. The program may be transferred from a remote computer (e.g., server) to a requesting computer (e.g., a client device). Transfer programs may be in electrical, optical, acoustical or other forms of data signals embodied in a carrier wave, or other medium for distribution via the data transmission channel (e.g., modem, network connection, or the like).

Design can be used in the execution of various stages from design to simulation to production. Data representing the design, can design a variety of ways. First of all, which is useful when modeling, hardware can be represented using a description language hardware or other functional description language. In addition, during some stages of the process to which struisbaai can be produced model-level diagrams with logical and/or transistor gate circuits. In addition, most of the structures at a certain stage is reached a certain level of data representing the physical layout of the various devices in the hardware model. In the case when using conventional manufacturing techniques of semiconductors, data representing a hardware model that can represent the data describing the presence or absence of various elements in different layers masks for masks used for manufacturing integrated circuits. In any view, the design data may be stored in any form is read by the device media. Optical or electrical wave, modulated or generated by other means, for transmission of such information, a storage device or a magnetic or optical storage media, such as disk can be read by the device media. Any of these media can "carry" or "to mean" constructive or software information. When transmit electrical carrier wave representing or carrying code or design, to the extent that they perform the copy, placing in the buffer or re-transmission of an electrical signal, receive a new copy. Thus, the provider of the data or network provider can do copy the product (carrier wave), which embody the technology of the present invention.

Modern processors use a variety of Executive modules for processing and execution of a number of codes and instructions. Not all statements generate equal, as some are more quickly, while others require a huge number of cycles of the clock frequency. The higher throughput of instructions, the better the General characteristics of the processor. Thus, it would be preferable to have the largest possible number of instructions that are executed as quickly as possible. However, there are certain instructions that are more complex and require more time for execution and more CPU power. For example, there are instructions floating-point operations load/store, move data, etc.

As more and more computer systems are used in Internet and multimedia applications, over time there was an additional processor support. For example, instructions with integer/floating-point type, single instruction stream and multiple data streams (OCMD) and expansion streaming OCMD (SSE, RPO) are instructions that reduce the total number of instructions required to perform a specific mission, which, in turn, which may reduce energy consumption. These instructions can speed up the software by parallel execution of operations on multiple data elements. The result may provide improved performance in a wide range of applications, including the processing of video data, voice data and images/photos. The embodiment of instructions OCMD in microprocessors and similar types of logical circuits typically involves a number of issues. In addition, the complexity of operations OCMD often leads to the necessity of using an additional circuit for proper data processing and data manipulation.

Currently, the statement of the scalar product OCMD unavailable. Without instructions to the scalar product OCMD may require a large number of instructions and registers data to perform the same results in applications such as compression, processing and manipulation of audio/video data. Thus, at least one instruction of the scalar product in accordance with various embodiments of the present invention can reduce the number of service codes and reduce resource requirements. Variants of the embodiment of the present invention provide a method embodiment of the operation of scalar product algorithm, which uses hardware means is, related to OCMD. Currently, several difficult and tedious to perform scalar product of the data stored in the register OCMD. Some algorithms require more instructions for layout data for arithmetic operations than the actual number of instructions designed to perform these operations. Due to the implementation of the variants of embodiment of the operation of scalar product in accordance with various embodiments of the present invention, the number of instructions required for handling the scalar product can be greatly reduced.

Variants of the embodiment of the present invention refer to the instructions for the realization of the operation of scalar product. The operation of scalar product normally involves a multiplication of at least two values and summing the results of this work with the work of at least two other values. Other variations can be made in respect of a General algorithm, scalar product, which includes the summation of the results of the various operations of the scalar product to generate another scalar product. For example, the operation of scalar product in accordance with one variant of the embodiment applied to the data elements in the future, can be represented as:

DEST1←SRC1*SRC2;

DEST2←SRC3*SRC4;

DEST3←DEST1+DEST2.

For operand packet data OCMD this thread can be applied to each data element of each operand.

In the above thread "DEST" and "SRC" are General members representing the source and destination of the corresponding data or operations. In some embodiments, embodiments they may be implemented using registers, storage device or other data storage devices having other names or functions than those that were presented. For example, in one variant embodiment DEST1 and DEST2 may represent a first and second area of temporary storage (for example, register "TEMP1" and "AMR"), SRC1 and SRC3 may represent a first and second region save destination (for example, register "DEST1 and DEST2"), etc. In other embodiments, embodiments two or more conservation SRC and DEST may correspond to different items save data within the same area of conservation (for example, register OCMD). In addition, in one variant embodiment, the operation of scalar product can generate the sum of the scalar products generated by the above General thread.

On figa shows a block diagram of an exemplary computer system formed with a processor that includes seaisraelru modules for executing operations of the scalar product in accordance with one embodiment of the present invention. The system 100 includes a component, such as processor 102 to implement the Executive modules that include logic for performing data processing algorithms, in accordance with the present invention, such as a variant of the embodiment described here. System 100 represents a system of treatment based on microprocessors (PENTIUM® III, PENTIUM®4, Xeon™, Itanium®, XScale™ and/or StrongARM™, supplied by Intel Corporation, Santa Clara, Calif., although you can also use other systems (including PCs having other microprocessors, engineering workstations, set-top boxes and the like). In one variant embodiment, the sample system 100 can perform the operating system version WINDOWS™provided by Microsoft Corporation, Redmond, Washington, although you can also use other operating systems such as UNIX and Linux), firmware and/or graphical user interfaces. Thus, variants of the embodiment of the present invention is not limited to any specific combination of hardware circuitry and software.

Variations of the embodiments are not limited to computer systems. Alternative embodiments of the present invention can be used in other devices such as a handheld device is the STV and embedded applications. Some examples of handheld devices include cell phones, devices, Internet Protocol, digital cameras, personal digital assistants (PDAs) and laptops. Embedded applications can include a microcontroller, a digital signal processor (DSP DSP), system on a chip, network computers (NetPC), set-top boxes, hubs, network switches, wide area network (WAN WAN) or any other system, which performs the dot product of the operands. In addition, some architectures were implemented to allow execution of instructions, processing several data at the same time, to improve the efficiency of multimedia applications. As the number of types and volume of data increases, computers and processors must be improved to handle these data using more effective methods.

On figa shows a block diagram of a computer system 100 formed with a processor 102 that includes one or more Executive module 108 for performing the algorithm for calculating the scalar product of the data elements in one or more operands in accordance with one embodiment of the present invention. One variant of embodiment m is can be described in the context of a desktop or server system with a single processor, but alternative embodiments can be included in a multiprocessor system. System 100 is an example of the architecture of the hub. Computer system 100 includes a processor 102 for processing data signals. The processor 102 may be a microprocessor, processor type with a complex instruction set (CISC, PSNC), the microprocessor architecture of computing, reduced instruction set (RISC, Valentina Nikolaevna), the microprocessor command words very long (VLIW, COBD), processor embodying a combination of instruction sets, or any other processing device, such as, for example, a digital signal processor. The processor 102 coupled with bus 110 of the processor on which it is possible to transmit data signals between the processor 102 and other components in the system 100. The elements of system 100 to perform their normal functions, which are well known to specialists in this field of technology.

In one variant embodiment, the processor 102 includes a storage device 104 internal cache Level 1 (L1 (level 1)). Depending on the architecture of the processor 102 may have a single internal cache or multiple levels of internal cache. Alternatively, in another variant embodiment of the memory device cache may be external to processor 102. Other variants of the embodiment can also on the part of a combination of internal and the external cache, depending on the particular embodiment and needs. File register 106 may store various types of data in various registers, including registers integer registers floating-point registers status register instruction pointer.

Executive module 108 includes logic to perform operations with integers and floating-point numbers, also located in the processor 102. The processor 102 also includes a ROM microcode (ucode)where you saved the firmware for certain macroinstruction. For this variant embodiment Executive module 108 includes logic circuits designed for batch processing of a set of 109 instructions. In one variant embodiment, the set 109 batch of instructions includes a batch statement - scalar product to calculate the scalar product on multiple operands. Inclusion batch set 109 instructions in the instruction set of the processor 102 General purpose together with the appropriate circuits to perform these operation instructions used by many multimedia applications can be performed using the packet data processor 102 General purpose. Thus, many multimedia applications can be accelerated and can be completed is more effective when using the full width of the data bus of the processor to perform operations with packet data. This can eliminate the need for transmission of small data units via the data bus of the processor to perform one or more operations on one data item at a time.

Alternative embodiments of the Executive module 108 can also be used in microcontrollers, embedded processors, graphics devices, DSPS, and other types of logic circuits. The system 100 includes a storage device 120. Storage device 120 may be a device dynamic random access memory (DRAM DOSE) device, a static random access memory with random access (SRAM, POPS), storage device types flash or other storage device. In the storage device 120 may include instructions and/or data represented by data signals that may be executed by processor 102.

A logic chip 116 system connected to the bus 110 of the processor and the storage device 120. The system logic chip 116 in the present embodiment is the hub of the memory controller (sit, PAC). The processor 102 can communicate with the PCC 116 via the bus 110 processor. PAC 116 provides a path 118 of the storage device with a large bandwidth in the storage device 120 to save in the of trucci and data and to save the graphics commands, data and textures. PAC 116 is designed to route data signals between the processor 102, storage device 120 and other components in the system 100 and to connect the data signals between bus processor 110, memory 120 and system 122 input/output. In some embodiments the embodiment of the logical circuits 116 system can provide a graphical port to connect to the graphics controller 112. PAC 116 is connected with the storage device 120 via the interface 118 of the memory. The video card 112 connect with PAC 116 through the interconnection 114 accelerated graphics port (AGP, PHA).

In the system 100 uses its own bus interface 122 for connection PAC 116 with controller hub I/o (ICH, qwic) 130. Qwic 130 provides a direct connection with some devices I/o via the local bus input/output. Local bus I/o is a high-speed bus I/o, designed for the connection of peripheral devices with the storage device 120, chipset and processor 102. Some examples also represent the audio controller hub hardware (BIOS (basic input-output, BSW) type flash) 128, a wireless transceiver 126, the drive data 124, the controller inherited in the ode/o contains interfaces for user input and keyboard, a serial expansion port such as universal serial bus (USB upsh), and the network controller 134. The storage device 124 may include a hard disk drive, floppy disks, CD-ROM device, flash memory device or other device to drive a large amount.

For another variant embodiment of the system Executive module designed for the execution of the algorithm with the instructions of the scalar product, can be used with a system on a chip. One alternative embodiment of the system in the chip consists of a processor and a storage device. A storage device for one such system is a storage device type flash. Storage device type flash can be located on the same chip as the processor and other system components. Additionally, other modules logic circuit such as a memory controller or graphics controller may also be located in the system on a chip.

On FIGU illustrates a system 140 data which embodies the principles of a variant embodiment of the present invention. To a person skilled in the art it will be clear that as described here, the embodiment can be used on lesofat with alternative processing systems, without departing from the scope of the invention.

Computer system 140 contains the kernel 159 processing performed by the operations OCMD, which includes the operation of scalar product. For one variant embodiment of the core 159 processing is a processing unit of any type of architecture, including, but without limitation PSNC, architecture type SNK or CORI. The core of 159 processing can also be performed with the possibility of manufacturing using one or more processing technologies and, as it is described in details on read device, the media can contribute to this production.

The core of 159 processing includes Executive module 142, a set of file (s) 145 register and decoder 144. The core of 159 processing also includes an additional circuit (not shown), which is not necessary for understanding the present invention. Executive module 142 is used to execute instructions received by the kernel 159 processing. In addition to the recognition of typical instructions of the Executive processor module 142 may recognize the instruction set 143 batch instructions to perform operations according to the format of the packet data. Set 143 batch of instructions includes instructions that are designed to support the operations of the scalar product, and may also include on the ot packet instructions. Executive module 142, coupled with file 145 register, can be an internal bus. File 145 register is a save area for the kernel 159 processing for storing information, including data. As mentioned above, it should be understood that the save area used to store packet data, is not critical. Executive module 142 is connected to the decoder 144. The decoder 144 is used for decoding instructions received by the kernel 159 processing, and entry points into control signals and/or microcode. In response to these control signals and/or entry points microcode Executive module 142 performs the appropriate operations.

The core of 159 processing connected with bus 141 to communicate with various other system devices, which may include, but are not limited to, for example, management 146 synchronous dynamic random access memory device (SDRAM, SDSU), management 147 static random access memory device (POPS), the interface 148 packet-type memory flash control 149 card in accordance with the standard of the international Association of manufacturers of memory cards for personal computers (PCMCIA, MACPPC)/compact flash card (CF), management 150 liquid crystal display (LCD, LCD), the controller 151 direct access to memory (DMA, PD is) and alternative main interface 152 bus. In one variant embodiment, the system 140 data can also contain a bridge 154 I/o, designed to communicate with different devices I/o through bus 153 input/output. Such device I/o may include, but are not limited to, for example, a universal asynchronous receiver / transmitter (UART UART) 155, a universal serial bus (upsh) 156, wireless UART 157 type a Bluetooth interface 158 expansion I/o.

In one variant embodiment, the system 140 of the data provides for mobile, network and/or wireless data transmission, and it contains the core of 159 processing, allowing to perform operations OCMD, which includes the operation of scalar product. The core of 159 processing can be programmed using various audio-video data, algorithms, imaging and data transfer, which includes discrete transformations, such as converting the Walsh-Hadamard transform, fast Fourier transform (FFT, FFT), discrete cosine transformation (DCT, DCT), and their corresponding inverse transformation; compression/expansion, such as color space conversion, motion estimation for encoding video data or the motion compensation decoding the video data; and a function of the modulation/demodulation (ODEM), such as pulse code modulation (PCM, KIM). Some of the variants of embodiment of the invention can also be used in graphics applications, such as three-dimensional (3D) modeling, rendering, collision detection of objects, transform and lighting of 3D objects, etc.

On figs illustrate other alternative embodiments of a data processing system, configured to perform operations of the scalar product OCMD. In accordance with one alternative embodiment of the system 160, the data processing may include a main processor 166, a coprocessor 161 OCMD, memory 167 cache and system 168 input/output. System 168 input/output, if necessary, can be connected with a wireless interface 169. Coprocessor 161 OCMD configured to perform operations OCMD, which includes the operation of scalar product. The core 170 of the processing may be suitable for manufacturing using one or more processing technologies and can be quite detailed descriptions of the read device data storage medium that facilitates the production of all or part of the system 160 data processing, which includes the core 170 processing.

In one variant embodiment, the coprocessor 161 OCMD contains Executive module 162 and a set of file (s) 164 register. One is a variant of the embodiment, the main processor 166 contains a decoder 165 to recognize instructions from a set of 163 instructions includes instructions calculate the scalar product OCMD, to perform the Executive module 162. In alternative embodiments, the coprocessor 161 OCMD also contains at least part of the decoder B for decoding instructions, consisting of a set of 163 instructions. The core 170 of the handle also includes an additional circuit (not shown), which is not required for understanding the variants of embodiment of the present invention.

During operation, the main processor 166 executes a stream of instructions and data that control the data processing operations of General type, including interaction with memory 167 cache and system 168 input/output. The instructions implemented in the instruction stream data represent instructions coprocessor OCMD. The decoder 165-core processor 166 recognizes these instructions coprocessor OCMD as this type of instruction that should be executed is attached coprocessor 161 OCMD. In accordance with this, the main processor 166 produces these instructions coprocessor OCMD (or control signals representing instructions coprocessor OCMD) bus 166 coprocessor, from which any attached coprocessor OCMD accepts them. In this case, the coprocessor 161 OCMD will accept and execute any received instructions coprocessor OCMD, pre is assigned to it.

Data can be received through the wireless interface 169 for processing using coprocessor instructions OCMD. As one example, voice data may be taken in the form of a digital signal that can be processed using the instructions coprocessor OCMD, to recover digital audiolibros representing voice data. In another example, compressed audio and/or video data can be obtained in the form of a stream of digital bits that can be processed by the coprocessor instructions OCMD, to recover digital audiovisual and/or frames of a moving video image. In one variant embodiment, the core 170 of the processing, the main CPU 166 and the coprocessor 161 OCMD integrated into a single kernel 170 processing, containing the Executive module 162, a set of file (s) 164 register and a decoder 165 to recognize instructions from a set of 163 instructions comprising instructions scalar product OCMD.

Figure 2 shows the block diagram of the microarchitecture of the processor 200, which includes a logic circuit for executing a scalar product in accordance with one embodiment of the present invention. In one variant embodiment, the instruction scalar product can multiply the first data item to the second item of data and the su shall inform this work with the work of the third and fourth data element. In some embodiments embodiment statement the scalar product can be embodied for working with data items having such size, in bytes, as a "word", "double word", the "quadruple word" and so on, as well as data types, like integer with single and double precision, and data types floating-point number. In one variant embodiment corresponding pre-processor 201 is part of the processor 200, which makes sampling microinstruction intended to perform, and prepares them for use at a later time in the processor pipeline. Pre-processor 201 may include multiple modules. In one variant embodiment preliminary module 226 fetch statement fetches macroinstructions from memory and transmits them to the decoder 228 instructions, which, in turn, decodes them, receiving primitives, called microinstructions or micro-operations (also called micro op or uops)that can be performed by the device. In one variant embodiment of a trace cache 230 selects the decoded micro-operation and collects them in software, an ordered sequence or track in the queue 234 micro-operations for execution. When in route cache 230 enters a complex macroinstruction, ROM 232 microcode provides micro operations, demand is to complete the operation.

A large number of macroinstruction is converted into a single micro-operations, while others require several operations to complete the whole operation. In one variant embodiment, if more than four micro-operation necessary to complete macroinstruction, the decoder 228 accesses the ROM 232 microcode to perform macroinstruction. In one variant embodiment batch statement the scalar product can be decoded in a small number of micro-operations for processing in the decoder 228 instructions. In other variant embodiments, the instructions for the batch algorithm, the scalar product can be stored in the ROM 232 microcode if multiple micro-operations needed to perform the operation. Trace cache 230 accesses the programmable logic array (PLA, PLA) entry point to determine the correct pointer microinstruction to read a sequence of microcode algorithm for the scalar product in the ROM 232 microcode. After the ROM 232 microcode will finish the sequencing of operations for the current macroinstruction, pre-processor 201 of the device resumes sample of micro-operations from the trace cache 230.

Some instructions OCMD and other instructions media types is considered as a complicated instructions. Most of the CTB regulations, related to floating-point operations are also complicated instructions. Thus, when the decoder 228 instructions faced with complex macroinstructions, refer to the ROM 232 microcode in the appropriate place to obtain a sequence of microcode for this macroinstruction. Various micro operations required to perform this macroinstruction, passed in mechanism 203 execution by changing the sequence to perform the respective Executive modules with integer and floating-point number.

Mechanism 203 execution by changing the sequence used when micro-instructions prepared for execution. Logic execution by changing the sequence has many buffers to smooth out the output stream and change the order flow microinstruction, to optimize operating characteristics as micro-instructions enter the pipeline and set the order of their execution. Logical allocation scheme allocates buffers of devices and resources that are required for each micro-operation for execution. Logic rename register changes the name of the logical registers to the inputs of the file register. The dispenser also provides input for each micro-operation in one of two queues microaire the third one for memory operations and one for operations without the use of memory to the scheduler instructions: scheduler memory quick scheduler 202, slow/General scheduler 204 floating-point and simple scheduler 206 floating point. Schedulers 202, 204, 206 operations determine when the micro-operations are ready for execution, based on the readiness of their dependent input source operand register and the availability of execution resources required for the operations to complete their operations. Quick scheduler 202 of the micro-operations in this variant embodiment can plan in each half of the main loop clock frequency, while the other schedulers can schedule only in each clock cycle of the main processor. Planners apply to ports of departure for planning of micro-operations for execution.

Files 208, 210 register located between schedulers 202, 204, 206 and Executive modules 212, 214, 216, 218, 220, 222, 224 in the Executive block 211. There is a separate file 208, 210 register for operations with integers and floating-point respectively. Each file 208, 210 of the register in accordance with this alternative embodiment also includes a bypass network that can bypass or to send only the completed results that have not yet been recorded in the file register, the new dependent micro-operation. File 208 case of integers and file 210 register floating-point also ispolneny with the ability to transfer data with each other. In one variant embodiment, the file register 208 integers is divided into two separate files register, single file register for 32 data bits low order and the second file register for 32 data bits of high order. FAI 210 register floating-point in one variant embodiment has the codes with a width of 128 bits, as instructions floating-point usually have operands with width from 64 to 128 bits.

Executive unit 211 includes Executive modules 212, 214, 216, 218, 220, 222, 224, which actually executes instructions. This unit includes files 208,210 case in which the stored values of the operand integer data and floating point data required for execution micro-instructions. The processor 200 in this variant embodiment consists of many Executive modules: (AGU, MHA) 212 generating addresses, MGA 214, fast ALU (ALU, arithmetical and logic unit) 216, fast ALU 218, slow ALU 220, ALU 222 floating-point module 224 move floating-point number. In this variant embodiment Executive blocks 222, 224 floating-point operations floating-point, MMX (DFID, multimedia extension), OCMD and RPO. ALU 222 floating-point number in accordance with this alternative embodiment includes a divider floating-point size is 64 bits on 64 bits for the carried out operations division, square root and remainder. In variants of the embodiment of the present invention, any action, including the floating-point value that occurs in the hardware floating-point. For example, conversions between integer format and floating-point format is carried out with the involvement of the file register floating-point number. Similarly, the division operation floating-point occurs in the divider floating-point number. On the other hand, the number without floating-point and integer number type process using hardware resources for integers. Simple, very frequently used operations ALU is passed to the Executive modules 216, 218 high-speed ALU. Fast ALU 216, 218 in accordance with this alternative embodiment can perform fast operations with the effective latency of the half cycle of the clock frequency. In one embodiment, the embodiment of the most complex operations with integers passed in slow ALU 220, as the slow ALU 220 includes Executive hardware for integers, intended for operations with a long delay, such as a multiplier, shifts, logic flags and processing branches. Operations load/store memory do with MHA 212, 214. In this variant embodiment ALU 216, 218, 220 for integers is described in the context of the issue the log of operations with integers for operand data size of 64 bits. In alternative embodiments ALU 216, 218, 220 can be made to support a number of data bits, including 16, 32, 128, 256, etc. bits. Similarly, the modules 222, 224 for floating-point operations can be implemented with the ability to support a certain range of operands with different width in bits. In one variant embodiment, the modules 222, 224 floating point can work with operands packet data width of 128 bits together with OCMD and multimedia instructions.

In this variant embodiment schedulers 202, 204, 206 operations have finished performing operations that depend on sending, to load the top level. Because the micro-operation theoretically, plan and execute in the processor 200, the processor 200 also includes logic to handle loss of memory. If the download data is missing in the data cache, the moment can be dependent operations in the pipeline, which left for the scheduler temporarily incorrect data. The mechanism of re-execution tracks and re-executes instructions that use incorrect data. Only dependent operations you want to perform repeatedly, and independent operations are allowed to end. Planners and the mechanism re-run in one variant embodiment of the processor is also designed with the possibility of the ability of capturing a sequence of instructions for operations of the scalar product.

The term "registers" is used here to refer to storage locations within the processor that are used as part of macroinstruction to identify operands. In other words, the registers, which are here the link, represent registers that are visible outside the processor (from a programmer's perspective). However, the registers variant embodiment should not be limited by the type-specific value chain. Rather register in accordance with a variant embodiment should only provide the ability to store and provide data, and performing the functions described here. The registers described herein may be implemented using circuits inside the processor, using any number of technologies, such as dedicated physical registers, dynamically allocated physical registers, using the rename register, the combination of a dedicated and dynamically allocated physical registers, etc. In one variant embodiment, the registers for integers contain data integer size is thirty-two bits. File register in accordance with one embodiment also contains sixteen HMM (PDA, an extended memory Manager) and the General purpose registers, eight multimedia (for example, adding "EMT") registers OCMD for batch d is the R. For the following descriptions under registers, see registers data intended for the preservation of packet data, such as MMPtm width 64 bits (in some cases also called registers 'mm') in microprocessors that are supporting the MMD technology company, Intel Corporation, Santa Clara, California. These registers DFID, available in integer form and in the form of a floating-point number that can work with elements of the packet data that accompany the instructions OCMD and RPO. Similarly, the ANC registers with a width of 128 bits related to technology CO, RPO, RPO or later (generally designated as "Boh"), can also be used for the content of such packet data operands. In this variant embodiment, when storing packet data and integer registers need not be split between the two data types.

In the examples the following examples describe a number of data operands. On figa are illustrated different views of the type of packet data in a multimedia registers in accordance with one embodiment of the present invention. On figa illustrates the data types for packet bytes 310, packet 320 words and batch double word (dword) 330 for operands of a width of 128 bits. The format of the packet 310 bytes in this example has a length of 128 bits and contains sixteen data elements PA is to maintain bytes. Bytes defined here as 8 bits of data. Information for each item of data bytes contained in bits 7-0 for byte 0, bits 15-8 for byte 1, bit 23 - bit 16 for byte 2 and, finally, in a bit of a 120 - bit 127 for byte 15. Thus, all available bits are used in the register. This arrangement save increases the efficiency saving of the processor. Also when referring to the sixteen data elements one operation can now be performed for the sixteen data elements in parallel.

In the General case, the data element is a separate part of the data that is stored in a single register or a location in memory with other data elements of the same length. Sequencing batch of data related to technology Rpoh, the number of data items stored in the ANC register is 128 bits divided by the length in bits of the individual data item. Similarly, in sequences of packet data, specific to the technology of the mold and the RPO, the number of data items stored in the register MMD is 64 bits divided by the length in bits of the individual data item. Although the data types illustrated in figa, have a length of 128 bits, the variants of embodiment of the present invention can also work with operands of a width of 64 bits or operands other sizes. The format of the counter 320 is th word in this example has a length of 128 bits and contains eight data elements of the packet words. Each batch word contains sixteen bits of information. Format 330 batch double word on figa has a length of 128 bits and contains the four elements of the data packet double word. Each element of the data packet double word contains thirty-two bits of information. Batch quadword has a length of 128 bits and contains two pieces of data packet four words.

On FIGU illustrates an alternative format to save the data in the register. Every packet data may include more than one independent data element. Three formats packet data presented in the figure: batch half 341, batch unit 342 and a double batch 343 data. In one variant embodiment batch half 341, batch unit 342 and a double batch 343 data contains data elements with a fixed decimal point. For an alternative embodiment one or more of the batch half 341, batch unit 342 and a double batch 343 data may contain elements of floating point data. In one alternative embodiment batch half 341 has a length of one hundred and twenty eight bits, containing eight items of data size to 16 bits. In one variant embodiment batch unit 342 data have a length of one hundred and twenty eight bits and contains four data element size of 3 bits. In one variant embodiment of the double batch 343 data have a length of one hundred and twenty eight bits and contains two pieces of data of size 64 bits. It should be understood that such formats packet data can be further extended for other lengths of registers, such as 96 bits, 160 bits, 192 bits, 224 bits, 256 bits or more.

On figs are illustrated different views of the types of packet data signed and unsigned multimedia registers in accordance with one embodiment of the present invention. View 344 batch unsigned byte illustrates the preservation of the packet unsigned byte in register OCMD. Information for each data item byte is stored at bit seven - bit zero of byte zero, bit fifteen - bit eight for byte one, bit twenty-three - sixteen bit for byte two and, finally, in a bit one hundred and twenty - bit one hundred and twenty-seven to fifteen bytes. Thus, all available bits are used in the register. This arrangement may improve the efficiency of storing data processor. And when referring to the sixteen data elements one operation can now be performed in parallel for the sixteen data elements. View 345 batch of signed byte illustrates saving a batch of signed byte. It should be noted that Wasim the th bit of each data item byte represents the indicator mark. View 346 batch unsigned word illustrates how the word seven - word zero is stored in register OCMD. View 347 batch word with a sign similar to a view 346 in the batch register unsigned word. It should be noted that the sixteenth bit of each data element of the word is an indicator of the sign. View 348 batch double unsigned word is as retain elements of data double word. View 349 batch double word with a sign similar to a view 348 in the batch register double word unsigned. It should be noted that required the sign bit is a thirty-second bit of each data element of the double word.

On fig.3D describes one variant embodiment of 360 format encoding operation (opcode)that has thirty two bits, and modes of address of the operand register/memory corresponding to the type of format the encoding operation described in the publication " IA-32 Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference," which is presented by Intel Corporation, Santa Clara, California on the website of the world wide web (WWW) address intel.com/design/litcentr. In one variant embodiment, the operation of scalar product can be encoded by one or more of the fields 361 and 362. Can be identified up to two locations of the operand on the ins is the instruction, includes up to two identifiers 364 and 365 of the source operand. For one variant embodiment, the instructions to the scalar product ID 366 destination operand is the same as the identifier 364 of the source operand, whereas in other embodiments, embodiments they may be different. In alternative embodiments, the identifier 366 destination operand is the same as the identifier 365 of the source operand, whereas in other embodiments, embodiments they are different. In one variant embodiment, the instructions of the scalar product of one of the source operands identified by identifiers 364 and 365 of the source operand, overwrite the results of operations of the scalar product, while in other embodiments, the incarnation identifier 364 corresponds to the element of the register of the source and the identifier 365 corresponds to the element of the destination register. For one variant embodiment, the instructions to the scalar product identifiers 364 and 365 of the operand can be used to identify the source size is 32 bits or 64 bits and destination operands.

On file describes another alternative format 370 encoding operation (opcode), having forty or more bits. Format 370 opcode corresponds to 360 format opcode and contains optional 378 bytes prefix. Operation type IC is popular works may be encoded on one or more fields 378, 371 and 372. Up to two locations of the operand to the instruction may be identified by identifiers 374 and 375 of the source operand and the byte 378 prefix. For one variant embodiment, the instructions to the scalar product of 378 bytes of prefix can be used to identify 32-bit or 64-bit operand of the source and destination. For one variant embodiment, the instructions to the scalar product ID 376 of the destination operand are the same as the ID 374 of the source operand, whereas in other embodiments, embodiments they are different. For an alternative embodiment, the identifier 376 destination operand is the same as the ID 375 of the source operand, whereas in other embodiments, embodiments they are different. In one variant embodiment of the operation of scalar product, multiplying one of the operands identified by identifiers 374 and 375 of the operand, the other operand identified by the identifiers 374 and 375 operand overwrite the results of operations of the scalar product, while in other embodiments, the embodiment of the scalar product of the operands identified by identifiers 374 and 375, record to another data element in another register. Formats 360 and 370 opcode allow for the addressing of the register in the register memory ustroystvo register, register memory register register register immediate value register memory address, as partially defined fields 363 and 373 MOD and using an optional basis, the scale index, and byte offset.

Consider now fig.3F, in some alternative embodiments of the arithmetic operation type single instruction stream - multiple data flows (OCMD) of size 64 bits can be performed using the processing instruction data co-processor (CDP, ODS). Format 380 encoding operation (opcode) is one such statement ODS with field 382 and 389 opcode UDF. The instruction type of the UDF for the alternative embodiment of the operations of the scalar product can be encoded in one or more fields 383, 384, 387, and 388. Up to three locations operand to the instruction can be identified, including up to two IDs 385 and 390 of the source operand and one ID 386 destination operand. For one variant embodiment, the coprocessor can operate on 8-, 16-, 32 - and 64-bit values. For one variant of the operation of scalar product perform on the elements of the integer data. In some embodiments embodiment statement the scalar product can be performed depending on conditions using field 381 choice. For nekotoryye scalar product dimensions data source instructions can be encoded field 383. In some embodiments, embodiments of the detection instruction scalar product, Zero (Z) (zero), Negative (N) (negative), carry (C) (transfer), and overflow (V) (overflow) can be performed on fields OCMD. For some instructions, the type of saturation mode may be encoded on the field 384.

Figure 4 shows a block diagram of a variant embodiment of the logic to perform the scalar product of the operands packet data in accordance with the present invention. Variants of the embodiment of the present invention can be embodied to work with different types of operands, such as described above. In one variant embodiment of the operation of scalar product in accordance with the present invention embodied as a set of instructions to work on specific types of data. For example, the instruction for the scalar product batch data single-precision (DPPS, SPPO) provided for defining the scalar product of the 32-bit data types, including integers and floating point numbers. Similarly, the statement the scalar product of the packet data double-precision (SPD, DPPD) provided for defining the scalar product of the 64-bit data types, including integers and floating point numbers. Although these instructions have different names, in General, the operation is calaroga works they perform similar. To simplify the following descriptions and examples below are shown in the context of instructions to the scalar product for processing data elements.

In one variant embodiment, the instruction scalar product identifies a variety of information, including: the ID operand of A DATA 410 of the first data and the identifier of the second operand DATA 420 IN the second data and the identifier for the RESULTANT 440 (resultant) of the operation of scalar product (which may represent the same ID as one of the identifiers of the first operand data in one variant embodiment). For the following descriptions of the DATA A, DATA b and to the RESULTANT, in General, are called operands or data blocks, but is not limited to this, and also include registers, file registers and locations in memory. In one variant embodiment each statement the scalar product (SPPO, SPPD) decode one micro-operations. In alternative embodiments, each instruction can be decoded in a different number of micro-operations to perform the operation of scalar product on data operands. For this example, the operands 410, 420 are part of the data width of 128 bits stored in the register/memory source having data elements the width of the th "word". In one variant embodiment of the operands 410, 420 are contained in registers OCMD length of 128 bits, such as registers Rpoh PE length of 128 bits. In one variant embodiment of the RESULTANT 440 is also a data register of a PDA. In addition, the RESULTANT 440 may also represent the same register or a location in memory that one of the operands of the source. Depending on the particular variant embodiment, the operands and registers can have a different length, such as 32, 64, and 256 bits, and can have data elements of size bytes, double word or Quad word. Although the data elements in this example have a size of "words", the same concept can be extended to elements of size bytes and double word. In one variant embodiment, where the data operands are 64 bits wide registers MMD registers are used instead of a PDA.

The first operand 410 in this example consists of a set of eight data elements: A3, A2, A1 and A0. Each individual data element corresponds to the position of the data element in the resultant 440. The second operand 420 consists of another set of eight segments: B3, B2, B1 and B0 of the data. The data segments are of equal length, and each of them contains the data size of one word (32 bits). However, the data elements and the position of the data elements can contain other granularity is beyond words. If each data element of them is l size byte (8 bits), double word (32 bits) or quadword (64 bits), the operands of size 128 bits would represent the data elements of a width of sixteen bytes, of a width of four double words or width of two quadword respectively. Variants of the embodiment of the present invention is not limited to the operand data or data segments with a specific length, and their size can be set respectively for each variant embodiments.

Operands 410, 420 may be located on or in the register, or in a specific location in memory or in a file register, or may be located in a mixed form. Operands 410, 420 of the data are passed to a logic circuit 430 calculation of the scalar product Executive module processor together with a statement of the scalar product. To that point in time when the instruction of the scalar product is delivered to the Executive module, the instruction must be decoded before in the processor pipeline in one variation of the embodiment. Thus, the statement of the scalar product can be in the form of a micro-operation (RBI) or in some other decoded format. For one variant embodiment of the two operand 410, 420 accept data in the logic circuit 430 calculation of the scalar product. Logic circuit 430 calculation of the scalar product generates the first result is mnogaya two data elements of the first operand 410 with the second result of the multiplication of two data elements in the corresponding position of the data element of the second operand 420 and stores the sum of the first and second results multiplication in the corresponding position in the resultant 440, which may correspond to the same location to save that of the first or second operand. In one variant embodiment, the data elements from the first and second operands are the data with single-precision (for example, 32 bits), while in other embodiments, the embodiment of the data elements from the first and second operands are the data double-precision (for example, 64 bits).

In one variant embodiment, the data items to all provisions of these processes in parallel. In another variant embodiment, a certain part of the provisions of the data item can be processed together at the same time. In one variant embodiment, the resultant 440 consists of two or four possible positions of the scalar product, depending on whether SPD or SPO respectively: DOT-PRODUCTA31-0, DOT-PRODUCTA63-32, DOT-PRODUCTA95-64, DOT-PRODUCTA127-96(for instructions, CPPO) and DOT-PRODUCTA63-0, DOT-PRODUCTA127-64(for instructions SPD).

In one variant embodiment, the position of the scalar product in the resultant 440 depends on the selection fields associated with the instruction of the scalar product. For example, instructions SPO position of the scalar product in financial p is Tanta 440 is a DOT-PRODUCT A31-0if the selection field is equal to the first value, the DOT-PRODUCTA63-32if the selection field is equal to the second value, the DOT-PRODUCTA95-64if the selection field is equal to a third value, and DOT-PRODUCTA127-64if the selection field is equal to the fourth value. In the case of instructions SPD the position of the dot product DOT-PRODUCT in the resultant 440 is DOT-PRODUCTA63-0if the selection field is a first value, and DOT-PRODUCTA127-64if the selection field is a second value.

On figa illustrates the operation instructions to the scalar product in accordance with one embodiment of the present invention. In particular, figa illustrates the operation instructions of SPO in accordance with one embodiment. In one variant embodiment, the operation of scalar product in the example shown in figa, essentially, can be performed using logic 430 calculation of the scalar product in figure 4. In other embodiments, the embodiment of the operation of scalar product on figa may be performed using another logic circuit, which includes hardware, software, or some combination of the three.

In other embodiments, the embodiment of the operations shown in figures 4, 5A and 5b, can be performed in any combination or order to obtain R is the result of the scalar product. In one variant embodiment in figa illustrated 128-bit register a source, including locations save up to four values to single-precision, floating-point or integer values, each of size 32 bits, A0-A3. Similarly, on figa presents the register 505A destination size of 128-bits, which includes the saving location to save up to four values to single precision floating point or integer values, each of which has a size of 32 bits, B0-B3. In one variant embodiment, each value of A0-A3 stored in the register of the source multiplied by the appropriate value B0-B3 stored in the corresponding status register is the destination, and each resulting value A0·B0, A1·B1, A2·B2, A3·B3 (called here a "work"), save in the appropriate location to save the first 128-bit temporary register ("TEMP1") 510A model that includes the location to save up to four values of single precision floating point or integer values, each 32-bits.

In one variant embodiment of the pair works together and summarize each amount (called here "subtotals") is stored in the storage location of the second 128-bit temporary register ("THOSE who 2") a and third 128-bit temporary register ("AMR") a. In one variant embodiment works remain in place save the least significant 32-bit element in the locations of the first and second temporary registers. In other embodiments, embodiments they can be stored in other locations of the item first and second temporary registers. In addition, in some embodiments, embodiments of the compositions can be stored in the same register, for example either in the first or in the second temporary register.

In one variant embodiment Subtotal summed together (called here the "final amount") and retain the element of preserving the fourth 128-bit temporary register ("AMR") a. In one embodiment, the embodiment of the ultimate amount remain in the least significant 32-bit element of preserving TEMR, while in other embodiments, the embodiment of the ultimate amount retain other elements of conservation TEMR. The final sum is then stored in the element storing register 505A destination. The exact element of conservation, which should be saved finite amount may depend on variables that are configurable within the instructions of the scalar product. In one variant embodiment of the direct field ("IMMy [x]"), containing a number of storage locations of bits can be used to determine item save register appointed the Oia, which should be saved in the final sum. For example, in one variant embodiment, if the field IMM8 [0] contains the first value (for example, "1"), the final sum is stored in the element B0 save the destination register if the field IMM8 [1] contains the first value (for example, "1"), the final sum is stored in the element B1 save, if the field IMM8 [2] contains the first value (for example, "1"), the final sum is stored in the element B2 save the destination register and, if the field IMM8 [3] contains the first value (for example, "1"), the final sum is stored in the element B3 save the destination register. In other embodiments, other embodiments immediate field can be used to determine the element of preservation in which the final sum is stored in the register destination.

In one variant embodiment immediate field can be used to control whether each of the operations of multiplication and summation operations presented on figa. For example, IMM8 [4] can be used to indicate (for example, when set to "0" or "1"), whether A0 multiplied by B0 and the save in TEMP1. Similarly, IMM8 [5] can be used to indicate (when installing, for example, to "0" or "1",)whether A1 multiplied by B1 and the result, save in TEMP1. Similarly, IMM8 [6] can be used to indicate (when installing, for example, to "0" or"1"), whether A2 multiplied by B2 and the result, save in TEMP1. Finally, IMM8 [7] can be used to indicate (when installing, for example, to "0" or "1"), whether A3 multiplied by B3 and the save in TEMP1.

On fig.5b illustrates the operation instructions SPD in accordance with one embodiment. One difference between the instructions SPPO and SPD is that SPD works with values with double precision floating point and integer values (for example, values of size 64 bits) instead of the values with single precision. In accordance with this smaller number of data items that you want to manage, and therefore a smaller number of intermediate operations and modules conservation (e.g., registers) used in the statement SPD than statements of SPA, in one embodiment, the incarnation.

In one variant embodiment in fig.5b illustrated 128-bit register 501b source, incorporating elements of conservation to save up to two values with double precision floating point or integer values, each of size 64 bits, A0-A1. Similarly, on fig.5b shows the register 505b destination size of 128 bits, including the elements of saving, to save up to two values with double precision floating point or integer values, size of 64 bits to the each, B0-B1. In one variant embodiment, each value A0-A1 stored in the register of the source multiplied by the appropriate value B0-B1 stored in the corresponding position of the destination register, and each of the resulting value A0·B0, A1·B1 (below referred to as "works") remain in the corresponding element of the preservation of the first 128-bit temporary register ("TEMP1") 510b, incorporating elements of conservation, to save up to two values with double precision floating point or integer values, size of 64 bits each.

In one variant embodiment of the pair works together and summarize each amount (below called "final amount") retain the element of preserving the second 128-bit temporary register ("AMR") 515b. In one variant embodiment of the product and the total amount remain in the least significant space saving 64-bit element of the first and second temporary registers, respectively. In other embodiments, embodiments they can be stored in other storage locations of the elements of the first and second temporary registers.

In one variant embodiment of the final sum is stored in the element storing register 505b destination. The exact item save to save the final sum may depend on variables that are configurable within the instructions is calaroga works. In one embodiment, the embodiment can use the direct field ("IMMy [x]"), containing a certain number of bit locations save for element definitions, save the destination register, which should be saved in the final sum. For example, in one variant embodiment, if the field IMM8 [0] contains the first value (for example, "1"), the final sum is stored in the element B0 save the destination register if the field IMM8 [1] contains the first value (for example, "1"), the final sum is stored in the element B1 save. In other embodiments, the embodiment can be used by other immediate field to determine the item save where you save the final sum in the destination register.

In one variant embodiment immediate field can be used to control whether each of the operations of multiplication operations scalar product presented on fig.5b. For example, IMM8 [4] can be used to indicate (for example, by setting to "0" or "1"), whether A0 multiplied by B0 and the save in TEMP1. Similarly, IMM8 [5] can be used to indicate (by setting, for example, to "0" or "1")whether A1 multiplied by B1 and the result, save in TEMP1. In other embodiments, embodiments can use other methods of control to determine whether to perform operas is the multiplication of the scalar product.

On figa shows the block diagram of the circuit 600A operation of scalar product on integer values with single precision or floating-point values in accordance with one embodiment. Diagram 600A in this variant embodiment multiplies through multipliers Suite 610a-e corresponding elements with single precision two registers a and a, the results of which can be selected using multiplexers a-a using the direct field, IMM8 [7:4]. Alternatively, the multiplexers a-a can choose a value of zero instead of the product of the multiplication for each element. The results of the selection by the multiplexers a-a then summed together by adder a and the result remain in any of the elements of the register 630a result depending on the value of the immediate field IMM8 [3:0], which selects the corresponding result of the sum adder a using multiplexers a-a. In one variant embodiment, the multiplexers a-a can choose zeros to fill the element register 630 a result, if the stored result of the sum will not be selected for preservation in the result item. In other embodiments, the embodiment can use a larger number of adders to generate different amounts of results of the multiplication. In addition, some of the options that embodiment of an intermediate element in the conservation can be used to save the results of the work or summation as long while they will not be subjected to further operations.

On FIGU shows the block diagram of the circuit 600b to perform the operation of scalar product over integer values with single precision or floating point values in accordance with one embodiment. Diagram 600b this variant embodiment multiplies through multipliers 610b, 612b corresponding elements with single precision two registers 601b and 605b, and the results can be selected by multiplexers 615b, 617b, using the direct field, IMM8 [7:4]. Alternatively, the multiplexers 615b, 618b can choose a value of zero instead of the product of the multiplication for each element. The result of selection by the multiplexers 615b, 618b then summed together by adder 620b and the result remain in any of the elements of the register 630b result depending on the value of the immediate field, IMM8 [3:0], which selects the corresponding result of the sum from the adder 620b, using multiplexers 615b, 627b. In one variant embodiment, the multiplexers 625b-627b can choose zeros to fill the element register 630b result, if the stored result of the sum will not be selected for preservation in the result item. In other embodiments, embodiments a greater number of adders can be used to generate the sum is s different results of the multiplication. In addition, in some embodiments the embodiment of the intermediate elements save can be used to save the results of the work or summing up until them will not be performed more operations.

On figa view shown in the form of pseudocode operations to execute SPO in accordance with one embodiment. The pseudo-code illustrated in figa, indicates the value with single precision floating point or integer value stored in the register of the source ("SRC") in bits 31-0 must be multiplied by the value with single precision floating point or integer value stored in the destination register ("DEST") in bits 31-0, and the result remain in bits 31-0 of the temporary register ("TEMP1"), only if the immediate value is stored in the immediate field (IMM8 [4]"), equal to "1". Otherwise, the location of bits 31-0 can contain a null value, such as all zeros.

On figa also presents the pseudo-code to indicate that the value of a single precision floating point or integer value stored in register SRC in bits 63-32 must be multiplied by the value with single precision floating point or integer value stored in register DEST in bits 63-32, and the result is store in b is tah 63-32 in the register TEMP1, only if the immediate value is stored in the immediate field (IMM8 [5]"), equal to "1". Otherwise destination 63-32 can contain a null value, such as all zeros.

Similarly, on figa shows the pseudo-code, denoting that the value of the single precision floating point or integer value stored in register SRC in bits 95-64, must be multiplied by the value with single precision floating point or integer value stored in register DEST in bits 95-64, and the result should be stored in bits 95-64 register TEMP1, only if the immediate value is stored in the immediate field (IMM8 [6]"), equal to "1". Otherwise, place 95-64 save can contain a null value, such as all zeros.

Finally, on figa shows the pseudo-code, denoting that the value of the single precision floating point or integer value stored in register SRC in bits 127-96 must be multiplied by the value with single precision floating point or integer value stored in register DEST in bits 127-96, and the result remain in bits 127-96 register TEMP1, only if the immediate value is stored in the immediate field (IMM8 [7]"), equal to "1". Otherwise, place 127-96 save can contain a null value, such as all zeros.

D is more, on figa illustrates that the bits 31-0 summarize the bits 63-32 TEMP1 and the result is stored in the storage locations of bits 31-0 of the second temporary register ("EMR"). Similarly, bits 95-64 summarize the bits 127-96 TEMP1 and the result is stored in the storage locations of bits 31-0 of the third temporary register ("EMR"). Finally, bits 31-0 TEMR summarize the bits 31-0 TEMR and the result is stored in the storage locations of bits 31-0 fourth temporary register ("EMR").

The data stored in temporary registers can then be stored in the DEST register in one variant embodiment. A particular location in the DEST register for storing data may depend on other fields in the instruction of SPO, such as the fields in IMM8 [x]. In particular, figa shown that in one variant embodiment of bits 31-0 TEMR stored in the storage locations of bits DEST 31-0, if IMM8 [0] is "1", save bits DEST 63-32 if IMM8 [1] is "1", save bits DEST 95-64, if IMM8 [2] is "1"or save DEST 127-96 if IMM8 [3] is set to "1". Otherwise, the corresponding element bits DEST will contain a zero value, for example all zeros.

On FIGU view shown in the form of pseudocode operations to execute SPD in accordance with one embodiment. The pseudo-code shown in figv, indicates the value with single precision floating Sapato the integer value stored in the register of the source ("SRC") in bits 63-0, must be multiplied by the value with single precision floating point or integer value stored in the destination register ("DEST") in bits 63-0, and the result remain in bits 63-0 temporary register ("TEMP1"), only if the immediate value is stored in the immediate field (IMM8 [4]"), equal to "1". Otherwise, the destination, bits 63-0 can contain a null value, such as all zeros.

Also on FIGU shows the pseudo-code, denoting that the value of the single precision floating point or integer value stored in register SRC in bits 127-64, must be multiplied by the value with single precision floating point or integer value stored in register DEST in bits 127-64, and the result should be stored in bits 127-64 register TEMP1, only if the immediate value is stored in the immediate field (IMM8 [5]"), equal to "1". Otherwise destination of bits 127-64 can contain a null value, such as all zeros.

Next, figw, it is illustrated that the bits 63-0 summarize the bits 127-64 TEMP1 and the result is stored in the storage locations of bits 63-0 second temporary register ("EMR"). The data stored in the temporary register can then be stored in the DEST register, in one variant embodiment. Specifically the location, in which the data will be stored in register DEST, may depend on other fields in the instruction of SPO, such as fields in IMM8 [x]. In particular, in Fig. 7A shows that in one variant embodiment of bits 63-0 TEMR stored in the storing destination bits DEST 63-0, if IMM8 [0] equal to "1", or bits 63-0 TEMR stored in the storage locations of bits DEST 127-64, if IMM8 [1] equal to "1". Otherwise, the corresponding element bits DEST will contain a zero value, for example all zeros.

The operation disclosed in Fig. 7A and 7B represent only one view of operations that can be used in one or more embodiments, the embodiment of the invention. In particular, the pseudo-code shown in Fig. 7A and 7B corresponds to the operations performed in accordance with one or more architectures processor with 128-bit registers. Other variants of the embodiment can be executed in processor architectures, with any size registers, or any other type of conservation. In addition, other variations of the embodiment may not use registers exactly as it is shown in Fig. 7A and 7B. For example, in some embodiments, embodiments can use a different number of temporary registers or none of the registers are generally saved for operands. Finally, variants of the embodiment of the invention can be performed between multiple processors or poison the processing using any number of registers and data types.

Thus, the disclosed methods to perform the operation of scalar product. Although certain exemplary embodiments, the embodiments have been described and shown in the attached drawings, it should be understood that such variations of the embodiments are merely illustrative and does not limit the broad invention and that this invention should not be limited to the specific illustrated and described constructions and configurations, since various other modifications may occur from experts in the art after studying this disclosure. Technology such as this, which is growing rapidly and the further development of which it is difficult to foresee disclosed variants of the embodiment can be easily modified in arrangement and detail in accordance with the development of technology without going beyond the principles of the present disclosure or the scope of the attached claims.

1. Read device information medium on which is stored instructions, which if run with the help of the device provides the execution device of the method containing the steps that
generate a first product by multiplying the first packet data integer data type contained in the first operand and the second packet data of type integer, motorinterface the second operand;
generate a second product by multiplying the third packet data integer data type contained in the first operand and the fourth packet data of type integer, which contains the second operand;
retain first and second work;
define the scalar product of the first operand and the second operand by adding the first selected value and the second selected value, while the first selected value selected from a first stored works or zero values, and the second selected value is chosen from the second stored works or zero values, at least two operands, each of which has many batch values of the first data type;
store the result of the scalar product in the destination register.

2. Read by the device carrier according to claim 1, in which the first data type is an integer data type.

3. Read by the device carrier according to claim 1, in which the first data type is a data type floating-point number.

4. Read by the device carrier according to claim 1, in which each of these at least two operands has only two packet values.

5. Read by the device carrier according to claim 1, in which each of these at least two operands and eat only four batch values.

6. Read by the device carrier according to claim 1, in which each of the multiple packet value is the value with single precision and should be represented in 32 bits.

7. Read by the device carrier according to claim 1, in which each of the multiple packet value is the value with double precision and should be represented in 64 bits.

8. Read by the device carrier according to claim 1, in which the at least two operands and the result of the scalar product must be stored in at least two registers are designed to save up to 128 bits of data.

9. Device to perform the operation of scalar product, containing
first logic for executing a scalar product of the type "single instruction stream, multiple data flows" (ACMD) on at least two batch operands of the first type data, which instructions scalar product type OCMD contain an indicator of the source operand, the indicator of the destination operand, at least one direct indicator values indicator of the source operand includes the register address of the source, with many items to save a lot of batch values, and at least many bits control to display item save space point is, in which to save the result.

10. The device according to claim 9, in which the indicator of the destination operand includes the address of the destination register with multiple items to save a lot of batch values.

11. The device according to claim 10, in which a direct indicator value includes multiple bits of the control.

12. The device according to claim 9, in which at least each of the two batch of the operands is an integer double-precision.

13. The device according to claim 9, in which at least each of the two batch of the operands is a value double-precision floating-point number.

14. The device according to claim 9, in which at least each of the two batch of the operands is a whole number with single precision.

15. The device according to claim 9, in which at least each of the two batch of the operands is a value with a single-precision floating-point number.

16. System to perform the operation of scalar product, containing
the first storage device to store instructions to the scalar product of the type "single instruction stream - multiple data flows" (OCMD);
a processor connected to the first storage device to execute the scalar product OCMD in which the instructions Salerno the works type OCMD contain an indicator of the source operand, the indicator of the destination operand, at least one direct indicator values indicator immediate value includes multiple bits of the control.

17. System according to clause 16, in which the indicator of the source operand includes the register address of the source, with many items to save a lot of batch values.

18. System 17, in which the indicator of the destination operand includes the address of the destination register with multiple items to save a lot of batch values.

19. System according to clause 16, in which at least each of the two batch of the operands is an integer double-precision.

20. System according to clause 16, in which at least each of the two batch of the operands is a value double-precision floating-point number.

21. System according to clause 16, in which at least each of the two batch of the operands is a whole number with single precision.

22. The device according to clause 16, in which at least each of the two batch of the operands is a value with a single-precision floating-point number.

23. The processor to perform the operation of scalar product, containing:
register source, designed to save the first batch of the operand, which includes first and vtoro the data values;
the destination register, designed to save the second batch of the operand, which includes the third and fourth data values;
a logic circuit for executing a scalar product of the type "single instruction stream - multiple data flows" (ACMD) in accordance with the control value indicated by the instruction of the scalar product,
moreover, the logic circuit includes a first multiplier for multiplying the first and third data values to generate a first product, a second multiplier, for multiplying the second and fourth data values, for generating the second works, and the logic circuitry additionally includes at least one adder for summing the first and second compositions to obtain at least one of the amount in which the logic circuitry additionally includes a first multiplexer for selecting between the first piece and a zero value depending on the first bit values control the second multiplexer to select between the second piece and zero value based on a second bit value of the control, the third multiplexer to select between the sum and the zero value to save in the first element of the destination register, and a fourth multiplexer for selecting between the sum and the zero value for the sector in the second element of the destination register.

24. The processor according to item 23, in which the first, second, third and fourth data values represent a 64-bit integer values.

25. The processor according to item 23, in which the first, second, third, and fourth data values are 64-bit floating-point value.

26. The processor according to item 23, in which the first, second, third and fourth data values are 32-bit integer values.

27. The processor according to item 23, in which the first, second, third and fourth data values are 32-bit floating-point value.

28. The processor according to item 23, which registers the source and destination are intended to preserve at least 128 bits of data.



 

Same patents:

FIELD: information technology.

SUBSTANCE: delay in launching certain applications can enhance overall system performance. Applications which must be delayed may be placed in a container object or packaging to that they can be monitored and so that other applications, which depend on the delayed applications, can be processed appropriately.

EFFECT: improve system performance.

20 cl, 3 dwg

FIELD: information technology.

SUBSTANCE: apparatus has a first input which is configured to receive an instruction address, and a second input which is configured to receive predecoded information which describes the instruction address as being related to an implicit subroutine call in a subroutine. In response to the predecoded information, the apparatus also includes an adder configured to add a constant to the instruction address defining a return address, causing the return address to be stored to an explicit subroutine resource, thus, facilitating subsequent branch prediction of a return call instruction.

EFFECT: emulating branch prediction of a subroutine call in order to reduce power and increase pipeline processor utilisation factor.

13 cl, 1 tbl, 7 dwg

FIELD: information technology.

SUBSTANCE: method involves the following: identification of a property of a first instruction, where the property differs from other properties encoded in a first set of pre-decoding bits, for which all available encodings are defined or reserved; coding the first instruction in a second format, whose length differs from that of the first format, including part of the first instruction and the first set of pre-decoding bits, where the second format contains part of the second instruction and a second set of pre-decoding bits, encoding the second set of pre-decoding bits using one of the available encodings.

EFFECT: faster operation.

17 cl, 4 dwg

FIELD: information technology.

SUBSTANCE: method of managing shadow register file system involves the following steps: allocating one or more multi-port registers from a physical register file to a first procedure, corresponding to part of the logic stack of registers, storing data associated with the first procedure in the allocated multi-port registers; selectively saving data associated with the first procedure from one or more multi-port registers to one or more registers of the first file of shadow registers of the shadow register file system, wherein one or more registers has independent data reading/recording ports, and freeing up corresponding allocated multi-port registers for allocating the second procedure; storing data associated with the first procedure from the first shadow register file to the second shadow register file; storing at least part of data associated with the first procedure from a specific register of the second shadow register file in backing memory, and then extraction of said part of data associated with the first procedure from the backing memory to a specific register of the second shadow register file; extracting data from the second shadow register file into one or more registers of the first shadow register file; and before continuing to execute the first procedure, retrieving data associated with the first procedure from one or more registers into one or more multi-port registers, and reallocating the first procedure one or more multi-port registers.

EFFECT: higher efficiency.

15 cl, 5 dwg

FIELD: information technology.

SUBSTANCE: method involves defining a granule which is equal to the smallest length instruction in the instruction set and defining the number of granules making up the longest length instruction in the instruction denoted MAX. The method also involves determining the end of an embedded data segment, when a program is compiled or assembled into the instruction string and inserting a padding of length MAX-1 into the instruction string to the end of the embedded data. Upon pre-decoding of the padded instruction string, a pre-decoder maintains synchronisation with the instructions in the padded instruction string even if embedded data are randomly encoded to resemble an existing instruction in the variable length instruction set.

EFFECT: ensuring reconstruction during repeated synchronisation owing to reduced errors of synchronising the mechanism for pre-decoding the instruction string.

20 cl, 11 dwg

FIELD: physics; computer engineering.

SUBSTANCE: invention relates to processors with pipeline architecture. The method of correcting an incorrectly early decoded instruction comprises stages on which: the early decoding error is detected and a procedure is called for correcting branching with a destination address for the incorrectly early decoded instruction in response to detection of the said error. The early decoded instruction is evaluated as an instruction, which corresponds to incorrectly predicted branching.

EFFECT: improved processor efficiency.

22 cl, 3 dwg, 1 tbl

FIELD: information technology.

SUBSTANCE: present invention relates to computer engineering and can be used in signal processing systems. The device contains an instruction buffer, memory control unit, second level cache memory, integral arithmetic-logic unit (ALU), floating point arithmetic unit and a system controller.

EFFECT: more functional capabilities of the device due to processing signals and images when working with floating point arithmetic.

4 cl, 4 dwg

FIELD: information technologies.

SUBSTANCE: command of message digest generation is selected from memory, in response to selection of message digest generation command from memory on the basis of previously specified code of function, operation of message digest generation, which is subject to execution, is determined, at that previously specified code of function defines operation of message digest calculation or operation of function request, if determined operation of message digest generation subject to execution is operation of message digest calculation, in respect to operand, operation of message digest calculation is executed, which contains algorithm of hash coding, if determined operation of message digest generation subject to execution is operation of function request, bits of condition word are stored in block of parameters that correspond to one or several codes of function installed in processor.

EFFECT: expansion of computer field by addition of new commands or instructions.

14 cl, 18 dwg

FIELD: physics; computer technology.

SUBSTANCE: present invention pertains to digital signal processors with configurable multiplier-accumulation units and arithmetic-logical units. The device has a first multiplier-accumulation unit for receiving and multiplying the first and second operands, storage of the obtained result in the first intermediate register, adding it to the third operand, a second multiplier-accumulation unit, for receiving and multiplying the fourth and fifth operands, storage of the obtained result in the second intermediate register, adding the sixth operand or with the stored second intermediate result, or with the sum of the stored first and second intermediate results. Multiplier-accumulation units react on the processor instructions for dynamic reconfiguration between the first configuration, in which the first and second multiplier-accumulation units operate independently, and the second configuration, in which the first and second multiplier-accumulation units are connected and operate together.

EFFECT: faster operation of the device and flexible simultaneous carrying out of different types of operations.

21 cl, 9 dwg

FIELD: physics.

SUBSTANCE: invention pertains to the means of providing for computer architecture. Description is given of the method, system and the computer program for computing the data authentication code. The data are stored in the memory of the computing medium. The memory unit required for computing the authentication code is given through commands. During the computing operation the processor defines one of the encoding methods, which is subject to implementation during computation of the authentication code.

EFFECT: wider functional capabilities of the computing system with provision for new extra commands or instructions with possibility of emulating other architectures.

10 cl, 15 dwg

FIELD: radio engineering.

SUBSTANCE: invention applies new sequence of interrelated actions, including procedure of vector disturbance in combination of array basis reduction and multi-alternative quantisation. Invention makes it possible to simultaneously service group of several subscriber stations in one and the same physical channel. Invention advantage is possibility of quite simple realisation in transmitter and especially simple realisation in receiver of subscriber station. Invention advantage is possibility of realisation with only one receiving antenna available in each of subscriber stations.

EFFECT: increased throughput capacity of communication channel.

6 cl, 7 dwg

FIELD: information technology.

SUBSTANCE: device has a matrix comprising m rows and n columns of a homogeneous medium, n blocks for counting units, unit for finding the maximum, adders, a memory unit, a lower-bound estimate search unit which has a pulse generator, element selection multiplexers, row selection decoder, incidental vertex decoders, fixed arc decoders, row and column counters, fixed arc counters, incidental vertex counters, mode triggers, group of m triggers, group of m inhibit circuit units, matrix (i.j) (i=1.2,…, m, j=1.2,…,n) of fixed arc counters, matrix (i.j) (i=1.2,…, m, j=1.2,…,n) of OR elements, matrices (i.j) (i=1.2,…,m, j=1.2,…,n) of AND elements, an OR element, inverters, AND elements, group of m OR elements.

EFFECT: broader functional capabilities.

2 dwg

FIELD: computer engineering, possible use for parallel computation by digit cuts of sums of paired productions of complex numbers, may be used for solving problems of digital signals processing, solving problems of spectral analysis and hydro-location, automatic control systems.

SUBSTANCE: device contains adder-subtracter, two blocks for computing sums of products, each one of which comprises multiplier registers, multiplicand registers, matrix multiplexers, transformer of equilibrium codes to positional codes, matrix adders.

EFFECT: expanded functional capabilities, increased speed of operation.

5 dwg

FIELD: computer science, possible use for engineering devices meant for processing numeric information arrays, in particular, for permutation of rows of two-dimensional array (matrix) stored in memory of computing device.

SUBSTANCE: device contains matrix of unary first memory registers and matrix of unary registers of second memory, which are identical to each other. Between them a commutator is positioned. Unary memory registers, positioned conditionally in one row, are connected between each other as shifting row registers. Commutator on basis of law given externally connects output of shifting register of first memory, corresponding to i-numbered row, to input of shifting register of second memory, corresponding to j-numbered row in second memory. After sending a packet of shifting pulses to shifting input of i-numbered shifting register of first memory, information from it moves to j-numbered shifting register of second memory. Therefore, transfer of i-numbered row to j-numbered position in new array occurs. Transfer of rows can be realized row-wise, or simultaneously for all, while structure of commutator is different for different cases.

EFFECT: realization of given permutation of rows and/or columns of two-dimensional array.

7 cl, 10 dwg, 1 tbl

FIELD: computer science.

SUBSTANCE: device has block of registers of first memory, block of registers of second memory, block for controlling reading of columns, block for controlling reading of rows, block for controlling reverse recording; according to second variant, device has same elements excluding block for controlling reverse recording. Third variant of device is different from second variant by absence of block for controlling reading of columns, and fourth variant of device is different from second one by absence of block for controlling reading of rows.

EFFECT: higher efficiency.

4 cl, 9 dwg

The invention relates to computer technology and can be used in data mining systems, including processing and analysis of geological and geophysical information and other data obtained in the study of natural or socio-economic objects or phenomena

The invention relates to the field of spectral analysis and can be used in the classification of quasi-periodic signals

The invention relates to computer technology and can be used in specialized solvers for the solution of problems, including digital processing of signals and images

The invention relates to the field of computer engineering and can be used in specialized computer systems for computing the eigenvalues of the matrix (nn)

FIELD: computer science.

SUBSTANCE: device has block of registers of first memory, block of registers of second memory, block for controlling reading of columns, block for controlling reading of rows, block for controlling reverse recording; according to second variant, device has same elements excluding block for controlling reverse recording. Third variant of device is different from second variant by absence of block for controlling reading of columns, and fourth variant of device is different from second one by absence of block for controlling reading of rows.

EFFECT: higher efficiency.

4 cl, 9 dwg

Up!