Computer system and data transmission method in a computer system

 

The invention relates to computer networks. The technical result is the reduction of run-time operations data. For this purpose, the system includes a local processor node, the remote processor node, the schema of cross-site connections, each local processor node contains the schema of the local interconnect, the processor, system memory controller, while the above-mentioned controller discards the data received from the remote processor node in response to the request, if the response to the request received in the local processor node is coherence with reference to the modified or joint intervention. The method includes the following operations: in the remote processor node speculative forwards the request received from the schema of local interconnects, and when entering the local processor node in response to the specified request, the response to the query process in accordance with a resolution request to the local processor node, and when the processing of the specified query response discard data received from the remote processor node, if the response to the request received in the local processor node, one is the GLA.

The invention relates to a computer system and data transmission method in a computer system. The invention relates primarily to data processing in the system with non-uniform memory access (NUMA). "Non-uniform Memory Access") and, in particular, to a data processing system with NUMA and data transmission method in a data processing system with NUMA, in which requests to read speculative sent to the remote memory.

In computing it is well known that to improve performance computing, respectively, the computer system can be achieved through the use of combined processing power of multiple individual processors when their tandem connection. A multiprocessor computer system can be implemented on the basis of various topologies, some of which may be more suitable for solving some applied problems and less for all other applications, and Vice versa, as determined by performance requirements and software used to solve any application problem. One of the most common topologies multiprocessor computer systems is symmetric mnogoprotsessornye memory and subsystem I / o, normally associated with a General system diagram (bus) interconnects. Such a computer system is called symmetric because in SMP computer system latency when accessing data stored in the shared system memory, is ideally the same for all processors.

Although SMP computer system and allow the use of relatively simple methods for inter-processor communication and data sharing, however, they have limited scalability. In other words, the performance of a typical SMP computer system cannot be increased only as it reasonably can be assumed, due to its scale (i.e., by increasing its number of processors), because inherent in the bus, memory and media I / o restrictions on their capacity to achieve significant benefits and significant gain in performance when scaling SMP computer system in excess of the specific realization of the limits for which optimized the use of such shared resources. Thus, with increasing scale computer system performance due to the inherent made only when accessing system memory. In addition, the scaling SMP computer systems due to the additional production costs and therefore lower profitability of their production. For example, despite the opportunity to optimize some of the components for use in a single-processor computer systems and small SMP computer systems, such components are typically ineffective when used in large SMP computer systems. Conversely, components that are designed for use in large SMP computer systems, is economically disadvantageous to use in small systems.

Therefore, to eliminate most of the limitations of SMP computer systems, at the expense of some complexity was developed alternative topology multiprocessor computer systems, which is known as the architecture of the non-uniform memory access (NUMA). A typical computer system with NUMA has several interconnected nodes, each of which includes one or more processors and a local system memory. Such computer systems are called systems with non-uniform memory access for the reason that the delay in access to the local host processor fewer delays in accessing data, stored in the system memory of the remote node. Systems with NUMA can also be divided into incoherent and cache-coherent system, depending on whether it supports the coherence of data between different cache nodes. The complexity of systems with NUMA and support coherence caches (CC-NUMA) is largely determined by the presence of additional means of data transfer that you want to equip hardware to maintain the coherency of data between different levels of cache memory and system memory in each node, but also between caches and modules of system memory different nodes. However, computer systems with NUMA devoid of the disadvantages associated with limitations on the scalability of a traditional computer SMP systems, since each node in a computer system with NUMA can be implemented in the form of SMP systems smaller. This makes it possible to optimize the shared components that are included in each node, for use with only a few processors, whereas in respect of the whole system can achieve significant benefits, manifested in increasing its productivity due to ATI relatively low.

The main problem affecting the performance of computer systems with CC-NUMA is associated with a delay in operations or data (i.e. data, messages, commands, queries, and so on) through a scheme of inter-node connections. Thus, in particular, delay in operations for reading out the data stored in the remote system memory, which operations are the most common type of operations performed in a computer system, are up to two times the delay in operations for reading out the data stored in the local system memory. For example, in application EP-A-0817072 described multiprocessor system with a sub-node controller that controls communication between the processor node and the rest of the system. When you initiate a request to perform an operation in the processor node first verifies the ability to perform this operation on the local level, and, if this is not possible, then you want to use remote access. In this case, no action is taken until, until you have determined whether it is possible to serve this request on the local level or not. Because of the relatively large detention is read through the schema of local interconnects, it seems appropriate to reduce the delay in executing a read operation through schematic cross-site connections. As described in the application EP-A-0379771 approach allows to partially solve the problem and reduce the time delays in the execution of the corresponding operations by storing the modified copy of the requested data in the cache.

The present invention was based on the task to develop a methodology that would allow at least partly eliminate the above described disadvantages inherent in the known from the prior art solutions.

This task is solved according to the invention with its proposed computer systems, especially computer systems with NUMA, with a scheme of cross-site connections, and at least the local processor node and a remote processor node, each of which is associated with the scheme inter-node connection, and a local processor node has a local interconnects the processor and system memory, connected to this local interconnects, and the controller located between the local interconnects and circuit cross-site connections and is intended for speculative transmission request received from the schema of the local interconnects, in the remote processor node via a scheme of cross-site connections and to access the local processor node, when the local controller processor node discards the data received from the remote processor node in response to the request, if the response to the request received in the local processor node is coherence with reference to the modified or joint intervention.

In such a computer system remote processor node preferably also has a pattern for the local interconnect and controller located between the said scheme of cross-site connections and local interconnects, in response to receiving a speculative request, the remote controller CPU node sends such a speculative request to the local interconnects the remote processor node.

According to another preferred variant of the invention in its proposed computer system also has a third processor node, with the specified request contains the address, and the controller of the first processor node at least partially on the basis of information contained in the request address selects the destination processor node, which is the recipient of speculative query submitted.

In accordance with the following preference is, when the local controller processor speculative node forwards the request to this remote processor node when establishing that the request address related to the specified system memory of the remote processor node.

In addition, in the proposed invention the computer system controller local processor node preferably directs received from the remote processor node data in the schema of the local interconnect local processor node, if the response to the request received in the local processor node is coherence, pointing to the inability of the local service this request.

In the invention it is also proposed a method of transferring data in a computer system, primarily in the computer system with NUMA, with a scheme of cross-site connections, linking at least the local processor node and a remote processor node, and the local processor node has a local interconnects the processor and system memory, connected to this local interconnects, and the controller located between the local interconnects and circuit cross-site connections, saut request, obtained from the circuit of the local interconnects, and when entering the local processor node from a remote processor node in response to the specified request, the response to the query process in accordance with a resolution request to the local processor node, and when the processing of the specified query response discards data received from the remote processor node, if the response to the request received in the local processor node is coherence with reference to the modified or joint intervention.

According to one preferred options proposed in the invention method, when receiving in the remote processor node speculative request the speculative request is sent to the local interconnects remote processor node.

In another preferred embodiment proposed in the invention method, the computer system also has a third processor node, and the request contains the address, at least in part on the basis of information contained in the request address, choose the destination processor node, which is the recipient of speculative query submitted.

In accordance with still th node has a system memory, when speculative transfer its speculative request is passed to the remote processor node when establishing that the request address related to the specified system memory of the remote processor node.

According to one preferred options proposed in the invention method when processing the response received from the remote processor node send data to the schema of the local interconnect local processor node, if the response to the request received in the local processor node is coherence, pointing to the inability of the local service this request.

Other distinctive features and advantages of the invention discussed below in more detail on the example of one of the variants of its implementation with reference to the accompanying drawings on which is shown: in Fig.1 - the scheme proposed in the invention of the computer system with NUMA, made by one of the variants of Fig.2 is a detailed structural diagram shown in Fig.1 host controller of Fig.3A and 3B is a logical block diagram of a high level, together illustrating one possible example of the method of processing requests when the requests for the Fig.4A-4G - diagram illustrating example processing of data in accordance with the method illustrated in Fig.3A and 3B.

The overall structure of the system Below with reference to the accompanying drawings and first with reference to Fig. 1 considered one of the possible embodiments of the proposed invention in a computer system with NUMA. This is shown in the drawing, the system may be implemented as a workstation, server, or a universal computing machine (mainframe). As shown in the drawing, the computer system 6 with NUMA includes several (Nm2) of processor nodes 8A-8n, which are interconnected circuit 22 cross-site connections. Each of these processor nodes 8A-8n can log M (M0) of the CPU 10, the circuit 16 local interconnects and system memory 18, which is accessed through the controller 17 of the memory. Processors 10A-10m, preferably (but not necessarily) identical processors, which can be used PowerPC processorsTMmanufactured by International Business Machines Corporation (IBM), Armonk, pieces of new York. In addition to elements such as registers, the command logic and operational or execution units, ispolzuemoe, each of the processors 10A-10m is also implemented in the form of built-in cache of the hierarchical structure (hierarchical cache memory) that is used to direct data to the core processor 12 of the modules of the system memory 18. Each such hierarchical cache memory 14 may include, for example, the cache 1st level (L1) and the cache 2nd level (L2), the capacity of which is 8-32 kilobytes (KB) and 1 to 16 megabytes (MB), respectively.

Each of the processor nodes 8A-8n has a corresponding node controller 20, connected between circuit 16 local interconnects and circuit 22 is a cross-site connections. Each node controller 20 serves as a local agent for remote processing node 8 and performs to this end at least two functions. First, each node controller 20 monitors the status of the associated circuit 16 local interconnects and provides the data from it to the remote processor nodes and 8. Secondly, each node controller 20 monitors the transmitted in scheme 22 inter-node connection data and controls the direction of the corresponding data in figure 16 of the local interconnect. Data transmission in each diagram 16 local interconnects controls the arbiter 24. Arbitrary 10, and compile the answers coherence for tracked data transfers in scheme 16 local interconnects, as described in more detail below.

Scheme 16 local interconnects through the bridge 26 is connected to the bus 30 of the second level, which can be implemented, for example, in the form of a PCI local bus (from the English. "Peripheral Component Interconnect", "mutual connection of peripheral components"). This bridge 26 bus second level allows you to create a transmission path of data with low latency or waiting through which the CPU 10 can directly access the devices 32 I / o and storage devices 34 displayed on bus memory and/or address space I / o, as well as to create a high-speed transmission path of the data on which device 32 I / o devices and storage devices 34 may access the system memory 18. Device 32 I / o can represent, for example, a display, a keyboard, a graphical pointer, as well as serial and parallel ports intended for connection to external networks or separately connected devices. The storage device 34 may be a storage device, Nanoha software.

The organization of memory In a computer system 6 with all NUMA processors 10 shared a single address space physical memory, i.e., each physical address applies only to a single cell in one of the modules of the system memory 18. In line with this, all the contents of system memory, which can usually be accessed by any processor 10 of the computer system 6 with NUMA, can be considered as content distributed between modules of the system memory 18. For example, according to one of the possible embodiments of the present invention with four processor nodes 8 in NUMA computer system may use the physical address space volume 16 gigabytes (GB), consisting of the area, forming a universal memory (General purpose memory), and reserved area. Forming a universal memory area is divided into segments at 500 MB, with each of the four processor nodes 8 highlights every fourth segment. In the reserved area, the volume of which may be approximately 2 GB, the memory storing data for controlling the system and peripheral memory, and I / o, which one of RA is stored specific data item, called the base node for this data item, and Vice versa, other processor nodes 8A-8n are referred to as remote nodes with respect to this data element.

The coherence of memory As any of the processors 10 of the computer system 6 with NUMA may prompt you to choose and modify the data stored in each of the system memory 18, in such a computer system with 6-NUMA Protocol is used to ensure coherence caches, which helps to maintain the coherency between the caches of the same processor node and between the caches of the various processor nodes. In accordance with this computer system 6 with NUMA should be classified as a computer system with non-uniform memory access and support coherence caches (CC-NUMA). Used in this Protocol ensure coherence caches depends on the specific implementation of the system and may be a well-known MESI Protocol (from the English. "Modified, Exclusive, Shared, Invalid", "modified (M), exclusive (E), joint (S), invalid (I)") or its variant. In the following it is assumed that for hierarchical caches 14 and arbitrators 24 using conventional MESI Protocol, the status of M, S and I which are recognized by the controller 20 at owami, in the controllers 20 knots, it is assumed that the data are exclusively stored in the remote cache has been modified, regardless of whether such data has been modified or not.

The architecture of the interconnect Circuit 16 local interconnects and circuit 22 inter-node connections in each case can be implemented based on any tire broadcast architecture, switching broadcast architecture or switching non-broadcast architecture. However, in the preferred embodiment, it is proposed to realize at least the circuit 22 inter-node connection in the form of circuit switching nonbroadcast interconnects, to control which uses a communication Protocol HH developed by IBM Corporation. Scheme 16 local interconnects and circuit 22 is a cross-site connections allow you to perform various operations separately, i.e., this means that between the address and data together components of the transmitted information, there is no fixed time dependence and that the ordering of data packets can occur separately from the corresponding address of the packets. In a preferred embodiment, the efficiency of the circuits 16 local messoudi the conveyor mode, that allows you to start the following operation data before entering into the device that initiated the previous operation data, answers coherence from each recipient of the data.

Regardless of the type or types used architecture interconnects to transfer information between the processor nodes 8 through figure 22 cross-site connection between modules tracking through circuit 16 of the local interconnect is used at least three types of "packages" (the term "package" is used here to refer to a discrete unit of information, namely the address packets, data packets and response packets coherence. In tables I and II is summarized information about the fields and see their description for unicast packets and data packets, respectively.

As follows from tables I and II, each packet in the transmitted information is marked with the appropriate tag to the node of the secondary data or the tracking module can determine which data transmitted, respectively, to a data transfer operations include each package. For specialists in this area it is obvious that regulation load funds peredannykh or commands execution logic and the corresponding control data stream or a sequence of command signals.

Each processor node 8 responses status and coherence are transmitted between each tracking module and a local arbiter 24. Signal line used in circuits 16 local interconnects for transmitting information about the status and coherence, are presented below in table III.

Between response status and coherence, transmitted by the signal transmission AResp and AStat in schemes 16 local interconnects, and relevant software addresses preferably there is a fixed but programmable time dependence. For example, in the second cycle after receiving the address packet may need to vote on the signals AStatOut, which allows to obtain preliminary information about how did each tracking module successfully take the address packet, transmitted by the circuit 16 of the local interconnect. The arbiter 24 compiles the results of voting on signals AStatOut and then after a fixed but programmable number of cycles (for example, through 1 cycle) gives the result of AStatIn vote. The possible results of voting AStat below in table IV.

At the end of period AStatIn may need to pass through a fixed but programmable is losowania on ARespOut for each tracking module and outputs the result of voting ARespIn, preferably during the next cycle. The possible results of voting AResp preferably contain answers coherence listed below in table V.

The outcome of the vote ReRun, usually issued by the controller 20 of the node as the result of the vote, AResp, indicates that the monitored request identified long waiting time and that the requester will receive a command to repeat the operation at a later point in time. Thus, in contrast to the result of the vote, Retry, which may also be generated as a result of the vote, AResp, the outcome of the vote ReRun makes responsible for retrying the operation data at a later point in time the recipient (not the sender) data, voted for ReRun.

The host controller of Fig. 2 depicts a functional diagram of the controller 20 of the node used in a computer system 6 with NUMA, shown in Fig.1. As shown in Fig.2, each node controller 20, which is connected between circuit 16 local interconnects and circuit 22 inter-node connection has unit 40 receive requests (BPG), block 42 query (BOZ), the unit 44 receive data (BPOA) and the block 46 sending data (BAUD). BPG 40, BOSE 42, the BPOA 44 specific integrated circuits (SIS). As described above, the address path and the transmission path of data through the controller 20 of the node are branching, with the address of the packages, and the packages coherence) are processed BPG 40 and BOSE 42, and the data packets are processed by the BPOA 44 and BAUD 46.

BPG 40, the name of which implies that it is intended to receive a stream of requests coming from the circuit 22 is a cross-site connections, is responsible for receiving address packages and coherence scheme 22 inter-node connection requests in scheme 16 local interconnects and forwarding responses to the BOSE 42. Part BPG 40 includes a multiplexer 52 responses, which receives packets from the circuit 22 is a cross-site connections and sends the packets to the device 54 controls the transfer of data over the bus (RUPDS) and in logic 56 processing responses coherence in BOSE 42. In response to receiving the address packet from the multiplexer 52 answers the master device 54 controls the data transfer on the bus initiates the operation data according to scheme 16 local interconnects, which type (operation) can be the same type of operation on the transfer data, which is specified in the accepted address the package, or may differ from it.

BOSE 42, which, as its name implies, Bolshogo number of records in the buffer 60 expectations, which temporarily stores the attributes not yet completed operations data, which is transmitted in scheme 22 cross-site connections. Similar attributes transfers data stored in a record in the buffer 60 expectations, preferably contain at least address the transfer of data (including the label), its type and the number of expected answers coherence. For each entry in the buffer expectations provided by the relevant state, which either may be Null, indicating the removal of the corresponding entry of the buffer expectations, or can be ReRun, indicating that the operation is still not completed. Apart from sending unicast packets in scheme 22 inter-node connection BOSE 42 communicates with BPG 40 for processing requests to access the memory and outputs the command in the BPOA 44 and BAUD 46 to control the transfer of data between circuit 16 local interconnects and circuit 22 is a cross-site connections. BOSE 42 implements the selected coherency Protocol (for example, the MSI Protocol) scheme 22 inter-node connection together with logic 56 processing responses coherence and leads directory 50 coherence using logic 58 management directory.

< verified with caches of remote nodes, for which the local processor node is the base node. Pointers are addresses for each cache line stored together with the ID for each remote processor node that has a copy of the cache line, and with the state of coherence of the cache line in each remote processor node. The possible States of coherence for entries in the directory 50 coherence are shown in table VI.

As indicated in table VI, the status information of the coherence cache lines contained in the remote processor nodes, is inaccurate. This inaccuracy is due to the fact that the cache line contained in the remote node can move from state S I, E I or E in M without informing the controller 20 of the base node.

Query processing in reading
In Fig.3A and 3B shows two logical flowchart of a high level, together illustrating one possible example of the method of handling requests for reading in accordance with the present invention. Procedure, the block diagram of which is shown in Fig.3A begins at step 70 and then proceeds to step 72, where the processor 10, for example, the processor 10A processor node 8A, asks you to read it with the track, associated with the circuit 16 local interconnects processor node 8A. In response to a request to read the modules of the tracking display of the results of voting on AStatOut, which are compiled by the arbitrator 24 for the formation of the voting AStatIn (step 74). If the request to read the address in the remote system memory 18, then before issuing the node controller 20 of the vote Ack as a result of the voting on AStatOut allowing further processing of the request is read, the controller 20 of the node will allocate a buffer 60 standby sensing element and recording element with cleaning. Due to the selection of both of these elements, the controller 20 node speculative send a request for reading the base node of the requested cache line and to properly handle the response to a request to read regardless of the subsequent voting AResp processor in the node 8A.

If the result of the voting AStatIn obtained in step 74, is a Retry, then the next step 76 request for reading is cancelled, possibly allocated in the buffer 60 standby elements are released, and returns to step 72, which is discussed above. In this case, the CPU 10A must re-issue the request scidevnet the transition from step 76 to step 78, where the controller 20 of the node by accessing the memory card determines whether its processing node 8 a base node for the physical address specified in the request for reading. If a positive response is a transition to step 80, and otherwise, i.e. if the processor node 8 is not a base node for a query to read, there is a transition to step 100.

At step 80 modules tracking processor node 8A conduct a vote on ARespOut, the results of which are compiled by the arbitrator 24 with the result of the voting ARespIn. If the directory 50 coherence indicates that the cache line whose address is specified in the request for reading is associated with at least one remote processing node 8, node controller 20 will be issued as a result of the voting the result of the ReRun, if the service request to read requires communication with a remote processing node 8. For example, if the directory 50 coherence indicates that the state of the requested cache line in the remote processor node 8 is set to "modified" ("Modified") 8, to service the request to read it, you will need to send to the remote processing node 8. Similarly, if directoryno on "sharing" ("Shared"), servicing the request to read the purpose of the modifications to the remote processing node 8 will need to send the command to kill, to invalidate located in a remote processor node, a copy or copies of the requested cache line. If you obtained in step 82 the result of the voting ARespIn is not ReRun, then move to step 90, which is described below, and if the result of the voting ARespIn is ReRun, then move to step 84.

In step 84, the controller 20 of the node through the circuit 22 inter-node connection sends a message to one or more remote processing node 8, which has verified the requested cache line. As indicated above, such a message may be a command for the cache (for example, cancellation or request for reading. We then turn to step 86, the implementation of which cyclically lasts as long as the controller 20 of the node is not received responses from each of the remote processor nodes 8, in which in step 84 was the sent message. After receiving the appropriate number of answers, among which may present a copy of the requested cache line, the node controller 20 transmits the scheme is the CPU 10A to re-issue a request for reading. After that, in step 88 initiating sending a query to read the CPU 10A in response to the specified request ReRun again issues a request to read in scheme 16 local interconnects. After periods of AStat and AResp the next step 90 a request to read is served either in the controller 20 local processor node submitting a copy of the requested cache line received from the remote processing node 8, or other local tracking module in the processor node 8A (for example, the controller 17 memory or hierarchical cache 14), directing the requested cache line. We then turn to step 150, where the execution of the whole procedure is completed.

If at step 100, the controller 20 of processing node 8A determines that the processor node 8A is not a base node for the requested cache line, the controller 20 of the speculative node sends a request to read to the remote processing node 8, which is the base node for the requested cache line. As shown in Fig.3A, a request for reading is then sent by the controller 20 of the node is at least concurrent with the period ARespIn, preferably immediately upon receipt of the voting AStatIn from the resolver 24 and to Nacionalna on ReRun. Then in step 102, the tracking modules give the results of their voting ARespOut, which are compiled by the arbitrator 24 with the result of the voting ARespIn. After that, in step 110, and at the subsequent basic steps, the node generates a response to a request to read that (answer) is processed by the controller 20 of the node in accordance with the result of the vote ARespIn for a query to read in the processor node 8A.

If the result of the voting ARespIn is Retry, the request to read essentially cancelled in processor node 8A. Thus, in response to receiving the result of the voting ARespIn of the Retry condition of elements read and write allocated in the buffer 60 expectations, changed to Null. We then turn to step 110, which then moves to steps 112 and 114, the first of which the controller 20 of the node waits to receive the requested cache line from the base node, and the second discards the cache line, if the item's state is read into the buffer 60 standby corresponds to the Null state. We then turn to step 150, where the execution of the whole procedure is completed.

If the result of the voting ARespIn is Modified Intervention ("modified vmeshan (legacy) data from the base node. Thus, in response to receiving the result of the voting ARespIn of the Modified Intervention ("modified interference condition of the sensing element in the buffer 60 standby is changed to a Null state, followed by a transition to step 102, and from there through the steps 110 and 120 to step 122. In this step 122, the tracking module, which during the period ARespOut gave the vote of Modified Intervention ("modified intervention"), sends the requested cache line in figure 16 local interconnects processor node 8A. After that, the state of coherence of the requested cache line in the module tracking, sending the requested cache line is changed from Modified (the"modified") Shared ("sharing"). In response to receiving the requested cache line that triggered the sending of the query processor 10A in step 124 loads the cache line in its hierarchical cache 14. In addition, the controller 20 node intercepts the requested cache line in diagram 16 local interconnects, and in step 126 generates the base node contains a cache line of the write command with the cleanup in order to update the system memory 18 of the base unit by writing to the modified cache line. After that carried the Noah system 6, may optionally support the regime of joint intervention, i.e., the service request is to read the local hierarchical cache 14, in which the requested cache line is stored in the state Shared ("sharing"). If the mode of the joint intervention supported by the cache coherency Protocol used in a computer system 6, and the result of voting ARespIn to the issued request is Shared (i.e., "cooperative intervention", or "Shared Intervention"), then the tracking module, which as a result of the vote showed Shared ("sharing"), at step 132 sends the requested cache line in figure 16 of the local interconnect. In response to receiving the requested cache line that triggered the sending of the query processor 10A in the next step 134 downloads the requested cache line in its hierarchical cache 14. Because you do not need to make any changes in the system memory 18, the state of the elements of reading and writing, allocated in the buffer 60 expectations, changes in the Null state, then the transition to step 150, where the execution of the whole procedure is completed.

If the result of the voting ARespIn for the query processor in the node 8A of alchemipedia changes to ReRun. Then from step 102 sequentially navigates to the steps 110, 120 and 130, and after them to step 142, where the controller processor 20 of node 8A is waiting for the requested cache line from the base node and which is cyclically to obtain the requested cache line from the base node. After receiving the requested cache line from the base node via a scheme 22 inter-node connection controller 20 of the node in the next step 144 passes this string to the cache requesting processor 10A through figure 16 local interconnects. Then in the next step 146 upon receipt of the requested cache line that triggered the sending of the query processor 10A downloads the requested cache line in its hierarchical cache 14. We then turn to step 150, where the execution of the whole procedure is completed.

In Fig.3B shows a logical block diagram of a high level, illustrating the processing in the base node messages received from another processor node. As shown in the drawing, this procedure begins at step 160 and then proceeds to step 162, where it is checked, did the base node message from another processor node via a scheme 22 cross-site connections. If this message is node 8 will not receive any message. When the controller 20 of the base node messages from a remote processing node 8 is a transition to step 164, where the controller 20 of the core node transmits the received at step 162, the message in figure 16 local interconnects the base node. Thereafter in step 170 determines whether the received message is directed to the circuit 16 local interconnects, a request for reading, and if a positive response is a transition to step 172, where the request to read is serviced by the tracking module, which sends a copy of the requested cache line in the controller 20 of the base node. Upon receipt of this requested cache line, the controller 20 of the node in the next step 174 transmits it to initiating the request processor node 8 through figure 22 cross-site connections. We then turn to step 190, where the execution of the whole procedure is completed.

If you obtained in step 164, the message sent in scheme 16 local interconnects the base node is a write request (e.g., a record with a clearance), then after checking in step 170 is a transition to step 180, and from there to step 184, the controller 17 of the memory is updated, the system memory 18 way to step 190, at which perform the entire procedure is completed. If the message sent in scheme 16 local interconnects the base node that is neither a request for a read or a write request, then navigates to step 182, where the base node performs a specific(s) operation(s) found(s) in the message, followed by a transition to step 190, where the execution of the whole procedure is completed. As examples of such operations that can be performed upon receipt of an appropriate message and which are not read or write, can be called a change of state of coherence of cache lines that are in a hierarchical cache 14 of the base node.

In Fig. 4A-4D as the example shows one way of performing data processing in accordance with the present invention. In this embodiment, the data processing is discussed below on the example of a simplified model of the computer system 6 with two processor nodes 8A and 8b, each of which has two processors 10A and 10b. The state of coherence of the requested cache line is indicated in the hierarchical cache 14 each processor 10, and the directory 50 coherence, relative to the base node 8A cache the state in which the hierarchical cache 14 of the processor corresponds to Invalid ("invalid") (i.e., the cache line is not in it). In response to receiving this request, the read controller 20 of processing node 8b speculative requests are read into the processor node 8A, which is the base node for the cache line specified in the request for reading. After the speculative request to read in the processor node 8A CPU 10A during the period ARespOut votes for the Modified Intervention ("modified intervention"), because of its hierarchical cache 14 of the requested cache line is in the Modified state ("modified"). The referee processor node 8b compiles the results of the voting on ARespOut and as a result of the vote ARespIn directs in each module tracking processor node 8b the Modified Intervention ("modified intervention").

After that, the controller 20 of processing node 8A, as shown in Fig.4B, accepts speculative directional query to read and issue this query to read in scheme 16 local interconnects. As shown in Fig.4B, the controller 20 of the node in response to specifying a directory 50 coherence, ed ("modified"), issues during the period ARespOut the outcome of the vote is Null. The controller 20 of the node that recognizes such a special condition that allows further processing of the request is read, as described in more detail below with reference to Fig.4D.

As shown in Fig.4B, the processor 10A processor node 8b regardless of the speculative request to read in the processor node 8A (and possibly before, simultaneously with or after sending such a request to read) sends in response to the request to read the requested cache line in figure 16 local interconnects and changes the state of coherence of the requested cache line in its hierarchical cache 14 on the state of the Shared ("sharing"). When detected, tracked the requested cache line that triggered the sending of the query processor 10b downloads the requested cache line in its hierarchical cache 14 and sets the corresponding state coherence in Shared ("sharing"). In addition, the controller 20 of processing node 8b intercepts cache line and gives the processor node 8A request message to the account with the clearing containing the modified cache line. When you receive this from the second memory 18 through the circuit 16 local interconnects 16. Then the corresponding row of the system memory 18 of base unit 8A is updated by writing to the modified data.

As shown in Fig.4G, system memory 18 of processor node 8A regardless of the above with reference to Fig.4A update it (and possibly prior to such update, simultaneously, or after upgrade) directs in response to a request for reading, you may have an outdated copy of the requested cache line to the controller processor 20 of node 8A through figure 16 local interconnects 16. After that, the controller 20 of processing node 8A directs that a copy of the requested cache line to the controller processor 20 of node 8b, which discards the cache line, because in its buffer 60 standby request to read is marked as Null.

Thus, the present invention offers an advanced computer system with NUMA and improved data transmission method in a computer system with NUMA. According to the invention, the request to read speculative sent to the remote (i.e., core) processor node via a scheme of inter-node connections before determining whether to service the request to read at the local level without the intervention of the second request to read, processed initiated the sending of the query processor node in accordance with local response coherence to request a reading. This approach can significantly reduce the delay in executing operations on data transmission or exchange of data.

In accordance with the above description, the invention features a computer system with non-uniform memory access (NUMA), having at least a local processor node and a remote processor node, each of which is associated with the scheme inter-node connection. The local processor node has a local interconnects the processor and system memory, connected to this local interconnects, and the controller located between the local interconnects and circuit cross-site connections. Such a controller node in response to a request for reading obtained from the schema of the local interconnects, speculative passes this query to read in the remote processor node via a scheme of inter-node connections. After that, when receiving a response to a request to read from a remote processor node specified by the host controller processes the response in accordance with a resolution request the data contained in the response received from the remote processor node are discarded by the host controller, if the response to the request for reading obtained in the local processor node is coherence with the indication of the modified intervention (Modified Intervention).


Claims

1. A computer system having a circuit cross-site connections, and at least the local processor node and a remote processor node, each of which is associated with the scheme inter-node connection, and a local processor node has a local interconnects the processor and system memory, connected to this local interconnects, and the controller located between the local interconnects and circuit cross-site connections and is intended for speculative transmission request received from the schema of the local interconnects, remote processor node via a scheme of cross-site connections and to process the response to the request received from the remote processor node, in accordance with a resolution request to the local processor node, and the local controller processor node discards the data polucen the electropneumatic node, is coherence with reference to the modified or joint intervention.

2. Computer system under item 1, in which the remote processor node also has a local interconnects and a controller located between the said scheme of cross-site connections and local interconnects, in response to receiving a speculative request, the remote controller CPU node sends such a speculative request to the local interconnects the remote processor node.

3. Computer system under item 1, in which there is a third processor node, with the specified request contains the address, and the controller of the first processor node at least partially on the basis of information contained in the request address selects the destination processor node, which is the recipient of speculative query submitted.

4. Computer system under item 1, in which the remote processor node has a system memory, and the local controller processor speculative node forwards the request to this remote processor node when establishing that the request address related to the specified system memory remote prozessindustrie from a remote processor node data in the schema of the local interconnect local processor node, if the response to the request received in the local processor node is coherence, pointing to the inability of the local service this request.

6. The data transmission method in a computer system having a schematic cross-site connections, linking at least the local processor node and a remote processor node, and the local processor node has a local interconnects the processor and system memory, connected to this local interconnects, and the controller located between the local interconnects and circuit cross-site connections, namely, that in the remote processor node via a scheme inter-node connection speculative forwards the request received from the schema of local interconnects, and when entering the local processor node from a remote processor node in response to the specified request, the response to the query process in accordance with a resolution request to the local processor node, and when the processing of the specified query response discard data received from the remote processor node, if the response to the request received in the local processor node, t is in which when receiving in the remote processor node speculative request the speculative request is sent to the local interconnects remote processor node.

8. The method according to p. 6, in which the computer system also has a third processor node, and the request contains the address, at least in part on the basis of information contained in the request address, choose the destination processor node, which is the recipient of speculative query submitted.

9. The method according to p. 6, in which the remote processor node has a system memory, while speculative transfer its speculative request is passed to the remote processor node when establishing that the request address related to the specified system memory of the remote processor node.

10. The method according to p. 6, in which when processing the response received from the remote processor node send data to the schema of the local interconnect local processor node, if the response to the request received in the local processor node is coherence, pointing to the inability of the local service this request.

 

Same patents:

The invention relates to the processing schemes for the recognition and comparison of complex combinations in high-speed data streams, in particular for use in machines of search and retrieval

Computer // 2216033
The invention relates to computing, and in particular to computing devices that process information using flow control

The invention relates to the field of information security and, in particular, refers to the hardware and software components of firewalls are used to prevent unauthorized access and exchange of information between the various subscribers of computer networks

The invention relates to information-measuring technique and is designed for gathering information from geographically dispersed and hard-to-reach objects

The macro processor // 2210808
The invention relates to computing

The invention relates to computing and technology relay protection and can be used to automate the process of collecting information about the state of the input data, connections and switches the object of control, automation of data collection, analysis and storage of information about the emergency processes, collect diagnostic information from blocks of relay protection and automatics

The invention relates to information management systems and is designed for collecting information, missions and develop control signals weapons systems and technical means of the ship

The invention relates to computer technology and can be used for signal processing of multi-element antenna arrays in underwater acoustics

The invention relates to automated banking machines and can be used for communication of user institutions with banking machines other institutions

The invention relates to a method of monitoring the performance of computer programs in accordance with their intended purpose

The invention relates to the field of optical recording and reproducing video and/or audio data, in particular to the recording medium for storing identification information of the manufacturer of the recording device, changing the contents of the recording media

The invention relates to semiconductor integrated circuits, driven both in parallel and sequentially

The invention relates to a device and method for authentication of the content of the memory

The invention relates to the processing unit and method for accessing a memory having multiple memory cells for storing data values

The invention relates to processing of video signals for display

The invention relates to protected memory, in particular memory, providing multiple layers of protection for areas of application

The invention relates to systems for protection against illegal use of the software product

The invention relates to the field of protection against unauthorized access to information stored in the personal computer, and can be used in automated systems for handling confidential information-based personal computers

FIELD: computers.

SUBSTANCE: system has nine registers, four address selectors, triggers, AND elements, OR elements and delay elements.

EFFECT: higher speed.

8 dwg

Up!