Data storage for a knowledge-based system for extraction of information from data

FIELD: information extraction systems engineering.

SUBSTANCE: system contains data storage, analysis mechanism of lower level, analysis mechanisms of higher level, indexer. In accordance to method, on basis of first set of rules, appropriate for first analysis mechanism, first key is connected, which is sent output second analysis mechanism, in which second key is generated based on second set of rules, first and second keys are connected to objects and keys and key values are indexed.

EFFECT: decreased time and computing resources spent on processing of large data arrays to extract required information.

2 cl, 5 dwg

 

The technical field to which the invention relates.

The present invention generally relates to systems for extracting information from data.

Background of invention

The data mining is the process of selection information from the funds in accordance with the desire of the user. In relation to this process (described in the original English-language sources as "data mining") in Russian literature also came into use terms such as "data mining", data mining, knowledge discovery", "data mining" and other context of the invention, all these terms are regarded as synonymous with extracting information from data. Perhaps the most common example of extracting information from data is the functionality of the search engines or search engines that are included in most Web browsers allow users to enter keywords and then get back a list of documents (sometimes consisting of thousands of documents), which the user then searches to find needed information.

The basis of the existing search engines, such as AltaVista, Google, Northern Light, FAST and Inktomi on the principle of navigation on the world wide web, i.e. these systems access Web page is the boundaries and pages hyperlinks which contain the page that was accessed, generating an inverted index of key words found on Web pages. In this index the keywords associated with identifiers (uniform resource locators or URLS) of web pages that contain keywords. To answer a query, access the index, using as actual parameters requested keywords, then the pointer returns the URLS of pages that satisfy the query. Returning the IDs of the pages are usually ordered by relevance, for example, data link, or frequency of use of keywords.

Despite the fact that most commercial search engines, the search results are arranged according to their relevance, to locate information of a particular type of user usually have to shovel a huge amount of query results. This is because for disqualification of useful information from useless often require special knowledge in a particular area. Indeed, in the claimed invention was aware of the situation when for processing multiple documents with the purpose of selecting a subset of documents required by one person using the same criteria of selection is a, then you want another specialist, using other criteria, systematized the necessary information contained in the subset of documents selected by the first specialist. Despite the fact that this procedure is only necessary step preceding the step of using the data it is associated with significant effort and may take more time than any other stage of the work.

For example, consider the case of the answer to a difficult question from the field of marketing, such as "what is the opinion of our customers in the North-Western States of the Pacific coast about Wellness products of our competitors from the point of view of recognition and value of its trademark?". Analysis of Web pages can start with a keyword search, which can be the name of a competitor, but then the specialist will take considerable time to weed out, perhaps thousands, of documents such as government reports, which do not answer the question, while the rest are relevant. Among past the first phase of the screening of documents may be many documents that have even less relevance to the case, such as documents from teen chat rooms, which may be mentioned the name of a competitor, but for disqualification to the x require special knowledge about the demographic composition of the target segment of customers.

Consider also the simple question, "is compatible with Adobe Acrobat, MS Word?". In response to such a simple query that is entered into one of the above search engines, a list was provided of the results of the 33 million Web pages, most of which did not contain the required answer "Yes" or "no". To weed out useless pages would need an expert who looked at least each page and determine whether it is a page that may contain information about program compatibility. Then would need another expert to explore the pages selected by the first specialist, and determine whether these pages contain the answer to a specific question. It is easy to understand that the successive specialists, screening a large amount of information, may take an excessively long time.

Summary of the invention

One object of the invention is a system for extracting information from data, comprising: at least one repository, or data warehouse, which contains the objects; at least one analysis engine lower level associated with the data store and generating output data based on the first set of rules implemented in the specified mechanism analysis of the lower level; and at least one mechanism the analysis of higher level receiving the output of the analysis engine lower level and generates its output based on the second set of rules implemented in the specified mechanism analysis of a higher level, and output mechanisms analysis of lower and higher levels attached to the objects contained in the data store.

In private versions of the proposed system, the data store may be a database and may contain vertical and horizontal tables. Sign in vertical tables can be done using the output of the analysis engine of one of the above levels, and the entrance to the horizontal table using object identification. The output can be the keys representing the relevant characteristics of the object to which these keys are mapped. Alternatively, the data store may be, for example, the file system.

Optionally, the system may also be used indexer associated with the data store, as well as high-speed cache memory, semiconductor and processor requests for querying at least one analysis engine. In addition, mechanisms of analysis can be linked queue works.

In the private preferred embodiment, the index of the EO contains the indexes of the keys and key values, available in the data storage structures, such as tables. It can also contain a Boolean indexes that store the values "Yes" or "no" to the requests of the form "does the key k, value v?". In addition, the indexer may contain interval indexes that store the values of the keys, and indexes of the text. If necessary, the indexer may be a generalized embodiment of a text indexer in the form of an inverted file indexing Web documents, and providing an application programming interface (API) to search for documents by keywords.

In a preferred embodiment, the indexer may contain certain keys to make queries against a particular object with the use of Boolean logic. In addition, the indexer may contain image data that supports incoming and outgoing requests.

It is advisable to supply indexes, labels, and indexing the indexer was carried out separately.

Another object of the invention is a method of data storage to retrieve information from the data, namely that: at least one data store stores objects; a data store; communicate through at least one of the first analysis engine; based on the first set of rules is, corresponding to the first mechanism analysis, generate output data; send them to output at least one second mechanism analysis; based on the second set of rules corresponding to the second mechanism analysis, generate output data; and appending the generated output data to objects.

Brief description of drawings

Below the invention is described in more detail by the example of the preferred variants of its implementation, illustrated in the accompanying drawings, on which:

figure 1 shows the block diagram of the system with the preferred system architecture

figure 2 is a block diagram illustrating the General logic operations

figure 3 shows a diagram of the horizontal table,

figure 4 shows a diagram of a vertical table,

figure 5 shows a block diagram illustrating the logic operation of the analyzer.

A detailed description of the preferred option implementation

Figure 1 shows a system, generally designated position 10 used to answer entered by the clients requests for data. Essentially in the system 10 are assembled together the knowledge of many experts, is necessary for screening, i.e. detailed analysis of large information Fund and respond to requests for information, which can be quite complex, in the example, as is described above. In private embodiments of the invention, not limiting the volume of patent claims, the system 10 can be used to analyze data about companies, gathering information about competitors, analyze trends, identify hidden relationships, providing services for Web server clustering and the creation of a taxonomic hierarchy. In addition, the system 10 may be used to support the objective functions that require the use of a significant amount of specialized knowledge, such as service provisioning services (of interest to a particular part of the enterprise).

The system 10 may be based in one place, where it is the developer and maintained by a single processor or group of processors in order to respond to customer requests to receive data service format. Alternatively, some components of the system 10 can be made available to the client to retrieve information from the data on the equipment of the client.

As described in more detail below, the system 10 has a level of data collection, data retention level, the level of data analysis, presentation tier and level control system. At the level of data collection in the left part of figure 1 Navigator 12 world wide web (WWW) has access to the world wide web 14 (and, if necessary, to other the parts of the Internet). Navigator 12 can also get access to corporate intranets 16, including the private information of the firm, which can only be obtained by appropriate authentication. Preferably, the Navigator 12 continuously navigating the world wide web 14, looking through some pages more often than others with regard to the frequency of page updates and other criteria, and using the interface 20 application programming interfaces (APIs) to output the pages in the data store 18. In the preferred private embodiment of the invention the interface 20 is the Protocol of IBM to service requests, known as the "Vinci xTalk", which is a simplified Protocol based on XML in combination with a set of user agreements regulating current control, registration and transfer of data. In the description used in the system APIs network layer uses the terminology of frames Protocol xTalk.

Also preferably, the Navigator had a channel for feedback if necessary to change the operation Navigator. In the preferred private embodiment of the invention Navigator 12 is a tool that is described in the patent US 6263364, or Navigator, described in the received IBM patent US 641833 (patent application U.S. No. 09/239921), entitled "System and Method for Focussed Web Crawling", also included in the present description by reference. In addition to receiving data from the Navigator 12 system, if necessary, can also have the tool 22 collection of structured data, which carries out the processing of data from client and third-party databases 24 and transmits the processed data to the repository or data warehouse 18.

With regard to the data store 18, in one embodiment of the invention the data store management system is a relational database (RDBMS), such as DB2, IBM. In other embodiments can use other systems, such as the file system. The following description applies to storage data of both types.

In one embodiment, the implementation of the data store 18 may have a centralized program, running on one or on multiple computers. The following mechanisms for analyzing the analyzers can be run on an independent computer, accessing the program data storage requests to read or write data. Alternatively, the data store 18 may be distributed over multiple computers, and the analyzers operate on such computers in parallel. In this embodiment, to improve the efficiency of resource use, the document may be read from a local part store data in memory, nahodjas is in RAM, follow the chain of dependent or independent analyzers and again to be stored in the data store. In fact, both types of architecture can be implemented in the same system 10, including the fact that some analyzers work better in the architecture of the second type (for example, analyzers per page), while for other analyzers may require service signals or data provided by the architecture of the first type.

The data store 18 is connected with the indexer 26 and, if necessary, with a high-speed cache memory 28 on semiconductors. To query analyzer includes a processor 30 requests that have access to the cache memory 28, the indexer 26 and the data store 18, as described below. As part of the level data storage system 10 in the system can be implemented queue works analyzers discussed below.

In the data store 18 contains a fairly large amount of information, for example, the data of the Web page received from the Navigator 12. In the data store 18 also contains objects that represent data, on the basis of which decisions can be made, as described below. These objects have corresponding universal identifier (UEID), encoding the identity and type of the object, for example, "Web page", "hyperlink", "h the private person" "Corporation", "article". Objects can also contain keys with the corresponding values that are attached to the objects described below analyzers. For example, the analyzer processes the page object and creates a key named "Crawl:Content", which includes the HTTP content of the corresponding Web page (so the length of the key value is relatively high). In any case, the objects can be stored in the file system, DBMS such as DB2, where they are represented in the horizontal and vertical tables, or other data storage system.

The indexer 26, among other items, contains the indexes of the keys and key values available in the data warehouse. The indexer 26 may contain a Boolean indexes that store the values "Yes" or "no" to the requests of the form "does the key k, value v?". In addition, the indexer 26 may contain interval indexes that store the values of keys, for example, coordinates of the geographical areas, the indexes of the text, which is the usual indices for describing data objects, and optionally other indexes.

In any case, it is preferable that the indexes and tables to store data when the data store is a database) was not specified on the possible location of a particular name or text, but only that the spiral is the main page has a certain characteristic, or that in some place on this page is specific text element. Thus, thanks to quite a high degree of structuring of the system 10 to the data store are less demands on memory than otherwise, which facilitates the practical implementation of data warehouse. However, if desired, can be specified exactly where on the Web page can be a specific name or text.

As for additional details on the level of data storage system 10, in the preferred private embodiment of the invention, the indexer 26 is a generalized embodiment of a plain text indexer in the form of an inverted file. In one embodiment, it indexes Web documents and provides an application programming interface (API) to search for documents by keywords. The set mapped to the document keywords may merely represent words from a document, or it can be replenished as needed - through described below analyzers - additional information relating to, for example, on page geographical areas, their own names, links to products or restaurants or other known system 10 objects, the results of semantic analysis and the like. In atomlike API keyword search allows the inclusion in requests for any word of these extended sets of keywords.

In other embodiments of the indexer 26 contains certain keys to make queries against a specific object using Boolean logic or allows graphics data to support incoming and outgoing requests, and the like. To achieve this versatility, the supply of labels (tokenization) is arranged separately from the index. In particular, the indexer 26 is designed to receive flow labels, and not the document flow. Accordingly, the supply of labels is carried out before indexing. For each indexed label label position (the relative position of the label in the stream) is stored together with the user-defined data label, which can be arbitrary. This simplified model provides an efficient indexing and provides a universal API that is applicable in a variety of cases. In addition, decoupling supply data labels and indexing lets you share index labels originating from different variants of implementation of the rules (for example, from different generators labels).

Multiple versions of the indexer 26 can be performed simultaneously. For simplicity, the "main" text indexer is an indexer that stores the labels corresponding to the entire set of reviewed and selected pages. the AK is indicated below in respect of the analyzers according to the present invention, analyzers attach the keys to the objects stored in the data store 18. Generators labels associated with the indexer 26, are exactly on the same principle. In the private version of the invention the generator text labels can be made on the basis of the label generator Text Analysis Framework (TAF), developed by IBM Research and the IBM Software (gablingen). This label generator reads the page data, and writes the result to the original markup for each page. Then there are other label generators that use these data or, at its option, the raw data pages and store in the data store other labels. In particular, one label generator can compare their own names and tagging (tag) them as such, and the other label generator is to read only the output of the generator labels own names and record labels containing metadata relating their own names with specific known objects in any location in the system 10. All these generators marks are registered in the main indexer 26.

For a detailed description of the preferred private option, run the indexer 26 is followed by a consideration of the processor 30 to process requests. To activate the processor 30 requests data flow from a data store 18 may osuwestvlenieaj.in below analyzers on the extensible query language. The principle of access to the processor 30 query processing in exactly the same as the principle of access to the indexer 26, i.e. a requester sends a request to the service (in this case, the problem on the extensible query language) and receives a data stream from the processor 30 to process requests. The request may involve combining multiple streams using the standard operations of the combinational logic that combines data streams (Boolean operators such as And and OR operators database connection, such as inner and outer join operators sort and operators, augment the flow of additional information, for example, additions of each universal object identifier UEID in a flow value for a specific key). The query language can connect any arbitrary thread.

As mentioned above, the level of data analysis system 10 includes a library of 32 analyzers, which are stored implemented software analyzers, which interact with the API 20 level data collection and, consequently, to the data storage layer. In a typical private embodiment of the invention the library 32 analyzers includes the analyzer 34 links, outstanding references to a specific page and/or page, the filter 36 spam detecting spam in the data store 18, the filter 38 pornography is AI, identifying in the data store 18 pornographic pages, the analyzer-classifier 42, which classifies pages based on the presence on the pages of sequences of terms, the analyzer 44 geographic data revealing Web page to any geographical information, the analyzer 46 corporations, the analyzer 48 taxonomy, outstanding page given taxonomic categories, the analyzer 50 regular expressions, issuing a stream of pages that contain a particular regular expression, and so on.

Under the "analyzer" or "element of data analysis refers to the analysis engine, which is based on a set of rules generates output data, in particular the output, which may contain one or more keys representing the characteristics of the object. Such rules can be determined heuristically and may include statistically valid rules. As an example, the analyzer 38, which implements the function of the filter pornography, using techniques of image analysis can determine whether a Web page, pornography, and attach to the page key and a Boolean key: "porn = Yes" or "porn = no". For example, in this analyzer pornography can use the principles described in patent US 6295559 on behalf of the company IBM. Or analyzer 46 corporations on the basis of Association rules is fishing, analysis of the URL or other means may determine whether or not a particular page by page, any Corporation, and then attach to the key page, indicating the result of the analysis. Again, in the analyzer that implements the filter 36 spam, can be used, for example, the principles described in patent US 6266692 on behalf of the company IBM. This analyzer can be attached to Web pages or messages received by e-mail, keys, indicates whether they are spam. Further, the analyzer 44 geographic data can be attached to the Web page of a key representing intervals of degrees of latitude and longitude, related to theme of the page or the author, using the rules for extracting such information. In a particular example implementation of the invention in the core analyzer geographic data can be based on the principles described in patent US 6285996 on behalf of the company IBM. Note that specific types of analyzers and specific rules used by each analyzer may vary without sacrificing functionality and scope of the present invention.

In any case, the analyzers are modular components that have certain specifications of input and output data. They can be written in any language and may take, for example, from the how many rows in simple Perl for finding keywords to tens of thousands (or more) lines of code to perform complex distributed operations. Large tasks can be broken down into smaller parts, each of which can be easily solved by individual data analyzer or its developer. Get intermediate results can be easily viewed, test and debug, while they may present: independent interest for other developers analyzers. Thus, the analyzers are equivalent object-oriented design implemented in the architecture of the service a wide range of queries. Analyzers are specified from the point of view of data, usually specified by means discussed below keys that must be present at the beginning of their work, and data (including other keys)that they create in the process of successful data processing.

In particular, in the preferred embodiment of the invention, the analyzer can be equipped with a work (job) from the managed system queue based on one or more relationships defined by this analyzer. As an example, some sort of analyzer (analyzer), who is interested in processing pages containing specific personalities or certain geographic areas, may register its dependence on the analyzer 44 geographic data and analyzer personalities. Then turn works for analysis is atora And will be continuously updated due to the inclusion of objects, which have been processed by the analyzer geographic data and analyzer personalities that indicate the keys that are attached to these objects in a data store of geographic data analyzer and analyzer personalities, but not yet analyzer A. After processing of such objects analyzer And can attach to processed objects private key or keys, using - in case if the data store is a database - existing tables objects or create new objects (using the appropriate tables if the data store is implemented in the form of a database), with each key represents a specific characteristic of the object. In this mode, operate the analyzer that extracts the references to specific types of products trade names, people, industry sectors, actors and so on.

On the other hand, the analyzer may not take job from the queue works, but instead to register their claims to the freshness of the data in the following control system, which determines how often and in what environment should be used in the analyzer. Other analyzers that, for example, carry out weekly total counts may request the following control system to initiate one or more instances of this analyzer, storiespolice weekly formation of the final summary tables or data structures, also using as input data objects, which are attached to the appropriate keys.

Thus, the analyzer reads from the data store 18 long-term and reliable streams of raw content (i.e. content), as well as the processed data created by other analyzers. These analyzers, and in fact many analyzers system 10 consume and process the data. Two of the models considered above provide access to data include random access to a specific object or set of objects and streaming access to the list of objects. If random access to the data store 18, the analyzer simply requests the relevant part of the corresponding object using universal object identifier UEID. To obtain the data flow is initiated the compilation of a list of objects by requesting data from the data store 18 using the indexer 26 or processor 30 to process requests.

For example, the analyzers with more complex data requirements may apply to the processor 30 query processing with complex queries, possibly involving the need to access multiple components, while conventional methods are query optimization, and in response generates data streams. Such requests can is to go for a combination of databases on multiple tables creating reference tables, indexes, including text search, the query interval values, geographic reference tables and addition in the system smaller result sets from many different sources. Regardless withdrawn if the lists of the indexer 26 or processor 30 query processing, these lists provide continuity of existence of objects, and access to them can be done sequentially or in parallel depending on the nature of the processing.

The results of the processing performed by the analyzer are stored in the data store 18 with access to other analyzers and end users. As mentioned above, to write data in the data store 18 with access to other analyzers the analyzer simply creates new keys and values that should be attached to the object, and then performs the write operation in the memory.

The query results of a specific client to receive information provided by the analyzers according to the present invention, may be represented at the level of 52 data representation. The results can be printed or audiovisual form or other form as desired. Above the level controls the level 54 of the subsystem management cluster, which is described in more detail below. To facilitate the water data and response information, the client interface 56 may, if desired, access the API 20 level data collection and client databases 58.

In this preferred embodiment, the level 54 of the subsystem management plans, initiates, monitors and registers operations occurring in various components. The target application, we take the results of the rendered table, the data store 18 or analyzers processing requests in real-time.

In the private preferred embodiment of the invention, the system 10 and a level 54 management subsystem supports a large cluster (cluster computing system). In addition to the control analyzers, control system 54 detects the faults and failures of hardware and software cluster and a particular program restores the system after failures appropriately notifying system administrators. Level 54 management subsystem also provides functionality such as moving, alignment, and capacity planning for each software component.

All events in the system 10 are flocking into one information server, which monitors the state of the system, keeps statistics, registration and receives error codes of the application and is of komponentov infrastructure. Events are generated by a variety of sources and include classes of bugs, used software components of the cluster, the event of the DB2 database Manager and the register associated with the data store 18, system and network components of the control and the so-called observer agents ("the nanny" - "Nanny"), which are part of level 54 management subsystem and executed on the respective computers of the cluster.

Preferably, the observing agents initiated, stopped, and controlled processes and tracked resources corresponding computers on which such agents are installed. They are undertaking and/or control the transfer of information, disk usage, memory, CPU, kernel resources (processes, sockets, etc) and the management of computing process, including commands "start", "stop", "destroy everything". Watching agents also receive status information from individual analyzers operating on the corresponding computers on which they are installed, including registered messages, error messages, statistics, the number of pending documents, number of documents processed per second, the actual speed of the flow of documents, the processing speed in bytes or objects per second, and other information about the state of the characteristic of the analyzers.

Figure 2 PR who dstanley General logic of the above-described system 10. In step 60 Navigator 12 searches in the world wide web 14 to Supplement the data store 18. If desired, the replenishment data warehouse 18 data from the database 24 data can be performed at step 62 using the tool 22 data collection.

After the replenishment data warehouse 18 data in step 64 at least those of analyzers that can be considered as analyzers lower level (lower level), access the data and process the data as described above. Analyzers lower level writes the processing results back to the data store 18. For example, analyzers, filters, such as filter 36 and spam filter 38 porn can handle all Web pages that are stored in the data store 18, and attach to the respective objects corresponding to the key, which indicates whether the corresponding website as a source of spam or pornography. In addition, each page can be activated removes tags analyzer that when the page is processed removes the markup elements of hypertext markup language (HTML), leaving only the raw text, and then attaches to each respective object points to this key "untagged".

Next, at step 66 may be accepted client requests for information. In response to the step 68 may b the th recorded additional analyzers lower level, or the generated analyzers higher level (or higher level), if they are necessary, but not yet developed. Under analyzers higher level can be understood analyzers, which provide for themselves dependence on the output of other analyzers, i.e. which require to handle objects that are tagged with a key issued by the analyzers of the lower level.

The example analyzer or analysis engine higher level can serve as an analyzer, answering the above question "what is the opinion of our customers in the North-Western States of the Pacific coast about Wellness products of our competitors from the point of view of recognition and value of its trademark?". Such an analyzer may determine that it requires only the pages of the North-Western States of the Pacific coast, which is indicated by the geographical key to be attached to the object analyzer geographic data, and only if that object contains the name of a competitor, which specifies the key that is attached to the object analyzer own names. Many of these dependencies can be formulated on the basis of hypotheses, it is assumed that the technician who installs the dependencies for this analyzer uses a heuristic method according to their knowledge and experience and may not know how specialist, created, for example, EN is the lyst geographical data, came to this decision. At step 70, the client will communicate the results and give him the account based on the price of a single query or the cost of the subscription.

Figure 3 and 4 shows a diagram of horizontal and vertical tables that can be used when - as a particular example implementation of the invention, the data store is implemented as a DBMS such as DB2. Figure 3 shows a horizontal table 72, each line 74 which displays any object. Each row has a column 76 universal object identifiers UEID, column 78 timestamps (if required) and many columns 80 keys. Unlike horizontal table, shown in figure 4 vertical table 82 has a lot of lines 84, each of which has one column 86 keys, column 88 universal object identifiers UEID, column 90 key codes codes that specify the type of key, and the column 92 of the key values that specifies the key value, for example, a Boolean value, interval, etc. may optionally be added to the column 94 timestamp which indicates the time of entering the appropriate data into the table.

From the foregoing it should be clear that the data store 18 by means implemented in the form of database tables 72, 82 abstracted from real data format that allows you to take the decision what s on table type to use for a particular object to improve the effectiveness of those schemes, which are assumed to be typical for this type of object. Preferably, the data store 18 is also abstracted from existing DB2 restrictions on the length of the string automatically using objects with a variable number of characters (VARCHAR) or large binary objects (blobs) for storing values whose length exceeds the maximum length of the string. To facilitate programmers direct access to the database DB2 to write code that does not depend on the physical location of data in the proposed system provided APIs.

For example, the navigation system 12 records the Crawl key:Content (content search), Crawl key: Header (title search) and a set of keys extracted metadata, such as the URL, the data delay selection, last modification date, page, server information, return code, HTTP, and so on. If the data store 18 is implemented as a database, all information of this kind is recorded in one horizontal table, in which each key of the Navigator provides a single column. Information is recorded only device 12, however, be read by any analyzer with permission. Analyzers that require the content of the page should only be requested key value Cawl:Content, and the data store 18 shows the corresponding table.

To improve the efficiency of such calculations in the data store 18 may optionally be created in the data dictionary that contains information about the mapping between a particular key and the actual location in a relational database. It also contains auxiliary information such as the type and owner of the key. Analyzers that write a lot of keys, you can write to these keys in a special horizontal table, with one update operation line can be written many keys.

As mentioned above, several analyzers, including Navigator 12, the most natural way to work at the page level, creating and consuming information page. However, other analyzers can work with objects other than the raw pages. In particular, some analyzers, such as the filter 36 spam based on the links work with all Web sites, determining whether the entire site as a source of spam. Other analyzers can work with phrases, own names, company names or names of settlements, restaurants, the names of entrepreneurs, etc. Each category of data is a separate object and requires its own set horizontal is s and vertical tables (or other data storage structure) in the data store 18. Accordingly, just as the Navigator 12 writes the data in the horizontal table of the page object in the database, the analyzer 46 corporations can fill a horizontal table in relation to corporations. Other analyzers, which must attach a key-value to corporations, can access the keys attached to the object analyzer 46 corporations, and then write other keys in another data structure of the object of the Corporation.

Figure 5 shows a specific logical block diagram applicable to the case where in step 96 from the Navigator 12 receives a Web page. At step 98 may be activated removes tags analyzer that when processing the page in step 100 removes HTML markup, leaving only the raw text, and attaches to the object key "untagged", containing the indication.

In step 102 the object in accordance with the foregoing principles may come in other analyzers, for which the control system 54, based on the key "no tag" delivers another object such analyzers. In step 104 these other analyzers process characterizing the object data and can add your own keys to the elements of the data structure of the object, for example, and in a horizontal table, representing the object, and the corresponding vertically the table, displays the key, if the data store is implemented as a database. Some analyzers can also retrieve information such as the name of the Corporation, for example, from the page object and create patterns store additional data object (file or table)displaying such objects, for example, the objects of the Corporation.

After completing the initial data analyzers is a jump to the marked diamond step 106 a decision on which set was requested whether objects with the specified keys any more analyzers, for example, n-m analyzer. If the object contains all the keys that are requested n-m analyzer as input, in step 108 n-th analyzer is such an object, for example, by placing the object in the queue works analyzer. Then the n-th analyzer in step 110 accesses the object by, for example, appeals to his queue of jobs to be processed of the object and/or the processing characterizing the data object. In step 112 the nth analyzer generates a private key or keys and introduces such keys in the data structure of the object as appropriate for binding of the key or keys with this object. Next, at step 114 the analyzer client can activate other analyzers and/or to access the objects corresponding the adequate way to create a database, containing the information you are looking for the client system.

The described system 10 can be used to solve a variety of specific tasks in the interests of clients. One such task is the creation of object references/in-depth comments to objects (action link/drill note)when the system receives a document, and the analyzer system based on certain rules identified in the document important objects or entities (e.g. people, places, events). Then the analyzer-the compiler system 10 is of the dossier or the set of data in a different form on each of these objects. Further, such a dossier (or equivalent) associated with the object in the source document.

Dossier or its analogue may be a mini-portal for this type of object, for example, may look like a special directory like Yahoo for this object. Accordingly, if the object is any man, for it can be created subcategories containing the addresses relating to such person, the names of people related to the person's location with respect to such person, the field of activities related to such person, the publication of such person, etc. the Choice of objects to create links implements parser-compiler, preferably, in accordance with the function of tunable preferences the process another rule, which can be determined heuristically.

Other private application example system 10 serves as an application program "search and create links to the legal relationship", one component of which is the above-mentioned system of referencing objects/depth comments to objects and which also searches the likely relationships between "objects", objects are stored in the data store 18, and at least some of the desired coupling elements may be missing. As an example, consider the case in which the data structure of the objects of the personalities are listed John DOE and Jane Smith (objects-personalities), but the data stored in the data store 18, do not indicate a clear relationship between them. The analyzer application search and create links to the legal relationship identifies both as an important object based on a set of specific rules, and then determines whether there are other objects through these two people which can be interconnected. For example, you may find that both are Board members of any company or charitable organization, can work together to publish any report could be mentioned in the press as colleagues or partners for any transaction, etc. In such cases, the binder amount is t (for example, the company, Board members which John and Jane are at the same time, the report that they published together) can be considered as "connecting object" and to be included in the query to identify hidden relationships.

Therefore, the phase detecting hidden relationships of a court case can be extended to include not only request documents relevant to specific topics, people or events, but also search for documents relevant to "connecting objects", through the external data sources.

In another particular example, the capabilities of the system 10 is implemented in an application program marketing competing products. The system 10 can be put information about the range of products and can be created analyzers to uniquely identify references to a goods and classification of such references context-based references. For example, to distinguish detergent Tide from the tide (tide - eng.) as natural phenomena. For the classification of context references also use the analyzer-classifier/Profiler, which in the classification/characterization references, based on a set of previously classified/profiled references, preferably uses statistical tools. In addition, for definition wide-angle the respective geographical ties, with respect to the source, which is often mentioned, can be used in the analyzer of geographic data. Then consider the application program may be configured to issue precisely differentiated measures "public attention" or "walking rumors" around a certain number of products in comparison with the "public attention" or "walking rumors"that cause competitive products. Such information may be displayed on the map, for example, different colors or brightness levels corresponding to the level of "public interest" or "walking rumors". It can also be tracked over time, which helps to identify noteworthy positive or negative trends. The system can also be entered segmented by geographic or demographic characteristic data showing the cost of advertising or other marketing activities related to the product, and the analyzer checks direct correlation between this activity and public attention" or "walking rumors", thereby creating a certain criterion is the effectiveness of marketing activities.

Although discussed in detail above knowledge-based system for extracting information from data and store the data for it is able to fully solve the above objectives of the invention, it should be borne in mind that this is only preferred at the moment variant of the invention, characterizing broad in its essence, the proposed solution. Under patent claims of the present invention is entirely covered and other embodiments of which may be obvious to specialists, and the amount of these claims is not limited by anything except the claims, in which reference any element in the singular, unless it is expressly stated, means not "the only one"and "one or more". All structural and functional equivalents of the elements described above the preferred option for implementation, which are known or will be known later, the specialists in this field of technology are included in the scope of the present patent claims. In addition, any device or method are included in the scope of the present patent claims, even if such a device or method is not intended to solve all the tasks that should be solved by the present invention. Furthermore, no element, component or step of the method mentioned in the present description, is not designed to transfer to public use, regardless of mentioned whether explicitly such an element, component or step of the implementation is to be placed way in the claims.

1. A computer system for data mining, which includes at least one data repository that contains the object, and the following software:

at least one analysis engine lower level, interacting with the data store and generating as output at least one key representing the relevant characteristics of the object with which the key is mapped, based on the first set of rules implemented in the mechanism analysis at the lower level.

at least one analysis engine higher level, receiving the output of the analysis engine lower level and generating as output at least one key representing the relevant characteristics of the object with which the key is mapped, based on the second set of rules implemented in the mechanism analysis of a higher level, and these output data attached to the objects contained in the data store; and

the indexer associated with the data store containing the indexes of the keys and key values available in the data store, and the generation of keys in the mechanisms of analysis and indexing the indexer manufactured separately.

2. The system according to claim 1, in which storage is Elise data includes vertical and horizontal tables, moreover, the input data in the vertical table is performed using one of the keys, and enter data in the horizontal table is performed using the object ID.

3. The system according to claim 1, in which the data store is a relational database.

4. The system according to claim 1, wherein the data store is a file system.

5. The system according to claim 4, comprising a high speed cache memory, semiconductor-related indexer and data repository.

6. The system according to claim 5, comprising a processor requests having access to one or more of the following elements: a cache memory, the indexer and the data repository for querying at least one analysis engine.

7. The system according to claim 6, comprising at least one stage of works associated with at least one analysis engine.

8. The system according to claim 1, in which the objects are identified by identifiers, coding type of the object.

9. The system according to claim 1, in which the indexer contains a Boolean indexes that store the values "Yes" or "no" to the requests of the form "does the key k, value v?".

10. The system according to claim 1, in which the indexer contains the interval index that stores the intervals of values of the keys.

11. The system according to claim 1, in which the indexer contains the indexes of the text.

12. The system according to claim 1, in which the indexer includes indices and data warehouse tables, in which no indication of location specific name or text in the object, and there is only an indication of the presence of the object has certain characteristics.

13. The system according to claim 1, in which the indexer indicates the presence in the Web document of a certain text element and provides an application programming interface (API) to search for documents by keywords.

14. The system according to claim 1, in which the indexer contains certain keys to make queries against a particular object with the use of Boolean operators.

15. The method of data storage to support data mining system, namely, that

at least one data store stores objects,

interact with the data store via at least one first mechanism analysis

based on the first set of rules associated with the first mechanism analysis, generate at least one first key representing the relevant characteristics of the object with which the first key is mapped,

send the first key to at least one second mechanism analysis

based on the second set of rules corresponding to the second mechanical the mu analysis, generate at least one second key representing the relevant characteristics of the object to which the second key is mapped,

attach the first and second keys to objects and

index keys and key values available in the data store, and the generation of keys in the mechanisms of analysis and indexing are organized separately.



 

Same patents:

FIELD: criminalistics and forensic examination.

SUBSTANCE: automated workplace consists of stand for researching electronic information carriers and personal computer. Stand featured in invention consists of controllable commutation device, ensuring possible mating of electronic information carrier and personal computer, and a source of controllable voltage. Controllable commutation device has m+n inputs/outputs and is represented by a set of m·n controlled rectifying cells, forming a commutation matrix of m×n dimensions, connecting 1÷m and (m+1)÷(m+n) inputs/outputs, while m=k+1, numbers k and n corresponding to maximal values of numbers of contacts of sockets of personal computer and electronic information carrier, respectively. Controllable rectifying cell is in turn represented by device, providing controllable capability of one-direction commutation with controllable transfer coefficient.

EFFECT: no limitations on types of electronic information carriers being connected, increased quality and speed of reviews of electronic information carriers, in other words, suggested automated workplace allows highly reliable fast access to information, stored in memory of electronic information carrier received for review, while quantitative and qualitative characteristics of electronic information carriers are not changed.

3 dwg

FIELD: informatics; computer technology.

SUBSTANCE: device can be used for soling tasks of composing dictionaries, manual as well as for creation of new databases. Device has entrance memory unit, processed words memory unit, unit for analyzing search, substitution memory unit, substitution unit, result storage unit, control unit.

EFFECT: widened functional abilities; improved reliability of operation; simplified algorithm of operation.

16 dwg

FIELD: electric communications, possible use for finding and quickly identifying information in multi-service digital data transfer networks with commutation of packets.

SUBSTANCE: device contains N generators of time intervals, N selection blocks, frequency divider, N temporary storage registers, N two-input AND elements, solving three-input element AND, N-input OR-NOT element, electronic key, mask storage register, n-input AND-NOT element, control block.

EFFECT: expanded area of possible use of device, increased speed of operation.

5 cl, 6 dwg

FIELD: syntactic analysis of bit stream, containing data having structure and content, matching certain format, possible use for generation of tree-like representation of said stream.

SUBSTANCE: proposed scheme is produced from XML, making it possible to describe encoding format in generalized form. Such scheme is used for performing syntactic analysis of stream of bits for production of document, which represents a stream of bits, which acts as a sample of aforementioned scheme, or for generation of stream of bits from document, representing the stream of bits.

EFFECT: increased resistance to interference.

7 cl, 3 dwg, 4 app

FIELD: statistical language models, used in speech recognition systems.

SUBSTANCE: word indexes of bigrams are stored in form of common base with characteristic shifting. In one variant of realization, memory volume required for serial storage of bigram word indexes is compared to volume of memory, required for storage of indexes of bigram words in form of common base with characteristic shifting. Then indexes of bigram words are stored for minimization of size of data file of language model.

EFFECT: decreased memory volume needed for storing data structure of language model.

7 cl, 4 dwg

FIELD: communication systems; method for storing geographical information in communications center.

SUBSTANCE: geographical data is received, authentication query is sent to geographical data authentication database, which communicated with communications center. Answer for authentication query is received, and geographical data is stored in informational storage, which is a database, which communicates with communications center.

EFFECT: increased accuracy of service rendering corresponding to location in communication network on the basis of previously stored location information.

10 cl, 5 dwg

FIELD: information search means, database structures.

SUBSTANCE: two data areas are created. At least one of them is resident area, and at least one other area is non-resident for searched data object query source. Control data objects array is created in resident area, and/or control data objects array with corresponding to each object initial hyperlinks as linked data. In nonresident area control associated information data objects array is created and/or control associated information data objects array with corresponding to each object associated data and/or at least one secondary hyperlink.

EFFECT: simplified logical and physical database organization with permanent renewal of control associated information data objects, and increased performance of system due to simplified functioning of informational network communication nodes.

37 cl, 1 tbl

FIELD: computer engineering, automated system for collecting and processing electronic polls data.

SUBSTANCE: system consists of input messages receiving unit, data from server database receiving unit, election committee identification unit, first and second units for candidates base addresses identification, polls results disclosure time cycles selection unit, polls results recording time cycles selection unit, input messages receiving time cycles selection unit, database read and write signals forming unit, final polls results data forming unit.

EFFECT: increased system performance due to database entries address localization using receiving messages identifiers and forming of progressive total of polls results in real-time.

9 dwg

FIELD: computer engineering, systems for supporting informational identity of geographically distributed databases of airline companies.

SUBSTANCE: systems consists of address identifiers unit, memory area identification unit, input message target selection unit, database entries base address selection unit, adder, read signal forming unit, six registers, database entries identification unit, entries quantity identification unit, counter, control signal forming unit, OR elements.

EFFECT: increased system performance due to database entries addresses localization using data sources and flights identifiers.

9 dwg

FIELD: computer engineering; system for data distribution control in information analytical center network of air company commerce unit.

SUBSTANCE: system contains three registers, renewed data entries address identification device, client query data address identification device, decoder, data read control signal forming unit, and data output channels commutation unit.

EFFECT: simplified system, increased performance by excluding memory buffer blocks and time interval selector, and asynchronous mode of server-client interaction implementation.

5 dwg

FIELD: data access technologies.

SUBSTANCE: method includes assignment of simplified network address, recording URL and converting numbers into storage system with net access, inputting assigned number into computer, transferring inputted number to storage system, converting number to URL, receiving page matching URL, and displaying it. Method for use in operation systems for message transfer include intercepting system level messages to certain objects and forming pseudonym messages during that. Systems realize said methods.

EFFECT: broader functional capabilities.

12 cl, 30 dwg

FIELD: computers.

SUBSTANCE: system has entries memory block, words memory block, control block, substitutions block, n blocks for searching and replacing.

EFFECT: broader functional capabilities.

17 dwg

FIELD: computers.

SUBSTANCE: system has nine registers, four address selectors, triggers, AND elements, OR elements and delay elements.

EFFECT: higher speed.

8 dwg

FIELD: computers.

SUBSTANCE: system has operation mode setting block, first and second blocks for selecting records addresses, block for forming addresses for reading records, data output block, first and second record codes comparison blocks, records quality comparison block, year intervals comparison block, records selection control block, register, adder and OR elements.

EFFECT: higher speed of operation.

10 dwg

FIELD: computers.

SUBSTANCE: system has memory for programs, including browser, display block, database for storing documents, addressing control block, while each document of base has at least one link with indicator of its unique number and indicator with address of program for control stored in addressing control block, system contains also, connected by data buses and control of other blocks of system, memory for links of couples of unique numbers of links and forming means for lists of unique numbers of documents links, which are interconnected.

EFFECT: higher efficiency.

2 cl, 1 dwg

FIELD: telecommunication networks.

SUBSTANCE: messages, sent by cell phones, are formed by means of printed and public-distributed classifier, wherein at least one category is made with possible detection of at least one identifier of individual mark of object, identifier is sent by sender via at least one message to computer server with software, which transfers such message into database record at server for its transfer to at least one receiver, or searches for such record in database at server in accordance to received message and transfers to sender of such message at least one found database record.

EFFECT: broader functional capabilities.

2 dwg

FIELD: web technologies.

SUBSTANCE: method for integration of printed business documents, requiring original signature, with electronic data concerning these documents and later extraction of data, inputted for forming documents, is characterized by steps for forcing end user or agent to input all necessary data for forming of required document, saving collected data in database, linking saved data to unique ID code and printing unique ID code on printed document during printing. Printed documents is signed by end user and sent together with supporting documentation. When document is received by business-client, business-client inputs ID code, which is then used for access to saved data, and updates private database of business-client with all data, used for creation of original documents.

EFFECT: higher efficiency.

2 cl, 7 dwg

FIELD: computer science.

SUBSTANCE: device has string memory block, comparator, memory block for words and substitutes, block for analysis and forming of displacement results, block for storing string address, control block.

EFFECT: broader functional capabilities, higher reliability.

10 dwg

FIELD: data bases.

SUBSTANCE: method includes presenting operations at all levels of company in form typical product life cycle tree, wherein existing objective functional-technological connections of each manufacture stage are decomposed, and forming information system in form of pertinent-relevant complex information system and search, for which typical structure-information modules of information system are formed, system objective information requirements of data consumers, being a result of decompositions by levels of operations and problems, are determined as precisely as possible, data base of found documents in form of files is formed of key nodes with set of elementary data block for each system information requirement and files of information system modules, starting from lower levels of current stage and then upwards, while each data block has a list of pertinent documents ordered by determined information requirements.

EFFECT: higher search efficiency.

13 cl, 11 dwg

FIELD: computer science.

SUBSTANCE: system has first, second, third, fourth and fifth registers, first and second memory blocks, first, second and third decoders, triggers, elements AND, OR and delay elements.

EFFECT: higher speed of operation.

1 dwg

Up!