Method for finding and selecting information in various databases

FIELD: technology for finding and identifying documents based on their descriptions, present in various databases and information resources with different document creation standards.

SUBSTANCE: in accordance to the invention, search requests formed by user are dispatched into search system of server, which processes aforementioned requests by selecting documents from various databases, searching system combines all selected documents in single list, sorts aforementioned selected documents based on topics, creates folders, which contains aforementioned documents of one topic, then aforementioned sorted documents are sorted again with consideration of final rating. After that on basis of user request sections of future report are determined, by means of searching system, text signs of beginning and end of sections are determined, text of documents selected for greatest final rating is marked up, inside each section text segments are selected, segments are sorted according to publishing data, final report is prepared, in which text segments, sorted according to publishing date of original document, are combined in single text array, after that final report is dispatched to user terminal through telecommunication communication means.

EFFECT: increased precision of searching and analyzing of received information.

4 dwg, 1 tbl

 

The invention relates to search and locate documents based on their descriptions in different databases and information resources with different standards of formation documents, and can also be used during the filling of the database fragments initially unstructured texts.

Known methods of identification documents in their descriptions, which consists in converting the natural language texts in selected areas of knowledge into signals suitable for machine processing in the formation of a query in the form of sample keywords and comparing the sample keyword query thesaurus texts stored in the database (see the patents of the Russian Federation No. 2107942, 2167450, U.S. patent No. 6460034, search the database Yandex).

The disadvantage of this method is limited to one database with a known standard formation.

The closest analogue adopted for the prototype, is a method of search and retrieval of information from databases described in patent RU 2236699, under which perform: generating a user in the workplace, representing any personal computer with access to various databases, at least one search query; transmitting the generated user request to the search engine; a search processing C is subject to the user search query by selecting documents from a database; moreover, the search engine sorts mentioned selected documents by topics and forms folders, each of which contains the above-mentioned documents, sorted on the same subjects; for each of the sorted document highlight the characteristics of this document; within each folder, the search system determines the rating of each characteristic that is contained in each of the sorted document; then the search engine determines the number of matches the characteristics of the individual sorted documents in one folder with the signs of other documents contained in other folders; - determine the final rating of each of the sorted document based on the number of matching characteristics and taking into account the weighting factor database; after that, the search engine sorts again referred sorted documents with the final rating and directs sorted according to the final ranking of the documents on the user's workplace.

The disadvantage of the prototype is the lack of structural processing and analysis of the documents according to their relevance with respect to a given query term. The equivalence of all selected objects and documents leads to growth of the volume of the selected information and the growth of information noise, which ultimately increases costs in electralog labor for processing the selected information by the user.

In addition, when working with multiple document repositories with different standards document generation identification of objects becomes difficult.

The technical result of the claimed invention is the extension of functionality by improving the accuracy of search and analysis of the information received.

The technical result is achieved due to the fact that in the method of search and retrieval of information from various databases, including the formation of a user on a user computer, at least one search query, transmitting the generated user request through telecommunication facilities in a search engine server, processing search system generated user search query by selecting documents from different databases, advanced search engine combines all the selected documents into a single list, sorts mentioned selected documents by topics, creates folders, each of which contains the above-mentioned documents, sorted on one topic for each folder defines the parameters relevance for each of the sorted document each folder, selects the features that characterize each document within each folder determines the rating of each characteristic, is contained in each of the sorted document determines the final rating of each selected document based on the ranking of each characteristic of the document and parameter estimation relevance folder, again sorts the sorted mentioned documents, according to the final ranking, sorted according to the final ranking of documents memorizes in the memory of the search engines, selects specified by the user in the query number of documents with the highest indicators of the final rating, provides structural processing the selected number of documents for the preparation of the final report generates a graph showing the dependence of the number of sorted according to the final ranking of documents from the current time, transmits to the user terminal through the telecommunication facilities final report and generated the graph.

In the trend analysis and monitoring of a particular subject area, based on the analysis of unstructured information (books, articles, reviews and so on), the user is faced with the following main problems:

1. A large amount of information. In traditional subject areas, the volume of publications in the hundreds of thousands and even millions of units publications. This volume, the user cannot read. Thus, conclusions about the status and the subject area he has to do on the basis of sampling. While no user can be rest confident that there are significant materials, refuting his point of view, it simply is not found.

2. The relevance of the materials. Almost all search engines (Google, Yahoo, Yandex, and others you search results in a single list. These lists according to the formal principles (the number of links, the concentration of the words, attendance and so on) is determined by the relevance (value) of materials. However, in the everyday activities of the user never uses these principles when comparing dissimilar things. The book does not compare with the person and company event, patent news. In some cases, for the user important one information, such as article, in other cases it may be other information materials of the conference or the patent.

3. The incompleteness of information. Typically, the user has to deal with a set of heterogeneous information (texts): articles, books, conference papers, news, etc. Selected materials often refer to different time periods of the development of the subject areas, written by different authors, representing different organizations, with different interests, goals, objectives, level of development and competence. In addition, these texts vary in their structure, depth, terminology, data volume. According to the shafts can be worn principal character, perspective may conflict with one another, even from one author, the position of which often varies from work to work, depending on the time.

The claimed method is aimed at solving the above problems.

The essence of the claimed invention lies in the fact that after the formation of the user on the user's computer search query and transmit it through telecommunication facilities in a search engine server, processing search system generated user search query by selecting documents from different databases, perform the following steps.

stage 1. The selection information.

Initially the user in the various search engines, such as AltaVista, Yahoo, Google and others, selects an arbitrary number of documents in order, which gives him a searching system in accordance with its own rules. As noted above, this list of links are heterogeneous by nature materials: articles, patents, companies, news, etc. it is Obvious that in real life we don't compare man and a patent or a book with the company. The value of a material depends on various subjective factors, the interests of the user in this particular case.

In the proposed method:

To connect with search engines all Nayden the e materials in a single list without regard to the order in which they appear in the lists of the various search engines;

To sort through the search engines documents by topics, namely to collect in separate folders (groups) homogeneous in nature materials (all the books in the folder "Books", all the patents in the folder "Patents", information specialists in the folder "Personalities" and so on).

The number of folders can be arbitrary, but in each folder should include uniform documents that exist as objects in real life.

An example set of folders:

Popular materials (Introduction)

News

News sources

Events

Organization

Personalities

Portals

Periodicals

Books

Reviews

To determine for each folder options assessment of relevance (importance) of a material.

For example, the folder "Books". The evaluation parameters are: the citation index, the readers in the Internet shops, year of publication, the fame of the author, the language works and other

In the folder "Company" parameters can be: the company's reputation, specialization in this domain, capitalization, region, country, etc.

The number of parameters that can be entered by the user is not limited. In practice, however, a number of important parameters rarely exceeds the number 7. Empirically established that in the majority of practically important cases, the optimal number of parameters is the tsya 4.

Each parameter in each folder is assigned a certain degree of importance, called weight parameter and takes values from 0 to 1.

Are set for each parameter in each folder, integer level, depending on its actual value.

For example, the folder "Books".

The parameter "citation index"

more than or equal to 10-5 points,

more than 2 but less than 10-4 points,

1-2-3 points;

0-2 points.

The parameter "Language"

"English" - 5 points;

"English" - 4 points;

"German" - 3 points;

"Japanese" - 2 points;

"Arab" -1 point.

The parameter "Year of publication"

2004-2006 - 5 points;

2000-2003 - 4 points;

1995-1999 - 3 points

1980-1995 - 2 points;

until 1980 - 1 point.

Each user can configure the system parameters and to assign weights according to your preferences. The level parameter for each document in the folder depends on the real parameter values for a given document.

To determine the rating of the i-th document in the j-the folder according to the formula:

mj- number of parameters in the j-th folder;

akj- the weight of the k-parameter in the j-th folder;

pkij- really determine the value of the k-parameter for the i-th element in the j-th folder;

withij- count the number of groups, in reference to the i-th document in the j-th folder.

Then in each folder are selected, only the first few documents with a maximum in the folder indicators of relevance.

Sorted search system in accordance with final rating documents memorize in the memory of the search engines.

The initial selection of documents, sorting them into folders (groups), the computation of the relevance of each item and the selection of the final number of documents with large indicators of relevance complete stage 1 of the work.

stage 2. Structural processing.

Sorted search system in accordance with final rating documents memorize in the memory of the search engines. For phase 2 of you:

1. To determine the future sections of the final report. For example, Goals, objectives, Forecasts, Current results, and others.

2. To determine the keywords that are specific to each of the sections.

3. To identify the text characteristics of the beginning and end of the section.

4. To hold a markup of the text selected in the first stage of documents in accordance with sections. To highlight segments of text, move these segments in the database in the corresponding section.

5. Within each section to sort segments. Sorting the segments is carried out in accordance with the original publication date is on the document. The segments in which there are mutual citations, are placed sequentially one after another, starting with the segment with the earlier date.

6. To make the database is updated with new documents (to monitor).

7. To generate a report about the problem.

At any time the user can generate the final issue report. When creating a report all segments of the partition are merged into a single text array with references to primary sources.

The above method of selection of information and its structural processing can also be used for filling the database fragments initially unstructured texts.

stage 3. Trend analysis.

Most subject areas are developed in accordance with the G-curve (Fig 1.). This curve was first published analytical Corporation Gartner Group in 1995.

G-curve shows the dependence of the number of information messages on the technology of the time and level of technology development. The information message refers to any mention of technology in media, literature, the Internet and other information sources. The text of each reference is considered a document.

The curve is divided into five sections:

1 - start technology (new concept).

On this site happens fast ro is t the number of documents. This is due to a significant increase in investment in this technology, the growth of advertising, the increase in the number of participants in the development and promotion of technology.

2 plot of peak excessive expectations.

On this site it stops growing number of publications, reduces the volume of advertising, it stops growing number of participants.

3 phase - disappointment.

Reduced the number of published documents, reduces the volume of advertising, many participants leave the business. Of the subject area goes investments. Costs are directed to the search of high-quality technology changes.

4 plot the slope of enlightenment.

Found quality problems. Investments start to grow. Increases the volume of documents. Increasing the volume of advertising.

5 plot of the plateau of productivity.

The developed technology with optimal parameters. The number of documents is steadily growing. Optimized advertising budgets. Increasing investments.

Starting the analysis of a new subject area, the user is faced with a fact that is not able to determine which section of the G-curve corresponds to this informational material (document) (figure 2).

To solve this problem it would be good to compare information materials related to different time intervals (figure 3). However, in practice this task is difficult, t is how to most information materials have different structure and completeness. For example, in one report discusses the goals, objectives and prospects of development, others have problems and technical specifications, in the third discusses the investment attractiveness of the project. To determine the trend of the development of individual materials in the financial, technical, and scientific fields related to different time intervals, it is very difficult. In addition, selected materials, the numerical values of some characteristics of the studied technologies can be absent.

The method allows to construct a G-curve, to identify a trend in the presence of fragmentary data, verbal descriptions of characteristics of the state of technology development.

For solving the problem of building G-curve, you must do the following steps:

1. In each section, created on the 2nd stage, an evaluation of respective characteristics. As noted above, all the collected segments of the original documents within the section are placed sequentially in accordance with the date of publication of the original document. To formalize the verbal characteristics of technology conducted a pairwise comparison of adjacent segments.

If quality characteristics of the subject area increase, in the table of results of the processing is written +1, if weakened,- 1. After filling in the table according to her is built G (figure 4).

2. To simplify the formalization of the rules table:

No. p.pRule formalisationValue
1.The number of points in the subsequent segment of the document grew+1
2.The text repeats-1
3.Appeared date of event+1
4.Specify numerical parameters+1
5.Include the words "transferred", "not satisfied", "no match"-1
.........
nIf..., then...

If the terms of the Rules of the tables have the opposite value, the numerical value of the segment is changed to the opposite.

The user computer is any personal computer, such as IBM, which consists of a system unit that is connected to the monitor, keyboard, and pointing device type "mouse".

Custom computer must have access to databases, which can be either remote or local. Access databases can be operated by the pressure of the user's computer through telecommunication facilities to the search engine server of the global Internet or a local network, for example Intranet.

The database can be as homogeneous, each of which contains documents only on one subject, for example, patent database, and heterogeneous, which contain documents on different topics, such as Yandex.

The database recorded in memory of the computer or server, for example, on the hard disk.

Search engine is a normal 32-bit machine (for example, Linux, Solaris, Free BSD, Win32).

As search engines used, for example, the search system is Fast, handles well-known logic of direct search. Search engine Fast developed and marketed by the Norwegian company Fast Search &Transfer ASA.

The application of the method allows also to reduce the computing time of the search, to increase the relevance of the sample document request, to reduce costs of intellectual work in the analysis of a sample document.

The method of search and retrieval of information from various databases, including the formation of a user on a user computer, at least one search query, transmitting the generated user request through telecommunication facilities in a search engine server, processing search system generated user search query by selecting documents from different databases, search engine correctly, yet all the selected documents into a single list, sorts referred to the selected documents by topics, creates folders, each of which contains the above-mentioned documents, sorted on one topic for each folder defines the parameters of relevance for each of the sorted document each folder, selects the features that characterize each document within each folder determines the rating of each characteristic that is contained in each of the sorted document, determines the final rating of each selected document based on the ranking of each characteristic of the document and parameter estimation relevance folder, again sorts the sorted mentioned documents, according to the final ranking, sorted according to the final ranking of documents memorizes in the memory of the search engines, selects specified by the user in the query number of documents with the highest indicators of the final rating, characterized in that the implement structural processing the selected number of documents containing the time that the user in the search request specifies the sections of the future final report and keywords specific to each of the sections, using search engines identify the text characteristics of commencement and completion of sections, conduct markup text selected with the greatest pokazatel the mi final ranking of documents, within each section highlight segments of text that are sort segments in accordance with a publication date, prepare a final report, in which the segments of the text, sorted according to the date of publication of the original document, merged into a single text array, and then transmit to the user terminal through the telecommunication facilities final report.



 

Same patents:

FIELD: metadata index structures.

SUBSTANCE: in accordance to the invention, key is searched for across index structure of multi-keys, containing values of multi-keys and identifying information for metadata, corresponding to multi-key value, while multi-keys are structured by means of a combination of predetermined metadata fields, after than a fragment of metadata is extracted using the found key.

EFFECT: increased index searching speed.

4 cl, 15 dwg, 6 tbl

FIELD: index structures for meta-data.

SUBSTANCE: in accordance to the invention, structure of metadata index includes values of multi-keys and identification information for metadata, corresponding to value of multi-key, while multi-keys are structured by means of a combination of predetermined metadata fields. In this structure, key is searched for, which is then used to find and extract a fragment of metadata.

EFFECT: increased meta-data searching speed.

6 cl, 15 dwg, 6 tbl

FIELD: index structure of metadata.

SUBSTANCE: in accordance to the invention, structure of metadata index includes a list of multi-keys and position information for defining multi-keys, while information is determined, which corresponds to search condition fields, and multi-key, corresponding to position information, is searched for, after that, metadata fragment is extracted using the found multi-key.

EFFECT: increased index search speed.

2 cl, 15 dwg, 6 tbl

FIELD: technology for providing audio and visual services.

SUBSTANCE: in accordance to invention, signal has structure for transferring index with encoded multi-component key for determining position of component key of meta-data index of digital content, transported on basis of independent data unit, containing key index list section (index list), including multi-component key data, used in search for component key of index, while the data of multi-component key is encoded by single code value.

EFFECT: possible transfer of data stream structure with encoded multi-component key, useable for determining position of component keys of meta-data index of digital content.

2 cl, 5 tbl

FIELD: computer engineering.

SUBSTANCE: in accordance to invention, file version watermarks are included in source data file of an application. File version watermarks may indicate various properties of original data file. File version watermarks may include older version watermark, last version watermark, lesser version watermark, creation version watermark and object version watermark. In accordance to invention, file version watermarks are used to determine whether file or certain information in file correspond with previous, active or future version of an application. Respectively, invention allows to load or save original data file on basis of determined results.

EFFECT: creation of expandable file format, compatible with previous, current and future versions of an application; possible use of previous versions of applications for working with a file.

20 cl, 9 dwg

FIELD: medical information technologies.

SUBSTANCE: in accordance to method, medical databases are created and regularly updated; informative characteristics of patients health condition are formed on basis of case histories in medical databases; on basis of detected informative coefficients, deciding rules are created, considering cause-effect connections between pieces of medical information and biological age and health condition of patients; client units of practical doctors are equipped with software means for using formed deciding rules for information support of practical doctors; data pertaining to research and patient examination are recorded in client unit; appropriate deciding rules are used in client unit to determine biological age and to evaluate health condition of patient by groups: healthy, belongs to risk group or sick; to predict, when patient is included in risk group, the course of disease or pathological process; to provide, when patient is included in "sick" group, the clinical diagnosis with consideration of nature and severity degree of changes in main life support systems, individual features of organism and constitution factor of patient.

EFFECT: increased precision of diagnostics.

1 tbl

FIELD: systems for selling goods and services to population, using network technologies.

SUBSTANCE: in accordance to invention, in unified informational consolidating center, data input/output and identification block is meant for identification of user being buyer, seller, manufacturer, activation of controlling software block, for access to blocks of unified informational consolidating center depending on its user status. If user is buyer, then block for controlling personal account of buyer is initiated, if user is seller, then sales block is initiated for charging personal account of buyer with a part of price of product, and seller terminal is meant for, during purchases, to receiver payment of part of product price from buyer, and remaining part of price is received from virtual bank block.

EFFECT: creation of new software-hardware complex for selling goods with compensation of consumer losses.

2 dwg

FIELD: metadata searching means.

SUBSTANCE: in accordance to invention, in a carrier, containing indexing metadata structure, in first variant, indexing structure includes a list of keys, consisting of predetermined metadata fields, where at least part of information about position is expressed in form of predetermined code. In second variant, indexing structure contains section of list of key indexes, section of key index, section of key sub-index. In accordance to third variant, contains a list of keys and information about position for determining keys, key values and metadata identification information.

EFFECT: decreased metadata searching time.

3 cl, 12 dwg, 6 tbl

FIELD: information technologies, possible use for optimizing storage and selection of data.

SUBSTANCE: in accordance to method, spatial data structure is formed with elements in form of original n-dimensional intervals; lens is determined, being a 2n-dimensional interval of interval request operator, representing an instruction about selection of data of required n-dimensional intervals with description of given interval and positioning conditions of required ones relatively to it, while lens is determined in accordance to rose of intervals, representing virtual two-dimensional geometric diagram of areas such as 2n-dimensional points in axes xp and yp of their space, coordinates {xp,yp} of which are appropriate coordinates of p-projections of n-dimensional intervals appropriate for these points, where p-projection of n-dimensional interval is its projection on p-axis of its space basis; built and stored on physical data carrier is interval request operator about selection from data structure of such points, which are enveloped by this lens, by means of this operator interval request to structure is performed, resulting in a set of data of 2n-dimensional points, simultaneously being the data of sought after intervals corresponding to the data.

EFFECT: possible efficient execution of any interval requests in one spatial data structure.

10 dwg, 1 tbl, 8 ex

FIELD: information extraction systems engineering.

SUBSTANCE: system contains data storage, analysis mechanism of lower level, analysis mechanisms of higher level, indexer. In accordance to method, on basis of first set of rules, appropriate for first analysis mechanism, first key is connected, which is sent output second analysis mechanism, in which second key is generated based on second set of rules, first and second keys are connected to objects and keys and key values are indexed.

EFFECT: decreased time and computing resources spent on processing of large data arrays to extract required information.

2 cl, 5 dwg

FIELD: data access technologies.

SUBSTANCE: method includes assignment of simplified network address, recording URL and converting numbers into storage system with net access, inputting assigned number into computer, transferring inputted number to storage system, converting number to URL, receiving page matching URL, and displaying it. Method for use in operation systems for message transfer include intercepting system level messages to certain objects and forming pseudonym messages during that. Systems realize said methods.

EFFECT: broader functional capabilities.

12 cl, 30 dwg

FIELD: computers.

SUBSTANCE: system has entries memory block, words memory block, control block, substitutions block, n blocks for searching and replacing.

EFFECT: broader functional capabilities.

17 dwg

FIELD: computers.

SUBSTANCE: system has nine registers, four address selectors, triggers, AND elements, OR elements and delay elements.

EFFECT: higher speed.

8 dwg

FIELD: computers.

SUBSTANCE: system has operation mode setting block, first and second blocks for selecting records addresses, block for forming addresses for reading records, data output block, first and second record codes comparison blocks, records quality comparison block, year intervals comparison block, records selection control block, register, adder and OR elements.

EFFECT: higher speed of operation.

10 dwg

FIELD: computers.

SUBSTANCE: system has memory for programs, including browser, display block, database for storing documents, addressing control block, while each document of base has at least one link with indicator of its unique number and indicator with address of program for control stored in addressing control block, system contains also, connected by data buses and control of other blocks of system, memory for links of couples of unique numbers of links and forming means for lists of unique numbers of documents links, which are interconnected.

EFFECT: higher efficiency.

2 cl, 1 dwg

FIELD: telecommunication networks.

SUBSTANCE: messages, sent by cell phones, are formed by means of printed and public-distributed classifier, wherein at least one category is made with possible detection of at least one identifier of individual mark of object, identifier is sent by sender via at least one message to computer server with software, which transfers such message into database record at server for its transfer to at least one receiver, or searches for such record in database at server in accordance to received message and transfers to sender of such message at least one found database record.

EFFECT: broader functional capabilities.

2 dwg

FIELD: web technologies.

SUBSTANCE: method for integration of printed business documents, requiring original signature, with electronic data concerning these documents and later extraction of data, inputted for forming documents, is characterized by steps for forcing end user or agent to input all necessary data for forming of required document, saving collected data in database, linking saved data to unique ID code and printing unique ID code on printed document during printing. Printed documents is signed by end user and sent together with supporting documentation. When document is received by business-client, business-client inputs ID code, which is then used for access to saved data, and updates private database of business-client with all data, used for creation of original documents.

EFFECT: higher efficiency.

2 cl, 7 dwg

FIELD: computer science.

SUBSTANCE: device has string memory block, comparator, memory block for words and substitutes, block for analysis and forming of displacement results, block for storing string address, control block.

EFFECT: broader functional capabilities, higher reliability.

10 dwg

FIELD: data bases.

SUBSTANCE: method includes presenting operations at all levels of company in form typical product life cycle tree, wherein existing objective functional-technological connections of each manufacture stage are decomposed, and forming information system in form of pertinent-relevant complex information system and search, for which typical structure-information modules of information system are formed, system objective information requirements of data consumers, being a result of decompositions by levels of operations and problems, are determined as precisely as possible, data base of found documents in form of files is formed of key nodes with set of elementary data block for each system information requirement and files of information system modules, starting from lower levels of current stage and then upwards, while each data block has a list of pertinent documents ordered by determined information requirements.

EFFECT: higher search efficiency.

13 cl, 11 dwg

FIELD: computer science.

SUBSTANCE: system has first, second, third, fourth and fifth registers, first and second memory blocks, first, second and third decoders, triggers, elements AND, OR and delay elements.

EFFECT: higher speed of operation.

1 dwg

Up!