RussianPatents.com

Method for automated classification of documents

IPC classes for russian patent Method for automated classification of documents (RU 2254610):

G06F17/30 - Information retrieval; Database structures therefor

Another patents in same IPC classes:

Relation databases object-oriented control system / 2253894
System has generator of unique data blocks identifiers, means for forming relation database tables, means for forming a catalogue in form of a table or a set of relation database tables, converter for commands of input, search, extraction and adding of appropriate data block defined by attribute name to command of input, search, etc. of the relation database table, name of which is related to name of said attribute, controlling device is made with possible conversion of commands with attributes to command with relation database table name.

System for controlling data conversion in information network of governmental population register / 2251727
System has first, second, third, fourth and fifth registers, first and second memory blocks, first, second and third decoders, triggers, elements AND, OR and delay elements.

Method for creating and using data search system and for providing industrial manufacture specialists / 2251149
Method includes presenting operations at all levels of company in form typical product life cycle tree, wherein existing objective functional-technological connections of each manufacture stage are decomposed, and forming information system in form of pertinent-relevant complex information system and search, for which typical structure-information modules of information system are formed, system objective information requirements of data consumers, being a result of decompositions by levels of operations and problems, are determined as precisely as possible, data base of found documents in form of files is formed of key nodes with set of elementary data block for each system information requirement and files of information system modules, starting from lower levels of current stage and then upwards, while each data block has a list of pertinent documents ordered by determined information requirements.

Device for finding and replacing random strings in text words / 2250493
Device has string memory block, comparator, memory block for words and substitutes, block for analysis and forming of displacement results, block for storing string address, control block.

System and method for integration of printed business documents with computer-read data, inputted at computer network / 2250492
Method for integration of printed business documents, requiring original signature, with electronic data concerning these documents and later extraction of data, inputted for forming documents, is characterized by steps for forcing end user or agent to input all necessary data for forming of required document, saving collected data in database, linking saved data to unique ID code and printing unique ID code on printed document during printing. Printed documents is signed by end user and sent together with supporting documentation. When document is received by business-client, business-client inputs ID code, which is then used for access to saved data, and updates private database of business-client with all data, used for creation of original documents.

Method for inputting and searching for information concerning an object on basis of its individual marks in remote database via encoding messages, transferred by cell phones / 2248040
Messages, sent by cell phones, are formed by means of printed and public-distributed classifier, wherein at least one category is made with possible detection of at least one identifier of individual mark of object, identifier is sent by sender via at least one message to computer server with software, which transfers such message into database record at server for its transfer to at least one receiver, or searches for such record in database at server in accordance to received message and transfers to sender of such message at least one found database record.

Addressing system with use of hypertext links / 2247424
System has memory for programs, including browser, display block, database for storing documents, addressing control block, while each document of base has at least one link with indicator of its unique number and indicator with address of program for control stored in addressing control block, system contains also, connected by data buses and control of other blocks of system, memory for links of couples of unique numbers of links and forming means for lists of unique numbers of documents links, which are interconnected.

Information system of governmental population register / 2246756
System has operation mode setting block, first and second blocks for selecting records addresses, block for forming addresses for reading records, data output block, first and second record codes comparison blocks, records quality comparison block, year intervals comparison block, records selection control block, register, adder and OR elements.

System for data transfer for governmental population recording system / 2246130
System has nine registers, four address selectors, triggers, AND elements, OR elements and delay elements.

Parallel search and replace system / 2245579
System has entries memory block, words memory block, control block, substitutions block, n blocks for searching and replacing.

Method and system for providing access to data in network with use of message pseudonym functions, having shadow back call functions / 2245578
Method includes assignment of simplified network address, recording URL and converting numbers into storage system with net access, inputting assigned number into computer, transferring inputted number to storage system, converting number to URL, receiving page matching URL, and displaying it. Method for use in operation systems for message transfer include intercepting system level messages to certain objects and forming pseudonym messages during that. Systems realize said methods.

Parallel search and replace system / 2245579
System has entries memory block, words memory block, control block, substitutions block, n blocks for searching and replacing.

System for data transfer for governmental population recording system / 2246130
System has nine registers, four address selectors, triggers, AND elements, OR elements and delay elements.

FIELD: documents classification systems.

SUBSTANCE: method is based on multiple use of simple classifier for classifying document and following correction of classification results with consideration of thematic dependences between categories.

EFFECT: possible a priori setting of dependences of categories from each other in form of categories tree.

3 dwg

The invention relates to systems of classification of documents and can be used in information processing systems, databases, digital repositories in cases where thematic dependence of the categories from each other can be represented in the form of a tree.

There is a method for automatic classification of documents [1], namely, that carry out the conversion of the document from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance on the stage of training upon presentation of a set of manually classified documents form the set of classification criteria and the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document, on the basis of simple Bayesian classification criteria and classification criteria determine whether a document category.

Note that this method is suitable for processing machine-readable texts in natural language.

This method is simple Bayesian classification of documents uses Hypo is Jesu about the independence of the words of the document from each other. At the same time as the document, and categories are treated as probabilistic system, which calculates the probability of occurrence of word forms as independent events. To determine the probability of the set of document categories is computed measure of closeness between the two probabilistic systems. The method is simple Bayesian classification can be used for binary classification (it is necessary to determine the document category or not), and plural (you need from the list of categories to find the one which belongs to the document). In the latter case, the document can belong to only one category from the list.

In those tasks where the document can belong to multiple categories, use multiple binary classifiers considered type, each of which determines whether or not the current document in this category. Taking the hypothesis of independence of the categories from each other.

However, this method has the disadvantage that is associated with the fact that it is not possible to classify the documents in the case when category thematically dependent from each other, for example when they are hierarchically dependent on each other.

Also known a method of automatic document classification [2], namely, that carry out pre the education of the document from a storage format to text in natural language, convert word document in the basic word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their occurrence; during the training phase after presenting a set of manually classified documents form the set of classification criteria, the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document based on the classification criterion SVM (Support Vector Machines) and classification criteria determine whether a document category.

This method, like the previous one, is intended for processing machine-readable texts in natural language.

The method described in [2], based on the classification method SVM, which allows you to build in a multidimensional feature space, the hyperplane that separates the signs of the documents belonging to the category, from the evidence of documents that do not belong to her. This method can also be used in cases where the document can belong to several categories.

However, this method has the same disadvantage as [1] - it is not possible to classify the documents in the case when category thematically dependent others the g from each other.

The closest in technical essence to the present invention is a method multiCLASS classification [3], adopted for the prototype, which consists in the fact that carry out the conversion of the document from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance and thereby form the feature vector of the document, at the training stage after presenting a set of manually classified documents form the set of classification criteria, retain classification features in the database, the classification of the document shall convert it from the storage format into a text natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document and form the feature vector of the document, and then take a decision about belonging or not belonging document each of the categories.

In this way also under the natural language text refers to machine-readable text.

This method is to classify uses weak hypotheses about the ownership of the document to many categories for iterative refinement of the distribution function categories on the plural is e documents. To obtain the weak hypotheses used methods for binary classification of documents; and the classification of use built distribution to determine the list of categories to which the document belongs. This method shows a good performance, as he repeatedly applying simple classification methods, resulting in more accurate classification. In addition, within the above categories are not considered to be independent. The dependence between them is set during the training phase by providing appropriate training sample documents. The disadvantage of the prototype is the inability to use when classifying a priori information about the dependencies of the categories from each other.

The technical result from implementation of the invention is to eliminate the disadvantages of the prototype, i.e. the possibility of a priori task dependencies categories from each other in the form of a tree of categories.

This technical result is due to the fact that carry out the conversion of the document from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance and thereby form the feature vector of the document, on the stage of training upon presentation of a set of manually classified documents form the set of classification criteria, retain the classification features in the database, the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document and form the feature vector of the document, and then take a decision about belonging or not belonging document each of the categories; however, use of a priori information about the dependencies of the categories from each other. An additional feature of this method is that the dependence of the categories from each other is specified by the category tree. In addition, the use of binary classifiers to determine a set of document categories, followed by the correction of the classification results by the analysis for each category of the set of document categories at a higher level. Note that this method is suitable for processing machine-readable texts in natural language.

In the way that word documents are considered independent within the category. All the many categories initially presented in the form of a tree, reflecting thematic dependence of one category from the other and their community in relation to each other. The document some categories involve with the Oh, it is also all categories, located in the hierarchy above it. This classification method allows taking into account this input document to determine which nodes of a tree of categories it belongs to, and which are not.

Documents for classification can be presented in various formats, allowing the allocation of them to the text content. It can be text files of various formats of graphic files with a graphical representation of some text, audio files with voice recording and other files, for which there is an allocation mechanism of the text that reflect their content. Each document (or training, or undergoing classification) previously held the primary processing stage, which is the definition of the format of the document and ascertain whether it is possible to extract text from a document in this format. In case of a positive decision is made by extracting text from the document. After splitting the text into words is the definition for each word to its base word forms one of the methods [4-7]. Most often for such tasks used the porter algorithm [4], consisting in the use of special rules clipping and replace the endings of words.

According to the proposed method, each document D_iis represented by a vector of attributes of the form:

d_i=(w₁,...w _n),

where the value of the j-th characteristic w_jis the weight of j-th word in document D_icalculated by the formula:

w_j=c_ij/N_i.

Here c_ij- the number of times that the j-th word form occurs in the i-th document, N_i- the total number of word forms in i-th document.

To initialize the classifier and build classification criteria is the training phase of the classifier. This should be given plenty of training documents pre-classified manually. After extraction of text content is built-in dictionary documents. Dictionary document contains the base word forms of all words found in the training documents.

In the classification of the document are taken into account not all words from the dictionary documents, and only those, which are part of the working vocabulary of the classifier in this category. In the working vocabulary of the classifier includes the most informative word forms from the point of view of determining the membership of a document in this category, not included in the stop-dictionary. The information content of the wordform w_ifor the classifier category C_jis determined by the following formula [8]:

This sets the threshold informative ε; in the working vocabulary of the classifier included the CoE words not included in the stop-dictionary, information which exceeds this threshold. Stop-dictionary consists of words, frequency of occurrence in the set of training documents exceeds a predetermined threshold δ. This clipped words do not carry the semantic load, such as prepositions, conjunctions, introductory words, etc. Values of the coefficient of δaccording to this method, are set in the range from 0.05 to 0.7, depending on the specific application of the method. Threshold informative ε may be different in different applications of the method.

Building classification criteria includes the calculation of the a priori categories and build the distribution of word forms from the working dictionary category. To calculate the a priori probability is determined by the fraction of training documents of their total number caught in each of their categories. To calculate the distributions of word forms desktop dictionary categories are defined by the frequency of occurrence of these forms in the categories on the set of training documents. The a priori probabilities of the categories and distribution of the forms working dictionary data stored in the database classification criteria.

Document classification is performed in two stages.

At the first stage of the classification categories are considered independent and is independent the distribution of toiletries document each of the categories. To do this, use the inequality

where P(C_j) - a priori probability of the current category With_j, n is the number of words in the current document D, T is the number of distinct word forms in the document D,is the probability of occurrence of the i-th word in document D,- the probability of category C_jwhen the condition for the appearance of the i-th word. If this inequality is satisfied, it is decided that the document D belongs to category C_j.

In the second stage of classification is the correction obtained at the first stage results taking into account a priori information about theme based categories, represented as a tree of categories.

The basis of the correction is the assertion that if a document belongs to a certain category, then it also belongs to all categories, lying higher in the hierarchy. The phase correction is to traverse all vertices of category tree from the root (except for the root itself, as it is not a category) to the leaves. At each iteration, it is checked whether this category current document. If owned, is the rise in the hierarchy from the current node to the root of the tree, in which is determined by the number of vertices that belong to the document according to the SNO decisions of the classifier and the number of vertices, he does not belong according to the decision of the classifier. If the number of vertices that belong to the document exceeds the number of vertices, which he does not belong on the stage of correction is decided that the current vertex, the document belongs to, then there is a correction of the decision of the classifier on the entire path from the current node to the root of the tree by assigning all intermediate vertices of a positive decision on the classification. If the number of vertices belongs to a document that does not exceed the number of vertices, which he does not belong, assigning a negative decision on the membership of the current document category corresponding to a current node.

The invention is illustrated by drawings, in which figure 1 presents a block diagram of a computing device to implement the method. Figure 2 presents the algorithm of formation signs documents. Figure 3 presents the algorithm for training the classifiers.

Device to implement the method (figure 1) consists of source documents 1, block 2 forming characteristics of the document managing unit 3, unit 4 learning classifiers, block 5 training data, the database 6, block 7 classification unit 8 correction of the classification results, unit 9 output classification results./p>

According to the way the device works in the following way. When in the source documents 1 new document, it enters the unit 2, which performs the processing of the document and the construction on it of the feature vector. First is the primary document processing (figure 2), which is extracted from the document text content, then the definition of the base forms for all words contained in the text of the document, and determine the weight of the forms in the document.

The control unit 3 is controlled by the operator of the device and can operate in two modes: training mode and in the mode of classification.

In training mode the signs of the document, formed in the block 2, coming next in block 4, in which there is an accumulation of training documents and the formation of the classification criteria. Using the training data and documents received from unit 5 and stored in the database 6. A set of training data provides information and facilities of each training document categories. This information, together with evidence of training documents accumulated in the database 6. At the end of the training mode immediately before the switching control unit in the mode of classification is removing all of the accumulated training documents from the database d is the R 6 and the formation of the classification criteria (figure 3). To do this, by the accumulated set of training documents is the creation of a dictionary of documents that contains all words occurring in the training documents, and then creates worker dictionaries classifiers. Then there is the construction of the classification criteria. The obtained classification features stored in the database 6.

Mode classification signs of a document received through the control device in block 7, where the classification of the document on its characteristics. Are coming from the database 6 classification features. Then the classification results are received in block 8, where their correction with regard to thematic dependencies between categories, and then adjusted the list of categories in which a document is passed to the block 9.

Thus, the method allows to classify documents according to a priori dependencies between the categories set by the category tree, thus achieving the above technical result.

Sources of information

1. Li Y., Jain, A. "Classification of text documents", The Computer Journal, 41, 8, pp.537-546, 1998.

2. U.S. patent 6327581, CL G 06 F 015/18.

3. Schapire R.E., Singer Y. "BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 2/3, 2000, pp.135-168 - prototype.

4. Porter, M.F., "An algorithm for suffix stripping", Program, Vol.14, No.3, 1980, pp.130-137.

5. RF patent №2096825 With, CL G 06 F 17/00.

6. U.S. patent 6308149, CL G 06 F 17/27.

7. U.S. patent 6430557, CL G 06 F 017/30; G 06 F 017/27; G 06 F 017/21.

8. Craven, M., DiPasquo, D., Freitag D. et al. "Learning to onstruct knowledge bases from the World Wide Web, Artificial Intelligence, Vol.118(1-2), 2000, pp.69-113.

A method of automatic classification of documents, namely, that carry out the conversion of the document from a storage format to text in natural language, translate the words of the converted document in the basic word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance and thereby form the feature vector of the document, at the training stage after presenting a set of manually classified documents form the set of classification criteria, retain classification features in the database, the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document and form the feature vector of the document, and then take a decision about belonging or not belonging document each of the categories, wherein the step of determining the membership of a document each of the categories using a priori information about zavisimost the s categories from each other, asked by categories tree, using binary classifiers to determine a set of document categories, followed by an analysis of the membership of each category of document categories a higher level, and if the number of nodes in the tree, which belongs to the document exceeds the number of vertices, which he does not belong, then decide on the conformity of the document to the current vertex, then make adjustment decisions of the classifier on the entire path from the current node to the root of the tree and classify the document through all intermediate nodes in the tree.