RussianPatents.com
|
Method for automated classification of documents |
|||||||||||||||||||||
IPC classes for russian patent Method for automated classification of documents (RU 2254610):
|
FIELD: documents classification systems. SUBSTANCE: method is based on multiple use of simple classifier for classifying document and following correction of classification results with consideration of thematic dependences between categories. EFFECT: possible a priori setting of dependences of categories from each other in form of categories tree. 3 dwg
The invention relates to systems of classification of documents and can be used in information processing systems, databases, digital repositories in cases where thematic dependence of the categories from each other can be represented in the form of a tree. There is a method for automatic classification of documents [1], namely, that carry out the conversion of the document from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance on the stage of training upon presentation of a set of manually classified documents form the set of classification criteria and the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document, on the basis of simple Bayesian classification criteria and classification criteria determine whether a document category. Note that this method is suitable for processing machine-readable texts in natural language. This method is simple Bayesian classification of documents uses Hypo is Jesu about the independence of the words of the document from each other. At the same time as the document, and categories are treated as probabilistic system, which calculates the probability of occurrence of word forms as independent events. To determine the probability of the set of document categories is computed measure of closeness between the two probabilistic systems. The method is simple Bayesian classification can be used for binary classification (it is necessary to determine the document category or not), and plural (you need from the list of categories to find the one which belongs to the document). In the latter case, the document can belong to only one category from the list. In those tasks where the document can belong to multiple categories, use multiple binary classifiers considered type, each of which determines whether or not the current document in this category. Taking the hypothesis of independence of the categories from each other. However, this method has the disadvantage that is associated with the fact that it is not possible to classify the documents in the case when category thematically dependent from each other, for example when they are hierarchically dependent on each other. Also known a method of automatic document classification [2], namely, that carry out pre the education of the document from a storage format to text in natural language, convert word document in the basic word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their occurrence; during the training phase after presenting a set of manually classified documents form the set of classification criteria, the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document based on the classification criterion SVM (Support Vector Machines) and classification criteria determine whether a document category. This method, like the previous one, is intended for processing machine-readable texts in natural language. The method described in [2], based on the classification method SVM, which allows you to build in a multidimensional feature space, the hyperplane that separates the signs of the documents belonging to the category, from the evidence of documents that do not belong to her. This method can also be used in cases where the document can belong to several categories. However, this method has the same disadvantage as [1] - it is not possible to classify the documents in the case when category thematically dependent others the g from each other. The closest in technical essence to the present invention is a method multiCLASS classification [3], adopted for the prototype, which consists in the fact that carry out the conversion of the document from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance and thereby form the feature vector of the document, at the training stage after presenting a set of manually classified documents form the set of classification criteria, retain classification features in the database, the classification of the document shall convert it from the storage format into a text natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document and form the feature vector of the document, and then take a decision about belonging or not belonging document each of the categories. In this way also under the natural language text refers to machine-readable text. This method is to classify uses weak hypotheses about the ownership of the document to many categories for iterative refinement of the distribution function categories on the plural is e documents. To obtain the weak hypotheses used methods for binary classification of documents; and the classification of use built distribution to determine the list of categories to which the document belongs. This method shows a good performance, as he repeatedly applying simple classification methods, resulting in more accurate classification. In addition, within the above categories are not considered to be independent. The dependence between them is set during the training phase by providing appropriate training sample documents. The disadvantage of the prototype is the inability to use when classifying a priori information about the dependencies of the categories from each other. The technical result from implementation of the invention is to eliminate the disadvantages of the prototype, i.e. the possibility of a priori task dependencies categories from each other in the form of a tree of categories. This technical result is due to the fact that carry out the conversion of the document from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance and thereby form the feature vector of the document, on the stage of training upon presentation of a set of manually classified documents form the set of classification criteria, retain the classification features in the database, the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document and form the feature vector of the document, and then take a decision about belonging or not belonging document each of the categories; however, use of a priori information about the dependencies of the categories from each other. An additional feature of this method is that the dependence of the categories from each other is specified by the category tree. In addition, the use of binary classifiers to determine a set of document categories, followed by the correction of the classification results by the analysis for each category of the set of document categories at a higher level. Note that this method is suitable for processing machine-readable texts in natural language. In the way that word documents are considered independent within the category. All the many categories initially presented in the form of a tree, reflecting thematic dependence of one category from the other and their community in relation to each other. The document some categories involve with the Oh, it is also all categories, located in the hierarchy above it. This classification method allows taking into account this input document to determine which nodes of a tree of categories it belongs to, and which are not. Documents for classification can be presented in various formats, allowing the allocation of them to the text content. It can be text files of various formats of graphic files with a graphical representation of some text, audio files with voice recording and other files, for which there is an allocation mechanism of the text that reflect their content. Each document (or training, or undergoing classification) previously held the primary processing stage, which is the definition of the format of the document and ascertain whether it is possible to extract text from a document in this format. In case of a positive decision is made by extracting text from the document. After splitting the text into words is the definition for each word to its base word forms one of the methods [4-7]. Most often for such tasks used the porter algorithm [4], consisting in the use of special rules clipping and replace the endings of words. According to the proposed method, each document Diis represented by a vector of attributes of the form: di=(w1,...w n), where the value of the j-th characteristic wjis the weight of j-th word in document Dicalculated by the formula: wj=cij/Ni. Here cij- the number of times that the j-th word form occurs in the i-th document, Ni- the total number of word forms in i-th document. To initialize the classifier and build classification criteria is the training phase of the classifier. This should be given plenty of training documents pre-classified manually. After extraction of text content is built-in dictionary documents. Dictionary document contains the base word forms of all words found in the training documents. In the classification of the document are taken into account not all words from the dictionary documents, and only those, which are part of the working vocabulary of the classifier in this category. In the working vocabulary of the classifier includes the most informative word forms from the point of view of determining the membership of a document in this category, not included in the stop-dictionary. The information content of the wordform wifor the classifier category Cjis determined by the following formula [8]: This sets the threshold informative ε; in the working vocabulary of the classifier included the CoE words not included in the stop-dictionary, information which exceeds this threshold. Stop-dictionary consists of words, frequency of occurrence in the set of training documents exceeds a predetermined threshold δ. This clipped words do not carry the semantic load, such as prepositions, conjunctions, introductory words, etc. Values of the coefficient of δaccording to this method, are set in the range from 0.05 to 0.7, depending on the specific application of the method. Threshold informative ε may be different in different applications of the method. Building classification criteria includes the calculation of the a priori categories and build the distribution of word forms from the working dictionary category. To calculate the a priori probability is determined by the fraction of training documents of their total number caught in each of their categories. To calculate the distributions of word forms desktop dictionary categories are defined by the frequency of occurrence of these forms in the categories on the set of training documents. The a priori probabilities of the categories and distribution of the forms working dictionary data stored in the database classification criteria. Document classification is performed in two stages. At the first stage of the classification categories are considered independent and is independent the distribution of toiletries document each of the categories. To do this, use the inequality where P(Cj) - a priori probability of the current category Withj, n is the number of words in the current document D, T is the number of distinct word forms in the document D,is the probability of occurrence of the i-th word in document D,- the probability of category Cjwhen the condition for the appearance of the i-th word. If this inequality is satisfied, it is decided that the document D belongs to category Cj. In the second stage of classification is the correction obtained at the first stage results taking into account a priori information about theme based categories, represented as a tree of categories. The basis of the correction is the assertion that if a document belongs to a certain category, then it also belongs to all categories, lying higher in the hierarchy. The phase correction is to traverse all vertices of category tree from the root (except for the root itself, as it is not a category) to the leaves. At each iteration, it is checked whether this category current document. If owned, is the rise in the hierarchy from the current node to the root of the tree, in which is determined by the number of vertices that belong to the document according to the SNO decisions of the classifier and the number of vertices, he does not belong according to the decision of the classifier. If the number of vertices that belong to the document exceeds the number of vertices, which he does not belong on the stage of correction is decided that the current vertex, the document belongs to, then there is a correction of the decision of the classifier on the entire path from the current node to the root of the tree by assigning all intermediate vertices of a positive decision on the classification. If the number of vertices belongs to a document that does not exceed the number of vertices, which he does not belong, assigning a negative decision on the membership of the current document category corresponding to a current node. The invention is illustrated by drawings, in which figure 1 presents a block diagram of a computing device to implement the method. Figure 2 presents the algorithm of formation signs documents. Figure 3 presents the algorithm for training the classifiers. Device to implement the method (figure 1) consists of source documents 1, block 2 forming characteristics of the document managing unit 3, unit 4 learning classifiers, block 5 training data, the database 6, block 7 classification unit 8 correction of the classification results, unit 9 output classification results./p> According to the way the device works in the following way. When in the source documents 1 new document, it enters the unit 2, which performs the processing of the document and the construction on it of the feature vector. First is the primary document processing (figure 2), which is extracted from the document text content, then the definition of the base forms for all words contained in the text of the document, and determine the weight of the forms in the document. The control unit 3 is controlled by the operator of the device and can operate in two modes: training mode and in the mode of classification. In training mode the signs of the document, formed in the block 2, coming next in block 4, in which there is an accumulation of training documents and the formation of the classification criteria. Using the training data and documents received from unit 5 and stored in the database 6. A set of training data provides information and facilities of each training document categories. This information, together with evidence of training documents accumulated in the database 6. At the end of the training mode immediately before the switching control unit in the mode of classification is removing all of the accumulated training documents from the database d is the R 6 and the formation of the classification criteria (figure 3). To do this, by the accumulated set of training documents is the creation of a dictionary of documents that contains all words occurring in the training documents, and then creates worker dictionaries classifiers. Then there is the construction of the classification criteria. The obtained classification features stored in the database 6. Mode classification signs of a document received through the control device in block 7, where the classification of the document on its characteristics. Are coming from the database 6 classification features. Then the classification results are received in block 8, where their correction with regard to thematic dependencies between categories, and then adjusted the list of categories in which a document is passed to the block 9. Thus, the method allows to classify documents according to a priori dependencies between the categories set by the category tree, thus achieving the above technical result. Sources of information 1. Li Y., Jain, A. "Classification of text documents", The Computer Journal, 41, 8, pp.537-546, 1998. 2. U.S. patent 6327581, CL G 06 F 015/18. 3. Schapire R.E., Singer Y. "BoosTexter: A boosting-based system for text categorization. Machine Learning 39, 2/3, 2000, pp.135-168 - prototype. 4. Porter, M.F., "An algorithm for suffix stripping", Program, Vol.14, No.3, 1980, pp.130-137. 5. RF patent №2096825 With, CL G 06 F 17/00. 6. U.S. patent 6308149, CL G 06 F 17/27. 7. U.S. patent 6430557, CL G 06 F 017/30; G 06 F 017/27; G 06 F 017/21. 8. Craven, M., DiPasquo, D., Freitag D. et al. "Learning to onstruct knowledge bases from the World Wide Web, Artificial Intelligence, Vol.118(1-2), 2000, pp.69-113. A method of automatic classification of documents, namely, that carry out the conversion of the document from a storage format to text in natural language, translate the words of the converted document in the basic word forms, discard insignificant words, carry out the calculation of the weights of words in the document in accordance with the frequency of their appearance and thereby form the feature vector of the document, at the training stage after presenting a set of manually classified documents form the set of classification criteria, retain classification features in the database, the classification of the document shall transform it from a storage format to text in natural language, convert the word document into the base word forms, discard insignificant words, carry out the calculation of the weights of words in the document and form the feature vector of the document, and then take a decision about belonging or not belonging document each of the categories, wherein the step of determining the membership of a document each of the categories using a priori information about zavisimost the s categories from each other, asked by categories tree, using binary classifiers to determine a set of document categories, followed by an analysis of the membership of each category of document categories a higher level, and if the number of nodes in the tree, which belongs to the document exceeds the number of vertices, which he does not belong, then decide on the conformity of the document to the current vertex, then make adjustment decisions of the classifier on the entire path from the current node to the root of the tree and classify the document through all intermediate nodes in the tree.
|
© 2013-2014 Russian business network RussianPatents.com - Special Russian commercial information project for world wide. Foreign filing in English. |