Method of classifying documents by categories

FIELD: information technology.

SUBSTANCE: method of classifying documents by categories includes constructing ontology in form of a set of categories. For each category, terms, i.e. sequences of words typical for texts in said category, are identified and the weight of each of the identified terms is determined when reading electronic versions of the documents from a training collection of documents. A profile is formed for each of the categories in form of a list of all terms in all ontology categories with indication of the weight of each term in said category. A list of possible combinations word forms of said term is compiled for each term. Identified terms are selected in each document to be classified when reading an electronic version thereof, considering only word forms from the compiled list. For each document to be classified, a profile is formed for each category based on the selected terms. Relevance of said document to each category is determined by comparing profiles of said document with profiles of categories in the ontology. A classification spectrum of the document is constructed in form of a set of categories with relevance found for each of them.

EFFECT: high rate of classification and reduced size of consumed memory.

7 cl

 

The present invention relates to a method of classifying documents into categories and can be used in the development of new and improvement of existing systems scan text documents.

The level of technology

In the process of automatic analysis of text documents, for example, in the case of tracking documents held on a company network, for the presence in them of the confidential information required to classify the scanned document to one or another category.

Now there are different ways of classifying text documents.

Thus, in the patent of Russian Federation №2167450 (publ. 20.05.2001) described a method for identifying objects by their descriptions, in which linguistic sorting all the words of the text in the specified cluster. Use it all the words of the text to classify dramatically extends the classification process and requires a large amount of memory for storing all (or most) of the words of the language used.

In the application for U.S. patent No. 2008/0098010 (publ. 24.04.2008) disclosed system and method for classifying, publishing, search and locate electronic documents. According to this application, the electronic documents are classified according to the ontological description, consisting of vectors, each of which contains a pair of characteristic values. Each interval is of a sector corresponds to the characteristic and vector range of each interval corresponds to the set of all possible values of each attribute. To build a classification used two hash functions, the first of which shows every sign in the interval number corresponding to the coordinate vector, and the second displays the value of each pair in the numerical value of the interval corresponding to the range of each coordinate. The result of the two hash functions can be displayed in the node hypercube. This method also requires quite a long time for their implementation.

The closest analogue of the present invention presented in the application for U.S. patent No. 2010/0205525 (publ. 12.08.2010), revealing method for automatic classification of text using a computer system. In this way be classified text is converted into a sequence of alphanumeric characters, which, in turn, is transformed into the so-called shingle, i.e. byte string, in which some special characters are replaced by letters. Find the frequency of occurrence of the shingle in the underlying classification of the text, compare it with the frequency of the same shingle in the reference documents and depending on the result of this comparison classify the document.

However, this method requires a long time for analysis, because the shingles conversions which are often full of words, who supplied various additional pointers: type part of speech (noun, adjective etc), the type of phrase (verb, deepmeta etc), the level of synonymy (words of the same level - "drizzle" and "raining buckets", the words adjacent levels - "CSKA" and "football team", etc). Therefore, in this way you need to analyze the shingles, composed of most of the words used language, which, incidentally, requires a large amount of memory for storage of such shingles.

Disclosure of inventions

The present invention is made to overcome the above mentioned disadvantages of the prior art and provides technical result in increasing the speed of classification and reduction of the required memory.

To achieve the technical result, a method of classifying documents into categories, namely, that: build an ontology in the form of a set of categories; identify for each of the categories of terms, each of which represents a sequence of words characteristic of the texts in this category; determine the weight of each of the identified terms in each of the categories in the process of reading electronic versions of documents from the training collection of documents; form for each category its profile in the form of a list of all terms in all it is the tories ontology indicating the weight of each term in the given category; amount for each term a list of possible combinations of those forms of words that are included in this term; allocate the identified terms in each subject classification document when reading its electronic version, taking into account only variations from the schedule for a given term; form for each subject classification document profiles for each of the categories based on selected when reading terms; find the relevance of this document each of the categories by comparing the profiles of this document profiles the categories in the ontology; build classification range of the document as a set of categories with relevance was found for each of them.

The peculiarity of this method is that each word of the term can be assigned a unique identifier and use unique identifiers when creating profiles.

Another feature of this method is that for each of the generated profile may build a vector in a multidimensional space, each dimension of which corresponds to one term, and for matching profiles to calculate the cosine measure between the mapped vectors in a multidimensional space. In this case, when building a classification spectrum for any document is and use only those categories for which the cosine measure between the mapped vectors exceeds a predetermined threshold value.

Another feature of this method is that the weight of each term can be defined as TF·IDF, where TF is the term frequency in all documents of this category in the training collection of documents, a IDF - inverse document frequency, characterizing, in how many documents in this category out of the total number of documents found this term.

Another feature of this method is that the ontology is built in the form of hierarchically related sequence types.

Finally, another feature of this method is that the use of syntactic analysis for the resolution of lexical ambiguity in the text of the terms on the basis of lists drawn up for each term.

Detailed description of embodiments

The present invention can be implemented in any computing system, for example, a personal computer, server and TPS implementation of the invention requires the presence of the corresponding database that stores electronic files of text documents.

The method according to the present invention is to classify the different categories of those documents, which can then be subjected, e.g. the measures the so-called kopiraytnomu analysis (the English equivalent - fingerprint detection), whose task is to establish the similarity between binary and (or) text documents to documents submitted previously in a database (library) as the reference, or any other text processing.

Classification makes it possible to correlate the incoming electronic versions of text documents to one or more categories. Categories can be selected according to the desire of the designer, or in accordance with the requirements of the system, which uses the method according to the present invention. Examples of categories can be found in the aforementioned patent U.S. No. 2008/0098010 and 2010/0205525, as well as in the application for U.S. patent No. 2009/0327189 (publ. 31.12.2009) and in international application number WO 2010/134752. Categories can be selected independently, but preferably categories lined up in the form of hierarchically related sequence, as, for example, in the above-mentioned international application number WO 2010/134752 and the application for U.S. patent No. 2009/0327189.

The combination of the selected categories will be classified incoming electronic versions of documents, is ontology classification. As already indicated, the ontology is built preferably in the form of hierarchically related sequence selected categories. This allows you not in the cases in which the absence of the appropriate category at some level ontology to move to a higher level in a hierarchical tree.

For each of the selected categories of the ontology identify terms, each of which represents a sequence of words characteristic of the texts in this category. The sequence of words in any term may contain one or more words. Thus take into account the occurrences of each word included to the term. This is especially important for such highly inflectional languages like Russian and other Slavic languages, however, it is also applicable for less inflectional languages, as, for example, English. The accounting for form is as follows.

For each term consists of a list of possible combinations of word forms of all words included in this term. Preferably, each word is assigned a unique number, and all sequences of word forms (or rooms)that belong to the given term, mark ID of this term. A subsequent selection of the identified terms in the course of processing the incoming e-version of the text document carry it on word forms, finding them in the processed text and determining what term is this or that word. And classification of text produce for combinations of forms included in a particular term.

At the stage of "learning" - as, indeed, and at a later stage classification of incoming texts read e is ectroni version of the documents: at the stage of learning and the construction of ontologies that will be the documents from the training collection of documents (so to speak reference documents). In the process of reading and finding identified terms determine the weight of each of the identified terms in each of the aforementioned categories. The weight determination can be made by any method, for example, in the same way as it is done in the aforementioned application U.S. No. 2008/0098010. In the present invention preferably uses a method in which the weight of each term is defined as TF·IDF, where TF is the term frequency in all documents of this category in the training collection of documents (i.e. the number of occurrences of this term in all the documents in this category), and IDF is the inverse document frequency, characterizing, in how many documents in this category out of the total number of documents found this term (see http://ru.wikipedia.org/wiki/TF-IDF).

After determining the weight of each term is formed for each of the selected categories its profile in the form of a list of all terms in all categories of the ontology built indicating the weight of each term in this category. For documents from the training collection of these profiles are considered as reference, and for scanned documents - workers. When creating profiles, if, as in the preferred embodiment, each word of the term was assigned a unique identifier, these unique identifiers are used to form the profile.

After forming the profile of a particular classified document for each of the categories of ontology based term allocated when reading the text of the document, find the relevance of this document each of the categories of the ontology by comparing the profiles of this document profiles the categories in the ontology. The mapping can be done in different ways. This can be done, for example, as in the above-mentioned patent application U.S. No. 2008/0098010. However, in the present invention preferably uses a comparison of the profiles by calculating the Pearson coefficient, i.e. the cosine of the angle vectors profiles in the multidimensional vector space, where for each term entered its measurement (see http://rcdl.ru/doc/2010/430-435.pdf). In this case, the cosine measure of comparison may vary from -1 to +1.

On detected values of relevancy build the classification spectrum of a particular document in the form of a set of categories with relevance was found for each of them. In this classification range are categories for which the value relevance exceeds a certain threshold value, for example, of 0.1.

When reading the electronic version of the subject classification of the document, as already mentioned, take into account only the word from the list with the purposes for this term. It allows to reduce the processing time, because, first, it uses only those words which are in the constructed ontology, which speeds up the search of identified terms (i.e. at the first lower level processing), and secondly, selects only those words which are identified in terms that accelerates the classification of text (on the second, upper level of processing). In addition, you do not need a large amount of memory, since it is necessary to store only available inflectional forms, and not all words in the language in which the written text of the classified document.

There is another advantage of using only available in terms of word forms. In the case of homonymy two words to resolve such lexical ambiguity in the text of the terms on the basis of lists drawn up for each term, you can use syntactic, and not semantic analysis, which greatly simplifies this procedure.

Thus, the method of classifying documents into categories in accordance with the present invention provides the technical result in increasing the speed of classification and reduction of the required memory.

1. The method of classifying documents into categories, namely, that:
- build an ontology as a set of the mentioned categories;
- identify for each of the aforementioned categories of the terms is s, each of which represents a sequence of words characteristic of the texts in this category;
- determine the weight of each of the identified terms in each of the aforementioned categories in the process of reading electronic versions of documents from the training collection of documents;
- form for each of the categories mentioned her profile in the form of a list of all terms in all categories mentioned ontology indicating the weight of each term in this category;
- account for each term a list of possible combinations of those forms of words that are included in this term;
- allocate mentioned the identified terms in each subject classification document when reading its electronic version, taking into account only variations from the above list compiled for this term;
- form for each subject classification document profiles for each of the aforementioned categories based on selected when reading terms;
- find the relevance of this document each of the aforementioned categories by comparing the profiles of this document profiles the categories mentioned in the ontology;
- build classification range of the above document in the form of a set of the mentioned categories with relevance was found for each of them.

2. The method according to claim 1, in which:
- privai shall indicate each word of the term unique identifier;
- use the mentioned unique identifiers when the above-mentioned formation of the profile.

3. The method according to claim 1 or 2, in which:
- build for each of the generated profile vector in a multidimensional space, each dimension of which corresponds to one term;
- when the above-mentioned mapping profiles calculate the cosine measure between the mapped vectors in said multidimensional space.

4. The method according to claim 3, in which when the above-mentioned construction of the classification of the spectrum of any document are only those of the aforementioned categories, for which the above-mentioned cosine measure between the mapped vectors exceeds a predetermined threshold value.

5. The method according to claim 1 in which the said weight of each term is defined as TF·IDF, where TF is the term frequency in all documents of this category in the training collection of documents, a IDF - inverse document frequency, characterizing, in how many documents in this category out of the total number of documents found this term.

6. The method according to claim 1, in which the ontology is built in the form of hierarchically related sequence of the mentioned categories.

7. The method according to claim 1 or 2, in which the use of syntactic analysis for the resolution of lexical ambiguity in the texts mentioned terms on the basis of by mentioning what's lists, made for each term.



 

Same patents:

FIELD: information technologies.

SUBSTANCE: method is realised for building of semantic relations between elements extracted from document content, in order to generate semantic representation of content. Semantic representations may contain elements identified or analysed in the text part of the content, elements of which may be associated with other elements, which jointly use semantic relations, such as relations of an agent, a location or a topic. Relations may also be built by means of association of one element, which is connected to another element or is near, thus allowing for quick and efficient comparison of associations found in the semantic representation, with associations received from requests. Semantic relations may be defined on the basis of semantic information, such as potential values and grammatical functions of each element within the text part of the content.

EFFECT: provision of quick detection of most relevant results.

21 cl, 11 dwg

FIELD: information technology.

SUBSTANCE: method of constructing a semantic model of a document consists of two basic steps. At the first step, ontology is extracted from external information resources that contain descriptions of separate objects of the object region. At the second step, text information of the document is tied to ontology concepts and a semantic model of the document is constructed. The information sources used are electronic resources, both tied and untied to the structure of hypertext links. First, all terms of the document are separated and tied to ontology concepts such that each term corresponds to a single concept which is its value, and values of terms are then ranked according to significance for the document.

EFFECT: enabling enrichment of document with metadata, which enable to improve and increase the rate of comprehension of basic information, and which enable to determine and highlight key terms in the text, which speeds up reading and improves understanding.

15 cl, 6 dwg

FIELD: information technology.

SUBSTANCE: mechanism converts messages in different formats to a common format, and the common format message is processed by a business logic application. The syntax analyser analyses the message and determines the suitable scheme for the specific format of the received message. The scheme is a data structure in a scheme register which includes a grammatical structure for the received format, as well as handler pointers for converting different message fields to an internal message format using a grammatical structure ("grammar" may include a field priority, field type, length, symbol coding, optional and mandatory fields etc). The handlers are compiled separately. As far as formats change, new formats or changes in old formats may be dynamically added to the syntax analysis/assembly mechanism by loading a new scheme and handlers.

EFFECT: broader functional capabilities, particularly the possibility of receiving and handling electronic messages in different formats, received using an application which is isolated from all external factors which are used through other external formats.

11 cl, 21 dwg

FIELD: information technology.

SUBSTANCE: text is segmented in electronic form to elementary units. Fixed collocations are identified and sentences are formed. Semantically significant objects and semantically significant relationships between then are identified. Several triads are formed for each semantically significant relationship, in which a single first type triad corresponds to the link set by the semantically significant relationship between two semantically significant objects. Each second type triad corresponds to the value of a specific attribute of one of these semantically significant objects. Each third type triad corresponds to the value of a specific attribute of the semantically significant relationship itself. All semantically significant objects which are linked by semantically significant relationships are separately indexed into several formed triads. The formed triads and the obtained indices together with the link to the initial text from which said triads were formed are stored in a database.

EFFECT: more accurate and faster searching for relevant facts and documents.

12 cl, 9 dwg, 16 tbl, 1 ex

FIELD: electrical engineering.

SUBSTANCE: methods, systems and computer carriers for complementing hinting instructions for symbol intended for improving symbol bit card produced from symbol outline with certain size and output resolution when symbol outline is converted during scanning. Output symbol is extracted. Symbol belonging to symbol semantic classification is defined and hinting instructions are addressed that are associated with symbol semantic classification. Hinting instruction stores symbol semantic meaning and, at the same time, varying dash availability and position or both for at least one dash of at least one attribute of symbol proceeding from basic sizes and output resolution of symbol. If actual symbol size and output resolution do not fall beyond basic size and output resolution for hinting instruction, then the latter is executed.

EFFECT: improved eligibility of scaled symbol bit card.

40 cl, 10 dwg

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to processing natural language text and can be used to automate search of required documents in their large collection. Upon request, its content is processed on sentences. Sentences of the text array and the search request are compared pairwise and relevancy of each document of the text array to the request is calculated from the results based on sentences included in the document. The text array is indexed on separate sentences. Precise meaning of words in sentences is identified first and semantic links between them are established. The precise word meanings are then replace them by breaking down to elementary meanings which are stored for each meaning in thesaurus, after which a matrix is made for each sentence, which contains the link between all pairs of objects included in the sentence. An inverted index is then made, where for each object included in the text array, the documents, sentences and the number of times it is met is indicated.

EFFECT: invention enables comparison of phrases according to sense.

2 cl

FIELD: information technology.

SUBSTANCE: invention relates to identification of rephrasing in text. A set of text segments is obtained from a cluster of different articles, written for a common event. The set of text segments is then processed in accordance with text alignment methods to identify rephrasing based on segments in the text. Identification of rephrasing can be used in machine translation systems.

EFFECT: possibility of identifying rephrasing in different text, related to the same event.

7 cl, 5 dwg

FIELD: information technology.

SUBSTANCE: invention relates to information searching and sampling. To achieve the technical outcome, a sequence of ambiguous information components is received from a user and transformed into one or more corresponding sequences of less ambiguous information components. These sequences of less ambiguous information are given as input data into the search engine. Search results are received from the search engine and presented to the user. Translation between these sets of characters and/or languages can be done by analysing use of terms in the aligned text. Probabilities can be associatively linked to each possible translation. These probabilities can be corrected by analysing interaction of the user with the search results.

EFFECT: possibility of searching using queries written in set of characters or language, which is different from the set of characters or language of documents, which are to be found and obtaining relevant search results.

45 cl, 16 dwg

FIELD: information technology.

SUBSTANCE: present invention relates to access and presentation of information in a computer system, and more specifically to data presentation based on voice input by the user. The method of presenting information to a user from a document based on a request, involves the following stages: presentation of data in a document for the user, identification of the first and second objects from the request, accessing the document to identify semantic tags, related to text in the document, linking the first object to the first semantic tag, corresponding to the first portion of text stored in the document, and the second object to the second semantic tag, corresponding to the second portion of text stored in the document. At least one of these first and second portions stored in the text is related to data in the document, which were presented. The third portion of stored text, related to the first portion and second portion, is identified and selectively presented in understandable form.

EFFECT: more functional capabilities.

33 cl, 20 dwg

FIELD: physics; computer engineering.

SUBSTANCE: present invention pertains to computer technology. The elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema. Elements of the second schema are "transparent" to the elements of the first schema, when the text processor checks correctness of elements of the first schema. Elements of the second schema are verified separately so that, elements of the first schema are "transparent" for verification of elements, corresponding to the second schema.

EFFECT: provision for validity checking of an extensible mark-up language (XML) document, with elements, linked to two or more schemata, where elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema.

16 cl, 6 dwg

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

FIELD: technology for automated synthesis of text documents.

SUBSTANCE: method includes, in data variable, selecting variable unified information (common word combinations), variable inputted data (details), and variable non-unified information (free word combinations), while variable unified information is separated as a plurality of support words, constituting lexicological document skeleton, and is recorded in machine-readable database, lexicological document tree is formed and data document control contour is formed, and during generation of document, all branches of formed lexicological document tree are passed to select necessary support words for inserting matching word combinations into generated document.

EFFECT: lower probability of errors, lower laboriousness.

3 cl, 7 dwg

FIELD: computer science, in particular, system for identification of preparedness of text documents in network for distributed processing of data.

SUBSTANCE: system contains block for receiving sections of text documents, block for selection of base addresses of text documents, block for selecting structure of text documents, block for forming signals for recording and reading database, block for gating sections of text documents, block for addressing of text documents, block for receiving sections of text documents from database of server, block for commutator of channels for dispensing sections of text documents, block for counting number of finished sections of text documents, comparator, counter.

EFFECT: increased speed of operation of system.

8 dwg

FIELD: technology for recognizing text information from graphic file.

SUBSTANCE: in accordance to method, set in advance is order of access to additional information, assigned also is estimate of quality for each type of additional information, different variants of division of image of selected rows on fragments are constructed, for each fragment of row linear division graph is built, images of graphic elements are recognized, using a classifier, and an estimate is assigned to each recognition variant, transition from variants of recognition of graphic elements to variants of alphabet symbols is performed, for each chain, connecting starting and ending vertexes, chains are built, appropriate for all variants of recognition of graphical elements and variants of transitions from recognized graphical elements to alphabet symbols, produced variants are ranked in order of decrease of recognition quality estimate, produced variants are processed with usage of information about position of uppercase and lowercase letters, if more than one variant of symbol is available based on results of recognition of graphic element, variants are processed with successive usage of additional information, and/or when necessary simultaneous usage of all types of additional information, quality estimate is assigned to each produced variant, variants of symbols with estimate below predetermined value are discarded, produced variants are sorted using pair-wise comparison, and additional correction of recognition of spaces, erroneously recognized at previous stages, is performed.

EFFECT: increased precision of recognition of text and increased interference resistance of text recognition.

9 cl, 2 dwg

FIELD: devices for recognition of written symbols.

SUBSTANCE: method contains stage of receipt of written symbols, which are written on sensor screen, where sensor screen contains at least at area for writing symbols and area for writing punctuation. Then a stage for determining ratio of written symbols is performed for symbols which are written in symbol area of punctuation writing, relatively to symbol writing area, and stage of recognition of punctuation marks is performed. Recognition stage is conducted for written symbols, when ratio exceeds threshold value, where conduction of recognition of punctuation symbols determines at least one possible punctuation mark from, similar to written symbols, from a set of punctuation marks.

EFFECT: automatic recognition of punctuation marks with increased precision.

8 cl, 8 dwg

FIELD: information technologies.

SUBSTANCE: invention relates checking methods of documents accuracy of extensible markup language (XML) and message delivery about the real-time scheme violation. Parallel tree is supported with portions corresponding the elements of another XML document XML. When irregularities take place in XML document, elements of another XML of document are pointed out which comply the irregularities. Portions which correspond the pointed out elements of another XML document are verified according to the XML scheme, which in its turn corresponds another XML document positioning. This elements and portions which comply the errors in another XML document positioning are reported to the user according to image indicators in XML document and parallel tree.

EFFECT: XML document accuracy check provision and messaging about scheme irregularities in real-time mode while document correcting by the user.

20 cl, 8 dwg

FIELD: physics, computer equipment.

SUBSTANCE: present invention is related to components of trees ordering in system of sentences realisation. Component accepts disordered syntactical tree and generates ranged list of alternatively ordered syntactical trees from disordered syntactical tree. Component also includes statistic models of components structure that are used by component of trees ordering for estimation of alternatively ordered trees.

EFFECT: provision of proper order of words in treelike structure.

24 cl, 11 dwg

FIELD: computer engineering.

SUBSTANCE: application program interface (API) for import can be implemented to import content from hierarchically structured document, such as XML-file. Import API works in conjunction with syntax analyser to preview document and extract content from selected elements, units, attributes and text. Import API also uses callback component to process extracted content. Export API also can be realised to export data with the aim of creation of hierarchically structured document , such as XML-file. Export API works in conjunction with editor to receive data and export data in the form of elements, units, attributes and text in hierarchically structured document.

EFFECT: providing selective data import and export in electronic document.

20 cl, 5 dwg

FIELD: physics, computer technology.

SUBSTANCE: invention concerns methods and systems of text segmentation. Method involves addressing symbol line (204), long lexeme determination (206), recording adjoining symbols in long lexeme (208), determination of lexemes from symbol line by holding together the adjoining symbols, and determination of multiple lexeme combinations (210), with number of lexeme combinations reduced by means of recorded adjoining symbols.

EFFECT: enhanced speed of text fragmentation.

22 cl, 3 dwg

FIELD: physics; computer engineering.

SUBSTANCE: present invention pertains to computer technology. The elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema. Elements of the second schema are "transparent" to the elements of the first schema, when the text processor checks correctness of elements of the first schema. Elements of the second schema are verified separately so that, elements of the first schema are "transparent" for verification of elements, corresponding to the second schema.

EFFECT: provision for validity checking of an extensible mark-up language (XML) document, with elements, linked to two or more schemata, where elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema.

16 cl, 6 dwg

Up!