Method for automatic semantic classification of natural language texts

FIELD: information technology.

SUBSTANCE: method for automatic semantic classification of natural language texts comprises presenting each text to be classified in digital form for subsequent processing; indexing the text to obtain elementary units of the first through fifth levels; detecting the frequency of occurrence of units of the fourth level, each being a semantically significant object or attribute, and the frequency of occurrence of semantically significant relationships linking semantically significant objects, as well as objects and attributes; forming a semantic network from a triad which is units of the fifth level; renormalising the frequencies of occurrence into the semantic weight of the units of the fourth level; ranking the units of the fourth level according to the semantic weight by comparison thereof with a threshold value and those having a weight below the threshold value; detecting the degree of crossing semantic networks of the text and text samples; selecting as a class for text object regions, the degree of crossing the semantic network with the semantic network of text is greater than the threshold.

EFFECT: faster process of comparing texts.

6 cl, 2 dwg, 24 tbl

 

The technical field to which the invention relates.

The present invention relates to the field of information technologies, in particular to a method of automated semantic classification of texts in natural language.

The level of technology

There are different ways of automated semantic (i.e., semantic) classification of texts in natural languages (see, for example, the patents of the Russian Federation No. 2107943 (publ. 27.03.1998) and No. 2108622 (publ. 10.04.1998), and EPO application No. 0241717 (publ. 21.10.1987)).

Generally speaking, the semantic classification of texts in natural language cannot be done directly, because to classify in this case need not have specific words in the text, and the meaning behind the whole sentences and even paragraphs or sections. Therefore, usually the semantic classification of texts precede semantic indexing of these texts, which is carried out in various ways. While the importance of resolving semantic ambiguity of these texts.

Such methods semantic indexing texts for their subsequent comparison with resolving semantic ambiguity described, for example, in the patent of Russian Federation №2242048 (publ. 10.12.2004), U.S. patent No. 6871199 (publ. 22.03.2005), 7024407 (publ. 04.04.2006) and 7383169 (publ. 03.06.2008), in applications for U.S. patent No. 2007/0005343 and 2007/000544 (both publ. 04.01.2007), 2008/0097951 (publ. 24.04.2008), tiled bids Japan No. 05-128149 (publ. 25.05.1993), 06-195374 (publ. 15.07.1994), 10-171806 (publ. 26.06.1998) and 2005-182438 (publ. 07.07.2005), EPO application No. 0853286 (publ. 15.07.1998).

Closest to the claimed invention can be considered a method of automated semantic indexing of text in natural language, disclosed in the patent of Russian Federation №2399959 (publ. 20.09.2010). In this way the text in digital form segments at the elementary units of the first level (words); form for each elementary unit of the first level (words) elementary unit of the second level (normalized word form); segment the text in digital form for proposals, the relevant sections of the indexed text; identify in the text, in the process of linguistic analysis, a basic unit of the third level (phrases); in the multistep process of semantic-syntactic analysis by accessing the pre-generated database of linguistic and heuristic rules in a predetermined linguistic environment, identify each of the generated sentences basic unit of the fourth level (semantically meaningful object and its attribute) and semantically meaningful relationships between the identified semantically meaningful objects, and between semantically meaningful objects and attributes; F. rerout within the text for each of the identified semantically meaningful relationship many elementary units of the fifth level (triad); index on the set formed triads all related semantically meaningful relationships semantically meaningful objects, and attributes, individually, and all of the triad of "semantically meaningful object semantically meaningful relationship - semantically meaningful object, and all the triads of the form "semantically meaningful object semantically meaningful relationship - attribute"; retain in the database formed the triad and the resulting indexes with reference to the original text, from which formed these triads.

The disadvantage of this method is the lack of ranking formed of elementary units of the fourth level according to their relevancy to the text, which leads to unnecessarily large amount of computation associated with the need for further processing all the generated index.

Disclosure of inventions

The purpose of the present invention consists in expanding the Arsenal of methods of semantic classification of texts in natural languages by accelerating the process of comparing texts.

The achievement of this purpose and the receipt of the indicated technical result is ensured in the present invention by way of an automated semantic classification of texts in natural language, namely, that: represent each KLA is cificity text in digital form for subsequent automatic and / or automated processing; perform indexing each of the classified text in digital form, receiving: a basic unit of the first level that includes at least words, a basic unit of the second level, each of which represents a normalized word form, a basic unit of the third level, each of which represents the exact phrase in the text, a basic unit of the fourth level, each of which is semantically meaningful object and attribute, and a basic unit of the fifth level, each of which represents a triad or two semantically meaningful objects and semantically meaningful relationships between them, either semantically meaningful object and attribute and linking semantically meaningful relationships; identify the frequency of occurrence of elementary units of the fourth level and frequency of occurrence of semantically meaningful relationship; retain in the database formed the basic unit of the second, third, fourth and fifth levels of the detected frequencies of occurrence of elementary units of the fourth level and semantically meaningful relationship, as well as the received index with links to specific proposals of this text; form of triads semantic network so that the first elementary unit of chetvert what about the level of subsequent triad is associated with the same second elementary unit of the fourth level of the previous triad; perform, during the iterative procedure, the renormalization of the frequencies of occurrence of the semantic weight of elementary units of the fourth level, which are the vertices of a semantic network, so that the elementary units of the fourth level, connected in a network with a large number of other elementary units of the fourth level of the high frequency of occurrence, increase their semantic weight, and other elementary units of the fourth level it evenly lose; rank a basic unit of the fourth level semantic weight by comparing the semantic weight of each of them with a preset threshold value and remove a basic unit of the fourth level, with the semantic weight below a threshold; keep in memory the remaining basic unit of the fourth level with the semantic weight above the threshold, and semantically meaningful relationships between the remaining elementary units of the fourth level; identify the degree of intersection of the semantic network are classified text and semantic networks text samples, each text sample is composed of previously classified texts and describe the subject area of semantic classification, the degree of intersection identified as on the tops of semantic networks and links between these nodes with regard to form, the new weights of the vertices of the considered semantic networks and weight characteristics of their relationships, and take revealed the degree of intersection of semantic networks classified text and specific text selection as a value characterizing the semantic similarity of the classified text and this text selection; choose as a class for classified text at least one of the subject areas, the degree of intersection of the semantic network semantic network classified text is greater than a predetermined threshold.

The feature of the method according to the present invention is that when exceeding a predetermined threshold degrees of intersection for multiple subject areas, subject areas can be ranked by the degree of their closeness to the classified text.

This can choose a preset predetermined number of subject areas, which include classified text.

Another feature of the method according to the present invention is that the indexing is carried out in the following stages: segment the text in digital form at the elementary units of the first level that includes, at least, words; segment on graphematics rules text in digital form for proposals; form for each elementary unit of the first level, which is a word-based morphology is ical analysis of elementary units of the second level, includes normalized word form; calculate the frequency of occurrence of each elementary unit of the first level for two or more adjacent units of the first level in this text and unite among elementary units of the first level sequence of words following each other in this text, a basic unit of the third level representing a stable combination of words, if for every two or more consecutive words in this text the difference between the calculated frequency of occurrence of these words for the first occurrence of a given sequence of words, and for several subsequent appearances for each pair of words of the sequence remain unchanged; identify, in the multistep process of semantic-syntactic analysis by accessing the pre-generated database of linguistic and heuristic rules in a predefined language environment, each formed sentences semantically meaningful objects and attributes - a basic unit of the fourth level; for each basic unit of the fourth level, record the identity of reference between the corresponding semantically meaningful object and attribute, and the corresponding anatomicheskoi link if available in the classified text, replacing every single anaphoric the action link to the corresponding antecedent; keep in memory every semantically meaningful object and attribute; identify, in the multistep process of semantic-syntactic analysis by accessing the pre-generated database of linguistic and heuristic rules in a predefined language environment, each formed sentences semantically meaningful relationships between the identified units of the fourth level is semantically meaningful objects, and between semantically meaningful objects and attributes; assign each semantically meaningful relation corresponding to the type of data stored in a database domain ontology on the subject of the subject area, which is classified text; highlight all the text in the frequency of occurrence of elementary units of the fourth level and frequency of occurrence referred to semantically meaningful relationship; keep in mind each of the identified semantically meaningful relationship with assigned type; form within the text for each of the identified semantically meaningful relationship, linking as appropriate semantically meaningful objects and semantically meaningful object and its attribute, the set of triads, which are the elementary units of the fifth level; index on the set formed triads on individual p is tis all related semantically meaningful relationships semantically meaningful objects with their frequencies of occurrence, all attributes with their frequencies of occurrence, and all formed of the triad.

Another feature of the method according to the present invention is that the degree of intersection of the two semantic networks are calculated as the sum matches the elementary units of the fifth level of these two semantic networks.

When this exercise stages, which are chosen as the basic network of the two semantic networks, in which after the ranking and removal of vertices with the semantic weights below a threshold remained more vertices than the other you choose as compare; find for each vertex of the underlying network comparison network vertex, which is the same basic unit of the fourth level, i.e. the same semantically meaningful object, or the same attribute; calculate, for each of the found peaks in each of the base and compare networks the magnitudes of all associated with the top of the triads as the areas of triangles, whose sides correspond to features of each of these triads, and the angle between the parties is proportionate to the weight semantically meaningful relationships; choose for each pair of triads associated with a specific pair of vertices in the base and compare the networks, the smaller of the calculated values as the degree of intersection of the above triads in the base and compare networks; summarize the d the I each associated with the vertex of the vertices of all the selected calculated values, receiving the degree of intersection for a given pair of vertices of the base and compare networks; normalized found the sum by the number of semantically meaningful objects and attributes associated with the given vertex in the base and compare network, which contains more associated with this vertex vertex; summarize the normalized sum over all the vertices of the base and compare network, which contains more vertices; normalized the resulting amount by the number of remaining network of elementary units of the fourth level, receiving the degree of intersection of the two semantic networks.

Brief description of drawings

The present invention is illustrated further by describing a concrete example of its implementation and the accompanying drawings.

In Fig.1 shows a conventional block diagram illustrating the claimed method.

In Fig.2 is a block diagram illustrating a preferred method of indexing text.

Detailed description of the invention

The method according to the present invention can be implemented in virtually any computing environment, for example, on a personal computer connected to external databases. The stages of the method illustrated in Fig.1.

All further explanations are given in the application to the Russian language, which is one of the most visokoefektivnih languages, although the proposed method is applicable to SEM is nicesly classification of texts in any natural language.

First of all, each subject semantic text classification must be submitted in electronic form for subsequent automated processing. This stage of Fig.1 is conventionally denoted by the reference position 1 and can be performed by any known method, for example, scanning text with subsequent detection using well-known tools like ABBYY FineReader. If the text is on the classification of the electronic network, for example, from the Internet, the stage of its submission in electronic form is performed in advance before placing this text in the network.

Professionals should be clear that the operation of this and subsequent steps are performed with remembering intermediate results, for example, in random access memory device (RAM).

Converted into electronic form text is received for processing, which is done indexing. This indexing (step 2 in Fig.1) may be the same as is disclosed, for example, mentioned in the patent of Russian Federation №2399959 or in the application for U.S. patent 2007/0073533 (publ. 29.03.2007). During this indexing receive a basic unit of text at different levels. The elementary units of the first level include, at least, words; each of the elementary units of the second level is a normalized word form; each of ELEH InterNIC units third level is a sequence of consecutive words in the processed text; each of the elementary units of the fourth level is semantically significant object, or attribute; each of the elementary units of the fifth level is a triad or two semantically meaningful objects and semantically meaningful relationships between them, or semantically meaningful object and its attribute, and linking semantically meaningful relationship.

Preferably, however, to index the text using the method claimed in the patent of the Russian Federation 2012150734 (priority from 27.11.2012) and illustrated in Fig.2. In this way the text in digital form is first segmented into elementary units of the first level that includes, at least, words. In the above-mentioned patent RF №2399959 these elementary units of the first level are called tokens). The token can be any text object from the following set: words, each consisting of a sequence of letters and, possibly, hyphens; the sequence of spaces; punctuation; number. Sometimes this includes such sequence of characters as the A300, i150b, etc., the Allocation of tokens is always fairly simple rules, for example, as in the above-mentioned patent RF №2399959. In Fig.2 this step is conventionally denoted by the reference position 21.

Following this, at step 22 (Fig.2) segments of the indexed text in digital form on before is ogene, the relevant sections of this text. Such a segmentation is carried out on graphematics rules. For example, the simplest rule for the selection of proposals is: "a Sentence is a sequence of tokens that begin with a capital letter and ending with a point.

Next, for each elementary unit of the first level (for each token, which is a word-based morphological analysis form the corresponding elementary unit of the second level, which is a normalized word form, hereinafter referred to as the Lemma. For example, for the word "go" normalized form is "go" for the word "beautiful" normalized form will be "beautiful," and for the words "wall" normalized form "the wall". In addition, for each word indicates the part of speech to which the word and its morphological characteristics. Naturally, for different parts of speech these characteristics are different. For example, for nouns and adjectives this genus (male - female average), number (singular - plural), case; for verbs of this type (perfect - imperfect), person, number (singular - plural); and so on, Thus, for a given word, its normalized form (Lemma) + morphological characteristics, including the art of speech, are mortem. The same word can have multiple morphs. For example, the word "glass" has two morph - one for the noun neuter and one for the verb in the past tense. This step is conventionally indicated in Fig.2 reference position 23.

The next step, conventionally indicated in Fig.2 the reference position 24, is that for each of these elementary units of the first level mentioned in the text, calculate the frequency of occurrence. In other words, determine how many times each word occurs in the processed text. This operation is carried out automatically, for example, simply counting the frequency of occurrence of each token, or as described in the patent of Russian Federation №2167450 (publ. 20.05.2001), or in U.S. patent No. 6189002 (publ. 13.02.2001). Simultaneously with the counting of the frequency of occurrence is found for each of two or more consecutive words in this text the difference between the calculated frequency of occurrence of these words in the first appearance of this sequence of words and their subsequent appearance. If these differences for the first occurrence of a given sequence of words, and for several subsequent appearances remain the same, such a sequence of words following each other in this text (i.e., the elementary units of the second level) are combined into a basic unit of the third, Ural branch of the nya, representing idiomatic expressions.

Then, in the next step, indicated in Fig.2 reference position 25, in order to identify semantically important objects and attributes, perform multi-semantico-syntactic analysis. Such multistage semantico-syntactic analysis performed by accessing generated in the database of linguistic and heuristic rules in a predetermined linguistic environment. Such an environment may be, for example, the linguistic environment, mentioned in the application for U.S. patent No. 2007/0073533 or in the above patents of the Russian Federation No. 2242048 and the Russian Federation No. 2399959, or any other linguistic environment, which determines the relevant rules, which allow to resolve syntactic and semantic ambiguity of the words and expressions of real text. Linguistic and heuristic rules in the selected environment, referred to as rules.

Identifying semantically meaningful objects and attributes, which are the elementary units of the fourth level, is the proposal on a set of elementary units of the first, second and or third levels.

For each semantically meaningful object, or attribute, i.e., a basic unit of the fourth level with the assigned types are corresponding anatomicheskuyu link (if it is here). For example, in the sentence "Mechanics is the part of physics that studies the laws of mechanical motion and the causes or modify this movement" anatomicheskoi reference to the word "mechanics" is the pronoun "which", whereas the word "mechanics" is the antecedent for this anaphora, and yet, anatomicheskoi reference to the word "mechanical" is the pronoun "it", whereas the word "mechanical" is the antecedent for this anaphora. This stage is finding anatomicheskoi links conventionally indicated in Fig.2 reference position 26. Each anatomicheskuyu link replaced with the corresponding antecedent. After that, each identified semantically meaningful object and retain attribute in the corresponding memory.

In the next step, indicated in Fig.2 reference position 27, perform multi-semantico-syntactic analysis, which on the basis of elementary units of the first, second, third and fourth levels are found using the above mentioned rules are semantically meaningful relationships between semantically meaningful objects, and between semantically meaningful objects and attributes.

At the stage indicated in Fig.2 the reference position 28, each semantically meaningful relation is assigned a corresponding type of data stored in a database domain ontology on the subject of the PR is dmetal region, belongs to the indexed text. After that every semantically meaningful relationship remain in the appropriate memory together with its assigned type and found for him morphological and semantic attributes.

After this phase, indicated in Fig.2 reference position 29, identify the frequency of occurrence of semantically meaningful objects and attributes, as well as the frequency of occurrence of semantically meaningful relationship between semantically meaningful objects and between semantically meaningful objects and attributes throughout this text. This operation performs almost as well as on the stage 24 for the elementary units of the first level.

At the stage indicated in Fig.2 the reference position 30 stored semantically meaningful objects, and attributes, and semantically meaningful relationships are used to form the triad. At the same time within the text being indexed for each of the identified semantically meaningful relationship linking certain semantically meaningful objects and attributes that form many of the triads of two types. Each of the many triads of the first type includes semantically meaningful relationship and two semantically meaningful object that are associated with this semantically meaningful relationship. Each of the many triads of the second type includes semantically meaningful relation to the giving, one semantically meaningful object and its attribute associated with this semantically meaningful relationship. If we denote two semantically meaningful object through Oiand Ojand linking semantically meaningful relationship through Rijthen each of the triads of the first type can be (portray) Oi→Rij→Oj. Each of the triads of the second type can be represented as Oi→Rim→Amwhere Amare the corresponding attribute, a Rimlinking semantically meaningful object and the attribute is semantically meaningful. In these records the indices i, j, m are integers.

Then, at the stage indicated in Fig.2 reference position 31, perform the indexing of the text. When this index separately on the set formed triads all related semantically meaningful relationships semantically meaningful objects with their frequencies of occurrence, all attributes with their frequencies of occurrence, and all formed of the triad.

For this purpose, the set formed triads index all semantically meaningful objects and their attributes individually, with their frequencies of occurrence, and all the triads of the form "semantically meaningful object semantically meaningful relationship - semantically meaningful object, and all triad types "semantic the ski meaningful object semantically meaningful relationship - attribute". Formed on the stage 30 of the triad and received at step 31 the index with reference to the specific proposals of the source text from which formed these triads, save in the database (step 32 in Fig.2).

For professionals it is obvious that mentioned on individual stages of the storage device can actually be as different devices, and one storage device sufficient. Similarly, a separate database mentioned in the respective stages can be not only physically separate databases, but only one database. Moreover, these storage device (memory) can store the same single database, or stored separately mentioned database. Experts also clear that claimed in the present invention, the methods are performed in the corresponding computing environment running corresponding programs that are recorded on machine-readable media, intended for direct participation in the work of the computer.

Return to the block diagram of Fig.1. In step 3 identify the frequency of occurrence of elementary units of the fourth level (i.e., semantically meaningful objects and attributes), and to detect the frequency of occurrence of semantically meaningful relationship. Mark is m, that formed a basic unit of the fourth level remain in the database together with the detected frequencies of occurrence. In addition, keep in the database the received index with links to specific proposals of this text.

Then in step 4 in the method according to the present invention form a semantic network in such a way that the first semantically meaningful object subsequent triad is associated with the same second semantically meaningful object to the previous triad. In the process iterative procedure carry out the renormalization of the frequencies of semantically meaningful objects and attributes in the semantic weight of semantically meaningful objects and attributes that are leaf nodes of the semantic network. This renormalization is carried out in such a way that semantically meaningful objects and attributes that are associated in a network with a large number of semantically meaningful objects and attributes with a high frequency of occurrence, increase their semantic weight, and other semantically meaningful objects and attributes it evenly lose (step 5 in Fig.1).

Next, a basic unit of the fourth level rank for semantic weight by comparing their semantic weight to a predetermined threshold value (step 6 in Fig.1).

A basic unit of the fourth level with the semantic weight below poro is new removed (step 7 in Fig.1). The remaining basic unit of the fourth level with a weight above the threshold remain in the memory (step 8). Keep in memory is also semantically meaningful relationships between semantically meaningful objects, and between semantically meaningful objects and attributes remaining in the semantic web.

Next, in step 9 identify the degree of intersection of the constructed semantic network classified text and semantic networks of text samples. These text samples are from previously classified texts. They describe the subject area of the semantic classification for which you are processing the classified text. The degree of intersection of semantic networks reveal how they vertices and relations between these nodes with regard to the meaning of the weights of the vertices of the considered semantic networks and the weight characteristics of their relationships.

Identified the degree of intersection of semantic networks classified text and specific text selection take as values characterizing the semantic similarity of the classified text and this text selection. Then choose the class for classified text at least one of the subject areas, the degree of intersection of the semantic network semantic network classifica what has been created text is greater than the predetermined threshold (step 10 in Fig.1).

The degree of intersection of the two semantic networks, formed as described above, is calculated as the sum matches the elementary units of the fifth level of these two semantic networks. In principle, this calculation may be conducted by various methods known in the art.

Preferably, the degree of intersection can be calculated as the sum of the intersections of elementary units of the fifth level of the two networks. To do this, choose as the base network of the two semantic networks, in which after the ranking and removal of vertices with the semantic weights below a threshold value (see step 7 in Fig.1) have more vertices than in the other you choose as the benchmark. For each vertex of the underlying network is found in the comparison network vertex, which is the same basic unit of the fourth level, i.e. the same semantically meaningful object, or the same attribute. For each of the found peaks in each of the base and compare networks calculate all values associated with the top of the triads as the areas of triangles, whose sides correspond to features of each of these triads. This calculation area can be normalized to 100% the scalar product for vectors ofciimg src="http://img.russianpatents.com/1190/11909790-s.gif" height="6" width="4" /> andcjwhere the vectorcicorresponds to the first semantically meaningful object or attribute of an elementary unit of the fifth level, vectorcjcorresponds to the second semantically meaningful object or the attribute of an elementary unit of the fifth level, and the angle between vectors ciand cjequal to wijproportional to the frequency of occurrence of semantically meaningful relationships between the first and second semantically meaningful objects or between the first semantically meaningful object and attribute normalized by 90°: wij∈(0...90°).

Next, choose for each pair of triads associated with a specific pair of vertices in the base and compare the networks, the smaller of the calculated values as the degree of intersection of the triads in the base and compare the networks. All selected calculated values are summed for each of the vertices, receiving the degree of intersection for a given pair of vertices of the base and compare networks. Found sum normalized by the number of semantic the ski significant objects and attributes associated with this in the top of the base and compare network, which contains more vertices. The obtained normalized sum sum is now over all vertices of the base and compare network, which contains more vertices. Finally, the resulting sum normalized by the number of remaining network of elementary units of the fourth level, i.e. semantically meaningful objects and attributes, receiving the degree of intersection of the semantic web.

Obviously, in the absence of comparable network of any vertex of degree crossing for this peak is assumed to be zero.

Example

To illustrate the implementation of the claimed method of automated semantic classification of text in natural language consider the following example. Let some Russian text presented on the Internet site http://www.unn.ru/rus/priem.htm and a few (e.g. three) of the sample texts describing classes (subject area) presented on the same site. Thus, it can be considered that the conversion of texts in electronic form, marked in Fig.1 reference number 1, has already been completed.

A typical example of such a text is the following fragment:

"Worldwide, the mathematics exam is written decision tasks. The written nature of the tests is posy who we are as a mandatory feature of a democratic society, as the election of several candidates. Indeed, at the oral exam, the student is completely defenseless. I just happened to hear, taking exams at the Department of differential equations, faculty of mechanics and mathematics faculty of Moscow state University, examiners, who were drowned at the next table students gave excellent answers (perhaps surpassing the level of understanding of the teacher). Known are cases when stoked on purpose (sometimes from this, you can save time by logging in audience)".

In accordance with the stated method of automated semantic classification of texts in natural language use pre-created database syntax rules and dictionaries, which will be word processing and build a semantic index. Such bases are prepared by expert linguists, on the basis of their experience and knowledge determine the sequence and composition of the syntactic processing of text-specific language.

Expert linguists pre-built set of syntactic rules that allow the use of pre-built by expert linguists relevant linguistic dictionaries further processed texts to automatically identify specific information, sootwetstwujuschtschie significant objects attributes semantically meaningful objects and semantically meaningful relationships that can occur between semantically meaningful objects or between semantically meaningful objects and attributes.

In addition to the specification of the subject area and rules in accordance with the above methods are used dictionaries General and special vocabulary.

In accordance with the stated method of automated semantic comparison of texts in natural language, first perform the segmentation of the text into elementary units - tokens (reference position 21 in Fig.2) and morphological analysis of token words (reference position 23 in Fig.2). In this phase, the source text is transformed into a set of tokens and morphs, which are presented in Table 1 and Table 2, respectively.

Introductory words and plug-in design does not assume any syntactic load, so the tokens of this type of further analysis are excluded.

Tokens-geographical names are considered as one word, with morfa corresponding Morphou main words.

Next, after the segmentation of the text into tokens and morphological analysis of tokens-words perform allocations phrases (reference position 24 in Fig.2). To do this, calculate the frequency of occurrence of words in posledovatelnostei two or more words in the text. Then compare the difference of the frequencies of occurrence of words in the sequence for the first occurrence of a given sequence of words, and for several subsequent appearances.

The frequency of occurrence of words when you first receive sequence, and its subsequent appearance, and the difference of these frequencies are presented in Table 3.

In this stage the original text, in addition to the elementary units of the first and second levels, complemented by a number of units of the third level, sustained phrases. Phrases for our example are shown in Table 4.

After performing the above steps are fragmentation of the processed text into sentences (reference position 22 in Fig.2). In this phase formed above sets are complemented by many of the proposals presented in Table 5.

Thus, after performing all the above steps are processed text is segmented into sentences, each of which is spaced sets of annotations elementary units of the first, second and third levels.

Following this, in accordance with the stated method of automated semantic comparison of texts in natural language, detection of semantically meaningful objects and attributes (elementary e is the INIC fourth level) (reference position 25 in Fig.2). It is produced in each sentence on the set of elementary units of the first, second and or third level using a pre-formed set of linguistic and heuristic rules using pre-formed as the relevant linguistic dictionaries.

Semantico-syntactic processing of proposals is carried out in several stages. All stages will be carried out at the text, we have selected for example.

1. Partitioning proposals for punctuation and Union (Union words and phrases) in the initial fragments and determining the type of the fragment based on its morphological characteristics. This is the dictionary of unions, Union of words and phrases.

Slice boundaries are placed on all punctuation and Union (Union words and phrases) without decimal point. In addition, the dictionary of unions is determined whether such a complex Union, which beginning in neighboring on the left of the fragment, and the end in this. In our case, such a Union phrase is "up until". If such a Union is, the comma is transferred to the whole Union.

The type of the fragment is one of the following values listed in table 6. In order specified in table 6, is searched for in the fragment of the word form with the appropriate homonym, other homonyms found inflectional forms do not consider the Xia.

2. The amalgamation of the original segments with simple cases of a homogeneous series of adjectives, adverbs, nouns, etc., Characteristic of homogeneity is the presence of coordinating Union (or comma), before and after which there should be variations of the same part of speech that have homonyms that have the same morphological information. Other homonyms are not considered in subsequent analysis, therefore, there is a partial disambiguation.

In our example, coordinating Union "as" connected segments 2.1 and 2.2, since the tokens 14 ("nature") and 26 ("election") of table 1 is the homonyms of one part of speech that have the same morphological information.p. or Wines.p. The type of the received segment - 1.

3. The design of a simple syntactic groups corresponding to the attribute level descriptions (table.8): the characteristic of the object/subject/action + object/subject/action measure a characteristic of the object/subject/action + object/subject/action.

Later in the sentences of the text are identified and disclosed anatomicheskie links. To do this, within just the processed text in the process step indicated in Fig.2 reference position 26, are pronouns that can be anabolicescimi links to the appropriate words, and pronouns, which indeed they are, fix t is Edisto on references between the corresponding semantically meaningful object and its anatomicheskoi link. In our example, anaphora is missing.

4. The attachment pin located fragments (sacrament, deprecated turns subordinate attributive, etc.) and the establishment of a hierarchy in the fragments. Involved in the turnover and adventitious key will be the sign of the corresponding object, deprecatory turnover is a sign of action.

In our example, the following investments:

- fragment of 4.2 (table.7) type 6 "taking exams at the Department of differential equations, faculty of mechanics and mathematics faculty" is depricated turnover with the main word "taking" hence the whole fragment 4.2 subject to the verb of the preceding snippet to "hear"

- fragment of 4.5 (table.8) type 5 "gave excellent answers is involved in trafficking to the main word "gave" agreed with the noun "students" previous fragment in the genus and, therefore, the entire fragment 4.5 subject noun "students", as its indicative description. Thus, the entire fragment 4.5 - attribute (characteristic) of the noun "students".

In the second column of table 10 shows received after attachment enlarged fragments of the proposals.

5. The construction of a set of unambiguous morphological interpretations of each fragment.

Within each predlagayutsya partial disambiguation at the morphological level by:

1) selection of groups of nouns, agreed with one or more adjectives/participles/pronouns-adjectives in a homogeneous communication (so-called attribute level, described above in paragraph (3);

2) analysis of the location of the dash, which removes the ambiguity, first, with the word form "it's", because the dash before the word indicates that it is a particle, secondly, with nouns before and after a dash, because the one closest to the dash of the noun to the right is possible only nominative case, and the left - nominative or instrumental. So, in our example, the words "this is" (token 8, PL.2) is the particle, and the word "exam" (token 4, PL.2) and "resolution" (the token 10, PL.2) can be used only in the accusative;

3) identify sacrament turns standing after the noun, and deprecated speed, because speed is highlighted by commas, and nouns that are included in them depend on verb forms and may not be in the nominative case. So, in our example, the word "exam" (token 45, PL.2) and answers (token 65, PL.2), may not be in the nominative case;

4) identify prepositions, subordinate to the pretense of the noun removed those homonyms that have a case, don't use this excuse (using the control model p is edlog). In our example:

- the preposition "from" (the token 27, PL.1) before the word "candidates" (token 29, PL. 1) can't manage a noun in the accusative;

- the preposition "on" (token 46, PL.1) before the word "chair" (token 47, PL.1) can't manage a noun in the dative case;

- the word "me" (the token 40, PL.1), before which the preposition is missing, cannot have prepositional case,

therefore, these homonyms are removed from consideration.

Table 2 variants of homonyms that are excluded from consideration as a result of partial removal of homonymy at the morphological level is highlighted in gray.

6. Merging fragments in simple sentences in the structure of the complex using a subordinating conjunction. Subordinating conjunctions act as the boundaries of simple sentences (PL.10, column 3).

7. Identification of predicative minimum (including the main semantically meaningful objects, and the main semantically important relationships - predicates) proposals by comparing its structure with a dictionary of templates minimum block diagrams of sentences, a fragment of which is shown in Table 11. The result for our example is shown in Table 12.

8. Selection of the rest of the members of simple sentences (other semantically important objects and attributes) and other semantically important is Vasa by sequential comparison of sentences with actant structure of a verb from the dictionary of the valences of the verb. Filled valence slots of predicates text of the example are shown in Table 13.

Take a closer look at the predicate stoked. According to the semantic classification used in the dictionary of the valences of the verb, he expects the situation in the exposure of the subject to the object. Verbs of this class are the formal expression of the form "noun in the nominative case - verb - noun in the genitive case". Thus, identifies the main semantically meaningful objects "teacher", "student", and the main semantically meaningful relationship "impact".

9. The construction of syntactic groups received within simple sentences, in which the arguments of the predicates are the main words using syntactic rules that identify syntactic relations between words. Built group are shown in Table 14.

Thus, there are many other semantically important objects and attributes, as well as other semantically relations. For this example they are summarized in Table 15.

After you perform the previous steps on the set of selected elementary units of the first, second, third and fourth levels with the help of the mentioned rules are semantically meaningful relationships between semantically meaningful objects. For example, in the sentence "all over the world the exam on matemat the ke - this written decision tasks" of the text using the set of rules corresponding to which the signal processing unit shown in Fig.2 (paragraphs processing 1-9), as used in this rule dictionaries are presented in Tables 6-16, are distinguished semantically meaningful relationship "is". Other semantically meaningful relationships are colored using the same set of rules. Each semantically meaningful relation is assigned to its type. As a result, in the source text identify semantically meaningful relationships. Many of these semantically meaningful relationship with their assigned types for this example are presented in Table 16.

Thus, after performing all the above steps of processing the source text is marked by a set of annotations that correspond semantically meaningful objects, attributes, and semantically meaningful relationships between semantically meaningful objects, and between semantically meaningful objects and attributes.

After this phase, indicated in Fig.2 reference position 29, identify the frequency of occurrence of semantically meaningful objects and attributes, as well as semantically meaningful relationship between semantically meaningful objects and between semantically meaningful objects and attributes throughout this text. This operation you anaut almost the same as on the stage 24 for the elementary units of the first level. A fragment of such a frequency dictionary for our example are shown in Tables 17 and 18.

The next step indicated in Fig.2 the reference position 30 is technical and is performed to form triads, corresponding stored semantically meaningful objects, attributes, and semantically meaningful relationships. Fragment of a multitude of such triads for our example is presented in Table 19. In fact, formed many triads is the source data to build a semantic index, processed in the previous steps of the text.

At the stage indicated in Fig.2 reference position 31, build a semantic index as follows: first of many triads obtained in the previous step, form a subset of triads, each of which corresponds to one semantically meaningful object with its attributes, and each resulting subset of triads used as input for one of the standard indexers, for example widely known redistributable Lucene indexer, indexer, search engines Yandex, Google indexer or any other indexer, the output of which is unique for a given subset of the triad index. A similar sequence of steps performed for all subsets of triads, with the commensurate triads of the form "semantically meaningful object semantically meaningful relationship - semantically meaningful object and the triads of the form "semantically meaningful object semantically meaningful relationship - attribute', receiving respective unique indexes, which together make up the semantic index of the text.

At the stage indicated in Fig.2 reference position 32 formed on the stage 30 of the triad and received at stage 31 indexes along with a link to the original text, from which formed these triads, save in the database.

In accordance with the method of automated semantic comparison of texts in natural language of these triads can form a semantic network in such a way that the first semantically meaningful object subsequent triad is associated with the same second semantically meaningful object to the previous triad. An example of a fragment of such a semantic network is shown in Table 20.

However before saving in the database formed triads and derived indices is carried out, during the iterative procedure, the renormalization of the frequencies of semantically meaningful objects and attributes, as well as frequencies of occurrence of semantically meaningful relationship in the semantic weight of semantically meaningful objects and attributes, which are the vertices of a semantic network, in such a way that semantically nachamie objects or attributes, connected in a network with a large number of semantically meaningful objects or attributes, with a high frequency of occurrence increases its semantic weight, and other semantically meaningful objects or attributes it evenly lose. Example renormalized in the semantic weight of the numerical values of the weights of the concepts of the semantic network are shown in Table 21. Similarly processed sample texts describing classes (in this example, three), which should be compared with the classified text.

Next, compute the degrees of the intersection of semantic networks classified text and samples of texts describing classes (subject area) as vertices, and their relations with regard to the meaning of weights of semantic networks and the weight characteristics of their relationships. Example values of the degrees of the intersection of semantic networks classified text and sample texts describing classes (subject area), are shown in Table 22. The degree of intersection of the classified text with the class "Math" says more about their semantic similarity, compared to other classes.

If you set the threshold for attribution of classified text to subject areas (classes) are equal 2,00000, the text does not fall within one of the defined classes. When setting the threshold equal to 1,50000, the text is fully in the subject area of Mathematics.

The degree of intersection of the two semantic networks belonging to the classified text and sample texts describing classes (subject area), is calculated as the sum of the degrees of the intersection of elementary units of the fifth level of the two networks. This amount is formed by all the nodes of the network, which has more vertices. For each vertex in this network is the top of another network, which is the same basic unit of the fourth level - the same semantically meaningful object or the same attribute. If such node in the second network is not found, the degree of intersection for this vertex is equal to zero. Example values of the degrees of the intersection of the vertices of semantic networks classified text and sample texts, describing one of the classes listed in Table 23.

For each vertex one semantic networks (for each semantically meaningful element or attribute - elementary units level IV) calculate the degree of intersection with the corresponding top of the other semantic networks. In the example under consideration, for example, the top "feature that is included in the semantic networks of both the compared texts. This degree of intersection is calculated as the sum of the degrees of the intersection of all semantically meaningful objects and attributes associated with this vertex. In the semantic is the hildren classified text and sample texts characterizing the class "Math", "equation", "derivative", "score", "the solution of the equation" and others, in the same semantic network, and "equation", "derivative", "solving equations", "order", etc. in another semantic network.

For the vertex function are computed normalized to 100% scalar product 99×99×sin(52,2°)/100=77,44 and 99×99×sin(75,6°)/100=94,93 with vertices "equation". And so for all the nodes of the semantic network, the semantic weight of which exceeds the threshold value (chosen to be equal to 70 in this example).

The total degree of intersection of the two semantic networks on top "function" - 177,49 all neighboring vertices of semantic networks is normalized by the largest number 120 remaining after removal of the subthreshold peaks in one of the two semantic networks compare texts.

The degree of intersection of the semantic web, therefore, is calculated by summing the lowest degrees of intersection of the two pairs of the same name semantically meaningful concepts or attributes compare two networks (see Table 24). This calculates the semantic intersection of semantic weights of each semantically meaningful object, or attribute associated with this vertex in these two networks. These semantic intersection are calculated as normalized to 100% of the scalar product of the semantic weights of the first and second vertex,and the angle between them is taken proportional normalized to 100% frequency of occurrence linking semantically meaningful relationship. To the resulting amount is added to the smaller of the scalar products. If the second network for a given vertex is not appropriate semantically meaningful object, or attribute, the degree of intersection on this semantically meaningful object, or attribute, is equal to zero. After summation over all semantically meaningful objects or attributes associated with the current node, normalized amount received on the two networks, the number of semantically meaningful objects and attributes associated with that node, and move to the next vertex.

Received by all the nodes in one of the networks with the largest number of vertices), the sum is normalized by the number of saved after applying the processing in step 7 (see Fig.1) elementary units of the fourth level.

Subject area (grade) Mathematics is the subject area (class) belongs to the classified text.

It should again be noted that although the inventive method by expert linguists pre-built set of syntactic rules and the relevant linguistic dictionaries (which is why in the title of the claimed method is the definition of "automated") disclosed above semantic text classification is performed without operator intervention.

Thus, the present izobreteyonija way semantic classification of texts in natural language almost without operator intervention. The main difference of this method from the known methods is that the calculated frequency of occurrence of elementary units of the fourth level, i.e. semantically meaningful objects and attributes with their subsequent renormalization in semantic weight. Association of triads of semantically meaningful objects and attributes using semantically meaningful relations in the semantic network provides a fast classification of texts, especially texts in highly inflectional languages.

td align="justify"> .
Table 1
Segmentation of text into tokens
No. tokenTokenStartEndToken type
1In12word
2all47word
3world912word
4exam1420word
5on2224word
6mathematics2636word
7-3838sign ven.
8it4042word
9written......word
10solution......word
11tasks......word, sentence boundary
12......sign ven.
13Written......word
14character......word
15test......word
16considered......word
17everywhere......word
18so......word
19same......word
20the obligation is passed ......word
21sign......word
22democratic......word
23society......word
24,......sign ven.
25as......word
26elections......word
27from......word
28short ......word
29candidates......word, sentence boundary
30.......sign ven.
31Really......introductory word
32,......sign ven.
33on......word
34oral......word
35exam......word
36student ......word
37fully......word
38helpless......word, sentence boundary
39.......sign ven.
40I......word
41happened......word
42to hear......word

43,......sign ven.
44 taking......word
45exams......word
46on......word
47the Department......word
48differential......word
49equations......word
50the mechanics and mathematics......word
51faculty......word
52MSU......reduction
53,......sign ven.
54examiners......word
55,......sign ven.
56that......word
57stoked......word
58for......word
59neighbouring......word
60 table......word
61students......word
62,......sign ven.
63giving......word
64excellent......word
65answers......word
66(perhaps surpassing the level of understanding of the teacher)......plug-in design, the sentence boundary
67....... sign ven.
68Known......word
69and......word
70such......word
71cases......word
72,......sign ven.
73when......word
74stoked......word
75deliberately......word
76(sometimes this can save time entering into the audience)......plug-in design, the sentence boundary
77.......sign ven.

Table 2
Lemma and morphs
No. tokenLemmaMorphs
1inExcuse
2allDate.p. Mn.h. Pronoun Mestom.-prilog
TV.p. M. R. Ed.h. Pronoun Mestom.-prilog
TV.p. Cf. Ed.h. Pronoun Mestom.-prilog

Th.p. M. R. Ed.h. Pronoun Mestom.-prilog
Th.p. Cf. Ed.h. The pronoun who esteem.-prilog
3worldTh.p. M. R. Ed.h. Noun Neous.
4examThem.p. M. R. Ed.h. Noun Neous.
Wines.p. M. R. Ed.h. Noun Neous.
5onExcuse
6mathematicianTh.p. M. R. Ed.h. Noun Dushell.
mathematicsDate.p. J. R. Ed.h. Noun Neous.
Th.p. J. R. Ed.h. Noun Neous.
8itParticle
thisThem.p. Cf. Ed.h. Pronoun Mestom.-prilog
Wines.p. Cf. Ed.h. Pronoun Neous. Mestom.-prilog
Wines.p. Cf. Ed.h. Pronoun Dushell. Mestom.-prilog
9writtenWines.p. Cf.Ed.h. The Adjective Dushell.
Them.p. Cf. Ed.h. The Adjective Dushell.
10solutionThem.p. Cf. Ed.h. Noun Neous.
Wines.p. Cf. Ed.h. Noun Neous.
11taskThe genus.p. J. R. Mn.h. Noun Neous.
13writtenWines.p. M. R. Ed.h. The Adjective Dushell.
Them.p. M. R. Ed.h. The Adjective Dushell.
14characterThem.p. M. R. Ed.h. Noun Neous.
Wines.p. M. R. Ed.h. Noun Neous.
15testThe genus.p. Cf. Mn.h. Noun Neous.
16consideredEd.h. The present 3rd person Imperfect Verb
17everywhereAdverb
18soAdverb
19sameParticle
20requiredDate.p. Mn.h. Adjective
TV.p. M. R. Ed.h. Adjective
TV.p. Cf. Ed.h. Adjective
21signTV.p. M. R. Ed.h. Noun Neous.
22democraticThe genus.p. M. R. Ed.h. Adjective
The genus.p. Cf. Ed.h. Adjective
Wines.p. M. R. Ed.h. The Adjective Dushell.
Wines.p. Cf. Ed.h. The Adjective Dushell.
24societyThem.p. Cf. Mn.h. Noun Neous.
The genus.p. Cf. Ed.h. Noun Neous.
Wines.p. Cf. Mn.h. Noun Neous.
25as Union
26choiceThem.p. M. R. Mn.h. Noun Neous.
Wines.p. M. R. Mn.h. Noun Neous
27fromExcuse
28moreThe genus.p. Mn.h. Numeral Quantitative
Th.p. Mn.h. Numeral Quantitative
Wines.p. Mn.h. Numeral Quantitative Dushell.
29candidateThe genus.p. M. R. Mn.h. Noun Dushell.
Wines.p. M. R. Mn.h. Noun Dushell.
35onExcuse
34oralTh.p. M. R. Ed.h. Adjective
Th.p. Cf. Ed.h. Adjective
35examTh.p. M. R. Ed.h. Noun Neo is ush.

36studentThem.p. M. R. Ed.h. Noun Dushell.
37fullyAdverb
38helplessM. R. Ed.h. Crack. F. Adjective
40IDate.p. Ed.h. 1st person Personal Pronoun (animation)
O-p. Ed.h. 1st person Personal Pronoun (animation)
41happenCf. Ed.h. The past. The Imperfect verb
42to hearThe Imperfect verb
44to takeThis Depricate Imperfect
45examThem.p. M. R. Mn.h. Noun Neous.
Wines.p. M. R. Mn.h. Noun Neous.
46 onExcuse
47DepartmentDate.p. J. R. Ed.h. Noun Neous.
Th.p. J. R. Ed.h. Noun Neous.
48differentialThe genus.p. Mn.h. Adjective
Wines.p. Mn.h. The Adjective Dushell.
Th.p. Mn.h. Adjective
49equationThe genus.p. Cf. Mn.h. Noun Neous.
50the mechanics and mathematicsThe genus.p. M. R. Ed.h. Adjective
The genus.p. Cf. Ed.h. Adjective
51facultyThe genus.p. M. R. Ed.h. Noun Neous.
54examinerThe genus.p. M. R. Mn.h. Noun Dushell.
Wines.p. M. R. Mn.h. Noun Dushell.
56whichThem.p. Mn.h. Pronoun Mestom.-prilog
Wines.p. Mn.h. Pronoun Neous. Mestom.-prilog
57flushingMn.h. The past. The Imperfect verb
58forExcuse
59neighbouringDate.p. Mn.h. Adjective
TV.p. M. R. Ed.h. Adjective
TV.p. Cf. Ed.h. Adjective
60tableTV.p. M. R. Ed.h. Noun Neous.
61studentThe genus.p. M. R. Mn.h. Noun Dushell.
Wines.p. M. R. Mn.h. Noun Dushell.
63to giveThe genus.p. Mn.h. The past. Active Participle Imperfect
Wines.p. Mn.h. The past. Active Participle Dushell. Imperfect
Th.p. Mn.h. The past. Active Participle Imperfect
64excellentWines.p. Mn.h. Adjective, Neous.
Them.p. Mn.h. Adjective, Neous.
Them.p. M. R. Mn.h. Noun Neous.
65replyWines.p. M. R. Mn.h. Noun Neous.
68knownMn.h. Crack. F. Adjective
69andUnion
70suchThem.p. Mn.h. Pronoun Mestom.-prilog
Wines.p. Mn.h. Pronoun Neous. Mestom.-prilog
Them.p. M. R. Mn.h. Noun Neous.
71caseWines.p. M. R. Mn.h. Noun Neous.
73 whenUnion
74stokedMn.h. The past. The Imperfect verb
75deliberatelyAdverb

Table 3
The frequency of occurrence of the first and subsequent words in sequence in the text, as well as the difference of the frequencies of occurrence of different words in a sequence
The repetition of a sequence of words in the textWord sequenceThe frequency of occurrenceThe difference frequency
11asymptotically1
stable10
2asymptotically2
stable20
3asymptotically3
stable30
......
asymptotically70
7stable70
...............

Table 4
Phrases words in the text
The phrase
asymptotically stable
......

Table 5
Many sentences of text
No. th.Proposal textUnit 1 levelUnit 2 levelUnit 3 level
1Worldwide examination in mathematics is written decision tasks.In all, the world, the exam, mathematics, it, writing, solving, taskin the whole world, exam, (mathematician, mathematics), (this, this) written decision task
2The written nature of the tests considered everywhere as a mandatory feature of a democratic society, the election of several candidates.Written, character tests, is everywhere, as mandatory, indication, democratic,society, as elections, several candidateswriting, nature, trial, be deemed to be everywhere, as mandatory, indication, democratic society, as the choice of several, Ph.D.
3Indeed, at the oral exam, the student is completely defenseless.On oral examination, the student, completely defenselesson oral examination, the student, completely defenseless
4I just happened to hear, taking exams at the Department of differential equations, faculty of mechanics and mathematicsto me, happened to hear, taking, exams, chair, differential equations, mechanics and mathematics,I happen to hear, to take the exam, the Department, differential equation

faculty, examiners, who were drowned at the next table students gave excellent answers (perhaps surpassing the level of understanding of the teacher).is aculeata, the examiners, who were drowned, adjacent, Desk, students, giving perfect answersthe mechanics and mathematics faculty, the examiner who, flushing, adjacent, Desk, student, give the perfect answer
5Known are cases when stoked on purpose (sometimes from this, you can save time by logging into the audience).Known, and such cases, when heated, deliberatelyknown, and such is the case when, flushing, deliberately

Table 6
Type fragment
The verb in a personal wayA brief communionAdjectivePredicative wordCommunionDeprecateInfinitiveIntroductory wordOtherwise
1234 56789

Table 7
The results of the initial fragmentation proposals
No. fragm.Fragments of sentencesType fragment
1.1worldwide examination in mathematics is written decision tasksDASH
2.1the written nature of the tests considered everywhere as a mandatory feature of a democratic society1
2.2as the election of several candidates9
3.1at the oral exam, the student is completely defenseless3
4.1I just happened to hear1
4.2taking exams at the Department of differential equalized the th of mechanics and mathematics faculty 6
4.3examiners9
4.4who drowned at the next table students1
4.5gave excellent answers5
5.1known3
5.2and such cases9
5.3when stoked on purpose1

Table 8
The elements attribute level descriptions
Components of the proposalMorphological features
The object/SubjectNoun, pronoun-noun
ActionThe verb
The feature of the objectFull of prilagatel is Noah, ordinal numeral, pronoun-adjective, consistent with the object/subject in gender, number and case
The basis of actionAdverb
Measure traitAdverb, an adverbial numeral

Table 9
Syntax group corresponding to the attribute level descriptions
No. th.Elements of syntax groupThe number of tokensSyntax group
1the feature of the object + object2+3the world
1the feature of the object + object9+10a written decision
2the feature of the object + object13+14written character
2action + sign action 16+17is everywhere
2measure the sign + sign + object18+19+20+21as a mandatory attribute
2the feature of the object + object22+23a democratic society
2the feature of the object + object29+30several candidates
3the feature of the object + object35+36oral exam
3measure the sign + sign object38+39completely defenceless
4the feature of the object + object49+50differential equations
4the feature of the object + object51+52the mechanics and mathematics faculty
4 the feature of the object + object60+61the next table
4the feature of the object + object65+66excellent answers
5the feature of the object + object71+72such cases
5action + sign action75+76stoked on purpose

Table 10
Received simple sentences in the integration fragments
no simple offers.Enlarged fragmentsComponents of simple sentences
1.1worldwide examination in mathematics is written decision tasksworldwide examination in mathematics is written decision tasks
2.1the written nature of the tests is posy who we are as a mandatory feature of a democratic society the written nature of the tests considered everywhere as a mandatory feature of a democratic society as the election of several candidates
as the election of several candidates
3.1at the oral exam, the student is completely defenselessat the oral exam, the student is completely defenseless
4.1I just happened to hear, taking exams at the Department of differential equations, faculty of mechanics and mathematics faculty examinersI just happened to hear, taking exams at the Department of differential equations, faculty of mechanics and mathematics faculty examiners
4.2examiners stoked at the next table students gave excellent answersexaminers stoked at the next table students gave excellent answers
5.1known and such casesknown and such cases
5.2when stoked on purposewhen the top is whether deliberately
Note to table: the first digit in the number of simple sentences corresponds to the number of proposals to which it refers.

Table 11
Minimum structural scheme proposals (fragment)
MCCExamples of sentences
N1 V(f)Rooks have arrived. Things are done by people.
N1 Cop(f) Adj1The night was quiet (quiet, quiet).
N1 Cop(f) Adj5Silent night (silent).
N1 Cop(f) Adj(f)The night was quieter day.
N1 Cov(f) N1He (was) a student.
N1 Cop(f) N5He was a student.
Cop(f) N1It's going to rain. It was winter. The whisper. Timid breath. Silence. ...

Note to table 11:

V(f) - praguerie form of the verb (not the infinitive);

Cop(f) - praguerie form bundles of words to be, become, be;

Inf - infinite is against verb or ligament;

N1, N5 - nominative, instrumental case of substantive;

Adj1, Adj5 - nominative, instrumental case of adjectives and passive participles;

Adj(f) - short form and comparative adjectives and passive participles.

Proposal template Cop(f) N1 can be denominative, i.e. verb-link there is not present explicitly. In this case, suppose the predicate is zero, denoted as NULL.

Table 12
Predicative at least simple sentences, forming part of sentences of the source text
no simple offers.Components of simple sentencesTemplate MCCPredicative minimum (Subject-Predicate)
1.1worldwide examination in mathematics is written decision tasksN1 Cop(f) N1 Substantive in the nominative case + copula-models + Substantive in the nominative case(the exam; there is a solution)
2.1the written nature of the tests is everywhere equally mandatory feature of the democratic the CSOs society as the election of several candidates N1 Cop(f) N1/5 N2 - Substantive in the nominative case + copula-models + Substantiv in instrumental case(nature, elections; it is considered a sign)
3.1at the oral exam, the student is completely defenselessN1 Cop(f) Adj1/5 Substantive in the nominative case + copula-models + adjective in the nominative case(student; have unprotected)
4.1I just happened to hear, taking exams at the Department of differential equations, faculty of mechanics and mathematics faculty, examinersN3 V(f) Inf Substantive in the dative case + spraguea form of verb + infinitive(me, happened to hear)
4.2examiners stoked at the next table students gave excellent answersN1 V(f) Substantive in the nominative case + spraguea form of the verb(examiners; heated)
5.1known and such casesN1 Cop(f) Adj1/5 Substantive in the nominative case + copula-models + adjective in the nominative case (the cases; there are known)
5.2when stoked on purposeSpraguea verb form pluralNULL; heated)

Table 13
The filling of the valence slots of predicates sample text
no simple offers.Predicate1. Subject2. Object3. Destination4. Tool5-7. Locative
1.1there is a solutionexamtasks---
2.1it's a sign ofthe nature of elections----
3.1there are defencelessstudent ----
4.1just happened to hearIexaminers---
4.2stokedexaminersstudents---
5.1there is a knowncases----
5.2stoked-----
Note to table: 5 - initial locative, 6 - final locative, 7 - average locative.

Table 14
Syntactic groups derived from the source is EXT using the syntax rules
no simple offers.Components of simple sentencesSyntactic groups, where the arguments and the predicate is the main wordThe name of the groups and rules
1.1worldwide examination in mathematics is written decision tasksexamination in mathematicsGenitive definition in postposition
a written decisionThe object feature of the object
solving problemsGenitive definition in postposition
2.1the written nature of the tests considered everywhere as a mandatory feature of a democratic society as the election of several candidateswritten characterThe object feature of the object
character testGenitive definition in postposition
a mandatory attributeThe object feature of the object
p is inacom society Genitive definition in postposition

election of candidatesGenitive definition in postposition
3.1at the oral exam, the student is completely defenseless--
4.1I just happened to hear, taking exams at the Department of differential equations, faculty of mechanics and mathematics faculty, examiners--
4.2examiners stoked at the next table students gave excellent answersstudents gave excellent answersThe object feature of the object
5.1known and such casessuch casesThe object feature of the object
5.2when stoked on purpose--

Table 15Many semantically meaningful objects and attributes (fragment)Simple sentenceSemantically meaningful objectsAttributesworldwide examination in mathematics is written decision tasksexammathematicssolutionwrittentasksthe written nature of the tests considered everywhere as a mandatory feature of a democratic society as the election of several candidatescharacterwrittentestsignrequiredsocietyelectionscandidate

Table 16
Relations between semantically meaningful the mi objects, and between semantically meaningful objects and attributes
Semantically meaningful object 1Semantically meaningful object 2Semantically meaningful relationType semantically meaningful relations
1examsolutionto be
2charactersignconsideredto be
3examinersstudentsflushinginfluence
...............

Table 17
The frequency of occurrence of semantically meaningful objects and attributes
Semantically meaningful object or attribute The frequency of occurrence
1teacher14
2student27
3function16
4equation44
...

Table 18
The frequency of occurrence of semantically meaningful relationship between semantically meaningful objects and between semantically meaningful objects and attributes
Semantically meaningful object 1 is semantically meaningful object 2Semantically meaningful relationThe frequency of occurrence of semantically meaningful relations
1teacher-studentto change status8
2function - equation 4
...

Table 19
Many triads (fragment).
Triad
1exam - solution
2teacher - student
3student - exam
...

Table 20
Semantic network of the triad (fragment).
The main wordRatioSubordinate word
1teacherto change statusstudent
2studentbe evaluated exam
3examto besolution
...

Table 21
Semantic weight semantically meaningful words and attributes
Semantically meaningful object or attributeSemantic weight
1teacher99
2student99
3function100
4equation99
...

Table 22
Overlapping semantic networks of the source text with the two networks is other text
ClassesMathematicsEducationHistory
The level crossing1,591600,464800,18382

Table 23
The degree of intersection on one side of the top of the semantic network are classified text and semantic network sample texts
a fragment of the first networkthe fragment of the second networkthe degree of intersection of the second objects or attributes
topthe peak associated with the firstsemantic weight weight relationsthe peak associated with the firstsemantic weight weight relations
function177,49/120=1,4791
1equation100, 58equation99,8477,44
2derivative99,48derivative99,6267,09
8score99,480
3the solution of the equation99,32the solution of the equation87,2532,96
order99,620
argument97,570
the degree of intersection vertices "function"177,49
...

Table 24
The degree of intersection of semantic networks classified text and sample texts one of the subject areas
the first networkthe second network
topthe peak associated with the firsttopthe peak associated with the firsttotal weight
1equation1equationof 14.25
2function2functionbr15.15
3score3argument0
4 plane4plane13,10
6derivative6derivative16,23
8solution8solution15,20
9point9point14,01
...
76vector field76process0
625differential equation0
Amount994,75
The normalized sumthe 994,75/625=1,5916

1. The way automated semantic classification of texts in natural language, namely, that:
are each classified text in digital form for subsequent automatic and / or automated processing;
- carry out indexing of each of the classified text in digital form, getting:
- a basic unit of the first level that includes, at least, words,
- a basic unit of the second level, each of which represents a normalized word form,
- a basic unit of the third level, each of which represents the exact phrase in the above text,
- a basic unit of the fourth level, each of which is semantically meaningful object and attribute, and
- a basic unit of the fifth level, each of which represents a triad or two semantically meaningful objects and semantically meaningful relationships between them, either semantically meaningful object and attribute and linking semantically meaningful relationships;
- identify the frequency of occurrence of elementary units of the fourth level and frequency of occurrence referred to semantically meaningful relationship;
- save in the database with formirovanie elementary units of the second, third, fourth and fifth levels of the detected frequencies of occurrence of elementary units of the fourth level and semantically meaningful relationship, as well as the received index with links to specific proposals of this text;
- form of these triads semantic network so that the first elementary unit of the fourth level of the subsequent triad is associated with the same second elementary unit of the fourth level of the previous triad;
- carry out, during the iterative procedure, the renormalization of the mentioned frequencies of occurrence in the semantic weight of elementary units of the fourth level, which are the vertices of a semantic network, so that the elementary units of the fourth level, connected in a network with a large number of other elementary units of the fourth level of the high frequency of occurrence, increase their semantic weight, and other elementary units of the fourth level it evenly lose;
- rank a basic unit of the fourth level semantic weight by comparing the semantic weight of each of them with a preset threshold value and remove a basic unit of the fourth level, with the semantic weight below a threshold value;
- keep in memory the remaining basic unit of the fourth level with the semantic weight of the above on ogopogo, as well as semantically meaningful relationships between the remaining elementary units of the fourth level;
- identify the degree of intersection of the mentioned semantic network classified text and semantic networks text samples, each text sample is composed of previously classified texts and describe the subject areas mentioned semantic classification, these level crossings identified as the peaks mentioned semantic networks, and relations between these nodes with regard to the meaning of the weights of the vertices of the considered semantic networks and weight characteristics of their relationships, and take revealed the degree of intersection of semantic networks classified text and specific text selection as a value characterizing the semantic similarity of the classified text and this text selection;
- chosen as the class for classified text, at least one of these subject areas, the degree of intersection of the semantic network semantic network referred to the classified text is greater than a predetermined threshold.

2. The method according to p. 1, in which when mentioned exceeding the aforementioned predetermined threshold degrees of intersection for multiple subject areas, ranging mentioned the e subject area according to the degree of their proximity to the classified text.

3. The method according to p. 1 or 2, in which you choose a preset predetermined number of the mentioned subject areas, which include the aforementioned classified text.

4. The method according to p. 1, in which the indexing is carried out in the following stages:
- segment the text in digital form at the elementary units of the first level that includes, at least, words;
- segment on graphematics rules text in digital form for proposals;
- form for each elementary unit of the first level, which is a word-based morphological analysis of elementary units of the second level, which includes the normalized word form;
- count the frequency of occurrence of each elementary unit of the first level for two or more adjacent units of the first level in this text and unite mentioned among elementary units of the first level sequence of words following each other in this text, a basic unit of the third level representing a stable combination of words, if for every two or more consecutive words in this text the difference between the calculated frequency of occurrence of these words for the first occurrence of a given sequence of words, and for several subsequent appearances for each pair of words pic is egovernance remain unchanged;
- identify, in the multistep process of semantic-syntactic analysis by accessing the pre-generated database of linguistic and heuristic rules in a predefined language environment, each formed sentences semantically meaningful objects and attributes - units of the fourth level;
- for each basic unit of the fourth level, record the identity of reference between the corresponding semantically meaningful object and attribute, and the corresponding anatomicheskoi link if available in the classified text, replacing each anatomicheskuyu a link to one of her antecedent;
- keep in memory every semantically meaningful object and attribute;
- identify, in the multistep process of semantic-syntactic analysis by accessing the pre-generated database of linguistic and heuristic rules in a predefined language environment, each formed sentences semantically meaningful relationships between the identified units of the fourth level is semantically meaningful objects, and between semantically meaningful objects and their attributes;
- assign each semantically meaningful relation corresponding to the type of data stored in a database domain ontology on the subject of the pre is matney region, belongs to the classified text;
- highlight all the text in the frequency of occurrence of elementary units of the fourth level and frequency of occurrence referred to semantically meaningful relationship;
- keep in mind each of the identified semantically meaningful relationship with assigned type;
- form within the text for each of the identified semantically meaningful relationship, linking as appropriate semantically meaningful objects and semantically meaningful object and its attribute, the set of triads, which are the elementary units of the fifth level;
- index on the set formed triad separately all related semantically meaningful relationships semantically meaningful objects with their frequencies of occurrence, all attributes with their frequencies of occurrence, and all formed of the triad.

5. The method according to p. 1 in which the said degree of intersection of the two semantic networks are calculated as the sum matches the elementary units of the fifth level of these two semantic networks.

6. The method according to p. 5, in which:
- chosen as the base network of the mentioned two semantic networks, in which after the ranking and removal of vertices with the semantic weights below the mentioned threshold values remained more vertices than another, chosen as the comparison;
- find for each vertex of the base network in the above-mentioned comparison network vertex, which is the same basic unit of the fourth level, i.e. the same semantically meaningful object, or the same attribute;
- calculate, for each of the found peaks in each of the aforementioned base and compare networks, all values associated with the top of the above triads as square triangles, whose sides correspond to features of each of these triads, and the angle between the parties is proportionate to the weight semantically meaningful relationships;
- choose for each pair of the above-mentioned triad associated with a specific pair of vertices in the base and compare the networks, the smaller of these calculated values as the degree of intersection of the above triads in the above-mentioned base and compare networks;
- summarize for each of the associated with the top of the tops of all the selected calculated values, receiving the degree of intersection for a given pair of vertices of the base and compare networks;
- normalized found the sum by the number mentioned semantically meaningful objects and attributes associated with the given vertex in the reference and compare network, which contains more associated with this vertex vertex;
- summarize the normalized sum over all vertices of t the th of the base and compare networks which contains more vertices;
- normalized the resulting amount by the number of remaining network of elementary units of the fourth level, getting mentioned the degree of intersection of the two semantic networks.



 

Same patents:

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to information technology. The disclosed method includes presenting two texts to be compared in digital form for subsequent processing; indexing the texts to obtain elementary units of first to fifth levels; detecting the frequency of occurrence of elementary units of the fourth level, each being a semantically significant object or attribute, and the frequency of occurrence of semantically significant relationships linking semantically significant objects, as well as the semantically significant objects and attributes; storing the formed elementary units of the second to fifth levels, and the obtained indices together with links to specific sentences of said text; forming from a triad, which are elementary units of the fifth level, a semantic network; ranking the elementary units of the fourth level according to semantic weight by comparing the semantic weight of each of them with a predetermined threshold and removing elementary units of the fourth level having a semantic weight below the threshold; detecting for two compared texts the degree of crossing of their semantic networks.

EFFECT: faster process of comparing texts.

4 cl, 2 dwg, 26 tbl

FIELD: information technology.

SUBSTANCE: method of generating syntactically and semantically correct commands includes converting a text Backus-Naur form (BNF), containing a command meta-description, into a relational BNF containing recognisable SUBD command meta-description. A text semantic rule containing a command usage restriction is converted to a relational semantic rule containing a recognisable SUBD command usage restriction. A command is identified and a basic rule is assigned for the identified command, wherein the basic semantic rule consists of a plurality of relational semantic rules. A resultant dynamic structure is formed for the identified command. Elements of the basic semantic rule are identified for the identified command and all elements of all relational semantic rules are applied to the identified command. A syntactically and semantically correct command is then generated.

EFFECT: automation and high accuracy of generating SUBD commands and less amount of computations required to generate SUBD commands.

38 cl, 18 dwg

FIELD: information technology.

SUBSTANCE: method for automatic semantic indexing of natural language text comprises segmenting the text into elementary first level units (words) and sentences; forming second level units (standardised word forms); calculating the frequency of occurrence of each first level unit for adjacent first level units and merging the sequence of words into third level units (stable word combinations); identifying in each sentence a semantically significant entity and an attribute thereof (fourth level units); identifying in each sentence semantically significant relationships between semantically significant entities and between semantically significant entities and attributes; determining the frequency of occurrence of second level and third level units; forming, for each semantically significant relationship, a plurality of triads (fifth level units); on the plurality of the formed triads, separately indexing all semantically significant entities linked by semantically significant relationships with their frequency of occurrence, all attributes with their frequency of occurrence and all formed triads.

EFFECT: high accuracy of indexing natural language texts.

6 cl, 2 dwg, 23 tbl

FIELD: information technology.

SUBSTANCE: programming language parsing method is based on table LR parsing. Canonical LR tables of a parser are dynamically rearranged during compilation using grammar extension directives given separately for each hierarchy level of nesting grammatical rules of the programming language, said directives being intended for inputting new grammatical structures. The compiler continues parsing of the program using the rearranged LR tables.

EFFECT: enabling dynamic modification of compilation tables which form the basis for a parser by extending the grammar of the programming language.

5 cl

FIELD: information technology.

SUBSTANCE: method includes a step for syntax analysis of text. A step for extracting text components and relationships thereof in the text is then executed. A graph or graphic representation of the text is generated or used as representation of the meaning of the text independent of the language. That graph or graphic representation is used to perform modelling, knowledge presentation and processing in a language processing system. A judgment of the representation in the model of the semantic realm is made during the processing step, thereby checking consistency of the extracted text semantics.

EFFECT: improvement and further advancement of the method of processing natural language which enables to properly process text semantics or other data.

29 cl, 15 dwg

FIELD: information technology.

SUBSTANCE: method of classifying documents by categories includes constructing ontology in form of a set of categories. For each category, terms, i.e. sequences of words typical for texts in said category, are identified and the weight of each of the identified terms is determined when reading electronic versions of the documents from a training collection of documents. A profile is formed for each of the categories in form of a list of all terms in all ontology categories with indication of the weight of each term in said category. A list of possible combinations word forms of said term is compiled for each term. Identified terms are selected in each document to be classified when reading an electronic version thereof, considering only word forms from the compiled list. For each document to be classified, a profile is formed for each category based on the selected terms. Relevance of said document to each category is determined by comparing profiles of said document with profiles of categories in the ontology. A classification spectrum of the document is constructed in form of a set of categories with relevance found for each of them.

EFFECT: high rate of classification and reduced size of consumed memory.

7 cl

FIELD: information technologies.

SUBSTANCE: method is realised for building of semantic relations between elements extracted from document content, in order to generate semantic representation of content. Semantic representations may contain elements identified or analysed in the text part of the content, elements of which may be associated with other elements, which jointly use semantic relations, such as relations of an agent, a location or a topic. Relations may also be built by means of association of one element, which is connected to another element or is near, thus allowing for quick and efficient comparison of associations found in the semantic representation, with associations received from requests. Semantic relations may be defined on the basis of semantic information, such as potential values and grammatical functions of each element within the text part of the content.

EFFECT: provision of quick detection of most relevant results.

21 cl, 11 dwg

FIELD: information technology.

SUBSTANCE: method of constructing a semantic model of a document consists of two basic steps. At the first step, ontology is extracted from external information resources that contain descriptions of separate objects of the object region. At the second step, text information of the document is tied to ontology concepts and a semantic model of the document is constructed. The information sources used are electronic resources, both tied and untied to the structure of hypertext links. First, all terms of the document are separated and tied to ontology concepts such that each term corresponds to a single concept which is its value, and values of terms are then ranked according to significance for the document.

EFFECT: enabling enrichment of document with metadata, which enable to improve and increase the rate of comprehension of basic information, and which enable to determine and highlight key terms in the text, which speeds up reading and improves understanding.

15 cl, 6 dwg

FIELD: information technology.

SUBSTANCE: mechanism converts messages in different formats to a common format, and the common format message is processed by a business logic application. The syntax analyser analyses the message and determines the suitable scheme for the specific format of the received message. The scheme is a data structure in a scheme register which includes a grammatical structure for the received format, as well as handler pointers for converting different message fields to an internal message format using a grammatical structure ("grammar" may include a field priority, field type, length, symbol coding, optional and mandatory fields etc). The handlers are compiled separately. As far as formats change, new formats or changes in old formats may be dynamically added to the syntax analysis/assembly mechanism by loading a new scheme and handlers.

EFFECT: broader functional capabilities, particularly the possibility of receiving and handling electronic messages in different formats, received using an application which is isolated from all external factors which are used through other external formats.

11 cl, 21 dwg

FIELD: information technology.

SUBSTANCE: text is segmented in electronic form to elementary units. Fixed collocations are identified and sentences are formed. Semantically significant objects and semantically significant relationships between then are identified. Several triads are formed for each semantically significant relationship, in which a single first type triad corresponds to the link set by the semantically significant relationship between two semantically significant objects. Each second type triad corresponds to the value of a specific attribute of one of these semantically significant objects. Each third type triad corresponds to the value of a specific attribute of the semantically significant relationship itself. All semantically significant objects which are linked by semantically significant relationships are separately indexed into several formed triads. The formed triads and the obtained indices together with the link to the initial text from which said triads were formed are stored in a database.

EFFECT: more accurate and faster searching for relevant facts and documents.

12 cl, 9 dwg, 16 tbl, 1 ex

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

FIELD: technology for automated synthesis of text documents.

SUBSTANCE: method includes, in data variable, selecting variable unified information (common word combinations), variable inputted data (details), and variable non-unified information (free word combinations), while variable unified information is separated as a plurality of support words, constituting lexicological document skeleton, and is recorded in machine-readable database, lexicological document tree is formed and data document control contour is formed, and during generation of document, all branches of formed lexicological document tree are passed to select necessary support words for inserting matching word combinations into generated document.

EFFECT: lower probability of errors, lower laboriousness.

3 cl, 7 dwg

FIELD: computer science, in particular, system for identification of preparedness of text documents in network for distributed processing of data.

SUBSTANCE: system contains block for receiving sections of text documents, block for selection of base addresses of text documents, block for selecting structure of text documents, block for forming signals for recording and reading database, block for gating sections of text documents, block for addressing of text documents, block for receiving sections of text documents from database of server, block for commutator of channels for dispensing sections of text documents, block for counting number of finished sections of text documents, comparator, counter.

EFFECT: increased speed of operation of system.

8 dwg

FIELD: technology for recognizing text information from graphic file.

SUBSTANCE: in accordance to method, set in advance is order of access to additional information, assigned also is estimate of quality for each type of additional information, different variants of division of image of selected rows on fragments are constructed, for each fragment of row linear division graph is built, images of graphic elements are recognized, using a classifier, and an estimate is assigned to each recognition variant, transition from variants of recognition of graphic elements to variants of alphabet symbols is performed, for each chain, connecting starting and ending vertexes, chains are built, appropriate for all variants of recognition of graphical elements and variants of transitions from recognized graphical elements to alphabet symbols, produced variants are ranked in order of decrease of recognition quality estimate, produced variants are processed with usage of information about position of uppercase and lowercase letters, if more than one variant of symbol is available based on results of recognition of graphic element, variants are processed with successive usage of additional information, and/or when necessary simultaneous usage of all types of additional information, quality estimate is assigned to each produced variant, variants of symbols with estimate below predetermined value are discarded, produced variants are sorted using pair-wise comparison, and additional correction of recognition of spaces, erroneously recognized at previous stages, is performed.

EFFECT: increased precision of recognition of text and increased interference resistance of text recognition.

9 cl, 2 dwg

FIELD: devices for recognition of written symbols.

SUBSTANCE: method contains stage of receipt of written symbols, which are written on sensor screen, where sensor screen contains at least at area for writing symbols and area for writing punctuation. Then a stage for determining ratio of written symbols is performed for symbols which are written in symbol area of punctuation writing, relatively to symbol writing area, and stage of recognition of punctuation marks is performed. Recognition stage is conducted for written symbols, when ratio exceeds threshold value, where conduction of recognition of punctuation symbols determines at least one possible punctuation mark from, similar to written symbols, from a set of punctuation marks.

EFFECT: automatic recognition of punctuation marks with increased precision.

8 cl, 8 dwg

FIELD: information technologies.

SUBSTANCE: invention relates checking methods of documents accuracy of extensible markup language (XML) and message delivery about the real-time scheme violation. Parallel tree is supported with portions corresponding the elements of another XML document XML. When irregularities take place in XML document, elements of another XML of document are pointed out which comply the irregularities. Portions which correspond the pointed out elements of another XML document are verified according to the XML scheme, which in its turn corresponds another XML document positioning. This elements and portions which comply the errors in another XML document positioning are reported to the user according to image indicators in XML document and parallel tree.

EFFECT: XML document accuracy check provision and messaging about scheme irregularities in real-time mode while document correcting by the user.

20 cl, 8 dwg

FIELD: physics, computer equipment.

SUBSTANCE: present invention is related to components of trees ordering in system of sentences realisation. Component accepts disordered syntactical tree and generates ranged list of alternatively ordered syntactical trees from disordered syntactical tree. Component also includes statistic models of components structure that are used by component of trees ordering for estimation of alternatively ordered trees.

EFFECT: provision of proper order of words in treelike structure.

24 cl, 11 dwg

FIELD: computer engineering.

SUBSTANCE: application program interface (API) for import can be implemented to import content from hierarchically structured document, such as XML-file. Import API works in conjunction with syntax analyser to preview document and extract content from selected elements, units, attributes and text. Import API also uses callback component to process extracted content. Export API also can be realised to export data with the aim of creation of hierarchically structured document , such as XML-file. Export API works in conjunction with editor to receive data and export data in the form of elements, units, attributes and text in hierarchically structured document.

EFFECT: providing selective data import and export in electronic document.

20 cl, 5 dwg

FIELD: physics, computer technology.

SUBSTANCE: invention concerns methods and systems of text segmentation. Method involves addressing symbol line (204), long lexeme determination (206), recording adjoining symbols in long lexeme (208), determination of lexemes from symbol line by holding together the adjoining symbols, and determination of multiple lexeme combinations (210), with number of lexeme combinations reduced by means of recorded adjoining symbols.

EFFECT: enhanced speed of text fragmentation.

22 cl, 3 dwg

FIELD: physics; computer engineering.

SUBSTANCE: present invention pertains to computer technology. The elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema. Elements of the second schema are "transparent" to the elements of the first schema, when the text processor checks correctness of elements of the first schema. Elements of the second schema are verified separately so that, elements of the first schema are "transparent" for verification of elements, corresponding to the second schema.

EFFECT: provision for validity checking of an extensible mark-up language (XML) document, with elements, linked to two or more schemata, where elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema.

16 cl, 6 dwg

Up!