Selection of text classifier parameter based on semantic characteristics

FIELD: physics.

SUBSTANCE: to evaluate the text classifier parameters based on semantic characteristics, the semantic-syntactic text analysis in natural language from the body of texts in natural language is performed using the processing device to create a semantic structure representing a set of semantic classes. The text characteristic in natural language is identified, extracted based on a set of values from a set of the characteristic extraction parameters. The body of texts in natural language is separated into a training data sample including the first set of texts in natural language, and a test sample including the second set of texts in natural language. A set of parameter values is defined for extracting characteristics, taking into account the category of the training sample. The obtained set of parameter values is evaluated for extracting characteristics using the test sample.

EFFECT: improving the accuracy of classification results.

20 cl, 15 dwg

 



 

Same patents:

FIELD: information technology.

SUBSTANCE: method for textual information recognition and its integrity evaluation in Internet electronic documents an electronic document is split into areas presumptively containing text paragraphs and lines. Herewith, document splitting is performed up to obtaining the areas containing continuous logically bracketed text of the largest size. Redundant and surplus information it deleted. Symbol encoding validity is analysed by means of the analysis whether letters belong to the alphabet or not and whether text words belong to the vocabulary or not, taking into account the given language. Statistical characteristics of word classes and their forms are calculated. From the obtained values of statistical characteristics a working vocabulary attribute vector is generated, which is converted into the main components vector using componential analysis procedures and classified using preliminarily learned classifiers. Textual information integrity is evaluated based on a voting method of decision making.

EFFECT: higher productivity of an electronic documents contensive processing system and increase in the analysed data sources number.

5 dwg

FIELD: information technologies.

SUBSTANCE: in the method of formation of the relational description of command syntax on the basis of the metadescription of command syntax 110 metadescription of command syntax is identified. 120 elements of the metadescription are identified and each element is assigned by a unique identifier (ID), and ID is assigned in the order of arrangement of elements in the metadescription. 130 table containing all elements is formed, and each element is contained in one column of the table in different lines of the table. 140 opening structural elements and the closing structural elements among the elements contained in the table are identified and bidirectional communications between the corresponding opening and closing structural elements are generated. 150 unidirectional hierarchical communications between the opening elements and the respective opening element being at the previous level of encapsulation are generated, and generation of the named communications is performed for each opening element located on any of levels except for the first level.

EFFECT: providing of automatic formation of the relational description of command syntax on the basis of metadescription of command syntax.

17 cl, 15 dwg

FIELD: information technologies.

SUBSTANCE: in the method of structured data array transformation, which contains text in natural language, they create (101) the first data structure of the structured data array from the end data structure of the structured data array. They create (102) a data base of logical connections between logical sections of elements of the first data structure. They create (103) the second data structure of the structured data array. They create (104) a data base of semantic parts of logical sections of elements of the second data structure. They create (105) grammatically and orthographically correct semantic parts of logical sections of the second data structure elements by means of linguistic transformations over the specified semantic parts. They create (106) the end data structure of the structured data array.

EFFECT: creation of logically, grammatically and orthographically true data structure, providing for quick and convenient navigation by structure elements.

17 cl, 15 dwg, 3 tbl

FIELD: information technology.

SUBSTANCE: method of determining vulnerable functions in automated scanning of web applications for presence of vulnerabilities and non-declared capabilities comprises compiling a list of source texts of web applications intended for generating testing parameters, and setting source text parameters for testing; parsing the source texts using the given parameters and adding distinctive labels to the source text with indication of label-function pairs; performing automatic scanning and search for program errors in web applications and, in case of error, obtaining debugging data in the form of machine code, describing the currently executed module and containing the name of the corresponding label; determining, from said label, the corresponding label-function pair and obtaining the name of the vulnerable function, as well as the full name of the module containing the vulnerable function.

EFFECT: high number of potentially detected vulnerabilities of web applications, shorter time needed for manual analysis of program errors in order to determine criticality thereof.

3 cl

FIELD: information technology.

SUBSTANCE: method for automatic semantic classification of natural language texts comprises presenting each text to be classified in digital form for subsequent processing; indexing the text to obtain elementary units of the first through fifth levels; detecting the frequency of occurrence of units of the fourth level, each being a semantically significant object or attribute, and the frequency of occurrence of semantically significant relationships linking semantically significant objects, as well as objects and attributes; forming a semantic network from a triad which is units of the fifth level; renormalising the frequencies of occurrence into the semantic weight of the units of the fourth level; ranking the units of the fourth level according to the semantic weight by comparison thereof with a threshold value and those having a weight below the threshold value; detecting the degree of crossing semantic networks of the text and text samples; selecting as a class for text object regions, the degree of crossing the semantic network with the semantic network of text is greater than the threshold.

EFFECT: faster process of comparing texts.

6 cl, 2 dwg, 24 tbl

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to information technology. The disclosed method includes presenting two texts to be compared in digital form for subsequent processing; indexing the texts to obtain elementary units of first to fifth levels; detecting the frequency of occurrence of elementary units of the fourth level, each being a semantically significant object or attribute, and the frequency of occurrence of semantically significant relationships linking semantically significant objects, as well as the semantically significant objects and attributes; storing the formed elementary units of the second to fifth levels, and the obtained indices together with links to specific sentences of said text; forming from a triad, which are elementary units of the fifth level, a semantic network; ranking the elementary units of the fourth level according to semantic weight by comparing the semantic weight of each of them with a predetermined threshold and removing elementary units of the fourth level having a semantic weight below the threshold; detecting for two compared texts the degree of crossing of their semantic networks.

EFFECT: faster process of comparing texts.

4 cl, 2 dwg, 26 tbl

FIELD: information technology.

SUBSTANCE: method of generating syntactically and semantically correct commands includes converting a text Backus-Naur form (BNF), containing a command meta-description, into a relational BNF containing recognisable SUBD command meta-description. A text semantic rule containing a command usage restriction is converted to a relational semantic rule containing a recognisable SUBD command usage restriction. A command is identified and a basic rule is assigned for the identified command, wherein the basic semantic rule consists of a plurality of relational semantic rules. A resultant dynamic structure is formed for the identified command. Elements of the basic semantic rule are identified for the identified command and all elements of all relational semantic rules are applied to the identified command. A syntactically and semantically correct command is then generated.

EFFECT: automation and high accuracy of generating SUBD commands and less amount of computations required to generate SUBD commands.

38 cl, 18 dwg

FIELD: information technology.

SUBSTANCE: method for automatic semantic indexing of natural language text comprises segmenting the text into elementary first level units (words) and sentences; forming second level units (standardised word forms); calculating the frequency of occurrence of each first level unit for adjacent first level units and merging the sequence of words into third level units (stable word combinations); identifying in each sentence a semantically significant entity and an attribute thereof (fourth level units); identifying in each sentence semantically significant relationships between semantically significant entities and between semantically significant entities and attributes; determining the frequency of occurrence of second level and third level units; forming, for each semantically significant relationship, a plurality of triads (fifth level units); on the plurality of the formed triads, separately indexing all semantically significant entities linked by semantically significant relationships with their frequency of occurrence, all attributes with their frequency of occurrence and all formed triads.

EFFECT: high accuracy of indexing natural language texts.

6 cl, 2 dwg, 23 tbl

FIELD: information technology.

SUBSTANCE: programming language parsing method is based on table LR parsing. Canonical LR tables of a parser are dynamically rearranged during compilation using grammar extension directives given separately for each hierarchy level of nesting grammatical rules of the programming language, said directives being intended for inputting new grammatical structures. The compiler continues parsing of the program using the rearranged LR tables.

EFFECT: enabling dynamic modification of compilation tables which form the basis for a parser by extending the grammar of the programming language.

5 cl

FIELD: information technology.

SUBSTANCE: method includes a step for syntax analysis of text. A step for extracting text components and relationships thereof in the text is then executed. A graph or graphic representation of the text is generated or used as representation of the meaning of the text independent of the language. That graph or graphic representation is used to perform modelling, knowledge presentation and processing in a language processing system. A judgment of the representation in the model of the semantic realm is made during the processing step, thereby checking consistency of the extracted text semantics.

EFFECT: improvement and further advancement of the method of processing natural language which enables to properly process text semantics or other data.

29 cl, 15 dwg

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

FIELD: technology for automated synthesis of text documents.

SUBSTANCE: method includes, in data variable, selecting variable unified information (common word combinations), variable inputted data (details), and variable non-unified information (free word combinations), while variable unified information is separated as a plurality of support words, constituting lexicological document skeleton, and is recorded in machine-readable database, lexicological document tree is formed and data document control contour is formed, and during generation of document, all branches of formed lexicological document tree are passed to select necessary support words for inserting matching word combinations into generated document.

EFFECT: lower probability of errors, lower laboriousness.

3 cl, 7 dwg

FIELD: computer science, in particular, system for identification of preparedness of text documents in network for distributed processing of data.

SUBSTANCE: system contains block for receiving sections of text documents, block for selection of base addresses of text documents, block for selecting structure of text documents, block for forming signals for recording and reading database, block for gating sections of text documents, block for addressing of text documents, block for receiving sections of text documents from database of server, block for commutator of channels for dispensing sections of text documents, block for counting number of finished sections of text documents, comparator, counter.

EFFECT: increased speed of operation of system.

8 dwg

FIELD: technology for recognizing text information from graphic file.

SUBSTANCE: in accordance to method, set in advance is order of access to additional information, assigned also is estimate of quality for each type of additional information, different variants of division of image of selected rows on fragments are constructed, for each fragment of row linear division graph is built, images of graphic elements are recognized, using a classifier, and an estimate is assigned to each recognition variant, transition from variants of recognition of graphic elements to variants of alphabet symbols is performed, for each chain, connecting starting and ending vertexes, chains are built, appropriate for all variants of recognition of graphical elements and variants of transitions from recognized graphical elements to alphabet symbols, produced variants are ranked in order of decrease of recognition quality estimate, produced variants are processed with usage of information about position of uppercase and lowercase letters, if more than one variant of symbol is available based on results of recognition of graphic element, variants are processed with successive usage of additional information, and/or when necessary simultaneous usage of all types of additional information, quality estimate is assigned to each produced variant, variants of symbols with estimate below predetermined value are discarded, produced variants are sorted using pair-wise comparison, and additional correction of recognition of spaces, erroneously recognized at previous stages, is performed.

EFFECT: increased precision of recognition of text and increased interference resistance of text recognition.

9 cl, 2 dwg

FIELD: devices for recognition of written symbols.

SUBSTANCE: method contains stage of receipt of written symbols, which are written on sensor screen, where sensor screen contains at least at area for writing symbols and area for writing punctuation. Then a stage for determining ratio of written symbols is performed for symbols which are written in symbol area of punctuation writing, relatively to symbol writing area, and stage of recognition of punctuation marks is performed. Recognition stage is conducted for written symbols, when ratio exceeds threshold value, where conduction of recognition of punctuation symbols determines at least one possible punctuation mark from, similar to written symbols, from a set of punctuation marks.

EFFECT: automatic recognition of punctuation marks with increased precision.

8 cl, 8 dwg

FIELD: information technologies.

SUBSTANCE: invention relates checking methods of documents accuracy of extensible markup language (XML) and message delivery about the real-time scheme violation. Parallel tree is supported with portions corresponding the elements of another XML document XML. When irregularities take place in XML document, elements of another XML of document are pointed out which comply the irregularities. Portions which correspond the pointed out elements of another XML document are verified according to the XML scheme, which in its turn corresponds another XML document positioning. This elements and portions which comply the errors in another XML document positioning are reported to the user according to image indicators in XML document and parallel tree.

EFFECT: XML document accuracy check provision and messaging about scheme irregularities in real-time mode while document correcting by the user.

20 cl, 8 dwg

FIELD: physics, computer equipment.

SUBSTANCE: present invention is related to components of trees ordering in system of sentences realisation. Component accepts disordered syntactical tree and generates ranged list of alternatively ordered syntactical trees from disordered syntactical tree. Component also includes statistic models of components structure that are used by component of trees ordering for estimation of alternatively ordered trees.

EFFECT: provision of proper order of words in treelike structure.

24 cl, 11 dwg

FIELD: computer engineering.

SUBSTANCE: application program interface (API) for import can be implemented to import content from hierarchically structured document, such as XML-file. Import API works in conjunction with syntax analyser to preview document and extract content from selected elements, units, attributes and text. Import API also uses callback component to process extracted content. Export API also can be realised to export data with the aim of creation of hierarchically structured document , such as XML-file. Export API works in conjunction with editor to receive data and export data in the form of elements, units, attributes and text in hierarchically structured document.

EFFECT: providing selective data import and export in electronic document.

20 cl, 5 dwg

FIELD: physics, computer technology.

SUBSTANCE: invention concerns methods and systems of text segmentation. Method involves addressing symbol line (204), long lexeme determination (206), recording adjoining symbols in long lexeme (208), determination of lexemes from symbol line by holding together the adjoining symbols, and determination of multiple lexeme combinations (210), with number of lexeme combinations reduced by means of recorded adjoining symbols.

EFFECT: enhanced speed of text fragmentation.

22 cl, 3 dwg

FIELD: physics; computer engineering.

SUBSTANCE: present invention pertains to computer technology. The elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema. Elements of the second schema are "transparent" to the elements of the first schema, when the text processor checks correctness of elements of the first schema. Elements of the second schema are verified separately so that, elements of the first schema are "transparent" for verification of elements, corresponding to the second schema.

EFFECT: provision for validity checking of an extensible mark-up language (XML) document, with elements, linked to two or more schemata, where elements of each schema can be arbitrarily embedded in the elements of another schema and each set of elements remains correct within the limits of its intrinsic schema.

16 cl, 6 dwg

Up!