Method for stream processing of text messages

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

 

The invention relates to systems of classification of text messages and can be used in information processing systems, databases, digital repositories if you have a constant source of textual information.

There is a method of classifying messages [1], namely, that carry out text messages from a special storage format to text in natural language, convert the word document into the base word forms, carry out the calculation of the weights of words in the document in accordance with the frequency of their occurrence; during the training phase after presenting a set of manually classified documents form the set of classification criteria, if necessary, the classification of the document shall convert it from a special storage format to text in natural language, convert the word document into the base word forms, carry out the calculation of the weights of words in the document based on the classification criterion SVM (Support Vector Machines) and classification criteria determine whether a document category.

However, this method has significant limitations, namely that it is only for message classification in static mode and does not contain funds for streaming, such as sequential training of the classifier, is also estimating the information content of the text message and the duration of its storage.

There is also known a method of classifying messages [2], namely, that continuously processes the sequence of text messages, updating at each stage of the classification features using Widrow-Hoff (Widrow-Hoff); this allows you to categorize messages and to train the classifier. There is no need to present all training set of documents at once.

However, this method also has limitations related to the fact that it only describes the procedure of classifying messages and training the classifier and does not describe the full cycle of processing messages in streaming mode, for example, the initial document processing and storage, and does not provide mechanisms for estimating the information content of the text message and the duration of its storage.

The closest in technical essence to the present invention is a method for identifying objects according to their descriptions [3], suitable for the solution adopted for the prototype, which consists in the fact that you receive text messages on natural language information channel, perform linguistic processing of words of each message, form thesaurus text of each message, carry out statistical processing of words in thesaurus messages, save a text message and thesaurus the repository.

This method allows for a comparison of this text message set messages received at time intervals, and thereby to determine its thematic proximity to these intervals as categories. The disadvantages of the prototype is the inability of the job categories, except as a temporary sign, as well as the absence of mechanisms for estimating the information content of a text message and deleting it from the store when it loses its informative value.

The technical result from implementation of the invention is to eliminate the disadvantages of the prototype, that is, in obtaining the possibility of random assignment of categories. For each text message is determined by its information content, which affects the duration of storage of the message in the repository.

This technical result is due to the fact that you receive text messages on natural language information channel, perform linguistic processing of words of each message, form thesaurus text of each message, carry out statistical processing of words in thesaurus messages, save a text message and thesaurus store. This automatically determine the identity of a text message one category from a predefined list of categories is rd. In addition, determine the initial information content of a text message, store it in the storage along with the text message; periodically update the values of information content stored in the database text messages taking into account elapsed from the moment of their arrival time and remove those text messages, information, which has fallen below the predetermined threshold. An additional feature of this method is that during the processing of every text message updates the values of the classification criteria categories.

According to this method, the information content of the text message is determined by two factors:

the contents of the message (the word forms that belong to him, the deviation of the vocabulary of this message from occurring earlier).

- time elapsed since the entry of the message into the database.

The values of the classification criteria determine the distribution of weights of word forms, the most common messages of this category. For each text message and each category is defined by the membership function characterizing the measure set for this message in this category. Category for which the value of this function is maximum, is assigned to the message. In the case of small sacripanti toiletries messages of this category, the message has the unusual vocabulary and him in accordance with this method assigns more importance informative than the message, delivering large this function.

Directly after entering the message into the database, the message has, in accordance with this method, the maximum information content, as it likely has not yet been read and appreciated by operators of complex processing of messages. However, after some time, the information content of messages is reduced.

In accordance with the patented method, each text message s caught in the database in each moment of time t is assigned the value of information content according to the following formula, consistent with the reasoning presented above:

I(s)=1-(x(s),ck)-α(t-t0),

where x(s) is the vector of weights of all forms of thesaurus message s, ck- vector classification criteria category k, which belongs to a text message s, t0- time hit message s in the database α - the rate of information loss. The values of t and t0can be expressed in any time units, such as seconds. The choice of a particular temporary units would affect the value of the coefficient of α.

Factor α responsible for reducing the information content of the message is to be placed per unit time and is chosen with regard to the requirements of particular applications of this method.

As the information content of messages falls below the threshold informative εthat is deleting them from the database as uninformative.

When this message is received the highest values of information content from the outset, will no longer stay in the database; those of them that were uninformative, will be quickly removed from the database.

The method can be implemented using a computer or computing device, represented as a block diagram in the drawing.

The device for implementing the method consists of the information channel 1, block 2 forming thesauri text messages, control block 3, block 4 training data, block 5 classification unit 6 define the initial informative text messages, block 7 save text messages, store 8 text messages, unit 9 updates the classification of signs, store 10 classification criteria, unit 11 generating time samples, block 12 translation informatively text messages, block 13 delete text messages.

According to the way the device works in the following way. When in the information channel 1 text message, it is passed to block 2.

In unit 2 text message first passes pre-processing, which consists in ODA is the division in all his words to their base forms. This may be one of the ways [4-7], and then a thesaurus text message. Most often to solve problems find the base form is used the porter algorithm [4], consisting in the use of special rules clipping and replace the endings of words.

Thesaurus messages consists of all words contained in it. At the same time each word is mapped to its normalized weight in the message text defined by the formula TD-IDF (Term Frequency-Inverted Document Frequency):

tdidf(w)=tf(w)idf(w),

where tf(w) is the frequency of word w in the message, that is,

Here c(w) is the number of times that the word w is repeated in this message, N is the total number of words in this message. Is inverted in the frequency of documents idf(w) is calculated by the formula:

idf(w)=log(M)-log(d(w)),

where d(w) is the number of documents known to the system that contain the word w, M is the total number of documents known to the system. For normalization of the vector of weights used Euclidean norm:

Next, the generated thesaurus text message is sent to the control unit 3 controlled by the operator of the device. The control unit 3 allows the device to operate in two modes: elementary education and a normal R is the bench. The initial training necessary to build the initial classification criteria. In this mode, the controller receives information about the category to which belongs a text message from the block 4. When operating in normal mode, the control unit refers to the unit 5 for determining the category to which belongs a text message.

When classifying a text message is the calculation of the scalar product between the normalized vector of weights messages and vectors of weights (classification characteristics of all categories. Since these vectors are normalized, the dot product of these vectors is equal to the cosine of the angle between the corresponding vectors in the corresponding multi-dimensional spaces. Category for which this scalar product is maximum, is assigned to this text message:

Information about this category is transferred from the unit 5 to the control unit 3 and then goes to block define the initial informative text messages.

To determine the initial informative text message, use the following formula:

I(s)=1-(x(s),ck),

where s is the current text message, x(s) is the vector of weights of all forms of thesaurus message s, ck- vector classification criteria cat the category of k, which belongs to the text's message. This scalar product is calculated for all word forms from the intersection of many possible forms of a given text message and classification criteria category k, and the vector x(s) and ckhave components that meet the relevant forms.

From block 6 define the initial informative text message data on the current text message, namely the message itself, its thesaurus and the initial information content received in block 7, which stores them in the storage 8.

Then the control is passed to block 9. Classification signs corresponding to each category represent the vector of weights corresponding to all the word forms encountered in any text message from this category. Weights are normalized in accordance with the Euclidean norm. Define classification criteria for each category occurs iteratively. This uses a dynamic algorithm for learning linear classifiers Widrow-Hoff (Widrow-Hoff). Classification criteria categories are retrieved from the storage 10 and recalculated according to the formula:

ckj,new=Ckj,old-2η((Ck oldx(s))-y)xj(s)

where Ckj,oldand Ckj,new- j-e components, respectively the old and new vectors of the classification is resnicow k-th category, y - vector, which is the position corresponding to the number of the category, which belongs to a text message, s is the unit, and in other positions - zeros η - coefficient learning rates, set by the operator of the device. Then the updated values of the classification of the signs are written back to the repository 10.

Unit 11 transmits the signals to the block 12 with a constant time period, set by the operator of the device. Upon receipt of the signal unit 12 loops through all of the messages contained in the storage 8, and updates the values of their informatively in accordance with the following formula:

I(s)=I(s)-αΔt,

where α - the rate of information loss, Δt is the time period between successive signals generating block of time samples. The new values of informatively recorded in the storage 8. Factor α set operator.

In block 13 delete text messages iterates over all messages contained in the storage text messages 8, and the deletion of all messages, information which at the time of the inspection below the threshold informative εalso set by the operator of the device.

The values of the coefficients α, η, ε can be different depending on the specific usage of this device.

Thus, using the method which is the classification of text messages on a predetermined set of categories, and determine the information content of a text message and deleting it as it loses its information value, thus achieving the above technical result.

Sources of information taken into account in the preparation of application materials:

1. U.S. patent 6327581, CL G 06 F 015/18.

2. Lewis D.D., R.E. Shapire, J.P. Callan, and R. Papka "Training algorithms for linear text classifiers", In Proceedings of SIGIR-96, 49th ACM International Conference on Research and Development in Information Retrieval, pages 294-306, Zurich, CH, 1996.

3. RF patent № 2167450 C2, CL G 06 F 17/30 prototype.

4. Porter, M.F., "An algorithm for suffix stripping". Program, Vol.14, No.3, 1980, pp.130-137.

5. RF patent № 2096825 C1, class G 06 F 17/00.

6. U.S. patent No. 6308149, CL G 06 F 17/27.

7. U.S. patent No. 6430557, CL G 06 F 017/30; G 06 F 017/27; G 06 F 017/21.

The way streaming text messages, namely, that receive text messages on natural language information channel, perform linguistic processing of words of each message, form thesaurus text of each message, carry out statistical processing of words in thesaurus messages, save a text message and thesaurus in the repository, wherein automatically determine the identity of a text message one category from a predefined list of categories, determine the initial information content of a text message, store it in the storage VM is the sty with a text message; periodically update the values of information content stored in the database text messages taking into account elapsed from the moment of their arrival time, and delete those text messages, information, which has fallen below the predetermined threshold; for each text message updates the values of the classification criteria categories.



 

Same patents:

The invention relates to information processing natural language text materials
The invention relates to computing, and in particular to work on the Internet
The invention relates to the field of electronics and is designed, for example, to use auxiliary data arrays in the conversion process and/or verification of computer codes in the form of symbols, and the corresponding portions of the image

The invention relates to computing

The invention relates to a computer system of creation and translation of documents, to prepare the text in the language limitations and translation into a foreign language

The invention relates to the publishing industry and can be used for the preparation and issue of reference books
The invention relates to the field of electronics and is designed, for example, to use auxiliary data arrays in the conversion process and/or verification of computer codes in the form of symbols, and the corresponding portions of the image
The invention relates to the field of electronics and can be used, for example, in the way of interrelated activation computer code in the form of symbols and corresponding portions of the image

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

FIELD: computer science, technology for mutual transformation of document (for example, XML document) and program object (for example, Java language object).

SUBSTANCE: in such structure, interpreter is used, masking method for producing transformation properties. Due to that, transformation code is generated, having general type for transformation in both directions. Transformer transforms XML document to program object by means of analyzer 104. To execute reverse transformation (from Java language to XML) it is required, that elements of XML document are positioned in certain order to provide for validity of produced XML document 118. For this purpose, in accordance to invention, template XML document is generated using, for example, JSP technology. Template, created using JSP, allows recording tags of documents in JSP with possible reverse call for values of elements and attributes. Content may be sent to buffer or directly to the output stream of servlet.

EFFECT: it is possible to efficiently realize creation of structure for such transformation by means of standard tools.

4 dwg, 8 tbl, 2 ex

FIELD: technology for recognizing text information from graphic file.

SUBSTANCE: in accordance to method, set in advance is order of access to additional information, assigned also is estimate of quality for each type of additional information, different variants of division of image of selected rows on fragments are constructed, for each fragment of row linear division graph is built, images of graphic elements are recognized, using a classifier, and an estimate is assigned to each recognition variant, transition from variants of recognition of graphic elements to variants of alphabet symbols is performed, for each chain, connecting starting and ending vertexes, chains are built, appropriate for all variants of recognition of graphical elements and variants of transitions from recognized graphical elements to alphabet symbols, produced variants are ranked in order of decrease of recognition quality estimate, produced variants are processed with usage of information about position of uppercase and lowercase letters, if more than one variant of symbol is available based on results of recognition of graphic element, variants are processed with successive usage of additional information, and/or when necessary simultaneous usage of all types of additional information, quality estimate is assigned to each produced variant, variants of symbols with estimate below predetermined value are discarded, produced variants are sorted using pair-wise comparison, and additional correction of recognition of spaces, erroneously recognized at previous stages, is performed.

EFFECT: increased precision of recognition of text and increased interference resistance of text recognition.

9 cl, 2 dwg

FIELD: informatics; computer technology.

SUBSTANCE: device can be used for soling tasks of composing dictionaries, manual as well as for creation of new databases. Device has entrance memory unit, processed words memory unit, unit for analyzing search, substitution memory unit, substitution unit, result storage unit, control unit.

EFFECT: widened functional abilities; improved reliability of operation; simplified algorithm of operation.

16 dwg

FIELD: engineering of computer components for ordering graphic elements, shown through graphic user interface.

SUBSTANCE: a system of presenting means provides base class of presenting means and a set of interface methods, realized by presentation mechanism, for creation and integration of expandable set of classes of presenting means for processing data of graphical elements of various types during composition operation in given visible image. System of presenting means allows realization of complex image composition operations through calls of presentation mechanism. Aforementioned complex image composition operations include: breaking into pages, partial computation, stepwise computation, a set of samples, alteration of capabilities/operations of composition.

EFFECT: expanded functional capabilities, due to support of compositions of visible images of application, to which a set of graphic elements is assigned.

4 cl, 15 dwg

FIELD: technical means of informatics and computer science, possible use for solving problems of automated comparison and analysis on basis of transformation of non-structured stream of input data to object form.

SUBSTANCE: system contains user control device, device for storage and checking of certificates and centralized control device, consisting of device for registration of system user, decoding block, encoding block, device for controlling operation process, device for storage, classification and searching for ordered data, device for forming and storing data about system users, device for registration of ordered data and device for controlling and analyzing operation.

EFFECT: expanded functional capabilities and increased level of protection from unsanctioned access to system.

8 cl, 11 dwg

FIELD: font selection methods, in particular, methods which utilize markup language documents for setting one or several selection criterions.

SUBSTANCE: font selection method includes: receipt of code, which indicates the type of symbol requested by computer program; sending of information which is associated with region settings for keyboard, from operation system into glyph generation module for usage in font selection for indication, access to document in expandable markup language, when expandable markup language document is connected to a set of font files; detection, on basis of aforementioned sent information and expandable markup language document content, of which font of the set of fonts is required by computer program to indicate a symbol of type required by computer program, and in case when required font is inaccessible, usage of backup font, set by the expandable markup language document. To realize such a method, computer-readable carrier is provided, which stores commands for execution by computer. Method for determining which font of a set of fonts should be used in computer program includes: access to document in expandable markup language, which determines logical condition for usage of at least one font of a set of fonts; sending of information, associated with region settings for keyboard, from operation system into glyph generation module to determine logical condition for usage of at least one font of a set of fonts, detection of whether the aforementioned logical condition is fulfilled, and if logical condition is fulfilled, extraction of a glyph of at least one font, where if the logical condition is not fulfilled, usage of default font, set by expandable markup language document. For realization of given method, computer readable carrier is provided, having commands stored on it for execution by computer. Method for receiving glyphs from a set of fonts includes: sending the information associated with keyboard region settings from operation system into glyph generation module to determine the logical condition for usage during selection of one or more fonts of a set of fonts for indication, access to document in expandable markup language and to sent information, which determines logical condition for usage of first font of a set of fonts; detection of whether the aforementioned logical condition for usage of first font of a set of fonts is fulfilled; if aforementioned logical condition for usage of first font of a set of fonts is fulfilled, then extraction of one or several glyphs from first font; and if aforementioned logical condition for usage of second font of a set of fonts is fulfilled, extraction of one or several glyphs from second font, where if aforementioned logical condition for usage of first font and logical condition for second font are not fulfilled, usage of default font, set by expandable markup language document. For method for producing a glyph from a set of fonts, computer-readable carrier is provided, having commands stored on it for execution by computer. Computer-readable carrier, having text stored on it in markup language, where the text in markup language contains: data, representing region settings for keyboard, to be used during selection of fonts for indication, while region setting data for keyboard are provided by operation system to glyph generation module, a link to first font in the text in expandable markup language; the data, which represents condition, when first font should be used; scalable coefficient, which defines how the dimensions of first font should be altered, if the condition for usage of first font is fulfilled, a link to section font in the text in expandable markup language; and data which represents condition, when second font should be used.

EFFECT: invention allows a font developer to efficiently create international fonts with usage of several fonts existing beforehand, and may be used in distributed computing environment, when tasks are performed by remote processing devices, which are interconnected through a communication network.

7 cl, 3 dwg

FIELD: text processing.

SUBSTANCE: text row height is determined by multiplying text parameter by average width indicator function of one font cluster. As text parameter, default text row height, average value of thickness of one symbol, font size may be used.

EFFECT: automatic representation of text, reproduced on any display, in any language and in font of any size, in the form which is most comfortable to read.

5 cl, 27 dwg

FIELD: information technologies.

SUBSTANCE: invention relates checking methods of documents accuracy of extensible markup language (XML) and message delivery about the real-time scheme violation. Parallel tree is supported with portions corresponding the elements of another XML document XML. When irregularities take place in XML document, elements of another XML of document are pointed out which comply the irregularities. Portions which correspond the pointed out elements of another XML document are verified according to the XML scheme, which in its turn corresponds another XML document positioning. This elements and portions which comply the errors in another XML document positioning are reported to the user according to image indicators in XML document and parallel tree.

EFFECT: XML document accuracy check provision and messaging about scheme irregularities in real-time mode while document correcting by the user.

20 cl, 8 dwg

FIELD: information technology.

SUBSTANCE: method includes receiving the input data from the web-browser; creation of duplicate of the previously saved original XML-document, the original XML-document including preset structures and predetermined input data; substitution of the initial data included in the duplicated XML-document by the received input data; saving the XML-document obtained as a result of the above substitution in the form of new document. Users can easily create XML-documents of any preset form, such as curriculum vitae, blanks of commercial contracts, official documents using existing web-browsers without engaging specialised XML-editors.

EFFECT: possibility to create an XML-document without conversion of the input data into other formats.

11 cl, 24 dwg, 1 tbl

FIELD: computer science.

SUBSTANCE: method includes text messages from data channel, linguistic words processing is performed, thesaurus of each text message is formed, statistical processing of words in thesaurus is performed, text message and thesaurus are stored in storage. Membership of text message in one of categories from the list is determined, starting data value of text message is determined, stored in storage with text message, data value values are periodically updated with consideration of time passed since their appearance and text messages with data value below preset threshold are erased, during processing of each message values of categories classification signs are updated.

EFFECT: higher efficiency.

1 dwg

FIELD: technology for automated synthesis of text documents.

SUBSTANCE: method includes, in data variable, selecting variable unified information (common word combinations), variable inputted data (details), and variable non-unified information (free word combinations), while variable unified information is separated as a plurality of support words, constituting lexicological document skeleton, and is recorded in machine-readable database, lexicological document tree is formed and data document control contour is formed, and during generation of document, all branches of formed lexicological document tree are passed to select necessary support words for inserting matching word combinations into generated document.

EFFECT: lower probability of errors, lower laboriousness.

3 cl, 7 dwg

FIELD: computer science, in particular, system for identification of preparedness of text documents in network for distributed processing of data.

SUBSTANCE: system contains block for receiving sections of text documents, block for selection of base addresses of text documents, block for selecting structure of text documents, block for forming signals for recording and reading database, block for gating sections of text documents, block for addressing of text documents, block for receiving sections of text documents from database of server, block for commutator of channels for dispensing sections of text documents, block for counting number of finished sections of text documents, comparator, counter.

EFFECT: increased speed of operation of system.

8 dwg

Up!