Method for automated language detection and (or) text document coding

FIELD: information technologies.

SUBSTANCE: in the method of automated language detection and (or) text document coding, byte sequences are identified, and statistics of frequency of identified byte sequences is counted. Then, using the statistics, profiles of each language and (or) each coding are built, a search engine is built to extract sought-for byte sequences from the byte flow of the inspected document, and the built search engine and profiles of languages and (or) codes are saved into the memory. Byte sequences are found in electronic version of each inspected document with the help of the search engine, and statistics of frequency of found byte sequences is counted as the profile of the inspected document. The calculated profile of the inspected document is compared with profiles of languages and (or) codes to identify relevance of the language and (or) code to this inspected document.

EFFECT: expanded arsenal of technical facilities, making it possible to automatically detect language and coding of text according to previously collected statistics in any text documents.

3 cl

 

The technical field to which the invention relates.

The present invention relates to automatic language detection and (or) the encoding of a text document and can be used in the development of new and improvement of existing systems scan text documents.

The level of technology

In the process of automatic analysis of text documents, for example, in the case of tracking documents held on a company network, to determine the presence of sensitive information, there is an acute problem of determining the language in which recorded the scanned document, or encoding the binary representation of the document.

Currently known some of the ways the language definition and / or encoding of a text document.

Thus, in the patent of Russian Federation №2251737 (publ. 10.05.2005) described a method to automatically detect the language in multilingual recognition, which recognized individual characters of the text form hypotheses about language character group, and test these hypotheses based on the list used in the language model containing a predetermined signs symbols all alleged languages. Something similar is used in the application for U.S. patent No. 2009/0024385 (publ. 22.01.2009), which describes the semantic grammar analyzer is R, performing sequential analysis of proposals with components in multiple languages, the results of which built a graph used in further analysis. The effectiveness of these methods is not high enough due to the use of certain characters or groups of characters, since the text may include quotes in another language.

The closest you can consider U.S. patent No. 8041566 (publ. 18.10.2011), where disclosed thematically specific model for formatting text and speech recognition. This method segments the text, and each segment is prescribed model from a set of thematically-specific models that contain statistical information about the probability of the language model, word processing and formatting rules, for example, interpretation of commands for punctuation, highlight parts of the text, etc. and the characteristic dictionary for this segment. The method is of limited use because it is designed primarily for text formatting.

Disclosure of inventions

The purpose of the present invention is to provide such a method which would be expanding Arsenal of technical means and allowed to automatically determine the language and (or) the text encoding on a pre-dialed statistics in any text documents.

To solve this problem and achieve the specified order is the result of the present invention, a method of automated language detection and (or) the encoding of a text document, namely, that: allocate a byte sequence included in the electronic version of the reference documents, the texts of which are recorded in the appropriate language and (or) encoded by the corresponding encoding; calculate statistics of occurrence of the selected byte sequences for the reference documents, the texts are written in one language and (or) encoded in one encoding; build on the basis of the calculated statistical profiles of each language and (or) each character set in the form of multiple byte sequence indicating the weight of each byte sequences in a given language and (or) in this coding, constructing a search machine to extract the desired byte sequences from a byte stream of the scanned document; save in memory built search machine and profiles of languages and encodings, collectively referred to as the ontology of this collection of languages and encodings; find byte sequence in the electronic version of each of the scanned document using the search machine; count as a profile of the scanned document statistics of occurrence found byte sequences based on the ontology; compare the calculated profile of the scanned document with the profiles of languages and encodings to determine the relevance of the language and (or) encoding the scanned document.

The feature of the method according to the present invention is that for each language and (or) for each encoding can choose the most characteristic byte sequences.

Another feature of the method according to the present invention is that the search machine can build based on the algorithm of Aho-Corasick for exact matching sets.

Detailed description of the invention

The present invention can be implemented in any computing system, for example, a personal computer, server, etc. For carrying out the invention must also have an appropriate database that stores electronic files of text documents.

The way of automated language detection and (or) the encoding of a text document is designed to automatically detect the language and encoding of a text document.

First, on the basis of specially selected collections of texts in different languages are either already in use (by supplementing the initial collection of texts in electronic versions of the reference documents, the texts of which are recorded in the appropriate language and (or) encoded by the corresponding encoding, allocate a byte sequence and calculate the statistics of occurrence of the selected byte sequences for the m Aster is the R documents the texts are written in one language and (or) encoded in one encoding. Based on the calculated statistics are building profiles of each language and (or) each character set in the form of multiple byte sequence indicating the weight of each byte sequences in a given language and (or) in this encoding.

It should be noted that the accumulation of byte sequences for each language and (or) for each encoding operators working on the automated workplace, can choose the most characteristic byte sequences. This operation is carried out for the so-called (in machine learning) "reduction of dimensionality". You can use various algorithms, in particular, the algorithm for calculating the mutual occurrence (mutual information - see, for example, http://en.wikipedia.org/wiki/Mutual_information) byte sequence in marked-up documents in the categories in which the marked-up documents. In this case the category is languages or character sets. For example, let the sequence "ABC" is found in Ukrainian and Russian in 10 documents of the Ukrainian language and 15 Russian documents. Only in the Russian language 50 documents in Ukrainian, respectively, 30. Just suppose we have 8 languages in which documents are presented.

Based on these numbers, and calculated the mutual occurrence of the bytes of the th sequence (in this case, the sequence "ABC") in these documents.

The profile of each language and (or) each encoding usually form in the form of a vector selected byte sequences specific to the language and (or) this encoding, one vector per language.

In principle, for the languages and character sets are different collections of documents ("training set"), but the action and that, in either case, will be the same. As language can be used not only natural languages and programming languages. Encoding can be anything: for example, it may be ASCII or any other, including specially designed.

Next, build a search machine to extract the desired byte sequences from a byte stream of the document to be verified. This search machine can be constructed based on the algorithm of Aho-Corasick for an exact match sets (see, for example, http://e-maxx.ru/algo/aho_corasick or http://aho-corasick.narod.ru). Built a search machine and profiles of languages and encodings, collectively referred to as the ontology of this collection of languages and encodings, remain in the memory for further use when checking regular text documents.

Now, if the input byte stream of the scanned document is found using search machine byte sequence in elektronnoy versions of each scanned document and calculate the statistics of occurrence found byte sequences on the basis of the stored ontology. In other words, using built and saved search machine automatically detected in the scanned document byte sequence selected in the reference documents, and calculates statistics of occurrence. This statistic represents the profile of the scanned document. Then compare this profile (estimated statistics) of the scanned document with the stored profiles of languages and encodings to determine the relevance of a particular language and (or) of one or the other encoding the scanned document. As language and / or encoding, defined as a result of these actions, select, for example, the language or the character set, which have the greatest degree of relevance.

Thus, the method according to the present invention, expanding the Arsenal of technical tools that can automatically identify the language and (or) the text encoding on a pre-dialed statistics in any text documents.

1. The way of automated language detection and (or) the encoding of a text document, namely, that:
- allocate a byte sequence included in the electronic version of the reference documents, the texts of which are recorded in the appropriate language and (or) encoded by the corresponding encoding;
- calculate the statistics of occurrence of the selected buy the new sequences to the reference documents, the texts are written in one language and (or) encoded in one encoding;
- build on the basis of the calculated statistical profiles of each language and (or) each character set in the form of multiple byte sequence indicating the weight of each byte sequences in a given language and (or) in the encoding;
- build a search machine to extract the desired byte sequences from a byte stream of the scanned document;
- remain in memory-built search machine and profiles of languages and encodings, collectively referred to as the ontology of this collection of languages and encodings;
- find the byte sequence in the electronic version of each of the scanned document using the search machine;
- count as the profile of the scanned document statistics of occurrence found byte sequences based on said ontology;
- compare the calculated profile of the scanned document with the mentioned profiles of languages and encodings to determine the relevance of language and (or) encoding the scanned document.

2. The method according to claim 1, which selects, for each language and (or) for each encoding the most characteristic byte sequences.

3. The method according to claim 1, in which the mentioned search machine built on the basis of the e algorithm Aho-Corasick for exact matching sets.



 

Same patents:

FIELD: information technology.

SUBSTANCE: first version and at least one cell associated with the document are received, wherein at least one cell has a cell identifier and the cell identifier is associated with the first version, having at least one first version identifier. Each of the at least one first version identifiers presents cell status at a moment in time, and the coverage area defines a plurality of cells and versions, the coverage area including at least one root object. Updates for a first computing device are received. The updates indicate the identifier of the updated version, associated with each cell, associated with the document. The first version of each cell is stored if the first version identifier matches the identifier of the updated version of the cell. A new version of each cell is generated, wherein generation of the new version includes assigning a new version of the identifier of the new version if the identifier of the first version of the cell does not match the identifier of the updated version of the cell. Any cell on which there were no links in root objects is deleted and the document is synchronised by replacing cells with a new version of each cell.

EFFECT: reduced volume of altered information.

12 cl, 6 dwg

FIELD: information technologies.

SUBSTANCE: associative Identifier of Events, Technological, implements a circuit of identification of expected design events/conditions of a control system, determined readings of primary sensors of process control, and whenever such occur, it generates alternative design data for direct control of process without application of software and processor resources in asynchronous mode and at the moment of control data arrival, at the same time it includes a multi-layer architecture of an item, organising address-free space of memory and providing for equal and asynchronous access of input information to each layer, in respect to input data, all layers are interconnected memory cells with elements of data comparison and control of recording procedure.

EFFECT: increased indices of reliability, trustworthiness and validity.

3 cl, 3 dwg

FIELD: radio engineering, communication.

SUBSTANCE: disclosed system for controlling, collecting and processing data with onboard spacecraft recording equipment includes at least one onboard recording equipment unit connected by at least two communication channels to a data control and processing unit, which is connected onboard spacecraft equipment through at least one communication channel for subsequent collection of information on Earth. The data control and processing unit includes: an interfacing device, a self-contained timer, a single-board computer, a forced cooling system, a heat sensor system, a storage unit, a synchronous data transmission unit, a secondary power unit and a command transmission and power distribution system.

EFFECT: easier and reliable simultaneous connection to different onboard recording equipment.

7 cl, 2 dwg

FIELD: oil and gas industry.

SUBSTANCE: stages of the proposed method involve acquisition of database of oil deposit, which are related to oil-field objects. A self-organising map (SOM) is formed by means of the following: assignment of each of multiple data fields to one of multiple SOM maps. Each of multiple oil-field objects is assigned to one of multiple SOM positions based on the pre-determined SOM algorithm for presentation of statistical patterns in a variety of databases of oil deposit. Stochastic database is formed of databases of oil deposit based on artificial neuron network for databases of oil deposit. Screening of databases of oil deposit is performed to identify candidates from oil-field objects. Besides, screening is based on stochastic database. Detail assessment of each of the candidates and selection of oil-field object of candidates based on detail assessment is performed. Oil-field operations for the chosen oil-field object are performed.

EFFECT: improving assessment accuracy of oil-field objects.

22 cl, 23 dwg

FIELD: information technology.

SUBSTANCE: method for digital distribution of media content using a distribution main line system comprises steps of receiving a media content request from a client, the request including the profile of the client; performing inventory check and analysis of source assets by iteratively going through the client profile to generate output data; mapping functionalities, wherein several rules enable to map source assets onto the client profile; and scheduling the production process, which determines work elements and execution steps based on functionalities mapped in response to the media content request from the client.

EFFECT: automation of a process which downloads content in digital format and seamlessly manages said content by delivering to the client.

27 cl, 23 dwg

FIELD: radio engineering, communication.

SUBSTANCE: information on characteristics of weapons of each party is switched; the information is stored in a first memory unit; the information is supplemented with characteristics of a backup group with variable input time; information on weapons of all groups is used to pre-evaluate characteristics of their difference and determine coefficients of independent combat superiority of party A over groups B1, B2; the obtained information is used to select a strategy of combat operations; the remaining weapons of all groups are determined; intermediate characteristics of groups and the outcome of combat operations are stored in a second memory unit and read therefrom, and then transmitted to inputs of a display unit, where information on the outcome of combat operations of party A is displayed: win, loss, draw; the remaining weapons in groups: type of strategy, delayed backup, type of difference, values of coefficients of combat superiority and coefficients of distribution of weapons.

EFFECT: high combat efficiency and effectiveness of operations with different groups, rapid planning of the selection of the optimum target distribution strategy.

2 cl, 5 dwg

FIELD: information technology.

SUBSTANCE: method creating an audio scene for an avatar in a virtual environment comprises the following steps: creating a link structure in a virtual environment between a plurality of avatars; reproducing an audio scene for each avatar based on its connection with other avatars connected by the links; wherein the created link structure is configured to determine the angle for reproducing the audio scene and/or the attenuation coefficient for applying to audio streams on input links. The angle for reproducing the audio scene corresponds to angles of links between said each avatar and other avatars connected by links; the link structure includes a minimum spanning tree. Loops are introduced into the minimum spanning tree such that the minimum length of the loops is shorter than a predetermined value so as to prevent echo in the reproduced audio scenes.

EFFECT: solving a task such as creating voices which really seem to originate from avatars in a virtual environment.

12 cl, 6 dwg

FIELD: information technology.

SUBSTANCE: information on unit indicators of compared means is switched, recorded in a first memory unit, sent to a worst quality and best quality reference forming unit, which forms the corresponding beginning and end of a straight line which defines a quality estimation scale; planes perpendicular to that straight line are made through points of the compared means in the space of the unit indicators; parameters of the points of intersection with the estimation scale are found, values of which form complex quality indicators of the compared means, the maximum value of one of which corresponds to the preferred means.

EFFECT: high security of devices.

2 cl, 2 dwg

FIELD: information technology.

SUBSTANCE: system for hosting interactive audio/video (A/V) streaming with short waiting time includes a plurality of servers on which one or more applications are executed. The system also includes a network with input routing, which receives packet streams from users and routes these packets to one or more said servers, wherein said packet streams include user control signal input, wherein one or more said servers is configured to calculate A/V data in response to user control signal input. Furthermore, the system includes a compressing unit which is connected to receive A/V data from one or more servers and derive therefrom streaming compressed A/V data with short waiting time. The system also includes an output routing network which routes streaming compressed A/V data with short waiting time to each user over a communication channel through an interface.

EFFECT: high quality of A/V data transmitted over a communication channel.

29 cl, 40 dwg

FIELD: information technology.

SUBSTANCE: device for simulating the process of choosing a commodity has an array of m*n first registers, second registers whose number equals the number of rows of the array, adders whose number equals the number of rows of the array, AND element units whose number equals the number of rows of the array, third and fourth registers whose number equals the number of columns of the array, an array of m*n divider units, an array of multiplier units, a unit of OR elements, a maximum code selecting unit, a decoder, four delay elements and a flip-flop.

EFFECT: broader functional capabilities by providing selection of the best version of a commodity based on given consumer criteria.

1 dwg

FIELD: message boards and mail servers.

SUBSTANCE: electronic message board is provided with database in form of several known words, which are selected in specific order, while each word is connected to respective URI. Message text from user computer is checked using a plurality of known words. When message text does not include a known word of plurality of known words, message is placed at electronic board. Each known word is found in text, known in text is converted to hypertext format with URI connected to word, as destination of link, and message is placed at message board.

EFFECT: higher efficiency.

7 cl, 4 dwg

FIELD: computers.

SUBSTANCE: device has control trigger, random pulse generators, block for forming program of functioning of modeled multimode system, working modes and technological mode blocks, operation time counters, random pulses generators, OR block, orders counters.

EFFECT: broader functional capabilities.

3 dwg

FIELD: computers.

SUBSTANCE: device has matrix of m rows and n columns for homogenous environment, maximum detection block, adder, memory block, n blocks for counting units, block for estimating channels load levels, containing pulse generator, element selection multiplexer, element selection decoder, row selection decoder, m OR elements, m triggers, m counters of channel load, row number counter, column number counter, group of m blocks of forbidding elements.

EFFECT: broader functional capabilities.

5 dwg

FIELD: computers.

SUBSTANCE: method includes detection of connection of user computer to Web-site of sub-domain on national language and selects service of registration of domain name on national language, performing software extension for automated forming of combination of symbols of English alphabet, matching domain name, based on national language, determining, whether such combination of English symbols was registered before as existing domain name, and, if not, then it is registered as domain name.

EFFECT: higher efficiency.

4 cl, 7 dwg

FIELD: computer science.

SUBSTANCE: device has random time ranges generators, imitating specific usage modes, elements AND, OR, triggers, delay elements, random numbers generators, decrypters and differentiative elements, providing for modeling of dynamics and specifics of operation of surface mobile measuring complex.

EFFECT: higher precision, broader functional capabilities.

2 dwg

FIELD: computer science.

SUBSTANCE: device has input registers, NOT elements, comparison blocks, counters, indication blocks, delay elements, adders, division blocks, Or elements, commutators, subtraction blocks.

EFFECT: broader functional capabilities, higher efficiency.

8 dwg

FIELD: measuring equipment.

SUBSTANCE: device has harmonic signals generator, synchronization pulse generator, counter, rectangular pulse generator, key, quadrature demodulator block, digital block for adjusting synchronization phase, analog-digital generator of tone frequencies signals, three ADC, three buffer registers, three digital filters, five multiplexers for temporal division, six arithmetical subtracter-adders, six arithmetical multiplication blocks with accumulation, microprocessor set with input and output data registers, data processing block, micro-programmed control block and synchronous pulse generator and indication block.

EFFECT: higher precision, higher efficiency, broader functional capabilities.

1 dwg

FIELD: radiolocation.

SUBSTANCE: method includes analog-digital conversion of reflected from targets and received signal, calculating complex correlation sums of selection of received signal and support quadrature signals with values of parameter of resolution of support signal, taken on e even mesh, maximal by width intervals of values of resolution parameter, inside which all modules of correlation sums exceed threshold of detection, decision is taken about match of each local maximum to one target in the ranges, in which number of local maximums is more than one, width of range is calculated, inside which one local maximum is placed, decision about match of one local maximum to two targets is taken in case, if width of range is more than threshold width, in opposite case minimal non-square non-match of counts of complex correlation sums and count s of standard correlation sums of signal of one target are calculated, and decision is taken about match of local maximum inside range of one target in case, if non-match is less than non-match threshold, and two targets - in opposite case.

EFFECT: higher efficiency.

3 dwg

FIELD: equipment characteristics prognosis technologies.

SUBSTANCE: device provides forming of statistical model for prognosis of characteristics of equipment. Device gets input data, in form of equipment parameter, which includes multiple values, appropriate for parameter. Input data is inputted into model and data set is formed, appropriate for model response to input data, equations system if formed being a model of equipment characteristic, received data is statistically processed for forming probability image of equipment characteristic.

EFFECT: higher efficiency.

6 cl, 4 dwg

FIELD: computer science.

SUBSTANCE: device has two registers blocks, inputs of which are device parameters inputs, pulses multiplication block, four multiplication blocks, comparator, clock pulses generator, adder block, two subtraction blocks, block for multiplication by zero, counter, division block, integrator and register.

EFFECT: broader functional capabilities.

1 dwg

Up!