Method of identifying personal data of open sources of unstructured information

FIELD: information technology.

SUBSTANCE: personal data identification is achieved through linguistic techniques, realised by a data collection server, a linguistic processing server and an application server. The disclosed method includes creating a task based on open source bypass parameters coming in through an administrator's automated workstation. Further, the method includes loading text, bypassing open sources and loading texts or transmitting texts from an external system; selecting links from the loaded texts for addition thereof to addresses for further bypass; extracting text and converting binary files to a text format; text prepared for analysis is broken down and the substance is determined; the substance of personal data in the text is selected; personal data are identified; facts (substance determined at the previous step associated with persons) of personal data in the text are identified.

EFFECT: providing high relevance of results when identifying personal data in open information sources and in text files of the most common formats.

7 cl, 3 dwg

 

Area of technology

The invention relates to the field of information technology. The described method of identifying personal data is intended for detection and purification (replacement) personal data obtained from controlled sources of information.

The level of technology

Public authorities, municipal authorities, legal or natural persons, organizing or carrying out the processing of personal data and third parties gaining access to personal data, should ensure the confidentiality of such data, which involves the prohibition of their publication in open materials without the consent of the data subject.

Federal Law of 27 July 2006 No. 152-FZ "On personal data" defines personal data as any information relating to an identified or identifiable on the basis of such information, the natural person (data subject). An example of such data about a natural person may be: 1) surname, name, patronymic; 2) year, month, date and place of birth; 3) address; 4) family, social and property status; 5) education; 6) profession; (7) earnings (except for persons whose incomes are subject to public scrutiny, for example, of officials); 8) other information.

Application procedures for anonymisation of personal data but�regulatory fixed by resolution of the Government of the Russian Federation from March 21, 2012 No. 211 and is mandatory for operators of personal data.

Federal law of the Russian Federation of 27 July 2006 N 152-FZ "On personal data" (No. 152-FZ) requires operators of personal data to take a lot of measures to build a system of protection of personal data both organizational and technical nature.

Public authorities carrying out control and supervision functions on execution No. 152-FZ, operators of personal data (as defined in the act) of the essential tools of monitoring compliance with the legislation. Such tools must handle large volumes of unstructured open source data for efficient automatic discovery of personal data and cleaning (replacement) from those texts while preserving the semantic integrity. You must also have a flexible configuration mechanisms of these instruments in accordance with applicable law.

There are many products with different features designed to support these tasks. Often, however, a considerable amount of work on cleaning (replacement) and the transformation of materials containing personal information have to be done manually.

Known solution RU №2096824 C1, IPC G06F 15/16, G06F 17/60, representing the device automatiser�bath processing of information materials for personalized use. The system detects the presence of the processed information in the content information that describes from the point of view of their content elements as defined in the composition characteristics of the information needs of the user, records the fact of the existence of such information and the items that correspond to the characteristics of the information needs of the user and then uses these elements and their combination in presenting the user the contents of processed materials. Treatment is carried out in interactive mode, consistently demonstrate separate semantic fragments, which are subdivided processed information materials in a form appropriate to their mind. In the case of identifying the semantic relationships between the content of this fragment and the one or other element characteristics of the information needs of the user recorded the existence of such communication by forming an individual sign for each of the elements with which the identified semantic relationship of this fragment. When identifying different degrees of the connection of the semantic fragment with different elements of the characteristics of the information needs of the user form the basis of membership of these elements to different levels of information needs of the user in accordance with the identification number�different gradations of communication. Carry out the formation of the image of the local structure of this semantic fragment representing an undirected labeled graph, vertices of which are assigned to the elements of the characteristics of the information needs of the user for which formed the connection with the semantic content of this fragment indicated the graph is fully connected. This patent is the result of processing of the display in human readable form with replacing multiple edges geometric images, sizes or colors of which correspond to their multiplicity, and digital indication of multiplicity of ribs integral of the graph.

The present invention provides the identification of the semantic elements of the text based on pre-established contextual features, however, makes it impossible to monitor, detect and clean (replace) revealed fragments of personal data or personal data in its entirety.

Disclosure of the invention

Described patent-pending method implements the function of control and purification (replacement) personal data on the codewords in the stream of textual content open sources of information and in the lyrics of downloadable content with preserving the semantic integrity of the original text of the electronic document, communication or publication. Func�ed method allows to ensure compliance with the Federal law of July 27, 2006 No. 152-FZ "On personal data" in the large information space of open information sources using advanced information technologies.

In one embodiment, the implementation of the method involves performing 3 stages: monitoring, detection and purification (replacement) of personal data. Monitoring of personal data is the systematic collection and processing of information - procedures that can be used to improve process control personal data, it automation, as well as, indirectly, to educate the public on compliance with the requirements of Federal Law. Thus, the disclosed patent-pending method is a tool for the assessment of compliance №152-FZ. Patent-pending method carries one or more of three due to the law functions:

- detects open sources of information that publishes personal data in respect of which Supervisory authorities may assess the amount and systematic violations, and to choose appropriate sanctions, and by the owners of the sources to eliminate the causes of potential violations of law;

- provides a factual basis for the implementation of Supervisory procedures in accordance with No. 152-FZ;

- sets of conformity regulations No. 152-FZ.

Patent-pending method solves the following set of tasks:

1) �onitoring open information sources (websites of public authorities, print mass media, Internet media, blogs and forums, then open source);

2) semantic analysis of texts open source and downloadable text files and identifying in Russian texts of personal data;

3) linguistic processing of materials of public sources for the purpose of increasing accuracy and ease of analysis of the information on the detection of personal data, the measure of damages to the person whose data have been published, such as: clustering and classifying texts, ranking and genre classification of text highlighting in the text object (person, organization, brand, geographical concept), the calculation of the quality index of the object and definition of the role identified in the text object, identifying in the text direct speech of a person;

4) removal (replacement) and cleaning of personal data while preserving the semantic integrity of the structure of the source text and the formation of alienated dictionary of substitutions;

5) recovery of personal data in the source text by using a dictionary of replacements.

Patent-pending method implemented by the system of devices and software modules (System), which contains the following functional elements: a data collection server, open sources of information; the server of linguistic processing, including linguistic processor; a server when�of ojani, provides the System interface and the user.

The data collection server, open sources of information, coupled with the network interface providing network access to open sources of information (Internet) and data transmission to a network server interface linguistic processing.

The server of linguistic processing, contains linguistic processor word processing and database.

Linguistic processor performs word processing:

1. The allocation information of the object and ranking the importance of its mention in the message text (main, supporting or cameo role). Removed during this procedure, information is used as an auxiliary for assessing potential damage from the publication of personal data in the calculation of the quality index.

2. Categorizing (classifying) texts in which texts relate to one of three categories (allowed for hierarchical expansion during operation by means of standard System): public, private, closed. Classifying it is necessary to simplify the job search criteria violations and ordering them to appear in the System interface. Publication of information about the position place of work, education, with rare exceptions, does not represent any threat, as a rule, such data open�you. In contrast, a series and passport number, address of the person, information about income, etc. are private information. Specialists responsible for the detection of the personal data may, in particular, to limit the size of the resulting sample and to increase the relevance of search results by means of standard System (setup rubritsirovanyy through the introduction of child subheadings with subsequent visualization in the search results and the System reports only the interest or categories of data).

3. Genre classification of texts. The account of the genre of the text is required to maintain statistics on the publication of personal data, to obtain the understanding of texts (news; interviews; analyst; analytical article, TV; talk-show; legislation; press release; essays; reviews; ratings; etc.) is potentially the most dangerous. Particularly useful is submitted to the joint account, the statistical data obtained as a result of genre classification and rubritsirovanyy (categorization of personal data).

4. The identification of groups of linguistically similar texts and clustering of incoming information materials. Clustering information materials needed, such as to detect displacement of the focus of risk from the publication of personal data, define new kata�of territories to rubritsirovanyy personal data. In addition, the availability of information about the ownership of texts from a single source to a single cluster will allow to exclude from the search results, the extra data. For example, information on the incomes of officials, unlike other categories of the population, is open, and the declarations of their income, published on the websites of the agencies that will form the cluster. The this in mind would eliminate the Declaration of the results of a search for documents with personal information private.

5. The allocation of direct speech information objects. Removed during this procedure, information is used as an auxiliary for assessing potential damage from the publication of personal data in the calculation of the quality index (see paragraph 7 of this section).

6. Determination of the amount of airtime with stories illustrating an information object. Removed during this procedure, information is used as an auxiliary for assessing potential damage from the publication of personal data in the calculation of the quality index (see paragraph 7 of this section).

7. Calculation of quality index for the identified information objects that reflect a qualitative assessment of the relations in the text of the open source to the specified object. The quality index is calculated using the following data: influence of source (designed�I based on frequently updated data about its citation), the number of bands, size of text, an illustration, the role of the object in the text, the presence of citations of the object in the text, the nature of the object reference in the text (negative or positive). Quality index allows, in particular, to assess the degree of potential harm from the publication of personal data in a particular source.

Thus, the data collection servers perform the bypass trust open source and transfer the pre-processed texts to the linguistic processing server. After parsing the text and depending on his results in the server database linguistic processing are stored parsed texts, information about the crawl and addresses of references sources. Access to the analysed texts, the records of the results of the analysis, statistical data, etc. stored on the server, linguistic processing, ensure the application server is supported via the web interface automated workstation (AWS) Analytics. Also on this server is working and the web interface administrator's workstation. Advanced web interfaces AWP analyst and administrator's workstation will store a user's settings.

In addition, the composition of the infrastructure of the System includes switching equipment, firewall.

Key (but not only) functionality, provide�JSOC Bashneft patent-pending method are:

- collection and storage in a database System text data open sources. Input data System (depending on source) as follows:

1. HTML and XML files - the Internet blogs and forums, Internet media;

2. DOC, DOCX, RTF, XLS, XLSX, CSV, TXT, HTML, PDF files, web sites of public authorities (GIA);

3. PDF files - the official printed media;

4. PDF, HTML and XML Internet media;

5. multimedia files - videos Federal television channels.

- identification in unstructured texts persons and related attributes (position, place of employment, business phone number, email address, information on income and education, date of birth, place of birth, number of a personal mobile phone, INN, family ties, passport details, driving license, Bank card attributes, the car number, home phone, address, registration/accommodation, etc.).

- identification in the texts of other entities (organizations, brands, geographical concepts, etc.) with the aim of improving the efficiency analysis of texts containing personal information.

- categorizing (classifying) unstructured text in order to classify texts containing personal information to one of the categories, with the possibility of the introduction of new categories (subcategories) regular means of the System and, sootvetstvenno�, implementing more fine-tuning, monitoring, analysis and reporting System. Preset categories in the System:

- Open data, which do not allow to establish the whereabouts of the object (outside), his family members, connect with them, the publication of which poses no threat to the reputation or mandatory owing to official duties. To the category of the public do not qualify for these documents to uniquely identify a person, as well as information about racial origin and religious beliefs: 1) name; 2) Gender; 3) Title; 4) Place of work; 5) business phone Number; 6) the Address of service address; 7) Education (except education documents); 8) Information about military records (excluding data records); 9) Profession; 10) Information on income and property (for civil servants).

- Private data available on the job search sites, ads, Dating, phonebook: 1) date of birth; 2) Place of birth; 3) personal e-mail Address; 4) the personal mobile phone; 5) Citizenship; 6) TIN; 7) family ties.

- Private data: 1) All passport data; 2) a Driving licence; 3) the Number and other attributes of the card (CVV, date of expiration); 4) Income property (not for civil servants); 5) the Number of AB�of mobile; 6) Home telephone number; 7) registration Address, residence.

- clustering of information materials to identify groups of incoming texts are intended to clarify the procedures rubritsirovanyy, grouping texts General types of personal data. The implementation of text clustering General types of personal information leads to the formation of clusters in the heading. To prevent too large and uninformative clusters will include texts containing at least 3 types of personal data. Information on the incomes of officials, unlike other categories of the population, is open, and the declarations of their income, published on the websites of the agencies that will form the cluster of open information. Corresponding guide public figures contains the list of officials whose personal data are belong to the permitted category. An example of this category is the category of "Revenue officer".

- ranking object in the text (defining the role in which the object mentioned in the text: main, episodic) with the aim of increasing the accuracy and ease of analysis of the information on the detection of personal data, the measure of damages to the person whose data have been published.

- genre text classification with the purpose of increasing accuracy�and facilities analyse information on the detection of personal data in particular to gather information for various studies to clarify the target audience, as well as supporting information if necessary investigate the facts of revealing personal data.

- calculation of the quality index of the object identified in the text with the purpose of increasing accuracy and ease of analysis of the information on the detection of personal data, measurements incurred or potential prejudice to the person whose data have been published, with consideration of influence (including citations) of source text with personal data, as well as supporting information if necessary investigate the facts of revealing personal data.

- identification in the text direct speech of a person for the purpose of increasing accuracy and ease of analysis of the information on the detection of personal data identifying the original source of the information used in the text to focus on the author of the quote.

the ratings, which sources the publication of personal data on the number of violations in this part with the possibility of grouping by type of open source (media, blog, etc.).

- filter out sources that do not pose for one reason or another are of interest for monitoring purposes, with the aim of improving the efficiency analysis of texts containing personal information.

filtering those�ists, containing personal information about public figures from the corresponding guide, edited by means of standard System, with the aim of improving the efficiency analysis of texts containing personal information.

- adjust the sensitivity of the linguistic processor by means of standard System (modes increased accuracy of detection of the personal data and most of the fulness of revealing personal data), with the aim of improving the efficiency analysis of texts containing personal information.

- select the depth of the retrospective and different analytical sections for visualization of the results of linguistic processing in the form of reports.

transform unstructured text input System, in a structured (XML format) that contains information about identified personal data and objects in the text.

- export results of linguistic processing in one of the most common formats (doc, docx, rtf, xls, xlsx, csv, txt, xml, pdf).

- cleaning (replacement) of personal data in the text while preserving the semantic integrity and structure of the source text with the possibility of replacement of items of personal data to code words, save the dictionary of substitutions in a separate file or database.

- save in the database System factual data about the publication of personal data: text pub�ikali in original format links to the source, the date and time of publication, the text of the document produced by the substitution of personal data, records of actions of the System servers in the activity log (the log file).

Patent-pending method support tools to reduce the amount of manual work and connect to handle the additional sources. Mechanisms to identify and clean (replace) data are defined declaratively and are used as when working with new pluggable data sources, and work with the files common text formats. Patent-pending method provides an efficient and reliable execution of all phases of collection, processing, monitoring, analysing and transforming data for multiple sources and large data sets.

Summary of the invention

The technical result achieved by the present invention is the high relevance of search results in the identification of personal data in open sources of information in text files the most common formats and the results of processing unstructured text information, including:

1. obtaining factual data supporting the publication of personal data in material form (electronic or paper report) on the results of detection of the personal data;

2. the formation of impersonal text that does not contain personal data, and the dictionary of replacements of personal data;

3. the formation of a set of analytical reports.

Said technical result is achieved due to the data collection of open source, semantic parsing and text analysis, linguistic processing to identify personal data, objects and related attributes of the processed texts in natural language (Russian). Synergistic effect when using the invention is achieved by the application of the methods and results of the linguistic processing of unstructured text for the tasks of identifying the personal data and cleaning (replacement) personal data while preserving the semantic integrity and structure of texts. The technical result is achieved in particular through collection and storage in the System database text data public sources with the subsequent application of the methods of linguistic processing, which leads to a minimum of the following:

- identification in unstructured texts persons and related attributes (position, place of employment, business phone number, email address, information on income and education, date of birth, place of birth, number of a personal mobile phone, taxpayer identification number, related related�and, passport details, driving license, Bank card attributes, the car number, home phone, address, registration/accommodation);

- identification in the texts of other entities (organizations, brands, geographical concepts);

- clustering of information materials to identify groups of incoming texts are intended to clarify the procedures rubritsirovanyy, grouping texts General types of personal data. The implementation of text clustering General types of personal information leads to the formation of clusters in the heading. To prevent too large and uninformative clusters will include texts containing at least 3 types of personal data. Information on the incomes of officials, unlike other categories of the population, is open, and the declarations of their income, published on the websites of the agencies that will form the cluster of open information. Corresponding guide public figures contains the list of officials whose personal data are belong to the permitted category. An example of this category is the category of "Revenue officer";

- ranking object in the text (defining the role in which the object mentioned in the text: main, episodic) with the purpose of increasing accuracy and ease anal�for information about the detection of the personal data measure of damages the person whose data have been published;

- genre text classification with the purpose of increasing accuracy and ease of analysis of the information on the detection of personal data, in particular to gather information for various studies to clarify the target audience, as well as supporting information if necessary investigate the facts of identifying personal data;

- calculation of the quality index of the object identified in the text with the purpose of increasing accuracy and ease of analysis of the information on the detection of personal data, measurements incurred or potential prejudice to the person whose data have been published, with consideration of influence (including citations) of source text with personal data, as well as supporting information if necessary investigate the facts of identifying personal data;

- identification in the text direct speech of a person for the purpose of increasing accuracy and ease of analysis of the information on the detection of personal data identifying the original source of the information used in the text to focus on the author of the quote;

- adjust the sensitivity of the linguistic processor by means of standard System (modes increased accuracy of detection of the personal data and most of the fulness of identifying personal� data) with the aim of improving the efficiency analysis of texts containing personal data;

transform unstructured text input System, in a structured (XML format) that contains information about identified personal data and objects in the text;

- clear the text from the personal data while preserving the semantic integrity and structure of the source text with the possibility of replacement of items of personal data to code words, save the dictionary of substitutions in a separate file or database.

- save in the database System factual data about the publication of personal data: the text of the publication in its original format, link to source, date and time of publication, the text of the document produced by the substitution of personal data, records of actions of the System servers in the activity log (the log file).

The main effect disclosed in the invention method consists in the application of these linguistic technologies to obtain statistical, analytical and factual information about the facts and sources the publication of personal data, flexible and convenient configuration implemented in the invention of tools and features that solve the main tasks, as well as accessories, such as quantification born (potential) damage to person, public� data which were published and the possibility of increasing the level of security for the storage and transmission of personal data thanks to the functions of depersonalization.

In the present invention assumes the existence of at least one server that is running Microsoft Windows Server that allows you to focus on technical solutions for Microsoft system and application software used in the System. Workstation analyst and administrator are platform independent (cross-platform).

The claimed method is implemented with the following features and techniques:

Servers data collection;

Servers linguistic processing;

- Application servers.

Brief description of the drawings

Fig.1 is a diagram of a System that implements a patent-pending method.

Fig.2 is a block diagram of a method of identifying personal data.

Fig.3 - scheme of the data acquisition and processing.

The implementation of the invention

According To Fig.1 System which implements a method of detecting personal data open source unstructured information includes the device clean of personal data 1, which through the firewall 11, provides access to public networks public information 12 via the Internet 13.

The system that implements the claimed method, contains at least one data collection server 2 open information source 12 connected to network interface 4. In this case, the�m collection server 2 performs the bypass and collection of information from open sources, listed in extensible by means of standard Systems required to process sources. In addition, the collection server 2 performs a selection of links and text from uploaded documents. The collected information is stored in the database 10 on the linguistic processing server 3.

In addition to identifying entities and facts in incoming texts, linguistic processing server 3 performs classifying, genre classification, clustering, materials, identification in the text and the ranking of objects, detection of persons, the calculation of the quality index, separation and purification (replacement) personal data. The results and information needed to address these challenges, are stored in a database on the server linguistic processing 3.

The application server 14 supports web interfaces AWP analyst and administrator's workstation and provides storage for user settings and access control.

According To Fig.1 a method of detecting personal data is implemented using the collection server 2, which contains at least the following modules: personal data collection, parsing, processing and loading the parsed texts, the module identifying personal data.

In one embodiment, the implementation of the patented method, the linguistic processing server 3 contains a module cleaning persons�found 8 data.

According To Fig.2 the algorithm the implementation of one of the variants of the invention: Create a job using AWS administrator to bypass open sources 12. Parameters of generated jobs is transmitted to the collection server 2 (see Fig.1) in the collection unit.

This module collect personal information implement the following steps of the patented method:

A1. Download text. At this stage are crawling sites open sources 12 and download of texts, or the last transmission from the external system (for example, monitoring systems media or file storage);

A2. The selection of links. At this stage of the downloaded lyrics highlighted links (URL) to add to addresses for further crawling. In the absence of a link traversal is stopped.

Using the parsing module performs the steps of the method are:

B1. Receipt of text. The next step is getting texts from the collection unit.

B2. Text extraction. At this stage, is extracted from the text text, binary files are converted to a text format;

B3. Defining the language and codepage. At this stage determine the language and encoding of the text.

Using the processing module and load the parsed texts open sources 12, the base System performs the following steps of the patented method:

B1. Receipt of text. On this this�e text prepared for the analysis comes from the parsing module.

B2. Sending the personal identification data and get results. At this stage, the text is sent to the module identifying personal data, and when the processing is loaded from the processing module and boot;

v3. Sending on linguistic processing and getting results. At this stage, are produced as sending to the processing of linguistic processor 6, and having analyzed text;

B4. Indexing text. At this stage is used to index the text for later loading into the database;

B5. The preservation of the text. Parsed and indexed the text stored in the database System.

Using the module identifying personal data will implement the following steps of the patented method:

G1. The analysis of the text and identify entities. At this stage, morphological analysis and the selection in the text entity (such as named and unnamed entities, special entity);

G2. Building a semantic network. At this stage, the formation of a network module, containing all the entities referred to in the text: names of objects and persons, actions, and characteristics associated with various types of syntactic and semantic relations;

G3. Identifying personal data�. At this stage, the selection in the text-based semantic templates stored in the System, fragments of text, involving the essence of the element of personal data and the person to which this entity belongs, i.e. relations between entities identified in step (G1).

The module identifying personal data made with the possibility of highlighting in text personal data, including the detection of the facts of the references to figures in the text and additional personal information associated with the identified persona. As a result of semantic data also revealed links between persons. The module identifying personal data will:

- Database processing 5 linguistic processor 6, containing texts, articles to identify personal data.

- Identification and allocation of personal data in the flow of information from the database 5 of the linguistic processor 6.

- Identify facts with personal data based on templates.

Save to the DB offset and length of personal data. In the module settings of revealing personal data, use the following dictionaries and rules:

morphological dictionary of Russian language;

morphological dictionary of names of Russian language;

- semantic dictionary of the Russian language;

- dictionary of geographic names;

keywords�ü keyword foreign organizations;

- names of the months;

- names of the months in the nominative case;

- month names in the genitive;

- rules for splitting text into tokens (pieces of text - sentences and tokens allocated on the basis of punctuation and HTML tags);

- Glossary of key words to denote an individual entrepreneur;

- dictionary for fuzzy recognition of keywords to denote an individual entrepreneur;

rules for recognizing the correct TIN;

rule number recognition;

rules for recognition of the number and series of passport;

rules for recognizing the correct accounts;

rules of recognition date;

rules for recognition of incorrect accounts;

rules for recognition of incorrectly recorded TIN;

- rules for recognition of date of birth;

- rules of recognition of the place of birth;

rules for bonding tokens into one token;

- a rule for the allocation of numbers of cars;

- a rule to highlight the e-mail address;

auxiliary rule for the allocation of names of organizations;

- the rule for allocation date;

- the rule for allocation date;

- a rule to highlight the complex dates;

- the rule for allocation date;

- a rule for the allocation of geographical names;

auxiliary rule for apportionment�Iya geographical names consisting of addresses;

auxiliary rule for the allocation of names of persons (full);

- a rule to highlight the parts of the addresses.

- a rule for the allocation of full addresses;

- a rule for the allocation of foreign addresses;

- a rule for the allocation of passport data;

- a rule for allocating data of the passport;

- a rule for the allocation of positions;

- a rule for the allocation of positions;

auxiliary rule for the allocation of difficult geographical names;

auxiliary rule for the allocation of names of persons (individual);

- a rule for the allocation of telephone numbers;

- the rules for replacements.

In one example of implementation in case of detection of the reference person module identifying personal data highlights in the text of the following symptoms:

1) date and place of birth of the person. The date of birth of the person is a complete birth date, or partial date (year, month or a certain period in the past).

2) the residence Address of the person. The residence address of a person is any indication of the region of residence of the person, city of residence, or a more accurate address.

3) marital status. Family status of persons is a reference to a current status, an indication of the presence or absence of family relations in the past.

4) the Presence of children. Under the presence of children means telling analicia or absence of children.

5) Profession, place of work. Profession, place of work is any indication of the kind of activity the person and place of work (past and present).

6) Evaluation of material status. Rating property status is an indication of the presence of property of any kind.

7) Information on income. Information about income is any information indicating the presence of income, sources of income and amounts of income of the person.

8) Information on education. Information is any information indicating the location of study, type of education, availability of texts, proof of education, date of learning of the person.

Using personal data sanitization module 8 performs the following steps of the patented method:

D1. The formation of the residual fragments. At this stage, on the basis of information about the facts of personal data received from the module identifying personal data, are formed by the replaced text.

D2. Substitution of the formed fragments. At this stage, on the basis of information on the situation of personal data in the text are marked corresponding parts to remove and insert the replacement fragments.

D3. The deletion of personal data. At this stage deleted text fragments in accordance with the markup.

The personal data sanitization module 8 �will win yourself a set of technical solutions enabling the personal data sanitization module to perform:

1. Selection code word or phrase from the relevant directory substitutions in accordance with the replaced item of personal data (for example, the phrase "A. F. Belyaev, living to the address: Solyanka str., 1/2, apt 47" will be replaced with the phrase "[N], residing at [When]", where N is a natural number from the set of positive integers, calculated incrementally for each case of personal data within the same text).

2. The removal from the text of identified personal data.

3. Replacement identified in the text of the personal data identifiers, as described in example p. 1.

4. The entry in the database 5 of the linguistic processor 6 values of the relevant attributes the persistence of links 'original value' - 'replacement'.

5. The entry in the database of results of processing of the text obtained after substitutions identified in the text of the personal data, and the data processing results of the lookup table codes replaced personal data.

With the help of the linguistic processor 6 performs the steps of the method are:

E1. The selection and ranking of information objects. At this stage, an assessment of the role of the object in the text;

E2. The allocation of direct speech information�x of objects (persons). At this stage is the allocation of direct speech objects in the texts.

E3. Genre text classification. This step enables the classification of the text with personal data to one of the predefined in the dictionary of the System of genres.

E4. Determination of the amount of airtime. At this stage, it retrieves information about the amount of airtime for a TV commercial from the text accompanying the supply roller of the plot;

E5. Calculation of quality index. At this stage, an assessment of the quality of media coverage of the activities of the person based on the characteristics of the text, the power of the media and some other parameters;

E6. The attribution of the text to a category. In this step, you define the belonging of the text in the heading on the basis of found in it of personal data.

E7. The clustering of texts. In this step, you identify in each section of text with a common and most frequently occurring together the types of personal data.

In one embodiment, the implementation of the patented method, the application server 14 includes a processor monitoring of personal data, which forms the basis of data processing results, generates analytical reports on the basis of linguistically processed data, filter and select the depth retrospectives� and analytical sections. The processor of the monitoring of personal data also exports results cleaning your personal data to the data processing results in the form of files that are stored using a common data storage formats. The processor of the monitoring of personal data on the basis of information stored in the database of the linguistic processor 6, provides a solution to the problem of creating the final report in one of the types of submission: ratings, the dynamics of indicators, regional indicators, clusters. The processor of the monitoring of personal data builds the rating by the number of detected violations in the part of personal data; the export of raw and treated texts containing personal information in the following formats: doc, docx, rtf, xls, xlsx, txt, xml, pdf.

Linguistic processor 6 of linguistic processing server 3 is arranged to connect external modules 9 (SOAP architecture). Plugins 9 expand the possibilities of how to identify personal data. A more detailed description of the plugins, 9 discussed later.

In one embodiment of the invention, the linguistic processor 6 is connected to the module text classification, made with the possibility of categorization of texts (rubritsirovanyy) for classifying the processed texts � one of the categories of personal data (closed, private, public). Module text classification helps in identifying the most potentially dangerous texts. Module text classification relates the text to a category based on the types of personal data

The fundamental procedure in this and most other modules that perform linguistic text processing is the extraction from the text of entities and facts, which performs linguistic processor 6.

Module rubritsirovanyy texts is a tool to detect potentially dangerous (from the point of view of the publication of personal data) text types. Particularly useful is submitted to the joint account statistics module rubritsirovanyy module and genre classification. Classification is performed by recording in the database 5 of the linguistic processor 6 relevant attributes and the classification of the detected personal data to the group of closed, private, public.

In this category are closed texts containing: 1) All passport data; 2) a Driving licence; 3) the Number and other attributes of the card (CVV, date of expiration); 4) Income property (not for civil servants); 5) the number of the car; 6) Home telephone number; 7) registration Address, residence.

To private are texts containing the�conventionally protected data, available on job search sites, ads, Dating, published in the telephone directory: 1) date of birth; 2) Place of birth; 3) personal e-mail Address; 4) the personal mobile phone; 5) Citizenship; 6) TIN; 7) family ties. This includes data whose publication is not a threat to the reputation or mandatory owing to official duties. For example: date of birth; place of birth; Bank account details; personal e-mail address; telephone number of the personal mobile phone; citizenship; OGRNIP, an INN; data of education documents; data relating to racial or ethnic origin, political opinions, religious or philosophical beliefs; data on marital status, the presence of children.

To the category public are texts, not allowing to locate the object (outside), his family members, contact them, as well as data whose publication is not a threat to the reputation of the facility or binding effect of official duties, but this category does not include these texts that uniquely identifies a person, as well as information about her racial background and religious beliefs. For example: 1) name; 2) Gender; 3) Title; 4) Place of work; 5) business phone Number; 6) the Address of service address; 7) Education (except for�education data education documents); 8) Information about military records (excluding data records); 9) Profession; 10) Information on income and property (for civil servants).

In one embodiment of the invention, the linguistic processor 6 is connected to the module genre classification, which is made with the possibility of attributing the text to a certain group by genre, defined by a linguistic technologies: news; interviews; analyst; analytical article, TV; talk-show; legislation; press release; essays; reviews; ratings; etc.

In one embodiment of the invention, the linguistic processor contains the connected module is the calculation of the quality index that allows for a comprehensive assessment of the quality of illumination of the object in the text and writes to the database 5 of the linguistic processor 6 corresponding attributes. Implemented calculation of system indices allows you to quickly assess the degree of potential harm to referred to in the text person from publishing her personal information. The quality index is calculated using the following data: the influence of open source (calculated on the basis of a rapidly updated data about its citation); number of lanes (for print media texts); text size; an illustration; the role of the object in the article; stock quotes �object of the article; character reference objects in the text (positive, negative, neutral).

In one embodiment of the invention, the linguistic processor contains a connected module search and filter, which provides searching and filtering data (texts of the original and processed articles, publications, blog entries; violations of the personal data; of objects, etc.) in the database of the linguistic processor. The module performs the following tasks: performs a context search in the array of source texts, including specified as a criterion part of a word; performs advanced search with the option of selecting one or more parameters; filters search results according to one or more parameters.

In one embodiment of the invention, the application server 14 is connected to the module administration System that implements a patent-pending method. The module provides the user with appropriate credentials, flexible tools to configure and audit, management handbooks, dictionaries, journals, accounts, and user rights, schedules, services, etc.

Administration module performs the following tasks: controls based on a schedule that you set for each open-source, timeliness of receipt of materials in the System; UE�ulation of the schedule of supply of materials, alerts the income, errors and other System events; provides the ability to customize the collection server 2 and server linguistic processing 3. Pre-processing and downloading of texts open source involves setting the following processing options: maintenance of the list of sources; setting the threshold depth of crawl Internet sources; installation types downloadable and excluded files; setting of addresses of web pages to check for updates, open source content; determining a disk quota for sources, management handbooks and dictionaries of the System; viewing the audit log can filter records by multiple attributes, the ability to record the System log in the log file.

1. A method of detecting personal data contains the following elements: one or more servers of data collection of open-source information, one or more servers linguistic processing, one or more application servers, wherein:
- using the application server creates a task based on the arm coming through the administrator settings to bypass public sources;
- the data collection server includes a collection unit, which performs the following steps of the patented method:
- load the text, produce a bypass open sources and load� texts or transmission of texts from the external system;
- distinguish links from the downloaded texts to add them to the addresses further crawl, in the absence of a link traversal is stopped.
- the data collection server includes a parsing module, which performs the following steps of the patented method:
- get the text from the collection engine;
- extracted text, binary files are converted to a text format;
- the data collection server includes a processing module and load the parsed texts open source database servers linguistic processing, using the processing module and download perform the following steps of the patented method:
- get the text prepared by the parsing module parsing;
- send a text to identify the personal data in the module identifying personal data when the processing load from the module identifying personal data;
- the data collection server contains a module identifying personal data with which to perform the steps of the method are:
- parse the text and identify entities, produce a selection of entities of personal data in the text;
- identify personal information, isolated facts (entities, identified in the previous step, associated with persona) of personal data in the text.

2. A method according to claim 1, wherein using the parsing module d�define additional language, code page, the text encoding.

3. A method according to claim 1, wherein the linguistic processing server further comprises a cleaning module of the personal data with which to perform additional steps of the patented method:
- form residual fragments on the basis of information revealed in the text of the personal data received from the detection module of the personal data;
- perform a lookup of the generated fragments on the basis of information on the situation of personal data in the text, mark the relevant parts to remove and insert the replacement fragments;
- remove personal data (text containing) in accordance with the markup;
- form a set of records that contains information to allow separate storage of the detected personal data from the source text and return the text replaced with personal data.

4. A method according to claim 1, wherein after the detection of the personal data and download the text from the module identifying personal data, the resulting text is sent to the processing of the linguistic processor, which perform the additional steps of the method are:
- index the text for further load to the database servers linguistic processing;
- maintain database servers linguistic clicks�processing parsed and indexed text;
- distinguish and rank information objects, evaluate the role of the object in the text;
- distinguish direct speech information objects in the texts;
- produce genre text classification, text classification with personal data to one of the predefined genres;
- determine the amount of airtime, extract the information about the amount of airtime for a TV commercial from the text accompanying the supply roller of the plot;
- calculate the quality index, evaluate the lighting quality open source activity person based on the characteristics of the text, the influence of the source;
- assign the text to a category (categories) on the basis of found in it of personal data;
- perform text clustering, identification of the category of groups of texts with common types of personal data.

5. A method according to claim 1, wherein the application server connected to the module administration System implementing method.

6. A method according to claim 5, wherein the administration module performs the additional steps of the method are:
- controls based on a schedule that you set for each open-source, timeliness of receipt of materials in the System; manages the schedule of delivery of materials, the notification of income; provides the ability to configure servers collect � servers linguistic processing.

7. A method according to claim 2, wherein the application server contains a processor of the monitoring of personal data, which perform at least one of the additional stages of the method:
- form the basis of data processing results;
- form analytical reports on the basis of linguistically processed texts;
- perform the filtering and the choice of depth retrospective and analytical sections;
- carry out the export results cleaning your personal data to the data processing results in the form of reports that are stored using a common data storage formats;
- create reports on one of the categories: ratings (including ratings on a number of violations regarding personal data), the dynamics of indicators, regional indicators, clusters;
- carry out the export of raw and treated texts containing personal information in the following formats: doc, docx, rtf, xls, xlsx, txt, xml, pdf.



 

Same patents:

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to means of communication over the Internet for applications and content. The system comprises network infrastructure configured to create a pattern and implement an application supermarket which supports a plurality of users, wherein the application supermarket provides customers with access to online stores which provide digital products, wherein relationships between the plurality of users are flexibly and dynamically specified for at least some of the plurality of users.

EFFECT: high efficiency and reliability when selling and buying applications owing to customisation thereof.

31 cl, 11 dwg

FIELD: physics.

SUBSTANCE: method includes receiving geodetic data for a plurality of locations on a surface, wherein the geodetic data contain information on surface gradient for at least a subset of locations on the surface; generating a set of constraining relations based on the geodetic data, wherein the set of constraining relations correlates undefined values for temporary changes in surface height in the subset of locations on the surface with information on surface gradient included in the geodetic data; the set of constraining relations includes undefined values for temporary changes in surface height at multiple locations on the surface; identifying specific values for temporary changes in surface height at each location on the surface in the subset based on determining the solution of the set of constraining relations.

EFFECT: high accuracy of the model of a geophysical area.

33 cl, 7 dwg

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to information retrieval means. method comprises receiving a request at the input; outputting said request at the output in a plurality of different sources, at least one of which is a public domain search engine and at least one of which is a private domain search engine; receiving, at the input, a list of results from each of said plurality of different sources; determining whether to merge said list of results based on determining relevancy using a merging model; creating, by a processor, a full list of results from the received list of results using a merging model; providing presentation of the full list of results through a user interface; monitoring user behaviour in response to the presented full list of results; using the user behaviour to update the merging model.

EFFECT: improved relevancy of results.

21 cl, 12 dwg

FIELD: physics, video.

SUBSTANCE: invention relates to detection of audio and/or video streams broadcast in real time. The method includes receiving, from a media server, information about a media stream, which includes searching for features indicating that an analysed stream is a source of multimedia broadcast in real time. The features used can be, for example, a parameter which characterises stream Duration and/or a parameter which characterises the Start Time of the stream and/or a Seekable parameter within the transmitted stream.

EFFECT: high reliability of determining streams in real time in a multiple stream environment.

8 cl, 3 dwg, 7 tbl

FIELD: medicine.

SUBSTANCE: invention refers to medical equipment. A method for managing the execution of clinical guidelines involving the stages, whereat: accepting an input comprising a patient's condition; retrieving a set of recommendations corresponding to the above condition; displaying at least a portion of the set of recommendations to the user; accepting the user's selection of recommendations from the set of recommendations, issuing warnings, if the user's selection is rejected from the recommended sequence from the set of recommendation; accepting the input that one of the recommendations has been executed; and changing the display of recommendations on the basis of the above input that one of recommendations has been executed.

EFFECT: automatic management of executing the medical guidelines.

15 cl, 3 dwg

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to database search means. The method includes receiving a request to initiate a search for data for a specific individual; determining, based on the request, a strategy to search a reference database; searching the reference database, in accordance with the strategy, for a match to the request and outputting the match; extracting, from said request, an attribute that is relevant to the search; assigning a weight to the attribute, thus yielding a weighted attribute, wherein said weight is indicative of the usefulness of the attribute in finding a match to the request; establishing a function, based on said weighted attribute; retrieving from the reference database, candidates having attribute values that indicate likely matches to the request, based on said function; determining a best candidate from said candidates and returning said best candidate as the match, wherein the request includes a request value for the attribute; modifying the weight depending on the number of records in the reference database that have the request value for the attribute.

EFFECT: improved match of the result with the request data.

9 cl, 2 dwg, 8 tbl

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to computer engineering and specifically to intelligent automated assistant systems. Disclosed is method of operating an intelligent automated assistant. The method is carried out in an electronic device having a processor and memory which stores instructions for execution by the processor. The processor executes instructions on which a user request is received, wherein the user request includes a speech input received from the user. A prompt is provided to the user, the prompt presenting two or more properties relevant to items of an object selection domain. The user is requested to specify relative importance between the two or more properties.

EFFECT: high accuracy of providing a user with relevant information owing to consideration of relative importance between properties which correspond to items of an object domain.

12 cl, 50 dwg, 5 tbl

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to computer engineering and specifically to intelligent automated assistant systems. Disclosed is method of operating an intelligent automated assistant. The method is carried out in an electronic device having a processor and memory which stores instructions for execution by the processor. The processor executes instructions on which a user request is received, wherein the user request includes a speech input received from the user. Two or more alternative interpretations of user intent are obtained based on the received user request and one or more similarities and one or more differences between said alternatives are identified. Further, the user is presented with a response, said response being at least one of the identified differences.

EFFECT: high accuracy of presenting relevant interpretations of user intent in the correct context.

13 cl, 50 dwg, 5 tbl

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to computer engineering. Proposed method converts all info-important cells of standard down-loads from data bases from data base with indication of their position in every down-load. Definite conditions are set to indicate interrelations between cells in one line of down-load. Converted standard down-loads and named conditions are memorised in definite memory. Revealed are cell of standard down-loads in electronic file of analysed document. Found cells matrix is compiled to apply preset named conditions to matrix of found cells. Compiled the list of conditions whereto corresponds the matrix of found cells. Decision is made on if the portion of standard down-load exists in analysed document which satisfied the preset named conditions.

EFFECT: protection of data stored in protected data base from leaks.

2 cl, 2 dwg

FIELD: information technologies.

SUBSTANCE: in the method of automatic classification of formalised documents in an electronic document circulation system they identify and analyse characteristics of identical text sections (details) in a formalised document, and identified details are analysed. The informative part of the document is converted into text in natural language, document words are transformed into basic wordforms, insignificant words are deleted, word weights are counted in accordance with frequency of their occurrence, forming predicates of text criteria identification. According to the proposed set of manually classified texts they generate a system of predicates of text criteria identification, which is saved in a data base. Values of significant wordform weights are added into the system of predicates. If it is necessary to use a priori information on dependences of information areas between each other, algebra of end predicates is used, which makes it possible to perform operations over logical expressions, with the help of which information areas are described.

EFFECT: reduced time of system operation through making it possible to classify documents by form and identified metadata and to perform analysis only in the informative part of the document.

1 dwg

FIELD: physics, computer engineering.

SUBSTANCE: invention relates to methods of filling electronic glossaries - lists of terms with tags. The method of filling a glossary from a training set of electronic documents using a computer (personal computer, server, etc.) includes forming a training subset, the text of all electronic documents of which contains glossary terms. Characteristic selection criteria are applied to words met in the training subset. Words selected using the criteria are assigned tags and the selected words are optionally assigned a weight. The selected words are added to the glossary with corresponding tags (and weights).

EFFECT: high efficiency of using electronic glossaries in text analysis tasks by enabling assignment of intelligent weights to terms and automatic filling of glossaries with a training set of texts.

16 cl, 13 dwg

FIELD: information technologies.

SUBSTANCE: method to detect text objects consists in the fact that: for each text object to be detected they generate a list of regular expressions, every of which describes this text object; a syntaxic analyser is created, designed for syntaxic analysis of regular expressions; an individual final automaton is generated on the basis of the syntaxic analyser for each regular expression; individual final automatons of all regular expressions are united into at least one search automaton, designed to search for text objects; search automatons are started on the text of the document to be verified to detect lines in it that represent text objects.

EFFECT: expanded arsenal of technical facilities due to creation of a comparatively fast method of detection of text objects.

7 cl

FIELD: information technology.

SUBSTANCE: document-independent context object is created from the current document context object; a copy of the document-independent context object is created. Said copy is modified during analysis of the document to include analysis results of electronic ink of the document-independent context object. A second version of the current context object is received for each ink node in the document-independent context object if there is a corresponding node in the current context object. That node is added to the hash table, in which conformity is given of unique node identifiers in the document-independent context object to node links in the current context object; for each node in the document-independent context object, it is determined whether said corresponding node in the second version of the current context object is different from said each node, and that node is added to the list of nodes to which changes are not extended.

EFFECT: faster processing of electronic ink.

4 cl, 49 dwg

FIELD: information technology.

SUBSTANCE: in the method of integrating coreference resolution mechanisms, a portion of text is retrieved using the natural language mechanism of a server computer. Coreference within said portion of text is identified using the natural language mechanism of the server computer. A fact is retrieved from said portion of text using the natural language mechanism of the server computer, wherein the fact has a value. Said fact is expanded using the natural language mechanism of the server computer so that it includes a coreference value different from said value and based on the identified coreference.

EFFECT: improved indexing of documents in natural language.

20 cl, 5 dwg

FIELD: training.

SUBSTANCE: invention relates to the method for studying the system of writing Chinese characters and based on Chinese characters writing system of other languages. In the method of creating a dictionary for studying the multi-symbolic hieroglyphs of the systems of writing based on Chinese characters languages, first the list of user-recognisable characters is formed, and each of these user-recognisable characters has its key associated with the appropriate binder and separated from the rest part of the associated key. Then a complex multi-symbolic Chinese character is identified which is subject to be studied and added to the recognised list. The user-recognisable character is defined in the multi-symbolic hieroglyph which is necessary to write, using the mnemonic diagram on the user's language based on keys and binders associated with the symbol. The multi-symbolic character is recorded in the form of the user-recognisable character, and the multi-symbolic hieroglyph is added to the recognised list.

EFFECT: improving the efficiency of learning the language, simplification of the process of memorising and acceleration of the process of learning characters of written language, and increase in number of characters that a particular person can record and memorise a the process of creation of the dictionary.

26 cl, 23 dwg

FIELD: information technology.

SUBSTANCE: data model can represent a data storage system such that the data storage system is a database-based file system. A data manipulation component can manipulate data associated with the data model and enforces at least one of the constraints and characteristics to ensure integrity of such system. In addition, an application programming interface (API) component can be invoked to provide the manipulation of data within the data storage system.

EFFECT: data manipulation.

18 cl, 12 dwg, 23 tbl

FIELD: information technology.

SUBSTANCE: system includes: a localisation platform, a matching component; plurality of localisation content components, where the localisation platform comprises: a resource manager which includes a data gathering component, where the localisation content contains resources localised by at least one input data source which provides said resources, and metadata and context information which identify each input data source and the associated localised content which it has provided, where the data gathering component stores localisation content in content localisation components, where the data gathering component allows the input data source to modify only that localisation content which was provided through that data gathering component by that input data source based on metadata associated with the localisation content.

EFFECT: easy localisation of content and software.

18 cl, 6 dwg

FIELD: information technology.

SUBSTANCE: apparatus employs a decoder which consists of external code-controlled transpositional elements which perform cross-cluster rearrangement of an input data vector in one cycle of an external clock pulse generator.

EFFECT: possibility of high-speed cross-cluster rearrangement of data elements using control codes.

5 cl, 3 dwg

FIELD: information technology.

SUBSTANCE: method provides a preliminary presentation which automatically shows the intended outcome of applying one or another control to data. This is preferred when analysing electronic worksheet data by formatting certain data based on the control condition. The method involves identification of one or more data parametres subject to formatting based on the condition on display, selection of a predefined condition and automatic temporary application of that predefined condition to parametre(s), display of the temporary preliminary presentation on the display of the said predefined condition applied to data which correspond to the said predefined condition. The method also enables preliminary change of conditions and parametres applied to data, and automatically provide corresponding preliminary presentation of the effect of such application of the altered conditions with respect to displayed data.

EFFECT: faster formatting of displayed data.

27 cl, 28 dwg

FIELD: information technologies.

SUBSTANCE: in invention it is automatically detected, which is the category of printed document, and unauthorised printing is prevented. In method printed document is analysed for availability of confidential information, system comprises user device, printing device, server of printing control service, converter unit, server of databases, file storage, unit of recognition, server of context analysis and alarm service.

EFFECT: provision of information safety, detection of document flows containing confidential information and requiring high extent of control.

2 cl

FIELD: the invention refers to the system of remote training.

SUBSTANCE: the system has an arrangement for providing training in rendering training services through a net; an arrangement for transmitting texts connected with training aids, an arrangement for evaluation of reception of the answer through a net; an arrangement for transmitting of evaluation of transmitting the result of evaluation to a user; a database about members supporting training; an arrangement for selection of supporting members for reception of inquiry about support from the user through a net and for selection of a member for training in required field of specialization; an intermediary arrangement for connection for fulfillment of the role of the mediator at connecting the contact address of the selected member supporting training and the user through a net.

EFFECT: allows to provide services in training with dynamically changing training changes depending from the evaluation of the degree of perception in remote system with corresponding support.

6 cl, 9 dwg

Up!