Processing multi-lingual business data

9
SALES RELAUNCH F&Q SESSION

Transcript of Processing multi-lingual business data

Page 1: Processing multi-lingual business data

SALES RELAUNCH F&Q SESSION

Page 2: Processing multi-lingual business data

Multi-lingual data processing The CIS and Georgia

Olga Rink, director general

Page 3: Processing multi-lingual business data

3

Content

Interfax - Dun & Bradstreet, Innovations in Multi-lingual context

• Business environment

• Main stages of processing multi-lingual business data

o Naming convention

o Transliteration

o Matching

• Seeding and verifying objects in a media coverage

Page 4: Processing multi-lingual business data

4

Official languages, population (mn) and Russian as a

second language (est.)

Interfax - Dun & Bradstreet, Innovations in Multi-lingual context

Page 5: Processing multi-lingual business data

5

Multi-lingual environment

Interfax - Dun & Bradstreet, Innovations in Multi-lingual context

Country Official language (group)

Population,

mn Alphabet Second language

Russian, % of

population, est.

Russia Russian 150 Cyrillic

35+* official and over 100

used 100%

Armenia

Armenian (Indo-European

language) 3 Own script Russian, English 100%

Azerbaijan Azeri Turkish 9,8 Latin in Azerbaijan, Cyrillic in Russia

(Dagestan) 90%

Belarus Bielaruskaja mova, Russian 9,5 Cyrillic Russian 100%

Georgia Georgian (Kartvelian language) 3,7 Georgian script

Russian, English, Azeri,

Armenian 100%

Kazakhstan

Kazakh (Turkic language),

Russian 17,7

Kazakh alphabets (Cyrillic, Latin,

Perso-Arabic, Kazakh Braille)

Russian

100%

Kyrgyzstan

Kyrgyz (Turkic language),

Russian 6 Cyrillic Kyrgyz 100%

Moldova Romanian 3,6 Latin Russian is widely used 90%

Tajikistan Tajik (Persian dialect) 8 Cyrillic Russian 90%

Turkmenistan Turkmen (Turkic language) 5,2 Cyrillic, Latin Russian is used 100%

Ukraine Ukrainian (Ukrayins'ka mova) 42,5 Cyrillic

Russian is widely used along

with a number of other

languages 100%

Uzbekistan Uzbek, in fact Russian 31,6 Cyrillic, Latin Russian is widely used 100%

• The Constitution of Dagestan defines "Russian and the languages of the peoples of Dagestan" as the state languages

• a bulk of newly-registered business is available in Cyrillic or Latin

Page 6: Processing multi-lingual business data

6 Interfax - Dun & Bradstreet, Innovations in Multi-lingual context

• For Slavic languages we use ISO 9:1995 standard with one exception: put a combination of Latin characters instead of Latin diacritic characters.

Example: Ch (without diacritic) instead of Ч – Č (with diacritic) • ISO9985 is used for Armenian • ISO 9984 – for Georgian

• ООО «Ъ» (Trade style: OOO TVERDY ZNAK; OOO “” is a transliterated name – no way to find by the original name)

• Minor changes in transliteration like 3DNYUS, OOO >3DNEWS, LLC are accepted and now filtered while being updated

• Matching rules are defined in our “Naming Convention”: i.e. the transliterated «normalized» Charter brief company name is used as primary: an indication to a legal form in the name (required by law) is put at the end via comma.

• Second one is the transliterated full legal name.

• Trade style contains official name in English/Latin or trade marks

• We use rule-based and machine learning approaches, including areas of collecting data, identifying objects, developing credit scorings, digesting media coverage

Page 7: Processing multi-lingual business data

7

Natural Language Processing and Machine Learning The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products

Interfax - Dun & Bradstreet, Innovations in Multi-lingual context

Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train,

and deploy credit and reputation risk models with minimal effort

• Tagging documents and

• Classifying by a text type (media-release, forecast, feature etc)

Detecting and Disambiguating Named Entities

Support Vector Machine (SVM) or Bayes are used,

depending on configuration

• SVM represents a text as a vector to compare with a pattern

(prototype); The closeness defines the type

• Bayes rule is applicable when you rely on pre-determined

assumptions (a range of known “symptoms”) while calculating

probabilities

Rule-based fact extraction and sentiment analysis

At an initial phase for seeding named persons • Rule-based approach mostly

• Context analysis and statistics for entity disambiguation

Clarification of Named Entity Detection with learning semi-

automatically labelled corpus

• Support Vector Machine (SVM)

• A neural network on the basis of the existing rule-based

structure is considered for future

Page 8: Processing multi-lingual business data

8

An intellectual WOW-effect or what can only SCAN

do – forward to “verifying” media coverage

Interfax - Dun & Bradstreet, Innovations in Multi-lingual context

Out of 3 mn companies automatically generated by the Scan linguistic kernel for

the recent year 22 thousand have been verified, 0.5 mn are identified with Spark

2 mn persons were generated (seeded); out of them 75 thousand verified

300 thousand of geographic locations: all Russian ones identified by OKATO classifier and many global locations got by parsing

Wikipedia

13 thousand trade marks (“Trade style”)

24 thousand sources in Russian

Page 9: Processing multi-lingual business data

Thank You

Interfax – Dun & Bradstreet

www.dnb.ru