Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

28
Sicore Sicore The Insee Automatic Coding System François Bulot April 22, 2003

Transcript of Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Page 1: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

SicoreSicore

The Insee Automatic Coding SystemFrançois Bulot

April 22, 2003

Page 2: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

PlanPlan

Introduction The knowledge bases How does the Sicore system work ? An adequate management structure Some important results The software package Surveys

Page 3: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The Sicore projectThe Sicore project

Launched in 1993 by Pascal Rivière Written by Éric Meyer and Bruno

Berlemont Finished in may 1996 Followed successively by Pierrette Schuhl,

Frédérique Deschamps and François Bulot

Page 4: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The main four objectives of The main four objectives of SicoreSicore

Construct evolutive knowledge bases for the variables

Create an adequate management structure Write a generalized software package

User-friendlyFor any variableFor any language

Provide a documented methodology

Page 5: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The The knowledge bases knowledge bases ::4 kinds of information4 kinds of information

The reference file : texts <=> codes

The normalization rules- Maximum number of words- Maximum length of each word- Empty (and blank) characters- Empty words- Synonyms

Page 6: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The The knowledge bases knowledge bases ::4 kinds of information (next)4 kinds of information (next)

The logical rules : additional variables

The parameters of the learning algorithm ; parameters about :

• the structure of the reference file• how to split the words and build the coding tree

Page 7: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

How does the Sicore system How does the Sicore system work ? work ?

First, the learning phase

Second, the coding phase

Page 8: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

1 - The learning phase :1 - The learning phase :two steps to build the two steps to build the coding treecoding tree

The normalization step of the reference file

Remove empty charactersRemove empty words Replace words (or groups of

words) by their synonymsLimit the number of words and the length of each wordSplit each word into pieces of two characters : bigrams

Page 9: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

ExampleExample

"Occupation" Reference File

TEXT CODE

Normalization and Truncation in Bigrams

of Words

TA XI DR IV ER 64

DO CT OR ME DE CI NE 31

FA CT OR Y WO RK ER 62

EM PL OY EE 54

SU RG EO N 34

SU RG EO N DE NT IS T 31

TAXI DRIVER 64 DOCTOR OF MEDECIN 31 FACTORY WORKER 62 EMPLOYEE 54 SURGEON 34 SURGEON AND DENTIST 31

Page 10: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

1 - The learning phase :1 - The learning phase :two steps to build the coding tree (next)two steps to build the coding tree (next)

To build the coding tree, Sicore :Takes the normalized reference as inputComputes the position of the word piece which gives

the biggest amount of information (Shannon information)

Builds all branches which correspond to this positionFor each branch, Sicore computes again the second

position which gives the biggest amount of informationBuilds the next branchRepeats this process until each branch uniquely

identifies a code

Page 11: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

ExampleExample

"X I"64

(T axi driver)

"D O "31

(D octor m edecine)

"FA"62

(Factory w orker)

"C T "-> 1st b igram , 1st w ord

"PL"54

(Em ployee)

"D E"52

(Surgeon dentist)

" "52

(Surgeon)

"R G "->1st b igram , 2nd w ord

2nd b igram1st w ord

Page 12: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

2 - The coding phase2 - The coding phase Normalization of the file to be coded Pattern recognition algorithm :determines a code

using the coding tree Failure : the pattern of the text is not recognized => no code Complete success : the pattern is recognized and a code is obtained Partial success : the pattern is recognized but the text is too much

ambiguous

The decision step for the complete success : Set of logical rules and additional variables => code

Page 13: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Sicore circleSicore circle

VERIFICATION OF THE COHERENCE

KNOWLEDGE BASE FILE

COMPILED KNOWLEDGE BASE

FILE TO BE CODED

NON-CODED RECORD

CODED RECORDS

LEARNING OF KNOWLEDGE BASE

ANALYSE

GENERAL CODING TOOL

Page 14: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

An adequate management An adequate management structurestructure

To insure that the knowledge bases are regularly updated

The variable expert, the Sicore expert To properly incorporate automatic coding in

survey data processingTo ensure that all concerned parties (3) join

forces to attain the common goal

Page 15: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The documented methodologyThe documented methodology

As of now : 3 documents writtenThe user's guideA dictionary with the important words and

conceptsThe methodology guide : how Sicore works,

how to construct the knowledge bases, how to verify the knowledge bases coherence

The programmer's guide At the moment, only in French

Page 16: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Surveys coded for OccupationSurveys coded for Occupation

All INSEE surveys since the last Census (1999) : surveys on living conditions (PCV), Household Consumption Survey, Health Survey, Continuous Employment Survey (LFS)…

Before : PCV from 1997, the survey on household patrimonies, test for the national Census (1997)

Many regional surveys Surveys for other national organisations

Page 17: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Other variablesOther variables

Communes for the national Census Nationalities/countries for the Census Diploma and training levels Activities for the Time Use Survey Consumption products and shops for the Household

Consumption Survey Geocoding in the Réunion Island Activities of the firms (4 sources : agriculture,

administrative body responsible for collecting social security payments, Chamber of Commerce, Guild Chamber)

Page 18: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The use for the French The use for the French National Census in 1999National Census in 1999

Batch process "Slight" run : communes of studies,of the previous place of residence, of the working place ; country of birth, of the previous place of residence ; nationality "Heavy" run : present and previous occupations

Interactive processPick-up codification for the present and the previous occupations

Page 19: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

News relating to SicoreNews relating to Sicore

Pick-Up Activities :– Occupation for the Census– Occupation for the EEC– Diploma/training level for the EEC– Occupation for the Health Survey and all the

Surveys with the common trunk Sicore under CAPI/BLAISE

Page 20: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Sicore’s main criteriaSicore’s main criteria

Three criteria to be examined together :The efficiency : percentage of records

that are automatically codedThe accuracy : percentage of coded

records that are well codedThe speed : average time to code one

record

Page 21: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Occupations baseOccupations base Reference file :

26784 lines ; Text = occupation + rank Normalization rules :

10 empty characters : '()-_,/\+:299 empty words : "dand", "chevronné", "SMIG" ...Synonyms : 2684 expressions <=> 775 synonyms

Parameters of the learning phase :5 words (2 - 12 - 12 - 12 - 12)8 priority bigrams, 3 redundancy bigrams

Logical rules : 14 additional variables 2933 tables 524 codes

Learning phase time : 8 seconds

Page 22: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Communes baseCommunes base Reference file : :

49006 lines (base : geographical official code) Normalization rules :

8 empty characters : '()-_,/*58 empty words : "district", "canton", "cedex", ...Synonyms : 126 expressions <=> 35 synonyms

Parameters of the learning phase : 5 words (2 - 14 - 12 - 12 - 12)4 priority bigrams, no redundancy bigram

Logical rules : 1 additional variable = date2291 tables4021 codes

Learning phase time : 2 seconds

Page 23: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Countries (nationalities) baseCountries (nationalities) base Reference file : :

1542 lines Normalization rules :

7 empty characters : '()-_,/29 empty words : "democratic", "republic", ...Synonyms : 42 expressions <=> 14 synonyms

Parameters of the learning phase : 4 words (12 - 12 - 12 - 12)3 priority bigrams 2 redundancy bigrams

Logical rules : None Learning phase time : less than 1 second

Page 24: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Several speedsSeveral speedsOccupation (EEC) :

about 900 wordings by second Occupation (Common Trunk) :

about 1000 wordings by secondActivities of Time Use :

about 1700 wordings by second Commune :

about 7000 wordings by second Nationality :

about 25000 wordings by second

Page 25: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Efficiencies for the OccupationEfficiencies for the Occupation

• For the National Census ("Heavy" run) :

- Present Occupation : 56,6% coded- Former Occupation : 83,7%

• For the EEC (LFS) : 80%

• For the household surveys (common trunk) :Between 75 and 80% not empty wordings

Page 26: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Efficiencies for other variablesEfficiencies for other variables

For national Census : Communes of place of work, of study or previous home : 98,5%

Countries/nationalities : 98,9% Time Use activities : 90% Household Consumption Survey :

- Till Receipts : 69,5%- Consumption board (other purchases) : 75,3%- Shops : 91,8%

Diploma (EEC) : 90%

Page 27: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

The software packageThe software package

Independence of the language and the variables used

Written in C language Available in PC with Windows or Windows NT Works on IBM/MVS mainframes and on Unix

workstations, excluding the expert interface 3 parts : the expert interface, the application

program interface (A.P.I.) package, the object modules and include files package

Page 28: Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Conclusion, the important Conclusion, the important elementselements

Separation between software and knowledge basesSeparation between software and knowledge bases A quick learning phaseA quick learning phase Many parametersMany parameters Specific tools to help expertsSpecific tools to help experts The use of local and global criteriaThe use of local and global criteria Distinction between learning and coding phasesDistinction between learning and coding phases Independence vis-à-vis variables and languagesIndependence vis-à-vis variables and languages And only one piece of software to maintainAnd only one piece of software to maintain