Sicore The Insee Automatic Coding System François Bulot April 22, 2003.
-
Upload
mark-bishop -
Category
Documents
-
view
226 -
download
0
Transcript of Sicore The Insee Automatic Coding System François Bulot April 22, 2003.
SicoreSicore
The Insee Automatic Coding SystemFrançois Bulot
April 22, 2003
PlanPlan
Introduction The knowledge bases How does the Sicore system work ? An adequate management structure Some important results The software package Surveys
The Sicore projectThe Sicore project
Launched in 1993 by Pascal Rivière Written by Éric Meyer and Bruno
Berlemont Finished in may 1996 Followed successively by Pierrette Schuhl,
Frédérique Deschamps and François Bulot
The main four objectives of The main four objectives of SicoreSicore
Construct evolutive knowledge bases for the variables
Create an adequate management structure Write a generalized software package
User-friendlyFor any variableFor any language
Provide a documented methodology
The The knowledge bases knowledge bases ::4 kinds of information4 kinds of information
The reference file : texts <=> codes
The normalization rules- Maximum number of words- Maximum length of each word- Empty (and blank) characters- Empty words- Synonyms
The The knowledge bases knowledge bases ::4 kinds of information (next)4 kinds of information (next)
The logical rules : additional variables
The parameters of the learning algorithm ; parameters about :
• the structure of the reference file• how to split the words and build the coding tree
How does the Sicore system How does the Sicore system work ? work ?
First, the learning phase
Second, the coding phase
1 - The learning phase :1 - The learning phase :two steps to build the two steps to build the coding treecoding tree
The normalization step of the reference file
Remove empty charactersRemove empty words Replace words (or groups of
words) by their synonymsLimit the number of words and the length of each wordSplit each word into pieces of two characters : bigrams
ExampleExample
"Occupation" Reference File
TEXT CODE
Normalization and Truncation in Bigrams
of Words
TA XI DR IV ER 64
DO CT OR ME DE CI NE 31
FA CT OR Y WO RK ER 62
EM PL OY EE 54
SU RG EO N 34
SU RG EO N DE NT IS T 31
TAXI DRIVER 64 DOCTOR OF MEDECIN 31 FACTORY WORKER 62 EMPLOYEE 54 SURGEON 34 SURGEON AND DENTIST 31
1 - The learning phase :1 - The learning phase :two steps to build the coding tree (next)two steps to build the coding tree (next)
To build the coding tree, Sicore :Takes the normalized reference as inputComputes the position of the word piece which gives
the biggest amount of information (Shannon information)
Builds all branches which correspond to this positionFor each branch, Sicore computes again the second
position which gives the biggest amount of informationBuilds the next branchRepeats this process until each branch uniquely
identifies a code
ExampleExample
"X I"64
(T axi driver)
"D O "31
(D octor m edecine)
"FA"62
(Factory w orker)
"C T "-> 1st b igram , 1st w ord
"PL"54
(Em ployee)
"D E"52
(Surgeon dentist)
" "52
(Surgeon)
"R G "->1st b igram , 2nd w ord
2nd b igram1st w ord
2 - The coding phase2 - The coding phase Normalization of the file to be coded Pattern recognition algorithm :determines a code
using the coding tree Failure : the pattern of the text is not recognized => no code Complete success : the pattern is recognized and a code is obtained Partial success : the pattern is recognized but the text is too much
ambiguous
The decision step for the complete success : Set of logical rules and additional variables => code
Sicore circleSicore circle
VERIFICATION OF THE COHERENCE
KNOWLEDGE BASE FILE
COMPILED KNOWLEDGE BASE
FILE TO BE CODED
NON-CODED RECORD
CODED RECORDS
LEARNING OF KNOWLEDGE BASE
ANALYSE
GENERAL CODING TOOL
An adequate management An adequate management structurestructure
To insure that the knowledge bases are regularly updated
The variable expert, the Sicore expert To properly incorporate automatic coding in
survey data processingTo ensure that all concerned parties (3) join
forces to attain the common goal
The documented methodologyThe documented methodology
As of now : 3 documents writtenThe user's guideA dictionary with the important words and
conceptsThe methodology guide : how Sicore works,
how to construct the knowledge bases, how to verify the knowledge bases coherence
The programmer's guide At the moment, only in French
Surveys coded for OccupationSurveys coded for Occupation
All INSEE surveys since the last Census (1999) : surveys on living conditions (PCV), Household Consumption Survey, Health Survey, Continuous Employment Survey (LFS)…
Before : PCV from 1997, the survey on household patrimonies, test for the national Census (1997)
Many regional surveys Surveys for other national organisations
Other variablesOther variables
Communes for the national Census Nationalities/countries for the Census Diploma and training levels Activities for the Time Use Survey Consumption products and shops for the Household
Consumption Survey Geocoding in the Réunion Island Activities of the firms (4 sources : agriculture,
administrative body responsible for collecting social security payments, Chamber of Commerce, Guild Chamber)
The use for the French The use for the French National Census in 1999National Census in 1999
Batch process "Slight" run : communes of studies,of the previous place of residence, of the working place ; country of birth, of the previous place of residence ; nationality "Heavy" run : present and previous occupations
Interactive processPick-up codification for the present and the previous occupations
News relating to SicoreNews relating to Sicore
Pick-Up Activities :– Occupation for the Census– Occupation for the EEC– Diploma/training level for the EEC– Occupation for the Health Survey and all the
Surveys with the common trunk Sicore under CAPI/BLAISE
Sicore’s main criteriaSicore’s main criteria
Three criteria to be examined together :The efficiency : percentage of records
that are automatically codedThe accuracy : percentage of coded
records that are well codedThe speed : average time to code one
record
Occupations baseOccupations base Reference file :
26784 lines ; Text = occupation + rank Normalization rules :
10 empty characters : '()-_,/\+:299 empty words : "dand", "chevronné", "SMIG" ...Synonyms : 2684 expressions <=> 775 synonyms
Parameters of the learning phase :5 words (2 - 12 - 12 - 12 - 12)8 priority bigrams, 3 redundancy bigrams
Logical rules : 14 additional variables 2933 tables 524 codes
Learning phase time : 8 seconds
Communes baseCommunes base Reference file : :
49006 lines (base : geographical official code) Normalization rules :
8 empty characters : '()-_,/*58 empty words : "district", "canton", "cedex", ...Synonyms : 126 expressions <=> 35 synonyms
Parameters of the learning phase : 5 words (2 - 14 - 12 - 12 - 12)4 priority bigrams, no redundancy bigram
Logical rules : 1 additional variable = date2291 tables4021 codes
Learning phase time : 2 seconds
Countries (nationalities) baseCountries (nationalities) base Reference file : :
1542 lines Normalization rules :
7 empty characters : '()-_,/29 empty words : "democratic", "republic", ...Synonyms : 42 expressions <=> 14 synonyms
Parameters of the learning phase : 4 words (12 - 12 - 12 - 12)3 priority bigrams 2 redundancy bigrams
Logical rules : None Learning phase time : less than 1 second
Several speedsSeveral speedsOccupation (EEC) :
about 900 wordings by second Occupation (Common Trunk) :
about 1000 wordings by secondActivities of Time Use :
about 1700 wordings by second Commune :
about 7000 wordings by second Nationality :
about 25000 wordings by second
Efficiencies for the OccupationEfficiencies for the Occupation
• For the National Census ("Heavy" run) :
- Present Occupation : 56,6% coded- Former Occupation : 83,7%
• For the EEC (LFS) : 80%
• For the household surveys (common trunk) :Between 75 and 80% not empty wordings
Efficiencies for other variablesEfficiencies for other variables
For national Census : Communes of place of work, of study or previous home : 98,5%
Countries/nationalities : 98,9% Time Use activities : 90% Household Consumption Survey :
- Till Receipts : 69,5%- Consumption board (other purchases) : 75,3%- Shops : 91,8%
Diploma (EEC) : 90%
The software packageThe software package
Independence of the language and the variables used
Written in C language Available in PC with Windows or Windows NT Works on IBM/MVS mainframes and on Unix
workstations, excluding the expert interface 3 parts : the expert interface, the application
program interface (A.P.I.) package, the object modules and include files package
Conclusion, the important Conclusion, the important elementselements
Separation between software and knowledge basesSeparation between software and knowledge bases A quick learning phaseA quick learning phase Many parametersMany parameters Specific tools to help expertsSpecific tools to help experts The use of local and global criteriaThe use of local and global criteria Distinction between learning and coding phasesDistinction between learning and coding phases Independence vis-à-vis variables and languagesIndependence vis-à-vis variables and languages And only one piece of software to maintainAnd only one piece of software to maintain