American Immigrants Classification and Naturalization Time...

5
American Immigrants Classification and Naturalization Time Prediction of Different Groups Yixiao Sheng, Yu-Chung Lien & Ching-Hua Wang Abstract In this project, we investigate American immigrants' first entry into the US and their naturalization year to study how long it takes to become an American citizen for people with different backgrounds. The collected data is from 2008- 2013 American Community Census that includes more than 500,000 samples [1]. We narrow down the scope to California and approach problem with three methodologies. First, we use K-means to cluster the data sets into 30 groups based on the selected features and naturalization time, thus the main 30 groups characteristic and their naturalization time from the data set is revealed. Second, we apply linear regression analysis, which reveals highly correlated features. These features matches well to the statistical distribution of naturalization time. Third, we implement the regression-tree to provide us more detail and higher accuracy for the naturalization time prediction to each branch. Term: Naturalization is the process by which U.S. citizenship is granted to a foreign citizen or national after he or she fulfills the requirements established by Congress in the Immigration and Nationality Act (INA). 1. Introduction 1.1. Background Immigrants are believed to be indispensible replenishments for creation and energy of the American. As international students, we are interested in applying machine learning to discover how many years it takes, for people with different race, education, gender, English speaking ability etc., to be granted their naturalizations Before digging into more details, we plot a map to display “how many years it takes for immigrants to become American citizen”, which is assumed to be length of naturalization versus states. The difference among states could be due to industry environment and people structure. In the following study, we focus on California, where we currently live in as well as accommodates the most population of immigrants. 1.2. Methodology Starting with examining six years census raw data, we analyze dataset as following steps. (1) Raw data extraction and filtering: First, as the census covered many questions for each person, we selected the 11 questions, which suppose to be the most influential for naturalization time. These questions are the features of each person. Second, after extracting those data from the census dataset, we need to preprocess the raw data since many features aren’t easily to be separated by numbers. Therefore, we add dummy variables to certain questions. For example, participants need to answer their class of work type, which defines “employee of private company for profit” as 1 to “working without pay in family business/farm” as 9. The better way is splitting them into 9 different features and distinguishing them all by 0 or 1. In addition to adding dummy variables, we normalized the features such as “Age”, “Year of Entry” and “Salary”. Note that “Salary” was normalized in log form, as the distribution of log (salary) is closer to a Gaussian distribution. (2) Cluster data by K-means: We cluster the data into 30 groups based on the selected 11 features and the naturalization time. Therefore, the main 30 groups of immigrants in California are revealed. This classification can tell us the shared features of each group and their average/standard deviation of naturalization time. (3) Apply linear regression: we implement linear regression on the whole dataset to reveal the weight of each feature and interpret the meaning of the result. Fig. 1 The US map shows the length of time for immigrants to become American citizens based on more than 5,000,000 samples. (Darkest: 13.5 years; lightest: 8.5 years)

Transcript of American Immigrants Classification and Naturalization Time...

Page 1: American Immigrants Classification and Naturalization Time ...cs229.stanford.edu/proj2016/report/ShengLienWang...sample5 sample6 sample7. time for these groups. As this is the clustering

American Immigrants Classification and Naturalization Time Prediction of Different Groups Yixiao Sheng, Yu-Chung Lien & Ching-Hua Wang Abstract

In this project, we investigate American immigrants' first entry into the US and their naturalization year to study how long it takes to become an American citizen for people with different backgrounds. The collected data is from 2008-2013 American Community Census that includes more than 500,000 samples [1]. We narrow down the scope to California and approach problem with three methodologies. First, we use K-means to cluster the data sets into 30 groups based on the selected features and naturalization time, thus the main 30 groups characteristic and their naturalization time from the data set is revealed. Second, we apply linear regression analysis, which reveals highly correlated features. These features matches well to the statistical distribution of naturalization time. Third, we implement the regression-tree to provide us more detail and higher accuracy for the naturalization time prediction to each branch. Term: Naturalization is the process by which U.S. citizenship is granted to a foreign citizen or national after he or she fulfills the requirements established by Congress in the Immigration and Nationality Act (INA). 1. Introduction

1.1. Background

Immigrants are believed to be indispensible replenishments for creation and energy of the American. As international students, we are interested in applying machine learning to discover how many years it takes, for people with different race, education, gender, English speaking ability etc., to be granted their naturalizations

Before digging into more details, we plot a map to

display “how many years it takes for immigrants to become American citizen”, which is assumed to be length of naturalization versus states. The difference among states could be due to industry environment and people structure. In the following study, we focus on California, where we currently live in as well as accommodates the most population of immigrants.

1.2. Methodology Starting with examining six years census raw data,

we analyze dataset as following steps.

(1) Raw data extraction and filtering: First, as the census covered many questions for each person, we selected the 11 questions, which suppose to be the most influential for naturalization time. These questions are the features of each person. Second, after extracting those data from the census dataset, we need to preprocess the raw data since many features aren’t easily to be separated by numbers. Therefore, we add dummy variables to certain questions. For example, participants need to answer their class of work type, which defines “employee of private company for profit” as 1 to “working without pay in family business/farm” as 9. The better way is splitting them into 9 different features and distinguishing them all by 0 or 1. In addition to adding dummy variables, we normalized the features such as “Age”, “Year of Entry” and “Salary”. Note that “Salary” was normalized in log form, as the distribution of log (salary) is closer to a Gaussian distribution. (2) Cluster data by K-means: We cluster the data into 30 groups based on the selected 11 features and the naturalization time. Therefore, the main 30 groups of immigrants in California are revealed. This classification can tell us the shared features of each group and their average/standard deviation of naturalization time. (3) Apply linear regression: we implement linear regression on the whole dataset to reveal the weight of each feature and interpret the meaning of the result.

Fig. 1 The US map shows the length of time for immigrants to become American citizens based on more than 5,000,000 samples. (Darkest: 13.5 years; lightest: 8.5 years)

Page 2: American Immigrants Classification and Naturalization Time ...cs229.stanford.edu/proj2016/report/ShengLienWang...sample5 sample6 sample7. time for these groups. As this is the clustering

(4) Analyze with Regression Tree: Although the linear regression can tell us the weight of each feature to the whole group, the accuracy of established model needs to be improved. The regression tree can break down samples into branches base on their features and give the prediction of naturalization time for each branch. 2. Classification

The purpose of classification is clustering people

into several groups with similar background. We put the selected features and the naturalization time of each sample and cluster them into 30 groups. 2.1. Feature Selection

Among all the questions in the census, we selected 11 features, as shown in Fig. 2, that we suspect they could influence people’s naturalization time.

Fig. 2 Selected features of each immigrant for studying how many years they need to be naturalized.

Then we use k-means clustering method to classify them with assigned feature space. The feature space can be thought of as a space with dimensions defined by the features chosen to describe the data to be categorized. We used Matlab to realize the algorithm and then apply to the whole dataset. 2.2. Kernel matrix analysis

In order to determine the number of group, we first randomly choose 10000 people’s records and compute the kernel matrix. The kernel function we use is the Gaussian kernel. Then we find the eigenvalue decomposition of the kernel matrix and sort the eigenvalues from the highest to the lowest, as drawn in figure 3. The elbow point gives us possible cluster number. We repeat the process for seven times and all of them give us roughly the same curve.

To determine the elbow of an eigenvalue curve, we first draw a line between the first point and the last point. Then we compute the distance from each point on the curve to the line. The distances are shown in the right figure. We take the point with the maximum distance as the elbow. Each sample gives us a cluster number around 30.

Fig. 3 (Left) Eigenvalue of kernel matrix. (Right) Distance for extracting elbow of the eigenvalue curve. 2.3. Analysis result

As we have 11 features, it is hard to visualize the result, thus we summarize the features of groups as following tables. We highlight those clusters having higher average naturalization time than the average 12.1 year of all people.

Table 1 Classified groups for male immigrants.

Table 2 Classified groups for female immigrants.

From the clustering result, first, we can see that California immigrants are mostly from Asian and Latin American. Beyond that, we mark the groups with longer naturalization time and suspect a couple features that could cause the longer time between first entry and naturalization

English Speaking

Ability

Gender

Education Attainment

World Area of Birth

Have Children

Class of Worker

Race

Disability

Marital Status

Salary

Year of Entry

0 50 100 150 200 250# of points

120

140

160

180

200

220

240

260

280

300

dist

ance

sample1sample2sample3sample4sample5sample6sample7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Latin America v v v v

Africa vAsia v v v v v v v v

White v v vBlack Others v

Married v v v v v v v v v v vNever Married v v v

Separated v vGood v v v v v v v v v v v v v v v

Not Good vPrivate Company v v v v v v v v

GovernmentHigh School or Lower v v v v v v v

Bachelor Degree v vMaster or Higher v v

Age at time of Entry Young v v v vLow vHigh v v vMean 14.7 17.4 17.9 10.5 10.3 9.7 13.3 18.8 10.1 9.5 12.1 16 10.1 13.9 9.9 10.2

Standard Deviation 8 8.4 8.7 6.6 6 6.3 5.8 9 6.3 5.6 8.4 9.2 6.3 8.2 6 6.6Naturalization Time

Income

English Speaking Ability

Class of Worker

Wolrd Area of Birth

Race

Marital Status

Education Attainment

Group Number (Male)

1 2 3 4 5 6 7 8 9 10 11 12 13 14Latin American v v

Asian v v v v v v v v vWhite vBlack Others

Married v v v v v v v vNever Married v v

Separated v vWith Children v v v v vNo Children

Good v v v v v v v v v vNot Good

Private Company v v v vGovernment v

High School or Lower v v v v v vBachelor Degree v vMaster or Higher v

Age at time of Entry Young v v v vIncome Low v v v

Mean 10.5 9.7 9.6 12.9 15.8 9.2 9.3 18.6 9.9 10.5 16.8 12.7 9.5 9.4Standard Deviation 6.1 5.8 6.2 8.8 7.5 5.6 5.5 9.5 6.5 6.1 6.7 7.5 6.3 5.4

Naturalization

Female have children under 17

English Speaking Ability

Class of Worker

Education Attainment

Group Number (Female)

Wolrd Area of Birth

Race

Marital Status{ }{ }

2( ) ( )

( ) ( )1

( )1

argmin

1:

1

i ij

m i ii

j m ii

c x

c j x

c j

θµ

µ =

=

= −

==

=

∑∑

0 50 100 150 200 250# of points

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

log(�

)

sample1sample2sample3sample4sample5sample6sample7

Page 3: American Immigrants Classification and Naturalization Time ...cs229.stanford.edu/proj2016/report/ShengLienWang...sample5 sample6 sample7. time for these groups. As this is the clustering

time for these groups. As this is the clustering result, some groups with longer naturalization may not be well explained by linear regression. (1) World are of birth (Latin American): People from Latin America usually spend more time to get naturalization. We can also confirm this aspect from the linear regression result. (ex: Weight of born in Latin American: 1.4 [higher]; Weight of born in Asia: -2.9 [lower])

(2) Age at time of entry (young): People arriving the country at younger ages tend to take more years, probably because they tend not to settle down.

(3) Education Attainment (Master or higher): People with Mater degree or higher more likely to take longer time to get naturalized since they need to finish degree first, which can also be seen from the linear regression result. (Weight of Master Degree: -2.4 [higher]; Weight of Doctorate Degree: -1.6 [higher]; Weight of Bachelor degree: -3.0 [lower])

(4) Education Attainment (High school or lower): This group of people maybe less competitive. (Weight of below college: -2.5/-1.1 [higher]; Weight of Bachelor degree: -3.0 [lower])

(2) Linear Regression

The goal of this section is predicting naturalization

time for each group of people. Here shows our preliminary result of using linear regression to measure the weight of each feature as we can derive from the following equation.

3.1. Result & Interpretation We use Weka, the package software, to implement

linear regression to our data. According to the preliminary result, we have a better understanding about how each feature influences the naturalization time.

Although linear regression is very convenient for

interpreting the result, it is not reliable when it comes to analyzing features such as “year of entry” or “age”. As they are continuous numbers but the influence on the output would not be linear. Also, The “age” and “year of entry” are dependent on each other. People came to America earlier are usually older, so this might influence the outcome. The result and the corresponding interpretation are as follows:

(1) Year of entry: People came later are easier to get naturalized.

(2) Age: Older people are faster for naturalization. This can be explained as they are more likely to get stabilized as they arrived but the younger generations are probably here for education. (3) Wage income: Groups with higher incomes are granted naturalization earlier. (4) Disability: People with disability are harder to get naturalized, as there is no such legislation to make exception for them. (5) Gender: Female gets naturalization faster. (Male: 1, Female: 0)

(6) Class of worker: Federal government job reduces the waiting time.

(7) Educational attainment: In general, people with higher degrees (Bachelor and above) tend to be naturalized faster. However, it is not obvious for people with doctorate degree because these people usually need a long time to finish their degree first.

(8) Work area of birth: Born in Asia or Europe make naturalization faster. People from Latin America vice versa.

Features WeightsYear of entry -30.3Age -4.2Wage income -0.4Disability 0.4Sex 0.2

Class of Worker WeightsEmployee of for profit unit 2.5Employee of a non-profit unit 2.3Local government employee 2.0State government employee 2.2Federal government employee 0.5Self-employed in own notincorporated business,professional practice, or farm

1.8

Self-employed in ownincorporated business,professional practice or farm

2.2

Working without pay infamily business/ farm 2.3

Educational attainment WeightsBelow 12th grade - no diploma -1.1Below colloge -2.5Associate's degree -3.3Bachelor's degree -3.0Master's degree -2.4Professional degree beyonda bachelor's degree

-3.4

Doctorate degree -1.6

0

( ) ( ) 2

1

( ) ( ) ( )1

( )

1( ) (x ) y )2

: (y (x ))

nT

i ii

mi i

im i i i

j j ji

h x x x

J h

h x

θ

θ

θ θ

θ

θ θ α

=

=

=

= =

= −

= + −

Page 4: American Immigrants Classification and Naturalization Time ...cs229.stanford.edu/proj2016/report/ShengLienWang...sample5 sample6 sample7. time for these groups. As this is the clustering

Salary (1,000 a Year) Arrive Age First Year of Entry

< 8.75 > 35 1921-1978 8.75-9.25 31-35 1979-19839.25-9.75 26-30 1984-1987

9.75-10.25 21-25 1988-199110.25-10.75 16-20 1992-201110.75-11.25 11-1511.25-11.75 =< 10

>11.75

(9) Ability of speaking English: Speaking better English shortens naturalization time.

(10) Race: Asian and black people need shorter time. However, white people take longer time. This is because most of them are from Latin America.

(11) Marital status: Never married people need longer time.

(12) Female has children under 17: This factor doesn’t affect much (Yes: -0.2; No: -0.3) 3.2. Residual Plot

After performing linear regression, a model is created. After application of 10-fold cross validation among all the training data, the correlation coefficient is 0.56. The figure shows the residual plot of the predicted naturalization year and actual naturalization year.

Fig. 3 Residue plot according to linear regression model. 3.3. Statistical Result

According to the result of linear regression, the weights of “Year of Entry” and “World Area of Birth” are larger. We further examine the relationship between naturalization time and these two features. The left figure is a statistical distribution of “Year of Entry” v.s. naturalization time. The data concentrates at the left corner, elucidating later entrance but roughly earlier naturalization time. On the other hand, for the “Year of Entry” around 1960s, the data points are sparse and spread, indicating a longer naturalization time. The right figure is the ‘World Area of Birth” versus naturalization time. Except people from Latin America, the rest three groups have similar distributions, corresponding to our findings in linear regression.

Fig. 4 (Left) The statistical distribution of year of entry versus naturalization time. (Right) World area of birth dependent naturalization time. (3) Regression Tree

In addition to linear regression, we also apply regression tree, which can give us a better prediction with given features. Information gain (IG) is based on the concept of entropy from information theory. Information gain is used to decide which feature to split on at each step in building the tree. Simplicity is best, so we want to keep our tree small.

To further simplify the tree, we break down the

numerical number into different ranges. It may affect the accuracy of the model, but is necessary.

Table 3 Break down features with numerical number

World Area of Birth WeightsBorn in Latin America 1.4Born in Asia -2.9Born in Europe -2.9Born in Africa -2.2Born in Northern America -0.9Oceania and at Sea -1.6

Ability to speak English WeightsVery well 0.1Well 0.8Not at all 2.3

Race WeightsAmerican Indian and Alaska Native 0.5Asain -0.4Black -0.4Others 0.4White 0.3

Marital status WeightsMarried -0.3Widowed -0.5Divorced -0.4Never married or under 15 years old -0.1

2

,

(Y | X) (X x)H(Y | X x)

(X x) (Y y | X x) log (Y y | X x)

(X x,Y y) (Y y | X x)

( ) ( ) (Y | X)

x

x y

x y

H P

P P P

P P

IG X H Y H

= = =

= − = = = = =

= − = = = =

= −

∑ ∑

Page 5: American Immigrants Classification and Naturalization Time ...cs229.stanford.edu/proj2016/report/ShengLienWang...sample5 sample6 sample7. time for these groups. As this is the clustering

Fig. 5 is a schematic diagram of a regression tree with two features: “World Area of Birth” & “Year of Entry”. For instance, between 1979-1983, people from Asia need 11.41 years to be naturalized; people from Latin American need 17.54 years. Fig.5 Regression tree schematic considering two features.

4.1. Validation Method

Before optimizing the feature selection, we first try

out how different validation method influence the correlation coefficient. In the table below, we see that if we use all the data as training set, the correlation coefficient would be the highest. However, we need to separate the training set and test sets to build up a reliable model. If we separate data into 2 folds, the correlation coefficient would be 0.5699. We chose 10 folds to get a high correlation coefficient and short computation time.

Table 4 validation and correlation coefficient 4.2. Features Optimization

To find the best feature combination, we start with running regression tree with only one feature. The best feature is “Year of Entry”. Then we develop tree with “Year of Entry” and the other features. It turns out “Year of Entry” & “World Area of Birth” can give us a relatively high correlation coefficient (0.5749) while the tree size is still small (36). To achieve higher accuracy, the tree complexity will increase. Using the same methodology, we can find the best 5 features for regression tree, which are “Year of entry”, “World Area of Birth”, “Arrive Age”, “Education level” and “English Ability”.

Although “Race” has pretty high correlation

coefficient with naturalization time, it is not included in the best five features. It is because “Race” is highly related to “World Area of Birth”, but it doesn’t matter much itself.

Also, although “English Ability” has very low correlation coefficient for naturalization time itself, it can improve the correlation coefficient in the end. We can interpret this result as “English ability” may not be the main factor, but better English still helps.

Table 5 features optimization and correlation coefficient 4.3. Branches for 1992~2011

The following table is the regression tree result considering “Year of Entry”, “World Area of Birth” and “Arrive Age” for immigrants came in the period of 1992~2011.

Table 6 1992~2011 tree branches with three features. (4) Conclusions

The two most influential features for naturalization time are “Year of Entry” and “World Area of Birth”. This result can also be observed in the statistical result. The other features such as “Arrive Age”, “Educational Level” and “English Ability” can also improve the correlation coefficient to 0.596 in regression tree algorithm.

(5) Reference

(1) http://www2.census.gov/acs2013_1yr/pums/csv_pus.zip (2) https://en.wikipedia.org/wiki/Determining_the_number_of_c

lusters_in_a_data_set (3) https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=L

ectures.DecisionTrees

26~3

0

Asa

in

Latin

Am

eric

aEu

rope

Nor

ther

n A

mer

ican

A

fric

aO

cean

ia a

nd a

t Sea

Asa

in

Latin

Am

eric

aEu

rope

Nor

ther

n A

mer

ican

A

fric

aO

cean

ia a

nd a

t Sea

Asa

in

Latin

Am

eric

aEu

rope

Nor

ther

n A

mer

ican

A

fric

aO

cean

ia a

nd a

t Sea

Asa

in

Latin

Am

eric

aEu

rope

Nor

ther

n A

mer

ican

A

fric

aO

cean

ia a

nd a

t Sea

Asa

in

Latin

Am

eric

aEu

rope

Nor

ther

n A

mer

ican

A

fric

aO

cean

ia a

nd a

t Sea

Asa

in

Latin

Am

eric

aEu

rope

Nor

ther

n A

mer

ican

A

fric

aO

cean

ia a

nd a

t Sea

31-3

5

World Area of Birth

Year of Entry (1992-2011)Arrive Age

11~1

5

<=10

> 35

21-2

5

16-2

0

Number of Features Feature Correlation

Coefficient Size of Tree

Year of Entry 0.4252 6Arrive Age 0.2652 8

Education Level 0.2091 4World Area of Birth 0.4141 7

English Level 0.0924 5Race 0.3513 7

Year of Entry World Area of Birth

Year of Entry WAOB

Education Level Year of Entry

World Area of Birth Arrive Age

Education Level Year of Entry

World Area of Birth Arrive Age

Education Level English Ability

5

4

3

2

1

0.596 958

0.5749 36

0.5817 105

0.591 475

YEOP WAOB

WAOB

WAOB

WAOB

1979-1983 1921-1978

11.41

17.54

13.27

16.48

WAOB

1988-1991 1992-2011 1984-1987 12.15

13.99

Asia

La'nAmericanEuropeNor.AmericanAfrica

Oceania & at sea

Validation Method Number

Correlationcoeffi

cient 0.632

2folds 2 0.56995folds 5 0.574710folds 10 0.575615folds 15 0.57520folds 20 0.575930folds 30 0.5755

UsetrainingSets

0.569

0.57

0.571

0.572

0.573

0.574

0.575

0.576

0.577

0 5 10 15 20 25 30 35

Cor

rela

tion

Coe

ffic

ient

��of Folds