EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

20
1 DATA MANAGEMENT FINAL PROJECT ANALYSIS OF SAN-FRANCISCO EMPLOYEE COMPENSATION FOR FISCAL YEAR 2014 AND 2015 SUBMITTED BY SAGAR VINAYKUMAR TUPKAR MS-BUSINESS ANALYTICS’16 UNIVERSITY OF CINCINNATI, OHIO

Transcript of EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

Page 1: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

1

DATA MANAGEMENT FINAL PROJECT

ANALYSIS OF SAN-FRANCISCO EMPLOYEE COMPENSATION FOR

FISCAL YEAR 2014 AND 2015

SUBMITTED BY

SAGAR VINAYKUMAR TUPKAR

MS-BUSINESS ANALYTICS’16

UNIVERSITY OF CINCINNATI, OHIO

Page 2: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

2

CHAPTER 01

DATA INFORMATION

1.01 ABOUT DATA

The Data that is worked upon this project is the dataset of the compensation of employees in San

Francisco for the Fiscal Year 2014 and 2015. The San Francisco Controller's Office maintains a

database of the salary and benefits paid to City employees since fiscal year 2013. This data has

also been summarized and presented on the Employee Compensation report hosted at

http://openbook.sfgov.org. New data is added on a bi-annual basis when available for each fiscal

and calendar year.

1.02 DATA SOURCE

The data was obtained from an open-data source website (www.data.sfgov.org) from the

internet. Here is the link of the dataset.

https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd

1.03 MODIFICATIONS DONE TO THE DATA

a. The Original data that was downloaded from the website did not have the datatypes

correct. So, using excel the datatypes for the measures were changed to Numbers and

Dimensions to Text.

b. Using Excel, a filter was applied to the dataset and the data was extracted only for FISCAL

year 2014 and 2015. All CALENDAR year and year 2013 data was excluded from the

dataset to be analyzed.

c. Some of the columns in the table were also excluded, e.g. Year type, Union Code, Union

Name, Employee Identifier etc. as these information were not used in the analysis to be

done.

Page 3: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

3

CHAPTER 02

TABLE OVERVIEW

2.01 GENERIC OVERVIEW OF THE DATA

The dataset used for study is the Employee Compensation data for San Francisco city for the

Fiscal year 2014 and 2015. The dataset that was modified for analysis contains 83946 rows and

18 columns. The flow of the columns is hierarchical from Organization to Job and the

Compensation is also granulated into Salaries, Benefits which are further distributed into

different categories. Here are all the Column names with description present in the dataset –

1) Year – the year which for which the data exists (2014 or 2015)

2) Organization Group Code – a unique code given to an Organization Group

3) Organization Group – name of the Organization group

4) Department Code - a unique code given to a department

5) Department – name of the department

6) Job Family Code – a unique code given to a Job Family

7) Job Family – name of the Job Family

8) Job Code – a unique code given to a Job

9) Job – name of the Job

10) Salaries – salary for that job in USD

11) Overtime – overtime extra bonus in USD

12) Other Salaries – other salaries besides the main salary in USD

13) Total Salaries – total salary (aggregate of all 3 columns above) in USD

14) Retirement – benefit due to retirement plan in USD

15) Health/Dental – benefit due to health/ dental privileges in USD

16) Other Benefits – other benefits in USD

17) Total Benefits – total benefits (aggregate of all columns above) in USD

18) Total Compensation – total compensation (total salary + total benefits) in USD

Page 4: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

4

2.02 ANALYSIS TO BE DONE ON THE DATASET

The latter part of the report includes probing into the dataset to extract information from it. The

dataset will be analyzed for Average/Minimum/Maximum/Sum of Salary, Benefits, and

Compensation for various Organization Group, Department, Job Family and Jobs, looking for

outliers as they would be insightful to the reader. The analysis will also be done on the trend

followed by the statistics for Fiscal year 2015 as compared to Fiscal Year 2014. Important

information like number of employees in a particular department, organization or doing a

particular type of job will also be showcased and analyzed.

Page 5: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

5

CHAPTER 03

NORMALIZATION OF THE DATA

3.01 IS THE DATASET NORMALIZED?

The dataset is usually normalized before analysis to remove the redundancy and repetition of the

information contained. Also, relational database system is much better to analyze and maintain

as compared to non-relational database system. The dataset that is analyzed, although uniform

and well granulated, is not Normalized. The values in rows are redundant with respect to the

columns. Also, there is no linkage between the columns that should be intuitively related to each

other e.g. Job is a part of Job Family which are different for different departments and these

departments are categorized into various organization groups. All these columns can be related.

3.02 HOW TO NORMALIZE THE DATASET?

As mentioned above the dataset needs to be normalized in order to remove the redundancy from

the rows. So,

1. To normalize the dataset, new tables need to be created and linked with each other using

the relation they have. E.g.

Table 1 – Organization Group Code and Organization Group Name because every code

has a unique name associated with it.

Similarly, other can also be created for Department, Job Family and Job.

2. Using the above tables and a fact table, we can form the same dataset, but normalized

using joins in SQL.

3. The Total Salary table can also be created using the columns Salary, Overtime, Other

Salary and Total salary; but in this new table the new column for total salary will work on

the function for aggregate applied using SQL query. Hence, whenever the values of other

3 columns are added, the total salary is automatically updated. This can be done for Total

Benefits and thus Total Compensation as all these values are linked with each other.

Page 6: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

6

CHAPTER 04

PROBLEMS IN THE DATASET AND DATA CLEANING

4.01 PROBLEMS IN THE DATASET

Although the dataset is well organized and maintained by the SF government, there are certain

problems regarding the dataset which should be fixed to make it better.

1. The values in the columns of ‘Job Family Code’ and ‘Job Code’ are not consistent as far as

the format is concerned. While most of the codes are numeric, there are some entries

which are alpha numeric. This will cause a problem in the data manipulation.

2. The columns for measures such as ‘Salaries’, ‘Benefits’ etc. have many negative entries.

Such records should be deleted from the dataset and if at all they have any significance,

they should be saved in another table for different analysis. Negative values in these

columns make no sense and it affects the overall analysis (Sum, Average etc.) as well.

3. It was observed that some of the Job codes and Job names were same for different

departments. This can create confusions while concluding about the salaries and

compensations for a particular job name unless filters are applied.

4. There were NULLS in the initial dataset which might have caused serious problems.

5. As mentioned earlier, the datatypes of the columns were not in the standard format

which could have caused problem while importing it into any other tool for analysis.

4.02 IMPROVEMENT AND SUGGESTIONS

As discussed above, there are a lot of issues with the dataset that can possibly interrupt in

further analysis, so the dataset was cleaned using excel and SQL. All the datatypes were

corrected in Excel before any operation is done on the table. After truncating the data as

needed, it was imported in SQL Server Express and all the Nulls (only present in the

dimensions) were replaced by ‘0.00’.

Apart from the problems present in the dataset, there can be a few additional changes that

can potentially increase the utility of the table and much more information can be extracted.

1. New columns with the name, age, work experience and work history of the employee

can be added to the dataset. (for the government officials where extracting names is

legal)

2. The columns where all the codes are mentioned could have been dropped to make

the dataset small and tidy. The identifier could be added later while normalizing the

dataset

Page 7: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

7

CHAPTER 05

GENERAL STATISTICS OF THE DATA

A) USING EXCEL –

For our dataset, we will check the number of records for each organization group for both

years 2014 and 2015 combined. Here is the output from pivot table of excel

To probe further into the number of records for each department in an organization

group, pivot table was used again to get the following results whose snapshots are

attached below –

a. General City Responsibilities

Page 8: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

8

b. Culture and Recreation

c. General Administration and Finance

d. Community Health

Page 9: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

9

e. Human Welfare and Neighborhood Development

f. Public Protection

g. Public Work, Transportation and Commerce

Page 10: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

10

B) USING SQL –

The dataset was imported into Microsoft SQL Server Management Studio for initial

analysis. A SQL file is attached along with the submission where all the codes with

description are present. A snapshot of the top 15 records for all the dimensions and

measures was taken separately in SQL. Here are the snapshots of the sample to give

reader an idea about the data.

1. Dimensions

2. Measures

Page 11: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

11

Some queries were written and run in SQL to get the outputs accordingly. Here are some

of the observations –

1. Initial overview or summary of the data was obtained – e.g. total number of records,

total number of records in 2014, 2015, number and names of distinct organization

groups, number of distinct departments, number of distinct job families, number of

distinct jobs. It was observed that there are a total of 83946 records out of which

43078 are from the year 2015 and the rest 40686 are from year 2014. It appears that

the number of registered employees in San Francisco increased by 2392 from the

Fiscal Year 2014 to 2015. Also, it was observed that there are 7 different

Organizational Groups, 53 Departments, 55 Job Families and 1068 different job titles

for the year 2015 in San Francisco. Here are the snapshots of the output from SQL –

Page 12: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

12

2. A query was written and run in SQL to find out the top 10 departments having largest

number of employees in 2015. It was observed that Public Health Department had the

maximum number of employees, 9148 for the Fiscal Year 2015 followed by Municipal

Transportation Agency with 6427 employee. Here is an output of the query –

3. The top 10 compensations of the entire database for the year 2015 were extracted by

writing a query. It was observed that the Job title of ‘Asst Med Examiner’ from the Job

family ‘Med Therapy and Auxiliary’ from the department of ‘General Services Agency-

City Admin’ under the Organizational Group ‘General Administration & Finance’ has a

record highest compensation of around $497505

Page 13: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

13

4. The summation of the compensation in an organizational group is a biased estimate

of the average compensation. To find the average compensation of each

organizational group, a query was written and it was observed that the Public

protection group has the maximum average compensation of $144452 and the rest

follows the pattern as shown in the snapshot from the SQL output –

5. A similar query was written to pull out the top 10 departments having highest

compensation. It was observed that the fire department had the maximum average

compensation of $182231 for the fiscal year 2015. Here is the output.

Page 14: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

14

6. The record with maximum total salary was shown for each department along with the

other column information. The output has 53 records which cannot be shown here

but the output table looks somewhat like this.

7. Finally, those records were pulled out for which the difference in salaries was greater

than 250k for the fiscal year 2015. The observation was that the department of

‘General Services Agency-City Admin’ under the Organizational Group ‘General

Administration & Finance’ has the maximum spread in salaries with the difference

between maximum and minimum being $413272. Here is the output –

Page 15: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

15

C) USING TABLEAU

We got an overview of the statistics of the table using Excel and SQL. Now we use a tool

called Tableau to get a visual idea about the statistics. Tableau is mainly used for Data

Visualization.

1. As done earlier, we will form a visualization for the employee count in the year 2015

for departments and organizational groups.

a. For Organizational Groups

b. For Departments

Page 16: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

16

2. Here is a visualization for Average Total Salary for top 10 department code in the year

2015.

3. To get a better idea, we plot a bar graph of the total compensation, total salary and

total benefits for the top 10 departments for the year 2015.

Page 17: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

17

4. The trend of average compensation for the organizational groups was studied and

plotted in tableau. It was observed that for two organizational groups- General City

Responsibilities and Human Welfare and Neighborhood Development, the average

compensation has decreases significantly from the year 2014 to 2015. Here is the plot

5. To probe more into the above fact, we plotted the trend for the Count of employees

and average compensation for just these two organizational groups with distribution

in Departments. It was observed that the number of record/employees significantly

decreased for the ‘Human Services’ Department from 2014 to 2015 while there wasn’t

a significant change in the number of employees in General Fund Unallocated

Department.

Page 18: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

18

6. The plot for the Average salaries for Job Family code gives the fact that a single Job

Family Code or Job Name, appears in multiple departments. The visualization stacks

the output for different departments under the same job family code column. Here

is a glimpse –

Page 19: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

19

CHAPTER 06

SUMMARY OF THE FINDINGS AND SUGGESTIONS

6.01 SUMMARY OF THE FINDINGS

The dataset of San Francisco Employee Compensation for the Fiscal Year 2014 and 2015 was

analyzed in this project and the following observations were found –

1. The Job title of ‘Asst Med Examiner’ from the Job family ‘Med Therapy and Auxiliary’ from

the department of ‘General Services Agency-City Admin’ under the Organizational Group

‘General Administration & Finance’ has a record highest compensation of around

$497505

2. The Public protection group has the maximum average compensation

3. The fire department had the maximum average compensation for the fiscal year 2015

4. The observation was that the department of ‘General Services Agency-City Admin’ under

the Organizational Group ‘General Administration & Finance’ has the maximum spread in

salaries

5. For two organizational groups- General City Responsibilities and Human Welfare and

Neighborhood Development, the average compensation has decreases significantly from

the year 2014 to 2015

6. the number of record/employees significantly decreased for the ‘Human Services’

Department from 2014 to 2015 while there wasn’t a significant change in the number of

employees in General Fund Unallocated Department

6.02 SUGGESTIONS

Although the dataset had a lot of information pertaining to the Employee Compensation and its

bifurcations, it could have been made better by including more columns to the dataset. Apart

from normalizing the dataset and getting it cleaned, following are few suggestions –

1. A column showing the age of the employee or his work experience could be added so that

more information can be pulled about the distribution of Salaries according to the

experience a person have.

2. A column showing Demographic information about the employee can be added to the

dataset. This will cater the need to get a distribution of salaries of different demographics.

3. Adding a column showing the qualification of the employee e.g. PhD or Masters can be

very useful. For a person with certain qualification who is looking for a job in SF, this data

Page 20: EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

20

can help him get an idea of the average salary an employee gets for his qualification in

the particular field/department he is planning to apply.

4. A column with a flag giving knowledge about whether the employee has worked in

California before or not can also be utilized wisely. Generally, some departments prefer

people worked in the State before and there is a difference in the CTC for these

employees as compared to the people who haven’t, so this information can also be useful.

REFERENCES –

1. Data –

https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-

5mnd

2. Picture –

http://highincomerealestate.com/wp-content/uploads/2014/09/SanFrancisco2.jpg