EDA of San Francisco Employee Compensation for Fiscal Year 2014-15
-
Upload
sagar-vinaykumar-tupkar -
Category
Data & Analytics
-
view
175 -
download
0
Transcript of EDA of San Francisco Employee Compensation for Fiscal Year 2014-15
1
DATA MANAGEMENT FINAL PROJECT
ANALYSIS OF SAN-FRANCISCO EMPLOYEE COMPENSATION FOR
FISCAL YEAR 2014 AND 2015
SUBMITTED BY
SAGAR VINAYKUMAR TUPKAR
MS-BUSINESS ANALYTICS’16
UNIVERSITY OF CINCINNATI, OHIO
2
CHAPTER 01
DATA INFORMATION
1.01 ABOUT DATA
The Data that is worked upon this project is the dataset of the compensation of employees in San
Francisco for the Fiscal Year 2014 and 2015. The San Francisco Controller's Office maintains a
database of the salary and benefits paid to City employees since fiscal year 2013. This data has
also been summarized and presented on the Employee Compensation report hosted at
http://openbook.sfgov.org. New data is added on a bi-annual basis when available for each fiscal
and calendar year.
1.02 DATA SOURCE
The data was obtained from an open-data source website (www.data.sfgov.org) from the
internet. Here is the link of the dataset.
https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd
1.03 MODIFICATIONS DONE TO THE DATA
a. The Original data that was downloaded from the website did not have the datatypes
correct. So, using excel the datatypes for the measures were changed to Numbers and
Dimensions to Text.
b. Using Excel, a filter was applied to the dataset and the data was extracted only for FISCAL
year 2014 and 2015. All CALENDAR year and year 2013 data was excluded from the
dataset to be analyzed.
c. Some of the columns in the table were also excluded, e.g. Year type, Union Code, Union
Name, Employee Identifier etc. as these information were not used in the analysis to be
done.
3
CHAPTER 02
TABLE OVERVIEW
2.01 GENERIC OVERVIEW OF THE DATA
The dataset used for study is the Employee Compensation data for San Francisco city for the
Fiscal year 2014 and 2015. The dataset that was modified for analysis contains 83946 rows and
18 columns. The flow of the columns is hierarchical from Organization to Job and the
Compensation is also granulated into Salaries, Benefits which are further distributed into
different categories. Here are all the Column names with description present in the dataset –
1) Year – the year which for which the data exists (2014 or 2015)
2) Organization Group Code – a unique code given to an Organization Group
3) Organization Group – name of the Organization group
4) Department Code - a unique code given to a department
5) Department – name of the department
6) Job Family Code – a unique code given to a Job Family
7) Job Family – name of the Job Family
8) Job Code – a unique code given to a Job
9) Job – name of the Job
10) Salaries – salary for that job in USD
11) Overtime – overtime extra bonus in USD
12) Other Salaries – other salaries besides the main salary in USD
13) Total Salaries – total salary (aggregate of all 3 columns above) in USD
14) Retirement – benefit due to retirement plan in USD
15) Health/Dental – benefit due to health/ dental privileges in USD
16) Other Benefits – other benefits in USD
17) Total Benefits – total benefits (aggregate of all columns above) in USD
18) Total Compensation – total compensation (total salary + total benefits) in USD
4
2.02 ANALYSIS TO BE DONE ON THE DATASET
The latter part of the report includes probing into the dataset to extract information from it. The
dataset will be analyzed for Average/Minimum/Maximum/Sum of Salary, Benefits, and
Compensation for various Organization Group, Department, Job Family and Jobs, looking for
outliers as they would be insightful to the reader. The analysis will also be done on the trend
followed by the statistics for Fiscal year 2015 as compared to Fiscal Year 2014. Important
information like number of employees in a particular department, organization or doing a
particular type of job will also be showcased and analyzed.
5
CHAPTER 03
NORMALIZATION OF THE DATA
3.01 IS THE DATASET NORMALIZED?
The dataset is usually normalized before analysis to remove the redundancy and repetition of the
information contained. Also, relational database system is much better to analyze and maintain
as compared to non-relational database system. The dataset that is analyzed, although uniform
and well granulated, is not Normalized. The values in rows are redundant with respect to the
columns. Also, there is no linkage between the columns that should be intuitively related to each
other e.g. Job is a part of Job Family which are different for different departments and these
departments are categorized into various organization groups. All these columns can be related.
3.02 HOW TO NORMALIZE THE DATASET?
As mentioned above the dataset needs to be normalized in order to remove the redundancy from
the rows. So,
1. To normalize the dataset, new tables need to be created and linked with each other using
the relation they have. E.g.
Table 1 – Organization Group Code and Organization Group Name because every code
has a unique name associated with it.
Similarly, other can also be created for Department, Job Family and Job.
2. Using the above tables and a fact table, we can form the same dataset, but normalized
using joins in SQL.
3. The Total Salary table can also be created using the columns Salary, Overtime, Other
Salary and Total salary; but in this new table the new column for total salary will work on
the function for aggregate applied using SQL query. Hence, whenever the values of other
3 columns are added, the total salary is automatically updated. This can be done for Total
Benefits and thus Total Compensation as all these values are linked with each other.
6
CHAPTER 04
PROBLEMS IN THE DATASET AND DATA CLEANING
4.01 PROBLEMS IN THE DATASET
Although the dataset is well organized and maintained by the SF government, there are certain
problems regarding the dataset which should be fixed to make it better.
1. The values in the columns of ‘Job Family Code’ and ‘Job Code’ are not consistent as far as
the format is concerned. While most of the codes are numeric, there are some entries
which are alpha numeric. This will cause a problem in the data manipulation.
2. The columns for measures such as ‘Salaries’, ‘Benefits’ etc. have many negative entries.
Such records should be deleted from the dataset and if at all they have any significance,
they should be saved in another table for different analysis. Negative values in these
columns make no sense and it affects the overall analysis (Sum, Average etc.) as well.
3. It was observed that some of the Job codes and Job names were same for different
departments. This can create confusions while concluding about the salaries and
compensations for a particular job name unless filters are applied.
4. There were NULLS in the initial dataset which might have caused serious problems.
5. As mentioned earlier, the datatypes of the columns were not in the standard format
which could have caused problem while importing it into any other tool for analysis.
4.02 IMPROVEMENT AND SUGGESTIONS
As discussed above, there are a lot of issues with the dataset that can possibly interrupt in
further analysis, so the dataset was cleaned using excel and SQL. All the datatypes were
corrected in Excel before any operation is done on the table. After truncating the data as
needed, it was imported in SQL Server Express and all the Nulls (only present in the
dimensions) were replaced by ‘0.00’.
Apart from the problems present in the dataset, there can be a few additional changes that
can potentially increase the utility of the table and much more information can be extracted.
1. New columns with the name, age, work experience and work history of the employee
can be added to the dataset. (for the government officials where extracting names is
legal)
2. The columns where all the codes are mentioned could have been dropped to make
the dataset small and tidy. The identifier could be added later while normalizing the
dataset
7
CHAPTER 05
GENERAL STATISTICS OF THE DATA
A) USING EXCEL –
For our dataset, we will check the number of records for each organization group for both
years 2014 and 2015 combined. Here is the output from pivot table of excel
To probe further into the number of records for each department in an organization
group, pivot table was used again to get the following results whose snapshots are
attached below –
a. General City Responsibilities
8
b. Culture and Recreation
c. General Administration and Finance
d. Community Health
9
e. Human Welfare and Neighborhood Development
f. Public Protection
g. Public Work, Transportation and Commerce
10
B) USING SQL –
The dataset was imported into Microsoft SQL Server Management Studio for initial
analysis. A SQL file is attached along with the submission where all the codes with
description are present. A snapshot of the top 15 records for all the dimensions and
measures was taken separately in SQL. Here are the snapshots of the sample to give
reader an idea about the data.
1. Dimensions
2. Measures
11
Some queries were written and run in SQL to get the outputs accordingly. Here are some
of the observations –
1. Initial overview or summary of the data was obtained – e.g. total number of records,
total number of records in 2014, 2015, number and names of distinct organization
groups, number of distinct departments, number of distinct job families, number of
distinct jobs. It was observed that there are a total of 83946 records out of which
43078 are from the year 2015 and the rest 40686 are from year 2014. It appears that
the number of registered employees in San Francisco increased by 2392 from the
Fiscal Year 2014 to 2015. Also, it was observed that there are 7 different
Organizational Groups, 53 Departments, 55 Job Families and 1068 different job titles
for the year 2015 in San Francisco. Here are the snapshots of the output from SQL –
12
2. A query was written and run in SQL to find out the top 10 departments having largest
number of employees in 2015. It was observed that Public Health Department had the
maximum number of employees, 9148 for the Fiscal Year 2015 followed by Municipal
Transportation Agency with 6427 employee. Here is an output of the query –
3. The top 10 compensations of the entire database for the year 2015 were extracted by
writing a query. It was observed that the Job title of ‘Asst Med Examiner’ from the Job
family ‘Med Therapy and Auxiliary’ from the department of ‘General Services Agency-
City Admin’ under the Organizational Group ‘General Administration & Finance’ has a
record highest compensation of around $497505
13
4. The summation of the compensation in an organizational group is a biased estimate
of the average compensation. To find the average compensation of each
organizational group, a query was written and it was observed that the Public
protection group has the maximum average compensation of $144452 and the rest
follows the pattern as shown in the snapshot from the SQL output –
5. A similar query was written to pull out the top 10 departments having highest
compensation. It was observed that the fire department had the maximum average
compensation of $182231 for the fiscal year 2015. Here is the output.
14
6. The record with maximum total salary was shown for each department along with the
other column information. The output has 53 records which cannot be shown here
but the output table looks somewhat like this.
7. Finally, those records were pulled out for which the difference in salaries was greater
than 250k for the fiscal year 2015. The observation was that the department of
‘General Services Agency-City Admin’ under the Organizational Group ‘General
Administration & Finance’ has the maximum spread in salaries with the difference
between maximum and minimum being $413272. Here is the output –
15
C) USING TABLEAU
We got an overview of the statistics of the table using Excel and SQL. Now we use a tool
called Tableau to get a visual idea about the statistics. Tableau is mainly used for Data
Visualization.
1. As done earlier, we will form a visualization for the employee count in the year 2015
for departments and organizational groups.
a. For Organizational Groups
b. For Departments
16
2. Here is a visualization for Average Total Salary for top 10 department code in the year
2015.
3. To get a better idea, we plot a bar graph of the total compensation, total salary and
total benefits for the top 10 departments for the year 2015.
17
4. The trend of average compensation for the organizational groups was studied and
plotted in tableau. It was observed that for two organizational groups- General City
Responsibilities and Human Welfare and Neighborhood Development, the average
compensation has decreases significantly from the year 2014 to 2015. Here is the plot
5. To probe more into the above fact, we plotted the trend for the Count of employees
and average compensation for just these two organizational groups with distribution
in Departments. It was observed that the number of record/employees significantly
decreased for the ‘Human Services’ Department from 2014 to 2015 while there wasn’t
a significant change in the number of employees in General Fund Unallocated
Department.
18
6. The plot for the Average salaries for Job Family code gives the fact that a single Job
Family Code or Job Name, appears in multiple departments. The visualization stacks
the output for different departments under the same job family code column. Here
is a glimpse –
19
CHAPTER 06
SUMMARY OF THE FINDINGS AND SUGGESTIONS
6.01 SUMMARY OF THE FINDINGS
The dataset of San Francisco Employee Compensation for the Fiscal Year 2014 and 2015 was
analyzed in this project and the following observations were found –
1. The Job title of ‘Asst Med Examiner’ from the Job family ‘Med Therapy and Auxiliary’ from
the department of ‘General Services Agency-City Admin’ under the Organizational Group
‘General Administration & Finance’ has a record highest compensation of around
$497505
2. The Public protection group has the maximum average compensation
3. The fire department had the maximum average compensation for the fiscal year 2015
4. The observation was that the department of ‘General Services Agency-City Admin’ under
the Organizational Group ‘General Administration & Finance’ has the maximum spread in
salaries
5. For two organizational groups- General City Responsibilities and Human Welfare and
Neighborhood Development, the average compensation has decreases significantly from
the year 2014 to 2015
6. the number of record/employees significantly decreased for the ‘Human Services’
Department from 2014 to 2015 while there wasn’t a significant change in the number of
employees in General Fund Unallocated Department
6.02 SUGGESTIONS
Although the dataset had a lot of information pertaining to the Employee Compensation and its
bifurcations, it could have been made better by including more columns to the dataset. Apart
from normalizing the dataset and getting it cleaned, following are few suggestions –
1. A column showing the age of the employee or his work experience could be added so that
more information can be pulled about the distribution of Salaries according to the
experience a person have.
2. A column showing Demographic information about the employee can be added to the
dataset. This will cater the need to get a distribution of salaries of different demographics.
3. Adding a column showing the qualification of the employee e.g. PhD or Masters can be
very useful. For a person with certain qualification who is looking for a job in SF, this data
20
can help him get an idea of the average salary an employee gets for his qualification in
the particular field/department he is planning to apply.
4. A column with a flag giving knowledge about whether the employee has worked in
California before or not can also be utilized wisely. Generally, some departments prefer
people worked in the State before and there is a difference in the CTC for these
employees as compared to the people who haven’t, so this information can also be useful.
REFERENCES –
1. Data –
https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-
5mnd
2. Picture –
http://highincomerealestate.com/wp-content/uploads/2014/09/SanFrancisco2.jpg