Preparing a Dataset for Processing
-
Upload
manish-chopra -
Category
Data & Analytics
-
view
167 -
download
0
Transcript of Preparing a Dataset for Processing
Preparing a Dataset for Processing
Contents
Introduction
Obtaining data from an original source
ETL process (Extract, Transform and Load)
Using the SAS University Edition Virtual Machine
Using the dataset for production
Author : Manish Chopra
Date : 15th April 2017
Introduction This tutorial presents how a new dataset can be prepared for processing by joining multiple Excel files
into a single large CSV (Comma-Separated Values) file. The final dataset you arrived at later can be used
with RDBMS systems and Big Data based NoSQL systems. It is desirable and often essential to have
many such datasets with prospective relationships among them in the event of scaling up towards a
larger production system.
In this document, we will see how a dataset can be prepared from the data files downloaded from
Indian government's open data portal. Following are two examples taken from the site:
Primary Census Abstract 2011 - India and States
Company Master Data up to 31st March 2015
There are multiple datasets and categories available on the portal with descriptions of each dataset
available. This makes it easy to schematize the database that you will be creating for your project.
Obtaining data from an original source Indian government data site (https://data.gov.in) publishes several datasets on the given portal.
Following figure displays datasets and categories available on the site that we can navigate and select
the files to be downloaded. An API is also provided that can be used to connect to the datasets using
internet as the connection medium.
Figure 1 : Indian government data portal's primary page
The Indian government data site provides a rich set of data available that can be put for production use.
Our approach here is to get started with a few categorized datasets like country's population statistics
and company data statistics, and consolidate them into single large datasets.
For each of the examples taken here, that is Primary Census Data and Company Master Data, there are
35 files available on the site, each representing one state of India. These files are downloaded from the
URL given above and placed in one directory. Following is an extract from the directory listing of these
files.
PCA0000_2011_MDDS.xls PCA0100_2011_MDDS.xls ......... ......... PCA3400_2011_MDDS.xls PCA3500_2011_MDDS.xls
ETL Process (Extract, Transform and Load) We can either open these files manually and make a consolidated file by combining them one by one, or
they can be processed in a batch at once. We take the latter approach as it automates most of the
required manual work. The XLS files are first converted into the CSV (Comma-Separated Values) format
through a tool known as "XLS to CSV Converter" that batch converts all the files in a single operation.
We now have 35 CSV files as the output of XLS to CSV conversion. After this, copy the folder containing
CSV files in a Linux machine, and open the terminal. Here, use the following command to generate a
single file:
[linux@localhost CSVs]$ cat PCA* >> /tmp/Consolidated-Population-Dataset.csv
Above command will read all the 35 files starting with PCA filename and store the output file as
Consolidated-Population-Dataset.csv. If the header rows of these files were not removed prior to this
operation, they will be inserted into several rows of the new CSV file. Remove the duplicate headers
from the file, and your dataset is ready. The new CSV file can be opened in MS Excel or used by other
applications that support the format.
The same process has been performed for "Company Master Data" excel files available on the Indian
government data portal.
The resultant file is approximately 433 MB in size, which could not be opened in MS Excel due to the
limitation of opening a maximum of 1 million rows. But this file contained around 1.45 million rows.
In order to overcome the limitation that MS Excel exerts, we shall use a SAS software that can handle
this large dataset. Other databases too can handle large datasets, like Oracle, MySQL, SQL server and
NoSQL databases.
Following is the warning message that MS Excel threw up when the 433 MB file was opened:
Figure 2 : Microsoft Excel restricting maximum number of rows to 1 million
Following is the complete text as appeared in the MS Excel warning message boxes given above.
This message can appear due to one of the following: The file contains more than 1,048,576 rows or 16,384 columns. To fix this problem, open the source file in a text editor such as Microsoft Office Word. Save the source file as several smaller files that conform to this row and column limit, and then open the smaller files in Microsoft Office Excel. If the source data cannot be opened in a text editor, try importing the data into Microsoft Office Access, and then exporting subsets of the data from Access to Excel. The area that you are trying to paste the tab-delineated data into is too small. To fix this problem, select an area in the worksheet large enough to accommodate every delimited item. Notes Excel cannot exceed the limit of 1,048,576 rows and 16,384 columns. By default, Excel places three worksheets in a workbook file. Each worksheet can contain 1,048,576 rows and 16,384 columns of data, and workbooks can contain more than three worksheets if your computer has enough memory to support the additional data.
In such a scenario, we can edit the CSV files in Linux, and remove the header from each of the 35 files
either in vi editor or using a small script that eliminates the first row of each file. This is a data cleansing
feature as we do not want the header rows to appear inside the data rows when arriving at the
consolidated dataset.
Using the SAS University Edition Virtual Machine SAS is a collection of many software tools - A data analysis tool, a programming language, a statistical
package, business intelligence tool, and more.
SAS University Edition runs in a virtual environment on any computer that can run either VMware
Player, Fusion, or Oracle Virtual Box. Requirements for running the SAS University Edition are displayed
when you download the SAS University Edition, meant for non-commercial use.
The SAS University Edition uses SAS Studio as the interface. SAS Studio provides an environment that
includes a point-and-click facility for performing many common tasks, such as producing reports, graphs,
data summaries, and statistical tests. For those who either enjoy programming or have more
complicated tasks, SAS Studio also allows you to write and run your own programs.
As per SAS website, following are some benefits of using SAS University Edition:
Statistics and quantitative methods in a variety of areas : economics, psychology and other
social sciences, computer science, business, medical/health sciences, engineering, etc.
Introductory to advanced-level statistics and quantitative methods
SAS programming and statistical analysis
A consistent user experience across all applications
Figure 3 : Features of SAS University Edition
Features of SAS University Edition
SAS Studio - An intuitive interface lets you interact with the software from Windows, Mac or
Linux workstation.
Base SAS - A powerful programming language is easy to learn, easy to use.
SAS/STAT - Comprehensive, reliable tools include state-of-the-art statistical methods.
SAS/IML - A robust, yet flexible matrix programming language enables more in-depth,
specialized analysis and exploration.
SAS/ETS - Several time series forecasting procedures – TIMEDATA, TIMESERIES, ARIMA, ESM,
UCM and TIMEID are included.
SAS/ACCESS - Out-of-the-box access to PC file formats provide a simplified approach to
accessing data.
Powerful statistical software
With SAS University Edition, you get SAS Studio, Base SAS, SAS/STAT, SAS/IML, SAS/ACCESS and several
time series forecasting procedures from SAS/ETS. It's the same world-class analytics software used by
more than 80,000 business, government and university sites around the world, including 93 of the top
100 companies on the Fortune Global 500 list. So you'll be using the most up-to-date statistical and
quantitative methods.
Fill the skills gap
By 2018, demand for workers skilled in analytics could outpace supply by 60 percent, or 1.5 million jobs
according to a McKinsey Global Institute study.
SAS University Edition Virtual Machine can be downloaded from SAS website at the link give below:
https://www.sas.com/en_us/software/university-edition.html
Further, one can follow the book titled "An Introduction to SAS University Edition" by Ron Cody to get
well versed with SAS analytics. The book comes with example code and datasets to work on the
exercises given in it. To know more, there is ample of documentation available on SAS website.
Once through with setting up your virtualization environment, like VMware or Virtual Box, import the
downloaded ova file, and start SAS University Edition Virtual Machine. The VM startup will be as follows:
Figure 4 : Starting SAS University Edition Virtual Machine
After the VM loads completely, it would display a screen as given in the image below, along with a URL
to get connected to it through a web browser.
Figure 5: Terminal Screen of the SAS Virtual Machine
Now open the above URL in your chosen web browser like Chrome or Firefox.
Figure 6: Web GUI Screen connected to the SAS Virtual Machine
Working with the SAS VM
As previously mentioned, MS Excel has a limitation on the maximum number of rows and columns. Here
is a screenshot displaying the maximum number of rows (1048576) that MS Excel could display when
the 433 MB file was opened.
Figure 7 : Maximum rows in Microsoft Excel - 1048576
The same 433 MB CSV file was successfully imported into the SAS virtual machine, as shown in the figure
below.
Figure 8 : SAS web GUI displaying the starting row of Companies Dataset
These images are screenshots of web browser interfaces connected with SAS University Edition VM. The
previous image shows the starting range of the dataset, and below you will find the last dataset row.
Figure 9 : A total of 1459085 rows in Companies Dataset
In the image above we see 1459085 records were imported in to the SAS virtual machine successfully.
Using the Dataset for Production Although as of now, we have generated a single file of 433 MB, that can either be put to use as a single
file, or be place in a Relational Database or a NoSQL Database, to be able to provide inferences through
SQL statements.
A highly complex database schema can span 1000's of tables having numerous relationships among
them, much like how our brain works, or how a regular computing network switch works like mesh,
where many-to-many transactions take place continuously.
There may be several ways to prepare datasets and this was one of the methods adopted. Further we
saw that certain applications are not well suited to process a large dataset. In another tutorial we shall
see how these two datasets are put to use in applications.