Bussiness Analytics Chep-2
-
Upload
rohit-kumar -
Category
Documents
-
view
220 -
download
0
Transcript of Bussiness Analytics Chep-2
-
8/13/2019 Bussiness Analytics Chep-2
1/36
Types of Digital Data
-
8/13/2019 Bussiness Analytics Chep-2
2/36
-
8/13/2019 Bussiness Analytics Chep-2
3/36
Digital Data
Unstructured (80%) :
Semi-structured :
Structured : organized form, in tables,
According to Merrill Lynch 80-90% of business
data is either unstructured or semi-structured. Data is usually in this format which makes it
difficult to extract information from it.
-
8/13/2019 Bussiness Analytics Chep-2
4/36
2
-
8/13/2019 Bussiness Analytics Chep-2
5/36
-
8/13/2019 Bussiness Analytics Chep-2
6/36
Structured data is organized in rows and columns in arigidly defined format so that applications can retrieve andprocess it efficiently. Typically stored using a databasemanagement system (DBMS).
Data is unstructured if its elements cannot be stored inrows and columns, and is therefore difficult to query andretrieve by business applications.
For example, customer contacts may be stored in variousforms such as sticky notes, e-mail messages, business cards,
or even digital format files such as .doc, .txt, and .pdf . Dueits unstructured nature, it is difficult to retrieve using acustomer relationship management application.
-
8/13/2019 Bussiness Analytics Chep-2
7/36
Types of data
-
8/13/2019 Bussiness Analytics Chep-2
8/36
In hospital (GoodLife) data structure is
maintained in structured way, so anyone can
locate desire information easily.
Comes from Access , OLTP, SQL, spreadsheets,
Fully described datasets.
Clearly describe categories and sub categories.
Data neatly placed in rows and columns
Indexing can be easily done.
-
8/13/2019 Bussiness Analytics Chep-2
9/36
Characteristics of Structured data
-
8/13/2019 Bussiness Analytics Chep-2
10/36
-
8/13/2019 Bussiness Analytics Chep-2
11/36
-
8/13/2019 Bussiness Analytics Chep-2
12/36
-
8/13/2019 Bussiness Analytics Chep-2
13/36
Unstructured Data
Email had not been successfully updated in
medical system database as it fell in the
Unstructured format.
Difficult to determine the meaning of the
data.
Does not follow any rules and semantics.
Any type so unpredictable.
Free form text without any structure
-
8/13/2019 Bussiness Analytics Chep-2
14/36
Characteristics of UnStructured data
-
8/13/2019 Bussiness Analytics Chep-2
15/36
Anything in non database form.
Bitmap objects : image, video, audio files.
Textual objects : word, email.
Body of email is raw data without any structure.
Email had not been updated into the medical
database record.
Noisy text such as chats, emails, sms. Language isalso different from normal lang.
-
8/13/2019 Bussiness Analytics Chep-2
16/36
-
8/13/2019 Bussiness Analytics Chep-2
17/36
Sources of unstructured data
-
8/13/2019 Bussiness Analytics Chep-2
18/36
How to manage unstructured data
Index in SQL is created on existing tables to retrieve the rows quickly.
When there are thousands of records in a table, retrieving information will take a
long time. Therefore indexes are created on columns which are accessed frequently,
so that the information can be retrieved quickly.
Indexes can be created on a single column or a group of columns. When a index is
created, it first sorts the data and then it assigns a ROWID for each row.
Indexing is nothing but an identifier and represents data in adata set.
Indexing is possible in case of unstructured data .
Based on text or some other attributes like the filename.
Indexing is difficult in unstructured data is difficult because itdoes not follow any naming conventions.
-
8/13/2019 Bussiness Analytics Chep-2
19/36
Tags /metadata
Using metadata data in the document can be
tagged but in unstructured data this is
difficult as little or no metadata is available.
structure of the document cannot be
determined as it is coming from more than
one source and doesnt has particular format
-
8/13/2019 Bussiness Analytics Chep-2
20/36
Classification/taxonomy
Taxonomy is classifying data on the basis of the relationshipsthat exist between data.
Data can be arranged in groups and placed in hierarchies
based on the taxonomy prevalent in an organization.Classifying unstructured data is difficult as identifyingrelationships between data is not an easy task.
CAS (content addressable storage ):It stores data based ontheir metadata.
It assigns a unique to every object stored in it. It is used extensively to store emails.
-
8/13/2019 Bussiness Analytics Chep-2
21/36
Challenges to store
S l i h ll
-
8/13/2019 Bussiness Analytics Chep-2
22/36
Solution to storage challenges
-
8/13/2019 Bussiness Analytics Chep-2
23/36
-
8/13/2019 Bussiness Analytics Chep-2
24/36
-
8/13/2019 Bussiness Analytics Chep-2
25/36
UIMA : Unstructured Information
Management Architecture
Solution for unstructured data.
It is an open source platform from IBM whichintegrates different kinds of analysis engines to providea complete solution for knowledge discovery from
unstructured data. UIMA stores information in structured format.
Various analysis engines analyze unstructured data indifferent ways as such:
Breaking up of documents. Grouping and classifying acc. to taxonomy.
Detecting parts of speech ,grammar and synonyms
Detecting events and times
Detecting relationship between various elements.
-
8/13/2019 Bussiness Analytics Chep-2
26/36
Semi structured data
Only about 10 percent of data in an organization is semistructured.
Semi structured data does not conform to any data model.
Data cant be stored in rows and columns.
Semi structured data has tags and markers which helpgroup the data and describe how the data is stored ,givingsome metadata.
But they are not sufficient for management andautomation of data.
Similar entities are grouped and organized in a hierarchy. The properties or the attributes within a group may or may
not be the same.
-
8/13/2019 Bussiness Analytics Chep-2
27/36
Characteristics of semi-structure data
-
8/13/2019 Bussiness Analytics Chep-2
28/36
-
8/13/2019 Bussiness Analytics Chep-2
29/36
-
8/13/2019 Bussiness Analytics Chep-2
30/36
How semi structured data is stored
Schemas : used to define the structure of data.The problem with schema is that requirementsare ever changing and changes required in dataalso lead to changes in schema.
Graph based data models: these can be used todescribe data .self describing, tree like structureto describe relationship and hierarchies. Schemaless approach.
XML: used to store and exchange semi structureddata. It allows the user to define tags to storedata hierarchical form.
Ch ll i St f i
-
8/13/2019 Bussiness Analytics Chep-2
31/36
Challenges in Storage of semi
structured data
-
8/13/2019 Bussiness Analytics Chep-2
32/36
Solution for storing
-
8/13/2019 Bussiness Analytics Chep-2
33/36
Challenges to extract information.
-
8/13/2019 Bussiness Analytics Chep-2
34/36
-
8/13/2019 Bussiness Analytics Chep-2
35/36
-
8/13/2019 Bussiness Analytics Chep-2
36/36
Difference between structured and
semi structured data