ISQS 6339, Business Intelligence Supplemental Notes on the Term Project

ISQS 6339, Business IntelligenceISQS 6339, Business Intelligence

Supplemental Notes on Supplemental Notes on the Term Projectthe Term Project

Zhangxi LinTexas Tech University

1

Projects

Two data warehousing projects (60 points, 100% for A+, 90%+ for A, 80%+ for B) SQL Server 2008 based Hadoop based

Big data collaborative studies (20 points). One presentation – 50 minutes Report & references Videos and demonstations

Term project

4-6 students form a team to fulfill a data mart development project. Stage 1 (10%): SQL Server Project proposal. March 3 Stage 2 (25%): Data mart Implementation. March 26 Stage 3 (10%): Hadoop Project proposal. Due April 14 Stage 4 (25%): Hadoop Project completed. Due April 30 Stage 5 (30%): Final report. Due May 12

Detailed instructions: http://zlin.ba.ttu.edu/6339/Projects15.html

http://zlin.ba.ttu.edu/6339/Projects15.html

Merits of data warehousing projects Carefully developed project proposal demonstrating the

understanding of the business requirements, attractive analytics themes, and clearly defined project goal and objectives

Comprehensive data mart design, such as multiple fact tables, with supporting analytic themes

Applications of advanced ETL model or techniques, such as slowly changing dimensions, the use of containers, etc.

Advanced OLAP cube design, and/or optional MDX scripting by self-taught

Rich data analysis outcomes Well-presented final report Demonstrating the creative ideas and skillful data warehousing

ability

HADOOP PROJECTS

Components Load Balancer Oozie Solr, SolrCloud, SolrJ, HA NewSQL Kafka, Storm, Impala REST ZK MySQL Nginx/HA-Proxy Flume Sqoop Ganglia Technology stack Tomcat, Jetty Avro

Big Data Presentation TopicsNo: Topic Components Team# Presentation1 Data warehousing

Focus: Hadoop Data warehouse designHDFS, HBase, HIVE, NoSQL/NewSQL, Solr

DW1 4/7

2 Publicly available big data services Focus: tools and free resources

Hortonworks, CloudEra, HaaS, EC2

DW2 4/9

3 MapReduce & Data miningFocus: Efficiency of distributed data/text mining

Mahout, H2O, R, Python DW3 4/14

4 Big data ETLFocus: Heterogeneous data processing across platforms

Kettle, Flume, Sqoop, Impala DW4 4/16

5 System management:Focus: Load balancing and system efficiency

Oozie, ZooKeeper, Ambari, Loom, Ganglia

DW5 4/21

6 Application development platformFocus: Algorithms and innovative development environments

Tomcat, Neo4J, Pig, Hue DW6 4/23

7 Tools & VisualizationsFocus: Features for big data visualization and data utilization.

Pentaho, TableauSaiku, Mondrian, Gephi,

DW7 4/28

8 Streaming data processingFocus: Efficiency and effectiveness of real-time data processing

Spark, Storm, Kafka, Avro 5/5

Data Warehousing Data Warehousing MethodologyMethodology

- Implementing data warehouse systematically

8

Dimensional Modeling Process Preparation

Identify roles and participants Understanding the data architecture strategy Setting up the modeling environment Establishing naming conventions

Data profiling and research Data profiling and source system exploration Interacting with source system experts Identifying core business users Studying existing reporting systems

Building Dimensional models High-level dimensional model design Identifying dimension and fact attributes

Developing the detailed dimensional model Testing the model Reviewing and validating the model

Business Dimensional Lifecycle

10

ProjectPlanning

BusinessReq’ts

definition

Technical Arch.

Design

ProductSelection &Installation

DimensionalModeling

PhysicalDesign

BI Appl.Specification

BIApplication

Development

ETL design &

DevelopmentDeployment

Maintenance

Growth

Project Management

Data ProfilingData Profiling Data profiling is a methodology for learning about he

characteristics of the data It is a hierarchical process that attempt to build an assessment of

the metadata associated with a collection of data sets. Three levels

Bottom – characterizing the values associated with individual attributes

Middle – the assessment looking at relationships between multiple columns within a single table.

Highest level – the profile describing relationships that exist between data attributes across different tables.

Can run a program against the sandbox source system to obtain the needed information.

11

ETL MethodologyETL Methodology Develop a high-level map Build a sandbox source system (optional) Detailed data profiling Make decisions

The source-to-target mapping How often loading tables The strategy for partitioning the relational and Analysis Services

fact table The strategy for extracting data from each source system

De-duplicate key data from each source system (optional) Develop a strategy for distributing dimension tables across

multiple database servers (optional)

12

Sandbox Source SystemSandbox Source System Sandbox

A protected, limited environment where applications are allowed to "play" without risking damage to the rest of the system.

A term for the R&D department at many software and computer companies. The term is half-derisive, but reflects the truth that research is a form of creative play.

In the DW/BI context, sandbox source system is a subset of source database for analytic exploration tasks

How to create Set up a static snapshot of the database By sampling

13

Decision Issues in ETL System DesignDecision Issues in ETL System Design

Source-to-target mapping Load frequency How much history is needed

14

Strategies for Extracting Strategies for Extracting DataData Extracting data from packaged source systems –self-contained

data sources May not be good to use their APIs May not be good to use their add-on analytic system

Extracting directly from the source databases Strategies vary depending on the nature of the source database

Extracting data from incremental loads How the source database records the changes of the rows

Extracting historical data

15

ISQS 6339, Business Intelligence Supplemental Notes on the Term Project

Documents

Transcript of ISQS 6339, Business Intelligence Supplemental Notes on the Term Project