BI Questions (1)

Explain the concepts and capabilities of Business Intelligence.

Business Intelligence helps to manage data by applying different skills, technologies, security and quality risks. This also helps in achieving a better understanding of data. Business intelligence can be considered as the collective information. It helps in making predictions of business operations using gathered data in a warehouse. Business intelligence application helps to tackle sales, financial, production etc business data. It helps in a better decision making and can be also considered as a decision support system.

Explain the concepts and capabilities of Business Intelligence.

Business Intelligence is all about processes, skills, technologies, practices and applications used for supporting decision making.

Business Intelligence applications could perform

- Centrally initiated by the business needs- It includes decision support system, query reporting, OLAP, data mining, forecasting

Explain the Dashboard in the business intelligence.

A dashboard in business intelligence allows huge data and reports to be read in a single graphical interface. They help in making faster decisions by replying on measurable data seen at a glance. They can also be used to get into details of this data to analyze the root cause of any business performance. It represents the business data and business state at a high level. Dashboards can also be used for cost control. Example of need of a dashboard: Banks run thousands of ATM’s. They need to know how much cash is deposited, how much is left etc.

Explain the Dashboard in the business intelligence.

Dashboard in business intelligence is used for rapid prototyping, cloning and deployment for all databases, operational applications or spread sheets through an organization.

A dashboard in BI allows an enterprise’s status/position, heading to, by using graphs, maps and chars. The drill-down and roll-over capabilities allows organizing things without revealing important information. It is fully customizable, including free-form design options. Dashboard consolidates vital statistics of business into an easy-to-read page.

SAS Business Intelligence.

SAS business intelligence has analytical capabilities like statistics, reporting, data mining, predictions, forecasting and optimization. They help in getting data in the format desired. It helps in improving quality of data.

SAS Business Intelligence.

SAS BI provides the information about an enterprise when needed. It provides this information in customized format. SAS BI integrates data across the enterprise and delivers the self-service reporting and analysis. This consumes less time for responding requests and for business uses to view the information. An integrated, flexible and robust presentation layer for SAS Analytics with full breadth is also offered by SAS BI. All these are integrated within the context of business for better and faster decision making.

What are fact tables and dimension tables?

As mentioned, data in a warehouse comes from the transactions. Fact table in a data warehouse consists of facts and/or measures. The nature of data in a fact table is usually numerical.

On the other hand, dimension table in a data warehouse contains fields used to describe the data in fact tables. A dimension table can provide additional and descriptive information (dimension) of the field of a fact table.

e.g. If I want to know the number of resources used for a task, my fact table will store the actual measure (of resources) while my Dimension table will store the task and resource details.

Hence, the relation between a fact and dimension table is one to many.

What are fact tables and dimension tables?

Business facts or measures and foreign keys are persisted in fact tables which are referred as candidate keys in dimension tables. Additive values are usually provided by the fact tables which acts as independent variables by which dimensional attributes are analyzed.

Attributes that are used to constrain and group data for performing data warehousing queries are persisted in the dimension tables.

What is ETL process in data warehousing?

ETL is Extract Transform Load. It is a process of fetching data from different sources, converting the data into a consistent and clean form and load into the data warehouse. Different tools are available in the market to perform ETL jobs.

What is ETL process in data warehousing?

ETL stands for Extraction, transformation and loading. That means extracting data from different sources such as flat files, databases or XML data, transforming this data depending on the application’s need and loads this data into data warehouse.

Explain the difference between data mining and data warehousing.

Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc.

E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.

Explain the difference between data mining and data warehousing.

Data mining is a method for comparing large amounts of data for the purpose of finding patterns. Data mining is normally used for models and forecasting. Data mining is the process of correlations, patterns by shifting through large data repositories using pattern recognition techniques.

Data warehousing is the central repository for the data of several business systems in an enterprise. Data from various resources extracted and organized in the data warehouse selectively for analysis and accessibility.

What is an OLTP system and OLAP system?

OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network.

OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional).

What is an OLTP system and OLAP system?

OLTP stands for OnLine Transaction Processing. Applications that supports and manges transactions which involve high volumes of data are supported by OLTP system. OLTP is based on client-server architecture and supports transactions across networks.

OLAP stands for OnLine Analytical Processing. Business data analysis and complex calculations on low volumes of data are performed by OLAP. An insight of data coming from various resources can be gained by a user with the support of OLAP.

What are cubes?

A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily.

E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.

What are cubes?

Multi dimensional data is logically represented by Cubes in data warehousing. The dimension and the data are represented by the edge and the body of the cube respectively. OLAP environments view the data in the form of hierarchical cube. A cube typically includes the aggregations that are needed for business intelligence queries.

What is snow flake scheme design in database?

A snowflake Schema in its simplest form is an arrangement of fact tables and dimension tables. The fact table is usually at the center surrounded by the dimension table. Normally in a snow flake schema the dimension tables are further broken down into more dimension table.

E.g. Dimension tables include employee, projects and status. Status table can be further broken into status_weekly, status_monthly.

What is snow flake scheme design in database?

Snow flake schema is one of the designs that are present in database design. Snow flake schema serves the purpose of dimensional modeling in data warehousing. If the dimensional table is split into many tables, where the schema is inclined slightly towards normalization, then the snow flake design is utilized. It contains joins in depth. The reason is that, the tables split further.

What is analysis service?

Analysis service provides a combined view of the data used in OLAP or Data mining. Services here refer to OLAP, Data mining.

What is analysis service?

An integrated view of business data is provided by analysis service. This view is provided with the combination of OLAP and data mining functionality. Analysis Services allows the user to utilize a wide variety of data mining algorithms which allows the creation and designing data mining models.

What is surrogate key? Explain it with an example.

Data warehouses commonly use a surrogate key to uniquely identify an entity. A surrogate is not generated by the user but by the system. A primary difference between a primary key and surrogate key in few databases is that PK uniquely identifies a record while a SK uniquely identifies an entity.

E.g. an employee may be recruited before the year 2000 while another employee with the same name may be recruited after the year 2000. Here, the primary key will uniquely identify the record while the surrogate key will be generated by the system (say a serial number) since the SK is NOT derived from the data.

What is surrogate key? Explain it with an example.

A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in the database. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by the current system and is invisible to the user. As several objects are available in the database corresponding to surrogate, surrogate key can not be utilized as primary key.

For example, a sequential number can be a surrogate key.

What is the purpose of Factless Fact Table?

AnswerFact less tables are so called because they simply contain keys which refer to the dimension tables. Hence, they don’t really have facts or any information but are more commonly used for tracking some information of an event.

Eg. To find the number of leaves taken by an employee in a month.

What is the purpose of Factless Fact Table?

A tracking process or collecting status can be performed by using fact less fact tables. The fact table does not have numeric values that are aggregate, hence the name. Mere key values that are referenced by the dimensions, from which the status is collected, are available in fact less fact tables.

What is a level of Granularity of a fact table?

A fact table is usually designed at a low level of Granularity. This means that we need to find the lowest level of information that can store in a fact table.

E.g. Employee performance is a very high level of granularity. Employee_performance_daily, employee_perfomance_weekly can be considered lower levels of granularity.

What is a level of Granularity of a fact table?

The granularity is the lowest level of information stored in the fact table. The depth of data level is known as granularity. In date dimension the level could be year, month, quarter, period, week, day of granularity.

The process consists of the following two steps:

- Determining the dimensions that are to be included- Determining the location to place the hierarchy of each dimension of information

The factors of determination will be resent to the requirements.

Explain the difference between star and snowflake schemas.

AnswerA snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema.

E.g. Star Schema: Performance report is a fact table. Its dimension tables include performance_report_employee, performance_report_manager

Snow Flake Schema: the dimension tables can be broken to performance_report_employee_weekly, monthly etc.

Explain the difference between star and snowflake schemas.

Star schema: A highly de-normalized technique. A star schema has one fact table and is associated with numerous dimensions table and depicts a star.

Snow flake schema: The normalized principles applied star schema is known as Snow flake schema. Every dimension table is associated with sub dimension table.

Differences:

A dimension table will not have parent table in star schema, whereas snow flake schemas have one or more parent tables.

The dimensional table itself consists of hierarchies of dimensions in star schema, where as hierarchies are split into different tables in snow flake schema. The drilling down data from top most hierarchies to the lowermost hierarchies can be done.

Differences between star and snowflake schema.

A snowflake schema is a more normalized form of a star schema. In a star schema, one fact table is stored with a number of dimension tables. On the other hand, in a star schema, one dimension table can have multiple sub dimensions. This means that in a star schema, the dimension table is independent without any sub dimensions.

What is a Cube and Linked Cube with reference to data warehouse?

A data cube stores data in a summarized version which helps in a faster analysis of data. Where as linked cubes use the data cube and are stored on another analysis server. Linking different data cubes reduces the possibility of sparse data.

E.g. A data cube may store the Employee_performance. However in order to know the hours which calculated this performance, one can create another cube by linking it to the root cube (in this case employee_performance).

What is a Cube and Linked Cube with reference to data warehouse?

Logical data representation of multidimensional data is depicted as a Cube. Dimension members are represented by the edge of cube and data values are represented by the body of cube.

Linked cubes are the cubes that are linked in order to make the data remain constant.

What are fundamental stages of Data Warehousing?

Stages of a data warehouse helps to find and understand how the data in the warehouse changes.

At an initial stage of data warehousing data of the transactions is merely copied to another server. Here, even if the copied data is processed for reporting, the source data’s performance won’t be affected.

In the next evolving stage, the data in the warehouse is updated regularly using the source data.

In Real time Data warehouse stage data in the warehouse is updated for every transaction performed on the source data (E.g. booking a ticket)

When the warehouse is at integrated stage, It not only updates data as and when a transaction is performed but also generates transactions which are passed back to the source online data.

What are fundamental stages of Data Warehousing?

Offline Operational Databases: This is the initial stage of data warehousing. In this stage the development of database of an operational system to an off-line server is done by simply copying the databases.

Offline Data warehouse: In this stage the data warehouses are updated on a regular time cycle from operational system and the data is persisted in an reporting-oriented data structure.

Real time Data Warehouse: Data warehouses are updated based on transaction or event basis in this stage. An operational system performs a transaction every time.

Integrated Data Warehouse: The activity or transactions generation which are passed back into the operational system is done in this stage. These transactions or generated transactions are used in the daily activity of the organization.

What is Virtual Data Warehousing?

The aggregate view of complete data inventory is provided by Virtual Warehousing. The metadata is utilized for forming logical enterprise data model which is a part of database of record infrastructure , is contained in virtual data warehousing. The infrastructure consists of publishments of legacy database sysems with their metadta extracted. The standards JEE, JMS and EJBs are used in the infrastructure for the purpose of transactional unit requests and extract-tranform-load tools are used for loading real time bulk data.

What is Virtual Data Warehousing?

A virtual data warehouse provides a compact view of the data inventory. It contains Meta data. It uses middleware to build connections to different data sources. They can be fast as they allow users to filter the most important pieces of data from different legacy applications.

What is active data warehousing?

The transactional data captured and reposited in the Active Data Warehouse. This repository can be utilized in finding trends and patterns that can be used in future decision making.

What is active data warehousing?

An Active data warehouse aims to capture data continuously and deliver real time data. They provide a single integrated view of a customer across multiple business lines. It is associated with Business Intelligence Systems.

Difference between dependent and independent data warehouse

Dependent data ware house are build ODS,where as independent data warehouse will not depend on ODS.

Difference between dependent and independent data warehouse

A dependent data warehouse stored the data in a central data warehouse. On the other hand independent data warehouse does not make use of a central data warehouse.

What is data modeling and data mining?

Designing a model for data or database is called data modelling. Data is reposited in fact table and dimension table. Fact table consists of data about transaction and dimensional table consists of master data. Data model is used to design abstract model of database.

The process of obtaining the hidden trends is called as data mining. Data mining is used to transform the hidden into information. Data mining is also used in a wide range of practicing profiles such as marketing, surveillance, fraud detection.

What is data modeling and data mining? What is this used for?

Data modeling aims to identify all entities that have data. It then defines a relationship between these entities. Data models can be conceptual, logical or Physical data models. Conceptual models are typically used to explore high level business concepts in case of stakeholders. Logical models are used to explore domain concepts. While Physical models are used to explore database design.

Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Data mining helps in reporting, planning strategies, finding meaningful patterns etc. it can be used to convert a large amount of data into a sensible form.

Difference between ER Modeling and Dimensional Modeling.

Dimensional modelling is very flexible for the user perspective. Dimensional data model is mapped for creating schemas. Where as ER Model is not mapped for creating shemas and does not use in conversion of normalization of data into denormalized form.

ER Model is utilized for OLTP databases that uses any of the 1st or 2nd or 3rd normal forms, where as dimensional data model is used for data warehousing and uses 3rd normal form.

ER model contains normalized data where as Dimensional model contains denormalized data.

Difference between ER Modeling and Dimensional Modeling.

ER modeling that models an ER diagram represents the entire businesses or applications processes. This diagram can be segregated into multiple Dimensional models. This is to say, an ER model will have both logical and physical model. The Dimensional model will only have physical model.

What is snapshot with reference to data warehouse?

A snapshot of data warehouse is a persisted report from the catalogue. The persistence into a file is done after disconnecting report from the catalogue.

What is snapshot with reference to data warehouse?

A snapshot is in a data warehouse can be used to track activities. For example, every time an employee attempts to change his address, the data warehouse can be alerted for a snapshot. This means that each snap shot is taken when some event is fired.

A snapshot has three components –

Time when event occurred. A key to identify the snap shot. Data that relates to the key.

What is degenerate dimension table?

The dimensions that are persisted in the fact table is called dimension table. These dimensions does not contain its own dimensions. Mapping does not take place for the columns available in fact tables. The values in the table is neither dimensions nor measures.

What is degenerate dimension table?

A degenerate table does not have its own dimension table. It is derived from a fact table. The column (dimension) which is a part of fact table but does not map to any dimension. E.g. employee_id

What is Data Mart?

Data Mart is a data repository which is served to a community of people who works on knowledge (also known as knowledge workers). The data resource can be from enterprise resources or from a data warehouse.

What is Data Mart?

Data mart stores particular data that is gathered from different sources. Particular data may belong to some specific community (group of people) or genre. Data marts can be used to focus on specific business needs.

Difference between metadata and data dictionary.

Metadata describes about data. It is ‘data about data’. It has information about how and when, by whom a certain data was collected and the data format. It is essential to understand information that is stored in data warehouses and xml-based web applications.

Data dictionary is a file which consists of the basic definitions of a database. It contains the list of files that are available in the database, number of records in each file, and the information about the fields.

What is the difference between metadata and data dictionary?

Data dictionary is a repository to store all information. Meta data is data about data. Meta data is data that defines other data. Hence, the data dictionary can be metadata that describes some information about the database.

Describe the various methods of loading Dimension tables.

The following are the methods of loading dimension tables:

Conventional Load:

In this method all the table constraints will be checked against the data, before loading the data.

Direct Load or Faster Load:

As the name suggests, the data will be loaded directly without checking the constraints. The data checking against the table constraints will be performed later and indexing will not be done on bad data.

Describe the various methods of loading Dimension tables.

The methods to load Dimension tables:

Conventional load:- Here the data is checked for any table constraints before loading.

Direct or Faster load:- The data is directly loaded without checking for any constraints

What is the difference between OLAP and data warehouse?

The following are the differences between OLAP and data warehousing:

Data Warehouse

Data from different data sources is stored in a relational database for end use analysis.Data organization is in the form of summarized, aggregated, non volatile and subject oriented patterns.Supports the analysis of data but does not support data of online analysis.

Online Analytical Processing

With the usage of analytical queries, data is analyzed and evaluated in the data ware house.Data aggregation and summarization is utilized to organize data using multidimensional models.Speed and flexibility for online data analysis is supported for data analyst in real time environment.

What is the difference between OLAP and data warehouse?

A data warehouse serves as a repository to store historical data that can be used for analysis. OLAP is Online Analytical processing that can be used to analyze and evaluate data in a warehouse. The warehouse has data coming from varied sources. OLAP tool helps to organize data in the warehouse using multidimensional models.

Describe the foreign key columns in fact table and dimension table.

The primary keys of entity tables are the foreign keys of dimension tables.The Primary keys of fact dimensional table are the foreign keys of fact tables.

Describe the foreign key columns in fact table and dimension table.

A foreign key of a fact table references other dimension tables. On the other hand, dimension table being a referenced table itself, having foreign key reference from one or more tables.

What is cube grouping?

A transformer built set of similar cubes is known as cube grouping. A single level in one dimension of the model is related with each cube group. Cube groups are generally used in creating smaller cubes that are based on the data in the level of dimension.

Define the term slowly changing dimensions (SCD)

Slowly changing dimension target operator is one of the SQL warehousing operators that can be used in mining flow or in data flow.When the attribute for a record varies over time, the SCD is applied.

Define the term slowly changing dimensions (SCD).

SCD are dimensions whose data changes very slowly. An example of this can be city of an employee. This dimension will change very slowly. The row of this data in the dimension can be either replaced completely without any track of old record OR a new row can be inserted, OR the change can be tracked.

Explain the use of lookup tables and Aggregate tables.

At the time of updating the data warehouse, a lookup table is used. When placed on the fact table or warehouse based upon the primary key of the target, the update is takes place only by allowing new records or updated records depending upon the condition of lookup.

The materialized views are aggregate tables. It contains summarized data. For example, to generate sales reports on weekly or monthly or yearly basis instead of daily basis of an application, the date values are aggregated into week values, week values are aggregated into month values and month values into year values. To perform this process @aggregate function is used.

Explain the use lookup tables and Aggregate tables.

An aggregate table contains summarized view of data. Lookup tables, using the primary key of the target, allow updating of records based on the lookup condition.

What is real time data-warehousing?

The combination of real-time activity and data warehousing is called real time warehousing. The activity that happens at current time is known as real-time activity. Data is available after completion of the activity.

Business activity data is captured in real-time data warehousing as the data occurs. Soon after the business activity and the available data, the data of completed activity is flown into the data warehouse. This data is available instantly. Real-time data warehousing can be viewed / utilized as a framework for the information retrieval from data as the data is available.

What is real time data-warehousing?

In real time data-warehousing, the warehouse is updated every time the system performs a transaction. It reflects the businesses real time information. This means that when the query is fired in the warehouse, the state of the business at that time will be returned.

What is conformed fact? What is conformed dimensions use for?

Allowing having same names in different tables is allowed by Conformed facts. The combining and comparing facts mathematically is possible. A dimensional table can be used more than one fact table is referred as conformed dimension. It is used across multiple data marts along with the combination of multiple fact tables. Without changing the metadata of conformed dimension tables, the facts in an application can be utilized without further modifications or changes.

What is conformed fact? What is conformed dimensions use for?

Conformed fact in a warehouse allows itself to have same name in separate tables. They can be compared and combined mathematically. Conformed dimensions can be used across multiple data marts. These conformed dimensions have a static structure. Any dimension table that is used by multiple fact tables can be conformed dimensions.

Define non-additive facts.

The facts that can not be summed up for the dimensions present in the fact table are called non-additive facts. The facts can be useful if there are changes in dimensions. For example, profit margin is a non-additive fact for it has no meaning to add them up for the account level or the day level.

Define non-additive facts.

Non additive facts are facts that cannot be summed up for any dimensions present in fact table. This means that these columns cannot be added for producing any results

Difference between SAS tool and other tools

The differences between SAS and other tools are:

-SAS is a reporting tool. -SAS is an ETL tool and also a forecasting tool.

Tools other than SAS

- consists of reporting tool, for example, Business Objects Cognos or ETL tool, for example, Informatica, or both , for example Business Objects.

Other tools does not have forecasting tool. For this reason, SAS is used in most in Clinical Trials and health care industry.

List out difference between SAS tool and other tools.

SAS provides more features in comparison to other tools. it supports almost ALL database interfaces and has its own extensive database engine.

Why is SAS so popular?

Statistical Analysis System is an integration of various software products which allows the developers to performData entry, data retrieval, data management and data mining Report writing and supports for graphicsStatistical analysis, business planning, business forecasting and business decision supportOperations research and project management, quality improvement, application developmentExtract, transform and load functions in data warehousing.Platform independent and remote computingBecause of these many features, SAS has become more and more popular.

Why is SAS so popular?

SAS is an ETL tool. Not just this it can be used for reporting and can be used for forecasting business needs.

What is data cleaning? How can we do that?

Data cleaning is also known as data scrubbing. Data cleaning is a process which ensures the set of data is correct and accurate. Data accuracy and consistency, data integration is checked during data cleaning. Data cleaning can be applied for a set of records or multiple sets of data which need to be merged.

Data cleaning is performed by reading all records in a set and verifying their accuracy. Typos and spelling errors are rectified. Mislabeled data if available is labeled and filed. Incomplete or missing entries are completed. Unrecoverable records are purged, for not to take space and inefficient operations.

What is data cleaning? How can we do that?

Data cleaning is the process of identifying erroneous data. The data is checked for accuracy, consistency, typos etc.

Methods:-

Parsing - Used to detect syntax errors.Data Transformation - Confirms that the input data matches in format with expected data.Duplicate elimination - This process gets rid of duplicate entries.Statistical Methods- values of mean, standard deviation, range, or clustering algorithms etc are used to find erroneous data.

Explain in brief about critical column.

A column (usually granular) is called as critical column which changes the values over a period of time.

For example, there is a customer by name ‘Anirudh’ who resided in Bangalore for 4 years and shifted to Pune. Being in Bangalore, he purchased Rs. 30 Lakhs worth of purchases. Now the change is the CITY in the data warehouse and the purchases now will shown in the city Pune only. This kind of process makes data warehouse inconsistent. In this example, the CITY is the critical column. Surrogate key can be used as a solution for this.

Explain in brief about critical column.

A critical column in a warehouse is a column whose value changes over a period of time. For e.g. city of the user. If a user resides in city 'abc' and the warehouse keeps a track of his per day expenses - when the user changes the city, the data warehouse becomes inconsistent since the city has changed and the expenses are shown under the new city.

BI Questions (1)

Documents

Transcript of BI Questions (1)