Authenticating and Reducing False hits in Mining.pptx

37
Authenticating and Reducing False hits in Mining By Ujwala Bhoga

Transcript of Authenticating and Reducing False hits in Mining.pptx

Authenticating and Reducing False hits in Mining

By Ujwala Bhoga

INTRODUCTION• It is defined as the extraction of interesting patterns or

knowledge from huge amount of data. – Data -Data are any facts, numbers, or text that can be processed by a

computer– Information-The patterns, associations, or relationships among all this

data can provide information– Knowledge -Information can be converted into knowledge about

historical patterns and future trends

• Data mining comes in two flavors• Directed Directed data mining attempts to explain or categorize

some particular target field such as income o response.

• Undirected Undirected data mining attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predicted classes.

• Data mining is largely concerned with building models.• A model is simply an algorithm or set of rules that connects a

collection of inputs to a particular target or outcome.• Many problems of intellectual, economic, and business interest

can be phrased in terms of the following tasks:• Classification• Estimation• Prediction• Affinity grouping• Clustering• Description and Profiling

• The first are examples of directed data mining, where the goal is to find the value of a particular target variable.

• Affinity grouping and clustering are undirected tasks where the goal is to uncover structure in data without respect to a particular target variable.

• Profiling is a descriptive task that may be either directed or undirected.

The most commonly used techniques in data mining are:• Artificial neural networks: Non-linear predictive models that learn

through training and resemble biological neural networks in structure.

• Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).

• Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

• Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

• Rule induction: The extraction of useful if-then rules from data based on statistical significance.

• Decision Trees

• Nearest Neighbor classification

• Neural Networks

• Rule Induction

• K-means Clustering

Architecture

Data Mining Algorithms• A data mining algorithm is a set of heuristics and calculations

that creates a data mining model from data. To create a model, the algorithm first analyzes the data you provide, looking for specific types of patterns or trends.

• The mining model that an algorithm creates from your data can take various forms, including:

– A set of clusters that describe how the cases in a dataset are related.

– A decision tree that predicts an outcome, and describes how different criteria affect that outcome.

– A mathematical model that forecasts sales.– A set of rules that describe how products are grouped together in a

transaction, and the probabilities that products are purchased together.

• Analysis Services includes the following algorithm types: Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. Regression algorithms predict one or more continuous variables, such

as profit or loss, based on other attributes in the dataset. Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow.• Experienced analysts will sometimes use one algorithm to determine

the most effective inputs (that is, variables), and then apply a different algorithm to predict a specific outcome based on that data.

• Nearest neighbor Method

A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

K-Nearest-Neighbor (kNN) Models

• Use entire training database as the model • Find nearest data point and do the same thing as you did for that record

Very easy to implement. More difficult to use in production. Disadvantage: Huge Models

0 Doses 1000

100

Age

• Authentication

• Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involve confirming the identity of a person or software program, tracing the origins of an artifact, ensuring that a product is what its packaging and labeling claims to be.

• In private and public computer networks authentication is commonly done through the use of logon passwords. Knowledge of the password is assumed to guarantee that the user is authentic. Each user registers initially, using an assigned or self-declared password. On each subsequent use, the user must know and use the previously declared password. The weakness in this system for transactions that are significant is that passwords can often be stolen, accidentally revealed, or forgotten.

• For this reason, Internet business and many other transactions require a more stringent authentication process. The use of digital certificates issued and verified by a Certificate Authority (CA) as part of a public key infrastructure is considered likely to become the standard way to perform authentication on the Internet.

• Public Key Cryptography• Public-key cryptography refers to a cryptographic system requiring

two separate keys, one to lock or encrypt the plaintext, and one to • The two main branches of public key cryptography are:• Public key encryption: a message encrypted with a recipient's

public key cannot be decrypted by anyone except a possessor of the matching private key — it is presumed that this will be the owner of that key and the person associated with the public key used. This is used for confidentiality.

• Digital signatures: a message signed with a sender's private key can be verified by anyone who has access to the sender's public key, thereby proving that the sender had access to the private key (and therefore is likely to be the person associated with the public key used), and the part of the message that has not been tampered with.

• False hit • Generally the tem hit means successful search i.e., the required

information has been found in the search by the given query. But, if the information required is not available in the database then it is known as false hit.

• False hits in the data mining increase the cost of the application. So, we have to reduce the false hits in order to improve the performance of the application.

• In this application false hits are reduced by storing the queries

of the false hits in another database, so for the first time if the information is not available for the given query by the client that query will be saved as a false hit in false hit database. Whenever client gives the query first it searches in the false hit database.

Analysis

Existing System:

• Several application including image, medical, Time series and Document Databases involve high dimensional data. Similarity retrieval in these application based on low dimensional indexes, such as the R* Tree is very expensive due to the dimensionality curse. The system considers the query which is processing under the Nearest Neighbor but it should not be an authenticated because it’s providing the result-set with nearest data only.

Disadvantages:• This system’s provided the record-set is fully authenticated• Unable to use the public key cryptosystem.• We have to search the nearest result accurately.

Proposed System:

The system provides authentication for processing the query is done by maintaining a dataset DB in server and it is signed by a trusted authority (e.g., the data owner, a notarization service). The signature is usually based on a public key cryptosystem. The server receives and processes queries from clients. Each query returns a result set and the database that satisfies certain predicates. Moreover the client must be able to establish that result set is correct i.e. it contains all records of database that satisfy the query condition and that these records have not been modified by the server or another entity. Since the signature captures the entire database and the server returns the verification objects then the clients can verify result set based on signature and the signer’s public key. In order to make easier this problem, we present a novel technique that reduces the size of each false hit.

Advantages:• The system provides the result-set that result-

set is accurate one.• Using a public key Cryptosystem, the system

provides the result-set is fully authenticated to the user and can visible with his signature.

• As we are using AMNN method, the client can visible the accurate data.

SOFTWARE REQUIREMENTS

• Operating system : Windows 7/ XP Professional• Front End : Microsoft Visual Studio .Net

2008• Coding Language : Visual C# .Net• Backend : SqlServer 2005

HARDWARE REQUIREMENTS

• PROCESSOR : PENTIUM IV 2.6 GHz• RAM : 2 GB • HARD DISK : 40 GB

Design

Use case diagram

Class diagram

Object diagram

State diagram

Activity diagram

Sequence diagram

Collaboration diagram

Component diagram