Rd1 r17a19 datawarehousing and mining_cap617t_cap617

14
HOME WORK TITLE -DESIGN PROBLEM COURSE CODE -CAP 617T COURSE INSTRUCTOR - Lect.Neha Malhotra Mam COURSE TUTOR -DO ALLOCATION DATE -01/09/12 SUBMITION DATE -01/11/12 STUDENT ROLL NO . – A19 SECTION- D1R17 DECLARATION, I declare that this design problem is individual work. I have not copied from any student work or from any other source except where due acknowledgement is made explicitly in the text, nor has any part been written from me by another person.

Transcript of Rd1 r17a19 datawarehousing and mining_cap617t_cap617

Page 1: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

HOME WORK TITLE-DESIGN PROBLEM

COURSE CODE-CAP 617T

COURSE INSTRUCTOR-Lect.Neha Malhotra Mam

COURSE TUTOR-DO

ALLOCATION DATE-01/09/12

SUBMITION DATE-01/11/12

STUDENT ROLL NO. – A19

SECTION-D1R17

DECLARATION,

I declare that this design problem is individual work. I have not copied from any student work or from any other source except where due acknowledgement is made explicitly in the text, nor has any part been written from me by another person.

EVALUATOR COMMENT……… STUDENT SIGN

MARKS OBTAINED………. PRIYA RANJAN

Page 2: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

Q.1What kind of data mining algorithms are used for mining the database and justify the importance of the tool.

Ans:Kind of data mining algorithms are used for mining the database and we justified the importance of the tool as described below:-

When student mistakes are recorded, association rules algorithms can be used to find mistakes often associated together. Combined with a genetic algorithm, concepts mastered together can be identified using student scores. The teacher may use these findings to reflect on his/her teaching and re-design the course material. It also comprises data exploration and visualization to present results in a convenient way to users. New variables may be calculated and used in algorithms, such as the average number of mistakes made per attempted exercise.

Tools: We used a range of tools. Initially we worked with Excel and Access toperform simple SQL queries and visualization.

Data exploration and visualization: Raw data and algorithm results can be visualizedThrough tables and graphics such as graphs and histograms as well as through more specificTechniques such as symbolic data analysis. The aim is to display data along certain attributes and make extreme points, trends and clusters obvious to human eye.

Clustering algorithms aim at finding homogeneous groups in data. The task of identifying groups of records that are similar between themselves but different from the reset of the data. Often the variables providing the best clustering should be identified as well. We used k-means clustering and its combination with hierarchic clustering. Both methods rest on a distance concept between individuals. We used Euclidian distance.

Classification is used to predict values for some variable. The task of finding a function that maps records into one of several discrete classes.For example, given all the work done by a student, one may want to predict whether the student will perform well in the final exam. We used C4.5 decision tree from TADA-Ed which relies on the concept of entropy. The tree can be represented by a set of rules such as: if x=v1 and y> v2 then t= v3.Thus, depending on the values an individual takes for, say the variables x and y, one can predict its value for t. The tree is built taking a representative population and is used toPredict values for new individuals.

Page 3: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

Association rules find relations between items. Rules have the following form: X ->Y, support 40%, confidence 66%, which could mean 'if students get X incorrectly, then they get also Y incorrectly', with a support of 40% and a confidence of 66%. Support is the frequency inthe population of individuals that contains both X and Y. Confidence is the percentage of theinstances that contains Y amongst those which contain X. We implemented a variant of thestandard Apriori algorithm in TADA-Ed that takes temporality into account. Takingtemporality into account produces a rule X->Y only if exercise X occurred before Y.

Q.2What preprocessing steps are lcarried out for effective mining and comment on the same.Ans:Tada-ed provides a pre-processing facility steps are lcarried out for effective mining as described below:-

Tada-ed provides a pre-processing facility which allows making the data minable. We need to specify two aspects:(1)What element we want to cluster or classify: students, exercises, mistakes?

(2)Which attributes and distance do we want to retain to compare these elements?

Example: - An example could be to cluster students, using the number of mistakes they made and the number of correct steps they entered.The data was maintained in different tables was joined in a single table. After we integrated the data into one files, to increase interpretation and comprehensibility; we discretized the attributes to categorical ones. For examples, we grouped all grades into five groups’ excellent, very good, good, average, and poor. In this step the fields used in the study were determined and transformed if necessary. By using normal distribution method, we categorized the value of each item in questionnaire with High, Medium and Low

Data Preparation:-

Page 4: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

Q.3In which way the Association rule mining algorithms and clustering algorithms are useful during the analysis and interpretation?

Ans:-Association Rules:Spatial and non-spatial association rules are in the form of X-->Y(c%), where X and Y are sets of spatial or non-spatial predicates and c% is the confidence of the rule. For examples:- Is_a (X, origin of CSU student) --> close to (X,railway stations) (70%), this rule states that 60% of CSU students originally live close to highway.- Is_a (X, CSU student) --> from (X, middle education level areas), (80%), this rule states that 80% of CSU students are from the area which have middle level of education.For a large database where there are a large set of objects and attributes, there may exist a large number of associations between them (Koperski, 1998). Some rules may only apply to a small number of objects, for example, less than 5% of CSU's Park Recreation and Heritage students are associated with a disability code, therefore it may not be of interest for the further study.While 75% of students live within 10 km of railway stations, therefore it attracted further study. A minimum support threshold needs to be specified along with minimum confidence threshold to filter out the uninterested associations.We used association rules to find mistakes often occurring together while solving exercises.The purpose of looking for these associations is for the teacher to ponder and, may be, to

Page 5: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

review the course material or emphasize subtleties while explaining concepts to students.Thus, it makes sense to have a support that is not too low.

Association rules for Year 2004.

M11 ==> M12 [sup: 77%, conf: 89%]M12 ==> M11 [sup: 77%, conf: 87%]M11 ==> M10 [sup: 74%, conf: 86%]M10 ==> M12 [sup: 78%, conf: 93%]M12 ==> M10 [sup: 78%, conf: 89%]M10 ==> M12 [sup: 74%, conf: 88%]M10: Premise set incorrectM11: Rule can be applied, but deduction incorrectM12: Wrong number of line reference givenThe first association rule says that if students make mistake Rule can be applied, but deduction incorrect while solving an exercise, then they also made the mistake Wrong number of line references given while solving the same exercise.

ClusteringClustering provides grouping together of similar data items. This technique provides a high level view of the database. Clustering technique is a technique that merges and combines techniques from different disciplines such as mathematics, physics, math-programming, statistics, computer sciences, artificial intelligence and databases etc. Variety of clustering algorithms exists and belongs to several different categories. This technique helps to create new groups and classes based on the study of patterns and relationships between values of data in a data bank. In our case study we are considering students data which exists in three separate clusters named as student academic record, student residence record, student personal record etc.These clusters are again consisting of several its own attributes. By applying this technique we will easily categorize the record of the student in detail manner. For example:- CGPA Grade shows the total marks of the student academic record cluster. On the basis of marks we can easily identify the actual performance or the level of the student. This can be shown in below figure.

Page 6: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

Different cluster numbers were tried, and successful partitioning was achieved with 5 clusters. In our case study, the cluster graph as a picture of students group according to their performance on figure 3 gives. For graphs, the Rapidminer software was used. The graphs are given in figure 4 is deviation plot five clusters of students. Using these results we can divide students into five groups and guide them according to their behavior.

We performed clustering using this subpopulation, both using (i) k-means in TADA-Ed, and(ii) a combination of k-means and hierarchical clustering of Clementine. Because there is neither a fixed number nor a fixed set of exercises to compare students, determining a distance between individuals was not obvious. We calculated and used a new variable: the total number of mistakes made per student in an exercise.

Page 7: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

As a result, students with similar frequency of mistakes were put in the same group. Histograms showing the different clusters revealed interesting patterns. Consider the histogram shown in Figure 1 obtained with TADA-Ed. There are three clusters: 0 (red, on the left), 1 (green, in the middle) and 4 (purple, on the right). From other windows (not shown) we know that students in cluster 0 made many mistakes per exercise not finished, students in cluster 1 made few mistakes and students in cluster 4 made an intermediate number of mistakes. Students making many mistakes use also many different logic rules while solving exercises, this is shown with the vertical, almost solid lines.

Histogram showing, for each cluster of students, the rules incorrectly used per student

Q.4How classification rule mining algorithms was helpful to the teachers? Justify your actions.Ans:Classification rule mining algorithms was helpful to the teachers we justified our action as described below:-

Classification –We built decision trees to try and predict exam marks (for the question related to formal proofs). Decision trees are basically a type of procedure on the basis of this we can decide either specific value is accept or reject by that procedure. It is basically provides mapping with the current state to the future state. In this way it helps to take decision in efficient manner. This follows theory of dynamic optimization. Decision trees are using If-Then Statements. The major advantage to use this technique is that results can be displayed in graphical manner so that user understands easily.For evaluating the decision trees we can use various factors like Dataset, Data type, Scalability, accuracy and robustness etc.

Description of teacher activity

Page 8: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

 

Diagrammatical representation of classification rule mining algorithm

The information extracted greatly assisted us as teachers to better understand the cohort ofLearners. Whilst SQL queries and various histograms were used during the course of theteaching semester to focus the following lecture on problem areas, the more complexmining was left for reflection between semesters.Symbolic data analysis revealed that if students attempt at least two exercises, they aremore likely to do more (probably overcoming the initial barrier of use) and completetheir exercises. In subsequent years we required students to do at least 2 exercises aspart of their assessment.Mistakes that were associated together indicated to us that the very concept of formalproofs was a problem. In 2003, that portion of the course was redesigned to take thisproblem into account and the role of each part of the proof was emphasized. After theend of the semester, mining for mistakes associations was conducted again.Surprisingly, results did not change much (a slight decrease in support and confidencelevels in 2003 followed by a slight increase in 2004).

Q.5Based on your analysis, list out 5 best DM queries for the given context.

Ans:The 5 best DM queries for the given context are as follows:-1.Symbolic data analysis revealed that if students attempt at least two exercises, they are

Mistakes Rules Learn

Page 9: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

more likely to do more and complete their exercises. In subsequent years we required students to do at least 2 exercises as part of their assessment.

2.Mistakes that were associated together indicated to us that the very concept of formalproofs was a problem. However, marks in the final exam continued increasing. This leads us to think that making mistakes, especially while using a training tool, is simply part of the learning process and was supported by the fact that the number of completed exercises.

3.The level of prediction seems to be much better when the prediction is based on exercises (number, length, variety of rules) rather than on mistakes made. This also supports the idea that mistakes are part of the learning process, especially in a practicetool where mistakes are not penalised.

4.Using data exploration and results from decision tree, one can infer that if students dosuccessfully 2 to 3 exercises for the topic, then they seem to have grasped the conceptof formal proof and are likely to perform well in the exam question related to that topic.This finding is coherent with correlations calculated between marks in the final examand activity with the Logic Tutor and with the general, human perception of tutors inthis course. Therefore, a sensible warning system could look as follows. Report to thelecturer in charge students who have completed successfully less than 3 exercises. Forthose students, display the histogram of rules used. Be proactive towards these students,distinguishing those who use out the pop-up menu for logic rules from the others.

5.feedback functionality is available for association rule. where the teacher can design his/her own proactive feedback for that particular sequence of mistakes. The content of the page is up to the teacher. For instance for the pattern of mistakes A, B -> C, the teacher may want to provide explanations about mistakes A and B (which the current student has made) and review underlying concepts of mistake C. This rule has a support and confidence.

Page 10: Rd1 r17a19 datawarehousing and mining_cap617t_cap617

Q.6From the given context, suggest few more areas where we can apply data mining in an University.

Ans:We can applied data mining in an University as described below:-

Student Academic Record

Attributes Description Selected AttributesStd_Reg_No Student RegistrationNumberEnroll_Yr Year of EnrollmentCompletion_Yr Year of CompletionCGPA_PG Post-Graduation CGPA yesCGPA_G Graduation CGPACGPA_HS High School CGPACGPA_T Tenth CGPAE_ID E-Mail IDMOB_NO Mobile NumberDate_of_Birth Date of BirthPerformanc_Grade Overall Performance yes

Student Residence Record

Attributes Description Selected AttributesPermanent Add Permanent AddressCorrespondance_Add Address For CorrespondenceCity_C CityLocation_L LocationState_S State

Student Personal Record

Attributes Description Selected AttributesGender_G Male-M,Female-FPlace_of_Birth_POB Student Birth PlaceNationality_N Nationality