An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging...

37
An Introduction to An Introduction to Big Data Big Data Harry E. Pence 2013 Harry E. Pence Harry E. Pence TLTC Faculty TLTC Faculty Fellow Fellow for Emerging for Emerging Technologies Technologies

Transcript of An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging...

Page 1: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

An Introduction to Big DataAn Introduction to Big Data

Harry E. Pence 2013

Harry E. PenceHarry E. PenceTLTC Faculty Fellow TLTC Faculty Fellow for Emerging for Emerging TechnologiesTechnologies

Page 2: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Gerd LeonhardGerd Leonhard

John Wanamaker once said, “I know that half of John Wanamaker once said, “I know that half of my advertising doesn’t work. The problem is I my advertising doesn’t work. The problem is I don’t know which half.”don’t know which half.”

Page 3: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

What’s new about Big Data? What’s new about Big Data?

IBM describes the problem as The Three V’s. IBM describes the problem as The Three V’s.

The sheer The sheer volumevolume of stored data is exploding. IBM predicts that of stored data is exploding. IBM predicts that there will be 35 zettabytes stored by 2020. there will be 35 zettabytes stored by 2020.

This data comes in a bewildering This data comes in a bewildering varietyvariety of structured and of structured and unstructured formats.unstructured formats.

The The velocityvelocity of data depends on not just the speed at which the of data depends on not just the speed at which the data is flowing but also the pace at which it must be collected, data is flowing but also the pace at which it must be collected, analyzed, and retrieved.analyzed, and retrieved.

Although most businesses already collect terabytes of Although most businesses already collect terabytes of information about customers, employees, and their enterprise, information about customers, employees, and their enterprise, a recent survey found that 62 % of business leaders couldn’t access their information fast enough, and 83 % believe it didn’t give them what they needed to know.

Harry E. Pence 2013

Page 4: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

The annual The annual growth rate growth rate of big data of big data is 60%.is 60%.

Many Business Many Business Schools call it Schools call it Business Analytics Business Analytics rather than Big Data.rather than Big Data.

Page 5: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.
Page 6: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Computing resources Computing resources are becoming much are becoming much cheaper and more cheaper and more powerful.powerful.

In 1980 a terabyte of disk In 1980 a terabyte of disk storage cost $14 million; storage cost $14 million; now it costs about $30.now it costs about $30.

Amazon or Google will Amazon or Google will “rent” a cloud-based “rent” a cloud-based supercomputer cluster for supercomputer cluster for only a few hundred dollars only a few hundred dollars an hour.an hour.

Page 7: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Social networks, like Facebook and Twitter, Social networks, like Facebook and Twitter, are spanning the globe.are spanning the globe.

Twitter generates more than 7 Terabytes (TB) Twitter generates more than 7 Terabytes (TB) a daya day; Facebook ; Facebook more than 10 TBs, and some enterprises already store data in more than 10 TBs, and some enterprises already store data in the petabyte range.the petabyte range.

Sebastien Pierre’s Facebook Map

Page 8: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

1 gigabytes = 1 gigabytes = 1E9 Bytes1E9 Bytes

1 terabytes = 1 terabytes = 1E12 Bytes1E12 Bytes

1 Petabyte = 1 Petabyte = 1E15 Bytes 1E15 Bytes oror

250,000 DVDs250,000 DVDsSource Source

http://tinyurl.com/a8zwman

1 Megabyte = 1 Megabyte = 1E6 Bytes.1E6 Bytes.

Facebook currently stores more than 100 petabytes of data.

Page 9: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

The amount of data available is The amount of data available is growing much faster than the growing much faster than the ability of most companies to ability of most companies to process and understand it.process and understand it.

Moore’s Law 2: Data expands to fill available Moore’s Law 2: Data expands to fill available storage.storage.

According to researchers at the UC-San According to researchers at the UC-San Diego, Americans consumed about 3.6 Diego, Americans consumed about 3.6 zettabyteszettabytes of information in 2008. of information in 2008.

David Weinberger (p.7) says digital David Weinberger (p.7) says digital War and War and Peace Peace is about 2 megabytes (1296 pgs.), so is about 2 megabytes (1296 pgs.), so one zettabyte equals 5E14 copies of one zettabyte equals 5E14 copies of War War and Peaceand Peace..

It would take light 2.9 days to go from the top It would take light 2.9 days to go from the top to the bottom of this stack.to the bottom of this stack.

Harry E. Pence 2013

Page 10: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Some applications of Big DataSome applications of Big Data

Topological data analysis is being used to rethink basketball Topological data analysis is being used to rethink basketball (five or thirteen positions). (five or thirteen positions). http://tinyurl.com/c5ajwm3

In March, 2013, the Obama Administration announced $200 In March, 2013, the Obama Administration announced $200 million in R&D investments for Big Data. million in R&D investments for Big Data. http://tinyurl.com/85oytkj

Google combined search terms with CDC data to identify Google combined search terms with CDC data to identify search terms that correlated with the spread of the 2009 flu search terms that correlated with the spread of the 2009 flu season. season.

Both Amazon and Netflix use Both Amazon and Netflix use correlationcorrelation-based suggestions -based suggestions to boost sales.to boost sales.

Target assumes that if a 20-something female shopper Target assumes that if a 20-something female shopper purchases a unscented lotion, supplements such as zinc purchases a unscented lotion, supplements such as zinc and calcium, and a large purse, she is pregnant .and calcium, and a large purse, she is pregnant .

Page 11: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Did Big Data help to Did Big Data help to determine the 2012 election? determine the 2012 election?

NY TimesNY Times, , 2/14/132/14/13 http://tinyurl.com/acj8hk5http://tinyurl.com/acj8hk5

Romney raised slightly more money from his online ads than he Romney raised slightly more money from his online ads than he spent on them, Obama’s team more than doubled the return on spent on them, Obama’s team more than doubled the return on its online-ad investment.its online-ad investment.

Romney’s get-out-the-vote digital tool, Orca, crashed on Election Romney’s get-out-the-vote digital tool, Orca, crashed on Election Day; Obama’s Narwhal, gave every member of the campaign Day; Obama’s Narwhal, gave every member of the campaign instant access to continuously updated voter information.instant access to continuously updated voter information.

Obama was the very first candidate to appear on Reddit, and the Obama was the very first candidate to appear on Reddit, and the photo of the Obamas became the most popular image ever seen photo of the Obamas became the most popular image ever seen on Twitter or Facebook.on Twitter or Facebook.

Romney’s senior strategist, Stuart Stevens, may well be the last Romney’s senior strategist, Stuart Stevens, may well be the last guy to run a presidential campaign who never tweeted.guy to run a presidential campaign who never tweeted.

Harry E. Pence 2013

Page 12: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

One definition is, “Big Data is when the size One definition is, “Big Data is when the size of the data itself becomes a problem.”of the data itself becomes a problem.”

Often so much data is collect that much of it is discarded, a Often so much data is collect that much of it is discarded, a process know as the Donner Party Effect. process know as the Donner Party Effect.

Open, online databases, like http://data.worldbank.org/ and Open, online databases, like http://data.worldbank.org/ and https://explore.data.gov/, are now available. https://explore.data.gov/, are now available.

Google Analytics allows us to query search patterns and Google Analytics allows us to query search patterns and Google’s Big QueryGoogle’s Big Query (https://developers.google.com/bigquery/) (https://developers.google.com/bigquery/) allows allows anyone to query all of Wikipedia, Shakespeare, and weather anyone to query all of Wikipedia, Shakespeare, and weather stations for less than $0.035 per GB. stations for less than $0.035 per GB. http://tinyurl.com/bvd2yve

But Big Data is important not just because of size but also But Big Data is important not just because of size but also because of how it connects data, people, and information because of how it connects data, people, and information structures. structures. ((http://tinyurl.com/ato2hbu) It enables us to see It enables us to see patterns that weren’t visible before. patterns that weren’t visible before.

Harry E. Pence 2013

Page 13: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Companies ranging from the Companies ranging from the NY Times NY Times to the to the UK Ordnance UK Ordnance SurveySurvey are creating linked data to make the web more are creating linked data to make the web more interconnected.interconnected.

Each entity is defined by a Each entity is defined by a Uniform Resource IdentifierUniform Resource Identifier (URI) which is machine readable.(URI) which is machine readable.

The hope is to attach metadata to each entity to show how The hope is to attach metadata to each entity to show how they relate to each other, employees to companies, they relate to each other, employees to companies, actors to motion pictures, etc.actors to motion pictures, etc.

We are moving We are moving towards the towards the Semantic WebSemantic Web..

Harry E. Pence 2013

Page 14: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Inexpensive sensors are rapidly moving Inexpensive sensors are rapidly moving us towards an Internet of Things.us towards an Internet of Things.

http://tinyurl.com/aoev3x9

We are here!We are here!

Physical World WebPhysical World Web

Page 15: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Some predict that the Internet of Things will soon produce a massive volume and variety of data at unprecedented velocity. http://tinyurl.com/ahytzdf

Welcome to the new information age

http://tinyurl.com/ahytzdf

Page 16: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Morgan Stanley predicts the Morgan Stanley predicts the following data applications will following data applications will grow fastest in 2013: grow fastest in 2013: 1. Healthcare, 1. Healthcare, 2. Entertainment2. Entertainment3. Com/Media, 3. Com/Media, 4. Manufacturing4. Manufacturing5. Financial5. Financial

Inexpensive ($100-200) devices, Inexpensive ($100-200) devices, like Fitbit already will track like Fitbit already will track your daily physical activity to your daily physical activity to a web page.a web page.

Harry E. Pence 2013

American Society of Clinical Oncology is creating a database, CancerLinQ, to centralize cancer records so that Big Data methods can evaluate the effectiveness of treatments and hasten development of new medicines. http://tinyurl.com/cnv6wfw

Page 17: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Big Data and the Future of Big Data and the Future of Health CareHealth Care

A patient can use a smartphone to do many tests now done in the laboratory, such as an EKG or glucose test. A patient can use a smartphone to do many tests now done in the laboratory, such as an EKG or glucose test.

The cost of a personal genome is dropping rapidly and some are predicting a $100 dollar cost soon. Knowing The cost of a personal genome is dropping rapidly and some are predicting a $100 dollar cost soon. Knowing an individual’s genome should allow treatment to be customized to the individual. an individual’s genome should allow treatment to be customized to the individual.

A recent report says the Big Data could save as much as $450 million in health care costs but the AMA says that A recent report says the Big Data could save as much as $450 million in health care costs but the AMA says that current electronic health record systems lack the sophistication to manage the storage and retrieval of big data. http://tinyurl.com/cln8vf9

Page 18: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Ever since the 1970s, the Ever since the 1970s, the roughly 25,000 families who roughly 25,000 families who create the Nielsen ratings create the Nielsen ratings have determined what TV have determined what TV shows survive and what ad shows survive and what ad rates will apply.rates will apply.

As more and more people TiVoed, As more and more people TiVoed, Nielsen created the C3 rating in Nielsen created the C3 rating in 2007, and recently they added 2007, and recently they added the C7 rating to measure how the C7 rating to measure how many people viewed the show many people viewed the show after it was originally aired.after it was originally aired.

Now advertisers want to know not Now advertisers want to know not just if people watched, but if just if people watched, but if they were “engaged.”they were “engaged.”

In Nov. 2012, Neilsen purchased SocialGuide, which measures the “social impact” of TV, and announced it was partnering with Twitter.

Now Twitter has purchased Bluefin Labs, a social-TV analytics company. Wired, April 2013, 92-94

Page 19: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Big Data and Big ScienceBig Data and Big Science“The Fourth Paradigm”“The Fourth Paradigm”The University of California, San Diego, is building a "big data freeway system" for science projects in "genomic sequencing, climate science, electron microscopy, oceanography and physics."

The Square Kilometre Array (SKA) under development The Square Kilometre Array (SKA) under development in Australia and South Africa will collect one exabyte of in Australia and South Africa will collect one exabyte of data data per day per day from 36 small antennas spread over more from 36 small antennas spread over more than 3000 km to simulate a single giant radio telescope.than 3000 km to simulate a single giant radio telescope.

Bradley VoytekBradley Voytek ( ( Big Data location 1186Big Data location 1186) ) argues that Big Data argues that Big Data analysis allows researchers to identify patterns that were analysis allows researchers to identify patterns that were previously invisible; it is possible to automate critical previously invisible; it is possible to automate critical aspects of the scientific method itself.aspects of the scientific method itself.

Harry E. Pence 2013

Page 20: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

The large Hadron Collider The large Hadron Collider at CERN produces so much at CERN produces so much data that scientists must data that scientists must discard most of it, hoping discard most of it, hoping they haven’t thrown away they haven’t thrown away anything useful.anything useful.

Weather prediction combines data from multiple earth satellites with massive computing power.

Most of the satellites belong to the U.S., but the Europeans have more powerful computers.

Our weather satellites are old. http://tinyurl.com/cvpz5qe

Harry E. Pence 2013

17 miles

Page 21: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Search engines, like Yahoo! Search engines, like Yahoo! and Google, were the first and Google, were the first companies to work with companies to work with datasets that were too large datasets that were too large for conventional methods.for conventional methods.(According to Big Data 2 location 177, Google (According to Big Data 2 location 177, Google has over a million servers.)has over a million servers.)

In order to power its searches, Google developed a In order to power its searches, Google developed a strategy called strategy called MapReduceMapReduce. You map a task onto . You map a task onto a multitude of processors then retrieve the results.a multitude of processors then retrieve the results.

Traditional data warehouses use a relational Traditional data warehouses use a relational database (think Excel rows and columns);database (think Excel rows and columns);

Search engines need to handle non-relational Search engines need to handle non-relational databases, sometimes called NoSQL. databases, sometimes called NoSQL.

Harry E. Pence 2013

Page 22: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

The most popular software to The most popular software to search No-SQL databases is called search No-SQL databases is called Hadoop and several different Hadoop and several different versions are freeware.versions are freeware.

Hadoop is designed to collect data, even if it doesn’t fit Hadoop is designed to collect data, even if it doesn’t fit nicely into tables, distribute a query across a large nicely into tables, distribute a query across a large number of separate processors, and then combine the number of separate processors, and then combine the results into a single answer set in order to deliver results into a single answer set in order to deliver results in almost real time.results in almost real time.

This is often paired with machine learning apps, like a This is often paired with machine learning apps, like a recommendation engine, classification, error detection, recommendation engine, classification, error detection, or facial recognition.or facial recognition.

Personal Aside: I suggest that Google Analytics might be Personal Aside: I suggest that Google Analytics might be the best way to introduce students to crafting a query the best way to introduce students to crafting a query for a Big Data exercise.for a Big Data exercise.

Harry E. Pence 2013

Named after his son’s pet elephant.

Page 23: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

We are in the Golden Age of Data We are in the Golden Age of Data VisualizationVisualization

A streamgraph of the conversation around a brand.A streamgraph of the conversation around a brand.http://tinyurl.com/beqxuyl

Page 24: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

It can be difficult to classify a tweet as pro or con.It can be difficult to classify a tweet as pro or con.

Don’t appoint a #fracking proponent to lead the Don’t appoint a #fracking proponent to lead the Dept. of Energy. (often RT)Dept. of Energy. (often RT)

Broome County Executive took $82,428 in pro-Broome County Executive took $82,428 in pro-#fracking campaign contributions.#fracking campaign contributions.

The only politician I know that has backed up The only politician I know that has backed up his promise to address #fracking is Gov. his promise to address #fracking is Gov. Cuomo.Cuomo.

Natural gas is neither perfect nor perfectly evil.Natural gas is neither perfect nor perfectly evil.

Businesses surprised to see their names on Businesses surprised to see their names on #fracking petition.#fracking petition.

Mmmm @fracking fluid bit.ly/ZmQVQjMmmm @fracking fluid bit.ly/ZmQVQj

Harry E. Pence 2013

Page 25: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Amazon’s Mechanical Turk is one way to Amazon’s Mechanical Turk is one way to manually create a standard data template.manually create a standard data template.

https://www.mturk.com/mturk/welcomehttps://www.mturk.com/mturk/welcome

Page 26: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Scraping, processing, Scraping, processing, and buying Big Dataand buying Big Data

One can only copyright a specific arrangement of the data, One can only copyright a specific arrangement of the data, but the metadata is often extremely important and may but the metadata is often extremely important and may not follow the scrape.not follow the scrape.

Companies, like InfoChimp, are scraping and cleaning Companies, like InfoChimp, are scraping and cleaning selected data from Twitter and then selling access to selected data from Twitter and then selling access to these datasets.these datasets.

Pete Warden reports that it only cost him $120 to gather, Pete Warden reports that it only cost him $120 to gather, analyze, and visualize 220 million public Facebook analyze, and visualize 220 million public Facebook profiles profiles ((http://tinyurl.com/yb2q3dv) and and 8olegs8olegs allowed allowed him to download a million web pages for about $2.20. him to download a million web pages for about $2.20.

Harry E. Pence 2013

Page 27: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Problems with Big Data – Problems with Big Data – Selection bias Selection bias

((http://tinyurl.com/bdb7wgy)

Selection bias occurs when the individuals or groups to take Selection bias occurs when the individuals or groups to take part in a study don’t represent the general population.part in a study don’t represent the general population.

According to a recent Pew survey, Twitter users are younger According to a recent Pew survey, Twitter users are younger than the general public and more likely to lean toward the than the general public and more likely to lean toward the Democratic Party. Democratic Party. Twitter reactions are often at odds (six Twitter reactions are often at odds (six out of eight times) with overall public opinion. out of eight times) with overall public opinion. http://tinyurl.com/cvqq5hzhttp://tinyurl.com/cvqq5hz

A recent article in A recent article in EJP Data Science EJP Data Science says Twitter is actually says Twitter is actually comprised of modern-day tribes, groups of people who use comprised of modern-day tribes, groups of people who use a discrete language and are connected to a a discrete language and are connected to a character, occupation or interest. . http://preview.tinyurl.com/bc8gecu

Harry E. Pence 2013

Page 28: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

The partition The partition of English of English speaking speaking Twitter users Twitter users into into communities, communities, annotated annotated with words with words typical of typical of those often those often used by used by members of members of each each

communitycommunity.http://tinyurl.com/

bmb9r9e 

Page 29: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Problems with Big Data – Problems with Big Data – Misclassification bias Misclassification bias

((http://tinyurl.com/bdb7wgy)

Misclassification occurs when either the cause or the Misclassification occurs when either the cause or the effect is not accurately recognized.effect is not accurately recognized.

What if a response is not correctly identified as What if a response is not correctly identified as intended by the customer. This is especially true when intended by the customer. This is especially true when subjective interpretation is required to classify an subjective interpretation is required to classify an answer. answer. http://tinyurl.com/cvqq5hz

Remember: Correlation does not imply causation,Remember: Correlation does not imply causation,

Harry E. Pence 2013

Page 30: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Comments on Big Data for educationComments on Big Data for education

The information in LMSs (i.e. time on task, number of log-ins, The information in LMSs (i.e. time on task, number of log-ins, number of list posts, etc.), like Moodle or Angel, is well number of list posts, etc.), like Moodle or Angel, is well structured and is already being analyzed.structured and is already being analyzed.

In MOOCs, students interact entirely online, leaving behind a In MOOCs, students interact entirely online, leaving behind a record of every page they visited.record of every page they visited.

The Gates Foundation recently gave $100 million to InBloom The Gates Foundation recently gave $100 million to InBloom to improve ways to transfer information among the many to improve ways to transfer information among the many technology information silos where students records are technology information silos where students records are currently stored. Nine states (including New York) are currently stored. Nine states (including New York) are participating in this pilot project and plan to offer third-party participating in this pilot project and plan to offer third-party vendors access to student data (without student or parental vendors access to student data (without student or parental consent). consent). http://tinyurl.com/b8e4whh

Unanswered questions include security, who owns learner Unanswered questions include security, who owns learner produced data (TurnItIn?), who owns the data analysis, and produced data (TurnItIn?), who owns the data analysis, and what will be shared with the students? what will be shared with the students?

Harry E. Pence 2013

Page 31: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

E-Textbooks can now E-Textbooks can now track student use patterns.track student use patterns.

Instructors who use an e-text from CourseSmart receive information about each student showing how much she is reading the book, what pages she skips, how much she highlights, and whether she is taking notes. See NY Times, April 9, 2013, http://tinyurl.com/d7dfob4

Students who take notes with pen and paper may be penalized, even if they are doing well in the course.

Image from the blog Electric Venom: Motherhood, mid-life crisis, martinis.

If it can be measured, some teachers will grade it!

Page 32: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Some Learning Management Systems already Some Learning Management Systems already display a "dashboard" of a student’s performance.display a "dashboard" of a student’s performance.

Math SAT score 390 Math SAT score 390 (You need extra work in math intensive courses)(You need extra work in math intensive courses)

Avg. grade of students like you who took this course: CAvg. grade of students like you who took this course: C

Avg. grade of students like you who had this instructor: DAvg. grade of students like you who had this instructor: D

Your grade in prerequisite courses: Precalculus: DYour grade in prerequisite courses: Precalculus: D

Danger Needs improvementNeeds improvement Good

AttendanceAttendance Time on LMSTime on LMSHomeworkHomework

Harry E. Pence 2013

Page 33: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Chico State Chico State University (CA) has University (CA) has been using learning been using learning

analytics to study analytics to study how student how student

achievement is achievement is related to (LMS) use related to (LMS) use

and student and student characteristics.characteristics.

http://tinyurl.com/acz5f7o

The project merges LMS data with student characteristics and The project merges LMS data with student characteristics and course performance from the campus database.course performance from the campus database.

The study reports a direct positive relationship between LMS The study reports a direct positive relationship between LMS usage and the student ‘s final grade.usage and the student ‘s final grade.

Voytek’s Third Law: Any sufficiently advanced statistics can Voytek’s Third Law: Any sufficiently advanced statistics can trick people into believing the results reflect truth.trick people into believing the results reflect truth.

Harry E. Pence 2013

Dwell time is how long a Dwell time is how long a student spent on a given student spent on a given activity.activity.

Page 34: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Problems with Big Problems with Big Data – Confounding Data – Confounding

bias bias ((http://tinyurl.com/bdb7wgy)

Confounding occurs when there is a failure to control Confounding occurs when there is a failure to control for some other factor that is affecting the outcome.for some other factor that is affecting the outcome.

How do other factors, like attendance, reading the How do other factors, like attendance, reading the textbook, attending extra help sessions, etc. relate to textbook, attending extra help sessions, etc. relate to the time spent on the LMS (and so the course grade)?the time spent on the LMS (and so the course grade)?

Do the Dwell Times really make sense?Do the Dwell Times really make sense?

The article notes that, “No individually identifiable The article notes that, “No individually identifiable information is included in the data files.” information is included in the data files.”

Harry E. Pence 2013

Really???Really???

Page 35: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Problems with Big Data – Problems with Big Data – PrivacyPrivacy

A sensible back-up strategy may create more than 100 copies. A sensible back-up strategy may create more than 100 copies. How can organizations protect the privacy of all this data How can organizations protect the privacy of all this data from hackers?from hackers?

This can make it hard to protect individual privacy, and recent This can make it hard to protect individual privacy, and recent experiences suggest that there are now so many public experiences suggest that there are now so many public datasets available for cross-referencing that it is difficult to datasets available for cross-referencing that it is difficult to assure that any Big Data records can be kept private.assure that any Big Data records can be kept private.

In a number of cases, information from “anonymous studies” In a number of cases, information from “anonymous studies” has been tracked back to identify individuals and even their has been tracked back to identify individuals and even their families.families.

Harry E. Pence 2013

Page 36: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

““Just as Ford changed the way we make Just as Ford changed the way we make cars – and then transformed work itself-cars – and then transformed work itself-Big Data has emerged as a system of Big Data has emerged as a system of knowledge that is already changing the knowledge that is already changing the objects of knowledge, while also having objects of knowledge, while also having the power to inform how we understand the power to inform how we understand human networks and community.”human networks and community.”danah boyd (danah boyd (http://tinyurl.com/bdb7wgy)

““And finally, how will the harvesting of Big Data change And finally, how will the harvesting of Big Data change the the meaningmeaning of learning and what new possibilities and of learning and what new possibilities and limitations may come from these systems of knowing?”limitations may come from these systems of knowing?”

Ira "Gus" Hunt, chief technology officer for the Central Intelligence Agency, said, "The value of any piece of information is only known when you can connect it with something else that arrives at a future point in time."

Harry E. Pence 2013

danah boyd

Page 37: An Introduction to Big Data Harry E. Pence 2013 Harry E. Pence TLTC Faculty Fellow for Emerging Technologies.

Any questions?Any questions?

Thank you for Thank you for listening.listening.