BigData primer

download BigData primer

If you can't read please download the document

description

Morten Egan gives a short introduction to Big Data and what it is all about. What is MapReduce, HDFS, Hive, Pig and HCatalog? Also, a short introduction to Hortonworks. This presentation was made for the danish oracle user group.

Transcript of BigData primer

  • 1. What BIG data Isn't

2. ! = GB/TB/PB 3. ! = 4. What Everyone Is saying 5. "Big Data is the amount of data that one single machine cannot store and process" OTN - 2014 "I have travelled the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won't last out the year." Editor, Prentice Hall - 1957 "Information is the oil of the 21st century, and analytics is the combustion engine." Peter Sndergaard Gartner group "Data is the new science. Big Data holds the answers." Pat Gelsinger - EMC "Big Data is not the new oil." Jes Thorp Harvard busienss review "Not everything that can be counted counts, and not everything that counts can be counted." William Bruce Cameron "You can have data without information, but you cannot have information without data." Daniel Keys Moran 6. As for "Big Data" I think that is also a concept. In living memory keeping detailed sales by style, color, and size was too much to hold for most retail chains and at least two that tried screwed themselves into bankruptcy. By now we have mostly advanced to vendor managed inventory of not only the inventory and sales, but the in store shelf locations and dollar turn per cubic centimeter of shelf space to leverage the vendors against each other in negotiating for shelf space. That was an epochal change in how the world of retail works, which as a side effect helps non-brick and mortar establishments negotiate with vendors as well. Being able to keep even transiently more orders of magnitude of data and analyze it in a way that even *might* give a competitive advantage is the concept of "Big Data" that makes it something different. I completely dislike the name, but I think the concept is extremely useful. I don't think it has a single thing to do with the physical infrastructure that processes the data. A big part of the concept is that it includes data collection from non-transactional systems and behaviors where the Internet Of Things is included in the search space. Mark Farnham - Oaktable My take away from OpenWorld, was that you buy $1m in gear, harvest 27 billion tweets....do the hadoop equivalent of: select count(*) from shitloads_of_tweets where text like '%you suck%' and work from there ? Am I missing something ? Connor McDonald - Oaktable 7. Others can call Big Data whatever shit they want but these days but the only viable Big Data stack that is somewhat guaranteed to survive (but likely evolve a lot) the rest is Hadoop. IMHO. And there are TWO things it let you do: 1. Due to commodity software and hardware phenomena of last years, you can now build scalable data processing systems affordable to pretty much ANY organization. On few TB scale you just use Linux file system and MySQL or Postgres if needed and maybe flash storage. Beyond that - it's Hadoop. 2. Since running scalable Hadoop cluster is so cheap, efficiency of processing becomes secondary and value moves towards its flexibility - how quickly you can try things, grow the system and integrate new kinds of data into it. Agility is king - time to market is critical. What most forget is that in its current state, Hadoop requires shit load of really good engineering talent. This is why it's only justifiable at certain scale because savings on h/w and s/w will trump the cost of additional engineering getting into order of magnitude difference or two. I'll take my coat... -- Alex Gorbachev Software and hardware must be affordable at scale, or you can go home. Oracle, EMC, Teradata, IBM, Netapp can all just forget about it. Jeffrey Needham One of the hadoops 8. Certainly 1000 node (or 5000 node, if you like) clusters are fully automated ... The data science pipelines are not, nor is the surrounding ecosystem engineering, but the Hadoop cluster needs a shopping cart to operate. Nobody "operates" or admins clusters as this scale. This would be pure insanity. XXXX operates 8 4000 node clusters with 10 people. These people mostly surf YouTube on their NOC screens as there isn't much for them to do either. My job was in production engineering - making sure all the grids worked across all colos (and for $100, no less). However, search engineering (or data science production engineering is probably what the new group will be called) has their back. Everyone on oak table should figure out how to either build or be in a data science production group) Don't bother learning how to operate HDFS and Yarn (and the 8 zillion plugins). Hadoop 2.0 (be it Hwx or CDH) will be the next OS/database kernel you need to learn. And It's OK if you don't believe me ... 9. So what is BIG data? 10. DATA processing that scales DATA processing with fault tolerance DATA accessible from everywhere 11. When there is an elephant in the room Introduce him Randy Pausch The Last Lecture https://www.youtube.com/watch?v=ji5_MqicxSo&t=0m45s 12. The Hadoop Distributed File System is not a complex, feature-rich, kitchen sink file system, but it does two things very well: its economical and functional at enormous scale. Affordable. At. Scale. Maybe thats all it should be. A big data reservoir should make it possible for traditional database products to directly access HDFS and still provide a canal for enterprises to channel their old data sources into the new Reservoir. Big data reservoirs must allow old and new data to coexist and inter mingle. For example, DB2 currently supports table spaces on tradi tional OS file systems, but when it supports HDFS directly, it could provide customers with a built-in channel from the past to the future. HDFS contains a feature called federation that, over time, could be used to create a reservoir of reservoirs, which will make it possible to create planetary file systems that can act locally but think globally. 13. 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.filecache.DistributedCache; 8. import org.apache.hadoop.conf.*; 9. import org.apache.hadoop.io.*; 10. import org.apache.hadoop.mapred.*; 11. import org.apache.hadoop.util.*; 12. 13. public class WordCount extends Configured implements Tool { 14. 15. public static class Map extends MapReduceBase implements Mapper { 16. 17. static enum Counters { INPUT_WORDS } 18. 19. private final static IntWritable one = new IntWritable(1); 20. private Text word = new Text(); 21. 22. private boolean caseSensitive = true; 23. private Set patternsToSkip = new HashSet(); 24. 25. private long numRecords = 0; 26. private String inputFile; 27. 28. public void configure(JobConf job) { 29. caseSensitive = job.getBoolean("wordcount.case.sensitive", true); 30. inputFile = job.get("map.input.file"); 31. 32. if (job.getBoolean("wordcount.skip.patterns", false)) { 33. Path[] patternsFiles = new Path[0]; 34. try { 35. patternsFiles = DistributedCache.getLocalCacheFiles(job); 36. } catch (IOException ioe) { 37. System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(ioe)); 38. } 39. for (Path patternsFile : patternsFiles) { 40. parseSkipFile(patternsFile); 41. } 42. } 43. } 44. 45. private void parseSkipFile(Path patternsFile) { 46. try { 47. BufferedReader fis = new BufferedReader(new FileReader(patternsFile.toString())); 48. String pattern = null; 49. while ((pattern = fis.readLine()) != null) { 50. patternsToSkip.add(pattern); 51. } 52. } catch (IOException ioe) { 53. System.err.println("Caught exception while parsing the cached file '" + patternsFile + "' : " + StringUtils.stringifyException(ioe)); 54. } 55. } 56. 57. public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { 58. String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase(); 59. 60. for (String pattern : patternsToSkip) { 61. line = line.replaceAll(pattern, ""); 62. } 63. 64. StringTokenizer tokenizer = new StringTokenizer(line); 65. while (tokenizer.hasMoreTokens()) { 66. word.set(tokenizer.nextToken()); 67. output.collect(word, one); 68. reporter.incrCounter(Counters.INPUT_WORDS, 1); 69. } 70. 71. if ((++numRecords % 100) == 0) { 72. reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile); 73. } 74. } 75. } 76. 77. public static class Reduce extends MapReduceBase implements Reducer { 78. public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 79. int sum = 0; 80. while (values.hasNext()) { 81. sum += values.next().get(); 82. } 83. output.collect(key, new IntWritable(sum)); 84. } 85. } 86. 87. public int run(String[] args) throws Exception { 88. JobConf conf = new JobConf(getConf(), WordCount.class); 89. conf.setJobName("wordcount"); 90. 91. conf.setOutputKeyClass(Text.class); = select word, count(word) From words_Table Group by word; De fleste syntes ikke at det var smart 14. Welcome Hadoop platforms 15. Time for a demo! 16. So what can we do with Oracle and Hadoop? Data Loader for Oracle Oracle Direct Connector 17. Correlation != Causation 18. DSB vs. P3 Top kunstnere skyld I forsinkelser: Rihanna 03,46% Medina 01,78% Lady Gaga 01,26% Andre < 1% Danske kunstnere skyld I forsinkelser: Medina 09,74% Fallulah 04,31% Panamah 02,83% Pharfar 01,34% Ukendt Kunstner 01,11% Andre < 1% 19. DSB vs. Pollen El Hassel Elm Birk Bynke Grs Andre 0 5 10 15 20 25 30 35 Pollen forsinkelser Procent Pollentype 20. DSB vs. Nasdaq Januar Februar Marts April Maj Juni Juli August September Oktober November December -5 0 5 10 15 20 21. Size does not matter It is all about the data 22. ?