Jongwook Woo
HiPIC
CSULA
Recent IT Development and Women: Big Data and The Power of Women in Goryeo
KWiSE Annual MeetingChapman University, CA
Oct 20th 2012
Jongwook Woo (PhD)
High-Performance Internet Computing Center (HiPIC)
Educational Partner with Cloudera and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
HiPIC
Jongwook Woo
CSULA
Contents
Part I. Big DataFundamentals of Big DataData-Intensive Computing: HadoopBig Data Supporters and Use Cases
Part II. The Power of Women in Goryeo DynastyNorth East Asia before the Mongol EmpireKorea and MongolThe Empress Gi
HiPIC
Jongwook Woo
CSULA
Part I
Big DataFundamentals of Big DataNoSQL DB: HBase, MongoDBData-Intensive Computing: HadoopBig Data Supporters and Use Cases
HiPIC
Jongwook Woo
CSULA
Experience in Big Data
Grants
Received Amazon AWS in Education Research Grant (July 2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011
Partnership
Received Academic Education Partnership with Cloudera since June 2012
Certificate
Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012
Cloud Computing Blog
http://dal-cloudcomputing.blogspot.com/
HiPIC
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
ClouderaHortonWorks
AWS
NoSQ
L DB
Big Data Era
HiPIC
Jongwook Woo
CSULA
Big Data
Too much dataTera-Byte (1012), Peta-byte (1015)
–Because of web–Sensor Data, Bioinformatics, Social
Computing, smart phone, online game…
Cannot handle with the legacy approachToo bigUn-/Semi-structured data
HiPIC
Jongwook Woo
CSULA
Two Issues in Big Data
How to store Big DataNoSQL DB
How to compute Big DataParallel Computing with multiple cheap
computers–Not need super computers
HiPIC
Jongwook Woo
CSULA
Contents
Fundamentals of Big Data
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases
HiPIC
Jongwook Woo
CSULA
Data nowadays
• Data Issueso data grows to 10TB, and then 100TB. o Unstructured data coming from sources
like Facebook, Twitter, RFID readers, sensors, and so on.
Need to derive information from both the relational data and the unstructured data• as soon as possible.
• Solution to efficiently compute Big Datao Hadoop Map/Reduce
HiPIC
Jongwook Woo
CSULA
Solutions in Big Data Computation
Map/Reduce by Google(Key, Value) parallel computing
Apache Hadoop Big Data
Þ Data Computation (MapReduce, Pig)
Integrating MapReduce and RDBOracle + HadoopSybase IQVertica + Hadoop Hadoop DBGreenplumAster Data
Integrating MapReduce and NoSQL DBMongoDB MapReduceHBase
HiPIC
Jongwook Woo
CSULA
Apache Hadoop
Motivated by Google Map/Reduce and GFS
open source project of the Apache Foundation. framework written in Java
– originally developed by Doug Cutting • who named it after his son's toy elephant.
Two core Components
Storage: HDFS– High Bandwidth Clustered storage
Processing: Map/Reduce– Fault Tolerant Distributed Processing
Hadoop scales linearly with
data size Analysis complexity
HiPIC
Jongwook Woo
CSULA
Hadoop issues
Map/Reduce is not DB Algorithm in Restricted Parallel Computing
HDFS and HBase Cannot compete with the functions in RDBMS
But, useful for Semi-structured data model and high-level dataflow query
language on top of MapReduce– Pig, Hive, Jsql, Cascading, Cloudbase
Useful for huge (peta- or Terra-bytes) but non-complicated data– Web crawling – log analysis
• Log file for web companies– New York Times case
HiPIC
Jongwook Woo
CSULA
MapReduce Pros & Cons Summary
Good whenHuge data for input, intermediate, outputA few synchronization requiredRead once; batch oriented datasets (ETL)
Bad for
Fast response timeLarge amount of shared dataFine-grained synch neededCPU-intensive not data-intensiveContinuous input stream
HiPIC
Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional programming languages (eg. Lisp)
Provides Restricted parallel programming model on Hadoop
User implements Map() and Reduce()Libraries (Hadoop) take care of
EVERYTHING else–Parallelization–Fault Tolerance–Data Distribution–Load Balancing
HiPIC
Jongwook Woo
CSULA
MapConvert input data to (key, value) pairs
map() functions run in parallel, creating different intermediate (key, value)
values from different input data sets
HiPIC
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values into one or more final values for that same key
reduce() functions also run in parallel, each working on a different output key
Bottleneck: reduce phase can’t start until map phase is
completely finished.
HiPIC
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the largest hit URLsStored in log files
Map() Input <logFilename, file text>Output: Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce() Input: <url, list of hit counts> from multiple map
nodesOutput: Sums all values for the same key and emits
<url, TotalCount>– eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
HiPIC
Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
(http://hi.com, <1, 1, …, 1>)(http://hello.com, <3, 5, 2, 7>)
(http://hi.com, 32)(http://hello.com, 17)
Input Log Data
Reduce2()
(http://hi.com, 1)(http://hello.com, 3)…
(http://halo.com, 1)(http://hello.com, 5)…
(http://halo.com, <1, 5,>)
(http://halo.com, 6)
HiPIC
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs
into more web-friendly PDF files. – not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
HiPIC
Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid,
– playing around with Amazon Web Services, Elastic Compute Cloud (EC2),
• uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3)
• In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.
The total cost for the computing job? $240– 10 cents per computer-hour times 100 computers times 24 hours
HiPIC
Jongwook Woo
CSULA
Contents
Fundamentals of Big Data
Data-Intensive Computing: Hadoop
Big Data Supporters and Use Cases
HiPIC
Jongwook Woo
CSULA
Supporters of Big Data
Apache Hadoop Supporters
Cloudera– Like Linux and Redhat– HiPIC is an Academic Partner
Hortonworks– Pig, – Consulting and training
Facebook– Hive
IBM– Jaql
NoSQL DB supporters
MongoDB– HiPIC tries to collaborate
HBase, CouchDB, Apache Cassandra (originally by FB) etc
HiPIC
Jongwook Woo
CSULA
Similarities in Pig, Hive, and Jaql
• translate high-level languages into MapReduce jobs
o the programmer can work at a higher level than writing MapReduce jobs in Java or other
lower-level languages
• programs are much smaller than Java code.
• option to extend these languages,
o often by writing user-defined functions in Java.
• Interoperability
o programs written in these high-level languages can be imbedded inside other languages as well.
HiPIC
Jongwook Woo
CSULA
Pig
• developed at Yahoo Research around 2006
o moved into the Apache Software Foundation in 2007.
• PigLatin,
o Pig's languageo a data flow language o well suited to processing unstructured data
Easy to write MapReduce codes
HiPIC
Jongwook Woo
CSULA
Hive
• developed at Facebook
o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying.
• HiveQL
o a declarative language (SQL dialect)
• Difference from PigLatin,
o you do not specify the data flow, but instead describe the result you want
Hive figures out how to build a data flow to achieve it.
o a schema is required,
HiPIC
Jongwook Woo
CSULA
Jaql
• developed at IBM.
• a data flow language
o its native data structure format is JSON (JavaScript Object Notation).
HiPIC
Jongwook Woo
CSULA
Use Cases
Amazon AWS
Craiglist
HuffPOst | AOL
HiPIC
Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws.amazon.com
IT infrastructure business– Focus on your business not IT management
Pay as you go– Pay for servers by the hour– Pay for storage per Giga byte per month– Pay for data transfer per Giga byte
Services with many APIs– S3: Simple Storage Service– EC2: Elastic Compute Cloud
• Provide many virtual Linux servers• Can run on multiple nodes
– Hadoop and HBase– MongoDB
HiPIC
Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.comSamsung
– Smart TV hub sites: TV applications are on AWSNetflix
– ~25% of US internet traffic– ~100% on AWS
NASA JPL– Analyze more than 200,000 images
NASDAQ– Using AWS S3
HiPIC
Jongwook Woo
CSULA
Facebook [7]
Using Apache HBaseFor Titan and PumaHBase for FB
– Provide excellent write performance and good reads– Nice features
• Scalable• Fault Tolerance• MapReduce
HiPIC
Jongwook Woo
CSULA
Titan: Facebook
Message services in FBHundreds of millions of active users15+ billion messages a month50K instant message a second
ChallengesHigh write throughput
– Every message, instant message, SMS, emailMassive Clusters
– Must be easily scalable
SolutionClustered HBase
HiPIC
Jongwook Woo
CSULA
Puma: Facebook
ETL
Extract, Transform, Load– Data Integrating from many data sources to Data Warehouse
Data analytics– Domain owners’ web analytics for Ad and apps
• clicks, likes, shares, comments etc
ETL before Puma
8 – 24 hours– Procedures: Scribe, HDFS, Hive, MySQL
ETL after Puma
Puma– Real time MapReduce framework
2 – 30 secs– Procedures: Scribe, HDFS, Puma, HBase
HiPIC
Jongwook Woo
CSULA
Twitter [8]
Three Challenges Collecting Data
– Scribe as FBLarge Scale Storage and analysis
– Cassandra: ColumnFamily key-value store– Hadoop
Rapid Learning over Big Data– Pig
• 5% of Java code• 5% of dev time• Within 20% of running time
HiPIC
Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist~700 cities, worldwide~1 billion hits/day~1.5 million posts/dayServers
– ~500 servers– ~100 MySQL servers
Migrate to MongoDBScalable, Fast, Proven, Friendly
HiPIC
Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use CasesComment Moderation
– Evaluate All New HuffPost User Comments Every Day• Identify Abusive / Aggressive Comments• Auto Delete / Publish ~25% Comments Every Day
Article Classification– Tag Articles for Advertising
• E.g.: scary, salacious, …
build a flexible ML platform running on HadoopPig for Hadoop implementation.
HiPIC
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Storage: NoSQL DB
Computation: Hadoop MapRedude
Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns, Bioinformatics, Medical data …
HiPIC
Jongwook Woo
CSULA
Part II
The power of Women in Goryeo DynastyNorth East Asia before the Mongol EmpireKorea and MongolThe Empress Gi
HiPIC
Jongwook Woo
CSULA
Three kingdoms (AD 907 - 1125)
HiPIC
Jongwook Woo
CSULA
Before Mongol
Three kingdoms balanced powerGoryeo, Yo (Liao, Cathay, Khitan, 契丹 ),
Song–Goryeo-Yo: 3 wars
• First invasion (AD 993): 서희 , • Second invasion with 400K (AD 1010):
강조• Third invasion with 100K (AD 1018):
강감찬– Goryeo became famous after this victory
HiPIC
Jongwook Woo
CSULA
Three kingdoms (AD 1115- 1234)
HiPIC
Jongwook Woo
CSULA
Before Mongol
Three kingdoms balanced power (AD 1115 - 1234)Goryeo, Gum (Jin, Jurchen, Yojin, 金
朝 ), South Song–윤관 invaded Jurchen Wanyan ( 完顏 )
clan (AD 1111) and many battles–Jin defeated Liao dynasty at AD 1121– wanted to keep a peace with Goryeo
• From the emperor of big brother to the king of little brother
HiPIC
Jongwook Woo
CSULA
Part II. The power of Women in Goryeo Dynasty
Korea and MongolWars since AD1231 ( 고종 18)
Goryeo (Korea) dynasty Military dictatorship of Choe family ended at AD1258
( 고종 45)
MongolWas conquering China (the South Song dynasty)
since AD1257– Möngke Kahn
• Right battalion – Kublai
• Left battalion
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
Mongol Empire in 1227 at Genghis Khan‘s death [http://en.wikipedia.org/wiki/Timeline_of_the_Mongol_Empire]
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
Mongol Empire after Genghis Khan‘s death (1227) under Möngke Khan [http://en.wikipedia.org/wiki/Timeline_of_the_Mongol_Empire]
1236 Beginning invading Europe by Hulagu
1231 Beginning invading Korea
1236 Beginning invading South AsiaBy Möngke Khan and Kublai
Ariq Böke controlled Mongol at Karakorum
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
World in AD1257 – 12601257: Mongols was attacking Vietnam1258: Mongols occupied Baghdad1259: Mongols was invading Syria
– The death of Möngke Khan1260: The succession war had begun
– By Möngke’s brothers : Kublai Khan and Ariq Böke. – Kublai and the youngest brother Hulugu returned to
KaraKorum: Capital of the Mongol empire• Kara: north, Korum: Khori (Space, 골 , 고을 )
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
Again Goreyo and Mongol in 1259Decided to have a peace treaty with Mongol
– Actually to surrenderApril 21 1259 ( 고종 46): The Crown Prince left to
meet the KhanMay 17th 1259: The Crown Prince met Mongol army
at Yoyang (Liao liang) who was about to invade Goreyo– Stop the Mongol army
June 30 1259: The king Go-Jong passed away July 30 1259: The Khan passed away
– Mongol army stopped the prince to hide the khan’s death
The prince met Kublai at Gaebong close to the Yellow river– Dec 1259: Kublai was returning back to KaraKorum
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
Mongol Empire after Möngke Kahn' death (1227) [http://en.wikipedia.org/wiki/Timeline_of_the_Mongol_Empire]
Hulagu
Goryeo’s Crown Prince
Kublai
Ariq Böke controlled Mongol at Karakorum
HiPIC
Jongwook Woo
CSULA
Goreyo and Mongol in 1260-1264
The great meeting and the great KhanKublai welcomed the prince with the glad favor
– Kublai was so happy and said • “The god is helping me. Goryeo kingdom surrendered
to me, who was never defeated even by the Chinese emperor Dang Tae-Jong”
• He knew that Goryeo is originated from GoGuRyeoKublai appointed the prince to the king of Goryeo
(Won-Jong)– as Go-Jong passed away
They came together to Beijing on Jan 1260.April 1260: Won-Jong’s enthronement ceremony in
Goryeo August 21 1264: Ariq Böke surrendered to Kublai at
Xanadu (KaraKorum)
HiPIC
Jongwook Woo
CSULA
The great meeting and the marriage
Sept 1264: King Won-Jong went to Beijing and meet the Khan Another great welcoming from the Khan
1269: Kublai decided his daughter to marry the crown price of Goryeo 1269, Aug 1270: Won-Jong and the crown prince asked Kublai for the
marriage 1271, 1272: the prince went to Beijing and returned back
– Volunteer to lead the invasion of Japan April 1273: Defeated Sambyolcho at Jeju island
May 1274: The crown prince of Goryeo and the princess of the Mongol (Holdorogerimisil, 제국공주 ) empire married at the palace of the capital in the Mongol empire
Aug 1274: The prince became the king ( 충렬왕 )
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
Mongol Empire in 1300 -1405: this map is not correct as Goryeo was an independent kingdom [http://en.wikipedia.org/wiki/Timeline_of_the_Mongol_Empire]
HiPIC
Jongwook Woo
CSULA
Korea and Mongol (Cont’d)
Mongol Empire in [http://en.wikipedia.org/wiki/Kublai_Khan]
The Mongol Empire and the Kingdom of Goryeo tied with marriages
HiPIC
Jongwook Woo
CSULA
The political position
The position of the king was the 7th ranked in the Mongol empire It is the power of the princess
– A daughter of Kublai Should know that Kublai Khan has 12 sons. Goryeo received many benefits from the empire
– “Only Goryeo in the world kept the king and kingdom”– When the king went to the palace of the empire, all mongol
officials wanted to give presents.– The king asked the Khan to suppress Mongol generals in
Goryeo
The position of the king was the 4th ranked in the empire The next great Khan Temur: The princess is his aunt The khan asked the king be the 4th ranked at the empire
HiPIC
Jongwook Woo
CSULA
The Empress Gi ( 기황후 , 奇皇后 )
born to Gi Ja-o ( 奇子敖 ) in Haengju ( 幸州 ), GoryeoBecame a concubine of
Toghun Temür Khan– Became the first
empress in 1365Her son Ayurshiridar was
designated Crown Prince in 1353.– Supported by Korean
eunuch Bak Bulhwa ( 朴不花 )
– became a Khan called Biligtü Khan in 1370.
HiPIC
Jongwook Woo
CSULA
The Empress Gi ( 기황후 , 奇皇后 )
Good for GoryeoShe prohibited the culture to send Korean women to
the Mongol empire for marriage and slavery She eliminated any discussion to make Goryeo
kingdom as one of provinces in the Mongol empire
HiPIC
Jongwook Woo
CSULA
The Empress Gi ( 기황후 , 奇皇后 )
An elder brother named Gi Cheol ( 奇轍 , Bayan Bukha). Came to threaten the position of the king of Goryeo King Gongmin exterminated the Gi family in 1356
HiPIC
Jongwook Woo
CSULA
The Empress Gi ( 기황후 , 奇皇后 )
The Ming China occupied the capital of the empire, Dadu ( 大都 , Beijing), in 1368The empress was disappointed that Goryeo did not
send any reinforcementsFled north to Shangdu ( 上都 , Xanadu)
HiPIC
Jongwook Woo
CSULA
Conclusion II
Woman has a power to control husband: King and Khan (Emperor) can promote their social positions to the higher
Woman can make a son to a Khan
Woman possess a political power to positively affect the motherland
We need to know history and educate kids
HiPIC
Jongwook Woo
CSULA
Question?
HiPIC
Jongwook Woo
CSULA
References Part I1) Introduction to MongoDB, Nosh Petigara, Jan 11, 2011
2) Hadoop Fundamental I, Big Data University
3) “Large Scale Data Analysis with Map/Reduce”, Marin Dimitrov, Feb 2010
4) “BFS & MapReduce”, Edward J Yoon http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html, Feb 26 2009
5) “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, The Third International Conference on Emerging Databases (EDB 2011), Songdo Park Hotel, Incheon, Korea, Aug. 25-27, 2011
HiPIC
Jongwook Woo
CSULA
References6) “Market Basket Analysis Algorithm with Map/Reduce of
Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011),Las Vegas (July 18-21, 2011)
7) Building Realtime Big Data Services at Facebook with Hadoop and Hbase, Jonathan Gray, Facebook, Nov 11, 2011, Hadoop World NYC
8) Analyzing Big Data at Twitter, Kevin Well, Web 2.0 Expo, NYC, Sep 2010
9) Lessons Learned from Migrating 2+ Billion Documents at Craigslist, Jeremy Zawodny, 2011
10) Machine Learning on Hadoop at Huffington Post | AOL, Thu Kyaw and Sang Chul Song, Hadoop DC, Oct 4, 2011
HiPIC
Jongwook Woo
CSULA
References
11) “MapReduce Debates and Schema-Free”, Woohyun Kim, www.coordguru.com, http://blog.naver.com/wisereign, March 3 2010
12) “Large Scale Data Analysis with Map/Reduce”, Marin Dimitrov, Feb 2010
13) “HBase Schema Design Case Studies”, Qingyan Liu, July 13 2009
HiPIC
Jongwook Woo
CSULA
References Part II1) 고려에 시집온 징기스칸의 딸들 , 이한수 , Nov 8 2006, 김영사
2) 쿠빌라이 칸의 일본원정과 충렬왕 , 이승한 , 2009, 푸른역사
Top Related