Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets,...

6
Big Data Analytics Carlos Ordonez

Transcript of Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets,...

Page 1: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Big Data Analytics

Carlos Ordonez

Page 2: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Big Data Analytics research

• Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

• How? Fast external algorithms; memory-efficient data structures at two storage levels; parallel: multi-threaded or multi-node

• Efficiency goal: linear time O(n) and linear speedup• Hardware? single node or parallel cluster• Infrastructure? parallel file system; any large files• Challenging: Theory+programming in action

Page 3: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Systems research today

• Transaction processing? Main memory, lock-free

• Efficient analysis? Optimal joins, compiled queries, streams, exploit ample RAM, explout multi-core

• Compiler versus interpreter?

• Massive storage? Posix, HDFS

• Fast external algorithms? Simple tasks.

• Parallel computation? Multi-core with threads, Shared-nothing, message-passing

• Exploiting new hardware? Difficult/customized

• Analyzing: queries, cubes, statistics. Machine learning

• Hot today: Information integration (database+files)

Page 4: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

DB Systems involves Core CS research:Theory+Programming

• Theory we use:– Time complexity (big O()) and I/O cost (disk, solid state memory)– Data structures (trees, hash tables, linked lists)– Relational model and information retrieval models– Multivariate statistics, machine learning, discrete mathematics, linear

algebra– Compilers and programming languages: parsing/compiling/optimizing

code; recursion• Programming:

– Languages: mostly C++, but also R, SQL, Java– Unix, but we have a lot of past work on MS Windows– Systems: Threads, binary I/O, parallel file systems, code generation, code

optimization, interpreter runtime

Page 5: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Sample of target problemsBusiness Intelligence: cubes, lattices

Big Data summarization: vector outer products

Bayesian statistics: MCMC, classification, regression, variable/feature selection

Graph transitive closure and linear recursion

Page 6: Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Why join the DBMS group?

• Just came back from ATT Labs (formerly the famous ATT Bell Labs)..my head is spinning with C++ 14 and Unix commands. Currently programming with my PhD students.

• Balance between theory (mathematics) and programming (C++)

• Mature and stable CS research area

• Job prospect upon graduation is excellent. Great opportunity to join industrial labs.

• Visit my web page, DBLP. Google “Ordonez SQL”, stop by on my office hours