bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5....
Transcript of bigdata csail prospectus 1people.csail.mit.edu/ebruce/CSAIL_BIGDATA/bigdata_csail... · 2012. 5....
DRAFT -‐-‐ not for distribution
1
bigdata@csail
Mission The goal of bigdata@csail is to identify and develop the technologies needed to solve the next generation data challenges which will require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide. We want to enable people to truly leverage Big Data by developing platforms that are reusable, scalable and easy to deploy across multiple application domains.
Our approach includes two key aspects. First, we will collaborate closely with industry to provide real-‐world applications and drive impact. Second, we view the Big Data problem as fundamentally multi-‐disciplinary. Our team includes faculty and researchers across many related technology areas, including algorithms, architecture, data management, machine learning, privacy and security, user interfaces, and visualization, as well as domain experts in finance, medical, smart infrastructure, education and science.
The Big Data Problem We define big data as data that is too big, too fast, or too hard for existing tools to process. Here, “too big” means that organizations increasingly have to deal with petabyte-‐scale collections of data, which come from click streams, transaction records, sensors, and many other places. “Too fast” means that not only is data big, but it needs to be processed quickly – for example, to perform fraud detection at a point of sale or determine what ad to show to a user on a web page. “Too hard” is a catchall for data that doesn’t fit neatly into an existing processing tool, i.e., data that needs more complex analysis than existing tools can readily provide.
Examples of the big data problem abound.
Web Analytics
On the Internet, many websites now register millions of unique visitors per day. Each of these visitors may access and create a range of content. This can easily amount to tens to hundreds of gigabytes per day (tens of terabytes per year) of accumulating user and log data, even for medium sized websites. Increasingly, companies want to be able to mine this data to understand limitations of their site, improve response time, offer more targeted ads, and so on. Doing this requires tools that can perform complicated analytics on data that far exceeds the memory of a single machine or even a cluster of machines.
Finance
DRAFT -‐-‐ not for distribution
2
As another example, consider the big data problem as it applies to banks and other financial organizations. These organizations have vast quantities of data about consumer spending habits, credit card transactions, financial markets, and so on. This data is massive: for example, Visa processes more than 35B transactions per year; if they record 1 KB of data per transaction, this represents 3.5 petabytes of data per year. Visa, and large banks that issue Visa cards would like to use this data in a number of ways: to predict customers at risk of default, to detect fraud, to offer promotions, and so on. This requires complex analytics. Additionally, this processing needs to be done quickly and efficiently, and needs to be easy to tune as new models are developed and refined.
Medical
As a third example, consider the impact of new sensors on our ability to continuously monitor a patient's health. Recent advances in wireless networking, miniaturization of sensors via MEMS processes, and incredible advances in digital imaging technology have made it possible to cheaply deploy wearable sensors that monitor a number of biological signals on patients, even outside of the doctors office. These signals measure functioning of the heart, brain, circulatory system, etc. Additionally, accelerometers and touch screens can be used to assess mobility and cognitive function. This creates an unprecedented opportunity for doctors to provide outpatient care, by understanding how patients are progressing outside of the doctor’s office, and when they need to be seen urgently. Additionally, by correlating signals from thousands of different patients, it become possible to develop a new understand of what is normal or abnormal, or what kinds of signal features are indicative of potential serious problems.
Similar challenges arise across different industry sectors including healthcare, finance, government, transportation, biotech, drug discovery, insurance, retail as well as across many scientific fields including astronomy, genomics, oceanography, physics and biology.
Our approach
We believe the solution to big data is fundamentally multi-‐disciplinary. Our approach is to bring together world leaders in parallel architecture, massive-‐scale data processing, algorithms, machine learning, visualization, and interfaces to collectively identify and address the fundamental technology challenges we face with Big Data.
Our approach focuses on four broad research themes, summarized in the figure to the right
• Computational Platforms • Scalable Algorithms
DRAFT -‐-‐ not for distribution
3
• Machine Learning and Understanding • Privacy and Security
Below we briefly summarize these areas of research at MIT CSAIL, using italics to reference specific research projects that are described in more detail at the end of this document.
Computational Platforms: We are building parallel data processing platforms, including SciDB, BlinkDB, and several cloud-‐based deployment platforms, including FOS and Relational Cloud. The goal of these platforms is to make it easy for developers of big data applications to write programs much as they would on a single-‐node computational environment, and to be able to rapidly deploy those applications on tens or hundreds of nodes. Additionally, as the computation and storage requirements of applications change, these platforms should be able to dynamically and elastically adapt to those changes.
Scalable Algorithms: We are developing a range of algorithms designed to deal with very large volumes of data, and to process that data in parallel. These include parallel implementations of a range of known algorithms, including matrix computations, as well as statistical operations like regression, optimization methods like gradient descent, and machine learning algorithms like clustering and classification.
In addition, we are developing fundamental new types of algorithms designed to handle the challenges of Big Data. For example, we are working on sublinear algorithms that can compute a range of statistics, such as estimates of the number of distinct items in a set, using space that is exponentially smaller than the input. Additionally, we are developing new algorithms for encoding, comparing, and searching massive data sets; specific examples include hash-‐based similarity search on massive scale data, algorithms for compressed sensing that provide a new way to encode sparse data that arise in a number of scientific applications, and algorithms for computing the Fourier Transform that are faster than FFT for sparse data.
Machine Learning and Understanding: On top of these algorithms, we are deploying a number of novel machine learning applications focused on machine understanding in specific domains. For example, in work on scene understanding in images we are building tools that automatically label parts of an image, or that classify an image as belonging to a certain category or categories based on the types of images that appear in them. As a second example, we are using natural language processing to convert massive quantities of text tweets and text reviews on the web into structured information about products, restaurants, and services that indicate the type of content in some text (e.g., a food review, a rating), an assessment of the sentiment of the text, etc.
Privacy and Security: Finally, because much of the mining and analysis involved in a big data context involves sensitive, private information, we are working technologies and policies for protecting, anonymization, and allowing people to
DRAFT -‐-‐ not for distribution
4
retain control over their data. As an example, in the Crypt DB project, we are building a database system that stores data in an encrypted format in the cloud, in such a way that a curious database or system administrator cannot decrypt the data. Users retain the encryption keys over their data, but have the ability to execute queries over that encrypted data on the database serving, enabling much better performance than simply sending the data back an decrypting on the client’s machine.
Work in these four areas is coupled with application experts in Finance (Professor Andrew Lo), Medical (Professor John Guttag), Smart Infrastructure (Balakrishnan and Madden), Education (through a relationship with the MITx initiative), and Science (Stonebraker).
Membership Model The goal of bigdata@csail membership model is to promote in-‐depth interactions between industry and academia. Member companies will have the opportunity to be exposed to multiple research projects that span the work of about 20 MIT faculty and researchers, including their postdocs and students. The model has two components: bigdata@csail membership and optional additional engagements.
Membership
bigdata@csail will involve a selected group of member companies (approx. 10–15 companies) . There is an annual membership fee of $150K per company, to be provided by each member in the form of an unrestricted gift, with the expectation of an initial three–year commitment.
The membership fees will be used to support the operation of the initiative and provide seed funding for new ideas and projects. Our faculty will continue to raise research support from NSF, DARPA, and other organizations to significantly amplify this industrial funding, leveraging the investment from all our member companies.
Membership provides the company with the following benefits:
1. Each member can contribute one member to the advisory board of bigdata@csail, which will advise and provide feedback to the directors on research directions and priorities
2. Diversified seed funding of about 3-‐5 early–stage projects 3. Early exposure to a larger set of sponsored projects in the area of big data 4. In-‐depth interactions and shared learning on topics of particular interest to
each member company -‐-‐ these topics are chosen in consultation with the company representative on the advisory board
5. Interactions with the graduate students for recruitment and internships 6. Annual meetings in which the students and faculty present on relevant
research and results and the companies provide feedback and discuss in relation to industry
DRAFT -‐-‐ not for distribution
5
7. Discussions on key topics of interest to members 8. Ad-‐hoc interactions with members on an as-‐needed basis; bigdata@csail
directors will facilitate connections between companies and researchers 9. Notifications of events, latest news, publications.
Optional additional engagements
Members may engage in company-‐specific activities through separate agreements. For example, if a member company wishes to have CSAIL host one of its employees, this may be arranged via an Industry Visitor Agreement. Further, if a member company becomes highly interested in a particular research project and wants to sponsor future development of that project, this may also be arranged via a CSAIL Sponsored Research Agreement providing additional project-‐specific funding.
Directors The Director of bigdata@csail is Professor Samuel Madden, CSAIL Principal Investigator.
Intellectual Property The overall goal of bigdata@csail is to conduct basic research that will have a significant impact over a long time scale. Given the nature of our intended research, MIT anticipates that most of the research results and technology will be placed into the public domain via publication and open-‐source licensing. However, in certain cases, MIT may decide to obtain intellectual property protection for certain research results and license use of that technology under those intellectual property rights, as the most effective way to transfer technology we develop to industry for economic benefit to society.
DRAFT -‐-‐ not for distribution
6
MIT Principal Investigators
Madden, Stonebraker, Lo, Barzilay, Fisher, Jaakkola, Karger, Miller, Olivia, Torralba, Rubinfeld, Guttag, Amarasinghe, Indyk, Pentland, Devadas, Glass, Balakrishnan, Zeldovich, Freeman
DRAFT -‐-‐ not for distribution
7
Example Big Data Projects
The following are examples of sponsored projects conducted by MIT Principal Investigators to illustrate the breadth and depth of the work being conducted at CSAIL.
UNDERSTANDING
[Finance] Detecting Defaults - Andrew Lo et al: The goal of this research is to develop analytical models that can predict when a consumer is at risk of default on a loan, based on their recent financial transactions. On a test on 1.5TB of data from a major financial institution, the developed models were able to much more accurately predict defaults than traditional measures like FICO scores.
[Energy] Hydrocarbon Exploration - Indyk, Jaakkola, Poggio, Freeman, et al. In this project, the goal is to identify boundaries between different types of underground rocks using seismic sensors. Such boundaries are of interest in hydrocarbon exploration as they are places where oil is often present. These sensors produce massive streams of data that need to be mined to understand the location of boundaries. Researchers are working these mining algorithms, as well as advanced compression and encoding techniques to compactly summarize these data streams.
[Smart Transportation] Cartel – Balakrishnan and Madden – The goal of CarTel (“car telecommunications”) is to investigate how sensor equipped cars and smartphones can be used to capture information about the transportation network and urban environment in general. Example results include an interactive map of the biggest potholes in Cambridge and Boston, collected using car-‐mounted accelerometers, and traffic aware routing, where real-‐time traffic delays from cars are used to find the fastest driving routes.
[Social] Influence Modeling - Alex Pentland et al. – The goal of this project is to learn how people inside of large organizations influence each other, and to track the flow of influence throughout an organization. Relationships can be modeled as graphs, with edges indicating the degree of influence. Weights are learned from a variety of data sources, including personal communication and data gathered from sensors about face-‐to-‐face interaction. In large organizations, there can be billions of pieces of information that need to be incorporated into this influence graph, and the calculations to track influence throughout the graph are not readily expressed in existing query processing or database systems.
[Social] TwitInfo - Karger, Miller, Madden, et al. TwitInfo extracts a series of tweets that match a keyword from Twitter and arranges them on a timeline, provide a quick summary of a collection of Tweets on topic in a simple visualization. The key idea is to identify “peaks” in the frequency of tweets that represent interesting occurrences in time (e.g., points scored in a sporting event, or a major speech by a
DRAFT -‐-‐ not for distribution
8
politician), and then assign labels to peaks using information retrieval techniques. A related system, called TweeQL is used to implement TwitInfo; TweeQL provides a SQL-‐like streaming language for running queries over the Twitter stream in real time.
[Social] Condensr - Barzilay et al. Condensr is a review summarization system that processes Yelp restauarant reviews and categorizes them, breaking down reviews into comments about food, ambience, service and value, as well as giving an overall summary of reviewer sentiment. The goal is to go beyond a simple star rating to give the overall consensus of diners about various aspects of a restaurant experience.
[Images] Large Scale Vision - A. Torralba and A. Oliva. The goal of this project is to study computer and human vision when large amounts of visual data become available. We are developing the Scene UNderstanding (SUN) database, a large database of images found on the web organized by scene types that are being fully segmented and annotated. With this large database we are developing computer vision algorithms for scene understanding that make use of a large training combined with non-‐parametric (memory based) methods. In parallel, we are also studying how humans memorize large amounts of visual information. As a result we try to understand which representations might be useful for developing new efficient computer vision algorithms and also, how can we use computer vision models of human memory to predict which images will be remembered.
ALGORITHMS
Machine learning - Jaakkola. Modern use of data relies heavily on predictive modeling. Machine learning methods are needed to distill large, heterogeneous, and fragmented data sources into useful pieces of information such as answers to search queries, purchasing patterns of customers, or likely actions of mobile users. This research focuses on predicting the behavior of mobile users -‐-‐ actions they are likely to take in any particular context -‐-‐ based on a collection of intermittent sensors such as GPS, wifi, accelerometer, and others. Our goal is to develop methods that will be useful more broadly. Our work addresses the following key problems: 1) scaling to realistic problem sizes, 2) robustness, and 3) maintaining privacy even as data are used collaboratively. Faster Fourier Transform - Indyk, Katabi et al. Sparse Fast Fourier Transform (sFFT) is a new class of highly efficient algorithms for computing the frequency spectrum of a signal. The algorithms work for signals whose spectrum is sparse, i.e., signals that consist of a small number of dominating frequencies. Such signals often occur in areas such as image/audio/video compression, signal processing and data communication. For such signals, the algorithms are significantly faster than the state of the art algorithms based on the Fast Fourier Transform (FFT). The goal of this project is to develop more efficient variants and implementations of sFFT, and apply them to concrete massive data problems.
DRAFT -‐-‐ not for distribution
9
Tunable Fast Similarity Search for High-Dimensional Data – Indyk et al. Locality-‐Sensitive Hashing (LSH) is an efficient algorithm for finding pairs of similar (or highly correlated) objects in a database without enumerating all pairs of such objects. Example applications include searching for near-‐duplicate documents, similar images, highly correlated stocks etc. Although the algorithm is very fast, one can envision further improvements in its efficiency by adapting it to specific data sets. The goal of this project is to develop tools and techniques for performing such tuning. COMPUTATIONAL PLATFORMS
SciDB - Stonebraker and Madden. The vast majority of machine learning, statistical, and scientific operations can be expressed via a small number of linear algebra operations. SciDB is a database system designed to support scalable linear algebra over massive arrays stored on disk of a large cluster of machines. It is much faster than relational databases on these types of workloads, and scales to much larger datasets than main memory matrix-‐oriented systems like Matlab and R.
BlinkDB - Madden et al. BlinkDB is a database system that runs on top of Hadoop (MapReduce), running SQL queries and translating them into MapReduce jobs. The key idea is that rather than running queries over the entire data set, it runs queries on a random (precomputed) sample of the data, and uses sampling theory to estimate the true query answer.
Execution Migration Machine - Devadas et al. The Execution Migration Machine (EM²) is a novel data-‐centric multicore memory system architecture based on computation migration. Unlike traditional distributed memory multicores, which rely on complex cache coherence protocols to move the data to the core where the computation is taking place, our scheme always moves the computation to the core where the data resides. By doing away with the cache coherence protocol, we can boost the effectiveness of per-‐core caches while drastically reducing hardware complexity. Experimental results on a range of SPLASH-‐2 and PARSEC benchmarks indicate that EM2 can significantly improve per-‐core cache performance in comparison to directory-‐based cache-‐coherent architectures, decreasing overall miss rates by as much as 84% and reducing average memory latency by up to 58%.
Crowd Computing - Miller et al. The goal of this work is to build and study systems that orchestrate small contributions from a crowd of people. Examples include Soylent, which is an add-‐in to Microsoft Word that uses crowd contributions to perform interactive document shortening, proofreading, and human-‐language macros and TurKit, is a Java/JavaScript API for running iterative tasks on Mechanical Turk.
PRIVACY & SECURITY
CryptDB – Balakrishnan, Kaashoek, Madden, Zeldovich, et al. CryptDB is system for processing queries over an encrypted database. The key idea is that, in a cloud-‐
DRAFT -‐-‐ not for distribution
10
based setting, a database may be stored on machine that aren’t completely trusted, and so keeping it encrypted may be necessary. In such a setting, processing queries naively would require transmitting the entire encrypted database back for local processing. Instead, in CryptDB, special types of encryption are used which protect the data while allowing queries to be processed on it; in this way, the user can encrypt his queries, send them to the database, and receive encrypted answers back while transferring far less data than the naïve solution requires.