Project Deimos
Transcript of Project Deimos
UNIVERSITY OF WATERLOOFaculty of Mathematics
MARS-LOGGER: DESIGN AND IMPLEMENTATION OF A DISTRIBUTED LOGGING SYSTEM
Bloomberg L.P.New York, NY
Prepared byShun Da Suo
2A Computer Science/Business AdministrationID 20509411
April 24, 2015
Memorandum of Submittal
To: Mr. Andrei CojocaruFrom: Shun Da SuoDate: April 24, 2015Re: Work Report: “MARS-Logger: Design and Implementation of a Distributed Logging System”
As we agreed, I have prepared the enclosed report, “MARS-Logger: Design and Implementation of a Distributed Logging System,” for my 2A work report and for the MARS Pricing Team. This report, the second of four work reports that the Co-operative Education Program requires that I successfully complete as part of my Computer Science and Business Double Degree Co-op degree requirements, has not received academic credit.
The MARS (Multi-Asset Risk Systems) Pricing Team is an infrastructural level research and development team that focuses on request routing and price calculation of multi-asset portfolios. As a software development intern, I aided in design and implementation of a distributed logging system. The goal was to replace the existing framework that had long been the bottleneck of innovation for the MARS platform. In this report, I discuss the high-level architecture of the system, analyze significant implementation details, and evaluate the overall effectiveness.
The Faculty of Mathematics requests that you evaluate this report for command of topic and technical content/analysis. Following your assessment, the report, together with your evaluation, will be submitted to the Math Undergrad Office for evaluation on campus by qualified work report markers. The combined marks determine whether the report will receive credit and whether it will be considered for an award.
Thank you for your assistance in preparing this report. Finally, I want to thank Ms. Selina Chen for distracting me at all times. Without her, I couldn’t have not finished earlier.
Shun Da Suo
Table of Contents
List of Figures…………………………………………………………………….iii
Executive Summary………………………………………………………………iv
1.0 Introduction………………………………………………………………..1
1.1 Importance of Logging in Financial Software…………………….1
1.2 Service Oriented Architecture and Logging as a Service…………2
1.3 Implementation of Logging Service and Associated Issues………3
2.0 Analysis……………………………………………………………………5
2.1 Short Term Issues and Long Term Goals…………………………5
2.2 Logging Architecture with Distributed Systems as Solution……..7
2.3 Implications of Leveraging Open Source Projects………………11
2.4 Evaluation of Alternative Technologies…………………………13
3.0 Conclusions………………………………………………………………14
4.0 Recommendations………………………………………………………..15
References………………………………………………………………………..17
ii
List of Figures
Figure 1: Comparison between monolithic applications and composite
applications………………………………………………………………..2
Figure 2: User interface of Mars Logger Viewer (MTT MLOG), an internal log-
viewing tool…..……………………………………………………………5
Figure 3: Application Dashboard (APDX) showing logging service statistics…...6
Figure 4: Relational Database Dashboard (RDBD) showing logging database
statistics……………………………………………………………………6
Figure 5: High level architecture of a logging system………………………….....8
Figure 6: Detailed implementation of the distributed logging system…………...10
iii
Executive Summary
Logging systems are crucial components of any complex software systems. As an
uprising feature in Bloomberg, MARS platform is undergoing rapid growth in
terms of hits, code base, and team size. In this changing environment, the current
logging system fails to keep up, and is acting as a bottleneck of innovation. It fails
in all three major areas of concern: latency, capacity, and extensibility.
A new implementation of the system based on open source distributed
technologies is presented as a potential solution. Specifically, a pipeline
consisting of Apache Kafka, Apache Storm, and Apache HBase is prototyped.
The new system leverages the core features of the technologies and works almost
off the shelf, requiring little customization.
The usage of open source technologies captures additional intellect and resources
from the broader community with little drawback. It also allows the developers to
focus on performing specific customization and optimization, instead of wasting
resources on building a framework from the ground up.
As a result, the open source distributed implementation is overall a better solution
than the existing logging system and should be integrated into the infrastructure
immediately.
iv
1.0 Introduction
1.1 Importance of Logging in Financial Software
For Bloomberg L.P., the leading financial software, data and media company in
the United States, the integrity and security of its information are of utmost
importance. To ensure the correctness of data produced by a software module in a
complex system, two measures must be in place: a thorough testing framework
and a robust logging system. This report focuses on the latter of the two.
A log, by definition, is a physical record of past activity, used for monitoring and
record keeping purpose. In the software industry, logs are usually time-indexed
events containing state information such as: task identifier, event identifier, event
time, event data, user identifier, etc. They are crucial in monitoring performance
and diagnosing errors in running applications.
Although a logging system does not directly enhance the function or performance
of a software solution, it brings business value by making the infrastructure more
understandable and maintainable. With sufficient data to model the past and
current state of the infrastructure, it can reveal opportunities for further capacity
planning and performance optimization.
1
1.2 Service Oriented Architecture and Logging as a Service
A Service Oriented Architecture (SOA) is “a design pattern in which application
components provide services to other components via a communications protocol,
typically over a network” (“Service Oriented Architecture”). Under this design
philosophy, complex software systems can be decomposed to unassociated,
loosely coupled modules. These service modules, then, interact with others by
passing request and response messages.
Figure 1: Comparison between monolithic applications and composite
applications.
2
In Bloomberg, it is especially important for software components to be loosely
coupled, since there are complex dependencies between systems, modules and
libraries. Each individual service module only exposes the interface that describes
the types of requests it accepts and the types of the responses it produces. With
minimal knowledge about other services, the modules are insulated from
implementation changes of their dependencies. A comparison between traditional
monolithic applications and composite applications built with SOA is shown in
Figure 1.
No surprisingly, as applications grow in size, their logging demands grow in
volume and complexity as well. The naïve approach of using a local logging
library is no longer effective. By extracting the logging logic into a standalone
service, it is possible to iterate through implementation improvements without
breaking compatibility with existing services. The SOA design of the current
logging system would enable the series of prototyping and testing of new
implementations explored in this report.
1.3 Implementation of Logging Service and Associated Issues
Currently, the Multi-Asset Risk Systems (MARS) team maintains its own logging
framework. The existing implementation uses a proprietary relational database
3
system as the underlying data store. This report explores various shortcomings of
the existing solution, such as high write latency, low scalability, and poor
extensibility. It then examines several open source projects and proposes an
alternative implementation.
4
2.0 Analysis
2.1 Short Term Issues and Long Term Goals
In the MARS team, the most immediate use case of the logging system is
debugging. Every time a service module receives a request or emits a response, a
log entry is dispatched to the logging system. With the internal log-viewing tool,
the developers can then search for a set of log events by the log time and the
unique user identifier. Once the desired group of log events is found, the tool
displays the requests and responses in a logical order to highlight the path taken
by the requests within the infrastructure. A screenshot of the user interface of the
log-viewing tool is shown in Figure 2.
Figure 2: User interface of Mars Logger Viewer (MTT MLOG), an internal log-
viewing tool
5
This simple use case, however, is becoming increasingly more difficult to satisfy
with the current logging system. As usage of the MARS platform grows rapidly,
the logging system simply cannot keep up. A quick look at the various metric
tracking panels in the Bloomberg environment verifies this grave reality. As
shown in Figure 3, the Application Dashboard (APDX) panel shows a maximum
latency of 21 seconds for the logging service. This long delay reveals a severe
bottleneck in writing speed of the log entries. In Figure 4, information shown in
the Relational Database Dashboard (RDBD) panel also supports this assertion.
The write hits are colored red across the shards of the table, indicating it is
operating near full capacity.
Figure 3: Application Dashboard (APDX) showing logging service statistics
Figure 4: Relational Database Dashboard (RDBD) showing logging database
statistics
6
In addition to the bottleneck in writing speed, the logging system also faces a
severe capacity problem. Given the current resource allocation and lack of
efficient compression, the logging system is only capable of preserving five full
days of log entries. This limitation severely reduces the developers’ ability to
track down more complex and elusive bugs. This inefficiency translates to loss of
valuable developer productivity.
In the long term, the current logging system also lacks in extensibility. It is
difficult to add additional analytical functionalities without re-implementing the
entire system. This restriction prevents the valuable information contained in the
log entries to be mined and studied effectively.
2.2 Logging Architecture with Distributed Systems as Solution
As demonstrated in the previous section, an ideal solution should be scalable,
highly available, and extensible. It must be noted that this requirement
specification is not exclusive to Bloomberg: the need to accommodate high
volume, high capacity data has seen a rise in general popularity in the recent
years. There are many new technologies developed with distributed storage and
computation in mind that make use of high parallelism to reduce latency and
increase scalability.
7
This report focuses on three open source technologies, all hosted by the Apache
Software Foundation: Kafka, Storm, and HBase. According to the official
website, Kafka is described as “publish-subscribe messaging rethought as a
distributed commit log” (“Kafka”). Storm is “a distributed real-time computation
system” (“Storm”), and HBase is “a distributed, scalable, big data store”
(“HBase”).
Under a high level architecture, the responsibility of the logging system can be
divided into three sections: data ingestion, computation and transformation, and
data storage. This separation of concern is illustrated in Figure 5. It is evident that
the existing logging system fails in all three aspects. Firstly, the lack of buffering
causes congestion and results in latency during data ingestion. Secondly, the lack
of compression and processing stage provides no platform for future extension.
Lastly, the data storage does not have enough capacity to store sufficient log data.
Figure 5: High level architecture of a logging system
8
The three open source technologies introduced above are carefully chosen and
evaluated to target the weakness in each of the three sections. Kafka, the
distributed messaging framework, can absorb bursts of high throughput from
upstream publishers, dampen the load, and release a steady stream to the
subscribers. As a result, the downstream subscribers can be insulated from
temporary spikes in throughput and enjoy a more stable environment. As Kafka
ingests data, it immediately persists the data to disk, preventing any data loss even
if the downstream clients are absent.
Storm, the stream processing system, provides a platform for arbitrary
transformation of data streams with a very low overhead. Parallel instances of
ingestion nodes, “spouts”, and computation nodes, “bolts”, can be defined to
perform near real-time computation and processing. The platform is highly
extensible and is compatible with a wide range of programming languages. This
enables the potential of building additional aggregation nodes and more custom
compression schemes for further enhancement and tuning of the logging system.
HBase, the distributed key/value store, uses an entirely different philosophy of
data storage as compared to the existing proprietary relational database. It is
considered as “NoSQL”, meaning that it does not conform to the usual relational
database querying paradigm. It trades away sophisticated relational modeling for
9
faster access and linear scalability. This drawback is not significant since the type
of queries performed on the logging system is very standard and limited. It
requires little additional effort to implement the HBase table in a way that
supports all the existing queries.
The detailed implementation of the clustered logging system is presented below in
Figure 6.
Figure 6: Detailed implementation of the distributed logging system.
The implementation of the clustered systems discussed above has several
disadvantages as well. The introduction of additional software systems creates
10
more overheads to manage and monitor. The lack of familiarity with distributed
software can result in more error prone implementations and improper
configurations. These drawbacks, however, do not fundamentally undermine the
effectiveness of the new implementation. Bloomberg already has several big data
teams researching and developing solutions with distributed systems outlined
above. Hence, it is possible to shift the ownership and responsibility of the
underlying infrastructure to a dedicated team. The lack of experience in
distributed software should not be a deciding factor, since any innovation starts
from the prototyping stage.
2.3 Implications of Leveraging Open Source Projects
In a company wide initiative, Bloomberg L.P. recently began advocating the
philosophy and usage of open source projects. It is not difficult to see the benefit
of this approach: open source projects embody the collective intelligence and
effort of the entire community. Instead of developing an in-house solution from
scratch with limited development resources, it is often more advantageous to
budget the resource towards customizing a general solution supported by the open
source community.
Open source technologies such as Apache Kafka, Apache Storm, and Apache
HBase are well supported by mature communities, and not restricted by
11
bureaucratic procedures and corporate release schedules. As a result, they are
capable of undergoing much more rapid development cycle. This agility benefits
the adopters in terms of faster bug fixes and more rapid implementation of
requested features.
Leveraging open source technologies, however, is not without its drawbacks. It is
more difficult for the firm to dictate the development direction of the
technologies, since the community has the final say. If the needs of the firm
diverge from that of the community, it then becomes necessary to maintain a fork
of the main implementation.
The previously mentioned agility of open source projects can also become a
drawback if not managed properly. Rapid development also implies more
compatibility breaking changes and faster deprecation of existing codebase. It
would be at the adopters’ discretion to find a balance between maintaining
compatibility and receiving new upgrades.
Given the architecture proposed in the earlier section of the report, the benefit of
leveraging open source technologies outweighs the drawbacks. In this specific use
case, only the core characteristics of the technologies are required. Hence, little
customization is required, and it is unlikely for the communities to make drastic,
12
compatibility breaking changes to the core features. As long as relatively stable
releases are used, the technologies leveraged should require few upgrades after
integration.
2.4 Evaluation of Alternative Technologies
Another popular distributed, non-relational database is Apache Cassandra, a
competitor to Apache HBase. While HBase has a stronger consistency guarantee,
it is more difficult to set up and maintain. Cassandra has the advantage of
providing a SQL like querying interface, which many developers would find more
familiar. This however, does not mean it has the full querying capabilities; it often
has more hidden restrictions and caveats that quickly overwhelm those who
mistake it for a traditional relational database.
As described in previous sections, the queries that are performed on the logging
database are relatively simplistic, and can be modeled in HBase just as easily as in
Cassandra. This, along with the fact that Bloomberg already has a running cluster
of HBase maintained by a dedicated team, makes HBase the clear winner.
13
3.0 Conclusions
As demonstrated in earlier analysis, the current logging system is rapidly
becoming insufficient for the growing MARS platform. It imposes severe
restriction on the efficiency of debugging. As a result, it’s beneficial for the
MARS team to devote resources to implement a new logging system that satisfy
the latency, scalability and extensibility requirements.
The new implementation of the system based on open source distributed
technologies is shown to be a suitable solution. The improvement achieved
through buffering, parallel computation, and concurrent access outweighs the
overhead of monitoring the infrastructure clusters. Apache Kafka, Storm, and
HBase are selected as suitable components of the system, after evaluating against
alternative technologies.
The usage of open source technologies in general is determined to have a positive
impact on efficiency. The communal intellect and resources that it can leverage
far outweighs its drawbacks such as volatile code base and lack of control.
Furthermore, many of the disadvantages can be offset by careful planning, and do
not undermine its effectiveness.
14
4.0 Recommendations
MARS team should devote more resources and developers to replace and enhance
the existing logging system. Given the importance of analytics to a platform’s
healthy growth, it might be beneficial to devote an entire team to champion this
responsibility. This team, MARS Analytics, would develop, maintain and
innovate the logging system and additional analytical extensions. The addition of
a dedicated team ensures sufficient resources are dedicated towards monitoring
and evaluation of the existing framework.
Given the benefit of open source projects discussed in this report, MARS team
should embrace open source technologies no only in its new projects, but also in
its existing ones. Effort should be devoted to reviewing legacy code bases that are
hard to maintain and evaluate whether there are replaceable by newer open source
technologies. This approach may hinder development in the short run, but it will
produce a more robust and less error prone system in the long run.
Bloomberg as a company should encourage continued learning in new software
technologies through incentivized seminars. As applications grow in user base and
complexity, more and more problems deal with large quantities of data, and
require knowledge of distributed systems to be solved efficiently. By investing in
15
human capital, Bloomberg can potentially increase software performance and
reduce maintenance cost.
16
References
HBase ™ Reference Guide. (n.d.). Retrieved May 1, 2015, from
http://hbase.apache.org/book.html
How Spotify Scales Apache Storm. (2015, January 5). Retrieved May 1, 2015,
from https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-
storm/
Kafka 0.8.2 Documentation. (n.d.). Retrieved May 1, 2015, from
http://kafka.apache.org/documentation.html
Service Oriented Architecture: What Is SOA? (n.d.). Retrieved May 1, 2015, from
http://www.opengroup.org/soa/source-book/soa/soa.htm#soa_definition
Storm Documentation. (n.d.). Retrieved May 1, 2015, from
https://storm.apache.org/documentation/Home.html
17