A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US...

4

Click here to load reader

description

A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007

Transcript of A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US...

Page 1: A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007

A Quick Survey of Open Source Software for PH Organizations

By Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007

Unstructured Text

1. Lucene: Apache Lucene is a high-performance, full-featured text search engine librarywritten entirely in Java. This technology suitable for nearly any application that requiresfull-text search, especially cross-platform. Lucene itself is just an indexing and searchlibrary and does not contain crawling and HTML parsing functionality. The Apacheproject Nutch is based on Lucene and provides this functionality. Lucene providescapabilities to index a variety of document formats.

2. Solr: Solr is an open source enterprise search server based on the Lucene Java searchlibrary, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching,replication, and a web administration interface. Solr is a stand alone server whichapplications communicate with using XML and HTTP to index documents, or executesearches. Solr supports a rich schema specification that allows for a wide range offlexibility in dealing with different document fields, and has an extensive search pluginAPI for developing custom search behavior

3. Nutch: Nutch is an effort to build an open source search engine based on Lucene Javafor the search and index component. The fetcher ("robot" or "web crawler") has beenwritten from scratch solely for this project. Nutch has a highly modular architectureallowing developers to create plugins for the following activities: media-type parsing,data retrieval, querying and clustering. As of June 2005, Nutch has graduated from theApache Incubator, and is now a subproject of Lucene. It is coded completely in the Javaprogramming language, but data is written in language-independent formats. In June2003, there was a successful 100 million page demo system. To meet the multimachineprocessing needs of the crawl and index tasks, the Nutch project has also implemented aMapReduce facility and a distributed file system. These two facilities have been spunout into their own subproject called Hadoop.

4. UIMA: UIMA stands for Unstructured Information Management Architecture. It is acomponent software architecture for the development, discovery, composition, anddeployment of multi-modal analytics for the analysis of unstructured information and itsintegration with search technologies developed by IBM. The source code for a referenceimplementation of this framework has been made available on SourceForge, and later onApache Software Foundation website. UIMA is a framework and SDK for developingsuch applications. An example UIM application might ingest plain text and identifyentities, such as persons, places, organizations; or relations, such as works-for orlocated-at. UIMA enables such an application to be decomposed into components, forexample "language identification" -> "language specific segmentation" -> "sentence

Page 2: A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007

boundary detection" -> "entity detection (person/place names etc.)". Each componentmust implement interfaces defined by the framework and must provide self-describingmetadata via XML descriptor files. The framework manages these components and thedata flow between them. Components are written in Java or C++; the data that flowsbetween components is designed for efficient mapping between these languages. UIMAadditionally provides capabilities to wrap components as network services, and can scaleto very large volumes by replicating processing pipelines over a cluster of networkednodes.

Alternative GIS

Graphical Information System (GIS) is an equally critical component. GIS provides a way ofcapturing, storing, analyzing and managing data and associated attributes which are spatiallyreferenced to the earth. Additionally, for proper data analysis a time series should be supportedas it provides researchers, first responders and emergency personnel capabilities to view dataspatially and over time. The most prominent inexpensive tools one the internet are Google Map,Google Earth, Microsoft Live Earth and Yahoo Maps. All these tools are relatively easy to useconfigure and distribute. Google Earth and Goggle maps are the most prominent tools used byweb developer. The Keyhole Markup Language (KML) is XML based and used to describegeospatial data, KML can be used by Google Earth and Google Maps.

1. Open Layers (http://www.openlayers.org): OpenLayers provide capabilities to embeddynamic maps in any web page. It can display map tiles and markers loaded from avariety of sources. MetaCarta developed the initial version of OpenLayers and gave it tothe public to further the use of geographic information of all kinds. OpenLayers iscompletely free, Open Source JavaScript, released under the BSD License.

2. MapServer (http://mapserver.gis.umn.edu): MapServer is an open source developmentenvironment for building spatially-enabled internet applications. MapServer supportsOpen Geospatial Consortium (OGC) standards, including Web Map Service (WMS) andWeb Feature Service (WFS). MapServer works with PostgreSQL and its PostGISextension, and supports proprietary GIS formats including ESRI's Shapefile format.MapServer uses OGR and GDAL libraries to translate files from one file format toanother. MapServer supports PHP, Python, Perl, Ruby, Java, and C# for scripting andcustomization.

3. GeoServer (http://geoserver.org): GeoServer is an Open Source server that connectsinformation to the Geospatial Web including publishing and editing data using openstandards. It is a fully functional geospatial web service implementing the WMS 1.1.1and WFS 1.0 implementation specifications from OGC. Information is made available ina large variety of formats as maps/images or actual geospatial data. GeoServer'stransactional capabilities offer robust support for shared editing. GeoServer's focus isease of use and support for standards, in order to serve as 'glue' for the geospatial web,connecting from legacy databases to many diverse clients.

Page 3: A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007

4. GeoTools (http://geotools.codehaus.org): Geo Tools is an open source (LGPL) Java codelibrary which provides standards compliant methods for the manipulation of geospatialdata, for example to implement Geographic Information Systems (GIS) . The Geo Toolslibrary implements Open Geospatial Consortium (OGC) specifications as they aredeveloped, in close collaboration with the GeoAPI and GeoWidgets projects.

Enterprise Services Bus (ESB)

Application integration is one of the most challenging aspects when building a platform. An ESBis middleware infrastructure that connects multiple systems via standard protocols, exposesservices for consummation, provides messaging capabilities, transformation, routing, as well asleverage existing IT assets. There are several open source ESB products

1. ServiceMix: ServiceMix is an Open Source ESB combining functionality of a ServiceOriented Architecture (SOA) and an Event Driven Architecture (EDA) to create anagile, enterprise ESB. Apache ServiceMix is an open source distributed ESB built fromthe ground up on the Java Business Integration (JBI) specification JSR 208 and releasedunder the Apache license. The goal of JBI is to allow components and services to beintegrated in a vendor independent way, allowing users and vendors to plug and play.ServiceMix is lightweight and easily embeddable, has integrated Spring support and canbe run at the edge of the network (inside a client or server), as a standalone ESBprovider or as a service within another ESB.

2. Mule: Mule is a light-weight messaging framework. It is a highly distributable objectbroker that can seamlessly handle interactions with other applications using disparatetechnologies, transports and protocols. The Mule framework provides a highly scalableenvironment in which you can deploy your business components. Mule manages all theinteractions between components transparently whether they exist in the same VM orover the internet and regardless of the underlying transport used. The common scenariofor using Mule include Integration projects where two or more existing systems need tocommunicate with each other. Applications that need to be totally decoupled from theirsurrounding environment or where the ability to scale one more components in thesystem is needed.

3. FUSE ESB: Fuse ESB is an Open source product based on Apache ServiceMix odder byIONA Technologies. FUSE ESB provides a standardized methodology, server, and toolsto deploy integration components, freeing architects from the dependencies that havetraditionally locked enterprises into proprietary middleware stacks. FUSE ESB enablesorganizations to achieve their service-oriented architecture (SOA) objectives with aproven open source solution for enterprise integration.

Page 4: A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007

Scalability

Scalability is important when deploying solutions that need to perform adequately during highvolume. Scalability is the ability to ensure availability, reliability, and performance based on theamount of concurrent connections, load as they progressively increase. Scalability can be definedas follows:

• Scale vertically: To scale vertically (or scale up) implies adding resources to a singleserver, typically involving the addition of CPUs or memory. This could also meanexpanding the number of running processes.

• Scale horizontally: To scale horizontally (or scale out) means to add more servers to asystem, such as adding a new computer to a distributed software application. Anexample might be scaling out from 1 web server to 3.

The following products can deliver high availability and clustered solutions:

1. Open Terracotta: Open Terracotta is Open Source JVM-level clustering software forJava, delivering clustering as a runtime infrastructure service, simplifying the task ofclustering a Java application. The capability is provided by clustering the JVMunderneath the application, instead of clustering the application itself.

2. GridGain: GridGain is a computational grid framework. Its goal is to improve generalperformance of processing intensive applications by splitting and parallelizing theworkload. In many cases GridGain is used to achieve better overall throughput, betterscalability or availability of services. GridGain supports out-of-the-box the follwign:JBoss, Spring, Spring AOP, JBoss AOP, AspectJ, JGroups, Weblogic, Websphere,Oracle Coherence, Mule, JXInsight, and GigaSpaces.