The SUMMA Platform: A Scalable Infrastructure for Multi ... · Jobqueue 0.30% 326MiB...

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 99–104Melbourne, Australia, July 15 - 20, 2018. c©2018 Association for Computational Linguistics

99

The SUMMA Platform: A Scalable Infrastructure forMulti-lingual Multi-media Monitoring

Ulrich GermannUniversity of Edinburgh

[email protected]

Renars Liepin, šLETA

[email protected]

Guntis BarzdinsUniversity of [email protected]

Didzis GoskoLETA

[email protected]

Sebastião MirandaPriberam Labs

[email protected]

David NogueiraPriberam Labs

[email protected]

Abstract

The open-source SUMMA Platform is ahighly scalable distributed architecture formonitoring a large number of media broad-casts in parallel, with a lag behind actualbroadcast time of at most a few minutes.

The Platform offers a fully automated me-dia ingestion pipeline capable of recordinglive broadcasts, detection and transcrip-tion of spoken content, translation of alltext (original or transcribed) into English,recognition and linking of Named Enti-ties, topic detection, clustering and cross-lingual multi-document summarization ofrelated media items, and last but not least,extraction and storage of factual claims inthese news items. Browser-based graphicaluser interfaces provide humans with aggre-gated information as well as structured ac-cess to individual news items stored in thePlatform’s database.

This paper describes the intended use casesand provides an overview over the system’simplementation.

1 Introduction

SUMMA (“Scalable Understanding of MultilingualMedia”) is an EU-funded Research and Innova-tion Action that aims to assemble state-of-the-artNLP technologies into a functional media process-ing pipeline to support large news organizations intheir daily work. The project consortium consistsof the University of Edinburgh, the Lativan Infor-mation Agency (LETA), Idiap Research Institute,Priberam Labs, Qatar Computing Research Insti-tute, University College London, and Sheffield Uni-versity as technical partners, and BBC Monitoringand Deutsche Welle as use case partners.

1.1 Use Cases

Three use cases drive the technology integrationefforts.

1.1.1 External Media MonitoringBBC Monitoring1 is a business unit within theBritish Broadcasting Corporation (BBC). In op-eration since 1939 and with a staff of currently ca.300 regional expert journalists, it provides mediamonitoring and analysis services to the BBC’s ownnews rooms, the British government, and othersubscribers.Each staff journalist monitors up to four live

broadcasts in parallel, plus several other informa-tion sources such as social media feeds. Assumingwork distributed around the clock in three shifts,2BBC Monitoring currently has, on average, the ca-pacity to actively monitor about 400 live broad-casts at any given time. At the same time, it hasaccess to over 1,500 TV stations and ca. 1,350 ra-dio stations world-wide via broadcast reception, aswell as a myriad of information sources on the in-ternet. The main limiting factor in maintaining(or even extending) monitoring coverage is the costand availability of the staff required to do so.Continuous live monitoring of broadcast chan-

nels by humans is not the most effective use of theirtime. Entertainment programming such as moviesand sitcoms, commercial breaks, and repetitions ofcontent (e.g., on 24/7 news channels), for example,are of limited interest to monitoring operations.What is of interest are emerging stories, new devel-opments, and shifts in reporting. By automaticallyrecording, monitoring, and indexing media contentfrom a large number of media streams, and storingit in a database within a very short period of timeafter it has been published, the Platform allows an-alysts to focus on media content that is most rel-evant to their work and alleviates them from thetedious task of just monitoring broadcast streamsin search of such relevant content.

1.1.2 Internal MonitoringDeutsche Welle,3 headquartered in Germany, is aninternational public broadcaster covering world-

1 https://monitoring.bbc.co.uk/2 Actual staff allocation and thus the amount of media

monitoring may vary over the course of the day.3 http://www.dw.com

https://monitoring.bbc.co.uk/

http://www.dw.com

100

Transcribe*

Translate*

Recogniseand link

named entities

Extract factsand relations

Cluster similar

documents

Detecttopics

Record*

Summariseclusters

Database ofannotatednews items

* if applicable

Figure 1: Overview of the SUMMA architecture.

wide news in 30 different languages. Regional newsrooms produce and broadcast content indepen-dently; journalistic and programming decisions arenot made by a central authority within DeutscheWelle. This means that it is difficult for the organi-zation at large, from regional managers in charge ofseveral news rooms to top management, to main-tain an accurate and up-to-date overview of whatis being broadcast, what stories have been covered,and where there is potential for synergies, for ex-ample by using journalistic background work thatwas performed for one story to produce similar con-tent for a different audience in a different broadcastregion and language.The SUMMA Platform’s cross-lingual story clus-

tering and summarization module with the corre-sponding online visualization tool addresses thisneed. It provides an aggregate view of recent cap-tured broadcasts, with easy access to individualbroadcast segments in each cluster.

1.1.3 Data JournalismThe third use case is the use of the platform (ormore specifically, it’s database API) by journalistsand researchers (e.g., political scientists) who wantto perform data analyses based on large amountsof news reports. Potential uses are the plotting ofevents or time series of events on maps, analysis ofbias or political slant in news reporting, etc. Thisuse case is the most open-ended and, unlike theother two, has not been a major driving force in theactual design and implementation of the Platformso far.

2 System Architecture

2.1 Design

Figure 1 shows a schematic overview over thePlatform’s workflow. The Platform comprisesthree major parts: a data ingestion pipeline buildmostly upon existing state-of-the-art NLP technol-ogy; a web-based user front-end specifically devel-oped with the intended use cases in mind, and adatabase at the center that is continuously updatedby the data ingestion pipeline and accessed by endusers through the web-based GUI, or through aREST API by downstream applications.

Services within the Platform — technical suchas the database server or the message queue, andservices performing natural language processingtasks — run independently in individual Docker4application containers. This encapsulation allowsfor mostly independent development of the com-ponents by the consortium partners responsible forthem.Each NLP service augments incoming media

items (represented as json structures) with addi-tional information: automatic speech recognition(ASR) services add transcriptions, machine trans-lation (MT) engines add translations, and so forth.The flow of information within the Platform is

realized via a message queues5. Each type of task(e.g., speech recognition in a specific language, ma-chine translation from a specific language into En-glish, etc.) has a queue for pending tasks andanother queue for completed tasks; task workerspop a message off the input queue, acknowledge itupon successful completion, and push the outputof the completed task onto the output queue. Atask scheduler (the only component that needs to“know” about the overall structure of the process-ing pipeline) orchestrates the communication.The use of a message queues allows for easy scal-

ability of the Platform — if we need more through-put, we add more workers, which all share the samequeues.The central database for orchestration of media

item processing is an instance of RethinkDB,6 adocument-oriented database that allows clients tosubscribe to a continuous stream of notificationsabout changes in the database. Each documentconsists of several fields, such as the URL of theoriginal news item, a transcript for audio sources,or the original text, a translation into English ifapplicable, entities such as persons, organisationsor locations mentioned in the news items, etc.For the user-facing front end, we use an in-

stance of PostgreSQL7 that pulls in new arrivalsperiodically from the RethinkDB. The PostgreSQLdatabase was not part of the original platform de-sign. It was introduced out out of performance con-cerns, when we noticed at at some point that Re-thinkDB’s reponsiveness was deteriorating rapidlyas the database grew. We ultimately determinedthat this was due to an error in our set-up of thedatabase — we had failed to set up indexing forcertain fields in the database, resulting in linearsearches through the database that grew longerand longer as the number of items in the databaseincreased. However, we did not realise this until af-ter we had migrated the user front-ends to interact

4 www.docker.com5 www.rabbitmq.com6 www.rethinkdb.com7 www.postgresql.org

www.docker.com

www.rabbitmq.com

www.rethinkdb.com

www.postgresql.org

101

with a PostgreSQL database. The Platform alsoprovides an export mechanism into ElasticSearch8

databases.The web-based graphical user front-end was de-

veloped specifically for SUMMA, based on wire-frame designs from our use case partners. It is im-plemented in the Aurelia JavaScript Client Frame-work9.In the following section, we briefly present the in-

dividual components of the data ingestion pipeline.

3 Data Ingestion PipelineComponents

3.1 Live Stream Recorder and ChunkerThe recorder and chunker monitors one or morelive streams via their respective URLs. Broad-cast signals received via satellite are converted intotransport streams suitable for streaming via the in-ternet and provided via a web interface. All datareceived by the Recorder is recorded to disk undchunked into 5-minutes segments for further pro-cessing. Within the Platform infrastructure, therecorder and chunker also serves as the internalvideo server for recorded transitory material.Once downloaded and chunked, a document stub

with the internal video URL is entered into thedatabase, which then schedules them for down-stream processing.Video and audio files that are not transitory but

provided by the original sources in more persistentforms (i.e., served from a permanent location), arecurrently not recorded but retrieved from the orig-inal source when needed.

3.2 Other Data FeedsText-based data is retrieved by data feed modulesthat poll the providing source at regular intervalsfor new data. The data is downloaded and enteredinto the database, which then schedules the newarrivals for downstream processing.In addition to a generic RSS feed monitor, we

use custom data monitors that are tailored to spe-cific data sources, e.g. the specific APIs thatbroadcaster-specific news apps use for updates.The main task of these specialized modules is tomap between data fields of the source API’s spe-cific (json) response and the data fields used withinthe Platform.

3.3 Automatic Speech RecognitionThe ASR modules within the Platform are built ontop of CloudASR (Klejch et al., 2015); the underly-ing speech recognition models are trained with theKaldi toolkit (Povey et al., 2011). Punctuation isadded using a neural MT engine that was trained

8 www.elastic.co9 aurelia.io

to translate from un-punctuated text to punctua-tion. The training data for the punctuation moduleis created by stripping punctuation from an exist-ing corpus of news texts. The MT engine usedfor punctuation insertion uses the same softwarecomponents as the MT engines used for languagetranslation. Currently, the Platform supports ASRof English, German, Arabic, Russian, Spanish, andLatvian; systems for Farsi, Portuguese and Ukra-nian are planned.

3.4 Machine TranslationThe machine translation engines for languagetranslation use the Marian decoder (Junczys-Dowmunt et al., 2016) for translation, with neu-ral models trained with the Nematus toolkit (Sen-nrich et al., 2017). In the near future, we plan toswitch to Marian entirely for training and trans-lation. Currently supported translation directionsare from German, Arabic, Russian, Spanish, andLatvian into English. Systems for translation fromFarsi, Portuguese and Ukranian into English areplanned.

3.5 Topic ClassificationThe topic classifier uses a hierarchical attentionmodel for document classification (Yang et al.,2016) trained on nearly 600K manually annotateddocuments in 8 languages.10

3.6 Storyline Clustering and ClusterSummarization

Incoming stories are clustered into storylines withthe online clustering algorithm by Aggarwal andYu (2006). The resulting storylines are summa-rized with the extractive summarization algorithmby Almeida and Martins (2013).

3.7 Named Entity Recognition andLinking

For Named Entity Recognition, we use TurboEnti-tyRecognizer, a component within TurboParser11(Martins et al., 2009). Recognized entities and re-lations between them (or propositions about them)are linked to a knowledge base of facts using AMRtechniques developed by Paikens et al. (2016).

3.8 Trained systems and theirperformance

Space contraints prevent us from discussing theNLP components in more detail here. Detailed in-formation about the various components can befound in the project deliverables 3.1, 4.1, and 5.1,

10 The Platform currently is designed to handle 9 lan-guages: English, Arabic, Farsi, German, Latvian,Portuguese, Russian, Spanish, and Ukrainian.

11 https://github.com/andre-martins/TurboParser

www.elastic.co

aurelia.io

https://github.com/andre-martins/TurboParser

102

Figure 2: Trending media item views: histogram(top), and list (bottom)

which are available from the SUMMA project’sweb page.12

4 User InterfacesThe platform provides several user interfaces(“views”), to accommodate different user’s needs.The trending media item views (Fig 2) rely on

recognised Named Entities. For each entity rec-ognized by the Named Entity recognizer, the his-togram view (top) shows a bar chart with the num-ber of media items that mention said entity withineach hour for the past n hours. The order of thelist of histograms corresponds to the recent promi-nence of the respective entity. In the list view (bot-tom), recent arrivals for a specific Named Entityare sorted most-recent-first. The screenshot showsthe result of a full text search.Figure 3 shows the view of a recorded video seg-

ment. The video player panel is on the left. Onthe right, we see topic labels (top), a summaryof the transcript (middle), and the original tran-script. The transcript is synchronized with thevideo, so that recognized speech is automaticallyhighlighted in the transcript as the video is played,and a click into the transcript takes the user to thecorresponding position in the video.The document cluster view (Fig. 4) gives a bird

eye’s view of the clustered media items; the area

12 www.summa-project.eu/deliverables

occupied by each cluster corresponding to the num-ber of items within the cluster. A click on the re-spective tile shows a multi-document summary ofthe cluster and a list of related media items.The trending media item views correspond most

to our first use case: journalists keeping track ofspecific stories linked to specific entities and iden-tifying emerging topics.The document cluster ciew corresponds best to

the needs of our second use case: internal monitor-ing of a media organisation’s output.In addition to these visual user interfaces, the

platform can also expose a database API, so thatusers wishing to access the database directly withtheir own analysis or visualisation tools can do so.A number of ideas for such tools were exploredrecently at BBC’s SUMMA-related #NewsHackevent in London in November 2017.13 They in-clude a slackbot that allows journalists to querythe database in natural language via a chat inter-face, automatic generation of short videos captur-ing and summarizing the news of the days in aseries of captioned images, or contrasting the cov-erage by a specific media outlet against a largerpool of information sources.

5 Hardware Requirements andPerformance Analysis

When designing the system, we made a deliberatedecision to avoid reliance on GPUs for processing,due to the high cost of the hardware, especiallywhen rented from cloud computing services. Thisconstraint does not apply to the training of theunderlying models, which we assume to be doneoffline. For most components, pre-trained modelsare available for download.

5.1 Disk SpaceA basic installation of the platform with modelsfor 6 languages / 5 translation directions currentlyrequires about 150GB of disk space. Recordingof live video streams requires approximately 10–25GB per stream per day, depending on the resolu-tion of the underlying video. We do not transcodeincoming videos to a lower resolution, due to thehigh processing cost of transcoding.

5.2 CPU Usage and MemoryRequirements

Table 1 shows a snapshot of performance moni-toring on a singe-host deployment of the Platformthat we use for user interface testing and user expe-rience evaluation, measured over a period of about20 minutes. The deployment’s host is a large server

13 http://bbcnewslabs.co.uk/2017/11/24/summa-roundup. The event attracted 63 regis-trations; subtracting no-shows, the event had ca. 50attendees.

www.summa-project.eu/deliverables

http://bbcnewslabs.co.uk/2017/11/24/summa-roundup

http://bbcnewslabs.co.uk/2017/11/24/summa-roundup

103

Figure 3: Single media item view

Figure 4: Document cluster view

with 32 CPU cores and 287 GB of memory, ofwhich ca. 120GB are currently used by the Plat-form, supporting transcription in 6 languages andtranslation from 5 into English. Not shown in thetable are the ASR, Punctuation and MT compo-nents for Latvian and German, as they were idleat the time.

Notice that except for the Named Entity recog-nizer, no component has a peak memory use ofmore than 4 GB, so that individual componentscan be run on much smaller machines in a dis-tributed set-up.

Speech recognition is clearly the bottleneck andresource hog in the pipeline. The ASR engine cur-rently is a single-threaded process that can tran-scribe speech at about half the “natural” speaking

rate.14 Each segment is transcribed in one chunk;we use multiple workers (2-3) per channel to keepup with the speed of data arrival.The multi-threaded MT engines use all avail-

able cores to translate in parallel and are able totranslate some 500 words per minute per thread.15For comparison: the typical speaking rate of anEnglish-speaking news speaker is about 160 wordsper minute. A single MT engine can thus easily ac-commodate transcripts from multiple transcribedlive streams.

14 For segments without speech, the processing speedmay be about 8 times real time, as the speech recog-nizer gets confused and explores a much larger searchspace.

15 Junczys-Dowmunt et al. (2016) report 141 words persecond on a 16-thread machine; this means ca. 528words per minute per thread.

104

Table 1: Relative use of CPU time and peak mem-ory use (per container) for various tasks in the NLPprocessing pipeline.

Task CPU RAMASR (ar) 27.26% 1359 MiBASR (en) 26.53% 2594 MiBASR (ru) 22.40% 2044 MiBASR (es) 13.44% 2800 MiBMT (ar) 3.34% 926 MiBMT (ru) 3.14% 1242 MiBSummarization 1.21% 3609 MiBSemantic parsing 0.97% 3290 MiBMT (es) 0.34% 1801 MiBJob queue 0.30% 326 MiBTopic identification 0.27% 1418 MiBNamed entity recognition 0.18% 12625 MiBPunctuation (en) 0.16% 362 MiBDatabase 0.16% 3021 MiBRecording and chunking 0.15% 337 MiBPunctuation (ru) 0.04% 285 MiBPunctuation (ar) 0.03% 276 MiBDB REST interface 0.02% 153 MiBClustering 0.01% 1262 MiBData feed (not streamed) 0.01% 141 MiBPunctuation (es) 0.01% 240 MiBTask creation 0.01% 1076 MiBResult writer 0.01% 1003 MiB(msg. queue to DB)

In a successful scalability test with docker swarmin January 2018 we were able to ingest 400 TVstreams simultaneously on Amazon’s AWS EC2service, allocating 2 t2.2xlarge virtual machines (8CPUs, 32GB RAM) per stream, although 1 ma-chine per stream probably would have be sufficient.For this test, we deployed several clones of the en-tire platform for data ingestion, all feeding into asingle user-facing PostgreSQL database instance.

6 Conclusion

The SUMMA Platform successfully integrates amultitude of state-of-the-art NLP components toprovide an integrated platform for multi-lingualmedia monitoring and analysis. The Platform de-sign provides for easy scalability and facilitatesthe integration of additional NLP analysis mod-ules that augment existing data.

Availability

The Platform infrastructure and most of its com-ponents are currently scheduled to be releasedas open-source software in late August 2018 andwill available through the project home page at

http://www.summa-project.eu

Acknowledgements

This work was conducted within the scopeof the Research and Innovation Action

SUMMA, which has received funding from the Eu-ropean Union’s Horizon 2020 research and innova-tion programme under grant agreement No 688139.

ReferencesAggarwal, Charu C and Philip S Yu. 2006. “Aframework for clustering massive text and cat-egorical data streams.” SIAM Int’l. Conf. onData Mining, 479–483. Boston, MA, USA.

Almeida, Miguel B and Andre FT Martins. 2013.“Fast and robust compressive summarizationwith dual decomposition and multi-task learn-ing.” ACL, 196–206. Sofia, Bulgaria.

Junczys-Dowmunt, Marcin, Tomasz Dwojak, andHieu Hoang. 2016. “Is neural machine transla-tion ready for deployment? A case study on 30translation directions.” International Workshopon Spoken Language Translation. Seattle, WA,USA.

Klejch, Ondřej, Ondřej Plátek, Lukáš Žilka, andFilip Jurčíček. 2015. “CloudASR: platform andservice.” Int’l. Conf. on Text, Speech, and Dia-logue, 334–341. Pilsen, Czech Republic.

Martins, André FT, Noah A Smith, and Eric PXing. 2009. “Concise integer linear programmingformulations for dependency parsing.” ACL-IJCNLP, 342–350. Singapore.

Paikens, Peteris, Guntis Barzdins, Afonso Mendes,Daniel Ferreira, Samuel Broscheit, Mariana S. C.Almeida, Sebastião Miranda, David Nogueira,Pedro Balage, and André F. T. Martins. 2016.“SUMMA at TAC knowledge base populationtask 2016.” TAC. Gaithersburg, Maryland,USA.

Povey, Daniel, Arnab Ghoshal, Gilles Boulianne,Lukáš Burget, Ondrej Glembek, Nagendra Goel,Mirko Hannemann, Petr Motlíček, YanminQian, Petr Schwarz, Jan Silovský, Georg Stem-mer, and Karel Veselý. 2011. “The Kaldi speechrecognition toolkit.” ASRU. Waikoloa, HI,USA.

Sennrich, Rico, Orhan Firat, Kyunghyun Cho,Alexandra Birch, Barry Haddow, JulianHitschler, Marcin Junczys-Dowmunt, SamuelLäubli, Antonio Valerio Miceli Barone, JozefMokry, and Maria Nadejde. 2017. “Nematus: atoolkit for neural machine translation.” EACLDemonstration Session. Valencia, Spain.

Yang, Zichao, Diyi Yang, Chris Dyer, XiaodongHe, Alexander J. Smola, and Eduard H. Hovy.2016. “Hierarchical attention networks for doc-ument classification.” NAACL. San Diego, CA,USA.

http://www.summa-project.eu

The SUMMA Platform: A Scalable Infrastructure for Multi ... · Jobqueue 0.30% 326MiB...

Documents

Transcript of The SUMMA Platform: A Scalable Infrastructure for Multi ... · Jobqueue 0.30% 326MiB...