Here is a simple, high-level whitepaper on the OpenCalais service. Feel free to ping us with questions on Twitter @OpenCalais or @KristaThomas.
Transcript of Simple OpenCalais Whitepaper
1. Thomson Reuters Calais Web Service & the Linked Content
Economy Executive Summary: The rise of the Internet has brought
dramatic change to the publishing industry. While newspapers in
particular struggle to adapt, advertisers are cutting budgets,
seeking new efficiencies and increasingly using the Web to go
straight to the consumer. Semantic technologies and new open data
resources on the Web give both publishers and advertisers new tools
and services that can help them succeed. The Thomson Reuters Calais
Web service, found at OpenCalais.com, is one such service. Calais
identifies and automatically tags the people, places, companies,
facts and events Calais turns static text into Smart in text. It
then forges connections between Media that is enriched with open
data those entities and relevant data sets, media and connected to
a dynamic Linked files, Wikipedia entries and more on the open
Content Economy. Web. Finally, it gives publishers a new way to
share that tagged content with next generation -Thomas (Tom) Tague,
Calais initiative lead search engines, news aggregators and others
in the content ecosystem. Armed with this powerful new tool,
forward-looking publishers are automating time consuming content
operations and increasing editorial productivity. They are also
enhancing the value of their content, improving their user
experience and preparing to reach more readers in tomorrows media
landscape increasingly called the linked content economy.
Background: Calais is a strategic initiative at Thomson Reuters to
advance the interoperability of content and support the companys
mission to provide pervasive intelligent information to its
customers. Calais uses Natural Language Processing to give
publishers free metatagging services, developer tools and an open
standard for the generation of semantic content. The latest update
to Calais Calais 4.0 is a significant advance on the initiatives
goals. The Calais team originally set out to help developers,
bloggers and publishers automatically tag their content to improve
search and navigation, and enable new reader engagement features.
With Calais 4.0, the Calais Web service goes beyond metatagging to
help publishers enhance their content, using open data from sources
like Wikipedia, DBpedia. GeoNames, the Internet Movie Database
(IMDB), Shopping.com and more. It also makes it easy for publishers
to use
2. their metadata to share their content with next generation
content consumers such as search engines, news aggregators related
stories service and more to ultimately reach more readers. With
these added capabilities, Calais helps content creators and content
consumers alike connect to the rapidly emerging Linked Content
Economy and deliver Smart Media. The Linked Content Economy &
Smart Media: The Linked Content Economy is an evolving ecosystem of
enriched and connected content that helps publishers engage
readers, improve the user experience, and ultimately better convert
readership to revenue. Linked Content goes beyond link journalism,
(linking to related stories, etc.). It uses metadata to help
publishers create Smart Media content that automatically connects
the concepts, people, companies, etc. it contains to a rich array
of related data sets and media assets on the Web. It then uses
metadata to help publishers share their Smart Media with the rest
of the content ecosystem, including search engines, news
aggregators, related stories applications and more. How it Works:
1. Publishers submit content to the Calais Web service using their
Calais API key. 2. Calais tags each person, place, fact and event
in the content, making it machine- readable and interoperable on
the Web. 3. Each piece of content - and each entity or event in
that content - is assigned a unique identifier (a document ID and
many URIs) that ties back to the Linked Data Cloud. 4. Publishers
use the metadata Calais returns (tags, document IDs and URIs) to
enhance their content and create features like topic pages that
improve the reader experience. 5. Publishers can also use their
metadata to share their content with next generation search
engines, news aggregators, etc. Calais participation in this
ecosystem is as a platform. Calais lays the foundation on which, in
conjunction with Content Management Systems, users can create a
next generation publishing site, service or community. Calais
adopted the Linked Data standard to build a back-end infrastructure
and repository, enabling linkage between concepts and documents.
Linked Data is a standard promulgated by Sir Tim Berners-Lee. Here
are some of the open data assets in the Linked Data cloud.
3. By embracing the Linked Data standard and by creating a
Calais repository of Linked Data assets on publicly-traded
companies Thomson Reuters has built scaffolding that enables Web
sites, social networks and other content-rich applications to
navigate between previously separate silos of data and information.
Heres how it works: 1.) When Calais processes an article, it
extracts many named entities. For some classes of named entities,
such as companies, Calais now also returns an HTTP hyperlink,
called a Uniform Resource Identifier (URI). 2.) This hyperlink
points into the Calais repository, to a machine readable XML page
containing related content (company description, management team,
board of directors, etc.) as well as links to related assets in
DBpedia, from Thomson Reuters, etc. 3.) This linked data
infrastructure forms a web-of-links that applications can navigate
and use to pull information up for display or integration into the
user experience. Calais has thus created a lingua-franca to drive
content interoperability, and provided a simple Calais provides a
transportation layer standard for the sharing of rich semantic
metadata that enables users to share their semantic metadata with
downstream consumers Heres an example: like search engines, news
aggregators, A news story breaks on an IBM earnings report. related
stories applications and more. The user wants to find out if IBM
has any affiliation with Warren Buffett of Berkshire Hathaway.
-Thomas (Tom) Tague, Calais initiative lead Today such a complex
query requires time-consuming research. Search engines cant
hopscotch through content.
4. But with Calais: 1. The news application sends the story to
Calais. 2. Calais extracts IBM from the news story, ties it to
International Business Machines Corporation in the Linked Data
cloud and returns the URI (i.e. hyperlink) for IBM 3. The app. uses
the IBM URI to retrieve the list of the Board of Director members
from the Retuers.com content in the Calais repository 4. The app.
queries the Board members for their other affiliations and finds a
member that is also on the Board of Coca Cola plus a member that is
the CEO of American Express 5. The app. runs a query of
shareholders of Coca Cola and finds Berkshire Hathaway. 6. The app.
runs a query on shareholders of American Express and finds
Berkshire Hathaway. IBM Corporation Board of Directors Cathleen
Black Cathleen Black William Brody Kenneth Chenault Other
Affiliations Michael Eskew President, Hearst Magazines Board
Member, Coca Cola Berkshire Hathaway Key Stockholders Management
Team Kenneth Chenault Berkshire Warren Buffett Other Affiliations
Charlie Munger CEO, American Express American Express Key
Stockholders Berkshire Hathaway Semantic extraction is far more
powerful than keyword search, which can confuse Paris (Texas),
Paris (France) and Paris (Hilton). Calais can determine that the
Paris in this particular article is Paris Texas based on
sophisticated disambiguation that leverages a variety of clues in
the text. New Applications: Calais 4.0 and beyond will enable many
emergent applications including: - Publisher sites that dynamically
mingle and deliver additional relevant content based on user
preferences, profiles, history, friends selections and breaking
topics that are hot now. - Media Monitoring tools that deliver
slices of relevant information, e.g. content from all sites and
blogs discussing natural disasters occurring near iron mines in
Southeast Asia. - Plug-ins that integrate social networking /
community / blogging, and bypass search. - Semantic ad networks and
servers that go beyond keywords to inform ad placement with
context, e.g. preventing airline ads from appearing next to news of
air accidents. Conclusion: Armed with this powerful new tool,
publishers are automating content operations, increasing
productivity and cutting costs. They are enhancing the value of
their content, improving their user experience and preparing to
lead in the linked content economy. No-one can predict precisely
what kinds of creative and potentially game-changing applications
will emerge. With more than nine thousand users in the
OpenCalais.com community, Thomson Reuters expects to see
hyper-evolution in many arenas.