What is Semantic Publishing? And Why Should I Care?

What is Semantic Publishing?And Why Should I Care?

Jabin WhiteDirector of Strategic ContentWolters Kluwer Health – P&EMay 13, 2010PSP Presents – Semantic Publishing: An Introduction

Agenda• Introductions• Some definitions

▫Vocabularies, Taxonomies, and Ontologies, Oh My!• What is metadata, and why should publishers care?• What is semantic tagging, and why should

publishers care?• Impact of all this on publishers’…

▫Workflows/processes▫Business cases

• The Semantic Web• Final Thoughts, Recommendations

Introductions: My Company•Director of Strategic Content for Wolters

Kluwer Health – Professional & Education•Wolters Kluwer Health includes:

▫ Lippincott Williams & Wilkins titles▫ Ovid▫ UpToDate▫ Provation Order Sets▫ Drug Facts & Comparisons▫ Medi-Span▫ Clin-eguide

Introductions: Me• Started as Editorial Assistant• Dove into SGML in the mid-90s working on

drug reference• Six years at Elsevier in Electronic Production• Don’t typecast me!• Joined WK Health in May 2009

▫Responsible for making sure content flows through company more efficiently (DTDs, Content Management, Authoring Tools, Semantic Enrichment, Product Information Management, etc.)

The Web - Stop the Insanity!•A few humble web stats:

▫There are 2 billion (billion!) Google searches daily

▫There are 1 trillion (1,000,000,000,000) unique URLs in Google’s index

▫There are 2,695,205 articles in English on Wikipedia

▫It would take 412.3 years to view all the content on YouTube (3/08), but don’t try, because there are 13 hours of video uploaded every minute

** Source: Adam Singer’s “Social Media, Web 2.0 and Internet Stats site:http://thefuturebuzz.com/2009/01/12/social-media-web-20-internet-numbers-stats/

So What?•Clay Shirky’s concept of “Filter Failure”•When the capacity of people to “keep up

with” information is exceeded, curation becomes the value differentiator

Definitions• Controlled vocabulary: a bunch of words, no

relationships▫But there is advantage if all users use the same terms

to describe things• Taxonomy: is a controlled vocabulary with hierarchy• Thesaurus: is interchangeable with controlled

vocabulary, also sometimes referred to as an ontology

• Ontology: all of the above; think neural network with a bunch of relationships

• MetaData: data about data (we’ll get to that)

Some Level-Setting• Unfortunately, these definitions have been

diluted to the point of uselessness by their misuse▫Think “Content Management” around the

year 2000• MetaThesaurus – a collection of all of these

things▫EXAMPLE: UMLS

Information Classification•Pretty Wonky, Pretty Fast

•Hyperonym: Broader Term, more general▫car is a hyperonym of pinto)

•Hyponym: Narrower Term▫Baseball is a hyponym of sports

•Meronym: part term▫Kansas is a meronym of United States

•Holynym: whole term▫European Union is a holynm of France

Taxonomies in STM

Some Heavy Hitters•UMLS•MeSH•SNOMED-CT•ICD-9 and ICD-10•RxNORM•LOINC, ICPC-93, and VA/KP Subset of

SNOMED

UMLS – Unified Medical Language System•More than 5 million terms or named

entities•Divided into concepts, and each term has

unique identifier•Not a vocabulary, but a mapping

BETWEEN vocabularies

UMLS•Vocabularies included in the UMLS:

▫ MeSH Headings in 8 languages▫ ICPC-93 in 14 languages▫ WHO Adverse Drug Reaction Terminology in 5 languages▫ SNOMED-2, SNOMED-3, and UK Clinical Terms (former Read

Codes)▫ ICD-10 in English and German▫ ICD-10-AM (Australian Modification)▫ ICD-9 (US Modification)

The Semantic Network (UMLS)• Semantic types are big things like Disease, Syndrome, or

Clinical Drug• Semantic relationships are useful links between semantic

types (ie, Clinical Drug treats Disease or Symptom)

One Concept, Many NamesTERM SOURCE

VOCABULARYAtrial fibrillation ICD-9-CMAF NCI ThesaurusAfib MedDRAAtrial fibrillation (disorder)

SNOMED Clinical Terms

Atrium; fibrillation ICPC2-ICD10 Thesaurus

MeSH – Medical Subject Headings• An 11-level hierarchy developed and maintained by

the National Library of Medicine, part of the US Department of Health and Human Services

• The indexing method for MEDLINE/PubMed▫Contains more than 16 million references to journal

articles in the life sciences, with concentration in biomedicine

▫5,200 journals worldwide in 37 languages▫Since 2005, 2,000-4,000 references are added daily,

Tuesday-Saturday, all indexed to MeSH▫Loading suspended for two weeks every

November/December while MeSH is updated

The MeSH Staff

SNOMED-CT• Systemized Nomenclature of Medicine (Clinical

Terms)• 344,000 concepts, arguably the most complete

clinical taxonomy in the world• Developed and maintained by the College of

American Pathologists• Licensed by NLM, freely available to license as part

of UMLS• US Standard for electronic health information

exchange by Health IT standards panel• Adopted for use by US government through the

Consolidated Health Informatics (CHI) initiative

ICD-9 and ICD-10•International Classification of Diseases•Version 9 moving to Version 10 (US is

slower than rest of the world on this)•Codes that define diseases:

▫ Example: 411.0 = Postmyocardial infarction syndrome (aka, Dressler’s Syndrome)

•Used to drive insurance re-imbursements, billing, and other classifications of diseases

•Used to figure morbidity and mortality figures by US government

RxNorm•Standardized names for drugs, collections

of drugs, and delivery devices•Like MeSH, developed and maintained by

National Library of Medicine•Also includes standard way of expressing

generic and trade names, ingredients, strengths, and dose forms

LOINC Mapping Files•Logical Observation Identifiers Names

and Codes•A set of universal names and ID codes for

identifying laboratory and clinical test results

•Used to better communicate with HIT (Health Information Technology) systems

•Not much of an impact on publishers, but we should know about them

What is Metadata, and Why Should Publishers Care?

What is Metadata?•Reading most definitions of metadata and

related standards is like trying to resolve disputes with my kids

•Metadata is “data about data”▫But what does that mean?

•Its use may be increasing, but metadata is NOT new

Why Should Publishers Care•In the move from print publishing to

digital, metadata is a powerful tool to help publishers get content in the right place, in the right format, and known to the right systems and people, at the right time

•Print books were easy▫Everyone knew what they were▫You could really only use them one way▫They had a beginning, an end, a physical

presence, and a set price (mostly)

Why Should Publishers Care•Today, computers are often communicating

with one another as much as they are with users (people)

•Metadata becomes critical in:▫B2B relationships▫Enhancing B2C relationships▫B2-_________ relationships

•The quality of the metadata gives publishers a more powerful voice in what happens to their content

Why Should Publishers Care?• For example:

▫A digital asset (an image)▫What file format is it?▫How big is the image?▫Who took the picture?▫Who owns the picture?▫Can you use it on your web site? If you do, what credit

do you have to give to the owner?▫What date was it created?▫Is it part of a collection?▫Is it related to another piece of content?▫Does it stand alone or is it part of a group of images?

Publishers Should Care•If a publisher’s goal is to disseminate

content to the widest possible audience, metadata is critical

Publisher Relationships• Again, in books you had one use model• Metadata allows publishers to have diverse relationships

with content consumers and other information providers▫ Customers (duh)▫ Aggregators▫ The Open Web (not Google, but other search engines)

But don’t try to “game” the search engines with adult keywords; that’s just wrong

There have been lawsuits over use of meta keywords, including Playboy suing two adult web sites

▫ Technology partners/developers▫ Systems wherein content is a “value add”▫ Multiple output formats

Types of Metadata• HTML Metadata

▫ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

▫ <meta name="verify-v1" content="kBoFGUuwppiWVWGx4Ypzkw1Cs1GgMYEMMbfNr7FY65w=" />

▫ <meta name="description" content="International publisher of professional health information for physicians, nurses, specialized clinicians & students. Medical & nursing charts, journals, and pda software.">

▫ <meta name="keywords" content="springhouse, medical book, nursing journal, medical pda software, lippincott medical reference, lww, lippincott, lww com, medical publisher">

▫ <link rel="stylesheet" href="/css/style.css" type="text/css">

For people

For search enginges

Types of Metadata• Classifying Metadata

▫ ISBN (I told you this wasn’t new)

▫ Dewey Decimal System▫ Books in

Print/CIP/Library of Congress data

▫ MARC records▫ DOI (Digital Object

Identifier)

• Descriptive Metadata (sorry, my examples are from STM)▫ ICD-9 and ICD-10 Codes▫ MeSH▫ SNOMED-CT▫ NANDA, NIC, NOC for

Nursing▫ NDC, HCPCS for drugs

OLD NEW

Types of Metadata• Classifying Metadata

▫ ISBN (I told you this wasn’t new)

▫ Dewey Decimal System▫ Books in

Print/CIP/Library of Congress data

▫ MARC records▫ DOI (Digital Object

Identifier)

• Descriptive Metadata (sorry, my examples are from STM)▫ ICD-9 and ICD-10 Codes▫ MeSH▫ SNOMED-CT▫ NANDA, NIC, NOC for

Nursing▫ NDC, HCPCS for drugs

OLD NEW

• DOI (Digital Object Identifier)

Semantic Metadata• Using controlled vocabularies, extra power can

be added to content via semantic tagging to drive:▫More precise searching▫Contextually-based connections▫Lowering of “two terms meaning the same

thing” syndrome (hypertension vs. high blood pressure; heart attack vs. myocardial infarction)

▫Filling in of content gaps• Semantic tagging *is* metadata, but it

deserves its own section (coming up)

What is Semantic Tagging?

Semantic Basics•Semantics is tagging that describes what

content *is* and not how it should *look* on the page or screen

•Contrast to structural tagging, which is made of elements such as <para>, <list>, and <title>

•Both are XML, but semantics is like XML on steroids!

•Doing semantic tagging without a controlled vocabulary is madness for scholarly publishing▫Think “folksonomies”

Manual Tagging•DESCRIPTION: A subject matter expert (SME)

reads chapter/article, indexes or tags based on content, resulting in enriched content

•POSITIVES – If precision needed, and clinical understanding of concepts (ie, judgment) required, probably still the best option

•NEGATIVES - Cost prohibitive on large volumes of information; not scalable; inconsistency if controlled vocabulary not followed, or different taggers used

Manual Tagging – Other Factors•Offshore resources have improved in

recent years as “knowledge work” has gone global, resulting in cost reductions▫Some processes considered “too expensive”

to be done manually before could be revisited

•Great dependence on *type* of content, which means use cases should drive workflow decisions

Automated Approaches• DESCRIPTION: Software crawls content, adds

tags/unique identifiers or finds concepts & patterns to drive more intelligent search or entity extraction

• POSITIVES – Very effective in finding “trends” or concepts over a large repository of data; growing industry because of information overload (aka Data Mining, Text Analysis)

• NEGATIVES – Sometimes leads to false positives, lack of precision or judgment by machines processing data

Automated Approaches – Other Factors•If used effectively, quick wins on large

repositories•Can be used to accomplish projects that

would never be attempted (or approved) manually

Combination Approaches•DESCRIPTION: Automated process followed by

SME checking (deeper level than straight QA) and addition of specific conceptual information

•POSITIVES – best of both worlds for projects that deserve it; can drive precision but can also cover large repositories

•NEGATIVES – costs; every time software or people act on your content, there are costs – you don’t get a discount from either because you are doing both

FUD Around Semantic Search•Semantic Search engines

▫TEMIS, Collexis, NetBase, Vivisimo, OpenCalais▫Finding semantic concepts based on entities and

search algorithms▫Finding a needle in a haystack

•Semantic Tagging▫People (SMEs) identify concepts and tag

accordingly▫Drives precision in search and other things▫Finding the right needle in a stack of 10 needles

A Note About “Folksonomies” •Having users “tag” or classify data is

increasing in popularity•Not much use in clinical areas of health

sciences•If you are sick, do you want to know what

100 people think, or the one expert?

Impact on Publishers

Impact on Publishers•Impact depends on how deep you want to

go▫i.e., what am I going to get in return for

investing in metadata, and is it worth it?▫More and more, this is not an “if”

proposition, it’s “how much”•Publishers who buy in have two basic

choices on approach:

Option 1: Metadata in the Workflow• Requires deeper commitment, but has bigger

potential upside▫Positive impact on product creation and development

• Requires thinking about tools, workflows, and enterprise-level systems to allow for creation and MAINTENANCE of metadata

• Combination of good metadata in the workflow and creativity in product development team can pay big benefits

• Allows participation of authors (or subject matter experts in lieu of) at the beginning of the workflow

Option 2: Outside the Workflow• Requires lesser commitment, but potentially fewer

rewards• Can be done with zero impact on current systems• Has benefit of content being in “final form”

(whatever that means anymore) when intelligence is added in metadata

• Can keep SMEs as a separate offshoot of the workflow – easily outsourced

• Can attack this problem with brute force semantic search engines, but this is a different thing

Impact on Publishers•Active vs. Passive Metadata

▫Active metadata Publisher intentionally associates markup with

certain pieces of content Often using controlled vocabulary Includes semantic indexing Can also be machine-based, using scripts, etc.

▫Passive metadata Metadata created based on use of content

Image X was used as part of an image bank on pediatric Inheritance of properties from parent objects

Implications for Search•Machines don’t know the difference

between hypertension and high blood pressure▫ More accurately, machines don’t know they are the SAME

•How this is handled is a matter of User Experience (did you mean? … give them the result … etc.), but the content must be tagged first

Linking Content Within the Workflow•Use models have changed in health

sciences•Customers don’t expect (or don’t have

time) to exit a system to check clinical information▫ It needs to be at the Point of Care

•We need to have content linked into customer workflows, and taxonomies drive this

The Semantic Web

Semantic Web•Current web (mostly HTML) is

“undefined” information, and the growth is making this even worse

•Semantic web concept would ensure that content providers classify their information, so the web would become more of a smart database of information

Jabin’s Shopping ListHTML XML<H1>Jabin’s Shopping List <list type=“grocery” date=“5-13-2010”></H1> <title>Jabin’s Shopping List</title><ul><li>Bread</li> <grain>Bread</grain><li>Milk</li> <dairy>Milk</dairy><li>Bananas</li> <fruit>Bananas</fruit><li>Beans</li> <veggie>Beans</veggie></ul> </list>

The semantic web both requires and acts on this kind of tagging

A new idea? … Not so much• May 2001 issue, “Scientific American”• The Semantic Web: A new form of Web content that is

meaningful to computers will unleash a revolution of new possibilitiesBy Tim Berners-Lee, James Hendler and Ora Lassila

• The entertainment system was belting out the Beatles' "We Can Work It Out" when the phone rang. When Pete answered, his phone turned the sound down by sending a message to all the other local devices that had a volume control. His sister, Lucy, was on the line from the doctor's office: "Mom needs to see a specialist and then has to have a series of physical therapy sessions. Biweekly or something. I'm going to have my agent set up the appointments." Pete immediately agreed to share the chauffeuring.

Semantic Web vs. semantic Web•Grand vision of Semantic Web is a great

goal, but will take time•Meanwhile, each industry has its own

vocabulary(ies), which can drive their own semantic webs

•Resource Description Framework (RDF) can and will “bind” these webs together, but each industry vertical can make progress in the interim

Implications•If every industry has its own language,

how is that language *expressed*?•Answer: Taxonomies•How are those taxonomies applied?•Answer: Semantic Tagging

Final Thoughts

Importance of Use Cases•Use Cases should drive strategy and

justifications for all of this!•One taxonomy size/coverage does not fit all•One method of tagging/indexing does not

fit all▫ There is a fundamental difference, tension, and ultimately

tradeoff between large concept coverage over a massive amount of data, and precise conceptual expressiveness

•Approach should be tailored to content set and goals for that content set

THANK YOU

Questions?

Jabin WhiteDirector of Strategic ContentWolters Kluwer [email protected]: @jabinwhiteBlog: Technically Speaking at http://www.bookbusinessmag.com/channel/technically-speaking

mailto:[email protected]

http://www.bookbusinessmag.com/channel/technically-speaking

What is Semantic Publishing? And Why Should I Care?

Documents

Transcript of What is Semantic Publishing? And Why Should I Care?