What is Semantic Publishing? And Why Should I Care?
description
Transcript of What is Semantic Publishing? And Why Should I Care?
What is Semantic Publishing?And Why Should I Care?
Jabin WhiteDirector of Strategic ContentWolters Kluwer Health – P&EMay 13, 2010PSP Presents – Semantic Publishing: An Introduction
Agenda• Introductions• Some definitions
▫Vocabularies, Taxonomies, and Ontologies, Oh My!• What is metadata, and why should publishers care?• What is semantic tagging, and why should
publishers care?• Impact of all this on publishers’…
▫Workflows/processes▫Business cases
• The Semantic Web• Final Thoughts, Recommendations
Introductions: My Company•Director of Strategic Content for Wolters
Kluwer Health – Professional & Education•Wolters Kluwer Health includes:
▫ Lippincott Williams & Wilkins titles▫ Ovid▫ UpToDate▫ Provation Order Sets▫ Drug Facts & Comparisons▫ Medi-Span▫ Clin-eguide
Introductions: Me• Started as Editorial Assistant• Dove into SGML in the mid-90s working on
drug reference• Six years at Elsevier in Electronic Production• Don’t typecast me!• Joined WK Health in May 2009
▫Responsible for making sure content flows through company more efficiently (DTDs, Content Management, Authoring Tools, Semantic Enrichment, Product Information Management, etc.)
The Web - Stop the Insanity!•A few humble web stats:
▫There are 2 billion (billion!) Google searches daily
▫There are 1 trillion (1,000,000,000,000) unique URLs in Google’s index
▫There are 2,695,205 articles in English on Wikipedia
▫It would take 412.3 years to view all the content on YouTube (3/08), but don’t try, because there are 13 hours of video uploaded every minute
** Source: Adam Singer’s “Social Media, Web 2.0 and Internet Stats site:http://thefuturebuzz.com/2009/01/12/social-media-web-20-internet-numbers-stats/
So What?•Clay Shirky’s concept of “Filter Failure”•When the capacity of people to “keep up
with” information is exceeded, curation becomes the value differentiator
Definitions• Controlled vocabulary: a bunch of words, no
relationships▫But there is advantage if all users use the same terms
to describe things• Taxonomy: is a controlled vocabulary with hierarchy• Thesaurus: is interchangeable with controlled
vocabulary, also sometimes referred to as an ontology
• Ontology: all of the above; think neural network with a bunch of relationships
• MetaData: data about data (we’ll get to that)
Some Level-Setting• Unfortunately, these definitions have been
diluted to the point of uselessness by their misuse▫Think “Content Management” around the
year 2000• MetaThesaurus – a collection of all of these
things▫EXAMPLE: UMLS
Information Classification•Pretty Wonky, Pretty Fast
•Hyperonym: Broader Term, more general▫car is a hyperonym of pinto)
•Hyponym: Narrower Term▫Baseball is a hyponym of sports
•Meronym: part term▫Kansas is a meronym of United States
•Holynym: whole term▫European Union is a holynm of France
Taxonomies in STM
Some Heavy Hitters•UMLS•MeSH•SNOMED-CT•ICD-9 and ICD-10•RxNORM•LOINC, ICPC-93, and VA/KP Subset of
SNOMED
UMLS – Unified Medical Language System•More than 5 million terms or named
entities•Divided into concepts, and each term has
unique identifier•Not a vocabulary, but a mapping
BETWEEN vocabularies
UMLS•Vocabularies included in the UMLS:
▫ MeSH Headings in 8 languages▫ ICPC-93 in 14 languages▫ WHO Adverse Drug Reaction Terminology in 5 languages▫ SNOMED-2, SNOMED-3, and UK Clinical Terms (former Read
Codes)▫ ICD-10 in English and German▫ ICD-10-AM (Australian Modification)▫ ICD-9 (US Modification)
The Semantic Network (UMLS)• Semantic types are big things like Disease, Syndrome, or
Clinical Drug• Semantic relationships are useful links between semantic
types (ie, Clinical Drug treats Disease or Symptom)
One Concept, Many NamesTERM SOURCE
VOCABULARYAtrial fibrillation ICD-9-CMAF NCI ThesaurusAfib MedDRAAtrial fibrillation (disorder)
SNOMED Clinical Terms
Atrium; fibrillation ICPC2-ICD10 Thesaurus
MeSH – Medical Subject Headings• An 11-level hierarchy developed and maintained by
the National Library of Medicine, part of the US Department of Health and Human Services
• The indexing method for MEDLINE/PubMed▫Contains more than 16 million references to journal
articles in the life sciences, with concentration in biomedicine
▫5,200 journals worldwide in 37 languages▫Since 2005, 2,000-4,000 references are added daily,
Tuesday-Saturday, all indexed to MeSH▫Loading suspended for two weeks every
November/December while MeSH is updated
The MeSH Staff
SNOMED-CT• Systemized Nomenclature of Medicine (Clinical
Terms)• 344,000 concepts, arguably the most complete
clinical taxonomy in the world• Developed and maintained by the College of
American Pathologists• Licensed by NLM, freely available to license as part
of UMLS• US Standard for electronic health information
exchange by Health IT standards panel• Adopted for use by US government through the
Consolidated Health Informatics (CHI) initiative
ICD-9 and ICD-10•International Classification of Diseases•Version 9 moving to Version 10 (US is
slower than rest of the world on this)•Codes that define diseases:
▫ Example: 411.0 = Postmyocardial infarction syndrome (aka, Dressler’s Syndrome)
•Used to drive insurance re-imbursements, billing, and other classifications of diseases
•Used to figure morbidity and mortality figures by US government
RxNorm•Standardized names for drugs, collections
of drugs, and delivery devices•Like MeSH, developed and maintained by
National Library of Medicine•Also includes standard way of expressing
generic and trade names, ingredients, strengths, and dose forms
LOINC Mapping Files•Logical Observation Identifiers Names
and Codes•A set of universal names and ID codes for
identifying laboratory and clinical test results
•Used to better communicate with HIT (Health Information Technology) systems
•Not much of an impact on publishers, but we should know about them
1/3
What is Metadata, and Why Should Publishers Care?
What is Metadata?•Reading most definitions of metadata and
related standards is like trying to resolve disputes with my kids
•Metadata is “data about data”▫But what does that mean?
•Its use may be increasing, but metadata is NOT new
Why Should Publishers Care•In the move from print publishing to
digital, metadata is a powerful tool to help publishers get content in the right place, in the right format, and known to the right systems and people, at the right time
•Print books were easy▫Everyone knew what they were▫You could really only use them one way▫They had a beginning, an end, a physical
presence, and a set price (mostly)
Why Should Publishers Care•Today, computers are often communicating
with one another as much as they are with users (people)
•Metadata becomes critical in:▫B2B relationships▫Enhancing B2C relationships▫B2-_________ relationships
•The quality of the metadata gives publishers a more powerful voice in what happens to their content
Why Should Publishers Care?• For example:
▫A digital asset (an image)▫What file format is it?▫How big is the image?▫Who took the picture?▫Who owns the picture?▫Can you use it on your web site? If you do, what credit
do you have to give to the owner?▫What date was it created?▫Is it part of a collection?▫Is it related to another piece of content?▫Does it stand alone or is it part of a group of images?
Publishers Should Care•If a publisher’s goal is to disseminate
content to the widest possible audience, metadata is critical
Publisher Relationships• Again, in books you had one use model• Metadata allows publishers to have diverse relationships
with content consumers and other information providers▫ Customers (duh)▫ Aggregators▫ The Open Web (not Google, but other search engines)
But don’t try to “game” the search engines with adult keywords; that’s just wrong
There have been lawsuits over use of meta keywords, including Playboy suing two adult web sites
▫ Technology partners/developers▫ Systems wherein content is a “value add”▫ Multiple output formats
Types of Metadata• HTML Metadata
▫ <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
▫ <meta name="verify-v1" content="kBoFGUuwppiWVWGx4Ypzkw1Cs1GgMYEMMbfNr7FY65w=" />
▫ <meta name="description" content="International publisher of professional health information for physicians, nurses, specialized clinicians & students. Medical & nursing charts, journals, and pda software.">
▫ <meta name="keywords" content="springhouse, medical book, nursing journal, medical pda software, lippincott medical reference, lww, lippincott, lww com, medical publisher">
▫ <link rel="stylesheet" href="/css/style.css" type="text/css">
For people
For search enginges
Types of Metadata• Classifying Metadata
▫ ISBN (I told you this wasn’t new)
▫ Dewey Decimal System▫ Books in
Print/CIP/Library of Congress data
▫ MARC records▫ DOI (Digital Object
Identifier)
• Descriptive Metadata (sorry, my examples are from STM)▫ ICD-9 and ICD-10 Codes▫ MeSH▫ SNOMED-CT▫ NANDA, NIC, NOC for
Nursing▫ NDC, HCPCS for drugs
OLD NEW
Types of Metadata• Classifying Metadata
▫ ISBN (I told you this wasn’t new)
▫ Dewey Decimal System▫ Books in
Print/CIP/Library of Congress data
▫ MARC records▫ DOI (Digital Object
Identifier)
• Descriptive Metadata (sorry, my examples are from STM)▫ ICD-9 and ICD-10 Codes▫ MeSH▫ SNOMED-CT▫ NANDA, NIC, NOC for
Nursing▫ NDC, HCPCS for drugs
OLD NEW
• DOI (Digital Object Identifier)
Semantic Metadata• Using controlled vocabularies, extra power can
be added to content via semantic tagging to drive:▫More precise searching▫Contextually-based connections▫Lowering of “two terms meaning the same
thing” syndrome (hypertension vs. high blood pressure; heart attack vs. myocardial infarction)
▫Filling in of content gaps• Semantic tagging *is* metadata, but it
deserves its own section (coming up)
What is Semantic Tagging?
Semantic Basics•Semantics is tagging that describes what
content *is* and not how it should *look* on the page or screen
•Contrast to structural tagging, which is made of elements such as <para>, <list>, and <title>
•Both are XML, but semantics is like XML on steroids!
•Doing semantic tagging without a controlled vocabulary is madness for scholarly publishing▫Think “folksonomies”
Manual Tagging•DESCRIPTION: A subject matter expert (SME)
reads chapter/article, indexes or tags based on content, resulting in enriched content
•POSITIVES – If precision needed, and clinical understanding of concepts (ie, judgment) required, probably still the best option
•NEGATIVES - Cost prohibitive on large volumes of information; not scalable; inconsistency if controlled vocabulary not followed, or different taggers used
Manual Tagging – Other Factors•Offshore resources have improved in
recent years as “knowledge work” has gone global, resulting in cost reductions▫Some processes considered “too expensive”
to be done manually before could be revisited
•Great dependence on *type* of content, which means use cases should drive workflow decisions
Automated Approaches• DESCRIPTION: Software crawls content, adds
tags/unique identifiers or finds concepts & patterns to drive more intelligent search or entity extraction
• POSITIVES – Very effective in finding “trends” or concepts over a large repository of data; growing industry because of information overload (aka Data Mining, Text Analysis)
• NEGATIVES – Sometimes leads to false positives, lack of precision or judgment by machines processing data
Automated Approaches – Other Factors•If used effectively, quick wins on large
repositories•Can be used to accomplish projects that
would never be attempted (or approved) manually
Combination Approaches•DESCRIPTION: Automated process followed by
SME checking (deeper level than straight QA) and addition of specific conceptual information
•POSITIVES – best of both worlds for projects that deserve it; can drive precision but can also cover large repositories
•NEGATIVES – costs; every time software or people act on your content, there are costs – you don’t get a discount from either because you are doing both
FUD Around Semantic Search•Semantic Search engines
▫TEMIS, Collexis, NetBase, Vivisimo, OpenCalais▫Finding semantic concepts based on entities and
search algorithms▫Finding a needle in a haystack
•Semantic Tagging▫People (SMEs) identify concepts and tag
accordingly▫Drives precision in search and other things▫Finding the right needle in a stack of 10 needles
A Note About “Folksonomies” •Having users “tag” or classify data is
increasing in popularity•Not much use in clinical areas of health
sciences•If you are sick, do you want to know what
100 people think, or the one expert?
2/3
Impact on Publishers
Impact on Publishers•Impact depends on how deep you want to
go▫i.e., what am I going to get in return for
investing in metadata, and is it worth it?▫More and more, this is not an “if”
proposition, it’s “how much”•Publishers who buy in have two basic
choices on approach:
Option 1: Metadata in the Workflow• Requires deeper commitment, but has bigger
potential upside▫Positive impact on product creation and development
• Requires thinking about tools, workflows, and enterprise-level systems to allow for creation and MAINTENANCE of metadata
• Combination of good metadata in the workflow and creativity in product development team can pay big benefits
• Allows participation of authors (or subject matter experts in lieu of) at the beginning of the workflow
Option 2: Outside the Workflow• Requires lesser commitment, but potentially fewer
rewards• Can be done with zero impact on current systems• Has benefit of content being in “final form”
(whatever that means anymore) when intelligence is added in metadata
• Can keep SMEs as a separate offshoot of the workflow – easily outsourced
• Can attack this problem with brute force semantic search engines, but this is a different thing
Impact on Publishers•Active vs. Passive Metadata
▫Active metadata Publisher intentionally associates markup with
certain pieces of content Often using controlled vocabulary Includes semantic indexing Can also be machine-based, using scripts, etc.
▫Passive metadata Metadata created based on use of content
Image X was used as part of an image bank on pediatric Inheritance of properties from parent objects
Implications for Search•Machines don’t know the difference
between hypertension and high blood pressure▫ More accurately, machines don’t know they are the SAME
•How this is handled is a matter of User Experience (did you mean? … give them the result … etc.), but the content must be tagged first
Linking Content Within the Workflow•Use models have changed in health
sciences•Customers don’t expect (or don’t have
time) to exit a system to check clinical information▫ It needs to be at the Point of Care
•We need to have content linked into customer workflows, and taxonomies drive this
The Semantic Web
Semantic Web•Current web (mostly HTML) is
“undefined” information, and the growth is making this even worse
•Semantic web concept would ensure that content providers classify their information, so the web would become more of a smart database of information
Jabin’s Shopping ListHTML XML<H1>Jabin’s Shopping List <list type=“grocery” date=“5-13-2010”></H1> <title>Jabin’s Shopping List</title><ul><li>Bread</li> <grain>Bread</grain><li>Milk</li> <dairy>Milk</dairy><li>Bananas</li> <fruit>Bananas</fruit><li>Beans</li> <veggie>Beans</veggie></ul> </list>
The semantic web both requires and acts on this kind of tagging
A new idea? … Not so much• May 2001 issue, “Scientific American”• The Semantic Web: A new form of Web content that is
meaningful to computers will unleash a revolution of new possibilitiesBy Tim Berners-Lee, James Hendler and Ora Lassila
• The entertainment system was belting out the Beatles' "We Can Work It Out" when the phone rang. When Pete answered, his phone turned the sound down by sending a message to all the other local devices that had a volume control. His sister, Lucy, was on the line from the doctor's office: "Mom needs to see a specialist and then has to have a series of physical therapy sessions. Biweekly or something. I'm going to have my agent set up the appointments." Pete immediately agreed to share the chauffeuring.
Semantic Web vs. semantic Web•Grand vision of Semantic Web is a great
goal, but will take time•Meanwhile, each industry has its own
vocabulary(ies), which can drive their own semantic webs
•Resource Description Framework (RDF) can and will “bind” these webs together, but each industry vertical can make progress in the interim
Implications•If every industry has its own language,
how is that language *expressed*?•Answer: Taxonomies•How are those taxonomies applied?•Answer: Semantic Tagging
Final Thoughts
Importance of Use Cases•Use Cases should drive strategy and
justifications for all of this!•One taxonomy size/coverage does not fit all•One method of tagging/indexing does not
fit all▫ There is a fundamental difference, tension, and ultimately
tradeoff between large concept coverage over a massive amount of data, and precise conceptual expressiveness
•Approach should be tailored to content set and goals for that content set
THANK YOU
Questions?
Jabin WhiteDirector of Strategic ContentWolters Kluwer [email protected]: @jabinwhiteBlog: Technically Speaking at http://www.bookbusinessmag.com/channel/technically-speaking