For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl [email protected] Chief Visionary...

27
Text Intelligence For Science Mads Rydahl [email protected] Chief Visionary Officer, UNSILO

Transcript of For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl [email protected] Chief Visionary...

Page 1: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Text Intelligence For ScienceMads [email protected] Visionary Officer, UNSILO

Page 2: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

UNSILO

Artificial Intelligence Startupwith a small agile teamfocussed on Scientific Publishing

Page 3: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Vision

Making it easy and fast to find relevant knowledge and discover new patterns

Automated. Because scientific language is constantly growing, evolving, and accelerating. Omniscient. Because important findings may not be apparent. Even to the author.Unbiased. Because existing solutions rank by popularity and cause filter bubbles.

Page 4: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

The Problem

Discovery finding new stuff that’s relevant to what you’re doing

Information Extraction beyond named entity recognition

Meaningful Key Phrases nested and overlapping novel multi-word concepts

LikeIndirect Sodium-Selective Electrode Potentiometry

Page 5: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Our Toolbox

Big Data Corpus-wide analysis; learning language by example

Machine Learning Word embeddings; phrases used interchangeably have similar meaning

The Cloud 100 years using one machine... or 3 days using 10.000 machines

NotSingle-document analysis, TF/IDF, Document vectors, bag-of-words

Page 6: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Core Technology

Finds key phrases in any textand uses Machine Learning to identify novel ideas

Open Languages, libraries, and frameworksApache UIMA, Apache Ruta, Stanford NLP tools, DKPRo, Hadoop, Spark, TensorFlow, Mahout, Vowpal Wabbit, GenSim, LevelDb, Elasticsearch, Docker, Cloudsigma, AWS

Page 7: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Full Text Search

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 8: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Using Dictionaries and Ontologies

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Key: Chemical Technique Anatomy Disease Species

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 9: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

UNSILO Concept Extraction

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Key: Chemical Technique Anatomy Disease Species

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 10: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

UNSILO Semantic Mapping

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Key: Action/Relation Chemical Technique Anatomy Disease Species

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 11: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

■ Natural Language Processing Sentences are annotated with part-of-speech tags; noun, verb, adjective, and a dependency tree

methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry  [··thing··] [··action··] [···········thing··········] [·thing·] [····························· thing ······························]

■ Extract all “things”MethodSodium concentrationSerumIndirect Sodium-Selective Electrode Potentiometry

Phrase Extraction

Page 12: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

■ Reduce Morphological and Syntactic variation (Grammar, form)■ Normalize adjectival modifiers, compound paraphrases, and expand coordinations

Concentration of Sodium >> Sodium ConcentrationThe Electrode Potentiometry was indirect >> Indirect Electrode PotentiometryMethodology >> Method

■ Reduce Lexical and Semantic variation (Synonyms, hypernyms, ontologies)■ Normalize semantic Level-of-Detail using ontologies and vector models

Serum Sample >> Blood SampleSodium Concentration >> Natrium ConcentrationIndirect Electrode Potentiometry >> Electroanalysis

■ Remove rare super-grams and hyponyms (C-level filtering, distribution metrics)■ E.g. “Clinically validated indirect sodium-selective potentiometry”

■ Snap to common fragments and forms (actual usage and Ontologies)■ Indirect Sodium Selective Potentiometry is-a-kind-of Indirect Potentiometry is-a-kind-of Electroanalysis

Boundary detection and Normalization

Page 13: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

■ Rank and Filter using Frequency and Distribution MetricsLocal features:

■ Occurrence count■ Position in document graph■ Textual context

Global features: ■ Occurrence count■ TF/IDF■ Domain/topic distribution■ Aggregated textual context

■ Train Ranking Models using External MetricsHuman training data:

■ Article data: Which concepts are included in the abstract and title■ Behavioral data: Which concepts are clicked on by users■ Behavioral data: Which articles are clicked on by users (...those with promising titles ;-)

Synthetic training data: ■ Synthetic sentence data: Measure synonymi & recall/precision against a known outcome■ Synthetic text collections: Aggregate docs using keyword searches, then prune out keywords

Concept Ranking

Page 14: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

● We build high-dimensional vector-space representations of all concepts from the textual context

Word Embeddings and Word2Vec

Page 15: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Vasodilatation (finding)Peripheral vasodilation (finding)

Vasodilator (substance)Poisoning by vasodilator (disorder)

Vasodilating agent (product)Intra-cavernosal vasodilator (product)

Intra-arterial vasodilator (product)Coronary vasodilator (product)

Alpha blocking vasodilator (product)Nitrate-based vasodilating agent (product)

Human B-type natriuretic peptide (product)Endothelin receptor antagonist (product)

Pentaerythritol tetranitrate (product)Nitroglycerin (product)

Isosorbide mononitrate (product)Isosorbide dinitrate (product)

Measurement of blood pressure (procedure)Self-measurement devices (product)Systolic arterial pressure (observable entity)Non-invasive arterial pressure (observable entity)Blood pressure finding (finding)Blood pressure cuff, device (physical object)Blood pressure cuff inflator (physical object)Lying blood pressure (observable entity)Abnormal blood pressure (finding)Lower tourniquet cuff inflation (procedure)Cuff inflated (attribute)

principle.n.01generalizationbasic truthassumptionlaw

receptor.n01Plasma membrane moleculeG protein-coupled receptorligand-gated ion channelP2X receptorP2Y receptor

● We build high-dimensional vector-space representations of all concepts from the textual context

● We apply ontologies and dictionaries to improve occurrence counts of on rare, complex, or novel concepts

● We use these normalized concepts to improve recall and precision for rare, complex, or novel concepts

● We use this high-dimensional vector model to build real-time semantic indexes with unprecedented precision

Ontology Augmented Vector-space

Page 16: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Synsets built from Vector Cosine Similarity

Page 17: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Human-readable Fingerprints

Content-based Recommender based on a verifiable model of document similarity

Traditional Methods Document vectors based on TF-IDF and Naïve BOWSlow moving ontologies with simple concepts (“insulin” and “obesity”)Limited recognition (only lemmatization/stemming)

UNSILODynamic corpus-driven concept similarityCaptures novel significant phrases (“insulin insensitivity”)Links concepts across terminology variations (“reduced hormone response”)

Page 18: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Services for Science

Page 19: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

UNSILO Discovery Widgets

Page 20: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

Springer.com

“Using UNSILO’s fully automated content enrichment technology, we can identify the most descriptive concepts and phrases within any document in our content portfolio, and provide more valuable reading suggestions, even across domains with a highly variable terminology.”

Jan-Erik de BoerChief Information OfficerSpringer Nature

“Our goal with this new feature is to make it easy for our users to drill down on what they find important in an article, and use that insight as a departure point for their discovery process.”

Stephen CorneliusProduct OwnerIT Platform DevelopmentSpringer Nature

UNSILO technology vendor for Springer Nature9M scientific articles and book chapters22M monthly users Significant increase in traffic and user engagementDisplaced leading competitor

Page 21: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed
Page 22: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed
Page 23: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed
Page 24: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed
Page 25: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

UNSILO value for researchers▪ Point directly to the most important ideas of an article▪ Provide more relevant suggestions by applying

a deep semantic understanding of key article concepts▪ Allow users to "drill down" and interactively explore

key concepts of the most relevant related articles

UNSILO value for Scientific Publishers▪ A scalable way of adding value across all content types▪ Supplements or replaces manual curation of ontologies▪ Broader discovery, reduced bounce rates,

longer session times, more article views

Easier Content Exploration

Page 26: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

■ Normalize Actions and RelationshipsSample linguistic variations of common relationships from re-statements of known facts, Then apply what we learn to less well understood domains:

■ Serum consists of water■ Serum amounts to 93% water■ Serum contains water■ Serum is composed of water■ Serum is mostly water

■ Providing hooks into Unstructured TextImprove training and prediction capabilities of general AI systems by improving access to consumer feedback, corporate data lakes, or conversations within large communities of practice.

■ Reasoning at ScaleQuestion answering, uncover hidden causal chains, invalidate futile research projects

■ Augment Researcher’s cognitive abilities■ Improve the return on R&D investments■ Improve the productivity of 10M Researchers across the globe

Ongoing Development Efforts

■ Thin film Coated Gold Nano Particles■ Coating of Iron nano-particles with thin Gold film■ Fe Nanoparticles thin-film Gold coat■ Evaporation-coating of nanoparticles with gold■ Gold-coated magnetic nanoparticles

Page 27: For Science Text Intelligence · 2016. 12. 19. · Mads Rydahl mads@unsilo.ai Chief Visionary Officer, UNSILO. UNSILO Artificial Intelligence Startup with a small agile team focussed

[email protected]

Mads [email protected] Visionary Officer, UNSILO