Carnegie MellonSchool of Computer Science
1
NSF-Relevant Challenges in Computational Intelligence
Jaime Carbonell ([email protected])& Tom Mitchell, Guy Bleloch, Randy Bryant, et al
School of Computer ScienceCarnegie Mellon University
26-April-2007
I) Major Computational Intelligence Research Areas
II) Next-Generation Infrastructure (DISC)
Carnegie MellonSchool of Computer Science
2
Computational Intelligence• Machine Learning
Inductive learning algorithms, active leraning Data mining & novel pattern detection
• Language Technologies Multilingual & next-veneration search engines Machine translation (e.g. Arabic English)
• Perception Computer vision, tactile sensing (e.g., in robotics)
• Planning & optimizing Reasoning & planning under uncertainty Non-linear optimization (beyond O. R.) w/uncertainty
• Key scientific applications Proteomics, genomics, computational biology Modeling human brain functions
Carnegie MellonSchool of Computer Science
3
Machine Learning
Object recognition
Data Mining
Speech Recognition
Automated Control learning
• Reinforcement learning
• Predictive modeling
• Pattern discovery
• Hidden Markov models
• Convex optimization
• Explanation-based learning
• ....
Extracting facts from text
Carnegie MellonSchool of Computer Science
4
Influenza cultures
Sentinel physicians
WebMD queries about ‘cough’ etc.
School absenteeism
Sales of cough and cold meds
Sales of cough syrup
ER respiratory complaints
ER ‘viral’ complaints
Influenza-related deaths
Week (1999-2000))
Leveraging Existing Data Collecting Systems1999 Influenza outbreak
[Moore, 2002]
Carnegie MellonSchool of Computer Science
5
Cluster Evolution and Density Change Detection: d2F(r(t))/dt2
Constant Event New Unobfuscated Event
New Obfuscated Event Growing Event
Carnegie MellonSchool of Computer Science
6
Classifier = Rocchio, Topic = Civil War (R76 in TREC10), Threshold = MLR
MLR threshold function: locally linear, globally non-linear
Carnegie MellonSchool of Computer Science
7
Info-Age Bill of Rights
• Get the rightright information
• To the right people
• At the right time
• On the right medium
• In the right language
• With the right level of detail
Search Engines
Personalization
Anticipatory Analysis
Speech Recognition
Machine Translation
Summarization
Carnegie MellonSchool of Computer Science
8
MMR vs Current Search Engines
query
documents
MMR
IR
λ controls spiral curl
Carnegie MellonSchool of Computer Science
9
Types of Machine Translation Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source (Arabic)
Target(English)
Transfer Rules
Direct: SMT, EBMTRequires Massive
Massive Data Resources
Carnegie MellonSchool of Computer Science
10
2005 NIST Arabic-English MT
• Interlingual MTGrammars, semanticsBest for focused
domains
• Corpus-Based MTPre-translated text (10-
200M words)Target language text
(100M – 1 Trillon words)
Best for general MT
• Context-Based MT Improved variant of
corpus-based MTPerfect client for DISC
BLEU Score
0.6
0.5
0.4
0.3
0.2
0.1
0.0
GoogleISIIBM + CMUUMDJHU-CUEdinburgh
Systran
Mitre
FSC
0.7
TopicIdentification
Human Edittabletranslation
Usabletranslation
Expert Humantranslator
Useless Region
Carnegie MellonSchool of Computer Science
11
Arabic Statistical-MT Outputجميع / / 17بكين وروس صينيون مسئولون حث شينخوا يناير
الهدوء " التزام علي المعنية بشان " االطراف النفس ضبط وممارسةالشعبية . الديمقراطية كوريا بجمهورية الخاصة النووية القضية
الخارجية وزير ونائب تشانغ ون يانغ الصيني الخارجية وزير نائب التقي وقدالكسندر الروسي
مواصلة الي المعنية االطراف دعيا حيث غداء مادبة علي لوسيوكوفالسلمي الحل اجل من السعي
الحالي . المعقد الوضع ظل في الحوار خالل من
Beijing January 17 / Shinhua / the Chinese and Russian officials urged all parties concerned to " remain calm and exercise restraint " over the nuclear issue of the Democratic People's Republic of Korea.
He met with vice Chinese foreign minister Yang Chang won the deputy of the Russian foreign minister Alexander Losyukov at a lunch with invited interested parties to continue the search for a peaceful solution through dialogue under the current complicated situation.
BLEU = .64
Carnegie MellonSchool of Computer Science
12
What About Minor Languages or Dialects without Massive
Data?
Carnegie MellonSchool of Computer Science
13
Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA
3D Structure
Folding
Complex function within network of proteins
Normal
PROTEINSSequence Structure Function
(Borrowed from: Judith Klein-Seetharaman)
Carnegie MellonSchool of Computer Science
14
Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA
3D Structure
Folding
Complex function within network of proteins
Disease
PROTEINSSequence Structure Function
Carnegie MellonSchool of Computer Science
15
Predicting Protein Structures• Protein Structure is a key determinant of protein function
• Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins
• The gap between the known protein sequences and structures: 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico
Carnegie MellonSchool of Computer Science
16
Linked Segmentation CRF
• Node: secondary structure elements and/or simple fold• Edges: Local interactions and long-range inter-chain and
intra-chain interactions• L-SCRF: conditional probability of y given x is defined as
, , ,
1 1 , , ,,
1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))
i j G i j a b G
R R k k i i j l k i a i j a bV k lE
P f g yZ
y y y
y y x x x y x x y
Joint Labels
Carnegie MellonSchool of Computer Science
17
Fold Alignment Prediction: β-Helix• Predicted alignment for known β -helices on cross-family
validation
Carnegie MellonSchool of Computer Science
18
fMRI to observe human brain activity
Machine learning to discover patterns in complex data
New discoveries about human brain function
Our algorithms have learned to distinguish whether a human subject is reading a word
e.g. ‘tools’ or ‘buildings’ with 90% accuracy
Data
Carnegie MellonSchool of Computer Science
19
Requisite Infrastructure
• Data Intensive SuperComputing (DISC) for tera-scale and peta-scale data repositories
• Advanced algorithms researchMassively-parallel decompositionScalability in analytics & learningExtracting compact models for run-timePlanning, reasoning, learning w/uncertainty) Active Learning (maximally reducing uncertainty)
• Domain expertise (e.g. proteomics, neural sciences, astronomy, network security, …)
Carnegie MellonSchool of Computer Science
20
System Comparison: Data
System collects and maintains data
• Shared, active data setComputation
colocated with storage• Faster access
Data stored in separate repository
• No support for collection or management
Brought into system for computation
• Time consuming• Limits interactivity
System System
DISC Conventional Supercomputers
Carnegie MellonSchool of Computer Science
21
Program Model Comparison
Application programs written in terms of high-level operations on data
Runtime system controls scheduling, load balancing, …
Programs described at very low level• Specify detailed control of processing &
communications
Rely on small # of software packages• Written by specialists
• Limits classes of problems & solution methods
DISC Conventional Supercomputers
Hardware
Machine-DependentProgramming Model
SoftwarePackages
ApplicationPrograms
Hardware
Machine-IndependentProgramming Model
RuntimeSystem
ApplicationPrograms
Carnegie MellonSchool of Computer Science
22
Final Thoughts
• Opportunities in Computational IntelligenceMachine learning for tough problems: relevant novelty
detection, structural learning, active learningScientific applications: Computational X (X=biology,
linguistics, astrophysics, chemistry, …)
• Next generation computational infrastructureDISC principle (beyond HPC, beyond grid, …)Algorithmic fundamentals
• International programs (on common problems)
Top Related