Big Data Recommendation Approaches for Healthcaretoronto.ieee.ca/files/2018/06/DSP_revised.pdf ·...
Transcript of Big Data Recommendation Approaches for Healthcaretoronto.ieee.ca/files/2018/06/DSP_revised.pdf ·...
Big Data Recommendation Approaches for Healthcare
Samee U. KhanDepartment of Electrical and Computer Engineering
North Dakota State University
Fargo, ND 58108-6050, USA
n Introduction – Personal & Topic
n Recommendation system models
n Big data recommendation system applications
n Case studies
Outline
2May 31, 2018
3May 31, 2018
Recommendation Systems and Big Data
Volume
Dimensions
Velocity
Variety
Veracity
Ever growing data e.g. 500 million tweets daily 1and 600 TB daily on Facebook2
Time sensitive applications e.g. scrutinize fraud from millions of trade events
Structured (relational data), unstructured (text, audio, video, log files etc.). 80% unstructured3
Data authenticity and correctness
Introduced in 90s:• Information filtering• Personalization• Recommend items/services
Customers’ perspective
Providers’ perspective
• Finding items of interest
• Narrow down choices
• Customizations• Predict needs
• Understanding customers’ behavior
• Increase sales• Product promotion• Trend analysis
Big DataRecent Web trends require tools and methodologies to efficiently manage the data for curation, processing, and storage
Challenges• Storage• Availability• Reliability• Computations• Scalability
1 “Internet Live Stats,”http://www.internetlivestats.com/twitter-statistics/, Accessed on April 10, 2018. 2 "Fcode,” https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/, Accessed on April 10, 2018. 3 “Unstructured Data—A Growing Problem,” https://www.waterfordtechnologies.com/unstructured-data-growing-problem/#more-10513, Accessed on April 10, 2018.
4May 31, 2018
Recommendation System Modelsn Collaborative Filteringn Content Based Filteringn Hybrid Filteringn Collaborative Filtering
5 Information filtering through human behavior/user profiles
5 commonly employed in commercial recommender systems
g Example: Amazon
n Issues with Collaborative Filtering5 Cold Start
g Requires enough users/items in the system5 Sparsity
g Occurs due to scarce data points 5 Long-tail/Popularity Bias
g Recommendation of popular items only5 Scalability
g Occurs due to increase in users and items
Active users’ preferences
Existing users database
Recommendation Module
Find users with similar tastes
Generate Recommendations
Top-NRecommendations
5May 31, 2018
Recommendation System Modelsn Content Based Filtering
5 Recommendations based on the contents of items instead of users ratings or opinion
5 Requires no information about other users5 Recommendations for users with unique tastes5 No cold start and sparsity issues5 Capable of recommending new or unpopular items
n Issues with Content Based Filtering5 Requires meaningful encoding of content features5 Inability to utilize judgement quality of other users
n Hybrid Filtering5 Combination of collaborative and content based
filteringg Popularity datag Contents
n Issues with Hybrid Filtering5 Datasets interoperability
Active users’ profile
Recommendation Module
Top-N Recommendations
Profile Learning
Contents used in past
6May 31, 2018
Big Data Recommendation Systems Applications
n Healthcare5 Health expert recommendation from social media5 Disease risk assessment (prediction)5 Health insurance recommendation
n Route Recommendation Systems5 Social venues5 Large-scale evacuation
7May 31, 2018
Case Study I: Personalized Healthcare Services1
n Increasing trends for finding online health information5 Health related searches by 93 million Americans (Pew Internet & American
Life Project )2
n Health information from online health communities5 Exchange and share disease specific experiences5 Psychological support from peers (example: patientslikeme3)
n Increased expenditure of healthcare 5 U.S. healthcare expenses approx. 18.2% of the GDP till now (2018)4
n Key Contributions:5 Disease risk assessment (NHANES 2009—2010 dataset5 )5 Health expert recommendation from Twitter
g 1,500,000,000+ Healthcare Tweets6
g 30,000 provider profiles (MD, RN etc.)g 15,780 predefined topicsg 16,283 health communities
2“NBCNews,” http://www.nbcnews.com/id/3077086/t/more-people-search-health-online/#.Ws4ss4hubIU, accessed on April 11, 2018.3“Patientslikeme”,http://www.patientslikeme.com/, accessed on April 11, 2018.4 “Statista: The Statistical Portal,” https://www.statista.com/statistics/184968/us-health-expenditure-as-percent-of-gdp-since-1960/, Accessed on April 11, 2018. 5“National Health and Nutrition Examination Survey,” http://wwwn.cdc.gov/nchs/nhanes/search/nhanes09_10.aspx, accessed on September 29, 2014.6“Healthcare Social Media Analytics,” http://www.symplur.com/healthcare-social-media-analytics/, accessed on April 11, 2018.
1A. Abbas, M. Ali, M. U. S. Khan, and S. U. Khan, “Personalized Healthcare Cloud Services for Disease Risk Assessment and Wellness Management using Social Media” Pervasive and Mobile Computing, vol. 28, pp. 81-96, 2016.
8May 31, 2018
Case Study I: Personalized Healthcare Services1
Existing users’
profiles for
multiple diseasesRequesting
User’s profile
Collaborative Filtering
Disease Risk Assessment
Disease risk
assessment request
Disease specific profiles retrieval
Retrieval of important profile attributes
Similarity computation
Profile
matching
Twitter Based Health Expert
Recommendation
Tweets retrieval
Tweets tokenization
Candidate expert identification
Disease specific segregation of experts
through HITS
Pa
ralle
l an
d p
erio
dic
job
s
pre
pro
ce
ssin
g
Ranked list
of experts
Recommended
list of experts
Health expert
recommendation request
Health expert recommendation request
1
2
3
45
7
6
7
8
9
! ", $ = &'( +∑+∈- ./0(", 2)(4+,5 − &4+)
∑+∈- ./0(", 2)
attributes
mean of q
Similarity score predicted value of disease d for
existing user e
mean for particular
attribute
Hyperlink Induced Topic Search (HITS)—hubs and authorities
1A. Abbas, M. Ali, M. U. S. Khan, and S. U. Khan, “Personalized Healthcare Cloud Services for Disease Risk Assessment and
Wellness Management using Social Media” Pervasive and Mobile Computing, vol. 28, pp. 81-96, 2016.
9May 31, 2018
Case Study I: Personalized Healthcare Services1
K1 K2 K3 K4 K5 K6
U1 5 1 2 2 5 1U2 - 3 2 8 2 -U3 3 1 2 4 6 -U4 4 - - - - 11
Iteration No. U1 U2 U3 U4
1 0.281 0.218 0.255 0.234
38 0.275 0.249 0.288 0.196
Iteration No. K1 K2 K3 K4 K5 K6
1 0.197 0.060 0.067 0.235 0.246 0.191
39 0.190 0.065 0.068 0.258 0.254 0.163
User-keyword matrix Hub score Authority score
EvaluationPrecision = TP
TP + FP Recall = TPTP + FN F − measure = 2. precision. recallprecision + recall
CART— data partitioning to classify the presence of absence of diseaseLogistic Regression— relationship between disease data attributes to determine outcomes Naïve Bayes— probability for the presence or absence of disease BF tree— best data split to determine the outcomesBayesNet— DAG to represent the relationship between disease and symptoms MLP—attributes provided at input layer to produce output Random Forest—creation of multiple trees
Rotation Forest — splitting of dataset and evaluation through decision tree SVM— hyperplane separates patients from non-patients
TP=True Positive, FP=False Positive, TN=True Negative, FN=False Negative
CFDRA Evaluation EUR Evaluation RowSum —health related keyword count2Paul et al. — topical authority identification (tweets, retweets, self-similarity)3Cheng et al. — local experts in an area
2 A. Paul, and S. Counts, “Identifying topical authorities in microblogs,” In Proceedings of the fourth ACM international conference on
Web search and data mining, 2011, pp. 45-54. 3Z. Cheng, J. Caverlee, H. Barthwal, and V. Bachani, “Who is the Barbecue King of Texas? A Geo-Spatial Approach to Finding Local
Experts on Twitter,” In Proceedings of the 37th international ACM SIGIR conference on Research & development in information
retrieval, 2011, pp. 335-344.
“age”, “gender”, “ethnicity/race”, “height”, “weight”, “diagnosed high blood sugar or pre-diabetes”, “diabetes family history”, “physical activity”, “ever observed high blood pressure”, “blood cholesterol”, and “smoking, “ever diagnosed diabetes”.
Health related tweets extracted from Twitter
When it predicts YES, how often it is correct?
When it’s actually YES, how often does it predict YES?
1A. Abbas, M. Ali, M. U. S. Khan, and S. U. Khan, “Personalized Healthcare Cloud Services for Disease Risk Assessment and Wellness Management using Social Media” Pervasive and Mobile Computing, vol. 28, pp. 81-96, 2016.
10May 31, 2018
Case Study I: Experimental Results
0
0.5
1
CART RF LR
Naïve Bayes BF
MLP
BayesNet
RoFSVM
CFDRA
Proposed CFDRA Performance ComparisonPrecision Recall F-measure
0.1
0.2
0.3
5 10 15 20No. of Recommendations
Recall score comparison of the proposed EUR method
EUR RowSum Paul et al. Cheng et al.
0.1
0.3
0.5
5 10 15 20No. of Recommendations
F-measure score comparison of the proposed EUR method
EUR RowSum Paul et al. Cheng et al.
0
0.5
1
5 10 15 20No. of Recommendations
Precision score comparison of the proposed EUR method
EUR RowSum Paul et al. Cheng et al.
11May 31, 2018
Case Study I: Scalability Analysis
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12
Time (Sec.)
No. of Processors
CFDRA Scalability Analysis by varying the no. of profiles and no. of processors
5K Profiles 10K Profiles 15K Profiles
0
500
1000
1500
1 3 5 7 9 11
Time (sec.)
No. of Processors
EUR Scalability Analysis by varying the no. of profiles and no. of processors
103 MB 206 MB 309 MB
0
1000
2000
3000
4000
2 4 6 8 10 12
TPS Per Processor
No. of Processors
CFDRA Transactions per second per processor
5K Profiles 10K Profiles 15K Profiles
TPS=No. of profiles compared
00.10.20.30.40.5
2 4 6 8 10 12
TPS per processor
No. of Processors
EUR Transactions per second per processor
103 MB 206 MB 309 MB
TPS= Amount of data in MB
12May 31, 2018
Case Study II: Health Insurance Plan Recommendation1
n Patient Protection and Affordable Care Act (PPACA)
5Marketplaces
g medical plans (78,000)
2
g dental plans (45,000)
3
g expected increase in near future
5Private insurance providers
n Limited capabilities of the contemporary Web based tools
5Challenges
g Multi-faceted requirements
ucost
ucoverage
g Information filtering
udifficult to find relevant information
2QHP landscape individual market, https://data.healthcare.gov/dataset/QHPLandscape-Individual-Market-Medical/b8in-sz6k, 2015
(accessed on April 12, 2018).
3Dental plan information for individuals and families, https://www.healthcare. gov/dental-plan- information/, 2015 (accessed on April 12,
2018).
1A. Abbas, M. U. S. Khan, A. Yusoff, Y. Sadikaj, J. Ashley, and S. U. Khan, “Personalized Health Insurance Recommendation
Services,” IEEE Transactions on Cloud Computing (under review).
13May 31, 2018
Case Study II: Health Insurance Plan Recommendation1
Insurance plans retrieval from Web
Plans’ ontological representation
Plan Ranking
Plan ranking
Plan clustering
Similarity computation
Ranked list of plans
Cloud based health insurance plan recommendation
Implicit and explicit
recomm
endations
Interface to the cloud
Parallel jobs
Users request for health insurance plans
1
4
3
25
Key Accomplishments:Plans evaluation based on various criteria, such as premium, copay, deductibles, and out-of-pocket limit
Implicit plan recommendations in the start (solution to cold start issue)
Explicit plan recommendations based on user stated requirements
Plans’ clustering to minimize the number of comparisons
A ranking methodology to rank the plans
A methodology to avoid long-tail issue of recommender systems
• Recommendations offered on first interaction with the system
• Based on plan popularity
• Initial popularity computation to overcome cold start
Explicit Recommendations
• Recommendations based on user stated requirements
• Similarity between the plans and requirements
• Ranking using Multi-attribute Utility Theory
Implicit Recommendations
Cluster identification
!"#$% = ( ()* !, , % ×(. /% ×01% ))
Similarity scoreWeights of decision
criteria
Satisfiability1A. Abbas, M. U. S. Khan, A. Yusoff, Y. Sadikaj, J. Ashley, and S. U. Khan, “Personalized Health Insurance Recommendation Services,” IEEE Transactions on Cloud Computing (under review).
14May 31, 2018
Case Study II: Experimental Results
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25Cluster size
Recall scoreVoronoi DBSCAN FCM Bclust.
00.20.40.60.8
1
5 10 15 20 25Cluster size
Precision ScoreVoronoi DBSCAN FCM Bclust.
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25Cluster size
F-measure scoreVoronoi DBSCAN FCM Bclust.
0
10
20
30
40
50
1 3 5 7 9 11
Ti
me
(
sec)
No. of Processors
Scalability analysis3K Plans 6K Plans 12K Plans
DBSCAN— clustering based on density connected points Fuzzy C Mean (FCM) — closeness to centerVoronoi—partitioning into cells based on ranking distance of plans Bayesian Clustering (Bclust.)— cluster merging through statistical hypothesis test
15May 31, 2018
TCC
Case Study III: A Route Recommendation Service For Large-scale Evacuations1
Route Recommendation Service
RSU
Congestion
2
Real-timeRoute
Computation
Real-time Map processing
Computer Cluster
3
• Checks space in each shelter• Location of each member of group• Density and Congestion on each
route• Routes with least time to reach the
same shelter for each member of group
4
Real-time route recommended through RSUs and other media
1
Key Accomplishments:
• A scalable service capable of route recommendation during an emergency evacuation:
• efficient traffic flows • leads to minimum congestion of the
roads
Challenges:• Scalability:
• big data graphs handling and partitioning
• Dynamic factors:• road congestions• road safety• shelter space
1M. U. S. Khan, O. Khalid, Y. Huang, F. Zhang, R. Ranjan, S. U. Khan, J. Cao, K. Li, B. Veeravalli, and A. Zomaya, “MacroServ: A Route Recommendation Service for Large-Scale Evacuations,” IEEE Transactions on Services Computing, vol. 10, no. 4, pp. 589-602, 2017.
16May 31, 2018
Average evacuations per minute
Aver
age
trave
l tim
e (m
inut
es)
Average travel times with varying number of departing vehicles from each intersection
Number of cars per minute→
Aver
age
trave
l tim
e (m
ins)
Effect of road damage by varying departure time (scale parameter α of Weibull Distribution)
Scale parameter "→ Scale parameter "→
Aver
age
trave
l tim
e (m
ins)
Aver
age
cong
estio
n
Average congestion with respect to time with damaged road network
Time (mins)→ Aver
age
trave
l tim
e (m
ins)
Effect of population increase in future 3 years on average car travel time on
damaged network
Year →
Case Study III: Experimental Results
17May 31, 2018
Case Study III: Experimental Results
Num
ber o
f Mes
sage
s
Number of partitions →
Number of partitions →
Num
ber o
f Mes
sage
s
Com
mun
icat
ion/
com
puta
tion
• Doubling the size of the region increases the recommendation generation time by an average of 26%.
• The increase in single processor results in decrease in the recommendation generation time by an average of 9%
• Doubling the size of the map decreases the average number of vehicle crossing from one zone to another by 76%. Av
erag
e re
com
men
datio
n ge
nera
tion
time
time
(sec)
Number of partitions→
18May 31, 2018
Future Work
n Case Study I:5 Identification of health experts from the same geographical area where
enquiring users reside5 Identification of fake twitter profiles through tweet analysis
n Case Study II:5 Insurance plan recommendation through existing users characteristics
n Case Study III:5 Considering additional parameters for emergency evacuations:
g Drivers’ behaviorg Evacuees’ compliance to the recommended routes