Ayasdi with IHME data
-
Upload
kyle-serikawa -
Category
Documents
-
view
23 -
download
2
Transcript of Ayasdi with IHME data
Data has shape and shape has
meaningTM
2
• Overview of IRIS from Ayasdi• A tool for looking at large datasets and trying to find meaning
• Walking through an example of an Ayasdi analysis
Outline
3
• We are gathering more data all the time
What IRIS is for…
4
…and while data are often collected to address specific questions, the data may also hold additional insights
5
CD
+Stim, Ab
Baseline
“There isn’t a single story happening in your complex data” – Anthony Bak, Ayasdi
• IRIS combines topological math with a highly flexible and intuitive interface to analyze large datasets
• Creates different shapes that can be explored
• Ayasdi can be used on different kinds of high complexity datasets• Transcriptome profiling• Clinical data• Flow cytometry data• Financial data• Text• Etc.
That’s where we think IRIS from Ayasdi will help
6
• Concept is: data has shape based on how elements in the datasets are mathematically related to each other• For example, how are samples alike?
• IRIS takes the data, performs a mathematical transformation, and uses the output to group samples together and draw a picture
• This is done iteratively with different mathematical transformations to give multiple different views of the data’s shapes
• The shapes highlight possibly interesting parts of the dataset• In our case, disease or patient subsets
How does IRIS work?
7
8From Ayasdi
The problem of having a liberal arts education…
9
Platonic ideal of chair
What an IRIS analysis looks like
10
3 different shapes made from the
same data
Explaining the parts
11
Dots represent groups of
samples that are similar to each
other
Connecting lines represent at
least one shared member
between groups
Features like this arm on the
shape can be examined in further detail
Coloring (red=high to blue=low) can be based on initial math or annotations (ie, gender, disease), gene expression, etc.
• Groups and shapes area analyzed and interpreted• We try to understand what underlies the shapes and forms that arise
• Link back to biology, patients, effect
• Learn new insights
• Create hypotheses, test on the fly,
• Iterate
• Next several slides will be an example of an IRIS analysis and insights
How does an IRIS analysis proceed?
12
• Institute for Health Metrics and Evaluation (IHME)• Performed survey of smoking prevalence worldwide, from 1980-2012• 187 countries• Dataset contains smoking frequency broken down by age, gender, year
• 518 columns, 187 rows
• Some reasons to look at this data:• Practice—and IRIS workflow is pretty much the same for any dataset• Using non-gene expression data• Smoking is a risk factor for RA, diabetes, etc.
Example analysis: Smoking prevalence
13
These were derived from the IHME data
14
Thinking like an analyst: what do different parts of shapes mean?
There’s a lot to potentially explore
Start with this basic shape:
15
What are these two groups?
Upper arm
Lower arm
Certain mathematical transformations often create this antibody shape in large datasets
First step: define groups and do numerical and categorical comparison to rest of shape
16
Lower arm categorical table
Column Name ValuePercent in Group 1
Percent in Both Group 1 and
Group 2Count in Group
1
Count in Both Group 1 and
Group 2 p-valueISOsubregion 35 0.27 0.06 6 11 4.23E-04Developing Yes 1.00 0.73 22 137 6.48E-04ISOsubregion 14 0.27 0.09 6 17 0.006991494
Annualized Rate of Change (%) Male and Female 1980 to 2012 -0.5 0.18 0.04 4 8 0.007475094
Annualized Rate of Change (%) Male and Female 1980 to 2012 -0.7 0.18 0.05 4 10 0.019024382ISOregion 2 0.45 0.27 10 50 0.035708684
Bangladesh
Burkina Faso
Burundi
Cambodia
Djibouti
Federated States of Micronesia
Ghana
Guinea-Bissau
Indonesia
Jamaica
Laos
Malawi
Maldives
Myanmar
Namibia
Paraguay
Philippines
Rwanda
Somalia
Sri Lanka
Thailand
Zimbabwe
Southeastern Asia
Eastern Africa
Highlighting lower arm countries on a map
17
Some geographical
clustering
Now looking at numerical annotations
18
Column Name KS Statistic KS p-value T-test p-value Group 1 Mean - Group 2 Mean KS Sign
Smoking Prevalence (%) Age 80+ 1997 0.62 4.83578E-07 3.79979E-05 6.960909091 +
Smoking Prevalence (%) Age 80+ 2000 0.62 4.83578E-07 2.55956E-05 7.112424242 +
Smoking Prevalence (%) Age 80+ 1999 0.62 6.72238E-07 2.9015E-05 7.072121212 +
Smoking Prevalence (%) Age 80+ 2001 0.62 6.72238E-07 2.5208E-05 7.133030303 +
Smoking Prevalence (%) Age 80+ 2002 0.62 6.72238E-07 2.38392E-05 7.140909091 +
Smoking Prevalence (%) Age 80+ 1996 0.61 9.31143E-07 4.89306E-05 6.880909091 +
Smoking Prevalence (%) Age 80+ 1998 0.61 9.31143E-07 3.31192E-05 7.008787879 +
Smoking Prevalence (%) Age 80+ 2003 0.61 9.31143E-07 2.36669E-05 7.144242424 +
Smoking Prevalence (%) Age 80+ 1995 0.58 3.66511E-06 5.92711E-05 6.813030303 +
Smoking Prevalence (%) Age 80+ 2004 0.58 4.98014E-06 2.33953E-05 7.080606061 +
Smoking Prevalence (%) Age 75 2004 0.57 5.51162E-06 1.50199E-05 7.676363636 +
Smoking Prevalence (%) Age 75 2008 0.57 5.51162E-06 2.02097E-05 7.436666667 +
Smoking Prevalence (%) Age 75 2009 0.57 5.51162E-06 2.04579E-05 7.365151515 +
Smoking Prevalence (%) Age 75 2011 0.57 6.09737E-06 2.0317E-05 7.224545455 +
Smoking Prevalence (%) Age 75 2012 0.57 6.09737E-06 1.89945E-05 7.184242424 +
Smoking Prevalence (%) Age 80+ 2005 0.57 6.09737E-06 2.25215E-05 7.026363636 +
Smoking Prevalence (%) Age 75 2003 0.57 7.45331E-06 1.28236E-05 7.777878788 +
Smoking Prevalence (%) Age 75 2005 0.57 7.45331E-06 1.61689E-05 7.576666667 +
Smoking Prevalence (%) Age 75 2006 0.57 7.45331E-06 1.84185E-05 7.536969697 +
Smoking Prevalence (%) Age 75 2007 0.57 7.45331E-06 1.94395E-05 7.496666667 +
Smoking Prevalence (%) Age 75 2010 0.57 7.45331E-06 2.08264E-05 7.294848485 +
Smoking Prevalence (%) Age 80+ 2012 0.57 7.45331E-06 3.11246E-05 6.652121212 +
Smoking Prevalence (%) Age 80+ 1994 0.56 8.23553E-06 6.50367E-05 6.795151515 +
Smoking Prevalence (%) Age 80+ 2007 0.56 8.23553E-06 2.66895E-05 6.890909091 +
Smoking Prevalence (%) Age 75 2002 0.56 1.00428E-05 1.19239E-05 7.858484848 +
Smoking Prevalence (%) Age 80+ 2011 0.56 1.00428E-05 3.17879E-05 6.670606061 +
Smoking Prevalence (%) Age 80+ 2006 0.56 1.10835E-05 2.3874E-05 6.958181818 +
Smoking Prevalence (%) Age 80+ 2010 0.55 1.22271E-05 3.14422E-05 6.696666667 +
Ranking by one of their built in
statistics, see quickly that data columns largely reflect smoking
prevalence among the elderly
Pick a few years for the 80+ smoking prevalence to graph boxplots
19
Okay, so confirming insights: we’re looking at a subset of countries that have a high rate of smoking in the elderly. Note that Upper Arm
group has a substantially lower rate
Other countries have high rates in the elderly; and within the lower arm group, some
have relatively low rates
So we’ve found a subpopulation
But that’s not the whole story
20
CountryLower arm
groupSmoking Prevalence (%) Age 80+ 2000 Country
Lower arm group
Smoking Prevalence (%) Age 80+ 2000
Pakistan no 34 Laos yes 29.4Tonga no 25.2 Myanmar yes 26.4
Kiribati no 24.4 Namibia yes 23.3Nepal no 23.8 Bangladesh yes 21.8
Lebanon no 22.2 Cambodia yes 20Timor-Leste no 18.8 Indonesia yes 18.1
Denmark no 17.1 Federated States of Micronesia yes 17.6Tunisia no 16.4 Philippines yes 15.8Jordan no 16.2 Paraguay yes 14.5
Lesotho no 15.9 Malawi yes 14.4South Korea no 15.9 Djibouti yes 14.3
Malaysia no 15.8 Zimbabwe yes 13.7
Dominican Republic no 15 Thailand yes 13Vanuatu no 14.5 Maldives yes 12.5Palestine no 14.2 Sri Lanka yes 11.2Vietnam no 13.9 Burkina Faso yes 11Cyprus no 13.7 Burundi yes 9.7Samoa no 13.6 Rwanda yes 8.7Albania no 13.4 Somalia yes 8.5
Mongolia no 13.1 Ghana yes 7.9South Africa no 13.1 Jamaica yes 7.6
China no 13 Guinea-Bissau yes 7.5
• Many directions to go here
• In IRIS• persistence of group• Co-occurrence with other annotations beyond “developing”
• Outside of IRIS• Once you know a subgroup exists, statistical analyses• Visualization techniques such as heatmaps
What are the characteristics that define that subpopulation?
21
Persistence (or not) of subgroup integrity across shapes and analyses
22
From this we can go back to the mathematical
transformations used to make each set of shapes and find clues to what is driving this group to stay together in some shapes
but not others
Overlay of different kinds of information
23
Comparison of developing country status suggests two groups we could compare to look for additional insights
Annualized rate of change between 1980-1996 is another annotation we could look into more
Developing = no
Developing = yesPopulation
Ann rate of change 1980-96
Comparing the two developing world enriched groups
24
• Found differences between older age smoking prevalence—lower arm group has higher rate• We already knew that
• Also found differences in 10yr old smoking prevalence—lower arm group has lower rate• We didn’t know that…
10 year old smoking prevalence
25
1980
20102000
1990 Smoking in kids consistently low in the lower arm group.
Suggests for public health intervention for these
countries--need to confirm pattern and, if it confirms, look at transition from non-
smoking to smoking and when that happens
Looking more closely at Annualized rate of change
26
Ann rate of change 1980-96 Ann rate of change 2006-2012
Ann rate of change 1980-2012 Ann rate of change 1996-2006 Suggestion that lower arm group had relatively less
decrease in overall smoking rates in the 80s and 90s,
but rate of decrease began to pickup in the 2000s,
relative to other countries
From a Public Health standpoint, now go back
and ask what kinds of smoking cessation
interventions were put in place in the 2000s