Bioinformatics FOAM 2014 - Program and Abstracts

F O A M

ocus n nalytical ethods

sci t

amr

o fni

o iB

2014

irony, n. pron: /aɪərənɪ/, as in “This page was intentionally blank until we put this footer in”

Welcome Dear Colleagues,

Welcome to Bioinformatics Focus On Analytical Methods (FOAM) 2014, run as part of CSIRO’s Computational and Simulation Sciences and eResearch Annual Conference and Workshops, and sponsored by the CSIRO Bioinformatics Core and The Australian Bioinformatics Network (ABN).

The first half of FOAM 2014 is aimed at CSIRO bioinformaticians, computational biologists and quantitative bioscientists, recognising that this is a once-a-year opportunity for staff across Australia to get together to discuss CSIRO-specific issues.

The second half of the meeting is aimed at bioinformaticians, computational biologists and quantitative bioscientists in general. CSIRO’s CSS conference gives us a great opportunity to hold a very affordable (i.e., free to members) ABN event at a great location in a city with a high concentration of Australian life-science research.

We have a diverse and engaging agenda of presentations, reflecting the breadth of research that falls under the heading “bioinformatics”. We also want to encourage you to use this opportunity to meet new colleagues and catch up with old friends.

We hope you enjoy Bioinformatics FOAM 2014 and welcome your feedback and ideas about how to make future events even better.

With best wishes from the Bioinformatics FOAM 2014Organising Committee:

• Annette McGrath (CSIRO Bioinformatics Core Leader)

• David Lovell (Australian Bioinformatics Network Director)

• Lars Jermiin (OCE Science Leader in Genomics)

• Jen Taylor (CSIRO Plant Industry Bioinformatics Leader)

Bioinformatics FOAM 2014: Program and Abstracts – Thursday

Page 1 of 4

Start Speaker Title 10:30 AM Dr Annette McGrath Welcome to Day 1 10:35 AM Dr Annette McGrath An update on the activities of the CSIRO Bioinformatics Core 10:50 AM Dr Tamsyn Crowley Happy Mrs Chicken 11:10 AM Dr Heiko Mueller Similarity search in large sets of genes using semantic similarity of Gene Ontology annotations 11:30 AM Dr Neil Saunders SQL, noSQL or no database at all? 12:00 PM

Lunch

1:00 PM Dr John Henshall Parentage in parallel 1:20 PM Dr Jason Ross Global signals in genomic data Take 1 1:40 PM Dr Yalchin Oytam Global signals in genomic data Take 2 2:00 PM Dr Emma Huang Imputation in genotyping by sequencing 2:30 PM

Coffee

2:45 PM Dr Karen Meusemann How we currently deal with heterogeneous, non-stationary datasets – an example from the 1KITE project 3:05 PM Dr Lars Jermiin Mixture models of nucleotide sequence evolution, and the evolution of yeast genomes 3:25 PM Mr Sam Moskwa Publishing Software: Code and Data or it Didn’t Happen 3:45 PM Mr Brian Davis Using VM and cloud infrastructure for (CSIRO?) bioinformatics 4:05 PM Dr David Lovell How I stopped worrying about figure placement and learned to love Sweave 4:15 PM

Close

4:30 PM

Break 6:00 PM

Dinner


Page 2 of 4

Speaker Abstract

Dr Tamsyn Crowley: Happy Mrs Chicken With the current level of public and marketing attention being given to eggs from caged vs. free-range vs. barn-raised chickens, it’s a good time to ask “how can we know how happy or stressed a chicken is?”—this kind of question becomes even more pertinent in light of concerns about animal welfare in other production systems, including live export. This presentation describes the search for molecular markers that could serve as a basis for practical, low-cost, objective tests for stress in chickens.

Mr Brian Davis: Using VM and cloud infrastructure for (CSIRO?) bioinformatics

A chance to hear about the wealth of CPU and storage that is out there… somewhere…

Dr John Henshall: Parentage in parallel The most significant impact of high density SNP panels on livestock may come from an unexpected direction. Genomic selection and exploiting regions discovered through whole genome association studies may produce marginal increases in the rate of genetic improvement, but these won’t be transformational as they occur in industries that already have sophisticated pedigree based genetic improvement programs. The real opportunities lie where no genetic improvement currently takes place; in un-pedigreed commercial stocks, where the vast majority of animals are raised. High density SNP assays allow estimation of relationships between pools assembled from the tissue of dozens of animals of like phenotype - parentage in parallel! We give examples from a range of projects currently underway in livestock and aquaculture, and touch on some of the analytical challenges presented by these data.

Dr Emma Huang: Imputation in genotyping by sequencing

Genotyping-by-sequencing (GBS) technology has made dense genotyping cost-effective for many species. However, the high levels of missing data can result in a large loss of information. The popularity of GBS makes the development of efficient imputation approaches a priority. Here we consider imputation under the further difficulty caused by multi-parental experimental crosses. We present an approach to imputing founder genotypes which allows recovery of a large proportion of markers. Once these have been imputed, we compare three approaches to imputing progeny genotypes and apply our strategy to an eight-parent rice population to demonstrate the potential gain from imputation.

Dr Lars Jermiin: Mixture models of nucleotide sequence evolution, and the evolution of yeast genomes

Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that the data can be modeled accurately using one or several site-specific, time-reversible rate matrices. However, a growing body of data suggests that evolution under globally SRH conditions is an exception, rather than a norm. To address this issue, we introduce a family of mixture models that considers heterogeneity in the substitution process across lineages (HAL) and heterogeneity in the substitution process across sites (HAS). We also introduce an algorithm for searching model space and identifying a model of evolution that is less likely to over- or under-parameterize the data. The merits of our algorithms are illustrated with an analysis of 42,337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of eight species of yeast. For this data set, our HAL-HAS model fits the data better than other models do. Parameter estimates for this model indicate not only a complex ancestral sequence but also a complex evolutionary process.

Dr David Lovell: How I stopped worrying about figure placement and learned to love Sweave

As the artilleryman said in War of The Worlds: "It's doing the working and the thinking that wears a fellow out." And so it goes with doing the work of writing a journal paper after all that thinking (and R-coding) these past months. All those


Page 3 of 4

Speaker Abstract

bits of code, lovingly crafted to work perfectly... but not in the hands of just anyone, no! Just me...and even then, things can go awry. And now those fancy journals come a-prancing, with their talk of formats and conventions. Well, I tell you, I won't be starting all over again. I'VE GOT A PLAN!!!

Dr Karen Meusemann: How we currently deal with heterogeneous, non-stationary datasets – an example from the 1KITE project

Sampling and analysing currently more than 1,000 insects transcriptomes for phylogenetic analyses across insects (www.1kite.org), we recently compiled a dataset for 144 species and ~1,500 orthologous genes for phylogenetic inference. After shortly introducing the 1KITE project, I will outline how we identified non-stationarity across taxa, thus regions that don't match globally stationary, reversible, and homogeneous (SRH) conditions within our dataset, and how we reduced heterogeneity to a certain extent without excluding species. Moreover, I will present what we can expect from simulated data that match SRH conditions. Comparing these results, it is obvious that there is still a high need to develop models / methods that finally can deal with heterogeneity and non-stationarity across taxa in empirical datasets.

Dr Heiko Mueller: Similarity search in large sets of genes using semantic similarity of Gene Ontology annotations

Over the past years, more than 30 different semantic similarity measures for GO annotations have been proposed. In the first part of the presentation I will give an overview on the strength and weaknesses of these methods. In the second part I will talk about our efforts to develop algorithms that allow time efficient semantic similarity search in large sets of genes (e.g. UniProt).

Dr Yalchin Oytam: Global signals in genomic data Take 2

See the next abstract for this double act…

Dr Jason Ross: Global signals in genomic data Take 1

The TBCP-funded global signals in genomic data project is developing methods and software to view and characterise and in the case of batch effects, also correct for, large correlated signals in genomic data. In the last year, we have developed Python-based software to quickly identify genomic signals, with the next phase being the characterisation of these signals. In parallel, we have finished development of a method to identify and remove batch effects which outperforms existing methods. While we have several bodies of work in development, in this talk we will discuss in particular, the performance and importance of the new batch effect removal algorithm. This new technique maximises the removal of the the structured technical noise known as batch effects, with the constraint that the probability of overcorrection is kept to a fraction which is set by the end-user. This tunability allows control for overcorrection - defined as, the removal of genuine biological variance as well as batch noise. Overcorrection should be minimised as it can lead to false positive results due to the artificial deflation of within-group variances. Benchmarking across four datasets against Combat, the leading currently used technique, we show this new method is far superior in balancing removal of batch noise while preserving biological signal. Additionally, the new method is able to leave largely unchanged one of the datasets which has no significant batch effect, whereas Combat reduces the variance of that dataset by over 45%. For noise removal, we use “guided-PCA” a recently published quantifier of batch effects to show the probability of batch effects remaining in the data post correction. For signal preservation, we calculate in each case, the proportion of the original variance which remains in the datasets after correction.


Page 4 of 4

Speaker Abstract

Dr Neil Saunders: SQL, noSQL or no database at all?

Creating and querying a database - usually using MySQL - was once on the list of "core skills" for many bioinformaticians. Today, we have many more database implementations at our disposal, including the catch-all "noSQL" set of databases. In addition, many of us find that we can solve our problems without resorting to databases at all. Are databases still a "core skill"? When are they useful? When are they not?

Bioinformatics FOAM 2014: Program and Abstracts – Friday

Page 1 of 4

Start Speaker Title 9:00 AM Dr David Lovell Welcome to Day 2 9:10 AM Dr Paul Berkman Establishing a hybrid approach to the polyploid sugarcane genome assembly 9:30 AM Dr David Goode Reconstructing tumour evolution using a combined genomics approach 9:50 AM Dr Sarah-Jane Schramm Integration and analysis of high-throughput data types for insights into complex disease

10:10 AM Dr Sarah Boyd On Systems Biology 10:30 AM

Coffee

11:00 AM Dr Sarah-Jane Schramm The Inside Voice – reflections and ramblings of a newly minted post-doc 11:20 AM Dr David Lovell What I think about when I think about your job application? 12:10 PM Dr Andrew Lonie The Genomics Virtual Laboratory 12:30 PM

Lunch

1:30 PM Dr Neil Saunders Learning from complete strangers: social networking for bioinformaticians 2:00 PM Prof John Carlin On the value of p-values: Take 1 2:30 PM Prof Gordon Smyth On the value of p-values: Take 2 2:50 PM

Discussion

3:00 PM

Coffee 3:30 PM Mr Jason Whyte Before collecting `Big Data', what can we do with NO data? Or Why get into a lather about experimental design? 3:50 PM Dr James Kijas Building the genomics toolbox for a livestock genome 4:10 PM Dr Sean O'Donoghue New tools for protein structures and phosphorylation pathways - and another animation 4:30 PM

Close


Page 2 of 4

Speaker Abstract

Dr Paul Berkman: Establishing a hybrid approach to the polyploid sugarcane genome assembly

Flowering plant genomes are amongst the largest and most complex, caused by highly proliferative repetitive elements and frequent genome duplications. The sequencing revolution has now delivered over 30 plant genomes ranging from the 82 Mbp genome of floating bladderwort to the nearly 5 Gbp genome of diploid wheat. While a high quality reference genome is now a pivotal research tool in all crop improvement efforts, many projects emphasise delivery timeframes at the expense of genome quality. Our species of interest is sugarcane (Saccharum hybrid) which possesses a highly aneupolyploid genome 10 Gbp in size. In line with international efforts, our group has contributed a range of approaches to elucidate the sugarcane genome sequence. The first of these has been an international BAC-by-BAC sequencing effort to determine a "monoploid" genome sequence for the genotype R570, in which we have assembled Illumina paired-read data for 465 BACs into one or a few contigs each. Secondly, we have applied second-generation whole-genome shotgun sequencing up to 45x to de novo assemble the genome of R570. Our preliminary assembly represents over two thirds of the expected genome size with a contig NG50 of 1200 bp. Finally, we are now progressing a third-generation sequencing approach to supplement the results of the short-read approach and progress towards a final hybrid assembly. Without a robust approach structural and functional annotation cannot inform meaningful biological interpretation. As our work approaches completion, it is becoming clear that ultimately a hybrid approach combining all of these outputs will be required for a high quality reference genome for sugarcane. There is no single technology or approach to solve this problem. With an "out-of-the-box" approach nowhere in sight, assembling high quality genome sequences will likely remain an important problem for some time yet.

Dr Sarah Boyd: On Systems Biology "Systems Biology" is a term that means many different things to many different people

Prof John Carlin and Prof Gordon Smyth: On the value of p-values

P-values and Null Hypothesis Significance Testing abound in journal articles and other literature on bioinformatics and quantitative bioscience. Yet there has been confusion and controversy about what these approaches do and don’t mean—see e.g., Scientific method: Statistical errors (Nature 506, 150–152, 13 February 2014 doi:10.1038/506150a). Bioinformatics FOAM 2014 is fortunate to have two eminent statisticians—Professor John Carlin and Professor Gordon Smyth—to lead us in exploring the subtleties and science of statistical significance so that we can all become wiser to the value of p-values.

Dr David Goode: Reconstructing tumour evolution using a combined genomics approach

Tumours are a collection of genetically distinct cellular lineages related that must compete against each other and the external environment. Many cell lineages die out, while those with phenotypes that are advantageous expand. A major clinical consequence of this evolutionary process is the emergence of drug resistant tumour cells. We have established an in vitro model system in which we can induce the human well-differentiated liposarcoma (WDLPS) cell line 778 to acquire resistance to the MDM2 inhibitor Nutlin-3a. I will detail we have applied bioinformatics and evolutionary principles to reconstruct major evolutionary events that occurred as this line acquired drug resistance. Integration of SNP array and exome sequencing data from different time points during the evolution of Nutlin resistance allow us to infer the relative order of genetic changes and how the rate of evolution fluctuated during the course of the experiment.

Dr James Kijas: Building the genomics toolbox for a livestock

Researchers motivated to identify genes that control differences between individuals rely on a genomics toolbox. Really useful tools include i) a reference genome assembly ii) information about where the genes and other features reside inside that genome


Page 3 of 4

Speaker Abstract

genome assembly and iii) a way to collect large scale data describing the differences between genomes from different individuals. A team inside CSIRO has been developing each of these tools for the sheep genome, and putting them to work to understand what contributes to variation in production traits (eg wool growth and meat quality). The talk will focus on the approach taken over the last 5 years to build the various components of the toolkit. This has consequences for the development of genomic resources in other species which may currently have fewer available tools.

Dr Andrew Lonie: The Genomics Virtual Laboratory

The NeCTAR funded Genomics Virtual Laboratory (genome.edu.au) supports genome informatics research using the Australian Research Cloud. This presentation is a chance for you to get up to speed with the latest progress and next steps for the GVL with Chief Investigator and Head of the Life Sciences Computation Centre, Andrew Lonie.

Dr David Lovell: Welcome to Day 2 This is a chance to welcome everyone to this year's open section of Bioinformatics FOAM (Focus On Analytical Methods). Ideally this welcome will take about 60 seconds, with the other 540 seconds devoted to getting everyone into the meeting room… perhaps we can do better than that... only you can decide.

Dr David Lovell: What I think about when I think about your job application?

With special guest panellists: Dr Maria Doyle, Dr Andrew Lonie and Dr Alicia Oshlack. This is a chance for us to get inside the minds of people who, from time to time, have been and will be looking at a lists of applications and CVs while recruiting for a quantitative bioscientist or bioinformatician. Recruitment processes are, for many reasons, conducted with a high degree of confidentiality and, until you have actually been on a recruitment panel (and, sometimes even after that) they are shrouded in mystery. This facilitated discussion session focuses on the point of "first impressions" which is usually when the selection panel get a list of applications, CVs and cover letters to short list for further consideration. We will be asking panellists to expose their own experiences and perspectives both as recruiters and "recruitees" and use that as a launch pad for audience discussion.

Dr Sean O'Donoghue: New tools for protein structures and phosphorylation pathways - and another animation

Round out this year's FOAM with VizBi founder and CSIRO Science Leader Sean O'Donoghue as he takes us through his team's latest work, including an animation about pancreatic cancer.

Dr Neil Saunders: Learning from complete strangers: social networking for bioinformaticians

Social networks, forums, blogs and other online spaces can be a rich source of information for learning from other bioinformaticians. I've been involved with online bioinformatics communities for around 13 years, beginning with a Slashdot-style site called Nodalpoint in 2001, through to Twitter today. In this talk, I hope to convey something of what I've learned in that time about the benefits (and sometimes, the pitfalls) of being an "online bioinformatician".

Dr Sarah-Jane Schramm: Integration and analysis of high-throughput data types for insights into complex disease

Integrative ‘-omics’ – wherein multiple types of high-throughput data are combined and analysed together – continues to grow in popularity for its potential to illuminate the basis of complex diseases. Our work explores different ways of combining such data to reveal insights into cancer biology.

Dr Sarah-Jane Schramm: The Inside Voice – reflections and ramblings of a newly minted post-doc

Join the intrepid Dr Schramm on a journey beyond the PhD… what lies out there?


Page 4 of 4

Speaker Abstract

Mr Jason Whyte: Before collecting `Big Data', what can we do with NO data? Or Why get into a lather about experimental design?

Mathematical models of a physical system (such as a biomolecular interaction network) allow us to simulate or predict the behaviour of the system. In particular, one may gain insight into the behaviour of the system when conditions are changed, or infer the values of quantities that are not directly measurable. The author encountered the problem of a priori unidentifiable biochemical interaction models during some character-building misadventures with flow-cell optical biosensor (BIAcore TM) experiments. This talk will inspect the properties of models of some simple chemical reaction networks. Approaches discussed are suited to models arising in various applications.

Life’s complex…

…use bioinformatics

The CSIRO Bioinformatics Core and the Australian Bioinformatics Network are proud to support Bioinformatics FOAM 2014.

The Core aims to complement and augment the efforts of bioinformaticians and bioinformatics teams across CSIRO.

The Australian Bioinformatics Network aims to connect people, resources and opportunities to increase the benefits Australian bioinformatics can deliver.

We wish all delegates a successful meeting.

Bioinformatics FOAM 2014 - Program and Abstracts

Technology

Transcript of Bioinformatics FOAM 2014 - Program and Abstracts