Succeeding in academia despite doing good_software
-
Upload
gaelvaroquaux -
Category
Software
-
view
1.547 -
download
0
description
Transcript of Succeeding in academia despite doing good_software
Hacking academia for fun and profitThoughts on succeeding in academia despite doing good software
VaroquauxGael
Strong is the powerof the dark side
Hacking academia for fun and profitThoughts on succeeding in academia despite doing good software
VaroquauxGael
Strong is the powerof the dark side
with Python
Hacking academia for fun and profitThoughts on succeeding in academia despite doing good software
VaroquauxGael
Strong is the powerof the dark side
despite goodsoftware?
Publish or perish
Want a career?
G Varoquaux 2
Publish or perish
Want a career?
Broken value system
Only Science matters
Wrong incentives
G Varoquaux 2
Publish or perish
Want a career? Hack academia!
G Varoquaux 2
Publish or perish
Want a career? Hack academia!
G Varoquaux 2
Publish or perish
Want a career? Hack academia!
Publishing scientific software matters
[Pradal, Varoquaux, Langtangen,J Computational Science]
G Varoquaux 2
[TL;DR]‹
Choose your battleskeeping science in the target
Win themsoftware production
‹ Too Long, Didn’t Read
G Varoquaux 3
Growing up as a geek scientist
G Varoquaux 4
Growing up as a geek scientist
I did a PhD inquantum physics
G Varoquaux 5
Growing up as a geek scientist
I did a PhD inquantum physics
Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)
Best training everfor agile project
managementG Varoquaux 6
Growing up as a geek scientist
I did a PhD inquantum physics
Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)
Computers were only oneof the many moving parts
MatlabInstrument control
Shaped my visionof computing as ameans to an end
G Varoquaux 7
Growing up as a geek scientist
I did a PhD inquantum physics
Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)
Computers were only oneof the many moving parts
MatlabInstrument controlShaped my vision
of computing as ameans to an end
G Varoquaux 7
Success
2011Tenured researcherin computer science
TodayGrowing team withdata sciencerock stars
How / why did I switch?Fernando Perez (IPython), PrabhuRamachandran (Mayavi), Eric Jones(Enthought), Travis Oliphant (Numpy)...
Learning fast is more impor-tant than knowing something
Bye bye physics
G Varoquaux 8
Success
2011Tenured researcherin computer science
TodayGrowing team withdata sciencerock stars
How / why did I switch?Fernando Perez (IPython), PrabhuRamachandran (Mayavi), Eric Jones(Enthought), Travis Oliphant (Numpy)...
Learning fast is more impor-tant than knowing something
Bye bye physics
G Varoquaux 8
Success
2011Tenured researcherin computer science
TodayGrowing team withdata sciencerock stars
How / why did I switch?Fernando Perez (IPython), PrabhuRamachandran (Mayavi), Eric Jones(Enthought), Travis Oliphant (Numpy)...
Learning fast is more impor-tant than knowing something
Bye bye physics
G Varoquaux 8
Success
2011Tenured researcherin computer science
TodayGrowing team withdata sciencerock stars
How / why did I switch?Fernando Perez (IPython), PrabhuRamachandran (Mayavi), Eric Jones(Enthought), Travis Oliphant (Numpy)...
Learning fast is more impor-tant than knowing something
Bye bye physics
G Varoquaux 8
And now...What I do nowadays:
Machine learning to understand brain function
Cognitive neuroscience:Link neural activity to thoughts and cognition
G Varoquaux 9
And now...What I do nowadays:
Machine learning to understand brain function
Learn a bilateral linkbetween brain activity and cognitive function
G Varoquaux 9
Software is the new medium of scientificmethod
Galileo’s notesG Varoquaux 10
The scientific method
1. Make conjectures2. Derive prediction3. Carry experiments4. Confirm or infirm conjectures
G Varoquaux 11
The scientific method
1. Make conjectures2. Derive prediction3. Carry experiments4. Confirm or infirm conjectures
Software is everywhereData-miningComputational modelsComputer-controled experimentsData analysis
G Varoquaux 11
The scientific method
1. Make conjectures2. Derive prediction3. Carry experiments4. Confirm or infirm conjectures
Code is often the very language inwhich predictions are expressed
Models are now more complex than a simpleformula or sentence
G Varoquaux 11
Enabling falsification: reproducible scienceReplicating
A 3rd party redoing the workCode and data made available
ReproducingNew analysis on different data / code coming to thesame conclusion
ReusingApplying the approach to a new problemLet us enable reusable research
Arguments for BSD licenseNo strings attachedCan tinker with it
G Varoquaux 12
Enabling falsification: reproducible scienceReplicating
A 3rd party redoing the workCode and data made available
ReproducingNew analysis on different data / code coming to thesame conclusion
ReusingApplying the approach to a new problemLet us enable reusable research
Arguments for BSD licenseNo strings attachedCan tinker with it
G Varoquaux 12
Reusable science ñ evidence accumulation
Accumulation of scientific knowledgeand learning formal representations
Akin to a review paper of the fieldBut a mathematical model is more testable
“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 13
Reusable science ñ evidence accumulation
Accumulation of scientific knowledgeand learning formal representations
Akin to a review paper of the fieldBut a mathematical model is more testable
Machine learning:engineering knowledge from data
“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 13
Reusable science ñ evidence accumulation
Accumulation of scientific knowledgeand learning formal representations
Akin to a review paper of the fieldBut a mathematical model is more testable
“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 13
The sweet spots across science and software
G Varoquaux 14
The advancement of knowledgeImagine a circle that contains human knowledge
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeBy the time you finish elementary school, you know a little
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeHigh school takes you a little bit further
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeWith a bachelors degree, you gain a speciality
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeA master’s degree deepens this speciality
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeResearch papers take you to the edge of human knowledge
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeOnce you are at the boundary, you focus
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeYou push at the boundary for a few years
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeAnd one day it yields
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeThat dent you’ve made, is called a PhD
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeOf course, the world looks different to you now
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeBut don’t forget the big picture
PhD
Courtesy of Matt Might, via Stefan van der WaaltG Varoquaux 15
The advancement of knowledgeThis is an optimistic view
Biology
Maths
Computer sciencePhysics
Economy
LiteratureHistory
G Varoquaux 15
The advancement of knowledgeThis is an optimistic view
Biology
Maths
Computer sciencePhysics
Economy
LiteratureHistory
I want tobe there
G Varoquaux 15
Translationnal computional scienceComputational science
The use of computers and mathematical models toaddress scientific research
Translationnal scienceIn medecine: bring bench science to medical practice
Translationalcomputational science?
G Varoquaux 16
Translationnal computional scienceComputational science
The use of computers and mathematical models toaddress scientific research
Translationnal scienceIn medecine: bring bench science to medical practice
Translationalcomputational science?
G Varoquaux 16
Translationnal computional scienceComputational science
The use of computers and mathematical models toaddress scientific research
Translationnal scienceIn medecine: bring bench science to medical practice
Translationalcomputational science?
G Varoquaux 16
Pick a problem to work onTake the “easy” route
There needs to be a market screeming for thesoftware (in academia and in industry)
Refine your vision
Pull, not pushDesign driven be need
G Varoquaux 17
Having an impact
G Varoquaux 18
Having an impact
G Varoquaux 18
Pick the right battles: viable projectsProject idea
A software implementing:i) machine learning
and ii) neuroimagingand iii) a graphical user interfaceand iv) 3D plotting
Define project scope and visionBreak down projects by expertiseDon’t solve hard problemsKnow the software landscapeDon’t target markets that will notyield contributors
Need a vision = elevator pitch
Your research (PhD) probably does not qualifyñ need to cherry-pick contributions
G Varoquaux 19
Pick the right battles: viable projectsProject idea
A software implementing:i) machine learning
and ii) neuroimagingand iii) a graphical user interfaceand iv) 3D plotting
Define project scope and visionBreak down projects by expertiseDon’t solve hard problemsKnow the software landscapeDon’t target markets that will notyield contributors
Need a vision = elevator pitch
Your research (PhD) probably does not qualifyñ need to cherry-pick contributions
G Varoquaux 19
Pick the right battles: viable projectsProject idea
A software implementing:i) machine learning
and ii) neuroimagingand iii) a graphical user interfaceand iv) 3D plotting
Define project scope and visionBreak down projects by expertiseDon’t solve hard problemsKnow the software landscapeDon’t target markets that will notyield contributors
Need a vision = elevator pitch
Your research (PhD) probably does not qualifyñ need to cherry-pick contributions
G Varoquaux 19
Pick the right battles: viable projectsProject idea
A software implementing:i) machine learning
and ii) neuroimagingand iii) a graphical user interfaceand iv) 3D plotting
Define project scope and visionBreak down projects by expertiseDon’t solve hard problemsKnow the software landscapeDon’t target markets that will notyield contributors
Need a vision = elevator pitch
Your research (PhD) probably does not qualifyñ need to cherry-pick contributions
G Varoquaux 19
Open source and community developmentCode maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/monthjoblib „ 45 email/month mayavi „ 30 email/month
“Hey Gael, I take it you’re toobusy. That’s okay, I spent a daytrying to install XXX and I thinkI’ll succeed myself. Next timethough please don’t ignore myemails, I really don’t like it. Youcan say, ‘sorry, I have no time tohelp you.’ Just don’t ignore.”
Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah
Share the common code......to avoid dying under code
Code becomes less precious with timeAnd somebody might contribute features
G Varoquaux 20
Open source and community developmentCode maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/monthjoblib „ 45 email/month mayavi „ 30 email/month
Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah
Share the common code......to avoid dying under code
Code becomes less precious with timeAnd somebody might contribute features
G Varoquaux 20
Community development in scikit-learnHuge feature set:
benefits of a large teamProject growth:
More than 200 contributors„ 12 core contributors
1 full-time INRIA programmerfrom the start
Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn
G Varoquaux 21
Communities: many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 22
Having an impact
You need a community
G Varoquaux 23
What’s in a scientific-computing environment
G Varoquaux 24
The scientific workflow agile
Interaction...Ñ script...Ñ module...
ý interaction again...
Consolidation,progressively
Low tech and shortturn-around times
G Varoquaux 25
Choose your weapons
Python, what else?Interactive languageEasy to read / writeGeneral purpose
G Varoquaux 26
Choose your weapons
Python, what else?Interactive languageEasy to read / writeGeneral purposeOld virtual machine /compilerYounger languagespromissing (Julia)
but will they getadoption beyond science?
G Varoquaux 26
Choose your weapons
Python, what else?+Numpy arraysShoe-horn your data in anumpy array, and you’ve won
personnally disappointedthat pandas drifted away
G Varoquaux 26
Software architecture for science“Scriptability” is paramountIn an application: MVC (model, view, controller)
ModelNumerical ordata-processingcore
ViewOuput: graphs,or filesMust enableheadless use
ControllerInput: dialogs,or an APIAvoid input as files:not expressive
Dialogs should never be far from the codeDialog generation: traits, IPython widgetsReactive programming:
dialogs modify object, and the model updatesDon’t own the main
In Mayavi: script generation for free
G Varoquaux 27
Software architecture for science“Scriptability” is paramountIn an application: MVC (model, view, controller)
ModelNumerical ordata-processingcore
ViewOuput: graphs,or filesMust enableheadless use
ControllerInput: dialogs,or an APIAvoid input as files:not expressive
Dialogs should never be far from the codeDialog generation: traits, IPython widgetsReactive programming:
dialogs modify object, and the model updatesDon’t own the main
In Mayavi: script generation for free
G Varoquaux 27
Quality is free‹
‹ This is a book, by Philip CrosbyG Varoquaux 28
You need qualityQuality will give you users
Bugs give you bad rap
Quality will give you developersContribute to learn and improve
Quality will make your developers happyPeople need to be proud of their work
Do less, do betterGoes against the grant-system incentive
G Varoquaux 29
Quality: what & howGreat documentation
Simplify, but don’t dumb downFocus on what the user is trying to solve
Great APIsExample-based developmentIf something is hard to explain, rethink the conceptsLimit the number of different concepts and objectsConsistency, consistency, consistency
Good numericsWrite tests based on mathematical propertiesWhen a user finds an instability, write a new test
Quality enables reuseBeyond mere reproducibility
G Varoquaux 30
Quality: what & howGreat documentation
Simplify, but don’t dumb downFocus on what the user is trying to solve
Great APIsExample-based developmentIf something is hard to explain, rethink the conceptsLimit the number of different concepts and objectsConsistency, consistency, consistency
Good numericsWrite tests based on mathematical propertiesWhen a user finds an instability, write a new test
Quality enables reuseBeyond mere reproducibility
G Varoquaux 30
Be productive
G Varoquaux 31
Be productive
“If you spend too much time thinking about athing, you’ll never get it done.” — Bruce Lee
G Varoquaux 31
Limited resourcesLimited resources are good
Need success in the short term, not the long term
The startup culture: fail fastQuickly identify non-viable projects
The simpest solution that works is the best
G Varoquaux 32
Short cycles, limited ambitions
Keep coming back to your usersRelease early, release often
G Varoquaux 33
SimplicityComplexity increase superlinearly
[An Experiment on Unit Increase in Problem Complexity,Woodfield 1979]
25% increase in problem complexityñ 100% increase in code complexity
The 80/20 rule80% of the usecases can be solvedwith 20% of the lines of code
Avoid feature creep
Use objects sparinglyDon’t use classes for the sake of it
G Varoquaux 34
SimplicityComplexity increase superlinearly
[An Experiment on Unit Increase in Problem Complexity,Woodfield 1979]
25% increase in problem complexityñ 100% increase in code complexity
The 80/20 rule80% of the usecases can be solvedwith 20% of the lines of code
Avoid feature creep
Use objects sparinglyDon’t use classes for the sake of it
G Varoquaux 34
SimplicityComplexity increase superlinearly
[An Experiment on Unit Increase in Problem Complexity,Woodfield 1979]
25% increase in problem complexityñ 100% increase in code complexity
The 80/20 rule80% of the usecases can be solvedwith 20% of the lines of code
Avoid feature creep
Use objects sparinglyDon’t use classes for the sake of it
G Varoquaux 34
Software engineering
G Varoquaux 35
Software engineering good practicesVersion control
Use git + githubUnit testingIf it’s not tested, it’s broken or soon will be.
Make a package,with controlled dependencies and compilation
...
G Varoquaux 36
Research ‰ productionNeed to adapt software-engineering principles
ÓGood naming is freeUse functions, not scriptsVersion control is very cheapTests are more expensive... Considering if goalsstabilizesBuild chains are hard
Go down the chain as your research progressYou can think of shipping a software only if it wasviable to go completely down the chain
G Varoquaux 37
Things we did right (maybe)
G Varoquaux 38
Mayavi: 3D visualization in PythonSuccess factors
Building upon VTK Great powerComponent model (UI)Internals open to the world
ñ from interaction to scripting
Limiting factorsBuilding upon VTK A lot of complexityCodebase too complex and object-oriented
(bound to VTK)Users of GUIs do not turn into developersComposition is an API killer
G Varoquaux 39
Mayavi: 3D visualization in PythonSuccess factors
Building upon VTK Great powerComponent model (UI)Internals open to the world
ñ from interaction to scripting
Limiting factorsBuilding upon VTK A lot of complexityCodebase too complex and object-oriented
(bound to VTK)Users of GUIs do not turn into developersComposition is an API killer
G Varoquaux 39
joblib: computational workflow patterns
Parallel for loop>>> from joblib import Parallel, delayed>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]On-demand dispatch to ease memory consumptionThreading and processes backends
G Varoquaux 40
joblib: computational workflow patterns
Parallel for loop>>> from joblib import Parallel, delayed>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store
G Varoquaux 40
joblib: computational workflow patterns
Success factorsSimplicity of usePatterns we really, really need (pull not push)
G Varoquaux 41
joblib: computational workflow patterns
Success factorsSimplicity of usePatterns we really, really need (pull not push)
Limiting factorVision of the project unclearPositioning with regards to landscape unclear
(IPython, where are you headed?)Tricky code inside
G Varoquaux 41
scikit-learn: machine learning in Python
Success factorsRight project vision
Machine learning without learning the machineryBlack box that can be openedRight trade-off between ”just works” and versatility
(think Apple vs Linux)We’re not going to solve all the problems for you
I don’t solve hard problemsFeature-engineering, domain-specific cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux 42
scikit-learn: machine learning in Python
Success factorsRight project visionHigh-level programming
- Optimize algorithmes, not for loops- Know perfectly Numpy and scipy
All significant data should be in arraysAvoid memory copies, rely on blas/lapack
- Use Cython, quad not C/C++
G Varoquaux 42
scikit-learn: machine learning in Python
Success factorsRight project visionHigh-level programmingGood API design
- separate data from operations
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
G Varoquaux 42
scikit-learn: machine learning in Python
Success factorsRight project visionHigh-level programmingGood API design
- separate data from operations- Object API exposes a data-processing language
fit, predict, transform, score, partial fit
- Instantiated without data but with all parameters
G Varoquaux 42
scikit-learn: machine learning in Python
Success factorsRight project visionHigh-level programmingGood API designGreat community
- Github + code review
G Varoquaux 42
scikit-learn: machine learning in Python
Success factorsRight project visionHigh-level programmingGood API designGreat communityGreat documentation
G Varoquaux 42
scikit-learn: machine learning in Python
Success factorsRight project visionHigh-level programmingGood API designGreat communityGreat documentation
Limiting factorsTricky numerical codeOur own success ñ huge volume
G Varoquaux 42
@GaelVaroquaux
Succeeding in academia despite doing good software1 Game the system
It’s about convincing a tenure committeeCode must contribute to a scientific problem
2 Not all battles can be fought
3 Make good software
@GaelVaroquaux
Succeeding in academia despite doing good software1 Game the system
2 Not all battles can be foughtMake sure that there is a market
Don’t solve hard problemsProblems that matter for science and industry
3 Make good software
@GaelVaroquaux
Succeeding in academia despite doing good software1 Game the system
2 Not all battles can be fought
3 Make good softwareThat actually answers scientists needs
With quality, software engineeringRelying on a communauty
Usability matters
@GaelVaroquaux
Succeeding in academia despite doing good software1 Game the system
2 Not all battles can be fought
3 Make good software
Now go out, and code!