Cregit Recovering token level authorship from Git

51
cregit: Who Authored the Kernel? Recovering Token-Level Authorship Information from Git Daniel M German University of Victoria [email protected] Kate Stewart Linux Foundation Bram Adams Polytecnique of Montreal

Transcript of Cregit Recovering token level authorship from Git

Page 1: Cregit Recovering token level authorship from Git

cregit: Who Authored the Kernel?

Recovering Token-Level Authorship Information from Git

Daniel M GermanUniversity of Victoria

[email protected]

Kate StewartLinux Foundation

Bram AdamsPolytecnique of Montreal

Page 2: Cregit Recovering token level authorship from Git
Page 3: Cregit Recovering token level authorship from Git

Cincinnati LibraryImage in the public domain

Page 4: Cregit Recovering token level authorship from Git

diff-able

Page 5: Cregit Recovering token level authorship from Git

Image in the public domain

Page 6: Cregit Recovering token level authorship from Git
Page 7: Cregit Recovering token level authorship from Git
Page 8: Cregit Recovering token level authorship from Git

“History is writen by the winners”

By Y. Karsh. Image in the public domain in Canada. Copyrighted in the US

Page 9: Cregit Recovering token level authorship from Git
Page 10: Cregit Recovering token level authorship from Git

“Archeology is the search for fact, not truth. If it's truth you're interested in, Dr. Tyree's Philosophy class is right down the hall.”

-- Indiana Jones

Image Copyright Walt Disney Company

Page 11: Cregit Recovering token level authorship from Git

The history in git is likely to be incomplete

Page 12: Cregit Recovering token level authorship from Git

Yet, what can we do with it?

Page 13: Cregit Recovering token level authorship from Git

The Dream, by H. Rousseau. In the public domain.

Page 14: Cregit Recovering token level authorship from Git
Page 15: Cregit Recovering token level authorship from Git
Page 16: Cregit Recovering token level authorship from Git
Page 17: Cregit Recovering token level authorship from Git
Page 18: Cregit Recovering token level authorship from Git
Page 19: Cregit Recovering token level authorship from Git
Page 20: Cregit Recovering token level authorship from Git
Page 21: Cregit Recovering token level authorship from Git
Page 22: Cregit Recovering token level authorship from Git
Page 23: Cregit Recovering token level authorship from Git
Page 24: Cregit Recovering token level authorship from Git

Image in the public domain

Page 25: Cregit Recovering token level authorship from Git

Evolutionary Views of VC Repos

Page 26: Cregit Recovering token level authorship from Git
Page 27: Cregit Recovering token level authorship from Git
Page 28: Cregit Recovering token level authorship from Git
Page 29: Cregit Recovering token level authorship from Git
Page 30: Cregit Recovering token level authorship from Git
Page 31: Cregit Recovering token level authorship from Git

Per Line

Per Token

Page 32: Cregit Recovering token level authorship from Git
Page 33: Cregit Recovering token level authorship from Git
Page 34: Cregit Recovering token level authorship from Git

Linux History

Page 35: Cregit Recovering token level authorship from Git

Warning

● The author (git parlance) of a commit is not necessarily the author of the code

– Code imported from another source

– Refactorings

– Moving code

Page 36: Cregit Recovering token level authorship from Git

Up to 4.7

Persons in blame:

Line: 12,005 Token: 12,087

Page 37: Cregit Recovering token level authorship from Git
Page 38: Cregit Recovering token level authorship from Git
Page 39: Cregit Recovering token level authorship from Git
Page 40: Cregit Recovering token level authorship from Git
Page 41: Cregit Recovering token level authorship from Git
Page 42: Cregit Recovering token level authorship from Git
Page 43: Cregit Recovering token level authorship from Git

Token LineLinux

Page 44: Cregit Recovering token level authorship from Git

Token Line

kernel/

Page 45: Cregit Recovering token level authorship from Git

Many small changes

Non-merges that modified C and H files with respect to total of all commits

● 9.5 % of commits added 3 or less c-tokens and removed 3 or less c-tokens

● 7% of commits did not add any c-tokens but removed c-tokens

● 3.8% of commits added one c-token and removed one c-token

● 22.4% of commits added 10 or less c-tokens and removed 10 or less c-tokens

● 50% of commits added 60 or less c-tokens and removed 60 or less c-tokens

● 2 commits added at least 1M c-tokens and removed at least 1M c-tokens

Page 46: Cregit Recovering token level authorship from Git

C-Churn

● Churn = C Tokens added – C tokens removed in non-merge commits

Non-merges that modified C and H files with respect to total of all commits

● 10% of commits had c-churn == 0

● 48% had c-churn <= 10

● 26% had negative c-churn

● 2 commits had c-churn >= 1M

Page 47: Cregit Recovering token level authorship from Git

Conclusion

● On the large

– Token and Line areequivalent

● On the small

– Provide a fine grainedview of the evolutionof the code

Page 48: Cregit Recovering token level authorship from Git
Page 49: Cregit Recovering token level authorship from Git
Page 50: Cregit Recovering token level authorship from Git
Page 51: Cregit Recovering token level authorship from Git