Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Cregit Recovering token level authorship from Git
Transcript of Cregit Recovering token level authorship from Git
cregit: Who Authored the Kernel?
Recovering Token-Level Authorship Information from Git
Daniel M GermanUniversity of Victoria
Kate StewartLinux Foundation
Bram AdamsPolytecnique of Montreal
Cincinnati LibraryImage in the public domain
diff-able
Image in the public domain
“History is writen by the winners”
By Y. Karsh. Image in the public domain in Canada. Copyrighted in the US
“Archeology is the search for fact, not truth. If it's truth you're interested in, Dr. Tyree's Philosophy class is right down the hall.”
-- Indiana Jones
Image Copyright Walt Disney Company
The history in git is likely to be incomplete
Yet, what can we do with it?
The Dream, by H. Rousseau. In the public domain.
Image in the public domain
Evolutionary Views of VC Repos
Per Line
Per Token
Linux History
Warning
● The author (git parlance) of a commit is not necessarily the author of the code
– Code imported from another source
– Refactorings
– Moving code
Up to 4.7
Persons in blame:
Line: 12,005 Token: 12,087
Token LineLinux
Token Line
kernel/
Many small changes
Non-merges that modified C and H files with respect to total of all commits
● 9.5 % of commits added 3 or less c-tokens and removed 3 or less c-tokens
● 7% of commits did not add any c-tokens but removed c-tokens
● 3.8% of commits added one c-token and removed one c-token
● 22.4% of commits added 10 or less c-tokens and removed 10 or less c-tokens
● 50% of commits added 60 or less c-tokens and removed 60 or less c-tokens
● 2 commits added at least 1M c-tokens and removed at least 1M c-tokens
C-Churn
● Churn = C Tokens added – C tokens removed in non-merge commits
Non-merges that modified C and H files with respect to total of all commits
● 10% of commits had c-churn == 0
● 48% had c-churn <= 10
● 26% had negative c-churn
● 2 commits had c-churn >= 1M
Conclusion
● On the large
– Token and Line areequivalent
● On the small
– Provide a fine grainedview of the evolutionof the code