2014 10-14: GitHub plus FOSS == 1 million SPDX
-
Upload
nuno-brito -
Category
Technology
-
view
533 -
download
0
description
Transcript of 2014 10-14: GitHub plus FOSS == 1 million SPDX
![Page 1: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/1.jpg)
Large-scale license transparency using open data, open standards and F/OSS
+ => 1 million SPDX
http://triplecheck.net http://searchcode.com
![Page 2: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/2.jpg)
Speaker
Slide #2
Nuno Brito
Free/open source contributor since 2005 Last 12 months wrote 100k F/OSS lines of code SPDX contributor, co-founder of TripleCheck
Around the web http://nunobrito.eu
![Page 3: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/3.jpg)
Transparency
Slide #3
Take some source code as example
Who developed the code?Which licenses are applicable?Was the code copied from somewhere else?
![Page 4: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/4.jpg)
Size
Slide #4
A problem of scale
Open licenses? > 300 types to choose> 5 million F/OSS projects
> 100 million source code files
![Page 5: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/5.jpg)
Practice
Slide #5
Applying licenses
Burden on developer (do correctly, do enough) Expressed differently (difficult to understand) Scaling obstacles (scarce automation)
Transparency?
![Page 6: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/6.jpg)
What do?
Slide #6
Ideally, we'd have tooling that is..
a) Reachableb) Cooperativec) Free
Choose two. (sad reality)
![Page 7: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/7.jpg)
Choose three
Slide #7
Choose building blocks based on:
a) Open standardsb) Open datac) Reachable tools
Learn, write, improve.
Share.
![Page 8: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/8.jpg)
Standards
Slide #8
SPDX: Open standard for software licensing
Standardizes license description Defines Id for license terms http://spdx.org
Pro: Good docs, straightforward, getting better Cons: Slow adoption, scarce tooling
![Page 9: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/9.jpg)
Open data
Slide #9
GitHub: Targeting open data repositories
API suited for intensive access Social coding Largest open source code collection
Pro: Reachable, diverse Cons: Repositories processed one-by-one
![Page 10: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/10.jpg)
Tooling
Slide #10
Custom-built tools for software licenses
Large-scale repository data-mining Find applicable licenses inside content Share millions of SPDX documents
Pro: Learn by doing, modularized, single language Cons: Built from scratch, needs consolidation
![Page 11: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/11.jpg)
Step 1
Slide #11
Desktop tool/engine to discover licenses
SPDX format as storage medium Identify copyright and 18 license types Java, released in Feb 2014. EUPL
http://spdx.org/tools/community/triplecheck-reporter
![Page 12: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/12.jpg)
Desktop
Slide #12
![Page 13: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/13.jpg)
File detail
Slide #13
![Page 14: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/14.jpg)
SPDX file
Slide #14
![Page 15: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/15.jpg)
Customize
Slide #15
![Page 16: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/16.jpg)
Details
Slide #16
Underneath the hood
147 file extensions, 18 license types LOC, hashes (SHA1, MD5, SHA256, SSDEEP) Command line supported (Jenkins, cron) Fast, 40k files/minute (Pentium IV)
![Page 17: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/17.jpg)
Step 2
Slide #17
Discovering repositories with gitFinder
Create a list of projects online to use as components. Get basic licensing information from each project.
Write text file with each github user (~7 million) For each user, find repositories not forked (~10M) Split each repository according to language (197) For each list of language/reps, download code
![Page 18: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/18.jpg)
Performance
Slide #18
~70k repositories/day
Single machine (i7, 8Gb RAM, CentOS) 9 parallel threads Resume/recover supported Released in Jun. 2014
https://github.com/triplecheck/gitfinder
![Page 19: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/19.jpg)
Output
Slide #19
![Page 21: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/21.jpg)
Storage
Slide #21
BigZip, +100 million files on a single download
Flat-file, zip compression (per entry) Fast, simple, portable. Indexed search
https://github.com/triplecheck/big
![Page 22: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/22.jpg)
How it looks
Slide #22
![Page 23: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/23.jpg)
Step 3
Slide #23
SPDX search engine
One-click SPDX creation from open data Visualize license and copyright data Visit at http://searchcode.com/spdx
![Page 24: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/24.jpg)
Example
Slide #24
Using the original URL..
https://github.com/iuly/europa_kernel/
=>
https://spdxhub.com/iuly/europa_kernel/
![Page 25: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/25.jpg)
Example
Slide #25
![Page 26: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/26.jpg)
SPDX-1M
Slide #26
“Do It Yourself” kit. Generate 1 million SPDX
https://github.com/triplecheck/diy 1.2 million open source projects “Arduino” for s/w licenses detection
9Gb worth of SPDX? Grab:http://triplecheck.net/public/storage/spdx.big
![Page 27: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/27.jpg)
Screenshots
Slide #27
![Page 28: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/28.jpg)
Next step?
Slide #28
F2F – pinpointing non-original code
Decompose code into blocks Tokenize/anonymize data Find code matches across knowledge base
ETA in Dec. 2014https://github.com/triplecheck/f2f
![Page 29: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/29.jpg)
Preview
Slide #29
![Page 30: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/30.jpg)
Conclusion
Slide #30
What is now available for everyone
Desktop tooling / detection engine Extraction of open data in scale Search engine for SPDX
![Page 31: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/31.jpg)
Questions?
Slide #31
http://spdx.orghttp://searchcode.com/spdxhttp://github.com/triplecheck
Interesting stuff? Let us know: @nn81 @boyte #linuxcon
http://xkcd.com/1118/
![Page 32: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/32.jpg)
Backup slides
Slide #32
![Page 33: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/33.jpg)
Engine
Slide #33
![Page 34: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/34.jpg)
License DB
Slide #34
![Page 35: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/35.jpg)
Components
Slide #35
![Page 36: 2014 10-14: GitHub plus FOSS == 1 million SPDX](https://reader034.fdocuments.us/reader034/viewer/2022052621/558a58a5d8b42a73468b46a2/html5/thumbnails/36.jpg)
Exporting
Slide #36