Searching and Sorting Arrays. Searching in ordered and unordered arrays.
Sourcerer: mining and searching internet-scale software...
Transcript of Sourcerer: mining and searching internet-scale software...
![Page 1: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/1.jpg)
Sourcerer: mining and searching internet-scale software repositoriesIntroduction to Information RetrievalCS 221Donald J. Patterson
Content based on the paper located here:http://dx.doi.org/10.1007/s10618-008-0118-xand slides located http://bit.ly/9CEaaT
Friday, March 12, 2010
![Page 2: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/2.jpg)
Sourcerer: mining and searching internet-scale software repositories
Friday, March 12, 2010
![Page 3: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/3.jpg)
Sourcerer: mining and searching internet-scale software repositories
Friday, March 12, 2010
![Page 4: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/4.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Why mine source code?
• to understand engineering and development
• to understand complexity
• to improve code reuse
• to identify relationships between humans and
their code
• Code should not be treated as text
• There is a lot of structure that can be exploited
Friday, March 12, 2010
![Page 5: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/5.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Sourcerer is
• a crawler of software repositories
• a parser and feature extractor for code
• a fingerprinter
• a database
• a web search interface
• for Java Source code
Friday, March 12, 2010
![Page 6: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/6.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Sourcerer architecture
Friday, March 12, 2010
![Page 7: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/7.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Sourcerer search interface
Friday, March 12, 2010
![Page 8: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/8.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Parsing
• Entities • Relations
Friday, March 12, 2010
![Page 9: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/9.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Keyword Extraction
• Comments
• Splits on Case
• QuickSort -> “Quick” “Sort”
• Mapped to entities
Friday, March 12, 2010
![Page 10: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/10.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Fingerprinting Source Code
• Structure-based search requires a compact representation
of code characteristics
• “Fingerprints” are vectors whose elements denote the
occurrence of specific programming constructs
• Easily lends itself to the vector model of standard
information retrieval
• Fingerprints must balance efficiency and expressiveness
• Feature set must be rich enough to be meaningful
• Superfluous features add to computational overhead
Friday, March 12, 2010
![Page 11: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/11.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Fingerprinting types
• Control Structure Prints
• Provides information about concurrency, iteration, and
conditional constructs
• Useful for identifying benchmark datasets
• Java Type Prints
• Captures information about object-oriented constructs
(classes, methods, fields, constructors, etc)
• Provides capability for general entity structure search
Friday, March 12, 2010
![Page 12: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/12.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Fingerprinting types
• Micro Pattern Prints
• Bit vector indicating occurrence of simple design
patterns in code entities
• Allows for structure-based search based on commonly
occurring design practices
Friday, March 12, 2010
![Page 13: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/13.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Fingerprint Search
Friday, March 12, 2010
![Page 14: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/14.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Ranking
• Return code that is
• keyword relevant
• structure relevant
• frequently used
• robust
• Determine importance of entities by applying PageRank
• to source code
• probabilistic framework for ranking
Friday, March 12, 2010
![Page 15: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/15.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Ranking
• CodeRank can be tuned for
• Project local ranking
• Project global ranking
• Relationship-specific Ranking
• Increasing the weight of relevant edges in
dependency graph
Friday, March 12, 2010
![Page 16: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/16.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Current Sourcerer Statistics
• Repository
• Total number of projects (with source): 1555
• Total source files: 185,439
• Total lines of code: 18,804,522
• Size of repository: 1.17 GB
Friday, March 12, 2010
![Page 17: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/17.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Current Sourcerer Statistics
• Database
• Total number of java packages: 47,898
• Total number of java classes: 254,049
• Total number of methods: 1,516,212
• Total number of fields: 732,764
• Total number of relations: 11,345,077
Friday, March 12, 2010
![Page 18: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/18.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Keyword frequency (%)
Friday, March 12, 2010
![Page 19: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/19.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Keyword frequency (%)
Friday, March 12, 2010
![Page 20: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/20.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Code characteristics
Friday, March 12, 2010
![Page 21: Sourcerer: mining and searching internet-scale software ...djp3/classes/2010_01_CS221/Lectures/Lect… · Sourcerer: mining and searching internet-scale software repositories •](https://reader033.fdocuments.us/reader033/viewer/2022050523/5fa69968845e0b29bf6927f9/html5/thumbnails/21.jpg)
Sourcerer: mining and searching internet-scale software repositories
• Effectiveness of Search based on various code features
Friday, March 12, 2010