Chapter 10: Gathering and Using Information: Marketing Research and Market Intelligence
Twarfing: Gathering Intelligence from Twitter Data
-
Upload
sifumoraga -
Category
Documents
-
view
1.043 -
download
1
description
Transcript of Twarfing: Gathering Intelligence from Twitter Data
![Page 1: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/1.jpg)
Twarfing: Gathering Intelligence from Twitter Data
Morton SwimmerSenior Threat Researcher
Trend Micro, Inc.
Thursday, January 21, 2010
![Page 2: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/2.jpg)
Overview
• What is Twitter?
• Malicious activity on Twitter
• URL shortening services
• The Twarfing architecture
• Examples
• Conclusions
Thursday, January 21, 2010
![Page 3: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/3.jpg)
Thanks
• Ben April, Trend, for providing the massive hardware
• Based in part on a Virus Bulletin Presentation with Costin Riau, Chief of R&D, Kaspersky Labs
Thursday, January 21, 2010
![Page 4: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/4.jpg)
What is Twitter
• Founded by Jack Dorsey, Biz Stone and Evan Williams back in 2006
• Publish/Subscribe Communications system
• SMS
• Web site
• WS API and RSS
• Application
Thursday, January 21, 2010
![Page 5: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/5.jpg)
Twitter Internals
• 140 chars max to be SMS compatible
• SMS has a 160 char restriction
• But Twitter needed to add the user name
• Just text and metadata
Thursday, January 21, 2010
![Page 6: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/6.jpg)
Users not necessarily human!
• Devices
• From buoys to power meters
• Search for Twitter on instructables.com
Thursday, January 21, 2010
![Page 7: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/7.jpg)
Users not necessarily human!
• Not surprising that malware would use it, but
• Not good for Command and Control
• Easily blocked after detection
• … and twitter has been trigger happy with blocking
Thursday, January 21, 2010
![Page 8: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/8.jpg)
August 2008: Malware on Twitter
Thursday, January 21, 2010
![Page 9: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/9.jpg)
April 2009 – Twitter gets hit by XSS worm
• Multiple variants of JS.Twettir.a-h identified
• Thousands of spam messages containing the word "Mikeyy“ filled the timeline
• Proof of concept – no malicious intent
• Later, the author (Mikey Mooney) got a job at exqSoft Solutions, a web security company
• September 2009 - Yet another attack
• Phishing for the login details
Thursday, January 21, 2010
![Page 10: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/10.jpg)
2009 Trending topics start being exploited
Thursday, January 21, 2010
![Page 11: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/11.jpg)
June 2009 – Koobface spreading in Twitter
• Originally only targeted Facebook and MySpace users
• Now spreads through more social networks:
• Facebook, MySpace, Hi5, Bebo, Tagged, Netlog and most recently… Twitter
Thursday, January 21, 2010
![Page 12: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/12.jpg)
Twitter architecture, historically
• Multiple Ruby on Rails servers
• Mongrel HTTP servers
• Centrally MySQL backed
Thursday, January 21, 2010
![Page 13: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/13.jpg)
Twitter architecture, now• Details super-secret, but
this is what we think
• Front end
• Ruby-based front end
• Mongrel HTTP servers
• Back end
• Starling for queuing/messaging
• Scala-based code
• MySQL
• denormalized data whenever possible
• Only for backup and persistance
• Lots of caching using memcached
Thursday, January 21, 2010
![Page 14: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/14.jpg)
Stats (June 2009)• 25M users
• I measured 475K different users posting over a one week period in August 2009
• 300 tweets/sec
• MySQL handles 2400 reqs per second
• API traffic == 10x website traffic!
• Indicates that far more people are using applications
• TweetDeck, Twitteriffic, Digsby, Twhirl
• Many are Adobe Air based (!)
• One key to Twitter's success!
Thursday, January 21, 2010
![Page 15: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/15.jpg)
<id><user><text><created_at><source><truncated><in_reply_to_status_id><in_reply_to_user_id><favorited><in_reply_to_screen_name>
Twitter metadata
user information
Tweet text
in each tweet!
Thursday, January 21, 2010
![Page 16: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/16.jpg)
Tweet text• Retweets: RT passes the
note along
• now also in metadata
• Location: L tells friends where I am
• Hashtags: #
• topics
• show group associations
• just for tagging
• User reference: @ for public discussion
• URLs
Thursday, January 21, 2010
![Page 17: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/17.jpg)
Long URLs, short URLs
• URL shortening services have grown up around Twitter
• longurl.org counts 208 different ones
• Malicious URLs are one potential threat
• obscure the true URL
• May become malicious later
• RickRolling, but maliciously
• Benefits:
• ‘bit.ly’ and others blocks malicious URLs
Thursday, January 21, 2010
![Page 18: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/18.jpg)
Google SB API
• August 2009
• Twitter began filtering malicious URLs
• Mikko Hyppönen:
• seemed to indicate Google SB API!
• After more testing, we discovered it used some additional filtering
Thursday, January 21, 2010
![Page 19: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/19.jpg)
<id><name><screen_name><location><description><profile_image_url><url><protected><followers_count><profile_background_color><profile_text_color><profile_link_color><profile_sidebar_fill_color><profile_sidebar_border_color><friends_count><created_at>
<favourites_count><utc_offset><time_zone><profile_background_image_url><profile_background_tile><statuses_count><notifications><following><verified>
User data
Thursday, January 21, 2010
![Page 20: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/20.jpg)
Twitter to RDF
• Goal
• Avoid inventing own vocabulary
• Makes it easier to mix-in data later
• eg. Facebook, Tumblr, ...
• Most Twitter data fit readily into the SIOC ontology
• The rest used DublinCore
• Proprietary ontology for internal data
Thursday, January 21, 2010
![Page 21: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/21.jpg)
Using existing vocabs
• DublinCore, http://dublincore.org/
• SIOC, http://sioc-project.org/
• describe information from online communities
• FOAF, http://www.foaf-project.org/
• machine-readable pages describing people
• GeoOWL
Thursday, January 21, 2010
![Page 22: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/22.jpg)
Example
<http://twitter.com/float3r/status/3492845110> a sioc:Post ; dc:created "Sun, 23 Aug 2009 15:10:31 +0000" ; sioc:has_creator <http://twitter.com/float3r> ; sioc:content "RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdf"@no ;.<http://twitter.com/float3r> a sioc:User ; sioc:id "251294"^^xsd:integer ; rdfs:label "float3r" ; sioc:avatar "http://a1.twimg.com/profile_images/53008680/Picture_022_normal.jpg"^^xsd:anyURI ;.
float3r: RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdfSun, 23 Aug 2009 15:10:31 +0000
Thursday, January 21, 2010
![Page 23: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/23.jpg)
Example, continued
float3r: RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdfSun, 23 Aug 2009 15:10:31 +0000
Thursday, January 21, 2010
![Page 24: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/24.jpg)
WhiteTwarf/RedTwarf
White Twarf
Thursday, January 21, 2010
![Page 25: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/25.jpg)
Whitetwarf
• Receives a subset of the tweets via twitter search
• Stores metadata from twitter
• Processes text part for internal metadata
• User references, hashtags, Informal tags, URLs
• Creates canonical text representations
• Export to an RDF store for analysis
Thursday, January 21, 2010
![Page 26: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/26.jpg)
WhiteTwarf architecture v.2
Twitter Stream tweet processing
couchDB
tweets users
Domain Reputation
URLs
URL processing
RDF Converter
RDF Converter
4Store
RDF Graph
Thursday, January 21, 2010
![Page 27: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/27.jpg)
Tweet Processing
• Tweet fetch process
• Fetch a limited number of tweets as JSON
• Extract metadata and text
• Separate user metadata from tweet metadata
• Store in the CouchDB database
• Tweet processing
• Extract Tags, URLS, user references
• Text signatures
• Meant to remove small differences in text
• Normalization and whitespace removal
• UTF-8 tricks expansion/removal
• CouchDB views
Thursday, January 21, 2010
![Page 28: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/28.jpg)
CouchDB
• Scalable document and key-value store
• APIs REST and JSON based
• Embedded Javascript and MapReduce for querying
• http://couchdb.apache.org/
Thursday, January 21, 2010
![Page 29: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/29.jpg)
4Store
• Scalable RDF Store
• Developed at Garlik in C and is GPLed
• Has REST and native APIs
• Queries via SPARQL
• http://4store.org/
Thursday, January 21, 2010
![Page 30: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/30.jpg)
RDF Stores in general
• Store RDF triples (often as quads or named graphs)
• Have recently become much more scalable
• Still have limitations, though
• eg. have locking issues on writes
• Allegrograph 4 (Franz, Inc) may resolve some of these for us
Thursday, January 21, 2010
![Page 31: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/31.jpg)
URL redirector processing
• For every URL extracted by the view we follow the link
• In most cases we get a 30x response (redirect)
• These are also followed after storing
• Testing showed: usually faster to use shortener APIs
• So we are testing code that uses API instead for known shorteners
• We also capture other HTTP metadata
• Basically we are looking for malicious links
Thursday, January 21, 2010
![Page 32: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/32.jpg)
Detection malicious activity
• Data exported to an RDF Store
• This is a graph database
• Allows for complex queries
• Does have some performance issues and is not real time
• Simple Attack scenario
• User is observed to post to a malicious domain
• We want to see what else he has posted
Thursday, January 21, 2010
![Page 33: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/33.jpg)
Correlation with domain reputations
mal
http://unk.com/what.exe
tweet/1234
malicious
mal.com
http://mal.com/evil.exe
tweet/5678
tw:posts
tw:posts
tw:hasURL
tw:hasURL
drs:hasFQDN
drs:hasRating
Note that the examples in this
presentation are all radically simplified
for clarity.
Thursday, January 21, 2010
![Page 34: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/34.jpg)
Matching this in SPARQL
?m
?u2
?t1
malicious
?f1
?u1
?t2
tw:posts
tw:posts
tw:hasURL
tw:hasURL
drs:hasFQDN
drs:hasRating
Thursday, January 21, 2010
![Page 35: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/35.jpg)
Matching this in SPARQL
?m
?u2
?t1
malicious
?f1
?u1
?t2
posts
posts
tw:hasURL
tw:hasURL
drs:hasFQDN
drs:hasRating
SELECT ?m ?u2WHERE {?m tw:posts ?t1.?t1 tw:hasURL ?u1.?u1 drs:hasFQDN ?f1.?f1 drs:hasRating drs:Malicious.?m tw:posts ?t2.?t2 tw:hasURL ?u2.}
Thursday, January 21, 2010
![Page 36: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/36.jpg)
Advantages of using RDF
• Avoid multiple explicite joins
• An intuitive visualization available
• Mixing in new data as easy as adding it to the RDF store
Thursday, January 21, 2010
![Page 37: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/37.jpg)
Another attack scenario
• Observed: User modified URL on retweet to be malicious
@iceman: This link is cool http://cool.com/ice.html
@notniceman: RT: @iceman: This link is cool http://c00l.com/ice.exe
iceman
notniceman
tweet/1001
tweet/1005
http://cool.com/ice.html
http://c001.com/ice.exe
thislinkiscool
tw:posts
tw:posts
tw:hasTextSignature
tw:hasURL
tw:hasURL
Thursday, January 21, 2010
![Page 38: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/38.jpg)
Matching it
?m1
?m2
?t1
?t2
?u1
?u2
?ts
tw:posts
tw:posts
tw:hasTextSignature
tw:hasURL
tw:hasURL
≠
Thursday, January 21, 2010
![Page 39: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/39.jpg)
A few stats
• New architecture since October
• Since the new version started, we logged 3.6 million users
• and keep about 12 million tweets around
• Which is about 300 million triples
• Once out of experimental phase we need to increase this number dramatically
Thursday, January 21, 2010
![Page 40: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/40.jpg)
RedTwarf
• Goal: look for new attack patterns
• Based on WhiteTwarf data
• Using Data and Text-mining techniques to detect rules
• Probably will be Lucene based
Thursday, January 21, 2010
![Page 41: Twarfing: Gathering Intelligence from Twitter Data](https://reader034.fdocuments.us/reader034/viewer/2022042613/546b576fb4af9f8e2c8b4c85/html5/thumbnails/41.jpg)
Conclusions
• Twarf is still an experimental side project
• Twitter is becoming a popular attack vector
• Goal: protecting you, our customers
• Identifying the future development directions of Twitter threats
I would like to thank you for your support with 140 characters or less and guess what? I just did it! #swnyc
Thursday, January 21, 2010