Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab...

32
Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona [email protected] http:/ /ai.arizona.edu Acknowledgements: NSF CRI; NSF EXP-LA;

Transcript of Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab...

Page 1: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Dark WebCollection, Search, and Analysis

Dr. Hsinchun Chen

Director, Artificial Intelligence Lab

University of Arizona

[email protected] http://ai.arizona.edu

Acknowledgements: NSF CRI; NSF EXP-LA; DTRA, DOD CTFP, NPS; (ARFL WMD, CIA, FBI)

Page 2: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Leaderless Jihad and the Internet

• “The process of radicalization in a hostile habitat but linked through the Internet leads to a disconnected global network, the Leaderless Jihad.”

• Before 2004, face-to-face interactions, 26-year old

• After 2004, interactions on the Internet: Madrid, Dutch Hifsatd, Cairo, Toronto… Irhabi007 and Muntada, 20-year old

Page 3: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a)

Data, text, and web mining From COPLINK to Dark Web

Intelligence and Security Informatics

Page 4: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Newsweek Magazine,  March 3, 2003

A computerized way for police to coordinate crime databases

Washington Post, March 6, 2008

National dragnet is a click away! COPLINK in use in 1,600 police agencies

in US!

ABC News  April 15, 2003

Google for Cops: Coplink software helps police search for cyber clues to bust criminals

The New York Times, November 2, 2002

COPLINK assisted in DC sniper investigation

COPLINK project in the press

Page 5: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Dark Web Overview

Dark Web: Terrorists’ and cyber criminals’ use of the Internet

Collection: Web sites, forums, blogs, YouTube, Second Life

Analysis and Visualization: Link and content analysis; Web metrics analysis; Authorship analysis; Sentiment analysis; Multimedia analysis

Our collection is about 2 TBs in size, with close to 500M pages/files/messages from more than 10,000 Dark Web sites.

Page 6: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Project Seeks to Track Terror Web Posts, 11/11/2007

Researchers say tool could trace online posts to terrorists, 11/11/2007

Mathematicians Work to Help Track Terrorist Activity, 9/14/2007

Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007

Dar Web project in the press

Page 7: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Dark Web Forum Crawler System

Page 8: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Middle Eastern Web Collection File Types

Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in extremist Web sites, indicating a high level of technical sophistication.

Multimedia files (videos, images) are also heavily used in extremist Web sites.

Terrorist Collection # of Files Volume(Bytes)

Total 222,687 12,362,050,865

Indexable Files 179,223 4,854,971,043

HTML Files 44,334 1,137,725,685

Word Files 278 16,371,586

PDF Files 3,145 542,061,545

Dynamic Files 130,972 3,106,537,495

Text Files 390 45,982,886

Powerpoint Files 6 6,087,168

XML Files 98 204,678

Multimedia Files 35,164 5,915,442,276

Image Files 31,691 525,986,847

Audio Files 2,554 3,750,390,404

Video Files 919 1,230,046,468

Archive Files 1,281 483,138,149

Non-Standard Files 7,019 1,108,499,397

Number of Fi l es Di stri buti on (Arabi c)

80%

16%

0%

4%

I ndexabl eFi l esMul medi aFi l esArchi ve Fi l es

Non-StandardFi l es

Vol ume Di stri buti on (Arabi c)

39%

48%

4%9% I ndexabl e

Fi l esMul medi aFi l esArchi ve Fi l es

Non-StandardFi l es

Page 9: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

CyberGate System: Analysis & Visualization

Page 10: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

7. Results: Intensity RelationshipU.S. Forum Scores

0

100

200

300

400

0 100 200 300 400Hate Scores

Vio

len

ce

Sc

ore

s

Middle Eastern Forum Scores

0

100

200

300

400

0 50 100 150 200 250 300 350 400Hate Scores

Vio

len

ce

Sc

ore

s

Measuring Hate and Violence: US vs. Middle Eastern Groups

b1 R2

U.S. Middle Eastern

N 4676 3349

beta (slope) 0.079 0.682

t-Stat 21.354 48.265

P-Value 0.000 0.000

R-Square 0.076 0.486

Strong hate and violence

correlation, especially for

Middle-Eastern groups.

Page 11: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Number of Posts By Month: Al-Firdaws vs. Montada

Al-Firdaws consistently has between 2,500-3,000 posts per month since the second half of 2006.

Montada very active in 2002 and 2005.

Al-Firdaws Posts By Month

0

500

1000

1500

2000

2500

3000

3500

Jan

-05

Ma

r-0

5

Ma

y-0

5

Jul-

05

Se

p-0

5

No

v-0

5

Jan

-06

Ma

r-0

6

Ma

y-0

6

Jul-

06

Se

p-0

6

No

v-0

6

Jan

-07

Ma

r-0

7

Ma

y-0

7

Jul-

07

# p

os

ts

Montada Posts By Month

0

5000

10000

15000

20000

25000S

ep-0

0

Jan-

01

May

-01

Sep

-01

Jan-

02

May

-02

Sep

-02

Jan-

03

May

-03

Sep

-03

Jan-

04

May

-04

Sep

-04

Jan-

05

May

-05

Sep

-05

Jan-

06

May

-06

Sep

-06

Jan-

07

May

-07

# p

ost

s

Page 12: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Affect Intensities: Al-Firdaws vs. Montada

Al-Firdaws - Anger Montada - Anger

Al-Firdaws - Violence Montada - Violence

Al-Firdaws has considerably higher violence and also greater anger intensity.

Page 13: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Arabic Writeprint Feature Set

Lexical Syntactic StructuralContent Specific

Feature Set

Char-Based

Word-Based

Punctuation

Function Words

Word Structure

Word Roots

Technical Structure

Race/Nationality

Violence

Char-Level

Letter Frequency

Special Char.

Word-Level

Vocab. Richness

Word Length Dist.

(262) (15)(62)(79)

(418)

(48) (31) (12) (200) (48) (11) (4)

(4) (35) (9) (6) (8) (15)

(50)M

essage Level

Paragraph Level

Contact Information

Font Color

Font Size

Embedded Im

ages

(5) (6) (3) (29)

Hyperlinks

(14)

(8) (4) (7)

Elongation

(2)

Page 14: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Arabic Feature Extraction Component

Feature Set

Elongation FilterCount +1

Degree + 5

Incoming Message

Filtered Message

Root Dictionary

Root Clustering Algorithm

Similarity Scores (SC)

max(SC)+1

Generic Feature Extractor

All Remaining Features Values

1

3

2

4

Page 15: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Sliding Window + PCA : Turning Text into Dots

1,0,0,2,1,2

0,1,3,0,1,0

0.533 0.956 -0.541 0.445 0.034 0.089 0.653 0.456 0.975 -0.085 0.143 -0.381

Compute eigenvectors for 2 principal components of feature group

Transform into 2-dimensional space

x

Extract feature usage vectors

y

x = Zx

y = Zy

Repeat steps 2 and 3

1.

3.

2.

x

y

Message Text

Feature Usage Vector Z

Eigenvectors

Page 16: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Anonymous MessagesAuthor Writeprints

Author B

Author A 10 messages

10 messages

Page 17: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

ClearGuidance.com (Toronto Plot): Participant Network Visualization

Page 18: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

ClearGuidance Forum “Experts”

The series of overlapping circular patterns for bag-of-word features indicates that the author’s discussion revolves around a related set of topics.

Bag-of-words are predominantly related to religious topics, e.g., Adam, angels, etc.

Many large red blots indicative of the presence of features unique to this author, e.g., Adam, angels, etc.

Page 19: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

This author was later arrested as a major culprit in the Toronto terror plot (“Soldier of God”). He uses many violent affect terms.

Radar chart showing violent affect feature usages.

Selected feature is use of term “jihad” which is the highest in the forum .

Selected feature (i.e., “jihad”) is shown in red.

This author constantly attempts to justify acts of violence and terrorism. “…there are so many paid sheikhs

stuck in this life….no point going to them for fatwas…personally speaking…cuz they don’t even agree with jihad in the first place”

Page 20: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Dark Web Forum Tools

Information contained within Dark Web forums represent a significant source of knowledge for security and intelligence organizations.

We have developed tools supporting the large-scale collection, search, and analysis of Dark Web forums, specifically addressing the needs of security analysts.

Collection

AZ Forum

Spider

Search

AZ Forum Portal

AZ Sentiment Analyzer

Analysis

AZ CyberGate

Text Analyzer

Page 21: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

AZ Forum Spider

Automated collection of forum communications; weekly update

Proxy servers and parameters

Site map, URL ordering, and forum extraction

Incremental spider Collection

visualization

Collection – AZ Forum Spider

Forum List

SpideringStatus

CollectionStatistics

SpideringProfile

Page 22: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

AZ Forum Portal

Current version: 13M messages (340K members) across 29 major Jihadi forums in English, Arabic, French, German and Russian

Forum analysis By forum, thread,

member, time period, or topic

Social network analysis and visualization

Google Translation

Dark Web Forum Portal

Page 23: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

23

Forum Portal Data Set

23

Name Language Time Span Number of Members

Number of Threads

Number of Messages

Al-Boraq Arabic 01/08/2006 - 01/02/2010 3,503 52,322 223,648

Al-Fallujah Arabic 09/19/2006 - 01/02/2010 5,853 74,899 547,712

Al-Firdaws* Arabic 01/02/2005 - 12/06/2007 2,187 9,359 39,715

Midad al-Suyuf Arabic 03/18/2006 - 01/02/2010 1,597 11,232 38,382

Alokab Arabic 04/08/2005 - 12/31/2009 1,547 8,096 55,947

Al-Qimmah Arabic 11/23/2007 - 01/02/2010 287 12,097 23,709

Alsayra Arabic 04/05/2001 - 12/31/2009 66,705 147,598 1,227,207

Ansar Arabic 11/07/2008 - 01/02/2010 1,224 12,041 46,928

At-tahadi Arabic 04/14/2008 - 01/02/2010 313 2,599 5,406

Hanin Net Arabic 11/27/2006 - 01/12/2010 2,837 96,239 821,478

Hawaa World Arabic 01/01/2001 - 01/02/2010 113,579 40,501 2,251,553

Hadramout Arabic 11/25/2000 - 12/29/2009 29,491 151,694 1,552,227

Ma’arik Arabic 07/29/2007 - 01/03/2010 1,880 15,288 57,047

Al-Mujahidin Arabic 11/09/2007 - 01/02/2010 4,259 29,980 140,930

Montada Arabic 09/25/2000 - 12/29/2009 40,291 120,181 1,412,028

Page 24: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

24

Data Set (Cont’d)

24

Name Language Time Span Number of Members

Number of Threads

Number of Messages

Ana al-Muslim Arabic 10/08/1985 - 11/26/2009 12,215 179,791 1,343,370

Shumukh Arabic 03/21/2007 - 01/02/2010 3,938 46,666 289,201

Ansar English 12/08/2008 - 01/02/2010 377 11,133 29,056

Gawaher English 10/24/2004 - 01/01/2010 6,790 210,656 569,709

Islamic Awakening English 04/28/2004 - 12/31/2009 2,361 25,112 116,009

Islamic Network* English 06/09/2004 - 05/07/2008 1,573 11,974 87,314

Islamic Web-Community

English 11/14/2000 - 12/31/2009 745 6,262 24,850

Turn To Islam English 06/02/2006 - 01/01/2010 9,926 38,702 308,970

Ummah English 04/01/2002 - 12/31/2009 14,349 71,218 1,192,583

Al Minha Dj French 06/01/2008 - 01/04/2010 313 2,007 6,421

Forums d’aslama French 10/06/2004 - 01/03/2010 2,665 20,468 131,559

Al-Mourabitoune French 05/05/2002 - 03/27/2009 3,198 7,905 72,140

Ansar German 02/27/2009 - 01/02/2010 62 726 1,645

KavkazChat Russian 03/21/2003 - 01/03/2010 5,634 6,144 558,042

Total 339,699 1,422,890 13,174,786

Page 25: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

2525

Forum Statistics Summary (Cont’d)

Page 26: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

26

Cross Forum Search

26

Page 27: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

2727

Single Forum Search & Translation

Search: bomb, iraq

Translations of thread titles

Page 28: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

SNA Replay Network

28

1. Bint ul Islam (290 postings)

2. Iloveislam (239 postings)

3. Abuhannah (173 postings)

Page 29: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

AZ Sentiment Analyzer

Portal for the sentiment and affect analysis of forums, measuring member opinions and emotions

Characterizes the affects conveyed in forum text, and the underlying sentiment polarity

By forum, thread, member, or time period

Keyword search

Search – AZ Sentiment Analyzer

Page 30: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

AZ CyberGate Text Analyzer

Comprehensive system for the analysis and visualization of forum communications

Shows all text features Utilizes Writeprint and

Ink Blot techniques in text analysis

Incorporates rich visualization based upon multi-dimensional scaling and parallel coordinates

Analysis – AZ CyberGate Text Analyzer

Page 31: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Conclusion

The web offers extremists a rich medium for recruiting, communication, and radicalization.

Information contained within Dark Web sites, forums, blogs, multimedia, etc. represent a significant source of knowledge for security and intelligence organizations.

A computational approach to Dark Web research spans collection, search, and analysis.

Dark Web research could potentially assist in terrorism research and intelligence analysis.

Dark Web Forum Portal available now!!!

Page 32: Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu .

Dark WebCollection, Search, and Analysis

For more information:

Dr. Hsinchun Chen, University of Arizona

[email protected]

http://ai.arizona.edu