Surfacing the deep web (2 slides per page)
-
Upload
arthur-weiss -
Category
Business
-
view
809 -
download
1
description
Transcript of Surfacing the deep web (2 slides per page)
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 1
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Arthur Weiss Email: [email protected] / Twitter: @awareci
www.marketing-intelligence.co.uk 14 October 2013
Surfacing the Deep Web WebSearch Academy
Internet Librarian International
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Not everything can be found with Google…. The ‘Invisible Web’ or ‘Deep Web’ consists of web pages and documents which are not indexed by conventional search engines or are poorly or incompletely indexed.
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 2
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
5 Types of “Invisibility”
3
Not search engine
optimised so pages fail to appear in
“simple” searches
Not indexed by search engines
Subscription or
proprietary content
Excluded
from search index
Encrypted or non-
indexable content
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Know your tool kit
4
or
Standard Google Multiple approaches & tools
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 3
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
What do I need to find?
5
What sort of needle? What sort of haystack?
http://www.morguefile.com/archive/display/21091
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Why will the information be available? Where will it be held (Who will know it?)
Can I obtain it legally and ethically from this source & if so, how?
If not, are there other sources or ways of obtaining the information?
After obtaining the information are any checks needed to verify it?
What is the information’s relationship to other information?
6
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 4
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Not everything is online or can be found! • Try to find:
Original TV coverage of the storming of the Bastille1
A newspaper interview with Christopher Columbus, following his return from discovering America
A recording of Abraham Lincoln delivering the Gettysburg address
A photo of Jesus in his crib (Question from a 9 year old: “Why didn’t anybody take photos with their phones?”)
1 With thanks to Karen Blakeman of RBA Information (rba.co.uk) for these examples
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
“Forty-two! Is that all you’ve got to show for seven and a half million year’s work?”
“I checked it very thoroughly and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”
Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”
If your search approach is wrong, it doesn’t matter which approach or tool you use, or how you use it. Your results will be poor or wrong.
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 5
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Before starting to search consider sources for the subject / topic of interest…
Would any of the relevant pages be in another language? “cheap hotel in Dubai” OR “فندق اقتصادي في دبي”
Are there societies, organisations, people, or groups that may have information? (Who/where else could have information?)
What search tool / approach is most likely to access or index the information’s location (container)
Are there unique terms or jargon that lead to a specialist tool e.g. Lung cancer (consumer) versus pulmonary carcinoma (medical)
Why is information likely to be available? Consider also file-formats, and location of search terms
9
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Before starting to search: consider search terms for the topic or subject of interest
How might the information be written? “I work for Xcompany” to search for
employees of Xcompany “X is better than” for comparisons
Are any keywords likely to be in irrelevant documents that should be excluded from searches?
Are any keywords part of a common phrase?
Are there any other words likely to be in documents on the topic?
Are there any synonyms or variant spellings? Tyre or tire; Aluminum Candy or sweet Basle or Basel
10
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 6
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Research Planning
Information Requirements
Break down into individual
questions that, when answered, will provide the
required knowledge
Don’t start searching
without knowing what
you are looking for, and why
11
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
An example research plan Copy & fill in sheet for each key information question / topic
Research Topic Research Questions (breakdown topic into answerable questions)
Sources Search Approach / Parameters
Type of information expected
Comments / Possible problems
LINKEDIN JOB TITLE, CURRENT EMPLOYER, ETC.
PEOPLE PROFILES MAY NOT BE ACCURATE OR IN-DATE
GOOGLE SCHOLAR
AUTHOR NAME, TOPIC, DATE, ETC.
CITATIONS, ACADEMIC RESEARCH PAPERS….
DOESN’T COVER EVERYTHING
NATIONAL STATISTICS
SITE SEARCH ENGINE CENSUS & DEMOGRAPHIC DATA
MAY BE OLD OR INCOMPLETE
12
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 7
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Types of “Invisibility”
13
Not search engine
optimised so pages fail to appear in
“simple” searches
Not indexed by search engines
Subscription or
proprietary content
Excluded
from search index
Encrypted or non-
indexable content
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Advanced Searching • Use advanced search operators and options e.g.
Filetype: / InTitle: / InUrl: / .. (numeric) and * (wildcard)
14
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 8
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Search Engines – not just Google
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Types of “Invisibility”
16
Not search engine
optimised so pages fail to appear in
“simple” searches
Not indexed by search engines
Subscription or
proprietary content
Excluded
from search index
Encrypted or non-
indexable content
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 9
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Specialist Search / Deep Web Search
17
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Search for Information “Containers” • Knowing a reason for the information to be
available can lead to an information source Who else would want this information? Search for topic + “Database”
e.g. Coffee database – first two results:
18
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 10
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Case Examples – Economics by Country
19
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Case Examples – Trade Statistics
20
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 11
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Case Examples – Economic Indicators
21
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Case Examples – Genealogy
22
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 12
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Types of “Invisibility”
23
Not search engine
optimised so pages fail to appear in
“simple” searches
Not indexed by search engines
Subscription or
proprietary content
Excluded
from search index
Encrypted or non-
indexable content
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Proprietary sites / Blocked from Index • Register for password protected sites • Use site search or site map – if available • If Robots.txt file exists may be able to view the
hidden pages e.g. nytimes.com/robots.txt
24
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 13
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Types of “Invisibility”
25
Not search engine
optimised so pages fail to appear in
“simple” searches
Not indexed by search engines
Subscription or
proprietary content
Excluded
from search index
Encrypted or non-
indexable content
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Content that can’t / won’t be indexed • Non-textual information e.g. multimedia /
audiovisual Bing has search operators that can find RSS feeds
(hasfeed:) and pages containing specific types of file (e.g. mp3 files – contains:mp3)
Search for related textual information e.g. descriptions, or sources (e.g. artwork or film titles)
• Encrypted information / .Onion sites Project Tor (torproject.org) and the TOR browser
Access encrypted sites via proxy servers
26
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 14
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Searching TOR • On regular Google: fake passport site:onion.to
27
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
TOR / .Onion Sites
28
Surfacing the Web Websearch Academy 2013
14 October 2013
© Arthur Weiss, AWARE, 2013 15
© AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
Arthur Weiss is the managing director of AWARE - a UK based consultancy specialising in marketing & competitive intelligence analysis.
Contact Details: Web Sites: www.marketing-intelligence.co.uk E-mail: [email protected]
Twitter: @awareci
Telephone: +44 20 8954 9121 Fax: +44 20 8954 2102
29
Any Questions?