HCLT Whitepaper: Cyber Scan

17
CyberScan (Online IP Infringement Detection Service) July 2011

description

http://www.hcltech.com/engineering-rd-services/overview~ More on Engineering R&D services Trying to stop online piracy and illegal distribution of content on the internet is nothing new. Like hiring security guards for a store front, combating online theft can be both costly and have unique challenges. Further, the criminal sites respond to business attempts to find and remove illegitimate and illegal content with increasing technical sophistication. Not only must the sites hosting pirated material be identified, but the sites that link to their hacked content. HCL CyberScan can help any business protect their key intellectual property. As online piracy continues to grow exponentially, companies must remain vigilant with technology to minimize copyright infringement and its resulting profit loss. Our unique solution is a new and effective way to combat online piracy, IP theft and illegal distribution by using automation and the latest internet/cloud technologies. Let CyberScan stop piracy and secure your profits.

Transcript of HCLT Whitepaper: Cyber Scan

Page 1: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service)

J u l y 2 0 1 1

Page 2: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

TABLE OF CONTENTS

Abstract ............................................................................................. 3

Abbreviations .................................................................................... 4

The Problem ...................................................................................... 5

Business and Technical Challenges ................................................. 7

CyberScanSolution ........................................................................... 9

Key Features ................................................................................... 11

Key Capabilities .............................................................................. 12

How CyberScan Works ................................................................... 13

Business Impact Examples ............................................................. 15

Conclusion....................................................................................... 16

Author Info ....................................................................................... 16

Page 3: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

3

Abstract

Among the most profitable modern businesses on the internet today

are media and content providers, and the biggest threat to their

profits is from piracy of their copywritten products. Companies,

especially, in the entertainment, software, and publishing industries

continue to lose profits from the proliferation of pirated content being

available on a vast number of sites illegally.

HCL, a leading global IT service companyhas now tapped its

proprietary skills and tools to develop a software solution that seeks

out and protects against illegal hosting or linking of sold material.

CyberScan brings online copyright infringement from a revenue loss

into an automated evidence collector for direct action, and it uses

the same stealth-like methods as the criminals do to identify and

protect against illegal postings or links to “hacked” material.

CyberScan provides an innovative and highly effective online

copyright infringement detection service to help businesses reduce

profit loss. Its state-of-the-art software uses web crawling, tracking

and indexing, distributed agent „sniffers‟, and IP masking to identify,

monitor, and report infringed content in a rapid but undetectable

manner. It even bypasses the typical methods piracy sites use to

hide from or fight off such detection. CyberScan employs its special

technical methods automatically, reducing the cost of manually

finding and tracking infringement or compliance.

Infringement of any IP or copywritten content that can be sold and

shared digitally - movies, pictures, audio files, eBooks, TV shows,

documents, and software - can now be identified and addressed

with CyberScan. Your business can finally fight online piracy of your

content in an effective manner. CyberScan from HCL is a powerful

tool and a major benefit for any business combating copyright

infringement.

Page 4: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

4

Abbreviations

Sl. No. Acronyms (Page No.) Full form

1 IP(1) Intellectual property

2 URL(5) Uniform Resource

Locator

3 AWS(9) Amazon Web

Services

4 SaaS(11) Software as a Service

Page 5: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

5

The Problem

Online piracy and infringement of copywritten material continues to

grow as a problem - and profit loss - for businesses more than retail

theft in stores. The problem businesses face today in trying to

combat online piracy include the following:

Identifying infringing content from a vast rising number of

hosting/linking sites known or unknown to exist.

Searching, finding, and filtering through content in a timely

matter even though it is propagated quickly across the

internet.

Staffing manual operators to perform search and detection

of infringing material, or hiring developers skilled in

particular logic and coding algorithms for it.

Evading criminal website administrator techniques like

blocking IP addresses based on number of hits so they

avoid manual or scripted detection systems.

Poking through authentication techniques used by piracy

site administrators to protect and firewall their illegal content.

Issuing Cease and Desist or Takedown Notices to an ever-

growing number of dynamic hosting sites that change their

URLs and addresses.

Fingerprinting infringement as evidence and enforcing

compliance of removal after discovery or serving notice.

Reporting reasonable data out of the huge volumes of content found to derive infringement patterns, assess perpetrators, and make business decisions.

Meeting The Challenge Businesses currently trying to solve these problems of content piracy and distribution find tall challenges and roadblocks to their efforts, but CyberScan solves them:

Challenge Short Description CyberScan Solution

URL Obfuscation Sites hide pirate links with format or layout tricks

Intelligent search expressions see through formats

Website Authentication Sites require login or user credentials to access content

Sites are categorized and login credentials used for automation

Infringement Detection Rapid changes and posts make finding and monitoring timely

Special search methods seek, tag, and monitor based on site

Web Crawler Obstruction Site admins limit, watch, and block detecting programs/users

Jobs spawn across globe from new addresses to remain in stealth

Page 6: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

6

The next 2 sections go into further detail on these business and technical challenges, and precisely how CyberScan provides the best solution to them available today.

Page 7: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

7

Business and Technical Challenges

Trying to stop online piracy and illegal distribution of content on the

internet is nothing new. Like hiring security guards for a store front,

combating online theft can be both costly and have unique

challenges. Further, the criminal sites respond to business attempts

to find and remove illegitimate and illegal content with increasing

technical sophistication. Not only must the sites hosting pirated

material be identified, but the sites that link to their hacked content.

Each of the challenges listed are described in more detail below,

and the next section discusses in more detail how CyberScan

solves them:

URL Obfuscation

Websites that contain links to infringing content – from forums and

blogs to search engines - commonly use various obfuscation

techniques to prevent automated systems from detecting

infringement. Tactics of these linking sites include posting plain text

URLs instead of hyper-linked ones, replacing characters inside

URLs in a way a human can identify but not a computer, using third

party URL shortening services, and requiring registration to view

content or posts. These URL obfuscation methods are all a

challenge to a company trying to search and identify the sites that

serve as a link or entry point to pirated material.

Website Authentication

Some infringing or linking websites require registration before

content can be browsed. This closed-door firewall tactic is

particularly difficult to address because there are no standard

methods of authentication across the web. Many sites use

customized form-based authentication, which any manual or

programmed web crawler or sniffer must handle. To further

complicate matters, some linking websites allow anonymous access

to only small portions of the site, or require their own authentication

before users can view links to infringing content and downloads.

Infringing Content Detection

Identification of infringement hidden inside unstructured content is a

serious challenge, especially given the dynamic linking nature of the

Internet and the frequency of new or updated posts. Ability to detect

infringing content within hours of being posted is a desired capability

of any detection system. A specialized approach to content

detection and crawling methodologies that not only seeks and finds,

but also continually monitors any type of website, is a great

Websites hosting

infringing content are

responding to business

attempts to find and

remove illegitimate and

illegal content with

increasing technical

sophistication.

Page 8: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

8

challenge that must be met to protect pirated content from

spreading.

Webcrawler Obstruction

Some linking sites take proactive and even reactive measures to

hinder automated systems and manual sniffing for pirated content.

Tactics includes blocking IP addresses according to their own

criteria, user agent strings, and enforcing page view limits and

quotas. These obstructions may occur programmatically or via

manual intervention by the website administrators. A great

challenge in crawling the web manually or automatically is to remain

in a stealth mode so you can continue to detect and monitor IP

infringement while remaining undetected yourself.

Developed with these problems in mind, CyberScan provides key

logic and proven components, allowing automated solutions to

these unique challenges and more.

Page 9: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

9

CyberScanSolution

CyberScan directly addresses the business and technical challenges in the piracy prevention sphere outlined above in the following specific ways:

URL Obfuscation

Infringing sites use cover methods like changing or masking their

URLs, clouding the links to their site, or requiring registration to

continue.

CyberScan‟s custom webcrawler logic intelligently applies regular

expressions to detect and process host site URLs inside of

unstructured web content. CyberScan also supports custom website

authentication mechanisms, which enables crawling entire domains

under the guise of being of a registered user. This combination

effectively deals with most forms of URL obfuscation.

Website Authentication

Sites hosting pirated content often require user authentication,

keeping their illegal wares behind a closed and locked door.

Using sophisticated analysis algorithms, linking sites are classified

based on a wide range of criteria so the best applicable approach is

selected. Nutch web crawler‟s authentication modules have been

extended to support form-based authentication. Credentials

gathered from manual site registration are supplied through the

CyberScan Web Application and are used while crawl jobs are

underway. This allows CyberScan web crawlers to access and

analyze areas of suspect websites typical search engines are

unable reach.

Infringing Content Detection

Infringing content can be hidden and changed by new posts,

updates, propagation, and by the fast dynamic nature of the internet.

CyberScan uses a combination of weighted regular expressions to

detect infringing content. While within a website known to serve

content suspected of infringing, the program is stricter in

determining infringement possibilities. Within less known or new

sites, search logic can also be applied. If for example the body of a

post matches a regular expression designed to detect a customer‟s

content title, and the URL also contains a particular flagged string,

the code can accurately determine if the post is infringing or not.

CyberScan offers a

comprehensive solution

to online infringement by

effectively leveraging the

best of breed open source

technologies and power of cloud computing.

Page 10: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

10

Different crawling methodologies have been implemented

depending on the layout of sites. For example, CyberScan in forum

style websites attempts to use the site‟s search functionality to sniff

and crawl links that have a high probability of infringement. In other

site styles where search is not available, CyberScan crawls the

index pages to analyze each post according to its logic. CyberScan

will find infringement when it is there.

Webcrawler Obstruction

Pirate site administrators react and try to block access or views by

legitimate enforcers, either manually or with programs that detect

who is trying to detect them.

CyberScan conducts its crawls inside of Amazon's Elastic

MapReduce AWS service. Each crawl job is conducted in a newly

provisioned cluster, each using a different IP address and

geographical location. The user agent string is set to the most

common browsers/platforms on the web. To circumvent server-side

page view limits or quotas, the client crawl jobs are configured to be

low impact and “polite” to the web servers. By crawling the targeted

sites in large but distributed jobs, the load is spread across the

entire World Wide Web, while the system actively searches for

infringement using varying aliases. This complex combination helps

to keep the web crawlers under the radar of website administrators,

and makes CyberScan very difficult to identify and block.

Page 11: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

11

Key Features

CyberScan finds pirated content and the sites that provide paths to

it in a way no other software can – effectively, secretly, and

automatically. CyberScan‟s key features and benefits include the

following:

Automatic identification of suspected infringement on

intellectual property

Dynamic detection though multiple geographies to remain in

“stealth mode”

Savvy “crawl/ sniff” logic that remains undetected by pirate

administrators

Full evidence capture and archival for legal establishment of

infringement

Thorough and fully automated domain traversal, parsing,

and indexing

Powerful multi-faceted search for drilling into indexed

content

Adaptive tracking of detected sites to ensure removal and

compliance

Live feeds detailing newly discovered infringement

Cross-category crawls of sites and specific sniffing posed

as a legitimate user

Interactive web interface for system monitoring and control

Prevalence analysis and reporting of pirated content and its

service providers

Highly scalable and reliable cloud architecture using proven

open source modules

CyberScan’s rich featureset and capabilities around the 4 efficient parts: Identification, Evidence collection, Reporting Infringement and Re-verification provides 3600 protection for your content.

Page 12: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

12

Key Capabilities

CyberScan was developed by HCL experts to include key

capabilities and utilize proven components to specifically address

content piracy concerns of provider businesses and their technical

staff. The following are some highlights:

Customized proprietary version of the Nutch open source

web crawler

Proven cloud infrastructure utilizing Amazon Web Services

(AWS) to deploy/run

Advanced AWS services like elastic clusters for highly

scalable and reliable system

Dynamic resource allocation across multiple domains,

locations, and user strings, enabling CyberScan to work in

an undetected stealth mode

Fully indexed suspicious domain lists and multi-faceted

search results through a custom search engine UI, useful

for research, analysis, and reporting

Adaptive revisits to suspicious content download and link

pages, to detect when they are removed and ensure

compliance

Coding logic that ensures “politeness” to servers being

sniffed, to ensure web crawlers resemble normal users un-

noticed by reactive pirate site administrators

Cloud-based architecture that allows for global efficiency

and Software As a Service (SaaS) pay per use billing model.

Page 13: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

13

The CyberScan Difference

CyberScan‟s solution stands out far when compared to any mix of

software for its features and capabilities; the following are some of

the key benefits HCL adds when partnering with them to use the

CyberScan solution and services:

How CyberScan Works

The secret to CyberScan‟s profit-saving features and benefits lies in

HCL‟s selection and customization of technologies that can together

perform the job required to quietly and efficiently detect copyright

infringement and propagation. HCL found niche open source

computing platforms and customized them, added an intelligent

architecture geared for IP detection tasks, and tapped the power of

the cloud. The result is a differentiating feature set outlined above.

The following are some of the technologies and components used in

this unique HCL assembly and coding:

Java – programming language and computing platform.

Nutch – a multi-threaded web crawler capable of full web

scale indexing, serving as the core sniffer/crawling

technology

SOLR – an enterprise grade search platform, constructs a

full text searchable index of the content crawled by Nutch

Lucene – text search engine library, used by Nutch and

SOLR

Page 14: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

14

Hadoop – a powerful distributed computing framework,

breaks large computational jobs into manageable fragments

to be run in parallel on many servers.

The following diagram further depicts some of the back-end

components (the Hadoop layer) of CyberScan‟s architecture.

Page 15: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

15

Business Impact Examples

The innovation behind this powerful new tool has already led to the

following business impact on beta-testing and initial customers.

Listed here only as examples is how your business can rely on

similar success:

A customer has realized more than 95% infringement

detection accuracy.

A customer realized a 30% cost saving compared to its

existing mix of service providers, with added benefits a

single source, HCL, for the new provisions.

Customers express eagerness about a pay-per-use

scheme, allowing them to worry only about their business

while HCL takes care of the engineering, technology

innovation, maintenance, support and research.

A customer division, based on is resounding success with

CyberScan, is now introducing the solution to all Business

Units and select partners of its company.

Page 16: HCLT Whitepaper: Cyber Scan

CyberScan (Online IP Infringement Detection Service) | July 2011

© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.

16

Conclusion

HCL CyberScan can help any business protect their key intellectual

property. As online piracy continues to grow exponentially,

companies must remain vigilant with technology to minimize

copyright infringement and its resulting profit loss. Our unique

solution is a new and effective way to combat online piracy, IP theft,

and illegal distribution by using automation and the latest

internet/cloud technologies.

Let CyberScan stop piracy and secure your profits.

For more on how HCL CyberScan can benefit your organization,

contact us at [email protected]

Author Info

Michael Grucz

Technical Lead Research,

Internet Security Business Unit

Kiran Kumar Reddy . V

Product Manager,

Internet Security Business Unit.

CyberScan is a effective

and cost efficient solution

to combat online piracy

and reducing your profit

loss.

Page 17: HCLT Whitepaper: Cyber Scan

Hello, I’m from HCL’s Engineering and R&D Services. We enable technology led organizations to go to market with innovative products & solutions. We partner with our customers in building world class products & creating the associated solution delivery ecosystem to help build market leadership. Right now, 14500+ of us are developing engineering products, solutions and platforms across Aerospace and Defense, Automotive, Consumer Electronics, Industrial Manufacturing, Medical Devices, Networking & Telecom, Office Automation, Semiconductor, Servers & Storage for our customers.

For more details contact [email protected]

Follow us on twitter http://twitter.com/hclers and our blog http://ers.hclblogs.com/

Visit our website http://www.hcltech.com/engineering-services/

About HCL

About HCL Technologies HCL Technologies is a leading global IT services company, working with clients in the areas that impact and redefine the core of their businesses. Since its inception into the global landscape after its IPO in 1999, HCL focuses on „transformational outsourcing‟, underlined by innovation and value creation, and offers integrated portfolio of services including software-led IT solutions, remote infrastructure management, engineering and R&D services and BPO. HCL leverages its extensive global offshore infrastructure and network of offices in 26 countries to provide holistic, multi-service delivery in key industry verticals including Financial Services, Manufacturing, Consumer Services, Public Services and Healthcare. HCL takes pride in its philosophy of „Employee First‟ which empowers our 72,267 transformers to create a real value for the customers. HCL Technologies, along with its subsidiaries, had consolidated revenues of US$ 3.1 billion (Rs. 14,101 crores), as on 31

st

December 2010 (on LTM basis). For more information, please visit www.hcltech.com

About HCL Enterprise HCL is a $5.9 billion leading global technology and IT enterprise comprising two companies listed in India - HCL Technologies and HCL Infosystems. Founded in 1976, HCL is one of India's original IT garage start-ups. A pioneer of modern computing, HCL is a global transformational enterprise today. Its range of offerings includes product engineering, custom & package applications, BPO, IT infrastructure services, IT hardware, systems integration, and distribution of information and communications technology (ICT) products across a wide range of focused industry verticals. The HCL team consists of over 80,000 professionals of diverse nationalities, who operate from 31 countries including over 500 points of presence in India. HCL has partnerships with several leading Global 1000 firms, including leading IT and technology firms. For more information, please

visit www.hcl.com