Internet and WWW Services
description
Transcript of Internet and WWW Services
© Copyright 1997, The University of New Mexico M-1
Internet and WWW Services
• Security• Types of Services• Vended versus Internally Provided• Costs and Benefits• Servers and Clients• Potential Problems• Stats
© Copyright 1997, The University of New Mexico M-2
General Network Security
• Isolated Servers• Restricted Subnets• Firewalls• Proxy Servers
© Copyright 1997, The University of New Mexico M-3
WWW Application Security
• OS Level• Server Level• Program Level
© Copyright 1997, The University of New Mexico M-4
Types of WWW Services
• Static Data• Server Search Engines• Dynamic Data• Server Applications• Java Enabled
© Copyright 1997, The University of New Mexico M-5
Vended
• Which Vendor
• How Much Do They Do
– HTML
– Graphics
– Design & Layout
– Programming
• Bandwidth
– Total
– Dedicated
© Copyright 1997, The University of New Mexico M-6
Internally Provided WWW Server
• For who?• How many services, how much traffic?• For what use (scope the server) ?
© Copyright 1997, The University of New Mexico M-7
Cost of a WWW Service
• Server Usage• Disk Space• Network Bandwidth• Router or LAN Load• Application Development with Limited Capabilities• Application Development with Limited
Standardization
© Copyright 1997, The University of New Mexico M-8
Benefits
• High-touch, High Impact Narrow-casting• Kiosks• Fast, Simple Apps From Central Server• Built-in Protocols• Potentially Large Installed Client Base
© Copyright 1997, The University of New Mexico M-9
Shopping List
• Server Machine and O/S• Network Access• WWW Server• WWW Client• Server Programming Tools• Data and/or Databases
© Copyright 1997, The University of New Mexico M-10
Which Server Platform?
• Unix• NT
© Copyright 1997, The University of New Mexico M-11
Which Server?
• CREN• Microsoft• Netscape - Communication or Commerce• O’Reilly• WebForce• Oracle WebServer
© Copyright 1997, The University of New Mexico M-12
Client Compliance Level
• HTML 2.0• HTML 3.0• Netscape Enhancements• Java• Lynx (Text Browser)
© Copyright 1997, The University of New Mexico M-13
CGI-BIN Risks
• Dangerous Programs or Scripts• User-supplied Programs or Scripts
© Copyright 1997, The University of New Mexico M-14
Robots and Other Network Creatures
• Problems with “Automated Agents”• Deterring Robots• Reacting to Robots
© Copyright 1997, The University of New Mexico M-15
WWW Server Stats
JanuaryFebruary
March
UNM
Outside
0
50000
100000
150000
200000
250000
300000
350000
400000
WWW Accesses per Week
UNM35%
Outside65%
WWW Accesses per Week
UNM
Outside
0
100,000
200,000
300,000
400,000
500,000
600,000
January February March
© Copyright 1997, The University of New Mexico M-16
WWW Server Stats
JanuaryFebruary
March
UNM
Outside
0
50000
100000
150000
200000
250000
300000
350000
400000
WWW Accesses per Week
UNM35%
Outside65%
© Copyright 1997, The University of New Mexico M-17
Web Mining
Web based information extraction
© Copyright 1997, The University of New Mexico M-18
Why the Web(web = web browser)
• Ubiquitous: – Web browsers are on every desktop, every PC, Mac,
workstation, and terminal.
• Platform independence– Use of Java and server side programs means clicking on a
button does the same thing everywhere.
© Copyright 1997, The University of New Mexico M-19
Natural Language
News ServicesNews Services
Multidimensional vectors
Markov objectsID3ID3
Word frequency
Data warehouse
Data CleansingData Compression
Text Mining
Factorial Analysis
Keyword Search
Decision Trees
Tri-Grams
Tri-Letter Sets
Hidden Information
Hidden Information
Hyp
othe
sis
Ver
ific
atio
n
© Copyright 1997, The University of New Mexico M-20
DATA CleansedData
ExtractedData
N
DisplayResults
© Copyright 1997, The University of New Mexico M-21
What Kind of Data?
• Usenet News– Most places have Multi gigs of news
• System accounting files – Can tell who is doing what, when
• Misc. Web pages– A variety of interesting information
• Listserver or public system email– We keep email concerning system problems
© Copyright 1997, The University of New Mexico M-22
Cleansing Data
• News article– NNTP fields
– signatures
• Web Page– HTML codes
– descriptions of links to other sites
– pattern fields (headers and trailers that appear on every page at the site)
© Copyright 1997, The University of New Mexico M-23
Mining for data
• Test hypothesis• Look for hidden information• Find other similar information
© Copyright 1997, The University of New Mexico M-24
Display of Information
• Graphical• Text Listing
– Directories: human maintained categories• e.g.: recreation, computers, finances, arts
– Computer generated list
• Customized– User defined defaults
– Cookie defined defaults
© Copyright 1997, The University of New Mexico M-25
Data and Services
N
DisplayResultsLearning
to useservices
Learning to extract
data from the answer
Compileand clean
data
© Copyright 1997, The University of New Mexico M-26
What Services?
• Search Engines• Internet White Pages
– (information on individuals)
• Internet Yellow Pages – (information on corporations)
• Usenet News repositories• Online libraries• Online periodicals
© Copyright 1997, The University of New Mexico M-27
Learning to use Services
• Sample sets of data– can derive a format if taught to.
• Machine learning (same as in Data Mining)– look at every interpretation, find the one that conveys the
most information.
© Copyright 1997, The University of New Mexico M-28
Learning to interpret answers
• What format is information given in?• What do the fields mean?
– Can identify unknown fields by matching the data with a known information.
© Copyright 1997, The University of New Mexico M-29
Compile and Clean Data
• Redundancies• Duplicates• Redundancies• Newer information has precedence
© Copyright 1997, The University of New Mexico M-30
Security
• Server environment– Use trusted CGI scripts and server side includes
• Client environment– Restrict access by IP number or domain
– Restrict access by password
• Internet– encrypt data (PGP)
– Certification authority
© Copyright 1997, The University of New Mexico M-31
Data is in database?
Checking for hidden information
MachineLearning
N
Y
© Copyright 1997, The University of New Mexico M-32
Article: 52151 of comp.lang.perl.miscPath: lynx.unm.edu!pr1.plk.af.mil!tesuque.cs.sandia.gov!sloth.swcp.com!news.ironhorse.com!op.net!news.mathworks.com!enews.sgi.com!news.sgi.com!mr.net!news.mid.net!sbctri.tri.sbc.com!newspump.wustl.edu!newsfeed.rice.edu!rice!addFrom: [email protected] (Arthur Darren Dunham)Newsgroups: comp.lang.perl.misc,comp.infosystems.www.authoring.htmlSubject: Re: WWW: web site "pre-processor" in perl ?Date: 31 Oct 1996 00:20:06 GMTOrganization: Rice UniversityLines: 23Message-ID: <[email protected]>References: <[email protected]> <[email protected]> <[email protected]> <[email protected]>NNTP-Posting-Host: pecos.is.rice.eduXref: lynx.unm.edu comp.lang.perl.misc:52151 comp.infosystems.www.authoring.html:111886
In article <[email protected]>, Clay Shirky <[email protected]> wrote:>>Au contraire. HTML _is_ broken, relative to, say, SGML, but if you are>careful with your tags and comment carefully, your data can be derived>from your HTML files, not v-v.>>find . -name '*html' -exec perl -p -i.bak -e> 's#(<body[^>]*bgcolor="?)oatmeal("?[^>]*>)#$1skyblue$2#i;' {} \;
or if you wanted perl to do all the work, rather than have find(1)launch N perl executables for each .html files, you could do this....
find . -type f -name '*html' -print | xargs perl -p -i.bak -e 's#(<body[^>]*bgcolor="?)oatmeal("?[^>]*>)#$1skyblue$2#i;'
That way, perl happily iterates through all the lines in all the filessince we don't care which file we're in when we do the substitution.
-- Darren Dunham [email protected] Sysadmin Rice University(This line currently in revision) Houston, TXAny resemblance between real opinions and my post is coincidental
© Copyright 1997, The University of New Mexico M-33
<HTML><HEAD><TITLE>Information gathering</TITLE></HEAD><BODY><TABLE><TR><TH><IMG SRC="info.gif"></TH> <TH><font size="+3">Information Gathering</font><BR>Just some sample text which might or might not be worthless.You'd want to sort out which of this was just HTML tags and other worthless junk and which was meaningful.</TH></TR></TABLE><P><CENTER><H2>Links to</H2><A HREF="/sameplace/otherinfo"> A link to something on this site </A>
<A HREF="/otherplace/otherinfo"> A link to something on this another site </A>
</BODY></HTML>
© Copyright 1997, The University of New Mexico M-34
Re: Scots and English Gregory J Dalley, 30 May 1995, Lines: 18.Re: Dutch and English accents Phil Rose, 15 Jun 1995, Lines: 28.Re: ANY SIL'rs out there? A.K.A. Summer Institute of Linguistics. yomomma, 16 Jun 1995, Lines: 6.Re: ANY SIL'rs out there? A.K.A. Summer Institute of Linguistics. yomomma, 16 Jun 1995, Lines: 6.Conferences, Seminars-info wanted chris bowen, Mon, 03 Jul 1995, Lines: 7.AIGH? Coby (Jacob) Lubliner, 8 Jul 1995, Lines: 8."Shall" and "Will" in Welsh English [email protected], Wed, 19 Jul 95, Lines: 14.careers in linguistics scharle, 10 Sep 1995, Lines: 8.job opportunities in computational linguistics? Sonny Xuan Vu, 30 Sep 1995, Lines: 14.Re: job opportunities in computational linguistics? Miss Sarah Tiller, Wed, 4 Oct 1995, Lines: 27.Re: What Is Singapore English? Zhong Qiyao, 11 Dec 1995, Lines: 28.Re: What Is Singapore English? Chew Kim Swee Andrew, 14 Dec 1995, Lines: 41.Re: What Is Singapore English? Pota alok Ashwin, 16 Dec 1995, Lines: 45.Re: How to write in English ... Ann Weiner, Tue, 2 Jan 1996, Lines: 13.Re: What Is Singapore English? Wing Luk, 7 Jan 1996, Lines: 27.Linguistics Careers lebitz,stacey b, 23 Jan 1996, Lines: 14.English Teaching Offering in China - offer2.doc [1/1] XIAOJUN ZHANG, 24 Jan 1996, Lines: 240.TRYING TO PROTECT YOUR WORK? prepaid, Sun, 04 Feb 1996, Lines: 1.Give me, please, one program for learn to speak english!! Please!! "Eugen I. Ivanov", 20 Feb 1996, Lines: 1.Re: The English "R" for Germans Joerg Settemeyer, 8 Mar 1996, Lines: 5.English Tutor Needed. Mua Tran, 23 Mar 1996, Lines: 20.Re: old form of shorthand Fido, 1 Apr 1996, Lines: 9.Re: Math as pornography Gordon Fitch, 17 May 1996, Lines: 7.Re: Chain Shift Charles Lieberman, 26 Jul 1996, Lines: 10.Re: Tendency of Inflections to Disappear - Why? Terrence Griffin, 28 Jul 96, Lines: 1.Re: Concerning the number of esperantists Marc Bonnaud, Fri, 09 Aug 1996, Lines: 14.Re: Concerning the number of esperantists Cheradenine Zakalwe, Fri, 9 Aug 1996, Lines: 16.Re: Concerning the number of esperantists Alan Gould, Sat, 10 Aug 1996, Lines: 22.Re: Concerning the number of esperantists Don HARLOW, Sun, 11 Aug 1996, Lines: 21.Re: Kiom da E-istoj *ne* regas la anglan? Andrew McConnell, Fri, 30 Aug 1996, Lines: 19.cohesion in CMC Per-Mikael Jansson ENGE, 22 Oct 1996, Lines: 10.
Articles from sci.lang selected through webSOM
© Copyright 1997, The University of New Mexico M-35
Limitations of the Web
• Some functionality/specialization was given up for ubiquity
• Transfer time– Mass data transfer prohibitive
• External to machine– Reliance on network
• Not inherently as secure as staying home
© Copyright 1997, The University of New Mexico M-36
Why Data Mining
• There is a lot of data of unknown worth and purity• Data mining uses the same underlying procedures as
other knowledge discovery/ data extraction systems
© Copyright 1997, The University of New Mexico M-37
Automatic Customization to user preferences
• Web pages– Hotwired autoconfigs based on what you surf to
• News services– usenet service custom.roy-corey.1
• Information display paradigm– industry report style
– collegiate style
– Microsoft style
© Copyright 1997, The University of New Mexico M-38
Methods for gathering data
• Extraction from documents– data mining
– keyword searches
– similarity searches
• Extraction from services– ILA: internet learning agents
– Softbots
– Metacrawler
© Copyright 1997, The University of New Mexico M-39
Data mining on the web?
• Transfer rate too slow to transfer most databases whenever you want
• Computation too intensive to let others mine your database whenever they want
• So: Use pre-collected data or pre-indexed database
© Copyright 1997, The University of New Mexico M-40
Java -- What is it?
• Programming Language• Java Compiler• Java Interpreter (Java Virtual Machine)• For creating applets which run inside a browser• For creating applications (stand alone programs)
© Copyright 1997, The University of New Mexico M-41
Java Application Source Code
//
// Sample HelloWorld application
//
class HelloWorldApp {
public static void main(String args[]) {
System.out.println("Hello World!");
}
}
© Copyright 1997, The University of New Mexico M-42
Java Applet Source Code
//
// Sample HelloWorld applet
//
import java.awt.Graphics;
import java.applet.Applet;
public class HelloWorld extends Applet {
public void paint (Graphics g){
g.drawString("Hello world!", 25, 25);
}
}
© Copyright 1997, The University of New Mexico M-43
How could you use it?
• Client applets or applications• Server code• Portable code• Create via Developer Tools
© Copyright 1997, The University of New Mexico M-44
Developer Tools
• Visual C++ (Visual Java?)• Symantec• Sun• SGI - Cosmo Code
© Copyright 1997, The University of New Mexico M-45
Developer Tools
• SourceCraft• Powersoft - Fusion• Quintessential Objects - Diva for Java (Javaside)• Roguewave - JFactory
© Copyright 1997, The University of New Mexico M-46
Advantages
• Object Oriented and event-driven• Portable* bytecode• Multi-threaded• Integrated Network Abilities• Built-in Multimedia Capabilities• “Robust and Secure”
© Copyright 1997, The University of New Mexico M-47
Drawbacks
• Few deployed clients• Very C++ -like• Not yet stabilized• Very few Developer Tools• Not all the class libraries exist (yet)
© Copyright 1997, The University of New Mexico M-48
Class Structure
Class java.applet.Applet
java.lang.Object
|
+----java.awt.Component
|
+----java.awt.Container
|
+----java.awt.Panel
|
+----java.applet.Applet
© Copyright 1997, The University of New Mexico M-49
Security
• OS security in applications• “No Pointers” and no user memory management• Compile-time and Run-time checking• Client Data Security
– No access to disk from Netscape
– Directory-based security in Hot Java
© Copyright 1997, The University of New Mexico M-50
Security
• Network Security– No Applets
– No Access
– Applet Host
– Firewall
– Any Host
© Copyright 1997, The University of New Mexico M-51
Security Problems
• CERT 96.05 - Firewall Security– ftp://info.cert.org/pub/cert_advisories/CA-
96.05.java_applet_security_mgr
• CERT 96.07 - Bytecode Verifier– ftp://info.cert.org/pub/cert_advisories/CA-
96.07.java_bytecode_verifier
© Copyright 1997, The University of New Mexico M-52
Alternative Options
• Visual Basic and browsers• Visual Basic separate from WWW• Web Server without Java
© Copyright 1997, The University of New Mexico M-53
Books About Java
• Teach Yourself Java in 21 Days• Java!• Hooked On Java• Presenting Java• O’Reilly
© Copyright 1997, The University of New Mexico M-54
Java WWW Sites
• Sun– http://java.sun.com/
• The Internet Programming Page– http://www.apexsc.com/vb/internet.html
• Rogue Wave Home Page– http://www.roguewave.com/
• Symantec Café– http://cafe.symantec.com/cafe/index.html
© Copyright 1997, The University of New Mexico M-55
Java WWW Sites
• JavaSoft– http://www.javasoft.com/
• The Java Directory (Gamelan)– http://www.gamelan.com/
• IBM: Centre for Java Technology– http://www.hursley.ibm.com/javainfo/
• News: comp.lang.java