Security and Replication … and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 /...
-
Upload
britton-kennedy -
Category
Documents
-
view
218 -
download
0
Transcript of Security and Replication … and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 /...
Security and Replication… and Course Wrap-up
Zachary G. IvesUniversity of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 19, 2023
PNUTS slide content courtesy of Brian Cooper
2
Secure Transactions
Authentication using public/private key pairs is essential today
Consider every Web transaction – we want to know whom we’re conversing with!
… versus ending up with a phishing attack!
3
Secure Sockets Layer (SSL)
Relies on a trusted third party Certificate authority (CA) issues certificates
to certify a server and its public key Verisign is perhaps the best known of these
A server S generates public-private keypair Sends the public key, other info (plus $$$) to
Verisign (etc.) Gets back a certificate with:
CA name S’s name, URL, public key Timestamp and expiration info
4
Example Certificate
Owner: CN=GTE CyberTrust Root, O=GTE Corporation, C=US
Issuer: CN=GTE CyberTrust Root, O=GTE Corporation, C=US
Serial number: 1a3
Valid from: Fri Feb 23 23:01:00 GMT 1996 until: Thu Feb 23 23:59:00 GMT 2006
Certificate fingerprints:MD5: C4:D7:F0:B2:A3:C5:7D:61:67:F0:04:CD:43:D3:BA:58SHA1: 90:DE:DE:9E:4C:4E:9F:6F:D8:86:17:57:9D:D3:91:BC:65:A6:89:64
5
The SSL Protocol
Client C connects to server S from enterprise E S sends E’s certificate (cleartext) C validates the certificate using the CA (e.g.,
Verisign)’s public key C generates and sends to S a session key
encrypted with E’s public key
Java has built-in support for SSL (Java Secure Socket Extension, integrated in 1.4) and a tool for managing certificates (keytool)
6
So…
The client and server know each other given SSL How do we go ahead and make a purchase?
Most commonly: you enter your credit card number Sometimes this is stored in the retailer’s system for
future purposes! Best case:
The CC info is stored in a special, firewalled server, not part of the web site
Web server has other account info about you When a transaction goes through, web site sends order
to this special server, which combines it with CC info and sends it onward
7
Replication… Core of the Cloud
The vision of the “cloud”: a “computing utility” that is geographically distributed
At its core: geographical replication as well as partitioning What to replicate (including granularity) Where to replicate How to maintain consistency
(and how fresh data needs to be)
8
What to Replicate
Cost to maintaining consistency if data is changing Larger objects, slower networks, frequent updates,
freshness requirements replication is more expensive May be able to send a “diff” instead of the whole object
Thus, difference between LAN and WAN replication: Local-area / cluster:
Single-writer, multiple-reader data is often replicated e.g., CNN
Wide-area: Need to limit replication to seldom-updated data, or relax
the freshness or consistency constraints e.g., Akamai (images, video), Google index
9
Where to Place Replicas in the Internet
Want to place them at points where they can handle many requests and reduce traffic in bottlenecks
Commonly, at least one replica in Europe, Asia, US West Coast, US East Coast
Server 1 Server 2congested or
failure-prone linkC3
C2
C1
C4
C5 C6
C7
C8
C9
10
Schemes for Maintaining Consistency
Goal is to trade off performance vs. consistency guarantees
Lock-based protocols Invalidation Lease Time-to-live
11
Lock-Based Protocols
Guarantee strong consistency Similar to distributed version of what’s done in a
database Client request for an item requires a read lock at
its handling server Update to an item requires a write lock Multiple read locks can be held concurrently;
write lock must be exclusive
What are the potential pitfalls of this approach? Is it resilient to network partition?
12
Invalidation Protocols
If a server is to update an item, it can multicast this to all replicas
Requires servers to know who all of the other parties are
May be somewhat weaker than lock-based models – why?
Common variation: lease-based protocol A replicated item is “leased” for a particular period If the item is updated during its lease, it is
invalidated/refreshed After it expires, it is dropped
What are the pros and cons of these protocols?
13
Time-to-Live-Based Replication
Generally used when freshness constraints aren’t severe
Replicas are provided with an expectation for how likely they are likely to be current
After the “time-to-live” expires, they need to be revalidated
How does this compare to the previous approaches?
Replication in “Cloud” Services
Yahoo’s PNUTS, Google’s BigTable are based on the notion that there is locality of data access Consider consistency within each record but
ignore cross-record consistency
e.g., in a social network, we should coordinate accesses to the same user (but don’t care about consistency with unrelated friends)… but even here, we might be able to tolerate relaxed consistency among the users
14
15
Yahoo’s PNUTS Platform
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 W
D 12352 E
F 15677 E
Parallel databaseParallel database Geographic replicationGeographic replication
Indexes and viewsIndexes and views
Structured, flexible schemaStructured, flexible schema
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
16
Query model
Per-record operations Get Set Delete
Multi-record operations Multiget Scan Getrange
Web service (RESTful) API
System Architecture
17
Storageunits
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB
18
Tablet splitting and balancing
Each storage unit has many tablets (horizontal partitions of the table)Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeTablets may grow over timeOverfull tablets splitOverfull tablets split
Storage unit may become a hotspotStorage unit may become a hotspot
Shed load by moving tablets to other serversShed load by moving tablets to other servers
Storage unitTablet
20
Storage unit 1 Storage unit 2 Storage unit 3
Range queries
Router
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemon
LimeMangoOrange
StrawberryTomatoWatermelon
Grapefruit…Pear?
Grapefruit…Lime?
Lime…Pear?
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU1MIN-Canteloupe
21
Updates
1
Write key k
2
Write key k7Sequence # for key k
8Sequence # for key k
SU SU SU
3
Write key k4
5
SUCCESS
6
Write key k
RoutersMessage brokers
22
Asynchronous replication and
consistency
23
Asynchronous Replication
24
Goal: make it easier for applications to reason about updates and cope with asynchrony
Consider a single record for Brian Cooper’s Facebook entry:
Time
Record inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Update Update
Consistency Model
25
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current version
Stale versionStale version
Read (local)
Consistency Model
26
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read up-to-date
Current version
Stale versionStale version
Consistency Model
27
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current version
Stale versionStale version
Consistency Model
28
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write
Current version
Stale versionStale version
Consistency Model
29
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Consistency Model
30
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Mechanism: per record mastershipMechanism: per record mastership
Consistency Model
PNUTS Recap
An interesting compromise between consistency and performance/availability
Used underneath many of Yahoo’s properties
… And an exemplar of the new generation of cloud services
31
32
Experiments – Show It’s So!
The general goal: to help demonstrate and show why a real-world artifact provides a benefit Versus some benchmark or naïve strategy We also want to understand why there’s a benefit
Some common kinds of experiments: Usability: some sort of user tests, versus a benchmark Performance: as we increase the workload, what
happens? Scalability: as we increase the data, devices, nodes,
what happens? Complexity: especially for things like code, what
happens as we make the task harder or bigger?
33
Experimentation In general, experiments should follow the scientific
method: Hypothesis (e.g., our method will do better than XYZ on
workloads like QWV, which are representative of domain ABC)
Experiment (examine this – may need many trials, random workloads, etc.)
Conclusion (show, with statistically significant measurements, that the hypothesis is true)
Often, the hypothesis almost goes unsaid in computer science – it’s implicit in the choice of the problem – but it is there!
Note that many attributes, e.g., elegance, style, are not very amenable to experiments
Others, like expressiveness, generally need to be proven rather than run
34
Experimental Workloads There are generally three kinds of systems
experiments: Synthetic microbenchmark: experimental runs are done
over inputs that are generated to stress a specific factor, but is not particularly realistic
Examples: a hard disk random access test; a web server’s maximum throughput
Really shows the factor of interest; can be tweaked, scaled, etc.
Synthetic based on real behavior: experimental runs are done over inputs that are modeled after real data, but perhaps generated randomly
Examples: SPEC benchmarks; TPC-W web transaction benchmark
Enables us to generate more inputs, testing scalability, etc. Real-world: traces are collected of real system behavior
over real data Disadvantage: hard to quantify or control the different factors
Experimental Methodology
Consider the important factors that you wish to examine (and demonstrate) Scalability – can typically be in terms of running time, size of the
problem, space consumed, etc. Here: performance is what matters
Break it down into individual parameters Crawl & index time; time to answer a query; etc.
Consider a workload that helps measure the parameter Crawl 1000 documents; run 50 queries 10 times apiece; etc.
Vary one parameter at a time, study effects Number of machines; number of threads per machine; etc.
Run experiment multiple times; average and show 95% confidence intervals in line (continuous) or bar (discrete) chart
35
36
Course Recap(Until Next Week’s Midterm 2!)
Distributed, Web-scale systems are here to stay! They create many issues that are not totally
resolved, and for which there is no one answer: Heterogeneity Timing Partitioning and replication Consistency and integrity Etc.
This course tried to give you a sense of the issues and state-of-the-art – as well as the skills to go out and work in this domain I hope the amount of work we all sank into the material
(and the homeworks) will pay off for you! And stay tuned – there’s lots more to come!
Sensor networks, semantic Web, mobile systems, location-based services, …