Through the Looking-Glass: the technology behind the UK ...

39
Through the Looking-Glass: the technology behind the UK Mirror Service http://www.mirrorservice.org/ Adam Sampson [email protected] University of Kent Through the Looking-Glass – p. 1

Transcript of Through the Looking-Glass: the technology behind the UK ...

Page 1: Through the Looking-Glass: the technology behind the UK ...

Through the Looking-Glass:the technology behind the UK

Mirror Servicehttp://www.mirrorservice.org/

Adam Sampson

[email protected]

University of Kent

Through the Looking-Glass – p. 1

Page 2: Through the Looking-Glass: the technology behind the UK ...

Introduction

Through the Looking-Glass – p. 2

Page 3: Through the Looking-Glass: the technology behind the UK ...

What is UKMS?

Fast local copies (mirrors) of popular Internet resourcesfor the UK academic community

Some approximate numbers for September 2004:

201 mirrored sites4 million files6 terabytes of disk space0.4 terabytes of data shipped per dayAverage bandwidth usage 42Mbit/sec (peaking at100Mbit/sec)120,000 distinct user IP addresses per month28% outgoing traffic is to UK academic users onJANET100% availability from 1999 to 2004

Through the Looking-Glass – p. 3

Page 4: Through the Looking-Glass: the technology behind the UK ...

What we carry

Open Source software (the vast majority)e.g. Debian GNU/Linux, OpenOffice.org

Academic sitese.g. Project Gutenberg, Duke Papyrus Archive

Some commercial softwaree.g. MATLAB, Netscape

Legacy content (HENSA/Micros, etc.)

Official mirror site for many mirrors

Through the Looking-Glass – p. 4

Page 5: Through the Looking-Glass: the technology behind the UK ...

History

1987 netlib mirror at UKC; access by email(18 years later, we still mirror netlib)

1992 UKC gets first Internet link;ISSC-funded HENSA/Unix at UKC, HENSA/Micros atLancaster(hensa.ac.uk)

1999 HENSA merges to become the JISC-funded UK MirrorService with staff at UKC and Lancaster(mirror.ac.uk)

2003 Added third UKMS site at Reading C-POP

2004 JISC contract expires;UK Mirror Service now operates from UKC ComputerScience(mirrorservice.org)

Through the Looking-Glass – p. 5

Page 6: Through the Looking-Glass: the technology behind the UK ...

Architecture

Through the Looking-Glass – p. 6

Page 7: Through the Looking-Glass: the technology behind the UK ...

From the outside

via FTP, HTTP or rsyncuser connections

connections to source sitesvia FTP, HTTP or rsync

connections to source sitesvia FTP, HTTP or rsync

internal network

frontend

frontend

frontend

frontend

backend

backend

disk

disk

disk

disk

disk

disk

SCSI

The UK Mirror Service

Through the Looking-Glass – p. 7

Page 8: Through the Looking-Glass: the technology behind the UK ...

Removing the lid

via FTP, HTTP or rsyncuser connections

connections to source sitesvia FTP, HTTP or rsync

connections to source sitesvia FTP, HTTP or rsync

internal network

frontend

frontend

frontend

frontend

backend

backend

disk

disk

disk

disk

disk

disk

SCSI

Through the Looking-Glass – p. 8

Page 9: Through the Looking-Glass: the technology behind the UK ...

Explaining the roles

Frontend and backend hosts

Users make FTP, HTTP or rsync connections to arandomly-selected frontend host (DNS round-robinentries)

Frontends act as smart caching proxies to reduce loadon backends and disks

Frontends fetch the data from the backends

Backends have disks attached, each with severalmirrors on

Backends periodically fetch data from source sites todisks

Through the Looking-Glass – p. 9

Page 10: Through the Looking-Glass: the technology behind the UK ...

Hardware at UKC

4× Sun V120s as frontends

A V120 as “slow” backend withdisk arrays:

a 400-gigabyte Sun 3300a 4-terabyte Transtec(cheap!)

A Sun E450 as “fast” backendwith Sun A1000 disk arrays:

2× 160-gigabyte2× 280-gigabyte2× 400-gigabyte

Through the Looking-Glass – p. 10

Page 11: Through the Looking-Glass: the technology behind the UK ...

A fortnight in the life

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

11/0100:00

11/0300:00

11/0500:00

11/0700:00

11/0900:00

11/1100:00

11/1300:00

Byt

es/s

econ

d

UKMS backend traffic levels

compton eri0 outpalomar hme0 out

Through the Looking-Glass – p. 11

Page 12: Through the Looking-Glass: the technology behind the UK ...

Serving content

Through the Looking-Glass – p. 12

Page 13: Through the Looking-Glass: the technology behind the UK ...

Serving in general

Making a directory on a backend disk available to users

Must support a standard directory layout across allprotocols

Must transparently select the right backend for differentmirrors

Must minimise the backend load where possible

Must respect the source site’s presentation instructions

Through the Looking-Glass – p. 13

Page 14: Through the Looking-Glass: the technology behind the UK ...

Zooming in. . .

SCSI

ukms−rsync

ftp−server

Apache

remote−fsd

Apache

OS

disk

disk

disk

frontend backend

rsync

FTP

HTTP HTTP

remote−fs

Through the Looking-Glass – p. 14

Page 15: Through the Looking-Glass: the technology behind the UK ...

An aside: CFTP

A C++ library and a set of tools

Used by nearly every part of the UKMS software

The result of a late-90s UKC research project

Provides a virtual Unix-like filesystem tree withmountable filesystems

posix-fs connects to a real local directoryftp-fs connects to another FTP serverremote-fs connects to a directory on another host(using its own protocol)

Through the Looking-Glass – p. 15

Page 16: Through the Looking-Glass: the technology behind the UK ...

Serving content with FTP

ftp-server is a custom FTP server based on CFTP

Limit on the number of concurrent connections to eachfrontend (client should try next FE if full)

Can generate tar archives on the fly

The FTP server exports the CFTP virtual filesystem

Each mirror is mounted using remote-fs from theappropriate backend in the right place under /sites/

remote-fsd server runs on backends

Through the Looking-Glass – p. 16

Page 17: Through the Looking-Glass: the technology behind the UK ...

Serving content with rsync

ukms-rsync is a patched version of the stock rsync server

Uses libqfs, a C interface to the CFTP virtual filesystem(looks like Unix system calls)

Uses nearly the same config for CFTP as the FTPserver

Through the Looking-Glass – p. 17

Page 18: Through the Looking-Glass: the technology behind the UK ...

CFTP frontend caching

Initial approach used existing CFTP cache – not goodenough!

Final-year project implemented a better cachefilesystem for CFTP

Cooperative data and metadata caching between allFTP/rsync processes on a frontend

Tested on real logs – 30 gigabyte data cache reduceddata pulled from backend by 50%

Through the Looking-Glass – p. 18

Page 19: Through the Looking-Glass: the technology behind the UK ...

Serving content with HTTP

We use the Apache web server (twice)

Frontend Apache acts as a caching proxy, forwardingrequests on to the appropriate backends

Apache’s cache behaves poorly for large files, though,so we redirect requests for CD images to the FTPserver

Most virtual hosts handled directly by frontends

“Special” virtual hosts handled on backends using extraports

Doesn’t use CFTP yet, but we’re working on it

Through the Looking-Glass – p. 19

Page 20: Through the Looking-Glass: the technology behind the UK ...

The browser

Generates our web interface under /sites/

A (C!) CGI program that runs on backends

Supports browsing and extracting from archive files

Displays likely README files and mirror descriptions

Through the Looking-Glass – p. 20

Page 21: Through the Looking-Glass: the technology behind the UK ...

The search engine

Uses a separate machine and a huge PostgreSQLdatabase

Indexing scripts examine newly-mirrored data

Web frontend does queries against the database

Much more complex than it sounds – good searching isdifficult!

Through the Looking-Glass – p. 21

Page 22: Through the Looking-Glass: the technology behind the UK ...

Mirroring

Through the Looking-Glass – p. 22

Page 23: Through the Looking-Glass: the technology behind the UK ...

Mirroring in general

Making a local directory look like one on the source site

May need to exclude some content, or add extra content

Need to cope with source site being down

Need to update search engine

Need to copy mirrored content to other UKMS sites

Through the Looking-Glass – p. 23

Page 24: Through the Looking-Glass: the technology behind the UK ...

FTP mirroring with syncfs

syncfs is a general mirroring tool based on CFTP

Uses ls-lR files if available on source site

Can use multiple connections

Can resume partial downloads

Updates mirroring status files (hidden from users)

Lots of special behaviour to deal with broken FTPservers

Through the Looking-Glass – p. 24

Page 25: Through the Looking-Glass: the technology behind the UK ...

Peer mirroring

syncfs mirroring technique using two UKMS sites (UKCand Lancs)

Both UKMS sites connect to source site

One mirrors in alphabetical order, the other in reversealphabetical order

Each checks the other UKMS site to see whether it’sgot the file they want before going to the source site

On average, each pulls half the content from the sourcesite

Works when one UKMS site is down too!

Through the Looking-Glass – p. 25

Page 26: Through the Looking-Glass: the technology behind the UK ...

Web mirroring

Actually a general term for anything that’s not FTP

Wrapper around off-the-shelf toolsrsync for rsync mirroringpavuk for HTTP mirroringtucopy for Tucows mirrors

Has workarounds for broken toolsDetect empty files and remirrorFix up bad links in HTMLDetect “stuck” mirroring processes and restart

Push mirroring via a modified writable rsync server

Through the Looking-Glass – p. 26

Page 27: Through the Looking-Glass: the technology behind the UK ...

Tying it all together

Through the Looking-Glass – p. 27

Page 28: Through the Looking-Glass: the technology behind the UK ...

Metaconf

Configuring all our software by hand would beimpractical

Metaconf takes descriptions of mirrors in a standardformat (MDF)

Provides a uniform interface that scripts can use to getat the descriptions

Generates per-site configuration files for software thatcan’t use the Perl interface

Copes with remapping disks when faults occur

Through the Looking-Glass – p. 28

Page 29: Through the Looking-Glass: the technology behind the UK ...

MDF

An application of RDF (yes, the Semantic Web isuseful!)

An MDF file describes a single mirror:The name, description and classification of the mirrorThe logical disk upon which it’s storedThe host that performs mirroringHow and when to mirror it (including any specialoptions)How it can be accessed (protocols, virtual hosts)

Through the Looking-Glass – p. 29

Page 30: Through the Looking-Glass: the technology behind the UK ...

What we’ve learnt

Through the Looking-Glass – p. 30

Page 31: Through the Looking-Glass: the technology behind the UK ...

. . . about mirroring

Source sites are usually brokenThe most important source sites are alwaysoverloadedWith FTP listings, anything that can go wrong will

. . . but FTP is still the best protocol for mirroringHTTP mirroring is basically guessworkrsync works badly for very large mirrors

Off-the-shelf mirroring software is unreliable

Source site maintainers don’t have the time to listen tomirror maintainers

Through the Looking-Glass – p. 31

Page 32: Through the Looking-Glass: the technology behind the UK ...

. . . about serving

Client software is usually broken“Download accelerators”Incorrect HTTP Redirect handlingLarge file handling (Fedora DVDs)

Malicious users exist

Overzealous indexing bots also exist

Frontend caching is a really good idea

Through the Looking-Glass – p. 32

Page 33: Through the Looking-Glass: the technology behind the UK ...

. . . about protocol design

FTP has some serious problemsPoor takeup of standardised directory listingsThe whole active/passive mess – NAT/firewallunfriendly

HTTP has some serious problemsNo directory listings (WebDAV isn’t there yet)It’s not good for mirroring

rsync has some serious problemsHuge latency when transferring initial treeRandom failures during transfersBackwards compatibility makes code messy

All protocols suck!

Through the Looking-Glass – p. 33

Page 34: Through the Looking-Glass: the technology behind the UK ...

. . . about software design

Don’t! Big design up front doesn’t work for us

Our most reliable systems are those that have evolvedslowly

Clear code is easier to modify than a well-documentedmess

Use the right language for the job

Simple file formats are best

Being self-healing is useful

Version control software is invaluable

Through the Looking-Glass – p. 34

Page 35: Through the Looking-Glass: the technology behind the UK ...

Future plans

Through the Looking-Glass – p. 35

Page 36: Through the Looking-Glass: the technology behind the UK ...

Making our software Open Source

UKC owns copyright on most of the code

Need to tidy up build systems and package for release

Patches to existing software (rsync) are easier

Package as “Mirror Service In A Box” for others to use

Through the Looking-Glass – p. 36

Page 37: Through the Looking-Glass: the technology behind the UK ...

Reducing cost

Sun kit is reliable but (very!) expensive

PCs and SATA disks are cheap (free in some cases)and fast

Our plan is:6 PCs, each with 4 300-gigabyte SATA disksMirrored pairs of machines for redundancyEach machine is both a frontend and a backend

We aren’t the only people who’ve noticed this – Googleand archive.org take the same approach on a muchlarger scale!

Through the Looking-Glass – p. 37

Page 38: Through the Looking-Glass: the technology behind the UK ...

The End

Through the Looking-Glass – p. 38

Page 39: Through the Looking-Glass: the technology behind the UK ...

Any questions?

Find us at:http://www.mirrorservice.org/

Through the Looking-Glass – p. 39