Post on 21-Dec-2015
Anthropological Informatics
Reality Measures
or Reality bytes
Measurement and Perception
“Take away number in all things and all things perish. Take calculation from the world and all is enveloped in dark ignorance, nor can he who does not know the way to reckon be distinguished from the rest of the animals.” St. Isidore of Seville
“And still they come, new from those nations to which the study of that which can be weighted and measured is a consuming love.” W.H. Auden
Causality
“In causal terms the presence of oxygen is a necessary but not a sufficient condition for fire. Oxygen plus combustibles plus the striking of a match would illustrate a sufficient condition for fire” William L. Reese
A Necessary and Sufficient Condition
• Oxygen
• Combustibles
• Matches
Visualization: The Match?
“Science and technology have advanced in more than direct ratio to the ability of men to contrive methods by which phenomena which otherwise could be known only through the senses of touch, hearing, taste, and smell have been brought within the range of visual recognition and measurement and thus become subjects to that logical symbolization without which rational thought and analysis are impossible.” William N. Ivins
Mentalite
“One of the fundamental traits of the mind of the declining middle ages is the predominance of the sense of sight, a predominance which is closely connected with the atrophy of thought. Thought takes the form of visual images. Really to impress the mind a concept has first to take the visible shape.” Johan Huizinga
Dissonance• Modern: we feel that quantities are set and
transactions are fair and equivalent• Present : Past : with inspection, vagaries and
unfairness• In Roger Bacon 13th century, quanta differed from
region to region and transaction to transaction• A bushel of oats was nor more nor less than as
many oats a bushel basket contained but a bushel for the lord would be heaped and a bushel for the peasant was no more than level with the rim (the differential was not cheating but a proper negotiation)
Greek metrological relief
Greek multiplication wax tabletConical sundial with hours in Greek letters
Egyptian measuring gold ringsagainst a bull’s head weight
Egyptian alabaster vaseWith volume marked as81/2 hennu
Roman measuring tools
Facsimile of the Peutinger Table, a copy of a Roman road map; Rome is at the center
Roman milestone
Ptolemies’ “Geography”
Changes in Vision
• A shift to the visual in the Middle Ages was the match that ignited the flame of quantification
• Change was marked in several main fields of human exertion:
- LITERACY
- MUSIC
- PAINTING
- BOOKKEEPING
Literacy
• There was a shift in conduits of authority from the ear to the eye)
• In the 14th century devised new cursive script with word separation and punctuation for easier writing and reading
• Reading became swift and silent• Literacy spread to classes beneath poets and
philosophers: composers, painters and bookkeepers
Music
• Renaissance Europeans considered music to be an emanation of the basic structure of reality (harmony guided the heavens)
• Gregorian chants were performed from memory• By c. 10th century, accumulation of chants
exceeded apprentices’ abilities to memorize• Monks developed a system of “neumes” or signs
to indicate highs and lows without a musical staff• The musical staff was standardized by Guido of
Arezzo, a 11th century Benedictine choirmaster• Ut … re … mi … fa … sol … la … cut the
training of a good singer from 10 years to 1 year
Quadrivium
• 4 of the liberal arts considered essential for a solid education
• Arithmetic• Geometry• Astronomy• Music
Music and science: Galileo, Descartes, Kepler and Huyghens were all accomplished musicians and published on measurement in musical subjects
Painting
• Medieval artists were more concerned with rank of their subjects than with the faces of individuals (size = importance; space was to be filled by altering perspectives)
• In the 14th century, geometry begins to guide compositions (scenes were to be viewed by an observer at single point in time; perspective was adhered to)
Bookkeeping
“We shall ever give ground to honor. It will stand to us like a public accountant, just, practical, and prudent in measuring, weighing, considering, evaluating, and assessing, everything we do, achieve, think and desire.” Leon Battista Alberti (1440)
“Inasmuch as all things in the world have been made with a certain order, in like manner they must be managed … of the greatest importance, such as the business of merchants, which … is ordered for the preservation of the human race.” Benedetto de Cotrugli (15th c.)
The merchant struggling to make sense of his books was a theme
• Blizzards of transactions, scrambled by• Bills of exchange• Promissory notes• Credit practices• Axiom: production preceded delivery• Reality: payments could precede delivery or
production• Payments were undulatory, with currencies and bills
of exchange billowing and plunging in value in relation to one another
RECORDS …. My god me we need records or what will we know?
• By the end of the 14th c. Hindu-Arabic numerals were beginning to appear in merchants’ account books
• Double-entry accounting systems were developed (ingoing and outgoing values; plus and minus); great improvement over narrative accounts
• By the 15th century, an accounting lexicon and guides to practice were being developed
Visions and Models
“I often say that when you can measure what you are speaking about and express it in numbers you know something about it; but when you cannot measure it, and when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.” William Thompson, Lord Kelvin (1891)
Our Information Age:
• All information incomplete. There is always more to know, always another way to reframe what is already known. Our leaders must make important decision on the basis of incomplete information
• Information does not narrow the range of choices; it widens it. Further information is likely to make any decision-making process more meaningful and effective. It may not make the decision easier.
• Information is always subject to multiple interpretations and constructions. “Data” is nothing until it is given meaning and assembled in a narrative.
• Information comes in many forms: data, stories, myths, visual images, and meta-theories. Information theorists do not regard data as information at all. It is potential information. Information is data endowed with relevance and purpose.
• Data are undigested facts.• Information are facts organized for you by
someone else but not yet absorbed into your own thinking.
• Knowledge is information that you have internalized.
• Different people speak different information languages even when they are speaking the same language.
• Information leaks. In our information society nobody keeps secrets. There is an erosion of confidentiality that accompanies the inundation in information through media.
• Information once distributed is almost impossible to destroy. Information has its own survival skills.
Information Production
• about 10 exabytes• 90% digital• 55% personal• print .003% of bytes• email is 4 PB/y• www is about 50 TB• growth at 50% y
Gray and Szalay 2003
The First Disk 1956
• IBM 305 RAMAC• 4 MB• 50 X 24” disks• 1200 rpm• 100 ms access• $35K/y rent• Included computer and
accounting software
10 Years Later
30 MB
Cost of Storage
Storage Capacity Outstrips Moore’s Law
• Improvements
Capacity 60%/y
Bandwidth 40%/y
Access time 16%/y
• $1000/TB today
• $100/TB in 2007
Moore’s Law: 58.7%/y
TB growth: 112.3%/y
Price decline: 50.7%/y
Moore’s Law
• Performance/price doubles every 18 months
• 100 X per decade
• Progress in next 18 months will outstrip all previous progress (new storage sums all previous storage and new processing will outstrip all old processing)
Rules of Thumb for Data Engineering
• Moore’s Law: an address bit per 18 months• Storage grows 100 X/decade (1000X in last decade!)• Disk data of 10 years ago now fits in RAM• Device bandwidth grows 10X/decade (need for
parallelism)• RAM:disk:tape price is 1:10:30 and will go to
1:10:10• Gilder’s Law: aggregate bandwidth 2X/8 months• Web Rule: cache everything
Filling A Terabyte In A Year
Item Items/TB Items/day
300 KB JPEG 3 M 9,800
1 MB Doc 1 M 2,900
1 hour 256kb/s 9K 26
MP3 audio
1 hour 1.5 Mbp/s 290 .8
MPEG video
Gray and Szalay 2003
Schematized Storage
• File metaphor too primitive: just a “blob”
• Table metaphor too primitive: just “records”
• Need metadata describing data context
– Format
– Providence (author, publisher, citations)
– Rights
– History
– Related documents • in a standard format• XML and XML schema• Data Set is a great example• World is defining standard schema
Keys for Storage
• Schematized storage can help organization and research
• Schematized XML data sets are a universal way to exchange data
• Data are objects, and so, need standard representation for classes and methods
Access Variable and Increasing
Stages in Science• Observational Science
Scientist gathers data by direct observationScientist analyzes data
• Analytical ScienceScientist builds analytical modelMakes predictions
• Computational ScienceSimulate analytical modelValidate model and make predictions
• Data Exploration Science: data captured by instruments or data generated by simulatorprocessed by softwareplaces in a database as filesScientist analyzes database files
Data Avalanche
• Better observational instruments and better simulations are producing an avalanche of data
Discoveries Booming
• Conceptual discoveries (relativity, quantum mechanics) and theoretical may be inspired by observations
• Phenomenological discoveries (dark matter, obscured universe) made by advances in empirical rigor; inspires theories and is motivated by them
Discovery Cycle
• New technical capabilities• Observational discoveries• Advances in theory• Application of new theories
Phenomenological discoveries: exploring parameter space; making new connections
Maxim: understanding complex phenomena requires complex, information rich data and simulations
How to Keep Up
• We are looking for “needle in haystacks” (the Higgs particle in dark matter)
• Needles are easier than haystacks• Global statistics have poor scaling• As data and computers grow at the same rate, we
can only keep up with N log N• Discard notion of optimal: data are fuzzy and
solutions are approximations• Require combination of statistics and computer
science
Analysis of Databases• Create uniform samples• Filter data• Assemble subsets• Estimate completeness• Censor bad data• Count and build histograms• Generate Monte Carlo subsets• Perform likelihood calculations• Test hypotheses
These tasks are best done inside databases (“bring Mohamed to the mountain”)
Go for Smart Data
• Too much data to move around, so take analysis to the data
• Do all data manipulations inside the database (build custom procedures and functions in the database)
• Guaranteed automatic parallelism• Easy to build custom functionality key (pixel
processing, temporal and spatial indexing, unified databases and procedures)
• Easy to reorganize data (multiple views make optimal analyses)
• Scalable to Petabyte data sets
Data Mining Images
We can discover new types of phenomena using automated patternrecognition; multiscale analyses
Optimal Statistics
• Statistics algorithms scale poorly• Even if data and computers grow at same rate,
computers can do at most N log N algorithms• Solutions:
assume infinite computational resourcesassume only source of error is statisticalthere is a finite sample size
Solutions will require combinations of statistics and CS
New algorithms will not be worse than N log N
Make Clever Data Structures
• Use of tree structures
• Fast, approximate algorithms
• Must account for computation costs
scale level of accuracy
shoot for “best” results given …
Hyperdimensionality
• Explore parameter spaces in catalog domains through– Clustering analysis (different types and
outliers)– Multivariate correlations (find significant,
nontrivial correlations in the data)
Visualization becomes the key; include interactive visualization and data mining processes
Publishing Data
• expectations and standards must change• there will be exponential growth• projects must become more responsible
Archaeological Informatics
Organizing Piles of Articulated and Disarticulated Information
“Great Chain of Being”
• Stewart (1997) summarized the course of archaeological information moving to information as the GCB: moving from logical stages in data collection, to data management, to data analysis, and to variable modes of dissemination
• Use of Information Technology (IT) was to be seen as a multistranded web rather than as a linear feature on the computing landscape
Archaeological IT
• Quantitative methods• Statistics and
classification• Archaeometry• Visualization
(imaging, CAD, multimedia and virtual reality)
• Expert systems• Artificial intelligence• GIS
All require• Digital archives• Databases
Databases
• Term supplanted “databanks” in the 1980s• Concept linked to increased availability of
microcomputers• Emphasis accompanies shift to industry standard
software• Enhancement is a profound goal of government
organizations as they move toward encompassing strategies for digital data management
Access to Data
• Has emerged as the primary hot button of the 21st century
• Digital archives are being built but data languishes, unsorted and unavailable
• The backlog of information is huge and daunting
• Technological fixes are available but implementation is a social problem
Techno Science
• Use of electronic media to enhance scientific communication is a huge shift in the conduct of basic science
• Scientists want pure access to information• Potential for cross-disciplinary and international
collaborations is booming• Keys are building adequate metadata, migrating
data, and controlling access to information
There are Risks
• We cannot allow transformation of scientific communication to occur in a pure laissez-faire environment
• We cannot assume that everyone will catch on the using e-media structures
• We cannot assume that various e-media initiatives represent a period of problem-solving
What’s Out There?
• Run-away agendas and competing proprietary interests that will seek to retard powerful e-venues
• Huge amounts of money and resources are being committed by government agencies, private firms and organizations, by academics, by publishers, by professional societies, and individual researchers for development, maintenance and promotion of all sorts of competing e-media and for proprietary e-markets
Practical Problems
• Scientists and policy-makers do not have accepted theory for shaping IT
• Producers and users work within context-free models
• Work consists of ongoing prototyping and fledgling projects with high promise and withered funding
• The result: wasted funding, and orphaned data left in marginal, decaying, dead systems and formats
Responses: E-com reform
• Extends across all e-media• Spokesmen include Paul Ginsparg and Paul
Harnard• Harnard urges decentralized scholarly publishing
peer-reviewed or not (editor of Psycholoquy); originator of “scholarly skywriting
• Ginsparg is developer of the Los Alamos National Labs Physics E-Print Server, working papers for high-energy physicists
• Future: move away from hard-copy journals and archives in all forms, centralized and decentralized
Reform Ideology
• E-media is better than traditional media
• E-communication will be less expensive
• Access to e-media will be easier and wider
• Systematic use of e-media will dramatically speed up scientific communication
Subversive Actions
• Editors of Electronic Transactions on Artificial Intelligence (ETAI) have created a completely open article review process
• Phase I: article is open to the public online for 3 months
• Phase II: after author response, the article is reviewed for acceptance using confidential peer review and journal level quality criteria
• The Journal of Artificial Intelligence (JAIR) uses online appendices and discussions of published articles
• JAIR is distributed without charge on the Internet
Social Designing
• Electronic access to resources that include primary data
• High speed of work- and results-sharing
• Selection of target audiences for research
• Allocation of proper credit for work performed
• Allocation of professional status based on quality of data design and data sharing
Market Forces
• Industrial and corporate support for research creates authoritative, owner-driven sanctions on information dissemination
• These distribution systems are opaque, hidden behind secure doors
• Data release is carefully controlled, if allowed, and timing is completely geared to coporate advantage and profit-making
• Two poles: open access (transparent) and controlled access (opaque)
“Boom and Bust Cycles”
• “Worm Community System” for molecular biologists proved too complicated and costly for most users
• WCS was recast as A.C. Elegans DataBase (ACEDB), which has found greater acceptance
• Many biologists invested in the “Genome Database” only to see financial support withdrawn
• The “Archaeological Data Archive Project,” much celebrated, is now dead for lack of clientele
Liberating Archaeological Data
• Perring and Vince (1999) set out a guide for bringing complex archaeological data out to view
• They cite Hodder (1998) on the impact of the Internet in organization of archaeological knowledge, with a shift from hierarchical structures to network flows
• The veil: many archaeologists, working under Federal and State mandates, remain outside any long term concern with data handling
• Data liberation runs afoul of insistence on fossilized traditional research practice, fueled by resource management contracts
Need for Re-thinking
• Archaeological classification practices will need to emphasize optimal structures for organization of archaeological data in an electronic environment
• Interpretive structures must admit variable ways of grouping data
• Higher order groupings (typologies) will have to be supplemented by alternative analytical groupings (material classes, deposition classes)
• Data structures will have to be flexible and analytical
New Structures Must Recover Links
• Traditional databases (TDs) have disparate or unlinked compendiums (fields with specimen measurements but no link to “grey literature” reports)
• TDs typically are arranged to follow a rigid linear structure based on chronological groupings dictated by field recovery records and publishing
• This produces intractable data sets, where important data remain unavailable because reclamation costs are so high, there is a lack of integration for specialist data to be linked with overall data structure, and little potential for futrue synthesis
New Methods
• Proviso: we cannot enter new data as old structures into new IT (HTML, interrelational databases, and GIS) and expect working databases
• The theory-driven structure of the data must be revied
SAA 2000 position paper
• “Digital Data: Preservation and Re-Use” promoted ideas on improvements
• Robinson’s “Digital Archiving Pilot Project for Excavation Records” (DAPPER) reviewed projects’ data handling
• A central concern was the user interface, whether it should be designed for aesthetics or for clean access to data
• Argued for data preservation in standard formats as proposed by the UK Archaeology Data Service
Cost Measures
• Digital archiving of Eynsham Abbey collections cost 1.2% of excavation and post-excavation budgets
• Digital archiving of the Royal Opera House collections cost .1% of the total project cost
• Upshot: CAD archives, arranged as separate files, is more cost efficient for non-specialist venues, while GIS is the more powerful research tool but requires specialist training
Levels of Digital Archives
• Index level archive: index record for ADS catalog and summary document; not further work expected
• Assessment level archive: index record, project design, assessment report, specialist level databases, and site matrix
• Research level archive: above, with analytical results and publications
• Integrated archive: above, with records of ongoing scholarship, linking text files with other data records
Concerns
• Must ensure reuse of data: Eiteljorg emphasizes need for user training in CAD, GIS and database software
• Data translations are tricky: any relationships within software must be identified (data segments in CAD layers or DBF relations and links)
• Assessments of:– Systematic collection methodology– Record of data corrections
Metadata
• Data about data, providing information essential to data use and reuse
• Can refer to agreed upon sets of fields and associated lexicons
• Can consist of detailed descriptions of measurement systems and rules for their application
• Data users need metadata to make intelligent decision in selecting, using, adding to, or translating databases
Increasing Number of Standards
• MARC, Machine Readable Catalog, library cataloging
• Text Encoding Initiative (TEI), standard descriptions of machine readable text
• Directory Interchange Format (DIF), metadata for satellite imagery
• U.S. National Spatial Data Infrastructure (NSDI), complex descriptions of spatial data
Dublin Core
• Seeks to supply metadata descriptions between crude metadata of search engines and complex systems developed for MARC and the Federal Geographic Data Committee
• Can describe resources on the Internet and to insert file types (HTML and various postscript files)
• DC is extended as separate frameworks as in the Warwick Framework (descriptions can be stored as DIF or FGDC, or as simple extensions of the 13 DC elements)
Metadata and Databases
• Metadata should act to improve or restrict access to data
• Facilitate sharing and interoperability
• Characterize and index data
Data Models
• Data are a model of the real world• The description is arbitrary and biased• Data models incorporate different data views• Key issues: verification, validation and
certification of data quality• Measures: objective correctness (accuracy and
consistency) and appropriateness defined by intended purpose
• Required: all data must be augmented with metadata to record information needed to assess data quality, record results of assessments, and support process control
Measures for Data Quality
• Adequate description and meaning
• Specification intended use and range of purposes and constraints
• Requirements for access and use
• Description and rationale for structure and design
• Global relationships to other databases
• Updated cycle information
Data Deterioration
• Limited media life• Rapid obsolescence of software and hardware• Use of graphics, hypertext and linked structures
only accelerates decay rates• Data files will become increasingly dependent on
specific software for continued interpretation• Record keeping paradigms are essential
(compression is not an option; annotated metadata must remain transparent)
Reality
• Archaeological data and information are growing exponentially
• New data paradigms must be created
• Effects on theory and method will be extreme
• Effects on the culture of the discipline will prompt profound dislocations