Data Management and Linguistic Analysis: MDS applied to RODA Sheila M. Embleton, Dorin Uritescu &...
-
Upload
shannon-diana-kelly -
Category
Documents
-
view
213 -
download
0
Transcript of Data Management and Linguistic Analysis: MDS applied to RODA Sheila M. Embleton, Dorin Uritescu &...
Data Management and Linguistic Analysis:
MDS applied to RODA
Sheila M. Embleton, Dorin Uritescu & Eric S. Wheeler
York University, Toronto, Canada
Order of Presentation Context
Romanian and RODA RODA as Linguistic Technology Examples
• Latin Word-final /u/• Non-palatalized dentals before front vowels
MDS MDS as an analytic tool MDS and Romanian Dialects
Noul Atlas lingvistic român. Crişana Crişana region in
north-west Romania
Hard copy atlas by Stan and Uritescu (1996, 2003)
Digitize to make it more accessible
Objective Use Information Technology to
permit a broad range of scholars to access the data, select the data appropriately, and present the data clearly;
and so gain greater understanding of its significance.
State of the Project (Nov 2007)
Have entered all 407 maps from Vol. I and II Twice proof-read Consulted source slips, when needed
Have developed search and mapping tools to access the digital data
Initial version now posted at:http://vpacademic.yorku.ca/romanian
The technology allows one to:
View the data Search for data and count it Interpret the data or the counts Analyze the data (e.g. MDS) See the results as maps
Save the maps as .jpg pictures Save the results for later use
Hear samples of the data
RODA: function Custom-defined maps
• You select the data• You see the result as a map
Programmable access to the whole set of digitized data• You ask about data spread over many maps• You can customize what you search for
(not just the editor’s choice)
RODA: search of data Context of search becomes important
• Word-final vs non-final vs either• Plain character vs accented character• Character vs (superposed) alternate
Choice of fields to search• E.g. With nouns: sg. vs pl. entries• Variations heard by field workers• Flags to mark special situations (e.g.
hesitation)
Word-final /u/ from Latin
Latin Romanian(standard and most
dialects)
Dialectal Variation
canto ‘I sing’ cânt cântu(vowel present)
cântu
(non-syllabic)
oculum ‘eye’ ochi ochiu ochiu
Is word-final /u/ random? Look for a geographic pattern over
all potential occurrences The maps for single examples such
as /ochi/ and others, are in the hard-copy dialect Atlas,
But total data for all examples is spread widely over many maps.
Word-final /u/
Data from:•407 maps•Field 1
Size of cross shows the number of occurrences
Horizontal= syllabic
Vertical = non-syllabic
Syllabic and non-syllabic /u/
Data from:•Selected maps•Field 1•Word-final or non-word-final
Size of cross shows the number of occurrences
Horizontal= syllabic
Vertical = non-syllabic
Word-final,syllabic /u/
Data from:•407 maps•Field 1•word-final only•(horizontal = vertical)
Locations 137, 141, 146 show most examples
Word-final,syllabic /u/
Data from:•selected maps•Field 1•word-final only•removed non-vocalic /u/ , def. art., some clusters +/u/.•(horizontal = vertical)
Locations 137, 141, 146 show most examples
/u/ Pattern There is a pattern:
Word final /u/ is retained in central, and north-eastern areas
It is syllabic mostly in parts of the central area
The locations with most frequent syllabic final /u/ do not form a continuous area
Dialect sub-regions Some locations have a given
feature; others do not. On the basis of such (sometimes
limited) examples, linguists posit the existence of dialect sub-regions.
MDS analysis of “all” data raises questions about the nature of these sub-regions.
Non-palatalized dentals before front vowels
Crişana: dentals before front vowels are palatalized.
Are they restructured as palatals? If the process is no longer productive,
there may be non-palatalized dentals before front vowels.
If so, where, in what forms and what is the frequency?
Non-palatalized dentals before front vowels
•Examples everywhere.
•(As is well-known, dentals are not palatalized in Oaş, except for 220.)
•Map shows where and how many examples.
Non-palatalized dentals before front vowels
There are examples everywhere (not only in Oaş)
Here we establish a result with the location and frequency of examples.
Can view the examples that support the conclusion.
MDS as Analytic tool In addition to select, search, count
and map functions, RODA can have special-purpose analytic tools.
A built-in MDS tool allows us to create MDS maps based on any selected set of data.
Other analytic techniques could also be implemented.
MDS Process-1
Multidimensional scaling (MDS) uses the “linguistic distance” between n+1 locations to place them in an n-dimensional space exactly...
MDS Process-2
MDS projects an n-space onto a 2-space (a map) so that the distances among the points are preserved as best as possible.
MDS Process -3 The linguistic map may or may not
correspond to geography It does give a high-level picture of
the total linguistic relationship: All the data used to get the distances is now displayed as a single picture.
Distance measures Based on linguistic forms being
“same” or “not same” Does not account for forms that are
nearly the same:• “cat” ~ “caţ” ~ “feline”
Missing forms are “not same” Summed over many comparisons
MDS and dialects Embleton and
Wheeler have used an MDS process on English dialects Finnish dialects
Dialect roughly correlates with geography
Romanian Dialect groupings Begin with a hypothesis about
dialect groupings in Crişana. Analyzed all data in 403 maps, using
the MDS method. Identity is exact match; any difference
is a difference of 1. Distance is sum of differences.
We see the groupings on a map.
MDS mapAll groups
South-east and South-west are distinct.
The rest are less so. Suggests
the dialect unity of the region
--> refine groupings
MDS mapRefined groupings
Still, considerable overlap or closeness
More groups that could be identified, e.g.:
Several divisions in West
Two areas in Oaş
Oaş is close to southern areas
Still, its distinctness is clear (cf. also Uritescu 1984a).
Crişana dialect regionsWhen a lot of data is considered: There is much overlap of regions A few regions are distinct.It is possible that areas share features in a
complex way, based on distance, physical geography and other factors.
There is more apparent unity than traditional analyses (based on a few features) would provide.
Further investigation
We want to look at: Differences in vocabulary (rare vs
common terms) Phonetics vs morphology vs syntax Other definitions of distance
RODA and MDS RODA provides the large amount of
data. MDS makes the large amount of
data readily understandable as a single picture.
Implementing MDS in RODA means that researchers can easily try the approach.
Summary RODA provides:
Accessible data Flexible searching and custom presentation Repeatable processing
MDS makes the data easy to visualize Result: new linguistic insights based on
the greater understanding of the data
Contacts Sheila [email protected] Dorin [email protected] Eric [email protected]
Site: vpacademic.yorku.ca/romanian/