1 BIBE’05April 19, 2023
Using Data Mining Techniques to Learn Layouts of Flat-File Biological Datasets
Kaushik SinhaXuan ZhangRuoming Jin
Gagan Agrawal
2 BIBE’05April 19, 2023
Overall Goal Informatics tools for biological data
integration driven by: Data explosion
Data size & number of data sources New analysis tools Autonomous resources
Heterogeneous data representation & various interfaces
Frequent Updates Common Situations:
Flat-file datasets Ad-hoc sharing of data
3 BIBE’05April 19, 2023
Current Approaches Manually written wrappers
Problems O(N2) wrappers needed, O(N) for a single updates
Mediator-based integration systems Problems
Need a common intermediate format Unnecessary data transformation
Integration using web/grid services Needs all tools to be web-services (all data in
XML?)
4 BIBE’05April 19, 2023
Our Approach Automatically generate wrappers Transform data in files of arbitrary
formats No domain- or format-specific heuristics Layout information provided by users
Help biologists write layout descriptors using data mining techniques
5 BIBE’05April 19, 2023
Our Approach: Challenges Description language
Format and logical view of data in flat files Easy to interpret and write
Wrapper generation and Execution Correspondence between data items Separating wrapper analysis and execution
Interactive tools for writing layout descriptors What data mining techniques to use ?
6 BIBE’05April 19, 2023
Wrapper Generation System Overview
Layout Descriptor Schema Descriptors
Parser Mapping Generator
Data Entry Representation Schema Mapping
DataReader DataWriterSynchronizer
SourceDataset
TargetDataset
Application Analyzer
WRAPINFO
7 BIBE’05April 19, 2023
Key Open Questions
How hard is it to write layout descriptors ?
Given a flat file, how hard is it to learn its layout?
Can we make the process semi-automatic ?
8 BIBE’05April 19, 2023
Learning Layout of a Flat-File In general – intractable Try and learn the layout, have a
domain expert verify Key issue: what delimiters are
being used ?
9 BIBE’05April 19, 2023
Finding Delimiters Difficult problem Some knowledge from domain expert
is required (Semi-automatic) Naïve approaches
Frequency Counting Counts frequently occurring single tokens
(word separated by space) Sequence Mining
Counts frequently occurring sequence of tokens
10 BIBE’05April 19, 2023
Frequency Counting Problems
Some tokens, appearing very frequently, are not delimiters
Delimiters could be a sequence of token rather than a single token
Possible Solution Use knowledge from frequency of
token sequence and all its subsequences to decide possible delimiter sequence
11 BIBE’05April 19, 2023
Sequence Mining Example For any sequence of tokens s, f(s) represents
frequency of s Lets say A,B,C are tokens Case 1:
f(ABC)=10, f(AB)=10, f(BC)=10, f(CA)=10 Information about AB, BC, CA is already embedded in
ABC ABC is possible delimiter but AB, BC, CA are not
Case 2: f(ABC)=10, f(AB)=20, f(BC)=10, f(CA)=10 BC and CA occur less frequently than AB ABC cannot be a delimiter AB is possible delimiter
12 BIBE’05April 19, 2023
Limitations of Sequence Mining
Does not work very well if token frequencies are distributed in a skewed manner
An example where it does not work in (Pfam dataset) \n, #=GF, AC are tokens with
f(\n,#=GF)>>f(#=GF,AC) F(\n,#=GF)>>f(\n,#=GF,AC)
\n #=GF is concluded as possible delimiter In reality \n #=GF AC is a delimiter
13 BIBE’05April 19, 2023
Can we do better? Biological datasets are written for
humans to read It is very unlikely that delimiters will be
scattered all around, in different places in a line
Position of the possible delimiters might provide useful information
Combination of positional and frequency information might be a better choice
14 BIBE’05April 19, 2023
Positional Weight
Let P be the different positions in a line where a token can appear
For each position i є P, tot_seqji represents total # of
token sequences of length j starting at position i
For each position i є P, tot_unique_seqji represents total
# of unique token sequences of length j starting at position i
For any tuple (i,j), p_ratio(i,j) is defined as shown above
p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1)
ji
ji
sequniqetot
seqtotjiratiop
__
_),(_
15 BIBE’05April 19, 2023
Delimiter score (d_score) Frequency weight for any token sequence sj
i with length j and starting at position i, f_wt(sj
i), is obtained by log normalizing frequency f(sj
i)
Obviously, f_wt(sji) є (0,1)
Positional and frequency weight now can be combined together to get d_score as follows,
d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sj
i) Where α є(0,1)
Thus d_scrore has the following two properties, d_score(sj
i) є(0,1) d_score(sj
i) > d_score(sjk) implies sj
i is more likely to be a delimiter than sj
k
16 BIBE’05April 19, 2023
Finding delimiters using d_score
Since delimiter sequence length is not known in advance, an iterative algorithm is used to get a superset S of potential delimiters, where,
At any iteration i, ci represents the cut-off value which is determined by observing a substantial difference in sorted d_score values
All token sequences above ci are called Ni
17 BIBE’05April 19, 2023
Generating layout descriptor
Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA
This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states
The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters
18 BIBE’05April 19, 2023
Realistic Situation The task of identifying complete list
of correct delimiters is difficult Most likely we will end up with
getting an incomplete list of delimiters
The delimiters which does not appear in every data record (optional) are the ones to be possibly missed
19 BIBE’05April 19, 2023
Identifying Optional Delimiters Given a list of incomplete
delimiters how can we identify optional delimiters, if any? Build a NFA based on given
incomplete information Perform clustering to identify possible
crucial delimiters Perform contrast analysis
20 BIBE’05April 19, 2023
Crucial delimiter A delimiter is considered crucial, if
missing delimiters will appear immediately following these delimiters
The goal is to create two clusters, one having delimiters which are not crucial The other one having crucial delimiters
21 BIBE’05April 19, 2023
Identifying crucial delimiters:A few definitions Succ(X): Set of delimiters that can
immediately follow X Dist_App: # of groups of occurrences of
X based on # of text lines between X and immediately next delimiter
Info_Tuple(nXi,fX
i,tXi): Information for
each Dist_App Info_Tuple_List Lx: For any X, list of all
possible Info_Tuple.
22 BIBE’05April 19, 2023
Metric for clustering
rXf is likely to be low if an optional delimiter appears
immediately after X, and high otherwise Choose a suitable cut-off value rc and assign
delimiters to different groups as follows,- If rX
f < rc, assign X to a group containing possible crucial delimiters
Else assign X to the group containing non crucial delimiters
totalX
XfX f
fr
max
23 BIBE’05April 19, 2023
Observations and Facts Missing optional delimiters can appear
immediately after crucial delimiters ONLY Non-crucial delimiters can be pruned away Consider two Info_Tuples (nX
1, fX1 ,tX
1) and (nX
2, fX2 ,tX
2) in LX
If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- nX
1 > nX2
Missing delimiter will appear in tX1 but not in tX
2
24 BIBE’05April 19, 2023
A hypothetical example illustrating Contrast Analysis
Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows,
L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt)
Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows,
S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }
Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or
is verified by a domain expert as a valid delimiter
15 Sf 25 Sf
25 BIBE’05April 19, 2023
Contrast Analysis For any i,j, if nX
i > nXj , look for frequently
occurring sequences in tXi and tX
j, call them fsX
i and fsXj respectively
If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter
If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter
iXfsfs j
Xfsfs
26 BIBE’05April 19, 2023
Generalized Contrast Analysis In case of more than two Info_Tuples,
identify mean of all nXi values
Form a group by appending text from all Info_Tuples, where
Form another group by appending text from all Info_Tuples, where
Perform contrast analysis among all such possible groups
totalX
l
i
iX
iX
meanX f
fnn
1
meanX
iX nn
meanX
jX nn
27 BIBE’05April 19, 2023
Another example illustrating Generalized Contrast Analysis
Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3
, as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) L3=(15, 10, l3 .txt)
Mean number of lines, Append l2 .txt and l3 .txt , call it t2 .txt Sequence mining on l1 .txt and t2 .txt yields two sets of frequently
occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }
Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified
by a domain expert as a valid delimiter
15 Sf 25 Sf
09.33101220
)1015()1220()2050(
meanXn
30 BIBE’05April 19, 2023
Results: Non-optional Missing delimiters
Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too
If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails
If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works
31 BIBE’05April 19, 2023
Summary Semi-automatic tool for learning
the layout of a flat-file dataset Mechanism for identifying missing
optional delimiters Automatic tool for wrapper
generation Once the layout descriptor is known
Can ease integration of new/updated sources
Top Related