An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100...

8
Journal of Information & Computational Science 12:5 (2015) 1695–1702 March 20, 2015 Available at http://www.joics.com An Efficient and Effective Similarity Measure to Enable Data Mining of Fabric Image of Xinjiang Folk Art Rigen Te a , Xiongfei Li a, * , Yueying Zhang b a Key Laboratory of Symbol Computation and Knowledge Engineering, Jilin University Changchun 130012, China b College of Materials Science and Engineering, Jilin University, Changchun 130012, China Abstract Fabric pattern is a folk art of artificial fabrics pattern design which includes apparels (known as clothes and hats), and crafts (known as carpets and tapestries). Although image processing, information retrieval and data mining have made a large impact on many human endeavors, they never had an impact on the study of the fabric image. In this paper we will present the reasons for this, and introduce a novel similarity measure and algorithm (Pattern Similarity Measuring Algorithm (PSMA)) which allows efficient data mining for large collections of fabric image. PSMA, the effectiveness of which is proved be experiment test, is an image similarity measurement algorithm that achieves the machine-learning based theory via the gradient descent algorithm with multidimensional linear regression. Keywords : Folk Art; Image Processing; Similarity Measure; Data Mining; Gradient Descent Algorithm; Multidimensional Linear Regression 1 Introduction The fabric patterns of Chinese minority nationalities have a long history, which are very important in the daily lives among some minority nationalities. The data mining tools for these fabric patterns can provide a lot of potential knowledge about the national minority culture, and can play a significant role in protecting and succeeding the cultural heritage. Fabric pattern is an archaeological term for the artificial fabrics pattern design, including apparels known as clothes and hats, and crafts known as carpets and tapestries. Fabric patterns are one of the earliest ways of peoples’ abstract thinking, and a true hallmark of humanity. They provide a rich amount of information on several different dimensions, which as an aesthetic expression is far beyond their practical value. Other similar cultural heritage such as rock art also has the same effect. Studies of rock art can show some implications beyond anthropology and history. For example, a recent study has postulated the existence of an species of extinct Australian bat based on extraordinarily detailed pictographs that is known to be at * Corresponding author. Email address: [email protected] (Yueying Zhang). 1548–7741 / Copyright © 2015 Binary Information Press DOI: 10.12733/jics20105460

Transcript of An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100...

Page 1: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

Journal of Information & Computational Science 12:5 (2015) 1695–1702 March 20, 2015Available at http://www.joics.com

An Efficient and Effective Similarity Measure to Enable

Data Mining of Fabric Image of Xinjiang Folk Art

Rigen Te a, Xiongfei Li a,∗, Yueying Zhang b

aKey Laboratory of Symbol Computation and Knowledge Engineering, Jilin UniversityChangchun 130012, China

bCollege of Materials Science and Engineering, Jilin University, Changchun 130012, China

Abstract

Fabric pattern is a folk art of artificial fabrics pattern design which includes apparels (known as clothesand hats), and crafts (known as carpets and tapestries). Although image processing, information retrievaland data mining have made a large impact on many human endeavors, they never had an impact onthe study of the fabric image. In this paper we will present the reasons for this, and introduce anovel similarity measure and algorithm (Pattern Similarity Measuring Algorithm (PSMA)) which allowsefficient data mining for large collections of fabric image. PSMA, the effectiveness of which is proved beexperiment test, is an image similarity measurement algorithm that achieves the machine-learning basedtheory via the gradient descent algorithm with multidimensional linear regression.

Keywords: Folk Art; Image Processing; Similarity Measure; Data Mining; Gradient Descent Algorithm;Multidimensional Linear Regression

1 Introduction

The fabric patterns of Chinese minority nationalities have a long history, which are very importantin the daily lives among some minority nationalities. The data mining tools for these fabricpatterns can provide a lot of potential knowledge about the national minority culture, and canplay a significant role in protecting and succeeding the cultural heritage. Fabric pattern is anarchaeological term for the artificial fabrics pattern design, including apparels known as clothesand hats, and crafts known as carpets and tapestries.

Fabric patterns are one of the earliest ways of peoples’ abstract thinking, and a true hallmarkof humanity. They provide a rich amount of information on several different dimensions, whichas an aesthetic expression is far beyond their practical value. Other similar cultural heritagesuch as rock art also has the same effect. Studies of rock art can show some implications beyondanthropology and history. For example, a recent study has postulated the existence of an speciesof extinct Australian bat based on extraordinarily detailed pictographs that is known to be at

∗Corresponding author.Email address: [email protected] (Yueying Zhang).

1548–7741 / Copyright © 2015 Binary Information PressDOI: 10.12733/jics20105460

Page 2: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

1696 R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702

least 17, 500 years old [1]; In a similar way, fabric images have been used in studies of the climatechange, and the changing inventories of species in the Dampier Archipelago from the Pleistocene tothe early Holocene period have been reconstructed partly thanks to fabric evidence [2]. However,the research on fabric images is not relatively much. Li [3] proposed two models named MFGSMand MFPSM to describe the minority fabric pattern genes and minority fabric patterns as forthe semi-structured storage of the Xingjian fabric patterns. Zhao [4] put forward a segmentalgorithm based on the edge morphological transform for colorful fabric images. Firstly, colorspace transformation is implemented.

Xinjiang ethnic fabric patterns are artistic treasures in the Chinese culture. And it would be agreat contribution to protecting, inheriting and carrying the folk culture including fabric patternsto generate folk fabric patterns automatically via computers. An obvious thing to do with suchan image to place in a cultural background is to see if a similar image exists among the manypetroglyphs in the region. Thus, we began this project with careful considerations of the shapesimilarity. And the image properties that the similarity measurement for images depends on arenot all the same, either. The current common-used quantization algorithms are: the ones usingClustering Method such as k-mean [5], LBG Algorithm [6], ISODATA Algorithm [7] and so on,the central idea of which is to pick k samples as the initial clustering center, and then othersamples will be gathering towards the center. As a result, a new classification is created and willbe modified till it meets the requirements if it does not after the creation.

Also, there is another type of quantization algorithm using color spaces such as the unified quan-tization method [8], which divides the color space and select a color table with well-distributedred, blue and green as the palette colors and then replace each pixel with the color in paletteaccording to the closest color principle, and the median splitting method [9].

2 The Principle of Measuring the Similarity

2.1 Pretreatment of PSMA

Definition 1 Given an arbitrary image, after a certain times of operations like enlarging, zoom-ing in & out or rotating it, then fill it in a 100 ×100 pixels canvas with the size no larger than thecanvas. Now, if it cannot be enlarged as well, this series of operation is called Image Maximum.As it can be seen in Fig. 1, (b) is the result after the image maximum of (a).

(a) (b)

Fig. 1: Image maximum

Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix-el distribution in each square by using an n×n matrix, where 1 means the pixel exists and 0

Page 3: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702 1697

means not. And this series of operations are called N-dimensional grid fuzzification. After 10-dimensional grid fuzzification for (b) in Fig. 2, as shown in Fig. 3, we can get (b) from (a) whichwe have already showed in Fig. 1. And (a) in Fig. 3 represents the result after the grid division,(b) represents the information that can be remained after the grid fuzzification.

1

1

0

0

0

0

0

0

0

0

1

1

1

1

1

0

0

0

0

0

0

1

0

0

1

1

1

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

1

1

1

0

0

1

0

0

0

0

1

0

0

1

1

1

1

0

0

0

1

0

0

1

0

0

1

0

0

0

1

1

1

1

0

0

1

0

0

0

0

0

0

1

1

1

1

1

0

0

0

0

0

0

0

0

1

1

1 2 3 4 5 6 7 8 9 10

(a) (b) (c)

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

Fig. 2: The process of 10-dimensional grid fuzzification

(a) (b) (c)

Fig. 3: The influencing factors of similarity measurement

We can also have the description matrix in (b) according to the grid fuzzification process in(c).

Definition 3 When we want to measure the similarity between (a) which is used as a standardand (b) which we want to find out its similarity level to (a), we call (a) the Standard drawing and(b) the Measuring drawing.

In this way, not only less information will we record but also we can describe the shape of theimage roughly. When n=100, it reaches the highest accuracy level to describe the shape of theimage, which means we can describe the image precisely. That is to say, the bigger n we set themore accurate matrix of the image we can get.

2.2 PSMA Principle

In order to measure the similarity between two images, we need to maximize and grid fuzzifiereach image so that we can get the most overlap parts between those two images. But due to theuncertainty of the shape and size of each image, we cannot make sure that two images have arelatively high overlap ratio after the maximization and grid fuzzification. In Fig. 3, (a) is usedas the standard drawing, (b) and (c) are the measured drawings. Apparently, comparing to (c),

Page 4: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

1698 R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702

(b) is more similar to (a). But if we just take overlap ratio into consideration, (c) is more similarto (a) than (b) is. So the conclusion is that it is not rational to just take overlap ratio intoconsideration when comparing the similarity between images.

Therefore, we should take three influencing factors into consideration when taking similaritymeasurements.

1. The number of overlap squares (C);

2. The distance variance between non-overlap squares (D);

3. The difference of the number of grids the two images occupy (Called extra grids R).

After image maximization, the number of grid that the image has is going to change, so theinfluencing factors should be calculated after the image deformation. And the influencing factorsare (N is the number of black square in compared image):

1. The proportion of the overlap squares C/N ;

2. The distance variance of the non-overlap squares D;

3. The proportion of the extra grids R/N .

The formula of the similarity measurement can be defined as y = α0 + α1x1 + α2x2 + α3x3, x1

is the proportion of the overlap squares C/N ; x2 is 1/D, the reciprocal of the distance varianceof the non-overlap squares D; and x3 is the proportion of the extra grids R/N . In this way, wehave Eq. (1):

y = α0 + α1C/N + α21/D + α3R/N (1)

The value of y in the similarity measurement formula is unknown before training. Thus, thevalue of y in the training set is recorded by the inequality of each other. And if y > 0, the twoimages are similar to each other, otherwise they do not.

3 Pattern Similarity Measuring Algorithm (PSMA)

3.1 Image Maximization Algorithm

When we are implementing the Image maximization algorithm, we should scale, move and rotatethe image continuously until the maximization accomplished because of the randomness of theposition of the image center. Also, because the original image cannot be bigger than the canvas,the scale option means only enlarging is included.

Assuming that the distance between the top of image and the top of the canvas is Up, thedistance between the bottom of image and the bottom of the canvas is Down, the distancebetween the left side of image and the left side of the canvas is Left, and the distance betweenthe right side of image and the right side of the canvas is Right.

Algorithm implementation:

Page 5: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702 1699

Step 1 If Up = Down and Left = Right in the chosen image, take Step 3; otherwise makeUp = Down and Left = Right by moving the center of the image and then take Step 2.

Step 2 Enlarge the image and make sure that one side of the picture coincide with the corre-sponding side of the canvas and no parts of the image exceed the canvas.

Step 3 Rotate the image and record the angle of the rotation. After one degree of clockwiserotation, execute i = i+ 1 and take Step 4.

Step 4 Algorithm ends when i = 360. At this time the image maximization should be done,otherwise take Step 5.

Step 5 If there is a certain part of the image that exceeds the boundary of the canvas, weshould move the image until Up = Down and Left = Right. If there is still a part of theimage exceeding the boundary of the canvas, take Step 3. If no parts of the image overlap theboundaries of the canvas, take Step 2 and mark i = 0.

3.2 Image Grid fuzzification

First we should define the dimension of the image before the grid fuzzification. If dimension isn, the image can be equally divided into nn squares with the size unchanged. And by analyzinghow many grids of the canvas the image occupies via pixels, we can use an nn matrix to describethe situation, and ai,j to describe the information of row i, column j.

Algorithm implementation:

Step 1 Divide the image into nn squares with the same size.

Step 2 Check the squares of the image from a1,1 to an,n, if there are any pixels in the square, wemark ai,j = 1 otherwise mark ai,j = 0.

Step 3 After checking all the squares, the algorithm ends and generates matrix J .

3.3 The Calculation of the Influencing Factors of PSMA

Before calculating the influencing factors of PSMA, we need to match the two images and adjustthe measured drawing in order to make sure the grid fuzzification matrix of the measured drawingcan match with Standard drawing as much as possible.

Definition 4 In order to make sure the grid fuzzification matrix of the Measured drawing canmatch with Standard drawing as much as possible, we need to rotate and narrow the measureddrawing, and make sure that the measured drawing is inside the canvas at the same time. Thisoperation is called the matchinglization of the measured drawing.

Page 6: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

1700 R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702

3.4 Algorithm for the Similarity Training and Measuring

The basic process of the similarity training algorithm is to calculate −→α that is the parameter ofthe similarity measurement according to a limited training set. The training results of −→α can beregarded as a set because of the unequal relationship in the result set of the training set.

According to Section 2 we can see that the final formula for the n-dimensional gradient descentalgorithm can be defined as −→α = (xTx)−1xTy. But in the image oriented similarity trainingalgorithm, y is a set vector, so the formula needs to be redefined as (xT )−1xTx−→α = y, x−→α = y,and we can obtain the result training set of −→α :

With the result the original training set already calculated, add a new training result. Andregard the new training result as a training set, then the training result set of −→α can be figuredout. After taking an intersection for the result and the original training result set, the final resultis in.

Definition 5 When the training result set is empty, screen the result set. Then make sure thatthe result set is not null at the cost of deleting the limitations as little as possible. This process iscalled the Result Set Correctness.

After the training set reaches a certain size, there may be a chance that the intersection isempty. This is when we should correct the result set in order to obtain the training result.

After obtaining via the similarity training algorithm, we can calculate the result of similaritymeasurement by Eq. (1).

Algorithm implementation:

Step 1 Maximize the standard drawing and measured drawing.

Step 2 Match the measured drawing with the standard drawing and record the final match value.

4 Experimental Results and Analysis

Take a test in a database which has 131 patterns. First select three patterns which have nosimilarity to the other two. Then measure the similarity of image A, B and C using PSMA, wecan obtain three groups of comparing result, as shown in Fig. 4. Group A is sorted manually,group B is sorted after a small amount of trainings (20 times), and group C is sorted after arelatively large amount of trainings (100 times).

According to the comparing result above, we can assume that there is no obvious connectionbetween training times and training results. But we can still guarantee that most of the results inmanual sort can be obtained. Currently the images which are in the front of the manual sort aresure to be shown in the result of the algorithm and the sorting order is basically consistent withthe training results in the training set. Thus, we can come to a conclusion that this algorithmnot only has the function of measuring the similarity but also can generally represent the actualmanual sort, which means the algorithm has the Learnability.

As shown above, the first row addresses the Bass clef followed by six samples; the second rowshows the Treble clef followed by six samples; the third row describes the Alto clef followed bysix samples, in which the last three are very different from the first three.

Page 7: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702 1701

A

B

C

A

B

C

A

B

C

(a) The effect of training times on sorting the results 1

(b) The effect of training times on sorting the results 2

(c) The effect of training times on sorting the results 3

(a) The effect of training times on sorting the results 1

(b) The effect of training times on sorting the results 2

Fig. 4: The effect of training times

After sorting the pattern library given by Li [3] using PSMA, we can compare the result weobtain with the manual training one and calculate the accuracy rate of the grouping and then wewill make a comparison with other algorithms. The results are shown in Table 1.

Table 1: Shows the accuracy rate of different algorithm

Method Zernike moment ART DTWB GHT PSMA

Accuracy (%) 65.07 72.74 95.81 87.12 85.61

Although the accuracy rate of PSMA is not the highest, the main purpose of the algorithm is tosort similar patterns, and the result after the training is always larger than the manual training.Also, the first several of the sort result of PSMA are basically the same as the manual trainingresult. Thus, the PSMA can be more accurate than other algorithms in the perspective of sorting.

Page 8: An Efficient and Effective Similarity Measure to Enable Data ......Definition 2 Given a 100 ×100 image that is divided into n2 even squares, and describe pix- el distribution in

1702 R. Te et al. / Journal of Information & Computational Science 12:5 (2015) 1695–1702

5 Conclusion

This paper represents the Pattern Similarity Measuring Algorithm (PSMA), which is for theintelligentization of sorting and grouping the pattern similarity. PSMA has achieved the intel-ligentizing process of the pattern grouping and sorting by machines. It also has managed tomatch any pattern perfectly via the grid fuzzification and maximization of patterns, then findsthe pattern that meets the matching requirements by calculating the matching result. Throughthe test of the PSMA, we can summarize that the sorting result and the training set are of arelatively strong similarity and high accuracy on grouping.

Nevertheless, it is not of good efficiency, so it would be time consuming when we define thesimilarity level between two patterns with a large pattern database, which means the improvementon the efficiency is significant in the future work. Also, it remains to be a big question if thetraining result set of a pattern library can be the one for other pattern libraries. If this can beachieved as well, PSMA will be an algorithm with much more practical value.

References

[1] J. Pettigrew, M. Nugent, A. McPhee, J. Wallman, An unexpected, stripe-faced flying fox in iceage rock art of Australia’s Kimberley [J], Journal of Antiquity, 82, 2008

[2] I. Aseyev, Horseman image on an ostrich eggshell fragment [J], Archaeology, Ethnology and An-thropology of Eurasia, 34(2), 2008, 96-99

[3] W. Li, R. Te, X. Li, A model of national minority fabric patterns of China to enable data mining[J], International Journal of Digital Content Technology & Its Applications, 5(12), 2011

[4] H. Y. Zhao, Y. F. Yang, G. M. Xu, Design method for Xinjiang folk art pattern [J], ComputerSystems and Applications, 20(7), 2011, 94-99

[5] S. H. Park, I. D. Yun, S. U. Lee, Color image segmentation based on 3-D clustering: Morphologicalapproach [J], Pattern Recognition, 31(8), 1998, 1061-1076

[6] Y. Linde, A. Buzo, R. Gray, An algorithm for vector quantizer design [J], IEEE Transactions onCommunications, 28(1), 1980, 84-95

[7] N. Venkateswarlu, P. Raju, Fast ISODATA clustering algorithms [J], Pattern Recognition, 25(3),1992, 335-342

[8] P. Heckbert, Color image quantization for frame buffer display [C], Proceedings of the 9th AnnualConference on Computer Graphics and Interactive Techniques, SIGGRAPH’82, ACM, 1982, 297-307

[9] G. G. Z. Mingquan, An algorithm based on octree structure for color quantization [J], Mini-MicroSystems, 1, 1997