Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream...

8
Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance & Accounting, Lianyungang, Jiangsu, China Keywords: transfer learning; semi-supervised clustering; distributed computing Abstract: Due to the large amount of redundant data in traditional data transmission, in order to avoid excessive bandwidth occupation, this paper proposes a parallel semi-supervised clustering algorithm based on migration learning, which filters the data in advance before data transmission, so as to reduce the burden of server data transmission. 1. Introduction There may be a large amount of jumbled data in the process of parallel transmission of large data streams. To avoid the occupation of limited server bandwidth by large amount of jumbled data, the parallel clustering algorithm which can process large amount of data is a general way of clustering optimization for large data. 2. Semi-supervised Fuzzy Possibility Clustering In practice, we often collect a small number of known labels in the data set. At this time, unsupervised algorithm is difficult to use these small amount of information to improve the accuracy of clustering algorithm, and there are still more redundant information in the process of large data transmission. Therefore, some scholars have studied semi-supervised learning mechanism combined with transfer learning, such as Fuzzy C-Means algorithm (FCM). Combining with the idea of transfer learning, this study first explores semi-supervised FCM clustering algorithm, and then proposes semi-supervised FPCM clustering algorithm. 2.1 Semi-supervised FCM clustering algorithm In previous studies, Pedrycz proposed Semi Supervised Fuzzy C-Means (SS-FCM). In order to distinguish labeled data from unlabeled data, a vector matrix { } | 1, 2, , k B b k N = = is introduced. If the label samples k x are known, then 1 k b = , otherwise 0 k b = , as shown in formula 1 k is marked data otherw 1, is x 0 e k b = (1) Note the category attributes { } | 1, 2, , ; 1, 2, , ik F f i Ck N = = = , if k x belong to category i, then i =1 k ƒ ; otherwise i =0 k ƒ , as shown in formula 2. k i is marked data otherw 1, x is 0 e k = ƒ (2) After introducing B and F, Pedrycz takes the value of the fuzzy parameter m as 2, and its objective function is shown in formula 3. ( ) 2 2 2 2 1 1 1 1 C N C N SS FCM ik ik ik ik k ik i k i k J ud u fb d α = = = = = + ∑∑ ∑∑ (3) 2019 3rd International Conference on Artificial intelligence, Systems, and Computing Technology (AISCT 2019) Copyright © (2019) Francis Academic Press, UK DOI: 10.25236/aisct.2019.029 140

Transcript of Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream...

Page 1: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

Design of Parallel Large Data Stream Transmission System Based on Migration Learning

Xiwu Zheng Jiangsu College of Finance & Accounting, Lianyungang, Jiangsu, China

Keywords: transfer learning; semi-supervised clustering; distributed computing

Abstract: Due to the large amount of redundant data in traditional data transmission, in order to avoid excessive bandwidth occupation, this paper proposes a parallel semi-supervised clustering algorithm based on migration learning, which filters the data in advance before data transmission, so as to reduce the burden of server data transmission.

1. IntroductionThere may be a large amount of jumbled data in the process of parallel transmission of large data

streams. To avoid the occupation of limited server bandwidth by large amount of jumbled data, the parallel clustering algorithm which can process large amount of data is a general way of clustering optimization for large data.

2. Semi-supervised Fuzzy Possibility ClusteringIn practice, we often collect a small number of known labels in the data set. At this time,

unsupervised algorithm is difficult to use these small amount of information to improve the accuracy of clustering algorithm, and there are still more redundant information in the process of large data transmission. Therefore, some scholars have studied semi-supervised learning mechanism combined with transfer learning, such as Fuzzy C-Means algorithm (FCM). Combining with the idea of transfer learning, this study first explores semi-supervised FCM clustering algorithm, and then proposes semi-supervised FPCM clustering algorithm.

2.1 Semi-supervised FCM clustering algorithm In previous studies, Pedrycz proposed Semi Supervised Fuzzy C-Means (SS-FCM). In order to

distinguish labeled data from unlabeled data, a vector matrix { }| 1, 2, ,kB b k N= = … is introduced. If the label samples kx are known, then 1kb = ,otherwise 0kb = , as shown in formula 1

k is marked data otherw

1,is

x0 ekb

=

(1)

Note the category attributes { }| 1, 2, , ; 1, 2, ,ikF f i C k N= = … = … , if kx belong to category i, then

i =1kƒ ; otherwise i =0kƒ , as shown in formula 2.

k i

is marked data otherw

1, xis0 ek =ƒ

(2)

After introducing B and F, Pedrycz takes the value of the fuzzy parameter m as 2, and its objective function is shown in formula 3.

( )22 2 2

1 1 1 1

C N C N

SS FCM ik ik ik ik k iki k i k

J u d u f b dα−= = = =

= + −∑∑ ∑∑ (3)

2019 3rd International Conference on Artificial intelligence, Systems, and Computing Technology (AISCT 2019)

Copyright © (2019) Francis Academic Press, UK DOI: 10.25236/aisct.2019.029140

Page 2: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

2.2 Semi-supervised FPCM clustering algorithm Pedrycz's research finds that semi-supervised algorithm based on transfer learning can make

better use of known information to improve clustering efficiency. Because the functions of B and F in the previous paper are similar, the objective function of preserving F and improving FPCM is shown in Formula 4.

( ) ( )22 2 2 2

1 1 1 1

C N C N

SS FPCM ik ik ik ik ik iki k i k

I u t d u f dα β ω−= = = =

= + + −∑∑ ∑∑ (4)

Among them, 0, 0, 0,0 , 1ik iku tα β ω≥ ≥ > ≤ ≤ ; Then Lagrangian expression is constructed to minimize the objective function, and an iterative

expression is obtained as shown in 5.

( ) ( )22 2 2 2

1 1 1 1 1 1 1 11 1

C N C N N C C N

jk ik ik ik ik ik k ik i iki k i k k i i k

Q a t d u f d u tα β ω λ θ= = = = = = = =

= + + − + − + −

∑∑ ∑∑ ∑ ∑ ∑ ∑ (5)

kλ and iθ are Lagrange multipliers, and then the partial derivatives of Q to , , ,ik ik iu t vλ are solved respectively. The membership matrix iku , clustering center iv and possibility matrix ikt can be obtained by making the partial derivatives equal to 0. Their expressions are shown in expressions 6, 7 and 8.

12

21

11 , ,

c

jkj

ik ikck

j jk

fu f i k

dd

α ωω

α ω=

=

+ −

= + ∀ +

∑(6)

( ){ }( ){ }

22 2

1

22 2

1

,

N

ik ik ik ik kk

i N

ik ik ik ikk

u t u f xv i

u t u f

α β ω

α β ω

=

=

+ + −= ∀

+ + −

∑(7)

12

21

, ,N

ikik

j ij

dt i kd

=

= ∀ ∑ (8)

Through the continuous iteration optimization of these three expressions, if the results converge, the necessary partition of clustering can be obtained according to the membership matrix or possibility matrix. This improved semi-supervised Fuzzy Possibilitic C-Means (Smei Supervised Fuzzy Possibilitic C-Means). SS-FPCM can control the weights of FCM and PCM in the algorithm, and the weights of FCM and PCM are bigger. At the same time, the proportion of known labels in the algorithm can be controlled by the change of parameters.

Ifω equals 0, it means that all tags are unknown tags, and the function of the algorithm degenerates into unsupervised FPCM algorithm. Ifω value is infinite, it is regarded as finding clustering centers through known tags, and then classifying the categories directly according to clustering centers. Therefore, the algorithm can control the importance of known labels in SS-FPCM algorithm by adjusting parameterω , so that the clustering effect of the algorithm can be maintained in a reasonable range. The flow chart of the related algorithm is shown in Figure 1.

141

Page 3: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

Fig. 1 Flow chart of SS-FPCM algorithm

3. Semi-supervised Migration Fuzzy Possibility ClusteringMigration learning can use the data or information acquired from the source data to guide the

target data. When the source data information is highly correlated with the target data, it can discover some specific latent information from the source data and use it for future data analysis.

3.1 Semi-supervised FPCM clustering algorithm with non-negative migration When the source data has some tags, these data can be filtered out and added to the target data to

cluster together, so that the source data can better guide the target data. In the previous paper, semi-supervised FPCM clustering algorithm can enhance the clustering effect through the effective use of tags, and can directly refer to the objective function of formula 1-4. But at the same time, there is still a problem of negative migration in transfer learning. If the source data is not related to the target data, the historical data label is likely to deviate from the target data, resulting in negative migration. Therefore, the formula is modified to avoid negative migration, named as Transfer Semi-Supervised Fuzzy Possibilitic C-Means (TSS-FPCM).

Assuming that there are M known label samples in the source data, they are extracted and placed after the target data to form a new target data set { }| 1, 2, , , 1, , , d

k kX x k N N N M x R′ = = … + … + ∈ . The objective function is updated according to the data set, as shown in formula 9:

Start

End

Calculating New Membership Degree U Based on Formula 1-6.

Computation of New Possibility Matrix T Based on Formula 1-8.

Updating Cluster Center V according to Formula 1-7.

The number of iterations is 1 > L or the absolute value of the difference between two value functions is less than the threshold ε.

Updating Cluster Center V according to Formula 1-7.

Y

N

Random initialization of cluster centers, construction of matrix F based on known labels, initialization of objective function is 0.

142

Page 4: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

( )

( ) ( )

2 2 2TSS -FPCM

1 1

22 2 2 2

1 1 1 1

C N

ik ik iki k

C N M C N M

ik ik ik ik ik iki k N i k N

J u t d

u t d u f d

α β

ω α β

= =

+ +

= = + = = +

= +

+ + + −

∑∑

∑ ∑ ∑ ∑(9)

Among them,1 1

0, 0, 0,0 , 1, 1 , 1k

c N M

i ik ik iki k

u t u k t iα β ω+

= =

≥ ≥ > ≤ ≤ = ∀ = ∀∑ ∑ .

Comparing formulas 4 and 9, it is easy to find that when parameterω tends to zero, the former is equivalent to adding M source data as an unknown label to the target domain and performing unsupervised mixed C-means clustering, while the latter is equivalent to discarding these data as useless, so that the latter can avoid the occurrence of negative migration.

In order to minimize the objective function, the Lagrange expression is constructed and the iterative expression is obtained as shown in formula 10.

( ) ( ) ( )22 2 2 2 2 2 2 2

1 1 1 1 1 1

1 1 1 11 1

c N C N N C N M

ik ik k k k k k kk kki k i k N i k N

N N c c N N

k k i kk i i k

Q a t d s t d u f d

u t

β ω α β

λ θ

+ +

= = = = + = = +

+ +

= = = =

= + + + + −

+ − + −

∑∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ (10)

kλ and iθ are Lagrange multipliers, and then the partial derivatives of Q to , , ,ik ik iu t vλ are solved respectively. The membership matrix iku can be obtained by making the partial derivatives equal to 0. The expressions of clustering center iv and possibility matrix ikt are shown in expressions 11, 12 and 13.

( ) ( ){ }( ) ( ){ }

22 2 2 2

1 1

22 2 2 2

1 1

,

N N M

k ik k ik ik k ik kk k N

i N N M

k ik ik ik k ikk k N

a t x u t u f xv i

a t t u f

α β ω α β

α β ω ω β

+

= = ++

= = +

+ + + + −= ∀

+ + + + −

∑ ∑

∑ ∑(11)

12 2

2 21 1

12 2

2 21 1

,

,

N N Mik k

j j Nj ij

ikN N M

ik ik

j j Nij ij

d d k Nd d

td d N k N M

d d

ω

ω

−+

= = +

−+

= = +

+ ≤ =

+ < ≤ +

∑ ∑

∑ ∑(12)

12

21

12

21

,

111 1 ,

1

ck

j jk

cik jk

jikC

ik

j jk

d k Nd

u ff N k N M

dd

αα

=

=

=

≤ = −

++ < ≤ + +

(13)

By calculating the expression of updating objective function of data set, we can see that when parameter ω tends to zero, the updating objective function of data set degenerates into the traditional FPCM clustering algorithm, which ensures that the algorithm will not be negatively detoured by the known label of the newly added source data. Because the membership matrix iku , the expression of clustering center iv and possibility matrix ikt are changed, TSS-FPCM has no

143

Page 5: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

obvious change in the algorithm flow.

3.2 An improved semi-supervised circuitous clustering algorithm In the source data, the amount of unlabeled data is much larger than the labeled data, but it can

also get the information needed to help the generation of target data. Direct clustering of a large number of unlabeled data into the target data will greatly increase the computational complexity. Therefore, in order to reduce the computational complexity, a “representative point” can be used to represent a class in the source data. This “representative point” can be either a clustering center or a real sample point in the data, so it will be huge. Data is converted to a limited number of “representative points”.

Firstly, the known labels in the source data are used as a part of the prior information, and then the source data is clustered by classical clustering algorithm to get the “representative point” information of the source data samples. Finally, both of them are used as the auxiliary information of the algorithm to help cluster the target data, and an improved semi-supervised migration clustering algorithm (Improved Transfer Semi-Supervised Fuzzy Possibilistic C-Means) is obtained. ITSS-FPCM).

In order to make effective use of “representative points”, the set of representative points is set as { }ˆ ˆ | 1, 2, ,iX x i C= = … , where C represents the number of clusters. Therefore, the redefined distance

function can be obtained as shown in Formula 14. 2 2 22

1 2ˆ ˆ ˆik k i k i i id x v x x v xγ γ= − + − + − (14)

Among them, 1γ and 2γ are weighting factors, which are used to adjust the importance of the historical center and transfer representative points to the target data as effective information.

By improving the distance function, the new objective function is shown in Formula 15.

( )

( ) ( )

2 2 2ITSS

1 1

22 2 2 2

1 1 1 1

ˆ

ˆ ˆ

c N

FPCM ik ik iki k

C N M C N N

ik ik ik ik ik iki k N i k N

J u t d

u t d u f d

α β

ω α β

−= =

+ +

= = + = = +

= +

+ + + −

∑∑

∑ ∑ ∑ ∑(15)

Among them,1 1

0, 0, 0, 0 , 1, 1 , 1c N M

k ik ik iki k

u t u k t iα β ω+

= =

≥ ≥ > ≤ ≤ = ∀ = ∀∑ ∑ . In order to obtain

its iteration expression, Lagrange expression is constructed by using Lagrange extremum optimization expression.

( ) ( ) ( )22 2 2 2 2 2 2

1 1 1 1 1 1

1 1 1 1

ˆ ˆ

1 1

c N c N N c N M

ik ik k ik ik ik k ik iki k i k V i k N

N N c c N N

k kk i kjk i i k

Q u t d u t d u f d

u t

α β ω α β

λ θ

+ +

= = = = + = = +

+ +

= = = =

= + + + + −

+ − + −

∑∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑

(16)

Where m and N are Lagrangian multipliers. Let A = 1, the expression of clustering center is obtained as shown in formula 17. kλ and iθ are Lagrange multipliers. Let / 0iQ v∂ ∂ = , the expression of clustering center is obtained as shown in formula 17.

( )( )( )( )( )

( ) ( )( )1

222 2

2 2 2 2 12

21 1 1 2 22

22 2 2 22

1

ˆˆ4

1 4

NM

N N k k k k kk N

k k k kkk k N

k k k k

i N NM

k k k k k kk k N

u f xx

f xv

f

α βα β γ α β ω

γ α β

γ α β ω α β

= +

= = +

= =

+ + − + + + +

+ + + − =

+ + + + + −

∑∑ ∑

∑ ∑(17)

144

Page 6: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

Let / 0kQ λ∂ ∂ = , we can get formula 18.

11

c

iki

u=

=∑ (18)

Let / 0ikQ u∂ ∂ = , we can get the following formula. About 0 k N< ≤

22ikik

udλα

=

(19)

Formula 20 is substituted for formula 21 to obtain: 1

21

1ˆ2

C

i ikdλα

=

= ∑ (20)

Then the substitution formula 21 is obtained: 1

2

21

ˆˆ

cik

ikj jk

dud

=

= ∑ (21)

By the same token, we can find that:

12

21

111 1

ˆ 1ˆ

c

jkj

ik ikcik

j jk

fu f

dd

αα

=

=

−+

= ++

∑(22)

The final expression can be obtained by combining formula 23 and formula 24: 1

2

21

12

21

ˆ,ˆ

111 1 ,ˆ 1

ˆ

ck

j jk

cik jk

jikc

ik

j jk

d k Nd

u ff N k N M

dd

αα

=

=

=

= −

++ < ≤ + +

(23)

Using the same method, the iterative expression of ikt can be obtained:

12 2

2 21 1

12 2

2 21 1

ˆ ˆ,ˆ ˆ

ˆ ˆ,ˆ ˆ

N N Mik ik

j j Nij ij

ikN N M

ik ik

j j Nij ij

d d k Nd d

td d N k N M

d d

ω

ω

−+

= = +

−+

= = +

+ ≤ =

+ < ≤ +

∑ ∑

∑ ∑(24)

ITTS-FPCM can get the final membership matrix U by iteration optimization, and then get the required partition results by deblurring the matrix U. The algorithm flow is still unchanged.

4. Parallel semi-supervised fuzzy clustering algorithmBecause there are many redundant data in traditional data processing and traditional machine

learning is difficult to process directly, a distributed semi-supervised FPCM clustering algorithm

145

Page 7: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

based on Spark platform is designed in this paper, which enables the system to process massive source data and provide assistance for the transmission of large data.

Elastic Distributed Data Set (RDD) is a distributed memory Abstraction. Its efficient error tolerance enables it to support applications of working set type. The operation of RDD can be divided into two categories: transformation operation and action existence, which will produce new RDD, while action existence will return the final result of some operation on RDD or write data to external storage without generating new RDD.

In the SS-FPCM algorithm mentioned above, the adjustment of the algorithm is mainly to update the possibility matrix T, membership matrix U and clustering center matrix V. Therefore, the parallel SS-FPCM algorithm must redesign these three matrices to meet the needs of parallel operation.

4.1 Parallel Solution of Possibility Matrix T According to the expression of matrix T in formula 8, it is necessary to define the expression of

matrix P as shown in formula 25.

21

1 ,N

ij ij

p id=

= ∀

∑ (25)

Change the expression of matrix T to that shown in formula 26: 12 * , ,ik ik it d p i k−

= ∀ (26)

In the changed expression, ip needs to be calculated first, and distance ijd in ip needs to be solved separately. Therefore, ijd can be solved through distributed computing, then ip can be obtained, and matrix T can be obtained through a map operation.

4.2 Parallel Solution of Membership Matrix U The expression of matrix U of formula 6 shows that the value of Category attribute F is the key

to the calculation of membership matrix U. In order to avoid using broadcast broadcast variables to make F matrix too large, it is necessary to record the class label of data points in the early stage of data conversion operation. If there is no class label, it is recorded 0, then cached into(X,L). Form, where X is the data vector and L is the class label, the distribution calculation can be transformed into the corresponding value of F matrix through the class label. Since the solution of membership degree of each data point in Formula 6 is only related to the distance from itself to the cluster center, but not to other data points, the membership matrix U can be directly parallelized. The result of RDD operation is obtained.

4.3 Parallel Solution of Cluster Center V According to the expression of formula 7 clustering center matrix V, it is found that it needs the

solution of matrix T. Therefore, matrix P can also be used to solve the new clustering center, and its expression is updated to that shown in formula 27.

( ){ }( ){ }

2 22 2

1

2 22 2

1

*,

*

N

ik ik i ik ik kk

i N

ik ik i ik ikk

u d p u f xv i

u d p u f

α β ω

α β ω

=

=

+ + − = ∀

+ + −

∑ (27)

Since the denominator of formula 27 has the same part, the first matrix H can be obtained as shown in formula 28.

( )2 22 2 * , ,k ik ik i ik ikh u d p u f i kα β ω−

= + + − ∀ (28)

146

Page 8: Design of Parallel Large Data Stream Transmission System ......Design of Parallel Large Data Stream Transmission System Based on Migration Learning Xiwu Zheng Jiangsu College of Finance

Then the expression of cluster center V is updated as shown in formula 29.

1

1

,

N

ik kk

i N

ikk

h xv i

h

=

=

= ∀∑

∑(29)

Combining the calculation method of matrix T and matrix U, a new clustering center V can be obtained.

4.4 Distributed Semi-supervised Fuzzy Possibility Clustering Based on RDD The parallel solution of possibility matrix T, membership matrix U and cluster center matrix V

makes the solution of distributed semi-supervised fuzzy possibilistic clustering algorithm simpler. A new distributed semi-supervised fuzzy possibilistic clustering algorithm (D-SS-FPCM) is proposed. The algorithm steps are described as follows:

Input: The original data set X={ kx |k=1,2,...,N}, the number of categories K, the maximum number of iterations L, the matrix F, threshold ε and related parameters , ,α β ω composed of semi-supervised information.

Output: Central Matrix V. Membership Matrix U and Possibility Matrix T Algorithmic description: (1) Initialize cluster center F randomly according to the size of K, and set the current iteration

number l=0 to set the relevant parameters. (2) According to the computing principles of formulas 27, 28 and 29 above, a new clustering

center Vnew is obtained. (3) L = L + 1. Compare whether the distance between cluster centers is less than the threshold or

the iteration stopping condition l > L. If it is jumping step (4), otherwise update the center matrix V = Vnew and jump to step (2).

(4) The membership matrix U is calculated in parallel and distributed according to the principleof formula 6 above.

(5) Distributed parallel computation of possibility matrix T based on formula 25 and formula4-2;

(6) Output center matrix V. Membership matrix U and possibility matrix T;

References [1] Liu Zhen, Yang Junan, Liu Hui, Wang Wei. Transfer learning algorithm for clustering andresampling of fuzzy nearest neighbor density [J]. Signal processing, 2016, 32 (06): 651-659.[2] Zhang Shenkai, Baofang, Wang Shitong. Protected migration Learning Clustering Algorithmunder cosine distance [J]. Computer Engineering and Application, 2015, 51 (23): 131-138+225.

147