[IEEE 2008 Tenth IEEE International Symposium on Multimedia (ISM) (Formerly MSE) - Berkeley, CA, USA...

An Inductive Logic Programming-Based Approachfor TV Stream Segment Classification

Gael Manson, Sid-Ahmed BerraniOrange Labs – France Telecom R&D

35510 Cesson-Sevigne. France.{gael.manson, sidahmed.berrani}@orange-ftgroup.com

Abstract

This paper proposes a method for classifying TV streamsegments as long programs or inter-programs (IP). As al-most all IPs are broadcasted several times in the TV stream,a first segmentation step based on a repeated sequence de-tection is performed. Resulting segments (the occurrencesof repeated sequences and the rest of the stream) have to beclassified. The proposed solution for that is based on Induc-tive Logic Programming. In addition to intrinsic features ofeach segment (e.g. duration), our technique makes use of therelational and contextual information of the segments in thestream. The effectiveness of our solution has been shown ona real TV stream of 6 days and a comparative study with anSVM-based classification approach has been performed.

1. Introduction

The availability of a large number of TV streams raisesmany technical issues regarding their storage and indexing.TV streams provide content (mainly programs) that can beused to build novel services like TV-on-Demand. Basically,to be useful, a TV stream has first to be segmented in orderto precisely determine the start and the end of each pro-gram. Programs have then to be labeled and indexed. Theseprocessing steps, and particularly the segmentation step, arevery time consuming when performed manually.

One can think that TV stream segmentation can rely onbroadcasted metadata like the Electronic Program Guide(EPG). These are unfortunately imprecise and not alwaysavailable. The interested reader can refer to [2] for a de-tailed analysis of this issue and the difficulties for TV chan-nels to provide accurate metadata.

TV streams have therefore to be analyzed and automat-ically segmented using only the audio-visual signal. Threehierarchical levels representing the TV stream are used(cf. Figure 1). The first level (L1) consists in segmenting

the stream into video shots. In the second level (L2), con-secutive shots are grouped together in order to create ho-mogeneous segments. Each segment is then classified asan inter-program (IP) or a long useful program (or part ofa program). IPs are short sequences and include commer-cials, trailers, jingles and very short commercial news orgames. In the last level (L3), programs are more preciselyannotated: a title and a summary are added to each of them.

In this paper, we focus on level L2. For L1, satisfactorysolutions exist for video shot detection [12]. As for L3, it ismainly based on metadata. If these are not available, there isno other solution than manually annotate the programs. Thelevel L2 is also the most time-consuming and is crucial forthe whole process. It has to be performed automatically, ef-ficiently and accurately, i.e. the boundaries of program andIP segments have to be precisely determined.

There are different approaches for performing the L2

structure. As long useful programs are heterogeneous anddo not share any common feature, the existing approachesmainly rely on detecting IPs. Programs are then detected asthe rest of the stream. These approaches can be classifiedinto 3 categories:

1. Detection-based techniques use the intrinsic featuresof the IPs like separating monochrome frames, audiochanges, action and presence of logos [7, 4, 1]. Allthese approaches are limited to one kind of IPs (mainlycommercials) and are thus not sufficient to achieve TVstream segmentation.

2. Reference database-based techniques store previouslylabeled IPs in a database. IPs in the stream arethen detected using a content-based matching tech-nique [9]. Audio or video fingerprinting or perceptualhashing [10] can also be used. However, the databasehas first to be created manually for each TV channelas IPs differ from one channel to the other. It has thento be periodically updated as IPs of the same channelchange over time.

3. Repetition-based techniques detect IPs as near-

Tenth IEEE International Symposium on Multimedia

978-0-7695-3454-1/08 $25.00 © 2008 IEEE

DOI 10.1109/ISM.2008.62

130

Figure 1. TV Stream hierarchical representation.

identical repeated audio/video sequences in the TVstream. Indeed, IPs are broadcasted several times aday in the stream. In [5], a hashing-based solutionis proposed to detect repeated shots using video fea-tures. In [6], a correlation study of audio features isused to find near-identical sequences of a pre-definedsize within a buffer. In [2], a clustering-based approachis proposed. It relies on grouping similar keyframesusing visual features. In all these approaches, a post-processing step is required to select from the detectedrepeated sequences those that are actually IPs.

Repetition-based techniques (3) are the most promisingapproaches for automatic TV stream segmentation into pro-grams and IPs. However, these techniques can only de-tect sequences that repeat in the stream. Some of thesesequences are actually IPs and others belong to programs(e.g. flashbacks, opening credits and news reports). More-over, some IPs do not repeat. Therefore, in order to builda solution for performing the L2 structure using repeatedsequence detection, an additional classification step of thesegments that have been created using repetitions is re-quired. This is the main concern of the paper.

The main contribution of this paper is a novel and ef-ficient TV stream segment classification technique. Thistechnique aims at using our previously proposed repeatedsequence detection to build a fully automatic solution forTV stream segmentation into programs/IPs. Our classifica-tion solution makes use of basic features like the numberof times a sequence is repeated, the duration, etc. It alsouses more elaborated features to correctly classify specificsequences like opening credits. These may be detected asrepeated sequences but should be classified as part of pro-grams. On the other hand, it relies on inductive logic pro-gramming (ILP) in order to build a classifier that is ableto easily take into account prior knowledge and to modelneighboring relationships between segments.

The rest of the paper is organized as follows. Section 2presents the proposed solution. The experimental study weconducted to show the effectiveness of our solution is thenpresented in Section 3. Section 4 concludes the paper.

2. The Proposed Solution

The ultimate objective of our solution is to segment thestream into program and IP segments. As introduced ear-lier, our solution relies on a first step of segmentation basedon repeated sequence1 detection. A feature vector is thencomputed for each segment and an ILP classification is per-formed in order to classify segments as programs or IPs.This classifier is trained during an off-line phase. The nov-elty of our approach is twofold. First, it is fully automaticas it is based on an automatic repeated sequence detectionstep. Second, it makes use of the contextual and relationalinformation of each segment. A segment is then classifiedusing other information on its position within the TV streamand its neighborhood. The contextual and relational infor-mation are represented in the feature vector of the segmentas well as in the ILP classification rules.

In the three following sections, we first briefly presentthe repeated sequence detection technique and the resultingsegmentation. We then introduce segment feature vectors.Finally, we describe the ILP-based classification.

2.1. Repetition-based segmentation

This step relies on the clustering-based repeated se-quences detection technique proposed in [2]. This tech-nique first clusters DCT-based 30-dimensional visual de-scriptors of keyframes. The resulting clusters gather withinthe same group near-identical keyframes and thus their cor-responding near-identical shots. Each subset of clusters thatis likely to generate a repeated sequence is then selected.Amongst others, clusters within the same subset have thesame number of keyframes and the order of keyframeswithin occurrences are coherent. A final refinement pro-cessing step validates and extends potential repeated se-quences and exactly delimits them. This approach doesnot make any hypothesis on the frequency or the size ofrepeated sequences.

1Repeated sequences are in this context near-identical audiovisual se-quences.

131

In order to segment a TV stream, this technique needs toaccumulate a certain amount of audio-visual content. Thiscontent is analyzed and a set of repeated sequences is de-tected. Sequences are used to segment the stream: each oc-currence of a repeated sequence is considered as a segmentand each gap between two consecutive segments is also asegment. Each of the resulting segments could hence bepart of a program or of an IP.

2.2. Segment description

We propose three kinds of features to characterize a seg-ment: local, contextual and relational features.

Local features

These features describe segments with features that do notdepend on others segments. We use only the duration of thesegment and the number of times it repeats in the stream(0 if the segment does not repeat).

Contextual features

These features take into account the context of the segment.We define the following features:

• If the segment is an occurrence of a repeated sequence,we compute the mean of the number of repetitions ofthe following segments adjacent to each occurrence.This is illustrated in Figure 2. This feature appliesto occurrences {A1, A2} and is equal to (|B|+|C|)/2,where |X| denotes the number of occurrences of thesegment X. We compute also in the same manner themean of the number of repetitions of the previous seg-ments that are adjacent to each occurrence (segmentsD and E in Figure 2). We propose this feature in or-der to help discriminating opening/closing credits fromthe set of repeated sequences. Indeed, when this fea-ture is null, that means that the segment always lie be-fore/after a long segment that do not repeat and that ismost likely a long program. This feature may also helpto classify segments that always lie between other veryfrequent repeated segments.

• As a feature, we also consider the local and contex-tual features of a adjacent segments: segments in theneighborhood of the considered segment.

Relational features

These features are the class of the neighboring segments.Therefore, they apply only when at least one of the neigh-boring segments have already been classified. They allowto take into account the class of neighboring segments.

��

��

��

� ��

��

Figure 2. Illustration of adjacent segments tothe occurrences of a repeated sequence.

2.3. Segment classification

As explained before, we have chosen to use an ILP clas-sifier in order to model the relational information. ILP candirectly manage complex logical relationships between seg-ments and returns explicit rules in the form of first orderlogic. Prior knowledge can also be easily taken into ac-count. They just have to be encoded as first order logicalrules and added to the background knowledge. However,as ILP does not handle numerical data, all the local andrelational features have to be transformed to symbolic at-tributes. Categories for symbolic attributes are defined us-ing numerical intervals based on prior knowledge.

An ILP system builds a logical program from the back-ground knowledge and a set of training examples repre-sented as a set of logical facts. This logical program iscomposed of a set of first order logical rules that cover allthe positive and none of the negative training examples. ILPinfers rules from examples by using computational logic asthe representation mechanism for hypotheses and examples.An example of a logical rule that can be learned is: “If asegment A is short and A lies between two IPs then A is anIP”.

The neighboring relationships are hence represented inthe training set by a set of facts that gives to each segmentof the stream its following segment. They are also repre-sented by a recursive rule that defines transitively this rela-tion, which allows to define a “distance” between segments.

In our implementation, we have used Aleph (ex-P-Progol), a descendant ILP system which performs trainingfrom general to specific hypotheses [8, 11].

We present in the following sections the rules used byour ILP classifier, the training and the classification phases.

2.3.1 Rules

The logical rules computed by ILP define requirements fora segment A which belongs to the class of IPs or the class ofprograms. They can be sorted into four categories followinghow they model segment features:

(1) Simple not recursive rules (SNR-rules) rely only on thelocal and contextual features of A.

132

(2) Simple recursive rules (SR-rules) rely in addition to (1)on the fact that some neighboring segments belong tothe class of A.

(3) Relational not recursive rules (RNR-rules) rely in addi-tion to (1) on the fact that some neighboring segmentsbelong to a class distinct of the class of A.

(4) Relational recursive rules (RR-rules) rely in addition to(3) on the fact that some neighboring segments belongto the class of A.

2.3.2 Training phase

In order to compute logical rules that define IPs or pro-grams, we first encode a part of the TV stream as a databaseof logical facts. This is the training set. The ILP systemthen infers a set of logical rules. Some of these rules aregeneric and very relevant. However, some other inferredrules are very specific to special cases of the training setand may confuse the classifier. Thus, in order to select rele-vant rules, we propose to use an additional validation phase.The learned rules are applied on the validation set and de-pending on their precision, a confidence level is associatedwith each of them. The higher the precision, the higher theconfidence level. Details on the number of considered con-fidence levels are given in Section 3.1.

2.3.3 Classification

The training phase provides a set of rules (SNR, SR, RNR,RR) ordered by their levels of confidence (the highest levelof confidence = 0). The classification step takes into ac-count the confidence levels and the types of rules. Priorknowledge rules are considered as the most reliable. Theyare hence applied first and at the beginning of each iteration.The classification phase consists in the following procedure:

(1) Apply prior knowledge rules,

(2) Select a subset Ei (initially i = 0) of the rules with thelevel of confidence i,

(3) Select and apply SNR-rules for IPs from Ei,

(4) Select and apply recursively SR-rules for IPs from Ei,

(5) Select and apply SNR-rules for programs from Ei,

(6) Select and apply recursively SR-rules for programsfrom Ei,

(7) Select and apply RNR-rules for IPs from Ei,

(8) Select and apply recursively RR-rules for IPs from Ei,

(9) Select and apply RNR-rules for programs from Ei,

(10) Select and apply recursively RR-rules for programsfrom Ei,

(11) Select the next level of confidence: i = i + 1 and con-tinue with step (1).

3. Experiments

In order to evaluate our solution, we have collected adataset of 6 day TV broadcasts of a French channel (de-noted TV6d). It has been manually annotated: the start andthe end of each program and IP have been precisely deter-mined. Within TV6d, we have particularly focused on theperiod from 6:00 PM to 11:59 PM of each day. This periodcontains the most interesting programs (movies and prime-time TV shows). TV6d provides 6×6hr TV streams. Wehave used 3 of these for training and the 3 others for testing.

The repeated sequence detection technique has first beenapplied on the TV6d. A set of 1218 repeated sequenceshave been discovered with a total number of 5880 occur-rences. We have then manually evaluated the precision ofthe results on the 50 most frequent sequences and on a setof 50 other sequences randomly chosen. The observed pre-cision was equal to one. A complete evaluation of the re-peated sequence detection is given in [2]. TV6d has thenbeen segmented following the procedure described in Sec-tion 2.1. The training set and the testing set have been seg-mented respectively into 798 and 805 segments. Experi-ments presented in this section have been performed on thebasis of this segmentation.

Precision (P) and recall (R) have been used to evaluatethe classification results. They have been computed on aper shot basis and also on segments. While the P and Rmeasures on a per shot basis are easy to compute. Comput-ing P and R on segments is however more complicated. Thefollowing criteria have been applied:• A segment that is classified as IP is considered as rel-

evant if it overlaps with more than 50% with ground-truth IP segments,

• A ground-truth IP segment is considered as detectedif it overlaps with more than 50% with segments thathave been classified as IP.

In the rest of this section, we first present the evalua-tion result of our solution. We then present the comparativestudy with an SVM classifier. This experiment shows therelevance of contextual features in correctly classifying thesegments. Finally, we present the results of applying oursolution for extracting long useful programs.

3.1. Evaluation of our solution

Before evaluating our solution, we have performed twoexperiments. The first one consists of applying a simpleduration-based filtering rule for classifying IP: “if the dura-tion of a segment is less than 3 minutes then the segment isan IP”. In the second experiment, we have annotated eachsegment w.r.t. the corresponding segment of the ground-truth. We apply similar rules to those that have been pro-posed to evaluate the precision and the recall measures. In

133

addition, we have considered that a segment that lasts morethan 5 minutes is a program.

The first experiment has been performed in order to com-pare our solution to the naive one. The second one aims atevaluating the limitation of a segmentation that is based onrepeated sequence detection. In other words, it provides theperformance that can be delivered using a perfect classifier.In this case, shots and segments that are not correctly clas-sified are due to an imperfect and ambiguous segmentation,i.e. a segment that is spread over an IP and a program seg-ment in the ground-truth, for instance.

Duration Threshold GT-based LabelingSeg. bas. Shot bas. Seg. bas. Shot bas.

P (%) 70.83 70.01 97.63 96.43R (%) 98.44 98.66 98.21 98.63

Table 1. Precision and recall of a durationthreshold-based classification and ground-truth (GT)-based labeling.

The obtained results for these two experiments are pre-sented in Table 1. They show that the duration filter bringsalmost all the IPs. It is however very imprecise: 30% ofthe segments that last less than 3 minutes are not IPs. Asexplained earlier, the second experiment gives the maxi-mum precision and recall that can be acheived given the TVstream data and the performed segmentation.

To train our ILP classifier, the 3×6hr TV streams havebeen used: the two first 6hr for learning the logical rules,and the last 6hr for validation. As explained earlier, thevalidation aims at evaluating the effectiveness of the in-ferred rules. We have defined 3 levels of confidence andwe have used only the two highest levels during the classifi-cation phase. Rules with the lowest level of confidence havebeen discarded. We have also defined numeric intervals forthe symbolic attributes required by ILP. For example, wehave partitioned the duration domain into the following in-tervals: ]0, 10s[, [10s, 30s[, [30s, 1m[, [1m, 2m[, [2m, 3m[,[3m, 5m[, [5m,∞[.

This training phase has created a set of 35 rules: 22 rulesassociated with the highest level of confidence, 10 ruleswith the second level of confidence, and 3 rules with thelowest level of confidence. We have added one prior knowl-edge rule that states that “If a segment A lasts more than 5minutes, then A is a program”.

The inferred rules have then been applied on the 3× 6hrtesting streams. The obtained results that are summarized inTable 2 clearly show the effectiveness of our solution. Wecan also see that these results are close to those obtainedusing the GT-based labeling (Table 1).

ILPSeg. bas. Shot bas.

P (%) 93.39 93.51R (%) 95.98 97.12

Table 2. Evaluation of the ILP classifier.

3.2. Comparative Study: ILP vs. SVM

In order to compare the performance of our classifier, wehave performed a comparative study with a Support VectorMachine (SVM) classifier. The objective of this study is toshow the effectiveness of our solution and also to highlightthe importance of relational and contextual features.

The SVM classifier is a widely used classifier. It uses akernel function to map the data points into a higher dimen-sional space where a hyperplane is computed to separateand classify them. SVMs are well fitted to program/IP clas-sification as they are non-linear binary classifiers. They arealso able to handle high-dimensional and imprecise data.However, they do not explicitly model contextual featuresof neighboring segments. Contextual features can only betaken into account if the feature vectors of neighboring seg-ments are added to the features of the considered segment.

In our implementation, we have used SVM with the Ra-dial Basis Function kernel. In addition, we have performedthe grid-based search using cross-validation through theLibsvm library [3] to obtain the optimal SVM parameters.

In order to not penalize SVM, we have considered thesame prior knowledge used within the ILP classifier. Wehave also used SVM with the same symbolic features usedwith ILP. We have chosen a binary representation: if a sym-bolic feature can take m different values, a binary word ofm bits is used to represent that feature. We have also ap-plied SVM directly on numerical features. The obtainedF-measure has been overall 1% to 2% lower than with sym-bolic features.

The first time, we have only considered the intrinsic fea-ture of segments (local and contextual without neighboringsegments). We denote this experiment as SVM±0. We havethen performed a set of experiments in which the featurevector of a segment is composed of its own features con-catenated with the features of the neighboring segments.SVM±N denotes the experiment where the features of theN previous and the N following segments are taken intoaccount. The obtained results are presented in Table 3.

This table shows that without taking contextual features,the classification performance are not very good, in particu-lar regarding precision. It shows that considering features ofneighboring segments improves the performance. The bestone is obtained with SVM±1 where the F-measure on a pershot basis equals 94.56%. It is however still less than theperformance of our ILP classifier (F-measure = 94.67%).

This experiment highlights also an important advantage

134

SVM±0 SVM±1 SVM±2 SVM±3 SVM±4 SVM±5Seg. Shot Seg. Shot Seg. Shot Seg. Shot Seg. Shot Seg. Shot

P (%) 90.83 89.71 92.77 93.67 92.25 92.10 92.94 93.08 92.63 92.80 92.46 92.60R (%) 94.64 95.58 96.43 97.42 96.20 97.18 95.54 96.96 96.20 97.15 96.20 97.15

Table 3. Evaluation of the SVM±N classifiers.

of ILP over SVM. In addition to taking into account rela-tional features, ILP is able to model automatically and dy-namically the contextual features, whereas the SVM classi-fier requires a predefined and fixed neighborhood that hasto be set beforehand.

3.3. Application to TV program extraction

As explained in the introduction, the classification ofsegments is only a step toward performing an automatic ex-traction of programs from the TV stream, i.e. automati-cally determining the start and the end of each program andlabeling the programs using metadata. In this experiment,we have used our solution to segment the stream into pro-gram/IP segments. We have also used the EPG metadataand we have focused on the 3×6h testing TV6d dataset.

We have fused consecutive segments that have been clas-sified as IPs. We have also fused consecutive segments thathave been classified as programs and those that are sepa-rated with a very short IP segment. The resulting programsegments have then been labeled using a straightforwardmatching procedure using the EPG metadata: a segmentthat overlaps with a program from the EPG is labeled usingits label. We have retrieved thus 8 programs. We have thenstudied the accuracy of obtained program segments w.r.t.the ground-truth. The obtained results are summarized inTable 4. This table presents the mean (μ) and the standarddeviation (σ) of the extraction imprecision of the start andthe end times of long useful programs. The imprecisionhere refers to the absolute value of the difference betweenthe obtained start (resp. end) time and the ground-truth start(resp. end) time. This table provides also the accuracy ofan EPG-based program extraction.

Start time End timeμ σ μ σ

Our Sol. 5.6 s 12.8 s 11.7 s 17.7 sEPG 2 m 14.0 s 1 m 4.2 s 4 m 6.5 s 3 m 48.2 s

Table 4. Accuracy of program extraction.

Overall, these results show that our approach is very ac-curate and outperforms the metadata-based approach.

4. Conclusion

In this paper, we have presented how a repeated se-quence detection technique combined with an ILP classi-

fication technique and a suitable set of feature vectors canbe used to effectively segment a TV stream into program/IPsegments. The contribution of the paper concerns in par-ticular the classification step. We have shown the impor-tance of the relational and contextual features and we haveshown the effectiveness of the ILP classifier in modelingthem. The technique has also been efficiently applied to TVstream program extraction.

Future extensions will study how the ILP classifier canbe used to perform a finer classification of IPs, i.e. classifyIPs into trailers, commercials...

References

[1] A. Albiol, M. J. Ch. Fulla, A. Albiol, and L. Torres. Detection of TVcommercials. In Proc. of the IEEE Int. Conf. on Acoustics, Speech,and Signal Processing (vol. 3), Montreal, Canada, May 2004.

[2] S.-A. Berrani, G. Manson, and P. Lechat. A non-supervised ap-proach for repeated sequence detection in TV broadcast streams.Signal Processing: Image Communication, special issue on ”Se-mantic Analysis for Interactive Multimedia Services”, 23(7):525-537, 2008.

[3] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vectormachines. http://www.csie.ntu.edu.tw/˜cjlin/libsvm, 2008.

[4] L.-Y. Duan, J. Wang, Y. Zheng, J. S. Jin, H. Lu, and C. Xu. Segmen-tation, categorization, and identification of commercial clips fromTV streams using multimodal analysis. In Proc. of the 14th ACMInt. Conf. on Multimedia, Santa Barbara, CA, USA, October 2006.

[5] J. M. Gauch and A. Shivadas. Finding and identifying unknowncommercials using repeated video sequence detection. Jour. ofComputer vision and image understanding, 103(1):80–88, 2006.

[6] C. Herley. Argos: automatically extracting repeating objects frommultimedia streams. IEEE Transactions on Multimedia, 8(1):115–129, 2006.

[7] R. Lienhart, C. Kuhmunch, and W. Effelsberg. On the detectionand recognition of television commercials. In Proc. of the IEEEInt. Conf. on Multimedia Computing and Systems, Ottawa, Canada,June 1997.

[8] S. Muggleton. Inverse entailment and progol. New GenerationComputing, Special issue on Inductive Logic Programming, 13(3–4):245–286, 1995.

[9] X. Naturel, G. Gravier, and P. Gros. Fast structuring of large televi-sion streams using program guides. In Proc. of the 4th Int. Work. onAdaptive Multimedia Retrieval, Geneva, Switzerland, July 2006.

[10] J. Oostveen, T. Kalker, and J. Haitsma. Feature extraction and adatabase strategy for video fingerprinting. In Proc. of the 5th Int.Conf. on Recent Advances in Visual Information Systems, Hsin Chu,Taiwan, March 2002.

[11] A. Srinivasan. Aleph: A learning engine for proposing hypothe-ses. http://web2.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.pl, 2007.

[12] E. Stringa and C. Regazzoni. Real-time video-shot detection forscene surveillance applications. IEEE Transactions on Image Pro-cessing, 9(1):69–79, 2000.

135

[IEEE 2008 Tenth IEEE International Symposium on Multimedia (ISM) (Formerly MSE) - Berkeley, CA, USA...

Documents

Transcript of [IEEE 2008 Tenth IEEE International Symposium on Multimedia (ISM) (Formerly MSE) - Berkeley, CA, USA...