[IEEE 2009 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) - London,...

4
REPETITION DENSITY-BASED APPROACH FOR TV PROGRAM EXTRACTION Ga¨ el Manson and Sid-Ahmed Berrani Orange Labs - France Telecom R&D 4, rue du Clos Courtel. BP 91226 35510 Cesson-S´ evign´ e. France. ABSTRACT This paper addresses the problem of automatic TV broad- casted program extraction. It consists firstly of precisely de- termining the start and the end of each broadcasted TV pro- gram, and then of properly giving them a name. The extracted programs can be used to build novel services like TV-on- Demand. The proposed solution is based on the density study of repeated audiovisual sequences. This study allows to sort out most of the inter-programs from the repeated sequences. The effectiveness of our solution has been shown on two dis- tinct real TV streams lasting 5 days. A comparative eval- uation with traditional approaches has also been performed (metadata-based and silences-and-monochrome-frames-based). 1. INTRODUCTION TV-on-Demand is a novel service that aims to make previ- ously broadcasted long TV programs available anytime and anywhere. Basically, this service needs to extract and store TV programs. Manual TV program extraction from TV streams is a hard, tedious and very time consuming task. As a conse- quence, automatic and efficient techniques are required. It is possible for TV channels to know the accurate start and end times of their broadcasted programs, though unfor- tunately, most TV broadcast chains are too complex and not standardized. The included metadata information does not remain coherent and complete until the end of the chain. On the other hand TV channels can refuse to give this informa- tion for commercial purposes. As an example, the metadata broadcasted with the TV stream and included by the TV chan- nels, namely EPG (Electronic Program Guide) or EIT (Event Information Table), provide approximate start and end times and titles of some TV programs. They are however not always available, imprecise and incomplete [1]. Basically, TV program extraction aims to precisely deter- mine the the start and the end times of each broadcasted TV programs. This paper addresses how to perform this extrac- tion automatically. Its main contribution is an efficient and unsupervised approach that relies on studying the density of repeated audiovisual sequences in the TV stream. A set of supervised and unsupervised techniques related to TV program extraction have already been proposed. Most of these techniques rely on detecting inter-programs (like com- mercials or trailers) which are broadcasted between two parts of a TV program or between two TV programs. If all inter- programs are properly detected, TV programs (or parts of) can be easily deduced. The supervised techniques require a set of manually an- notated data. This can be annotated broadcasted video se- quences [2] used for perceptual hashing-based recognition. Equally, audio or video fingerprinting can be used [3]. An- notated data can also be more than one year of past manually created TV program guides, which are used to learn and to model the TV program guide [4]. The main drawbacks are that the annotated database has to be manually created for each TV channel and then periodically updated. There are two kinds of unsupervised techniques: 1. The detection-based techniques use the intrinsic fea- tures of the inter-programs like separating monochrome frames, audio changes, action and presence of logos [5, 6]. All these approaches are limited to one kind of inter-program (mainly commercials) and are thus not sufficient to achieve a good TV stream segmentation. 2. The repetition-based techniques detect inter-programs as near-identical audiovisual sequences in the TV stream. Indeed, most of inter-programs are broadcasted several times. In [7], a hashing-based solution is proposed to detect repeated shots. In [8], a correlation study of au- dio features is used to find near-identical sequences of a pre-defined size within a buffer. In [9], a clustering- based approach is proposed. It relies on grouping simi- lar keyframes using visual features. These last unsupervised techniques are the most promis- ing. However, a post-processing step is required to select from the detected repeated sequences those that are actually inter-programs and that can lead to perform an accurate auto- matic TV stream segmentation. The rest of the paper is organized as follows. Section 2 presents our repetition density-based TV stream segmentation for TV program extraction. The experimental study we con- ducted to show the effectiveness of our approach is presented in Section 3. And finally, Section 4 concludes the paper and discusses future extensions. 978-1-4244-3610-1/09/$25.00 c 2009 IEEE. 181 WIAMIS 2009

Transcript of [IEEE 2009 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) - London,...

Page 1: [IEEE 2009 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) - London, United Kingdom (2009.05.6-2009.05.8)] 2009 10th Workshop on Image Analysis for Multimedia

REPETITION DENSITY-BASED APPROACH FOR TV PROGRAM EXTRACTION

Gael Manson and Sid-Ahmed Berrani

Orange Labs - France Telecom R&D4, rue du Clos Courtel. BP 9122635510 Cesson-Sevigne. France.

ABSTRACTThis paper addresses the problem of automatic TV broad-casted program extraction. It consists firstly of precisely de-termining the start and the end of each broadcasted TV pro-gram, and then of properly giving them a name. The extractedprograms can be used to build novel services like TV-on-Demand. The proposed solution is based on the density studyof repeated audiovisual sequences. This study allows to sortout most of the inter-programs from the repeated sequences.The effectiveness of our solution has been shown on two dis-tinct real TV streams lasting 5 days. A comparative eval-uation with traditional approaches has also been performed(metadata-based and silences-and-monochrome-frames-based).

1. INTRODUCTION

TV-on-Demand is a novel service that aims to make previ-ously broadcasted long TV programs available anytime andanywhere. Basically, this service needs to extract and storeTV programs. Manual TV program extraction from TV streamsis a hard, tedious and very time consuming task. As a conse-quence, automatic and efficient techniques are required.

It is possible for TV channels to know the accurate startand end times of their broadcasted programs, though unfor-tunately, most TV broadcast chains are too complex and notstandardized. The included metadata information does notremain coherent and complete until the end of the chain. Onthe other hand TV channels can refuse to give this informa-tion for commercial purposes. As an example, the metadatabroadcasted with the TV stream and included by the TV chan-nels, namely EPG (Electronic Program Guide) or EIT (EventInformation Table), provide approximate start and end timesand titles of some TV programs. They are however not alwaysavailable, imprecise and incomplete [1].

Basically, TV program extraction aims to precisely deter-mine the the start and the end times of each broadcasted TVprograms. This paper addresses how to perform this extrac-tion automatically. Its main contribution is an efficient andunsupervised approach that relies on studying the density ofrepeated audiovisual sequences in the TV stream.

A set of supervised and unsupervised techniques related toTV program extraction have already been proposed. Most of

these techniques rely on detecting inter-programs (like com-mercials or trailers) which are broadcasted between two partsof a TV program or between two TV programs. If all inter-programs are properly detected, TV programs (or parts of)can be easily deduced.

The supervised techniques require a set of manually an-notated data. This can be annotated broadcasted video se-quences [2] used for perceptual hashing-based recognition.Equally, audio or video fingerprinting can be used [3]. An-notated data can also be more than one year of past manuallycreated TV program guides, which are used to learn and tomodel the TV program guide [4]. The main drawbacks arethat the annotated database has to be manually created foreach TV channel and then periodically updated.

There are two kinds of unsupervised techniques:

1. The detection-based techniques use the intrinsic fea-tures of the inter-programs like separating monochromeframes, audio changes, action and presence of logos [5,6]. All these approaches are limited to one kind ofinter-program (mainly commercials) and are thus notsufficient to achieve a good TV stream segmentation.

2. The repetition-based techniques detect inter-programsas near-identical audiovisual sequences in the TV stream.Indeed, most of inter-programs are broadcasted severaltimes. In [7], a hashing-based solution is proposed todetect repeated shots. In [8], a correlation study of au-dio features is used to find near-identical sequences ofa pre-defined size within a buffer. In [9], a clustering-based approach is proposed. It relies on grouping simi-lar keyframes using visual features.

These last unsupervised techniques are the most promis-ing. However, a post-processing step is required to selectfrom the detected repeated sequences those that are actuallyinter-programs and that can lead to perform an accurate auto-matic TV stream segmentation.

The rest of the paper is organized as follows. Section 2presents our repetition density-based TV stream segmentationfor TV program extraction. The experimental study we con-ducted to show the effectiveness of our approach is presentedin Section 3. And finally, Section 4 concludes the paper anddiscusses future extensions.

978-1-4244-3610-1/09/$25.00 c©2009 IEEE. 181 WIAMIS 2009

Page 2: [IEEE 2009 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) - London, United Kingdom (2009.05.6-2009.05.8)] 2009 10th Workshop on Image Analysis for Multimedia

2. THE PROPOSED SOLUTION

The general working scheme of our solution is the following:the TV stream is first accumulated to a sufficient amount andthen it is continuously received and periodically processed.This process of performing unsupervised TV program extrac-tion is composed of three steps: repeated sequence detection,TV stream segmentation using the repetition density, and seg-ment annotation.

The main contributions of this paper concern the TV streamsegmentation step and the experiments validating our approach.

2.1. Repeated sequence detection

The repeated sequence detection technique we propose to useis the one presented in [9]. Repeated audiovisual sequencesare in this context near-identical audiovisual sequences. Re-peated sequences are detected from a micro-clustering ap-proach that first groups near identical keyframes using DCT-based 30 dimensional visual descriptors. The similarities oftemporal diversity of keyframes within the micro-clusters arethen analysed to create the repeated sequences.

A repeated sequence consists of a set of occurrences. Wenote O, the set of all the occurrences of all the detected re-peated sequences andR, the set of all the repeated sequences:

for each x ∈ O, RS(x) ∈ R is the repeated sequence to which belongs x

for each x ∈ O, IP (x) =

{1 if x is an inter-program

0 otherwise

2.2. TV stream segmentation using the repetition density

The goal of our solution is to segment the TV stream into pro-grams. We represent the TV stream as a succession of con-secutive TV programs. Two consecutive TV programs may(or not) be separated by a break and each program may (ornot) contain a break. A break is composed of a successionof one or more inter-programs which can be trailers, jinglelogos, opening/closing commercial break credits or commer-cials. With this representation of the TV stream, our objec-tive is to determine the start (resp. end) of each TV program,i.e. the start (resp. end) of its first (resp. last) part.

Our solution detects parts of TV programs by detectingbreaks in the TV stream that are detected by their inter-programs.As explained in the introduction, the most promising approachto detect inter-programs is to use their repetition property. In-deed, almost all inter-programs are broadcasted several timesin the stream. This hypothesis is validated in [10] whererelevant statistical data on repetition of inter-programs areprovided. However, the existing technique for repeated se-quences detection detects only sequences that repeat in thestream. Some of the detected sequences are actually inter-programs and others belong to programs (e.g. flashbacks, open-ing credits and news reports). Therefore, inter-programs have

to be sorted out from the whole set of detected repeated se-quences.

We propose a technique to classify most of the repeatedsequences. The main idea behind our work is based on priorknowledge on TV streams. It follows three hypotheses:

• (H1) An occurrence x of a repeated sequence that issurrounded by a lot of other occurrences of repeatedsequences is most likely inside a break with other inter-programs. It is then considered as an inter-program.We defined dw(x) the repetition density around x asthe number of repeated sequence occurrences within acentered given time window w. Given the predefinedthreshold td, we propose the following classificationrule:

for each x ∈ O, dw(x) > td ⇒ IP (x) = 1

• (H2) The repeated sequence occurrences in the neigh-borhood (defined by tl) of an inter-program sequenceare also inter-programs:

for each x ∈ O such as IP (x) = 1∀y ∈ O, ‖x− y‖ < tl ⇒ IP (y) = 1

• (H3) If an occurrence of a repeated sequence has beenclassified as an inter-program than all the other occur-rences of the repeated sequence are also inter-programs:

for each x ∈ O such as IP (x) = 1∀y ∈ O, RS(x) = RS(y) ⇒ IP (y) = 1

From these hypotheses, we have built a repetition density-based inter-program filter. For each occurrence of each re-peated sequence, the repetition density is computed on thegiven time window w and the occurrences with a density greaterthan a threshold td are considered as inter-programs (H1). Byextension (H3), all occurrences of a repeated sequence whichcontain an inter-program occurrence are inter-programs. More-over (H2), the neighboring repeated sequences of an inter-program occurrence are also inter-programs. Neighboring oc-currences of x are occurrences y whose distance ‖x − y‖ inthe stream is less than tl seconds.

Parameters td and tl have to be set from prior knowledgefor each TV channel. They are then empirically adjusted.

Figure 1 shows the repetition density computed on 6 hoursof a real TV stream. The grey negative rectangles representthe breaks in the stream. The black positive histograms rep-resent the computed repetition density. The dashed-line rep-resents the density threshold td. This figure shows that highrepetition density regions match with real breaks (H1). Thebreaks which do not match with any high repetition densityregions can be detected using hypothesis H3.

Most inter-programs are detected by our repetition density-based filter. Neighboring detected inter-programs (H2) aremerged to build the breaks in the TV stream. As a result, gapsbetween two breaks create program segments.

182

Page 3: [IEEE 2009 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) - London, United Kingdom (2009.05.6-2009.05.8)] 2009 10th Workshop on Image Analysis for Multimedia

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6

Den

sity

Time (in hours)

Fig. 1. Repetition density computed on 6 hours of TV stream(black positive histograms). Gray negative rectangles are thereal positions of breaks.

2.3. Segment annotation

The previous segmentation step provides an over-segmentationof the TV stream. The resulting segments have then to bemerged and annotated in order to extract the full TV pro-grams. For automatically labeling the segments, the straight-forward approach is to use the metadata information broad-casted with the TV stream like EPG or EIT. Algorithms suchas Dynamic Time Warping [2] can be used to merge and anno-tate the segments from the metadata. However, this approachheavily relies on the metadata. Its effectiveness mainly de-pends on the reliability of the metadata. It requires at leastcomplete and consistent metadata which is not the case. Adeeper analysis of weaknesses of the TV metadata informa-tion is given in [1].

Therefore, in order to reduce the reliance on metadata,only three simple rules are used to perform segment annota-tion: (1) three consecutive segments are merged if the dura-tion of the middle segment lasts less than 60 seconds, (2) adetected segment is labeled with the name of the metadatasegment that has the best overlap with the detected segment,(3) consecutive segments with the same label are merged.

Experiments will show that these basic rules are sufficientto achieve a very accurate TV program extraction.

3. EXPERIMENTS

To evaluate our approach, we have performed a set of exper-iments using real TV broadcast streams from two differentchannels recorded during 5 days: a French public TV chan-nel (Cpub), and a French private TV channel (Cpv). In or-der to conduct the following experiments, we have created aground-truth on Cpub and Cpv in which TV programs havebeen precisely segmented and annotated. A set of 47 TV Pro-grams has been labeled on Cpub and 56 on Cpv . On the 120 hof recorded TV stream, the total duration of breaks has been12h 02m 15s on Cpub and 17h 06m 09s on Cpv .

The results are evaluated using the following criteria:

1. the number of extracted programs (All),

2. the number of valid extracted programs (Ok) which arecorrectly labeled,

3. the number of valid extracted programs (2s) with animprecision of the start and of the end less than 2 sec,

4. the number of valid extracted programs (10s) with animprecision of the start and of the end less than 10 sec,

5. the mean (μ) and the standard deviation (σ) of the im-precision of the extracted programs.

The imprecision here means the absolute difference be-tween the obtained start (resp. the end) time w.r.t. the ac-curate start (resp. the end) time given by the ground-truth.The imprecision is only evaluated on the valid extracted pro-grams. Within both Cpv and Cpub, we have focused on thelong TV programs in the period between 11 am and 12 pmthat contains the most interesting programs (series, moviesand prime-time TV shows).

In order to perform a comparative study of our solution(Our. Sol.), we have considered two other solutions: (1) ametadata-based solution (Meta.) and (2) a monochrome-frames-based solution (Monoch.).

The metadata-based solution uses the approximate starts,ends and names given when available in the EPG.

The monochrome-frames-basedsolution first computes theintersection between silences in the audio TV stream and mono-chrome frames in the video TV stream as in [2]. Then, all thedetected intersections separated by more than 60 seconds areconsidered as program segments.

We have also built a merged solution (Both.) that com-bines the monochrome-frames-based detected breaks and therepetition density-based detected breaks.

3.1. Evaluation of our solution on Cpub

The repeated sequence detection technique has first been ap-plied. A set of 477 repeated sequences has been discoveredwith a total number of 2001 occurrences on Cpub. We havecounted 210 commercial repeated sequences with an averagenumber of 4.92 occurrences. We have also counted 34 trailerswith an average number of 7.41 occurrences.

Table 1 shows the obtained results on Cpub. This re-sult shows first that metadata-based solution is outperformedby the other techniques. Then, the ratio of valid extractedprograms to the extracted programs (Ok/All) is almost thesame between our solution, monochrome-frames-based solu-tion and the merged solution. As for the imprecision, our so-lution is more accurate than the monochrome-frames-basedtechnique. This table also shows that our solution can be im-proved by the use of silences and monochrome frames de-tected breaks.

The detection of the start is more accurate than the detec-tion of the end because of the reliance on metadata which are

183

Page 4: [IEEE 2009 10th Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) - London, United Kingdom (2009.05.6-2009.05.8)] 2009 10th Workshop on Image Analysis for Multimedia

more accurate on the start times. We note that TV programsthat have generated the most imprecision have been due to“Le tour de France” which is a live sport program for whichthe metadata is completely wrong.

start endOk/All 2s 10s μ σ μ σ

Our. Sol 43/48 11 17 11.2s 11.3s 38.5s 134.1sMonoch. 44/47 2 11 34.8s 55.6s 56.3s 102.8s

Both. 43/47 9 24 9.1s 8.7s 19.1s 90.9sMeta. 46/47 0 0 186.1s 270s 439.4s 641.9s

Table 1. Evaluation results on Cpub.

3.2. Evaluation of our solution on Cpv

The repeated sequence detection technique has first been ap-plied. A set of 656 repeated sequences has been discoveredwith a total number of 2679 occurrences on Cpv . We havecounted 316 commercial repeated sequences with an averagenumber of 5.72 occurrences. We have also counted 86 trail-ers with an average number of 5.32 occurrences. This illus-trates the main differences between private and public chan-nels. Private channels tend to have more commercials whichare repeated more often. However, other non-commercialinter-programs on private channels tend to be repeated less.For TV stream segmentation, non-commercial inter-programshave a greater impact on the imprecision.

Table 2 shows the obtained results. As non-commercialinter-programs are repeated less than on Cpub, our solutionhas been less effective than on Cpub. However, it is still betterthan the metadata-based solution. Our solution merged withthe monochrome-frames-based breaks has also been the mosteffective.

start endOk/All 2s 10s μ σ μ σ

Our. Sol. 56/58 2 15 46.7s 244.3s 80.3s 226.5sMonoch. 56/58 2 9 30.6s 78.9s 104.9s 198.4s

Both. 56/58 11 23 12.9s 32.7s 61.8s 201.8sMeta. 55/58 0 0 180.5s 120.1s 469.7s 289.1s

Table 2. Evaluation results on Cpv .

The obtained results on Cpub and Cpv show that automaticTV program extraction is a complex problem. Our best resultsshow that about 45.6% of the TV programs have been effec-tively extracted with an imprecision less than 10 seconds.

The successfully extracted TV programs have been mainlyprime time shows, movies and daily programs such series,news, or games shows. The TV programs that have been ex-tracted with a greater imprecision have been series inside asuccession of episodes. For two programs on Cpv , impreci-sion has been due to the mis-detection of a sponsoring thatdoes not repeat in the stream.

4. CONCLUSION

This paper shows the importance of the repetition densityof inter-programs and how it can be used in a TV streamsegmentation process for TV program extraction. Experi-ments show that the traditional approaches (metadata-basedor monochrome-frames-based) are not sufficiently effectivein order to perform an accurate TV segmentation. These canbe, however, greatly improved by our merged solution thatcan achieve very accurate TV program extraction.

Future extension will study how our approach can be ex-tended to be applied on-line, that is, how to segment the TVstream on-line. This will require performing the repetition de-tection on-line. We will also address how to remove breaksfrom programs.

5. ACKNOWLEDGMENT

The authors would like to gratefully acknowledge X. Naturelfor his help with the monochrome-frames-based solution.

6. REFERENCES

[1] S.-A. Berrani, P. Lechat, and G. Manson, “TV broadcast macro-segmentation: Metadata-based vs. content-based approaches,” in Proc.of the ACM Int. Conf. on Image and Video Retrieval, Amsterdam, TheNetherlands, July 2007.

[2] X. Naturel, G. Gravier, and P. Gros, “Fast structuring of large televisionstreams using program guides,” in Proc. of the 4th Int. Workshop onAdaptive Multimedia Retrieval, Geneva, Switzerland, July 2006.

[3] J. Oostveen, T. Kalker, and J. Haitsma, “Feature extraction and adatabase strategy for video fingerprinting,” in Proc. of the 5th Int. Conf.on Recent Advances in Visual Information Systems, Hsin Chu, Taiwan,March 2002.

[4] J.-P. Poli and J. Carrive, “Modeling television schedules for televisionstream structuring,” in Proc. of ACM Int. MultiMedia Modeling Conf.,Singapore, January 2007.

[5] R. Lienhart, C. Kuhmunch, and W. Effelsberg, “On the detection andrecognition of television commercials,” in Proc. of the IEEE Int. Conf.on Multimedia Computing and Systems, Ottawa, Ontario, Canada,June 1997.

[6] A. Albiol, M.J. Ch, F.A. Albiol, and L. Torres, “Detection of TV com-mercials,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, andSignal Processing (vol. 3), Montreal, Quebec, Canada, May 2004.

[7] J. M. Gauch and A. Shivadas, “Finding and identifying unknown com-mercials using repeated video sequence detection,” Journal of Com-puter vision and image understanding, vol. 103, pp. 80–88, 2006.

[8] C. Herley, “Argos: automatically extracting repeating objects frommultimedia streams,” IEEE Transactions on Multimedia, vol. 8, no.1, pp. 115–129, 2006.

[9] S.-A. Berrani, G. Manson, and P. Lechat, “A non-supervised approachfor repeated sequence detection in tv broadcast streams,” Signal Pro-cessing: Image Communication, spec. iss. on ”Semantic Analysis forInteractive Multimedia Services”, vol. 23, no. 7, pp. 525–537, 2008.

[10] G. Manson and S.-A. Berrani, “Tv broadcast macro-segmentation us-ing the repetition property of inter-programs,” in Proc. of the IASTEDInt. Conf. on Signal Processing, Pattern Recognition and Applications,Innsbruck, Austria, February 2009.

184