Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the...

12
1 Institute of Architecture of Application Systems, University of Stuttgart, Germany {falkenthal, barzen, breitenbuecher, leymann}@iaas.uni-stuttgart.de Pattern Research in the Digital Humanities: How Data Mining Techniques Support the Identification of Costume Patterns Michael Falkenthal 1 , Johanna Barzen 1 , Uwe Breitenbücher 1 , Sascha Brügmann 2 , Daniel Joos 2 , Frank Leymann 1 , Michael Wurster 2 @article{Falkenthal2016, author = {Falkenthal, Michael and Barzen, Johanna and Breitenb{\"{u}}cher, Uwe and Br{\"{u}}gmann, Sascha and Joos, Daniel and Leymann, Frank and Wurster, Michael}, doi = {10.1007/s00450-016-0331-6}, journal = {Computer Science - Research and Development}, number = {74}, title = {{Pattern research in the digital humanities: how data mining techniques support the identification of costume patterns}}, volume = {22}, year = {2016} } : Institute of Architecture of Application Systems © 2016 Springer-Verlag. The final publication is available at Springer via http://dx.doi.org/10.1007/s00450-016-0331-6 2 Herman-Hollerith Zentrum, University of Applied Sciences Reutlingen, Germany {michael1.wurster, daniel.joos, sascha.bruegmann}@student.reutlingen-university.de

Transcript of Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the...

Page 1: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

1Institute of Architecture of Application Systems, University of Stuttgart, Germany

{falkenthal, barzen, breitenbuecher, leymann}@iaas.uni-stuttgart.de

Pattern Research in the Digital Humanities: How Data Mining Techniques Support the

Identification of Costume PatternsMichael Falkenthal1, Johanna Barzen1, Uwe Breitenbücher1,

Sascha Brügmann2, Daniel Joos2, Frank Leymann1, Michael Wurster2

@article{Falkenthal2016,

author = {Falkenthal, Michael and Barzen, Johanna and Breitenb{\"{u}}cher,

Uwe and Br{\"{u}}gmann, Sascha and Joos, Daniel and Leymann, Frank and

Wurster, Michael},

doi = {10.1007/s00450-016-0331-6},

journal = {Computer Science - Research and Development},

number = {74},

title = {{Pattern research in the digital humanities: how data mining

techniques support the identification of costume patterns}},

volume = {22},

year = {2016}

}

:

Institute of Architecture of Application Systems

© 2016 Springer-Verlag.The final publication is available at Springer viahttp://dx.doi.org/10.1007/s00450-016-0331-6

2Herman-Hollerith Zentrum, University of Applied Sciences Reutlingen, Germany

{michael1.wurster, daniel.joos, sascha.bruegmann}@student.reutlingen-university.de

Page 2: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

Noname manuscript No.(will be inserted by the editor)

Pattern Research in the Digital HumanitiesHow Data Mining Techniques Support the Identification of Costume Patterns

Michael Falkenthal · Johanna Barzen · Uwe Breitenbucher ·Sascha Brugmann · Daniel Joos · Frank Leymann · Michael Wurster

Received: date / Accepted: date

Abstract Costumes are prominent in transporting acharacter’s mood, a certain stereotype, or character traitin a film. The concept of patterns, applied to the domainof costumes in films, can help costume designers to im-prove their work by capturing knowledge and experience

Michael FalkenthalInstitute of Architecture of Application SystemsUniversity of StuttgartGermany Tel.: +49-711-68588482E-mail: [email protected]

Johanna BarzenInstitute of Architecture of Application SystemsUniversity of StuttgartGermany Tel.: +49-711-68588487E-mail: [email protected]

Uwe BreitenbucherInstitute of Architecture of Application SystemsUniversity of StuttgartGermany Tel.: +49-711-68588261E-mail: [email protected]

Sascha BrugmannHerrmann-Hollerith ZentrumUniversity of Applied Sciences ReutlingenGermany E-mail: [email protected]

Daniel JoosHerrmann-Hollerith ZentrumUniversity of Applied Sciences ReutlingenGermany E-mail: [email protected]

Frank LeymannInstitute of Architecture of Application SystemsUniversity of StuttgartGermany Tel.: +49-711-68588470E-mail: [email protected]

Michael WursterHerrmann-Hollerith ZentrumUniversity of Applied Sciences ReutlingenGermany E-mail: [email protected]

about proven solutions for recurring design problems.However, finding such Costume Patterns is a di�cultand time-consuming task, because possibly hundreds ofdi↵erent costumes of a huge number of films have to beanalyzed to find commonalities. In this paper, we presenta Semi-Automated Costume Pattern Mining Method todiscover indicators for Costume Patterns from a largedata set of documented costumes using data mining anddata warehouse techniques. We validate the presentedapproach by a prototypical implementation that buildsupon the Apriori algorithm for mining association rulesand standard data warehouse technologies.

Keywords Pattern Languages · Pattern Mining ·Pattern Identification · Data Mining · CostumeLanguages · Costume Patterns · Digital Hummanities

1 Introduction

When watching a movie, the first impression of the char-acters is often not caused by how they move or whatthey say but by what they wear. Costumes are specialtypes of clothes used by costume designers to supporta certain character, his moods and transformations orto give the recipient hints on where the movie is set ingeographical or historical terms. This communication,based on clothes, is called vestimentary communication(from the Latin term “vestimentum” meaning clothes)and is used by the costume designers to communicatecertain stereotypes, character traits, professions, or aspecific age of a certain character. Because the vesti-mentary communication is a nonverbal communication,mainly experienced unconsciously by the recipients, andinterpreted based on their social and socioeconomicbackground, it is rather complex to gain insight in howthis communication works. However, there are some

Page 3: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

2 Michael Falkenthal et al.

rules that allow us to distinguish between a villain anda hero in a classic western movie. Whether we inter-pret characters as villains, because they wear blackand dirty-looking costume elements, or as heroes (oftenrepresented as sheri↵s) because they wear rather tidycostumes including a sheri↵’s star, is one of the ques-tions we want to answer. A costume designer has thechallenging job to create an appropriate costume for aspecific character, which relies heavily on the designer’sexperience.

Systematically capturing of insights in which designconventions have established in films into patterns wouldstrongly support the creative process of finding adequatetextile expressions for specific design problems at hand(Schumm et al, 2012). This is because patterns andpattern languages, which originated by Alexander et al(1977) in the domain of architecture, aim to captureknowledge gathered from experience in order to provideproven solutions for frequently reoccurring problems.

According to Barzen and Leymann (2015) as wellas Fehling et al (2015), patterns can be detected byanalyzing existing, documented concrete solutions andabstracting the essence of detected commonalities intostructured pattern documents. For investigating the ves-timentary communication in films, concrete solutionscorrespond to concrete costumes worn in films. To pro-vide machine-readable data about costumes, we builtthe MUSE-Repository1 as a database that captures cos-tumes and their relevant attributes (Barzen et al, 2015).This so-called Costume Repository contains (i) generalinformation about the captured movies, e.g., title, yearof publication, producer, and costume designer, as wellas (ii) specific information in terms of the involved roles,such as gender, profession, age, main personality, andstereotype attributes. Further, each role is linked to aset of concrete costumes worn during the movie. Eachcostume consists of a set of base elements, e.g., trousers,shirts, and shoes, and primitives, e.g., sleeves, collar,and cu↵s. Base elements and primitives are describedby means of specific categorical properties, which areorganized into taxonomies. At the time of writing thispaper, the Costume Repository contained 25 movies,about 900 corresponding roles, 2,100 costumes, 10,360base elements, and 20,660 primitives.

Although this Costume Repository is a first stepto reduce e↵orts for investigating costumes in films inorder to identify common design principles, the processof pattern identification still bases on manual and, thus,time-consuming work because of the lack of automationfor discovering similarities in the documented costumes.Therefore, we introduced a first approach to support

1 As part of the MUSE project (last accessed on 25.02.2016):http://www.iaas.uni-stuttgart.de/forschung/projects/MUSE

analyzing the captured costumes using On-Line Analyt-ical Processing (OLAP) technologies (Falkenthal et al,2015). Nevertheless, this approach is only capable ofanswering concrete questions and verifying assumptionsby specifying them as multidimensional queries that areexecuted on the database, but the approach does nothelp to detect yet unknown coherences in the data set.

To overcome these issues, applying data mining tech-niques for identifying similarities, relations, and rulesin the captured costume data is promising2. However,data mining, in fact, is hardly an easy exercise to accom-plish and comprehensive conceptual as well as technicalknowledge about database systems is inevitable. De-tailed knowledge about data mining algorithms and howto apply them to the captured data set is required, whiledomain experts require knowledge on how the data isstructured and how to interpret the mining results. Thisleads to a complex challenge in terms of how such datamining techniques can be applied to a concrete domain,in this paper, the domain of costumes in films. Further-more, it is rarely the case that data mining is a straightprocess defining a concrete start and end. It is ratherseen as an iteration-based process in which domain ex-perts work on the results and the mining configurationincrementally based on gathered insights.

In this work, we contribute to the field of DigitalHumanities by introducing a Semi-Automated CostumePattern Mining Method that (i) leverages the capabilitiesof data mining to enable detecting indications regard-ing potential Costume Patterns from concrete costumedocumentations. The method (ii) can be partially auto-mated to support analyzing the vast amount of capturedcostumes while (iii) it supports the iterative refinementof the data analysis. Thereby, we show how general datamining techniques for mining association rules can beapplied to the domain of costumes in films. To provethe practical feasibility of the presented method, weconducted two case studies and present a prototypicalimplementation of a Costume Pattern Mining Frame-work that is based on standard IT-technologies. In sum-mary, we show how IT can contribute to the domain ofhumanities and, thus, how it provides fundamentals forthe new research discipline of the Digital Humanities.While the techniques and methods this work is predi-cated upon are well understood from the perspective ofdata analytics, their application in the domain of cos-

2 As the term pattern is ambigious and used besides thecostume domain also in the domain of data mining (cf. (Bishop,2006)) we clarify the di↵erent meanings at this point. Whiledata mining is utilized to find patterns in large data sets inthe form of similarities, relations, and rules, costume patternsfollow the principles of the pattern approach by Alexanderet al (1977).

Page 4: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

Pattern Research in the Digital Humanities 3

tumes clearly fosters the endeavours to tackle researchchallenges in the humanities by means of IT.

The remainder of this paper is structured as follows:We discuss related work and approaches, which thiswork builds upon, in Section 2. Section 3 introducesthe Semi-Automated Costume Pattern Mining Methodthat enables to find common design practices in a setof concrete, documented costumes. The challenges ofapplying data mining techniques to the domain of cos-tumes are discussed in Section 4. Section 5 presentsthe case studies and prototypical implementation of theCostume Pattern Mining Framework. In Section 6, weconclude and give an outlook on future work.

2 Related Work

Patterns are commonly used to capture knowledge andexperience about proven solutions for recurring prob-lems (Reiners, 2013). In the past, patterns and patternlanguages have been used in various di↵erent researchdomains (Alexander et al, 1977; Hohpe and Woolf, 2003).In literature, discovering patterns is described as a gener-ative process and is referred as pattern mining (Deardenand Finlay, 2006; Appleton, 1997), which is a metaphorfor discovering patterns from existing designs (Deardenand Finlay, 2006).

Reiners et al. propose di↵erent pattern mining meth-ods (Reiners, 2013; Reiners et al, 2015). A pattern min-ing process in their perspective is a manual assessmentof existing solutions with domain experts, e.g., in work-shops, and relies heavily on the experts’ experience. Inaddition to workshops, a community-based platformwith online discussions, commenting, rating, and vot-ing is used to share knowledge and to assess existingsolutions.

Fehling et al (2015) propose a pattern research method-ology where pattern candidates shall be identified fromconcrete solutions, which are then linked to the ab-stracted patterns (Falkenthal et al, 2014a,b). In anotherresearch, Fehling et al (2014) published a general pat-tern identification, authoring, and application process,which is applicable for several research domains. Theiteration-based process consists of three phases: (i) pat-tern identification, (ii) pattern authoring, and (iii) pat-tern application. Each phase is broken down into aseparate cycle that consists of multiple sub-activities.Our work applies to the phase pattern identification,which is the structuring, collection, and analysis of in-formation in a domain in which patterns shall be identi-fied. Following this method, we build upon a CostumeRepository that contains a large number of documentedconcrete solutions. Moreover, this repository provides

a machine-accessible interface that can be used for an-alyzing the contained data. We provide details aboutthese approaches in Section 3.

Fayyad et al (1996) introduce the process of Knowl-edge Discovery in Databases (KDD). KDD refers tothe overall process of discovering useful knowledge fromdata. This process incorporates the concepts of datamining and proposes a comprehensive approach to iden-tify potential coherences in data. Our approach baseson the KDD process in order to analyze existing doc-umented solutions for potential pattern indicators inthe area of costumes in films. Data mining can be usedto “discover hidden, previously unknown and usableinformation from a large amount of data” (ISO, 2006).Data mining techniques are used to gather knowledgefrom an underlying data set for a better understandingusually without any expectation on the outcome (ISO,2006). The Apriori algorithm, as proposed by Agrawaland Srikant (1994), is one well-known algorithm in thearea of data mining. It is used for discovering associa-tion rules between items in a database of sales trans-actions (Agrawal and Srikant, 1994). As a prominentexample, consider the market basket analysis, helpingretailers to find out, which of their o↵ered products aretypically sold in combination with other products. Theresulting association rules can be used, for example, tooptimize the store layout or to adapt the advertisingstrategy of the retailer.

3 Semi-Automated Costume Pattern Mining

Method

In order to e�ciently support the identification of com-mon design principles hidden in the Costume Repositorythat captures concrete costume descriptions, we presenta Costume Pattern Mining Method that extends KDD(cf. Section 2) and builds upon Barzen and Leymann(2016) to analyze existing costumes for indicators aboutpatterns in the area of costumes in films. The processconsists of three phases: (i) Data Preparation, (ii) Hy-pothesis Discovery, and (iii) Hypothesis Validation. Anoverview of the method is shown in Figure 1 with therespective phases depicted as a sequence of chevronsresulting in pattern indicators that represent frequentdesign principles contained in the analyzed data set. Wealso describe automation capabilities to handle the hugeamount of data.

To provide an overview, we first summarize themethod as follows and provide details about each phasein the following subsections. In the (i) Data Preparationphase, data is structured, prepared, and transformed toa domain-specific data set in order to be processed bydata mining algorithms.

Page 5: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

4 Michael Falkenthal et al.

PatternIndicator

HypothesisValidation

HypothesisDiscovery

DataPreparation

Fig. 1 Overview of the proposed Costume Pattern Mining Method

In the (ii) Hypothesis Discovery phase, analysis in-terests are first manually translated into specific con-figurations of a mining algorithm, to be precise, thecostume data set is reduced to the attributes relevantfor the analysis interests. Afterwards, the configureddata mining algorithm gets automatically executed tofind hypotheses about coherences in the investigateddata set in the form of association rules. Finally, inthe (iii) Hypothesis Validation phase, the discovered hy-potheses are validated manually against the data set inthe Costume Repository, i.e., domain experts interpretthem and evaluate them based on OLAP techniques.

3.1 Data Preparation

In the first phase, the data of the Costume Reposi-tory needs to be prepared so that it can be used inthe succeeding phases: On the one hand, this includespreparations required for the data mining algorithmto work. On the other hand, OLAP cube techniquesused for validating hypotheses require the data of theCostume Repository, which are structured and storedin a relational schema, to be converted to a di↵erentdata model. Moreover, additional data structures re-quired for cube-based data analysis have to be createdin the database. The data preparation phase follows theprinciples of an Extract-Transform-Load (ETL) process,typically used in the area of Data-Warehouse applica-tions (Fayyad et al, 1996). For the analysis of the dataset in the Costume Repository, we built upon an existingETL process, described by Falkenthal et al (2015).

The data of the Costume Repository first gets ex-tracted into a separate database. A reason for this isthat the application of data mining algorithms can causeheavy load on the underlying database and might causethe Costume Repository to be slow or unavailable forparallel insertion of new costume data. As an additionaladvantage, working on a copy of the Costume Repos-itory provides a certain level of data consistency andisolation as no new costumes are added and no datagets changed while applying data mining algorithms andexecuting the OLAP analysis.

After the data has been imported into a separatedatabase (also called data staging area), automated

steps for filtering, cleansing, and transformation of thedata are executed (Fayyad et al, 1996). For instance, theCostume Repository contains a lot of screenshots, e.g.,showing a costume in several scenes of a movie. This datais not used for analytical processing and is, therefore,filtered-out. Furthermore, the Costume Repository con-tains entries that are considered not valid, e.g., havingNULL values or invalid strings. Those are filtered-outby the ETL process as well. Various minor transfor-mation steps are applied to the data, for example, theapplied implementation of the data mining algorithmcould require non-composed primary keys. Therefore,the creation of surrogate keys would be required – atypical operation in ETL processes. The data are furtherdenormalized into a star schema, while the hierarchicalstructure of the data describing the relevant parametersof a costume is preserved.

3.2 Hypothesis Discovery

Discovering common combinations in costume designthat indicate a Costume Pattern is realized using datamining techniques as these enable to find similaritiesand associations in the data set of costumes. Data min-ing algorithms are applied to the data set of costumesresulting in hypotheses about similarities and associa-tions. Thus, the second phase of the Costume PatternMining Method is called Hypothesis Discovery. It isrefined in Figure 2, where it is broken down into foursteps, described in the following.

The Hypothesis Discovery phase starts with definingthe specific Interest we have about a certain area of thecostume domain. As an example, it could be interestingto find out, if there is a relation between personalityattributes of a character, wearing a certain costume, andthe composition of the costume’s base elements. Findingsuch relations in the costume data would support theidentification of costume patterns for special charactertraits, meaning which base elements are mainly usedto, e.g., communicate a certain character trait like “con-ceited” or “cool”. Therefore, this step heavily relies onthe input, given by an expert of the costume domain.As an output, this step clarifies the costume parametersthat can be used to answer the question.

Page 6: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

Pattern Research in the Digital Humanities 5

Hypothesis Discovery

DiscoveryModellingInterest Inspection

Fig. 2 Refinement of Hypothesis Discovery phase

So, after focusing on an Interest, the relevant cos-tume parameters have to be modeled regarding theirstructure in order to get processed by the data miningalgorithm. Using the above example, this could mean touse “character traits” as the input data set and the setof available “base elements” as the output data set. Thisstep most likely requires support of a technical expertof the used data mining toolset.

The third step, the Discovery, involves the actual ex-ecution of the data mining algorithm, using the miningstructures, created in the previous step. It also includesadaptations of filters and parameters specific to the al-gorithm. The form of the results of this step dependson the executed data mining algorithm, which could be,e.g., a set of association rules or clusters representingdiscovered similarities of the input data set. Indepen-dently from the actually run data mining algorithm theresults can be stated as hypotheses about coherences inthe input data set of costume descriptions.

The last step is called Inspection and focusses onpresenting the results of the Discovery step to an expert.Additional filtering can be applied to the result to en-able the expert to put focus on specific and interestingaspects. If no appropriate results were generated, theoverall method can be continued back at the Modellingstep. This allows to adapt the input data set or specificparameters of the data mining algorithm in order torefine the conducted analysis.

3.3 Hypothesis Validation

In this subsection we describe how identified hypothesescan be validated using the concept of OLAP cubes. Arefinement of the Hypothesis Validation phase is shownin Figure 3 and the depicted steps are described in thefollowing.

First, the data we are interested in have to be dis-covered in the OLAP cube. Analysis with the OLAPcube requires column- and row-filtering to be applied,focusing on a certain slice of the underlying data. Hav-

ing found a hypothesis, the column- and row-filters ofthe OLAP cube need to be set to the dimensions (prop-erties) that include the subset of costume parametersthat are relevant for the hypothesis. This operation isoften referred to as Slice and Dice as the view on thedata focuses on a certain slice or sub-cube of the overallOLAP cube (Codd et al, 1993).

As second step, the data that lead to the inspectedhypothesis needs to be identified in the cube. Therefore,search and filtering capabilities as well as drill-down androll-up (cf. Codd et al (1993); Golfarelli et al (1998))operations have to be applied on the dimensions of theOLAP cube.

In order to compare a hypothesis, the cube needsto count the appearance of costumes with given proper-ties. So, the combination of relevant parameters of thehypothesis at hand can be compared with the numberof appearances of other property combinations. If otherproperty combinations are not considered significant incontrast to the examined combination, the discoveredhypothesis can be considered validated. Then it canbe grasped as an indicator for a pattern (cf. PI in Fig-ure 3) because it represents a design principle, whichcontributes substantially to achieve the intended e↵ectsconsidered by the formerly defined Interests. Otherwise,the Hypothesis Discovery step starts all over again, ei-ther to focus on other Interests or to refine the inputdata set or parameters of the data mining algorithm.

4 Mining Association Rules from the Costume

Repository

As explained in Section 3.2, data mining algorithmshave to be applied to the data set of costumes. Tounderstand the challenges of the Hypothesis Discoveryphase and to grasp the required expertise to apply datamining techniques to the domain of costumes in films,we describe the application of the Apriori algorithmdeveloped by Agrawal and Srikant (1994) to the dataset of the Costume Repository.

Page 7: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

6 Michael Falkenthal et al.

Hypothesis Validation

ValidateDrill-DownSlicing

Fig. 3 Refinement of the Hypothesis Validation phase

The Apriori algorithm was originally designed towork with transactional sales data. As an example, con-sider a store selling multiple products. One of the store’scustomers bought three of those products together, soin this case, the transaction would include those threeproducts. In order to apply this algorithm to find fre-quent associations between costume elements we have totransform the costume descriptions into correspondingconcepts and data structures.

Looking at the domain of costumes in films, we con-sider a transaction to be a single costume that occursin a specific movie. Barzen (2013) describes the rele-vant parameters of a costume, such as color, design,and material, as well as the base elements, which com-pose the costume. For each of those parameters, Barzendefines taxonomies, providing a well-defined and hier-archical set of parameter values. Fehling et al (2015)define a costume to consist of (i) “clothes” as a hapticbasis, which itself is composed of base elements and(ii) an “intended e↵ect”. Through this intended e↵ect,costumes communicate attributes about a charactersuch as character traits, mood or social standing, as wellas represented stereotypes to the recipient (Barzen andLeymann, 2015).

Applying the concept of transactions of the Apriorialgorithm to the domain of costumes in films, the base el-ements of a costume correspond to the products boughtby a customer. Costume parameters such as charactertraits, gender, stereotypical information of a costume aswell as the information about the related movie corre-spond to attributes of a sales transaction, like time of day,store location, or product numbers. Let P := P1 [ P2 [· · · [ P

n

be the set of all available costume parameters,where P1 := {b1, b2, . . . , bm} be the set of available baseelements, P2 := {s1, s2, . . . , sk} be the set of availablestereotypical information, and so on. Then, the trans-action representing a costume is defined as T

costume

:=T1 [ T2 [ · · · [ T

n

such that T1 ✓ P1, T2 ✓ P2, andso on. Let C := {T

costume1 , Tcostume2 , . . . , Tcostumen} bethe set of all available costumes in the database, which

corresponds to the set of all transactions D of Apriori(cf. Agrawal and Srikant (1994)).

In general, an association rule X =) Y producedby Apriori is an implication from a set X to a set Y ,both containing elements of a set of elements I, i.e.,X,Y ⇢ I, and X \ Y = ;. In our case, I corresponds toP . An association rule has support s, which describesthe number of transactions in D that contain X [ Y .For the transactions in D that contain X the confidencec describes the percentage of transactions that alsocontain Y . For representing association rules betweena specific set of costume parameters that have to beinvestigated, let P

x

⇢ P be the set of permitted lefthand parameters and P

y

⇢ P be the set of permittedright hand parameters, P

x

\ Py

= ;, then for everyassociation rule X =) Y holds X ✓ P

x

and Y ✓ Py

.

The following list contains association rules that wehave found during our analysis of the corpus of films. Theset of character traits P

character

= { active, evil , good ,. . . } and the set of genders P

gender

= {male, female}are used as left hand parameters, i.e., P

x

= Pcharacter

[Pgender

, while the set of investigated available base ele-ments P

base

= {trousers,necklace, . . . } is used as righthand parameters, i.e., P

y

= Pbase

:

– {evil ,male} =) {trousers}(confidence c = 87, 5%)– {evil , female} =) {necklace}(confidence c = 75, 6%)– {active,male} =) {boots}(confidence c = 59, 4%)

(Algorithm parameters: minimum support s � 10)

Looking at the second item in the example list, weidentify that 75% of the costumes worn by evil femalecharacters have a necklace available in their compositionof base elements. We also know that this combination oc-curs in at least 10 costumes by looking at the configuredminimum support threshold.

Rule quality can be derived from the confidence cof each association rule. Rules of high quality lead toa hypothesis about certain aspects of a costume. Forexample, the rule {evil , female} =) {necklace} leadsto the hypothesis that evil female characters can beexpressed by adding a necklace to the costume.

Page 8: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

Pattern Research in the Digital Humanities 7

If no rules with appropriate confidence have beenfound in the Inspection step, the method can be con-tinued back at the Modelling step, as described inSection 3.2. This allows to change parameters of theassociation rule discovery to find stronger rules. Forexample, additional model filters could be applied forfocusing on a specific genre. Also parameters of themining algorithm, such as support and confidence canbe adapted to the requirements of a domain expert inorder to properly investigate the data set. Specificallyfor mining design patterns from present concrete solu-tions, support and confidence have to be set to valuesthat ensure the resulting rules to be the frequently oc-curing essence from many concrete solutions. In orderto classify a detected solution the so called rule of threehas emerged in the pattern community, which defines adetected solution to be relevant for formulating a pat-tern if it occurs at least three times in di↵erent concreteimplementations (Coplien, 1996).

If rules with su�cient confidence have been mined,they can be validated using OLAP techniques. To un-derstand the impact of the conceptually described capa-bilities in Section 3.3, the above stated example of theassociation rule that indicates that female characters,who are evil also wear a necklace is depicted as a pivottable in Figure 4. For this case, the filtering is configuredto include only movies of the high school comedy genreand female characters. Rows present the base elementsof a costume, drilled-down to the sixth level of the baseelement taxonomy. Columns show the character traits,drilled-down to the second level of the character traitstaxonomy.

In this case, a filter is applied to only show thecharacter trait “evil”. The table holds the distinct countof costumes with the respective column/row properties.Clearly, the base element “necklace”, highlighted bythe bold border in Figure 4, is one of the top 5 baseelements for evil female characters, so the discoveredassociation rule of the above example can be consideredas validated. Discovering association rules on the actualCostume Repository would probably also produce rulesfor the “ring”, “earring” and “bracelet” base elements,as those are used in a similar number of costumes.

5 Prototypical Implementation and Case

Studies

In this section, we describe our prototypical implementa-tion of the introduced Costume Pattern Mining Method.In addition, we exercise the method based on two exam-ple case studies: the overall question is if we can find atypical costume for a villain or “Bad Guy” in the gen-res “western” and “high school comedy”. At the time

Genre highschoolcomedyGender female

CostumeIDDistinctCount Level02Type

Level06 evilring 16earring 16wristband 14necklace 13openshoes 12longtrousers 10top 10hairaccessories 9wristwatch 9miniskirt 8

Fig. 4 Example of Hypothesis Validation using MicrosoftExcel Pivot Table

of writing, the database contained approximately 350costumes from 23 western movies and approximately2,200 costumes from 21 high school comedy movies.

As part of the Digital Humanities research, one ofthe core goals of the proposed prototype is to hide asmuch of the complexity as possible that domain expertsin the area of humanities with a lower level of detailedknowledge in data mining techniques and algorithmscan e↵ectively and e�ciently work with the proposedtoolchain and immediately benefit from the results.

5.1 Prototype Implementation – Data Preparation

Starting with the data preparation step, as describedin Section 3.1, we set up an ETL process using theMicrosoft SQL Server Integration Services. In additionto the tasks already mentioned, the ETL process mi-grates the data from a MySQL database of the CostumeRepository to a Microsoft SQL Server database stagingarea. The ETL process is scheduled to run once everynight fully automated. This enables the analysis to berepeatable, as the daily database backup could be usedto restore a specific state of the Costume Repository.The data of the Costume Repository changes ratherslowly, so the interval of one day does not cause thedata mining algorithm to work on outdated data.

5.2 Prototype Implementation – Hypothesis Discovery

In the Costume Repository, attributes to express a “vil-lain” or a “Bad Guy” are represented by stereotype and

Page 9: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

8 Michael Falkenthal et al.

Table 1 Mining structures for discovering association rulesfor “villains”

Input Set Output Set

Character traits, Gender =)

Base element (appearance)Base element designBase element colorBase element materialBase element condition

Stereotype information, Gender =)

Base element (appearance)Base element designBase element colorBase element materialBase element condition

character traits. Therefore, we can refine our overallinterests into the following questions:

– Are there relations between stereotype, charactertraits and worn costume base elements?

– Are there relations between stereotype, charactertraits, design, color, material and condition of cos-tume base elements?

Using Microsoft’s implementation of the Apriori al-gorithm (Microsoft Association Algorithm) the set ofinput and output attributes, P

x

and Py

, as described inSection 3.2, have to be configured in a so called miningmodel. Multiple mining models are grouped in a min-ing structure. Having a set of mining models availableallows to easily re-trigger the discovery of associationrules with a defined set of parameters as the number ofcostumes in the costume database grows over time.

We transformed the questions above into ten min-ing structures. Each mining structure is designed toanswer a specific part of the questions, e.g., there is onemining structure to represent the question about rela-tions between personality attributes and base elementsand there is a second mining structure representing thequestion about relations between stereotype informationand base elements, and so on. The mining structures asdepicted in Table 1 have been defined.

For each mining structure we created two miningmodels, one to answer the question in the perspective ofwestern movies and another one to answer the questionin the perspective of high school comedy movies. Dueto the fact that we approximately have 2,200 costumesfor the genre high school comedy we decided to limitthe data to costumes worn by supporting roles and ex-tra artists. Such characters have a rather short screentime and, therefore, need to communicate the stereo-type and personality characteristics more e�ciently thancostumes with longer screen time.

The defined mining models were processed by theMicrosoft SQL Server Analysis Services. It allows toapply a keyword-based filtering on the set of produced

association rules. We filtered the rules by the keywords“evil” and “bad”. This gave a first impression on how apossible pattern indicator could look like. In addition,we put focus on rules including a gender property andhaving this property set to “male”. This resulted in therules described in Table 2.

5.3 Prototype Implementation – Hypothesis Validation

The rules depicted in Table 2 give a first impressionabout pattern indicators for a villain in western moviesand high school comedy movies. Scanning the rules ofwestern movies, one can easily picture a typical costumeof a bandit - wearing a black cowboy hat, a black scarf,a worn-out shirt and brown boots.

Nevertheless, the rules have to be considered as inde-pendent as one can only tell that a villain in a westernmovie often wears a cowboy hat and boots. In addition,villains often seem to wear costume elements made outof worn-out cotton in brown or black color. But, by nowone cannot relate such base element attributes, like colorand material, to specific worn base elements. Further, adomain expert has to analyze the mined rules in orderto decide which rules provide meaningful informationand, thus, are actually relevant for the hypothesis athand.

Having found such a set of rules, thus, they have tobe validated against the data set. We used an OLAPcube by Falkenthal et al (2015) to validate these patternindicators. To access the provided OLAP cube we usedMicrosoft Excel and its pivot functionality. Regardingthe domain of Digital Humanities, using Microsoft Ex-cel provides the opportunity for users with only littleIT background to easily access the functions provided.We applied filters to set genre, gender, stereotype andcharacter traits, we are looking for.

We validated the rules by comparing two dimensions.We used the dimension base element as column valuesand the dimensions “base element design”, “base ele-ment color”, “base element material”, and “base elementcondition” as row values each combination in a separatepivot table. For the column values, we set the range ofpossible values, which are found by the association rulealgorithm.

In case of high school comedy movies, we set therange to “Long Trousers”, “Tie”, “Business Shirt”, and“Wristwatch”. Having this setting, we were able to relateeach base element to the base element attributes andvalidate if a discovered rule significantly expresses astate in the OLAP cube.

By applying this process, we identified that a villainin a high school comedy typically wears black trousers,

Page 10: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

Pattern Research in the Digital Humanities 9

Table 2 Discovered rules for villains in genres “high school comedy” and “western”

Genre “high school comedy” (min. s = 5) c Genre “western” (min. s = 10) c

1. {evil ,male} =) {long trousers} 67.6% 16. {villain,male} =) {long trousers} 75.4%2. {villain,male} =) {tie} 43.8% 17. {villain,male} =) {boots} 55.4%3. {villain,male} =) {business shirt} 50.0% 18. {evil ,male} =) {revolver} 39.1%4. {villain,male} =) {wristwatch} 56.3% 19. {evil ,male} =) {cartridge belt} 26.2%5. {villain,male} =) {striped} 50.0% 20. {villain,male} =) {casual shirt} 43.1%6. {evil ,male} =) {black} 71.8% 21. {villain,male} =) {jacket} 24.1%7. {evil ,male} =) {blue tones} 63.4% 22. {evil ,male} =) {cowboy hat} 35.6%8. {evil ,male} =) {metallic colors} 73.2% 23. {villain,male} =) {scarf } 21.1%9. {villain,male} =) {powerful color} 93.8% 24. {villain,male} =) {striped} 27.7%

10. {villain,male} =) {shiny} 87.5% 25. {evil ,male} =) {checkered} 13.8%11. {evil ,male} =) {gold} 56.3% 26. {villain,male} =) {black} 78.5%12. {villain,male} =) {cotton} 100.0% 27. {evil ,male} =) {browntones} 66.7%13. {evil ,male} =) {clean} 73.2% 28. {evil ,male} =) {cotton} 81.6%14. {evil ,male} =) {tidy} 18.3% 29. {villain,male} =) {leather} 80.0%15. {villain,male} =) {ironed} 56.3% 30. {villain,male} =) {worn � out} 52.3%

<<worn-on >>

<<worn-on >>

<<worn-on >>

<<worn-on >>

<<worn-above>>

Fig. 5 Villain in a high school comedy movie

a blue-striped shirt, a brown tie and a golden watch, asdepicted by the costume composition graph in Fig. 5.Looking at the profession of roles wearing such costumes,one can discover that mostly “teachers” and “attorneys”are expressed by such costumes and these roles oftenact as the counterpart of the protagonist, namely the“popular boy” or the “prom king” in high school comedymovies. In contrast, a villain in a western movie typicallywears a black cowboy hat, a black scarf, a brownish,worn-out and checkered shirt, a black jacket and brownboots and pants as depicted in Fig. 6. If drilling-downto the type of role that is represented by such costumes,we can determine that such roles are often expressed asbandits.

These resulting costume composition graphs, showthe feasibility of the presented approach. Applying theCostume Pattern Mining Method on our corpus of con-crete costume descriptions, we were able to identify in-dicators for costume patterns. To optimally reuse thesefindings, a domain expert can build upon them to authorcostume patterns by refining the results to include evenmore relevant parameters and abstracting the essenceinto patterns.

6 Summary and Outlook

In this paper, we presented an approach to identify in-dicators for costume patterns in movies by using datamining techniques for finding coherences between cos-

Page 11: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

10 Michael Falkenthal et al.

<<worn-on >>

<<worn-on >>

<<worn-on >><<worn-above>>

<<worn-above>>

<<worn-on >> <<worn-above>>

<<worn-on >>

Fig. 6 Villain in a western movie

tumes. A prototype was presented that builds upondata mining techniques for basic validation of those.This can support the identification of costume patternsand the creation of a costume language, as describedby Barzen and Leymann (2015) as well as Fehling et al(2015). Our work shows how IT can seize issues fromthe humanities and contributes approaches and ideastypically not utilized in this domain. Therefore, the pre-sented approach is a motivating example on how theemergent field of Digital Humanities can be enabledand influenced building upon established methods andtechniques from IT.

We were focusing on a specific area of costume partsand attributes for identifying costume patterns in spe-cific genres, as the current film corpus has the bestcoverage on those genres. To enhance the results of thepresented method in the future, the number of data min-ing structures has to be increased. This also includesthe usage of additional data mining algorithms, suchas clustering. Also the set of genres can be expanded,as the film corpus will grow. Executing the presentedmethod with di↵erent movie genres can also give addi-tional confirmation for the approach to work.

References

Agrawal R, Srikant R (1994) Fast algorithms for miningassociation rules in large databases. In: Proceedings

of the 20th International Conference on Very LargeData Bases, Morgan Kaufmann Publishers Inc., SanFrancisco, USA, VLDB ’94, pp 487–499

Alexander C, Ishikawa S, Silverstein M (1977) A pat-tern language: towns, buildings, construction. OxfordUniversity Press, New York

Appleton B (1997) Patterns and software: essential con-cepts and terminology. Object Magazine Online 3(5)

Barzen J (2013) Taxonomien kostumrelevanter Parame-ter: Annaherung an eine Ontologisierung der Domanedes Filmkostums. Technical Report 2013/04, Uni-versity of Stuttgart, Faculty of Computer Science,Electrical Engineering and Information Technology,Germany

Barzen J, Leymann F (2015) Costume languages aspattern languages. In: Baumgartner P, Sickinger R(eds) Proceedings of PURPLSOC (Pursuit of PatternLanguages for Societal Change). The Workshop 2014.,epubli GmbH, pp 88–117

Barzen J, Leymann F (2016) Patterns as Formulas:Applying the Scientific Method to the Humanities.Technical Report 2016/01, University of Stuttgart,Faculty of Computer Science, Electrical Engineeringand Information Technology, Germany, Universityof Stuttgart, Institute of Architectur of ApplicationSystems

Barzen J, Falkenthal M, Hentschel F, LeymannF (2015) Musterforschung in den Geisteswis-

Page 12: Pattern Research in the Digital Humanities: How Data ......data analytics, their application in the domain of cos-2 As the term pattern is ambigious and used besides the costume domain

Pattern Research in the Digital Humanities 11

senschaften: Werkzeugumgebung zur Musterextrak-tion aus Filmkostumen. In: Extended Abstract DigitalHumanities im deutschsprachigen Raum (DHd 2015),DHd 2015, Graz

Bishop C (2006) Pattern Recognition and MachineLearning. Springer, New York

Codd EF, Codd SB, Salley CT (1993) Providing OLAP(On-Line Analytical Processing) to User-Analysts: AnIT Mandate. E. F. Codd and Associates

Coplien J (1996) Software Patterns. SIGSDearden A, Finlay J (2006) Pattern Languages in HCI:A Critical Review. Human-Comp Interaction 21(1):49–102

Falkenthal M, Barzen J, Breitenbucher U, Fehling C,Leymann F (2014a) E�cient Pattern Application:Validating the Concept of Solution Implementationsin Di↵erent Domains. International Journal On Ad-vances in Software 7(3&4):710–726

Falkenthal M, Barzen J, Breitenbucher U, Fehling C,Leymann F (2014b) From pattern languages to so-lution implementations. In: Proceedings of the 6th

International Conferences on Pervasive Patterns andApplications (PATTERNS), pp 12–21

Falkenthal M, Barzen J, Dorner S, Elkind V, Fauser J,Leymann F, Strehl T (2015) Datenanalyse in den Dig-ital Humanities – Eine Annaherung an Kostummustermittels OLAP Cubes. In: Datenbanksysteme fur Busi-ness, Technologie und Web (BTW), 16. Fachtagungdes GI-Fachbereichs “Datenbanken und Information-ssysteme” (DBIS), Lecture Notes in Informatics

Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) TheKDD process for extracting useful knowledge from vol-umes of data. Communications of the ACM 39(11):27–34

Fehling C, Barzen J, Breitenbucher U, Leymann F (2014)A process for pattern identification, authoring, andapplication. In: Proceedings of the 19th EuropeanConference on Pattern Languages of Programs – Eu-roPLoP '14, Association for Computing Machinery(ACM)

Fehling C, Barzen J, Falkenthal M, Leymann F (2015)PatternPedia - Collaborative Pattern Identificationand Authoring. In: Proceedings of PURPLSOC (Pur-suit of Pattern Languages for Societal Change). TheWorkshop 2014., epubli GmbH, pp 252–284

Golfarelli M, Maio D, Rizzi S (1998) The DimensionalFact Model: A Conceptual Model For Data Ware-houses. International Journal of Cooperative Informa-tion Systems 7:215–247

Hohpe G, Woolf B (2003) Enterprise Integration Pat-terns: Designing, Building, and Deploying MessagingSolutions. Addison-Wesley Longman Publishing Co.,Inc.

ISO (2006) ISO/IEC 13249-6:2006 Information technol-ogy – Database languages – SQL Multimedia andApplication Packages – Part 6: Data Mining

Reiners R (2013) An Evolving Pattern Library for Col-laborative Project Documentation. Phd thesis, RWTHAachen University

Reiners R, Falkenthal M, Jugel D, Zimmermann A(2015) Requirements for a collaborative formulationprocess of evolutionary patterns. In: Proceedings ofthe 18th European Conference on Pattern Languagesof Program – EuroPLoP '13, Association for Comput-ing Machinery (ACM)

Schumm D, Barzen J, Leymann F, Ellrich L (2012)A pattern language for costumes in films. In: Pro-ceedings of the 17th European Conference on PatternLanguages of Programs – EuroPLoP '12, Associationfor Computing Machinery (ACM)