«Abstract. Web Usage Mining (WUM) is the application of data mining techniques over web server logs in order to extract navigation usage patterns. ...»
Exploiting Knowledge Representation for Pattern
Mariângela Vanzin, Karin Becker
Pontifícia Universidade Católica do Rio Grande do Sul – PUCRS
Av. Ipiranga, 6681, Porto Alegre, Brazil
Abstract. Web Usage Mining (WUM) is the application of data mining techniques over web server logs in order to extract navigation usage patterns. Semantic Web Usage Mining aims at combining the Semantic Web and WUM.
The main goal of the Semantic WUM is to improve the process and the results of WUM by exploiting the new semantic structure in the Web. Pattern analysis is a critical phase in WUM, for two main reasons: a) mining algorithms yield a huge number of patterns; b) there is a significant semantic gap between URLs and events performed by users. This paper discusses the use of ontologies available at Semantic Web to support the interpretation of web usage sequential patterns. Functionality is targeted at supporting the comprehension of patterns, as well as on the identification of potentially interesting ones through interactive pattern rummaging.
1 Introduction Web Mining aims at discovering insights about Web resources and their usage .
Web Usage Mining (WUM) is the application of data mining techniques to extract navigation usage patterns from records of page requests made by visitors of a Web site. Access patterns mined from Web logs can represent useful knowledge in practice. It can help improving the design of Web sites, analyzing users reaction and motivation, building adaptive Web sites, improving site content, among others. The comprehension of mined patterns is difficult due to the primarily syntactical nature of web data . Thus, the formalization of the semantics of Web resources and navigation behavior is increasingly required.
Semantic Web is the proposal of enriching the Web with machine-processable information to better support users in their tasks . Semantic Web Mining aims at combining these two research areas [3, 5, 6]. The main goal is, on one hand, to improve the results of Web Mining by exploiting the new semantic structures available in the Web; and on the other hand, to make use of Web Mining, for building up the Semantic Web. Recently, many approaches started exploiting the semantic structures stored in the ontology layer  in the Semantic Web architecture.
The WUM process is divided into three generic phases : preprocessing, pattern discovery and pattern analysis. Pattern analysis remains a key issue in the area of WUM. Typically mining techniques (e.g. association, sequence) yield a huge number of patterns and most of them are useless, uncompressible or uninteresting to users .
Due to the elevated number of patterns, users have difficulty on identifying the ones that are interesting with regard to the domain.
This paper discusses the ontology usage, possibly available at the Semantic Web, to support pattern interpretation. Ontologies are exploited for addressing three interrelated problems: a) to represent patterns in a more intuitive form, b) to identify patterns related to some subject of interest, and c) to identify potentially interesting patterns through concept-oriented, interactive pattern rummaging. Other features complement this approach, such as patterns grouping and pattern visual representation.
The remainder of this paper is structured as follows. Section 2 presents the proposed ontology-based functionality targeted at supporting the analysis phase. It describes the ontology properties, and its use for conceptual pattern representation, pattern rummaging, pattern retrieval and concepts merging. Section 3 describes a scenario of usage. Section 4 compares related work with the proposed approach.
Conclusions and future work are addressed in Section 5.
2 An Ontology-based Approach for Pattern Analysis
Given the output of the pattern discovery phase, the goal of the pattern analysis phase is to eliminate irrelevant patterns and to extract the interesting ones, i.e. those that constitute knowledge. But pattern analysis is not an easy task because: a) the number of patterns yielded by mining algorithms can easily exceed the capabilities of a human user of identifying interesting results; b) the output of Web mining algorithms is not suitable for human interpretation, and c) frequently in a WUM process the user does not know what he is looking for, i.e. in most cases the search for interesting patterns is exploratory, which does not include hypothesis verification.
Our approach makes use of ontologies, possibly available in the Ontology Layer of the Semantic Web, to support the interpretation of web usage sequential patterns.
Ontologies are exploited for addressing three interrelated problems: a) to represent patterns in a more intuitive form, thus reducing the gap between URLs and site events, b) to identify patterns that are related to some subject of interest, and c) to identify potentially interesting patterns through concept-oriented interactive pattern rummaging. Other features complement this approach, such as the grouping of patterns by different similarity criteria and visual pattern representation and manipulation. The remaining of this section describes the underlying assumptions for developing the pattern analysis, the ontology structure, as well the functionality proposed to support the pattern analysis activity. The next section illustrates the use of the functionality using the prototype currently under implementation.
3.1 WUM Process Assumptions Our approach is targeted at the pattern analysis phase. The pre-processing phase considers a set of URLs as data source, which are processed using typical activities, such as data cleaning, user and session identification and path completion . Preprocessing also does not assume any particular data enrichment. If available, a semantic log composed by records with formal semantics based on an ontology underlying the site could be used as well (e.g. ).
Because we are interested in usage patterns, we assume the application of the sequence technique in the pattern discovery phase using the algorithm of . As in , we assume running the mining algorithm with minimum support threshold.
Higher values can make a mining algorithm run faster, but at the risk of reducing the usefulness of data mining results. The basic idea is to accept the execution time required for mining, as well as the huge number of patterns returned. Then, pattern analysis functionality described in the remainder of this section is used to set focus on a subset of patterns, to interpret their meaning, and to identify the potentially interesting ones.
3.2 Ontology Representation Ontologies available at the Semantic Web can be used to represent the events of a web site, which can be roughly categorized as service (e.g. buying, finding) and content (e.g. Hamlet) . Thus, they can be used to associate meaning to web pages and user actions over pages. Our approach exploits the semantic of the pages visited along users’ paths, where meaningful application events are mapped into domain knowledge. The domain events are represented in two levels: conceptual and physical. The conceptual level is composed by an ontology that specifies concepts and relationships among these concepts. At the physical level, events are represented by URLs. The conceptual layer corresponds to the ontology layer in the Semantic Web architecture.
Ontologies represent and support relationships among concepts providing them meaning. Three types of relationship are considered in this work: generalization/ specialization, which are powerful abstractions for sharing similarities among classes while preserving their differences; aggregation (part-whole and part-of relationships), in which classes representing the components are associated to the class representing the entire assembly; and binary relationships, representing any other type of relationship that connects two concepts.
URLs are then mapped into ontology concepts according to two dimensions: service and content. An URL can be mapped into one service, one content or both. In case an URL is mapped into a service and a content, the predominant dimension must be defined. A same ontology concept can be used in the mapping of various URLs.
Not all URLs need to be mapped (e.g. auxiliary pages ). Figure 1 describes the structure of the ontology using a UML class diagram.
The task of mapping URLs into ontology concepts can be laborious, but it pays off by greatly simplifying the interpretation activity, as described in the remaining of this section. The future semantic web will certainly contribute in reducing this effort , in that the creation of the respective ontology layer will be part of any site design.
3.3 Pattern Interpretation Functionalities
Visual Conceptual Pattern Representation.
Patterns yielded by the sequential mining algorithm are a sequence of URLs, which are often hard to interpret. In order to reduce the semantic gap between URLs and events performed by users in the Web sites, our approach exploits the semantic of the pages visited by users. Thus, the sequential patterns presented to the analyst are not composed of URLs, but rather of the primitive concepts of the ontology into which they were mapped.
Considering the ontology illustrated in Figure 2, a pattern in the form URL1→URL2 is displayed using the concepts that represent the corresponding primitive events in the site, such as Send-File→ Glossary. This pattern representation provides the analyst with a more intuitive meaning of the pattern. By exploring the dimensions, the analyst can interpret the patterns according to his interests. For instance, the pattern URL1→ URL2 can be represented as Send-File → VisualizeInformation if the analyst is interested by the service dimension or Send-File→ Glossary if both dimensions are of interest. According to the content dimension, the pattern URL2→ URL3 can be interpreted as Glossary→ Virtual-Environment.
The generalization/specialization and aggregation relationships can be explored to provide various abstraction levels over a same pattern. For instance, the pattern SendFile→ Virtual Environment can also be represented as Task-Submission → VirtualEnvironment, Task-Submission → Distance-Education and so on.
Interactive Pattern Rummaging.
The interactive pattern rummaging functionality allows exploiting the ontology in different ways to identify relevant patterns. The analyst can visualize the patterns in different abstraction levels, exploring the generalization and aggregation relationships through operations similar to “roll-up” and “drill-down” in OLAP (On-line Analytical Processing). The roll-up operation represents a concept either by its generalization or aggregation relationship. The drill-down operation explores these relationships in the inverse sense.
Roll-up and drill-down operations can be used for two different purposes: better understanding the events represented by the pattern, and to obtain
patterns that actually represent a set of patterns. Figure 3 illustrates the use of roll-up operations over individual elements of a pattern for understanding their meaning through more abstract concepts. This task is called pattern comprehension. In this example, the original pattern reveals that users access some page about the subject “virtual environment”, access the glossary and then load and send a specific file. By rollingup the concept Virtual-Environment, the user understands that it is part of the distance education content, which possibly motivate the users to look for other definitions available in the glossary. He also understands that loading and sending are activities related to the submission of an assignment.
With the same purpose of pattern comprehension, binary relations can be used to complement the information about the pattern events, by showing other related concepts on demand. The user selects a concept and asks for the relationships in which it participates. For instance, Glossary concept has a binary relationship with the concept Learning-Process as represented by the Figure 2. Thus, the analyst can understand that the glossary has words about learning process.
Another use of the roll-up operation is to obtain an abstract pattern, i.e. a pattern that actually represents a set of patterns. For that purpose, the user substitutes one or more pattern elements for their corresponding abstract concept, as depicted in Figure
4. In this example, the user is interested in patterns where a group of users access a page about virtual environment, then the glossary, followed by the use of two task submission activities (e.g. load, visualize, cancel and send a file, according to the ontology). Notice that in doing so, the support of the abstract pattern must be recalculated. For instance, the abstract pattern may match both Virtual-Environment→ Glossary→ Load-File→ Send-File and Virtual-Environment→ Glossary→ Load-File→ Cancel, which are found in the rule set. Our approach for recalculating the support is inspired in , and it is not discussed here due to space limitations.
The roll-up and drill -down operations allow users to analyze the rule set provided by the mining algorithm in an exploratory manner, based on the events captured by the ontology. For instance, the user starts with the pattern illustrated in Figure 3, and after having the insight that load and send a file are tasks related to the submission of assignments; he rolls-up for the abstract pattern of Figure 4, such that all patterns that match this abstraction can be found in the rule set. Then, by drilling down, he becomes aware of the use of other distinct task submission services that support the abstract pattern. By analyzing the support of these rules, he may realize that the number of students who canceled the submission after loading the file is greater than the number of students who actually sent their assignments. Then he realizes that the assignment submission service of the site is not intuitive for the students, and should be redesigned or a help/tutorial should be provided.