|
Organizers |
Web Intelligent Information Systems based on Web Mining Search Results
by
Ricardo Campos
Tomar Polytechnic Institute
Coauthors: Gaël Dias, University of Beira Interior
With so many information published on the web, search engines have a very difficult task in selecting the best relevant documents. Typically, they are low precision in response to a query, retrieving lots of useless web snippets, and miss to retrieve some other important. In this paper we study the web page hierarchical clustering problem. We propose the architecture of WISE [1], a meta search-engine software, that automatically builds up groups of related web pages (with the same query sense) into a set of clusters, hierarchically organized and labeled with a phrase (we indistinctively refer to phrases as key or keyword concepts extracted from the web documents), representativeness of its key concept.
The system, which is web-based interface, introduces some interesting ideas, such as the pre-selection of the retrieved web pages and its capability to statistically detect phrases within the documents, which in turn are represented with its most relevant key concepts, by using web content mining techniques to grasp its content, based on pre-trained decision trees. The system final step is supported by a graph based overlapping clustering algorithm, which groups the selected documents into a hierarchy of organized overlap clusters.
We believe that our solution is innovative as the architecture in its all, and not just part of it, is language and topic independent. Moreover, we are the first in the literature to combine hierarchical clustering and phrases with the use of web content mining techniques to semantically represent the documents, overtaking the problem of synonymy and ambiguous, poor and less informative user queries.
References
1. Campos, R., Dias, G.: Automatic Hierarchical Clustering of Web Pages. In Proceedings of the ELECTRA Workshop associated to the 28th Annual International ACM SIGIR Conference, Salvador, Brazil, August 19, 83-85, (2005).
Date received: July 12, 2006
Copyright © 2006 by the author(s). The author(s) of this document and the organizers of the conference have granted their consent to include this abstract in Atlas Conferences Inc. Document # cath-52.