Wlodzislaw Duch1,2 and Pawel Matykiewicz2,3.
1School of Computer Engineering, Nanyang Technological University, Singapore,
and 2Department of Informatics, Nicolaus Copernicus University,
Grudziadzka 5, 87-100 Torun, Poland.
and 3Department of Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, Ohio, USA
Similarity of semantic content of web pages is displayed using interactive graphs presenting fragments of minimum spanning trees. Homepages of people are analyzed, parsed into XML documents and visualized using TouchGraph LinkBrowser, displaying clusters of people that share common interest. The structure of these graphs is strongly affected by selection of information used to calculate similarity. Influence of simple selection and Latent Semantic Analysis (LSA) on structures of such graphs is analyzed. Homepages and lists of publications are converted to a word frequency vector, filtered, weighted and similarity matrix between normalized vectors is used to create separate minimum sub-trees showing clustering of people's interest. Results show that in this application simple selection of important keywords is as good as LSA but with much lower algorithmic complexity.
Reference: Duch W, Matykiewicz, P. Minimum Spanning Trees Displaying Semantic Similarity. Intelligent Information Processing and Web Mining, Advances in Soft Computing, Springer Verlag, ISBN 3-540-25056-5 (Eds. Klopotek, M.A., Wierzchon, S.T., Trojanowski, K.) (2005) 31-40
Preprint for comments in PDF, 621 KB.
BACK to the publications of W. Duch.
BACK to the on-line publications of the Department of Informatics, NCU.