Document Clustering Using Side Information for Mining Text Data
In many text mining applications, side-information is accessible alongside the text documents. Such side-information could be of various types, like document place of origin info, the links within the document, user-access behavior from internet logs, or other non-textual attributes that are embedded into the text document. Such attributes could contain an incredible quantity of information for cluster functions. However, the relative importance of this side-information is also troublesome to estimate, especially when a number of the knowledge is same. In such cases, it is often risky to include side-information into the text mining method, because it will either improve the standard of the illustration for the mining method, or will add noise. Therefore, we need a principled way to perform the mining method, therefore on maximize the benefits from exploitation this aspect info. In this paper, we design associate algorithmic rule which mixes classical partitioning algorithms with probabilistic models so as to make an efficient clustering approach. We tend to then show a way to extend the approach to the classification drawback. We tend to gift experimental results on a number of real knowledge sets so as the benefits of exploitation such an approach.
Classification, Text Mining, Side Information, Data mining, Clustering.