Concept Based Text Document Clustering
A cluster is a collection of data objects that are similar to one another. A cluster of data objects can be treated collectively as one group and so it may be considered as form of data compression. Clustering is also called as data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Indexing of documents is based on the related or semantically related keywords. Topic based weighting scheme is proposed to index the text. It involves with identifying topic candidates, determine their importance, and detect similar and synonymous topics. The indexing algorithm uses topic frequency to determine their importance and existence of the topics. Concept based weighting scheme is used to index the document, it identifies topic candidates, determine their importance, detect the similar and synonymous topics. In this system the numbers of medical documents are collected, and then the documents are taken for document pre-processing which includes tokenization and stop word removal. Finally compare the topic based weighting scheme with other indexing schemes and prove that topic based indexing reduces the dimensionality of the data which is efficient even for very large databases and provides an understandable description of the discovered clusters by their frequent term sets.
Clustering algorithms, Indexing, Topic based weighting scheme, Concept based weighting scheme and Me Sh ontology.