Open Access Open Access  Restricted Access Subscription or Fee Access

Large Document Set Clustering: an Integrated Approach

Krishna Kumar Mohbey, G.S. Thakur

Abstract


Document clustering is an important mining task used by the different peoples for different kind of purposes. It is generally used to find the similar document from the large amount of documents. The document set may be the collection of blogs, website access patterns, or any transaction files. By the document clustering one can find out the similar kind of habits of different peoples, which can play large role in future trend analysis and taking some decisions. Most of the clustering methods uses distance calculation for similarity measure. They scans document multiple times for knowing class and then prepare cluster. If the documents are large then these methods takes more time for clustering. We propose an advanced environment for document clustering, in which only one time documents are scan and immediately assign into the appropriate cluster. Experiments are conducted with the 20 news group datasets by the MATLAB software. Experimental results show the effectiveness of the proposed environment for large document sets.


Keywords


Document Clustering, Similarity Measurements, Dendogram, Term Extraction

Full Text:

PDF

References


Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks Vol. 16, No. 3, May 2005.

Bidyut kr. Patra,Sukumar Nandi,P.Viswanath, A distance based clustering method for arbitrary shaped clusters in large datasets, Pattern Recognition 44(2011) 2862-2870.

M. Anderberg, Cluster Analysis for Applications. New York:

Academic,1973.

R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. NewYork: Wiley, 2001.

Jin Chen, Alan M. MacEachren, and Donna J. Peuquet, ―Constructing Overview + Detail Dendrogram-Matrix Views ‖, IEEE Transactions on Visualization and Computer Graphics, Vol .15, No.6 ,Nov 2009.

B. Duran and P. Odell, Cluster Analysis: A Survey. New

York:Springer-Verlag, 1974.

B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London: Arnold, 2001.

P. Hansen and B. Jaumard, ―Cluster analysis and Math- ematical programming,‖ Math. Program., vol. 79, pp. 191–215, 1997.

A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.

E. Backer and A. Jain, ―A clustering performance measure based on fuzzy set decomposition,‖ IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-3, no. 1, pp. 66–75, Jan. 1981.

C. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.

V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory, and Methods. New York: Wiley, 1998.

A. Baraldi and E. Alpaydin, ―Constructive feedforward ART clustering networks—Part I and II,‖ IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 645–677, May 2002.

M. Steinbach, G.Karypis, V.Kumar, A Comparison of document clustering techniques, Proc. of the 6th ACM SIGKDD int’l conf. on Knowledge Discovery and Data Mining(KDD), 2000.

P. Willet, Recent trends in hierarchical document clustering: a critical review, Information processing & Management 24(5) (1988), pp 577-597.

Ghanshyam Thakur, Rekha Thakur and R.C. Jain, ―Association Rule Generation from Textual Document‖ International Journal of Soft Computing, 2: 2007 pp. 346-348.

M. Dash, H.Liu, P. Scheuermann, K.L. Tan, fast hierarchical clustering and its validation, Data & Knowledge Engineering 44(1) (2003) pp. 109-138.

R. Balaji And R.B. Bapat, Block Distance Matrices, Electronic Journal of Linear Algebra ISSN 1081-3810 A publication of the International Linear Algebra Society Volume 16, pp. 435-443, December 2007.

M. Nanni, speeding-up hierarchical agglomerative clustering in presence of expensive metrics, in proc. Of Ninth Pacific-Asia conference on knowledge discovery and Data mining (PAKDD)2005, pp. 378-387.

P.A.Vijaya, M.N.Murty, D.K. Subramanian, Efficient bottom up hybrid hierarchical clustering techniques for protein sequence classification, pattern Recognition 39 (12) (2006), pp.2344-2355.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.