# Introduction to Concepts and Techniques in Data Mining and Application to Text Mining

## Download this book!

This book is composed of six chapters.

Chapter 1 introduces the field of data mining and text mining. It includes the common steps in data mining and text mining, types and applications of data mining and text mining. Seven types of mining tasks are described and further challenges are discussed.

In Chapter 2, data preprocessing is treated in details. It contains how to represent data, how to clean, integrate, transform and reduce data before the main process of data mining.

Chapter 3 describes a number of classification and prediction methods, including Fisher’s linear discriminant or centroid-based method, k-nearest neighbor method, statistical classifiers, decision trees, rule-based classification, artificial neural networks, and support vector machines. For numeric prediction, linear regression, regression trees and model trees are explained. Moreover, two techniques to use regression as classification are presented. At the end of the chapter, four techniques of model ensemble, namely bagging, boosting, stacking and co-training, are introduced to combine the results from multiple classifiers to obtain better performance.

Chapter 4 presents techniques for two general unsupervised learning tasks; cluster analysis and association analysis. For clustering, some common approaches including partition-, hierarchical-, density-, grid-, and model-based clustering, are described in details. Three common algorithms; Apriori, FP-tree and CHARM, are given for association analysis. An extension of association analysis with hierarchical structures is also discussed. Topics of evaluation methods for information retrieval, classification and numeric prediction, forms

Chapter 5. Finally, three applications of data mining to text mining are given as examples in Chapter 6.They are centroid-based text classification, document relation extraction and automatic Thai unknown detection. Their original full descriptions can be found in (Lertnattee and Theeramunkong, 2004a), (Sriphaew and Theeramunkong, 2007a) and (TeCho et. al, 2009b).