1 option
Text Mining : Concepts, Implementation, and Big Data Challenge.
Springer eBooks EBA - Intelligent Technologies and Robotics Collection 2024 Available online
View online- Format:
- Book
- Author/Creator:
- Jo, Taeho.
- Series:
- Studies in Big Data Series
- Studies in Big Data Series ; v.45
- Language:
- English
- Physical Description:
- 1 online resource (451 pages)
- Edition:
- 2nd ed.
- Place of Publication:
- Cham : Springer, 2025.
- Summary:
- This popular book, updated as a textbook for classroom use, discusses text mining and different ways this type of data mining can be used to find implicit knowledge from text collections. The author provides the guidelines for implementing text mining systems in Java, as well as concepts and approaches. The book starts by providing detailed text preprocessing techniques and then goes on to provide concepts, the techniques, the implementation, and the evaluation of text categorization. It then goes into more advanced topics including text summarization, text segmentation, topic mapping, and automatic text management. The book features exercises and code to help readers quickly learn and apply knowledge.
- Contents:
- Intro
- Preface
- Contents
- Part I Foundation
- 1 Introduction
- 1.1 Definition of Text Mining
- 1.2 Texts
- 1.2.1 Text Components
- 1.2.2 Text Formats
- 1.3 Data Mining Tasks
- 1.3.1 Classification
- 1.3.2 Clustering
- 1.3.3 Association
- 1.4 Data Mining Types
- 1.4.1 Relational Data Mining
- 1.4.2 Web Mining
- 1.4.3 Big Data Mining
- 1.5 Summary
- References
- 2 Text Indexing
- 2.1 Overview of Text Indexing
- 2.2 Steps of Text Indexing
- 2.2.1 Tokenization
- 2.2.2 Stemming
- 2.2.3 Stop-Word Removal
- 2.2.4 Term Weighting
- 2.3 Text Indexing: Implementation
- 2.3.1 Class Definition
- 2.3.2 Stemming Rule
- 2.3.3 Method Implementations
- 2.4 Additional Steps
- 2.4.1 Index Filtering
- 2.4.2 Index Expansion
- 2.4.3 Index Optimization
- 2.5 Summary
- 3 Text Encoding
- 3.1 Overview of Text Encoding
- 3.2 Feature Selection
- 3.2.1 Wrapper Approach
- 3.2.2 Principal Component Analysis
- 3.2.3 Independent Component Analysis
- 3.2.4 Singular Value Decomposition
- 3.3 Feature Value Assignment
- 3.3.1 Assignment Schemes
- 3.3.2 Similarity Computation
- 3.4 Issues of Text Encoding
- 3.4.1 Huge Dimensionality
- 3.4.2 Sparse Distribution
- 3.4.3 Poor Transparency
- 3.5 Summary
- 4 Text Association
- 4.1 Overview of Text Association
- 4.2 Data Association
- 4.2.1 Functional View
- 4.2.2 Support and Confidence
- 4.2.3 Apriori Algorithm
- 4.3 Word Association
- 4.3.1 Word Text Matrix
- 4.3.2 Functional View
- 4.3.3 Simple Example
- 4.4 Text Association
- 4.4.1 Functional View
- 4.4.2 Simple Example
- 4.5 Overall Summary
- Part II Text Categorization
- 5 Text Categorization: Conceptual View
- 5.1 Definition of Text Categorization
- 5.2 Data Classification
- 5.2.1 Binary Classification
- 5.2.2 Multiple Classification
- 5.2.3 Classification Decomposition.
- 5.2.4 Regression
- 5.3 Classification Types
- 5.3.1 Hard vs. Soft Classification
- 5.3.2 Flat vs. Hierarchical Classification
- 5.3.3 Single vs. Multiple Viewed Classification
- 5.3.4 Independent vs. Dependent Classification
- 5.4 Variants of Text Categorization
- 5.4.1 Spam Mail Filtering
- 5.4.2 Sentimental Analysis
- 5.4.3 Information Filtering
- 5.4.4 Topic Routing
- 5.5 Summary and Further Discussions
- 6 Text Categorization: Approaches
- 6.1 Machine Learning
- 6.2 Lazy Learning
- 6.2.1 K-Nearest Neighbor
- 6.2.2 Radius Nearest Neighbor
- 6.2.3 Distance-Based Nearest Neighbor
- 6.2.4 Attribute Discriminated Nearest Neighbor
- 6.3 Probabilistic Learning
- 6.3.1 Bayes Rule
- 6.3.2 Bayes Classifier
- 6.3.3 Naive Bayes
- 6.3.4 Bayesian Learning
- 6.4 Kernel-Based Classifier
- 6.4.1 Perceptron
- 6.4.2 Kernel Functions
- 6.4.3 Support Vector Machine
- 6.4.4 Optimization Constraints
- 6.5 Summary and Further Discussions
- 7 Text Categorization: Implementation
- 7.1 System Architecture
- 7.2 Class Definitions
- 7.2.1 Classes: Word, Text, and PlainText
- 7.2.2 Interface and Class: Classifier and KNearestNeighbor
- 7.2.3 Class: TextClassificationAPI
- 7.3 SubsectionTitle
- 7.3.1 Class: Word
- 7.3.2 Class: PlainText
- 7.3.3 Class: KNearestNeighbor
- 7.3.4 Class: TextClassificationAPI
- 7.4 Graphic User Interface and Demonstration
- 7.4.1 Class: TextClassificationGUI
- 7.4.2 Preliminary Tasks and Encoding
- 7.4.3 Classification Process
- 7.4.4 System Upgrading
- 7.5 Summary and Further Discussions
- 8 Text Categorization: Evaluation
- 8.1 Evaluation Overview
- 8.2 Text Collections
- 8.2.1 NewsPage.com
- 8.2.2 20NewsGroups
- 8.2.3 Reuter21578
- 8.2.4 OSHUMED
- 8.3 F1 Measure
- 8.3.1 Contingency Table
- 8.3.2 Micro-Averaged F1
- 8.3.3 Macro-Averaged F1
- 8.3.4 Example.
- 8.4 Statistical t-Test
- 8.4.1 Student t-Distribution
- 8.4.2 Unpaired Difference Inference
- 8.4.3 Paired Difference Inference
- 8.4.4 Example
- 8.5 Summary and Further Discussions
- Part III Text Clustering
- 9 Text Clustering: Conceptual View
- 9.1 Definition of Text Clustering
- 9.2 Data Clustering
- 9.2.1 SubSubsectionTitle
- 9.2.2 Association vs. Clustering
- 9.2.3 Classification vs. Clustering
- 9.2.4 Constraint Clustering
- 9.3 Clustering Types
- 9.3.1 Static vs. Dynamic Clustering
- 9.3.2 Crisp vs. Fuzzy Clustering
- 9.3.3 SubsectionTitle
- 9.3.4 Single vs. Multiple Viewed Clustering
- 9.4 Derived Tasks from Text Clustering
- 9.4.1 Cluster Naming
- 9.4.2 Subtext Clustering
- 9.4.3 Automatic Sampling for Text Categorization
- 9.4.4 Redundant Project Detection
- 9.5 Summary and Further Discussions
- 10 Text Clustering: Approaches
- 10.1 Unsupervised Learning
- 10.2 Simple Clustering Algorithms
- 10.2.1 AHC Algorithm
- 10.2.2 Divisive Clustering Algorithm
- 10.2.3 Single-Pass Algorithm
- 10.2.4 Growing Algorithm
- 10.3 K-Means Algorithm
- 10.3.1 Crisp K-Means Algorithm
- 10.3.2 Fuzzy K-Means Algorithm
- 10.3.3 Gaussian Mixture
- 10.3.4 K Medoid Algorithm
- 10.4 Competitive Learning
- 10.4.1 Kohonen Networks
- 10.4.2 Learning Vector Quantization
- 10.4.3 Two-Dimensional Self-Organizing Map
- 10.4.4 Neural Gas
- 10.5 Summary and Further Discussions
- 11 Text Clustering: Implementation
- 11.1 System Architecture
- 11.2 Class Definitions
- 11.2.1 Classes in Text Categorization System
- 11.2.2 Class: Cluster
- 11.2.3 Interface: ClusterAnalyzer
- 11.2.4 Class: AHCAlgorithm
- 11.3 Method Implementations
- 11.3.1 Methods in Previous Classes
- 11.3.2 Class: Cluster
- 11.3.3 Class: AHC Algorithm
- 11.4 Class: ClusterAnalysisAPI.
- 11.4.1 Class: ClusterAnalysisAPI
- 11.4.2 Class: ClusterAnalyzerGUI
- 11.4.3 Demonstration
- 11.4.4 System Upgrading
- 11.5 Summary and Further Discussions
- Reference
- 12 Text Clustering: Evaluation
- 12.1 Introduction
- 12.2 Cluster Validations
- 12.2.1 Intra-cluster and Inter-cluster Similarities
- 12.2.2 Internal Validation
- 12.2.3 Relative Validation
- 12.2.4 External Validation
- 12.3 Clustering Index
- 12.3.1 Computation Process
- 12.3.2 Evaluation of Crisp Clustering
- 12.3.3 Evaluation of Fuzzy Clustering
- 12.3.4 Evaluation of Hierarchical Clustering
- 12.4 Parameter Tuning
- 12.4.1 Clustering Index for Unlabeled Documents
- 12.4.2 Simple Clustering Algorithm with Parameter Tuning
- 12.4.3 K Means Algorithm with Parameter Tuning
- 12.4.4 Evolutionary Clustering Algorithm
- 12.5 Summary and Further Discussions
- Part IV Advanced Topics
- 13 Text Summarization
- 13.1 Definition of Text Summarization
- 13.2 Text Summarization Types
- 13.2.1 Manual Versus Automatic Text Summarization
- 13.2.2 Single Versus Multiple Text Summarization
- 13.2.3 Flat Versus Hierarchical Text Summarization
- 13.2.4 Abstraction Versus Query-Based Summarization
- 13.3 Approaches to Text Summarization
- 13.3.1 Heuristic Approaches
- 13.3.2 Mapping into Classification Task
- 13.3.3 Sampling Schemes
- 13.3.4 Application of Machine Learning Algorithms
- 13.4 Combination with Other Text Mining Tasks
- 13.4.1 Summary-Based Classification
- 13.4.2 Summary-Based Clustering
- 13.4.3 Topic-Based Summarization
- 13.4.4 Text Expansion
- 13.5 Summary and Further Discussions
- 14 Text Segmentation
- 14.1 Definition of Text Segmentation
- 14.2 Text Segmentation Type
- 14.2.1 Spoken Versus Written Text Segmentation
- 14.2.2 Ordered Versus Unordered Text Segmentation
- 14.2.3 Exclusive Versus Overlapping Segmentation.
- 14.2.4 Flat Versus Hierarchical Text Segmentation
- 14.3 Machine Learning-Based Approaches
- 14.3.1 Heuristic Approaches
- 14.3.2 Mapping into Classification
- 14.3.3 Encoding Adjacent Paragraph Pairs
- 14.3.4 Application of Machine Learning
- 14.4 Derived Tasks
- 14.4.1 Temporal Topic Analysis
- 14.4.2 Subtext Retrieval
- 14.4.3 Subtext Synthesization
- 14.4.4 Virtual Text
- 14.5 Summary and Further Discussions
- 15 Taxonomy Generation
- 15.1 Definition of Taxonomy Generation
- 15.2 Relevant Tasks to Taxonomy Generation
- 15.2.1 Keyword Extraction
- 15.2.2 Word Categorization
- 15.2.3 Word Clustering
- 15.2.4 Topic Routing
- 15.3 Taxonomy Generation Schemes
- 15.3.1 Index-Based Scheme
- 15.3.2 Clustering-Based Scheme
- 15.3.3 Association-Based Scheme
- 15.3.4 Link Analysis-Based Scheme
- 15.4 Taxonomy Governance
- 15.4.1 Taxonomy Maintenance
- 15.4.2 Taxonomy Growth
- 15.4.3 Taxonomy Integration
- 15.4.4 Ontology
- 15.5 Summary and Further Discussions
- 16 Dynamic Document Organization
- 16.1 Definition of Dynamic Document Organization
- 16.2 Online Clustering
- 16.2.1 Online Clustering in Functional View
- 16.2.2 Online K Means Algorithm
- 16.2.3 Online Unsupervised KNN Algorithm
- 16.2.4 Online Fuzzy Clustering
- 16.3 Dynamic Organization
- 16.3.1 Execution Process
- 16.3.2 Maintenance Mode
- 16.3.3 Creation Mode
- 16.3.4 Additional Tasks
- 16.4 Issues of Dynamic Document Organization
- 16.4.1 Text Representation
- 16.4.2 Binary Decomposition
- 16.4.3 Transition into Creation Mode
- 16.4.4 Variants of DDO System
- 16.4.5 Summary and Further Discussions
- Part V Word Mining
- 17 Word Encoding
- 17.1 Introduction
- 17.2 Word Encoding
- 17.2.1 Text Indexing
- 17.2.2 Text Index Structure
- 17.2.3 Word Indexing
- 17.2.4 Inverted Index
- 17.3 Word Representation.
- 17.3.1 Text Representation.
- Notes:
- Description based on publisher supplied metadata and other sources.
- ISBN:
- 9783031759765
- 3031759761
- OCLC:
- 1481791252
The Penn Libraries is committed to describing library materials using current, accurate, and responsible language. If you discover outdated or inaccurate language, please fill out this feedback form to report it and suggest alternative language.