KMeansClusteringService Documentation
Overview
KMeansClusteringService
is a Swift implementation of the k-means clustering algorithm optimized for high-dimensional data (256-512 dimensions). It provides functionality for clustering data points, handling metadata, and performing sentiment analysis. The implementation includes numerical stability improvements through compensated summation and supports both Euclidean and cosine distance metrics.
Class Structure
KMeansClusteringService
Main class that handles the clustering process.
Parameters:
k
: Number of clusters to createmaxIterations
: Maximum number of iterations for convergence (default: 50)convergenceThreshold
: Minimum change in centroids to continue iteration (default: 0.001)useCosineDistance
: Whether to use cosine distance instead of Euclidean (default: true)
Point Struct
Represents a data point in the clustering space.
Cluster Struct
Represents a cluster with its centroid and associated points.
Error Handling
Main Methods
cluster(_:) -> [Cluster]
Performs k-means clustering on the input points.
Process:
Validates input data
Initializes centroids using k-means++
Iteratively assigns points to nearest centroid
Updates centroids using compensated summation
Checks for convergence
Throws:
invalidDimensions
: If points have inconsistent dimensionsnoDataPoints
: If input array is emptyconvergenceFailed
: If algorithm doesn't convergeinvalidK
: If k is larger than dataset size
analyzeSentiment(for:) -> String
Analyzes text sentiment from metadata.
Returns: "positive sentiment", "negative sentiment", or "neutral sentiment"
Helper Methods
compensatedSum(_:) -> [Float]
Performs numerically stable vector summation using Kahan summation algorithm.
calculateNewCentroid(for:) -> [Float]
Calculates new centroid position using compensated summation.
Distance Metrics
cosineDistance(::) -> Float
Calculates cosine distance between two vectors. Preferred for high-dimensional data.
euclideanDistance(::) -> Float
Calculates Euclidean distance between two vectors.
Usage Example
Performance Considerations
Memory Usage
Pre-computed squared magnitudes
Efficient vector operations
Reuse of arrays where possible
Computational Efficiency
Compensated summation for numerical stability
Cached calculations
Optimized distance metrics
High-Dimensional Optimizations
Cosine distance option
K-means++ initialization
Numerical stability improvements
Best Practices
Data Preparation
Normalize input vectors if using Euclidean distance
Ensure consistent vector dimensions
Consider data scaling for better clustering
Parameter Selection
Choose k based on domain knowledge
Adjust maxIterations based on convergence needs
Set convergenceThreshold based on precision requirements
Error Handling
Always use try-catch blocks
Validate input data
Check cluster quality after convergence
Limitations
Number of clusters (k) must be specified in advance
Algorithm may converge to local optima
Sensitive to initial centroid positions
May not handle non-globular clusters well
Future Improvements
Parallel processing for large datasets
Additional distance metrics
Automatic k selection
Enhanced metadata processing
More sophisticated sentiment analysis