KMeansClusteringService.swift
KMeansClusteringService Documentation
Overview
KMeansClusteringService.swift
implements the k-means clustering algorithm to group similar data points (vectors) into clusters. This service is particularly useful for grouping similar text embeddings or feature vectors.
Core Components
Data Structures
Point Structure
vector
Numerical representation of data point
id
Unique identifier for tracking
Cluster Structure
centroid
Center point of cluster
points
Data points belonging to cluster
Error Types
Configuration Parameters
k
Number of clusters
Required
maxIterations
Maximum iterations allowed
100
convergenceThreshold
Minimum change for convergence
0.001
Primary Functions
Clustering
Steps:
Validates input dimensions
Initializes random centroids
Iteratively assigns points to nearest centroid
Updates centroid positions
Checks for convergence
Helper Functions
Find Closest Centroid
Finds nearest centroid to a point
Uses Euclidean distance
Returns centroid index
Calculate Mean Vector
Computes centroid position
Averages all points in cluster
Returns new centroid vector
Distance Calculation
Calculates distance between points
Uses Euclidean distance metric
Returns scalar distance value
Usage Examples
Basic Clustering
With Text Embeddings
Best Practices
Data Preparation
Normalize input vectors
Validate dimensions
Remove outliers if necessary
Parameter Selection
Choose appropriate k value
Adjust convergence threshold
Set reasonable iteration limit
Performance Optimization
Batch large datasets
Monitor convergence
Cache intermediate results
Error Handling
Validate input data
Handle convergence failures
Provide fallback options
Algorithm Details
Initialization Phase
Random centroid selection
Input validation
Parameter checking
Iteration Phase
Point assignment
Centroid updates
Convergence checks
Termination Conditions
Maximum iterations reached
Convergence threshold met
Error condition detected
Integration Points
With WordEmbeddingService
With SimilarityService
Performance Considerations
Time Complexity
O(nki) where:
n = number of points
k = number of clusters
i = iterations
Memory Usage
Stores all points and centroids
Temporary calculations
Cluster assignments
Optimization Tips
Use appropriate data structures
Implement early stopping
Consider parallel processing
Limitations
Algorithm Constraints
Requires predefined k
Sensitive to initialization
May converge to local optima
Data Requirements
Fixed dimensions
Numerical values only
Sufficient data points
Performance Issues
Scales with data size
Memory intensive
Iteration overhead
Last updated