KMeansClusteringService Documentation

Overview

KMeansClusteringService is a Swift implementation of the k-means clustering algorithm optimized for high-dimensional data (256-512 dimensions). It provides functionality for clustering data points, handling metadata, and performing sentiment analysis. The implementation includes numerical stability improvements through compensated summation and supports both Euclidean and cosine distance metrics.

Class Structure

KMeansClusteringService

Main class that handles the clustering process.

class KMeansClusteringService {
    init(k: Int, maxIterations: Int = 50, convergenceThreshold: Float = 0.001, useCosineDistance: Bool = true)
}

Parameters:

k: Number of clusters to create
maxIterations: Maximum number of iterations for convergence (default: 50)
convergenceThreshold: Minimum change in centroids to continue iteration (default: 0.001)
useCosineDistance: Whether to use cosine distance instead of Euclidean (default: true)

Point Struct

Represents a data point in the clustering space.

struct Point {
    let vector: [Float]         // The feature vector
    let id: UUID               // Unique identifier
    let squaredMagnitude: Float // Cached for performance
    let metadata: [String: Any]? // Optional metadata
}

Cluster Struct

Represents a cluster with its centroid and associated points.

struct Cluster {
    var centroid: [Float]
    var points: [Point]
}

Error Handling

enum ClusteringError: Error {
    case invalidDimensions  // Inconsistent vector dimensions
    case noDataPoints      // Empty dataset
    case convergenceFailed // Failed to converge within maxIterations
    case invalidK         // k value larger than dataset
}

Main Methods

cluster(_:) -> [Cluster]

func cluster(_ points: [Point]) throws -> [Cluster]

Performs k-means clustering on the input points.

Process:

Validates input data
Initializes centroids using k-means++
Iteratively assigns points to nearest centroid
Updates centroids using compensated summation
Checks for convergence

Throws:

invalidDimensions: If points have inconsistent dimensions
noDataPoints: If input array is empty
convergenceFailed: If algorithm doesn't converge
invalidK: If k is larger than dataset size

analyzeSentiment(for:) -> String

func analyzeSentiment(for text: String) -> String

Analyzes text sentiment from metadata.

Returns: "positive sentiment", "negative sentiment", or "neutral sentiment"

Helper Methods

compensatedSum(_:) -> [Float]

private func compensatedSum(_ vectors: [[Float]]) -> [Float]

Performs numerically stable vector summation using Kahan summation algorithm.

calculateNewCentroid(for:) -> [Float]

private func calculateNewCentroid(for cluster: Cluster) -> [Float]

Calculates new centroid position using compensated summation.

Distance Metrics

cosineDistance(::) -> Float

private func cosineDistance(_ a: [Float], _ b: [Float]) -> Float

Calculates cosine distance between two vectors. Preferred for high-dimensional data.

euclideanDistance(::) -> Float

private func euclideanDistance(_ a: [Float], _ b: [Float]) -> Float

Calculates Euclidean distance between two vectors.

Usage Example

// Initialize service
let service = KMeansClusteringService(
    k: 5,
    maxIterations: 50,
    convergenceThreshold: 0.001,
    useCosineDistance: true
)

// Create points
let points = [
    Point(vector: [1.0, 2.0], id: UUID(), metadata: ["text": "happy"]),
    // ... more points
]

// Perform clustering
do {
    let clusters = try service.cluster(points)
    // Process clusters
} catch {
    // Handle errors
}

Performance Considerations

Memory Usage
- Pre-computed squared magnitudes
- Efficient vector operations
- Reuse of arrays where possible
Computational Efficiency
- Compensated summation for numerical stability
- Cached calculations
- Optimized distance metrics
High-Dimensional Optimizations
- Cosine distance option
- K-means++ initialization
- Numerical stability improvements

Best Practices

Data Preparation
- Normalize input vectors if using Euclidean distance
- Ensure consistent vector dimensions
- Consider data scaling for better clustering
Parameter Selection
- Choose k based on domain knowledge
- Adjust maxIterations based on convergence needs
- Set convergenceThreshold based on precision requirements
Error Handling
- Always use try-catch blocks
- Validate input data
- Check cluster quality after convergence

Limitations

Number of clusters (k) must be specified in advance
Algorithm may converge to local optima
Sensitive to initial centroid positions
May not handle non-globular clusters well

Future Improvements

Parallel processing for large datasets
Additional distance metrics
Automatic k selection
Enhanced metadata processing
More sophisticated sentiment analysis

PreviousKMeansClusteringService.swift NextClusteringCoordinator Documentation

Was this helpful?