KMeansClusteringService Documentation

Overview

KMeansClusteringService is a Swift implementation of the k-means clustering algorithm optimized for high-dimensional data (256-512 dimensions). It provides functionality for clustering data points, handling metadata, and performing sentiment analysis. The implementation includes numerical stability improvements through compensated summation and supports both Euclidean and cosine distance metrics.

Class Structure

KMeansClusteringService

Main class that handles the clustering process.

class KMeansClusteringService {
    init(k: Int, maxIterations: Int = 50, convergenceThreshold: Float = 0.001, useCosineDistance: Bool = true)
}

Parameters:

  • k: Number of clusters to create

  • maxIterations: Maximum number of iterations for convergence (default: 50)

  • convergenceThreshold: Minimum change in centroids to continue iteration (default: 0.001)

  • useCosineDistance: Whether to use cosine distance instead of Euclidean (default: true)

Point Struct

Represents a data point in the clustering space.

struct Point {
    let vector: [Float]         // The feature vector
    let id: UUID               // Unique identifier
    let squaredMagnitude: Float // Cached for performance
    let metadata: [String: Any]? // Optional metadata
}

Cluster Struct

Represents a cluster with its centroid and associated points.

struct Cluster {
    var centroid: [Float]
    var points: [Point]
}

Error Handling

enum ClusteringError: Error {
    case invalidDimensions  // Inconsistent vector dimensions
    case noDataPoints      // Empty dataset
    case convergenceFailed // Failed to converge within maxIterations
    case invalidK         // k value larger than dataset
}

Main Methods

cluster(_:) -> [Cluster]

func cluster(_ points: [Point]) throws -> [Cluster]

Performs k-means clustering on the input points.

Process:

  1. Validates input data

  2. Initializes centroids using k-means++

  3. Iteratively assigns points to nearest centroid

  4. Updates centroids using compensated summation

  5. Checks for convergence

Throws:

  • invalidDimensions: If points have inconsistent dimensions

  • noDataPoints: If input array is empty

  • convergenceFailed: If algorithm doesn't converge

  • invalidK: If k is larger than dataset size

analyzeSentiment(for:) -> String

func analyzeSentiment(for text: String) -> String

Analyzes text sentiment from metadata.

Returns: "positive sentiment", "negative sentiment", or "neutral sentiment"

Helper Methods

compensatedSum(_:) -> [Float]

private func compensatedSum(_ vectors: [[Float]]) -> [Float]

Performs numerically stable vector summation using Kahan summation algorithm.

calculateNewCentroid(for:) -> [Float]

private func calculateNewCentroid(for cluster: Cluster) -> [Float]

Calculates new centroid position using compensated summation.

Distance Metrics

cosineDistance(::) -> Float

private func cosineDistance(_ a: [Float], _ b: [Float]) -> Float

Calculates cosine distance between two vectors. Preferred for high-dimensional data.

euclideanDistance(::) -> Float

private func euclideanDistance(_ a: [Float], _ b: [Float]) -> Float

Calculates Euclidean distance between two vectors.

Usage Example

// Initialize service
let service = KMeansClusteringService(
    k: 5,
    maxIterations: 50,
    convergenceThreshold: 0.001,
    useCosineDistance: true
)

// Create points
let points = [
    Point(vector: [1.0, 2.0], id: UUID(), metadata: ["text": "happy"]),
    // ... more points
]

// Perform clustering
do {
    let clusters = try service.cluster(points)
    // Process clusters
} catch {
    // Handle errors
}

Performance Considerations

  1. Memory Usage

    • Pre-computed squared magnitudes

    • Efficient vector operations

    • Reuse of arrays where possible

  2. Computational Efficiency

    • Compensated summation for numerical stability

    • Cached calculations

    • Optimized distance metrics

  3. High-Dimensional Optimizations

    • Cosine distance option

    • K-means++ initialization

    • Numerical stability improvements

Best Practices

  1. Data Preparation

    • Normalize input vectors if using Euclidean distance

    • Ensure consistent vector dimensions

    • Consider data scaling for better clustering

  2. Parameter Selection

    • Choose k based on domain knowledge

    • Adjust maxIterations based on convergence needs

    • Set convergenceThreshold based on precision requirements

  3. Error Handling

    • Always use try-catch blocks

    • Validate input data

    • Check cluster quality after convergence

Limitations

  1. Number of clusters (k) must be specified in advance

  2. Algorithm may converge to local optima

  3. Sensitive to initial centroid positions

  4. May not handle non-globular clusters well

Future Improvements

  1. Parallel processing for large datasets

  2. Additional distance metrics

  3. Automatic k selection

  4. Enhanced metadata processing

  5. More sophisticated sentiment analysis