KMeansClusteringService.swift

KMeansClusteringService Documentation

Overview

KMeansClusteringService.swift implements the k-means clustering algorithm to group similar data points (vectors) into clusters. This service is particularly useful for grouping similar text embeddings or feature vectors.

Core Components

Data Structures

Point Structure

struct Point {
    let vector: [Float]
    let id: UUID
}
Property
Description

vector

Numerical representation of data point

id

Unique identifier for tracking

Cluster Structure

struct Cluster {
    var centroid: [Float]
    var points: [Point]
}
Property
Description

centroid

Center point of cluster

points

Data points belonging to cluster

Error Types

enum ClusteringError: Error {
    case invalidDimensions
    case noDataPoints
    case convergenceFailed
}

Configuration Parameters

init(k: Int, maxIterations: Int = 100, convergenceThreshold: Float = 0.001)
Parameter
Description
Default

k

Number of clusters

Required

maxIterations

Maximum iterations allowed

100

convergenceThreshold

Minimum change for convergence

0.001

Primary Functions

Clustering

func cluster(_ points: [Point]) throws -> [Cluster]

Steps:

  1. Validates input dimensions

  2. Initializes random centroids

  3. Iteratively assigns points to nearest centroid

  4. Updates centroid positions

  5. Checks for convergence

Helper Functions

Find Closest Centroid

private func findClosestCentroid(_ point: [Float], centroids: [[Float]]) -> Int
  • Finds nearest centroid to a point

  • Uses Euclidean distance

  • Returns centroid index

Calculate Mean Vector

private func calculateMeanVector(for points: [Point]) -> [Float]
  • Computes centroid position

  • Averages all points in cluster

  • Returns new centroid vector

Distance Calculation

private func euclideanDistance(_ a: [Float], _ b: [Float]) -> Float
  • Calculates distance between points

  • Uses Euclidean distance metric

  • Returns scalar distance value

Usage Examples

Basic Clustering

let service = KMeansClusteringService(k: 3)

let points = [
    Point(vector: [1.0, 2.0], id: UUID()),
    Point(vector: [1.1, 2.1], id: UUID()),
    Point(vector: [5.0, 6.0], id: UUID())
]

do {
    let clusters = try service.cluster(points)
    // Work with clusters
} catch {
    print("Clustering failed: \(error)")
}

With Text Embeddings

// Example with word embeddings
let embeddings = words.map { Point(
    vector: wordEmbeddingService.generateVector(for: $0),
    id: UUID()
) }
let clusters = try service.cluster(embeddings)

Best Practices

  1. Data Preparation

    • Normalize input vectors

    • Validate dimensions

    • Remove outliers if necessary

  2. Parameter Selection

    • Choose appropriate k value

    • Adjust convergence threshold

    • Set reasonable iteration limit

  3. Performance Optimization

    • Batch large datasets

    • Monitor convergence

    • Cache intermediate results

  4. Error Handling

    • Validate input data

    • Handle convergence failures

    • Provide fallback options

Algorithm Details

  1. Initialization Phase

    • Random centroid selection

    • Input validation

    • Parameter checking

  2. Iteration Phase

    • Point assignment

    • Centroid updates

    • Convergence checks

  3. Termination Conditions

    • Maximum iterations reached

    • Convergence threshold met

    • Error condition detected

Integration Points

With WordEmbeddingService

// Convert embeddings to points
let embeddings = texts.map { text in
    Point(
        vector: wordEmbeddingService.generateTextVector(for: text),
        id: UUID()
    )
}

With SimilarityService

// Calculate cluster cohesion
for cluster in clusters {
    let cohesion = similarityService.calculateClusterCohesion(
        points: cluster.points
    )
}

Performance Considerations

  1. Time Complexity

    • O(nki) where:

      • n = number of points

      • k = number of clusters

      • i = iterations

  2. Memory Usage

    • Stores all points and centroids

    • Temporary calculations

    • Cluster assignments

  3. Optimization Tips

    • Use appropriate data structures

    • Implement early stopping

    • Consider parallel processing

Limitations

  1. Algorithm Constraints

    • Requires predefined k

    • Sensitive to initialization

    • May converge to local optima

  2. Data Requirements

    • Fixed dimensions

    • Numerical values only

    • Sufficient data points

  3. Performance Issues

    • Scales with data size

    • Memory intensive

    • Iteration overhead

Last updated