SimilarityService.swift

SimilarityService Documentation

Overview

SimilarityService.swift calculates similarity measures between vectors, primarily used for comparing text embeddings and determining how closely related different pieces of text are. The service focuses on cosine similarity calculations and threshold-based similarity detection.

Core Components

Error Types

enum SimilarityError: Error {
    case dimensionMismatch
    case emptyVectors
    case invalidInput
}
Error
Description

dimensionMismatch

Vectors have different dimensions

emptyVectors

One or both vectors are empty

invalidInput

Invalid vector values (e.g., all zeros)

Primary Functions

Cosine Similarity

func cosineSimilarity(between v1: [Float], and v2: [Float]) throws -> Float
  • Calculates similarity between two vectors

  • Returns value between -1 and 1

  • 1: Most similar

  • 0: Orthogonal (unrelated)

  • -1: Most dissimilar

Threshold Checking

func meetsThreshold(_ similarity: Float, threshold: Float) -> Bool
  • Determines if similarity meets minimum threshold

  • Returns boolean indicating if threshold is met

  • Used for filtering similar items

Similar Vector Finding

func findSimilarVectors(
    target: [Float], 
    candidates: [[Float]], 
    threshold: Float
) throws -> [(index: Int, similarity: Float)]
  • Finds vectors similar to target

  • Returns sorted array of matches

  • Includes similarity scores and indices

Usage Examples

Basic Similarity Check

let service = SimilarityService()

do {
    let similarity = try service.cosineSimilarity(
        between: [1.0, 0.0], 
        and: [1.0, 0.0]
    )
    print("Similarity: \(similarity)") // 1.0 (identical)
} catch {
    print("Error calculating similarity: \(error)")
}

Threshold-Based Filtering

let similarity = 0.85
if service.meetsThreshold(similarity, threshold: 0.8) {
    print("Vectors are sufficiently similar")
}

Finding Similar Items

let target = [1.0, 0.0, 0.0]
let candidates = [
    [1.0, 0.0, 0.0],
    [0.8, 0.2, 0.0],
    [0.0, 1.0, 0.0]
]

do {
    let similar = try service.findSimilarVectors(
        target: target,
        candidates: candidates,
        threshold: 0.8
    )
    // Process similar vectors
} catch {
    print("Error finding similar vectors: \(error)")
}

Integration Points

With WordEmbeddingService

// Compare two texts
let embedding1 = wordEmbeddingService.generateTextVector(for: text1)
let embedding2 = wordEmbeddingService.generateTextVector(for: text2)
let similarity = try similarityService.cosineSimilarity(
    between: embedding1,
    and: embedding2
)

With Clustering

// Pre-cluster similarity check
let shouldCluster = try similarityService.meetsThreshold(
    similarity,
    threshold: clusteringThreshold
)

Best Practices

  1. Input Validation

    • Check vector dimensions

    • Validate vector values

    • Handle edge cases

  2. Threshold Selection

    • Choose appropriate thresholds

    • Consider use case requirements

    • Validate against test data

  3. Performance

    • Cache frequently compared vectors

    • Batch similarity calculations

    • Optimize for large datasets

  4. Error Handling

    • Handle dimension mismatches

    • Provide meaningful errors

    • Include fallback behavior

Mathematical Background

Cosine Similarity Formula

cos(θ) = (A·B)/(||A||·||B||)
where:
- A·B is the dot product
- ||A|| and ||B|| are vector magnitudes

Implementation Details

  1. Dot Product Calculation

    • Element-wise multiplication

    • Sum of products

  2. Vector Magnitude

    • Square each element

    • Sum squares

    • Take square root

  3. Final Calculation

    • Divide dot product by magnitudes

    • Handle zero magnitudes

Performance Considerations

  1. Time Complexity

    • O(n) for single comparison

    • O(mn) for finding similar vectors where:

    • n = vector dimension

    • m = number of candidates

  2. Memory Usage

    • Temporary calculations

    • Result storage

    • Vector storage

  3. Optimization Strategies

    • Parallel processing

    • Early termination

    • Result caching

Limitations

  1. Mathematical Constraints

    • Sensitive to magnitude

    • Direction-based similarity

    • Dimension requirements

  2. Performance Limits

    • Scales with vector size

    • Large dataset handling

    • Memory constraints

  3. Use Case Constraints

    • Binary similarity only

    • No semantic understanding

    • Context-independent

Last updated