SimilarityService.swift
SimilarityService Documentation
Overview
SimilarityService.swift
calculates similarity measures between vectors, primarily used for comparing text embeddings and determining how closely related different pieces of text are. The service focuses on cosine similarity calculations and threshold-based similarity detection.
Core Components
Error Types
dimensionMismatch
Vectors have different dimensions
emptyVectors
One or both vectors are empty
invalidInput
Invalid vector values (e.g., all zeros)
Primary Functions
Cosine Similarity
Calculates similarity between two vectors
Returns value between -1 and 1
1: Most similar
0: Orthogonal (unrelated)
-1: Most dissimilar
Threshold Checking
Determines if similarity meets minimum threshold
Returns boolean indicating if threshold is met
Used for filtering similar items
Similar Vector Finding
Finds vectors similar to target
Returns sorted array of matches
Includes similarity scores and indices
Usage Examples
Basic Similarity Check
Threshold-Based Filtering
Finding Similar Items
Integration Points
With WordEmbeddingService
With Clustering
Best Practices
Input Validation
Check vector dimensions
Validate vector values
Handle edge cases
Threshold Selection
Choose appropriate thresholds
Consider use case requirements
Validate against test data
Performance
Cache frequently compared vectors
Batch similarity calculations
Optimize for large datasets
Error Handling
Handle dimension mismatches
Provide meaningful errors
Include fallback behavior
Mathematical Background
Cosine Similarity Formula
Implementation Details
Dot Product Calculation
Element-wise multiplication
Sum of products
Vector Magnitude
Square each element
Sum squares
Take square root
Final Calculation
Divide dot product by magnitudes
Handle zero magnitudes
Performance Considerations
Time Complexity
O(n) for single comparison
O(mn) for finding similar vectors where:
n = vector dimension
m = number of candidates
Memory Usage
Temporary calculations
Result storage
Vector storage
Optimization Strategies
Parallel processing
Early termination
Result caching
Limitations
Mathematical Constraints
Sensitive to magnitude
Direction-based similarity
Dimension requirements
Performance Limits
Scales with vector size
Large dataset handling
Memory constraints
Use Case Constraints
Binary similarity only
No semantic understanding
Context-independent
Last updated