TextTokenizationService.swift
TextTokenizationService Documentation
Overview
TextTokenizationService.swift
provides functionality to break down text into tokens (words or sentences) using Apple's Natural Language framework. This service is essential for text preprocessing and analysis.
Core Components
Properties
Uses Apple's
NLTokenizer
for text processingConfigurable for different tokenization units (word, sentence, etc.)
Error Types
emptyInput
No text provided for tokenization
tokenizationFailed
Tokenization process failed
Primary Functions
Initialization
Creates tokenizer with specified unit type
Defaults to word-level tokenization
Options include:
.word
,.sentence
,.paragraph
Text Tokenization
Breaks text into individual tokens
Returns array of token strings
Throws error if tokenization fails
Token Range Detection
Gets ranges for each token in text
Useful for text manipulation
Preserves original text positions
Language Detection
Identifies the primary language of text
Returns
NLLanguage
enum valueReturns nil if language cannot be determined
Usage Examples
Basic Tokenization
Sentence Tokenization
Language Detection
Best Practices
Tokenizer Configuration
Choose appropriate tokenization unit
Consider language requirements
Initialize once and reuse
Error Handling
Check for empty inputs
Handle tokenization failures
Provide meaningful error messages
Performance
Process large texts in chunks
Cache results when appropriate
Monitor memory usage
Text Preprocessing
Clean text before tokenization
Handle special characters
Consider case sensitivity
Integration Points
With WordEmbeddingService
With Text Classification
Limitations
Language Support
Best performance with Latin scripts
May need adjustments for specific languages
Consider language-specific tokenization rules
Special Cases
Handling of contractions
Treatment of punctuation
Multi-word expressions
Performance Considerations
Processing time for large texts
Memory usage with large documents
Initialization overhead
Related Components
WordEmbeddingService: For processing tokens
UNPARTYTextClassifier: For text classification
SimilarityService: For token comparison
Last updated