TextTokenizationService.swift

TextTokenizationService Documentation

Overview

TextTokenizationService.swift provides functionality to break down text into tokens (words or sentences) using Apple's Natural Language framework. This service is essential for text preprocessing and analysis.

Core Components

Properties

private let tokenizer: NLTokenizer

Uses Apple's NLTokenizer for text processing
Configurable for different tokenization units (word, sentence, etc.)

Error Types

enum TokenizationError: Error {
    case emptyInput
    case tokenizationFailed
}

Error

Description

emptyInput

No text provided for tokenization

tokenizationFailed

Tokenization process failed

Primary Functions

Initialization

init(unit: NLTokenUnit = .word)

Creates tokenizer with specified unit type
Defaults to word-level tokenization
Options include: .word, .sentence, .paragraph

Text Tokenization

func tokenize(_ text: String) throws -> [String]

Breaks text into individual tokens
Returns array of token strings
Throws error if tokenization fails

Token Range Detection

func getTokenRanges(_ text: String) -> [Range<String.Index>]

Gets ranges for each token in text
Useful for text manipulation
Preserves original text positions

Language Detection

func detectLanguage(for text: String) -> NLLanguage?

Identifies the primary language of text
Returns NLLanguage enum value
Returns nil if language cannot be determined

Usage Examples

Basic Tokenization

let service = TextTokenizationService()

do {
    let tokens = try service.tokenize("Hello world!")
    // ["Hello", "world", "!"]
} catch {
    print("Tokenization failed: \(error)")
}

Sentence Tokenization

let sentenceTokenizer = TextTokenizationService(unit: .sentence)
let sentences = try? sentenceTokenizer.tokenize("First sentence. Second one!")
// ["First sentence.", "Second one!"]

Language Detection

if let language = service.detectLanguage(for: "Hello world") {
    print("Detected language: \(language)")
}

Best Practices

Tokenizer Configuration
- Choose appropriate tokenization unit
- Consider language requirements
- Initialize once and reuse
Error Handling
- Check for empty inputs
- Handle tokenization failures
- Provide meaningful error messages
Performance
- Process large texts in chunks
- Cache results when appropriate
- Monitor memory usage
Text Preprocessing
- Clean text before tokenization
- Handle special characters
- Consider case sensitivity

Integration Points

With WordEmbeddingService

// Example pipeline
let tokens = try tokenizationService.tokenize(inputText)
let embeddings = try tokens.map { wordEmbeddingService.generateWordVector(for: $0) }

With Text Classification

// Example preprocessing
let tokens = try tokenizationService.tokenize(inputText)
let processedText = tokens.joined(separator: " ")
let classification = try classifier.predict(text: processedText)

Limitations

Language Support
- Best performance with Latin scripts
- May need adjustments for specific languages
- Consider language-specific tokenization rules
Special Cases
- Handling of contractions
- Treatment of punctuation
- Multi-word expressions
Performance Considerations
- Processing time for large texts
- Memory usage with large documents
- Initialization overhead

WordEmbeddingService: For processing tokens
UNPARTYTextClassifier: For text classification
SimilarityService: For token comparison

PreviousClusteringCoordinator Documentation NextWordEmbeddingService.swift

Last updated 1 month ago

TextTokenizationService Documentation

Overview

Core Components

Properties

Error Types

Primary Functions

Initialization

Text Tokenization

Token Range Detection

Language Detection

Usage Examples

Basic Tokenization

Sentence Tokenization

Language Detection

Best Practices

Integration Points

With WordEmbeddingService

With Text Classification

Limitations

Related Components