TextTokenizationService.swift

TextTokenizationService Documentation

Overview

TextTokenizationService.swift provides functionality to break down text into tokens (words or sentences) using Apple's Natural Language framework. This service is essential for text preprocessing and analysis.

Core Components

Properties

private let tokenizer: NLTokenizer
  • Uses Apple's NLTokenizer for text processing

  • Configurable for different tokenization units (word, sentence, etc.)

Error Types

enum TokenizationError: Error {
    case emptyInput
    case tokenizationFailed
}
Error
Description

emptyInput

No text provided for tokenization

tokenizationFailed

Tokenization process failed

Primary Functions

Initialization

init(unit: NLTokenUnit = .word)
  • Creates tokenizer with specified unit type

  • Defaults to word-level tokenization

  • Options include: .word, .sentence, .paragraph

Text Tokenization

func tokenize(_ text: String) throws -> [String]
  • Breaks text into individual tokens

  • Returns array of token strings

  • Throws error if tokenization fails

Token Range Detection

func getTokenRanges(_ text: String) -> [Range<String.Index>]
  • Gets ranges for each token in text

  • Useful for text manipulation

  • Preserves original text positions

Language Detection

func detectLanguage(for text: String) -> NLLanguage?
  • Identifies the primary language of text

  • Returns NLLanguage enum value

  • Returns nil if language cannot be determined

Usage Examples

Basic Tokenization

let service = TextTokenizationService()

do {
    let tokens = try service.tokenize("Hello world!")
    // ["Hello", "world", "!"]
} catch {
    print("Tokenization failed: \(error)")
}

Sentence Tokenization

let sentenceTokenizer = TextTokenizationService(unit: .sentence)
let sentences = try? sentenceTokenizer.tokenize("First sentence. Second one!")
// ["First sentence.", "Second one!"]

Language Detection

if let language = service.detectLanguage(for: "Hello world") {
    print("Detected language: \(language)")
}

Best Practices

  1. Tokenizer Configuration

    • Choose appropriate tokenization unit

    • Consider language requirements

    • Initialize once and reuse

  2. Error Handling

    • Check for empty inputs

    • Handle tokenization failures

    • Provide meaningful error messages

  3. Performance

    • Process large texts in chunks

    • Cache results when appropriate

    • Monitor memory usage

  4. Text Preprocessing

    • Clean text before tokenization

    • Handle special characters

    • Consider case sensitivity

Integration Points

With WordEmbeddingService

// Example pipeline
let tokens = try tokenizationService.tokenize(inputText)
let embeddings = try tokens.map { wordEmbeddingService.generateWordVector(for: $0) }

With Text Classification

// Example preprocessing
let tokens = try tokenizationService.tokenize(inputText)
let processedText = tokens.joined(separator: " ")
let classification = try classifier.predict(text: processedText)

Limitations

  1. Language Support

    • Best performance with Latin scripts

    • May need adjustments for specific languages

    • Consider language-specific tokenization rules

  2. Special Cases

    • Handling of contractions

    • Treatment of punctuation

    • Multi-word expressions

  3. Performance Considerations

    • Processing time for large texts

    • Memory usage with large documents

    • Initialization overhead

  • WordEmbeddingService: For processing tokens

  • UNPARTYTextClassifier: For text classification

  • SimilarityService: For token comparison

Last updated