src/preprocessor.py:

LightPreprocessor Documentation

Overview

The LightPreprocessor class provides a memory-efficient solution for converting raw text into numerical formats suitable for machine learning models, particularly perceptrons. This implementation prioritizes performance and minimal resource usage while maintaining essential text processing functionality.

Table of Contents

  • Installation

  • Quick Start

  • Class Reference

  • Implementation Details

  • Performance Considerations

  • Examples

  • Best Practices

Installation

# Copy preprocessor.py to your project
from preprocessor import LightPreprocessor

Required dependencies: None (uses only Python standard library)

Quick Start

# Initialize preprocessor
preprocessor = LightPreprocessor(max_vocab_size=1000)

# Example texts
texts = [
    "The movie was fantastic!",
    "This product is terrible.",
    "Great service today"
]

# Build vocabulary
preprocessor.build_vocabulary(texts)

# Convert text to vectors
sparse_vector = preprocessor.text_to_sparse("The movie is great")
dense_vector = preprocessor.text_to_dense("The movie is great")

Class Reference

Constructor

LightPreprocessor(max_vocab_size: int = 1000)
  • max_vocab_size: Maximum number of words to keep in vocabulary (default: 1000)

Core Methods

clean_text(text: str) → str

Cleans and normalizes input text.

  • Input: Raw text string

  • Output: Cleaned text with:

    • Lowercase conversion

    • Special character removal

    • Extra whitespace removal

  • Time Complexity: O(n) where n is text length

tokenize(text: str) → List[str]

Converts text into tokens with stopword removal.

  • Input: Cleaned text string

  • Output: List of tokens excluding stopwords

  • Time Complexity: O(n) where n is number of words

build_vocabulary(texts: List[str]) → None

Builds vocabulary from a list of input texts.

  • Input: List of raw text strings

  • Output: None (updates internal word_to_index mapping)

  • Time Complexity: O(N*M) where N is number of texts, M is average text length

text_to_sparse(text: str) → List[int]

Converts text to sparse vector format.

  • Input: Raw text string

  • Output: List of indices where vector would have 1s

  • Time Complexity: O(n) where n is number of words in text

text_to_dense(text: str) → List[int]

Converts text to dense binary vector format.

  • Input: Raw text string

  • Output: Binary vector of vocabulary size

  • Time Complexity: O(V) where V is vocabulary size

Utility Methods

get_vocab_size() → int

Returns current vocabulary size.

get_vocabulary() → List[str]

Returns list of vocabulary words.

add_custom_stopwords(words: List[str]) → None

Adds custom stopwords to existing set.

Implementation Details

Data Structures

  • word_to_index: Dictionary mapping words to indices

  • stop_words: Set of stopwords for O(1) lookup

  • Sparse representation: List of indices where 1s occur

  • Dense representation: Binary vector of fixed size

Text Processing Pipeline

  1. Text Cleaning

    • Lowercase conversion

    • Special character removal

    • Whitespace normalization

  2. Tokenization

    • Simple space-based splitting

    • Stopword removal

  3. Vectorization

    • Option for sparse or dense representation

    • Index-based encoding

Performance Considerations

Memory Usage

  • Sparse representation for memory efficiency

  • Set-based stopwords for fast lookup

  • No external NLP libraries

  • Minimal data structure overhead

Processing Speed

  • Single-pass text cleaning

  • Simple tokenization strategy

  • O(1) stopword lookup

  • Efficient vocabulary building using Counter

Trade-offs

  • No stemming/lemmatization (speed vs accuracy)

  • Basic tokenization (simplicity vs precision)

  • Fixed vocabulary size (memory vs coverage)

Examples

Basic Usage

# Initialize
preprocessor = LightPreprocessor(max_vocab_size=1000)

# Process multiple texts
texts = [
    "The user interface is intuitive",
    "Performance could be better",
    "Great documentation and support"
]

# Build vocabulary
preprocessor.build_vocabulary(texts)

# Get sparse representation
sparse = preprocessor.text_to_sparse("The interface is great")
print(f"Sparse vector: {sparse}")  # e.g., [2, 5, 8]

# Get dense representation
dense = preprocessor.text_to_dense("The interface is great")
print(f"Dense vector: {dense}")    # e.g., [0, 0, 1, 0, 0, 1, 0, 0, 1]

Custom Stopwords

# Add domain-specific stopwords
preprocessor.add_custom_stopwords(["very", "quite", "rather"])

Best Practices

Memory Management

  1. Use sparse representation for large vocabularies

  2. Keep max_vocab_size reasonable for your use case

  3. Clean up large text lists after vocabulary building

Preprocessing Pipeline

  1. Clean texts before building vocabulary

  2. Consider domain-specific stopwords

  3. Monitor vocabulary coverage

Integration Tips

  1. Process texts in batches for efficiency

  2. Cache vectors for frequently used texts

  3. Consider vocabulary persistence for production use

Error Handling

  1. Handle empty texts gracefully

  2. Check vocabulary size before processing

  3. Validate input text encoding

Contributing

To contribute improvements to this preprocessor:

  1. Maintain the focus on lightweight processing

  2. Add thorough documentation for new features

  3. Consider backward compatibility

  4. Test with various text types and sizes

Last updated