src/preprocessor.py:

LightPreprocessor Documentation

Overview

The LightPreprocessor class provides a memory-efficient solution for converting raw text into numerical formats suitable for machine learning models, particularly perceptrons. This implementation prioritizes performance and minimal resource usage while maintaining essential text processing functionality.

Installation
Quick Start
Class Reference
Implementation Details
Performance Considerations
Examples
Best Practices

Installation

# Copy preprocessor.py to your project
from preprocessor import LightPreprocessor

Required dependencies: None (uses only Python standard library)

Quick Start

# Initialize preprocessor
preprocessor = LightPreprocessor(max_vocab_size=1000)

# Example texts
texts = [
    "The movie was fantastic!",
    "This product is terrible.",
    "Great service today"
]

# Build vocabulary
preprocessor.build_vocabulary(texts)

# Convert text to vectors
sparse_vector = preprocessor.text_to_sparse("The movie is great")
dense_vector = preprocessor.text_to_dense("The movie is great")

Class Reference

Constructor

LightPreprocessor(max_vocab_size: int = 1000)

max_vocab_size: Maximum number of words to keep in vocabulary (default: 1000)

Core Methods

clean_text(text: str) → str

Cleans and normalizes input text.

Input: Raw text string
Output: Cleaned text with:
- Lowercase conversion
- Special character removal
- Extra whitespace removal
Time Complexity: O(n) where n is text length

tokenize(text: str) → List[str]

Converts text into tokens with stopword removal.

Input: Cleaned text string
Output: List of tokens excluding stopwords
Time Complexity: O(n) where n is number of words

build_vocabulary(texts: List[str]) → None

Builds vocabulary from a list of input texts.

Input: List of raw text strings
Output: None (updates internal word_to_index mapping)
Time Complexity: O(N*M) where N is number of texts, M is average text length

text_to_sparse(text: str) → List[int]

Converts text to sparse vector format.

Input: Raw text string
Output: List of indices where vector would have 1s
Time Complexity: O(n) where n is number of words in text

text_to_dense(text: str) → List[int]

Converts text to dense binary vector format.

Input: Raw text string
Output: Binary vector of vocabulary size
Time Complexity: O(V) where V is vocabulary size

Utility Methods

get_vocab_size() → int

Returns current vocabulary size.

get_vocabulary() → List[str]

Returns list of vocabulary words.

add_custom_stopwords(words: List[str]) → None

Adds custom stopwords to existing set.

Implementation Details

Data Structures

word_to_index: Dictionary mapping words to indices
stop_words: Set of stopwords for O(1) lookup
Sparse representation: List of indices where 1s occur
Dense representation: Binary vector of fixed size

Text Processing Pipeline

Text Cleaning
- Lowercase conversion
- Special character removal
- Whitespace normalization
Tokenization
- Simple space-based splitting
- Stopword removal
Vectorization
- Option for sparse or dense representation
- Index-based encoding

Performance Considerations

Memory Usage

Sparse representation for memory efficiency
Set-based stopwords for fast lookup
No external NLP libraries
Minimal data structure overhead

Processing Speed

Single-pass text cleaning
Simple tokenization strategy
O(1) stopword lookup
Efficient vocabulary building using Counter

Trade-offs

No stemming/lemmatization (speed vs accuracy)
Basic tokenization (simplicity vs precision)
Fixed vocabulary size (memory vs coverage)

Examples

Basic Usage

# Initialize
preprocessor = LightPreprocessor(max_vocab_size=1000)

# Process multiple texts
texts = [
    "The user interface is intuitive",
    "Performance could be better",
    "Great documentation and support"
]

# Build vocabulary
preprocessor.build_vocabulary(texts)

# Get sparse representation
sparse = preprocessor.text_to_sparse("The interface is great")
print(f"Sparse vector: {sparse}")  # e.g., [2, 5, 8]

# Get dense representation
dense = preprocessor.text_to_dense("The interface is great")
print(f"Dense vector: {dense}")    # e.g., [0, 0, 1, 0, 0, 1, 0, 0, 1]

Custom Stopwords

# Add domain-specific stopwords
preprocessor.add_custom_stopwords(["very", "quite", "rather"])

Best Practices

Memory Management

Use sparse representation for large vocabularies
Keep max_vocab_size reasonable for your use case
Clean up large text lists after vocabulary building

Preprocessing Pipeline

Clean texts before building vocabulary
Consider domain-specific stopwords
Monitor vocabulary coverage

Integration Tips

Process texts in batches for efficiency
Cache vectors for frequently used texts
Consider vocabulary persistence for production use

Error Handling

Handle empty texts gracefully
Check vocabulary size before processing
Validate input text encoding

Contributing

To contribute improvements to this preprocessor:

Maintain the focus on lightweight processing
Add thorough documentation for new features
Consider backward compatibility
Test with various text types and sizes

PreviousPerceptron Nextsrc/perceptron.py:

Last updated 1 month ago