src/preprocessor.py:
LightPreprocessor Documentation
Overview
The LightPreprocessor
class provides a memory-efficient solution for converting raw text into numerical formats suitable for machine learning models, particularly perceptrons. This implementation prioritizes performance and minimal resource usage while maintaining essential text processing functionality.
Table of Contents
Installation
Quick Start
Class Reference
Implementation Details
Performance Considerations
Examples
Best Practices
Installation
Required dependencies: None (uses only Python standard library)
Quick Start
Class Reference
Constructor
max_vocab_size
: Maximum number of words to keep in vocabulary (default: 1000)
Core Methods
clean_text(text: str) → str
Cleans and normalizes input text.
Input: Raw text string
Output: Cleaned text with:
Lowercase conversion
Special character removal
Extra whitespace removal
Time Complexity: O(n) where n is text length
tokenize(text: str) → List[str]
Converts text into tokens with stopword removal.
Input: Cleaned text string
Output: List of tokens excluding stopwords
Time Complexity: O(n) where n is number of words
build_vocabulary(texts: List[str]) → None
Builds vocabulary from a list of input texts.
Input: List of raw text strings
Output: None (updates internal word_to_index mapping)
Time Complexity: O(N*M) where N is number of texts, M is average text length
text_to_sparse(text: str) → List[int]
Converts text to sparse vector format.
Input: Raw text string
Output: List of indices where vector would have 1s
Time Complexity: O(n) where n is number of words in text
text_to_dense(text: str) → List[int]
Converts text to dense binary vector format.
Input: Raw text string
Output: Binary vector of vocabulary size
Time Complexity: O(V) where V is vocabulary size
Utility Methods
get_vocab_size() → int
Returns current vocabulary size.
get_vocabulary() → List[str]
Returns list of vocabulary words.
add_custom_stopwords(words: List[str]) → None
Adds custom stopwords to existing set.
Implementation Details
Data Structures
word_to_index
: Dictionary mapping words to indicesstop_words
: Set of stopwords for O(1) lookupSparse representation: List of indices where 1s occur
Dense representation: Binary vector of fixed size
Text Processing Pipeline
Text Cleaning
Lowercase conversion
Special character removal
Whitespace normalization
Tokenization
Simple space-based splitting
Stopword removal
Vectorization
Option for sparse or dense representation
Index-based encoding
Performance Considerations
Memory Usage
Sparse representation for memory efficiency
Set-based stopwords for fast lookup
No external NLP libraries
Minimal data structure overhead
Processing Speed
Single-pass text cleaning
Simple tokenization strategy
O(1) stopword lookup
Efficient vocabulary building using Counter
Trade-offs
No stemming/lemmatization (speed vs accuracy)
Basic tokenization (simplicity vs precision)
Fixed vocabulary size (memory vs coverage)
Examples
Basic Usage
Custom Stopwords
Best Practices
Memory Management
Use sparse representation for large vocabularies
Keep max_vocab_size reasonable for your use case
Clean up large text lists after vocabulary building
Preprocessing Pipeline
Clean texts before building vocabulary
Consider domain-specific stopwords
Monitor vocabulary coverage
Integration Tips
Process texts in batches for efficiency
Cache vectors for frequently used texts
Consider vocabulary persistence for production use
Error Handling
Handle empty texts gracefully
Check vocabulary size before processing
Validate input text encoding
Contributing
To contribute improvements to this preprocessor:
Maintain the focus on lightweight processing
Add thorough documentation for new features
Consider backward compatibility
Test with various text types and sizes
Last updated