src/main.py:
HTML Crawler and Text Clustering Pipeline Documentation
Overview
This implementation provides a complete pipeline for crawling HTML documentation, extracting text content, and clustering it into action and non-action categories. The system is optimized for processing documentation pages and preparing text data for vector representation.
Table of Contents
Components
Installation
Class Reference
Pipeline Flow
Output Format
Error Handling
Usage Examples
Components
Core Classes
UnpartyDocCrawler: HTML content extraction
TextClusteringPipeline: Text processing and clustering
Support functions: JSON handling and main execution
Dependencies
# External dependencies
requests==2.31.0
beautifulsoup4==4.12.2
# Standard library
json, time, datetime, uuid, logging, urllib.parse, re
# Local imports
from preprocessor import LightPreprocessor
from perceptron import LightPerceptron
Class Reference
UnpartyDocCrawler
Purpose
Handles HTML document crawling and text extraction with minimal processing overhead.
Methods
def __init__(self, base_url: str = "https://docs.unparty.app")
def fetch_content(self, endpoint: str) -> str
def extract_texts(self, html_content: str) -> List[str]
Key Features
Session management for efficient requests
HTML cleaning and parsing
Text segment extraction and filtering
Error logging and handling
TextClusteringPipeline
Purpose
Processes extracted text into action/non-action clusters with focus on vector preparation.
Methods
def __init__(self, vocab_size: int = 1000)
def create_cluster_item(self, tokens: List[str]) -> Dict[str, Any]
def _extract_tags(self, tokens: List[str]) -> List[str]
def process_texts(self, texts: List[str], process_attributes: Dict[str, Any] = None) -> Dict[str, Any]
Key Features
Tokenization and preprocessing
Action/non-action classification
Tag extraction and metadata generation
Process attribute tracking
Pipeline Flow
1. HTML Processing
Fetch HTML → Remove Scripts/Styles → Extract Content Elements → Clean Text
2. Text Processing
Clean Text → Tokenize → Generate Sparse Vector → Classify → Create Cluster Item
3. Clustering
Collect Items → Group by Classification → Add Metadata → Generate Output
Output Format
Cluster Item Structure
{
"id": "uuid",
"tokens": ["token1", "token2"],
"timestamp": "iso-timestamp",
"tags": ["task", "deadline"],
"meta": {
"vector_ready": true,
"token_count": 2
}
}
Process Attributes
{
"batch_id": "uuid",
"source": "docs.unparty.app",
"processing_type": "initial_clustering",
"pipeline_stage": "text_to_vector",
"parameters": {
"min_token_length": 2,
"max_token_length": 50,
"ignore_numeric": true
}
}
Error Handling
HTML Processing
Network error handling
Invalid HTML handling
Content extraction fallbacks
Text Processing
Empty text handling
Invalid token handling
Classification error handling
Output Generation
JSON validation
File writing verification
Structure validation
Usage Examples
Basic Usage
# Initialize
crawler = UnpartyDocCrawler()
pipeline = TextClusteringPipeline()
# Process single endpoint
html_content = crawler.fetch_content("/about-unparty")
texts = crawler.extract_texts(html_content)
clusters = pipeline.process_texts(texts)
save_clusters(clusters, "output.json")
Custom Processing
# Custom process attributes
process_attributes = {
"batch_id": str(uuid4()),
"source": "custom_source",
"parameters": {
"min_token_length": 3,
"max_token_length": 30
}
}
# Process with custom attributes
clusters = pipeline.process_texts(texts, process_attributes)
Best Practices
Crawling
Respect robots.txt
Implement rate limiting
Use appropriate timeouts
Handle network errors gracefully
Processing
Monitor memory usage
Validate input data
Log processing steps
Handle edge cases
Output
Validate JSON structure
Use appropriate file paths
Implement error recovery
Monitor disk space
Performance Considerations
Memory Optimization
Streaming text processing
Efficient data structures
Garbage collection hints
Processing Speed
Session reuse
Batch processing
Efficient text cleaning
Minimal data copying
Scalability
Configurable batch sizes
Memory-efficient processing
Resource monitoring
Troubleshooting
Common Issues
Network connectivity
Invalid HTML structure
Memory constraints
File permissions
Solutions
Implement retries
Add validation steps
Use batch processing
Check permissions early
Contributing
Guidelines for contributions:
Follow existing code structure
Add comprehensive tests
Document changes
Handle edge cases
License
[Your License Information Here]rbage collection optimization
Resource monitoring
Contributing
Guidelines for contributing:
Follow HTML parsing best practices
Document element selection criteria
Test with varied HTML structures
Handle edge cases gracefully
Last updated
Was this helpful?