src/main.py:

HTML Crawler and Text Clustering Pipeline Documentation

Overview

This implementation provides a complete pipeline for crawling HTML documentation, extracting text content, and clustering it into action and non-action categories. The system is optimized for processing documentation pages and preparing text data for vector representation.

Table of Contents

  • Components

  • Installation

  • Class Reference

  • Pipeline Flow

  • Output Format

  • Error Handling

  • Usage Examples

Components

Core Classes

  1. UnpartyDocCrawler: HTML content extraction

  2. TextClusteringPipeline: Text processing and clustering

  3. Support functions: JSON handling and main execution

Dependencies

# External dependencies
requests==2.31.0
beautifulsoup4==4.12.2

# Standard library
json, time, datetime, uuid, logging, urllib.parse, re

# Local imports
from preprocessor import LightPreprocessor
from perceptron import LightPerceptron

Class Reference

UnpartyDocCrawler

Purpose

Handles HTML document crawling and text extraction with minimal processing overhead.

Methods

def __init__(self, base_url: str = "https://docs.unparty.app")
def fetch_content(self, endpoint: str) -> str
def extract_texts(self, html_content: str) -> List[str]

Key Features

  • Session management for efficient requests

  • HTML cleaning and parsing

  • Text segment extraction and filtering

  • Error logging and handling

TextClusteringPipeline

Purpose

Processes extracted text into action/non-action clusters with focus on vector preparation.

Methods

def __init__(self, vocab_size: int = 1000)
def create_cluster_item(self, tokens: List[str]) -> Dict[str, Any]
def _extract_tags(self, tokens: List[str]) -> List[str]
def process_texts(self, texts: List[str], process_attributes: Dict[str, Any] = None) -> Dict[str, Any]

Key Features

  • Tokenization and preprocessing

  • Action/non-action classification

  • Tag extraction and metadata generation

  • Process attribute tracking

Pipeline Flow

1. HTML Processing

Fetch HTML → Remove Scripts/Styles → Extract Content Elements → Clean Text

2. Text Processing

Clean Text → Tokenize → Generate Sparse Vector → Classify → Create Cluster Item

3. Clustering

Collect Items → Group by Classification → Add Metadata → Generate Output

Output Format

Cluster Item Structure

{
    "id": "uuid",
    "tokens": ["token1", "token2"],
    "timestamp": "iso-timestamp",
    "tags": ["task", "deadline"],
    "meta": {
        "vector_ready": true,
        "token_count": 2
    }
}

Process Attributes

{
    "batch_id": "uuid",
    "source": "docs.unparty.app",
    "processing_type": "initial_clustering",
    "pipeline_stage": "text_to_vector",
    "parameters": {
        "min_token_length": 2,
        "max_token_length": 50,
        "ignore_numeric": true
    }
}

Error Handling

HTML Processing

  • Network error handling

  • Invalid HTML handling

  • Content extraction fallbacks

Text Processing

  • Empty text handling

  • Invalid token handling

  • Classification error handling

Output Generation

  • JSON validation

  • File writing verification

  • Structure validation

Usage Examples

Basic Usage

# Initialize
crawler = UnpartyDocCrawler()
pipeline = TextClusteringPipeline()

# Process single endpoint
html_content = crawler.fetch_content("/about-unparty")
texts = crawler.extract_texts(html_content)
clusters = pipeline.process_texts(texts)
save_clusters(clusters, "output.json")

Custom Processing

# Custom process attributes
process_attributes = {
    "batch_id": str(uuid4()),
    "source": "custom_source",
    "parameters": {
        "min_token_length": 3,
        "max_token_length": 30
    }
}

# Process with custom attributes
clusters = pipeline.process_texts(texts, process_attributes)

Best Practices

Crawling

  1. Respect robots.txt

  2. Implement rate limiting

  3. Use appropriate timeouts

  4. Handle network errors gracefully

Processing

  1. Monitor memory usage

  2. Validate input data

  3. Log processing steps

  4. Handle edge cases

Output

  1. Validate JSON structure

  2. Use appropriate file paths

  3. Implement error recovery

  4. Monitor disk space

Performance Considerations

Memory Optimization

  • Streaming text processing

  • Efficient data structures

  • Garbage collection hints

Processing Speed

  • Session reuse

  • Batch processing

  • Efficient text cleaning

  • Minimal data copying

Scalability

  • Configurable batch sizes

  • Memory-efficient processing

  • Resource monitoring

Troubleshooting

Common Issues

  1. Network connectivity

  2. Invalid HTML structure

  3. Memory constraints

  4. File permissions

Solutions

  1. Implement retries

  2. Add validation steps

  3. Use batch processing

  4. Check permissions early

Contributing

Guidelines for contributions:

  1. Follow existing code structure

  2. Add comprehensive tests

  3. Document changes

  4. Handle edge cases

License

[Your License Information Here]rbage collection optimization

  • Resource monitoring

Contributing

Guidelines for contributing:

  1. Follow HTML parsing best practices

  2. Document element selection criteria

  3. Test with varied HTML structures

  4. Handle edge cases gracefully

Last updated