src/main.py:

HTML Crawler and Text Clustering Pipeline Documentation

Overview

This implementation provides a complete pipeline for crawling HTML documentation, extracting text content, and clustering it into action and non-action categories. The system is optimized for processing documentation pages and preparing text data for vector representation.

Components
Installation
Class Reference
Pipeline Flow
Output Format
Error Handling
Usage Examples

Components

Core Classes

UnpartyDocCrawler: HTML content extraction
TextClusteringPipeline: Text processing and clustering
Support functions: JSON handling and main execution

Dependencies

# External dependencies
requests==2.31.0
beautifulsoup4==4.12.2

# Standard library
json, time, datetime, uuid, logging, urllib.parse, re

# Local imports
from preprocessor import LightPreprocessor
from perceptron import LightPerceptron

Class Reference

UnpartyDocCrawler

Purpose

Handles HTML document crawling and text extraction with minimal processing overhead.

Methods

def __init__(self, base_url: str = "https://docs.unparty.app")
def fetch_content(self, endpoint: str) -> str
def extract_texts(self, html_content: str) -> List[str]

Key Features

Session management for efficient requests
HTML cleaning and parsing
Text segment extraction and filtering
Error logging and handling

TextClusteringPipeline

Purpose

Processes extracted text into action/non-action clusters with focus on vector preparation.

Methods

def __init__(self, vocab_size: int = 1000)
def create_cluster_item(self, tokens: List[str]) -> Dict[str, Any]
def _extract_tags(self, tokens: List[str]) -> List[str]
def process_texts(self, texts: List[str], process_attributes: Dict[str, Any] = None) -> Dict[str, Any]

Key Features

Tokenization and preprocessing
Action/non-action classification
Tag extraction and metadata generation
Process attribute tracking

Pipeline Flow

1. HTML Processing

Fetch HTML → Remove Scripts/Styles → Extract Content Elements → Clean Text

2. Text Processing

Clean Text → Tokenize → Generate Sparse Vector → Classify → Create Cluster Item

3. Clustering

Collect Items → Group by Classification → Add Metadata → Generate Output

Output Format

Cluster Item Structure

{
    "id": "uuid",
    "tokens": ["token1", "token2"],
    "timestamp": "iso-timestamp",
    "tags": ["task", "deadline"],
    "meta": {
        "vector_ready": true,
        "token_count": 2
    }
}

Process Attributes

{
    "batch_id": "uuid",
    "source": "docs.unparty.app",
    "processing_type": "initial_clustering",
    "pipeline_stage": "text_to_vector",
    "parameters": {
        "min_token_length": 2,
        "max_token_length": 50,
        "ignore_numeric": true
    }
}

Error Handling

HTML Processing

Network error handling
Invalid HTML handling
Content extraction fallbacks

Text Processing

Empty text handling
Invalid token handling
Classification error handling

Output Generation

JSON validation
File writing verification
Structure validation

Usage Examples

Basic Usage

# Initialize
crawler = UnpartyDocCrawler()
pipeline = TextClusteringPipeline()

# Process single endpoint
html_content = crawler.fetch_content("/about-unparty")
texts = crawler.extract_texts(html_content)
clusters = pipeline.process_texts(texts)
save_clusters(clusters, "output.json")

Custom Processing

# Custom process attributes
process_attributes = {
    "batch_id": str(uuid4()),
    "source": "custom_source",
    "parameters": {
        "min_token_length": 3,
        "max_token_length": 30
    }
}

# Process with custom attributes
clusters = pipeline.process_texts(texts, process_attributes)

Best Practices

Crawling

Respect robots.txt
Implement rate limiting
Use appropriate timeouts
Handle network errors gracefully

Processing

Monitor memory usage
Validate input data
Log processing steps
Handle edge cases

Output

Validate JSON structure
Use appropriate file paths
Implement error recovery
Monitor disk space

Performance Considerations

Memory Optimization

Streaming text processing
Efficient data structures
Garbage collection hints

Processing Speed

Session reuse
Batch processing
Efficient text cleaning
Minimal data copying

Scalability

Configurable batch sizes
Memory-efficient processing
Resource monitoring

Troubleshooting

Common Issues

Network connectivity
Invalid HTML structure
Memory constraints
File permissions

Solutions

Implement retries
Add validation steps
Use batch processing
Check permissions early

Contributing

Guidelines for contributions:

Follow existing code structure
Add comprehensive tests
Document changes
Handle edge cases

License

[Your License Information Here]rbage collection optimization

Resource monitoring

Contributing

Guidelines for contributing:

Follow HTML parsing best practices
Document element selection criteria
Test with varied HTML structures
Handle edge cases gracefully

Previoussrc/perceptron.py:Nextaction.py

Last updated 7 months ago

Was this helpful?

HTML Crawler and Text Clustering Pipeline Documentation

Overview

Table of Contents

Components

Core Classes

Dependencies

Class Reference

UnpartyDocCrawler

TextClusteringPipeline

Pipeline Flow

1. HTML Processing

2. Text Processing

3. Clustering

Output Format

Cluster Item Structure

Process Attributes

Error Handling

HTML Processing

Text Processing

Output Generation

Usage Examples

Basic Usage

Custom Processing

Best Practices

Crawling

Processing

Output

Performance Considerations

Memory Optimization

Processing Speed

Scalability

Troubleshooting

Common Issues

Solutions

Contributing

License

Contributing