action2.py

Action Pattern Processor Documentation

Overview

The ActionPatternProcessor implements pattern discovery in tokenized data through co-occurrence analysis and custom scoring. It uses ReLU activation and normalized scoring to identify meaningful patterns without maintaining state between runs.

Table of Contents

  • Installation

  • Core Components

  • Pattern Discovery

  • Scoring System

  • Usage Examples

Installation

Dependencies

import numpy as np
from typing import List, Dict, Any, Tuple, Set
from collections import Counter, defaultdict
import logging
from datetime import datetime
import math

Core Components

ActionPatternProcessor Class

Constructor

def __init__(self, window_size: int = 3, min_threshold: float = 0.3):
    """
    Initialize pattern processor with configurable parameters.
    
    Args:
        window_size: Size of sliding window for substring patterns
        min_threshold: Minimum score for pattern consideration
    """

Core Methods

process_tokens(tokens: List[str], cluster_type: str) -> Dict[str, Any]

Process token list for pattern discovery.

  • Input: List of tokens and their cluster type

  • Output: Dictionary containing patterns and metadata

  • Performance: O(n * w) where n is number of tokens, w is window size

batch_process(token_groups: Dict[str, List[str]]) -> Dict[str, Any]

Process multiple groups of tokens.

  • Input: Dictionary of cluster types to token lists

  • Output: Comprehensive analysis across clusters

  • Performance: O(m * n * w) where m is number of clusters

Pattern Discovery

Sliding Window Implementation

def find_substring_patterns(self, token: str) -> Set[str]:
    """Find meaningful substrings using sliding window."""
    patterns = set()
    if len(token) < self.window_size:
        return {token}
        
    for i in range(len(token) - self.window_size + 1):
        substring = token[i:i + self.window_size]
        if substring.isalnum():
            patterns.add(substring)
    return patterns

Frequency Calculation

def calculate_token_frequency(self, tokens: List[str]) -> Dict[str, float]:
    """Calculate logarithmic frequency scores for tokens."""
    counts = Counter(tokens)
    return {token: math.log1p(count) for token, count in counts.items()}

Scoring System

Score Normalization

def normalize_scores(self, scores: np.ndarray) -> np.ndarray:
    """
    Apply min-max normalization with edge case handling.
    
    Edge Cases:
    - Empty array: Returns empty array
    - All same values: Returns array of ones
    """

Pattern Scoring

def calculate_pattern_scores(self, tokens: List[str], cluster_type: str) -> Dict[str, float]:
    """
    Calculate pattern scores based on:
    - Token frequency (logarithmic scale)
    - Pattern co-occurrence
    - Cluster type weighting
    """

Output Format

Pattern Analysis Result

{
    "patterns": {
        "pattern1": {
            "score": 0.85,
            "frequency": 5,
            "cluster_type": "action"
        }
    },
    "metadata": {
        "total_tokens": 100,
        "unique_patterns": 25,
        "processing_time": 0.023,
        "timestamp": "2024-12-05T10:30:00Z",
        "cluster_type": "action"
    }
}

Batch Processing Result

{
    "cluster_results": {
        "action": {...},
        "non_action": {...}
    },
    "pattern_distribution": {
        "pattern1": {
            "action": 0.85,
            "non_action": 0.2
        }
    },
    "summary": {
        "total_patterns": 50,
        "clusters_analyzed": 2,
        "timestamp": "2024-12-05T10:30:00Z"
    }
}

Usage Examples

Basic Usage

processor = ActionPatternProcessor(window_size=3)

# Process single token group
tokens = ["deploy", "update", "configure"]
result = processor.process_tokens(tokens, "action")

# Process multiple groups
token_groups = {
    "action": ["deploy", "update", "configure"],
    "non_action": ["status", "system", "report"]
}
results = processor.batch_process(token_groups)

Custom Configuration

# Adjust window size and threshold
processor = ActionPatternProcessor(
    window_size=4,
    min_threshold=0.4
)

Performance Considerations

Memory Usage

  • Pattern storage: O(n * w) where n is number of tokens

  • Score matrices: O(p) where p is number of patterns

  • Temporary calculations: O(n)

Processing Speed

  • Sliding window operations: O(n * w)

  • Score normalization: O(p)

  • Batch processing: O(m * n * w)

Optimization Tips

  1. Adjust window_size based on token characteristics

  2. Use appropriate min_threshold to filter noise

  3. Process tokens in batches for efficiency

Contributing

Guidelines for extending functionality:

  1. Maintain stateless processing

  2. Document edge cases

  3. Follow existing naming conventions

  4. Add unit tests

Last updated