Error Handling

Stabilize provides a structured error hierarchy with semantic error codes, chain traversal, and automatic classification.

Error Hierarchy

All Stabilize exceptions inherit from a two-tier hierarchy:

StabilizeBaseException          # Root - bypasses default handlers
└── StabilizeError              # Standard errors - caught and handled
    ├── TransientError          # Retryable (network, timeout, 5xx)
    │   └── ConcurrencyError    # Optimistic lock failures
    ├── PermanentError          # Non-retryable (auth, validation, 4xx)
    ├── RecoveryError           # Crash recovery failed
    ├── ConfigurationError      # Invalid configuration
    ├── TaskError               # Task execution failed
    │   ├── TaskTimeoutError    # Task exceeded timeout
    │   └── TaskNotFoundError   # Task implementation not found
    ├── WorkflowError           # Workflow-level issues
    │   └── WorkflowNotFoundError
    ├── StageError              # Stage-level issues
    ├── QueueError              # Queue operations failed
    │   └── DeadLetterError     # Message moved to DLQ
    └── VerificationError       # Stage verification failed
        └── TransientVerificationError  # Retry verification

Transient vs Permanent Errors

The distinction between transient and permanent errors determines retry behavior:

Transient Errors (will be retried):

from stabilize.errors import TransientError

# Network timeout - retry with backoff
raise TransientError(
    "Connection timed out",
    retry_after=30,  # Hint: wait 30 seconds
    context_update={"processed_items": 50}  # Preserve progress
)

Permanent Errors (moved to Dead Letter Queue):

from stabilize.errors import PermanentError

# Validation failure - don't retry
raise PermanentError(
    "Invalid input format",
    code=400
)

Check error type programmatically:

from stabilize.errors import is_transient, is_permanent

try:
    result = task.execute(stage)
except Exception as e:
    if is_transient(e):
        # Schedule retry with backoff
        pass
    elif is_permanent(e):
        # Move to DLQ, alert operator
        pass

Error Codes

Stabilize uses semantic ErrorCode values for programmatic error handling:

from stabilize.error_codes import ErrorCode

# Available error codes
ErrorCode.UNKNOWN                   # Unclassified error
ErrorCode.RESOURCE_EXHAUSTED        # Memory, disk, quota exceeded
ErrorCode.TASK_TIMEOUT              # Task exceeded timeout
ErrorCode.TASK_NOT_FOUND            # Task implementation missing
ErrorCode.UPSTREAM_DEPENDENCY_FAILED  # Upstream stage failed
ErrorCode.CONFIGURATION_INVALID     # Bad configuration
ErrorCode.CONCURRENCY_CONFLICT      # Optimistic lock failure
ErrorCode.AUTHENTICATION_FAILED     # Auth credentials invalid
ErrorCode.VALIDATION_FAILED         # Input validation failed
ErrorCode.NETWORK_ERROR             # Network/connection issues
ErrorCode.CIRCUIT_OPEN              # Circuit breaker tripped
ErrorCode.BULKHEAD_FULL             # Resource pool exhausted
ErrorCode.VERIFICATION_FAILED       # Stage verification failed
ErrorCode.USER_CODE_ERROR           # User task raised exception
ErrorCode.SYSTEM_ERROR              # Internal system error
ErrorCode.RECOVERY_FAILED           # Crash recovery failed

Each exception carries its error code:

from stabilize.errors import TaskTimeoutError

try:
    execute_task(stage)
except TaskTimeoutError as e:
    print(e.error_code)  # ErrorCode.TASK_TIMEOUT
    print(e.error_code.value)  # "TASK_TIMEOUT"

Override the default error code:

raise TransientError(
    "Rate limited by API",
    error_code=ErrorCode.RESOURCE_EXHAUSTED
)

Error Chain Traversal

When errors are wrapped (e.g., by bulkheads or retries), use chain traversal to find the root cause:

from stabilize.error_codes import error_chain, find_in_chain, classify_error

try:
    execute_with_resilience(...)
except Exception as e:
    # Get full chain from root to leaf
    chain = error_chain(e)
    for err in chain:
        print(f"  - {type(err).__name__}: {err}")

    # Find specific error type in chain
    timeout = find_in_chain(e, TaskTimeoutError)
    if timeout:
        print(f"Task {timeout.task_name} timed out")

    # Auto-classify any exception
    code = classify_error(e)  # Returns ErrorCode

Example chain:

BulkheadError: Execution failed
  - RuntimeError: Task execution failed
    - TaskTimeoutError: Shell command timed out after 60s

Error Truncation

Large error messages are automatically truncated before storage to prevent database bloat:

from stabilize.errors import truncate_error

# Truncate to 100KB with UTF-8 aware boundary handling
safe_message = truncate_error(large_error_message)

# Customize max size
safe_message = truncate_error(message, max_bytes=50_000)

# Result includes marker when truncated
# "Error details here... [TRUNCATED]"

The truncation:

  • Respects UTF-8 character boundaries

  • Appends [TRUNCATED] marker when shortened

  • Default limit is 100KB (configurable)

Custom Task Errors

Create rich error context for debugging:

from stabilize.errors import TaskError
from stabilize.error_codes import ErrorCode

raise TaskError(
    "Failed to process batch",
    task_name="batch_processor",
    stage_id=stage.id,
    execution_id=stage.execution_id,
    error_code=ErrorCode.USER_CODE_ERROR,
    details={
        "batch_id": "batch_123",
        "failed_item": 42,
        "reason": "Invalid format"
    }
)

Error Classification

The classify_error function maps any exception to an ErrorCode:

from stabilize.error_codes import classify_error

# Maps based on exception type and name patterns
classify_error(TimeoutError(...))      # TASK_TIMEOUT
classify_error(ConnectionError(...))   # NETWORK_ERROR
classify_error(PermissionError(...))   # AUTHENTICATION_FAILED
classify_error(ValueError(...))        # VALIDATION_FAILED

Classification rules (in order):

  1. Check if it’s a Stabilize exception with explicit error_code

  2. Check exception type name for patterns (timeout, connection, auth, etc.)

  3. Fall back to ErrorCode.UNKNOWN

Best Practices

  1. Always use the right error type:

    # Good: Semantic error type
    raise TransientError("API rate limited", retry_after=60)
    
    # Bad: Generic exception
    raise Exception("API rate limited")
    
  2. Include context in errors:

    raise TaskError(
        "Failed to fetch user",
        details={"user_id": user_id, "api_response": response.status_code}
    )
    
  3. Preserve progress across retries:

    raise TransientError(
        "Batch processing interrupted",
        context_update={"last_processed_index": i}
    )
    
  4. Use error codes for routing:

    if e.error_code == ErrorCode.AUTHENTICATION_FAILED:
        alert_security_team(e)
    elif e.error_code == ErrorCode.RESOURCE_EXHAUSTED:
        scale_up_resources()
    

Key Files

  • src/stabilize/errors.py - Exception hierarchy and utilities

  • src/stabilize/error_codes.py - ErrorCode enum and classification

  • src/stabilize/handlers/run_task/error.py - Error handling in task execution