Error Handling
Stabilize provides a structured error hierarchy with semantic error codes, chain traversal, and automatic classification.
Error Hierarchy
All Stabilize exceptions inherit from a two-tier hierarchy:
StabilizeBaseException # Root - bypasses default handlers
└── StabilizeError # Standard errors - caught and handled
├── TransientError # Retryable (network, timeout, 5xx)
│ └── ConcurrencyError # Optimistic lock failures
├── PermanentError # Non-retryable (auth, validation, 4xx)
├── RecoveryError # Crash recovery failed
├── ConfigurationError # Invalid configuration
├── TaskError # Task execution failed
│ ├── TaskTimeoutError # Task exceeded timeout
│ └── TaskNotFoundError # Task implementation not found
├── WorkflowError # Workflow-level issues
│ └── WorkflowNotFoundError
├── StageError # Stage-level issues
├── QueueError # Queue operations failed
│ └── DeadLetterError # Message moved to DLQ
└── VerificationError # Stage verification failed
└── TransientVerificationError # Retry verification
Transient vs Permanent Errors
The distinction between transient and permanent errors determines retry behavior:
Transient Errors (will be retried):
from stabilize.errors import TransientError
# Network timeout - retry with backoff
raise TransientError(
"Connection timed out",
retry_after=30, # Hint: wait 30 seconds
context_update={"processed_items": 50} # Preserve progress
)
Permanent Errors (moved to Dead Letter Queue):
from stabilize.errors import PermanentError
# Validation failure - don't retry
raise PermanentError(
"Invalid input format",
code=400
)
Check error type programmatically:
from stabilize.errors import is_transient, is_permanent
try:
result = task.execute(stage)
except Exception as e:
if is_transient(e):
# Schedule retry with backoff
pass
elif is_permanent(e):
# Move to DLQ, alert operator
pass
Error Codes
Stabilize uses semantic ErrorCode values for programmatic error handling:
from stabilize.error_codes import ErrorCode
# Available error codes
ErrorCode.UNKNOWN # Unclassified error
ErrorCode.RESOURCE_EXHAUSTED # Memory, disk, quota exceeded
ErrorCode.TASK_TIMEOUT # Task exceeded timeout
ErrorCode.TASK_NOT_FOUND # Task implementation missing
ErrorCode.UPSTREAM_DEPENDENCY_FAILED # Upstream stage failed
ErrorCode.CONFIGURATION_INVALID # Bad configuration
ErrorCode.CONCURRENCY_CONFLICT # Optimistic lock failure
ErrorCode.AUTHENTICATION_FAILED # Auth credentials invalid
ErrorCode.VALIDATION_FAILED # Input validation failed
ErrorCode.NETWORK_ERROR # Network/connection issues
ErrorCode.CIRCUIT_OPEN # Circuit breaker tripped
ErrorCode.BULKHEAD_FULL # Resource pool exhausted
ErrorCode.VERIFICATION_FAILED # Stage verification failed
ErrorCode.USER_CODE_ERROR # User task raised exception
ErrorCode.SYSTEM_ERROR # Internal system error
ErrorCode.RECOVERY_FAILED # Crash recovery failed
Each exception carries its error code:
from stabilize.errors import TaskTimeoutError
try:
execute_task(stage)
except TaskTimeoutError as e:
print(e.error_code) # ErrorCode.TASK_TIMEOUT
print(e.error_code.value) # "TASK_TIMEOUT"
Override the default error code:
raise TransientError(
"Rate limited by API",
error_code=ErrorCode.RESOURCE_EXHAUSTED
)
Error Chain Traversal
When errors are wrapped (e.g., by bulkheads or retries), use chain traversal to find the root cause:
from stabilize.error_codes import error_chain, find_in_chain, classify_error
try:
execute_with_resilience(...)
except Exception as e:
# Get full chain from root to leaf
chain = error_chain(e)
for err in chain:
print(f" - {type(err).__name__}: {err}")
# Find specific error type in chain
timeout = find_in_chain(e, TaskTimeoutError)
if timeout:
print(f"Task {timeout.task_name} timed out")
# Auto-classify any exception
code = classify_error(e) # Returns ErrorCode
Example chain:
BulkheadError: Execution failed
- RuntimeError: Task execution failed
- TaskTimeoutError: Shell command timed out after 60s
Error Truncation
Large error messages are automatically truncated before storage to prevent database bloat:
from stabilize.errors import truncate_error
# Truncate to 100KB with UTF-8 aware boundary handling
safe_message = truncate_error(large_error_message)
# Customize max size
safe_message = truncate_error(message, max_bytes=50_000)
# Result includes marker when truncated
# "Error details here... [TRUNCATED]"
The truncation:
Respects UTF-8 character boundaries
Appends
[TRUNCATED]marker when shortenedDefault limit is 100KB (configurable)
Custom Task Errors
Create rich error context for debugging:
from stabilize.errors import TaskError
from stabilize.error_codes import ErrorCode
raise TaskError(
"Failed to process batch",
task_name="batch_processor",
stage_id=stage.id,
execution_id=stage.execution_id,
error_code=ErrorCode.USER_CODE_ERROR,
details={
"batch_id": "batch_123",
"failed_item": 42,
"reason": "Invalid format"
}
)
Error Classification
The classify_error function maps any exception to an ErrorCode:
from stabilize.error_codes import classify_error
# Maps based on exception type and name patterns
classify_error(TimeoutError(...)) # TASK_TIMEOUT
classify_error(ConnectionError(...)) # NETWORK_ERROR
classify_error(PermissionError(...)) # AUTHENTICATION_FAILED
classify_error(ValueError(...)) # VALIDATION_FAILED
Classification rules (in order):
Check if it’s a Stabilize exception with explicit
error_codeCheck exception type name for patterns (timeout, connection, auth, etc.)
Fall back to
ErrorCode.UNKNOWN
Best Practices
Always use the right error type:
# Good: Semantic error type raise TransientError("API rate limited", retry_after=60) # Bad: Generic exception raise Exception("API rate limited")
Include context in errors:
raise TaskError( "Failed to fetch user", details={"user_id": user_id, "api_response": response.status_code} )
Preserve progress across retries:
raise TransientError( "Batch processing interrupted", context_update={"last_processed_index": i} )
Use error codes for routing:
if e.error_code == ErrorCode.AUTHENTICATION_FAILED: alert_security_team(e) elif e.error_code == ErrorCode.RESOURCE_EXHAUSTED: scale_up_resources()
Key Files
src/stabilize/errors.py- Exception hierarchy and utilitiessrc/stabilize/error_codes.py- ErrorCode enum and classificationsrc/stabilize/handlers/run_task/error.py- Error handling in task execution