Observability: Monitoring, Debugging, Testing for Automation Workflows
Lead Paragraph: In the world of automation engineering, observability isn't a luxury—it's a necessity. When your workflows run silently in the background, processing thousands of records or orchestrating critical business processes, you need more than just "it works." You need to know it's working correctly, efficiently, and reliably. Observability gives you the eyes and ears to understand your automation systems from the inside out. This comprehensive guide covers monitoring strategies to catch issues before they become problems, debugging techniques to quickly resolve failures, testing methodologies to prevent regressions, and error handling patterns to build resilient automation workflows that can withstand real-world chaos.Why Observability Matters for Automation Engineers
Automation workflows are like black boxes—data goes in, magic happens, results come out. But when something breaks (and it will), you need visibility into what happened, why it happened, and how to fix it. Observability provides this visibility through three pillars:
- Monitoring: Continuous observation of workflow health and performance
- Debugging: Tools and techniques to diagnose and fix issues
- Testing: Proactive validation to prevent problems before they reach production
Consider this reality: an automation workflow processing customer orders fails silently overnight. By morning, you have angry customers, lost revenue, and a frantic debugging session. With proper observability, you'd have received an alert at 2 AM, known exactly which step failed and why, and could have implemented a fix before business hours.
Monitoring: Your Automation Dashboard
1. Workflow Execution Monitoring
Track the heartbeat of your automation workflows with comprehensive execution monitoring:
-- Monitor workflow execution rates and failures
SELECT
DATE(executed_at) as execution_date,
workflow_name,
COUNT(*) as total_executions,
SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successful,
SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as failed,
ROUND(AVG(execution_time_ms), 2) as avg_execution_time,
MAX(execution_time_ms) as max_execution_time
FROM workflow_executions
WHERE executed_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY DATE(executed_at), workflow_name
ORDER BY execution_date DESC, failed DESC;
-- Identify workflows with increasing failure rates
SELECT
workflow_name,
DATE(executed_at) as execution_date,
COUNT(*) as total_executions,
SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as failed,
ROUND(100.0 * SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) / COUNT(*), 2) as failure_rate_percent
FROM workflow_executions
WHERE executed_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)
GROUP BY workflow_name, DATE(executed_at)
HAVING COUNT(*) > 10
ORDER BY failure_rate_percent DESC;
2. Performance Metrics Collection
Monitor key performance indicators (KPIs) to identify bottlenecks:
javascript
// Example: Collect and store workflow performance metrics
const performanceMetrics = {
workflowId: 'order-processing-001',
executionId: 'exec-12345',
startTime: new Date().toISOString(),
steps: [
{
name: 'fetch-orders',
startTime: '2026-02-27T10:00:00Z',
endTime: '2026-02-27T10:00:05Z',
durationMs: 5000,
memoryUsageMB: 120,
recordsProcessed: 250
},
{
name: 'validate-data',
startTime: '2026-02-27T10:00:05Z',
endTime: '2026-02-27T10:00:07Z',
durationMs: 2000,
memoryUsageMB: 150,
recordsProcessed: 250
}
],
totalDurationMs: 7000,
peakMemoryUsageMB: 150,
status: 'success'
};
// Send to monitoring system
await sendToMonitoringSystem('workflow-performance', performanceMetrics);
3. Resource Utilization Tracking
Monitor the resources your automation workflows consume:
yaml
Prometheus metrics configuration for automation workflows
workflow_executions_total{workflow="order_processing", status="success"} 1423
workflow_executions_total{workflow="order_processing", status="error"} 12
workflow_execution_duration_seconds{workflow="order_processing", quantile="0.5"} 5.2
workflow_execution_duration_seconds{workflow="order_processing", quantile="0.95"} 12.8
workflow_execution_duration_seconds{workflow="order_processing", quantile="0.99"} 25.4
workflow_memory_usage_bytes{workflow="order_processing"} 157286400
Alert rules for resource thresholds
groups:
- - name: automation-alerts
rules:
- - alert: HighWorkflowFailureRate
expr: rate(workflow_executions_total{status="error"}[5m]) / rate(workflow_executions_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High failure rate detected in {{ $labels.workflow }}"
- - alert: WorkflowExecutionSlowdown
expr: histogram_quantile(0.95, rate(workflow_execution_duration_seconds_bucket[5m])) > 30
for: 10m
labels:
severity: critical
annotations:
summary: "95th percentile execution time exceeds 30 seconds for {{ $labels.workflow }}"
Debugging: Finding and Fixing Issues
1. Structured Logging for Effective Debugging
Implement structured logging to make debugging easier:
javascript
// Structured logging implementation
const logger = {
debug: (context, message, data = {}) => {
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'DEBUG',
workflow: context.workflow,
executionId: context.executionId,
step: context.step,
message: message,
data: data,
correlationId: context.correlationId
}));
},
error: (context, message, error, data = {}) => {
console.error(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'ERROR',
workflow: context.workflow,
executionId: context.executionId,
step: context.step,
message: message,
error: {
message: error.message,
stack: error.stack,
code: error.code
},
data: data,
correlationId: context.correlationId
}));
}
};
// Usage in workflow step
try {
logger.debug(
{ workflow: 'order-processing', executionId: '123', step: 'validate-order' },
'Starting order validation',
{ orderId: 'ORD-789', customerId: 'CUST-456' }
);
// Validation logic
if (!order.amount || order.amount <= 0) {
throw new Error('Invalid order amount');
}
logger.debug(
{ workflow: 'order-processing', executionId: '123', step: 'validate-order' },
'Order validation successful',
{ orderId: 'ORD-789', validationTimeMs: 45 }
);
} catch (error) {
logger.error(
{ workflow: 'order-processing', executionId: '123', step: 'validate-order' },
'Order validation failed',
error,
{ orderId: 'ORD-789', orderData: order }
);
throw error;
}
2. Debugging Failed Workflow Executions
Create a systematic approach to debugging failed workflows:
-- Query to analyze failed workflow executions
SELECT
we.workflow_name,
we.execution_id,
we.executed_at,
we.status,
we.error_message,
we.execution_time_ms,
ws.step_name,
ws.step_status,
ws.step_error,
ws.step_duration_ms,
ws.input_data_sample,
ws.output_data_sample
FROM workflow_executions we
LEFT JOIN workflow_steps ws ON we.execution_id = ws.execution_id
WHERE we.status = 'error'
AND we.executed_at >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY we.executed_at DESC
LIMIT 10;
-- Common error patterns analysis
SELECT
error_message,
COUNT(*) as error_count,
workflow_name,
DATE(executed_at) as error_date,
GROUP_CONCAT(DISTINCT execution_id ORDER BY executed_at DESC LIMIT 3) as recent_executions
FROM workflow_executions
WHERE status = 'error'
AND executed_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)
AND error_message IS NOT NULL
GROUP BY error_message, workflow_name, DATE(executed_at)
HAVING COUNT(*) >= 3
ORDER BY error_count DESC;
3. Using Postman for API Debugging
Leverage Postman for debugging API integrations in your automation workflows:
javascript
// Postman collection for automation workflow API debugging
const postmanCollection = {
info: {
name: 'Automation Workflow API Debugging',
description: 'Collection for debugging API integrations in automation workflows'
},
item: [
{
name: 'Test Order API Integration',
request: {
method: 'POST',
header: [
{ key: 'Content-Type', value: 'application/json' },
{ key: 'Authorization', value: 'Bearer {{api_token}}' }
],
body: {
mode: 'raw',
raw: JSON.stringify({
order_id: 'TEST-ORDER-001',
customer_email: 'test@example.com',
items: [{ sku: 'PROD-001', quantity: 2 }],
total_amount: 99.98
}, null, 2)
},
url: '{{base_url}}/api/v1/orders'
},
response: []
},
{
name: 'Validate API Error Responses',
request: {
method: 'POST',
header: [
{ key: 'Content-Type', value: 'application/json' },
{ key: 'Authorization', value: 'Bearer {{api_token}}' }
],
body: {
mode: 'raw',
raw: JSON.stringify({
// Invalid data to trigger error responses
order_id: '',
customer_email: 'invalid-email',
items: [],
total_amount: -10
}, null, 2)
},
url: '{{base_url}}/api/v1/orders'
},
response: []
}
],
variable: [
{ key: 'base_url', value: 'https://api.example.com' },
{ key: 'api_token', value: 'your-test-token-here' }
]
};
// Postman test scripts for validation
const postmanTests = `
// Test 1: Verify successful order creation
pm.test("Order created successfully", function() {
pm.response.to.have.status(201);
pm.response.to.have.jsonBody('order_id');
pm.response.to.have.jsonBody('status', 'created');
});
// Test 2: Verify response structure
pm.test("Response has expected structure", function() {
const jsonData = pm.response.json();
pm.expect(jsonData).to.have.property('order_id');
pm.expect(jsonData).to.have.property('status');
pm.expect(jsonData).to.have.property('created_at');
pm.expect(jsonData).to.have.property('total_amount');
});
// Test 3: Verify error handling
pm.test("Invalid data returns proper error", function() {
if (pm.response.code === 400) {
const jsonData = pm.response.json();
pm.expect(jsonData).to.have.property('error');
pm.expect(jsonData).to.have.property('error_code');
pm.expect(jsonData.error).to.include('validation failed');
}
});
`;
Testing: Preventing Problems Before They Happen
1. Unit Testing Automation Components
Implement unit tests for individual automation components:
javascript
// Unit tests for data validation function
const { validateOrderData } = require('./order-validator');
describe('Order Data Validation', () => {
test('validates complete order data successfully', () => {
const orderData = {
order_id: 'ORD-12345',
customer_email: 'customer@example.com',
items: [{ sku: 'PROD-001', quantity: 2 }],
total_amount: 99.98,
shipping_address: {
street: '123 Main St',
city: 'Toronto',
country: 'CA'
}
};
const result = validateOrderData(orderData);
expect(result.isValid).toBe(true);
expect(result.errors).toHaveLength(0);
});
test('detects missing required fields', () => {
const orderData = {
order_id: 'ORD-12345',
// Missing customer_email
items: [],
total_amount: 0
};
const result = validateOrderData(orderData);
expect(result.isValid).toBe(false);
expect(result.errors).toContain('customer_email is required');
});
test('validates email format', () => {
const orderData = {
order_id: 'ORD-12345',
customer_email: 'invalid-email-format',
items: [{ sku: 'PROD-001', quantity: 1 }],
total_amount: 49.99
};
const result = validateOrderData(orderData);
expect(result.isValid).toBe(false);
expect(result.errors).toContain('customer_email must be a valid email address');
});
test('validates minimum order amount', () => {
const orderData = {
order_id: 'ORD-12345',
customer_email: 'customer@example.com',
items: [{ sku: 'PROD-001', quantity: 1 }],
total_amount: 0.50 // Below minimum
};
const result = validateOrderData(orderData);
expect(result.isValid).toBe(false);
expect(result.errors).toContain('total_amount must be at least 1.00');
});
});
// Mock external API for testing
jest.mock('./order-api', () => ({
createOrder: jest.fn()
.mockResolvedValueOnce({ order_id: 'MOCK-ORD-001', status: 'created' })
.mockRejectedValueOnce(new Error('API timeout'))
}));
2. Integration Testing for Workflow Orchestration
Test complete workflow integrations:
yaml
Integration test configuration for automation workflow
test_suite: order_processing_integration
workflow: order-processing-v2
environment: test
test_cases:
- - name: successful_order_processing
description: "Process a complete valid order"
input:
order_id: "TEST-ORD-001"
customer_email: "test@example.com"
items:
- - sku: "PROD-001"
quantity: 2
price: 49.99
total_amount: 99.98
expected_output:
status: "processed"
payment_status: "completed"
fulfillment_status: "pending"
mock_responses:
payment_gateway:
status: 200
body: { "transaction_id": "TXN-123", "status": "success" }
inventory_system:
status: 200
body: { "reserved": true, "reservation_id": "RES-456" }
- - name: failed_payment_processing
description: "Test workflow behavior when payment fails"
input:
order_id: "TEST-ORD-002"
customer_email: "test@example.com"
items:
- - sku: "PROD-001"
quantity: 1
price: 49.99
total_amount: 49.99
expected_output:
status: "failed"
error: "Payment processing failed"
mock_responses:
payment_gateway:
status: 402
body: { "error": "Insufficient funds", "code": "PAYMENT_DECLINED" }
- - name: out_of_stock_scenario
description: "Test workflow when item is out of stock"
input:
order_id: "TEST-ORD-003"
customer_email: "test@example.com"
items:
- - sku: "PROD-002"
quantity: 5
price: 29.99
total_amount: 149.95
expected_output:
status: "failed"
error: "Item out of stock"
mock_responses:
inventory_system:
status: 404
body: { "error": "Product not available", "sku": "PROD-002" }
3. End-to-End Testing with Real Data
Implement end-to-end tests using production-like data:
python
End-to-end test for automation workflow
import pytest
from automation_workflow import OrderProcessingWorkflow
from test_data_factory import create_test_order
class TestOrderProcessingE2E:
@pytest.fixture
def workflow(self):
"""Initialize workflow with test configuration"""
return OrderProcessingWorkflow(
environment='staging',
enable_mocks=True,
log_level='DEBUG'
)
def test_complete_order_flow(self, workflow):
"""Test complete order processing flow"""
# 1. Create test order data
test_order = create_test_order(
order_type='standard',
customer_tier='premium',
items_count=3,
total_amount=199.97
)
# 2. Execute workflow
result = workflow.execute(test_order)
# 3. Verify results
assert result['status'] == 'processed'
assert result['payment_status'] == 'completed'
assert result['fulfillment_status'] == 'pending'
assert 'order_id' in result
assert 'transaction_id' in result
assert 'reservation_id' in result
# 4. Verify side effects
assert workflow.payment_gateway_called is True
assert workflow.inventory_system_called is True
assert workflow.notification_sent is True
def test_order_with_invalid_data(self, workflow):
"""Test workflow handling of invalid order data"""
invalid_order = {
'order_id': '',
'customer_email': 'not-an-email',
'items': [],
'total_amount': -10
}
result = workflow.execute(invalid_order)
assert result['status'] == 'failed'
assert 'validation error' in result['error'].lower()
assert workflow.payment_gateway_called is False
def test_retry_mechanism(self, workflow):
"""Test workflow retry on transient failures"""
# Configure mock to fail first, then succeed
workflow.configure_mock(
'payment_gateway',
responses=[
{'status': 500, 'body': {'error': 'Internal server error'}},
{'status': 200, 'body': {'transaction_id': 'TXN-RETRY', 'status': 'success'}}
]
)
test_order = create_test_order()
result = workflow.execute(test_order)
assert result['status'] == 'processed'
assert workflow.payment_gateway_call_count == 2
Performance testing
def test_workflow_performance_under_load():
"""Test workflow performance with concurrent executions"""
workflow = OrderProcessingWorkflow(environment='performance')
# Execute 100 concurrent orders
orders = [create_test_order() for _ in range(100)]
start_time = time.time()
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_order = {
executor.submit(workflow.execute, order): order
for order in orders
}
for future in concurrent.futures.as_completed(future_to_order):
results.append(future.result())
end_time = time.time()
total_duration = end_time - start_time
# Assert performance requirements
assert total_duration < 30 # Should complete within 30 seconds
assert len([r for r in results if r['status'] == 'processed']) >= 95 # 95% success rate
# Calculate and log performance metrics
avg_duration = total_duration / len(orders)
print(f"Performance test completed: {len(orders)} orders in {total_duration:.2f}s")
print(f"Average per order: {avg_duration:.3f}s")
print(f"Success rate: {100 * len([r for r in results if r['status'] == 'processed']) / len(results):.1f}%")
Error Handling: Building Resilient Automation
1. Graceful Degradation Patterns
Implement error handling that allows workflows to continue functioning partially when components fail:
javascript
// Graceful degradation implementation
class ResilientWorkflowStep {
constructor(name, executeFn, fallbackFn = null, retryConfig = {}) {
this.name = name;
this.executeFn = executeFn;
this.fallbackFn = fallbackFn;
this.retryConfig = {
maxAttempts: retryConfig.maxAttempts || 3,
backoffMs: retryConfig.backoffMs || 1000,
retryableErrors: retryConfig.retryableErrors || ['NETWORK_ERROR', 'TIMEOUT']
};
}
async execute(context) {
let lastError;
for (let attempt = 1; attempt <= this.retryConfig.maxAttempts; attempt++) {
try {
context.logger.debug(Attempt ${attempt} for step ${this.name});
const result = await this.executeFn(context);
context.logger.debug(Step ${this.name} completed successfully);
return result;
} catch (error) {
lastError = error;
context.logger.warn(Step ${this.name} failed on attempt ${attempt}:, error.message);
// Check if error is retryable
const isRetryable = this.retryConfig.retryableErrors.some(
errorType => error.code === errorType || error.message.includes(errorType)
);
if (attempt === this.retryConfig.maxAttempts || !isRetryable) {
break;
}
// Exponential backoff
const backoffTime = this.retryConfig.backoffMs * Math.pow(2, attempt - 1);
context.logger.debug(Retrying in ${backoffTime}ms);
await new Promise(resolve => setTimeout(resolve, backoffTime));
}
}
// All retries failed, try fallback if available
if (this.fallbackFn) {
try {
context.logger.warn(Executing fallback for step ${this.name});
return await this.fallbackFn(context);
} catch (fallbackError) {
context.logger.error(Fallback for step ${this.name} also failed:, fallbackError.message);
throw new Error(Step ${this.name} failed after ${this.retryConfig.maxAttempts} attempts and fallback: ${lastError.message});
}
}
throw lastError;
}
}
// Usage example
const paymentStep = new ResilientWorkflowStep(
'process-payment',
async (context) => {
// Primary payment processing
return await paymentGateway.charge(context.order);
},
async (context) => {
// Fallback: queue for manual processing
await queueManualPayment(context.order);
return { status: 'queued', message: 'Payment queued for manual processing' };
},
{
maxAttempts: 3,
backoffMs: 2000,
retryableErrors: ['NETWORK_ERROR', 'TIMEOUT', 'RATE_LIMITED']
}
);
2. Circuit Breaker Pattern
Implement circuit breakers to prevent cascading failures:
javascript
// Circuit breaker implementation
class CircuitBreaker {
constructor(name, options = {}) {
this.name = name;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
this.failureCount = 0;
this.successCount = 0;
this.lastFailureTime = null;
this.threshold = options.threshold || 5; // Failures before opening
this.resetTimeout = options.resetTimeout || 60000; // 60 seconds
this.halfOpenMaxAttempts = options.halfOpenMaxAttempts || 3;
this.halfOpenAttempts = 0;
this.metrics = {
totalRequests: 0,
successfulRequests: 0,
failedRequests: 0,
circuitOpened: 0,
circuitClosed: 0
};
}
async execute(fn, context) {
this.metrics.totalRequests++;
// Check if circuit is open
if (this.state === 'OPEN') {
const timeSinceFailure = Date.now() - this.lastFailureTime;
if (timeSinceFailure > this.resetTimeout) {
this.state = 'HALF_OPEN';
this.halfOpenAttempts = 0;
context.logger.info(Circuit ${this.name} moving to HALF_OPEN state);
} else {
this.metrics.failedRequests++;
throw new Error(Circuit breaker ${this.name} is OPEN. Request blocked.);
}
}
try {
const result = await fn();
// Request succeeded
if (this.state === 'HALF_OPEN') {
this.halfOpenAttempts++;
if (this.halfOpenAttempts >= this.halfOpenMaxAttempts) {
this.state = 'CLOSED';
this.failureCount = 0;
this.metrics.circuitClosed++;
context.logger.info(Circuit ${this.name} moved to CLOSED state);
}
} else {
this.successCount++;
this.failureCount = 0;
}
this.metrics.successfulRequests++;
return result;
} catch (error) {
this.metrics.failedRequests++;
if (this.state === 'HALF_OPEN') {
// Half-open state failed, go back to open
this.state = 'OPEN';
this.lastFailureTime = Date.now();
this.halfOpenAttempts = 0;
context.logger.warn(Circuit ${this.name} failed in HALF_OPEN state, moving to OPEN);
} else {
// Closed state - increment failure count
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.lastFailureTime = Date.now();
this.metrics.circuitOpened++;
context.logger.error(Circuit ${this.name} opened after ${this.failureCount} failures);
}
}
throw error;
}
}
getMetrics() {
return {
...this.metrics,
state: this.state,
failureCount: this.failureCount,
successCount: this.successCount,
failureRate: this.metrics.totalRequests > 0
? (this.metrics.failedRequests / this.metrics.totalRequests) * 100
: 0
};
}
}
// Usage with external API call const apiCircuitBreaker = new CircuitBreaker('external-api', { threshold: 3, resetTimeout: 30000, // 30 seconds halfOpenMaxAttempts: 2 });
async function callExternalApiWithCircuitBreaker(data) { return await apiCircuitBreaker.execute( async () => { const response = await fetch('https://api.example.com/process', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(data), timeout: 5000 }); if (!response.ok) { throw new Error(API responded with ${response.status});
}
return await response.json();
},
{ logger: console }
);
}
3. Dead Letter Queues for Failed Messages
Implement dead letter queues to capture and analyze failed workflow executions:
-- Dead letter queue table structure
CREATE TABLE workflow_dead_letter_queue (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
workflow_name VARCHAR(255) NOT NULL,
execution_id VARCHAR(255) NOT NULL,
original_data JSON NOT NULL,
error_message TEXT,
error_stack TEXT,
error_context JSON,
failure_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
retry_count INT DEFAULT 0,
last_retry_timestamp TIMESTAMP NULL,
status ENUM('pending', 'retrying', 'resolved', 'abandoned') DEFAULT 'pending',
resolution_notes TEXT,
resolved_at TIMESTAMP NULL,
resolved_by VARCHAR(255),
INDEX idx_workflow_status (workflow_name, status),
INDEX idx_failure_timestamp (failure_timestamp),
INDEX idx_pending_retries (status, retry_count)
);
-- Query to analyze dead letter queue
SELECT
workflow_name,
COUNT(*) as total_failures,
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
SUM(CASE WHEN status = 'retrying' THEN 1 ELSE 0 END) as retrying,
AVG(retry_count) as avg_retries,
MIN(failure_timestamp) as first_failure,
MAX(failure_timestamp) as last_failure,
GROUP_CONCAT(DISTINCT LEFT(error_message, 100) ORDER BY failure_timestamp DESC LIMIT 3) as recent_errors
FROM workflow_dead_letter_queue
WHERE failure_timestamp >= DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY workflow_name
HAVING total_failures > 0
ORDER BY total_failures DESC;
-- Automated retry procedure
DELIMITER //
CREATE PROCEDURE retry_dead_letter_items(
IN p_workflow_name VARCHAR(255),
IN p_max_retries INT,
IN p_batch_size INT
)
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE v_id BIGINT;
DECLARE v_execution_id VARCHAR(255);
DECLARE v_original_data JSON;
DECLARE v_retry_count INT;
-- Cursor for pending items with retry count below threshold
DECLARE cur CURSOR FOR
SELECT id, execution_id, original_data, retry_count
FROM workflow_dead_letter_queue
WHERE workflow_name = p_workflow_name
AND status = 'pending'
AND retry_count < p_max_retries
AND (last_retry_timestamp IS NULL OR
last_retry_timestamp < DATE_SUB(NOW(), INTERVAL POWER(2, retry_count) MINUTE))
ORDER BY failure_timestamp
LIMIT p_batch_size;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
read_loop: LOOP
FETCH cur INTO v_id, v_execution_id, v_original_data, v_retry_count;
IF done THEN
LEAVE read_loop;
END IF;
-- Update status to retrying
UPDATE workflow_dead_letter_queue
SET status = 'retrying',
last_retry_timestamp = NOW()
WHERE id = v_id;
-- Here you would call your workflow retry logic
-- For example: CALL retry_workflow_execution(v_execution_id, v_original_data);
-- Simulate retry logic
SET @retry_result = 'success'; -- This would come from actual retry
IF @retry_result = 'success' THEN
UPDATE workflow_dead_letter_queue
SET status = 'resolved',
resolved_at = NOW(),
resolved_by = 'auto_retry_system',
resolution_notes = 'Automatically retried successfully'
WHERE id = v_id;
ELSE
UPDATE workflow_dead_letter_queue
SET status = 'pending',
retry_count = retry_count + 1,
error_message = CONCAT('Retry failed: ', @retry_result)
WHERE id = v_id;
END IF;
END LOOP;
CLOSE cur;
END //
DELIMITER ;
Observability Tools for Automation Projects
1. Logging and Monitoring Stack
Build a comprehensive observability stack for your automation workflows:
yaml
Docker Compose for observability stack
version: '3.8' services: # Log aggregation loki: image: grafana/loki:latest ports:- - "3100:3100"
- - ./loki-config.yaml:/etc/loki/local-config.yaml
- - loki-data:/loki
- - "9090:9090"
- - ./prometheus.yml:/etc/prometheus/prometheus.yml
- - prometheus-data:/prometheus
- - '--config.file=/etc/prometheus/prometheus.yml'
- - '--storage.tsdb.path=/prometheus'
- - '--web.console.libraries=/etc/prometheus/console_libraries'
- - '--web.console.templates=/etc/prometheus/consoles'
- - '--storage.tsdb.retention.time=30d'
- - "3000:3000"
- - GF_SECURITY_ADMIN_PASSWORD=admin
- - ./grafana-dashboards:/etc/grafana/provisioning/dashboards
- - ./grafana-datasources:/etc/grafana/provisioning/datasources
- - grafana-data:/var/lib/grafana
- - "16686:16686" # UI
- - "14268:14268" # Collector
- - "14250:14250" # Collector gRPC
- - ./tempo.yaml:/etc/tempo.yaml
- - tempo-data:/tmp/tempo
- - "3200:3200" # Tempo
- - "9095:9095" # Tempo metrics
2. Dashboard Configuration for Automation Monitoring
Create comprehensive Grafana dashboards for automation workflow monitoring:
json
{
"dashboard": {
"title": "Automation Workflows Monitoring",
"panels": [
{
"title": "Workflow Execution Rate",
"targets": [
{
"expr": "rate(workflow_executions_total[5m])",
"legendFormat": "{{workflow}}"
}
],
"type": "graph",
"yaxes": [
{"label": "Executions/sec", "min": 0}
]
},
{
"title": "Workflow Success Rate",
"targets": [
{
"expr": "sum(rate(workflow_executions_total{status=\"success\"}[5m])) by (workflow) / sum(rate(workflow_executions_total[5m])) by (workflow) * 100",
"legendFormat": "{{workflow}}"
}
],
"type": "graph",
"yaxes": [
{"label": "Success %", "min": 0, "max": 100}
],
"thresholds": [
{"value": 95, "color": "green"},
{"value": 90, "color": "yellow"},
{"value": 85, "color": "red"}
]
},
{
"title": "Workflow Execution Duration",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(workflow_execution_duration_seconds_bucket[5m]))",
"legendFormat": "P95 - {{workflow}}"
},
{
"expr": "histogram_quantile(0.50, rate(workflow_execution_duration_seconds_bucket[5m]))",
"legendFormat": "P50 - {{workflow}}"
}
],
"type": "graph",
"yaxes": [
{"label": "Seconds", "min": 0}
]
},
{
"title": "Top Failing Workflows",
"targets": [
{
"expr": "topk(5, sum(rate(workflow_executions_total{status=\"error\"}[1h])) by (workflow))",
"legendFormat": "{{workflow}}"
}
],
"type": "table",
"columns": [
{"text": "Workflow", "value": "workflow"},
{"text": "Errors/hr", "value": "Value"}
]
},
{
"title": "Circuit Breaker Status",
"targets": [
{
"expr": "circuit_breaker_state",
"legendFormat": "{{circuit}} - {{state}}"
}
],
"type": "stat",
"fieldConfig": {
"mappings": [
{"value": 0, "text": "CLOSED", "color": "green"},
{"value": 1, "text": "OPEN", "color": "red"},
{"value": 2, "text": "HALF_OPEN", "color": "yellow"}
]
}
},
Need Help Building Your Automation Workflows?
Our team specializes in designing and implementing production-grade automation systems using n8n and other enterprise tools.
Get Free Consultation