Harness Engineering ตอนที่ 6: Recovery Paths - แผนสำรองเมื่อ AI พัง

สรุปสำคัญ

Recovery Paths คือแผนสำรองเมื่อ AI ทำงานผิดพลาด — ไม่ใช่ให้สมบูรณ์ 100% แต่ให้ 'ยังใช้งานได้'
Graceful Degradation: Full AI Response → Simplified AI Response → Rule-Based Response → Human Handoff
4 เทคนิคหลัก: Fallback Strategies, Retry Logic (Exponential Backoff), Circuit Breaker, Dead Letter Queue

บทนำ: ทำไมต้องมี Recovery Paths?

สมมติว่าคุณกำลังสร้าง AI Agent สำหรับระบบ Customer Support ที่ทำงาน 24/7 วันหนึ่ง API ของ LLM ล่ม ระบบก็หยุดทันที ลูกค้าไม่ได้คำตอบ ธุรกิจเสียโอกาส

นี่คือสิ่งที่เกิดขึ้นเมื่อเราไม่มี Recovery Paths

AI Systems มีจุดล้มเหลวหลายจุดมากกว่าซอฟต์แวร์ทั่วไป:

Model API failures — Rate limits, timeouts, outages
Tool/Function calling errors — External API ที่เราเรียกใช้พัง
Context window exceeded — Prompt ยาวเกินไป
Invalid output format — AI สร้าง output ที่ไม่ตรงตาม schema (hallucinations)
Dependency failures — Database, cache, หรือ services อื่นๆ ที่ระบบต้องการ

ถ้าเราออกแบบระบบโดยคิดว่าทุกอย่างจะทำงานได้ตลอดเวลา วันหนึ่งมันจะล้มเหลวในจุดที่เราไม่คาดคิด

Recovery Paths คืออะไร?

คือ “แผนสำรอง” ที่เราวางไว้ล่วงหน้า เมื่อระบบหลักทำงานไม่ได้ แผนสำรองจะเข้ามาทำงานแทน ไม่ใช่เพื่อให้ระบบสมบูรณ์แบบ 100% แต่เพื่อให้ระบบ “ยังใช้งานได้” แม้จะลดความสามารถลงบ้าง

จากหลักการ Graceful Degradation — ลดความสามารถลงอย่างมีระดับ แทนที่จะล่มสิ้นเลย

1Full AI Response → Simplified AI Response → Rule-Based Response → Human Handoff

บทความนี้จะพาคุณไปรู้จักกับ 4 เรื่องหลัก:

ประเภทของ Recovery (Auto vs Manual)
Fallback Strategies
Retry Logic
Circuit Breaker Pattern

ประเภทของ Recovery: Auto vs Manual

Auto-Recovery (Self-Healing)

ลองนึกภาพ AI Agent ที่สามารถ “รู้ตัว” ว่าตัวเองทำผิด แล้วแก้ไขตัวเองได้ นี่คือ Self-Healing Agent Pattern

กระบวนการ 4 ขั้นตอน:

Stage	Action
1. Validation	ตรวจสอบว่า output ตรงตามที่ต้องการ
2. Failure Detection	จำแนกประเภท failure
3. Contextual Recovery	ใช้ recovery strategy ที่เหมาะสม
4. Learning Integration	บันทึกเพื่อปรับปรุงในอนาคต

ตัวอย่าง Failure Classification และ Recovery:

Failure Type	Recovery Action
Input corruption	ขอข้อมูลใหม่หรือ clean data
Context starvation	ขอข้อมูลเพิ่มเติม
Tool failure	Retry หรือใช้ tool ทางเลือก
Reasoning collapse	รีเซ็ตไปยัง state ล่าสุดที่ใช้ได้
Output corruption	สร้างใหม่ด้วย parameter ต่างกัน

Auto-Recovery เหมาะกับ:

Transient errors (errors ที่เกิดชั่วคราวแล้วหายไป)
Rate limits
Timeout ที่เกิดจากภาระงานสูง
Tool errors ที่มี fallback ชัดเจน

Manual Recovery

บางครั้งเราต้องให้คนเข้ามาช่วย

เมื่อไหร่ควรใช้ Manual Recovery:

Critical failures — ที่ต้องการ human judgment เช่น ข้อมูลผิดพลาดร้ายแรง
Security/Compliance issues — ปัญหาด้านความปลอดภัย
First-time failures — ที่ยังไม่มี pattern ในการแก้ไขอัตโนมัติ
Business-critical decisions — ที่ต้องมีคนตัดสินใจ

กระบวนการ Manual Recovery:

รับการแจ้งเตือน (alert)
วิเคราะห์สาเหตุ (root cause analysis)
ตัดสินใจแก้ไข
ดำเนินการ recovery
บันทึกบทเรียน

Hybrid Approach — ทางเลือกที่ดีที่สุด

ในความเป็นจริง ระบบส่วนใหญ่ควรใช้ Hybrid Approach — ทำบางอย่างอัตโนมัติ บางอย่างให้คนตัดสินใจ

สิ่งที่ควรทำอัตโนมัติ:

Transient error retries
Fallback ไปยัง secondary models
Circuit breaker activation
State recovery จาก checkpoints

สิ่งที่ควรให้คนตัดสินใจ:

Data corruption issues
Security breaches
Ethical concerns
Business-critical decisions

Fallback Strategies

Fallback คือ “เส้นทางสำรอง” เมื่อเส้นทางหลักใช้ไม่ได้ ลองนึกภาพถนนหลักปิด เราก็ต้องมีถนนอ้อมไปถึงจุดหมาย

ประเภทของ Fallback

1. Model-Level Fallback:

1Primary: GPT-4 → Fallback: GPT-3.5-turbo → Fallback: Local model

เวลา GPT-4 ใช้ไม่ได้ ก็ไปใช้ GPT-3.5 ถ้ายังไม่ได้อีก ก็ใช้ local model (อาจจะแย่กว่า แต่ยังใช้งานได้)

2. Provider-Level Fallback:

1OpenAI → Anthropic → Google → Local

กรณีที่ provider ใหญ่ล่มพร้อมกันหลายตัว ก็ต้องมีทางเลือก

3. Capability-Level Fallback:

1AI-generated response → Cached response → Rule-based response → Human handoff

นี่คือการลดความสามารถลงอย่างมีระดับ (Graceful Degradation)

Fallback Hierarchy — ทำไมถึงสำคัญ

ไม่มี Fallback:

Uptime: 97-98% (7-14 ชั่วโมง down/month) ¹
Recovery: Manual (1-4 ชั่วโมง)

มี Automatic Fallback:

Uptime: 99.9%+ (< 45 นาที down/month) ²
Recovery: Automatic (< 30 วินาที)

ต่างกันมากเลย!

Fallback Router Pattern

ลองดูตัวอย่างโค้ด:

 1class FallbackRouter:
 2    def __init__(self, primary, secondary, circuit_breaker):
 3        self.primary = primary
 4        self.secondary = secondary
 5        self.breaker = circuit_breaker
 6    
 7    async def call(self, request: dict) -> dict:
 8        if self.breaker.allow_request():
 9            try:
10                result = await self.primary.call(request)
11                self.breaker.on_success()
12                return {"result": result, "provider": "primary"}
13            except Exception as e:
14                self.breaker.on_failure()
15                # Fallback to secondary
16                result = await self.secondary.call(request)
17                return {"result": result, "provider": "secondary"}

หลักการสำคัญ:

ลอง primary ก่อน
ถ้า fail → บอก circuit breaker
แล้วไปใช้ secondary
ติด tag ว่าใช้ provider ไหน (เพื่อ monitor)

Retry Logic + Circuit Breaker

Retry Logic

การ retry ดูเหมือนเรื่องง่าย แต่ถ้าทำไม่ดี จะกลายเป็นปัญหาใหญ่

ปัญหาที่เกิดได้:

Retry ทันที → เพิ่มภาระให้ service ที่กำลังมีปัญหาอยู่แล้ว
Retry ไม่จำกัด → ระบบค้างไปเรื่อยๆ
Retry เหมือนเดิมทุกครั้ง → ได้ผลลัพธ์เหมือนเดิม

Tiered Retry Recipe:

1Attempt 1: Original prompt, standard timeout
2Attempt 2: Prompt + error feedback, reduced temperature
3Attempt 3: Simplified prompt, smaller model, tighter constraints
4Fallback: Route to human review or degrade gracefully

แต่ attempt ที่ต่างกัน ควรใช้ strategy ต่างกันด้วย

Exponential Backoff with Jitter:

1from tenacity import retry, stop_after_attempt, wait_exponential
2
3@retry(
4    stop=stop_after_attempt(3),
5    wait=wait_exponential(multiplier=1, min=2, max=10)
6)
7def call_llm(prompt):
8    return llm.generate(prompt)

Exponential backoff — รอนานขึ้นเรื่อยๆ ก่อน retry (2, 4, 8 วินาที)
Jitter — สุ่มเวลารอเล็กน้อย เพื่อไม่ให้ทุก request มาในเวลาเดียวกัน
Max attempts — จำกัดจำนวนครั้ง ไม่ให้ retry ตลอดไป

Circuit Breaker Pattern

Circuit Breaker เหมือนฟิวส์ไฟฟ้า — ถ้ามีกระแสไฟฟ้าลัดวงจรบ่อยๆ ฟิวส์ก็จะตัด เพื่อป้องกันความเสียหาย

สามสถานะ:

Closed — ทำงานปกติ requests ผ่านได้
Open — บล็อก requests ให้ fail เร็ว ไม่เพิ่มภาระ
Half-Open — ทดสอบว่าฟื้นตัวหรือยัง ก่อนกลับไป Closed

Configuration ที่ต้องตั้ง:

Failure threshold — จำนวน/เปอร์เซ็นต์ failures ก่อนเปิด circuit
Timeout period — รอนานแค่ไหนก่อนเข้า half-open
Success threshold — จำนวน success ที่ต้องการเพื่อปิด circuit
Rolling window — ช่วงเวลาคำนวณ failure rate

ทำไมต้องมี Circuit Breaker?

จาก Portkey.ai ³:

“Circuit breakers prevent cascading failures by stopping traffic to failing services early. Without them, retries pile up on already degraded services.”

สมมติ service A มีปัญหา requests เริ่ม fail → เราก็ retry → ยิ่งเพิ่มภาระให้ service A → fail มากขึ้น → retry มากขึ้น → วงจรนี้เรียกว่า “cascading failure” จนระบบล่มทั้งหมด

Circuit Breaker ช่วยหยุดวงจรนี้ได้

Rollback Strategy

นอกจาก Retry, Fallback และ Circuit Breaker แล้ว อีกสิ่งสำคัญที่ต้องมีคือ Rollback Strategy

Checkpointing — จุดบันทึกสถานะ

ลองนึกภาพคุณกำลังทำ workflow ยาว 10 ขั้นตอน มาถึงขั้นตอนที่ 8 แล้วเกิดปัญหา ถ้าไม่มี checkpoint คุณต้องเริ่มต้นใหม่ตั้งแต่ขั้นตอนที่ 1

Checkpointing ช่วยแก้ปัญหานี้:

บันทึก state ทุกขั้นตอน
สามารถย้อนกลับไปยังจุดใดก็ได้
ใช้สำหรับ long-running workflows

1class CheckpointManager:
2    async def save_checkpoint(self, workflow_id: str, state: dict):
3        # บันทึก state ปัจจุบัน
4        await self.storage.save(f"checkpoint/{workflow_id}", state)
5    
6    async def rollback_to(self, workflow_id: str, checkpoint_id: str):
7        # ย้อนกลับไปยัง checkpoint ที่ระบุ
8        state = await self.storage.load(f"checkpoint/{workflow_id}/{checkpoint_id}")
9        return state

ตัวอย่างการใช้งานจริง:

สมมติคุณมี workflow ประมวลผลคำสั่งซื้อ:

ตรวจสอบสินค้าคงคลัง ✓ (checkpoint)
ตัด stock ✓ (checkpoint)
คำนวณราคา ✓ (checkpoint)
เรียก Payment API → FAIL
ถ้าไม่มี checkpoint → ต้องคืน stock เอง, ยกเลิกทุกอย่าง
ถ้ามี checkpoint → rollback ไปขั้นตอนที่ 3 ระบบจัดการให้อัตโนมัติ

Idempotent Operations — ปลอดภัยต่อการ Retry

Idempotent คือการที่เราเรียก operation เดิมหลายครั้ง แต่ผลลัพธ์เหมือนกัน

ลองเปรียบเทียบ:

ไม่ Idempotent (อันตราย):

1POST /api/orders (สร้าง order ใหม่)
2→ Retry 3 ครั้ง → สร้าง order 3 ใบ!

Idempotent (ปลอดภัย):

1POST /api/orders? idempotency_key=abc123
2→ Retry 3 ครั้ง → สร้าง order 1 ใบ

หลักการสำคัญสำหรับ AI Systems:

ใช้ idempotency key สำหรับทุก API call ที่เปลี่ยนแปลง state
ออกแบบ prompts ให้ได้ผลลัพธ์คล้ายกันเมื่อ retry
หลีกเลี่ยง side effects ที่สะสม (เช่น ส่ง email ซ้ำ)

ตัวอย่างจริงจาก OpenClaw

OpenClaw มีระบบ Fallback 3 ชั้นที่น่าสนใจ:

 1Layer 1: Auth Profile Rotation
 2  - สลับระหว่าง credentials ต่างๆ (OpenAI, Anthropic)
 3  
 4Layer 2: Thinking Level Fallback (per profile)
 5  - high → medium → low → off
 6  - reset เมื่อ switch profile
 7  
 8Layer 3: Model Fallback
 9  - claude-opus → gpt-4o → gemini-2.0-flash → default
10  - trigger หลัง Layer 1 & 2 ใช้ไม่ได้

การจำแนกประเภท Error:

Error Type	วิธีจัดการ
Billing failures	ไม่ retry (เป็นปัญหาที่ account)
Rate limits	Retry with backoff (ชั่วคราว)
Model errors	Fallback ไปยัง alternative models

ปัญหาที่พบ (Issue #6230):

“When primary model fails and agent falls back to lower-priority model, it stays on fallback permanently until manually restarted. No automatic recovery path.”

นี่คือปัญหาที่พบในระบบจริง — เมื่อระบบ fallback ไปใช้ model ที่ต่ำกว่าแล้ว มันไม่ยอมกลับไปใช้ model หลักเมื่อ model หลักฟื้นตัว

Best Practices จาก OpenClaw:

ใช้ clawbot doctor --fix สำหรับ auto-repair
สร้าง backup ก่อนแก้ไข configuration
ใช้ systemd service สำหรับ auto-restart
แยกแยะ error types ต่างกัน (billing vs rate limit)

ตัวอย่างจริงจาก LangChain + AutoGen

LangChain

LangChain ⁴ มี built-in retry mechanisms และ fallback support:

1from langchain_groq import ChatGroq
2
3llm = ChatGroq(
4    model="mixtral-8x7b-32768",
5    temperature=0.0,
6    max_retries=2,  # Default retry
7    timeout=30
8)

RunnableWithFallbacks:

1from langchain_core.runnables import RunnableWithFallbacks
2
3chain_with_fallback = RunnableWithFallbacks(
4    primary=primary_chain,
5    fallbacks=[backup_chain1, backup_chain2]
6)

Tool Error Handling:

 1# Enable handle_tool_error
 2tool = SomeTool(
 3    handle_tool_error=True  # หรือใส่ custom function
 4)
 5
 6# หรือใช้ try-except
 7@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1))
 8def safe_tool_call(tool, input_data):
 9    try:
10        return tool.run(input_data)
11    except TransientError:
12        raise  # ให้ retry
13    except PermanentError:
14        return fallback_response  # ไม่ retry

LangGraph Error Handling:

 1from langchain.memory import ConversationBufferMemory
 2from langchain.agents import AgentExecutor
 3
 4memory = ConversationBufferMemory(
 5    memory_key="chat_history",
 6    return_messages=True
 7)
 8
 9agent = AgentExecutor(memory=memory)
10
11# Error handling with state tracking
12error_handler = ErrorHandlerNode(retry_limit=3)
13
14def handle_error(context):
15    if context.error_count > error_handler.retry_limit:
16        context.switch_to_alternative_path()

AutoGen (Microsoft)

AutoGen ⁵ v0.4 มี resilience features ที่ออกแบบมาสำหรับ production:

Robustness: ถ้า agent หนึ่งล้ม ระบบสามารถ re-route task ไปยัง agent อื่น
Fault Tolerance: ออกแบบสำหรับ production environments
Scalability: Scale agents ตาม demand

Checkpointing & Recovery:

1# Checkpointing สำหรับ resume workflows
2agent.run(
3    checkpoint=True,
4    resume_from_checkpoint=checkpoint_id
5)

Multi-Agent Failure Recovery:

“Traditional circuit breakers assume stateless services. AI agents violate these assumptions due to stateful nature, learning capabilities, and context maintenance.”

สำหรับ Multi-Agent Systems ต้องมี:

Shared Circuit Breaker State — Agents แชร์สถานะ circuit breaker
Progressive Traffic Restoration — ค่อยๆ กลับมาใช้งาน
State Synchronization — จัดการ state หลัง recovery
Conflict Resolution — แก้ไขข้อขัดแย้งใน views

Best Practices + Monitoring

Design Principles — หลักการออกแบบ

Design for Failure — สมมติว่าทุกอย่างจะล้ม
Fail Fast — แจ้ง failure เร็วที่สุด
Graceful Degradation — ลดความสามารถลงอย่างมีระดับ
Observability — มองเห็นทุกอย่างที่เกิดขึ้น
Idempotency — ปลอดภัยต่อการ retry

Implementation Checklist

Retry Logic:

Exponential backoff with jitter
Max retry limits
Different strategies ตาม error types
Timeout per attempt

Circuit Breaker:

Failure threshold configuration
Half-open state testing
Metrics & monitoring
Integration with fallback

Fallback:

Define fallback hierarchy
Test fallback paths
Monitor fallback usage
Human handoff procedure

Recovery:

Checkpointing strategy
State persistence
Rollback capability
Recovery testing

Metrics ที่ควรติดตาม

Metric	Description	Alert Threshold
Error Rate	% ของ requests ที่ fail	> 5%
Retry Rate	% ของ requests ที่ต้อง retry	> 20%
Circuit Breaker State	Open/Closed/Half-open	ติด Open > 5 นาที
Fallback Rate	% ที่ใช้ fallback	> 10%
Recovery Time	เวลาที่ใช้ในการกู้ตัว	> 30 วินาที
MTTR	Mean Time To Recovery	> 5 นาที

สรุป

วันนี้เราได้เรียนรู้การออกแบบ Recovery Paths ครบถ้วน:

ส่วนที่ 1:

ทำไมต้องมี Recovery Paths
Auto vs Manual Recovery
Fallback Strategies (Model → Provider → Capability)
Retry Logic + Circuit Breaker

ส่วนที่ 2:

Rollback Strategy — Checkpointing และ Idempotent Operations
ตัวอย่างจาก OpenClaw — ระบบ 3 Layer Fallback
ตัวอย่างจาก LangChain + AutoGen — Built-in mechanisms
Best Practices — หลักการออกแบบ + Metrics ที่ต้องติดตาม

บทเรียนสำคัญ:

ออกแบบระบบโดยคิดว่าทุกอย่างจะล้ม
มี Fallback หลายระดับ (Model → Provider → Capability)
ใช้ Circuit Breaker ป้องกัน Cascading Failures
Checkpoint เพื่อให้ Rollback ได้
ทำทุกอย่างให้ Idempotent
ติดตาม Metrics อย่างต่อเนื่อง

AI Systems ต้องการการดูแลมากกว่าซอฟต์แวร์ทั่วไป เพราะมีจุดล้มเหลวหลายจุด แต่ถ้าเราออกแบบมาดี ระบบก็สามารถ “รู้จักแก้ตัว” ได้ด้วยตัวเอง

อ้างอิง

Uptime Institute Annual Report 2024 - สถิติ uptime ของระบบที่ไม่มี fallback mechanism ↩︎
Google SRE Book - Reliability Targets - เปรียบเทียบ uptime ระหว่างระบบที่มีและไม่มี automatic recovery ↩︎
Portkey.ai Documentation - Circuit Breakers - Best practices สำหรับ circuit breaker pattern ใน AI systems ↩︎
LangChain Documentation - Retry and Fallback - Built-in retry mechanisms และ fallback support ↩︎
Microsoft AutoGen Documentation - Resilience features สำหรับ multi-agent systems ↩︎