SYXCLOUD - IOT

Updates

Incident Summary
• Incident Start: 15:36
• Incident End: 18:25
• Duration: 2 hours 49 minutes
• Impact: Some customers were unable to use the IoT platform.

Incident Description
At 15:36, the IoT application cluster experienced an issue that caused multiple worker nodes to become unresponsive. The application is hosted on a high-availability (HA) cluster with full server-level redundancy. However, a critical component of the cluster, specifically the cluster state database running on three separate nodes, had expired certificates. This caused the component to become unresponsive and unreachable. Additionally, some worker nodes faced the same problem, making them unable to schedule IoT workloads. Nodes added to the cluster later on did not experience the issue but could not fully take over the load from the failing nodes.
The cluster setup is provided by an external supplier, who should have notified us or automatically remediated the expired certificates. As the cluster setup is provided by the supplier this is a “closed” system and we do not have direct insights in internal working or specifics of the setup.

Detection
Our monitoring system notified us of the problem with the IoT workload. Upon receiving the alert, we immediately began investigating.

Root Cause
The root cause of the incident was expired certificates on a critical component of the cluster and some worker nodes. This issue prevented the component and the affected nodes from functioning properly.

Resolution

Initial Investigation: Upon identifying the expired certificates as the cause, we attempted to remediate the issue using a tool provided by our supplier. Unfortunately, the tool failed to complete the certificate rotation.
Escalation: We created a high emergency case with our supplier and contacted our direct representative to expedite the process.
Supplier Support: During a support call with the supplier, we provided details of our investigation and attempted solutions. The support representative had to consult an internal knowledge base.
Manual Renewal: The certificates were manually renewed on the critical component, which allowed the certificate renewal tool to function again.
Recovery: Running the tool made all worker nodes responsive, and the IoT workload resumed normal operation.

Follow-Up Actions

Permanent Fix: The expired certificates were renewed, resolving the immediate problem.
Improved Redundancy: We will enhance the cluster setup to increase redundancy.
Supplier Accountability: We will pressure our supplier to ensure this problem does not recur, including better notification and automatic remediation processes.
Multi-Cluster Solution: Since the beginning of the year, we have been exploring solutions to run the IoT application on multiple clusters. This would allow for continued operation even if an entire cluster goes down.

Timeline
• 15:36: Incident begins; monitoring system alerts us to the issue.
• 15:45: Initial investigation starts.
• 16:10: Identified expired certificates as the root cause.
• 16:20: Attempted remediation using supplier-provided tool, which fails.
• 16:30: High emergency case created with supplier; direct contact notified.
• 17:00: Support call with supplier; manual renewal of certificates.
• 18:00: Tool for certificate renewal works; worker nodes become responsive.
• 18:25: IoT workload fully operational; incident resolved.

Lessons Learned
• Monitoring and Alerts: Ensure robust monitoring and alerting for certificate expirations.
• Supplier Management: Establish clear communication and accountability with external suppliers for critical components.
• Redundancy Plans: Accelerate efforts to implement multi-cluster solutions to enhance system resilience.

This incident highlighted the need for improved redundancy, better supplier management, and proactive monitoring to prevent similar issues in the future.

July 14, 2024 · 13:58 CEST

← Back