Cisco Webex down (28.7.2022)

Stop - Pixabay [English]Blog-Leser Gerald hat mich gerade per E-Mail darüber informiert (danke dafür), dass der Videokonferenzdienst Cisco Webex gestört sei. Ich habe mal kurz geschaut, die Störung begann um ca. 19:00 Uhr deutscher Zeit und dauerte bei ca. 22:00 Uhr. Der Webex Room Systems-Service wird aktuell (22:41 Uhr) noch mit einem "Major Impact" geführt und auch beim Webex Clod Registered Device listet Cisco noch eine Störung. Die Störung scheint also langsam abzuklingen. Ergänzung: Erklärung von Cisco hinzugefügt.

Gerald hat mir den nachfolgenden Screenshot des WebEx-Status vor einigen Stunden mitgeschickt. Da ist mächtig was los gewesen.

Cisco Webex status

Auf allestoerungen.de ist nachfolgende Störungsgrafik abrufbar, die den Beginn und das Ende der Störung vom 28.7.2022 zeigt.

Auf Twitter bestätigte der Webex-Support vor zwei Stunden Probleme mit dem Dienst. Dort heißt es:

We're aware of issues affecting several Webex services and our engineering team is hard at work. We apologize for the inconvenience. Regular updates will be shared here as soon we as have them. We appreciate your patience.

In addition, users may experience difficulties logging into Webex Control Hub and managing their Webex Cloud device. All hands are on deck to restore services and we apologize for the inconvenience this may cause.

Update to service (1/2) Engineering is investigating an issue which is causing connectivity issues with the Webex App. This may include logging in, sending messages and files, presence issues, and starting or joining meetings in the Webex App.

Aber die Techniker scheinen das Problem schnell in den Griff bekommen zu haben, denn vor 44 Minuten hieß es:

Engineering is continuing to take remediation steps to restore services. We appreciate everyone's patience while we are all hands on deck to address the incident.

For users who are still unable to log into the Webex App, we recommend that you close and re-launch the Webex App. We will continue to provide updates as they become available. Once again, we appreciate your patience today.

Und die letzte Meldung vor wenigen Minuten bestätigt, dass der Dienst wohl wieder weitgehend funktioniert.

The majority of services have been restored and are operational. Some users may still experience intermittent issues accessing their Webex board and devices.

Hier die aktuelle Übersicht (22:38 Uhr) der Cisco Webex Statusseite, die hier abrufbar ist, auf der nur noch Reststörungen zu sehen sind.

Interessant ist in diesem Zusammenhang der nachfolgende Tweet, der auch von einem AWS-Problem an der Ostküste der USA berichtet.

Ursache des Ausfalls

Ergänzung: Blog-Leser Michael T. hat mir die Erklärung von Cisco zu den Gründen des Ausfalls per Mail zukommen lassen. Es gab wohl ein Problem in einem Rechenzentrum, was die Dienste beeinträchtigte. Die Umleitung auf andere Zonen klappte dann aber nicht. Ich hänge das einfach mal hier an:

Webex Services Incident

Incident Number: INC0047261
Incident Duration: July 28, 2022, 16:55 – 19:55 UTC

Incident Details
At 16:55 UTC on July 28 th , a service provider hosting some of the Webex services experienced a significant outage. Services for messaging,
devices, authentication, and analytics were hosted in the affected service provider data center, which caused multiple microservices to
fail. Users were unable to authenticate to Webex, register or connect to meetings consistently from Webex devices, use messaging, or log into the Control Hub administration console. The Webex status page was also hosted in the same service provider data center, which caused delayed access to the status page.

Root Cause
The Webex engineering team attempted to redirect services outside of the affected environment; however, the redirects were not
successful due to the outage affecting multiple redundant service zones. Due to a core component of the service architecture becoming
unavailable, the redirects to alternative zones were ineffective. Engineering was unable to successfully stop the services in the hosted
datacenter due to connectivity failures. This caused services to remain unstable as device and software clients continued to send
connection retries. The combination of incomplete service termination, the unhealthy data center, and the multiple client retries caused the connections to overload the available service capacity, and the capacity of the edge and microservices in the redundant environment was unable to process the traffic.

Corrective Actions
Engineering worked with the service provider to restore the core service architecture and scaled up the edge and service capacity within
the environment which allowed the clients to successfully connect, and the additional traffic generated by the retries to stop. This caused
services to stabilize, which allowed engineering to take additional remediation steps, including restarting unhealthy instances and
additional load balancing, leading to full service recovery. Engineering completed remaining service clean-up and load balancing, and services were fully recovered at 19:55 UTC.
The service team has identified areas of improvement for our incident response time, including changes to the service metrics for faster root cause identification, improved runbooks to speed up scale-up and deployment of additional micro-services, and revisiting client retry values which will improve service restoration should a similar incident occur in the future. The messaging architecture team is also engaged to identify service architecture improvements which will allow Webex services to better withstand a multi-zone failure.

Timeline
16:55: Webex alerts indicate multiple service failures; incident process began
17:15: Engineering began service redeploys
17:33: Services fully stopped in the affected DC
17:45: Service provider reported partial recovery
18:00: Metrics indicate high traffic volume due to client retries
18:30: Multiple scale-up efforts completed
19:10: Capacity increase across both pools completed; 80% increase in traffic observed. Service redeploys began
19:45: Vendor confirms services fully restored. Additional capacity deploys completed in all three zones
19:55: Final redeploys completed; services restored.

7 Kommentare zu Cisco Webex down (28.7.2022)

Adrian W. sagt:

28. Juli 2022 um 23:43 Uhr

Macht gar nichts, wenn diese mit den permaneten Security-Löchern glänzende Lösung down ist :-)
Und mit den Datenschutz nehmen Sie es auch nicht ganz so ernst.

Antworten
Ludwig L. sagt:

29. Juli 2022 um 09:09 Uhr

@Adrian W.
Datenschutz könnte ich vielleicht noch verstehen, aber welche "permaneten Security-Löcher" meinst du denn bitte?

Antworten
- Klaus sagt:
  
  29. Juli 2022 um 12:22 Uhr
  
  Hier wäre ein Recherche Ansatz dafür:
  https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=webex
  
  Antworten
  - Ludwig L. sagt:
    
    29. Juli 2022 um 12:40 Uhr
    
    Hmm… schön und gut.
    Allerdings hab ich die Erfahrung gemacht dass die Lücken auch innerhalb kürzester Zeit gefixed wurden…
    
    Und wenn man die Konkurrenz-CVEs sucht, schaut es auch nicht besser aus…
    
    Welche "sichere" und trotzdem gleichwertige Lösung seht ihr dann hier?
    
    Antworten
    - Klaus sagt:
      
      29. Juli 2022 um 12:59 Uhr
      
      > Welche "sichere" und trotzdem gleichwertige Lösung seht ihr dann hier?
      
      Es gibt keine.
      
      Manche Firmen/Behörden nutzen Skype Business im Eigenbetrieb auf eigenen Servern.
      
      Antworten
Adrain W. sagt:

29. Juli 2022 um 16:30 Uhr

@ Ludwig L
Sehr widersprüchlich die Aussagen "aber welche permaneten Security-Löcher meinst du denn bitte?"
Nach der "unendlichen Liste" von Klaus (Danke an Klaus :-) die Antwort "Hmm… schön und gut. Allerdings hab ich die Erfahrung gemacht dass die Lücken…"
Ergo muss der aufmerksame Leser annehmen, dass die Lücken bekannt sind – da man ja (scheinbar) Erfahrung damit hat – was wiederum "aber welche permaneten Security-Löcher meinst du denn bitte?" widerspricht.

Antworten
Michael T. sagt:

29. Juli 2022 um 20:20 Uhr

Ich finde die Aussage zu Webex schon interessant – die mit den "permaneten Security-Löcher".

Die lange Liste die hier gepostet wird – geht bis 2006 zurück – da gab es viele andere Lösungen noch nicht mal. Wenn man das mit den anderen Programmen vergeleicht, haben die seit es die gibt schon pro Jahr mehr Löcher wie Webex.

Antworten

Cisco Webex down (28.7.2022)

Ursache des Ausfalls

7 Kommentare zu Cisco Webex down (28.7.2022)

Schreibe einen Kommentar Antwort abbrechen

Translate

Suchen

Blogs auf Borncity

Spenden und Sponsoren

Aus dem DNV-Netzwerk

Links

Amazon

Awards

Blogroll

Soziale Netzwerke-Seiten

Foren

Neueste Kommentare