Overview
At approximately 02:00 there was a short break in communication from a significant number of communicators deployed in the field and connected to the WebWay host platform.
System operational investigation
• There was no failure of WebWay hardware prior to or during the incident, (servers, hard drives, memory, routers, switches etc.). • There was no failure of WebWay software applications prior to or during the incident (receiver platform, network device monitoring systems, bandwidth monitoring). • There are no indications that a software bug or some form of attack on the system was a contributing factor.
Network operational investigation
The MCTs handled the recovery process (a mass recovery is normally processed within 2 minutes), however the data synchronisation between the data centres was severely impaired. There was a noticeable degradation of the data rate over the interlinks between the replicated receivers. The duration of the interlink fault caused significant synchronisation delays. A process of controlled recovery was instigated. The procedure allowed the MCTs to process the recovery reliably and return to a stable operation.
Conclusions and remedial action
A combination of a short, mass network outage coupled with a degradation of the inter MCT link was an unprecedented situation. WebWay is in discussion with the data centre provider in respect of bandwidth and resilience of the interlink service. WebWay have identified elements of the outage that can be recognised in future and recovery procedures are being amended accordingly.
Additional information:-
The inter MCT circuit returned to normal operation at approximately 11:00am on the 13th. Manual Flood was activated at 02:55. PSTN devices were brought on-line in parallel as their impact on system performance was minimal due to slower polling and recovery rates. WebWay have received information from partner ARC’s that there was a BT Internet outage (National) within the timeframe of the incident which will be investigated.