BART trains sat still last Saturday morning, the familiar screeches absent across the Bay Area. But for days after thousands of riders were stranded, one question lingered:
How did this happen?
At Thursday’s BART Board of Directors regular meeting, staff told directors that the entire system was felled by a single “network switch” failure.
Essentially, a conduit for information between the train control system and equipment failed, a broken link in a chain of communication.
“The type of failure that occurred this weekend is very rare,” said Tamar Allen, assistant general manager of operations at BART.
The last time this kind of failure occurred was 2006, she told directors.
“This weekend’s failure was due to a failure of the network switch itself. It was a component failure,” she explained.
And that may not be an issue that ever hits BART that hard again, due to new efforts underway to prevent this “type and magnitude” of failure, Allen said.
Among them is the installation of a “remote redundant” disaster recovery center, which will be built out within the month, Allen said. That is being paid for with federal funding and was already underway when the network switch error occurred.
At about 2:45 a.m. this “single network switch,” one of many in a complex network, failed. It recycled data and generated a “data spike,” Allen said, rippling throughout BART. BART lost communication between its operation control center and any equipment in the field. Cisco Systems helped BART identify the issue.
The data spike meant that in that single network switch “the number of data packages requiring processing increased from one per millisecond to 54,000 per millisecond,” Allen said. That overwhelmed the network switch.
That network failure also prevented BART from enabling trip advisories online and in apps to warn riders that the system was down.
By about 9 a.m. the switch problem was largely resolved, Allen said. Trains were running systemwide except south of Daly City.
“We brought in all of our engineers who work in communication and engineering,” Allen said. “It was a large effort, and a lot of people contributed” to restoring the system.
All field sites had to be manually rebooted, she said.
Despite what was described as a “herculean” push by staff to get trains back up and running, the BART Board of Directors held staff to task.
“I clearly hear it was a gargantuan effort to get the system back up and running,” said Janice Li, a BART board director who represents San Francisco. Still, she asked BART staff for assurances this would not happen again.
“I heard so many stories,” she said, of riders left in the rain, of riders inconvenienced or stranded.
Allen told Li, “When you’re running large data networks, errors do happen,” but “as you may have been aware, Facebook was down much of yesterday” due to the same problem.” And with the new redundant data center, such problems may be avoided.
“As president of the board I want to recognize that we did let riders down on Saturday,” said BART Board of Directors President Bevan Dufty. He said he was “grateful to all our staff” to resolve the issue, and for the new redundant system that would prevent subsequent failures.
“We have to demonstrate that situations like this will not happen and we will be a reliable system,” Dufty said. Transit