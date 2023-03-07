Courtney Nash’s LinkedIn profile says she is an internet incident librarian. Her job is to keep track of the ways things fail and go wrong on the web.
Over the past three months, Nash, a senior researcher with VOID, or Verica Open Incident Database, a public database of reports on internet failures and outages, has been busy keeping track of Twitter.
The social network has reeled from outages ever since Elon Musk took over in October, like the one that hit on Monday.
Nash points to a key reason for the disruptions: huge layoffs that wiped out roughly two-thirds of the Twitter workforce in roughly 100 days. The cuts included Twitter engineers who, Nash said, play critical roles in maintaining one of the largest and busiest social platforms in the world.
“Those people had years, in some cases, a decade of understanding how that system was built, how it's supposed to work,” Nash told The Examiner. “They have a lot of intimate knowledge of what looks weird and wrong.”
Twitter could not immediately be reached for comment.
In an interview with The Examiner, Nash talked about how she’s been impressed by how well the Twitter platform has held up during a tumultuous transition — why she and others are worried about the degradation of the site amid the uncertainty under Musk.
This interview was edited for clarity and brevity.
Based on your monitoring, how is the Twitter platform doing? A lot of us foretold doom and gloom and many of us will admit we were wrong about how bad it would get. But it is showing creaks and cracks, the kinds of failures that those of us that study complex systems were predicting and the people who worked at Twitter were also predicting.
We're seeing more performance degradations, more wholescale outages, things just taking longer to even work. There are a lot of reasons why that could be. But the biggest attributable one that I think everybody has been right about is the loss of expertise for managing that system.
What do you mean by that? The people are gone. Did you need 7,000 people to run Twitter? Maybe not. [From roughly 7,500 employees when Musk took over, Twitter reportedly now has a workforce of about 1,800.] Do you need more than 1,800? Probably. That's what we're seeing.
Twitter's been architected technically to stay up this long. It's a very resilient distributed system. It was designed to tolerate a lot of turbulence.
But part of that was there were people behind the scenes. When the turbulence started to really increase, they knew how to adapt and how to deal with things and what parts of the systems might be creaking and breaking. A lot of those people are gone.
Can you break that down? Who are these people and what were their responsibilities that are now not being met? The layoffs have affected Twitter employees across the board. Particularly we've seen a loss of people who are on what we would consider the infrastructure side – people who know how the databases work, how the tooling works, how their deployment tools work, people involved in incident response. Most of those people are gone.
Those people had years, in some cases, a decade of understanding how that system was built, how it's supposed to work. They have a lot of intimate knowledge of what looks weird and wrong. They can go in and look at all kinds of tools. They can look at monitoring and graphs and all this stuff.
A lot of these people in the industry are called site reliability engineers. SREs is a term you may see and they're really the folks who are responsible for site reliability. Those folks are all gone.
If you just take someone who doesn't know how all the plumbing works, and something's wrong, they don't even know where to look.
Can we drill down a bit? How are the problems being manifested in terms of what's been reported, things Twitter has acknowledged externally? One of the things they did that was reported on and was pretty obvious was they took one of their data centers out of the equation. They used to have three data centers. And now they only have two. And I think it was one day after they got rid of the Sacramento data center that they had one of their biggest outages.
Typically, the entire Twitter was architected or designed to allow what's called failover. So if you have a problem, you can move things from one data center to another. So when they just took out Sacramento, the people who had designed that data center, the people who had built the systems weren't there to even tell you what would go wrong when you got rid of it.
That was one of the things I think that really manifested and a reported fact that I could point to. There's lots of other things I can't tell you specifically. But that's a really obvious one. They did that so fast that there's no way they could have planned for all of the ways their system relied on that one data center.
Let’s go back to something you said. There were predictions that Twitter would just collapse after Musk took over, but it did not. The thing we didn't account for – and in hindsight, it's fantastic, right? – is how well built Twitter really was. I mean, it was the site we've all loved to hate, or hate to love.
It kind of always existed in a degraded state, because there's so many pressures and it was constantly changing and scaling. But it was really well built. So you could go from 7,000 to 1,800 employees and still, the fundamentals we're keeping it running
So this is credit to (former CEOs) Parag Agrawal ang Jack Dorsey? It's a credit to the engineers. From up high, there were business goals, and all of these things, and those included advertising and content moderation. But the engineers had to decide how to support that. And they also had to make that work over a long time.
Twitter's been around for a long time. I'm not going to get embroiled in the arguments about whether it was a successful business then or not. I think your point is right, it's still succeeding to a degree. We have not seen how the financials will play out. But it is still working. And I do think that's because the systems were built to be very adaptive and resilient.
Do I think you can still maintain what it was with 1,800 people? That still remains to be seen. But we're still skeptical.
Why? To be financially viable, you need technical systems that can be incredibly responsive to a lot of change and experimentation. If one were to be objectively kind about what Musk is doing, there's a lot of experimentation and a lot of change.
You need expertise to make your systems potentially do things they didn't do before. An example of a thing that didn't go so well was when they tried to change how Twitter Blue works, how the blue check stuff works, and to try to do that rapidly.
If it's going to succeed and try to adapt to new business models and new things, that's going to be hard to do with a skeleton crew. It's going to be hard without long earned expertise about how the guts of Twitter work.
If you want to build on top of that, and you want to try and do new things and push those systems in really unique ways, the expertise on how the system originally worked still needs to be there. And most of that's gone.
Could you do it with 1,800 people? Maybe – but not 1,800 people who don't know how it really should work. The expertise is so invaluable. It's not the sheer number of people. It’s what was in the brains of the ones who aren't there anymore.
There are also concerns about a spike in hate speech on Twitter. Elon Musk himself has made controversial statements or posts. What is your own call on how this could play out?
I would just say that if fighting hate speech is a business value for Twitter, then they don't have the investment in that currently and they don't have the staffing for it. If they're worried about advertising, and advertising perception, same as infrastructure, you need to support that which they're not doing.
You need a team and the expertise, essentially, to make sure that the kind of content that could turn off advertisers doesn’tt appear next to the ads that the advertisers want users to see.
Yeah. You need non-technical people to make decisions around that and to ensure that. And you need the technical people to help make sure that doesn't show up next to that advertisement. As far as we know, those people aren't there anymore.
What has been the most troubling development for you on how things have played out under Elon Musk in the last 100 days? I worry that other business leaders will think that gutting your expertise is a business model that will work. We've all seen that you can't just get rid of experts and folks who know how systems run and carry on. But this is a grand experiment being conducted on a public stage. Then we're seeing an economic climate right now where reducing workforce and staff is very appealing to business leaderships. That would be my concern – a crisis of expertise, if you will, and that we feel that we can get rid of experts and still carry on and everything will work as desired.