I wonder how many people lost their jobs over this?
It's an oddity how many IT folks preach about backups, disaster recovery, and the like yet walk back into their glass house every single day and fail to practice what they preach.
It doesn't just apply to backups either. Password expiration, weak passwords, bad permissions, one time fixes, and the rest seem to be common place. It's absolutely amazing how lazy IT folks can be when it comes to doing their own job.
@UnderLoK: It's simple. Everybody KNOWS you HAVE to have backups and a disaster recovery plan. Everybody KNOWS that their backup routine is solid, and that their disaster recovery plan is the best possible.
And then the disaster hits, and you check your backups... oh, ooops. That 9-track tape you had your data on was failing silently. Ooops, the offsite data got overwritten because your software was buggy. Dang! That disaster recovery plan forgot to actually TEST the backup hardware...
The problem with backups and disaster recovery is that if ANYTHING goes wrong, AT ALL, and there's lots that can go wrong, you look completely incompetent for all that look, even if it was a good plan. Because there's no difference between "it SHOULD have worked but the vendor-provided backup solution had a minor bug which erased your data" and "there was no chance in hell your data was surviving, and those of us over in IT are laughing our asses off at you" to the end user. #tmobile
@DennyCraneDennyCraneDennyCrane: That's the whole point, they know the terms, but they don't know what they really mean or how to implement them or are just flat out lazy. #tmobile
@UnderLoK: Well, I agree in general, but I think that doing it RIGHT at a large scale is really, really hard. And even if you think you've got it right, you can get it wrong, even if you know what you're doing. And, even then, you can always get some corner case that nobody thought of, that can make it end badly.
Why? Because backing up Petabytes is HARD. It takes a long, long time. Testing it takes even longer. If there's corruption in a backup, sometimes even an otherwise good testing routine can miss it. Sometimes, hot-swap hardware is tested and verified, but fails because the data is bad. Sometimes the hot-swap hardware is tested and verified, but fails because it wasn't used for so long, dust built up and caused a short.
And if ANYTHING goes wrong, at any step, you look like a fool. If you pronounce a firm 'recovery' deadline, and miss it because of an unforeseen problem, you look like a fool. Not to mention, on top of this, you have an expectation of staying within budget, and doing it right is EXPENSIVE, but not everybody can afford multiple redundant servers separated geographically, with multiple ISPs, redundant power supplies and power sources, extensive testing of the tape backups, et cetera. When the boss turns any one of those, and it results in downtime, YOU still look like the fool. The user is still out their data.
It's easy to say "they preach it, but don't follow it", but in reality, they preach it because they understand how very, very hard it is, and most people don't understand that.
Wow, that went long. Anyway, long story short: I don't work with backups, and dear god am I glad I don't... #tmobile
@DennyCraneDennyCraneDennyCrane: I wasn't just talking about just this incident, I was speaking in generalities and from personal experience in that the people who talk about it the most are often times the biggest offenders of breaking IT policies and processes.
I don't question the difficulty because I've been there (not on that scale obviously), but that is what you're paid for and if you can't do the job correctly or have the balls to tell the bosses what more you need, you have no business being there. #tmobile
This makes me angry. No, I'm not mad at the companies involved. I'm mad at the massive toddler bitchfit that resulted from this outage. Give me a break, it's not that big of a deal people. You can live without your phone for a little bit, it's not essential to your life. They act as if their mental, emotional, and physical health was drastically affected. Lawsuits? Greedy, opportunistic bastards. Did they consider using their new found phoneless freedoms to go outside and/or enjoy some peace and quiet? Take a walk in the park or something, occasionally it's nice to be tech free. I realize that some people may have sincerely needed their phone for emergencies, work, etc, but I would bet the majority of those affected are just whining like spoiled children. #sidekickdatarecovered
And the backup tapes?
So... assuming that was technically possible, part of the timebomb then looked and functioned like the backup software, but erased the tapes while reporting that it was backing up data, and did this far enough in advance that all of the sets of tapes were wiped?
Could happen... in theory... but it would take a hell of a lot of work and understanding of the systems they were working on - deep enough to send commands to the tape drive from an original program, or maybe run some console/invisible instance of the real backup program with instructions to silently wipe tapes, while the frontend displayed normal status.
...or they sat there themselves some time and wiped every tape manually - ideally from someone else's account or an admin/service account to defeat the audit logs.
I don't buy it though - that would be some unprecedented malice, risk, and orchestration to make it come together.
I Can't Stand It, I Know You Planned It
Ima Set It Straight, This Watergate
I Can't Stand Rockin' When I'm In Here
'Cause Your Crystal Ball Ain't So Crystal Clear
So, While You Sit Back And Wonder Why
I Got This Fucking Thorn In My Side
Oh My God, It's A Mirage
I'm Tellin' Y'all It's Sabotage
As others have touched on; more likely the backups were garbage and they had a failure which they couldn’t recover from.
This kind of thing happens all the time. Admins get lazy, tapes don’t get checked, backups are partial, corrupt, or blank, and that is that. I’ve worked at places that had over 12 months of garbage backups when I started, amazing…
If it was sabotage the person would have made sure the backups were useless (not hard to do) and then nuke the system, not delete backup tapes he/she had no access too… I have to believe they did off site storage of tapes.
@UnderLoK: I would agree. The normal thing is to rotate through series of backup tapes and you usually have several weeks worth you can go back to in case of a issue that occurred a while back that forces you to restore to a previous week / month. So a nuke of ALL backups is very unlikely. It's far more likely this is caused by an untested and dysfunctional back up process.
@Rowdy Yates Trail Boss: I used to run into stuff like this fairly often. I worked for a HP/Compaq reseller for years and it made me sick to see the look on peoples faces when I said "the array is dead, you need to restore it from tape" because I knew from the look on their faces that they were screwed.
I've seen at least 10 Admins fired over the years for bad backups and once the CIO all the way down to the PC Tech that was swapping tapes (8 or 9 people including the AS400 guys even though they didn’t loose data they didn’t have backups either) were fired which is also when I started there... What a mess that place was, but it was a fun job.
@OldSchoolGadgetLover: It's not as bad now, but back in the day usually what prevented people from having proper backups was cost. If you did things the right way you had at least 2 weeks of daily's and then 3 months of weeklies and a year of monthlies, but a few of the companies I worked for didn't even have enough money to buy tapes for the monthlies...
I went into a school district around here where I told them to buy a 50 pack of DLTs with their new servers. They cheeped out and bought 6... S I X tapes for 6 servers, I couldn't believe it. I must have brought it up 20 times and they kept saying "sorry we don't have the money". They had also bought an external array which had I think it was 8 or 12 drives in R5 which they were using as storage and backup (again against my advice).
Long story short that array failed due to the incompetence of a Compaq tech and they lost their data.
@UnderLoK: Yeah, I was part of a fortune 100 data company back in the good old days when big iron ruled (mainframes), supplemented by mid range systems (AS400s and RS6000s), and we had the whole expensive back up strategy using incremental daily backups (system up), weekly backups of data (interactive systems down, batch systems up), and monthly backups of all things (all sub systems down, backed up user profiles, software, etc.).
We found out the hard way that the dailies were worthless because we failed to trap for file locks skipped backing up the incremental changes when users were signed into the system and using files. same with weeklies if batch jobs were running against key files.
When we had a data center AC failure, resulting in multi disk array failures, this little problem came to light, and many full grown men cried.
Guess some things change... and some things stay the same.
I've had some close calls, but never a complete loss. Doing Net/Sys Administration (14 years now) is too forking stressful when your doing it solo. I'm actually an ass hair away from going back to school and switching professions or maybe just do PM instead.
I call BS. A service with this many users and this much historical data would likely be spread across MANY backup tapes.
So the timebomb not only wiped all production data, all backup data, all system drives, and also caused the tape jukebox to load each individual tape into the tape drives and wipe all of them (a very time consuming task)?
Not to mention this doesn't even jive with the original explanation that Hitachi was upgrading the SAN and it inadvertently killed off the RAID groups.
Sounds like somebody is fishing for sensationalism to me. Go figure, they said this to Apple Insider of all people.
@DrunkenOstrich: Plus you don't leave all the tapes in the box to begin with. Most companies need to slots to complete a nightly and don't even have the space to just leave the tapes in there for storage...
I just can't believe TMobile didn't have requirements that these guys had to meet.
I just can't believe a company like this didn't use Iron Mountain or a Safety Deposit box or the like to store their tapes. Every day those tapes should have been rotated out with ones from at a minimum 2 weeks prior (daily, obviously weeklies you keep longer).
ahuh? erased backup tapes? and how exactly could that happen?
as far as i know, backup tapes are to be kept off site once backup is complete. Or at least off the contact of any server or machine until the time it is actually really needed.
So how exactly could a backup tape get erased? I've been learning servers and doing my own experimental shit about servers and I've never heard of backup tapes getting erased, until now. And, as far as I know, only a few uses backup tapes these days. They moved to a more revolutionary method called "server mirroring" and uses a more robust system of backups through an off-site server system where backups are stored (and that deletion can only be performed if done within the backup facility, and can not be executed from a remote server).
So I wonder how this saboteur actually performed the trick? I simply can't imagine how stupid the server have been setup for this to happen.
@zaghy2zy: It's bunk, the guy who sent this in is full of crap. Sure it's possible, but more than likely it falls on human error. Some idiots didn't check backups (ever) and they were all trash.
Even with off-site servers you always keep off-site tape. Static copy greater than *
@Gilliam: Hate to be "that guy" (actually i love it) but that same wikipedia article also describes it as "the principle that can be popularly stated as 'when you have two competing theories that make exactly the same predictions, the simpler one is the better.'" which would explain the confusion.
However, you are indeed correct that Hanlon's Razor would be the proper eponymous adage.
@acispades: its ok "that guy", i had only 30 minutes left in my day at the time.
hanlon is what i think dimes was looking for, occam was what i hadn't heard of yet so just quick quoted it.
I know somebody on the inside, and he says that the company can't stand it, and that they know they planned it. But not to worry, they're going to set it straight, this Watergate.
Look, Sidekick, I know this is your time in the spotlight, and I'ma gonna let you talk in a minute but the iPhone lets you back up your contacts to your own machine as well as store them in the cloud.
@Lite: hates Illinois Nazis: you can do that with a sidekick too but most users tend not to do it. consider the vast majority of the target users. Do you think a 14-18yo female or some club kid is thinking about backing her contacts up to a pc/mac?
This is why if someone gives me an important number, I write it down on a piece of paper and swallow it. That way I'll have it forever. Sure, when I do this, people tend to look at me like I'm crazy. But after this whole Sidekick fiasco, I'm thinking crazy like a fox, huh!
10/20/09
It's an oddity how many IT folks preach about backups, disaster recovery, and the like yet walk back into their glass house every single day and fail to practice what they preach.
It doesn't just apply to backups either. Password expiration, weak passwords, bad permissions, one time fixes, and the rest seem to be common place. It's absolutely amazing how lazy IT folks can be when it comes to doing their own job.
10/20/09
And then the disaster hits, and you check your backups... oh, ooops. That 9-track tape you had your data on was failing silently. Ooops, the offsite data got overwritten because your software was buggy. Dang! That disaster recovery plan forgot to actually TEST the backup hardware...
The problem with backups and disaster recovery is that if ANYTHING goes wrong, AT ALL, and there's lots that can go wrong, you look completely incompetent for all that look, even if it was a good plan. Because there's no difference between "it SHOULD have worked but the vendor-provided backup solution had a minor bug which erased your data" and "there was no chance in hell your data was surviving, and those of us over in IT are laughing our asses off at you" to the end user. #tmobile
10/20/09
10/20/09
Why? Because backing up Petabytes is HARD. It takes a long, long time. Testing it takes even longer. If there's corruption in a backup, sometimes even an otherwise good testing routine can miss it. Sometimes, hot-swap hardware is tested and verified, but fails because the data is bad. Sometimes the hot-swap hardware is tested and verified, but fails because it wasn't used for so long, dust built up and caused a short.
And if ANYTHING goes wrong, at any step, you look like a fool. If you pronounce a firm 'recovery' deadline, and miss it because of an unforeseen problem, you look like a fool. Not to mention, on top of this, you have an expectation of staying within budget, and doing it right is EXPENSIVE, but not everybody can afford multiple redundant servers separated geographically, with multiple ISPs, redundant power supplies and power sources, extensive testing of the tape backups, et cetera. When the boss turns any one of those, and it results in downtime, YOU still look like the fool. The user is still out their data.
It's easy to say "they preach it, but don't follow it", but in reality, they preach it because they understand how very, very hard it is, and most people don't understand that.
Wow, that went long. Anyway, long story short: I don't work with backups, and dear god am I glad I don't... #tmobile
10/20/09
I don't question the difficulty because I've been there (not on that scale obviously), but that is what you're paid for and if you can't do the job correctly or have the balls to tell the bosses what more you need, you have no business being there. #tmobile
10/15/09
10/15/09
10/16/09
10/15/09
10/14/09
So... assuming that was technically possible, part of the timebomb then looked and functioned like the backup software, but erased the tapes while reporting that it was backing up data, and did this far enough in advance that all of the sets of tapes were wiped?
Could happen... in theory... but it would take a hell of a lot of work and understanding of the systems they were working on - deep enough to send commands to the tape drive from an original program, or maybe run some console/invisible instance of the real backup program with instructions to silently wipe tapes, while the frontend displayed normal status.
...or they sat there themselves some time and wiped every tape manually - ideally from someone else's account or an admin/service account to defeat the audit logs.
I don't buy it though - that would be some unprecedented malice, risk, and orchestration to make it come together.
10/14/09
10/14/09
Ima Set It Straight, This Watergate
I Can't Stand Rockin' When I'm In Here
'Cause Your Crystal Ball Ain't So Crystal Clear
So, While You Sit Back And Wonder Why
I Got This Fucking Thorn In My Side
Oh My God, It's A Mirage
I'm Tellin' Y'all It's Sabotage
Sabotage
Artist: Beastie Boys
10/14/09
This kind of thing happens all the time. Admins get lazy, tapes don’t get checked, backups are partial, corrupt, or blank, and that is that. I’ve worked at places that had over 12 months of garbage backups when I started, amazing…
If it was sabotage the person would have made sure the backups were useless (not hard to do) and then nuke the system, not delete backup tapes he/she had no access too… I have to believe they did off site storage of tapes.
10/14/09
10/14/09
I've seen at least 10 Admins fired over the years for bad backups and once the CIO all the way down to the PC Tech that was swapping tapes (8 or 9 people including the AS400 guys even though they didn’t loose data they didn’t have backups either) were fired which is also when I started there... What a mess that place was, but it was a fun job.
@OldSchoolGadgetLover: It's not as bad now, but back in the day usually what prevented people from having proper backups was cost. If you did things the right way you had at least 2 weeks of daily's and then 3 months of weeklies and a year of monthlies, but a few of the companies I worked for didn't even have enough money to buy tapes for the monthlies...
I went into a school district around here where I told them to buy a 50 pack of DLTs with their new servers. They cheeped out and bought 6... S I X tapes for 6 servers, I couldn't believe it. I must have brought it up 20 times and they kept saying "sorry we don't have the money". They had also bought an external array which had I think it was 8 or 12 drives in R5 which they were using as storage and backup (again against my advice).
Long story short that array failed due to the incompetence of a Compaq tech and they lost their data.
10/14/09
We found out the hard way that the dailies were worthless because we failed to trap for file locks skipped backing up the incremental changes when users were signed into the system and using files. same with weeklies if batch jobs were running against key files.
When we had a data center AC failure, resulting in multi disk array failures, this little problem came to light, and many full grown men cried.
Guess some things change... and some things stay the same.
10/14/09
I've had some close calls, but never a complete loss. Doing Net/Sys Administration (14 years now) is too forking stressful when your doing it solo. I'm actually an ass hair away from going back to school and switching professions or maybe just do PM instead.
10/14/09
So the timebomb not only wiped all production data, all backup data, all system drives, and also caused the tape jukebox to load each individual tape into the tape drives and wipe all of them (a very time consuming task)?
Not to mention this doesn't even jive with the original explanation that Hitachi was upgrading the SAN and it inadvertently killed off the RAID groups.
Sounds like somebody is fishing for sensationalism to me. Go figure, they said this to Apple Insider of all people.
10/14/09
I just can't believe TMobile didn't have requirements that these guys had to meet.
I just can't believe a company like this didn't use Iron Mountain or a Safety Deposit box or the like to store their tapes. Every day those tapes should have been rotated out with ones from at a minimum 2 weeks prior (daily, obviously weeklies you keep longer).
10/14/09
as far as i know, backup tapes are to be kept off site once backup is complete. Or at least off the contact of any server or machine until the time it is actually really needed.
So how exactly could a backup tape get erased? I've been learning servers and doing my own experimental shit about servers and I've never heard of backup tapes getting erased, until now. And, as far as I know, only a few uses backup tapes these days. They moved to a more revolutionary method called "server mirroring" and uses a more robust system of backups through an off-site server system where backups are stored (and that deletion can only be performed if done within the backup facility, and can not be executed from a remote server).
So I wonder how this saboteur actually performed the trick? I simply can't imagine how stupid the server have been setup for this to happen.
10/14/09
Even with off-site servers you always keep off-site tape. Static copy greater than *
10/14/09
It doesn't have to be some conspiracy...easy explained by apathy.
dimes
10/14/09
[en.wikipedia.org]
@van_line: Occam's Razor is "entities must not be multiplied beyond necessity."
[en.wikipedia.org]
10/14/09
However, you are indeed correct that Hanlon's Razor would be the proper eponymous adage.
...now I'm off to douche up someone else's day :)
10/14/09
hanlon is what i think dimes was looking for, occam was what i hadn't heard of yet so just quick quoted it.
10/14/09
I'd never even heard of hanlon's razor until I heard it on the internet.
10/14/09
10/14/09
10/14/09
10/14/09
10/14/09
10/14/09
Even offsite ones [what, didn't have any?]? Even the ones at the backup site [what, didn't have any?]?
That's a neat trick.
10/13/09
10/13/09
10/13/09
10/13/09
10/13/09