CCIE Pursuit Blog

May 24, 2008

Some Of This Crap Really Has Real World Implications :-)

Filed under: BGP,Cisco,Cisco Certification,IOS,Work — cciepursuit @ 3:39 pm
Tags: , , ,

I had an issue at work with a DS3 mysteriously bouncing.  We never saw the circuit actually drop (nor any errors) but the BGP peering would sporadically drop.  After one of the engineers “solved” the problem by having AT&T set their BGP timers to match ours (see this QoD for an explanation of why that did not work) the issue came to me.  I suggested that we disable bgp fast-external-fallover and see if that at least kept the peering nailed up.  That worked!  We later found out that the site had taken a lightning strike a couple of weeks ago.  We had a vendor meet with Cisco, AT&T, the cabling vendor, and the LEC the next day.  MAGICALLY the issue “cleared while testing” once the LEC looked at the circuit.  🙂

Anyhoo…by default bgp fast-external-fallover is enabled.  This is generally a good thing as it will bring down a BGP peering if a directly connected link goes down.  No need to wait out your 3 keepalives.  In our case, their was some sporadic issue that “blinked” out the circuit (I suspect a punch-drunk repeater or some CO equipment) very briefly.  Our router would then bring down the BGP peering and then re-establish it.  By configuring ‘no bgp fast-external-fallover’ under the BGP process, we were able to keep the BGP peering up.

bgp fast-external-fallover

Usage Guidelines
The bgp fast-external-fallover command is used to disable or enable fast external fallover for BGP peering sessions with directly connected external peers. The session is immediately reset if link goes down. Only directly connected peering sessions are supported.

If BGP fast external fallover is disabled, the BGP routing process will wait until the default hold timer expires (3 keepalives) to reset the peering session. BGP fast external fallover can also be configured on a per-interface basis using the ip bgp fast-external-fallover interface configuration command.

May 6, 2008

Fun With ISPs

Filed under: BGP,Cisco,Work — cciepursuit @ 9:57 am

Michael Morris recently vented a bit about CO techs (‘No Love For Central Office Techs’).  While I can’t agree with him about the value of unions (my father was in the IBEW and my mother is in the UAN, both of which saved our family’s ass when times were tough) I do agree with him about the flippant attitude one often encounters with carriers/LECs.  I had an interesting experience about a month ago.

A DS3 link was up and up but the BGP peering was idle.  I bounce the interface, cleared the BGP adjacency, deleted and re-added the BGP configuration, and I (eventually) reloaded the router.  Nothing changed.  The interface was up and up (I could see traffic flowing in and out of the interface) but the BGP peering would not establish. I brought our ATM subject matter expert (my ATM skills are weak) to look at the issue and he verified that ATM was working correctly.  I even looped the interface and was able to ping my interface IP address.  For whatever reason I had no layer 3 connectivity to the PER.

So I opened a ticket with AT&T and pushed them to get their BGP team involved.  They successfully tested the circuit to the CO.  Since this was a DS3 they did not have an NIU to loop at our premises.  I told them that they needed to get someone to verify the PER’s configuration (especially the BGP configuration).  They eventually verified the PER configuration.  I opened a TAC case to get the DS3 card replaced.  While the RMA was cooking, I put up a loop on my interface and asked ATT to test to it.  They successfully tested to the loop.  I dropped my loop.  Luckily, that’s when the break came.

“Thanks for your time, I’ll get Cisco to replace our card.”
“No problem.  I just ran another pattern to your loop and it was good as well.”
“I just ran another pattern to your loop…”
“My loop?”
“I dropped my loop 15 minutes ago.”
“Are you testing the right circuit.”
“Can you send a loopdown code.”
“I just did and I can still see your loop.”
“Is there a loop in the CO?”
“We’ll contact the LEC to look at that.”

Two hours later I get a call from an AT&T tech…well kind of.  The LEC for this area is SBC.  SBC recently bought AT&T.  So now the (follow me here) SBC LEC technicians refer to themselves as AT&T.  The US telecom industry can make your head explode if you try to follow it too closely. 🙂  So the call was actually from the CO technician who called me by accident and thought that I was AT&T (the carrier).  The other twist to this tale, is that my company (large enterprise) does not interface directly with the LEC.  We only interface with the carrier/ISP. 

Anyhoo…the LEC tech told me that there was a loop in the CO towards the customer’s (my!) equipment and asked if I wanted it dropped (again, he thought I was the carrier and not the customer).  I asked him why the circuit was looped.

“I have no fucking idea.”
“Really?  You guys looped a DS3 for no good reason.”
“Drop the loop please.”

The loop dropped, the BGP peering established, and our site was back to 100% of their bandwidth capacity.  When I called AT&T (the carrier) to get a reason for outage, they gave me the tired old “cleared while testing”.  Nice.

Actually, there was another twist to this tale.  Our NOC missed the BGP alert.  We have separate routers connected to two different carriers (AT&T and MCI) at each of our sites.  So we still had a DS3 connection to the MCI cloud.  I don’t remember how the BGP issue eventually came to light, but it had been down for nearly a week when I got involved.  It’s a testament to our bandwidth allocation (but not our network monitoring) that the site never noticed the loss of 50% of its available bandwidth.  I have NO idea how this didn’t affect their VoIP.  Anyhoo…once I finally got an AT&T BGP technician to look at this issue, he had the balls to annotate the ticket (we can view their tickets online) with “BGP has been down for a week and they’re just now opening a ticket?”  When I spoke with him I told him that we have dual carriers and that MCI hadn’t fucked up our circuit and that he should probably keep comments like that out of our tickets.  This was before we discovered that the issue was not our equipment.  Now it was my turn to be a douchebag.  When AT&T told me “cleared while testing” I told them to open a post mortem (a ticket review process) on the ticket.  Then I jumped down the throat of our ATT account manager at our next weekly meeting.

“So you’re telling me that our circuit was looped at the CO and that it took you nearly two days to figure this out AND you lied to us about the RFO?  During this time our MCI circuit handled the load.  Why should we maintain you as a carrier?”

We pay tens of millions (maybe more) dollars for our bandwidth.  Even hinting that we might axe one or our carriers in favor of the other is kind of dirty, but it makes the account managers shit their pants and jump into action anytime we mention it.  By the end of the meeting my boss got AT&T refund us for 3 months worth of charges for that DS3.  It’s good to be king.  🙂


January 23, 2008

Network Device Naming Conventions

Filed under: Cisco,Personal,Work — cciepursuit @ 8:09 pm
Tags: , ,

I stumbed across this posting by Michael Morris concerning naming conventions:

When I started working on global enterprise networks it got much more interesting. Now you had thousands of routers at hundreds of sites in different rooms and closets/IDFs in all parts of the world. Now naming conventions became very important. A very large bank network I worked on was terrible: 5,000 routers with a cryptic naming convention that was (1) hard to understand and (2) not well followed. Adding to the problem was the city name of the router was often not an actual city. It was a name the bank liked to refer to the site as. Good luck trying to remember all those names. The rest of the name had some good points, but also several bad ones. It was not something I enjoyed.

The government network I worked on was minimalist. It was [city]-r1. For example, BUF-R1. Really boring and really useless. Some small company networks like to be cute and name devices after beer brands or rock bands or cartoon characters. That starts to fail quickly when the small company gets just a tad larger.

—Read The Rest Here—

My previous job was supporting an international WAN with over 3000 routers.  Not only did we have a ton of routers, but nearly 30 separate business divisions –  each of which had their own naming convention.  To add even more fun, most of the router “hostnames” were not the same as their FQDN names.  We kept a database with circuit IDs mapped to router names.  Of course, there were many deleted circuits and typos in the database.  This always made it fun when a vendor called to report a circuit down and we could not find out which device it terminated on.  Not to mention all of the tiny sites that we supported that shared a city name with a more famous, larger city.  A router named “miami” goes down and you start looking at Florida…wrong!  It’s in Miami, Oklahoma.  We had three Pittsburgh sites – none of which were in Pennsylvania.  And (as Michael Morris mentioned) there were a bunch of places (including our corporate headquarters) that were referred to by any number of local cities.  This lead to a lot of lost “support cycles” just trying to narrow down what device was being affected. 

My current job supports a uniform naming convention and it is an absolute joy to work with.  There are still the occasional anomalies, but it’s infinitely better than the mess I came from.  Our naming convention is similar to the one mention in Michael Morris’ post.  We have unique five-character real estate codes.  Prepended to that is a code for the device type.  At the end is the floor and IDF that the device is located in.  Very little time is lost trying to determine where a device is located.

I’m sure that there are a number of readers who do/did server support and have MUCH worse naming stories.  🙂

Blog at