Category Archives: voice

MPLS Network and QOS – Traffic Policing vs. Traffic Shaping

As mentioned in my last post, I had some trouble with resets in my PBX due to “some kind of network event”. It was very enlightening when we discovered the root cause, so I wanted to share my experience here in hopes that it saves you from similar sleepless nights.

The symptom was quite simple – the PBX was losing connectivity with its host processor for some reason. Avaya called it a “network event” but couldn’t be more specific. I was fortunate enough to work on the same team as the LAN/WAN techs. If we had been in different departments, this would have been even more of a nightmare. So the Avaya PBX loses heartbeats – after 15 of them it caused a reset. Heartbeats are about one second apart, so a failure of 15 heartbeats is a very long time. Surely a WAN outage of 15 seconds would be noticed by other systems, right? So when Avaya says there was a network event, my response was “uh, you gotta give me more than that. Nothing else on the network noticed”.

I’ll spare you the details here – they’re in my previous post anyway. The issue is the PBX is marking all IP traffic with “Expedited Forwarding” (EF), or Diffserv 46, or High Priority, or QOS. There are plenty of synonyms but it just means all IP packets are tagged with high priority and should be preferred over other packets in the LAN and WAN. Our MPLS carrier honors this tag through their network. Perfect, right?

So when you purchase QOS in the MPLS network, you are given a certain amount of bandwidth allowed. Obviously you’re not allowed to mark all traffic as high priority. MPLS is a shared cloud and you’ll pay a premium for expedited delivery of QOS packets. This bandwidth allotment is called the Committed Access Rate or CAR. What happens when you exceed this CAR? Well, as a telephone guy, I would assume the packets are delivered as “best effort” after that. But I was very wrong. Packets over the CAR are discarded by the MPLS carrier. Think about it – if you have high-priority packages to deliver overnight and the you’re only allowed to send ten per day. The eleventh package isn’t held for best effort. It’s thrown in the dumpster. Sorry. You exceeded your ten packages today, I’m throwing this one out.

The fix is simple and elegant. And probably crystal clear to you network routing engineers out there. It’s up to your edge router to strip the QOS tag from any packets that exceed the CAR. It sounds dangerous to me – what if my edge router and the carrier’s MPLS router disagree slightly on the current amount of traffic at this particular second? Especially if we’re using different brands of router? However, Mbps seems to be an agreed-upon measurement across all vendors, so having my edge router strip the QOS tag from these packets simply allows them to be delivered at best-effort across MPLS.

What I discovered, is if this setting is not correct, it only affects high-priority traffic. In my experience, most companies use a different network for video and everything else on the network such as Internet, email, chat, file servers, replication, database synchronization, etc. are not marked with QOS. So guess what? It looks like your phone system is hosed! And nothing else on the network is complaining! This gets back to a trend I’ve been noticing: Being a phone guy doesn’t have much to do with telephones anymore. It’s almost always the network. And to be a really good phone tech, you have to know networking well. I hope this helps.

I recently went through an ordeal with a PBX resetting. It’s an Avaya system using an IPSI to connect a port network back to its host, but this situation applies to anyone out there using QOS on their MPLS network. I’ve often said that being a “phone guy” is rarely about phones anymore. Most of my work – certainly troubleshooting – involves IP networking.

So I had a PBX with one IPSI that would occasionally reset. Since there was only one IPSI, the reset would cause all cards in the port network to reset as well, which would drop all calls in progress. Now this is about the worst thing that can happen when you’re responsible for the telephones. Full system outages are easier to understand. This is a reset, calls drop, users get frustrated and re-establish their calls, then it would reset again. It was a really bad situation.

What is causing the resets? Avaya said the heartbeats were failing to the IPSI. For any of you with an IPSI-connected port network, you should occasionally look for these. SSH to your Communication Manager and cd to /var/log/ecs. Then list the log files. Assuming you’re in Feburary 2013, you would look for missed heartbeats in your ecs log with the command:

cat 2013-02*.log|grep checkSlot
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (1)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (2)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (3)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (4)]
:pcd(5561):MED:[[3:0] checkSlot: data received replacing sanity message; socket delay is 14 secs]

I have stripped the date/time; you’ll see those on the left. Port networks and IPSIs are zero indexed, so the messages above apply to port network 4 and IPSI number 1.

I have been told that occasional sanity failures are just a part of life. These heartbeat messages are part of the Avaya protocol, not ICMP. So if you’re missing heartbeats, it’s not because ICMP is being dropped.

However, after a certain number of sanity failures, the IPSI will reset in order to re-esablish communication. How many sanity failures? That depends upon a system parameter setting:

display system-parameters ipserver-interface
IP SERVER INTERFACE (IPSI) SYSTEM PARAMETERS

SERVER INFORMATION
Primary Control Subnet Address:
Secondary Control Subnet Address:

OPTIONS

Switch Identifier: A
IPSI Control of Port Networks: enabled
A-side IPSI Preference: disabled
IPSI Socket Sanity Timeout: 15

QoS PARAMETERS
802.1p: 6
DiffServ: 46

The IPSI Socket Sanity Timeout determines how many sanity failures will cause an IPSI failover (if you have two in your port network), or a reset(!) if you only have one. The reset is the IPSI’s way of trying to re-establish communication. If you get too many sanity failures, you’ll get this message:

:pcd(5561):MED:[[3:0] checkSlot: too many sanity failures (15)]

Unfortunately, this means my CM lost connectivity to the first IPSI on port network 4. If I only have one IPSI, then the IPSI and all cards in the port network will reset. If I have a redundant IPSI, then the port network will failover and everything should be okay. In my particular case a second IPSI would not have helped me. It turns out, my MPLS carrier (who had also set up our edge routers) was policing the committed access rate. I’ll explain with more detail in my next post. The resolution was to shape the traffic rather than police it.

Recommendations to the new phone administrator

1 Reply

I’ve been working on telephone systems for a while. And I love my job. For the past few years, I find myself working with network administrators who have been handed the job of managing the telephone system. It makes sense – the PBX is just a big voice router, and nowadays the telephones are IP network endpoints.

But there’s more to managing a voice network than knowing the data network. I’m often asked by the new telecom admin “where should I start?” There’s a lot to know. And my biggest piece of advice is to Be the Authority. By this I mean you should be the person everyone asks about telephones. And you should usually start with the telephone and your voicemail system. The telephone is a complicated endpoint. Voicemail has a ton of features and an extremely limited user interface. For example, learn how to do the following:

Know how to transfer a call into voicemail without ringing the station.
Know how to conference two parties together. This includes two inbound calls. Also, learn the limits of conferencing. How many parties can conference together?
Can your users transfer calls outside the PBX (i.e. to mobile numbers)? If so, what happens if voicemail picks up at the far end. How do you pull that call back? What about when you attempt to conference rather than transfer?
Learn what all the feature buttons do, like park, call pickup, do-not-disturb, or any one of about 200 possible features.
Know how to program the speed dial buttons.
Keep a list of conference rooms and the speakerphone numbers handy.
Get to know you receptionists and find out what they need in a telephone system. They probably wish they had an accurate company directory, right? In a later post I’ll talk about how to provide this.
Spend time walking the floor and interacting with users. When someone calls for a simple change that can be performed remotely, go visit the user or at least give him or her a call. Try to chat about how they use the phone.
Learn how to create an out-of-office greeting and activate/deactivate it.
Learn how to leave a voicemail for someone without ringing their telephone.

The goal is to know the system well. You want people to think of you when they are trying to do something new. When you’re visiting, discreetly listen to the interaction with callers. I cannot tell you how many times I’ve heard “You’ll have to call back and ask the operator” or “His extension is 8244 but you’ll have to call back. I cannot transfer from here”. Try to help these people understand how to use the phone. Of course, some folks don’t want to hear it but some do. Be helpful. Know your telephone system. Be the Authority.

What types of questions do you get?

Quick fix for Avaya MAS error Access is denied (0x80070005)

Roger the Phone Guy

So… You're in charge of the phones now!

Category Archives: voice

MPLS Network and QOS – Traffic Policing vs. Traffic Shaping

Recommendations to the new phone administrator

Quick fix for Avaya MAS error Access is denied (0x80070005)