MPLS Network and QOS – Traffic Policing vs. Traffic Shaping

As mentioned in my last post, I had some trouble with resets in my PBX due to “some kind of network event”. It was very enlightening when we discovered the root cause, so I wanted to share my experience here in hopes that it saves you from similar sleepless nights.

The symptom was quite simple – the PBX was losing connectivity with its host processor for some reason. Avaya called it a “network event” but couldn’t be more specific. I was fortunate enough to work on the same team as the LAN/WAN techs. If we had been in different departments, this would have been even more of a nightmare. So the Avaya PBX loses heartbeats – after 15 of them it caused a reset. Heartbeats are about one second apart, so a failure of 15 heartbeats is a very long time. Surely a WAN outage of 15 seconds would be noticed by other systems, right? So when Avaya says there was a network event, my response was “uh, you gotta give me more than that. Nothing else on the network noticed”.

I’ll spare you the details here – they’re in my previous post anyway. The issue is the PBX is marking all IP traffic with “Expedited Forwarding” (EF), or Diffserv 46, or High Priority, or QOS. There are plenty of synonyms but it just means all IP packets are tagged with high priority and should be preferred over other packets in the LAN and WAN. Our MPLS carrier honors this tag through their network. Perfect, right?

So when you purchase QOS in the MPLS network, you are given a certain amount of bandwidth allowed. Obviously you’re not allowed to mark all traffic as high priority. MPLS is a shared cloud and you’ll pay a premium for expedited delivery of QOS packets. This bandwidth allotment is called the Committed Access Rate or CAR. What happens when you exceed this CAR? Well, as a telephone guy, I would assume the packets are delivered as “best effort” after that. But I was very wrong. Packets over the CAR are discarded by the MPLS carrier. Think about it – if you have high-priority packages to deliver overnight and the you’re only allowed to send ten per day. The eleventh package isn’t held for best effort. It’s thrown in the dumpster. Sorry. You exceeded your ten packages today, I’m throwing this one out.

The fix is simple and elegant. And probably crystal clear to you network routing engineers out there. It’s up to your edge router to strip the QOS tag from any packets that exceed the CAR. It sounds dangerous to me – what if my edge router and the carrier’s MPLS router disagree slightly on the current amount of traffic at this particular second? Especially if we’re using different brands of router? However, Mbps seems to be an agreed-upon measurement across all vendors, so having my edge router strip the QOS tag from these packets simply allows them to be delivered at best-effort across MPLS.

What I discovered, is if this setting is not correct, it only affects high-priority traffic. In my experience, most companies use a different network for video and everything else on the network such as Internet, email, chat, file servers, replication, database synchronization, etc. are not marked with QOS. So guess what? It looks like your phone system is hosed! And nothing else on the network is complaining! This gets back to a trend I’ve been noticing: Being a phone guy doesn’t have much to do with telephones anymore. It’s almost always the network. And to be a really good phone tech, you have to know networking well. I hope this helps.

26 thoughts on “MPLS Network and QOS – Traffic Policing vs. Traffic Shaping

  1. Marc

    Thanks for this article Roger, i have a gateway resetting and voice quality issues over a MPLS link your last two articles have been very interesting

    Reply
    1. roger Post author

      Sorry to hear about your gateway resets. Any type of phone disconnects are bad news of course, but it’s really frustrating when 100+ year old technology is looked upon as “unstable” by management and users. I hope all goes well with your PBX!

      Reply
  2. khurram

    Hi Roger,

    i have faced a similar issues yesterday where heartbeat was broken between CM and IPSI causing the entire PN to reboot. now i have engaged network team to investigate..can you please help in it. what parameters I need to check at network end to make sure this heartbeat does not break again.

    Reply
    1. roger Post author

      Hi Khurram, Sorry to hear about your port network troubles. If you can SSH directly to CM, you can grep the latest /var/log/ecs for ‘checkSlot’. This will show all the heartbeat errors. Do you have more than one IPSI? Did your port network reset during heavy traffic or a busy time with your PBX? If you’re seeing connectivity problems but nobody else on the network team sees it, then it sounds like something similar. Your MPLS carrier may be discarding EF packets that exceed your committed access rate. Does your MPLS carrier manage your edge routers, or your own network team? Or is it a “shared” responsibility? If shared, it could be that neither party has looked for dropped packets. Let me know how it goes!

      Reply
  3. khurram

    Hi Roger,
    My PN and CM are placed at a same location under one data center with both in same VLAN, no MPLS is involved in this communication but even then the heartbeat issue was faced with sanity check error 🙁

    Reply
    1. roger Post author

      Do you have a backup ESS server? The ESS servers also maintain a heartbeat with the port network so you can check the ESS’s /var/log/ecs for heartbeat failures as well. In my case, one datacenter was fine, the other datacenter was failing heartbeats. That helped me pinpoint the WAN links that were dropping packets.

      Strange that your CM and PN are sitting as close together as possible and you’re still getting heartbeat failures. I’m told occasional missed heartbeats are normal. When you “display sys ipsi” what is your sanity timeout? I put mine to the max of 15 seconds. Is yours pretty low? Sanity failures don’t necessarily mean control traffic is failing too. Have your users said anything about slow dialtone or slow call-setup?

      Reply
      1. khurram

        HI Roger,

        I have a CM1 which has a backup CM2 and then i also have ESS which is at another data center. The Sanity timeout is set to 15 which is the max i can set 🙂
        OK, i have read it somewhere that these type of issues can be caused by firmware version as well… do you have any idea about that?

        Reply
  4. roger Post author

    Yes unfortunately Avaya and every business partner will recommend that you upgrade to the latest firmware whenever you have trouble that’s hard to track down. Sometimes it feels like a cop-out to me, putting the work back on the customer. Usually firmware is fine for many years, but it does make sense to update it to the latest. There are four rather complex individual procedures for firmware updating:
    1) Update the TN circuit packs in the G650 chassis
    2) Update the MM media modules in the G3xx and G4xx gateways
    3) Update the G3xx and G4xx gateway firmware
    4) Update the IPSI firmware

    Oh, and then there’s patching the CM software as well as System Platform.

    Usually, you have to do this so infrequently that you forget how to do it between cycles. Perhaps I will post these in a separate topic.

    Reply
  5. khurram

    Hi Roger,

    i had a detailed meeting with my networks team in which I asked them to prioritize my traffic between IPSI and CM so that no HB message is skipped.
    One thing that they have asked me to provide is the socket details between IPSI and CM, now I have IP’s of both IPSI and CM but i am unable to find the port on which these HB messages are exchanged and since single Ethernet interface is being used for data and HB. DO you have this detail?

    Reply
    1. roger Post author

      I did get that information from a tech once. According to Avaya backbone:

      The following is a list of IPSI ‘listening’ ports and their corresponding ports on the S8700.
      S8700 IPSI TCP/UDP Description
      Any 5010 TCP Main control socket between PCD and SIM
      Any 5011 TCP Ipsiversion queries
      Any 5012 TCP Serial number queries
      Any 123 UDP Ntp
      123 any UDP Ntp
      Any 23 TCP Telnet – for configuration after enable.
      Any 21 FTP Download of firm ware
      Any 20 FTP-DATA Download of firm ware
      3166? 1956 TCP Command server (download,etc.)
      Any any ICMP Echo replies
      Any 2312 TCP Used by development only for telnet shell access for debugging. Can be blocked by a firewall.

      Reply
  6. khurram

    I am having a hard time convincing my network team that their equipment has let me down :@ :@ um so frustrated with Avaya

    Reply
    1. roger Post author

      Are you frustrated with Avaya because they don’t provide the evidence that it’s the network? I feel the same way. Avaya says “it was a network event – check with your WAN team”, but they don’t always give convincing evidence that it was indeed a network event. However, I will say this about Avaya: if you have a maintenance agreement, they’ll replace whatever hardware is necessary to demonstrate they’re system is fine. I have gone through at least three of these finger-pointing sessions and in each case the Avaya PBX was okay and the network was the issue.

      In my case, I had to suffer some painful outages and several hardware replacements. All the while I would say to Avaya “you gotta give me more than this. A ‘network event’ isn’t enough to convince my network team there’s a problem on the LAN”. Especially when voice seemed to be the only service affected.

      Is there anything else seemingly unrelated going on with your network? One time I had phones going into “Discover” state and moving to different CLANs. For two days we replaced hardware in the Avaya before I learned that some users were also having trouble with mapped drives. It turned out to be a spanning tree issue on one of the ethernet switches. Almost all other services are more resilient than VoIP. Video is probably the only service more sensitive and most companies put video on a separate network altogether. That makes VoIP the most visible (and sometimes the only) “problem” when the LAN has issues.

      Do you have more than one IPSI? Do you have a hardware replacement agreement with Avaya or a spare IPSI? Is all of your Avaya LAN traffic in a separate VLAN?

      Reply
  7. khurram

    Roger,

    My networks team says “we have given you Uplink of 10G on LAN, how come you can miss HB” & seriously if you look at it that’s a fair argument. I dont see any packet drop in my NMS, network utilization is not 100% its something around 40% and even then HB messages are skipped.
    I have separate VLAN for voice and data, with my ISPSI and CM in data VLAN and MedPro in Voice VLAN.
    one thing that i would like to share with you is that few days back 3 CLAN got stuck and this is the error printed on network switch

    2013 May 23 15:58:47.612 LLR995-CORE-SFSW-02 %ETHPORT-2-IF_DOWN_ERROR_DISABLED: Interface Ethernet108/1/13 is down (Error disabled. Reason:Too many link flaps)

    Reply
  8. khurram

    & this is the error on my IPSI port

    2013 May 28 18:29:48.561 LLR995-CORE-SFSW-01 %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet107/1/13 is down (Link failure)

    Reply
    1. roger Post author

      Aha! Can you ask your network team what they see for the port speed? IPSIs and CLANs should auto-negotiate to 100 mbps full duplex, but often they come up to half duplex. I’ve worked at two particular sites where the teams disabled auto-negotiate and they set the ports on each side for 100 Full. Early config notes for Avaya field services used this as best practice, so sometimes you’ll run into Avaya techs who tell you to never use auto-negotiate, but later updates have fixed this. Also, you mentioned firmware. I agree that you should update the firmware of your IPSI at the least. Do you have more than one? If you only have one, then the entire port network will flap and that could explain the CLAN errors.

      Reply
  9. khurram

    No I dont have more than 1 IPSI per PN and thats why the entire PN went down and it took down the entire contact center. Speed is set to 100 Mbps, Full Duplex and negotiation is set to off as i have hard coded these values at both end.

    Reply
  10. khurram

    Hi Roger,

    someone told me if Avaya rack is not provided with Earth, it can also cause HB skipping. is that true?

    Reply
    1. roger Post author

      Wow – that’s obscure but completely true. Now that you mention it I remember a site in the 90s that wasn’t properly grounded; they had intermittent restarts on their Nortel switch. Their electrician installed a proper ground and that fixed it. Unfortunately, sticking a meter against your rack may not be enough to tell for sure – the ground wire needs to be properly sized. Datacenters take grounding very seriously.

      Reply
    1. roger Post author

      And they will likely recommend ground and firmware updates before continuing. You mentioned that you have an ESS server in another datacenter. Can you check its /var/log/ecs logs for missed heartbeats to this port network? If it fails too, then maybe it’s simply a bad IPSI?

      Reply
    1. roger Post author

      Yes, The main CM and ESS CMs each initiate heartbeats with the IPSIs. Each CM logs missed heartbeats in its local /var/log/ecs/ log file. In your case, it would be very odd if the local CM is missing heartbeats, but the remote ESS is fine. Do you have a second G650 cabinet? If so, you can install a second IPSI.

      Reply
  11. khurram

    Hi Roger,

    i have faced the same situation again at another data center and this time my IPSI and CM are communicating over WAN 🙂
    is there any possibility to check the heartbeat messages through any sort of NMS?

    Reply
    1. roger Post author

      Is this a different port network than your previous posts? You can configure CM (via the web interface) to send everything that goes to /var/log/ecs to a syslog server also. This way you’ll get everything in real-time and you can have the syslog server alarm on patterns like “checkSlot: sanity failure”

      You once mentioned you have an ESS as well. Does the ESS also show missing heartbeats in its ecs log?

      Reply
  12. khurram

    Hi Roger,

    Yes this is a different port network & I have checked and found that this IPSI has sanity check errors with CM and also with ESS
    These are the logs from CM
    20130702:173514408:57566820:pcd(22868):MED:[[16:0] checkSlot: sanity failure (1)]
    20130702:173515408:57566867:pcd(22868):MED:[[16:0] checkSlot: sanity failure (2)]
    20130702:173516408:57566937:pcd(22868):MED:[[16:0] checkSlot: sanity failure (3)]
    20130702:173517408:57567015:pcd(22868):MED:[[16:0] checkSlot: sanity failure (4)]
    20130702:173518408:57567079:pcd(22868):MED:[[16:0] checkSlot: sanity failure (5)]
    20130702:173519408:57567127:pcd(22868):MED:[[16:0] checkSlot: sanity failure (6)]
    20130702:173520408:57567196:pcd(22868):MED:[[16:0] checkSlot: sanity failure (7)]
    20130702:173521408:57567238:pcd(22868):MED:[[16:0] checkSlot: sanity failure (8)]
    20130702:173522408:57567313:pcd(22868):MED:[[16:0] checkSlot: sanity failure (9)]
    20130702:173523408:57567379:pcd(22868):MED:[[16:0] checkSlot: sanity failure (10)]
    20130702:173524408:57567428:pcd(22868):MED:[[16:0] checkSlot: sanity failure (11)]
    20130702:173525407:57567476:pcd(22868):MED:[[16:0] checkSlot: sanity failure (12)]
    20130702:173526407:57567525:pcd(22868):MED:[[16:0] checkSlot: sanity failure (13)]
    20130702:173527407:57567587:pcd(22868):MED:[[16:0] checkSlot: sanity failure (14)]
    20130702:173528407:57567648:pcd(22868):MED:[[16:0] checkSlot: too many sanity failures (15)]

    These are the logs from ESS
    20130702:173216961:1491122:pcd(30819):MED:[[16:0] checkSlot: sanity failure (1)]
    20130702:173217961:1491124:pcd(30819):MED:[[16:0] checkSlot: sanity failure (2)]
    20130702:173218961:1491125:pcd(30819):MED:[[16:0] checkSlot: sanity failure (3)]
    20130702:173219961:1491126:pcd(30819):MED:[[16:0] checkSlot: sanity failure (4)]
    20130702:173220961:1491127:pcd(30819):MED:[[16:0] checkSlot: sanity failure (5)]
    20130702:173221961:1491128:pcd(30819):MED:[[16:0] checkSlot: sanity failure (6)]
    20130702:173222960:1491129:pcd(30819):MED:[[16:0] checkSlot: sanity failure (7)]
    20130702:173223960:1491130:pcd(30819):MED:[[16:0] checkSlot: sanity failure (8)]
    20130702:173224960:1491131:pcd(30819):MED:[[16:0] checkSlot: sanity failure (9)]
    20130702:173225960:1491132:pcd(30819):MED:[[16:0] checkSlot: sanity failure (10)]
    20130702:173226960:1491133:pcd(30819):MED:[[16:0] checkSlot: sanity failure (11)]
    20130702:173227960:1491134:pcd(30819):MED:[[16:0] checkSlot: sanity failure (12)]
    20130702:173228960:1491135:pcd(30819):MED:[[16:0] checkSlot: sanity failure (13)]
    20130702:173229960:1491136:pcd(30819):MED:[[16:0] checkSlot: sanity failure (14)]
    20130702:173230960:1491137:pcd(30819):MED:[[16:0] checkSlot: too many sanity failures (15)]

    Reply
    1. roger Post author

      Are the clocks on your CM and ESS three minutes apart? It shouldn’t matter, but you might want to set up NTP, especially with a call center (for your reports).

      Has anyone mentioned anything happening before the cutoff? Do people hear voice breaking up or poor sound quality during those final 15 seconds? Does it happen at a particularly busy time? Was there a lot of VoIP traffic over the WAN during this moment? If so, then it could be the high-priority packets are getting dropped and low priority packets are getting through just fine. Back to the MPLS Committed Access Rate.

      Or, it could just be another bad IPSI or firmware incompatibility with your network.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *