I recently went through an ordeal with a PBX resetting. It’s an Avaya system using an IPSI to connect a port network back to its host, but this situation applies to anyone out there using QOS on their MPLS network. I’ve often said that being a “phone guy” is rarely about phones anymore. Most of my work – certainly troubleshooting – involves IP networking.

So I had a PBX with one IPSI that would occasionally reset. Since there was only one IPSI, the reset would cause all cards in the port network to reset as well, which would drop all calls in progress. Now this is about the worst thing that can happen when you’re responsible for the telephones. Full system outages are easier to understand. This is a reset, calls drop, users get frustrated and re-establish their calls, then it would reset again. It was a really bad situation.

What is causing the resets? Avaya said the heartbeats were failing to the IPSI. For any of you with an IPSI-connected port network, you should occasionally look for these. SSH to your Communication Manager and cd to /var/log/ecs. Then list the log files. Assuming you’re in Feburary 2013, you would look for missed heartbeats in your ecs log with the command:

cat 2013-02*.log|grep checkSlot
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (1)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (2)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (3)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (4)]
:pcd(5561):MED:[[3:0] checkSlot: data received replacing sanity message; socket delay is 14 secs]

I have stripped the date/time; you’ll see those on the left. Port networks and IPSIs are zero indexed, so the messages above apply to port network 4 and IPSI number 1.

I have been told that occasional sanity failures are just a part of life. These heartbeat messages are part of the Avaya protocol, not ICMP. So if you’re missing heartbeats, it’s not because ICMP is being dropped.

However, after a certain number of sanity failures, the IPSI will reset in order to re-esablish communication. How many sanity failures? That depends upon a system parameter setting:

display system-parameters ipserver-interface
IP SERVER INTERFACE (IPSI) SYSTEM PARAMETERS

SERVER INFORMATION
Primary Control Subnet Address:
Secondary Control Subnet Address:

OPTIONS

Switch Identifier: A
IPSI Control of Port Networks: enabled
A-side IPSI Preference: disabled
IPSI Socket Sanity Timeout: 15

QoS PARAMETERS
802.1p: 6
DiffServ: 46

The IPSI Socket Sanity Timeout determines how many sanity failures will cause an IPSI failover (if you have two in your port network), or a reset(!) if you only have one. The reset is the IPSI’s way of trying to re-establish communication. If you get too many sanity failures, you’ll get this message:

:pcd(5561):MED:[[3:0] checkSlot: too many sanity failures (15)]

Unfortunately, this means my CM lost connectivity to the first IPSI on port network 4. If I only have one IPSI, then the IPSI and all cards in the port network will reset. If I have a redundant IPSI, then the port network will failover and everything should be okay. In my particular case a second IPSI would not have helped me. It turns out, my MPLS carrier (who had also set up our edge routers) was policing the committed access rate. I’ll explain with more detail in my next post. The resolution was to shape the traffic rather than police it.

8 thoughts on “Avaya IPSI resets – SOLVED

    1. roger Post author

      Hi Eddy. I was able to solve my problem by confirming the QOS settings with my MPLS carrier. Could your carrier be dropping packets like mine was? The router policy was set to discard packets rather than just strip the EF tag.

      Reply
  1. Eric

    The PN cold reset timer or also was called the Port Network Rest Timer can be changed to allow a longer time before the PN will reset. THIS DOES NOT FIX YOUR ISSUE…simply masks some of it. Your carrier is not providing a quality service and is either dropping packets, possibly policing the committed access rate, or is performing maintenance or allowing another company to burst and creating issues that prevent your packets from reaching the Port Network.

    Reply
  2. ErikInVA

    I know this is an old post, but ran across this posting when searching for the issue we’re seeing – like usual. Recently, I’ve been having similar issues. We’re not on a MPLS network, but on an AVPN, so I guess it’s close enough. Searching the logs, I see plenty of sanity failures, but the count only hits 1 – not higher. We don’t notice a hit to the IPSIs until the exact time that we see both IPSI :0 and :1 go out, and then we have a brief, 1 second “SUSPENDED” message on the PN. I don’t really see it being service affecting, but can’t figure out what else could be causing it.

    Here are the sanity 20161216:013809489:258577453:pcd(8706):MED:[[7:0] checkSlot: sanity failure (1)]
    20161216:074617590:258717943:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:074617590:258717944:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:091617615:258754917:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:091617615:258754918:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:100300641:258775642:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:101730641:258781587:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:101730641:258781588:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:104017615:258791181:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:113318630:258815675:pcd(8706):MED:[[8:0] checkSlot: data received replacing sanity message; socket delay is 1 secs]
    20161216:114518636:258821151:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:114618632:258821581:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:120342651:258829347:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:120342671:258829348:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:121732661:258835254:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:123310675:258842751:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:123325673:258842923:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:123611677:258843871:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:124123676:258845981:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:130358672:258856626:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:130924680:258859036:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:132524655:258866100:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:132524655:258866101:pcd(8706):MED:[[7:0] checkSlot: sanity failure (1)]
    20161216:143827541:258899650:pcd(8706):MED:[[1:0] checkSlot: sanity failure (1)]
    20161216:144623674:258902750:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:144824655:258904043:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:150117650:258909959:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:150117650:258909960:pcd(8706):MED:[[7:0] checkSlot: sanity failure (1)]
    20161216:150958660:258913841:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:150958660:258913842:pcd(8706):MED:[[7:0] checkSlot: sanity failure (1)]
    20161216:153716675:258926977:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:154620674:258930879:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:154935536:258932136:pcd(8706):MED:[[1:0] checkSlot: sanity failure (1)]
    20161216:160407658:258938916:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:160417672:258939059:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:162817670:258950101:pcd(8706):MED:[[8:1] checkSlot: sanity failure (1)]
    20161216:162817670:258950102:pcd(8706):MED:[[8:0] checkSlot: sanity failure (1)]
    20161216:164507679:258958466:pcd(8706):MED:[[7:1] checkSlot: sanity failure (1)]
    20161216:165834559:258964901:pcd(8706):MED:[[1:0] checkSlot: sanity failure (1)]

    And here are the “SUSPENDED” messages.
    20161216:013809489:258577456:hmm(8718):MED:[PORT-NETWORK 8 UNAVAILABLE (SUSPENDED).]
    20161216:013810489:258577463:hmm(8718):MED:[PORT-NETWORK 8 AVAILABLE.]
    20161216:074617590:258717947:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:074618590:258717954:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:091617615:258754921:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:091618615:258754928:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:100300641:258775645:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:100301629:258775656:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:101730641:258781591:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:101731641:258781607:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:114518636:258821154:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:114519636:258821163:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:114618632:258821584:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:114619632:258821592:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:123310675:258842754:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:123311675:258842769:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:130924680:258859039:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:130925680:258859045:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:132524655:258866104:hmm(8718):MED:[PORT-NETWORK 8 UNAVAILABLE (SUSPENDED).]
    20161216:132525655:258866116:hmm(8718):MED:[PORT-NETWORK 8 AVAILABLE.]
    20161216:143827541:258899653:hmm(8718):MED:[PORT-NETWORK 2 UNAVAILABLE (SUSPENDED).]
    20161216:143828541:258899669:hmm(8718):MED:[PORT-NETWORK 2 AVAILABLE.]
    20161216:150117650:258909963:hmm(8718):MED:[PORT-NETWORK 8 UNAVAILABLE (SUSPENDED).]
    20161216:150118650:258909974:hmm(8718):MED:[PORT-NETWORK 8 AVAILABLE.]
    20161216:150958660:258913845:hmm(8718):MED:[PORT-NETWORK 8 UNAVAILABLE (SUSPENDED).]
    20161216:150959660:258913854:hmm(8718):MED:[PORT-NETWORK 8 AVAILABLE.]
    20161216:153625652:258926465:capro(8723):MED:[Primary link for Signaling Group 104 is UP]
    20161216:154935536:258932139:hmm(8718):MED:[PORT-NETWORK 2 UNAVAILABLE (SUSPENDED).]
    20161216:154936536:258932147:hmm(8718):MED:[PORT-NETWORK 2 AVAILABLE.]
    20161216:160417672:258939062:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:160418672:258939076:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:162817670:258950105:hmm(8718):MED:[PORT-NETWORK 9 UNAVAILABLE (SUSPENDED).]
    20161216:162818670:258950116:hmm(8718):MED:[PORT-NETWORK 9 AVAILABLE.]
    20161216:165834559:258964904:hmm(8718):MED:[PORT-NETWORK 2 UNAVAILABLE (SUSPENDED).]
    20161216:165835559:258964914:hmm(8718):MED:[PORT-NETWORK 2 AVAILABLE.]

    I’m guessing it could be the same issue, but not sure. Avaya is telling me to set the ports to 100/Half with auto-negotiate turned off. I can’t find supporting documentation, but I’ve always been told to hard set them to 100/Full – so I’m not really sure I want to follow their advice. Any thoughts or ideas on this one?

    Reply
    1. roger Post author

      Your question deserves more time. But I wanted to tell what I know about this battle with port speeds. Actually I was going to type it up here but it really deserves its own post. Short answer – I recommend setting both sides to auto-negotiate. You should check with your network team to confirm the port speed is 100/full. If it does not negotiate to 100/full, then you and your network team should force the port to 100/full. If your ports do not auto-negotiate to 100/full, can you let me know? This is a long and crazy story.

      I think the half duplex could be the problem.

      Reply
    1. roger Post author

      Well, I don’t pretend to know more than an Avaya engineer (especially with 140 years of knowledge-base behind them), but that doesn’t make sense to me. I’ve been told to set to 100/Full and I’ve been told to let it auto-negotiate (so it can negotiate 100/Full), but I cannot think of any reason you’d ever want half duplex. Maybe get a second opinion? I would ask for some supporting documentation. I found this white paper from 2010 that encourages auto-negotiation to 100/Full. https://downloads.avaya.com/css/P8/documents/100121639

      I have definitely been on engineering calls with Avaya support where we had me and the network team check every single interface (IPSIs, MedPros, CLANs, Media Gateways, Session Managers, and SBCs) and make sure they are all at 100/Full.

      However, this may not be the cause of your IPSI sanity failures.

      Oh- do you have another ESS somewhere? All ESSs perform sanity checks against the IPSIs. It would be interesting to see if your other ESS is getting those too.

      I’m sorry you’re having trouble. I really feel your pain.

      Reply

Leave a Reply

Your email address will not be published.