Avaya 9630 Locks / Reboots up when registering

I recently had an Avaya phone in a reboot cycle. It would boot up, then when it registered it would lock up and after a couple minutes reboot again. The display looked like it was up and running fine, but when you press the SPEAKER button I would get three beeps. This is typically what happens when a phone cannot get TCP/IP signalling traffic to the call server. And the night before we had some maintenance on that Ethernet switch so I immediately suspected a network problem.

Just in case, I did a “CLEAR” procedure on the phone. Then I swapped out the phone. Then I swapped out the patch cables (at both ends). Then I moved the phone to a different port in the Ethernet switch. No matter what I did, the phone locked up. Then I tried something I probably should have tried earlier – I logged in a different extension and it worked fine! Then I logged the “bad” extension into a different phone and it locked up!

Turns out the config file on the web server (1234_96xxdata.txt) was incomplete. Apparently it was related to the network after all! When the phone was writing its data file to the web server the previous night, the write operation was interrupted as the network guy shut off the Ethernet switch. The resulting data file was incomplete – it had about half the call log entries and a partial line at the end. But none of the important lines you’d expect in the file such as:

LOGTDFORMAT=128
Redial=1
Edit Dialing=1
Go to Phone Screen on Calling=0
Go to Phone Screen on Ringing=1
Call Timer=1
Visual Alerting=0
History Active=1
Log Bridged Calls=1
Audio Path=1
Personalized Ring=0
Handset AGC=1
Headset AGC=1
Speaker AGC=1
Error Tone=1
Button Clicks=0
Text Size=1
Contacts Pairing=0
Voice Initiated Dialing=1
Voice Dialing Help Counter=0
Personalized Ring Menu=0
Go to Phone Screen on Answer=0
Voice Initiated Dialing Language=

If the file were missing, the phone would use default values and create the file at the next backup. However, since the file was there, the phone processed it but ended up locking up because it was incomplete. In all my years working with these phones, I’ve never seen that before. I wouldn’t have thought it possible for the phone’s interrupted “HTTP PUT” operation to result in an incomplete file on the web server, but there you go. Hopefully this helps you.

MPLS Network and QOS – Traffic Policing vs. Traffic Shaping

As mentioned in my last post, I had some trouble with resets in my PBX due to “some kind of network event”. It was very enlightening when we discovered the root cause, so I wanted to share my experience here in hopes that it saves you from similar sleepless nights.

The symptom was quite simple – the PBX was losing connectivity with its host processor for some reason. Avaya called it a “network event” but couldn’t be more specific. I was fortunate enough to work on the same team as the LAN/WAN techs. If we had been in different departments, this would have been even more of a nightmare. So the Avaya PBX loses heartbeats – after 15 of them it caused a reset. Heartbeats are about one second apart, so a failure of 15 heartbeats is a very long time. Surely a WAN outage of 15 seconds would be noticed by other systems, right? So when Avaya says there was a network event, my response was “uh, you gotta give me more than that. Nothing else on the network noticed”.

I’ll spare you the details here – they’re in my previous post anyway. The issue is the PBX is marking all IP traffic with “Expedited Forwarding” (EF), or Diffserv 46, or High Priority, or QOS. There are plenty of synonyms but it just means all IP packets are tagged with high priority and should be preferred over other packets in the LAN and WAN. Our MPLS carrier honors this tag through their network. Perfect, right?

So when you purchase QOS in the MPLS network, you are given a certain amount of bandwidth allowed. Obviously you’re not allowed to mark all traffic as high priority. MPLS is a shared cloud and you’ll pay a premium for expedited delivery of QOS packets. This bandwidth allotment is called the Committed Access Rate or CAR. What happens when you exceed this CAR? Well, as a telephone guy, I would assume the packets are delivered as “best effort” after that. But I was very wrong. Packets over the CAR are discarded by the MPLS carrier. Think about it – if you have high-priority packages to deliver overnight and the you’re only allowed to send ten per day. The eleventh package isn’t held for best effort. It’s thrown in the dumpster. Sorry. You exceeded your ten packages today, I’m throwing this one out.

The fix is simple and elegant. And probably crystal clear to you network routing engineers out there. It’s up to your edge router to strip the QOS tag from any packets that exceed the CAR. It sounds dangerous to me – what if my edge router and the carrier’s MPLS router disagree slightly on the current amount of traffic at this particular second? Especially if we’re using different brands of router? However, Mbps seems to be an agreed-upon measurement across all vendors, so having my edge router strip the QOS tag from these packets simply allows them to be delivered at best-effort across MPLS.

What I discovered, is if this setting is not correct, it only affects high-priority traffic. In my experience, most companies use a different network for video and everything else on the network such as Internet, email, chat, file servers, replication, database synchronization, etc. are not marked with QOS. So guess what? It looks like your phone system is hosed! And nothing else on the network is complaining! This gets back to a trend I’ve been noticing: Being a phone guy doesn’t have much to do with telephones anymore. It’s almost always the network. And to be a really good phone tech, you have to know networking well. I hope this helps.

I recently went through an ordeal with a PBX resetting. It’s an Avaya system using an IPSI to connect a port network back to its host, but this situation applies to anyone out there using QOS on their MPLS network. I’ve often said that being a “phone guy” is rarely about phones anymore. Most of my work – certainly troubleshooting – involves IP networking.

So I had a PBX with one IPSI that would occasionally reset. Since there was only one IPSI, the reset would cause all cards in the port network to reset as well, which would drop all calls in progress. Now this is about the worst thing that can happen when you’re responsible for the telephones. Full system outages are easier to understand. This is a reset, calls drop, users get frustrated and re-establish their calls, then it would reset again. It was a really bad situation.

What is causing the resets? Avaya said the heartbeats were failing to the IPSI. For any of you with an IPSI-connected port network, you should occasionally look for these. SSH to your Communication Manager and cd to /var/log/ecs. Then list the log files. Assuming you’re in Feburary 2013, you would look for missed heartbeats in your ecs log with the command:

cat 2013-02*.log|grep checkSlot
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (1)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (2)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (3)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (4)]
:pcd(5561):MED:[[3:0] checkSlot: data received replacing sanity message; socket delay is 14 secs]

I have stripped the date/time; you’ll see those on the left. Port networks and IPSIs are zero indexed, so the messages above apply to port network 4 and IPSI number 1.

I have been told that occasional sanity failures are just a part of life. These heartbeat messages are part of the Avaya protocol, not ICMP. So if you’re missing heartbeats, it’s not because ICMP is being dropped.

However, after a certain number of sanity failures, the IPSI will reset in order to re-esablish communication. How many sanity failures? That depends upon a system parameter setting:

display system-parameters ipserver-interface
IP SERVER INTERFACE (IPSI) SYSTEM PARAMETERS

SERVER INFORMATION
Primary Control Subnet Address:
Secondary Control Subnet Address:

OPTIONS

Switch Identifier: A
IPSI Control of Port Networks: enabled
A-side IPSI Preference: disabled
IPSI Socket Sanity Timeout: 15

QoS PARAMETERS
802.1p: 6
DiffServ: 46

The IPSI Socket Sanity Timeout determines how many sanity failures will cause an IPSI failover (if you have two in your port network), or a reset(!) if you only have one. The reset is the IPSI’s way of trying to re-establish communication. If you get too many sanity failures, you’ll get this message:

:pcd(5561):MED:[[3:0] checkSlot: too many sanity failures (15)]

Unfortunately, this means my CM lost connectivity to the first IPSI on port network 4. If I only have one IPSI, then the IPSI and all cards in the port network will reset. If I have a redundant IPSI, then the port network will failover and everything should be okay. In my particular case a second IPSI would not have helped me. It turns out, my MPLS carrier (who had also set up our edge routers) was policing the committed access rate. I’ll explain with more detail in my next post. The resolution was to shape the traffic rather than police it.

Recommendations to the new phone administrator

I’ve been working on telephone systems for a while. And I love my job. For the past few years, I find myself working with network administrators who have been handed the job of managing the telephone system. It makes sense – the PBX is just a big voice router, and nowadays the telephones are IP network endpoints.

But there’s more to managing a voice network than knowing the data network. I’m often asked by the new telecom admin “where should I start?” There’s a lot to know. And my biggest piece of advice is to Be the Authority. By this I mean you should be the person everyone asks about telephones. And you should usually start with the telephone and your voicemail system. The telephone is a complicated endpoint. Voicemail has a ton of features and an extremely limited user interface. For example, learn how to do the following:

  • Know how to transfer a call into voicemail without ringing the station.
  • Know how to conference two parties together. This includes two inbound calls. Also, learn the limits of conferencing. How many parties can conference together?
  • Can your users transfer calls outside the PBX (i.e. to mobile numbers)? If so, what happens if voicemail picks up at the far end. How do you pull that call back? What about when you attempt to conference rather than transfer?
  • Learn what all the feature buttons do, like park, call pickup, do-not-disturb, or any one of about 200 possible features.
  • Know how to program the speed dial buttons.
  • Keep a list of conference rooms and the speakerphone numbers handy.
  • Get to know you receptionists and find out what they need in a telephone system. They probably wish they had an accurate company directory, right? In a later post I’ll talk about how to provide this.
  • Spend time walking the floor and interacting with users. When someone calls for a simple change that can be performed remotely, go visit the user or at least give him or her a call. Try to chat about how they use the phone.
  • Learn how to create an out-of-office greeting and activate/deactivate it.
  • Learn how to leave a voicemail for someone without ringing their telephone.

The goal is to know the system well. You want people to think of you when they are trying to do something new. When you’re visiting, discreetly listen to the interaction with callers. I cannot tell you how many times I’ve heard “You’ll have to call back and ask the operator” or “His extension is 8244 but you’ll have to call back. I cannot transfer from here”. Try to help these people understand how to use the phone. Of course, some folks don’t want to hear it but some do. Be helpful. Know your telephone system. Be the Authority.

What types of questions do you get?

481 Call Does Not Exist (no local tag match)

Really? I’m the only person on the entire Internet to get this message from an Avaya Session Manager?

481 Call Does Not Exist (no local tag match)

I’m trying to integrate Avaya Aura Session Manager 6.1 with an Audiocodes MP-118. I have it working in one direction so far. If all goes as planned, I’ll get this figured out and will forget all about the cause for this error 481. Alas, I look forward to it already.

UPDATE: This turned out to be sort of an asymmetric route. In the chaos and confusion of the moment, I had Session Manager A sending calls to the Audiocodes, but the Audiocodes sending responses back to Session Manager B. Hopefully this helps someone out there.

Quick fix for Avaya MAS error Access is denied (0x80070005)

It’s not that I’m an “Avaya guy”, but it just happens to be the system I’ve been working with lately. If any of you have tried to publish a caller app on Modular Messaging and gotten the message Error in application deployment (Access is denied (0x8007005)), there’s an easy fix.

Error in application deployment

In the old version 3.1, you could deploy apps via RDP, but now in version 6.x, you’re only able to do it from a local terminal. Or, you can also RDP with the /admin switch:

mstsc /v:mymas /admin

You can then deploy apps remotely. Simple, but since I couldn’t find any quick info when I googled the error, I thought I’d post it here.

So, how’d you end up managing telecom?

Broadly speaking, those who manage telephone systems come from either a voice background or a data background.

It is quite common for data engineers to be responsible for the company’s voice and data systems. It is very rare that voice engineers get the same privilege. It’s kind of a strange turn of events. I remember saying in the mid 90s “voice guys can learn data easier than data guys can learn voice.” It was true then, but data has gotten more complex. And you know what’s crazy? Voice systems are mostly data now. Most of my troubleshooting and end user support is related to VoIP, routing, dhcp, vlans, PoE, or a hardware problem at the endpoint. The same thing applies to design and installation. It’s all about VoIP assessments, address assignments, and routing. Modern PBXs need to join domains and often have dedicated Ethernet switches, routes, firewall rules. The PBXs are simply servers in the data cabinet. It’s enough to make any old-school voice engineer cringe. I’ve started seeing PBXs installed with patch panels instead of punchblocks. No punchblocks!

The tools of the trade used to be a test phone, punch tool, and toner. Now a laptop is almost all I need. I can learn more from a traceroute or a diagnostic utility within the pbx than I can from visiting the station. And if i do end up visiting the station, the telephone’s diagnostics often tells me the rest. It’s been quite a transition; and in retrospect, it has been all the industry promised us.

So if you come from voice, you are probably seeing some amazing changes to the pbx and peripheral equipment. If you come from data, you’re starting to see a bunch of unfamiliar equipment plugged into your network. At first, those are black boxes the vendors and just a couple people in your company know about. I will help you to break into those black boxes and see what is inside.

How to convert and play a wav file in an Avaya Modular Messaging Caller Application

One of my clients uses  Avaya Modular Messaging (aka Messaging Application Server). I recently needed to write a caller application to just hold a channel open for a few minutes. I was working in a quiet environment and didn’t want to sit in a cube and speak “blah blah testing 1 2 3 testing hello nothing to hear keep moving” into the phone for 30 seconds or more, so I figured I would use a wav file I had of some open source music. It’s not obvious, but Avaya Modular Messaging supports drag-and-drop of a wav file onto the “record a prompt” screen.

Drop the wav file onto the media controls

The problem is, the docs are pretty vague on the exact encoding that MAS supports for these files. Surprisingly, I couldn’t get Audacity to export a file in the right format. Nor could I use a variety of other GUI utilities.

I finally decided to try sox. This is a great open source utility that I typically use with Asterisk sound files. In the end, the following command converted the file to a format that MAS liked:

sox source_music_file.wav -eu-law -b8 -c1 -r8000 music4mas.wav

When I deployed the app, the audio was a little too loud, causing static and distortion on the channel. If you need to lower the volume, you can add a volume adjustment to the sox command line (on the input file!) with:

sox -v-0.1 music.wav -eu-law -b8 -c1 -r8000 music1v.wav

That did it for me. Happy caller-apping everyone.

 

“It all started with the smoke signal…”

I just finished a great book titled “The Deal of the Century – The Breakup of AT&T”. During final negotiations between the Justice Department and AT&T, a judge named Vincent Buinno started a two-hour-long diatribe with “It all started with the smoke signal…” Wow, you’re in for a long wait when a judge begins that way.

I owe my career to the breakup of the Bell System. It was a tough time in AT&T’s history, and even thirty years later there are arguments either way. But personally I love being a phone guy and I wouldn’t have the chance if Ma Bell owned and controlled everything.

Phone Guy Blog

I’m a “phone guy” who has made a successful transition to “IT Guy”. Between the 1990s and now, phone guys had to adapt or die. I have adapted. To the PBX administrator who has transitioned to VoIP, or the networking/LAN/WAN administrators who have to support the telecommunications network, this blog is for you.

And of course, being a pretty good telecom guy and computer programmer doesn’t necessarily mean I’m very good at WordPress.