Category Archives: data

How to parse Avaya Communication Manager output into useful csv or database tables

Hello all!

This is a follow-up to my post about extracting data from Avaya Communication Manager. In that post, I showed you how to extract any “list”, “display”, or “status” from Avaya Communication Manager to a text file. First, I would like to show you my bash shell script that extracts this data. And then I’ll show you my perl script that converts the output to a text file that you can open with Excel or import into a database.

What information should you pull from Communication Manager? Well, in my case, I have a text file called ‘commands.txt’ that contains these lines:

display time
list ips
list surv
list med
status cdr
status trunk 1
status trunk 2
status trunk 8
status trunk 11
status trunk 16
status trunk 21
status trunk 79
list station
list reg
status station 2291
status station 2292
status station 2293
status station 2294
list call-forwarding
list off s
display time

So those are the various pieces of information that I care about. Looking at that info, those are my active trunk groups, and I want to know the status of four particular extensions (those happen to be the digital ports of my fax server). Also, I have a port network, so I want the status of IPSIs. Anyway, think about all the stuff you care about when you log into your PBX in the morning. And stick them in that “commands.txt” file. I put a “display time” at the beginning and end so I can look at the file later and tell how long it took to run the file.

So we want to run these commands automatically. So let’s create a file called “sanity.sh” and put this in the file:

file=/home/roger/avaya/sanity/data/sanitycheck_`date +"{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}Y-{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}m-{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}d-{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}H-{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}M"`.txt
perl /home/roger/avaya/sanity/av.pl /home/ARES/randerson/avaya/sanity/la.pbx /home/roger/avaya/sanity/commands.txt >$file
perl /home/roger/avaya/sanity/sanity.pl $file

Once you save the file, make it executable with

chmod +x sanity.sh

Let’s go over this script line-by-line, okay?

  1. The first line creates a variable for the file that will contain the data. There’s a little linux trick to embed a timestamp in the filename.
  2. The second line runs the file av.pl that I shared with you in a previous post about this script. It’s magical. I cannot take all the credit, but I have modified it for our purposes.
  3. The third line runs a file that parses the output of all the commands. This is a very useful script and I can take full credit for this one. It generates a list of “key/value” pairs to a text file.

My thought process with this script is to create a unique key for every piece of useful data. For example, I want the name, IP address, firmware version, gatekeeper, and network region for every station. I also want the EC500 mapping. And I would like the port number for analog stations. Oh! I also want a sense of low or high priority (Is it bad if the data changes?) and I need to include a site identifier in case I have the same extension in multiple sites. That’s a lot of information to display. And, if I have 1000 stations, it’s too much to display all this information on separate lines. That’s like 7000 lines of data for my stations. Fine for machines to read, but it seemed like too much for humans. I want to put them into one (and in the case of ec500, two) lines of data. I do this by assigning a “Key name” for my stations like this:

Key Name Key Value
lo.la.station.8348.offpbx EC500=>2135552978, OPS=>8348
lo.la.station.8348.status Anderson, Roger (ip=10.10.60.138;reg=3;ver=3.260A;gk=10.2.86.180)

See how much data I squeezed into that? And note that very few stations will have off-pbx information. Most stations will just have that station.status key. The script called sanity.pl takes the raw output from the Communication Manager and converts it to these key/value pairs above. Here is a link to sanity.pl. You’ll need to rename it to a ‘.pl’ file. This should compile just fine. When you copy it to your linux server, try

perl -c sanity.pl

That -c just means to check the syntax and don’t run it. It hopefully says “syntax okay”. Actually, I’m sorry but you’ll need to edit the file and change line 45 to the path where you want your ‘keyvalues.txt’ file, which will contain all your data. This should use a command line param, but it doesn’t. Sorry.

You might consider looking in more detail at this file. It’s where the magic happens and it’s a good framework for parsing other data. If you’re interested, here is a summary of how it works:

  • Starting in line 56, we check the raw text file line-by-line looking for the header for each ‘list’ or ‘display’ or ‘status’ section. For example, when we see the text ‘REGISTERED IP STATIONS’, then we know we are expecting to see the result of a ‘list reg’ (by setting our variable of type to 7).
  • Then on line 183 we actually process the registration status of stations. If type=7, then we are in the ‘list reg’ mode and if a line contains registration information for a station, then we capture the data.
  • Each section has some customization, but you’ll notice that (almost) each section performs a ‘set_data’ of the keys and values appropriate for that section.
  • At the end, the trunk group and station summary is generated and the keys are reported alphabetically.

Here are some examples of what the script does:

  • The script parses ‘list station’ and assumes all stations are unregistered.
  • The script parses ‘list reg’ and fills in the registration status for all stations (anything not parsed in this section has a status of ‘unregistered’)
  • The script parses ‘status trunk xxx’ and simply counts the trunks that are in service (ignoring the “idle/in use” values. We just count trunk members in service
  • The script parses ‘status cdr’ and simply stores the percent full of the buffers
  • The script parses ‘list off s’ and stores the off pbx mappings for all stations
  • The script parses various critical fields for IPSIs, media gateways, and survivable servers
  • The script parses ‘list call-forwarding’ and generates a list of all stations that are call forwarded.

At this point, if you ran the bash script, you would have an awesome list of keys and values in a text file called keyvalues.txt. You probably don’t need to know anything about Perl to get this working for you. But if you do know a bit of Perl, you’ll be able to do amazing things with it.

There’s more!

This data is useless unless you’re looking at it. In a future post, I will show you how to schedule this script with cron, push the data into a database, read it with a simple ‘web site in a single file’ php script, filter the data (including historical values!), and get alerts when the values change.

And then I have something REALLY amazing to show you!

Thanks for reading, everyone! Feel free to contact me at roger@rogerthephoneguy.com or post a comment here. If you need help getting this code working, or you want to tweak it a bit, let me know!

Roger

How to parse Avaya CDR into MySQL database with Perl

This is a follow-up to my earlier post about setting up CDR processing with Kiwi Syslog Server. In that post, I discuss how to capture the data to a text file. But of course, if we are capturing CDR, we probably want to analyze it, right?

So this post is dedicated to capturing the data to a MySQL database. It is not for the faint-of-heart though. It requires a little knowledge of MySQL and willingness to read a tiny bit of perl. I am happy to help you set this up, though. Either via this blog as time allows, or I would love to work for you as an hourly contractor – and I love all things telecom!

If you don’t have a Linux server in your network that you can use for your own purposes, I highly recommend one. Any Linux distribution will do, and you probably have a VMWare infrastructure to support it. If you need to justify it, I can help – perhaps even this blog entry will help. Tell your boss you need to perform text analysis of your PBX data. This blog entry assumes you have Linux, but I suppose it will work on Windows with a LAMP stack (but good luck with local firewall rules in Windows).

So first. you must create a MySQL database to store the call records. Let me know if you need help doing this. Since I have to start somewhere with this post, I will assume you have access to a MySQL database. You just need to create a table to store the data. Here is the statement to do that:

drop table if exists cdr_syslog;
create table cdr_syslog (
 id integer auto_increment primary key,
 method varchar(8),
 reported_on datetime,
 deduced_starttime datetime,
 source_ip varchar(32),
 from_number varchar(24),
 to_number varchar(24),
 call_type varchar(12),
 call_time varchar(16),
 direction varchar(3),
 duration_ss integer,
 call_hash varchar(64),
 pbx_time_of_day_hh varchar(2),
 pbx_time_of_day_mm varchar(2),
 pbx_duration_h varchar(1),
 pbx_duration_mm varchar(2),
 pbx_duration_6s varchar(1),
 pbx_condition_code varchar(1),
 pbx_access_code_dialed varchar(4),
 pbx_access_code_used varchar(4),
 pbx_dialed_number varchar(23),
 pbx_calling_number varchar(14),
 pbx_account_code varchar(15),
 pbx_authorization_code varchar(7),
 pbx_frl varchar(1),
 pbx_incoming_ckt varchar(3),
 pbx_outgoing_ckt varchar(3),
 pbx_feature_flag varchar(1),
 pbx_attendant_console varchar(4),
 pbx_incoming_tac varchar(4),
 pbx_node_number varchar(2),
 pbx_ins varchar(5),
 pbx_ixc varchar(3),
 pbx_bcc varchar(1),
 pbx_ma_uui varchar(1),
 pbx_resource_flag varchar(1),
 pbx_packet_count varchar(5),
 source_line varchar(128),
 created_on datetime
);

alter table cdr_syslog add index i_call_hash(call_hash);
alter table cdr_syslog add index i_deduced_starttime(deduced_starttime);
alter table cdr_syslog add index i_from_number(from_number);
alter table cdr_syslog add index i_to_number(to_number);

Those statements create the table and a few indexes to speed things up. The indexes for call_hash and deduced_starttime are used when inserting records. The indexes for from_number and to_number are added since these are probably the most common fields to query. Typically, I will add an index for any field that appears in my ‘where’ clause, but I’m no DBA so there are probably better ways to tune it. However, this works fine for me and should for you as well.

I am using a Kiwi Syslog server to capture the CDR. But the cool thing about Kiwi is that you can “forward” the message along to another server. So I configured mine to send it to my Linux server. This is the screenshot for that forward command within Kiwi:

kiwi-send-logs-out-via-syslog

This simply sends the Syslog message to my server on UDP port 514. This is the typical Syslog port. Then I have a Perl script running on my server. That Perl script has some dependencies

use strict;
use DBI;
use Net::Syslogd;
use Digest::MD5 qw(md5_hex);
use POSIX qw(setsid);

You don’t need to be familiar with Perl to run this script, but you will need to make sure these packages are installed. For example, if I have a script that runs this line:

use Roger::Wilco;

When I try to run the script, I will get this error:

Can't locate Roger/Wilco.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at syslog.pl line 2.
BEGIN failed--compilation aborted at syslog.pl line 2.

So if you try to run my CDR script and you get any kind of “Can’t Locate” error, then you need to install that package. To to this, type this at a command line:

Perl -MCPAN -e shell

And that will get you to the CPAN command line where you can install packages.

# perl -MCPAN -e shell
Terminal does not support AddHistory.

cpan shell -- CPAN exploration and modules installation (v1.9402)
Enter 'h' for help.

cpan[1]>

And from here, you can type

install Net::Syslogd

Or whatever other packages you need to install. Note that case matters, and that’s a double colon between the words. When it asks you a yes/no question, say yes. If all goes well, then it should install fine. CPAN installs all dependencies. If you get errors, the most common is you don’t have a c++ compiler on your machine. I never did figure out how to fix that in Windows. In Linux, there are lots of tutorials. With that in mind, here is the Perl script that will catch the CDR forwarded from Kiwi. Be sure to read my comments at the end.

#!/usr/bin/perl

### Read unformatted CDR data from an Avaya server
### This script uses Net::Syslogd to listen to standard syslog messages
### The Avaya phone systems send to TCP ports. Unfortunately, this script is to listen from 
### A 'proxy' instance of CDR. for example, a Kiwi Syslog Server that forwards syslog messages
### out to another server.
### see my blog http://rogerthephoneguy.com/?p=404 for details


### Usage!
### set the $dir folder to where you store text files from your syslog server
### and run this script with 'perl cdrd.pl file' to tell the script to process files
### or run as 'perl cdrd.pl' to just watch the processing fly by
### or run as 'perl cdrd.pl -d' as a daemon process
### to stop the daemon, just create a file matching $killfile
### Or do a 'ps -ef |grep cdrd.pl to find the PID
### and kill -9 that pid
### not the best way to manage daemons, but we are telephone engineers here, not linux admins, right?

use strict;
use DBI;
use Net::Syslogd;
use Digest::MD5 qw(md5_hex);
use POSIX qw(setsid);

my $db;
my $dir = "/home/roger/avaya/cdr/data/";
my $killfile = "/home/roger/avaya/cdr/cdrd.end";


sub db_open {
        my $dsn = "DBI:mysql:dbname_ie_avaya:localhost";
        my $user = "dbuser";
        my $pass = "dbpassword";
        while (!($db)) {
                $db = DBI->connect($dsn, $user, $pass);
        }
}

sub db_close {
        if($db) {
                $db->disconnect();
        }
        undef($db);
}

sub sql { #check
        my ($sql) = (@_);
        my $sth = $db->prepare($sql);
        if(!($sth->execute())) {
                print "WARNING - unable to execute $sql\n";
        }
        $sth->finish();
}

sub generic_create {
        my ($table, $fields, $values) = (@_);
        my $sql = "insert into $table ($fields) values ($values)";
        my $sth = $db->prepare($sql);
        if(!($sth->execute())) {
                print "WARNING - unable to insert with $sql\n";
        }
        $sth->finish();
}

sub generic_read {
        my ($select, $table, $where) = (@_);
        my @datarow;
        my $sql = "select $select from $table where $where limit 1";
        my $sth = $db->prepare($sql);
        if($sth->execute()) {
                @datarow = $sth->fetchrow_array();
        } else {
                print "unable to execute $sql\n";
        }
        $sth->finish();
        return $datarow[0];
}

sub one {
        my ($sql) = (@_);
        my @datarow;
        my $sth = $db->prepare($sql);
        if($sth->execute()) {
                @datarow = $sth->fetchrow_array();
        } else {
                print "unable to execute $sql\n";
        }
        $sth->finish();
        return $datarow[0];
}

sub process_cdr_line {
    my $cdr = shift;
    chomp($cdr);
    my ($logtime,$facsev,$remoteaddr,$message) = split(/\t/,$cdr);
    my $md5message = md5_hex($message);
    #print "012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789\n$message";
    #here are the start and length of each data point within the message of an "unformatted" cdr line
    #Nobody knows what some of these are. It's a mystery today
    #your condition codes may be different, so you should watch your data closely for several days
    my @parses = (
        0,2,     #time of day hours
        2,2,     #time of day minutes
        4,1,     #duration hours
        5,2,     #duration minutes
        7,1,     #duration tenths
        8,1,     #condition code
        9,4,     #access code dialed
        13,4,    #access code used 
        17,15,     #dialed number
        32,10,     #calling number
        42,15,     #account code
        57,7,     #auth code
        66,1,     #frl
        67,3,     #incoming ckt
        70,3,     #outgoung ckt
        73,1,     #feature flag
        70,4,     #attendant console
        76,4,     #incoming tac
        82,2,     #node number
        84,4,     #INS
        88,3,     #IXC
        92,1,     #BCC
        93,1,     #MAUUI
        94,1,    #Resource flag
        95,4    #packet count
    );
    my $values = '';
    my $insert;
    my $i;
    my $calltime;
    my $duration;
    my $direction = 'unk';
    my $dialed;
    my $calling;
    my $node = '01';
    my $condition;
    for($i=0; $i<25; $i++) { my $start = $parses[$i*2]; my $length = $parses[$i*2+1]; my $part = substr($message,$start,$length); #print "this $i from $start with length $length is '$part'\n"; $part =~ s/^ *//g; $part =~ s/ *$//g; if($i == 0) { #time of day hh $calltime = "$part:"; } elsif($i == 1) { #time of day mm $calltime .= $part; } elsif($i == 2) { #duration h $duration = $part * 60 * 60; } elsif($i == 3) { #duration mm $duration += ($part * 60); } elsif($i == 4) { #duration 1/10 min (6 sec) $duration += ($part * 6); } elsif($i == 5) { #condition code $condition = $part; if($part == 9) { $direction = "in"; } elsif($part == 7) { $direction = "out"; } } elsif($i == 8) { #dialed number $dialed = $part; $dialed =~ s/E$//; if(($dialed ne '6226') && ($dialed =~ /^6[234]\d\d$/)) { $node = '44'; } } elsif($i == 9) { #calling number $calling = $part; $calling =~ s/E$//; if(($dialed ne '6226') && ($dialed =~ /^6[234]\d\d$/)) { $node = '44'; } } elsif($i == 17) { #incoming tac if($part =~ /^#0\d\d/) { $node = '44'; } } elsif($i == 18) { #node number $part = $node; } $values .= "'$part',"; } if($direction eq "in") { my $temp = $dialed; $dialed=$calling; $calling=$temp; } if(($dialed =~ /^\d*$/) && ($calling =~ /^\d*$/)) { #dialed_name and calling_name are filled from a 'display_name' table that is not part of this demo. #for more information, contact me through RogerThePhoneGuy.com and we can chat about how this table is filled my $dialed_name=''; #generic_read("name", "avaya_display_name","station='$dialed'"); my $calling_name=''; #generic_read("name", "avaya_display_name","station='$calling'"); $dialed_name =~ s/'/\\'/g; $calling_name =~ s/'/\\'/g; #print "logtime:$logtime\ncondition $condition\ndirection $direction\ndialed $dialed ($dialed_name)\ncalling $calling ($calling_name)\nduration $duration\n"; $insert = "insert into cdr_syslog values (null,'syslog','$logtime',null,'$remoteaddr','$calling','$calling_name','$dialed','$dialed_name','type','$calltime','$direction',$duration,'$md5message',"; $insert .= "$values '$message',now())"; my $check = one("select count(*) from cdr_syslog where call_hash = '$md5message'"); #print "hash check = $check\n"; if($check == 0) { #We insert the md5 of this line, to prevent double-logging. This allows us to re-process the cdr as much as necessary w/o dups in the database #print "$insert\n"; sql($insert); sql("update cdr_syslog set deduced_starttime = DATE_SUB(reported_on,INTERVAL duration_ss SECOND) where call_hash='$md5message'"); } else { #print "duplicate\n"; } } else { $insert = "insert into pbx_syslog values (null,'$message')"; #sql($insert); } } sub daemonize { #check open STDIN, '/dev/null' or die "Can't read /dev/null: $!"; open STDOUT, '>>/dev/null' or die "Can't write to log: $!";
    open STDERR, '>>/dev/null' or die "Can't write to log: $!";
    defined(my $pid = fork)   or die "Can't fork: $!";
    exit if $pid;
    setsid                    or die "Can't start a new session: $!";
    umask 0;
}

if($ARGV[0] eq "-d") {
    daemonize;
}

db_open();

if($ARGV[0] eq "file") {
    opendir(DH, $dir);
    my @files = readdir(DH);
    closedir(DH);
    foreach my $file (@files) {
        open CDR, "$dir$file";
        my $i = 0;
        while (my $line = ) {
            $i++;
            my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday) = localtime(time);$mon++;$year+=1900;
            $line =~ s/[^!-~\s]//g; #strips unprintable
            process_cdr_line($line);
            #die if($i>2);
        }
        print "$dir$file - $i lines\n";
        close CDR;
        unlink "$dir$file";
    }

} else {
    my $syslogd = Net::Syslogd->new()
      or die "Error creating Syslogd listener: ", Net::Syslogd->error;

    while (!(-e $killfile)) {
        my $message = $syslogd->get_message();
        if (!defined($message)) {
            printf "$0: {0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}s\n", Net::Syslogd->error;
            exit 1;
        } elsif ($message == 0) {
            next
        }

        if (!defined($message->process_message())) {
            printf "$0: {0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}s\n", Net::Syslogd->error
        } else {
            my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday) = localtime(time);$mon++;$year+=1900;
            my $logtime = sprintf("{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}04d-{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}02d-{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}02d {0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}02d:{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}02d:{0ed28e3470e974017c124b0897303dd14e34b5245564abb28916e7d48d9b07c0}02d",$year,$mon,$mday,$hour,$min,$sec);
            my $remoteaddr = $message->remoteaddr;
            my $remoteport = $message->remoteport;
            my $facility = $message->facility;
            my $severity = $message->severity;
            my $time = $message->time;
            my $hostname = $message->hostname;
            my $msg = $message->message;
            my $line = "$logtime\t$facility.$severity\t$remoteaddr\t$msg\n";
            print $line;
            #now $line looks like any line from the CDR files
            #so this next part works via syslog or files
            $line =~ s/[^!-~\s]//g; #strips unprintable
            process_cdr_line($line);
        }
    }
    unlink $killfile;
}

db_close();

There’s a lot in that script. Things you must know:

  1. You’ll need to replace the database name, username, and password in the “db_open” subroutine.
  2. The comments at the top explain how to run it to process files, or to capture syslog live, or even how to run as a daemon.
  3. Each CDR line is hashed and included in the record, so don’t worry about processing the CDR files multiple times. It will NOT insert duplicate records.
  4. When processing files, the script will delete the file when done. So always copy the CDR files into a separate folder before running this script.
  5. I kludged this together from a working Perl script. I removed some identifying information, moved some variables around, and added a bunch of comments. If you get any errors, let me know – it’ll probably be really easy to fix.
  6. For any hobbyist Perl developers out there, this script includes some handy database functions that insert records, return rows, and return one record.
  7. This script is for information purposes only. Naturally, I don’t provide any warranties, guarantees, or assertions that it will work with your systems. Run it at your own risk.
  8. If you’d like me to help set this up, I’d love to work for you. I love telecom and I’m always available for hire as a consultant. I would LOVE to be your telephone guy!

If you’re planning on trying this, then I’m really flattered. Let me know how it goes. Mine has been running for about two years and I cannot tell you how handy it is to have that much CDR data in a MySQL table. I have a PHP script that I use to analyze the data. Do you want to see that?

Thanks all,

Roger

 

Avaya 9630 Locks / Reboots up when registering

I recently had an Avaya phone in a reboot cycle. It would boot up, then when it registered it would lock up and after a couple minutes reboot again. The display looked like it was up and running fine, but when you press the SPEAKER button I would get three beeps. This is typically what happens when a phone cannot get TCP/IP signalling traffic to the call server. And the night before we had some maintenance on that Ethernet switch so I immediately suspected a network problem.

Just in case, I did a “CLEAR” procedure on the phone. Then I swapped out the phone. Then I swapped out the patch cables (at both ends). Then I moved the phone to a different port in the Ethernet switch. No matter what I did, the phone locked up. Then I tried something I probably should have tried earlier – I logged in a different extension and it worked fine! Then I logged the “bad” extension into a different phone and it locked up!

Turns out the config file on the web server (1234_96xxdata.txt) was incomplete. Apparently it was related to the network after all! When the phone was writing its data file to the web server the previous night, the write operation was interrupted as the network guy shut off the Ethernet switch. The resulting data file was incomplete – it had about half the call log entries and a partial line at the end. But none of the important lines you’d expect in the file such as:

LOGTDFORMAT=128
Redial=1
Edit Dialing=1
Go to Phone Screen on Calling=0
Go to Phone Screen on Ringing=1
Call Timer=1
Visual Alerting=0
History Active=1
Log Bridged Calls=1
Audio Path=1
Personalized Ring=0
Handset AGC=1
Headset AGC=1
Speaker AGC=1
Error Tone=1
Button Clicks=0
Text Size=1
Contacts Pairing=0
Voice Initiated Dialing=1
Voice Dialing Help Counter=0
Personalized Ring Menu=0
Go to Phone Screen on Answer=0
Voice Initiated Dialing Language=

If the file were missing, the phone would use default values and create the file at the next backup. However, since the file was there, the phone processed it but ended up locking up because it was incomplete. In all my years working with these phones, I’ve never seen that before. I wouldn’t have thought it possible for the phone’s interrupted “HTTP PUT” operation to result in an incomplete file on the web server, but there you go. Hopefully this helps you.

MPLS Network and QOS – Traffic Policing vs. Traffic Shaping

As mentioned in my last post, I had some trouble with resets in my PBX due to “some kind of network event”. It was very enlightening when we discovered the root cause, so I wanted to share my experience here in hopes that it saves you from similar sleepless nights.

The symptom was quite simple – the PBX was losing connectivity with its host processor for some reason. Avaya called it a “network event” but couldn’t be more specific. I was fortunate enough to work on the same team as the LAN/WAN techs. If we had been in different departments, this would have been even more of a nightmare. So the Avaya PBX loses heartbeats – after 15 of them it caused a reset. Heartbeats are about one second apart, so a failure of 15 heartbeats is a very long time. Surely a WAN outage of 15 seconds would be noticed by other systems, right? So when Avaya says there was a network event, my response was “uh, you gotta give me more than that. Nothing else on the network noticed”.

I’ll spare you the details here – they’re in my previous post anyway. The issue is the PBX is marking all IP traffic with “Expedited Forwarding” (EF), or Diffserv 46, or High Priority, or QOS. There are plenty of synonyms but it just means all IP packets are tagged with high priority and should be preferred over other packets in the LAN and WAN. Our MPLS carrier honors this tag through their network. Perfect, right?

So when you purchase QOS in the MPLS network, you are given a certain amount of bandwidth allowed. Obviously you’re not allowed to mark all traffic as high priority. MPLS is a shared cloud and you’ll pay a premium for expedited delivery of QOS packets. This bandwidth allotment is called the Committed Access Rate or CAR. What happens when you exceed this CAR? Well, as a telephone guy, I would assume the packets are delivered as “best effort” after that. But I was very wrong. Packets over the CAR are discarded by the MPLS carrier. Think about it – if you have high-priority packages to deliver overnight and the you’re only allowed to send ten per day. The eleventh package isn’t held for best effort. It’s thrown in the dumpster. Sorry. You exceeded your ten packages today, I’m throwing this one out.

The fix is simple and elegant. And probably crystal clear to you network routing engineers out there. It’s up to your edge router to strip the QOS tag from any packets that exceed the CAR. It sounds dangerous to me – what if my edge router and the carrier’s MPLS router disagree slightly on the current amount of traffic at this particular second? Especially if we’re using different brands of router? However, Mbps seems to be an agreed-upon measurement across all vendors, so having my edge router strip the QOS tag from these packets simply allows them to be delivered at best-effort across MPLS.

What I discovered, is if this setting is not correct, it only affects high-priority traffic. In my experience, most companies use a different network for video and everything else on the network such as Internet, email, chat, file servers, replication, database synchronization, etc. are not marked with QOS. So guess what? It looks like your phone system is hosed! And nothing else on the network is complaining! This gets back to a trend I’ve been noticing: Being a phone guy doesn’t have much to do with telephones anymore. It’s almost always the network. And to be a really good phone tech, you have to know networking well. I hope this helps.

I recently went through an ordeal with a PBX resetting. It’s an Avaya system using an IPSI to connect a port network back to its host, but this situation applies to anyone out there using QOS on their MPLS network. I’ve often said that being a “phone guy” is rarely about phones anymore. Most of my work – certainly troubleshooting – involves IP networking.

So I had a PBX with one IPSI that would occasionally reset. Since there was only one IPSI, the reset would cause all cards in the port network to reset as well, which would drop all calls in progress. Now this is about the worst thing that can happen when you’re responsible for the telephones. Full system outages are easier to understand. This is a reset, calls drop, users get frustrated and re-establish their calls, then it would reset again. It was a really bad situation.

What is causing the resets? Avaya said the heartbeats were failing to the IPSI. For any of you with an IPSI-connected port network, you should occasionally look for these. SSH to your Communication Manager and cd to /var/log/ecs. Then list the log files. Assuming you’re in Feburary 2013, you would look for missed heartbeats in your ecs log with the command:

cat 2013-02*.log|grep checkSlot
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (1)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (2)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (3)]
:pcd(5561):MED:[[3:0] checkSlot: sanity failure (4)]
:pcd(5561):MED:[[3:0] checkSlot: data received replacing sanity message; socket delay is 14 secs]

I have stripped the date/time; you’ll see those on the left. Port networks and IPSIs are zero indexed, so the messages above apply to port network 4 and IPSI number 1.

I have been told that occasional sanity failures are just a part of life. These heartbeat messages are part of the Avaya protocol, not ICMP. So if you’re missing heartbeats, it’s not because ICMP is being dropped.

However, after a certain number of sanity failures, the IPSI will reset in order to re-esablish communication. How many sanity failures? That depends upon a system parameter setting:

display system-parameters ipserver-interface
IP SERVER INTERFACE (IPSI) SYSTEM PARAMETERS

SERVER INFORMATION
Primary Control Subnet Address:
Secondary Control Subnet Address:

OPTIONS

Switch Identifier: A
IPSI Control of Port Networks: enabled
A-side IPSI Preference: disabled
IPSI Socket Sanity Timeout: 15

QoS PARAMETERS
802.1p: 6
DiffServ: 46

The IPSI Socket Sanity Timeout determines how many sanity failures will cause an IPSI failover (if you have two in your port network), or a reset(!) if you only have one. The reset is the IPSI’s way of trying to re-establish communication. If you get too many sanity failures, you’ll get this message:

:pcd(5561):MED:[[3:0] checkSlot: too many sanity failures (15)]

Unfortunately, this means my CM lost connectivity to the first IPSI on port network 4. If I only have one IPSI, then the IPSI and all cards in the port network will reset. If I have a redundant IPSI, then the port network will failover and everything should be okay. In my particular case a second IPSI would not have helped me. It turns out, my MPLS carrier (who had also set up our edge routers) was policing the committed access rate. I’ll explain with more detail in my next post. The resolution was to shape the traffic rather than police it.