by Ben Logan
WIlkes Communications, Inc
With voice quality issues, where you get the capture from is critical. Let's say there's a PBX at point A, switches at points B and C and the problematic phone is point D. Ideally, you want to start with a packet capture at both points A and D simultaneously so you can compare the two. If the person at D hears the issue, the capture from point D is the most important. If the other end of the conversation hears the issue, the capture from point A is most important. For the sake of clarity, I'll refer to this "most important' capture as the "primary" capture and the other capture as the "secondary" capture. I also use "source" and "destination" from the perspective of the endpoint hearing the trouble. So the primary capture is taken at the destination endpoint and the other end is the source.
First, I look to see if I can "see" the issue at all in the packet capture. Ideally, you'll capture the beginning of a problematic call so you can see the call setup and so that Wireshark knows which packets belong to that call without intervention. If you can't catch the call setup in the capture, then Wireshark will not know that the UDP traffic is RTP and will just show it as UDP. Right-click on one of the UDP packets that is part of the conversation and click "Decode as" and then select RTP from the list. (I'm assuming you are using RTP for transport here.) You may have to do this for a packet in each direction of the conversation. Once that's done (if necessary) select one of the packets in the conversation and go to Telephony->RTP->Stream Analysis. This dialog gives you a lot of useful information. Here's a breakdown of what I look for here:
This should *typically* be slightly higher than 20ms for RTP audio. The way to verify what your value should be close to is to look at the SDP packet from the call setup and see what the ptime media attribute field is set to. This number represents the number of milliseconds of audio represented in the audio packets. (See RFC4566 for details.) Standard packetization rate on every device I've seen is 20ms and that's what you'll typically see in the SDP header. So you know that in a perfect 0-latency-0-overhead world, your packets would come at 20ms intervals. Of course, we don't live in that world, so there's going to be some variation and there's always going to be some extra delay. But even if you have a higher-latency link between A and D, your *delta* should only be slightly higher than your ptime. High delta translates into high jitter and causes dropped packets on the receiving end. Even if you have slow link between A and D and there's 150ms of latency, your delta will be close to 20ms as long as your latency is consistent. Consistency is far more important than low latency. With the county issue, I have a capture that showed a max jitter of 13.06ms, but a max delta of over 1000ms. The way jitter is calculated smooths out those max deltas--it is more of an average. So you can have some pretty ugly high deltas periodically mixed in with a majority of really good deltas, and your jitter will still look ok--but you'll hear quality issues. (See https://wiki.wireshark.org/RTP_statistics for more good info on the statistics.) High deltas result in "garbled" sounding audio (sometimes people report it as sounding like the speaker is under water) if the deltas result in sequence errors, and choppy audio if they don't.
Max and Mean Jitter
Because of the discussion above, you can see that jitter may look good yet there still be problems. Nevertheless, jitter is a good measure of how your network is doing overall at delivering packets on time. The question that always comes up is how much jitter is too much. In my experience over the years, the answer is completely dependent on the devices you have on each end because the phones, PBX, etc all have a jitter buffer designed to absorb enough packets to re-order any packets that get out of sequence due to jitter. If you have a 100ms jitter buffer and 120ms of jitter, you are going to have problems. On the other hand, if you have a 130ms jitter buffer, it will absorb that 120ms of jitter and you won't experience quality issues. Jitter buffers can be fixed or variable, but the variable ones have a maximum amount of jitter they will compensate. Increasing jitter buffers increases delay, and increasing delay increases echo issues. So you don't want to increase your jitter buffer more than necessary just to compensate for poor network performance. Across a controlled network with QoS implemented correctly, I don't typically see jitter higher than a few ms. Your mileage may vary based on equipment and network load. At peak times, our network carries 7Gbps of data, more than 1Gbps of multicast video and around 20Mbps of voice traffic and our mean jitter is usually <1ms. As a side note, if you have fax machines or other modems in the mix, try to set the ATAs involved to use a fixed jitter buffer. Variable jitter buffers can wreck havoc with fax/data calls because they are designed to be as efficient as possible and they are ok with a little bit of imperfection that is imperceptible to the human ear, but screws up your data.
Lost RTP Packets and Sequence Errors
The lost RTP packets statistic is pretty self explanatory. A few lost packets will not cause voice calls too much of an issue, but too much will cause broken up audio. Sequence errors are usually caused by high jitter. One packet gets delayed so much more than the others that it arrives behind the packets it was in front of. High jitter/high deltas don't always cause sequence errors though. In the case of the issue we had with the county, there were no sequence errors despite packets getting delayed 1000+ ms. That's because the router that had the issue was detaining that entire stream of packets. So the order of the packets was unaltered, but there was intermittent high delay resulting in broken audio.
In the previous analysis, be sure you are looking at the right "direction'. You'll see a Forward and Reverse Direction tab in the RTP Analysis window (if Wireshark has captured the call setup). Look at the source and destination IPs and make sure you are looking at the right direction for the location the capture was taken.
Another useful tool is the player built into Wireshark. From the RTP Analysis window, hit the "Player" button. Then hit "Decode". You can listen to one or both sides of the audio, but more importantly, it shows you how much traffic would be dropped or re-ordered by the jitter buffer, and you can specify the jitter buffer size right there. Very handy! (Hit "Decode" again after changing the jitter buffer size.) So if you know that your equipment has a 130ms jitter buffer, you can set it to 130ms and see if you would have lost any appreciable data. You can also listen to the audio from captures taken at different points in the network as a quick way of identifying where the problem is introduced. Which brings me to the next step in the troubleshooting process.
Once you have identified *what* the problem is--high delay, packet loss, excessive jitter, etc--it's time to start narrowing down the where and the why.
The first thing I look at is QoS. QoS needs to be implemented end-to-end at layer 2 and 3. There's a misconception among some that if your network is not congested, you don't need QoS. However, if you mix bursty data traffic in with real-time traffic like voice, you can experience issues long before your links are congested. At layer 2, the voice traffic, including the signaling protocol such as SIP or MGCP, should be prioritized above any bursty data or video traffic. A CoS of 5 is the industry standard for voice traffic. At layer 3, DSCP 46 is standard for voice traffic although I have seen 24 used for the signaling portion. I think that's somewhat of a throwback to the days before Diffserv--see https://en.wikipedia.org/wiki/Differentiated_services for more details. Next you may need to consider how your equipment maps from the layer 3 DSCP value to the layer 2 CoS value. You lose a lot of granularity there because the CoS field is only 3 bits, so you only have 8 possible values. DSCP on the other hand is 6 bits for 64 possible values. As an example, Brocade MLX switches map DSCP 46 to CoS 5 and DSCP 24 to CoS 3. Our Meta Switch voice switch set DSCP to 24 in the MGCP packets, which got mapped to queue 3 in the MLX. This contributed to some signalling issues we intermittently experienced, so we set the Meta to send the signalling traffic out with DSCP=46 to keep it above our video traffic once the traffic reached the edge of the routed network and travelled the layer 2 access equipment. Some equipment (Brocade included) will let you change the way those values are mapped, but you have to be sure you change it everywhere you need to. For us that was not as feasible as changing the DSCP value at the source.
It's easy to verify these values in Wireshark by looking at the 802.1q header and the IP header--check it at the source to make sure it's set right from the start, and then check it at the destination to make sure it was maintained throughout. Which brings up another point: if you are serious about having good quality, low trouble VoIP, you need to separate your data and voice traffic into different VLANs. It makes life a lot easier.
Even if your QoS stays intact through your network, you need to make sure it is turned on and being properly honored on your equipment.
Also look for bursts of non-voip traffic around the troubled VoIP frames that might be the source of the issue. I usually filter on SIP and/or RTP when analyzing a capture just to eliminate the clutter. That's a good starting point, but then once you find a problem area, clear your display filter so you can see what's going on around it. Look for bursts of traffic, or ICMP messages that might give a clue as to what is happening.
The above troubleshooting needs to be done as close to the source and destination endpoints as possible. Hopefully you don't see the issue right at the source (the secondary capture we talked about earlier). If so, you probably have either a problem with the piece of VoIP equipment at the primary capture location or an issue on the switch or router that equipment is directly connected to. Assuming it isn't at point A or D, move on to point B, C, etc until the source of the problem is pinpointed. I like to use a divide and conquer method in a larger network: if the issue is not introduced at the source, then move halfway towards the destination. If the issue shows up there, then work back towards the source. If it doesn't show up there, move half of that distance away from the source and test there. And so on. But that's just my preference, because it should get you to the root of the problem more quickly on average than just moving one hop at a time from source to destination.