RGS Agents can’t answer calls

I encountered an interesting fault this week: calls were being presented to Lync Response Group Agents, but they couldn’t answer them. Callers would be listening to hold, cursing the delays in the queue while the agents were desperately trying to answer the calls.

At the same time it also transpired that one of their two PSTN Gateways had died (although not fully). They had two ISDN streams, one into each Gateway, and one Gateway had failed with what I suspect to be the loss of one power supply rail. This was not enough to drop the ISDN circuit, so any calls that were presented on that service were rejected with an ISDN supervisory message. Unplugging the ISDN cable was sufficient to address that once we realised a reboot wouldn’t bring it back.

This is a Lync 2010 installation, with the PSTN Gateways talking by TCP back to the Mediation Server role in the single Enterprise Edition Front-End server.

Prior to me being engaged, the customer had already restarted the FE, as this fault seemed to have appeared after they’d installed some Windows Updates, which were duly removed. Interestingly, it would come good for *1* call, and then no other calls could be answered.

Response Group problems like this are usually resolved by restarting the RGS Service, but alas it wasn’t going to be that easy.

Tracing on the Front-End didn’t give me any clues (or so I thought at the time). There were plenty of error messages telling me that my Gateway was down, but I knew that, and I’d already removed it from the outbound routes anyway.

LS Mediation Server error 25051: “There was no response from a gateway to an OPTIONS request sent by the Mediation Server.”.
LS Mediation Server error 25052: “The Gateway peer cannot be contacted. Mediation server will keep trying; however additional failures will not be logged”.
LS Mediation Server error 25061: “The Mediation Server service has encountered a major connectivity problem with these gateway peer(s)”.

It was a client-side trace that finally twigged for me: as the call was answered by the agent, a 503 was logged referencing the dead Gateway:

Snooper-503-edit2

SIP/2.0 503 Service Unavailable
Authentication-Info: TLS-DSK qop="auth", opaque="FC46C4F6", srand="CC54D7FF", snum="55", rspauth="d575351449c16171d20572dced573089823c4e3b", targetname="lync2010EEFE1.contoso.com.au", realm="SIP Communications Service", version=4
Via: SIP/2.0/TLS 10.10.10.21:59260;ms-received-port=59260;ms-received-cid=39000
FROM: "User Name"<sip:uname@contoso.com.au>;tag=ab4325bd7d;epid=16c77e98b5
TO: <sip:lync2010EEFE1.contoso.com.au@epa.vic.gov.au;gruu;opaque=srvr:MediationServer:dpyWhsXsx1mN_ORGYjSpFgAA;grid=807cd079236c4ff9a37663ef97cc333d>;epid=6E70BBE33B;tag=71c1eacdd8
CSEQ: 1 INVITE
CALL-ID: b72a8501bb3f44529cfa6ca4f53215db
CONTENT-LENGTH: 0
SERVER: RTCC/4.0.0.0 MediationServer
ms-endpoint-location-data: NetworkScope;ms-media-location-type=intranet
ms-trunking-peer-state: down
ms-trunking-peer: 10.10.10.37
ms-diagnostics: 10001;source="lync2010EEFE1.contoso.com.au";reason="Gateway did not respond in a timely manner (timeout)";component="MediationServer"
ms-diagnostics-public: 10001;reason="Gateway did not respond in a timely manner (timeout)";component="MediationServer"

From here it was just a quick hop to Topo Builder where I tagged the working Gateway as the default and Published. (I don’t recollect now if I restarted Mediation on the FE or not, but if the topo publish doesn’t resolve the matter for you fairly promptly, give it a whirl).

Lync 2013 Seems Unaffected

I’ve tried to reproduce the scenario in my hybrid Lync 2010 & 2013 Lab environment but all calls come in OK, so I’m guessing this weakness might have been removed from Lync 2013.

According to the Lync 2013 Topology Builder, “a default trunk is required only when your topology contains Office Communications Server 2007 R2”:

Topo-MarkDefault-edit

If you remember back to OCS days (and I know – it’s been a while!), there was a 1:1 relationship between Mediation Servers and Gateways, and in OCS’ routing rules you specified the Mediation Server as the target of a route, rather than the down-stream PSTN Gateway as you do with Lync. Thus, in an environment where you still had OCS present, the Default (PSTN) Gateway is the one Lync would pass the call to when OCS sent a call to its Mediation Server.

Turns out I’ve seen his before…

On my page of Lync error messages, I’ve documented a case where users weren’t able to Park a call:

Lync-CannotParkCallRightNow

This one’s for the “strange but true” category. In this scenario I called in from the mobile via PSTN Gateway A to my Lync user, answered the call and attempted unsuccessfully to transfer it to Call Park. Tracing on the FE and Client were inconclusive, but I kept noticing 503 errors tattling on “Gateway B”, which I’d removed from all my call routes and powered-off but retained in the Topology. Turns out that for some strange reason, Lync wants to talk to the Gateway I’ve defined in the Topology as the Default Gateway prior to parking a call! So, the moral of the story: Don’t turn off the “Default Gateway” in your Topology without selecting a new Default.

… or upgrade to Lync 2013, remove your OCS, and “Unmake Default” the gateways.

The Morale of this Story?

I guess it’s that when multiple faults happen simultaneously, it’s not just a coincidence – they’re related, regardless of how strangely!
 

Leave a Reply

Your email address will not be published. Required fields are marked *

... and please just confirm for me that you're not a bot first: Time limit is exhausted. Please reload the CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.