客户在使用tuxedo时,访问应用出现断断续续不稳定现象,客户环境为tuxedo 11。
一、应用报错
通过查看tuxedo ULOG的错误日志发现有如下错误:
194013.xxxxx!WSL.6961.4157044416.0: WSNAT_CAT:1175: WARN: Disconnecting listener connection to workstation, request timed out 194014.xxxxx!WSL.6961.4157044416.0: WSNAT_CAT:1175: WARN: Disconnecting listener connection to workstation, request timed out 194014.xxxxx!WSL.6961.4157044416.0: WSNAT_CAT:1175: WARN: Disconnecting listener connection to workstation, request timed out 194014.xxxxx!WSL.6961.4157044416.0: WSNAT_CAT:1175: WARN: Disconnecting listener connection to workstation, request timed out 194014.xxxxx!WSL.6961.4157044416.0: WSNAT_CAT:1175: WARN: Disconnecting listener connection to workstation, request timed out |
二、分析过程
首先看这个错误的解析:
Description The workstation listener will scan its handler table and time out any connections that are overdue, restart any dead handlers, and allocate any blocked connection requests to a handler that is available. When scanning this table, the workstation listener found a blocked connection request that had not been passed to an available handler. Since the time to assign a handler expired, the workstation listener went ahead and forced the connection closed. Action This probably indicates that the number of workstation clients that need to join the application can not be supported by the existing configuration. The workstation listener command line options should be reset to support a larger number of workstation clients. The workstation listener -m, -M, and -x options should be used to that effect. |
意思上看是WSH处理不过来了,当前的配置可能满足不了当前的请求
然后检查WSL监听配置,tmunloadcf
"WSL" SRVGRP="GROUP1" SRVID=10 CLOPT="-A -- -n //192.168.40.117:9999 -m 5 -M 10 -x 5" |
-m 5 最小5个WSH
-M 10 最大10个WSH
-x 5 此参数的值表示每个工作站处理程序可以同时支持的工作站客户端数量
(An optional parameter used to control the degree of multiplexing desired within each workstation handler. The value for this parameter indicates the number of Workstation clients that can be supported simultaneously by each workstation handler. The workstation listener ensures that new handlers are started as necessary to handle new Workstation clients. This value must be greater than or equal to 1 and less than or equal to 4096. The default for this parameter is 10.
)
三、跟踪日志
输入pclt检查当时的情况
printclient(pclt) print names and other information about active client processes 打印 激活的客户端进程的 名称和其他信息
> pclt LMID User Name Client Name Time Status Bgn/Cmmt/Abrt --------------- --------------- --------------- -------- ------- ------------- simple tuxedo WSH 0:00:19 IDLE 0/0/0 simple tuxedo WSH 0:00:19 IDLE 0/0/0 simple tuxedo WSH 0:00:19 IDLE 0/0/0 simple tuxedo WSH 0:00:19 IDLE 0/0/0 simple tuxedo WSH 0:00:19 IDLE 0/0/0 simple tuxedo WSH 0:00:07 IDLE 0/0/0 simple tuxedo WSH 0:00:07 IDLE 0/0/0 simple tuxedo WSH 0:00:07 IDLE 0/0/0 simple tuxedo WSH 0:00:07 IDLE 0/0/0 simple tuxedo WSH 0:00:07 IDLE 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:07 BUSY/W 0/0/0 simple 192.168.40.117 0:00:06 BUSY/W 0/0/0 simple 192.168.40.117 0:00:06 BUSY/W 0/0/0 simple 192.168.40.117 0:00:06 BUSY/W 0/0/0 simple 192.168.40.117 0:00:06 BUSY/W 0/0/0 simple 192.168.40.117 0:00:06 BUSY/W 0/0/0 simple 192.168.40.117 0:00:05 BUSY/W 0/0/0 simple 192.168.40.117 0:00:05 BUSY/W 0/0/0 simple 192.168.40.117 0:00:04 BUSY/W 0/0/0 simple 192.168.40.117 0:00:03 BUSY/W 0/0/0 simple tuxedo tmadmin 0:00:15 IDLE 0/0/0 > |
可以发现50个客户端进程正处于BUSY状态,此50正好等于WSH能够处理的最大数目
同时检查192.168.40.117的调用服务进程:
ps -ef|grep tuxedo tuxedo 1972 1920 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1973 1922 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1974 1970 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1975 1937 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1976 1941 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1977 1924 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1978 1943 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1979 1927 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1980 1946 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1981 1955 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1982 1961 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1983 1963 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1984 1929 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1985 1939 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1986 1930 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1987 1959 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1988 1921 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1989 1923 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1990 1965 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1991 1911 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1992 1914 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1993 1931 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1994 1916 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1995 1925 0 17:52 pts/0 00:00:00 ./callserver1 aa tuxedo 1996 1926 0 17:52 pts/0 00:00:00 ./callserver1 aa ...... |
引起WSH被消耗尽的services是callserver1
进一步跟踪 1995进程的堆栈
gstack 2808 #0 0x00007f18b7530a8d in recv () from /lib64/libpthread.so.0 #1 0x00007f18b8157813 in _e_net_rcvraw () from /tuxedo/tuxedo11gR1/lib/libgpnet.so #2 0x00007f18b8160f19 in _e_net_recv () from /tuxedo/tuxedo11gR1/lib/libgpnet.so #3 0x00007f18b85d988a in br_e_net_recv () from /tuxedo/tuxedo11gR1/lib/libwsc.so #4 0x00007f18b8591106 in _wsc_nw_msgrcv () from /tuxedo/tuxedo11gR1/lib/libwsc.so #5 0x00007f18b8592d89 in _wsc_getrply () from /tuxedo/tuxedo11gR1/lib/libwsc.so #6 0x00007f18b85976ec in _wsc_conn () from /tuxedo/tuxedo11gR1/lib/libwsc.so #7 0x00007f18b858cdca in tpchkauth () from /tuxedo/tuxedo11gR1/lib/libwsc.so #8 0x00007f18b858e794 in _tpinit () from /tuxedo/tuxedo11gR1/lib/libwsc.so #9 0x00007f18b85e3c96 in tpinit () from /tuxedo/tuxedo11gR1/lib/libwsc.so #10 0x00000000004009b4 in main () |
看堆栈是停留在接收消息,中间处理时间太长
四、故障总结
客户端192.168.40.117调用callserver1服务超出了WSH能够处理的最大数目,当新的请求进来后并不能及时得到处理时,ULOG就会抛出WSNAT_CAT:1175: WARN: Disconnecting listener connection to workstation, request timed out
建议解决方案
1. 修改TUXCONFIG 文件,根据并发访问数增加 WSL的配置,当前配置为CLOPT="-A -- -n //192.168.40.117:9999 -m 5 -M 10 -x 5",例如修改配置为
CLOPT="-A -- -n //192.168.40.117:9999 -m 5 -M 15 -x 15"
2. 检查对应的service, 是否此service有性能问题,正常来说调用是很快返回结果后就会释放WSH,从而不会造成阻塞,当前的service是callserver1。