再论mysql线程出现Waiting for table flush被hang住问题

一.问题描述：

版本：mariaDB 10.0.2
我们在http://blog.csdn.net/zengxuewen2045/article/details/51412663一文中曾经提到过一条sql搞挂mysql实例的案例，当时分析没有找到kill操作的原因。
这次问题再次遇到，进一步分析问题产生原因,解决方法如此相似。
show processlist\G

跑最长时间线程处于killed状态，一直未释放

*************************** 6. row ***************************
Id: 5449173
User: lotsprd
Host: 10.16.26.199:51821
db: lots
Command: Killed
Time: 84026
State: Sending data
Info: SELECT DISTINCT
o.CUSTOMER_NAME,
o.WAREHOUSE_NAME,
o.CUSTOMER_ORDER_NO,
o.ANNTO_ORDER_NO
Progress: 0.000

很多正在跑sql的线程都处于Waiting for table flush状态
*************************** 10. row ***************************
Id: 5557272
User: lotsprd
Host: 172.16.12.155:54883
db: lots
Command: Query
Time: 70
State: Waiting for table flush
Info: SELECT
o.ANNTO_ORDER_NO,
ct.CARRIER_CODE,
ct.EXPRESS_NO,
ct.NODE_ACTION,
ct.ACCEPT_TI
Progress: 0.000
*************************** 11. row ***************************
Id: 5557273
User: lotsprd
Host: 172.16.12.155:54884
db: lots
Command: Query
Time: 9
State: Waiting for table flush
Info: SELECT annto_order_no,express_no,express_code,EXPRESS_NAME,source_express_code FROM (
SELECT

二.问题分析
1.检查binlog中是否有手工执行的flush table、alter table和analyze table命令
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep FLUSH
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep flush
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep analze
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ANALYZE
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep alter
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ALTER

查询都没有手工执行的flush table、alter table和analyze table命令。

2.检查备份，发现在1点时的备份脚本中有在备份前kill掉长时间查询的线程
crontab -l
0 1 * * * /apps/sh/cron/new_backup_mysql.sh --socket=/tmp/mysql3306.sock >> /apps/tmp/dump_3306.log 2>&1

mydumper --user=${db_user} --password=${db_pass} --long-query-guard=${v_long_query_guard} --socket=${socket} --database=${dbname} --outputdir=${this_back_p} --compress

--verbose=3 --logfile=${base_p}/${logfilename} ${kill_long_query_option}
v_backup_type=0 #mydumper 备份
#log_mysql_start_status
backup_mydumper
if [ $try_times -eq 1 ]
then
kill_long_query_option="--kill-long-queries"

3.查看mydumper备份的日志，确认有在01:01:22时kill了运行超过54999s的线程
cd /apps/mysql_backup
tail -300 new_backup_mysql3306.log.archive|more

2016-07-01 01:01:22 [WARNING] - Killed a query that was running for 54999s
2016-07-01 01:11:22 [ERROR] - Couldn't acquire global lock, snapshots will not be consistent: Lock wait timeout exceeded; try restar
ting transaction
2016-07-01 01:21:24 [ERROR] - Error dumping table (lots.t_carrier_tracking) data: Lock wait timeout exceeded; try restarting transac
tion
2016-07-01 01:21:24 [INFO] - Empty table lots.t_carrier_tracking
2016-07-01 01:21:42 [ERROR] - Error dumping table (lots.t_order) data: Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:21:42 [INFO] - Empty table lots.t_order

2016-07-01 01:24:57 [ERROR] - Error dumping table (lots.t_warehouse) data: Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:24:57 [INFO] - Empty table lots.t_warehouse
2016-07-01 01:31:24 [ERROR] - Error dumping schemas (lots.t_carrier_tracking): Lock wait timeout exceeded; try restarting transactio
n
2016-07-01 01:31:24 [INFO] - Thread 2 shutting down
2016-07-01 01:31:42 [ERROR] - Error dumping schemas (lots.t_order): Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:31:42 [INFO] - Thread 4 shutting down
2016-07-01 01:31:57 [ERROR] - Error dumping schemas (lots.t_warehouse): Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:31:57 [INFO] - Thread 1 shutting down

但从上面报错来看，线程被kill，但线程和锁没有真正释放，影响到的表为T_ORDER、 t_warehouse、t_carrier_tracking。
在flush tables with read lock成功获得锁之前，必须等待所有语句执行完成（包括SELECT）。所以如果有个慢查询在执行，或者一个打开的事务，或者其他进程拿着表锁，flush tables

with read lock就会被阻塞，直到所有的锁被释放。
这三个表:T_ORDER、 t_warehouse、t_carrier_tracking由于慢sql没有执行执行完成，导致加锁没有成功。所以下次再有新产生对三个表的select 或dml的sql执行时，线程就会出现waiting

for table flush。我们mysql的lock_wait_timeout时间为600 秒，上面备份加锁时，等待锁超过了10分钟后，就会超时报错。

此次问题产生原因是慢SQL和备份操作杀进程引起的，主要原因还是由于慢sql长时间执行，性能太差。

三.解决方法
关闭mysql，mysql长时间未有关闭的话，使用kill -9来杀掉mysql进程。
最后将引起问题的慢SQL发给开发人员优化，优化人员反馈已优化业务逻辑，观察问题暂未出现。

更多相关文章

随机推荐