Android(安卓)系统中WatchDog 日志分析
来源:
https://blog.csdn.net/shift_wwx/article/details/81030749
前言:
Android 系统中的 WatchDog 详解 一文中通过source code详细的分析了 WatchDog 的软件设计原理和监测过程,这一篇结合实例对WatchDog 做进一步的分析。
源码引路:
final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", ""); File stackFd = finalStack; if (tracesDirProp.isEmpty()) { String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null); String traceFileNameAmendment = "_SystemServer_WDT" + mTraceDateFormat.format(new Date()); if (tracesPath != null && tracesPath.length() != 0) { File traceRenameFile = new File(tracesPath); String newTracesPath; int lpos = tracesPath.lastIndexOf ("."); if (-1 != lpos) { newTracesPath = tracesPath.substring(0, lpos) + traceFileNameAmendment + tracesPath.substring(lpos); } else { newTracesPath = tracesPath + traceFileNameAmendment; } traceRenameFile.renameTo(new File(newTracesPath)); stackFd = new File(newTracesPath); } else { Slog.w(TAG, "dump WDT Traces: no trace path configured"); } } else { // the new trace dumpsing mechanism String newTracesPath = "traces_SystemServer_WDT" + mTraceDateFormat.format(new Date()) + "_pid" + String.valueOf(Process.myPid()); File tracesDir = new File(tracesDirProp); File watchdogTraces = new File(tracesDir, newTracesPath); try { if (watchdogTraces.createNewFile()) { FileUtils.setPermissions(watchdogTraces.getAbsolutePath(), 0600, -1, -1); // -rw------- permissions // Append both traces from the first and second half // to a new file, making it easier to debug Watchdog timeouts //dumpStackTraces() can return a null instance, so check the same if (initialStack != null) { // check the last-modified time of this file. // we are interested in this only it was written to in the // last minute or so final long age = System.currentTimeMillis() - initialStack.lastModified(); final long MINUTE_IN_MILLIS = 1000 * 60; if (age < MINUTE_IN_MILLIS) { Slog.e(TAG, "First set of traces taken from " + initialStack.getAbsolutePath()); appendFile(watchdogTraces, initialStack); } } else { Slog.e(TAG, "First set of traces are empty!"); } if (finalStack != null) { Slog.e(TAG, "Second set of traces taken from " + finalStack.getAbsolutePath()); appendFile(watchdogTraces, finalStack); } else { Slog.e(TAG, "Second set of traces are empty!"); } } else { Slog.w(TAG, "Unable to create Watchdog dump file: createNewFile failed"); } } catch (IOException ioe) { Slog.e(TAG, "Exception creating Watchdog dump file:", ioe); } } final File newFd = stackFd; // Try to add the error to the dropbox, but assuming that the ActivityManager // itself may be deadlocked. (which has happened, causing this statement to // deadlock and the watchdog as a whole to be ineffective) Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, subject, null, newFd, null); } }; dropboxThread.start(); try { dropboxThread.join(2000); // wait up to 2 seconds for it to return. } catch (InterruptedException ignored) {} // At times, when user space watchdog traces don't give an indication on // which component held a lock, because of which other threads are blocked, // (thereby causing Watchdog), crash the device to analyze RAM dumps boolean crashOnWatchdog = SystemProperties .getBoolean("persist.sys.crashOnWatchdog", false); if (crashOnWatchdog) { // Trigger the kernel to dump all blocked threads, and backtraces // on all CPUs to the kernel log Slog.e(TAG, "Triggering SysRq for system_server watchdog"); doSysRq('w'); doSysRq('l'); // wait until the above blocked threads be dumped into kernel log SystemClock.sleep(3000); // now try to crash the target doSysRq('c'); } IActivityController controller; synchronized (this) { controller = mController; } if (controller != null) { //一般monkey 的时候会打印 Slog.i(TAG, "Reporting stuck state to activity controller"); try { Binder.setDumpDisabled("Service dumps disabled due to hung system process."); // 1 = keep waiting, -1 = kill system int res = controller.systemNotResponding(subject); if (res >= 0) { Slog.i(TAG, "Activity controller requested to coninue to wait"); waitedHalf = false; continue; } } catch (RemoteException e) { } } // Only kill the process if the debugger is not attached. if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (!allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); for (int i=0; i
1、prop dalvik.vm.stack-trace-dir
一帮指的是 /data/anr
2、watch dog 日志名称
String newTracesPath = "traces_SystemServer_WDT" + mTraceDateFormat.format(new Date()) + "_pid" + String.valueOf(Process.myPid());File tracesDir = new File(tracesDirProp);File watchdogTraces = new File(tracesDir, newTracesPath);
最后在平台上的/data/anr/目录下会看到:
-rw------- 1 system system 227332 2007-11-06 00:05 BinderTraces.txt-rw------- 1 system system 179568 2007-11-06 00:04 anr_2007-11-06-00-04-41-133-rw------- 1 system system 230506 2007-11-06 00:05 anr_2007-11-06-00-05-14-301-rw------- 1 system system 180175 2007-11-06 02:57 anr_2007-11-06-02-57-39-825-rw------- 1 system system 229772 2007-11-06 02:58 anr_2007-11-06-02-58-13-023-rw------- 1 system system 198777 2007-11-06 19:26 anr_2007-11-06-19-26-00-611-rw------- 1 system system 249791 2007-11-06 19:26 anr_2007-11-06-19-26-33-888-rw------- 1 system system 410074 2007-11-06 00:05 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197-rw------- 1 system system 229772 2007-11-06 02:58 traces_SystemServer_WDT06_Nov_02_58_55.714_pid2726-rw------- 1 system system 249791 2007-11-06 19:27 traces_SystemServer_WDT06_Nov_19_27_16.140_pid2473
这里文件比较多啊,手机好久没有清理过的缘故,没事的后面会知道到底是哪个文件是我们需要的。
3、将AMS中记录的日志append 到当前watch dog 的traces中
Android 系统中的 WatchDog 详解 的run() 函数分析中我们知道,当monitor 超过30秒的时候,check 返回的状态是WAITED_HALF,WatchDog 会给第二次机会,但是同时会将AMS 的日志记录dump 出来。
appendFile(watchdogTraces, initialStack);
如果第二次还是无法正常返回,会将第二次的log 也添加到watchdogTraces 中:
appendFile(watchdogTraces, finalStack);
最后的traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197 的log 就最终产生了。
4、在 /data/system/dropbox 下创建watchdog 文件
Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, subject, null, newFd, null); } }; dropboxThread.start();
(1)subject 是blockedChecker 描述:
public String describeBlockedStateLocked() { if (mCurrentMonitor == null) { return "Blocked in handler on " + mName + " (" + getThread().getName() + ")"; } else { return "Blocked in monitor " + mCurrentMonitor.getClass().getName() + " on " + mName + " (" + getThread().getName() + ")"; } }
会在/data/system/dropbox/ 下的watchdog 日志中看到:
Subject: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)
就是这里来的。
(2)newFd 其实指的是上面一点中提到的finalStack 的fd,所以最后在dropbox 里面的traces 没有initialStack 的log
具体的addErrorToDropBox() 函数暂时不分析,下面会结合log 进一步解释,在平台上的/data/system/dropbox/目录下可以看到:
-rw------- 1 system system 329 2007-11-06 00:38 SYSTEM_BOOT@1194309534851.txt-rw------- 1 system system 329 2007-11-06 00:43 SYSTEM_BOOT@1194309834605.txt-rw------- 1 system system 329 2007-11-06 00:45 SYSTEM_RESTART@1194309831228.txt-rw------- 1 system system 1414 2007-11-06 00:44 data_app_crash@1194309853600.txt-rw------- 1 system system 1414 2007-11-06 00:44 data_app_crash@1194309854775.txt-rw------- 1 system system 4302 2007-11-06 00:43 system_app_strictmode@1194309835666.txt.gz-rw------- 1 system system 1832 2007-11-06 00:43 system_app_strictmode@1194309837786.txt.gz-rw------- 1 system system 2009 2007-11-06 00:44 system_app_strictmode@1194309840686.txt.gz-rw------- 1 system system 4035 2007-11-06 00:45 system_server_lowmem@1194309831226.txt.gz-rw------- 1 system system 19587 2007-11-06 00:05 system_server_watchdog@1194307528602.txt.gz
一般出问题找最新的traces 文件即可。
5、logcat 中打印堆栈信息
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); for (int i=0; i
在logcat 中会看到这样的打印:
11-06 00:05:28.603 2197 2441 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)11-06 00:05:28.603 2197 2441 W Watchdog: foreground thread stack trace:11-06 00:05:28.603 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.handleCallback(Handler.java:790)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Looper.loop(Looper.java:164)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)11-06 00:05:28.604 2197 2441 W Watchdog: ActivityManager stack trace:11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Handler.dispatchMessage(Handler.java:106)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.Looper.loop(Looper.java:164)11-06 00:05:28.604 2197 2441 W Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)11-06 00:05:28.604 2197 2441 W Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:46)11-06 00:05:28.604 2197 2441 W Watchdog: *** GOODBYE!11-06 00:05:28.605 2197 2441 I Process : Sending signal. PID: 2197 SIG: 9
从log 中看到:
- monitor是在fg thread 中,也就是WatchDog 中的mMonitorChecker 中的monitor() 触发了
- 除了monitor(),AMS 中的Handler 也触发了WatchDog
- 最后会把出现问题的进程kill 掉,这里是 pid 2197
问题分析:
结合上面的log,我们来分析一下。
1、logcat
上面的第 5 点中已经说明,通过打印可以大概知道大概是哪个线程或者是哪个线程的monitor 出现了问题
2、分析traces 文件
先来看下/data/anr/ watchdog 文件,根据上面logcat 可以知道最后的pid 是2197,那么就要找结尾是2197 字符串的traces文件,也就是 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197
(1)根据上面第 5 点,先找到monitor 出现的地方,也就是fg thread:
"android.fg" prio=5 tid=17 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380dc8 self=0x9e3b3200 | sysTid=2225 nice=0 cgrp=default sched=0/0 handle=0x9159d970 | state=S schedstat=( 0 0 0 ) utm=5 stm=2 core=2 HZ=100 | stack=0x9149b000-0x9149d000 stackSize=1038KB | held mutexes= at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862) - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81 at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211) at android.os.Handler.handleCallback(Handler.java:790) at android.os.Handler.dispatchMessage(Handler.java:99) at android.os.Looper.loop(Looper.java:164) at android.os.HandlerThread.run(HandlerThread.java:65) at com.android.server.ServiceThread.run(ServiceThread.java:46)
在等待锁,这个锁被tid 为81 的线程持有,下面来看下tid 为81 干了什么:
"Binder:2197_4" prio=5 tid=81 Waiting | group="main" sCount=1 dsCount=0 flags=1 obj=0x14041c20 self=0x8d713200 | sysTid=2435 nice=-10 cgrp=default sched=0/0 handle=0x8c749970 | state=S schedstat=( 0 0 0 ) utm=9 stm=6 core=2 HZ=100 | stack=0x8c64f000-0x8c651000 stackSize=1006KB | held mutexes= at java.lang.Object.wait(Native method) - waiting on <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq) at com.android.server.PermissionDialogReqQueue$PermissionDialogReq.get(PermissionDialogReqQueue.java:100) - locked <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq) at com.android.server.AppOpsService.noteOperationUnchecked(AppOpsService.java:1157) at com.android.server.AppOpsService.noteOperation(AppOpsService.java:1086) at com.android.server.am.ActivityManagerService.isStartAllowed(ActivityManagerService.java:24937) at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:369) at com.android.server.am.ActivityManagerService.startService(ActivityManagerService.java:18450) - locked <0x0316e620> (a com.android.server.am.ActivityManagerService) at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:528) at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2971) at android.os.Binder.execTransact(Binder.java:697)
是在startService 的时候占用了AMS 这把锁,但是后面又在等 PermissionDialogReqQueue 的锁,这个锁是自己锁上的并没有放出来,导致了一连串的死锁。
(2)同样的,根据上面第 5 点,来看下AMS 中的Handler 也触发了WatchDog
"ActivityManager" prio=5 tid=12 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380710 self=0x9e0fae00 | sysTid=2216 nice=-2 cgrp=default sched=0/0 handle=0x91bb4970 | state=S schedstat=( 0 0 0 ) utm=60 stm=23 core=3 HZ=100 | stack=0x91ab2000-0x91ab4000 stackSize=1038KB | held mutexes= at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305) - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81 at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428) at android.os.Handler.dispatchMessage(Handler.java:106) at android.os.Looper.loop(Looper.java:164) at android.os.HandlerThread.run(HandlerThread.java:65) at com.android.server.ServiceThread.run(ServiceThread.java:46)
等待同样的一个锁
结合上面的traces 分析,大概能看出来应该是AMS 的锁被持有一直没有放掉,再来看下dropbox 下的traces(一般到第二次的traces时候已经出现了ANR),这个时候主线程应该会出现ANR,来看下main:
"main" prio=5 tid=1 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x723919f0 self=0xa7a28000 | sysTid=2197 nice=-2 cgrp=default sched=0/0 handle=0xabb1e4a4 | state=S schedstat=( 0 0 0 ) utm=320 stm=99 core=2 HZ=100 | stack=0xbe447000-0xbe449000 stackSize=8MB | held mutexes= at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:20018) - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81 at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4175) at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4165) at com.android.server.BatteryService$9.run(BatteryService.java:578) at android.os.Handler.handleCallback(Handler.java:790) at android.os.Handler.dispatchMessage(Handler.java:99) at android.os.Looper.loop(Looper.java:164) at com.android.server.SystemServer.run(SystemServer.java:435) at com.android.server.SystemServer.main(SystemServer.java:278) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:787)
ANR 也是应该这个锁,根据两个traces 分析,可以确定是因为PermissionDialogReqQueue 自身的锁没有释放,导致了timeout 触发watch dog。
但是实际上,这个第二次的log 意义并不大,因为这个时候系统ANR,需要检查的是系统为什么出现ANR,也就是说第一次traces 的log 可能在后面的debug 中起到关键因素。
3、寻找原因
来看下 PermissionDialogReqQueue,java 100行 get() 函数:
public int get() { synchronized (this) { try { wait(); } catch (InterruptedException e) { //ignore } } Log.d(TAG, "result from permission dialog is " + mResult); return mResult; }
从逻辑上看,这里会通过wait 等待dialog 返回结果,在set 函数的时候notify:
public void set(int res) { synchronized (this) { mResult = res; notifyAll(); } }
逻辑上处理应该是没有问题的,如果dialog 在30秒内点击,之后notify() 就会放掉之前持有的锁。但是从现象上看,是在dialog 弹出之后,AMS 接口调用导致了dialog 卡死,这样无法正常的释放锁,最终导致了WatchDog 重启system。
这个问题弄的我很疑惑,逻辑上都是ok 就是在AMS 中要锁的时候导致了dialog 卡住,后来怀疑可能是dialog 的同时调用了AMS 的接口导致了主线程卡死,几经挫折想到将dialog 放到子线程中show(Android 子线程中更新UI 详解),结果居然成功了,还在继续寻找原因中!
更多相关文章
- 转:Android(安卓)AsyncTask
- Android面试系列文章2018之Android部分Handler机制篇
- 享受Android应用程序的Java技术盛宴
- Android(安卓)corners 圆角属性各个版本之间兼容问题
- Android(安卓)recovery 流程分析
- Android面试系列文章2018之Android部分Handler机制篇
- Android(安卓)SDK 快速安装方法
- Android之 UI主线程
- Android如何注册服务到ServiceManager?