来源:

https://blog.csdn.net/shift_wwx/article/details/81030749

 

前言:

Android 系统中的 WatchDog 详解 一文中通过source code详细的分析了 WatchDog 的软件设计原理和监测过程,这一篇结合实例对WatchDog 做进一步的分析。

 

源码引路:

            final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");            File stackFd = finalStack;            if (tracesDirProp.isEmpty()) {                String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);                String traceFileNameAmendment = "_SystemServer_WDT" +                        mTraceDateFormat.format(new Date());                if (tracesPath != null && tracesPath.length() != 0) {                    File traceRenameFile = new File(tracesPath);                    String newTracesPath;                    int lpos = tracesPath.lastIndexOf (".");                    if (-1 != lpos) {                        newTracesPath = tracesPath.substring(0, lpos) + traceFileNameAmendment                                + tracesPath.substring(lpos);                    } else {                        newTracesPath = tracesPath + traceFileNameAmendment;                    }                    traceRenameFile.renameTo(new File(newTracesPath));                    stackFd = new File(newTracesPath);                } else {                    Slog.w(TAG, "dump WDT Traces: no trace path configured");                }            } else {                // the new trace dumpsing mechanism                String newTracesPath = "traces_SystemServer_WDT"                        + mTraceDateFormat.format(new Date()) + "_pid"                        + String.valueOf(Process.myPid());                File tracesDir = new File(tracesDirProp);                File watchdogTraces = new File(tracesDir, newTracesPath);                try {                    if (watchdogTraces.createNewFile()) {                        FileUtils.setPermissions(watchdogTraces.getAbsolutePath(),                                0600, -1, -1); // -rw------- permissions                        // Append both traces from the first and second half                        // to a new file, making it easier to debug Watchdog timeouts                        //dumpStackTraces() can return a null instance, so check the same                        if (initialStack != null) {                            // check the last-modified time of this file.                            // we are interested in this only it was written to in the                            // last minute or so                            final long age = System.currentTimeMillis()                                    - initialStack.lastModified();                            final long MINUTE_IN_MILLIS = 1000 * 60;                            if (age < MINUTE_IN_MILLIS) {                                Slog.e(TAG, "First set of traces taken from "                                        + initialStack.getAbsolutePath());                                appendFile(watchdogTraces, initialStack);                            }                        } else {                            Slog.e(TAG, "First set of traces are empty!");                        }                        if (finalStack != null) {                            Slog.e(TAG, "Second set of traces taken from "                                    + finalStack.getAbsolutePath());                            appendFile(watchdogTraces, finalStack);                        } else {                             Slog.e(TAG, "Second set of traces are empty!");                        }                    } else {                        Slog.w(TAG, "Unable to create Watchdog dump file: createNewFile failed");                    }                } catch (IOException ioe) {                    Slog.e(TAG, "Exception creating Watchdog dump file:", ioe);                }            }            final File newFd = stackFd;            // Try to add the error to the dropbox, but assuming that the ActivityManager            // itself may be deadlocked.  (which has happened, causing this statement to            // deadlock and the watchdog as a whole to be ineffective)            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {                    public void run() {                        mActivity.addErrorToDropBox(                                "watchdog", null, "system_server", null, null,                                subject, null, newFd, null);                    }                };            dropboxThread.start();            try {                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.            } catch (InterruptedException ignored) {}            // At times, when user space watchdog traces don't give an indication on            // which component held a lock, because of which other threads are blocked,            // (thereby causing Watchdog), crash the device to analyze RAM dumps            boolean crashOnWatchdog = SystemProperties                                        .getBoolean("persist.sys.crashOnWatchdog", false);            if (crashOnWatchdog) {                // Trigger the kernel to dump all blocked threads, and backtraces                // on all CPUs to the kernel log                Slog.e(TAG, "Triggering SysRq for system_server watchdog");                doSysRq('w');                doSysRq('l');                // wait until the above blocked threads be dumped into kernel log                SystemClock.sleep(3000);                // now try to crash the target                doSysRq('c');            }            IActivityController controller;            synchronized (this) {                controller = mController;            }            if (controller != null) { //一般monkey 的时候会打印                Slog.i(TAG, "Reporting stuck state to activity controller");                try {                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");                    // 1 = keep waiting, -1 = kill system                    int res = controller.systemNotResponding(subject);                    if (res >= 0) {                        Slog.i(TAG, "Activity controller requested to coninue to wait");                        waitedHalf = false;                        continue;                    }                } catch (RemoteException e) {                }            }            // Only kill the process if the debugger is not attached.            if (Debug.isDebuggerConnected()) {                debuggerWasConnected = 2;            }            if (debuggerWasConnected >= 2) {                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");            } else if (debuggerWasConnected > 0) {                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");            } else if (!allowRestart) {                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");            } else {                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);                for (int i=0; i

1、prop dalvik.vm.stack-trace-dir

一帮指的是 /data/anr

 

2、watch dog 日志名称

String newTracesPath = "traces_SystemServer_WDT"           + mTraceDateFormat.format(new Date()) + "_pid"           + String.valueOf(Process.myPid());File tracesDir = new File(tracesDirProp);File watchdogTraces = new File(tracesDir, newTracesPath);

最后在平台上的/data/anr/目录下会看到:

-rw-------  1 system system 227332 2007-11-06 00:05 BinderTraces.txt-rw-------  1 system system 179568 2007-11-06 00:04 anr_2007-11-06-00-04-41-133-rw-------  1 system system 230506 2007-11-06 00:05 anr_2007-11-06-00-05-14-301-rw-------  1 system system 180175 2007-11-06 02:57 anr_2007-11-06-02-57-39-825-rw-------  1 system system 229772 2007-11-06 02:58 anr_2007-11-06-02-58-13-023-rw-------  1 system system 198777 2007-11-06 19:26 anr_2007-11-06-19-26-00-611-rw-------  1 system system 249791 2007-11-06 19:26 anr_2007-11-06-19-26-33-888-rw-------  1 system system 410074 2007-11-06 00:05 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197-rw-------  1 system system 229772 2007-11-06 02:58 traces_SystemServer_WDT06_Nov_02_58_55.714_pid2726-rw-------  1 system system 249791 2007-11-06 19:27 traces_SystemServer_WDT06_Nov_19_27_16.140_pid2473

这里文件比较多啊,手机好久没有清理过的缘故,没事的后面会知道到底是哪个文件是我们需要的。

 

 

3、将AMS中记录的日志append 到当前watch dog 的traces中

Android 系统中的 WatchDog 详解  的run() 函数分析中我们知道,当monitor 超过30秒的时候,check 返回的状态是WAITED_HALF,WatchDog 会给第二次机会,但是同时会将AMS 的日志记录dump 出来。

appendFile(watchdogTraces, initialStack);

如果第二次还是无法正常返回,会将第二次的log 也添加到watchdogTraces 中:

appendFile(watchdogTraces, finalStack);

最后的traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197 的log 就最终产生了。

 

4、在 /data/system/dropbox 下创建watchdog 文件

    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {        public void run() {            mActivity.addErrorToDropBox(                      "watchdog", null, "system_server", null, null,                       subject, null, newFd, null);            }        };    dropboxThread.start();

(1)subject 是blockedChecker 描述:

        public String describeBlockedStateLocked() {            if (mCurrentMonitor == null) {                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";            } else {                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()                        + " on " + mName + " (" + getThread().getName() + ")";            }        }

会在/data/system/dropbox/ 下的watchdog 日志中看到:

Subject: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)

就是这里来的。

(2)newFd 其实指的是上面一点中提到的finalStack 的fd,所以最后在dropbox 里面的traces 没有initialStack 的log

 

具体的addErrorToDropBox() 函数暂时不分析,下面会结合log 进一步解释,在平台上的/data/system/dropbox/目录下可以看到:

-rw-------  1 system system   329 2007-11-06 00:38 SYSTEM_BOOT@1194309534851.txt-rw-------  1 system system   329 2007-11-06 00:43 SYSTEM_BOOT@1194309834605.txt-rw-------  1 system system   329 2007-11-06 00:45 SYSTEM_RESTART@1194309831228.txt-rw-------  1 system system  1414 2007-11-06 00:44 data_app_crash@1194309853600.txt-rw-------  1 system system  1414 2007-11-06 00:44 data_app_crash@1194309854775.txt-rw-------  1 system system  4302 2007-11-06 00:43 system_app_strictmode@1194309835666.txt.gz-rw-------  1 system system  1832 2007-11-06 00:43 system_app_strictmode@1194309837786.txt.gz-rw-------  1 system system  2009 2007-11-06 00:44 system_app_strictmode@1194309840686.txt.gz-rw-------  1 system system  4035 2007-11-06 00:45 system_server_lowmem@1194309831226.txt.gz-rw-------  1 system system 19587 2007-11-06 00:05 system_server_watchdog@1194307528602.txt.gz

一般出问题找最新的traces 文件即可。

 

5、logcat 中打印堆栈信息

    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);    for (int i=0; i

在logcat 中会看到这样的打印:

11-06 00:05:28.603  2197  2441 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)11-06 00:05:28.603  2197  2441 W Watchdog: foreground thread stack trace:11-06 00:05:28.603  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:790)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Looper.loop(Looper.java:164)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:46)11-06 00:05:28.604  2197  2441 W Watchdog: ActivityManager stack trace:11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:106)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Looper.loop(Looper.java:164)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:46)11-06 00:05:28.604  2197  2441 W Watchdog: *** GOODBYE!11-06 00:05:28.605  2197  2441 I Process : Sending signal. PID: 2197 SIG: 9

从log 中看到:

  • monitor是在fg thread 中,也就是WatchDog 中的mMonitorChecker 中的monitor() 触发了
  • 除了monitor(),AMS 中的Handler 也触发了WatchDog
  • 最后会把出现问题的进程kill 掉,这里是 pid 2197

 

问题分析:

结合上面的log,我们来分析一下。

1、logcat

上面的第 5 点中已经说明,通过打印可以大概知道大概是哪个线程或者是哪个线程的monitor 出现了问题

 

2、分析traces 文件

先来看下/data/anr/ watchdog 文件,根据上面logcat 可以知道最后的pid 是2197,那么就要找结尾是2197 字符串的traces文件,也就是 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197

(1)根据上面第 5 点,先找到monitor 出现的地方,也就是fg thread:

"android.fg" prio=5 tid=17 Blocked  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380dc8 self=0x9e3b3200  | sysTid=2225 nice=0 cgrp=default sched=0/0 handle=0x9159d970  | state=S schedstat=( 0 0 0 ) utm=5 stm=2 core=2 HZ=100  | stack=0x9149b000-0x9149d000 stackSize=1038KB  | held mutexes=  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)  at android.os.Handler.handleCallback(Handler.java:790)  at android.os.Handler.dispatchMessage(Handler.java:99)  at android.os.Looper.loop(Looper.java:164)  at android.os.HandlerThread.run(HandlerThread.java:65)  at com.android.server.ServiceThread.run(ServiceThread.java:46)

在等待锁,这个锁被tid 为81 的线程持有,下面来看下tid 为81 干了什么:

"Binder:2197_4" prio=5 tid=81 Waiting  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14041c20 self=0x8d713200  | sysTid=2435 nice=-10 cgrp=default sched=0/0 handle=0x8c749970  | state=S schedstat=( 0 0 0 ) utm=9 stm=6 core=2 HZ=100  | stack=0x8c64f000-0x8c651000 stackSize=1006KB  | held mutexes=  at java.lang.Object.wait(Native method)  - waiting on <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)  at com.android.server.PermissionDialogReqQueue$PermissionDialogReq.get(PermissionDialogReqQueue.java:100)  - locked <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)  at com.android.server.AppOpsService.noteOperationUnchecked(AppOpsService.java:1157)  at com.android.server.AppOpsService.noteOperation(AppOpsService.java:1086)  at com.android.server.am.ActivityManagerService.isStartAllowed(ActivityManagerService.java:24937)  at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:369)  at com.android.server.am.ActivityManagerService.startService(ActivityManagerService.java:18450)  - locked <0x0316e620> (a com.android.server.am.ActivityManagerService)  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:528)  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2971)  at android.os.Binder.execTransact(Binder.java:697)

是在startService 的时候占用了AMS 这把锁,但是后面又在等 PermissionDialogReqQueue 的锁,这个锁是自己锁上的并没有放出来,导致了一连串的死锁。

(2)同样的,根据上面第 5 点,来看下AMS 中的Handler 也触发了WatchDog

"ActivityManager" prio=5 tid=12 Blocked  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380710 self=0x9e0fae00  | sysTid=2216 nice=-2 cgrp=default sched=0/0 handle=0x91bb4970  | state=S schedstat=( 0 0 0 ) utm=60 stm=23 core=3 HZ=100  | stack=0x91ab2000-0x91ab4000 stackSize=1038KB  | held mutexes=  at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81  at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)  at android.os.Handler.dispatchMessage(Handler.java:106)  at android.os.Looper.loop(Looper.java:164)  at android.os.HandlerThread.run(HandlerThread.java:65)  at com.android.server.ServiceThread.run(ServiceThread.java:46)

等待同样的一个锁

 

结合上面的traces 分析,大概能看出来应该是AMS 的锁被持有一直没有放掉,再来看下dropbox 下的traces(一般到第二次的traces时候已经出现了ANR),这个时候主线程应该会出现ANR,来看下main:

"main" prio=5 tid=1 Blocked  | group="main" sCount=1 dsCount=0 flags=1 obj=0x723919f0 self=0xa7a28000  | sysTid=2197 nice=-2 cgrp=default sched=0/0 handle=0xabb1e4a4  | state=S schedstat=( 0 0 0 ) utm=320 stm=99 core=2 HZ=100  | stack=0xbe447000-0xbe449000 stackSize=8MB  | held mutexes=  at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:20018)  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81  at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4175)  at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4165)  at com.android.server.BatteryService$9.run(BatteryService.java:578)  at android.os.Handler.handleCallback(Handler.java:790)  at android.os.Handler.dispatchMessage(Handler.java:99)  at android.os.Looper.loop(Looper.java:164)  at com.android.server.SystemServer.run(SystemServer.java:435)  at com.android.server.SystemServer.main(SystemServer.java:278)  at java.lang.reflect.Method.invoke(Native method)  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:787)

ANR 也是应该这个锁,根据两个traces 分析,可以确定是因为PermissionDialogReqQueue  自身的锁没有释放,导致了timeout 触发watch dog。

但是实际上,这个第二次的log 意义并不大,因为这个时候系统ANR,需要检查的是系统为什么出现ANR,也就是说第一次traces 的log 可能在后面的debug 中起到关键因素。

 

3、寻找原因

来看下 PermissionDialogReqQueue,java 100行 get() 函数:

        public int get() {            synchronized (this) {                try {                    wait();                } catch (InterruptedException e) {                    //ignore                }            }            Log.d(TAG, "result from permission dialog is " + mResult);            return mResult;        }

从逻辑上看,这里会通过wait 等待dialog 返回结果,在set 函数的时候notify:

        public void set(int res) {            synchronized (this) {                mResult = res;                notifyAll();            }        }

逻辑上处理应该是没有问题的,如果dialog 在30秒内点击,之后notify() 就会放掉之前持有的锁。但是从现象上看,是在dialog 弹出之后,AMS 接口调用导致了dialog 卡死,这样无法正常的释放锁,最终导致了WatchDog 重启system。 

这个问题弄的我很疑惑,逻辑上都是ok 就是在AMS 中要锁的时候导致了dialog 卡住,后来怀疑可能是dialog 的同时调用了AMS 的接口导致了主线程卡死,几经挫折想到将dialog 放到子线程中show(Android 子线程中更新UI 详解),结果居然成功了,还在继续寻找原因中!

 

 

 

 

 

 

 

更多相关文章

  1. C语言函数以及函数的使用
  2. 【Android 应用开发】Android资源文件 - 使用资源存储字符串 颜
  3. android 布局文件属性说明
  4. Android之 UI主线程
  5. Android 属性文件build.prop,获取属性以及设置属性
  6. Android获取打开各种文件Intent汇总

随机推荐

  1. Android(安卓)技术专题系列之二 -- telepho
  2. Android中获取屏幕相关信息(屏幕大小,状态
  3. android 拍照 onCreate() 调用两次的问题
  4. Android:漫画APP开发笔记之ListView中图
  5. 【Android(安卓)系统开发】 编译 Android
  6. PhoneGap Android环境搭建
  7. android jni 程序框架搭建
  8. Android(安卓)Fastboot
  9. 2010.11.28(2)———android 展示网页 和
  10. 什么是Android-Android中文SDK