来源:

https://blog.csdn.net/shift_wwx/article/details/81030749

 

前言:

Android 系统中的 WatchDog 详解 一文中通过source code详细的分析了 WatchDog 的软件设计原理和监测过程,这一篇结合实例对WatchDog 做进一步的分析。

 

源码引路:

            final String tracesDirProp = SystemProperties.get("dalvik.vm.stack-trace-dir", "");            File stackFd = finalStack;            if (tracesDirProp.isEmpty()) {                String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);                String traceFileNameAmendment = "_SystemServer_WDT" +                        mTraceDateFormat.format(new Date());                if (tracesPath != null && tracesPath.length() != 0) {                    File traceRenameFile = new File(tracesPath);                    String newTracesPath;                    int lpos = tracesPath.lastIndexOf (".");                    if (-1 != lpos) {                        newTracesPath = tracesPath.substring(0, lpos) + traceFileNameAmendment                                + tracesPath.substring(lpos);                    } else {                        newTracesPath = tracesPath + traceFileNameAmendment;                    }                    traceRenameFile.renameTo(new File(newTracesPath));                    stackFd = new File(newTracesPath);                } else {                    Slog.w(TAG, "dump WDT Traces: no trace path configured");                }            } else {                // the new trace dumpsing mechanism                String newTracesPath = "traces_SystemServer_WDT"                        + mTraceDateFormat.format(new Date()) + "_pid"                        + String.valueOf(Process.myPid());                File tracesDir = new File(tracesDirProp);                File watchdogTraces = new File(tracesDir, newTracesPath);                try {                    if (watchdogTraces.createNewFile()) {                        FileUtils.setPermissions(watchdogTraces.getAbsolutePath(),                                0600, -1, -1); // -rw------- permissions                        // Append both traces from the first and second half                        // to a new file, making it easier to debug Watchdog timeouts                        //dumpStackTraces() can return a null instance, so check the same                        if (initialStack != null) {                            // check the last-modified time of this file.                            // we are interested in this only it was written to in the                            // last minute or so                            final long age = System.currentTimeMillis()                                    - initialStack.lastModified();                            final long MINUTE_IN_MILLIS = 1000 * 60;                            if (age < MINUTE_IN_MILLIS) {                                Slog.e(TAG, "First set of traces taken from "                                        + initialStack.getAbsolutePath());                                appendFile(watchdogTraces, initialStack);                            }                        } else {                            Slog.e(TAG, "First set of traces are empty!");                        }                        if (finalStack != null) {                            Slog.e(TAG, "Second set of traces taken from "                                    + finalStack.getAbsolutePath());                            appendFile(watchdogTraces, finalStack);                        } else {                             Slog.e(TAG, "Second set of traces are empty!");                        }                    } else {                        Slog.w(TAG, "Unable to create Watchdog dump file: createNewFile failed");                    }                } catch (IOException ioe) {                    Slog.e(TAG, "Exception creating Watchdog dump file:", ioe);                }            }            final File newFd = stackFd;            // Try to add the error to the dropbox, but assuming that the ActivityManager            // itself may be deadlocked.  (which has happened, causing this statement to            // deadlock and the watchdog as a whole to be ineffective)            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {                    public void run() {                        mActivity.addErrorToDropBox(                                "watchdog", null, "system_server", null, null,                                subject, null, newFd, null);                    }                };            dropboxThread.start();            try {                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.            } catch (InterruptedException ignored) {}            // At times, when user space watchdog traces don't give an indication on            // which component held a lock, because of which other threads are blocked,            // (thereby causing Watchdog), crash the device to analyze RAM dumps            boolean crashOnWatchdog = SystemProperties                                        .getBoolean("persist.sys.crashOnWatchdog", false);            if (crashOnWatchdog) {                // Trigger the kernel to dump all blocked threads, and backtraces                // on all CPUs to the kernel log                Slog.e(TAG, "Triggering SysRq for system_server watchdog");                doSysRq('w');                doSysRq('l');                // wait until the above blocked threads be dumped into kernel log                SystemClock.sleep(3000);                // now try to crash the target                doSysRq('c');            }            IActivityController controller;            synchronized (this) {                controller = mController;            }            if (controller != null) { //一般monkey 的时候会打印                Slog.i(TAG, "Reporting stuck state to activity controller");                try {                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");                    // 1 = keep waiting, -1 = kill system                    int res = controller.systemNotResponding(subject);                    if (res >= 0) {                        Slog.i(TAG, "Activity controller requested to coninue to wait");                        waitedHalf = false;                        continue;                    }                } catch (RemoteException e) {                }            }            // Only kill the process if the debugger is not attached.            if (Debug.isDebuggerConnected()) {                debuggerWasConnected = 2;            }            if (debuggerWasConnected >= 2) {                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");            } else if (debuggerWasConnected > 0) {                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");            } else if (!allowRestart) {                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");            } else {                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);                for (int i=0; i

1、prop dalvik.vm.stack-trace-dir

一帮指的是 /data/anr

 

2、watch dog 日志名称

String newTracesPath = "traces_SystemServer_WDT"           + mTraceDateFormat.format(new Date()) + "_pid"           + String.valueOf(Process.myPid());File tracesDir = new File(tracesDirProp);File watchdogTraces = new File(tracesDir, newTracesPath);

最后在平台上的/data/anr/目录下会看到:

-rw-------  1 system system 227332 2007-11-06 00:05 BinderTraces.txt-rw-------  1 system system 179568 2007-11-06 00:04 anr_2007-11-06-00-04-41-133-rw-------  1 system system 230506 2007-11-06 00:05 anr_2007-11-06-00-05-14-301-rw-------  1 system system 180175 2007-11-06 02:57 anr_2007-11-06-02-57-39-825-rw-------  1 system system 229772 2007-11-06 02:58 anr_2007-11-06-02-58-13-023-rw-------  1 system system 198777 2007-11-06 19:26 anr_2007-11-06-19-26-00-611-rw-------  1 system system 249791 2007-11-06 19:26 anr_2007-11-06-19-26-33-888-rw-------  1 system system 410074 2007-11-06 00:05 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197-rw-------  1 system system 229772 2007-11-06 02:58 traces_SystemServer_WDT06_Nov_02_58_55.714_pid2726-rw-------  1 system system 249791 2007-11-06 19:27 traces_SystemServer_WDT06_Nov_19_27_16.140_pid2473

这里文件比较多啊,手机好久没有清理过的缘故,没事的后面会知道到底是哪个文件是我们需要的。

 

 

3、将AMS中记录的日志append 到当前watch dog 的traces中

Android 系统中的 WatchDog 详解  的run() 函数分析中我们知道,当monitor 超过30秒的时候,check 返回的状态是WAITED_HALF,WatchDog 会给第二次机会,但是同时会将AMS 的日志记录dump 出来。

appendFile(watchdogTraces, initialStack);

如果第二次还是无法正常返回,会将第二次的log 也添加到watchdogTraces 中:

appendFile(watchdogTraces, finalStack);

最后的traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197 的log 就最终产生了。

 

4、在 /data/system/dropbox 下创建watchdog 文件

    Thread dropboxThread = new Thread("watchdogWriteToDropbox") {        public void run() {            mActivity.addErrorToDropBox(                      "watchdog", null, "system_server", null, null,                       subject, null, newFd, null);            }        };    dropboxThread.start();

(1)subject 是blockedChecker 描述:

        public String describeBlockedStateLocked() {            if (mCurrentMonitor == null) {                return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";            } else {                return "Blocked in monitor " + mCurrentMonitor.getClass().getName()                        + " on " + mName + " (" + getThread().getName() + ")";            }        }

会在/data/system/dropbox/ 下的watchdog 日志中看到:

Subject: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)

就是这里来的。

(2)newFd 其实指的是上面一点中提到的finalStack 的fd,所以最后在dropbox 里面的traces 没有initialStack 的log

 

具体的addErrorToDropBox() 函数暂时不分析,下面会结合log 进一步解释,在平台上的/data/system/dropbox/目录下可以看到:

-rw-------  1 system system   329 2007-11-06 00:38 SYSTEM_BOOT@1194309534851.txt-rw-------  1 system system   329 2007-11-06 00:43 SYSTEM_BOOT@1194309834605.txt-rw-------  1 system system   329 2007-11-06 00:45 SYSTEM_RESTART@1194309831228.txt-rw-------  1 system system  1414 2007-11-06 00:44 data_app_crash@1194309853600.txt-rw-------  1 system system  1414 2007-11-06 00:44 data_app_crash@1194309854775.txt-rw-------  1 system system  4302 2007-11-06 00:43 system_app_strictmode@1194309835666.txt.gz-rw-------  1 system system  1832 2007-11-06 00:43 system_app_strictmode@1194309837786.txt.gz-rw-------  1 system system  2009 2007-11-06 00:44 system_app_strictmode@1194309840686.txt.gz-rw-------  1 system system  4035 2007-11-06 00:45 system_server_lowmem@1194309831226.txt.gz-rw-------  1 system system 19587 2007-11-06 00:05 system_server_watchdog@1194307528602.txt.gz

一般出问题找最新的traces 文件即可。

 

5、logcat 中打印堆栈信息

    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);    for (int i=0; i

在logcat 中会看到这样的打印:

11-06 00:05:28.603  2197  2441 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.am.ActivityManagerService on foreground thread (android.fg), Blocked in handler on ActivityManager (ActivityManager)11-06 00:05:28.603  2197  2441 W Watchdog: foreground thread stack trace:11-06 00:05:28.603  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:790)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Looper.loop(Looper.java:164)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:46)11-06 00:05:28.604  2197  2441 W Watchdog: ActivityManager stack trace:11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:106)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.Looper.loop(Looper.java:164)11-06 00:05:28.604  2197  2441 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)11-06 00:05:28.604  2197  2441 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:46)11-06 00:05:28.604  2197  2441 W Watchdog: *** GOODBYE!11-06 00:05:28.605  2197  2441 I Process : Sending signal. PID: 2197 SIG: 9

从log 中看到:

  • monitor是在fg thread 中,也就是WatchDog 中的mMonitorChecker 中的monitor() 触发了
  • 除了monitor(),AMS 中的Handler 也触发了WatchDog
  • 最后会把出现问题的进程kill 掉,这里是 pid 2197

 

问题分析:

结合上面的log,我们来分析一下。

1、logcat

上面的第 5 点中已经说明,通过打印可以大概知道大概是哪个线程或者是哪个线程的monitor 出现了问题

 

2、分析traces 文件

先来看下/data/anr/ watchdog 文件,根据上面logcat 可以知道最后的pid 是2197,那么就要找结尾是2197 字符串的traces文件,也就是 traces_SystemServer_WDT06_Nov_00_05_28.143_pid2197

(1)根据上面第 5 点,先找到monitor 出现的地方,也就是fg thread:

"android.fg" prio=5 tid=17 Blocked  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380dc8 self=0x9e3b3200  | sysTid=2225 nice=0 cgrp=default sched=0/0 handle=0x9159d970  | state=S schedstat=( 0 0 0 ) utm=5 stm=2 core=2 HZ=100  | stack=0x9149b000-0x9149d000 stackSize=1038KB  | held mutexes=  at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:23862)  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81  at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:211)  at android.os.Handler.handleCallback(Handler.java:790)  at android.os.Handler.dispatchMessage(Handler.java:99)  at android.os.Looper.loop(Looper.java:164)  at android.os.HandlerThread.run(HandlerThread.java:65)  at com.android.server.ServiceThread.run(ServiceThread.java:46)

在等待锁,这个锁被tid 为81 的线程持有,下面来看下tid 为81 干了什么:

"Binder:2197_4" prio=5 tid=81 Waiting  | group="main" sCount=1 dsCount=0 flags=1 obj=0x14041c20 self=0x8d713200  | sysTid=2435 nice=-10 cgrp=default sched=0/0 handle=0x8c749970  | state=S schedstat=( 0 0 0 ) utm=9 stm=6 core=2 HZ=100  | stack=0x8c64f000-0x8c651000 stackSize=1006KB  | held mutexes=  at java.lang.Object.wait(Native method)  - waiting on <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)  at com.android.server.PermissionDialogReqQueue$PermissionDialogReq.get(PermissionDialogReqQueue.java:100)  - locked <0x0c1b5e76> (a com.android.server.PermissionDialogReqQueue$PermissionDialogReq)  at com.android.server.AppOpsService.noteOperationUnchecked(AppOpsService.java:1157)  at com.android.server.AppOpsService.noteOperation(AppOpsService.java:1086)  at com.android.server.am.ActivityManagerService.isStartAllowed(ActivityManagerService.java:24937)  at com.android.server.am.ActiveServices.startServiceLocked(ActiveServices.java:369)  at com.android.server.am.ActivityManagerService.startService(ActivityManagerService.java:18450)  - locked <0x0316e620> (a com.android.server.am.ActivityManagerService)  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:528)  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:2971)  at android.os.Binder.execTransact(Binder.java:697)

是在startService 的时候占用了AMS 这把锁,但是后面又在等 PermissionDialogReqQueue 的锁,这个锁是自己锁上的并没有放出来,导致了一连串的死锁。

(2)同样的,根据上面第 5 点,来看下AMS 中的Handler 也触发了WatchDog

"ActivityManager" prio=5 tid=12 Blocked  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13380710 self=0x9e0fae00  | sysTid=2216 nice=-2 cgrp=default sched=0/0 handle=0x91bb4970  | state=S schedstat=( 0 0 0 ) utm=60 stm=23 core=3 HZ=100  | stack=0x91ab2000-0x91ab4000 stackSize=1038KB  | held mutexes=  at com.android.server.am.ActivityManagerService.idleUids(ActivityManagerService.java:23305)  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81  at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:2428)  at android.os.Handler.dispatchMessage(Handler.java:106)  at android.os.Looper.loop(Looper.java:164)  at android.os.HandlerThread.run(HandlerThread.java:65)  at com.android.server.ServiceThread.run(ServiceThread.java:46)

等待同样的一个锁

 

结合上面的traces 分析,大概能看出来应该是AMS 的锁被持有一直没有放掉,再来看下dropbox 下的traces(一般到第二次的traces时候已经出现了ANR),这个时候主线程应该会出现ANR,来看下main:

"main" prio=5 tid=1 Blocked  | group="main" sCount=1 dsCount=0 flags=1 obj=0x723919f0 self=0xa7a28000  | sysTid=2197 nice=-2 cgrp=default sched=0/0 handle=0xabb1e4a4  | state=S schedstat=( 0 0 0 ) utm=320 stm=99 core=2 HZ=100  | stack=0xbe447000-0xbe449000 stackSize=8MB  | held mutexes=  at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:20018)  - waiting to lock <0x0316e620> (a com.android.server.am.ActivityManagerService) held by thread 81  at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4175)  at android.app.ActivityManager.broadcastStickyIntent(ActivityManager.java:4165)  at com.android.server.BatteryService$9.run(BatteryService.java:578)  at android.os.Handler.handleCallback(Handler.java:790)  at android.os.Handler.dispatchMessage(Handler.java:99)  at android.os.Looper.loop(Looper.java:164)  at com.android.server.SystemServer.run(SystemServer.java:435)  at com.android.server.SystemServer.main(SystemServer.java:278)  at java.lang.reflect.Method.invoke(Native method)  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:438)  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:787)

ANR 也是应该这个锁,根据两个traces 分析,可以确定是因为PermissionDialogReqQueue  自身的锁没有释放,导致了timeout 触发watch dog。

但是实际上,这个第二次的log 意义并不大,因为这个时候系统ANR,需要检查的是系统为什么出现ANR,也就是说第一次traces 的log 可能在后面的debug 中起到关键因素。

 

3、寻找原因

来看下 PermissionDialogReqQueue,java 100行 get() 函数:

        public int get() {            synchronized (this) {                try {                    wait();                } catch (InterruptedException e) {                    //ignore                }            }            Log.d(TAG, "result from permission dialog is " + mResult);            return mResult;        }

从逻辑上看,这里会通过wait 等待dialog 返回结果,在set 函数的时候notify:

        public void set(int res) {            synchronized (this) {                mResult = res;                notifyAll();            }        }

逻辑上处理应该是没有问题的,如果dialog 在30秒内点击,之后notify() 就会放掉之前持有的锁。但是从现象上看,是在dialog 弹出之后,AMS 接口调用导致了dialog 卡死,这样无法正常的释放锁,最终导致了WatchDog 重启system。 

这个问题弄的我很疑惑,逻辑上都是ok 就是在AMS 中要锁的时候导致了dialog 卡住,后来怀疑可能是dialog 的同时调用了AMS 的接口导致了主线程卡死,几经挫折想到将dialog 放到子线程中show(Android 子线程中更新UI 详解),结果居然成功了,还在继续寻找原因中!

 

 

 

 

 

 

 

更多相关文章

  1. 转:Android(安卓)AsyncTask
  2. Android面试系列文章2018之Android部分Handler机制篇
  3. 享受Android应用程序的Java技术盛宴
  4. Android(安卓)corners 圆角属性各个版本之间兼容问题
  5. Android(安卓)recovery 流程分析
  6. Android面试系列文章2018之Android部分Handler机制篇
  7. Android(安卓)SDK 快速安装方法
  8. Android之 UI主线程
  9. Android如何注册服务到ServiceManager?

随机推荐

  1. android webview自动播放Video
  2. Android(安卓)UI控件的分类
  3. android TIF启动流程
  4. Android(安卓)RadioButton和CheckBox无法
  5. 常用方法(2)------根据图片的url路径获得
  6. 解决Android在MAC OSX上编译缺少10.5SDK
  7. Android加载png图片时出错
  8. 利用Java反射技术调用Android中被隐藏的A
  9. Android下Affinities和Task(FLAG_ACTIVIT
  10. android亮屏、黑屏、解锁事件的系统广播