From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A1E8C6FA82 for ; Thu, 22 Sep 2022 05:51:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6D70B940008; Thu, 22 Sep 2022 01:51:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 65FB4940007; Thu, 22 Sep 2022 01:51:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4FFC5940008; Thu, 22 Sep 2022 01:51:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 39EF5940007 for ; Thu, 22 Sep 2022 01:51:36 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1174740818 for ; Thu, 22 Sep 2022 05:51:36 +0000 (UTC) X-FDA: 79938649392.04.1B59048 Received: from mailgw02.mediatek.com (mailgw02.mediatek.com [216.200.240.185]) by imf29.hostedemail.com (Postfix) with ESMTP id 9240B12000F for ; Thu, 22 Sep 2022 05:51:34 +0000 (UTC) X-UUID: 6bfc14f3880a4988b888c56b28fd7f21-20220921 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mediatek.com; s=dk; h=Content-Transfer-Encoding:MIME-Version:Content-Type:References:In-Reply-To:Date:CC:To:From:Subject:Message-ID; bh=V/4+Moo1S3nQcLY9jDjtjnWEDd9L/Kw2aufNRHGosCA=; b=Zo90xoefOKJHvtfEFcDTAqWFMMlSWhYdihn9YUwakBoC7hSj65xFQOnkygZWjRWo+0MaOfGzMIuyKHdO04SqksKYo04Q/Ql+ZbID6f3GpTsr00YT5rq3AD7qAoZiR0Fhzx7za2FPjitpJ6vUJJ8lOcjWUAYbbUvnHd5+FHK+MKU=; X-CID-P-RULE: Release_Ham X-CID-O-INFO: VERSION:1.1.11,REQID:04bf92d0-f89d-4265-ba74-6378ad223090,IP:0,U RL:0,TC:0,Content:0,EDM:0,RT:0,SF:0,FILE:0,BULK:0,RULE:Release_Ham,ACTION: release,TS:0 X-CID-META: VersionHash:39a5ff1,CLOUDID:7d73d8e3-87f9-4bb0-97b6-34957dc0fbbe,B ulkID:nil,BulkQuantity:0,Recheck:0,SF:nil,TC:nil,Content:0,EDM:-3,IP:nil,U RL:0,File:nil,Bulk:nil,QS:nil,BEC:nil,COL:0 X-UUID: 6bfc14f3880a4988b888c56b28fd7f21-20220921 Received: from mtkmbs11n1.mediatek.inc [(172.21.101.185)] by mailgw02.mediatek.com (envelope-from ) (musrelay.mediatek.com ESMTP with TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 256/256) with ESMTP id 1371594570; Wed, 21 Sep 2022 22:51:20 -0700 Received: from mtkcas11.mediatek.inc (172.21.101.40) by mtkmbs10n2.mediatek.inc (172.21.101.183) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.792.3; Thu, 22 Sep 2022 13:40:47 +0800 Received: from mtksdccf07 (172.21.84.99) by mtkcas11.mediatek.inc (172.21.101.73) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Thu, 22 Sep 2022 13:40:47 +0800 Message-ID: <93f4ce9486ec4b856ba0f3bfe956fc9b2d3cb4cf.camel@mediatek.com> Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete From: Jing-Ting Wu To: Hillf Danton CC: Peter Zijlstra , , , Waiman Long , ValentinSchneider , TejunHeo , , , , , "chris.redpath@arm.com" , Dietmar Eggemann , "Vincent Donnefort" , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Christian Brauner , , , Date: Thu, 22 Sep 2022 13:40:47 +0800 In-Reply-To: <20220907000741.2496-1-hdanton@sina.com> References: <20220907000741.2496-1-hdanton@sina.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5-0ubuntu0.18.04.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-MTK: N ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1663825895; a=rsa-sha256; cv=none; b=03tiUta9sDhy++fhEUJjRB83QCx1qrjuhBkPx0akdjFKvFS5tSh0G/Xwv/Pq9fkclAG8xJ JCQkNKf7CFzh3GEqPQhoeZHIk+ArUFvULYCs0ftUhtYV3SrBKCPdTCDq9vHaAKGfEfgeAB XuXJc4VqdfFf6Jm5LDh0/sk6LTEAx+0= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=mediatek.com header.s=dk header.b=Zo90xoef; dmarc=pass (policy=quarantine) header.from=mediatek.com; spf=pass (imf29.hostedemail.com: domain of jing-ting.wu@mediatek.com designates 216.200.240.185 as permitted sender) smtp.mailfrom=jing-ting.wu@mediatek.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1663825895; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=V/4+Moo1S3nQcLY9jDjtjnWEDd9L/Kw2aufNRHGosCA=; b=LvMnVZdBQ5s7vfEJwBiNt/YITYglNclv31kY3OnQ9jHeoIxNMSmZP8d0YZt5acRaeNDN+P wtfhnjrjxe1UiIgjHSm0y8bjJ8fWp7woZhQSrorLY6LUZTGJPDsnqVaSuUwkQLMMsbn0wr S0Ioudq/wbcKUwqIjJ2eApEq1d4eGvk= X-Rspam-User: X-Stat-Signature: w9xs48oq1dcwg4pd8g6ns1bbmdmqg4wt X-Rspamd-Queue-Id: 9240B12000F Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=mediatek.com header.s=dk header.b=Zo90xoef; dmarc=pass (policy=quarantine) header.from=mediatek.com; spf=pass (imf29.hostedemail.com: domain of jing-ting.wu@mediatek.com designates 216.200.240.185 as permitted sender) smtp.mailfrom=jing-ting.wu@mediatek.com X-Rspamd-Server: rspam07 X-HE-Tag: 1663825894-717512 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 2022-09-07 at 08:07 +0800, Hillf Danton wrote: > On 5 Sep 2022 10:47:36 +0800 Jing-Ting Wu > wrote > > > > We meet the HANG_DETECT happened in T SW version with kernel-5.15. > > Many tasks have been blocked for a long time. > > > > Root cause: > > migration_cpu_stop() is not complete due to > > is_migration_disabled(p) is > > true, complete is false and complete_all() never get executed. > > It let other task wait the rwsem. > > See if handing task over to stopper again in case of migration > disabled > could survive your tests. > > Hillf > > --- linux-5.15/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2322,9 +2322,7 @@ static int migration_cpu_stop(void *data > * holding rq->lock, if p->on_rq == 0 it cannot get enqueued > because > * we're holding p->pi_lock. > */ > - if (task_rq(p) == rq) { > - if (is_migration_disabled(p)) > - goto out; > + if (task_rq(p) == rq && !is_migration_disabled(p)) { > > if (pending) { > p->migration_pending = NULL; Because Peter have some concern for patch by Waiman. We add Hillf's patch to our stability test. But there are side effects after patched. The warning appear once < two weeks. Backtrace as follows: [name:panic&]WARNING: CPU: 6 PID: 32583 at affine_move_task pc : affine_move_task lr : __set_cpus_allowed_ptr_locked Call trace: affine_move_task __set_cpus_allowed_ptr_locked migrate_enable __cgroup_bpf_run_filter_skb ip_finish_output ip_output The root cause is when is_migration_disabled(p) is true,the patched version will set p->migration_pending to NULL by migration_cpu_stop. And in affine_move_task will raise a WARN_ON_ONCE(!pending). Kernel-5.15/kernel/sched/core.c: static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flags *rf, int dest_cpu, unsigned int flags) { ... If (WARN_ON_ONCE(!pending)) { Task_rq_unlock(rq,p,fr); return -EINVAL; } ... } But the tasks have not been migrated to the new affinity CPU, so there should be pending tasks to be processed, so p->migration_pending should not be NULL. Without patch: When is_migration_disabled is true, then goto out and not set p- >migration_pending to NULL. static int migration_cpu_stop(void *data) { ... If (task_rq(p) == rq) { if (is_migration_disabled(p)) goto out; } ... } With patch: When is_migration_disabled is true and pending is true, goto else if flow. Because p->cpus_ptr not updated when migrate_disable, so this condition is always true and p->migration_pending will set to NULL. static int migration_cpu_stop(void *data) { ... If (task_rq(p) == rq && !is_migration_disabled(p) ) { ... } else if (pending) { ... If (cpumask_test_cpu(task_cpu(p), p-> cpus_ ptr)) { p->migration_pending = NULL; complete = true; goto out; } ... } Best regards, Jing-Ting Wu