From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7500E77188 for ; Fri, 20 Dec 2024 10:44:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 425A06B0085; Fri, 20 Dec 2024 05:44:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3AE246B0088; Fri, 20 Dec 2024 05:44:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24EEB6B0089; Fri, 20 Dec 2024 05:44:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 065D06B0085 for ; Fri, 20 Dec 2024 05:44:28 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9F25C1C745C for ; Fri, 20 Dec 2024 10:44:28 +0000 (UTC) X-FDA: 82915001778.01.7A2B97F Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) by imf26.hostedemail.com (Postfix) with ESMTP id CDB4F14000F for ; Fri, 20 Dec 2024 10:43:59 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf26.hostedemail.com: domain of chenridong@huaweicloud.com designates 45.249.212.56 as permitted sender) smtp.mailfrom=chenridong@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734691451; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=m/pNx2q/ehDA6ZPCpHcISMJfK2Er2iry1EawIBMmjSI=; b=FDHv0hdUUl/1HsdMk6d8jNH1mRkDmLO73LIW+nqO/XbHv2xQRcP52MqgzQ16QlFQsIK7ed b6hxdEvELlvmX+ziKjMQnb3KqUKFXNshbFQevO059raRZBc1cpJlMj+Wa9FhwONW+c9lYH IHpR05UDVb0PAz9PtfYojDcmlpLgj2Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734691451; a=rsa-sha256; cv=none; b=e33OQ0G8rFL85WQwvCBTG/booUfBUM5liuw3FKTOqlAjBFsZeEHW1siAAkcABtz9hOvSpA nTEx+maI1i76H8WY+IiB/I4yFFPn5In2g4yCO0kd3dsCuz+/Gw9Vjg16zBwIlfDm0NcUlM yfxH19iulEgHBM1yFDPHZqjMy6oJCcg= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf26.hostedemail.com: domain of chenridong@huaweicloud.com designates 45.249.212.56 as permitted sender) smtp.mailfrom=chenridong@huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.93.142]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTP id 4YF3tn43Hpz4f3khX for ; Fri, 20 Dec 2024 18:44:01 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.75]) by mail.maildlp.com (Postfix) with ESMTP id 23B941A07B6 for ; Fri, 20 Dec 2024 18:44:21 +0800 (CST) Received: from [10.67.109.79] (unknown [10.67.109.79]) by APP2 (Coremail) with SMTP id Syh0CgBH9OCDSmVnQFdcFA--.13050S2; Fri, 20 Dec 2024 18:44:20 +0800 (CST) Message-ID: Date: Fri, 20 Dec 2024 18:44:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] memcg: fix soft lockup in the OOM process To: Michal Hocko Cc: Tejun Heo , akpm@linux-foundation.org, hannes@cmpxchg.org, yosryahmed@google.com, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, davidf@vimeo.com, vbabka@suse.cz, handai.szj@taobao.com, rientjes@google.com, kamezawa.hiroyu@jp.fujitsu.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, chenridong@huawei.com, wangweiyang2@huawei.com References: <20241217121828.3219752-1-chenridong@huaweicloud.com> <872c5042-01d6-4ff3-94bc-8df94e1e941c@huaweicloud.com> <02f7d744-f123-4523-b170-c2062b5746c8@huaweicloud.com> <7d7b3c01-4977-41fa-a19c-4e6399117e8e@huaweicloud.com> Content-Language: en-US From: Chen Ridong In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-CM-TRANSID:Syh0CgBH9OCDSmVnQFdcFA--.13050S2 X-Coremail-Antispam: 1UD129KBjvJXoWxGFyUJrWUJFy7GrWxXw4Utwb_yoWrCFW7pF yDXa4jyan5JrW0qr12vw1jvryayrWxKr1UXr4Dtr1UtrnIqw13Xryjkr4UuF97uFn2yF12 vr4j9347WryjvaDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUv0b4IE77IF4wAFF20E14v26ryj6rWUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVAFwI0_GcCE3s1l84ACjcxK6I8E87Iv6xkF7I 0E14v26rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI64kE6c02F40E x7xfMcIj6xIIjxv20xvE14v26r1j6r18McIj6I8E87Iv67AKxVWUJVW8JwAm72CE4IkC6x 0Yz7v_Jr0_Gr1lF7xvr2IY64vIr41lFIxGxcIEc7CjxVA2Y2ka0xkIwI1lc7CjxVAaw2AF wI0_GFv_Wryl42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4 xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r4a6rW5 MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I 0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVWU JVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4UJVWxJrUvcSsGvfC2KfnxnUUI43ZEXa7IU0 s2-5UUUUU== X-CM-SenderInfo: hfkh02xlgr0w46kxt4xhlfz01xgou0bp/ X-Stat-Signature: h14jyn7j5bjqmrotgwxdwshoffyr8k9u X-Rspamd-Queue-Id: CDB4F14000F X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1734691439-197062 X-HE-Meta: U2FsdGVkX1/U4Mf8sqEmNkjyrrVnFL46d0qRoRq4fj/97q9ts/nc0d/vW7kvAw491ZkkX+AyHx6CwW82/HLMIVA8+nZZewdlPwv9DSXUyDFWO1iNMcmhWxdoLEQy1kvReOMWnAqgbjN7yxE15HMZZOIsUB1LKNlg72TRAqU0/6QbXWu87yco8Qmh2JgluTtqQ2TdKsKFkc473v00QQNI7gM25qKFQ1x7ICOkMA/69+95tveqLUjYlANr4j5QeO4pt9Gv7lfjZC47c0lDGei4UF6R3oBd1Qr/drict9BfiVisrgN0uUyzaz0VLBq+0P2/e8zlSx8WtkNg1SCxdyxB85ocA/z7CatelKrSz8l7n3jQsIWYTeaUJzlTMd9unb0jxN36opRatEBk/eQu/CKMHvMACQdaOV8ewMvxdfwZUpOi1T0BkQJVVynvb059SpvN2J3PbrN4Nhz/MphN1g8ke/doKPo6SdrKEXTQ9dDAPl+wEzmwUNcQbdLU+aK1x0VGbyt1nXNEplyJiFw4Pe05j7pBqEG0sYKfpUvxpQoVgaHpvBiKlBRsEyG5V1+rhiXBrgcGm5bZHYG12SD1vjOPah1aDDn674wyqKyrRW8oisFXhjfau6TYx3EGG4Lhb+nffUZI53/p9T0igwv1E2DVoC7NDmsE7nEmYOhXBm8EmcZllVmdU+I2E6kj1ZEi8kCFwks6SkLrjVcAwsA0JfxpMLK7n7EqyoQa9/EgA/6f+5d+sAHk7U31jK/oLu0bUdl+NWy+vyfU03NdItByuM14td7lvUuTVXDYahTWgm6UHGcHXw4PM0g+DSn1xZYbdR+9Bb1Ps2D9EF8y+ogb90RCz3/qGae9fta6eJUVGokkyRR50ov8Ju639gbTMDJb8tVk5IYTiIrASGVgukwLeNUmkTTwWwLPgl4Ox7uALvm6XRgvwIf4hbTu63eEtaYVh6BTNXVMzHlBZiCccOcLM0P gTTAOHK2 YfwW48ifq1TqX959gPunMQQVL2VQBPX7hGbyVyqMe+RT+HwiWCD5wJ7CF9mWS+ZUg3dLAFyCyFKsqkaTWnNe4b6yD4Q1X0vLFj59liD8nhCYvd9TaRu+V1fzK7C3yQSr8sSZkOcHg4cREynW1RvM7iERuAbHEGtDgI00H X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/12/19 15:57, Michal Hocko wrote: > On Thu 19-12-24 09:27:52, Chen Ridong wrote: >> >> >> On 2024/12/18 18:22, Michal Hocko wrote: >>> On Wed 18-12-24 17:00:38, Chen Ridong wrote: >>>> >>>> >>>> On 2024/12/18 15:56, Michal Hocko wrote: >>>>> On Wed 18-12-24 15:44:34, Chen Ridong wrote: >>>>>> >>>>>> >>>>>> On 2024/12/17 20:54, Michal Hocko wrote: >>>>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: >>>>>>> [...] >>>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>>>>>> index 1c485beb0b93..14260381cccc 100644 >>>>>>>> --- a/mm/oom_kill.c >>>>>>>> +++ b/mm/oom_kill.c >>>>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) >>>>>>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) >>>>>>>> return 0; >>>>>>>> >>>>>>>> + cond_resched(); >>>>>>>> task = find_lock_task_mm(p); >>>>>>>> if (!task) { >>>>>>>> /* >>>>>>> >>>>>>> This is called from RCU read lock for the global OOM killer path and I >>>>>>> do not think you can schedule there. I do not remember specifics of task >>>>>>> traversal for crgoup path but I guess that you might need to silence the >>>>>>> soft lockup detector instead or come up with a different iteration >>>>>>> scheme. >>>>>> >>>>>> Thank you, Michal. >>>>>> >>>>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks >>>>>> function below the fn, but after reconsideration, it may cause >>>>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. >>>>>> Therefore, I moved it into the dump_task function. However, I missed the >>>>>> RCU lock from the global OOM. >>>>>> >>>>>> I think we can use touch_nmi_watchdog in place of cond_resched, which >>>>>> can silence the soft lockup detector. Do you think that is acceptable? >>>>> >>>>> It is certainly a way to go. Not the best one at that though. Maybe we >>>>> need different solution for the global and for the memcg OOMs. During >>>>> the global OOM we rarely care about latency as the whole system is >>>>> likely to struggle. Memcg ooms are much more likely. Having that many >>>>> tasks in a memcg certainly requires a further partitioning so if >>>>> configured properly the OOM latency shouldn't be visible much. But I am >>>>> wondering whether the cgroup task iteration could use cond_resched while >>>>> the global one would touch_nmi_watchdog for every N iterations. I might >>>>> be missing something but I do not see any locking required outside of >>>>> css_task_iter_*. >>>> >>>> Do you mean like that: >>> >>> I've had something like this (untested) in mind >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 7b3503d12aaf..37abc94abd2e 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, >>> for_each_mem_cgroup_tree(iter, memcg) { >>> struct css_task_iter it; >>> struct task_struct *task; >>> + unsigned int i = 0 >>> >>> css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it); >>> - while (!ret && (task = css_task_iter_next(&it))) >>> + while (!ret && (task = css_task_iter_next(&it))) { >>> ret = fn(task, arg); >>> + if (++i % 1000) >>> + cond_resched(); >>> + } >>> css_task_iter_end(&it); >>> if (ret) { >>> mem_cgroup_iter_break(memcg, iter); >> >> Thank you for your patience. >> >> I had this idea in mind as well. >> However, there are two considerations that led me to reconsider it: >> >> 1. I wasn't convinced about how we should call cond_resched every N >> iterations. Should it be 1000 or 10000? > > Sure, there will likely not be any _right_ value. This is mostly to > mitigate the overhead of cond_resched which is not completely free. > Having a system with 1000 tasks is not completely uncommon and we do not > really need cond_resched now. > > More importantly we can expect cond_resched will eventually go away with > the PREEMPT_LAZY (or what is the current name of that) so I wouldn't > overthink this. > >> 2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched. >> Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At >> least, I have not encountered any other issue except except when fn is >> dump_tasks. > > See above. I wouldn't really overthink this. Thanks, I tested and sent v2. Best regards, Ridong