From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FB35C433FE for ; Wed, 26 Oct 2022 17:58:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B1BE48E0002; Wed, 26 Oct 2022 13:58:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ACC938E0001; Wed, 26 Oct 2022 13:58:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9943A8E0002; Wed, 26 Oct 2022 13:58:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8A23A8E0001 for ; Wed, 26 Oct 2022 13:58:06 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5F67C1C69F9 for ; Wed, 26 Oct 2022 17:58:06 +0000 (UTC) X-FDA: 80063859372.17.00C168A Received: from mail-pf1-f177.google.com (mail-pf1-f177.google.com [209.85.210.177]) by imf17.hostedemail.com (Postfix) with ESMTP id F13434003C for ; Wed, 26 Oct 2022 17:58:05 +0000 (UTC) Received: by mail-pf1-f177.google.com with SMTP id b185so9935000pfb.9 for ; Wed, 26 Oct 2022 10:58:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=kWYOvEh6BjR2AjkkXhVP2bGBtthES4knBdJs4bSieXw=; b=JxuwH3+wKiwD51DJmL5Ej4OLluLCIQR787eGi7DLmcqoLPOw2LnXWd8a6THfNg6DaB K2NJY2K/W7SAh5AA0XGrG3y4396tHkg1GjNeFUxDHww0n4RgmhUDJEjhl8aumuNKZLIc 1C/qPg5HTV62j+mV1DAPUPvBREsSzL4nL2KH/6RdVscl3Wu7uKfJx0PlokgyWN65g3zS 2ZtUc0qyWHFtPgqvupmx2BNFNzO0l41+HPTZjYavUWpZXbp0CVsVK60oCHRCiAjw4n0n qKqZcLy3+5+l3AQu1m/woiPjI2NkhvV5dnqAku1q/sDvwe2zPKatVGP+3udALryMEsAW VsNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kWYOvEh6BjR2AjkkXhVP2bGBtthES4knBdJs4bSieXw=; b=NANB7PqleLrc8iOKA1O7GN/V8T2RhJ/evGJUGAe9Y5xDinLuTvErsuWtfYXU+E8+r1 yHgY5XvQN7RP0OXftY87cfze46C0nneAWm9Ii8pRxKdXvaMm/nzB8K3A6coiL+fPSaQg vuG7IjQnYglFVeojC8sFfkM/ySpuLePbGumwKMXGY4tvlOEiluYyYB9msRRYWOFzCFE1 Z/bJAgnrKVHze9Azq3FYEnyjG1lH6uHyhfx4WHmsnxrudQKpahDkqcTZDfUW0LgOWPCN 6HBe1rDsFNvassoBGvZocwgpNmSf/fVFcWrd8C0RO4FrBB0iEuXTA7LcAJGS04cDeHiM ihUA== X-Gm-Message-State: ACrzQf0vT5U0Cjgw6dAyaEwzi0USqslQdHQdjySCkMsHB93uwkK9mxAk 3pvIDQjjx3geD9LkKX83nSChw353DyEV3DXrhBo= X-Google-Smtp-Source: AMsMyM4TLjq9KZz4YfohLxEle0QaPH142diKESYY5GrC16jdTAkB9i9kvb0y75PP6eA5xvtlFeYqFu2h/pjZBGmP7Rk= X-Received: by 2002:a05:6a02:20c:b0:461:74e5:ce9f with SMTP id bh12-20020a056a02020c00b0046174e5ce9fmr38996374pgb.294.1666807084824; Wed, 26 Oct 2022 10:58:04 -0700 (PDT) MIME-Version: 1.0 References: <20221026074343.6517-1-feng.tang@intel.com> In-Reply-To: From: Yang Shi Date: Wed, 26 Oct 2022 10:57:52 -0700 Message-ID: Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion To: Michal Hocko Cc: Feng Tang , Aneesh Kumar K V , Andrew Morton , Johannes Weiner , Tejun Heo , Zefan Li , Waiman Long , "Huang, Ying" , "linux-mm@kvack.org" , "cgroups@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Hansen, Dave" , "Chen, Tim C" , "Yin, Fengwei" Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666807086; a=rsa-sha256; cv=none; b=chnYfhYdxZCSoLB4ZCoezACHLQG64JAfFn3lb+wVXn6KkgINy5jvzvu1XpGalRLg0Z8zHl y89uZb9iM8BNkj8kutFvMsJMhZ4/ZeubRfwLUYP71V79G4cEFrMaVHF2+hhoQiMjCAlvzl m4QXsvGaVyntQCUt8wMVBYOG0bv6l8M= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JxuwH3+w; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666807086; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kWYOvEh6BjR2AjkkXhVP2bGBtthES4knBdJs4bSieXw=; b=ht4t1YFL6hkN7/XU242PJwUiA3NDFvPWHEkGqreaW7MMZ/BP6M3cqZgkb7Dm7ekO9DQBhW eZ1rZ176cd+9X2lBg6yVO7ep6SrLm/zoJlJwL4cxbfo83FOUlCgcTf6dIzUc/kNZ+7oEEe 9znMfoYlwNL64eJ6PFjCsJFsjoCQ/54= X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: F13434003C X-Rspam-User: Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=JxuwH3+w; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.177 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Stat-Signature: cbx7q435j8qs6ttqri6br84ncz4k39ys X-HE-Tag: 1666807085-132151 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko wrote: > > On Wed 26-10-22 20:20:01, Feng Tang wrote: > > On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote: > > > On Wed 26-10-22 16:00:13, Feng Tang wrote: > > > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote: > > > > > On 10/26/22 1:13 PM, Feng Tang wrote: > > > > > > In page reclaim path, memory could be demoted from faster memory tier > > > > > > to slower memory tier. Currently, there is no check about cpuset's > > > > > > memory policy, that even if the target demotion node is not allowd > > > > > > by cpuset, the demotion will still happen, which breaks the cpuset > > > > > > semantics. > > > > > > > > > > > > So add cpuset policy check in the demotion path and skip demotion > > > > > > if the demotion targets are not allowed by cpuset. > > > > > > > > > > > > > > > > What about the vma policy or the task memory policy? Shouldn't we respect > > > > > those memory policy restrictions while demoting the page? > > > > > > > > Good question! We have some basic patches to consider memory policy > > > > in demotion path too, which are still under test, and will be posted > > > > soon. And the basic idea is similar to this patch. > > > > > > For that you need to consult each vma and it's owning task(s) and that > > > to me sounds like something to be done in folio_check_references. > > > Relying on memcg to get a cpuset cgroup is really ugly and not really > > > 100% correct. Memory controller might be disabled and then you do not > > > have your association anymore. > > > > You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y', > > and the bright side is most of distribution have it on. > > CONFIG_MEMCG=y is not sufficient. You would need to enable memcg > controller during the runtime as well. > > > > This all can get quite expensive so the primary question is, does the > > > existing behavior generates any real issues or is this more of an > > > correctness exercise? I mean it certainly is not great to demote to an > > > incompatible numa node but are there any reasonable configurations when > > > the demotion target node is explicitly excluded from memory > > > policy/cpuset? > > > > We haven't got customer report on this, but there are quite some customers > > use cpuset to bind some specific memory nodes to a docker (You've helped > > us solve a OOM issue in such cases), so I think it's practical to respect > > the cpuset semantics as much as we can. > > Yes, it is definitely better to respect cpusets and all local memory > policies. There is no dispute there. The thing is whether this is really > worth it. How often would cpusets (or policies in general) go actively > against demotion nodes (i.e. exclude those nodes from their allowes node > mask)? > > I can imagine workloads which wouldn't like to get their memory demoted > for some reason but wouldn't it be more practical to tell that > explicitly (e.g. via prctl) rather than configuring cpusets/memory > policies explicitly? > > > Your concern about the expensive cost makes sense! Some raw ideas are: > > * if the shrink_folio_list is called by kswapd, the folios come from > > the same per-memcg lruvec, so only one check is enough > > * if not from kswapd, like called form madvise or DAMON code, we can > > save a memcg cache, and if the next folio's memcg is same as the > > cache, we reuse its result. And due to the locality, the real > > check is rarely performed. > > memcg is not the expensive part of the thing. You need to get from page > -> all vmas::vm_policy -> mm -> task::mempolicy Yeah, on the same page with Michal. Figuring out mempolicy from page seems quite expensive and the correctness can't be guranteed since the mempolicy could be set per-thread and the mm->task depends on CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG. > > -- > Michal Hocko > SUSE Labs >