From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FAB9C4332F for ; Wed, 14 Dec 2022 12:40:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 64E1B8E0003; Wed, 14 Dec 2022 07:40:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5FEEF8E0002; Wed, 14 Dec 2022 07:40:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 49F9B8E0003; Wed, 14 Dec 2022 07:40:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3747C8E0002 for ; Wed, 14 Dec 2022 07:40:38 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0AD71C0B1A for ; Wed, 14 Dec 2022 12:40:37 +0000 (UTC) X-FDA: 80240870556.18.420EEA3 Received: from mail-ej1-f48.google.com (mail-ej1-f48.google.com [209.85.218.48]) by imf21.hostedemail.com (Postfix) with ESMTP id 05F821C0003 for ; Wed, 14 Dec 2022 12:40:35 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=hPdCW45I; spf=pass (imf21.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.218.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671021636; a=rsa-sha256; cv=none; b=q+ZPTM7RgUa3H95x50Zo/FvGbMMFUtdaZnDJH318RbDg0qOmBSoDPXWptbPMayAIFcVjBd cFkpBVCgii5V/HHk4hypCsz7Yl+0rTKz17Zic5Veau7J9BV4zzASINzJNVWIptDBZZsmEG 4jYtfa8CmVWeZ6DoqRXVCnefmBDBZ1Y= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=hPdCW45I; spf=pass (imf21.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.218.48 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671021636; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A1xhBzpDoBPavmG4HUeBFEFL1GXCLdg20tEeRb+nhZE=; b=ZuSu46kopLMxgTJefOzcVG38L0UAOf1ihfdMIQfGC33zIaxKa0F+1/IOMMypy4Iniu0uRb UJ9uMmvCe1twb1Q+C4OkuCEhiJeuG8NUr4j3kJrtBJTFa6j7eRj56z0sw0gkVAitLSZ1hQ 7gh2f8iumzD4bV7M7J5xIrZjkWS9wYs= Received: by mail-ej1-f48.google.com with SMTP id m18so44145584eji.5 for ; Wed, 14 Dec 2022 04:40:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=A1xhBzpDoBPavmG4HUeBFEFL1GXCLdg20tEeRb+nhZE=; b=hPdCW45Imw+etSsnMztT0eq0wgnMv1RHYJPFB3G3fTMGsVF/RjcwDuJJlO2Rg1+iVY SmfwUEQAoxqJ4/G6OV+1V4ZjkvJADA1slFCIw7WahEiUEHRtfEWtZpVL3CkTSWw6Ey3y hZUHcNJymx+J3cZvnfnWmHdLLlXsKA9Sf6Ns3QzFnwufSLBDj+VJZiqAhQxdfIy2HoFA J3/Qr5U4mjB2f11TKpyoGVTg6SJg/lfiaS6nb8DbITTgt4frVYW1/cVxwNaRWzBAsv8n sD9Oc5SlxghQXqHdME/65WZLp852Z4cd1V9/9MxypUZs9vMu4Vlz94pVyPYnWLTvsvzA ujZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=A1xhBzpDoBPavmG4HUeBFEFL1GXCLdg20tEeRb+nhZE=; b=BPBCHJK9qQFC5v3PaeGfL5qWWk+iVM/rxW3HDYPtdlG7EVPFkv+ZluRZ2mNuKphJB6 Tuq2uq1o8xLSXNOPq/T1PNReUV8++oN4aW117mR6JPAhnRQiG3ZCNC9ZRrcMvfc/zTSG aHsRu9wegA+f8mRdF0i337SolGkPjifTKu8fa5gBKvVr6lC16LS8DTjNCts401y0kjDm rpSKin+NR40qPFN6U1WJmCzSjUeND8HKkfD1XcQhPGDTekpLH9GnGSDiCiGvuqUXEy4v 8CQvaW4kU5UpDm4uZd5sYnfTPw6kcCj4VllDn/7x+2f8y96t45mkQHc3gheOwdL3U3Vi 9DvQ== X-Gm-Message-State: ANoB5plQGGqWE7xwMDSOM7E8t4Bv7hGOJ/os0EzIPE4OwUNT7Dtscf+t skFZM67S6TkIHsbBMKF6udXwYw== X-Google-Smtp-Source: AA0mqf6BSY63Rg2Sgvtvd1MMLbdz7V4US4tlxmtn3trGpirP5EV4S/iHOmyfp6egzbvk7BQ6DIT5cA== X-Received: by 2002:a17:906:4b53:b0:7c1:4d21:abb9 with SMTP id j19-20020a1709064b5300b007c14d21abb9mr15468514ejv.14.1671021634452; Wed, 14 Dec 2022 04:40:34 -0800 (PST) Received: from localhost (ip-046-005-139-011.um12.pools.vodafone-ip.de. [46.5.139.11]) by smtp.gmail.com with ESMTPSA id la1-20020a170907780100b007c073be0127sm5653253ejc.202.2022.12.14.04.40.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Dec 2022 04:40:34 -0800 (PST) Date: Wed, 14 Dec 2022 13:40:33 +0100 From: Johannes Weiner To: Michal Hocko Cc: Dave Hansen , "Huang, Ying" , Yang Shi , Wei Xu , Andrew Morton , linux-mm@kvack.org, LKML Subject: Re: memcg reclaim demotion wrt. isolation Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: 05F821C0003 X-Rspamd-Server: rspam01 X-Stat-Signature: 86zq9b3xwmpqnfht53s6ztb8o15iou7b X-HE-Tag: 1671021635-328680 X-HE-Meta: U2FsdGVkX18SsoSaCwbDwxSXJGUc5FfzCe+dI0ik4CZLQsXHLksipArcnf6FCK+wl9iDWt3oatGB1oO+hbbQG18Rfak45apvn1gZvhmD7imgQJb7N/4BXqAowRO9MmXxEfNiYqKRd1lVs8TrHNC25UzuK6EvCg7/tznv+0gkxZDkNgUUjiY2pI3uVgMo9rQwy619Spd167r/ml6v5u+3ZrLosm3oBDBoaswKh7uSYZXpUvWc90Wi3uocVguJHxDtC042I7iz3aQJDmdanbtoaY0j7w8cfxByJArQU+fsgOkv67VH1xO365eA+1GV49pIwqOudgnW5I4AnsG6U7B4sp+3ipaVgLi2x4wOefUWkjMt9UGAmksEyXySV9NbBrRTQq+zuS2c7qYcmtqPPH81oDf2bGERfQNSSfPPXjl6ZRd9eippaxnxdKddJay+oMUivH13TVQCxtTiHRzWjr0WaHJlEEvI5hEXnCLZlyJqdF6jnux/fBN0gCsFQKLD2mApuRUx2E0KM9btgYIxmrg3LJ4O1S2QeUBnwduqN5TZk/jx7dXVKcXa3SZ6lWPJPVpsMxnd77A7g7n6kYjyzlaRh3D2ymH/oGU8KRJPDFr6Ku2V4BrK4Um6BlNfew/yhR/ZP4cPNuAIHtrVfqs47CbzHQYMU5hzaAXiUazY5nYMs33I1FjTNX40VWs7r+fnyezQlWC86yQsfy/mCte2TY5IZkNp7x68Hc+b09m4Iqucg+fSukAz1AWdRaXw20PlBRsTZ+g8T9p4HV58t5A0fah3l84+FDM9H37Gimz/BNnEujjiSn5A7F2ZLGPg8DTAc3TheVIhqabHsT3aN+AJ0/AWg4e47bjuN3WDmS8o61Ttx/JDE0tp9IA5E3ShVSUd0Q4lEXtYDoB8dcDmjEfhUEyLTpS+0NJwBqIc2+JfTjpzqGWdJUfJGy3aaqhNQroZsHPMFN2U6c7aHlJdWlFbaIY al8wlZ10 heDmQY4DZuSOeFmjFhKbOXhMaslnniIL+5hh+GxdPYj+8rIqtjp10RH63T/jx3S1aLkPoGmCmHEC903quQhmkQeTvVCHNYr83ctCsDSYpk26dbk1NJmys7Uhbk6o3a5jW/9NIFlBJzlO5Dm8Bp9vGAPnyzdrWii3BWG/b X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 14, 2022 at 10:42:56AM +0100, Michal Hocko wrote: > On Tue 13-12-22 17:14:48, Johannes Weiner wrote: > > On Tue, Dec 13, 2022 at 04:41:10PM +0100, Michal Hocko wrote: > > > Hi, > > > I have just noticed that that pages allocated for demotion targets > > > includes __GFP_KSWAPD_RECLAIM (through GFP_NOWAIT). This is the case > > > since the code has been introduced by 26aa2d199d6f ("mm/migrate: demote > > > pages during reclaim"). I suspect the intention is to trigger the aging > > > on the fallback node and either drop or further demote oldest pages. > > > > > > This makes sense but I suspect that this wasn't intended also for > > > memcg triggered reclaim. This would mean that a memory pressure in one > > > hierarchy could trigger paging out pages of a different hierarchy if the > > > demotion target is close to full. > > > > This is also true if you don't do demotion. If a cgroup tries to > > allocate memory on a full node (i.e. mbind()), it may wake kswapd or > > enter global reclaim directly which may push out the memory of other > > cgroups, regardless of the respective cgroup limits. > > You are right on this. But this is describing a slightly different > situaton IMO. > > > The demotion allocations don't strike me as any different. They're > > just allocations on behalf of a cgroup. I would expect them to wake > > kswapd and reclaim physical memory as needed. > > I am not sure this is an expected behavior. Consider the currently > discussed memory.demote interface when the userspace can trigger > (almost) arbitrary demotions. This can deplete fallback nodes without > over-committing the memory overall yet push out demoted memory from > other workloads. From the user POV it would look like a reclaim while > the overall memory is far from depleted so it would be considered as > premature and a warrant a bug report. > > The reclaim behavior would make more sense to me if it was constrained > to the allocating memcg hierarchy so unrelated lruvecs wouldn't be > disrupted. What if the second tier is full, and the memcg you're trying to demote doesn't have any pages to vacate on that tier yet? Will it fail to demote? Does that mean that a shared second tier node is only usable for the cgroup that demotes to it first? And demotion stops for everybody else until that cgroup vacates the node voluntarily? As you can see, these would be unprecedented and quite surprising first-come-first-serve memory protection semantics. The only way to prevent cgroups from disrupting each other on NUMA nodes is NUMA constraints. Cgroup per-node limits. That shields not only from demotion, but also from DoS-mbinding, or aggressive promotion. All of these can result in some form of premature reclaim/demotion, proactive demotion isn't special in that way. The default behavior for cgroups is that without limits or protections, resource access is unconstrained and competitive. Without NUMA constraints, it's very much expected that cgroups compete over nodes, and that the hottest pages win out. Per aging rules, freshly demoted pages are hotter than anything else on the target node, so it should displace accordingly. Consider the case where you have two lower tier nodes and there are cpuset isolation for the main workloads, but some maintenance thing runs and pollutes one of the lower tier nodes. Or consider the case where a shared lower tier node is divvied up between two cgroups using protection settings to allow overcommit, i.e. per-node memory.low. Demotions, proactive or not, MUST do global reclaim on a full node.