From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39E3BD1CA14 for ; Tue, 5 Nov 2024 01:45:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 506556B0083; Mon, 4 Nov 2024 20:45:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B55E6B0095; Mon, 4 Nov 2024 20:45:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32F0A6B0098; Mon, 4 Nov 2024 20:45:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 100DB6B0095 for ; Mon, 4 Nov 2024 20:45:00 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6EBD61A0CD5 for ; Tue, 5 Nov 2024 01:44:59 +0000 (UTC) X-FDA: 82750346952.22.616F7AB Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14]) by imf25.hostedemail.com (Postfix) with ESMTP id 4E3E1A0008 for ; Tue, 5 Nov 2024 01:44:33 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=b742hfVK; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730771039; a=rsa-sha256; cv=none; b=ERKxkG2x5JR5KMYVXdxI7HhDEIkGNlBWX6jV6EtPOx/1EkC33++XIo6JiFjCB4ezDfrnpB +Yl8avlfyhEzivoIE2FM/RXvTdUXzaZ2iIip5lZezsoa0qwECQBbIFJWj3RtGbCx2de2t4 w7fzCsmSbf+mPe/KW+RiGdo+eVnKA1c= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=b742hfVK; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730771039; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eQHeRfL8rZzRSEN4/Fv2sbqCLO8N/fsLZXRgjtP8NjQ=; b=EL0s4X9paJBb7UUpymQuew/5hG0zk2/OcD13cuYm75d68H9RV174RkOn0qZFB3g7CkTkJI zcYNspQ18aA/IoZrgHvEESt8s9SzV6dZnoIw7YwC0McPduLgVLeinKSRrGK7RV9uxwJSpN drDCnH86fVLjj+D+7D/JuBJUVNHmBx4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1730771097; x=1762307097; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=601r3QiK97UotA1IcFgllJRt19fzap5qMYMAijX2K5I=; b=b742hfVKNd6dxo26wsTO1yqyT1yeoRF5F90cYwQU1LhgnZdEAw/l04HT qBqswcJrqNLXGkz8y12+ytl+oWm91WUo5BsnCbWm+sYZrjJay8ou29y9i 8v/GS32VJ+/bs9wDwpCWZuWjVzsF4LM2umjeh1sXGIANAdGsnZpgG5Avr X53U99z6zQFGscdanzIdx/zu4C5XUVAceC2k7aulRYkE+zPS6kr0QqIsj wYzBdSHiiwwv1rfYilfyxUbFuXcYrVwSsahC0tdMbHnJwJnMMDD7klqe/ ixnoAMyuM5+3cNTvzvOvM8nwG0ftFgMOhThC3Q67MwV1fbu3lROWX1xoD w==; X-CSE-ConnectionGUID: vroPGAQySjy0qlRoe+Bpjw== X-CSE-MsgGUID: pMk/YM7wRnGEAEvAVCO3mw== X-IronPort-AV: E=McAfee;i="6700,10204,11246"; a="30711936" X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; d="scan'208";a="30711936" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 17:44:55 -0800 X-CSE-ConnectionGUID: HxPc0ynESoiXdCN0qR/xYA== X-CSE-MsgGUID: AtbssF/1Ske4Hvr7Jh4JcA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,258,1725346800"; d="scan'208";a="83362727" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Nov 2024 17:44:50 -0800 From: "Huang, Ying" To: Barry Song <21cnbao@gmail.com> Cc: Usama Arif , Johannes Weiner , Yosry Ahmed , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , Kairui Song , Ryan Roberts , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg In-Reply-To: (Barry Song's message of "Tue, 5 Nov 2024 14:13:35 +1300") References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <20241031153830.GA799903@cmpxchg.org> <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com> <3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com> <87ses67b0b.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 05 Nov 2024 09:41:18 +0800 Message-ID: <87ldxy78zl.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 4E3E1A0008 X-Rspamd-Server: rspam11 X-Stat-Signature: god3yn5ktgrxmfnpeacdmcjw1nkak9zt X-HE-Tag: 1730771073-344836 X-HE-Meta: U2FsdGVkX194SCsECPATqm9tH4pCrhX3njOWWRzJPVbhUsIIO4mN33HuTDkc9PVwL2oyaOqjXi1iAthMSZHN8pumGxxaWad40XX7I/GjIQKdeORSOKXtFCHAvQXeUksKrUVcMuahfXk+9rdIPj/lb+Ihf75MIfdQ7Po4g/20BUg+m83EUDaWIwVTs59MX20DYFYDaeeiIQVt7NEVSS5L6AvJbFZ2UfhdOwnEx+zifiwKXOA3bnAS5doYyFcIwqJeITd39VaT0p1euJyKxIMjyp95dqnQ6zI/NwUC/uGUaJ+dLJGaqfBcbH4RwGRIBW6tBhok58oRxgYoYZT+7FoC01Gi7z9b1rFQ+VECSZcauLIqnIBvn9JH90E5OqyJmMDEvNXu2Y7JJYqSSjawyIgROcUNjC6IlCVw/IyLydcUyOI5V3EZvTP2Pv4YUbM4baZKfL9Q6bJjeJw0njlPB+n1zFOOY1/WjHnjz3AqZWMO8Y3zNMZuY4333hSwESafe9qLGtn4Brhpys8h8qudRv/Tw663z3hV9ZPG6u0GaFVcj9qz3JBPUfk1DE9u8AE5SE45f5SIDRylnI0t3m6EBxFjWwEet5ZvO9yeaMhEAMU6di5rpUuoBct3u3/vDDi3XRkcgcKeF4ua6jdpeNEc+ocnSe5aTuWJEd0YHWM7Q82IzeaA07/gbWAqcVP8Kie5R+7mPaTYf3XU/pkjIscIo+uQxCe4ZS1zBnN7LJC5XCSMvwiN53QA1/tF0kLwfzA4YdBErKHWzDDodPJg72pXNTlqUxGjlqQAu91YFmlqshBtbgCDJNhgszXLPWYlWHgO1XSUkCF1wi+q7jVpYFNTcd0kV13tlYmGkpGqDCjX+4XjOUENFulZG5bBhIR6TvXhMb8fyL09imtGy+QCPN33r86D8w/WMafUWADgj6Yn+j6Usyyq5+InZQzamhMHnkktA8RGReg7iyp3CfV3TSw9ZAL 52QTyU3C WjPthdqOdfVolyyFjS3XQKklswkwCgKVZtMR4pH1NzJRc1PcbI7GghYw6uH6AXXrJG4Z0GxdoSaO435xfopgo1zMgAOFgyk7zYiKSi3JmTLQMagGNDB0/J+ivwvg9n6szUsO+ke9kvNoPKAYdnGVJdIXQmD/UHZT3xx6fSMt2gM3JIuhiGFfeDPnrOd9sq9wfZtrHhQ9ewcMwzEXjmsn0IMwVormoJJUsEn4110BCzebsVjofdY3YxVzTh70zzQIAlwzvWFnrBLexx3z+ztYh7coA3w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Barry Song <21cnbao@gmail.com> writes: > On Tue, Nov 5, 2024 at 2:01=E2=80=AFPM Huang, Ying = wrote: >> >> Usama Arif writes: >> >> > On 04/11/2024 06:42, Huang, Ying wrote: >> >> Johannes Weiner writes: >> >> >> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >> >>>> On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif wrote: >> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >> >>>>>> On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif wrote: >> >>>>>>>>> I am not sure that the approach we are trying in this patch is= the right way: >> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg = disabled and >> >>>>>>>>> then the mitigation being tried here wont apply. >> >>>>>>>> >> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the >> >>>>>>>> entire system is under memory pressure. I guess we would want t= he same >> >>>>>>>> "mitigation" either way. >> >>>>>>>> >> >>>>>>> What would be a good open source benchmark/workload to test with= out limiting memory >> >>>>>>> in memcg? >> >>>>>>> For the kernel build test, I can only get zswap activity to happ= en if I build >> >>>>>>> in cgroup and limit memory.max. >> >>>>>> >> >>>>>> You mean a benchmark that puts the entire system under memory >> >>>>>> pressure? I am not sure, it ultimately depends on the size of mem= ory >> >>>>>> you have, among other factors. >> >>>>>> >> >>>>>> What if you run the kernel build test in a VM? Then you can limit= is >> >>>>>> size like a memcg, although you'd probably need to leave more room >> >>>>>> because the entire guest OS will also subject to the same limit. >> >>>>>> >> >>>>> >> >>>>> I had tried this, but the variance in time/zswap numbers was very = high. >> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So fo= und >> >>>>> it very difficult to make comparison. >> >>>> >> >>>> Hmm yeah maybe more factors come into play with global memory >> >>>> pressure. I am honestly not sure how to test this scenario, and I >> >>>> suspect variance will be high anyway. >> >>>> >> >>>> We can just try to use whatever technique we use for the memcg limit >> >>>> though, if possible, right? >> >>> >> >>> You can boot a physical machine with mem=3D1G on the commandline, wh= ich >> >>> restricts the physical range of memory that will be initialized. >> >>> Double check /proc/meminfo after boot, because part of that physical >> >>> range might not be usable RAM. >> >>> >> >>> I do this quite often to test physical memory pressure with workloads >> >>> that don't scale up easily, like kernel builds. >> >>> >> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more= of a readahead >> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and= change the window >> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in lin= ux kernel build time >> >>>>>>>>> when cgroup memory is limited as readahead would probably caus= e swap thrashing as >> >>>>>>>>> well. >> >>> >> >>> +1 >> >>> >> >>> I also think there is too much focus on cgroup alone. The bigger iss= ue >> >>> seems to be how much optimistic volume we swap in when we're under >> >>> pressure already. This applies to large folios and readahead; global >> >>> memory availability and cgroup limits. >> >> >> >> The current swap readahead logic is something like, >> >> >> >> 1. try readahead some pages for sequential access pattern, mark them = as >> >> readahead >> >> >> >> 2. if these readahead pages get accessed before swapped out again, >> >> increase 'hits' counter >> >> >> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'. >> >> >> >> So, if there's heavy memory pressure, the readaheaded pages will not = be >> >> accessed before being swapped out again (in 2 above), the readahead >> >> pages will be minimal. >> >> >> >> IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in >> >> addition to the pages accessed are swapped in, the adjacent pages are >> >> swapped in (swap readahead) too. If these readahead pages are not >> >> accessed before swapped out again, system runs into more severe >> >> thrashing. This is because we lack the swap readahead window scaling >> >> mechanism as above. And, this is why I suggested to combine the swap >> >> readahead mechanism and mTHP swap-in by default before. That is, when >> >> kernel swaps in a page, it checks current swap readahead window, and >> >> decides mTHP order according to window size. So, if there are heavy >> >> memory pressure, so that the nearby pages will not be accessed before >> >> being swapped out again, the mTHP swap-in order can be adjusted >> >> automatically. >> > >> > This is a good idea to do, but I think the issue is that readahead >> > is a folio flag and not a page flag, so only works when folio size is = 1. >> > >> > In the swapin_readahead swapcache path, the current implementation dec= ides >> > the ra_window based on hits, which is incremented in swap_cache_get_fo= lio >> > if it has not been gotten from swapcache before. >> > The problem would be that we need information on how many distinct pag= es in >> > a large folio that has been swapped in have been accessed to decide the >> > hits/window size, which I don't think is possible. As once the entire = large >> > folio has been swapped in, we won't get a fault. >> > >> >> To do that, we need to move readahead flag to per-page from per-folio. >> And we need to map only the accessed page of the folio in page fault >> handler. This may impact performance. So, we may only do that for >> sampled folios only, for example, every 100 folios. > > I'm not entirely sure there's a chance to gain traction on this, as the c= urrent > trend clearly leans toward moving flags from page to folio, not from foli= o to > page :-) This may be a problem. However, I think we can try to find a solution for this. Anyway, we need some way to track per-page status in a folio, regardless how to implement it. >> >> >> >> >>> It happens to manifest with THP in cgroups because that's what you >> >>> guys are testing. But IMO, any solution to this problem should >> >>> consider the wider scope. >> >>> >> >>>>>>>> I think large folio swapin would make the problem worse anyway.= I am >> >>>>>>>> also not sure if the readahead window adjusts on memory pressur= e or >> >>>>>>>> not. >> >>>>>>>> >> >>>>>>> readahead window doesnt look at memory pressure. So maybe the sa= me thing is being >> >>>>>>> seen here as there would be in swapin_readahead? >> >>>>>> >> >>>>>> Maybe readahead is not as aggressive in general as large folio >> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum = order >> >>>>>> of the window is the smaller of page_cluster (2 or 3) and >> >>>>>> SWAP_RA_ORDER_CEILING (5). >> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >> >>>>> be similar to enabling 32K mTHP? >> >>>> >> >>>> Not quite. >> >>> >> >>> Actually, I would expect it to be... >> >> >> >> Me too. >> >> >> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >> >>>>>> contiguous allocation like large folio swapin. So that could be >> >>>>>> another factor why readahead may not reproduce the problem. >> >>>> >> >>>> Because of this ^. >> >>> >> >>> ...this matters for the physical allocation, which might require more >> >>> reclaim and compaction to produce the 32k. But an earlier version of >> >>> Barry's patch did the cgroup margin fallback after the THP was alrea= dy >> >>> physically allocated, and it still helped. >> >>> >> >>> So the issue in this test scenario seems to be mostly about cgroup >> >>> volume. And then 8 4k charges should be equivalent to a singular 32k >> >>> charge when it comes to cgroup pressure. -- Best Regards, Huang, Ying