From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0B6DD132B5 for ; Mon, 4 Nov 2024 12:13:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0FA816B0083; Mon, 4 Nov 2024 07:13:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0AA436B0085; Mon, 4 Nov 2024 07:13:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB3776B0088; Mon, 4 Nov 2024 07:13:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C848D6B0083 for ; Mon, 4 Nov 2024 07:13:28 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 706B11C5234 for ; Mon, 4 Nov 2024 12:13:28 +0000 (UTC) X-FDA: 82748302140.05.ECDF8D4 Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by imf21.hostedemail.com (Postfix) with ESMTP id E52411C0018 for ; Mon, 4 Nov 2024 12:12:25 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RsicmEXv; spf=pass (imf21.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730722323; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bOJk7OfXsT+Xw5K8e1e2bLMEk9P7tYgMnLJCrbfvplY=; b=KomfDU7GcfLl1vV3Ahs3IJCwjS4IZuDVXq3DgNE8JjXNmiqbu8RGQMN4y/KWq1lY890lXp trlz9OFoiu//I16D+dD6wSehB/viPV0/uL9XI23xswucxn04pO4gtCOy4DdMxv99yX48r7 V1V9Psn9rQA6XFVbyvaXd8I92GFcNZY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730722323; a=rsa-sha256; cv=none; b=dTE23h7VM2Q33swy+HuplaovtShlS0Vl6NM+yLZJW05cgjNg7WAm01bOtONOwmmc7b5EDy 3G024sWzvDgXm8FKiDVjLnbGIDmAAPHIT8GtYBWEwk/+JvAVhv76ndIH5QDB8xEa8vqM9h ZQBO92qu2thbclZ5kxcxKc7tAvciwhY= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=RsicmEXv; spf=pass (imf21.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lj1-f173.google.com with SMTP id 38308e7fff4ca-2fb4af0b6beso60329771fa.3 for ; Mon, 04 Nov 2024 04:13:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730722404; x=1731327204; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=bOJk7OfXsT+Xw5K8e1e2bLMEk9P7tYgMnLJCrbfvplY=; b=RsicmEXvdGQRrti0+53u41E/6a1iupdvrpsXFTMOhMHJ9WZVES8JBSSjHluZDPWA1j AjFxmUbepq4qaRXB2XP1K6DCOfrK56NUI/5sNu9WCdlC1ZFfrfr+8SoKsFKcHcRsUKNN 9egit8DKJwztOHBUNfoWlxwtWsiLuoEXWOlP2qtjnaxp3QC94CG696kh4uiZttwgFhrp cwcMF33he8js6t8fLO9OcQ0UKgzJO1HWdJeOvYdHQMzfG33r0OxkfdFTNZqY1PwDLwpz BY7C4TWR/FK7UZ29QitjVwH7XIWKxI9WMmfJMa4KG25iUc67ZpERYjb7FowEPP3EdF1B ekHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730722404; x=1731327204; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bOJk7OfXsT+Xw5K8e1e2bLMEk9P7tYgMnLJCrbfvplY=; b=tf6YUpUqpQUVy8jpnq8P66TrFPwJQ5C1oX7kL+ct25TOW8JYgxlC8fJ0YxXV9dag9O AZBHf+ynI/1sInyk1TQFOfCU220t8TkbeonlIJI17Kr+XQD4MDz4bOtHBy0Cog4+moNX G8Z4mWOtnNP3xeLDmQVvfEFuF3j6BbJ2Bg6n96dEKLzFsafxT3ihhbzpICy1a3IX9/fv niKfIlCFqqN+LXPW8zPoJfs3lFoOiTPJSmxp4JRovwZ/XqH2+qXDf8GD/SMmhP/PzLSn ymwe+8Ft39/pW0Q30BuGyPvMbKMFFN9ZKM9tCQp551mFe+N/NMNF5wo2etqKpboz73Ww jfgg== X-Forwarded-Encrypted: i=1; AJvYcCUbSQ8m/bfyYy+vhuCbNmh21BP8Qa2igCZg9alE/nS4Bc04EOsPbagpxvYzpahYIAQckPrhTDFyBQ==@kvack.org X-Gm-Message-State: AOJu0YysGAA0vW3JrxvrfnLrxBc2+HzN69mrjlHNIAix0wT0p4dnWQyd OT+LSfsYjhlTu+wd85hTnDlBAExn86ETZqsuDjJKgeZHGX0ZkkVd X-Google-Smtp-Source: AGHT+IFKs7aDMWR5bYmRZraqFa5M5X+Gd4nCdqEynWIk2yfgXDEdIi5TtwnA+VsCByPQsjSqfCK3eg== X-Received: by 2002:a05:651c:2123:b0:2fa:c0c2:d311 with SMTP id 38308e7fff4ca-2fedb794b2dmr86293771fa.5.1730722404011; Mon, 04 Nov 2024 04:13:24 -0800 (PST) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::5:76d9]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9e56641249sm545170566b.156.2024.11.04.04.13.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Nov 2024 04:13:23 -0800 (PST) Message-ID: <3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com> Date: Mon, 4 Nov 2024 12:13:22 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: "Huang, Ying" , Johannes Weiner , Barry Song <21cnbao@gmail.com> Cc: Yosry Ahmed , akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , Kairui Song , Ryan Roberts , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <20241031153830.GA799903@cmpxchg.org> <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-US From: Usama Arif In-Reply-To: <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Stat-Signature: 54bepygr3ew94po9epr78miopq97yzt5 X-Rspamd-Queue-Id: E52411C0018 X-Rspam-User: X-HE-Tag: 1730722345-514901 X-HE-Meta: U2FsdGVkX19T2g5tE8fqP3NzM02/aCTjVfQbdexJBxDnV36E6WGjYoR8sjsUflXiFHqmmPlY5WjE9UJ+qfYQenR9AY8FMSUOxeNDYB51itclG1GgE2/1HG7TYd0gwTVTjRhAr3DiXq6rA+FloEGAQi0k9uD6UwcQ/urQwLi5iQKk17cHKPPMmKUzJjSnTEkkeU8sH0oG1s0qBnBC+qMC1LR4hFr7TbjVc/SBWKw6hncFNGLblXLV4juiUXMTy9MK0a/BEImwgHZRsaHDKxrJh9GroLidc0GhufoXRw60Jum3uWb+hgsBTJSxA5+pl1QCVJu2p80enL3Q/8ZaS1zVuprLaUitwtiNvLVdHQtsxth6QgDIM97KwWWk62kNev3scUDDPMjdBs6XtnlEOqfQ+8S/O7FjaufHk73PddIpb+gOA51HJEeLYqWbvnzAsS0RLuxvROpC8BOhBQBX4NlXr7zWq1EdEz2XhV1Ub7KxZzC5dydGfzKj9rGjNmnlrh1gsnRUq9YAdZ10s8c8zBj6KPv0JWvUi4/SXY+uolwLfTbJwCbLNT9teh3Deaozn/mmGwkyzxcyck/ey3MI8q0I76Wpp51KnS0Iz6wyj6wPxS6HQ8th/MaKMDjaY0lmmtoKzBn/UsKVlTfMcUNfxmzM+6/tnFGJTfpeLEMMRLlq80DBAseXd0PsDbl9gQGP/AFdSTD+TGtffHIluBCLF2KHqz1F+wRIE1OG66kaA3quogs+9/rEX5QvSHqwh/HE60RLWUaztuS36Os0CJEvtd1Nmfolp5vRPD++5SNUTDRnqbc85F7XqbQSKg30/Oa6ibD/qhB4aNWMl2Rvs9QqvUeehBQ/LrLkPjv9oQX1A8Wq0imGcJR+S75fSClgzxgXC7jUJUjYNsQIL1vv8F7T4O0ZZ+fwJdfwLNi5Qpx2KjwaT4whq3ctWtQGherM7Df+eIPG2qtiePw7AhWw7tGC6yT UWtTyk1r LRrueR3pV6BaBDrAxy+dfKeZz0jqWDH4JQs8m/nGvIeSrLbHxty4A53z8JIQxwBh5OU0vHIrgiQ6BueiISDUesGYyvBJTAalj3J9kz+HdV4nB5ogpg0IUvgf2bh29j6GDAUpXyRUPhWN2b3oJcGB5w9DG4oSHcWw582Frb5BGzoCnzs6eAdlMI5tgvdsjat7KfO6KDR1BFlhCs3dJ2s2KQyeE+UsY65SxP9A2DRtlfsgJd/UKFrEh6XDpndI1xVbc3fKAzVCh9IwtyRsietNsd9bUtLNf5QaN0R3gW+1ilpX38Xk33auxnOfSoiFcqYd9IK+BrfS6UGEqz6eMAlqCacKD7tVeMk3q2x7bdWNexPvYnroQ/tEFPW8zrQxdW+T7MT3bjfqeQv+72TzMCbZHtMrUvm8k/jLG32RDA5W8alWmm7XXEUkpRiEmPLip1KpSCocvyv1DuO7DVQMrXu+xuLtI/2I9Fdp/VLE1YIk/ZvJRnpXHsBLV2E8L00xvo3toTTghQC/iZhp45aLlBw8nEiTdXV/ZISuMCl5E X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 04/11/2024 06:42, Huang, Ying wrote: > Johannes Weiner writes: > >> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote: >>> On Wed, Oct 30, 2024 at 2:13 PM Usama Arif wrote: >>>> On 30/10/2024 21:01, Yosry Ahmed wrote: >>>>> On Wed, Oct 30, 2024 at 1:25 PM Usama Arif wrote: >>>>>>>> I am not sure that the approach we are trying in this patch is the right way: >>>>>>>> - This patch makes it a memcg issue, but you could have memcg disabled and >>>>>>>> then the mitigation being tried here wont apply. >>>>>>> >>>>>>> Is the problem reproducible without memcg? I imagine only if the >>>>>>> entire system is under memory pressure. I guess we would want the same >>>>>>> "mitigation" either way. >>>>>>> >>>>>> What would be a good open source benchmark/workload to test without limiting memory >>>>>> in memcg? >>>>>> For the kernel build test, I can only get zswap activity to happen if I build >>>>>> in cgroup and limit memory.max. >>>>> >>>>> You mean a benchmark that puts the entire system under memory >>>>> pressure? I am not sure, it ultimately depends on the size of memory >>>>> you have, among other factors. >>>>> >>>>> What if you run the kernel build test in a VM? Then you can limit is >>>>> size like a memcg, although you'd probably need to leave more room >>>>> because the entire guest OS will also subject to the same limit. >>>>> >>>> >>>> I had tried this, but the variance in time/zswap numbers was very high. >>>> Much higher than the AMD numbers I posted in reply to Barry. So found >>>> it very difficult to make comparison. >>> >>> Hmm yeah maybe more factors come into play with global memory >>> pressure. I am honestly not sure how to test this scenario, and I >>> suspect variance will be high anyway. >>> >>> We can just try to use whatever technique we use for the memcg limit >>> though, if possible, right? >> >> You can boot a physical machine with mem=1G on the commandline, which >> restricts the physical range of memory that will be initialized. >> Double check /proc/meminfo after boot, because part of that physical >> range might not be usable RAM. >> >> I do this quite often to test physical memory pressure with workloads >> that don't scale up easily, like kernel builds. >> >>>>>>>> - Instead of this being a large folio swapin issue, is it more of a readahead >>>>>>>> issue? If we zswap (without the large folio swapin series) and change the window >>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time >>>>>>>> when cgroup memory is limited as readahead would probably cause swap thrashing as >>>>>>>> well. >> >> +1 >> >> I also think there is too much focus on cgroup alone. The bigger issue >> seems to be how much optimistic volume we swap in when we're under >> pressure already. This applies to large folios and readahead; global >> memory availability and cgroup limits. > > The current swap readahead logic is something like, > > 1. try readahead some pages for sequential access pattern, mark them as > readahead > > 2. if these readahead pages get accessed before swapped out again, > increase 'hits' counter > > 3. for next swap in, try readahead 'hits' pages and clear 'hits'. > > So, if there's heavy memory pressure, the readaheaded pages will not be > accessed before being swapped out again (in 2 above), the readahead > pages will be minimal. > > IMHO, mTHP swap-in is kind of swap readahead in effect. That is, in > addition to the pages accessed are swapped in, the adjacent pages are > swapped in (swap readahead) too. If these readahead pages are not > accessed before swapped out again, system runs into more severe > thrashing. This is because we lack the swap readahead window scaling > mechanism as above. And, this is why I suggested to combine the swap > readahead mechanism and mTHP swap-in by default before. That is, when > kernel swaps in a page, it checks current swap readahead window, and > decides mTHP order according to window size. So, if there are heavy > memory pressure, so that the nearby pages will not be accessed before > being swapped out again, the mTHP swap-in order can be adjusted > automatically. This is a good idea to do, but I think the issue is that readahead is a folio flag and not a page flag, so only works when folio size is 1. In the swapin_readahead swapcache path, the current implementation decides the ra_window based on hits, which is incremented in swap_cache_get_folio if it has not been gotten from swapcache before. The problem would be that we need information on how many distinct pages in a large folio that has been swapped in have been accessed to decide the hits/window size, which I don't think is possible. As once the entire large folio has been swapped in, we won't get a fault. > >> It happens to manifest with THP in cgroups because that's what you >> guys are testing. But IMO, any solution to this problem should >> consider the wider scope. >> >>>>>>> I think large folio swapin would make the problem worse anyway. I am >>>>>>> also not sure if the readahead window adjusts on memory pressure or >>>>>>> not. >>>>>>> >>>>>> readahead window doesnt look at memory pressure. So maybe the same thing is being >>>>>> seen here as there would be in swapin_readahead? >>>>> >>>>> Maybe readahead is not as aggressive in general as large folio >>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum order >>>>> of the window is the smaller of page_cluster (2 or 3) and >>>>> SWAP_RA_ORDER_CEILING (5). >>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might >>>> be similar to enabling 32K mTHP? >>> >>> Not quite. >> >> Actually, I would expect it to be... > > Me too. > >>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a >>>>> contiguous allocation like large folio swapin. So that could be >>>>> another factor why readahead may not reproduce the problem. >>> >>> Because of this ^. >> >> ...this matters for the physical allocation, which might require more >> reclaim and compaction to produce the 32k. But an earlier version of >> Barry's patch did the cgroup margin fallback after the THP was already >> physically allocated, and it still helped. >> >> So the issue in this test scenario seems to be mostly about cgroup >> volume. And then 8 4k charges should be equivalent to a singular 32k >> charge when it comes to cgroup pressure. > > -- > Best Regards, > Huang, Ying