From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48909CF9C71 for ; Wed, 25 Sep 2024 01:54:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 699C86B0085; Tue, 24 Sep 2024 21:53:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 64A126B0089; Tue, 24 Sep 2024 21:53:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5115B6B008A; Tue, 24 Sep 2024 21:53:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 32A6B6B0085 for ; Tue, 24 Sep 2024 21:53:59 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A50471606BB for ; Wed, 25 Sep 2024 01:53:58 +0000 (UTC) X-FDA: 82601589756.29.756E4C5 Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by imf18.hostedemail.com (Postfix) with ESMTP id 969CB1C0003 for ; Wed, 25 Sep 2024 01:53:53 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=HDZHC3UA; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf18.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727229104; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4v+88U4FOuGKfl4tx2izcT7w2SDLhnoDATDvDZFe/6o=; b=QVUNn0PiJPIyw92a257kO2TEmdes9f+yh46KpDnsPyXxRAl8g7fKYs+xEl0Fwu1vCyqGCt 9OQ+tT+Hh551MfO+tTC/rYqNc7gSe78wHqwiddhmklqLjh3ky0/e3J9FhNIyeYQW29f/9o VYa5fv9rxghTXMA4cSadsmEWzWFuxcY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727229104; a=rsa-sha256; cv=none; b=CNyUcAc3CPAq53ppunL+w+3Oib+taI/zmbmBnKIW9crqXwH2kyFlilQeCxrfEp0u4y2inj py1TVIcIlPy54QKrdPDeK5KKjKCCydWTuyAWt4YllTcb51WK41YpMAODor8gNV/dJs6oos mf8Yv4OzyjFwQmhAijERHQR9iDCYdhw= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=HDZHC3UA; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf18.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1727229229; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=4v+88U4FOuGKfl4tx2izcT7w2SDLhnoDATDvDZFe/6o=; b=HDZHC3UA+eETTyw/IS5L9i2uwX9liAJUnCLOS4mUZphK7Cbdk10UfQQZSsJtyfNKXUjNV7kjVTtYdKR7/s7TTvRiQpW1A6e8F2Ni9Ni2gGzvf6FhVc120XMJadHPjAPoUWaPvRtamkoqbwyBQ1yEEDybLWaYwp7zI/5uLsMsJWQ= Received: from 30.74.144.126(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WFhl-hc_1727229227) by smtp.aliyun-inc.com; Wed, 25 Sep 2024 09:53:48 +0800 Message-ID: <85a2fd61-93d3-4cd9-95a3-e9eaef87286b@linux.alibaba.com> Date: Wed, 25 Sep 2024 09:53:45 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM To: Nhat Pham , Yosry Ahmed Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, shakeel.butt@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, chrisl@kernel.org, david@redhat.com, kasong@tencent.com, willy@infradead.org, viro@zeniv.linux.org.uk, baohua@kernel.org, chengming.zhou@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org References: <20240923231142.4155415-1-nphamcs@gmail.com> <4d38c65d-760c-43a5-bb47-8e0235c13a51@linux.alibaba.com> <9a110f20-42ad-468b-96c6-683e162452a9@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 969CB1C0003 X-Stat-Signature: 65yukw9xkewtf5qf8sy4wxe1a7tfswaj X-Rspam-User: X-HE-Tag: 1727229233-990347 X-HE-Meta: U2FsdGVkX19SMITla4rmYOrnmtDsQmHZ5Afx3NsNHEcoyQwSYsA7sOhAp2Wrw0548FQ+ZmFBJEU5JNI+fsjQ22P6LjUpJ1j/hNzxuGctaHBVrmNnM4qy2HMUox0cjKeePtKfrbJuyJFz0HepNu/8ZzABJWniqRjZ1lI2AvPHtuvTcFqagxK9kmGmXVqY6KQaUq/VnwWu2Ys7TMU0N4cTg0b12U03BU+leRb7Fgr/OqJrCXGDbIyzbJAPGpH651m7AQhJTUf5kOQnG8coRyheG3t2PxWZ9KShGV+Hb4NZsNVemIGqKm9N0fYKAOEE5vs+TYr+luNotdRhwm1UT9zct8tTiTc4UOaa0a6ynTsaTsYFDOvOAAx2dG1wd41RaeHhnjAs2lkU4pwPApbAW+xPQxIj9aOHSXKw7EFsSF6iHmeUIzGMgri24Eh8M6CiiLIHDipjSB5huu9p88CbMZZ5LydtUn6Bhg4ZWy4BGWs9eiIVjgWhcP9iqB3w+6gjemGFv7EApt15bD6umYQ0y1A98xi9bAs9R7H0Nwuq15wpjNEsIZ6i3HWpqqXMKxZint5VIQpmX0yq+JLy3tMstJVJcoGWR2s7lEi9HgC3lVVk/iQrpyJk36hzfwH4iGJqNkCNPFK14kwTUXZ9T9X9PFrFiobohnr+aM9OJAFou6aADOMf3Zwl9BnTLy/0ufDheN6uQybaZsSo+DJNot12oZr9MGvqk8z/hdqeESGhW7S4XcDyaxOkn7tM9y86SSazhsNEacGHMOTqHwWmMkpW64QEkiy6f/Ch5FZ3NHg7OYtYS7vEXXz0X9csuqXi/JKXTS9GI5Rp7m4xSKIHgmnGros9LJIBcDorPtwvv3898IsaYt4VAI3RhXspoE60KIqU2LyQ53MLsTlpHSr7rsbD+2nfmfrEQ7yjCR0aG/TRZUOmfE4S4pdfB0RWYdsz1QVkXMNuApMKcluA1QgQ+EPq2Ie klivlFwD skvrKkU+1DLKrCC8Dde70CGm+3Z8BuCa016Fp2jhapkFPgzG17GqFptqllBco25/cSzJqzggizcRyDG7wjmGrn3td/eullrx4Lprt4gIY4mkAGDVZEt7LUWS+SLIo3RmlIXlStw3dsIR+t50YQoFvQzKwVIFZxjMrJZ4JKrhpTvkTyb0lwwfAthHOX3ateDeu4Q89842ts8BRKI23BCWRHnOsx1P/ZdrJChpa13hweE82ABHhKutwP2CX/Hb0xesXruLGh1Cldlq/3l9Zd8ROH0x8lMnfGOi0m3qDWi0fShWbMQSQ2SxkPBSEnoBBNIYNw3rh/or4x0WkxYhhlbpLplP8vC+haEx2fL/7ds/LQzTvq0/TJnpnOHx9FnKqqjJKDZxo4QN/1axVHvg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/9/24 23:48, Nhat Pham wrote: > On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed wrote: >> >> On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham wrote: >>> >>> On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang >>> wrote: >>>> >>>> >>>> On 2024/9/24 10:15, Yosry Ahmed wrote: >>>>> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2024/9/24 07:11, Nhat Pham wrote: >>>>>>> The SWAP_MAP_SHMEM state was originally introduced in the commit >>>>>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a >>>>>>> swap entry belongs to shmem during swapoff. >>>>>>> >>>>>>> However, swapoff has since been rewritten drastically in the commit >>>>>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now >>>>>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having >>>>>>> swap count == 1, and swap_shmem_alloc() behaves analogously to >>>>>>> swap_duplicate() >>>>>>> >>>>>>> This RFC proposes the removal of this state and the associated helper to >>>>>>> simplify the state machine (both mentally and code-wise). We will also >>>>>>> have an extra state/special value that can be repurposed (for swap entries >>>>>>> that never gets re-duplicated). >>>>>>> >>>>>>> Another motivation (albeit a bit premature at the moment) is the new swap >>>>>>> abstraction I am currently working on, that would allow for swap/zswap >>>>>>> decoupling, swapoff optimization, etc. The fewer states and swap API >>>>>>> functions there are, the simpler the conversion will be. >>>>>>> >>>>>>> I am sending this series first as an RFC, just in case I missed something >>>>>>> or misunderstood this state, or if someone has a swap optimization in mind >>>>>>> for shmem that would require this special state. >>>>>> >>>>>> The idea makes sense to me. I did a quick test with shmem mTHP, and >>>>>> encountered the following warning which is triggered by >>>>>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). >>>>> >>>>> Apparently __swap_duplicate() does not currently handle increasing the >>>>> swap count for multiple swap entries by 1 (i.e. usage == 1) because it >>>>> does not handle rolling back count increases when >>>>> swap_count_continued() fails. >>>>> >>>>> I guess this voids my Reviewed-by until we sort this out. Technically >>>>> swap_count_continued() won't ever be called for shmem because we only >>>>> ever increment the count by 1, but there is no way to know this in >>>>> __swap_duplicate() without SWAP_HAS_SHMEM. >>> >>> Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to >>> remove the swapfile check (that's another can of worms, but I need >>> data before submitting the patch to remove it...) >>> >>> One thing we can do is instead of warning here, we can handle it in >>> the for loop check, where we have access to count - that's the point >>> of having that for-loop check anyway? :) >>> >>> There's a couple of ways to go about it: >>> >>> 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 ); >> >> Hmm that should work, although it's a bit complicated tbh. >> >>> (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX)) >> >> I think this will make the warning very hard to hit if there's a >> misuse of __swap_duplicate(). It will only be hit when an entry needs >> count continuation, which I am not sure is very common. If there's a >> bug, the warning will potentially catch it too late, if ever. >> >> The side effect here is failing to decrement the swap count of some >> swap entries which will lead to them never being freed, essentially >> leaking swap capacity slowly over time. I am not sure if there are >> more detrimental effects. >> >>> >>> 2. Alternatively, instead of warning here, we can simply return >>> -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since >>> this MUST succeed. >> >> We still fail to rollback incremented counts though when we return >> -ENOMEM, right? Maybe I didn't get what you mean. > > My understanding now is that there are two for loops. One for loop > that checks the entry's states, and one for loop that does the actual > incrementing work (or state modification). > > We can check in the first for loop, if it is safe to proceed: > > if (!count && !has_cache) { > err = -ENOENT; > } else if (usage == SWAP_HAS_CACHE) { > if (has_cache) > err = -EEXIST; > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) { > err = -EINVAL; > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >= > SWAP_MAP_MAX)) { > /* the batched variants currently do not support rollback */ > err = -ENOMEM; > } > > At this point, IIUC, we have not done any incrementing, so no rollback > needed? :) Right, looks good (although need some cleanup pointed by Yosry). >>> >>> Either solutions should follow with careful documentation to make it >>> clear the expectation/guarantee of the new API. >>> >>> Yosry, Baolin, how do you two feel about this? Would something like >>> this work? I need to test it first, but let me know if I'm missing >>> something. >>> >>> If this does not work, we can do what Baolin is suggesting, and >>> perhaps maintain the swap_shmem_alloc() helper. It's less than ideal, >>> but at least we still lose a state... >> >> Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is >> just a cleanup with small wins, so if it's too complicated to remove >> it it may not be worth it. I am assuming with your ongoing work, it >> becomes much more valuable, so maybe if it's too complicated we can >> defer it until the benefits are realizable? > > I agree :) One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to batch free shmem swap entries in __swap_entries_free(), similar to the commit bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") did, which can improve the performance of shmem mTHP munmap() function based on my testing. Without this patch set, I need do following changes to batch free shmem swap entries: diff --git a/mm/swapfile.c b/mm/swapfile.c index 0cded32414a1..94e28cd60c52 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -163,7 +163,7 @@ static bool swap_is_last_map(struct swap_info_struct *si, unsigned char *map_end = map + nr_pages; unsigned char count = *map; - if (swap_count(count) != 1) + if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM) return false; while (++map < map_end) { @@ -1503,10 +1503,10 @@ static bool __swap_entries_free(struct swap_info_struct *si, unsigned int type = swp_type(entry); struct swap_cluster_info *ci; bool has_cache = false; - unsigned char count; + unsigned char count = swap_count(data_race(si->swap_map[offset])); int i; - if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1) + if (nr <= 1 || (count != 1 && count != SWAP_MAP_SHMEM)) goto fallback; /* cross into another cluster */ if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)