From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37EF5C54E41 for ; Mon, 4 Mar 2024 22:02:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A83056B0089; Mon, 4 Mar 2024 17:02:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A32166B008A; Mon, 4 Mar 2024 17:02:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8FA1E6B008C; Mon, 4 Mar 2024 17:02:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7D1A66B0089 for ; Mon, 4 Mar 2024 17:02:43 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 5F233160601 for ; Mon, 4 Mar 2024 22:02:43 +0000 (UTC) X-FDA: 81860731806.27.62CBC90 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf21.hostedemail.com (Postfix) with ESMTP id 7EB1E1C0022 for ; Mon, 4 Mar 2024 22:02:41 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709589761; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yHRiuOxJeocwzxWF/cwYC5tgDXtkYyQV5fYMYDHP9OI=; b=YB4AyMVCS97QWxz/89AXOw3Fu8qZCdzOrC/WGgLJDQK2pzs8J/neYnCzFaOn+DfVUuh4TG 7QPXwC19UlpkDR5wq8m47wX47sUgt2dnln/dilIytdsRy2/+RjU1wZ4yj76y0ACjlHOPAc Ya02jVkFByJq01e7NIQdun5C/rK+1oo= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709589761; a=rsa-sha256; cv=none; b=efuSgOEGXdLP1S5Yk7cH9/xgnHKexhKCUmKAQpRgmFFqgyrMY6tZSuYUMkhF07hjujzBkD V5U549A7CkbM1PaPlKKOuptGxu20wjMP429i7UA0jy512vRCCPkGxzXnrEb3LRr0ecBHjP K5PLESwScqNNLak/9jUKdcHLcopAQus= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F3AD02F4; Mon, 4 Mar 2024 14:03:16 -0800 (PST) Received: from [10.57.68.92] (unknown [10.57.68.92]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 923EF3F73F; Mon, 4 Mar 2024 14:02:37 -0800 (PST) Message-ID: Date: Mon, 4 Mar 2024 22:02:36 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio Content-Language: en-GB To: Barry Song <21cnbao@gmail.com>, David Hildenbrand Cc: akpm@linux-foundation.org, linux-mm@kvack.org, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 7EB1E1C0022 X-Stat-Signature: r8qh1on9n47x6p8soa4oxuceazz1cuqa X-Rspam-User: X-HE-Tag: 1709589761-977729 X-HE-Meta: U2FsdGVkX1/rlfdKfCA4AjdiqOWGrTzXi/TgFnLqud9E7iCg6UZhpU4KqldadUjWTozYIxKOKm//KJ75tkQzmzT37O0uXO7roRwIrHuURHYGajGllsS99rWUHuOoHqXXBTCSQE9a69qDITvxIpgKZ2ZQip31vVa0gCUAmXkj41s2lXAi9VdJWT9jqlVJVETTzRWsNNhQ8qd3k8EgI7o+AmetUC4rMalaJ0gdIJlcZHMWR0PQapw+H5dPNKxzB//klTMZnejhhWTB6OP4wyADIPWJb8nGZAhcIGzgFo1uZ/GW4D8xQFUl1U6vA9wUSNc4L5m5hZlsXuWdHy2AgxnflCJ2dgw0Zw+SSkgsUznfnz6ZNjkWHPTL5tBP3Y+HhStmJsSAFZ9VrVm3dDi9SfGegoAKVWYQ5VExV4bYEMgWmc/8j9a+/r1ad5ZeVCi53/EcCjGg2tgknpxCJMTPfBNVFOfcTFlYDZWCSHE0AGKBbU/Cm64hrNorcfffNaHOyHnvQ524RiDtG709JagOWe6TS9OAYFJUt2uUZaH4QU+Lthm/elcdCCIPmqvq3zLnZsy+by/4iM+ROjLrPS0IEGEQLl/4Yu2Um9jaA2HmLY+KeD1Btg9TJ5sRZR3clPEjtI/RHdTSDoIHTBwPqr3WY5LDJHuGuIJ19IYf3NmiJvlIX+r2WheOElJtX3oso94D0QfaCNPt4S2/+ZCWR651HUTGlzZNDEzPM5sw1/PLWkk3L8XxAuA7nup2NTR5Laj39nksZYfkLxq6fWw9U1F36aFEeH96XJMtz9+kNmVqPaTy1nmDN5mczUuONhVO68IuJbIhBH3snnUYRPjCu9aYnMQLiVL9ZnOrEJMp9Hwrzms1zX42clVPg9DLHHHYEF+QW1AXW+aSoi0GwQr0zJQLcQEsNR+DPb8qH355wIGsyqirT2dNX1NPreV89Vsr+BaBnH7ofMUGvPbeAYwlPradfCG t7G43b9G ZvRx6r296MEd8h2X23LLbnb70PVlWYvkiNTtZ1dbHj+DMRytxU3s7j9fV3k9wbiM98xPKmZLLR7ajwJKbluG1pUznA4ZozfLj+Z6lO9+g4kM8gdC7eS05BKC+7dJ9b8z/H7bK1Sa8N4tjhe6gFJydfrkIOsWIKWg3hS5Ry1shNqGusjKsACm2otJg+3IqcPoQRntasGAI0rLtmP+ROW7uCOUAATxrXodyEKk6s1cKQXJ4RSDkrx/nikevtXvRrk3zI8lAdbCX8y+txD3jE23g5ueYtQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 04/03/2024 21:04, Barry Song wrote: > On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand wrote: >> >> On 04.03.24 13:20, Ryan Roberts wrote: >>> Hi Barry, >>> >>> On 04/03/2024 10:37, Barry Song wrote: >>>> From: Barry Song >>>> >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other >>>> PTEs modification such as break-before-make, while iterating PTEs >>>> of a large folio, it will only begin to acquire PTL after it gets >>>> a valid(present) PTE. break-before-make intermediately sets PTEs >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped >>>> in try_to_unmap_one(). >>> >>> I just want to check my understanding here - I think the problem occurs for >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now >>> that I've had a look at the code and have a better understanding, I think that >>> must be the case? And therefore this problem exists independently of my work to >>> support swap-out of mTHP? (From your previous report I was under the impression >>> that it only affected mTHP). >>> >>> Its just that the problem is becoming more pronounced because with mTHP, >>> PTE-mapped large folios are much more common? >> >> That is my understanding. >> >>> >>>> For example, for an anon folio, after try_to_unmap_one(), we may >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. >>>> So folio will be still mapped, the folio fails to be reclaimed. >>>> What’s even more worrying is, its PTEs are no longer in a unified >>>> state. This might lead to accident folio_split() afterwards. And >>>> since a part of PTEs are now swap entries, accessing them will >>>> incur page fault - do_swap_page. >>>> It creates both anxiety and more expense. While we can't avoid >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for >>>> a large folio, we can indeed keep away from kernel's breaking up >>>> them due to its code design. >>>> This patch is holding PTL from PTE0, thus, the folio will either >>>> be entirely reclaimed or entirely kept. On the other hand, this >>>> approach doesn't increase PTL contention. Even w/o the patch, >>>> page_vma_mapped_walk() will always get PTL after it sometimes >>>> skips one or two PTEs because intermediate break-before-makes >>>> are short, according to test. Of course, even w/o this patch, >>>> the vast majority of try_to_unmap_one still can get PTL from >>>> PTE0. This patch makes the number 100%. >>>> The other option is that we can give up in try_to_unmap_one >>>> once we find PTE0 is not the first entry we get PTL, we call >>>> page_vma_mapped_walk_done() to end the iteration at this case. >>>> This will keep the unified PTEs while the folio isn't reclaimed. >>>> The result is quite similar with small folios with one PTE - >>>> either entirely reclaimed or entirely kept. >>>> Reclaiming large folios by holding PTL from PTE0 seems a better >>>> option comparing to giving up after detecting PTL begins from >>>> non-PTE0. >>>> >> >> I'm sure that wall of text can be formatted in a better way :) . Also, I >> think we can drop some of the details, >> >> If you need some inspiration, I can give it a shot. >> >>>> Cc: Hugh Dickins >>>> Signed-off-by: Barry Song >>> >>> Do we need a Fixes tag? It seems my original question has snowballed a bit. I was conflating this change with other reports Barry has made where the kernel was panicking (I think?). Given we are not seeing any incorrect functional behaviour that this change fixes, I agree we don't need a Fixes tag here.