From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4536AC4167B for ; Sat, 2 Dec 2023 09:25:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9E9D36B04BB; Sat, 2 Dec 2023 04:25:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 99AD76B04BC; Sat, 2 Dec 2023 04:25:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 888F46B04BD; Sat, 2 Dec 2023 04:25:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7663B6B04BB for ; Sat, 2 Dec 2023 04:25:51 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 46DB3120133 for ; Sat, 2 Dec 2023 09:25:51 +0000 (UTC) X-FDA: 81521346102.30.E120F1B Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf20.hostedemail.com (Postfix) with ESMTP id 4177C1C0019 for ; Sat, 2 Dec 2023 09:25:46 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701509149; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Vb0KoCcVIEUeI8rRa4FXAiDIE8Fecdz7qlWtcSueVY8=; b=h3TL2+IfQFfE6HwDqfJ1r44zpQseEuRDuOR3G7Bl7+LRmUzWoRxgeJMvx7IuKS1ZXPcmU/ AIChBaF2/08ODu+no0HGy0H5TbPTG8dnrS1lQUwE0v0EzH2iC6f6MQiNWLVtNvhQ/1v0/I kf0/0w7Qz4+tq1Z1oxSSSwSBkkqiri0= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701509149; a=rsa-sha256; cv=none; b=5kLvtSGvSwbKzX37JHAbYB7H+aji3730s9vtRJT44AT7BQhaaAIoBjre7DBwpUbcaoHWMY tiNZsEUcfM4xtiqRq+xj1j4G4ZZFI63iZBm/ywpyx8cpcYmrtGhj5fbNW4Olpr96YB8PKG gB5XPvNDMg53zj7/Q904bbwfFTTfwC8= Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Sj4Df4BRgzSgp2; Sat, 2 Dec 2023 17:21:22 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Sat, 2 Dec 2023 17:25:42 +0800 Message-ID: Date: Sat, 2 Dec 2023 17:25:41 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/4] mm: pagewalk: assert write mmap lock only for walking the user page tables Content-Language: en-US To: Muchun Song CC: Muchun Song , Mike Kravetz , Andrew Morton , Linux-MM , References: <20231127084645.27017-1-songmuchun@bytedance.com> <20231127084645.27017-2-songmuchun@bytedance.com> <9e5c199a-9b4d-4d1b-97d4-dd2b776ac85f@huawei.com> <6AB1EE49-496E-46DE-B51E-42B06AA717D8@linux.dev> From: Kefeng Wang In-Reply-To: <6AB1EE49-496E-46DE-B51E-42B06AA717D8@linux.dev> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm100001.china.huawei.com (7.185.36.93) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 4177C1C0019 X-Rspam-User: X-Stat-Signature: w64jizntxqybxacwrnz88gp834c9xje7 X-Rspamd-Server: rspam01 X-HE-Tag: 1701509146-876104 X-HE-Meta: U2FsdGVkX19Ifl5neTVnn2a/mtksjGLVPJp2mwk4XHLiccVTYOV05LwZiT6fDpdQirkg3zU3qfUahD+tzZKiTqh1BPj6/vYPypFojMc14/YibTM8fugldIA8OP4DJtaxpWVPFtmoDNUhfKZ1az0HTOVy3M53bDlaDMzR1CC8FUojEO2w1mVGb3ZPmoEsuMROpVWu4g2AbhJEOLiLyWInF0NA+VPzde49D+E3gSnDVqIsWO5s0PrTPl13B1TL9NkW+yYps7DGp6zXfKbX2ZmZzFoCWPqtysv+I0sXowLgkI1E/E62kcpqICtBInUeYW3pyhVMQebAl4XrPDYRDH8V8YW7gTMBTANOJfZ+de17+PzmP1tU/bXoYEJ/8TWiWEUBcQ3+d/U1P6zFrBRQGxIVoY1MmI5oo6X4tkCmOl88JDwBsnHpRjfM7xgcEk9u4wJ29G5wB2rGy8v+f/7qZMlo5SZveLcBjMfLYskWj6eCTzh2lQmu4jNyYRYEe4X1+vTKbOCvKfzE2p7zJ1Y7la74qF/FKQKzJru3ypngkkwg6/ECM5TYS5pmKT33n6JhDx221LL4ZkOzFXUaCPc3eXyVe3dBGMmTe7dRmJlGnwAPuTWhH+d3oW918Bx01v2mIdtQs0GL6rCZHxK4rRxgewB/bJlWk0GeDCtQTdpViZFO2M9sYORw6eTBr9Pqvd9gh5sLgrAPtg5QXS+nX+mjZ+QfJJ9vboNncG1dv9xoBTpqJN+M+AI8Xxb9n3m3JRXXdTV9zHUnAJwL7QG1994cjsUf8p2ge8SnLHVN0RLqnK8e1H9Atcg+rFG2sINGfjvnCjD/yYDNw279q22pa2vYxm5sVRzjV8+gnwCWHEqC1g4gZ12/oLPHSjZTap/JKem36HCqn9YYHzksg70qPVBPUKIWkot1qxE1QgNmIj30i1cm4MeLPtQt/a1+3oVZA8XXcVIwCWCGOmn2qOfL/RlfvEv JMbclDLw XYg58DzKD7d4QQHxnlhBN4Ne4pp1D9PZjRfBZI29NWii8qBkL323DD7vIIpFGrLZWFmpQ3se+USTQuXTk/BTtVIPPbNpimCtm1qID55c9cf6dGVcEjvccM5EXtLnICVh5FWpUGFHBtSo/yssyEvggVabgiu4luZQVb8yeQVOti3E9MXZXtGVh+hUtp7rIc2g1omLNaGBY75yBN1L0C0sk54CKRM7SodXVwH2TQbAU9Qwo1qtfcGNSm7nT8AlxwG4aEB9c X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/12/2 16:08, Muchun Song wrote: > > >> On Dec 1, 2023, at 19:09, Kefeng Wang wrote: >> >> >> >> On 2023/11/27 16:46, Muchun Song wrote: >>> The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") >>> introduces an assertion to walk_page_range_novma() to make all the users >>> of page table walker is safe. However, the race only exists for walking the >>> user page tables. And it is ridiculous to hold a particular user mmap write >>> lock against the changes of the kernel page tables. So only assert at least >>> mmap read lock when walking the kernel page tables. And some users matching >>> this case could downgrade to a mmap read lock to relief the contention of >>> mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read >>> lock) in the next patch. >>> Signed-off-by: Muchun Song >>> --- >>> mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- >>> 1 file changed, 28 insertions(+), 1 deletion(-) >>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c >>> index b7d7e4fcfad7a..f46c80b18ce4f 100644 >>> --- a/mm/pagewalk.c >>> +++ b/mm/pagewalk.c >>> @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, >>> * not backed by VMAs. Because 'unusual' entries may be walked this function >>> * will also not lock the PTEs for the pte_entry() callback. This is useful for >>> * walking the kernel pages tables or page tables for firmware. >>> + * >>> + * Note: Be careful to walk the kernel pages tables, the caller may be need to >>> + * take other effective approache (mmap lock may be insufficient) to prevent >>> + * the intermediate kernel page tables belonging to the specified address range >>> + * from being freed (e.g. memory hot-remove). >>> */ >>> int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >>> unsigned long end, const struct mm_walk_ops *ops, >>> @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, >>> if (start >= end || !walk.mm) >>> return -EINVAL; >>> - mmap_assert_write_locked(walk.mm); >>> + /* >>> + * 1) For walking the user virtual address space: >>> + * >>> + * The mmap lock protects the page walker from changes to the page >>> + * tables during the walk. However a read lock is insufficient to >>> + * protect those areas which don't have a VMA as munmap() detaches >>> + * the VMAs before downgrading to a read lock and actually tearing >>> + * down PTEs/page tables. In which case, the mmap write lock should >>> + * be hold. >>> + * >>> + * 2) For walking the kernel virtual address space: >>> + * >>> + * The kernel intermediate page tables usually do not be freed, so >>> + * the mmap map read lock is sufficient. But there are some exceptions. >>> + * E.g. memory hot-remove. In which case, the mmap lock is insufficient >>> + * to prevent the intermediate kernel pages tables belonging to the >>> + * specified address range from being freed. The caller should take >>> + * other actions to prevent this race. >>> + */ >>> + if (mm == &init_mm) >>> + mmap_assert_locked(walk.mm); >>> + else >>> + mmap_assert_write_locked(walk.mm); >> >> Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? > > No. You also need to make sure the users do not pass the wrong > walk_lock, so you also need to add something like following: > But all other walk_page_XX has been converted,see more from commit 49b0638502da "mm: enable page walking API to lock vmas during the walk"), there's nothing special about this one, the calls must pass the right page_walk_lock to mm_walk_ops, > if (mm == &init_mm) > VM_BUG_ON(walk_lock != PGWALK_RDLOCK); > else > VM_BUG_ON(walk_lock == PGWALK_RDLOCK); > > I do not think the code will be simple. or adding the above lock check into process_mm_walk_lock too.