From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2C41C10DCE for ; Fri, 1 Dec 2023 11:09:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0F9AF6B040B; Fri, 1 Dec 2023 06:09:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A9B86B040C; Fri, 1 Dec 2023 06:09:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F02336B0410; Fri, 1 Dec 2023 06:09:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E14556B040B for ; Fri, 1 Dec 2023 06:09:38 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B88A6140182 for ; Fri, 1 Dec 2023 11:09:38 +0000 (UTC) X-FDA: 81517978836.09.B5E85B6 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf06.hostedemail.com (Postfix) with ESMTP id 1C529180022 for ; Fri, 1 Dec 2023 11:09:34 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf06.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701428976; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fTtLKl5wV6g1MGJAOp5ziroqqiQf4zTolbzOx+9adyo=; b=MaEqAXc4QDIZUss0XBzkDn/hVdI6UAJfKBhPdP93WGf0znb4oasCN4gIERUhuok7zoXIPb A0B9pBRvQwvTRMl44Ds16aZBpGdJsRtBG9i4Pbd+QUiTxKzbwYqzIrIobDBBuSiJqZYl9x b44QH/maIxsTEFNUgi+FYzWyg7wEe+c= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf06.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701428976; a=rsa-sha256; cv=none; b=m8swpXQFzB4Id3uWLPeYGP2/Blj2o5BdOPAjNfCYqOmSNeigKGV84wvJxMnIJXlVduKp/a l6RQC8ArPLZh+CjfZakgF5wvfwOno3/bySQSo8q0y3Te38YDwCeMCR1GOoKmzdl1FJs2f8 SCtesOidDW26rmgNpBoZ98Ml3rHDLGA= Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.57]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4ShVZD5LMQzMngB; Fri, 1 Dec 2023 19:04:36 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Fri, 1 Dec 2023 19:09:29 +0800 Message-ID: <9e5c199a-9b4d-4d1b-97d4-dd2b776ac85f@huawei.com> Date: Fri, 1 Dec 2023 19:09:29 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/4] mm: pagewalk: assert write mmap lock only for walking the user page tables Content-Language: en-US To: Muchun Song , , , CC: , References: <20231127084645.27017-1-songmuchun@bytedance.com> <20231127084645.27017-2-songmuchun@bytedance.com> From: Kefeng Wang In-Reply-To: <20231127084645.27017-2-songmuchun@bytedance.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemm100001.china.huawei.com (7.185.36.93) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 1C529180022 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: qaknmteaz6nmqckj11p19ea353xaenty X-HE-Tag: 1701428974-744795 X-HE-Meta: U2FsdGVkX19uSVhdJGS/p6fQZyw+kPqV6lhGk2K+y5nEoGE9cALeNfFXCfYh9zEisXTYb8y0xm0ZvId9vT2uQAeV4MuzlX0lvKVewBdVsnF626liFDo09tFVowLoSZSPCE2mvcGks2OS8PiW+G1ztaHEaaw+kYtzmdcGeQzceZ5SSDFtoBC0ydhwtROB8dP2E9PEvX1++99LXt09bqrZQxQS0ffthMPlHwHiThsHKZjx+9BcgV/brkkloziA1bBVh3gOCEu0aelsizq3k+IZ8m+uZUtFpckfkr2exYh8cTrQ7lhnIbXFmfP7Q+W7DuUHyqrPyETestXLtm2oEHtT5HW4HfhAbEDb/u+PMLqgkq3eR+7nLLz4gFXgs4Fb2BILhYK/lnTzJ/X8sbbeVllni41oSodoOR1yP8efK0FuT+lslbPJ/i1WKiNZT+QgdKgG59M/91CQitUCPpbOhDH7F8JrIiBzLLJMaeDxKGrnBZ9K4OnGTk/zru/gqoAI1neo1EnAYM9NWTupAaKnPanQ97Px6ypnxgzjOOI0sZ3JjBIFL6qib2A4rO7I4mRoFOLhSlhDkzkkwggvaKFVFXhr9IrL21q2ml9Eu/tVa8+e6d1OH6Rkx+g1sPywj78tWo7HR9i6UuIH+oePaOxt2vEU8bdYZisx7Hw7mbKwstKlg6hQKq7uRuZXGMCYwoWWvjieAYFSCeld11+b+344F6sjiguy+YOMyP6kKhRUzZ25a5h0nK1U7znGMSYO9rhMgE0MKp+ydy5oxeZQPy72aDExk6vuoPenyUtyo+QYeZwohppkLY4kULRxHNdEeOoYa9krQR77rXVScERAepV8f3HbxUC0ArFOzlFZWm2YzcQHaHXnesNRDGScH+Nr/Ia0v91N0FUdPaTTLdcibRMzExnh0wuZqpv//hfJwvYc09HUTm4J9O87s9eLHXsNKMRt7u9UzPt+pZTmRhGgW9D4j0Z PMOaySBL TcLE0Jd+BhDSDKtCeEqrrHXy4wTwvB/QV3/GAgSoE57u+wNkZxNPi/RtTfPoODQKjb/A43OlvyKWk6PEb+1O75fmexqf+hju77IAlCOQec4UQHo6RA7+HNk6PHA0uyLvZmOGSDVwrbDX+DnKpibdPXSEt7BGlTa4QE5notJCd31MP4dNp4Z4opRdONMQ7+6H3zMXfjVp2J+7rRy9C2KRkonS+Xr00UuXz4RKr X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/11/27 16:46, Muchun Song wrote: > The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") > introduces an assertion to walk_page_range_novma() to make all the users > of page table walker is safe. However, the race only exists for walking the > user page tables. And it is ridiculous to hold a particular user mmap write > lock against the changes of the kernel page tables. So only assert at least > mmap read lock when walking the kernel page tables. And some users matching > this case could downgrade to a mmap read lock to relief the contention of > mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read > lock) in the next patch. > > Signed-off-by: Muchun Song > --- > mm/pagewalk.c | 29 ++++++++++++++++++++++++++++- > 1 file changed, 28 insertions(+), 1 deletion(-) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index b7d7e4fcfad7a..f46c80b18ce4f 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, > * not backed by VMAs. Because 'unusual' entries may be walked this function > * will also not lock the PTEs for the pte_entry() callback. This is useful for > * walking the kernel pages tables or page tables for firmware. > + * > + * Note: Be careful to walk the kernel pages tables, the caller may be need to > + * take other effective approache (mmap lock may be insufficient) to prevent > + * the intermediate kernel page tables belonging to the specified address range > + * from being freed (e.g. memory hot-remove). > */ > int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > unsigned long end, const struct mm_walk_ops *ops, > @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, > if (start >= end || !walk.mm) > return -EINVAL; > > - mmap_assert_write_locked(walk.mm); > + /* > + * 1) For walking the user virtual address space: > + * > + * The mmap lock protects the page walker from changes to the page > + * tables during the walk. However a read lock is insufficient to > + * protect those areas which don't have a VMA as munmap() detaches > + * the VMAs before downgrading to a read lock and actually tearing > + * down PTEs/page tables. In which case, the mmap write lock should > + * be hold. > + * > + * 2) For walking the kernel virtual address space: > + * > + * The kernel intermediate page tables usually do not be freed, so > + * the mmap map read lock is sufficient. But there are some exceptions. > + * E.g. memory hot-remove. In which case, the mmap lock is insufficient > + * to prevent the intermediate kernel pages tables belonging to the > + * specified address range from being freed. The caller should take > + * other actions to prevent this race. > + */ > + if (mm == &init_mm) > + mmap_assert_locked(walk.mm); > + else > + mmap_assert_write_locked(walk.mm); Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops? > > return walk_pgd_range(start, end, &walk); > }