From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89879C4332F for ; Fri, 10 Nov 2023 03:59:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 087D2280028; Thu, 9 Nov 2023 22:59:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0384B280027; Thu, 9 Nov 2023 22:59:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E421C280028; Thu, 9 Nov 2023 22:59:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D3E39280027 for ; Thu, 9 Nov 2023 22:59:05 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id AC7AA160584 for ; Fri, 10 Nov 2023 03:59:05 +0000 (UTC) X-FDA: 81440689050.09.D416B87 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf23.hostedemail.com (Postfix) with ESMTP id 923E814000E for ; Fri, 10 Nov 2023 03:59:02 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699588743; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dTykqZhvVjG+JVJAeAbWPxvhaOaOieTIvNWMOJJblRQ=; b=RS6KqhS7E+CaWjX+cnYGpwcYTaRzcNtnDlWUPaIxqRrIc1XFkUQ5K/8ov8ifgQmK4SiK+q oeo1SaJuGXgqynXk6L9bQQo2/yD2N9h1LMiTypnUQBPm7gFVXZjk4bUjEp+eHuo1Ww21PB /P331JgCqRGslyxluXOXThDVRVxr3I0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699588743; a=rsa-sha256; cv=none; b=dez2Bq9DXjitxchHEPe3w1Yz6YFOWFQnk2ynex0qIoKB4Fh1UT2WVtgh/3cGBifDsai5aL I7KH2Tse+XsuF0oEnF1VtfekpJjr4WMWOrGhOXuLLGKs1k0lgtqu0P+oNsFWZDpDP0nJ2N UK5SxbmkmispQch6rEy+mSba5OQGYz4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; spf=pass (imf23.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from dggpemm100001.china.huawei.com (unknown [172.30.72.54]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4SRPh32CBtzfb5g; Fri, 10 Nov 2023 11:39:15 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.31; Fri, 10 Nov 2023 11:39:25 +0800 Message-ID: <56e1e123-f593-443a-be5b-754cbfb0e611@huawei.com> Date: Fri, 10 Nov 2023 11:39:24 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [Question]: major faults are still triggered after mlockall when numa balancing Content-Language: en-US To: "Yin, Fengwei" , Yang Shi , "zhangpeng (AS)" , Aneesh Kumar K.V CC: , , , Matthew Wilcox , , , , , , , , , , , Nanyong Sun References: <9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com> <648aa9dc-fc42-4f28-af9a-b24adfdcd43d@intel.com> From: Kefeng Wang In-Reply-To: <648aa9dc-fc42-4f28-af9a-b24adfdcd43d@intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemm100001.china.huawei.com (7.185.36.93) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 923E814000E X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: b8ee7p75dfj3ae3owtjgucf73anxxakh X-HE-Tag: 1699588742-670337 X-HE-Meta: U2FsdGVkX19o3I95egdZ+1FPBzZESGZYVoj4XujQnmOjT8psuF0VE9fsBpuS6rOmgAh8yFqvlvyyvtpDsC8dxMxunRKuDCEKlhw2whg5+Sn3p94Kz4+OnJxdclgQKGsrSsSkMQr9gdryoZv0AxIL3mZFAcGTvlcAAw7jcBfft7r5KeysW8UuL2RhVfOPrQ/v2DKgjkzgvVoAagRcyRKAvvomDCjnZJ6rMCMhWmbVW12VNDWjPwvSrbECFgBeC+HStpfS/HtvbuJE+vS8BzlI29456LMeamAc7waMMhG2sXMcL94Yb11fJ90XWQSSYX1mHsF0L8fRTgzNENNGlLowrnxrSj5h/L9Bq3jpaMtUr+srXwZRuKyOHYU3DWSRpnKSYewR2hgr8m+SgN2p1VOKHbOUyRU4fwhlr+a3RnaalxeSJNPwGH3/DXHe86UUFF37LcJAhcfsgAboiQftyqKkOAcp/6ne5OBpWL64zd2awINFZWIJhoJSJQrPs92H0eMkf2NDxjR0AC4/+557d+UwVI4n4JVqb6+fAsMlAWf7EgcsSVfdDxqZnfGhLRjQf0Xgqn7Yn+0mR5GDwHyyGFuUroP17ukOI7BW3z5lsGJH6kYvKqofHzgnZHcXwaVEHWMzYZijlIKglWQ1rKKFCIp/Yf0HPUazayP4q6nQ8KjkO7Wunjf/k91obio1ETTnjKwFHAOP4XRAcPZb1w/cDQ0Sy45i1Dh5OHaiPnXwrgE2kBvLGi2Ax/saWVg2L9kxs8yltWa73HmQxVFgQo+UNQ50O9B+U40vRnYo24jfW07VARnF7MWXyl8wz709eIH5+3JcjM21QuMU2zlav9kLhoTmh8O5OPvNrhPhZs+UWGgTs634b8XoA2y2To7gCotqq6iZw1D1AiDy+Pcov3c2mnFSQYkcz4PUrL7c5LKpdQc7keHeey4JQTCgcGuDHXVIPPSRia8aHDCyrbl3uz9t5cp lOMesa7K eIluk1d3cnLn7bRIOXkq/0MdlekeVMtZvRirPfD8vXS6SFO0GzwJm8dQrWFSQpcNkZYUXqzNhJWGhPsar6AOU3dvTE3UuNCZYXhvbYYVHhgHpbtQ299tDVnqibDR76dqCsNGTd01Gt+Sl2+4cuz5ddYpZoyXPCQcIBadAaq86J+Siv4hH7nE9d+zUg5NoPPvKkev13ELSiosUPjcHYPJif8Hb+FC91jViF0RE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/11/10 9:57, Yin, Fengwei wrote: > > > On 11/10/2023 6:54 AM, Yang Shi wrote: >> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) wrote: >>> >>> Hi everyone, >>> >>> There is a performance issue that has been bothering us recently. >>> This problem can reproduce in the latest mainline version (Linux 6.6). >>> >>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process >>> to avoid performance problems caused by major fault. >>> >>> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >>> ptep_modify_prot_start() will clear the vmf->pte, until >>> ptep_modify_prot_commit() assign a value to the vmf->pte. >>> >>> For the data segment of the user-mode program, the global variable area >>> is a private mapping. After the pagecache is loaded, the private >>> anonymous page is generated after the COW is triggered. Mlockall can >>> lock COW pages (anonymous pages), but the original file pages cannot >>> be locked and may be reclaimed. If the global variable (private anon page) >>> is accessed when vmf->pte is zero which is concurrently set by numa fault, >>> a file page fault will be triggered. >>> >>> At this time, the original private file page may have been reclaimed. >>> If the page cache is not available at this time, a major fault will be >>> triggered and the file will be read, causing additional overhead. >>> >>> Our problem scenario is as follows: >>> >>> task 1 task 2 >>> ------ ------ >>> /* scan global variables */ >>> do_numa_page() >>> spin_lock(vmf->ptl) >>> ptep_modify_prot_start() >>> /* set vmf->pte as null */ >>> /* Access global variables */ >>> handle_pte_fault() >>> /* no pte lock */ >>> do_pte_missing() >>> do_fault() >>> do_read_fault() >>> ptep_modify_prot_commit() >>> /* ptep update done */ >>> pte_unmap_unlock(vmf->pte, vmf->ptl) >>> do_fault_around() >>> __do_fault() >>> filemap_fault() >>> /* page cache is not available >>> and a major fault is triggered */ >>> do_sync_mmap_readahead() >>> /* page_not_uptodate and goto >>> out_retry. */ >>> >>> Is there any way to avoid such a major fault? >> >> IMHO I don't think it is a bug. The man page quoted by Willy says "All >> mapped pages are guaranteed to be resident in RAM when the call >> returns successfully", but the later COW already made the file page >> unmapped, right? The PTE pointed to the COW'ed anon page. >> Hypothetically if we kept the file page mlocked and unmapped, >> munlock() would have not munlocked the file page at all, it would be >> mlocked in memory forever. > But in this case, even the COW page is mlocked. There is small window > that PTE is set to null in do_numa_page(). data segment access (it's to > COW page which has nothing to do with original page cache) happens in > this small window will trigger filemap_fault() to fault in original > page cache. > > I had thought to do double check whether vmf->pte is NULL in do_read_fault(). > But it's not reliable enough. > > Matthew's idea to use protnone to block both hardware accessing and > do_pte_missing() looks more promising to me. Actual, we could revert the following patch to avoid this issue, but this workaroud from ppc... commit cee216a696b2004017a5ecb583366093d90b1568 Author: Aneesh Kumar K.V Date: Fri Feb 24 14:59:13 2017 -0800 mm/autonuma: don't use set_pte_at when updating protnone ptes Architectures like ppc64, use privilege access bit to mark pte non accessible. This implies that kernel can do a copy_to_user to an address marked for numa fault. This also implies that there can be a parallel hardware update for the pte. set_pte_at cannot be used in such scenarios. Hence switch the pte update to use ptep_get_and_clear and set_pte_at combination. > > > Regards > Yin, Fengwei > >> >>> >>> -- >>> Best Regards, >>> Peng >>>