From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2432C4167D for ; Mon, 13 Nov 2023 02:04:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 06AB28E0003; Sun, 12 Nov 2023 21:04:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F35A38E0001; Sun, 12 Nov 2023 21:04:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DD7BD8E0003; Sun, 12 Nov 2023 21:04:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C83B18E0001 for ; Sun, 12 Nov 2023 21:04:40 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A0D2B1601F3 for ; Mon, 13 Nov 2023 02:04:40 +0000 (UTC) X-FDA: 81451287120.20.0A3CA3C Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) by imf04.hostedemail.com (Postfix) with ESMTP id 7747A40008 for ; Mon, 13 Nov 2023 02:04:37 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=anZpTxnv; spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699841078; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ylH6OEXiCGqoiVhN1N0xKPMZGmkUT0jBpmOD1ODbI3g=; b=oJGSz3NpOXxHzEeio1inYeXI9+hN5qMM0XPwHsJe/dBvgn/z4pZ81LLx7MrSX5uYM8c6GA veGK5Jam0I6dhkEXdqYd6UoiQk7ejtdigE2DcoEbFcMfSoyzrdCxuYQg0v+uLNvZOxvjZv 91VpcJrjRx4iBHZ1FoBBrNQPXqfWFPw= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=anZpTxnv; spf=pass (imf04.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.7 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699841078; a=rsa-sha256; cv=none; b=kB+QgiS0bHNXopzHlxYHMWx0aZ/zJhT22pYz9gquCwMEIHMCtLQtKbpd6IXcvJrHjxpqbr Q56EsaO7Vb2OvNRaOm+eraMxnk+i63NCfjEpB0VhwRCKiPjWHGXMfvoVLg6Qp0O3Yq+vij u7pyKE51J00lGB0nQ3e9juDct/niVgw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1699841077; x=1731377077; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=uhCzrhUih1kdqliJvzI4rlufC8mvCxq1FIYTIy906zE=; b=anZpTxnvut/CTsNywjOWuqzQ4PJfCjtgiyQLfBpBGSDsxhj83mCx6VsL p9stL4aGvtowTfA7aOTKqttqsA4CR9q/hiA8DM+sDaPe7AODI0cVlKgG6 EuUbZ2BKNpvLTiaQYAKz9EWLPie8ROx3pMhXaY7HqM4MsGxYTU2CPxvVE lfirv8IBbXJe+3BRVa9RwF59gmpvvoJh5TK05jB99QthHnSEz4KoaXCyR G9JJVI2KudHJfvZD20KbHcYPjK5kvdpNVxePzWxG0Eq6IJg+ukqAJUhBr Ne7yz7uj1yyEdsgZBgzQl7yPGJkSmhBjW8YBdbwH93nRQf/khSzwWZ5f4 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10892"; a="11915613" X-IronPort-AV: E=Sophos;i="6.03,298,1694761200"; d="scan'208";a="11915613" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Nov 2023 18:04:35 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10892"; a="834603249" X-IronPort-AV: E=Sophos;i="6.03,298,1694761200"; d="scan'208";a="834603249" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Nov 2023 18:04:31 -0800 From: "Huang, Ying" To: "Yin, Fengwei" Cc: Matthew Wilcox , "zhangpeng (AS)" , , , , , , , , , , , , , Nanyong Sun , Kefeng Wang Subject: Re: [Question]: major faults are still triggered after mlockall when numa balancing In-Reply-To: <2c95d0d0-a708-436f-a9d9-4b3d90eafb16@intel.com> (Fengwei Yin's message of "Fri, 10 Nov 2023 17:04:26 +0800") References: <9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com> <874jhugom8.fsf@yhuang6-desk2.ccr.corp.intel.com> <2c95d0d0-a708-436f-a9d9-4b3d90eafb16@intel.com> Date: Mon, 13 Nov 2023 10:02:30 +0800 Message-ID: <87r0kufm15.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 7747A40008 X-Rspam-User: X-Stat-Signature: ehato1jrujedcr4edisx1ewhuipssrjz X-Rspamd-Server: rspam01 X-HE-Tag: 1699841077-330352 X-HE-Meta: U2FsdGVkX1/MIj/mZFMbsOyUMTBxa7OUM6jtLTr5oXqCA7HGP0/gwX6OWqFbIyj3bMlIfuafi5vbrMMCuWIQ446DYpfvCXmiCYkpnM7TuM8QWTSmOqbuN/yu+CWzz/NER4VXsixPlAIWZIeQaPghs5GTEuojn3+9XThtnAS/FikffTZ4AfIiTBbgr7KikI+LCl49oqIaJMwnEdEh6rvGRRsruv0tfuWVfH+E9xWzBzcdv5Nv6G5G0ki3p5kgVuCGaofKb8QnVvKmpzfP3xBNcczdLtkFKr4S8UEUFETwEdHurPc/DgDa8Djd4p4RtyHXT/88JV297q5LHsYDmH+uvK/whr7v2Gkw+0XE1dG3HW/YuNjJMF1DZnhq3JRF6Uu/p8gYpXZ6zUvLnUtNJVWenZeOgzfTV7Hz+iT+9qWfdgRNpx+2wLWTZyAwSFtRx6Be93FxyOH1tc8heRgeK7bbzcX1x/czqHETHkvAj9oDhcf7zcV3T9P8r0Tzjpzg/tlDFOP6YzmhW9mPovDelQdl+r3OQz4xcysfZ/zWvfCsG1IzSOqTq1EgtDmgJoq/LJ08G1ogFjB7sodjV2AHDBFW3FiFhp4b6Nxk0W/WM/J+tEMCxyHj5PdHhvcoRlh2PTszevW8SOXUpKXnp9SJIAuFSsW3neCqUQODA1GidxFSMBdOKXkrgRP/OzO7CCC7xnMxd0VoK3A0OegIoPDcSoO8ZsgU9r5FLCFLjJIcWjxUuNM8CNgtl73wZIUlFblaTRDT+t5i1dsku8AIp15fpptX2BMDihPGFvI5xbSYQwikwwIOIk9HOXC17079ZV44GEa7jcnbEBxS7uTzk8pXG7Qm5/gkWFPwdhi05/hLyP2XY6cepqwNkxgJyuEyKr5QYog9QWyVs18HTKTZYYulcO2i10GLbrQIqbqxndKQGyhEhRIKyBwNWQoYVGZvvADy6ByGF9nwvmbROVsou9HDc64 jFjngzB3 T5ZiDunkRzJHqJMDCDRSrAlXLLvTxAxO06jMnreNsr44lm3h/s/vqC2g3JkdM5ckZh+4upM57Vf1GJCNJ3Y2y+tCyiiyIzAGfzbJD9k0Xb3CilHsuHLIw9X8V0k9G3N+raF7zcemm3Er0VtdYa+5S5kcgOjmnImm2dExc/vNNv60Vfy+S5sOGjdtCCVLELEE7Us+bhR4YpLMGk8ECv6qLrvIrQL3e20oJzd9P X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Yin, Fengwei" writes: > On 11/10/2023 1:32 PM, Huang, Ying wrote: >> Matthew Wilcox writes: >> >>> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote: >>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >>>> ptep_modify_prot_start() will clear the vmf->pte, until >>>> ptep_modify_prot_commit() assign a value to the vmf->pte. >>> >>> [...] >>> >>>> Our problem scenario is as follows: >>>> >>>> task 1 task 2 >>>> ------ ------ >>>> /* scan global variables */ >>>> do_numa_page() >>>> spin_lock(vmf->ptl) >>>> ptep_modify_prot_start() >>>> /* set vmf->pte as null */ >>>> /* Access global variables */ >>>> handle_pte_fault() >>>> /* no pte lock */ >>>> do_pte_missing() >>>> do_fault() >>>> do_read_fault() >>>> ptep_modify_prot_commit() >>>> /* ptep update done */ >>>> pte_unmap_unlock(vmf->pte, vmf->ptl) >>>> do_fault_around() >>>> __do_fault() >>>> filemap_fault() >>>> /* page cache is not available >>>> and a major fault is triggered */ >>>> do_sync_mmap_readahead() >>>> /* page_not_uptodate and goto >>>> out_retry. */ >>>> >>>> Is there any way to avoid such a major fault? >>> >>> Yes, this looks like a bug. >>> >>> It seems to me that the easiest way to fix this is not to zero the pte >>> but to make it protnone? That would send task 2 into do_numa_page() >>> where it would take the ptl, then check pte_same(), see that it's >>> changed and goto out, which will end up retrying the fault. >> >> There are other places in the kernel where the PTE is cleared, for >> example, move_ptes() in mremap.c. IIUC, we need to audit all them. >> >> Another possible solution is to check PTE again with PTL held before >> reading in file data. This will increase the overhead of major fault >> path. Is it acceptable? > What if we check the PTE without page table lock acquired? The PTE is zeroed temporarily only with PTL held. So, if we acquire the PTL in filemap_fault() and check the PTE, the PTE which is zeroed in do_numa_page() will be non-zero now. So we can avoid the major fault. But, if we don't acquire the PTL, the PTE may still be zero. -- Best Regards, Huang, Ying > Regards > Yin, Fengwei > >> >>> I'm not particularly expert at page table manipulation, so I'll let >>> somebody who is propose an actual patch. Or you could try to do it? >> >> -- >> Best Regards, >> Huang, Ying