From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B60E1C4332F for ; Fri, 10 Nov 2023 05:34:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E8B23280012; Fri, 10 Nov 2023 00:34:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E3982280009; Fri, 10 Nov 2023 00:34:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D00D7280012; Fri, 10 Nov 2023 00:34:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C061A280009 for ; Fri, 10 Nov 2023 00:34:25 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 531FA1CADD8 for ; Fri, 10 Nov 2023 05:34:25 +0000 (UTC) X-FDA: 81440929290.03.5220CF3 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.136]) by imf02.hostedemail.com (Postfix) with ESMTP id 9FA2380014 for ; Fri, 10 Nov 2023 05:34:22 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=iwec3KQv; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf02.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699594463; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KixRYlmAR11efoNIXJ4ViO0PWIEYcJAoWkj4+OAEhBM=; b=tqdomPYAQZVDTULDQGafWNfqPRrmm0XmRmy1THk4Z9eQ04wMDd3mU7Y0nwuY++0K9honz0 HioPfWZsUlNdntH/5P98Fvwv9PN5jb+tfs4bzp4riqWjhgAyExnN1PxDNZ8ufJh9ZxG69Y rxeTaMWbue0aMzMAYTRZsAKYAy7lr9I= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=iwec3KQv; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf02.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699594463; a=rsa-sha256; cv=none; b=58XGTY0cT1xTA+v7E/D14kYBYRP8HsVHcrAvRyr/9WyBRzCahT8o0Q4r0dHfTwzM1IQLyE R7YL5FvnHywfAAHIK2FAJsjswbT+++F9Zzl8qrmzHCs80kcOJZutwC9BwIizI/OtwIeKFO SEgytAilRbPGT6UOc2NNyBUSMdCAm9M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1699594462; x=1731130462; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=qX+PU4tzyzOQ2/Q80/XRzxTEAv2/ABl2VsarYdLMBrU=; b=iwec3KQvbb97DHnDuBO4Bybg0nMfWV5uAdQ+sV878aAuW8+eHFQBLrG4 Fp6kE7w5ERiHonuYNEmnPdnEgbuoYZFKg5y7H0rIP62QzOEdEV2eUVaB2 l6cxT8bEK0spW0Mhpxd2FbFBQJ5kcxYZP0Wvq2avw/WiDdAT3v7hDBefy us4L6vnsihabtGyvQt/6Fgaj0dM7qJKe7hNT7434UHPjONGCa3zX8hz36 keye0/M3w0rknlk9IAcV4/kjRO8LxEbIMArU5phsNrDBeTj8+iF8ayppP /E73TaM1aSHvw4NDFdhtycx4ZtlM0MlyQbuFI1fn6lGsiLx+DOk76QWvV w==; X-IronPort-AV: E=McAfee;i="6600,9927,10889"; a="369466964" X-IronPort-AV: E=Sophos;i="6.03,291,1694761200"; d="scan'208";a="369466964" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Nov 2023 21:34:21 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10889"; a="937085005" X-IronPort-AV: E=Sophos;i="6.03,291,1694761200"; d="scan'208";a="937085005" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Nov 2023 21:34:16 -0800 From: "Huang, Ying" To: Matthew Wilcox Cc: "zhangpeng (AS)" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, lstoakes@gmail.com, hughd@google.com, david@redhat.com, fengwei.yin@intel.com, vbabka@suse.cz, peterz@infradead.org, mgorman@suse.de, mingo@redhat.com, riel@redhat.com, hannes@cmpxchg.org, Nanyong Sun , Kefeng Wang Subject: Re: [Question]: major faults are still triggered after mlockall when numa balancing In-Reply-To: (Matthew Wilcox's message of "Thu, 9 Nov 2023 17:27:44 +0000") References: <9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com> Date: Fri, 10 Nov 2023 13:32:15 +0800 Message-ID: <874jhugom8.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 9FA2380014 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ptih3mze4b5o3dy4i58mxg446b4ma4qy X-HE-Tag: 1699594462-232258 X-HE-Meta: U2FsdGVkX18SYj4+fEF1OK6ZjTLxCACrk65VwdUtPIpqNoSfG0+6FYmqZvTOXF5J7nyDqz5yRhRfD4WrnM4amHS+vSpikjy9SGQtszPcFDnROlNQoCJPL2pzUHdmxsGn3Ls02HYPtB78ilf7mS0Im1K/PmSaPJKftLT9xR8tD4HgfAY/SryLh4kTy2Q5qxvdkUbyo09/c/JBXK38ik3iVNQK41Yei0oqjmxpMFaGMNPX2IbQEa6vXCzxFW1s3+iGKAy2fxI7kSgO49EKHbgiwLM0ncAA81lApnGyHzHI+WLINYrLkTVw5BmqRxQtab9TuAYdrt0Cir0F6WkooFamrDsRvgAeUfHZHUECdZwWe/n5Oolmt966iZkAhxRgTKrV65BBlBdxAm3UnPE0EbtCA6WWrIe99yLZOsPqRhQwG+WFEi9yER8/ET/dTvbUfRNpgfJvrxYDPY+l7VLg+WOuKxA7rRasy1k3pxXvWPkCRT+cIza3CRfyYvj8rw9Lxpr9knDur44oloD6J9KSrLiXIX8LR2x9kMktig8MM+D5nu74TNHCQaIh+0f7GXD96HD/3SuB8hKbVPf0NQBQLHbRBeIzE98PD9DxGEzsrXOdL40PM2U54CIg8iSUQtpnGm+E1dbrlUYh0a7wQikGJwOLimz+lRBmGj18A6cVOXfubhUZ2fs6YUXi1W83X1+0Fq0VK+NFZjE5nuvme6V9HRCubhNiFf0Ui7Z2w1cR0G6rA2SACWid174+t3H3l3/rBDOUS26WKE5aEMMS5zQ8ZSYMNGXJmH6qS3z+abgL4lTUd0puIySm7tNAWFb7HZGW1M92+PL/gieoDXsAjUBExiOoQ4nxiiNv8okWRUm/BZaPTA/3btMYlonnt2UnX82+yMplhVcFTRdJZaUlDo+0jFy3pFbBBN6LjuMxyych/py7Ng35hA7IWxqRHOUecmeiSvb9GzobgMvWjI9eAnA4Etv DNkKs4qQ RCvBdCZE6H6h9FFaVcmNhlYCXpqRLYBC5RwlRJE9LySGwS2eMfs6sf4p++dP4U54LJ4PIp1Zv/kwIuDBWPW/BpGnbFCXH4Trr+DUPWp5e94zFmY1+uIt1A2JloqAM9lDLN9ZnxAH9hiTF2+h7nvtNfMekhW06AwF9z6h7jGfjqiZauPUXiIlW8y4F6kb1gDp4ZbmB7lnliv6JknY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Matthew Wilcox writes: > On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote: >> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >> ptep_modify_prot_start() will clear the vmf->pte, until >> ptep_modify_prot_commit() assign a value to the vmf->pte. > > [...] > >> Our problem scenario is as follows: >> >> task 1 task 2 >> ------ ------ >> /* scan global variables */ >> do_numa_page() >> spin_lock(vmf->ptl) >> ptep_modify_prot_start() >> /* set vmf->pte as null */ >> /* Access global variables */ >> handle_pte_fault() >> /* no pte lock */ >> do_pte_missing() >> do_fault() >> do_read_fault() >> ptep_modify_prot_commit() >> /* ptep update done */ >> pte_unmap_unlock(vmf->pte, vmf->ptl) >> do_fault_around() >> __do_fault() >> filemap_fault() >> /* page cache is not available >> and a major fault is triggered */ >> do_sync_mmap_readahead() >> /* page_not_uptodate and goto >> out_retry. */ >> >> Is there any way to avoid such a major fault? > > Yes, this looks like a bug. > > It seems to me that the easiest way to fix this is not to zero the pte > but to make it protnone? That would send task 2 into do_numa_page() > where it would take the ptl, then check pte_same(), see that it's > changed and goto out, which will end up retrying the fault. There are other places in the kernel where the PTE is cleared, for example, move_ptes() in mremap.c. IIUC, we need to audit all them. Another possible solution is to check PTE again with PTL held before reading in file data. This will increase the overhead of major fault path. Is it acceptable? > I'm not particularly expert at page table manipulation, so I'll let > somebody who is propose an actual patch. Or you could try to do it? -- Best Regards, Huang, Ying