From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AB7E4C48BF6
	for <linux-mm@archiver.kernel.org>; Thu, 29 Feb 2024 06:33:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 27AC56B0081; Thu, 29 Feb 2024 01:33:09 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 22B876B009C; Thu, 29 Feb 2024 01:33:09 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0FDC96B009D; Thu, 29 Feb 2024 01:33:09 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 012296B0081
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 01:33:08 -0500 (EST)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id B78331C1508
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 06:33:08 +0000 (UTC)
X-FDA: 81843874056.16.91F8A1C
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	by imf08.hostedemail.com (Postfix) with ESMTP id BE634160008
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 06:33:06 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=h1dVtaSW;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.10 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709188387; a=rsa-sha256;
	cv=none;
	b=mqQb9y7EtMq56nESbqhqMbzwqESC8NdvC43wYcg+hWsPu87k9FQEGVgc4pOGIcY2XVXJP4
	dyBFJABSP1mSU0I4cQK58jh/8b1saTk7C6VGk1A2LZaB86zGaUnX0mu0ZM4uzKAMlYLK0x
	GDUUiUuV1eQfobFBlSz91NYDApdfGjg=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=h1dVtaSW;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf08.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.10 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709188387;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=YRaND1DObLIYDm44JwE2+UstwC9gFrM3VG7ZtkumRw4=;
	b=a61CM4xwAes3l++0RGGekqmK51Vm3MjCINbNq6hOAppqgiMDo/0mJbGAySk24iFhu11hqu
	Uz7JfLZClKhf1pTH57vVrFU6l000ds7CpDbDPgbZwfAJrytVopyTe2By6PCL7g8eqtWtFV
	+XbYrr+RN62XGAFmSEZOIQt4Z3nAwGE=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1709188387; x=1740724387;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version;
  bh=NSlHiZ9lIHUdp/N5IbtcCw+XhruNoYntUSB5WUs3eCI=;
  b=h1dVtaSWchwdVat1jvhT3O4CGwOerqxpYGGYHARh8k7UV+CzXUKfAvOU
   A3DW/NhUsMKowMitrCruEUp0o+LfQGPP9+JWRujm4p02tvH0dno9xHzLB
   TxaICDDCBvKrf7yHTfSwvkH/QvLQ1irIM0GPgUdSjLZ0BnAIj5yB8VXuf
   tWpLHSP950LHj4KbKcSerCGJPK4r8RqiXRzz0xb0fnxxHhMTyzOdvzEDn
   16eAmdvfoh/rpHJ7c4/pkW98/DMsz2sjK1tQPUHB+l6E3Bcw0TzzIWvvY
   GEUxbR/icc/T3zna7VamFD+vHoruPyj9YA4EopojgHmwZ8JIIIzG3+Zfp
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10998"; a="21095165"
X-IronPort-AV: E=Sophos;i="6.06,192,1705392000"; 
   d="scan'208";a="21095165"
Received: from fmviesa006.fm.intel.com ([10.60.135.146])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Feb 2024 22:33:05 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.06,192,1705392000"; 
   d="scan'208";a="7928578"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Feb 2024 22:33:02 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Peng Zhang <zhangpeng362@huawei.com>
Cc: <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,
  <akpm@linux-foundation.org>,  <willy@infradead.org>,
  <fengwei.yin@intel.com>,  <david@redhat.com>,
  <aneesh.kumar@linux.ibm.com>,  <shy828301@gmail.com>,
  <hughd@google.com>,  <wangkefeng.wang@huawei.com>,
  <sunnanyong@huawei.com>
Subject: Re: [PATCH v3] filemap: avoid unnecessary major faults in
 filemap_fault()
In-Reply-To: <20240229060907.836589-1-zhangpeng362@huawei.com> (Peng Zhang's
	message of "Thu, 29 Feb 2024 14:09:07 +0800")
References: <20240229060907.836589-1-zhangpeng362@huawei.com>
Date: Thu, 29 Feb 2024 14:31:07 +0800
Message-ID: <87il27dbo4.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: BE634160008
X-Stat-Signature: byq4befgx6c6pg7ytcs5pto44s1k5g9t
X-HE-Tag: 1709188386-707202
X-HE-Meta: U2FsdGVkX1+mtOvgZ6Q7WbOA8tKa8Gz1TSzX9/qufT3hTV3T3UdZJ26xbbbZyTYVA8hf9ML8RmbeiNVMSi2qt/QPa1lVeDxnVZ24oNqzlZAWbF6sdpIFkYPuUnVWqkrl53Y5eUTZLwsfJIqCPuifzpop2kcqalRY+sQpdjZdeFKIiVZyHHX5WQTjao7NxU2yis7CTZqRXTmyGjLPw3tkZa6LDfwp63uEenajDbQMlG/81QX33knFJm3LY7RbjCQMWKP/G8+qpVaa/JMAby+mnTmyCiUTKcHtG6NyzzFD4JcjMQ54g0QQAtQ2E/YRdRLZWW/hwu+Eg0JPIdyR4Ue2Q14GpLOM0vRSAhjH9GLgDJ5vnIf7d1xEpA9+5CBA3bH3AxIoSYFAmCfGmk7KIr8HcxBKWYGtB/hOIRAmGldm4u9rNYhyJY8meDBnxFugnuBmSO0HushMmDR1s/uCD42WIVWnaERDcizjQferCWjYMbSOWnY+j4+BPQoMCXPrH0fpxRolhoG0p0HqTGkiP6HBAYKpWTHtRLPmebThylpunSJuD4aUrBJ7eUOa+AOv1JI3A0JN/8/6DJiej4YiMCFozbh9g7C7nAkDxlOlfediS6kOlcytpW6j+KfaTBP6bgUgr7Mtp0urub9XlTWm/yOGr4umRJyfSCI75nTrxaL5UTvd8IQBSVKIijoznBAQ0UQ1+5odLQclHtIDn2EfZsTSt1z7QIfXAsP7g6zqvaTUGHVTawXoGwxBHUSaE9TH7kLcoFCutXZm6JBTiPI0A/TAwXOhCtLMkWwuDjZd+VUhjrCqbYTW+LxdhrTjKWNagbsr/3HTGjPYTeqFHpIWA/W+TBfaTByRfj/xOO0wpHNIlHoxivUn/p/zkBnMeZZwK41u0vJM848qlmk=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Peng Zhang <zhangpeng362@huawei.com> writes:

> From: ZhangPeng <zhangpeng362@huawei.com>
>
> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE)
> in application, which leading to an unexpected issue[1].
>
> This caused by temporarily cleared PTE during a read+clear/modify/write
> update of the PTE, eg, do_numa_page()/change_pte_range().
>
> For the data segment of the user-mode program, the global variable area
> is a private mapping. After the pagecache is loaded, the private anonymous
> page is generated after the COW is triggered. Mlockall can lock COW pages
> (anonymous pages), but the original file pages cannot be locked and may
> be reclaimed. If the global variable (private anon page) is accessed when
> vmf->pte is zeroed in numa fault, a file page fault will be triggered.
> At this time, the original private file page may have been reclaimed.
> If the page cache is not available at this time, a major fault will be
> triggered and the file will be read, causing additional overhead.
>
> This issue affects our traffic analysis service. The inbound traffic is
> heavy. If a major fault occurs, the I/O schedule is triggered and the
> original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If
> other applications are operating disks, the system needs to wait for
> more than 10 ms. However, the inbound traffic is heavy and the NIC buffer
> is small. As a result, packet loss occurs. But the traffic analysis service
> can't tolerate packet loss.
>
> Fix this by holding PTL and rechecking the PTE in filemap_fault() before
> triggering a major fault. We do this check only if vma is VM_LOCKED. In
> our service test environment, the baseline is 7 major faults / 12 hours.
> After the patch is applied, no major fault will be triggered.
>
> Testing file anonymous page read and write page fault performance in
> ext4, tmpfs and ramdisk using will-it-scale[2] on a x86 physical machine.
> The data is the average change compared with the mainline after the patch
> is applied. The test results are indicates some performance regressions.
> We do this check only if vma is VM_LOCKED, therefore, no performance
> regressions is caused for most common cases.
>
> The test results are as follows:
>                           processes processes_idle threads threads_idle
> ext4    private file write: -0.51%    0.08%          -0.03%  -0.04%
> ext4    shared  file write:  0.135%  -0.531%          2.883% -0.772%
> ramdisk private file write: -0.48%    0.23%          -1.08%   0.27%
> ramdisk private file  read:  0.07%   -6.90%          -5.85%  -0.70%

Have you retested with the VM_LOCKED optimization?  Why are there still
performance regression?

> tmpfs   private file write: -0.344%  -0.110%          0.200%  0.145%
> tmpfs   shared  file write:  0.958%   0.101%          2.781% -0.337%
> tmpfs   private file  read: -0.16%    0.00%          -0.12%   0.41%
>
> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/
> [2] https://github.com/antonblanchard/will-it-scale/
>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> v2->v3:
> - Do this check only if vma is VM_LOCKED per David Hildenbrand
> - Hold PTL and recheck the PTE
> - Place the recheck code in a new function filemap_fault_recheck_pte()
>
> v1->v2:
> - Add more test results per Huang, Ying
> - Add more comments before check PTE per Huang, Ying, David Hildenbrand
>   and Yin Fengwei
> - Change pte_offset_map_nolock to pte_offset_map as the PTL won't
>   be used
>
> RFC->v1:
> - Add error handling when ptep == NULL per Huang, Ying and Matthew
>   Wilcox
> - Check the PTE without acquiring PTL in filemap_fault(), suggested by
>   Huang, Ying and Yin Fengwei
> - Add pmd_none() check before PTE map
> - Update commit message and add performance test information
>
>  mm/filemap.c | 40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b4858d89f1b1..2668bac68df7 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3181,6 +3181,42 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>  	return fpin;
>  }
>  
> +/*
> + * filemap_fault_recheck_pte - hold PTL and recheck whether pte is none.
> + * @vmf - the vm_fault for this fault.
> + *
> + * Recheck PTE as the PTE can be cleared temporarily during a read+clear/modify
> + * /write update of the PTE, eg, do_numa_page()/change_pte_range(). This will
> + * trigger an unexpected major fault, even if we use mlockall(), which may
> + * increase IO and thus cause other unexpected behavior.
> + *
> + * Return VM_FAULT_NOPAGE if the PTE is not none or pte_offset_map_lock()
> + * fails. In other cases, 0 is returned.
> + */
> +static vm_fault_t filemap_fault_recheck_pte(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	vm_fault_t ret = 0;
> +	pte_t *ptep;
> +
> +	if (!(vma->vm_flags & VM_LOCKED))
> +		return ret;
> +
> +	if (pmd_none(*vmf->pmd))
> +		return ret;
> +

How about check PTE without lock firstly?  I guess that this can improve
performance in common case (no race).

> +	ptep = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> +				   &vmf->ptl);
> +	if (unlikely(!ptep))
> +		return VM_FAULT_NOPAGE;
> +
> +	if (unlikely(!pte_none(ptep_get(ptep))))
> +		ret = VM_FAULT_NOPAGE;
> +
> +	pte_unmap_unlock(ptep, vmf->ptl);
> +	return ret;
> +}
> +
>  /**
>   * filemap_fault - read in file data for page fault handling
>   * @vmf:	struct vm_fault containing details of the fault
> @@ -3236,6 +3272,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  			mapping_locked = true;
>  		}
>  	} else {
> +		ret = filemap_fault_recheck_pte(vmf);
> +		if (unlikely(ret))
> +			return ret;
> +
>  		/* No page in the page cache at all */
>  		count_vm_event(PGMAJFAULT);
>  		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);

--
Best Regards,
Huang, Ying