From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26E2CC3DA6E for ; Tue, 9 Jan 2024 02:05:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4C0AA6B0072; Mon, 8 Jan 2024 21:05:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 46FFD6B0074; Mon, 8 Jan 2024 21:05:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 337DA6B0075; Mon, 8 Jan 2024 21:05:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 1EC1A6B0072 for ; Mon, 8 Jan 2024 21:05:37 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E36811C0F6D for ; Tue, 9 Jan 2024 02:05:36 +0000 (UTC) X-FDA: 81658131072.19.D77D76D Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.88]) by imf18.hostedemail.com (Postfix) with ESMTP id 93ED51C0002 for ; Tue, 9 Jan 2024 02:05:33 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=OtzFcoyn; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1704765934; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3Z5uiUY12aqhEdI9nzFXqSYnjTIaADizzVRjLJpuok4=; b=r/Yj3SbbtfA7hLsKqPkHMoqKh+E577jQdT824yf5KHQ0nM5K2oHpelOVqMjuwCea6wrr0B RAz8jvNO8+IzZ00Y7bBQC2AWbUzawfZulTCjT0paFeqIJRyK6qgEGf7FZb4GBUKhpQL8cc utR70NtQweZNisgUU7ufv4mqNDarvqY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=OtzFcoyn; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf18.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.88 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704765934; a=rsa-sha256; cv=none; b=kXiuSzhluoDSrlH6qQy6C68ECfFFK7ZnPmJYIgpOGLOkZbmgnFRnZF1bAfNYShYUd4PGP/ Vxg1hckYoUKwxEUIbMToqFjn6SC7uv+pBqFKZxsWH6Ux0K/zJnTdOk768dAjrCCdjgGEo5 uVXEjhDKDDA5EC5SWjC0QeZub2+WUxc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1704765933; x=1736301933; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=dDVp6dSq9reUilE2DHoO6Squ+ylc5/8/8iH1Xo4Rf2Y=; b=OtzFcoynHIwqHSEpBz6Ue0cEe1BIu2KXOaUOmUnXoXr4742LhUScTD/r wwWLCsscmuS9E/i1m+Olgv88hg/pAPcwvt+O2q6A+tGzwdWstJO3IhE+v qlI2tYgyNXuNnp5AEARL9UdzXNLuV1y2LhAo0BBKPS9oMMi4PEKK5fkUD naCeD/WA/k8CJmEtgYnSbX4KL/8hTKc7HTede03EPm2YJYk6iEKRfay3C oEkz6CXaNtgJT4lZhxxD3SPbU29OfBTawJRISOyQpr+AnnBCc90+CWta4 bTHS2o6kHg+fTlCYC0iByE9WMGY4GoFhAeut7yb+fUr4EOIj7YJKR8v2E g==; X-IronPort-AV: E=McAfee;i="6600,9927,10947"; a="429246202" X-IronPort-AV: E=Sophos;i="6.04,181,1695711600"; d="scan'208";a="429246202" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2024 18:05:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,181,1695711600"; d="scan'208";a="23381454" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2024 18:05:29 -0800 From: "Huang, Ying" To: Kairui Song Cc: linux-mm@kvack.org, Kairui Song , Andrew Morton , Chris Li , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally In-Reply-To: <20240102175338.62012-10-ryncsn@gmail.com> (Kairui Song's message of "Wed, 3 Jan 2024 01:53:38 +0800") References: <20240102175338.62012-1-ryncsn@gmail.com> <20240102175338.62012-10-ryncsn@gmail.com> Date: Tue, 09 Jan 2024 10:03:29 +0800 Message-ID: <871qar9sb2.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 93ED51C0002 X-Stat-Signature: 1chgszyjy3uxj873rak37i897fbn1mi6 X-HE-Tag: 1704765933-815218 X-HE-Meta: U2FsdGVkX1/0d1Yk7XgcZuGMuDwJtZqm2nkBhQCujOvdWI2dTKz3wS/WbLN4E5IP5Mzhe6v1D2oQhVfr/VLh86rkkj+XYuvzvwQADENrgwuo42AAYObdhcqQVvpP4SpKmwwph41y27OP2mShJCJERNPuSRX2Ajwt28qa8uyFZAeZFFjN80M/CoCcJ6zItgcZx+5zq40A1UJAZNKR2o7lVbkiLWApwzaB7WSskC8dEBFd/FY/lvsi3yWMplgEDQMT5ej58Xi3mj/En0AAkGguglxByfACFDcMrZ65ddYOsL4QJ23RqCJvpHZxjxDufTsOjHARltsa8ofuk82GAGxlYCbx1ChcyZj5KEVrApzN+/sFrroDQ+y0xUuMG5HJXgiVF8fNXwJRAWgkCEzOdxZ8OhT1pCzPaq4K5rz7XIwG7DKWdEiR2YBsAYPDpxUiXLhKplJYjURH1cZHkJXMgL5FcbDFQ3pZVxOZC4TborlUbmzlisNCyRhlEbUf0LoBesGxmt53i1yMRqV86scjGjGhrnfXIu+bRJbmuCA83/wN7l/fCtI/HYGGL9oQxoQEah1XMFMbadVz7l6tPJ7MiM0gv+aR/2hcw8RJXvBKwVkv9wy1IaF8V5PWncUjmdbdtMUdubzWhu4Z1AVeBfBn/lsDtFN2hAeZFQ8kuCZFvVMNvPX8gOyy1Naf6dUTdyxpZ5k6Z8HRqbAUewKDA/6wMyYeiJqbu4ImWHBLEWn/DSIyJX5VLegIBQDO+E+AZmhvAIAyFswMeKKpw6bQ4CfvB2B1AXlYqpCUheBiJ6y1VwT2YQ3Qw6FqBxOJQ+vEcnZYMMryJJpg7XTyFlGeqLas3NDHYBVfN9+fLu5ogGHjdNnMHLmDTKzEu5V2bmnyRCEQPcFDfxLPXftkQwDOBQ5+JgpUNnn4cOXZx6VHwbpicTyB+ShwFcxdS2GOhOSD5uCnyQ/HDvKMWs65yyxLrWSl7GM CpH9s5KB SMJzAg0h/0toLQ37AfiGwRKpJ9MNCyg69i3kGM3xoWTblLXR59x3qGlWCU6hnP7u4UBJ3RVbn6zAxp5S+2S1viEUJUJaGpJ6HX53NW10ERDz1/pT9cCajvN5MyUBPGYjAJUpJNwqGR/VRPJhMRLtoW8XxDxSG5oo6LdOn7B38DL7ibN2MFfssAvHnw7YnALfHsnc9YtGrbYruq1DL7ovcFA/n5TanfryTxk+OaYsd1ZQcU9MUkAGbi77DUNKS23Nt512z X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kairui Song writes: > From: Kairui Song > > Currently, shmem uses cluster readahead for all swap backends. Cluster > readahead is not a good solution for ramdisk based device (ZRAM) at all. > > After switching to the new helper, most benchmarks showed a good result: > > - Single file sequence read: > perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192 > (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit) > Before: 22.248 +- 0.549 > After: 22.021 +- 0.684 (-1.1%) > > - Random read stress test: > fio -name=tmpfs --numjobs=16 --directory=/tmpfs \ > --size=256m --ioengine=mmap --rw=randread --random_distribution=random \ > --time_based --ramp_time=1m --runtime=5m --group_reporting > (using brd as swap, 2G memcg limit) > > Before: 1818MiB/s > After: 1888MiB/s (+3.85%) > > - Zipf biased random read stress test: > fio -name=tmpfs --numjobs=16 --directory=/tmpfs \ > --size=256m --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \ > --time_based --ramp_time=1m --runtime=5m --group_reporting > (using brd as swap, 2G memcg limit) > > Before: 31.1GiB/s > After: 32.3GiB/s (+3.86%) > > So cluster readahead doesn't help much even for single sequence read, > and for random stress test, the performance is better without it. > > Considering both memory and swap device will get more fragmented > slowly, and commonly used ZRAM consumes much more CPU than plain > ramdisk, false readahead could occur more frequently and waste > more CPU. Direct SWAP is cheaper, so use the new helper and skip > read ahead for SWP_SYNCHRONOUS_IO device. It's good to take advantage of swap_direct (no readahead). I also hopes we can take advantage of VMA based swapin if shmem is accessed via mmap. That appears possible. -- Best Regards, Huang, Ying > Signed-off-by: Kairui Song > --- > mm/shmem.c | 67 +++++++++++++++++++++++++------------------------ > mm/swap.h | 9 ------- > mm/swap_state.c | 11 ++++++-- > 3 files changed, 43 insertions(+), 44 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 9da9f7a0e620..3c0729fe934d 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1564,20 +1564,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo) > static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info, > pgoff_t index, unsigned int order, pgoff_t *ilx); > > -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp, > - struct shmem_inode_info *info, pgoff_t index) > -{ > - struct mempolicy *mpol; > - pgoff_t ilx; > - struct folio *folio; > - > - mpol = shmem_get_pgoff_policy(info, index, 0, &ilx); > - folio = swap_cluster_readahead(swap, gfp, mpol, ilx); > - mpol_cond_put(mpol); > - > - return folio; > -} > - > /* > * Make sure huge_gfp is always more limited than limit_gfp. > * Some of the flags set permissions, while others set limitations. > @@ -1851,9 +1837,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > { > struct address_space *mapping = inode->i_mapping; > struct shmem_inode_info *info = SHMEM_I(inode); > + enum swap_cache_result cache_result; > struct swap_info_struct *si; > struct folio *folio = NULL; > + struct mempolicy *mpol; > swp_entry_t swap; > + pgoff_t ilx; > int error; > > VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); > @@ -1871,36 +1860,40 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > return -EINVAL; > } > > - /* Look it up and read it in.. */ > - folio = swap_cache_get_folio(swap, NULL, 0, NULL); > + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx); > + folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result); > + mpol_cond_put(mpol); > + > if (!folio) { > - /* Or update major stats only when swapin succeeds?? */ > + error = -ENOMEM; > + goto failed; > + } > + if (cache_result != SWAP_CACHE_HIT) { > if (fault_type) { > *fault_type |= VM_FAULT_MAJOR; > count_vm_event(PGMAJFAULT); > count_memcg_event_mm(fault_mm, PGMAJFAULT); > } > - /* Here we actually start the io */ > - folio = shmem_swapin_cluster(swap, gfp, info, index); > - if (!folio) { > - error = -ENOMEM; > - goto failed; > - } > } > > /* We have to do this with folio locked to prevent races */ > folio_lock(folio); > - if (!folio_test_swapcache(folio) || > - folio->swap.val != swap.val || > - !shmem_confirm_swap(mapping, index, swap)) { > + if (cache_result != SWAP_CACHE_BYPASS) { > + /* With cache bypass, folio is new allocated, sync, and not in cache */ > + if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) { > + error = -EEXIST; > + goto unlock; > + } > + if (!folio_test_uptodate(folio)) { > + error = -EIO; > + goto failed; > + } > + folio_wait_writeback(folio); > + } > + if (!shmem_confirm_swap(mapping, index, swap)) { > error = -EEXIST; > goto unlock; > } > - if (!folio_test_uptodate(folio)) { > - error = -EIO; > - goto failed; > - } > - folio_wait_writeback(folio); > > /* > * Some architectures may have to restore extra metadata to the > @@ -1908,12 +1901,19 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > */ > arch_swap_restore(swap, folio); > > - if (shmem_should_replace_folio(folio, gfp)) { > + /* With cache bypass, folio is new allocated and always respect gfp flags */ > + if (cache_result != SWAP_CACHE_BYPASS && shmem_should_replace_folio(folio, gfp)) { > error = shmem_replace_folio(&folio, gfp, info, index); > if (error) > goto failed; > } > > + /* > + * The expected value checking below should be enough to ensure > + * only one up-to-date swapin success. swap_free() is called after > + * this, so the entry can't be reused. As long as the mapping still > + * has the old entry value, it's never swapped in or modified. > + */ > error = shmem_add_to_page_cache(folio, mapping, index, > swp_to_radix_entry(swap), gfp); > if (error) > @@ -1924,7 +1924,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > if (sgp == SGP_WRITE) > folio_mark_accessed(folio); > > - delete_from_swap_cache(folio); > + if (cache_result != SWAP_CACHE_BYPASS) > + delete_from_swap_cache(folio); > folio_mark_dirty(folio); > swap_free(swap); > put_swap_device(si); > diff --git a/mm/swap.h b/mm/swap.h > index 8f790a67b948..20f4048c971c 100644 > --- a/mm/swap.h > +++ b/mm/swap.h > @@ -57,9 +57,6 @@ void __delete_from_swap_cache(struct folio *folio, > void delete_from_swap_cache(struct folio *folio); > void clear_shadow_from_swap_cache(int type, unsigned long begin, > unsigned long end); > -struct folio *swap_cache_get_folio(swp_entry_t entry, > - struct vm_area_struct *vma, unsigned long addr, > - void **shadowp); > struct folio *filemap_get_incore_folio(struct address_space *mapping, > pgoff_t index); > > @@ -123,12 +120,6 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) > return 0; > } > > -static inline struct folio *swap_cache_get_folio(swp_entry_t entry, > - struct vm_area_struct *vma, unsigned long addr) > -{ > - return NULL; > -} > - > static inline > struct folio *filemap_get_incore_folio(struct address_space *mapping, > pgoff_t index) > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 3edf4b63158d..10eec68475dd 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -318,7 +318,14 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr) > > static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry) > { > - return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1; > + int count; > + > + if (!data_race(si->flags & SWP_SYNCHRONOUS_IO)) > + return false; > + > + count = __swap_count(entry); > + > + return (count == 1 || count == SWAP_MAP_SHMEM); > } > > static inline bool swap_use_vma_readahead(void) > @@ -334,7 +341,7 @@ static inline bool swap_use_vma_readahead(void) > * > * Caller must lock the swap device or hold a reference to keep it valid. > */ > -struct folio *swap_cache_get_folio(swp_entry_t entry, > +static struct folio *swap_cache_get_folio(swp_entry_t entry, > struct vm_area_struct *vma, unsigned long addr, void **shadowp) > { > struct folio *folio;