From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8ED78C02181 for ; Tue, 21 Jan 2025 01:44:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E6ED16B007B; Mon, 20 Jan 2025 20:44:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E1EEF6B0082; Mon, 20 Jan 2025 20:44:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE65F6B0083; Mon, 20 Jan 2025 20:44:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AB4356B007B for ; Mon, 20 Jan 2025 20:44:11 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 5092EA0558 for ; Tue, 21 Jan 2025 01:44:11 +0000 (UTC) X-FDA: 83029763502.22.2929821 Received: from invmail4.hynix.com (exvmail4.skhynix.com [166.125.252.92]) by imf29.hostedemail.com (Postfix) with ESMTP id A3DCF120008 for ; Tue, 21 Jan 2025 01:44:08 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf29.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737423849; a=rsa-sha256; cv=none; b=c9oDC/FXi/JQygSdmotH3O7njTLHWLhFd5wXh2Llewn1aM2vTZ/E04IG/9eRVa7JxqRu0B cw5bKg1y2d23pNWNWF8hxLRcdje8RI9U51rU7xL+bn94rz0i0f4xsIQk3rHRCpVW5ZBtXW 6Dx1MUnr6giaUftizEa7ehR2h7li/mY= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf29.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737423849; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=m5U6Mg3fLsw/wGj3n2U3YniSetWPgLnRUIFJEjq2xXI=; b=588DXP2+WUgSiNEY3BsIXG8Kv9x26gZ9vdPO5IiTCR3JXJU9zY3wqtfmQXNzfAs3N0ceyr tgFqeKXnB9LrhpC2xiNbbETEdOYCb/rTKGeZuuT9aLas3+SUZbOD8x37DBs2p0iVGgznIr TrjAEHdsDR1PWfJf0eSfvjuZ7BU66s4= X-AuditID: a67dfc5b-3c9ff7000001d7ae-3c-678efbe48c75 Date: Tue, 21 Jan 2025 10:43:59 +0900 From: Byungchul Park To: Vinay Banakar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, willy@infradead.org, mgorman@suse.de, Wei Xu , Greg Thelen , kernel_team@skhynix.com Subject: Re: [PATCH] mm: Optimize TLB flushes during page reclaim Message-ID: <20250121014359.GA60549@system.software.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrDLMWRmVeSWpSXmKPExsXC9ZZnoe6T333pBs2P5C3mrF/DZrFu1UQW i8u75rBZ3Fvzn9Vi8rtnjBa7bq9is3h/7SO7xe8fc9gcODwWbCr12LxCy2PTp0nsHidm/Gbx 2Hy62uPzJrkAtigum5TUnMyy1CJ9uwSujMv/17IW7DWs2PniPFMD4y31LkZODgkBE4kHG38x wthHp7SzdTFycLAIqEpcWJsFEmYTUJe4ceMnM4gtIqAkMWHacqASLg5mgcuMErMPTWACSQgL OEos2zAJzOYVsJA4u+ExG4gtJBAg0X//AzNEXFDi5MwnLCA2s4CWxI1/L5lAdjELSEss/8cB YnIKBEo8mFYDUiEqoCxxYNtxJpBVEgJ72CR2TbrADHGmpMTBFTdYJjAKzEIydRaSqbMQpi5g ZF7FKJSZV5abmJljopdRmZdZoZecn7uJERjmy2r/RO9g/HQh+BCjAAejEg/vAau+dCHWxLLi ytxDjBIczEoivKIfetKFeFMSK6tSi/Lji0pzUosPMUpzsCiJ8xp9K08REkhPLEnNTk0tSC2C yTJxcEo1MC4TKn446dnFq9+zX/cp7F4mEGkzx3vTszrnKKEmU4a1HdIiUWsYclje+P5Otu5/ lPz56P4VFd9vXlcVS3HXuLkqOKrwyee7DgoeKfLHZ7zdvaJsw0bNjYGCmR6+72d079CZnqnP lK6QtNol7Oymvr64ZSUqlRYuzKcqFk3zcT3eqqexSsNHSImlOCPRUIu5qDgRALARWS5vAgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrMLMWRmVeSWpSXmKPExsXC5WfdrPvkd1+6wdGz4hZz1q9hs1i3aiKL xeG5J1ktLu+aw2Zxb81/VovJ754xWuy6vYrN4v21j+wWv3/MYXPg9FiwqdRj8wotj02fJrF7 nJjxm8Vj8YsPTB6bT1d7fN4kF8AexWWTkpqTWZZapG+XwJVx+f9a1oK9hhU7X5xnamC8pd7F yMkhIWAicXRKO1sXIwcHi4CqxIW1WSBhNgF1iRs3fjKD2CICShITpi0HKuHiYBa4zCgx+9AE JpCEsICjxLINk8BsXgELibMbHrOB2EICARL99z8wQ8QFJU7OfMICYjMLaEnc+PeSCWQXs4C0 xPJ/HCAmp0CgxINpNSAVogLKEge2HWeawMg7C0nzLCTNsxCaFzAyr2IUycwry03MzDHVK87O qMzLrNBLzs/dxAgM2mW1fybuYPxy2f0QowAHoxIP7wGrvnQh1sSy4srcQ4wSHMxKIryiH3rS hXhTEiurUovy44tKc1KLDzFKc7AoifN6hacmCAmkJ5akZqemFqQWwWSZODilGhjTz/2P+rG5 4l1ukWXros93NmdFWy2qmSe8VjXNdWb3ucDHge4bDy3lTb0YbdfbqKStr6goUmXjf+FAATfr xhtexS0b5zoG35/B4nbII+jgUfErra65IX+jm7Pl0+ZdMVNrcLdcXBOReqej+dNVm60tGg7B rLfn2J5Pa34tcrDdu+TIg5lXdiuxFGckGmoxFxUnAgDaKsJFVgIAAA== X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: A3DCF120008 X-Stat-Signature: rcib51ksa3ncohu83daw6ozm37feds1y X-HE-Tag: 1737423848-760925 X-HE-Meta: U2FsdGVkX18jphhmBxfbHbkafyKLmnzo9DDuHeXxXbwBONMohFfjqlSnDroaJD2i8ArlLMYkWrgkbUGLwJMtVE7how9WMcJdaZoLj7SXgH+UIP/6ieMgPk6+QPxmQwjSjuVJ+cWLQGXri6OkmnDlHE0tMdD723qXyo40X/dOHFI+tm7sF6abz+muFAxk1T1Ay+9gW6RkVmqjRU5jiGimo1nBymULm61LURMc175GvbjN10ofEFoAo3Oo2Q47tho1wKaMgFKd8TNRg1TRwBYZBdVB89OpsrqwKo1FK+g0uic3+hb+1pphssW5EZlW+QxikqD1W/UlZfkpUzb6CfdjPQ8YYqqePJkl8xSVi/xcrl4YX2XdYpJ9V9SyWHbDTfQwGvcWSy2lfjQhvC2neV1gVrwVcB0S9T3SUVYOA+0jZaWiB4ypzgIola2/+FgaAUJL9VOkqXfY0ZJr8YbdIuScNKhraEtvYl3Y4SRMK2KYaRSEYgGKmhbao3jwstcSURuXi92wvpmn7ZWyG+KP4/8QKuFQI96AGAA5IdHue+Q7/kK3wJkGOBBLK6UcYCZtvQqazBrIiblGhrvzfi5ZUHAhIPgSvEhl+jTZ3ITcZ1Jeg5UT0CpPBQtGGq+WKDKWxdXIrR/I2yHmfsd5QmEKv09bnSnFYtdZTEZvxvI+SJbb9umgwBXq78eP32BIoat3yhIFUHt9+qBPHv1ucWuUKpOtpqtAWsKp35Yv4BLGf84IMa2H9fAOMShcAXulfYQJMbBiheO8sU9nngr2tqzP5FEQZM8MBJX5seQ9RBPQ7rv5N6ssnfNZh+Y6nViH9DoTda4TWA2UxRa5u5JXI5jrWkABu1WQ8X8UwDbAW+h2k61rWdZZwrvIpuABfWN+pd82TulP8f1RWB6qAUsdnnuBVjLbRphvX04TC4wdYPL1+CZSCLX+lNJ5y7lKOA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 20, 2025 at 04:47:29PM -0600, Vinay Banakar wrote: > The current implementation in shrink_folio_list() performs full TLB > flushes and issues IPIs for each individual page being reclaimed. This > causes unnecessary overhead during memory reclaim, whether triggered > by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where > applications are actively moving cold pages to swap while maintaining > high performance requirements for hot pages. > > The current code: > 1. Clears PTE and unmaps each page individually > 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or > issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage > 3. Submits each page individually to BIO > > This approach results in: > - Excessive full TLB flushes across all cores > - Unnecessary IPI storms when processing multiple pages > - Suboptimal I/O submission patterns > > I initially tried using selective TLB shootdowns (invlpg) instead of > full TLB flushes per each page to avoid interference with other > threads. However, this approach still required sending IPIs to all > cores for each page, which did not significantly improve application > throughput. > > This patch instead optimizes the process by batching operations, > issuing one IPI per PMD instead of per page. This reduces interrupts > by a factor of 512 and enables batching page submissions to BIO. The > new approach: > 1. Collect dirty pages that need to be written back > 2. Issue a single TLB flush for all dirty pages in the batch > 3. Process the collected pages for writebacks (submit to BIO) The *interesting* IPIs will be reduced by 1/512 at most. Can we see the improvement number? Byungchul > Testing shows significant reduction in application throughput impact > during page-out operations. Applications maintain better performance > during memory reclaim, when triggered by explicit > madvise(MADV_PAGEOUT) calls. > > I'd appreciate your feedback on this approach, especially on the > correctness of batched BIO submissions. Looking forward to your > comments. > > Signed-off-by: Vinay Banakar > --- > mm/vmscan.c | 107 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- > 1 file changed, 74 insertions(+), 33 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bd489c1af..1bd510622 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_list); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > if (!sc->may_writepage) > goto keep_locked; > > - /* > - * Folio is dirty. Flush the TLB if a writable entry > - * potentially exists to avoid CPU writes after I/O > - * starts and then write it out here. > - */ > - try_to_unmap_flush_dirty(); > - switch (pageout(folio, mapping, &plug)) { > - case PAGE_KEEP: > - goto keep_locked; > - case PAGE_ACTIVATE: > - goto activate_locked; > - case PAGE_SUCCESS: > - stat->nr_pageout += nr_pages; > - > - if (folio_test_writeback(folio)) > - goto keep; > - if (folio_test_dirty(folio)) > - goto keep; > - > - /* > - * A synchronous write - probably a ramdisk. Go > - * ahead and try to reclaim the folio. > - */ > - if (!folio_trylock(folio)) > - goto keep; > - if (folio_test_dirty(folio) || > - folio_test_writeback(folio)) > - goto keep_locked; > - mapping = folio_mapping(folio); > - fallthrough; > - case PAGE_CLEAN: > - ; /* try to free the folio below */ > - } > + /* Add to pageout list for defered bio submissions */ > + list_add(&folio->lru, &pageout_list); > + continue; > } > > /* > @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > + if (!list_empty(&pageout_list)) { > + /* > + * Batch TLB flushes by flushing once before processing all dirty pages. > + * Since we operate on one PMD at a time, this batches TLB flushes at > + * PMD granularity rather than per-page, reducing IPIs. > + */ > + struct address_space *mapping; > + try_to_unmap_flush_dirty(); > + > + while (!list_empty(&pageout_list)) { > + struct folio *folio = lru_to_folio(&pageout_list); > + list_del(&folio->lru); > + > + /* Recheck if page got reactivated */ > + if (folio_test_active(folio) || > + (folio_mapped(folio) && folio_test_young(folio))) > + goto skip_pageout_locked; > + > + mapping = folio_mapping(folio); > + pageout_t pageout_res = pageout(folio, mapping, &plug); > + switch (pageout_res) { > + case PAGE_KEEP: > + goto skip_pageout_locked; > + case PAGE_ACTIVATE: > + goto skip_pageout_locked; > + case PAGE_SUCCESS: > + stat->nr_pageout += folio_nr_pages(folio); > + > + if (folio_test_writeback(folio) || > + folio_test_dirty(folio)) > + goto skip_pageout; > + > + /* > + * A synchronous write - probably a ramdisk. Go > + * ahead and try to reclaim the folio. > + */ > + if (!folio_trylock(folio)) > + goto skip_pageout; > + if (folio_test_dirty(folio) || > + folio_test_writeback(folio)) > + goto skip_pageout_locked; > + > + // Try to free the page > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + > + case PAGE_CLEAN: > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + } > + > +skip_pageout_locked: > + folio_unlock(folio); > +skip_pageout: > + list_add(&folio->lru, &ret_folios); > + } > + } > + > /* Migrate folios selected for demotion */ > nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > /* Folios that could not be demoted are still in @demote_folios */