From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A81CC36014 for ; Mon, 31 Mar 2025 20:24:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B8151280002; Mon, 31 Mar 2025 16:24:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B098A280001; Mon, 31 Mar 2025 16:24:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A971280002; Mon, 31 Mar 2025 16:24:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 743AD280001 for ; Mon, 31 Mar 2025 16:24:17 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D1DBBB6B1F for ; Mon, 31 Mar 2025 20:24:18 +0000 (UTC) X-FDA: 83282973396.02.6BD295C Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf24.hostedemail.com (Postfix) with ESMTP id 20B74180003 for ; Mon, 31 Mar 2025 20:24:16 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=gMfBA8R0; spf=pass (imf24.hostedemail.com: domain of sj@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743452657; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fDQPjB/UsiIAZvRtWxj//JLo/2unDClFyCvj6LrNHg4=; b=ry4A1vX+zmuxKet27LHntsfrwA0OPT0ZHGBETMc5+lUVfp9WWRoDGKxO7OIIKo/gI4Nqru RjOGJytGFdJChDAqfhdbDY3vio7Q75k96tkb84r9D/LUaYrE9xFP8+sGgklssufaX64YlZ lIG6XTMdNKmpHgKJuAtzv31HFL7D+ks= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743452657; a=rsa-sha256; cv=none; b=FTQ2mdGguog/CcVSs/kcRlL5IOgk2e6OkXKWUU9UbAFIyscQEBQuobKWWixI6b54Xn2nCt 46qKqFrLQBkaErdu9t+62eHyaaNw9QNiW7VxGNUrDOoRSQUkixL2hvxCeDax24Z5nUavoy 7THR+iECb8D0V6gUVrZ9EXwPnSG1+0w= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=gMfBA8R0; spf=pass (imf24.hostedemail.com: domain of sj@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=sj@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id C287B5C475F; Mon, 31 Mar 2025 20:21:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 416F7C4CEE3; Mon, 31 Mar 2025 20:24:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1743452655; bh=RaV+ftD7id0HINAlA10aBvlHKJlS/GQxAi1hYTsniAQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gMfBA8R0TJsT4T//dtoFdQjOPnDS96w6ahGBc6Wbt1lGYFKbIH03ZtbTT4eVqK5WU 2hqXcK7X4noNDLdKW9toC+tPgl4dWj3oc6pRyYxFnxk7rXNc5B3NwtAJQ2FMq80i/F TgsVUMuVlFEN5hgcFTam6iHZFBeFghz+n+qmukRBIcEF5zLceiUJk14okMA8ees38+ 17QUU7Op+prjqCHq16MZyr2rfHNYbnRUamT8mkb1/DdsYdgrIL8BDQe1ljGyE0o3j1 EXeXKVaMu+R/qa05jlTDgOcgiEA/cOX2/8+W9+WdZYfm/cnTFJVl++srYcK5f5ZrDS /2GjPL/rlmGrA== From: SeongJae Park To: SeongJae Park Cc: Lorenzo Stoakes , Andrew Morton , "Liam R. Howlett" , David Hildenbrand , Shakeel Butt , Vlastimil Babka , kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Rik van Riel Subject: Re: [PATCH 6/9] mm/memory: split non-tlb flushing part from zap_page_range_single() Date: Mon, 31 Mar 2025 13:24:11 -0700 Message-Id: <20250331202411.1221-1-sj@kernel.org> X-Mailer: git-send-email 2.39.5 In-Reply-To: <20250311205801.85356-1-sj@kernel.org> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 20B74180003 X-Stat-Signature: keagncrkg3an76qpsjrq9wzqisp73pum X-HE-Tag: 1743452656-794469 X-HE-Meta: U2FsdGVkX18bcFFPwu78e2avK74NqdwlA/ww+aWseZBXQ0S8OHjDtNVeWmWIkRQ9363DlkexrvUo6yb2rHsWFZvmoiQqXFll2SX+gXErSeyTQy3WffWLxa5KNrgiX+CkB8gNt0S65/gxNPLfoXFxBfKRKhoqJQXg+7EUkxQEY010PTdxxAQXQrtsCwDlcDRF1S/bfa47bebOAnMbxBC05W4GJN1EJZ3EwIJTEI0AI2TpPbACrtGd1S/ff0+JFzWYHli4jlt1G3Pd91AGVtlEy8VVARt8Oh76lDmT8zls0+Oj+DAgPpgdE0FIXkvzgwbjOVf8dn20L0bwiK91bE/vTw+AJ6d2mvgZzZyQjom3I+bm48qJRR64BVnZsZtMT2qUX1xnEsPWN/m3iNb152uq+bj3ftIhx5ytGuVeL8XDQ5oloRoX97r72x5mNRqN7EvkjDhqDKthR98G/Bj95p004Xw6g6EDVV6yk/iT2gdHWDiNmh/o+8cD9lG4iicOMuBfP4TKjuSnwjc03mcpdUiiQ5YT0hQ66HrHc/hbNgwQbOnjX7SJa7V01QLfSwCMYC1WJKq4e93LpUSqL9jnzyqkgEXnUyvPX0B24qwa0ctTOLlmz0VUFdqdSs/+u9A6f9PRHKCiP07SZI5YLCIiTo9ISl4XTAp4lhj/DxmaHkxoGQPksz0uoqiyr4mqaLj0wX+PQNNFEASgN5F9j33IXCn4+NvtPayJAMJhUNdI2ipL0Jdl9ISx5xPKl+OaVGBLCKzEO+PySzsNBBlUWbofH3XnGInkiSXOAHjDmuo0SVwauG9dCz3z4Pn4C374fJbyQefHXayPkcxYLvkUSO7tFjZukAPMMXdSNYIcfu5r0Xm/Zy6REjFdNjeLU7mUp0RSPCpUJGyBLHsDCtemi5/KYbAhMpNmlO2iUK5Uh1g+9lq1ZbGVsC0tgoMRmJhVKu13AYjIb3zxuHYS6zzMyRv5rLu FE4vb+6z AyJD8PFXxZouFjiWnJZE+7AytqE40wqGU4uca89PbWogedVknlLOUpHdL17tIXKbwogfW+JdjPJEHmasS/PfkqpTxls8mOc/2WlYZBt4oBCr1uEv8kLI7Ooh9HolWam+PCbEbEWnn63JDZN6EGSjQPY+awTnv7u57X9tVvWnWhK81ylfehg3Mnz1kaTTzocpTjumR70P6sRq+nWz+lBQ0KillCT53rn7+hu9RUYgkLtYkef8KZVxT6GTp/AxzolSi2PY7wIHtbp/ACl25FO97FBajbUNm0JKhnUi/rwEyBadGRntoDvktSIOvuw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 11 Mar 2025 13:58:01 -0700 SeongJae Park wrote: > On Tue, 11 Mar 2025 12:45:44 +0000 Lorenzo Stoakes wrote: > > > On Mon, Mar 10, 2025 at 10:23:15AM -0700, SeongJae Park wrote: > > > Some of zap_page_range_single() callers such as [process_]madvise() with > > > MADV_DONEED[_LOCKED] cannot batch tlb flushes because > > > zap_page_range_single() does tlb flushing for each invocation. Split > > > out the body of zap_page_range_single() except mmu_gather object > > > initialization and gathered tlb entries flushing parts for such batched > > > tlb flushing usage. > > > > > > Signed-off-by: SeongJae Park > > > --- > > > mm/memory.c | 36 ++++++++++++++++++++++-------------- > > > 1 file changed, 22 insertions(+), 14 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 78c7ee62795e..88c478e2ed1a 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -1995,38 +1995,46 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas, > > > mmu_notifier_invalidate_range_end(&range); > > > } > > > > > > -/** > > > - * zap_page_range_single - remove user pages in a given range > > > - * @vma: vm_area_struct holding the applicable pages > > > - * @address: starting address of pages to zap > > > - * @size: number of bytes to zap > > > - * @details: details of shared cache invalidation > > > - * > > > - * The range must fit into one VMA. > > > - */ > > > -void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, > > > +static void unmap_vma_single(struct mmu_gather *tlb, > > > + struct vm_area_struct *vma, unsigned long address, > > > unsigned long size, struct zap_details *details) > > > { > > > const unsigned long end = address + size; > > > struct mmu_notifier_range range; > > > - struct mmu_gather tlb; > > > > > > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, > > > address, end); > > > hugetlb_zap_begin(vma, &range.start, &range.end); > > > - tlb_gather_mmu(&tlb, vma->vm_mm); > > > update_hiwater_rss(vma->vm_mm); > > > mmu_notifier_invalidate_range_start(&range); > > > /* > > > * unmap 'address-end' not 'range.start-range.end' as range > > > * could have been expanded for hugetlb pmd sharing. > > > */ > > > - unmap_single_vma(&tlb, vma, address, end, details, false); > > > + unmap_single_vma(tlb, vma, address, end, details, false); > > > mmu_notifier_invalidate_range_end(&range); > > > - tlb_finish_mmu(&tlb); > > > hugetlb_zap_end(vma, details); > > > > Previously hugetlb_zap_end() would happen after tlb_finish_mmu(), now it happens > > before? > > > > This seems like a major problem with this change. > > Oh, you're right. This could re-introduce the racy hugetlb allocation failure > problem that fixed by commit 2820b0f09be9 ("hugetlbfs: close race between > MADV_DONTNEED and page fault"). That is, this patch can make hugetlb > allocation failures increase while MADV_DONTNEED is going on. > > Maybe a straightforward fix of the problem is doing hugetlb_zap_end() for all > vmas in a batched manner, similar to that for tlb flush. For example, add a > list or an array for the vmas in 'struct madvise_behavior', let > 'unmap_vma_single()' adds each vma in there, and call hugetlb_zap_end() for > gathered vmas at vector_madvise() or do_madvise(). Does that make sense? > > Also Cc-ing Rik, who is the author of the commit 2820b0f09be9 ("hugetlbfs: > close race between MADV_DONTNEED and page fault") for a case that I'm missing > something important. I now think the straightforward fix I mentioned in the previous message might be unnecessarily big change. Maybe letting the unmap_vma_single() caller does hugetlb_zap_end() and tlb_finish_mmu() on their own in a correct sequence could be another way? Then zap_page_range_single() can do the calls for each invocation as it did before. process_madvise() could do batched tlb flushes only for non-hugetlb case. That is, do the tlb entries gathering as this version of patch series proposes in usual. But see if the address range is for hugetlb and therefore require hugetlb_zap_end() call in real. If so, flush the so far gathered tlb entries, call hugetlb_zap_end(), and then start next batch? In other words, I'm proposing to split the batched flushes when a hugetlb is encountered. This means that tlb flush overhead reduction might be smaller than expected if process_madvise() for unmapping hugetlb pages is intensively invoked. But I 't think that's not a common use case. Having the benefit for non-hugetlb pages with simple change first, and revisiting hugetlb case later once the problem comes out might be a way, in my opinion. For example, my idea could implemented like below, on top of this entire patch series. diff --git a/mm/madvise.c b/mm/madvise.c index 4021db51aeda..e6a74e7ef864 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -861,6 +861,20 @@ static long madvise_dontneed_single_vma(struct madvise_behavior *behavior, }; unmap_vma_single(behavior->tlb, vma, start, end - start, &details); + /* + * hugetlb_zap_end() should be called after tlb_finish_mmu() to avoid + * hugetlb faults for the tlb-flushing memory hanppen before freeing of + * the memory. If not, the fault will fail memory allocation. + * + * If hugetlb_zap_end() really need to be called, flush so-far gathered + * tlb entries, invoke hugetlb_zap_end(), and start another batch of + * tlb flushes for remaining unmap works. + */ + if (is_vm_hugetlb_page(vma)) { + tlb_finish_mmu(behavior->tlb); + hugetlb_zap_end(vma, &details); + tlb_gather_mmu(behavior->tlb, vma->vm_mm); + } return 0; } diff --git a/mm/memory.c b/mm/memory.c index add8d540cb63..4431630d3240 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2023,7 +2023,6 @@ void unmap_vma_single(struct mmu_gather *tlb, struct vm_area_struct *vma, */ unmap_single_vma(tlb, vma, address, end, details, false); mmu_notifier_invalidate_range_end(&range); - hugetlb_zap_end(vma, details); } /** @@ -2043,6 +2042,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address, tlb_gather_mmu(&tlb, vma->vm_mm); unmap_vma_single(&tlb, vma, address, size, details); tlb_finish_mmu(&tlb); + hugetlb_zap_end(vma, details); } /** Any concern or something I'm missing? Thanks, SJ [...]