From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C611C4167B for ; Tue, 27 Dec 2022 20:15:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADA4F8E0002; Tue, 27 Dec 2022 15:15:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A89F58E0001; Tue, 27 Dec 2022 15:15:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 951F78E0002; Tue, 27 Dec 2022 15:15:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 85F0A8E0001 for ; Tue, 27 Dec 2022 15:15:21 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 52997A0C07 for ; Tue, 27 Dec 2022 20:15:21 +0000 (UTC) X-FDA: 80289190842.11.6E4E8BC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf29.hostedemail.com (Postfix) with ESMTP id C3A3B12000D for ; Tue, 27 Dec 2022 20:15:18 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=pWAHBruJ; spf=pass (imf29.hostedemail.com: domain of nathan@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=nathan@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1672172118; a=rsa-sha256; cv=none; b=rS9zTMT4PeqOia7LY3JX9mMoVSrEZVg3JBqpUHRpC/K3QZSAinDdOol2uDRKdNDeaC075r 0XZNB2+TVJcN0sOITXvqXIwIbk28fvuIQHUzEUV7Cl4PzCQw8Yp+ozQjDVqTAhcsYbGEBN k3Fd9eyBK+H5iQQjbtC3LZ7gMRRvYII= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=pWAHBruJ; spf=pass (imf29.hostedemail.com: domain of nathan@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=nathan@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1672172118; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=AXyB94T491PD+4lNr7jeD3vb2dhDDIzHIoFUdGndn0Q=; b=zq+ynRowH8yvDSAqd+XlpyZSmcX9BA2uglveUqeg7xn6ve0tm1Ags1F1HXqrUqan2Dzu3v TON3VxDZEoRVJO13QF6lRlypJOpvhOzsNgruTRaOgU820b9+FlLT5tJLc9QOIjyBEFzsN8 TVssssYv6Zzvw7mC5L7c9DOHzRGIcZY= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id CE52661233; Tue, 27 Dec 2022 20:15:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4712C433EF; Tue, 27 Dec 2022 20:15:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1672172117; bh=mqi0wsDakPdLYmSGSAKEdartHoNPwu6mSgcaaUAhk7w=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=pWAHBruJXvS7KL397HoWZTRJZ/RFQMJ0zxqEgdvB1FDgoGtlls1IwI6vb4aEA/gvK lsoyVDbXYJuaxJHrd3q3grokt2jsEJbwXFRhw7Bzc6S1MUPSkZrP3l7ITrnq6p0l/k /hcbwRt0nzgGYL7QUz6b3Nil1rF+vVxyrORwSZvD/kabDmXKp56siuAjaAQaqI556U vPPntFZ2ukaJBqXorZ2TsK9lAdfwwsd5Xdz+X4cwduke5qANmWcPev0AVeHOeXj0US 4qWD6DcXNErnPBpBgh7W8d++NBnGCisbOYiWxofk1mCUOYZuTFyqU5Y0YjWS6/God5 Ue/cca7Rb1qPQ== Date: Tue, 27 Dec 2022 13:15:14 -0700 From: Nathan Chancellor To: Yin Fengwei Cc: riel@surriel.com, akpm@linux-foundation.org, shy828301@gmail.com, willy@infradead.org, kirill.shutemov@intel.com, ying.huang@intel.com, feng.tang@intel.com, zhengjun.xing@linux.intel.com, linux-mm@kvack.org Subject: Re: [RFC PATCH] mm/thp: check and bail out if page in deferred queue already Message-ID: References: <20221223135207.2275317-1-fengwei.yin@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20221223135207.2275317-1-fengwei.yin@intel.com> X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: C3A3B12000D X-Stat-Signature: dk3ha665amcp4arujj6d4zag3cwqdymx X-HE-Tag: 1672172118-896356 X-HE-Meta: U2FsdGVkX184Hr2bsVjkCK7BuUCGn3X54jUKHCWHX9wHNJR7gZFmACZh/TSChS7O7Ll4KyP0vGEq00Ie48t9Mf4NajNUuKOuEXM4xEhO47a1jwfJDY3jVc5cUT/IfBu9xLnAtv9Grykipf78syreTVtZjTke0lFqa8tUFC6xzFcvtQfqWvznH0XqSaPBBPh5//hgf//FdcqmeIj3e/ViLSwdM6hXnGtlHtr11aldRaswg7tewdCyBJOnjzX6KOnmV3CXJwCg5B0E/WQPFCpO50gr/g5qE25en4pd7qtx3/MfWseIIzDFlyhD5EL8iQKxzRIEaACRUc6UX565oygdugz0YinskjT7KoxxZlH7pfpg5rs7a9+9+Ob8yIu7CM6KH0nRfJoICLQxV6BviTl3u9X5OrKJ//r7nAJBrvmwQf44+qfMr/rGstaWQ6MncqSEAoECK+7AOg7DEplnj3pmwLWOA81ftOvYJGYN0swglUQQb2qF0C4jQJw/Nead962d0P2QTY9gXqMEBh1HPsETPECAXw/085mvgW6ZFRQm8sG4xB6DuyJdjh1BAf0UabRhpWdwyvwCxYRO7hAt5AGH2UxxNlrbuHFHNgnVTaFba2+adH3M6YvavSo2+HFgjixbml0GBks6qH7hHzvwbN+lLGtSkDnKcxHNJ5qJKC8DV/yRcsNZF5W0Yjt0tcZAowHY0aSpKhmZ7QFcjCr9rxPjPquGrQcpulLPX7WzCgcSfvQYsY65TbrG7o0GYm3sMeLQD89nzS3zQmYxRdXJbm7pvqgvwZYpJhv0Fo7eKRZjxoj57HD/3jIZfdG+qgouoILOiygQjDzl5k3Ms+h4KE42+yqeunnhx8cfhweUb69uuGHc7l1hR0UZsv/4UquVrsLPfIxaal/jHbD5orU/h6zFlL0/xDK/LhYPbYa8PE05+Lw6q+RU2fuaHQ1XzRSkzv1vuzDVf+2H06HDGZ8Dsva moq5kwUw vCsSDwdTeO67XVYTjGwMSrmOtZTRp/9Li6PAV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Dec 23, 2022 at 09:52:07PM +0800, Yin Fengwei wrote: > Kernel build regression with LLVM was reported here: > https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ > with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP > boundaries"). And the commit f35b5d7d676e was reverted. > > It turned out the regression is related with madvise(MADV_DONTNEED) > was used by ld.lld. But with none PMD_SIZE aligned parameter len. > trace-bpfcc captured: > 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 > 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 > > If the underneath physical page is THP, the madvise(MADV_DONTNNED) can > trigger split_queue_lock contention raised significantly. perf showed > following data: > 14.85% 0.00% ld.lld [kernel.kallsyms] [k] > entry_SYSCALL_64_after_hwframe > 11.52% > entry_SYSCALL_64_after_hwframe > do_syscall_64 > __x64_sys_madvise > do_madvise.part.0 > zap_page_range > unmap_single_vma > unmap_page_range > page_remove_rmap > deferred_split_huge_page > __lock_text_start > native_queued_spin_lock_slowpath > > If THP can't be removed from rmap as whole THP, partial THP will be > removed from rmap by removing sub-pages from rmap. Even the THP > head page is added to deferred queue already, the split_queue_lock > will be acquired and check whether the THP head page is in the queue > already. Thus, the contention of split_queue_lock is raised. > > Before acquire split_queue_lock, check and bail out early if the THP > head page is in the queue already. The checking without holding > split_queue_lock could race with deferred_split_scan, but it doesn't > impact the correctness here. > > Test result of building kernel with ld.lld: > commit 7b5a0b664ebe (parent commit of f35b5d7d676e): > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:07.99 real, 26367.77 user, 5063.35 sys > > commit f35b5d7d676e: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 7:22.15 real, 26235.03 user, 12504.55 sys > > commit f35b5d7d676e with the fixing patch: > time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all > 6:08.49 real, 26520.15 user, 5047.91 sys > > Signed-off-by: Yin Fengwei I cannot say whether or not this is a good idea or not but it does resolve the regression I reported: Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00016-g7b5a0b664ebe Time (mean ± σ): 383.003 s ± 0.680 s [User: 34737.850 s, System: 7287.079 s] Range (min … max): 382.218 s … 383.413 s 3 runs Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e Time (mean ± σ): 437.886 s ± 1.030 s [User: 35888.658 s, System: 14048.871 s] Range (min … max): 436.865 s … 438.924 s 3 runs Benchmark 1: x86_64 allmodconfig (GCC + ld.lld) @ 1b929c02afd3 ("Linux 6.2-rc1") on 6.0.0-rc3-debug-00017-gf35b5d7d676e-dirty Time (mean ± σ): 384.371 s ± 1.004 s [User: 35402.880 s, System: 6401.691 s] Range (min … max): 383.547 s … 385.489 s 3 runs Tested-by: Nathan Chancellor > --- > My first thought was to change the per node deferred queue to per cpu. > It's complicated and may be overkill. > > For the race without lock acquired, I didn't see obvious issue here. But I > could miss something here. Let me know if I did. Thanks. > > > mm/huge_memory.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index abe6cfd92ffa..7cde9f702e63 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2837,6 +2837,9 @@ void deferred_split_huge_page(struct page *page) > if (PageSwapCache(page)) > return; > > + if (!list_empty(page_deferred_list(page))) > + return; > + > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > if (list_empty(page_deferred_list(page))) { > count_vm_event(THP_DEFERRED_SPLIT_PAGE); > > base-commit: 8395ae05cb5a2e31d36106e8c85efa11cda849be > -- > 2.34.1 > > >