From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4BF2C433EF for ; Tue, 7 Jun 2022 18:04:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 062776B0071; Tue, 7 Jun 2022 14:04:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F2CE86B0072; Tue, 7 Jun 2022 14:04:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF4296B0074; Tue, 7 Jun 2022 14:04:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C9B676B0071 for ; Tue, 7 Jun 2022 14:04:22 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 9AF1161131 for ; Tue, 7 Jun 2022 18:04:22 +0000 (UTC) X-FDA: 79552214364.11.7E6B018 Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf31.hostedemail.com (Postfix) with ESMTP id 2AD7E2008B for ; Tue, 7 Jun 2022 18:03:34 +0000 (UTC) Received: by mail-lj1-f179.google.com with SMTP id e4so4176851ljl.1 for ; Tue, 07 Jun 2022 11:04:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=g/56BYII9OBXonF0imNSJohcsbyEfufCZ0LUG6wSdzE=; b=J0d79tx+dbGDuOg3o0/fm3w4ssEEObOTPga7N1MneChy6ZXVxQPYxz43YXJqrCJRyO 5p2zG8hA43nQPcQR/zOZutQsam8sXGsNEq69ySWQX8ego3rJ/MRfbhCo0eliUXmdZlBy PfXLG2aqwy7rMXvxmP3etCLcaVWyJFqjIwJr0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=g/56BYII9OBXonF0imNSJohcsbyEfufCZ0LUG6wSdzE=; b=pG0ymFxOQxHFwywkXZO3CKkqTDQ7A2H00iSF96AcTqgs7N9niMCrgnXthK15H57NM+ dw+e/AbsuCCeORK7mYnpo8/vRbYfGu/fke6dRV3y34R1p6LojXeUSbN97QFzn5NM/SPM DKea0aGkug3y/yIsBW6skUSd5Zj9336AqFlBPCi/yd4GUkwq17S4BD1K2qZcIMNzJPYz AK0gL0JWxrB/0tYK9uo4+1oxgQnvQab4rkWlpXQmhV5j5bg9X8mGn7MlbmJPCsF6rOdT fl+XVxb7QbjIqfBxjwxieWE+G5ucPDHSt9QLDgq8JK68JDoMBC1XjQkWTZXZ+B/7buel fgUA== X-Gm-Message-State: AOAM532UmZ5p2xcAVSv+LgmXy4bFj+1jLITCGoHRRffxKgeLqKEWq1Cy Qh/qnZXTZF1hQduy1T0/8kZSQRH/uSJtCQ+9Fgo= X-Google-Smtp-Source: ABdhPJxSWJ8BVne2yXzHcTSR5/lWYUci0RdCWO71QEzNMbaJwdb9UpToEoJLKPGtkzIWXLHKCcOjEg== X-Received: by 2002:a2e:bd8b:0:b0:253:cd2d:3098 with SMTP id o11-20020a2ebd8b000000b00253cd2d3098mr54452800ljq.234.1654625059016; Tue, 07 Jun 2022 11:04:19 -0700 (PDT) Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com. [209.85.208.179]) by smtp.gmail.com with ESMTPSA id n11-20020a05651203eb00b0047255d211e9sm3354702lfq.280.2022.06.07.11.04.18 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 07 Jun 2022 11:04:18 -0700 (PDT) Received: by mail-lj1-f179.google.com with SMTP id c30so3519585ljr.9 for ; Tue, 07 Jun 2022 11:04:18 -0700 (PDT) X-Received: by 2002:a5d:6da6:0:b0:20f:bc8a:9400 with SMTP id u6-20020a5d6da6000000b0020fbc8a9400mr28604127wrs.274.1654624577769; Tue, 07 Jun 2022 10:56:17 -0700 (PDT) MIME-Version: 1.0 References: <20220606202109.1306034-1-ankur.a.arora@oracle.com> <87k09s1pgo.fsf@oracle.com> In-Reply-To: <87k09s1pgo.fsf@oracle.com> From: Linus Torvalds Date: Tue, 7 Jun 2022 10:56:01 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v3 00/21] huge page clearing optimizations To: Ankur Arora Cc: Linux Kernel Mailing List , Linux-MM , "the arch/x86 maintainers" , Andrew Morton , Mike Kravetz , Ingo Molnar , Andrew Lutomirski , Thomas Gleixner , Borislav Petkov , Peter Zijlstra , Andi Kleen , Arnd Bergmann , Jason Gunthorpe , jon.grimm@amd.com, Boris Ostrovsky , Konrad Rzeszutek Wilk , Joao Martins Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: es6geiiyw186ftngotdq5jkedu5r1x1u X-Rspam-User: Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=google header.b=J0d79tx+; spf=pass (imf31.hostedemail.com: domain of torvalds@linuxfoundation.org designates 209.85.208.179 as permitted sender) smtp.mailfrom=torvalds@linuxfoundation.org; dmarc=none X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2AD7E2008B X-HE-Tag: 1654625014-843928 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora wrote: > > For highmem and page-at-a-time archs we would need to keep some > of the same optimizations (via the common clear/copy_user_highpages().) Yeah, I guess that we could keep the code for legacy use, just make the existing code be marked __weak so that it can be ignored for any further work. IOW, the first patch might be to just add that __weak to 'clear_huge_page()' and 'copy_user_huge_page()'. At that point, any architecture can just say "I will implement my own versions of these two". In fact, you can start with just one or the other, which is probably nicer to keep the patch series smaller (ie do the simpler "clear_huge_page()" first). I worry a bit about the insanity of the "gigantic" pages, and the mem_map_next() games it plays, but that code is from 2008 and I really doubt it makes any sense to keep around at least for x86. The source of that abomination is powerpc, and I do not think that whole issue with MAX_ORDER_NR_PAGES makes any difference on x86, at least. It most definitely makes no sense when there is no highmem issues, and all those 'struct page' games should just be deleted (or at least relegated entirely to that "legacy __weak function" case so that sane situations don't need to care). For that same HIGHMEM reason it's probably a good idea to limit the new case just to x86-64, and leave 32-bit x86 behind. > Right. Or doing the whole contiguous area in one or a few chunks > chunks, and then touching the faulting cachelines towards the end. Yeah, just add a prefetch for the 'addr_hint' part at the end. > > Maybe an architecture could do even more radical things like "let's > > just 'rep stos' for the whole area, but set a special thread flag that > > causes the interrupt return to break it up on return to kernel space". > > IOW, the "latency fix" might not even be about chunking it up, it > > might look more like our exception handling thing. > > When I was thinking about this earlier, I had a vague inkling of > setting a thread flag and defer writes to the last few cachelines > for just before returning to user-space. > Can you elaborate a little about what you are describing above? So 'process_huge_page()' (and the gigantic page case) does three very different things: (a) that page chunking for highmem accesses (b) the page access _ordering_ for the cache hinting reasons (c) the chunking for _latency_ reasons and I think all of them are basically "bad legacy" reasons, in that (a) HIGHMEM doesn't exist on sane architectures that we care about these days (b) the cache hinting ordering makes no sense if you do non-temporal accesses (and might then be replaced by a possible "prefetch" at the end) (c) the latency reasons still *do* exist, but only with PREEMPT_NONE So what I was alluding to with those "more radical approaches" was that PREEMPT_NONE case: we would probably still want to chunk things up for latency reasons and do that "cond_resched()" in between chunks. Now, there are alternatives here: (a) only override that existing disgusting (but tested) function when both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false (b) do something like this: void clear_huge_page(struct page *page, unsigned long addr_hint, unsigned int pages_per_huge_page) { void *addr = page_address(page); #ifdef CONFIG_PREEMPT_NONE for (int i = 0; i < pages_per_huge_page; i++) clear_page(addr, PAGE_SIZE); cond_preempt(); } #else nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); prefetch(addr_hint); #endif } or (c), do that "more radical approach", where you do something like this: void clear_huge_page(struct page *page, unsigned long addr_hint, unsigned int pages_per_huge_page) { set_thread_flag(TIF_PREEMPT_ME); nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); clear_thread_flag(TIF_PREEMPT_ME); prefetch(addr_hint); } and then you make the "return to kernel mode" check the TIF_PREEMPT_ME case and actually force preemption even on a non-preempt kernel. It's _probably_ the case that CONFIG_PREEMPT_NONE is so rare that it's n ot even worth doing. I dunno. And all of the above pseudo-code may _look_ like real code, but is entirely untested and entirely handwavy "something like this". Hmm? Linus