From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf1-f200.google.com (mail-pf1-f200.google.com [209.85.210.200]) by kanga.kvack.org (Postfix) with ESMTP id 045818E0004 for ; Fri, 7 Dec 2018 18:13:01 -0500 (EST) Received: by mail-pf1-f200.google.com with SMTP id b17so4595885pfc.11 for ; Fri, 07 Dec 2018 15:13:00 -0800 (PST) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id z83sor8133846pfd.11.2018.12.07.15.12.59 for (Google Transport Security); Fri, 07 Dec 2018 15:12:59 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.1 \(3445.101.1\)) Subject: Re: Number of arguments in vmalloc.c From: Nadav Amit In-Reply-To: <20181207084550.GA2237@hirez.programming.kicks-ass.net> Date: Fri, 7 Dec 2018 15:12:56 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20181128140136.GG10377@bombadil.infradead.org> <3264149f-e01e-faa2-3bc8-8aa1c255e075@suse.cz> <20181203161352.GP10377@bombadil.infradead.org> <4F09425C-C9AB-452F-899C-3CF3D4B737E1@gmail.com> <20181203224920.GQ10377@bombadil.infradead.org> <20181206102559.GG13538@hirez.programming.kicks-ass.net> <55B665E1-3F64-4D87-B779-D1B4AFE719A9@gmail.com> <20181207084550.GA2237@hirez.programming.kicks-ass.net> Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Matthew Wilcox , Vlastimil Babka , Linux-MM [ We can start a new thread, since I have the tendency to hijack = threads. ] > On Dec 7, 2018, at 12:45 AM, Peter Zijlstra = wrote: >=20 > On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote: >>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra = wrote: >>>=20 >>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote: >>>> [ +Peter ] >>>>=20 >>>> So I dug some more (I=E2=80=99m still not done), and found various = trivial things >>>> (e.g., storing zero extending u32 immediate is shorter for = registers, >>>> inlining already takes place). >>>>=20 >>>> *But* there is one thing that may require some attention - patch >>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()=E2=80=9D) set = ordering constraints >>>> on the VM_ARGS() evaluation. And this patch also imposes, it = appears, >>>> (unnecessary) constraints on other pieces of code. >>>>=20 >>>> These constraints are due to the addition of the volatile keyword = for >>>> this_cpu_read() by the patch. This affects at least 68 functions in = my >>>> kernel build, some of which are hot (I think), e.g., = finish_task_switch(), >>>> smp_x86_platform_ipi() and select_idle_sibling(). >>>>=20 >>>> Peter, perhaps the solution was too big of a hammer? Is it possible = instead >>>> to create a separate "this_cpu_read_once()=E2=80=9D with the = volatile keyword? Such >>>> a function can be used for native_sched_clock() and other seqlocks, = etc. >>>=20 >>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). = If >>> you want something else, use something else, there's plenty other >>> options available. >>>=20 >>> There's this_cpu_op_stable(), but also __this_cpu_read() and >>> raw_this_cpu_read() (which currently don't differ from = this_cpu_read() >>> but could). >>=20 >> Would setting the inline assembly memory operand both as input and = output be >> better than using the =E2=80=9Cvolatile=E2=80=9D? >=20 > I don't know.. I'm forever befuddled by the exact semantics of gcc > inline asm. >=20 >> I think that If you do that, the compiler would should the = this_cpu_read() >> as something that changes the per-cpu-variable, which would make it = invalid >> to re-read the value. At the same time, it would not prevent = reordering the >> read with other stuff. >=20 > So the thing is; as I wrote, the generic version of this_cpu_*() is: >=20 > local_irq_save(); > __this_cpu_*(); > local_irq_restore(); >=20 > And per local_irq_{save,restore}() including compiler barriers that > cannot be reordered around either. >=20 > And per the principle of least surprise, I think our primitives should > have similar semantics. I guess so, but as you=E2=80=99ll see below, the end result is ugly. > I'm actually having difficulty finding the this_cpu_read() in any of = the > functions you mention, so I cannot make any concrete suggestions other > than pointing at the alternative functions available. So I got deeper into the code to understand a couple of differences. In = the case of select_idle_sibling(), the patch (Peter=E2=80=99s) increase the = function code size by 123 bytes (over the baseline of 986). The per-cpu variable = is called through the following call chain: select_idle_sibling() =3D> select_idle_cpu() =3D> local_clock() =3D> raw_smp_processor_id() And results in 2 more calls to sched_clock_cpu(), as the compiler = assumes the processor id changes in between (which obviously wouldn=E2=80=99t = happen). There may be more changes around, which I didn=E2=80=99t fully analyze. But = the very least reading the processor id should not get =E2=80=9Cvolatile=E2=80=9D. As for finish_task_switch(), the impact is only few bytes, but still unnecessary. It appears that with your patch preempt_count() causes = multiple reads of __preempt_count in this code: if (WARN_ONCE(preempt_count() !=3D 2*PREEMPT_DISABLE_OFFSET, "corrupted preempt_count: %s/%d/0x%x\n", current->comm, current->pid, preempt_count())) preempt_count_set(FORK_PREEMPT_COUNT); Again, this is unwarranted, as the preemption count should not be = changed in any interrupt.