From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86446C433EF for ; Wed, 8 Sep 2021 17:24:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EF70F61131 for ; Wed, 8 Sep 2021 17:24:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EF70F61131 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 8F103940007; Wed, 8 Sep 2021 13:24:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 87AD96B0073; Wed, 8 Sep 2021 13:24:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F397940007; Wed, 8 Sep 2021 13:24:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59]) by kanga.kvack.org (Postfix) with ESMTP id 5AD626B0072 for ; Wed, 8 Sep 2021 13:24:43 -0400 (EDT) Received: from smtpin38.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 054448249980 for ; Wed, 8 Sep 2021 17:24:43 +0000 (UTC) X-FDA: 78565080846.38.C7C1D5F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf13.hostedemail.com (Postfix) with ESMTP id A1C721029BD6 for ; Wed, 8 Sep 2021 17:24:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1631121881; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7w/zueX428KK20j3/z4JCegoR8O9O7t1zqM9mW3i84Y=; b=eBGY2k7J3Li6cqll/nFqzkmeNtFDk4K7w2pPYK+7dBtFY4Ha6uCc2rYztDl3VmzAfepJ3E 0019lO61cS6U+tZ5QFvH20b6jr/31o108T8dSxOet+Ct7/a3Z7yDFSsvhn2EjCzBEJLde+ +QvSgCLYiUkRs+bWtCQ8pkfgHuk1X44= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-425-Wtm3-g-oO5S4ULZDf71nPg-1; Wed, 08 Sep 2021 13:24:34 -0400 X-MC-Unique: Wtm3-g-oO5S4ULZDf71nPg-1 Received: by mail-wm1-f70.google.com with SMTP id r125-20020a1c2b830000b0290197a4be97b7so1127709wmr.9 for ; Wed, 08 Sep 2021 10:24:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=7w/zueX428KK20j3/z4JCegoR8O9O7t1zqM9mW3i84Y=; b=HIuglxylaBrLPiZcr3mN6HdVQN6lndmPtqyjqGjwl1Q1jf3NTCd8q7FwvQLjGEU1ya gHJHSVSNeNXWcND/5qSYg8eODSL5gokng+7noCUvla/+qkkRMXGZ6XCVNffTp0XVqIv1 ePxDafa7S+rqoA3NEuhH8E7554Av3X/Rnwbe5SgPCTMJmWwy4Lj2F+Pq+O0narGESIkJ Kl6MRxFiScqSWUo8Tr4aFaQPyMzN8cqzJy7bscAhrB01iJC9qG/Ie2a2iUzZt5wXfvGd SjhnirBoGJepDNA35wMn1c6QAkYxTjP9r9g7v5WSbyH69TsxOKWlNZrN45QgJE8+7uHZ f9Tw== X-Gm-Message-State: AOAM532xTNrbt2VrUwB5ynNFrTbKuYLNoLDfOVCOFkgaokV9VanGYLZJ YLP4F7IASeb6HtHerl6/DWaMWmWeCO34ZY3qljCRlcFYTO1hP2zXwzZ1Ttl64Jj1mT19bM0iN6r oxc+uu8DX0yY= X-Received: by 2002:a7b:c385:: with SMTP id s5mr3730224wmj.149.1631121873641; Wed, 08 Sep 2021 10:24:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy6T8ytUh1/0yDTkjPOljyhQKZOHD96OfGqzpfP6skyrCSPjMRxB8wEHTnjcttq/fcWkFQxTw== X-Received: by 2002:a7b:c385:: with SMTP id s5mr3730194wmj.149.1631121873370; Wed, 08 Sep 2021 10:24:33 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6125.dip0.t-ipconnect.de. [91.12.97.37]) by smtp.gmail.com with ESMTPSA id a5sm2585439wmm.44.2021.09.08.10.24.32 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 08 Sep 2021 10:24:32 -0700 (PDT) To: Jesper Dangaard Brouer , Vlastimil Babka , Andrew Morton , bigeasy@linutronix.de, cl@linux.com, efault@gmx.de, iamjoonsoo.kim@lge.com, jannh@google.com, linux-mm@kvack.org, mgorman@techsingularity.net, mm-commits@vger.kernel.org, penberg@kernel.org, quic_qiancai@quicinc.com, rientjes@google.com, tglx@linutronix.de, torvalds@linux-foundation.org Cc: brouer@redhat.com References: <20210908025436.dvsgeCXAh%akpm@linux-foundation.org> <1a8ecf24-dca4-54f2-cdbf-9135b856b773@redhat.com> <6524bba5-f737-3ab4-ee90-d6c70bac04f7@redhat.com> <3a83989f-aa36-3aee-d92f-5ddc912d7fc5@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg Message-ID: <2397137a-81cb-f1a6-985e-943ac43a4b80@redhat.com> Date: Wed, 8 Sep 2021 19:24:32 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <3a83989f-aa36-3aee-d92f-5ddc912d7fc5@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A1C721029BD6 X-Stat-Signature: wyzxfs7aoosrzmjzxf3popgs6x6hzwwo Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=eBGY2k7J; spf=none (imf13.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1631121882-376530 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.09.21 19:14, Jesper Dangaard Brouer wrote: >=20 >=20 > On 08/09/2021 16.59, David Hildenbrand wrote: >> On 08.09.21 16:55, David Hildenbrand wrote: >>> On 08.09.21 15:58, Vlastimil Babka wrote: >>>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote: >>>>> >>>>> >>>>> On 08/09/2021 04.54, Andrew Morton wrote: >>>>>> From: Vlastimil Babka >>>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs >>>>>> instead of cmpxchg >>>>>> >>>>>> Jann Horn reported [1] the following theoretically possible race: >>>>>> >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 task A: put_cpu_partial() calls preempt_= disable() >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 task A: oldpage =3D this_cpu_read(s->cpu= _slab->partial) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 interrupt: kfree() reaches unfreeze_part= ials() and discards >>>>>> the page >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 task B (on another CPU): reallocates pag= e as page cache >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 task A: reads page->pages and page->pobj= ects, which are actually >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 halves of the pointer page->lru.prev >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 task B (on another CPU): frees page >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 interrupt: allocates page as SLUB page a= nd places it on the >>>>>> percpu partial list >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 task A: this_cpu_cmpxchg() succeeds >>>>>> >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 which would cause page->pages and page->= pobjects to end up >>>>>> containing >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 halves of pointers that would then influ= ence when >>>>>> put_cpu_partial() >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 happens and show up in root-only sysfs f= iles. Maybe that's >>>>>> acceptable, >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 I don't know. But there should probably = at least be a comment >>>>>> for now >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 to point out that we're reading union fi= elds of a page that >>>>>> might be >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 in a completely different state. >>>>>> >>>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() >>>>>> is only >>>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() >>>>>> if the >>>>>> latter disables irqs, otherwise a __slab_free() in an irq handler >>>>>> could >>>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulati= ng >>>>>> ->partial and corrupt it.=C2=A0 This becomes an issue on RT after = a >>>>>> local_lock >>>>>> is introduced in later patch.=C2=A0 The fix means taking the local= _lock >>>>>> also in >>>>>> put_cpu_partial() on RT. >>>>>> >>>>>> After debugging this issue, Mike Galbraith suggested [2] that to a= void >>>>>> different locking schemes on RT and !RT, we can just protect >>>>>> put_cpu_partial() with disabled irqs (to be converted to >>>>>> local_lock_irqsave() later) everywhere.=C2=A0 This should be accep= table >>>>>> as it's >>>>>> not a fast path, and moving the actual partial unfreezing outside >>>>>> of the >>>>>> irq disabled section makes it short, and with the retry loop gone >>>>>> the code >>>>>> can be also simplified.=C2=A0 In addition, the race reported by Ja= nn >>>>>> should no >>>>>> longer be possible. >>>>> >>>>> Based on my microbench[0] measurement changing preempt_disable to >>>>> local_irq_save will cost us 11 cycles (TSC).=C2=A0 I'm not against = the >>>>> change, I just want people to keep this in mind. >>>> >>>> OK, but this is not a fast path for every allocation/free, so it get= s >>>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expec= t >>>> cmpxchg to be expensive too? >>>> >>>>> On my E5-1650 v4 @ 3.60GHz: >>>>> =C2=A0=C2=A0=C2=A0 - preempt_disable(+enable)=C2=A0 cost: 11 cycle= s(tsc) 3.161 ns >>>>> =C2=A0=C2=A0=C2=A0 - local_irq_save (+restore) cost: 22 cycles(tsc= ) 6.331 ns >>>>> >>>>> Notice the non-save/restore variant is superfast: >>>>> =C2=A0=C2=A0=C2=A0 - local_irq_disable(+enable) cost: 6 cycles(tsc= ) 1.844 ns >>>> >>>> It actually surprises me that it's that cheap, and would have expect= ed >>>> changing the irq state would be the costly part, not the >>>> saving/restoring. >>>> Incidentally, would you know what's the cost of save+restore when th= e >>>> irqs are already disabled, so it's effectively a no-op? >>> >>> It surprises me as well. That would imply that protecting short RCU >>> sections using >>> >>> local_irq_disable >>> local_irq_enable >>> >>> instead of via >>> >>> preempt_disable >>> preempt_enable >>> >>> would actually be very beneficial. >=20 > Please don't draw this as a general conclusion. > As Linus describe in details, the IRQ disable/enable will be very > micro-arch specific. Sure: but especially for modern micro-archs, this might be very relevant. I actually stumbled over this exact question 1 month ago, that's why=20 your comment caught my attention. I looked for CLI/STI cycle numbers and=20 didn't really find a trusted source. I merely only found [1], which made=20 it look like incrementing/decrementing some counter would actually be=20 much faster most of the time. [1] https://www.agner.org/optimize/instruction_tables.pdf >=20 > The preempt_disable/enable will likely be more stable/consistent across > micro-archs. > Keep an eye out for kernel config options when juding > preempt_disable/enable performance [1] >=20 > [1] > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib= /time_bench_sample.c#L363-L367 >=20 >=20 >>> >>> Are the numbers trustworthy? :) >>> >> >> .. and especially did the benchmark consider side effects of >> enabling/disabling interrupts (pipeline flushes etc ..)? >> >=20 > Of-cause not, this is a microbenchmark... they are per definition not > trustworthy :-P :) --=20 Thanks, David / dhildenb