From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF1D5C433DB for ; Thu, 4 Feb 2021 00:26:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2404164DAF for ; Thu, 4 Feb 2021 00:26:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2404164DAF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 769FE6B0083; Wed, 3 Feb 2021 19:26:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 71A996B0085; Wed, 3 Feb 2021 19:26:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60AF36B0087; Wed, 3 Feb 2021 19:26:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241]) by kanga.kvack.org (Postfix) with ESMTP id 4A3556B0083 for ; Wed, 3 Feb 2021 19:26:34 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 073EE3632 for ; Thu, 4 Feb 2021 00:26:34 +0000 (UTC) X-FDA: 77778694308.13.ball12_25106be275d7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id DA95018140B60 for ; Thu, 4 Feb 2021 00:26:33 +0000 (UTC) X-HE-Tag: ball12_25106be275d7 X-Filterd-Recvd-Size: 6298 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf15.hostedemail.com (Postfix) with ESMTP for ; Thu, 4 Feb 2021 00:26:33 +0000 (UTC) Received: by mail-ed1-f52.google.com with SMTP id i5so1901291edu.10 for ; Wed, 03 Feb 2021 16:26:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=xyEbtM3oBDhnuHqycJNBjjR1/B89Yhm019P071TQu48=; b=PyjI0sGecZM5547DnvmFxMBcoowizD9MjexPTlrbIX/HYHp5OR84gY2BgvA/Bzanru o/+Gu60Z9DKDxX5SnMvg/ODOwAMYl8ha/jaCQR81p858/G5tGFMpTAGuj0iquoTJUpDy avKE4dGHC2ad8cVa+/hIsBMyF3XfZ2/emQ0zuym6sEOmlEvI00WK1bi7JFM7aPoQIFqh tIc/w/hv9HGvoKWh4trkbiLujBZ67kIAE4uOD12Cg/sP754OwuJ6DtxJ93/e0VlPXvEi DSx6xrZPNvOwzxyGlPECsaosLXkbJvLTSn6EuE8aZ/xwHM34MxjU0Xct+Z4Ju5hgxfO0 t2MQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xyEbtM3oBDhnuHqycJNBjjR1/B89Yhm019P071TQu48=; b=V/0WT+L5/rNC8qPObc9Ogcbx22RIrO9mEZR4cixfqOBnsV5KxpXxkjF2yKxFkY7p1l Aays0Y3F5cE+DV/Wk0yX5hAiFQRblMHgh5lbwWUxUbkVVKp2PQPgcEvNwb7pTTOfowTz 267hDoah/vqMQY1koF0Mvb5bZ1GJ6ubZq0CbmjPgsjqENeq3jjmo/dzy4hyc+kTuL9yN Qwvq59NDBKL3nmUgIcpAlNNWLBUwZD8ey/6hT1nbaN3SevvcyTry43mygKBh/UpsUK4Z f8ahrFahiZ8YFCsyznrvzlJTUvCl9t6a/6KzXHwwb/hk10Iwn6rz1Mr7LlIkrWXh4gYi 5qQw== X-Gm-Message-State: AOAM532Kdhvm5Fc3JTblZXJL38jW+vbJUpcxM94BDjaXiIhC81dvx1eC /MjCxjnJ9gs5VYqF80vvAaZKklLKvnNO6r5btDw= X-Google-Smtp-Source: ABdhPJz0Y3aD7STJswOVpoJe2UaXKHkXL0QCnyrLIm6Y21DdDVcZhjmk1ptGfj5vyAroyrWazC4zZQ1crKSKotB05ys= X-Received: by 2002:aa7:de82:: with SMTP id j2mr5705265edv.313.1612398392246; Wed, 03 Feb 2021 16:26:32 -0800 (PST) MIME-Version: 1.0 References: <20210126003411.2AC51464@viggo.jf.intel.com> <20210126003421.45897BF4@viggo.jf.intel.com> <317d4c23-76a7-b653-87a4-bab642fa1717@intel.com> In-Reply-To: From: Yang Shi Date: Wed, 3 Feb 2021 16:26:20 -0800 Message-ID: Subject: Re: [RFC][PATCH 05/13] mm/numa: automatically generate node migration order To: Dave Hansen Cc: Dave Hansen , Linux Kernel Mailing List , Linux MM , Yang Shi , David Rientjes , Huang Ying , Dan Williams , David Hildenbrand , Oscar Salvador Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 2, 2021 at 4:43 PM Dave Hansen wrote: > > On 2/2/21 9:46 AM, Yang Shi wrote: > > On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen wrote: > >> On 1/29/21 12:46 PM, Yang Shi wrote: > >> ... > >>>> int next_demotion_node(int node) > >>>> { > >>>> - return node_demotion[node]; > >>>> + /* > >>>> + * node_demotion[] is updated without excluding > >>>> + * this function from running. READ_ONCE() avoids > >>>> + * reading multiple, inconsistent 'node' values > >>>> + * during an update. > >>>> + */ > >>> Don't we need a smp_rmb() here? The single write barrier might be not > >>> enough in migration target set. Typically a write barrier should be > >>> used in pairs with a read barrier. > >> I don't think we need one, practically. > >> > >> Since there is no locking against node_demotion[] updates, although a > >> smp_rmb() would ensure that this read is up-to-date, it could change > >> freely after the smp_rmb(). > > Yes, but this should be able to guarantee we see "disable + after" > > state. Isn't it more preferred? > > I'm debating how much of this is theoretical versus actually applicable > to what we have in the kernel. But, I'm generally worried about code > like this that *looks* innocuous: > > int terminal_node = start_node; > int next_node = next_demotion_node(start_node); > while (next_node != NUMA_NO_NODE) { > next_node = terminal_node; > terminal_node = next_demotion_node(terminal_node); > } > > That could loop forever if it doesn't go out to memory during each loop. > > However, if node_demotion[] *is* read on every trip through the loop, it > will eventually terminate. READ_ONCE() can guarantee that, as could > compiler barriers like smp_rmb(). > > But, after staring at it for a while, I think RCU may be the most > clearly correct way to solve the problem. Or, maybe just throw in the > towel and do a spinlock like a normal human being. :) > > Anyway, here's what I was thinking I'd do with RCU: > > 1. node_demotion[] starts off in a "before" state > 2. Writers to node_demotion[] first set the whole array such that > it will not induce cycles, like setting every member to > NUMA_NO_NODE. (the "disable" state) > 3. Writer calls synchronize_rcu(). After it returns, no readers can > observe the "before" values. > 4. Writer sets the actual values it wants. (the "after" state) > 5. Readers use rcu_read_lock() over any critical section where they > read the array. They are guaranteed to only see one of the two > adjacent states (before+disabled, or disabled+after), but never > before+after within one RCU read-side critical section. > 6. Readers use READ_ONCE() or some other compiler directive to ensure > the compiler does not reorder or combine reads from multiple, > adjacent RCU read-side critical sections. Makes sense to me. > > Although, after writing this, plain old locks are sounding awfully tempting.