From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C13AC001B0 for ; Thu, 10 Aug 2023 09:59:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09AE16B0074; Thu, 10 Aug 2023 05:59:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 049EE6B0075; Thu, 10 Aug 2023 05:59:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E53E36B0078; Thu, 10 Aug 2023 05:59:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D48E36B0074 for ; Thu, 10 Aug 2023 05:59:22 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 9C85A1C9790 for ; Thu, 10 Aug 2023 09:59:22 +0000 (UTC) X-FDA: 81107747364.19.0C77485 Received: from www262.sakura.ne.jp (www262.sakura.ne.jp [202.181.97.72]) by imf18.hostedemail.com (Postfix) with ESMTP id 4F4891C001C for ; Thu, 10 Aug 2023 09:59:18 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; spf=none (imf18.hostedemail.com: domain of penguin-kernel@I-love.SAKURA.ne.jp has no SPF policy when checking 202.181.97.72) smtp.mailfrom=penguin-kernel@I-love.SAKURA.ne.jp; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691661560; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XXCaMP8dpOYBO5YL8uD14dPptQKvGaPsW1X+5vdi8Uo=; b=QlssgRR6Z8T1oe7R7f7Vmg5lsZavlWZNEowf5xMc6NcisL3Vva/qoGXl/ySwJtnsDnByjQ P36qcLYnu142/70XQWMoYtYLWnG5CcN+uLOOkB7i6JcgzPVwkRrh6j7JG3SgEKU1dhjGUS KaCBDztx0tHeTHSNxVZDUuODhmZPyE4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691661560; a=rsa-sha256; cv=none; b=VCMxRppZOZxOLXX/07IM+KFHbG3JT6W66ghlNWao246RY+02/paLhcvgJDmYUJKoPzsBSN gk6wO0oNeupdpIErZ7vlxVEGJ9//H2lNiJxPwO+ioOsyqDxWckxJiE0hPhJnaJzFmsJDTH yTgIryvy//HGPApAVC2ytRdZaar+0Hg= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; spf=none (imf18.hostedemail.com: domain of penguin-kernel@I-love.SAKURA.ne.jp has no SPF policy when checking 202.181.97.72) smtp.mailfrom=penguin-kernel@I-love.SAKURA.ne.jp; dmarc=none Received: from fsav118.sakura.ne.jp (fsav118.sakura.ne.jp [27.133.134.245]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id 37A9wx2X092256; Thu, 10 Aug 2023 18:58:59 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav118.sakura.ne.jp (F-Secure/fsigk_smtp/550/fsav118.sakura.ne.jp); Thu, 10 Aug 2023 18:58:59 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/550/fsav118.sakura.ne.jp) Received: from [192.168.1.6] (M106072142033.v4.enabler.ne.jp [106.72.142.33]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id 37A9wxvM092250 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NO); Thu, 10 Aug 2023 18:58:59 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Message-ID: <566173d4-84d1-c76b-6fe4-f5ea5f24f613@I-love.SAKURA.ne.jp> Date: Thu, 10 Aug 2023 18:58:59 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Subject: Re: [PATCH v2] mm/page_alloc: don't check zonelist_update_seq from atomic allocations Content-Language: en-US To: Sebastian Andrzej Siewior Cc: Michal Hocko , Andrew Morton , Petr Mladek , linux-mm , LKML , "Luis Claudio R. Goncalves" , Boqun Feng , Ingo Molnar , John Ogness , Mel Gorman , Peter Zijlstra , Thomas Gleixner , Waiman Long , Will Deacon References: <6cc13636-eda6-6a95-6564-db1c9ae76bb6@I-love.SAKURA.ne.jp> <20230810072637.6Sc3UU3R@linutronix.de> From: Tetsuo Handa In-Reply-To: <20230810072637.6Sc3UU3R@linutronix.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: osh9j3oynfdupnqa9ietu5uwiwgoegg7 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 4F4891C001C X-Rspam-User: X-HE-Tag: 1691661558-869184 X-HE-Meta: U2FsdGVkX19/cPjbt5VrWwVHx2NuQOdb9Hy84IYvHVKvOjXKkLTT9Yp9VECsFDj3C6QL0o4twJpQWpKMmMUsiOhJ7JzPKuV2l6GUlF7t1e/zvc1RfsYcxHKVEBXadbQPTcTr6eHWSlbmsd2Q6SKaED/ENSprIGfopEoFBlZE+a7s09ja6RM+qg+KngqRuD6YoNzGCcw6AsQA/NFjYs6PWXJHpdsRP7aU6csZ7rkjOfQBJACq5BIcO4aGIQoLGvVx1cxfKoqCqEYirXPSc00MJKLeci0v4KUjX7e7MiLRbdymfBue4y1KcnBYbX/lUx5fmGIXUWItSpMr1nlNVtf7CFVdB2tknVSDKavQsCNT4/YatrIxOSPvEbdrsqVbZ8aWwElcBtuqJADDeYLUhYN3UImOjFgBJ1oWY6XmeEFeuj478v0ImdOLQ5DuwW7Lp9ZskFglw7qOxmONWZlRto8zjetEPyfhmAbas0fJCVxJyH8NdLUS6piH6YKyGmdA0KO9U2s5aq1HlKLsTQfg1iUe7U5PnQ1KnTCrR8IFHk/yuhG15fQNzsoKpYmSCo7tIqPpvMb12DqX632kMyr9Uh4ivhBmJeFAnTemg4yw2zdzRK1oDoUp4fWKuwYV8DpT5oVmIvizEpqjc+xU3aNA9Yf/PkqTcVRdkVdGanH9nWuw6lWNOdh0ntbiiJ+7eGdwbV6+kPELG2QofuQQ45Ug20yu4I9sbfG0JauA4uB07hYFFN06I068L7eVbUWEcbkqTi8K5355S1u/23kh5EsmlSqmOOkUhH0V2y3raXFI9+YueN5I15VLZPoitdPcIWiZ6Rlc8Jr2ob3nNSmgH5pYP3B74csg9iSM1Sus3ZQ2fMxwFNZhbwqIsWxIojxoHtwUObvWSxIcxeFZ27JGVWQIz9vM8ohCycXmMS5ANmAMZHyqz31nAsCf9UN/2gGOemiVciQR42E7HPR5pm4KqZiLUQA z48gQrez /Qx106K/LMlX9enAlNy15MLV0LmOACbRPm6d6ikZSYgO0UvDGwSkA+x9EGlFfEWF7uVGxaGz+2bPaWc0rK8wAwXJ6CWGvPPB/x/pih9MIyOiJj/ljS/k1zVIOqSV22lCCGoIG0fJVvKU7GJyarMxrsnpxvePd9frv+i5zGq97TFRm8QTOX9nRrW5O3occntr0lj/RtTufhHs4QJAhyJl6SUM8HueXdCYgt2VCOMx8l2ktBYPRENdcEPzSAsrS9wVIOUxH7aVEmPp5CZJOFnS1CJ6faaJLfOIG4wPWHREYCc+dklIqDzI4xUDOcuZtQZ0b1q8qu6v3cwmykqDaozYYaM0Z4loiHY87Per+dpakP0OMF4lQqF1ztI+2NyCshOHCV0ZCNjMQbi20HFM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/08/10 16:26, Sebastian Andrzej Siewior wrote: > On 2023-08-09 20:03:00 [+0900], Tetsuo Handa wrote: >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 7d3460c7a480..5557d9a2ff2c 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release); > … >> -static DEFINE_SEQLOCK(zonelist_update_seq); >> +static unsigned int zonelist_update_seq; >> >> static unsigned int zonelist_iter_begin(void) >> { >> if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) >> - return read_seqbegin(&zonelist_update_seq); >> + /* See comment above. */ >> + return data_race(READ_ONCE(zonelist_update_seq)); > > This is open coded raw_read_seqcount() while it should have been > raw_seqcount_begin(). Not an open coded raw_read_seqcount(). You explained us at https://lkml.kernel.org/r/20230623101111.7tuAg5p5@linutronix.de that seqprop_sequence() behaves differently if CONFIG_PREEMPT_RT=y. The point of my proposal is to get rid of spin_lock(s->lock); spin_unlock(s->lock); from zonelist_iter_begin(). Also, my version avoids KCSAN warning by using data_race() and avoids papering over KCSAN warnings between zonelist_iter_begin() and check_retry_zonelist() by not using kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX). /* * The seqlock seqcount_t interface does not prescribe a precise sequence of * read begin/retry/end. For readers, typically there is a call to * read_seqcount_begin() and read_seqcount_retry(), however, there are more * esoteric cases which do not follow this pattern. * * As a consequence, we take the following best-effort approach for raw usage * via seqcount_t under KCSAN: upon beginning a seq-reader critical section, * pessimistically mark the next KCSAN_SEQLOCK_REGION_MAX memory accesses as * atomics; if there is a matching read_seqcount_retry() call, no following * memory operations are considered atomic. Usage of the seqlock_t interface * is not affected. */ The section between zonelist_iter_begin() and check_retry_zonelist() is very complicated where concurrency bug that is unrelated to zonelist counters could be found and fixed. > >> return 0; >> } >> >> -static unsigned int check_retry_zonelist(unsigned int seq) >> +static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq) >> { >> - if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE)) >> - return read_seqretry(&zonelist_update_seq, seq); >> + if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) { >> + /* See comment above. */ >> + unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq)); >> >> - return seq; >> + /* >> + * "seq != seq2" indicates that __build_all_zonelists() has >> + * started or has finished rebuilding zonelists, hence retry. >> + * "seq == seq2 && (seq2 & 1)" indicates that >> + * __build_all_zonelists() is still rebuilding zonelists >> + * with context switching disabled, hence retry. >> + * "seq == seq2 && !(seq2 & 1)" indicates that >> + * __build_all_zonelists() did not rebuild zonelists, hence >> + * no retry. >> + */ >> + return unlikely(seq != seq2 || (seq2 & 1)); > > open coded read_seqcount_retry(). Not an open coded read_seqcount_retry(), for read_seqcount_retry() checks only "seq != seq2" condition. We need to check "seq2 & 1" condition too. > >> + } >> + >> + return 0; >> } >> >> /* Perform direct synchronous page reclaim */ >> @@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data) >> int nid; >> int __maybe_unused cpu; >> pg_data_t *self = data; >> + static DEFINE_SPINLOCK(lock); >> unsigned long flags; >> >> - /* >> - * Explicitly disable this CPU's interrupts before taking seqlock >> - * to prevent any IRQ handler from calling into the page allocator >> - * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock. >> - */ >> - local_irq_save(flags); >> - /* >> - * Explicitly disable this CPU's synchronous printk() before taking >> - * seqlock to prevent any printk() from trying to hold port->lock, for >> - * tty_insert_flip_string_and_push_buffer() on other CPU might be >> - * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held. >> - */ >> - printk_deferred_enter(); >> - write_seqlock(&zonelist_update_seq); >> +#ifdef CONFIG_PREEMPT_RT >> + migrate_disable() >> +#endif > > There is no justification/ explanation why migrate_disable() here is > needed on PREEMPT_RT and I don't see one. This migrate_disable() is a compensation for removing spin_lock(s->lock); spin_unlock(s->lock); from zonelist_iter_begin(). Since neither the proposed zonelist_iter_begin() nor the proposed check_retry_zonelist() holds the spinlock, we need to guarantee that the thread which has performed the opening zonelist_update_seq++ can continue execution till the closing zonelist_update_seq++ without sleeping. Calling interrupt handlers are fine, for interrupt handlers can't do __GFP_DIRECT_RECLAIM allocation, which in turn guarantees that interrupt handlers switched from the thread which has performed the opening zonelist_update_seq++ won't deadlock. > > There are two changes here: > - The replacement of seqlock_t with something open coded Yes. > - Logic change when a retry is needed (the gfp mask is considered). Yes. > > I am not a big fan of open coding things especially when not needed and > then there is this ifdef which is not needed as well. I don't comment on > the logic change. If __build_all_zonelists() can run without being switched to other threads (except interrupt handlers), I consider that this approach works. > > Can we please put an end to this? > >> + /* Serialize increments of zonelist_update_seq. */ >> + spin_lock_irqsave(&lock, flags); >> + zonelist_update_seq++; >> + /* Tell check_retry_zonelist() that we started rebuilding zonelists. */ >> + smp_wmb(); >> >> #ifdef CONFIG_NUMA >> memset(node_load, 0, sizeof(node_load)); > > Sebastian >