From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 047B5C32774 for ; Tue, 23 Aug 2022 06:36:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E7466B0073; Tue, 23 Aug 2022 02:36:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 197C48D0002; Tue, 23 Aug 2022 02:36:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 05E688D0001; Tue, 23 Aug 2022 02:36:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EA5076B0073 for ; Tue, 23 Aug 2022 02:36:40 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B6F8F161470 for ; Tue, 23 Aug 2022 06:36:40 +0000 (UTC) X-FDA: 79829898960.10.8DA5931 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf14.hostedemail.com (Postfix) with ESMTP id A88FA10000A for ; Tue, 23 Aug 2022 06:36:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1661236599; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hikwwdT6bgO+i9vOTF+zv5CfzK1nJk4wmNuO9xOSqt0=; b=OSkByovS0STJWk+d3RODCOsQaSW0q4/k6he6KEnv3acWmpGvrl/uHXKXe1hhl5Wb08jwBh ZVATPLFKvQ+0SUHGNEFh1C98DOvG7yZ6pbNDk6xNWZZh9jzMsa0w8VHAMmbun4Ibx/zOkN lhPZunOqN1o4clKX7aS0sBsblNheTZY= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-533-I5R3YL1WMfuVQ63bVdjrhw-1; Tue, 23 Aug 2022 02:36:37 -0400 X-MC-Unique: I5R3YL1WMfuVQ63bVdjrhw-1 Received: by mail-wm1-f70.google.com with SMTP id ay21-20020a05600c1e1500b003a6271a9718so7260682wmb.0 for ; Mon, 22 Aug 2022 23:36:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc; bh=hikwwdT6bgO+i9vOTF+zv5CfzK1nJk4wmNuO9xOSqt0=; b=7XNmpt2Ldgg793Ou/j2cGxGLtUxfsDCOUF2LCplXIHDfY8fiX+BHvGmVNBzPCNNVOR YceqnjfvxyIfR57spYt6CKn7bKV1i9Pa6FFUgj+68pF5yZM2DfJIPsZTCxt6SCMWpYNL ZqAu5TxdS3qjbaOoTTgzITa+uR4SsKPIL9YJaDv7JhDO/zQYp1wHN/W7hsDRsE20AohQ 4+Asn34u74tQyz8tA8eP1wGmOwTxi+peFnbINmxHghetQrYy5WBOzmb7ZGysJVy+X/6a yuogS7BnIorRY0vG+W/bbCPwtQWkrPDp8mq5RvXSog4wJSo4ouPUoLSch/ebg65ACh/E RtFQ== X-Gm-Message-State: ACgBeo2QkwExNRLulxMAAPtMShzPMluK2PQ0oQ1hmV1jqfhNAksNU1BQ xSAvSe1UHZCG95uNtT2YZ8vvHe25Ui5gwOxBiWTWouTbK2XAtWGVVd9vYVWN2vpWITKE1HoJsdd hS33tZnwIhHo= X-Received: by 2002:a5d:64ca:0:b0:225:48a0:d9cb with SMTP id f10-20020a5d64ca000000b0022548a0d9cbmr7635989wri.399.1661236596337; Mon, 22 Aug 2022 23:36:36 -0700 (PDT) X-Google-Smtp-Source: AA6agR5ZlQQ/uHQcW9GTzOFQXavkS+br+K1bxxxgT6p8hwenLgTNhXgSiEGhlAPdTMm2KoT0F0uE9Q== X-Received: by 2002:a5d:64ca:0:b0:225:48a0:d9cb with SMTP id f10-20020a5d64ca000000b0022548a0d9cbmr7635977wri.399.1661236596038; Mon, 22 Aug 2022 23:36:36 -0700 (PDT) Received: from ?IPV6:2003:cb:c70b:1600:c48b:1fab:a330:5182? (p200300cbc70b1600c48b1faba3305182.dip0.t-ipconnect.de. [2003:cb:c70b:1600:c48b:1fab:a330:5182]) by smtp.gmail.com with ESMTPSA id ay15-20020a05600c1e0f00b003a604a29a34sm17206856wmb.35.2022.08.22.23.36.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Aug 2022 23:36:35 -0700 (PDT) Message-ID: <11f91089-1958-c7eb-126f-af32130d9f8a@redhat.com> Date: Tue, 23 Aug 2022 08:36:34 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: Race condition in build_all_zonelists() when offlining movable zone To: Mel Gorman , Michal Hocko Cc: Patrick Daly , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, Juergen Gross References: <20220817034250.GB2473@hu-pdaly-lv.qualcomm.com> <20220817104028.uin7cmkb4qlpgfbi@suse.de> From: David Hildenbrand Organization: Red Hat In-Reply-To: <20220817104028.uin7cmkb4qlpgfbi@suse.de> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OSkByovS; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf14.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661236600; a=rsa-sha256; cv=none; b=kTHrCo/I8VgXx7Ko5AS+f7a0q8iNL6wW1q1gBf6I/cbVwWC3IKaBP/j6kD1UqzfwKZ0b5z YiktIba9QbqFiF3U1gkzcmIxK860aakDrH31ah3Sw//k1S6NlH3vKbgI405taOHch/cAFf IBMns7qoOmGu3iMuFrvC93PV/YucAvc= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661236600; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hikwwdT6bgO+i9vOTF+zv5CfzK1nJk4wmNuO9xOSqt0=; b=rJSk7fCVwM++lHkUiHYmpoFkSwsg0GFe32dN5FTN5T4oBc7RlGaE0BXBwsZwBvl808MrVs J1Qj9E9DULaW3P/yMdwZuDOTzGX12vN2ejB1692lgX1bBgiLLrI8vFlM/P+kTxO3+gw8zh ICu+Z0e6yNUHr+SBt6j6z5ioD9m1lAA= X-Rspam-User: Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OSkByovS; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf14.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com X-Stat-Signature: kf33ngxk83m5nj3uy9t3f3gfzs655ikz X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A88FA10000A X-HE-Tag: 1661236599-588356 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 17.08.22 12:40, Mel Gorman wrote: > On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote: >>> In order to address that, we should either have to call first_zones_zonelist >>> inside get_page_from_freelist if the zoneref doesn't correspond to a >>> real zone in the zonelist or we should revisit my older approach >>> referenced above. >> >> Would this work? It is not really great to pay an overhead for unlikely >> event in the hot path but we might use a similar trick to check_retry_cpuset >> in the slowpath to detect this situation. >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index b0bcab50f0a3..bce786d7fcb4 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, >> * See also __cpuset_node_allowed() comment in kernel/cpuset.c. >> */ >> no_fallback = alloc_flags & ALLOC_NOFRAGMENT; >> + >> + /* >> + * A race with memory offlining could alter zones on the zonelist >> + * e.g. dropping the top (movable) zone if it gets unpoppulated >> + * and so preferred_zoneref is not valid anymore >> + */ >> + if (unlikely(!ac->preferred_zoneref->zone)) >> + ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, >> + ac->highest_zoneidx, ac->nodemask); >> z = ac->preferred_zoneref; >> + > > ac->preferred_zoneref->zone could still be a valid pointer to a zone, > but an empty one so that would imply > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e5486d47406e..38ce123af543 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (check_retry_cpuset(cpuset_mems_cookie, ac)) > goto retry_cpuset; > > + /* Hotplug could have drained the preferred zone. */ > + if (!populated_zone(ac->preferred_zoneref->zone)) > + goto retry_cpuset; > + > /* Reclaim has failed us, start killing things */ > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > if (page) > > But even that is fragile. If there were multiple zones in the zonelist > and the preferred zone was further down the list, the zone could still > be populated but a different zone than expected. It may be better to have > the same type of seq counter that restarts the allocation attempt if the > zonelist changes. > > So.... this? It is seqcount only with a basic lock as there already is a > full lock on the writer side and it would appear to be overkill to protect > the reader side with read_seqbegin_or_lock as it complicates the writer side. > > (boot tested only) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e5486d47406e..158954b10724 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4708,6 +4708,22 @@ void fs_reclaim_release(gfp_t gfp_mask) > EXPORT_SYMBOL_GPL(fs_reclaim_release); > #endif > > +/* > + * Zonelists may change due to hotplug during allocation. Detect when zonelists > + * have been rebuilt so allocation retries. > + */ > +static seqcount_t zonelist_update_seq = SEQCNT_ZERO(zonelist_update_seq); > + > +static unsigned int zonelist_update_begin(void) > +{ > + return read_seqcount_begin(&zonelist_update_seq); > +} > + > +static unsigned int zonelist_update_retry(unsigned int seq) > +{ > + return read_seqcount_retry(&zonelist_update_seq, seq); > +} > + > /* Perform direct synchronous page reclaim */ > static unsigned long > __perform_reclaim(gfp_t gfp_mask, unsigned int order, > @@ -5001,6 +5017,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > int compaction_retries; > int no_progress_loops; > unsigned int cpuset_mems_cookie; > + unsigned int zonelist_update_cookie; > int reserve_flags; > > /* > @@ -5016,6 +5033,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > no_progress_loops = 0; > compact_priority = DEF_COMPACT_PRIORITY; > cpuset_mems_cookie = read_mems_allowed_begin(); > + zonelist_update_cookie = zonelist_update_begin(); > > /* > * The fast path uses conservative alloc_flags to succeed only until > @@ -5191,6 +5209,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (check_retry_cpuset(cpuset_mems_cookie, ac)) > goto retry_cpuset; > > + if (zonelist_update_retry(zonelist_update_cookie)) > + goto retry_cpuset; > + > /* Reclaim has failed us, start killing things */ > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > if (page) > @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data) > static DEFINE_SPINLOCK(lock); > > spin_lock(&lock); > + write_seqcount_begin(&zonelist_update_seq); > > #ifdef CONFIG_NUMA > memset(node_load, 0, sizeof(node_load)); > @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data) > #endif > } > > + write_seqcount_end(&zonelist_update_seq); > spin_unlock(&lock); Do we want to get rid of the static lock by using a seqlock_t instead of a seqcount_t? -- Thanks, David / dhildenb