From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B54F0C433E2 for ; Thu, 3 Sep 2020 18:07:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6AE5720716 for ; Thu, 3 Sep 2020 18:07:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="TWwaZxX9" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6AE5720716 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0CA1E6B0070; Thu, 3 Sep 2020 14:07:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A329900003; Thu, 3 Sep 2020 14:07:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ED3488E0003; Thu, 3 Sep 2020 14:07:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0066.hostedemail.com [216.40.44.66]) by kanga.kvack.org (Postfix) with ESMTP id D74B76B0070 for ; Thu, 3 Sep 2020 14:07:56 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 98AC7362E for ; Thu, 3 Sep 2020 18:07:56 +0000 (UTC) X-FDA: 77222533752.16.part33_0c02e4b270ab Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id 51C84100E6929 for ; Thu, 3 Sep 2020 18:07:56 +0000 (UTC) X-HE-Tag: part33_0c02e4b270ab X-Filterd-Recvd-Size: 9462 Received: from mail-ej1-f67.google.com (mail-ej1-f67.google.com [209.85.218.67]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Sep 2020 18:07:55 +0000 (UTC) Received: by mail-ej1-f67.google.com with SMTP id z22so5106928ejl.7 for ; Thu, 03 Sep 2020 11:07:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=oCYaYCtUscxexyh6LWrjL8SmecgFsorLbUTETS26frM=; b=TWwaZxX9nUgmxfTZK0ic7NdtN05EQOwCzdn5knBje6r/BjSLp/iw1ZNQronrTsrSz0 HwF8veMKgh7Gy0MaznsmBo1EDNg0ij/JVAeCe5PoO+W0J7M7SISHc09yx83AaHml1OwE kArMQus8Swa9tblteZP1j4BwWJd3cQEk3YaKeTExzIayMT4iTwhp9XpOC1VGWwrd2l10 8IMPoboPZII5nR6C4ah5KvOLueDiMjbRA2PuF5TfZAlSBE+U49YVrTD6rDqCM9/XnRLi BuRCRtsVYPO2wQpj17UlJ0KN20bsKMvVEmFLx1EK3Hnf4xpYD7475tkMaQUxGAOm55pK 0rUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=oCYaYCtUscxexyh6LWrjL8SmecgFsorLbUTETS26frM=; b=rvlGI4zUHGgkvXsQM8gon5VIPDsKa2rKNTjhWoMJdRRf1aa/sPSjeM8LW0obgms2nf eJYvH/6WA24aknIiQ/PEYs8RjLrb/gXHv7rtygUXqYxJvlQmPPvGkbYM4ExXC5HSDH7f JAZ3PjcTcuq9DYuu2MGRwJc/rJ/m/tvsEkFhXa6DWtnmJ6w/ixWh23CXMOz0z9fdJgTB ZKBDcCEor4j8b/rpuaxtWZlCqnj1xok74IlIl1fFU5wmSxyAddqQEbQdQCT0YsmzcF5l brrrpmMMpn7i8FIvB09pt6FFoosejhweeWc2pJhtLiTaWu78u6+SPnwPg2iJg0C1ITb9 6GgA== X-Gm-Message-State: AOAM5304qjPfLi5kEqao1K/vpyrb6Cd4GdxqqonJIAs+B85qwMn2deBX SG2phP7AKgVN5hxz2gLws/ucjcrZ1lf+BwjkwoYxlw== X-Google-Smtp-Source: ABdhPJz0SmRsjyf1ZDuga+uE82XO1ty+f2yUzq5TkSUcztFMhfnbEEWUMwjKfbdl/JURAv5bJeHQB0/uOyIvdNdGjS8= X-Received: by 2002:a17:906:5418:: with SMTP id q24mr3257684ejo.296.1599156474251; Thu, 03 Sep 2020 11:07:54 -0700 (PDT) MIME-Version: 1.0 References: <20200903140032.380431-1-pasha.tatashin@soleen.com> <6ec66eb9-eeba-5076-af97-cef59ed5cbaa@redhat.com> In-Reply-To: <6ec66eb9-eeba-5076-af97-cef59ed5cbaa@redhat.com> From: Pavel Tatashin Date: Thu, 3 Sep 2020 14:07:18 -0400 Message-ID: Subject: Re: [PATCH v2] mm/memory_hotplug: drain per-cpu pages again during memory offline To: David Hildenbrand Cc: LKML , Andrew Morton , Michal Hocko , linux-mm , Oscar Salvador , Wei Yang , Vlastimil Babka , rientjes@google.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 51C84100E6929 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 3, 2020 at 1:36 PM David Hildenbrand wrote: > > On 03.09.20 16:00, Pavel Tatashin wrote: > > There is a race during page offline that can lead to infinite loop: > > a page never ends up on a buddy list and __offline_pages() keeps > > retrying infinitely or until a termination signal is received. > > > > Thread#1 - a new process: > > > > load_elf_binary > > begin_new_exec > > exec_mmap > > mmput > > exit_mmap > > tlb_finish_mmu > > tlb_flush_mmu > > release_pages > > free_unref_page_list > > free_unref_page_prepare > > set_pcppage_migratetype(page, migratetype); > > // Set page->index migration type below MIGRATE_PCPTYPES > > > > Thread#2 - hot-removes memory > > __offline_pages > > start_isolate_page_range > > set_migratetype_isolate > > set_pageblock_migratetype(page, MIGRATE_ISOLATE); > > Set migration type to MIGRATE_ISOLATE-> set > > drain_all_pages(zone); > > // drain per-cpu page lists to buddy allocator. > > > > Thread#1 - continue > > free_unref_page_commit > > migratetype = get_pcppage_migratetype(page); > > // get old migration type > > list_add(&page->lru, &pcp->lists[migratetype]); > > // add new page to already drained pcp list > > > > Thread#2 > > Never drains pcp again, and therefore gets stuck in the loop. > > > > The fix is to try to drain per-cpu lists again after > > check_pages_isolated_cb() fails. > > > > Fixes: c52e75935f8d ("mm: remove extra drain pages on pcp list") > > > > Signed-off-by: Pavel Tatashin > > Cc: stable@vger.kernel.org > > Acked-by: David Rientjes > > Acked-by: Vlastimil Babka > > --- > > mm/memory_hotplug.c | 14 ++++++++++++++ > > mm/page_isolation.c | 8 ++++++++ > > 2 files changed, 22 insertions(+) > > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > > index e9d5ab5d3ca0..b11a269e2356 100644 > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -1575,6 +1575,20 @@ static int __ref __offline_pages(unsigned long start_pfn, > > /* check again */ > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, > > NULL, check_pages_isolated_cb); > > + /* > > + * per-cpu pages are drained in start_isolate_page_range, but if > > + * there are still pages that are not free, make sure that we > > + * drain again, because when we isolated range we might > > + * have raced with another thread that was adding pages to pcp > > + * list. > > + * > > + * Forward progress should be still guaranteed because > > + * pages on the pcp list can only belong to MOVABLE_ZONE > > + * because has_unmovable_pages explicitly checks for > > + * PageBuddy on freed pages on other zones. > > + */ > > + if (ret) > > + drain_all_pages(zone); > > } while (ret); > > > > /* Ok, all of our target is isolated. > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > > index 242c03121d73..63a3db10a8c0 100644 > > --- a/mm/page_isolation.c > > +++ b/mm/page_isolation.c > > @@ -170,6 +170,14 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages) > > * pageblocks we may have modified and return -EBUSY to caller. This > > * prevents two threads from simultaneously working on overlapping ranges. > > * > > + * Please note that there is no strong synchronization with the page allocator > > + * either. Pages might be freed while their page blocks are marked ISOLATED. > > + * In some cases pages might still end up on pcp lists and that would allow > > + * for their allocation even when they are in fact isolated already. Depending > > + * on how strong of a guarantee the caller needs drain_all_pages might be needed > > + * (e.g. __offline_pages will need to call it after check for isolated range for > > + * a next retry). > > + * > > * Return: the number of isolated pageblocks on success and -EBUSY if any part > > * of range cannot be isolated. > > */ > > > > (still on vacation, back next week on Tuesday) > > I didn't look into discussions in v1, but to me this looks like we are > trying to hide an actual bug by implementing hacks in the caller Hi David, Please read discussion in v1 [1], some of the things that you brought up were covered in that discussion. > (repeated calls to drain_all_pages()). What about alloc_contig_range() > users - you get more allocation errors just because PCP code doesn't > play along. Yes, I looked at that, and it sounds normal to me: alloc error, try again if needed, at least we do not get stuck in an infinite loop like here. > > There *is* strong synchronization with the page allocator - however, > there seems to be one corner case race where we allow to allocate pages > from isolated pageblocks. > > I want that fixed instead if possible, otherwise this is just an ugly > hack to make the obvious symptoms (offlining looping forever) disappear. The fix would be to synchronize isolation code with release code, which would slow down the fastpath of the release code instead of rare page offline code. > > If that is not possible easily, I'd much rather want to see all > drain_all_pages() calls being moved to the caller and have the expected > behavior documented instead of specifying "there is no strong > synchronization with the page allocator" - which is wrong in all but PCP > cases (and there only in one possible race?). > > I do wonder why we hit this issue now and not before - I suspect > something in PCP code changed that made this race possible. I suspect because memory hot-remove was never stressed enough? We have a special use case where we do not hor-remove on every shutdown, and repro rate was only one in ~400 reboots. This bug and other deadlocks [2], [3] were tough too root cause partially because of so slow repro rate. Pasha [1] https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com [2] https://lore.kernel.org/lkml/CA+CK2bCQcnTpzq2wGFa3D50PtKwBoWbDBm56S9y8c+j+pD+KSw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+CK2bBEHFuLLg79_h6bv4Vey+B0B2YXyBxTBa=Le12OKbNdwA@mail.gmail.com/