From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93116C433E7 for ; Wed, 2 Sep 2020 12:51:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 52E53206F0 for ; Wed, 2 Sep 2020 12:51:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=soleen.com header.i=@soleen.com header.b="SYxd7JAe" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 52E53206F0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=soleen.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E6A156B0055; Wed, 2 Sep 2020 08:51:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E194E6B005C; Wed, 2 Sep 2020 08:51:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D08C16B005D; Wed, 2 Sep 2020 08:51:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0117.hostedemail.com [216.40.44.117]) by kanga.kvack.org (Postfix) with ESMTP id BCB256B0055 for ; Wed, 2 Sep 2020 08:51:44 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 78CDA362B for ; Wed, 2 Sep 2020 12:51:44 +0000 (UTC) X-FDA: 77218108128.15.grip83_5216159270a1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id 4E8AD1814B0C1 for ; Wed, 2 Sep 2020 12:51:44 +0000 (UTC) X-HE-Tag: grip83_5216159270a1 X-Filterd-Recvd-Size: 5410 Received: from mail-ej1-f68.google.com (mail-ej1-f68.google.com [209.85.218.68]) by imf30.hostedemail.com (Postfix) with ESMTP for ; Wed, 2 Sep 2020 12:51:43 +0000 (UTC) Received: by mail-ej1-f68.google.com with SMTP id i22so6464716eja.5 for ; Wed, 02 Sep 2020 05:51:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vb8nEfGlhVBFAznOemnS757d63JHXGeX/MUl+8OqStw=; b=SYxd7JAeBsojOK2HUAbQyxkhg1r2C33uBFoIFXIdeqIVKX1/oBdUHl9X47BM1Irz6D 24JJGwShtYpRXBQHDq0ap5W78KaAZab4FGlK5TlftAOyS+INqj2HqACK+BewNBms9+hQ 1KsBJbQYD8yb/ODXzfAkQfuclmUacIuGK8ecF1/h7gl83zPGuZQSgZBZoWkfoleS8Z/p yYgY+vxRSoj5p2o9I6xBA7+Eag6iQfUbh8pGhIs3gJoQ+wvK/x3mNY431ncbl+amubjR QDoq1oKdlhTDKI6qRQ5WztZ6wB0fRJP+w9S8Uy8WFrOugIyUV0VEvee4h3eN8VTLoWb+ 51vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vb8nEfGlhVBFAznOemnS757d63JHXGeX/MUl+8OqStw=; b=tJZ45hPPQbwDtCQq8KNVaSgpe13skOEefCvwAuwxXoA/QOCRaY2uJqmrFKPY/9whTc Ya4P+mXcP8rh8DUWxDg4YLc6CJNrZapKmDSrs+FIhc0zKMvXGgVKvwqxTWnemG/ZuFod ja0aWPF6dT/Zqg4xjS/xDHnqnG0Ac+A9GKKmsU+Yl1xiyKJklxmLMFAgHAmra9ttnEcM l78Sm85FGVG8vo30mY+5f6xV+OUbpK0dF9oMxVHETNY0Dh2eIy79hgpkJLSr6DtqarTo BgTtqkGudYs6HIbn8jiHoYkQHi6rCY3K1s2udkwfzk+YPdFyAzIA51LX4oEOwtvbw1Tv Wtog== X-Gm-Message-State: AOAM5303/gomILbjkq76WzaTdf2yAarzFt/LkbwjMUvo8RDKrskXG+bu C9jSmKO0cixI/UTAr4iuam9mbP6yZH9FTONUVEsC6Q== X-Google-Smtp-Source: ABdhPJxF8QWki8QRQ8ChcVSrdGNWI+JN5Pf5Gfm0KpWIWw9wgLSDXR87J9qd6Qzug5NPd3Wa1Di1x991eP/t+FZTUvU= X-Received: by 2002:a17:907:2055:: with SMTP id pg21mr6481789ejb.501.1599051102784; Wed, 02 Sep 2020 05:51:42 -0700 (PDT) MIME-Version: 1.0 References: <20200127173453.2089565-1-guro@fb.com> <20200130020626.GA21973@in.ibm.com> <20200130024135.GA14994@xps.DHCP.thefacebook.com> <20200813000416.GA1592467@carbon.dhcp.thefacebook.com> <6469324e-afa2-18b4-81fb-9e96466c1bf3@suse.cz> <20200902112624.GC4617@dhcp22.suse.cz> In-Reply-To: <20200902112624.GC4617@dhcp22.suse.cz> From: Pavel Tatashin Date: Wed, 2 Sep 2020 08:51:06 -0400 Message-ID: Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller To: Michal Hocko Cc: Vlastimil Babka , Roman Gushchin , Bharata B Rao , "linux-mm@kvack.org" , Andrew Morton , Johannes Weiner , Shakeel Butt , Vladimir Davydov , "linux-kernel@vger.kernel.org" , Kernel Team , Yafang Shao , stable , Linus Torvalds , Sasha Levin , Greg Kroah-Hartman , David Hildenbrand Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4E8AD1814B0C1 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > > > Thread #1: memory hot-remove systemd service > > > Loops indefinitely, because if there is something still to be migrated > > > this loop never terminates. However, this loop can be terminated via > > > signal from systemd after timeout. > > > __offline_pages() > > > do { > > > pfn = scan_movable_pages(pfn, end_pfn); > > > # Returns 0, meaning there is nothing available to > > > # migrate, no page is PageLRU(page) > > > ... > > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, > > > NULL, check_pages_isolated_cb); > > > # Returns -EBUSY, meaning there is at least one PFN that > > > # still has to be migrated. > > > } while (ret); > Hi Micahl, > This shouldn't really happen. What does prevent from this to proceed? > Did you manage to catch the specific pfn and what is it used for? I did. > start_isolate_page_range and scan_movable_pages should fail if there is > any memory that cannot be migrated permanently. This is something that > we should focus on when debugging. I was hitting this issue: mm/memory_hotplug: drain per-cpu pages again during memory offline https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com Once the pcp drain race is fixed, this particular deadlock becomes irrelavent. The lock ordering, however, cgroup_mutex -> mem_hotplug_lock is bad, and the first race condition that I was hitting and described above is still present. For now I added a temporary workaround by using save to file instead of piping the core during shutdown. I am glad the mainline is fixed, but stables should also have some kind of fix for this problem. Pasha