From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45C11C48BF6 for ; Mon, 4 Mar 2024 21:58:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA2F26B0083; Mon, 4 Mar 2024 16:58:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C52BB6B0088; Mon, 4 Mar 2024 16:58:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B1A8B6B0089; Mon, 4 Mar 2024 16:58:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9BB876B0083 for ; Mon, 4 Mar 2024 16:58:07 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 64CD8A0881 for ; Mon, 4 Mar 2024 21:58:07 +0000 (UTC) X-FDA: 81860720214.13.F608607 Received: from mail-vs1-f52.google.com (mail-vs1-f52.google.com [209.85.217.52]) by imf12.hostedemail.com (Postfix) with ESMTP id 9605C40019 for ; Mon, 4 Mar 2024 21:58:05 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XgrGIEoo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709589485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CqoBwlFjrEGlkxzScTIHxrhFsM9u6GTewCZPhOqLJRc=; b=nPK4w8Ye8Uxy/Ia4y2SXMAAmySzdVp4XfyEtwXVuVZ8v7Lr+FsgseECWkK7T5cO/aQ6wFN wy0yW3z95W2ttu6E42lPDaZ7lArNHdc/dKTU5enpPlk2kwIFLJE1ckbnf8k7Gw3EJpX9O9 wjxBl9N1Z3/mnwNFLrRZOunUTOw0DsE= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XgrGIEoo; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709589485; a=rsa-sha256; cv=none; b=lR04ewMetEScLZutaR7ceKsRjy3KFfUEDEvhTNhbMH9vWHmQGHdsdan1Xndv6dMwX92W/T xQvAFPfhcFq5J6R8slsb06XZHj5feERypkXTlMYg/6lKXDypP5So6aolCvmGxC/dJ89Fzz cH+Mg8OGCCwRt2LZYF++/u9xzYzp4E8= Received: by mail-vs1-f52.google.com with SMTP id ada2fe7eead31-4728f30090dso673466137.0 for ; Mon, 04 Mar 2024 13:58:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709589484; x=1710194284; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CqoBwlFjrEGlkxzScTIHxrhFsM9u6GTewCZPhOqLJRc=; b=XgrGIEoo31UhOZ9ahMyIolMxhTZz0521a6MR1b1Kbx2JMhbivkWUxeqgmua4DQ4niT xFyUig48HpDJhRfrVA6cn+Ldt7iURTFtnXfotYAHgey/uQ8e/TXCcg7AB+vIhQ6i2SN3 CGnRd60Q4CI7UQUB4t/Mh440fXd95+shuOyqmhg4NbvolUONttpvmteJ+KmyM+6HIFEk N+wV71j39BM2NaqfRPUkw3hwEJyr/GHQAwVrWbXZKr01F1ARLaOtBTSQ3zrfrkvcDanN CItphL0r2WZMZ+eRlPgy8Q6gtII0nJe1Xg6X512TkzjXihfYcqz/5YxC0C5LfU25t+v2 WcSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709589484; x=1710194284; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CqoBwlFjrEGlkxzScTIHxrhFsM9u6GTewCZPhOqLJRc=; b=CQzAnOunFJQQNO6nzDGpgyqK3cBKnWmxeLbKzTHwvQjVXi4/vjRqZMYC0T7+7gocuI wq3nlZovX/sF+be3MW+BHM366QpubJgwhh875H5SQ3fnzDgKv9k5SnslsrLRIdU0PYgl ktvx60AnaaU260ziLMMlY0YVURKlnYQuPwG5YYzr2fCL/ydGVAZAibNUaFZqQhwlPoi9 w6a7BVuaBJUs6q4XPpf9RGLCPq8heDjEBW7WqGnoKozFV9feo5vj1qxPEZHQ4SJCo1MU XPE2SrccH026TJpAjFGksPsDk/52myJvlkgVkBVy7sw0jNfMBknpvCIsear46FFZ19lU 7tUg== X-Forwarded-Encrypted: i=1; AJvYcCWMSlCkBnb/ag94H+ma+BQMH8rgODnoIZKHKZ3bzLjyxjcTwUMi+fEzG0pr7g0abpXgv006TwWzaQ7Rr71j/mlCt/k= X-Gm-Message-State: AOJu0YxOH0ELji+3RGnCiGS84QIeNGf1I1MZ9Q+t0KGe8wKhvr4jhZyQ 8s5xsDloMRibpXX3JaumY5jO06k/9tKsByv9Fged6x7iThpZ2LSXwEff7rJD8SzpjK/nTwt7EVa OxUO8BvA7bELvkdZOF+aDM4D5h7I= X-Google-Smtp-Source: AGHT+IFJylADP6V6ydrpu37OvmzYZGm3wMHZcijXhFmQ+Oi+Looal8ewdWIErgxOYA/rF/J9DUIr549t5Eh3HSk459c= X-Received: by 2002:a05:6102:d8c:b0:472:cc6a:5dd8 with SMTP id d12-20020a0561020d8c00b00472cc6a5dd8mr71679vst.3.1709589483176; Mon, 04 Mar 2024 13:58:03 -0800 (PST) MIME-Version: 1.0 References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> In-Reply-To: <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 5 Mar 2024 10:57:51 +1300 Message-ID: Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio To: Ryan Roberts Cc: akpm@linux-foundation.org, linux-mm@kvack.org, david@redhat.com, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9605C40019 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ae486ryudw916hpbtmb9woxb935e4f74 X-HE-Tag: 1709589485-552604 X-HE-Meta: U2FsdGVkX1+OfNxFZDWApnPZK9laNxevgB9mO4OD08wTd9XO6967GQHhp8ngYjjuA6vH6+FJJb+xKcr5KU3iJX3YJj5vXJDdZk32pNs69YMqiHe/jBIkWVjuHM+7a/RwaRsOTnM/kM8Qe3hf0NHg5INWpdhu0/bmhzTixOHGFZiUge5WkM6U/EV74ktVL3gnmWi9tn4rh0IkTgCYcacIM30a4McY0Od8MwpPb15RIysk+Ew3cBUC/6zuSRTtiJSWG4JjJ6k0t8fsIiGU1BlhCqN2JlpHKowpMD41QAAxETNxjFx3vjPDRpalgxYaIW5n//44CtVmr8RRybOSEdeAmOPFOijfA8ugv0s5d26I7QsiU2XzC9hfZJZ78u5KQp7grkH4gY0XWRIU1usHBRg+bk3S9uKJt7hFEMbu6gW6mQK+I3vDxql0/1f1ZGEqqgtxr3s23uKYkNZMrA/NSk/mFH2KpFJUd457Rs+GwlpPLPS+Nc8ZEmP2YLpoPobodQrOLfiT/qk4cGNj7uiSdUsyTnds9aqbJ2SO6q0he4RKl6yrGsWaDCCOEqfuOhligvC5WK7mSGOREAXVgqsJeqGKaZ9Uw8mecuPKaxrVhzcmItssyNYbWjxbTdlEkX9+qTqc5Q4cUGpXYqoM5odEK1Yn77ZBeG3zXBeL5hF0RePrdw8a4OeZqPQp3rZoBE/bd5HgFWDj705P9E+8LWFQ5OBWhnkf1/Y5HcCko5e5ipHhrIkjt/geRXrDkxv46cZPFRkglRKJygLJTlkI6Rdp1xfcijrBeIvXqqaXNQyOb3XUpO4AVbdGui5Rsy2zGExDUbQQFFKqs7xn3x7dVkAV//oQCwBHmd0088D4urgj/QptoDtUIuLDCnzVZ9dgYF8ECtIJCWbTe3cQfJOjHTRuWli1CNtRzMehUWPLc8KunFDLE9ERpOnDyJzEgD2X/44eWH+YMyvQ8eRrg7nPwrAVYM7 Q5wQ64KO AQJSNkcJ4pkJQ0Y1xEV0wgWWCZmQt/BSzDwK59tCGIwjpufdR3NxIAWnIKwkMZIxhsLB3J1+sD8TyOTNGI/SrTctVstWxXHYrCj6xqLu/Knl9fx9qgJDJX4rbXvdBhFRcAJmoDqwX8129kdQdImvTQ37Se91gOspkKHdr9ibeIqx8CDFGIO75+y0T8z+ICyXPUIBNyRA3uG4MN5gKuI4Ox2rqwsHi6TlX7XRAU4mwnXIawDVT8P7FbwxGSkz2sCsbgzVUhd4/7ZAojkvdzWFsqdn2qbGDuf+DWRGtTJo+r5mdHZOJR7bXh5LhjxkhwBKgrJnNh2lcIQkJElzMsQissED/+ewaST06lr/wMurbWvelDZeHnTFr3Y+AYtD8GpvRELIcFTrqWtW8o96Kd8fiLyqHW2FFiJBr5PdWG4SlE+2IXAYDoq0ia9denOPXMGJCHf6zxz7QzexZWOTHh0eIk3djLjps0pGjNqRUZCQhNv0fdaL/3yNUVkXWVQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 1:21=E2=80=AFAM Ryan Roberts = wrote: > > Hi Barry, > > On 04/03/2024 10:37, Barry Song wrote: > > From: Barry Song > > > > page_vma_mapped_walk() within try_to_unmap_one() races with other > > PTEs modification such as break-before-make, while iterating PTEs > > of a large folio, it will only begin to acquire PTL after it gets > > a valid(present) PTE. break-before-make intermediately sets PTEs > > to pte_none. Thus, a large folio's PTEs might be partially skipped > > in try_to_unmap_one(). > > I just want to check my understanding here - I think the problem occurs f= or > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folio= s? Now > that I've had a look at the code and have a better understanding, I think= that > must be the case? And therefore this problem exists independently of my w= ork to > support swap-out of mTHP? (From your previous report I was under the impr= ession > that it only affected mTHP). I think this affects all large folios with PTEs entries more than 1. but hu= geTLB is handled as a whole in try_to_unmap_one and its rmap is removed all together, i feel hugeTLB doesn't have this problem. > > Its just that the problem is becoming more pronounced because with mTHP, > PTE-mapped large folios are much more common? right. as now large folios become a more common case, and it is my case running in millions of phones. BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim all PTEs all together rather than iterating PTEs one by one. This will impr= ove performance. for example, a batched set_ptes_to_swap_entries() { } then we only need to loop once for a large folio, right now we are looping nr_pages times. > > > For example, for an anon folio, after try_to_unmap_one(), we may > > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > > So folio will be still mapped, the folio fails to be reclaimed. > > What=E2=80=99s even more worrying is, its PTEs are no longer in a unifi= ed > > state. This might lead to accident folio_split() afterwards. And > > since a part of PTEs are now swap entries, accessing them will > > incur page fault - do_swap_page. > > It creates both anxiety and more expense. While we can't avoid > > userspace's unmap to break up unified PTEs such as CONT-PTE for > > a large folio, we can indeed keep away from kernel's breaking up > > them due to its code design. > > This patch is holding PTL from PTE0, thus, the folio will either > > be entirely reclaimed or entirely kept. On the other hand, this > > approach doesn't increase PTL contention. Even w/o the patch, > > page_vma_mapped_walk() will always get PTL after it sometimes > > skips one or two PTEs because intermediate break-before-makes > > are short, according to test. Of course, even w/o this patch, > > the vast majority of try_to_unmap_one still can get PTL from > > PTE0. This patch makes the number 100%. > > The other option is that we can give up in try_to_unmap_one > > once we find PTE0 is not the first entry we get PTL, we call > > page_vma_mapped_walk_done() to end the iteration at this case. > > This will keep the unified PTEs while the folio isn't reclaimed. > > The result is quite similar with small folios with one PTE - > > either entirely reclaimed or entirely kept. > > Reclaiming large folios by holding PTL from PTE0 seems a better > > option comparing to giving up after detecting PTL begins from > > non-PTE0. > > > > Cc: Hugh Dickins > > Signed-off-by: Barry Song > > Do we need a Fixes tag? I don't feel a strong need for this as this doesn't cause a crash, memory leak or whatever serious. > > > --- > > mm/vmscan.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 0b888a2afa58..e4722fbbcd0c 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct lis= t_head *folio_list, > > > > if (folio_test_pmd_mappable(folio)) > > flags |=3D TTU_SPLIT_HUGE_PMD; > > + /* > > + * if page table lock is not held from the first = PTE of > > + * a large folio, some PTEs might be skipped beca= use of > > + * races with break-before-make, for example, PTE= s can > > + * be pte_none intermediately, thus one or more P= TEs > > + * might be skipped in try_to_unmap_one, we might= result > > + * in a large folio is partially mapped and parti= ally > > + * unmapped after try_to_unmap > > + */ > > + if (folio_test_large(folio)) > > + flags |=3D TTU_SYNC; > > This looks sensible to me after thinking about it for a while. But I also= have a > gut feeling that there might be some more subtleties that are going over = my > head, since I'm not expert in this area. So will leave others to provide = R-b :) > ok, thanks :-) > Thanks, > Ryan > > > > > try_to_unmap(folio, flags); > > if (folio_mapped(folio)) { > Thanks Barry