From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 79471C27C53
	for <linux-mm@archiver.kernel.org>; Fri,  7 Jun 2024 16:56:11 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CE8506B0096; Fri,  7 Jun 2024 12:56:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C6E856B009A; Fri,  7 Jun 2024 12:56:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AE7FC6B009B; Fri,  7 Jun 2024 12:56:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 8F7B06B0096
	for <linux-mm@kvack.org>; Fri,  7 Jun 2024 12:56:10 -0400 (EDT)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 437D1140231
	for <linux-mm@kvack.org>; Fri,  7 Jun 2024 16:56:10 +0000 (UTC)
X-FDA: 82204695300.03.C30405A
Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42])
	by imf13.hostedemail.com (Postfix) with ESMTP id 8781C20002
	for <linux-mm@kvack.org>; Fri,  7 Jun 2024 16:56:08 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=euvgs+Jb;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf13.hostedemail.com: domain of fvdl@google.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=fvdl@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1717779368;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=/8vmPr/U3SOxUzw+yKkrbrSVi4WiBXuKae4RE7ZMD7A=;
	b=oorIz3CNZ2qzXwTqzX4C1W7i/poAZHsTA8K55hq582DxhW86LSAgzTAXn8hk79Ktf/AINb
	pXHXDtGdtstfnYWzhsW91I2OAuPD1GH47sU4Xg5/q6lj0sTjTDV+F0hIiG/Cer0Kkr+O85
	GsUhDV+smuuJkfgaVM1lWDK2h7EyynE=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=euvgs+Jb;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf13.hostedemail.com: domain of fvdl@google.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=fvdl@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717779368; a=rsa-sha256;
	cv=none;
	b=xDs8Pte9BYHxdGDDBFt3KWC8dqCrk504hnlIF9EPocTjFj5IfCMKeENEDCbPPMWQItNtTi
	FPTNJxz1ooz7cUMGzXU9dhQhDpObpAnV8XwIf+kEEUpDkLhzmFYayCuQ/QN4ekzx2npODx
	fRlTMvalO8/77npX74PCG9Fl6JLcQkk=
Received: by mail-vs1-f42.google.com with SMTP id ada2fe7eead31-48c2d353b01so284053137.2
        for <linux-mm@kvack.org>; Fri, 07 Jun 2024 09:56:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1717779367; x=1718384167; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=/8vmPr/U3SOxUzw+yKkrbrSVi4WiBXuKae4RE7ZMD7A=;
        b=euvgs+JbIElD/B4I6z3CZpxOXGyx7WFBuavW17pLRzWlhquu9QGgD+m1I4fE4panUM
         gpgCImiltqSulA4KtDaz0Fc1HrSYa6A818i6hu806KfYKaWh9kge+KgF97eMdFgDT2E2
         8Q+AKkJmrC6vUU272BgtL253ykD3doVPS1h8t336yJGhOxV9S4VtQP/Y9gDePD03Wsep
         9cFxS3OzDARUngEcmYuO4jgtlxJ3DhaoGGwkrIKZ8xssnTDceFqBnHvuqCjEPP3nclbx
         V02nGCzxmvAHvrowVNOsIio03YXIgSAPx1bbCq0MdyLALsNnlqzgSfrn1koMlkAfygqu
         W7xA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1717779367; x=1718384167;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=/8vmPr/U3SOxUzw+yKkrbrSVi4WiBXuKae4RE7ZMD7A=;
        b=GiXgTpxGWTIg3HDy3dgCf6PhWYFKOtZANn0qUOLpeQ+jGDi/u6JQrHTFdhADnOtRMr
         uGc+mJ4fwTuLPL22GUo9osEsCKeK+TLqFqTf5rMndk8lL4luX0V9Rt6NRSDJYCzmqk0y
         kBgbr23hvQlv6qiuuF7gvZels6nSus3+yD8xGxBiw04gxKUtrOGKeIz/2LWwbsIxJSm7
         wajjOSz0CPJ44bw+9s+lJmIHdUfy44kivP4xfFPJvs1NM4n5cmDkZgwxo5ZrsXlF672v
         /I4In/0GIUr8W0Dw+UHi+8SdvjSJAMapLx9aCrPeS9gnRgcJjGWw+hiICsw/md0POhq3
         MKMg==
X-Forwarded-Encrypted: i=1; AJvYcCVa268dgMtYK5mhBi3QOWa2dqTiI+pf2qbQtyGQFQWVHjJUsb9Yc8jwE5Tv2M4gJyT2jNh8mB7f6N2gLojQJ7cO7Q0=
X-Gm-Message-State: AOJu0Yzyvj+5L4JgUkmYBIOp2SrR+6uig94rvh18sBsOipOYwVFFhBSj
	a8Eh3/M5VMLJO15cZBY5ggzJQmOeBkWr+E9R8RoxTkcyv0S2yh+ZKDkkkecKJi0V3fe9tfEPuSm
	Od+Q1dIgByeruxuh1WpgZzU7qCx+fqgKShhxJ
X-Google-Smtp-Source: AGHT+IHQxzVTprFPbLQ57YE1AQRfmd8LyuSfQ2VqUUO1mF+4n+YkqBqwxGBOejxvOZHd9fjgZ0tdMWBbAXNS0VMHz7U=
X-Received: by 2002:a05:6102:17cd:b0:48c:526:8ae with SMTP id
 ada2fe7eead31-48c2752821emr3455949137.2.1717779367409; Fri, 07 Jun 2024
 09:56:07 -0700 (PDT)
MIME-Version: 1.0
References: <20240113094436.2506396-1-sunnanyong@huawei.com>
 <ZbKjHHeEdFYY1xR5@arm.com> <d1671959-74a4-8ea5-81f0-539df8d9c0f0@huawei.com>
 <20240207111252.GA22167@willie-the-truck> <ZcNnrdlb3fe0kGHK@casper.infradead.org>
 <20240207121125.GA22234@willie-the-truck> <ZcOQ-0pzA16AEbct@casper.infradead.org>
 <908066c7-b749-4f95-b006-ce9b5bd1a909@oracle.com> <ZcT4DH7VE1XLBvVc@casper.infradead.org>
 <917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev> <CAOUHufYF2E-hM-u8eZc+APZAsMX3pOAmto7kB3orH5_MRgvSkA@mail.gmail.com>
 <be130a96-a27e-4240-ad78-776802f57cad@redhat.com>
In-Reply-To: <be130a96-a27e-4240-ad78-776802f57cad@redhat.com>
From: Frank van der Linden <fvdl@google.com>
Date: Fri, 7 Jun 2024 09:55:55 -0700
Message-ID: <CAPTztWb0ZMHB74=KGxqRpTejDXNVJZ+Y9LGH1KEaPy_cnUmABA@mail.gmail.com>
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
To: David Hildenbrand <david@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>, Muchun Song <muchun.song@linux.dev>, 
	Matthew Wilcox <willy@infradead.org>, Jane Chu <jane.chu@oracle.com>, Will Deacon <will@kernel.org>, 
	Nanyong Sun <sunnanyong@huawei.com>, Catalin Marinas <catalin.marinas@arm.com>, 
	akpm@linux-foundation.org, anshuman.khandual@arm.com, 
	wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, 
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 8781C20002
X-Stat-Signature: 8fb8rb8st7asf9wndmcp1x5xhzy4qw1q
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1717779368-757785
X-HE-Meta: U2FsdGVkX1/47By717FMxGhVTw9V2/FaL5Ei0b4VGTS4TNE2EdoQI0/4cE5CeKMJ6OMHXR33JNYzRMYtJ8Pp7RVM6jf4Fo+lelnM+QfiS7itBltAtfU22FmwfVAD8Co58VBaETDoDEARvKZ97JNi/SDJhQw29YZWYhMYX6u8mjrvJwxqoSjxihsAmvYbXe4S0vCoO2IciyT40n3gXhXEbvwOV5ELw7BK07JuhWvaaG7mgDRrKq/oKW1loOKPrx2T91R0R02me7DUCZVAF7qJU4yRlc6T5JgovxnCqgsi2cjTJTT7ag6YzkhCZnWZz/1dY/Jyjwsfu0BvwB9aUGeN8fw5yVQZMxibUkYZyTvnu+XYmc0ULxpREUrWzH9wr0r4b3giiiDxtLp0V5LlkvN83eoj3QFIO8HvsoKqQGOdSLFMavSZHG+MG92QoMnVXyB9RexZAC7mxJBF3NYSVs8QuWQuRPjQqE703D86JGiNEtsUWmiCdOmTd+zX43DadwmWn4Oi6tCxUAMFGTqWuq3nGAI4jwz5URkYEFiPqHQNKBBoB7GzQBr/W3VEJpluJXv4vSnBLs3muJo7Iv8vfRXxkBYys6i2lnZtyVA4F5cJS7rZkC7h/+r4xqbxGFaGc6WGPhiPYMkRZ6eWWj5oPLh7CE9yxYJk0AoIladsJ7NNxVmFfB8qkBrO+j08vAr2rTTtg4+xkggWC5uL/dK/Cns1Ohi7BNPKF9klRG9KlM/cV2xqpkb8zZmnToyi9xrovh4TYmjoeNT9DFonhS+y4YErhttrid4VQqi8V2oFSgfaATP6nVqz4VoLXziMNKSOE3Gtjcp0TKVTLre3DvDr8k0tK7jeDLXtBh4oApRdK0un8S8Lf4Og+BR49iOnjSvRhc6+MJ5/3lc5528HuwBSpOR63M+B3KeDLMawdMnEZBL7z8PAAKNa/4VXk4E+GXPCsbuf71Rgjagmb2K9zEZ3OVJ
 EUPu5yu1
 VEYDUc9L/nslXnP7OQcGcohG/rGIbdmImPXkAzsH7ah5UD/s12tjIwnNgSBY8e4h/HK7wzuqPIaq/y0yMT3bPAJVO7pZ7C5i/FO3HPZyK6oQ16BS0tsFqw7JOhzKHA84eZb4TU3nml7m/tXwkLNy4GiOG1msmzGZnmfFPq2V5cB7jFcaP7GiB4R4Xj7jxrSsR+deOx1xm4uK8AY6IBTuhYj4YvSTMpD9WZEDTawOTS5ZfTCpIPwYIaLRKhA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

I had an offline discussion with Yu on this, and he pointed out
something I hadn't realized: the x86 cmpxchg instruction always
produces a write cycle, even if it doesn't modify the data - it just
writes back the original data in that case.

So, get_page_unless_zero will always produce a fault on RO mapped page
structures on x86.

Maybe this was obvious to other people, but I didn't see it explicitly
mentioned, so I figured I'd add the datapoint.

- Frank

On Thu, Jun 6, 2024 at 1:30=E2=80=AFAM David Hildenbrand <david@redhat.com>=
 wrote:
>
> >> Additionally, we also should alter RO permission of those 7 tail pages
> >> to RW to avoid panic().
> >
> > We can use RCU, which IMO is a better choice, as the following:
> >
> > get_page_unless_zero()
> > {
> >    int rc =3D false;
> >
> >    rcu_read_lock();
> >
> >    if (page_is_fake_head(page) || !page_ref_count(page)) {
> >          smp_mb(); // implied by atomic_add_unless()
> >          goto unlock;
> >    }
> >
> >    rc =3D page_ref_add_unless();
> >
> > unlock:
> >    rcu_read_unlock();
> >
> >    return rc;
> > }
> >
> > And on the HVO/de-HOV sides:
> >
> >    folio_ref_unfreeze();
> >    synchronize_rcu();
> >    HVO/de-HVO;
> >
> > I think this is a lot better than making tail page metadata RW because:
> > 1. it helps debug, IMO, a lot;
> > 2. I don't think HVO is the only one that needs this.
> >
> > David (we missed you in today's THP meeting),
>
> Sorry, I had a private meeting conflict :)
>
> >
> > Please correct me if I'm wrong -- I think virtio-mem also suffers from
> > the same problem when freeing offlined struct page, since I wasn't
> > able to find anything that would prevent a **speculative** struct page
> > walker from trying to access struct pages belonging to pages being
> > concurrently offlined.
>
> virtio-mem does currently not yet optimize fake-offlined memory like HVO
> would. So the only way we really remove "struct page" metadata is by
> actually offlining+removing a complete Linux memory block, like ordinary
> memory hotunplug would.
>
> It might be an interesting project to optimize "struct page" metadata
> consumption for fake-offlined memory chunks within an online Linux
> memory block.
>
> The biggest challenge might be interaction with memory hotplug, which
> requires all "struct page" metadata to be allocated. So that would make
> cases where virtio-mem hot-plugs a Linux memory block but keeps parts of
> it fake-offline a bit more problematic to handle .
>
> In a world with memdesc this might all be nicer to handle I think :)
>
>
> There is one possible interaction between virtio-mem and speculative
> page references: all fake-offline chunks in a Linux memory block do have
> on each page a refcount of 1 and PageOffline() set. When actually
> offlining the Linux memory block to remove it, virtio-mem will drop that
> reference during MEM_GOING_OFFLINE, such that memory offlining can
> proceed (seeing refcount=3D=3D0 and PageOffline()).
>
> In virtio_mem_fake_offline_going_offline() we have:
>
> if (WARN_ON(!page_ref_dec_and_test(page)))
>         dump_page(page, "fake-offline page referenced");
>
> which would trigger on a speculative reference.
>
> We never saw that trigger so far because quite a long time must have
> passed ever since a page might have been part of the page cache / page
> tables, before virtio-mem fake-offlined it (using alloc_contig_range())
> and the Linux memory block actually gets offlined.
>
> But yes, RCU (e.g., on the memory offlining path) would likely be the
> right approach to make sure GUP-fast and the pagecache will no longer
> grab this page by accident.
>
> >
> > If this is true, we might want to map a "zero struct page" rather than
> > leave a hole in vmemmap when offlining pages. And the logic on the hot
> > removal side would be similar to that of HVO.
>
> Once virtio-mem would do something like HVO, yes. Right now virtio-mem
> only removes struct-page metadata by removing/unplugging its owned Linux
> memory blocks once they are fully "logically offline".
>
> --
> Cheers,
>
> David / dhildenb
>