From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=UAIT=PC=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 36C93C433F5
	for <linux-mm@archiver.kernel.org>; Thu, 14 Oct 2021 05:10:43 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 973BA61152
	for <linux-mm@archiver.kernel.org>; Thu, 14 Oct 2021 05:10:42 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 973BA61152
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id E038B6B0092; Thu, 14 Oct 2021 01:10:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DB22C6B0093; Thu, 14 Oct 2021 01:10:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C7A07940009; Thu, 14 Oct 2021 01:10:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0025.hostedemail.com [216.40.44.25])
	by kanga.kvack.org (Postfix) with ESMTP id B50186B0092
	for <linux-mm@kvack.org>; Thu, 14 Oct 2021 01:10:41 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 644618249980
	for <linux-mm@kvack.org>; Thu, 14 Oct 2021 05:10:41 +0000 (UTC)
X-FDA: 78693867882.09.62F43C1
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf10.hostedemail.com (Postfix) with ESMTP id 8F33D6001988
	for <linux-mm@kvack.org>; Thu, 14 Oct 2021 05:10:39 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1634188240;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Ge0ebwviz5Bm5d/fEKQmZixoWz8JO7cHa468HpLZEu4=;
	b=hPnqsQ0xKl7GPUg7RzASZU/p3/fTr+Q+WDMjxQjhQs1aTn3HVJHe+So5xqiqeGq2qllRQi
	88RBXm/0ZvI/mRJZqZKD7Hke+i3q6zbri3EJRgZrzHvmn5Xfw2TLtLcMuinzUE/PmIbaHd
	6VnJrQ+qXyxZe0rbyxYbVAKS4xjnlPE=
Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com
 [209.85.215.199]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-574-a1QYRGaQMuOHQCElOi_riw-1; Thu, 14 Oct 2021 01:10:38 -0400
X-MC-Unique: a1QYRGaQMuOHQCElOi_riw-1
Received: by mail-pg1-f199.google.com with SMTP id j18-20020a633c12000000b0029956680edaso2263525pga.15
        for <linux-mm@kvack.org>; Wed, 13 Oct 2021 22:10:38 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=Ge0ebwviz5Bm5d/fEKQmZixoWz8JO7cHa468HpLZEu4=;
        b=CVXmdMr7BZKF3evX2VoNiKJ2rWQTAjEKTD1w3wZ24oilefHNnYbIXX6Z9pLvviAccY
         RFfg6aqJ3fxb3pWiJ2mRjGwKfdGQ+q2oT5pA9+eccWvjtZxP2YZVshu0EQFMMuA88LGJ
         TSRbvEVhgrfKaTdaKoEHEuenO4F4X7ZGolJKRBxjxXvRxVDWtYrLnjle0mQZI1xjN0xe
         AMgEl4XKrBwB76ng9ZTwTxOjPlxmZjq9q7xC7R5ygbr7UMJ1rIDlSoY6tOdc6q17pEli
         3jSSMB40rzUy4Pm+zaaRt+xgLLrJb9IdF3Wi49AT5QxolcQNt25MNAgKb8alNfklC1vE
         +Z3Q==
X-Gm-Message-State: AOAM530KC42+sqWZO+AMOJag9Io70lxDCiSmdNMNMYYaJMatpTFHQ5MC
	7/iSWAvujJJEM5Gmns+kz1DYuyQsOARjsz4+Kk+AphJyq0TS2enz/9A0DduxijaPOaE+XEYfxGB
	vDTRB82p4Gcs=
X-Received: by 2002:a63:d64c:: with SMTP id d12mr2759060pgj.186.1634188237422;
        Wed, 13 Oct 2021 22:10:37 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJz+HfqtT1TmCrQjEW4TrA+TZHaBbrb5CUaGbIoddwpaWRpF41OyDzFEtYlCm87Zo5uV3KfwLQ==
X-Received: by 2002:a63:d64c:: with SMTP id d12mr2759040pgj.186.1634188237099;
        Wed, 13 Oct 2021 22:10:37 -0700 (PDT)
Received: from t490s ([209.132.188.80])
        by smtp.gmail.com with ESMTPSA id u3sm1062553pfl.155.2021.10.13.22.10.34
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 13 Oct 2021 22:10:36 -0700 (PDT)
Date: Thu, 14 Oct 2021 13:10:31 +0800
From: Peter Xu <peterx@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>, Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: mm: unnecessary COW phenomenon
Message-ID: <YWe7x5DK0sMDskYE@t490s>
References: <FFA0057D-1A17-4DF4-9550-A8CDEE9E0CE0@gmail.com>
MIME-Version: 1.0
In-Reply-To: <FFA0057D-1A17-4DF4-9550-A8CDEE9E0CE0@gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 8F33D6001988
X-Stat-Signature: y1usz51aaazb4nefxkkeph9r7d9ac8mm
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hPnqsQ0x;
	spf=none (imf10.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-HE-Tag: 1634188239-168835
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote:
> Andrea, Peter, others,

Hi, Nadav,

>=20
> I encountered many unnecessary COW operations on my development kernel
> (based on Linux 5.13), which I did not see a report about and I am not
> sure how to solve. An advice would be appreciated.
>=20
> Commit 09854ba94c6aa ("mm: do_wp_page() simplification=E2=80=9D) preven=
ts the reuse of
> a page on write-protect fault if page_count(page) !=3D 1. In that case,
> wp_page_reuse() is not used and instead the page is COW'd by wp_page_co=
py
> (). wp_page_copy() is obviously much more expensive, not only because o=
f the
> copying, but also because it requires a TLB flush and potentially a TLB
> shootodwn.
>=20
> The scenario I encountered happens when I use userfaultfd, but presumab=
ly it
> might happen regardless of userfaultfd (perhaps swap device with
> SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new
> anonymous page as read-only and a second write-protect fault that happe=
ns
> shortly after on the same page. In this case the page count is almost a=
lways
> elevated and therefore a COW is needed.
>=20
> [ The specific scenario that I have as as follows: I map a page to the
> monitored process using UFFDIO_COPY (actually a variant I am working on=
) as
> write-protected. Then, shortly after an write access to the page trigge=
rs a
> page fault. The uffd monitor quickly resolves the page fault using
> UFFDIO_WRITEPROTECT. The kernel keeps the page write protected in the p=
age
> tables but marked logically as uffd-unprotected and the page table is
> retried. The retry triggers a COW. ]
>=20
> It turns out that the elevated page count is due to the caching of the =
page in
> the local LRU cache (by lru_cache_add() which is called by
> lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since=
 the
> first fault happened shortly before the second write-protect fault, the=
 LRU
> cache was still not drained, so the page count was not decreased and a =
COW is
> needed.
>=20
> Calling lru_add_drain() during this flow resolves the issue most of the=
 time.
> Obviously, it needs to be called on the core that allocated (i.e., faul=
ted
> in) the page initially to work. It is possible to do it conditionally o=
nly if
> the page-count is greater than 1.
>=20
> My questions to you (if I may) are:
>=20
> 1. Am I missing something?

I agree with your analysis.  I didn't even notice the lru_cache_add() can=
 cause
it very likely to trigger the COW in your uffd use case (and also for swa=
p),
but that's indeed something could happen with the current page reuse logi=
c in
do_wp_page(), afaiu.

> 2. Should it happen in other cases, specifically SWP_SYNCHRONOUS_IO?

Frankly I don't know why SWP_SYNCHRONOUS_IO matters here, as that seems t=
o me a
flag to tell whether the swap device is fast on IO so swapping can be don=
e
synchronously and skip swap cache.  E.g., I think normal swapping could h=
ave
similar issue too?  As long as in do_swap_page() the reuse_swap_page() ca=
ll is
either not triggered (which means it's a read fault) or it returned false
(which means there's more than 1 map+swap count).

> 3. Do you have a better solution?

What you suggested as "conditionally lru draining in fault path" seems ok=
ay,
but that does look like yet another band-aid to the page reuse logic..
Meanwhile sorry I don't have anything better in mind.  Andrea proposed th=
e
mapcount unshare solution [1] (I believe you should be aware of it now; i=
t
definitely needs some time reading if you didn't follow that previusly...=
) and
that definitely can resolve this issue too, it's just that upstream hasn'=
t
reached a consensus on that, so the page reuse is kept the current way on
depending on refcount rather than mapcount.

[1] https://github.com/aagit/aa/tree/mapcount_unshare

Thanks,

--=20
Peter Xu