From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36C93C433F5 for ; Thu, 14 Oct 2021 05:10:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 973BA61152 for ; Thu, 14 Oct 2021 05:10:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 973BA61152 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id E038B6B0092; Thu, 14 Oct 2021 01:10:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DB22C6B0093; Thu, 14 Oct 2021 01:10:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7A07940009; Thu, 14 Oct 2021 01:10:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0025.hostedemail.com [216.40.44.25]) by kanga.kvack.org (Postfix) with ESMTP id B50186B0092 for ; Thu, 14 Oct 2021 01:10:41 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 644618249980 for ; Thu, 14 Oct 2021 05:10:41 +0000 (UTC) X-FDA: 78693867882.09.62F43C1 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id 8F33D6001988 for ; Thu, 14 Oct 2021 05:10:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634188240; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ge0ebwviz5Bm5d/fEKQmZixoWz8JO7cHa468HpLZEu4=; b=hPnqsQ0xKl7GPUg7RzASZU/p3/fTr+Q+WDMjxQjhQs1aTn3HVJHe+So5xqiqeGq2qllRQi 88RBXm/0ZvI/mRJZqZKD7Hke+i3q6zbri3EJRgZrzHvmn5Xfw2TLtLcMuinzUE/PmIbaHd 6VnJrQ+qXyxZe0rbyxYbVAKS4xjnlPE= Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-574-a1QYRGaQMuOHQCElOi_riw-1; Thu, 14 Oct 2021 01:10:38 -0400 X-MC-Unique: a1QYRGaQMuOHQCElOi_riw-1 Received: by mail-pg1-f199.google.com with SMTP id j18-20020a633c12000000b0029956680edaso2263525pga.15 for ; Wed, 13 Oct 2021 22:10:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=Ge0ebwviz5Bm5d/fEKQmZixoWz8JO7cHa468HpLZEu4=; b=CVXmdMr7BZKF3evX2VoNiKJ2rWQTAjEKTD1w3wZ24oilefHNnYbIXX6Z9pLvviAccY RFfg6aqJ3fxb3pWiJ2mRjGwKfdGQ+q2oT5pA9+eccWvjtZxP2YZVshu0EQFMMuA88LGJ TSRbvEVhgrfKaTdaKoEHEuenO4F4X7ZGolJKRBxjxXvRxVDWtYrLnjle0mQZI1xjN0xe AMgEl4XKrBwB76ng9ZTwTxOjPlxmZjq9q7xC7R5ygbr7UMJ1rIDlSoY6tOdc6q17pEli 3jSSMB40rzUy4Pm+zaaRt+xgLLrJb9IdF3Wi49AT5QxolcQNt25MNAgKb8alNfklC1vE +Z3Q== X-Gm-Message-State: AOAM530KC42+sqWZO+AMOJag9Io70lxDCiSmdNMNMYYaJMatpTFHQ5MC 7/iSWAvujJJEM5Gmns+kz1DYuyQsOARjsz4+Kk+AphJyq0TS2enz/9A0DduxijaPOaE+XEYfxGB vDTRB82p4Gcs= X-Received: by 2002:a63:d64c:: with SMTP id d12mr2759060pgj.186.1634188237422; Wed, 13 Oct 2021 22:10:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz+HfqtT1TmCrQjEW4TrA+TZHaBbrb5CUaGbIoddwpaWRpF41OyDzFEtYlCm87Zo5uV3KfwLQ== X-Received: by 2002:a63:d64c:: with SMTP id d12mr2759040pgj.186.1634188237099; Wed, 13 Oct 2021 22:10:37 -0700 (PDT) Received: from t490s ([209.132.188.80]) by smtp.gmail.com with ESMTPSA id u3sm1062553pfl.155.2021.10.13.22.10.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Oct 2021 22:10:36 -0700 (PDT) Date: Thu, 14 Oct 2021 13:10:31 +0800 From: Peter Xu To: Nadav Amit Cc: Andrea Arcangeli , Linux-MM , LKML Subject: Re: mm: unnecessary COW phenomenon Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 8F33D6001988 X-Stat-Signature: y1usz51aaazb4nefxkkeph9r7d9ac8mm Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hPnqsQ0x; spf=none (imf10.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1634188239-168835 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 13, 2021 at 03:42:08PM -0700, Nadav Amit wrote: > Andrea, Peter, others, Hi, Nadav, >=20 > I encountered many unnecessary COW operations on my development kernel > (based on Linux 5.13), which I did not see a report about and I am not > sure how to solve. An advice would be appreciated. >=20 > Commit 09854ba94c6aa ("mm: do_wp_page() simplification=E2=80=9D) preven= ts the reuse of > a page on write-protect fault if page_count(page) !=3D 1. In that case, > wp_page_reuse() is not used and instead the page is COW'd by wp_page_co= py > (). wp_page_copy() is obviously much more expensive, not only because o= f the > copying, but also because it requires a TLB flush and potentially a TLB > shootodwn. >=20 > The scenario I encountered happens when I use userfaultfd, but presumab= ly it > might happen regardless of userfaultfd (perhaps swap device with > SWP_SYNCHRONOUS_IO). It involves two page faults: one that maps a new > anonymous page as read-only and a second write-protect fault that happe= ns > shortly after on the same page. In this case the page count is almost a= lways > elevated and therefore a COW is needed. >=20 > [ The specific scenario that I have as as follows: I map a page to the > monitored process using UFFDIO_COPY (actually a variant I am working on= ) as > write-protected. Then, shortly after an write access to the page trigge= rs a > page fault. The uffd monitor quickly resolves the page fault using > UFFDIO_WRITEPROTECT. The kernel keeps the page write protected in the p= age > tables but marked logically as uffd-unprotected and the page table is > retried. The retry triggers a COW. ] >=20 > It turns out that the elevated page count is due to the caching of the = page in > the local LRU cache (by lru_cache_add() which is called by > lru_cache_add_inactive_or_unevictable() in the case userfaultfd). Since= the > first fault happened shortly before the second write-protect fault, the= LRU > cache was still not drained, so the page count was not decreased and a = COW is > needed. >=20 > Calling lru_add_drain() during this flow resolves the issue most of the= time. > Obviously, it needs to be called on the core that allocated (i.e., faul= ted > in) the page initially to work. It is possible to do it conditionally o= nly if > the page-count is greater than 1. >=20 > My questions to you (if I may) are: >=20 > 1. Am I missing something? I agree with your analysis. I didn't even notice the lru_cache_add() can= cause it very likely to trigger the COW in your uffd use case (and also for swa= p), but that's indeed something could happen with the current page reuse logi= c in do_wp_page(), afaiu. > 2. Should it happen in other cases, specifically SWP_SYNCHRONOUS_IO? Frankly I don't know why SWP_SYNCHRONOUS_IO matters here, as that seems t= o me a flag to tell whether the swap device is fast on IO so swapping can be don= e synchronously and skip swap cache. E.g., I think normal swapping could h= ave similar issue too? As long as in do_swap_page() the reuse_swap_page() ca= ll is either not triggered (which means it's a read fault) or it returned false (which means there's more than 1 map+swap count). > 3. Do you have a better solution? What you suggested as "conditionally lru draining in fault path" seems ok= ay, but that does look like yet another band-aid to the page reuse logic.. Meanwhile sorry I don't have anything better in mind. Andrea proposed th= e mapcount unshare solution [1] (I believe you should be aware of it now; i= t definitely needs some time reading if you didn't follow that previusly...= ) and that definitely can resolve this issue too, it's just that upstream hasn'= t reached a consensus on that, so the page reuse is kept the current way on depending on refcount rather than mapcount. [1] https://github.com/aagit/aa/tree/mapcount_unshare Thanks, --=20 Peter Xu