From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yEt/=C3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F2841C43463
	for <linux-mm@archiver.kernel.org>; Fri, 18 Sep 2020 20:40:58 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 3BCD7208DB
	for <linux-mm@archiver.kernel.org>; Fri, 18 Sep 2020 20:40:58 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="E25TJAGg"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3BCD7208DB
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id AFC248E0001; Fri, 18 Sep 2020 16:40:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AAC2C6B0095; Fri, 18 Sep 2020 16:40:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 94D338E0001; Fri, 18 Sep 2020 16:40:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44])
	by kanga.kvack.org (Postfix) with ESMTP id 7A7E46B0093
	for <linux-mm@kvack.org>; Fri, 18 Sep 2020 16:40:57 -0400 (EDT)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 3AD71180AD804
	for <linux-mm@kvack.org>; Fri, 18 Sep 2020 20:40:57 +0000 (UTC)
X-FDA: 77277351354.10.ear78_3412a1c2712e
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin10.hostedemail.com (Postfix) with ESMTP id 1983516A0DD
	for <linux-mm@kvack.org>; Fri, 18 Sep 2020 20:40:57 +0000 (UTC)
X-HE-Tag: ear78_3412a1c2712e
X-Filterd-Recvd-Size: 10515
Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120])
	by imf24.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 18 Sep 2020 20:40:56 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1600461656;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Y3KTlzeMxbkN1J9iZRU481kw+vClX1N3yaY5uZmjOiU=;
	b=E25TJAGgC3YeMJoF53ZTnmodDxGMXARKN/SYJLyaOru7rO2w8lKwhgGtkEX8NhXciIyXIk
	7nvS5OcVM5ewH1t21bMe4J7jJZjiTsntk3lF/dmEdnbZneV1s54NOmJZdE8QRuPnQ4igZu
	EswquS9l5W7qTMb1qqGDaxjqzaqi81I=
Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com
 [209.85.222.200]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-236-5Ms4soPXMdW5-v1LDxUZqA-1; Fri, 18 Sep 2020 16:40:51 -0400
X-MC-Unique: 5Ms4soPXMdW5-v1LDxUZqA-1
Received: by mail-qk1-f200.google.com with SMTP id o28so5596246qkm.23
        for <linux-mm@kvack.org>; Fri, 18 Sep 2020 13:40:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=Y3KTlzeMxbkN1J9iZRU481kw+vClX1N3yaY5uZmjOiU=;
        b=ZsrI1QY08XtinTPkfPhnPfiyYz06K4VJ0gJPuZecF5isxWBrfwa9ZlWCBOW52TtlME
         4d7ltjX4ELeaBr/kXtwbk8oAD57V/ZXuQOmF1CMFpVvDQhI6lLGzLMSeoX5uH0d4QtIz
         UwFNv3X/tR+5SEw/kr841lxQos0FqhFwT+mh+PFmQ9FxY1hgpF4C8iY+UcQsDFEXmA7U
         FMuKEcsj2eNveZUxE+V3XKwnV1rtwFvlHiR6t/dewAFARGswPCQk/tlbqEcFX0oFusOE
         3teIrM+oNTtiQspKAZIcucxQaLEHdSVJf4gr0lESCcadwOzp/MUPO/Q5Fc0zLnO1xivp
         gRDA==
X-Gm-Message-State: AOAM530lz2Fs/gy26zU6SNAOTvX44QHzM3cwL283O4zwPAnmGPisvP76
	LECtWluLtmNbd9cDW6srXAcVx0xy/gu4WOB2DfSLfnBxaY3X+rcM2/GgrBLSnWvetUCm/iw2VXN
	n9AfHkrBpNdg=
X-Received: by 2002:a05:6214:1873:: with SMTP id eh19mr675817qvb.16.1600461651232;
        Fri, 18 Sep 2020 13:40:51 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJy3iutpyLjf91kqStVUiq6iFQBBnNS8rPmqK2B51dRwo0/lAH8cLr6yiWJHX0b46FDiDeG1iw==
X-Received: by 2002:a05:6214:1873:: with SMTP id eh19mr675782qvb.16.1600461650816;
        Fri, 18 Sep 2020 13:40:50 -0700 (PDT)
Received: from xz-x1 (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15])
        by smtp.gmail.com with ESMTPSA id 202sm2832821qkg.56.2020.09.18.13.40.49
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 18 Sep 2020 13:40:50 -0700 (PDT)
Date: Fri, 18 Sep 2020 16:40:48 -0400
From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>, Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Maya B . Gokhale" <gokhale2@llnl.gov>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Marty Mcfadden <mcfadden8@llnl.gov>,
	Kirill Shutemov <kirill@shutemov.name>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Jan Kara <jack@suse.cz>, Kirill Tkhai <ktkhai@virtuozzo.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Christoph Hellwig <hch@lst.de>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
Message-ID: <20200918204048.GC5962@xz-x1>
References: <20200915232238.GO1221970@ziepe.ca>
 <e6c352f8-7ee9-0702-10a4-122d2c4422fc@nvidia.com>
 <20200916174804.GC8409@ziepe.ca>
 <20200916184619.GB40154@xz-x1>
 <20200917112538.GD8409@ziepe.ca>
 <CAHk-=wjtfjB3TqTFRzVmOrB9Mii6Yzc-=wKq0fu4ruDE6AsJgg@mail.gmail.com>
 <20200917193824.GL8409@ziepe.ca>
 <CAHk-=wiY_g+SSjncZi8sO=LrxXmMox0NO7K34-Fs653XVXheGg@mail.gmail.com>
 <20200918164032.GA5962@xz-x1>
 <20200918173240.GY8409@ziepe.ca>
MIME-Version: 1.0
In-Reply-To: <20200918173240.GY8409@ziepe.ca>
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Sep 18, 2020 at 02:32:40PM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 18, 2020 at 12:40:32PM -0400, Peter Xu wrote:
> 
> > Firstly in the draft patch mm->has_pinned is introduced and it's written to 1
> > as long as FOLL_GUP is called once.  It's never reset after set.
> 
> Worth thinking about also adding FOLL_LONGTERM here, at last as long
> as it is not a counter. That further limits the impact.

But theoritically we should also trigger COW here for pages even with PIN &&
!LONGTERM, am I right?  Assuming that FOLL_PIN is already a corner case.

> > One issue is when we charge for cgroup we probably can't do that onto the new
> > mm/task, since copy_namespaces() is called after copy_mm().  I don't know
> > enough about cgroup, I thought the child will inherit the parent's, but I'm not
> > sure.  Or, can we change that order of copy_namespaces() && copy_mm()?  I don't
> > see a problem so far but I'd like to ask first..
> 
> Know nothing about cgroups, but I would have guessed that the page
> table allocations would want to be in the cgroup too, is the struct
> page a different bucket?

Good question...  I feel like this kind of accountings were always done to
"current" via alloc_page().  But frankly speaking I don't know whether I
understand it right because afaict "current" is the parent during fork(), while
I feel like it will make more sense if it is accounted to the child process.  I
think I should have missed something important but I can't tell..

> 
> > The other thing is on how to fail.  E.g., when COW failed due to either
> > charging of cgroup or ENOMEM, ideally we should fail fork() too.  Though that
> > might need more changes - current patch silently kept the shared page for
> > simplicity.
> 
> I didn't notice anything tricky here.. Something a bit gross but
> simple seemed workable:
> 
> @@ -852,7 +852,7 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  			continue;
>  		}
>  		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> -							vma, addr, rss);
> +							vma, addr, rss, &err);
>  		if (entry.val)
>  			break;
>  		progress += 8;
> @@ -865,6 +865,9 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	pte_unmap_unlock(orig_dst_pte, dst_ptl);
>  	cond_resched();
>  
> +	if (err)
> +		return err;
> +
>  	if (entry.val) {
>  		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0)
>  			return -ENOMEM;
> 
> It is not really any different from add_swap_count_continuation()
> failure, which already works..

Yes it's not pretty, but I do plan to use something like this to avoid touching
all the return path in coyp_one_pte(), and I think the answer to the last
question matters too, below.

> > diff --git a/mm/gup.c b/mm/gup.c
> > index e5739a1974d5..cab10cefefe4 100644
> > +++ b/mm/gup.c
> > @@ -1255,6 +1255,17 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
> >  		BUG_ON(*locked != 1);
> >  	}
> >  
> > +	/*
> > +	 * Mark the mm struct if there's any page pinning attempt.  We're
> > +	 * aggresive on this bit since even if the pinned pages were unpinned
> > +	 * later on, we'll still keep this bit set for this address space just
> > +	 * to make everything easy.
> > +	 *
> > +	 * TODO: Ideally we can use mm->pinned_vm but only until it's stable.
> > +	 */
> > +	if (flags & FOLL_PIN)
> > +		WRITE_ONCE(mm->has_pinned, 1);
> 
> This should probably be its own commit, here is a stab at a commit
> message:
> 
> Reduce the chance of false positive from page_maybe_dma_pinned() by
> keeping track if the mm_struct has ever been used with
> pin_user_pages(). mm_structs that have never been passed to
> pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
> definition. This allows cases that might drive up the page ref_count
> to avoid any penalty from handling dma_pinned pages.
> 
> Due to complexities with unpining this trivial version is a permanent
> sticky bit, future work will be needed to make this a counter.

Thanks for writting this.  I'll keep the commit message once split until I need
to post a formal patch.  Before that hope it's fine I'll still use a single
patch for simplicity because I still want to keep the discussion within the
thread.

> 
> > +/*
> > + * Do early cow for the page and the pte. Return true if page duplicate
> > + * succeeded, false otherwise.
> > + */
> > +static bool duplicate_page(struct mm_struct *mm, struct vm_area_struct *vma,
> 
> Suggest calling 'vma' 'new' here for consistency

OK.

> 
> > +			   unsigned long address, struct page *page,
> > +			   pte_t *newpte)
> > +{
> > +       struct page *new_page;
> > +       pte_t entry;
> > +
> > +       new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > +       if (!new_page)
> > +               return false;
> > +
> > +       copy_user_highpage(new_page, page, address, vma);
> > +       if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) {
> > +	       put_page(new_page);
> > +	       return false;
> > +       }
> > +       cgroup_throttle_swaprate(new_page, GFP_KERNEL);
> > +       __SetPageUptodate(new_page);
> 
> It looks like these GFP flags can't be GFP_KERNEL, this is called
> inside the pte_alloc_map_lock() which is a spinlock
> 
> One thought is to lift this logic out to around
> add_swap_count_continuation()? Would need some serious rework to be
> able to store the dst_pte though.

What would be the result if we simply use GFP_ATOMIC?  Would there be too many
pages to allocate in bulk for ATOMIC?  IMHO slowness would be fine, but I don't
know the inside of page allocation, and not sure whether __GFP_KSWAPD_RECLAIM
means we might kick kswapd and whether we'll deadlock when the kswapd could
potentially try to take the spinlock again somewhere while we waiting for it?

It would be good to go this (easy) way considering this is a very rare to
trigger path, so we can still keep copy_one page simple.  Otherwise I seem to
have no choice to move the page copy logic out of copy_one_pte(), as you
suggested.

-- 
Peter Xu