From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3FE3C433F5 for ; Fri, 24 Dec 2021 04:54:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 268446B0075; Thu, 23 Dec 2021 23:54:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 218306B007B; Thu, 23 Dec 2021 23:54:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 107DA6B007D; Thu, 23 Dec 2021 23:54:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0161.hostedemail.com [216.40.44.161]) by kanga.kvack.org (Postfix) with ESMTP id F18076B0075 for ; Thu, 23 Dec 2021 23:54:12 -0500 (EST) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id A5197180ACEFB for ; Fri, 24 Dec 2021 04:54:12 +0000 (UTC) X-FDA: 78951471144.04.0076031 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf27.hostedemail.com (Postfix) with ESMTP id 4B8D14000B for ; Fri, 24 Dec 2021 04:54:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Q8w6vJlPkDbS/DrH5IOenXLh3tsJgjfSoA1ZgPxsyRg=; b=RLa6CpDWsILp5BM50R2uDaFxLY kckmBXACDfFG18rbWU451uZgJf2XJ6YxtUUeNLNZmEWFd0bzKnYVXpCPc6tGF8TT52vEyvbQIf5bD MdWiWj22xuFFnEWraqZZlU5OFHtDMSJkSyl6WS/CqP+8NzAmJFq/crbD3+u0ycPbxtg6alM7OQAS7 0rxKqIjEMrapCBDpChh+XaNVK9bFLaUjZ6T/wbs741J9fKkComlKwv/7xxFAxg3wgl4AVqgrDGBfm fec1iv69kA5NXlRg5KuUaX0Cvdkji5gnIC+RedwkG4oWu9cxd2swHojNuY5D5/rDqLghNmFwOcEx9 o/OObrqg==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1n0caI-004px5-1k; Fri, 24 Dec 2021 04:53:38 +0000 Date: Fri, 24 Dec 2021 04:53:38 +0000 From: Matthew Wilcox To: Jason Gunthorpe Cc: David Hildenbrand , Jan Kara , Linus Torvalds , Nadav Amit , Linux Kernel Mailing List , Andrew Morton , Hugh Dickins , David Rientjes , Shakeel Butt , John Hubbard , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Vlastimil Babka , Jann Horn , Michal Hocko , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Linux-MM , "open list:KERNEL SELFTEST FRAMEWORK" , "open list:DOCUMENTATION" Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) Message-ID: References: <900b7d4a-a5dc-5c7b-a374-c4a8cc149232@redhat.com> <20211221190706.GG1432915@nvidia.com> <3e0868e6-c714-1bf8-163f-389989bf5189@redhat.com> <20211222124141.GA685@quack2.suse.cz> <4a28e8a0-2efa-8b5e-10b5-38f1fc143a98@redhat.com> <20211224025309.GF1779224@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211224025309.GF1779224@nvidia.com> X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 4B8D14000B X-Stat-Signature: 7wyqe9gmu3dsj6nus497iy9ms619gyqq Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=RLa6CpDW; spf=none (imf27.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none X-HE-Tag: 1640321651-755902 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote: > On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote: > > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote: > > > Right, from an API perspective we really want people to use FOLL_PIN. > > > > > > To optimize this case in particular it would help if we would have the > > > FOLL flags on the unpin path. Then we could just decide internally > > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat > > > this like a FOLL_GET instead". And we would need that as well if we were > > > to keep different counters for R/O vs. R/W pinned. > > > > FYI, in my current tree, there's a gup_put_folio() which replaces > > put_compound_head: > > > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) > > { > > if (flags & FOLL_PIN) { > > node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs); > > if (hpage_pincount_available(&folio->page)) > > hpage_pincount_sub(&folio->page, refs); > > else > > refs *= GUP_PIN_COUNTING_BIAS; > > } > > > > folio_put_refs(folio, refs); > > } > > > > That can become non-static if it's needed. I'm still working on that > > series, because I'd like to get it to a point where we return one > > folio pointer instead of N page pointers. Not quite there yet. > > I'm keen to see what that looks like, every driver I'm working on that > calls PUP goes through gyrations to recover contiguous pages, so this > is most welcomed! I'm about to take some time off, so alas, you won't see it any time soon. It'd be good to talk with some of the interested users because it's actually a pretty tricky problem. We can't just return an array of the struct folios because the actual memory you want to access might be anywhere in that folio, and you don't want to have to redo the lookup just to find out which subpages of the folio are meant. So I'm currently thinking about returning a bio_vec: struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; }; In the iomap patchset which should go upstream in the next merge window, you can iterate over a bio like this: struct folio_iter fi; bio_for_each_folio_all(fi, bio) iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error); There aren't any equivalent helpers for a bvec yet, but obviously we can add them so that you can iterate over each folio in a contiguous range. But now that each component in it is variable length, the caller can't know how large an array of bio_vecs to allocate. 1. The callee can allocate the array and let the caller free it when it's finished 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs over (potentially) multiple calls. 3. The caller can overallocate and ignore that most of the array isn't used. Any preferences? I don't like #3.