From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 419FCC432C0 for ; Mon, 18 Nov 2019 11:58:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 01AAE2068D for ; Mon, 18 Nov 2019 11:58:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 01AAE2068D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8D7A56B0007; Mon, 18 Nov 2019 06:58:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8906A6B0008; Mon, 18 Nov 2019 06:58:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 74F646B000A; Mon, 18 Nov 2019 06:58:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0138.hostedemail.com [216.40.44.138]) by kanga.kvack.org (Postfix) with ESMTP id 601426B0007 for ; Mon, 18 Nov 2019 06:58:41 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id DE444180AD802 for ; Mon, 18 Nov 2019 11:58:40 +0000 (UTC) X-FDA: 76169251200.10.pie69_2e979ab364458 X-HE-Tag: pie69_2e979ab364458 X-Filterd-Recvd-Size: 8009 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf05.hostedemail.com (Postfix) with ESMTP for ; Mon, 18 Nov 2019 11:58:40 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 67C34AE68; Mon, 18 Nov 2019 11:58:37 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 912131E4B0C; Mon, 18 Nov 2019 12:58:29 +0100 (CET) Date: Mon, 18 Nov 2019 12:58:29 +0100 From: Jan Kara To: John Hubbard Cc: Andrew Morton , Al Viro , Alex Williamson , Benjamin Herrenschmidt , =?iso-8859-1?Q?Bj=F6rn_T=F6pel?= , Christoph Hellwig , Dan Williams , Daniel Vetter , Dave Chinner , David Airlie , "David S . Miller" , Ira Weiny , Jan Kara , Jason Gunthorpe , Jens Axboe , Jonathan Corbet , =?iso-8859-1?B?Suly9G1l?= Glisse , Magnus Karlsson , Mauro Carvalho Chehab , Michael Ellerman , Michal Hocko , Mike Kravetz , Paul Mackerras , Shuah Khan , Vlastimil Babka , bpf@vger.kernel.org, dri-devel@lists.freedesktop.org, kvm@vger.kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org, linux-mm@kvack.org, LKML Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages Message-ID: <20191118115829.GJ17319@quack2.suse.cz> References: <20191115055340.1825745-1-jhubbard@nvidia.com> <20191115055340.1825745-18-jhubbard@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20191115055340.1825745-18-jhubbard@nvidia.com> User-Agent: Mutt/1.10.1 (2018-07-13) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 14-11-19 21:53:33, John Hubbard wrote: > Add tracking of pages that were pinned via FOLL_PIN. >=20 > As mentioned in the FOLL_PIN documentation, callers who effectively set > FOLL_PIN are required to ultimately free such pages via put_user_page()= . > The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET > for DIO and/or RDMA use". >=20 > Pages that have been pinned via FOLL_PIN are identifiable via a > new function call: >=20 > bool page_dma_pinned(struct page *page); >=20 > What to do in response to encountering such a page, is left to later > patchsets. There is discussion about this in [1]. ^^ missing this reference in the changelog... > This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). >=20 > Suggested-by: Jan Kara > Suggested-by: J=E9r=F4me Glisse > Signed-off-by: John Hubbard > --- > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 6588d2e02628..db872766480f 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(stru= ct page *page) > return true; > } > =20 > +__must_check bool user_page_ref_inc(struct page *page); > + > static inline void put_page(struct page *page) > { > page =3D compound_head(page); > @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page) > __put_page(page); > } > =20 > -/** > - * put_user_page() - release a gup-pinned page > - * @page: pointer to page to be released > +/* > + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, ov= erload > + * the page's refcount so that two separate items are tracked: the ori= ginal page > + * reference count, and also a new count of how many get_user_pages() = calls were ^^ pin_user_pages() > + * made against the page. ("gup-pinned" is another term for the latter= ). > + * > + * With this scheme, get_user_pages() becomes special: such pages are = marked ^^^ pin_user_pages() > + * as distinct from normal pages. As such, the put_user_page() call (a= nd its > + * variants) must be used in order to release gup-pinned pages. > + * > + * Choice of value: > * > - * Pages that were pinned via pin_user_pages*() must be released via e= ither > - * put_user_page(), or one of the put_user_pages*() routines. This is = so that > - * eventually such pages can be separately tracked and uniquely handle= d. In > - * particular, interactions with RDMA and filesystems need special han= dling. > + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page r= eference > + * counts with respect to get_user_pages() and put_user_page() becomes= simpler, ^^^ pin_user_pages() > + * due to the fact that adding an even power of two to the page refcou= nt has > + * the effect of using only the upper N bits, for the code that counts= up using > + * the bias value. This means that the lower bits are left for the exc= lusive > + * use of the original code that increments and decrements by one (or = at least, > + * by much smaller values than the bias value). > * > - * put_user_page() and put_page() are not interchangeable, despite thi= s early > - * implementation that makes them look the same. put_user_page() calls= must > - * be perfectly matched up with pin*() calls. > + * Of course, once the lower bits overflow into the upper bits (and th= is is > + * OK, because subtraction recovers the original values), then visual = inspection > + * no longer suffices to directly view the separate counts. However, f= or normal > + * applications that don't have huge page reference counts, this won't= be an > + * issue. > + * > + * Locking: the lockless algorithm described in page_cache_get_specula= tive() > + * and page_cache_gup_pin_speculative() provides safe operation for > + * get_user_pages and page_mkclean and other calls that race to set up= page > + * table entries. > */ ... > @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned lon= g sz, unsigned long addr, > page =3D head + ((addr & (sz-1)) >> PAGE_SHIFT); > refs =3D __record_subpages(page, addr, end, pages + *nr); > =20 > - head =3D try_get_compound_head(head, refs); > - if (!head) > - return 0; > + if (flags & FOLL_PIN) { > + head =3D page; > + if (unlikely(!user_page_ref_inc(head))) > + return 0; > + head =3D page; Why do you assign 'head' twice? Also the refcounting logic is repeated several times so perhaps you can factor it out in to a helper function or even move it to __record_subpages()? > + } else { > + head =3D try_get_compound_head(head, refs); > + if (!head) > + return 0; > + } > =20 > if (unlikely(pte_val(pte) !=3D pte_val(*ptep))) { > put_compound_head(head, refs); So this will do the wrong thing for FOLL_PIN. We took just one "pin" reference there but here we'll release 'refs' normal references AFAICT. Also the fact that you take just one pin reference for each huge page substantially changes how GUP refcounting works in the huge page case. Currently, FOLL_GET users can be completely agnostic of huge pages. So yo= u can e.g. GUP whole 2 MB page, submit it as 2 different bios and then drop page references from each bio completion function. With your new FOLL_PIN behavior you cannot do that and I believe it will be a problem f= or some users. So I think you have to maintain the behavior that you increas= e the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here. Honza --=20 Jan Kara SUSE Labs, CR