From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.3 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D154C432C3 for ; Tue, 19 Nov 2019 00:22:48 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2A95D22313 for ; Tue, 19 Nov 2019 00:22:48 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="kZWTJaB6" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2A95D22313 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B11136B0005; Mon, 18 Nov 2019 19:22:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A9B5F6B0006; Mon, 18 Nov 2019 19:22:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 962746B0007; Mon, 18 Nov 2019 19:22:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7DAAB6B0005 for ; Mon, 18 Nov 2019 19:22:47 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 2E222181AEF10 for ; Tue, 19 Nov 2019 00:22:47 +0000 (UTC) X-FDA: 76171126374.05.frame85_8ca1b8b6f5541 X-HE-Tag: frame85_8ca1b8b6f5541 X-Filterd-Recvd-Size: 9862 Received: from hqemgate14.nvidia.com (hqemgate14.nvidia.com [216.228.121.143]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Tue, 19 Nov 2019 00:22:46 +0000 (UTC) Received: from hqpgpgate101.nvidia.com (Not Verified[216.228.121.13]) by hqemgate14.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Mon, 18 Nov 2019 16:22:46 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate101.nvidia.com (PGP Universal service); Mon, 18 Nov 2019 16:22:44 -0800 X-PGP-Universal: processed; by hqpgpgate101.nvidia.com on Mon, 18 Nov 2019 16:22:44 -0800 Received: from [10.110.48.28] (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 19 Nov 2019 00:22:43 +0000 Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages To: Jan Kara CC: Andrew Morton , Al Viro , Alex Williamson , Benjamin Herrenschmidt , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Christoph Hellwig , Dan Williams , Daniel Vetter , Dave Chinner , David Airlie , "David S . Miller" , Ira Weiny , Jason Gunthorpe , Jens Axboe , Jonathan Corbet , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Magnus Karlsson , Mauro Carvalho Chehab , Michael Ellerman , Michal Hocko , Mike Kravetz , Paul Mackerras , Shuah Khan , Vlastimil Babka , , , , , , , , , , , , , LKML References: <20191115055340.1825745-1-jhubbard@nvidia.com> <20191115055340.1825745-18-jhubbard@nvidia.com> <20191118115829.GJ17319@quack2.suse.cz> From: John Hubbard X-Nvconfidentiality: public Message-ID: <8424f891-271d-5c34-8f7c-ebf3e3aa6664@nvidia.com> Date: Mon, 18 Nov 2019 16:22:43 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <20191118115829.GJ17319@quack2.suse.cz> X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: quoted-printable DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1574122967; bh=ip2xVJC+jxbL/7d2eWhZjqdIi4kB0uoxgzyWnYDs5q4=; h=X-PGP-Universal:Subject:To:CC:References:From:X-Nvconfidentiality: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=kZWTJaB6BUYpdW+z0HXxRztjdsBPgYMtIagp6++GXQ1xMYEqmDvkuTu5SJexcfc3Z aqvcG/+YbvfnrntTmKpTzXcCcJ1vacUs8Vt+kkZFmecCKGt5ESA8XOWs62Cbve3dr3 XsCtQVKM4yFI8r3EtkYIToV94nBwkk0bzSnRfhI6IY4EfMEr9iYp9o7iElBN+PNDoe GYMLDs4YJjT+MkCB6avfM8/DST4lwWHJbTQrh01mIBzcPCF4cBGDQd1OqtdqCqvly5 URQ6TTy1wBAC/IlVtHZs8b+E7d7tANvL2AVJcuHSx7ushKjPgli8Izd92XqBpLulsV NLxmWdrhMpwGw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/18/19 3:58 AM, Jan Kara wrote: > On Thu 14-11-19 21:53:33, John Hubbard wrote: >> Add tracking of pages that were pinned via FOLL_PIN. >> >> As mentioned in the FOLL_PIN documentation, callers who effectively set >> FOLL_PIN are required to ultimately free such pages via put_user_page(). >> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET >> for DIO and/or RDMA use". >> >> Pages that have been pinned via FOLL_PIN are identifiable via a >> new function call: >> >> bool page_dma_pinned(struct page *page); >> >> What to do in response to encountering such a page, is left to later >> patchsets. There is discussion about this in [1]. > ^^ missing this reference > in the changelog... I'll add that.=20 >=20 >> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). >> >> Suggested-by: Jan Kara >> Suggested-by: J=C3=A9r=C3=B4me Glisse >> Signed-off-by: John Hubbard >> --- >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 6588d2e02628..db872766480f 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struc= t page *page) >> return true; >> } >> =20 >> +__must_check bool user_page_ref_inc(struct page *page); >> + >> static inline void put_page(struct page *page) >> { >> page =3D compound_head(page); >> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page) >> __put_page(page); >> } >> =20 >> -/** >> - * put_user_page() - release a gup-pinned page >> - * @page: pointer to page to be released >> +/* >> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, ove= rload >> + * the page's refcount so that two separate items are tracked: the orig= inal page >> + * reference count, and also a new count of how many get_user_pages() c= alls were > ^^ pin_user_pages() >=20 >> + * made against the page. ("gup-pinned" is another term for the latter)= . >> + * >> + * With this scheme, get_user_pages() becomes special: such pages are m= arked > ^^^ pin_user_pages() >=20 >> + * as distinct from normal pages. As such, the put_user_page() call (an= d its >> + * variants) must be used in order to release gup-pinned pages. >> + * >> + * Choice of value: >> * >> - * Pages that were pinned via pin_user_pages*() must be released via ei= ther >> - * put_user_page(), or one of the put_user_pages*() routines. This is s= o that >> - * eventually such pages can be separately tracked and uniquely handled= . In >> - * particular, interactions with RDMA and filesystems need special hand= ling. >> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page re= ference >> + * counts with respect to get_user_pages() and put_user_page() becomes = simpler, > ^^^ pin_user_pages() >=20 Yes. >> + * due to the fact that adding an even power of two to the page refcoun= t has >> + * the effect of using only the upper N bits, for the code that counts = up using >> + * the bias value. This means that the lower bits are left for the excl= usive >> + * use of the original code that increments and decrements by one (or a= t least, >> + * by much smaller values than the bias value). >> * >> - * put_user_page() and put_page() are not interchangeable, despite this= early >> - * implementation that makes them look the same. put_user_page() calls = must >> - * be perfectly matched up with pin*() calls. >> + * Of course, once the lower bits overflow into the upper bits (and thi= s is >> + * OK, because subtraction recovers the original values), then visual i= nspection >> + * no longer suffices to directly view the separate counts. However, fo= r normal >> + * applications that don't have huge page reference counts, this won't = be an >> + * issue. >> + * >> + * Locking: the lockless algorithm described in page_cache_get_speculat= ive() >> + * and page_cache_gup_pin_speculative() provides safe operation for >> + * get_user_pages and page_mkclean and other calls that race to set up = page >> + * table entries. >> */ > ... >> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long= sz, unsigned long addr, >> page =3D head + ((addr & (sz-1)) >> PAGE_SHIFT); >> refs =3D __record_subpages(page, addr, end, pages + *nr); >> =20 >> - head =3D try_get_compound_head(head, refs); >> - if (!head) >> - return 0; >> + if (flags & FOLL_PIN) { >> + head =3D page; >> + if (unlikely(!user_page_ref_inc(head))) >> + return 0; >> + head =3D page; >=20 > Why do you assign 'head' twice? Also the refcounting logic is repeated > several times so perhaps you can factor it out in to a helper function or > even move it to __record_subpages()? OK. >=20 >> + } else { >> + head =3D try_get_compound_head(head, refs); >> + if (!head) >> + return 0; >> + } >> =20 >> if (unlikely(pte_val(pte) !=3D pte_val(*ptep))) { >> put_compound_head(head, refs); >=20 > So this will do the wrong thing for FOLL_PIN. We took just one "pin" > reference there but here we'll release 'refs' normal references AFAICT. > Also the fact that you take just one pin reference for each huge page > substantially changes how GUP refcounting works in the huge page case. > Currently, FOLL_GET users can be completely agnostic of huge pages. So yo= u > can e.g. GUP whole 2 MB page, submit it as 2 different bios and then > drop page references from each bio completion function. With your new > FOLL_PIN behavior you cannot do that and I believe it will be a problem f= or > some users. So I think you have to maintain the behavior that you increas= e > the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here. >=20 Yes, completely agreed, this was a (big) oversight. I went through the same reasoning and reached your conclusions, in __gup_device_huge(), but then did it wrong in these functions. Will fix. thanks, --=20 John Hubbard NVIDIA