From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QBBA=LF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 50D73C48BD1
	for <linux-mm@archiver.kernel.org>; Fri, 11 Jun 2021 22:36:28 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A0C4D613C6
	for <linux-mm@archiver.kernel.org>; Fri, 11 Jun 2021 22:36:27 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A0C4D613C6
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2C1AB6B006C; Fri, 11 Jun 2021 18:36:27 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 272576B006E; Fri, 11 Jun 2021 18:36:27 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 113926B0070; Fri, 11 Jun 2021 18:36:27 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0058.hostedemail.com [216.40.44.58])
	by kanga.kvack.org (Postfix) with ESMTP id D46E66B006C
	for <linux-mm@kvack.org>; Fri, 11 Jun 2021 18:36:26 -0400 (EDT)
Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 600AA1816F75B
	for <linux-mm@kvack.org>; Fri, 11 Jun 2021 22:36:26 +0000 (UTC)
X-FDA: 78242903172.24.3AAC8CF
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by imf02.hostedemail.com (Postfix) with ESMTP id 833674080F7B
	for <linux-mm@kvack.org>; Fri, 11 Jun 2021 22:36:22 +0000 (UTC)
Received: by mail.kernel.org (Postfix) with ESMTPSA id 97E4C613DE;
	Fri, 11 Jun 2021 22:36:24 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org;
	s=korg; t=1623450984;
	bh=HtVrjP1NzBtomUQjA4d5/nvjnishSzc5bPaQjZfmId4=;
	h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
	b=ZQjgx8w0tK9x3x1IXhtV3DANxaEfxnBiu6iGsyUMj86+A2djnyx6DFkH7/GZjrtph
	 M2X7nECiT6fijvOobbjC9KqdBUUVRU+EUoaelTpkXvh1pWLBXUb5PPud7ZIhndDrjn
	 d1v9mQ86n34Eb+wNhvhhBVMk6+O3qWKSCHUOhCxg=
Date: Fri, 11 Jun 2021 15:36:24 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: Jann Horn <jannh@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Matthew Wilcox
 <willy@infradead.org>, "Kirill A . Shutemov" <kirill@shutemov.name>, John
 Hubbard <jhubbard@nvidia.com>, Jan Kara <jack@suse.cz>,
 stable@vger.kernel.org
Subject: Re: [PATCH resend] mm/gup: fix try_grab_compound_head() race with
 split_huge_page()
Message-Id: <20210611153624.65badf761078f86f76365ab9@linux-foundation.org>
In-Reply-To: <20210611161545.998858-1-jannh@google.com>
References: <20210611161545.998858-1-jannh@google.com>
X-Mailer: Sylpheed 3.5.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=linux-foundation.org header.s=korg header.b=ZQjgx8w0;
	dmarc=none;
	spf=pass (imf02.hostedemail.com: domain of akpm@linux-foundation.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org
X-Rspamd-Server: rspam02
X-Stat-Signature: yzhw9a9w91jyjf1wcwjz5doefk7cxok4
X-Rspamd-Queue-Id: 833674080F7B
X-HE-Tag: 1623450982-756402
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, 11 Jun 2021 18:15:45 +0200 Jann Horn <jannh@google.com> wrote:

> try_grab_compound_head() is used to grab a reference to a page from
> get_user_pages_fast(), which is only protected against concurrent
> freeing of page tables (via local_irq_save()), but not against
> concurrent TLB flushes, freeing of data pages, or splitting of compound
> pages.
> 
> Because no reference is held to the page when try_grab_compound_head()
> is called, the page may have been freed and reallocated by the time its
> refcount has been elevated; therefore, once we're holding a stable
> reference to the page, the caller re-checks whether the PTE still points
> to the same page (with the same access rights).
> 
> The problem is that try_grab_compound_head() has to grab a reference on
> the head page; but between the time we look up what the head page is and
> the time we actually grab a reference on the head page, the compound
> page may have been split up (either explicitly through split_huge_page()
> or by freeing the compound page to the buddy allocator and then
> allocating its individual order-0 pages).
> If that happens, get_user_pages_fast() may end up returning the right
> page but lifting the refcount on a now-unrelated page, leading to
> use-after-free of pages.
> 
> To fix it:
> Re-check whether the pages still belong together after lifting the
> refcount on the head page.
> Move anything else that checks compound_head(page) below the refcount
> increment.
> 
> This can't actually happen on bare-metal x86 (because there, disabling
> IRQs locks out remote TLB flushes), but it can happen on virtualized x86
> (e.g. under KVM) and probably also on arm64. The race window is pretty
> narrow, and constantly allocating and shattering hugepages isn't exactly
> fast; for now I've only managed to reproduce this in an x86 KVM guest with
> an artificially widened timing window (by adding a loop that repeatedly
> calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits,
> so that PV TLB flushes are used instead of IPIs).
> 
> ...
>
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -43,8 +43,21 @@ static void hpage_pincount_sub(struct page *page, int refs)
>  
>  	atomic_sub(refs, compound_pincount_ptr(page));
>  }
>  
> +/* Equivalent to calling put_page() @refs times. */
> +static void put_page_refs(struct page *page, int refs)
> +{
> +	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);

I don't think there's a need to nuke the whole kernel in this case. 
Can we warn then simply leak the page?  That way we have a much better
chance of getting a good bug report.

> +	/*
> +	 * Calling put_page() for each ref is unnecessarily slow. Only the last
> +	 * ref needs a put_page().
> +	 */
> +	if (refs > 1)
> +		page_ref_sub(page, refs - 1);
> +	put_page(page);
> +}