From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 480E4C43603 for ; Tue, 10 Dec 2019 02:37:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D6ABC206E0 for ; Tue, 10 Dec 2019 02:37:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EkM1m//A" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D6ABC206E0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3E3C26B29F7; Mon, 9 Dec 2019 21:37:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 36B7A6B29F8; Mon, 9 Dec 2019 21:37:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 234576B29F9; Mon, 9 Dec 2019 21:37:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0009.hostedemail.com [216.40.44.9]) by kanga.kvack.org (Postfix) with ESMTP id 06E496B29F7 for ; Mon, 9 Dec 2019 21:37:09 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id BCC094DC2 for ; Tue, 10 Dec 2019 02:37:08 +0000 (UTC) X-FDA: 76247669736.29.beds24_3e4021c82d62c X-HE-Tag: beds24_3e4021c82d62c X-Filterd-Recvd-Size: 10155 Received: from us-smtp-delivery-1.mimecast.com (us-smtp-2.mimecast.com [207.211.31.81]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Tue, 10 Dec 2019 02:37:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575945427; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wKlOPTsPIdqSrBsnPqvTtgA2zsB5MINy3LoQKcLT7RQ=; b=EkM1m//AOS+6oTJsEPc/1engKYN21+14cvD52zlXuccVQk/zQRJGnZ3yWRVION+lYSBRiE 9xJYAOH/2C9WZ6qX/LqXeM3dnNqbIU+LJ46DhDIeUTFK6+/v+eSspKsmzrI2+3m5a6SqBP iQnBSMMKtE83eDs+liJ9qQDzvcrlhcM= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-325-a_EyqVN2MEiIz10jbrwHIw-1; Mon, 09 Dec 2019 21:37:03 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 22EB9DB20; Tue, 10 Dec 2019 02:37:02 +0000 (UTC) Received: from llong.remote.csb (ovpn-123-0.rdu2.redhat.com [10.10.123.0]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5D1B560555; Tue, 10 Dec 2019 02:37:01 +0000 (UTC) Subject: Re: [PATCH] hugetlbfs: Disable IRQ when taking hugetlb_lock in set_max_huge_pages() From: Waiman Long To: Matthew Wilcox Cc: Mike Kravetz , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20191209160150.18064-1-longman@redhat.com> <20191209164907.GD32169@bombadil.infradead.org> Organization: Red Hat Message-ID: <1209d9ba-9d82-4dfc-5cdf-a2641814af75@redhat.com> Date: Mon, 9 Dec 2019 21:37:00 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-MC-Unique: a_EyqVN2MEiIz10jbrwHIw-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 12/9/19 7:46 PM, Waiman Long wrote: > On 12/9/19 11:49 AM, Matthew Wilcox wrote: >> On Mon, Dec 09, 2019 at 11:01:50AM -0500, Waiman Long wrote: >>> [ 613.245273] Call Trace: >>> [ 613.256273] >>> [ 613.265273] dump_stack+0x9a/0xf0 >>> [ 613.281273] mark_lock+0xd0c/0x12f0 >>> [ 613.341273] __lock_acquire+0x146b/0x48c0 >>> [ 613.401273] lock_acquire+0x14f/0x3b0 >>> [ 613.440273] _raw_spin_lock+0x30/0x70 >>> [ 613.477273] free_huge_page+0x36f/0xaa0 >>> [ 613.495273] bio_check_pages_dirty+0x2fc/0x5c0 >> Oh, this is fun. So we kicked off IO to a hugepage, then truncated or >> otherwise caused the page to come free. Then the IO finished and did th= e >> final put on the page ... from interrupt context. Splat. Not something >> that's going to happen often, but can happen if a process dies during >> IO or due to a userspace bug. >> >> Maybe we should schedule work to do the actual freeing of the page >> instead of this rather large patch. It doesn't seem like a case we need >> to optimise for. > I think that may be a good idea to try it out. It turns out using workqueue is more complex that I originally thought. I currently come up with the following untested changes to do that: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ac65bb5e38ac..629ac000318b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct page *pag =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 page[2].mapping =3D NULL; =C2=A0} =C2=A0 -void free_huge_page(struct page *page) +static void __free_huge_page(struct page *page) =C2=A0{ =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Can't pass hstate in her= e because it is called from the @@ -1199,6 +1199,82 @@ void free_huge_page(struct page *page) =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_unlock(&hugetlb_lock); =C2=A0} =C2=A0 +/* + * As free_huge_page() can be called from softIRQ context, we have + * to defer the actual freeing in a workqueue to prevent potential + * hugetlb_lock deadlock. + */ +struct hpage_to_free { +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct page=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 *page; +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct hpage_to_free=C2=A0=C2=A0=C2= =A0 *next; +}; +static struct hpage_to_free *hpage_freelist; +#define NEXT_PENDING=C2=A0=C2=A0 ((struct hpage_to_free *)-1) + +/* + * This work function locklessly retrieves the pages to be freed and + * frees them one-by-one. + */ +static void free_huge_page_workfn(struct work_struct *work) +{ +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct hpage_to_free *curr, *next; + +recheck: +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 curr =3D xchg(&hpage_freelist, NULL); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!curr) +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 return; + +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 while (curr) { +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 __free_huge_page(curr->page); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 next =3D READ_ONCE(curr->next); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 while (next =3D=3D NEXT_PENDING) { +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cpu_relax(); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 next =3D READ_= ONCE(curr->next); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 } +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 kfree(curr); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 curr =3D next; +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (work && READ_ONCE(hpage_freelist)= ) +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 goto recheck; +} +static DECLARE_WORK(free_huge_page_work, free_huge_page_workfn); + +void free_huge_page(struct page *page) +{ +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Defer freeing in softIRQ cont= ext to avoid hugetlb_lock deadlock. +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (in_serving_softirq()) { +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 struct hpage_to_free *item, *next; + +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 /* +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * We are in serious trouble if kmalloc fails. In this +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * case, we hope for the best and do the freeing now. +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 */ +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 item =3D kmalloc(sizeof(*item), GFP_KERNEL); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (WARN_ON_ONCE(!item)) +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 goto free_page= _now; + +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 item->page =3D page; +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 item->next =3D NEXT_PENDING; +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 next =3D xchg(&hpage_freelist, item); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 WRITE_ONCE(item->next, next); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 schedule_work(&free_huge_page_work); +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 return; +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } + +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * Racing may prevent some of de= ferred huge pages in hpage_freelist +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * from being freed. Check here = and call schedule_work() if that +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 * is the case. +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 */ +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (hpage_freelist && !work_pending(&= free_huge_page_work)) +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 schedule_work(&free_huge_page_work); + +free_page_now: +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __free_huge_page(page); +} + =C2=A0static void prep_new_huge_page(struct hstate *h, struct page *page, i= nt nid) =C2=A0{ =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 INIT_LIST_HEAD(&page->lru); ---------------------------------------------------------------------------= -------------------------- I think it may be simpler and less risky to use spin_lock_bh() as you have suggested. Also, the above code will not be good enough if more lock taking functions are being called from softIRQ context in the future. So what do you think? Cheers, Longman Cheers, Longman