From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7ECBC54798 for ; Sat, 9 Mar 2024 06:09:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 06A306B0071; Sat, 9 Mar 2024 01:09:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 01A866B0072; Sat, 9 Mar 2024 01:09:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E245E6B0074; Sat, 9 Mar 2024 01:09:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D224D6B0071 for ; Sat, 9 Mar 2024 01:09:29 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7D161A0E1B for ; Sat, 9 Mar 2024 06:09:29 +0000 (UTC) X-FDA: 81876473658.13.E21F9ED Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf04.hostedemail.com (Postfix) with ESMTP id 646944000E for ; Sat, 9 Mar 2024 06:09:25 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=RGd9gUwT; spf=none (imf04.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709964566; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DUHbAWbFNE+c5S58QJuPZ88a9r/9FOlO3XaH1UKD9Po=; b=hV+ObT0J24/stBae3JdpzljNdjiCCWr/1xbwK7MoMmfAx13c07XTenCeM9IR56C5LdtpfT M3LJUIC9fcybNL7CMBQcxiSLylnDPdiS+WL7hM2zE70SdSEvp8z5UslNQKFzloDlfWnw/t uH0zTS8DZMqSBGpSxfIqIKjjV/mSzXw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709964566; a=rsa-sha256; cv=none; b=N/iQP8ukKWcm3JSg2AHOGQBqrILsNBYl3Bu5WdKj2wKMAK1qWcNRTqQfF2RDUtoEbraHHf FT4YeZvse3KgAi5aX1CiWyu8bQi9aQT7ZPyfjRezRwq8EICdu9AaEyECPtiVI4rSN5Tjip Nolh1UGB45j1NNdBBQe3UPdriVOrKuU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=RGd9gUwT; spf=none (imf04.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=DUHbAWbFNE+c5S58QJuPZ88a9r/9FOlO3XaH1UKD9Po=; b=RGd9gUwTgP/yJwDM+eDRifvm4t 6hhy81JK6XLmtUcIz8tXi3mVswhPhGrKJJLhL0+KnGSq7lCeEOGjrpg+Z3TYIgjFUOZrkzRW8JTWx 5MhKKjKO+Rr+SBdDKmkxheIXPTLazgla5ZgNGPC+xPmgFeQnnu6PJQY88S9CVIPqkg/Eb2SixD37Z IIJ0zoFll0dZ1EKKmCYiL0QiLYqGqP6bvgFc4TAlPUaQuD+fyic/M31k4P+B6QO4BD7bQDBwPczYT 03f+VNvIp9Gj+85Ha032MoINNqpQa4JiE41En9Lq+TNb3VB6DAzlhrDwU/TJ3Q2kXwGz0NXbWz6ET rKxt5+mw==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1riptY-0000000D9QU-04w4; Sat, 09 Mar 2024 06:09:20 +0000 Date: Sat, 9 Mar 2024 06:09:19 +0000 From: Matthew Wilcox To: Ryan Roberts Cc: Zi Yan , Andrew Morton , linux-mm@kvack.org, Yang Shi , Huang Ying Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch processed Message-ID: References: <367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com> <85cc26ed-6386-4d6b-b680-1e5fba07843f@arm.com> <36bdda72-2731-440e-ad15-39b845401f50@arm.com> <03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: u6ozqjy14azttmfp3u4qbt16stc568mw X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 646944000E X-Rspam-User: X-HE-Tag: 1709964565-580061 X-HE-Meta: U2FsdGVkX18nEonDdy9998NXzphSHUKCMXaGy6TEDR0PRUzBZTkKhFerQeODCVA8tXyBhz9dwKGKj/JZD59esu/IDw0tZls8UJRWpboXUdHU+UCSzeXZipOAWUbZRJlphA+5WsvdwL/AkH+vkSbSLiKan84bf1VdyKhAtTpkhy+fKy8tKNAG3KwPJrdwhOU/ykxMeXkwEKRjp9D6f9bCPKhx0ITA1H5mie1Bv4FKSYHNBrIBppaTbw6Ukpb0sVGenaemPdsrPg1+ggLN9O+rNeScRLuPPQ0Y4JpHerZ4Yjl9vj794nadUkqo+c1whVsPA+zj1N8hAfCkAAp97AOy/3DkOWyUOcxwZ6zQjNRR4J2IV2B0xtg9PRlOvKyT7CtQ4cglj3PRFDlT0UPoOUuv9npVltv+0kNHp7YxJYmhlnkvNq+c2mWeW/I3vPqTIgImo6kbIymlwi15csOzwaXZlx8UnuyS82BcpdWk8s0Qoj2IYy6qT9SMtVhJGi/wjs/xcxTyZQBqqQQjoumm5bzliXsvaK0DafhfTuWsjVOUgi+zn4gaL0Phh4G9GfY6CLBt5XLJCktCm/lGaOxUHceQwZh55eTHialGReoyFEsbGHtd1PP4u3Lf0M0RO2wmpbWQyJl/mRDjSazRdvsZ2rIKglX5GRHQNfa6mXM/t4x+THM+3bbxqjWDIr4/J4vf73JymuLoBS97QkeHx09cxrrPcU9H0nGboz76b2dzAaRQLvBrvPipqDVKi7IUGYkTWrwKiBPhnSf/F/W3F5AX2+HWb3f78I+jFia0cPWJIR6q1Ez5Smfmrcr5mwePx0YFT5is/GyT9M6xivsPb8DufjMXew== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 08, 2024 at 11:44:35AM +0000, Ryan Roberts wrote: > > The thought occurs that we don't need to take the folios off the list. > > I don't know that will fix anything, but this will fix your "running out > > of memory" problem -- I forgot to drop the reference if folio_trylock() > > failed. Of course, I can't call folio_put() inside the lock, so may > > as well move the trylock back to the second loop. I think this was a bad thought ... > Dumping all the CPU back traces with gdb, all the cores (except one) are > contending on the the deferred split lock. I'm pretty sure that we can call the shrinker on multiple CPUs at the same time (can you confirm from the backtrace?) struct pglist_data *pgdata = NODE_DATA(sc->nid); struct deferred_split *ds_queue = &pgdata->deferred_split_queue; so if two CPUs try to shrink the same node, they're going to try to process the same set of folios. Which means the split will keep failing because each of them will have a refcount on the folio, and ... yeah. If so, we need to take the folios off the list (or otherwise mark them) so that they can't be processed by more than one CPU at a time. And that leads me to this patch (yes, folio_prep_large_rmappable() is now vestigial, but removing it increases the churn a bit much for this stage of debugging) This time I've boot-tested it. I'm running my usual test-suite against it now with little expectation that it will trigger. If I have time I'll try to recreate your setup. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fd745bcc97ff..2ca033a6c3d8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -792,8 +792,6 @@ void folio_prep_large_rmappable(struct folio *folio) { if (!folio || !folio_test_large(folio)) return; - if (folio_order(folio) > 1) - INIT_LIST_HEAD(&folio->_deferred_list); folio_set_large_rmappable(folio); } @@ -3312,7 +3310,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, struct pglist_data *pgdata = NODE_DATA(sc->nid); struct deferred_split *ds_queue = &pgdata->deferred_split_queue; unsigned long flags; - LIST_HEAD(list); + struct folio_batch batch; struct folio *folio, *next; int split = 0; @@ -3321,36 +3319,40 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, ds_queue = &sc->memcg->deferred_split_queue; #endif + folio_batch_init(&batch); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - /* Take pin on all head pages to avoid freeing them under us */ + /* Take ref on all folios to avoid freeing them under us */ list_for_each_entry_safe(folio, next, &ds_queue->split_queue, _deferred_list) { - if (folio_try_get(folio)) { - list_move(&folio->_deferred_list, &list); - } else { + list_del_init(&folio->_deferred_list); + sc->nr_to_scan--; + if (!folio_try_get(folio)) { /* We lost race with folio_put() */ - list_del_init(&folio->_deferred_list); ds_queue->split_queue_len--; + } else if (folio_batch_add(&batch, folio) == 0) { + break; } - if (!--sc->nr_to_scan) + if (!sc->nr_to_scan) break; } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); - list_for_each_entry_safe(folio, next, &list, _deferred_list) { + while ((folio = folio_batch_next(&batch)) != NULL) { if (!folio_trylock(folio)) - goto next; - /* split_huge_page() removes page from list on success */ + continue; if (!split_folio(folio)) split++; folio_unlock(folio); -next: - folio_put(folio); } spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - list_splice_tail(&list, &ds_queue->split_queue); + while ((folio = folio_batch_next(&batch)) != NULL) { + if (!folio_test_large(folio)) + continue; + list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); + } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + folios_put(&batch); /* * Stop shrinker if we didn't split any page, but the queue is empty. diff --git a/mm/internal.h b/mm/internal.h index 1dfdc3bde1b0..14c21d06f233 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -432,6 +432,8 @@ static inline void prep_compound_head(struct page *page, unsigned int order) atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); + if (order > 1) + INIT_LIST_HEAD(&folio->_deferred_list); } static inline void prep_compound_tail(struct page *head, int tail_idx) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 025ad1a7df7b..fc9c7ca24c4c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1007,9 +1007,12 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page) break; case 2: /* - * the second tail page: ->mapping is - * deferred_list.next -- ignore value. + * the second tail page: ->mapping is deferred_list.next */ + if (unlikely(!list_empty(&folio->_deferred_list))) { + bad_page(page, "still on deferred list"); + goto out; + } break; default: if (page->mapping != TAIL_MAPPING) {