From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A92ACA9EA0 for ; Tue, 22 Oct 2019 16:57:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 66E0F20B7C for ; Tue, 22 Oct 2019 16:57:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 66E0F20B7C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 08DCB6B0007; Tue, 22 Oct 2019 12:57:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 03E596B0008; Tue, 22 Oct 2019 12:57:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E6F986B000A; Tue, 22 Oct 2019 12:57:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id C6A926B0007 for ; Tue, 22 Oct 2019 12:57:49 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 76B2E180AD802 for ; Tue, 22 Oct 2019 16:57:49 +0000 (UTC) X-FDA: 76072027458.18.jam13_44388e6871133 X-HE-Tag: jam13_44388e6871133 X-Filterd-Recvd-Size: 6234 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Tue, 22 Oct 2019 16:57:48 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 398C6B3FA; Tue, 22 Oct 2019 16:57:47 +0000 (UTC) Date: Tue, 22 Oct 2019 18:57:45 +0200 From: Michal Hocko To: Waiman Long Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner , Roman Gushchin , Vlastimil Babka , Konstantin Khlebnikov , Jann Horn , Song Liu , Greg Kroah-Hartman , Rafael Aquini , Mel Gorman Subject: Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo Message-ID: <20191022165745.GT9379@dhcp22.suse.cz> References: <20191022162156.17316-1-longman@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191022162156.17316-1-longman@redhat.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [Cc Mel] On Tue 22-10-19 12:21:56, Waiman Long wrote: > The pagetypeinfo_showfree_print() function prints out the number of > free blocks for each of the page orders and migrate types. The current > code just iterates the each of the free lists to get counts. There are > bug reports about hard lockup panics when reading the /proc/pagetyeinfo > file just because it look too long to iterate all the free lists within > a zone while holing the zone lock with irq disabled. > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity > of crashing a system by the simple act of reading /proc/pagetypeinfo > by any user is a security problem that needs to be addressed. Should we make the file 0400? It is a useful thing when debugging but not something regular users would really need for life. > There is a free_area structure associated with each page order. There > is also a nr_free count within the free_area for all the different > migration types combined. Tracking the number of free list entries > for each migration type will probably add some overhead to the fast > paths like moving pages from one migration type to another which may > not be desirable. Have you tried to measure that overhead? > we can actually skip iterating the list of one of the migration types > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE > is usually the largest one on large memory systems, this is the one > to be skipped. Since the printing order is migration-type => order, we > will have to store the counts in an internal 2D array before printing > them out. > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the > zone lock for too long blocking out other zone lock waiters from being > run. This can be problematic for systems with large amount of memory. > So a check is added to temporarily release the lock and reschedule if > more than 64k of list entries have been iterated for each order. With > a MAX_ORDER of 11, the worst case will be iterating about 700k of list > entries before releasing the lock. But you are still iterating through the whole free_list at once so if it gets really large then this is still possible. I think it would be preferable to use per migratetype nr_free if it doesn't cause any regressions. > Signed-off-by: Waiman Long > --- > mm/vmstat.c | 51 +++++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 41 insertions(+), 10 deletions(-) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 6afc892a148a..40c9a1494709 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, > pg_data_t *pgdat, struct zone *zone) > { > int order, mtype; > + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES]; > > - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) { > - seq_printf(m, "Node %4d, zone %8s, type %12s ", > - pgdat->node_id, > - zone->name, > - migratetype_names[mtype]); > - for (order = 0; order < MAX_ORDER; ++order) { > + lockdep_assert_held(&zone->lock); > + lockdep_assert_irqs_disabled(); > + > + /* > + * MIGRATE_MOVABLE is usually the largest one in large memory > + * systems. We skip iterating that list. Instead, we compute it by > + * subtracting the total of the rests from free_area->nr_free. > + */ > + for (order = 0; order < MAX_ORDER; ++order) { > + unsigned long nr_total = 0; > + struct free_area *area = &(zone->free_area[order]); > + > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) { > unsigned long freecount = 0; > - struct free_area *area; > struct list_head *curr; > > - area = &(zone->free_area[order]); > - > + if (mtype == MIGRATE_MOVABLE) > + continue; > list_for_each(curr, &area->free_list[mtype]) > freecount++; > - seq_printf(m, "%6lu ", freecount); > + nfree[order][mtype] = freecount; > + nr_total += freecount; > } > + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total; > + > + /* > + * If we have already iterated more than 64k of list > + * entries, we might have hold the zone lock for too long. > + * Temporarily release the lock and reschedule before > + * continuing so that other lock waiters have a chance > + * to run. > + */ > + if (nr_total > (1 << 16)) { > + spin_unlock_irq(&zone->lock); > + cond_resched(); > + spin_lock_irq(&zone->lock); > + } > + } > + > + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) { > + seq_printf(m, "Node %4d, zone %8s, type %12s ", > + pgdat->node_id, > + zone->name, > + migratetype_names[mtype]); > + for (order = 0; order < MAX_ORDER; ++order) > + seq_printf(m, "%6lu ", nfree[order][mtype]); > seq_putc(m, '\n'); > } > } > -- > 2.18.1 -- Michal Hocko SUSE Labs