From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4B0AC02185 for ; Mon, 20 Jan 2025 16:00:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1227A6B0082; Mon, 20 Jan 2025 11:00:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0D1D36B0083; Mon, 20 Jan 2025 11:00:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB4236B0085; Mon, 20 Jan 2025 11:00:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CCB9A6B0082 for ; Mon, 20 Jan 2025 11:00:53 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BB903802BB for ; Mon, 20 Jan 2025 16:00:52 +0000 (UTC) X-FDA: 83028293544.08.0FD0E84 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) by imf26.hostedemail.com (Postfix) with ESMTP id 8ED0A140016 for ; Mon, 20 Jan 2025 16:00:47 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=nnzCQtxm; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="/feCphDO"; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=nnzCQtxm; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="/feCphDO"; spf=pass (imf26.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737388848; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ujqEMMfIMtzlt/Dc8SJOf+csB9q64VZHSVIJQmZbRBw=; b=o4+q1S7rEuOeEG9BstcsGlJbIcy4/mtKt1/G9Pl09exBXM0dfCBrk+G1Z1AVq2URVbAERJ PalTCs3gZaSgXRO2795Xcs8eAYStNTvazUYj+ebTfyz3DrsovYtuHx17qe8xiYwnfj5Hus Exy1fMWXiIKGo2b5GdCBp8tia+2EDg4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737388848; a=rsa-sha256; cv=none; b=P+BBGx9Bn44aprRx1e/DToSxasPM7liweEEspiQT36GIFy5fbsXgCWEJP3MjoTT6dMoiSB XgKR9PbF+FirLXZSlfsL9vLK77kJIT5beI41At6sIG8HXlyBT3tTAXyniU6ba363kCwVMz 8EmUcXSqflOn5YFXhw4TBoRlMGyQB7Y= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=nnzCQtxm; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="/feCphDO"; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=nnzCQtxm; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b="/feCphDO"; spf=pass (imf26.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz; dmarc=none Received: from imap1.dmz-prg2.suse.org (imap1.dmz-prg2.suse.org [IPv6:2a07:de40:b281:104:10:150:64:97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 3196A1F399; Mon, 20 Jan 2025 16:00:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737388845; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ujqEMMfIMtzlt/Dc8SJOf+csB9q64VZHSVIJQmZbRBw=; b=nnzCQtxmeWaHXX+Z0Zitc2YyBTKcz4bJr7fN4m5X2Mmxe28HUh/Eih+vdLRhVDZzaKO8Mq 8+P8nEeDePtQ+U6+UY7m4uh7wlnjnz7g4tG3OJtKypESUXJcA0hUKiSDCgqRHJeOpkux3U 6E3fGkIQhDamlXtfRgvREEunrEHGwDw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737388845; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ujqEMMfIMtzlt/Dc8SJOf+csB9q64VZHSVIJQmZbRBw=; b=/feCphDOSs7hWtwC9qPZr7/WvEW+4nHOAMZcCHecOnE/2h5AgkticNtwS74sdshIkVDpfU JzAOUn7yGBp/8YDQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1737388845; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ujqEMMfIMtzlt/Dc8SJOf+csB9q64VZHSVIJQmZbRBw=; b=nnzCQtxmeWaHXX+Z0Zitc2YyBTKcz4bJr7fN4m5X2Mmxe28HUh/Eih+vdLRhVDZzaKO8Mq 8+P8nEeDePtQ+U6+UY7m4uh7wlnjnz7g4tG3OJtKypESUXJcA0hUKiSDCgqRHJeOpkux3U 6E3fGkIQhDamlXtfRgvREEunrEHGwDw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1737388845; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ujqEMMfIMtzlt/Dc8SJOf+csB9q64VZHSVIJQmZbRBw=; b=/feCphDOSs7hWtwC9qPZr7/WvEW+4nHOAMZcCHecOnE/2h5AgkticNtwS74sdshIkVDpfU JzAOUn7yGBp/8YDQ== Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 257CE1393E; Mon, 20 Jan 2025 16:00:45 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id dwkiCS1zjmfwDwAAD6G6ig (envelope-from ); Mon, 20 Jan 2025 16:00:45 +0000 Received: by quack3.suse.cz (Postfix, from userid 1000) id D1785A081E; Mon, 20 Jan 2025 17:00:44 +0100 (CET) Date: Mon, 20 Jan 2025 17:00:44 +0100 From: Jan Kara To: Mateusz Guzik Cc: brauner@kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, tavianator@tavianator.com, linux-mm@kvack.org, akpm@linux-foundation.org Subject: Re: [RESEND PATCH] fs: avoid mmap sem relocks when coredumping with many missing pages Message-ID: <55qxyg2diynlelvdzorhvtk4omfcobarious3fkxh4n33oezod@sju7s6sebec3> References: <20250119103205.2172432-1-mjguzik@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250119103205.2172432-1-mjguzik@gmail.com> X-Rspamd-Action: no action X-Stat-Signature: euadmnt6abh57pfwag3cz1cxy7znwpcq X-Rspam-User: X-Rspamd-Queue-Id: 8ED0A140016 X-Rspamd-Server: rspam03 X-HE-Tag: 1737388847-269022 X-HE-Meta: U2FsdGVkX19DDbqKr1BQZvJ8oEDPNDbDEPm6nT0aJYLT4FqZl6oeJIH9OCg2I38qhwOX6hNmLQk2eVWOnUdQajEtwJgPRoDWAJdAbiCZy/e9D4W8/GNc/ioyn7n4QbdGCbg25U6Y5VF4hfuyQFQT5s4cKVUgP1grCDH/cylKryJpdHhWAIF2RbEJLO8vVZ5OaqOx56I78K8QqCdQs+s/I/i0QRA6xS9xivc6Nnt3nfiM2DiXK5t/nq+nUg40/W5ePCrKLRN4OuB/Fbikr57+LVaMN0S1w81pOKxnkhc2kAvDnIGKARbsnDoQrmNzm0HuSYnvKgG05LDd4FwafOcN/L9Cmz+yQDegR733IvDJDslS4GVbCMlrvYPPUTT7twWDNt5VSJaWLc8DBZzfSRFHGzem+NL3tZVCYyFsw8IhQFtgBYlnwxs2qv+HsbzulxBbuONey2J5xHVc5LsNXSDFNtAaJ/YvU3FrOomulM/MF01O3C79NqfdVX8J2a+2wXmi23mn9qLt2uT1qu3OkHb8y+ao/H/RMjW1UnS9w7MbZ/FUl/lEQmq+Uiz4kf7WSpWgY4Q0ph8EgvMbZeLbMSq5YhAVBQ1mAV5L4xXxRrh++nb7mzg4GzId4i4968dpNGUj2lQsePlJFsUuz5I8Y17ZZqRVoBZE5TCCZdoqBWg7LphzZVeH2m+UEuJFynZ8MeDE2mM6liaWnkkAwuGFnXhMbFNio5m+h1cGiBpFk6ybXQwmj7GAhLjxm6vP9cl+CWCLMO4giY57uabM/Brg0wIRnpyNO6RU47nYh2pQkoRlaralb0HTzIi7p65jYvfkKV5NbKHRDyFDRkj6oKx4HGX7qMroRRoE7s8GvBpNt6Vplal8Zbd0TT+ymhMbzDPQjp5pcW02NKSchEQpS23Swx+HDonnX7dmxBphZjVcLJnfCFYBx1fwx9zLM5U48v19Ne08Kxj/fSAa373foWNZ5ts U73djkM3 Kj9ebqevfhdiTFs9cfweQMuzj2D1dx0SaTY9foOGVVU4bjAP/K7ZzqJcMTNrWKWRE69vTbEUv2Fwod+5U8eWawMkdiiorPaOeIAsIlqZ61QBef/08fpK8CCYPw3qFNFTdjipvKdOXn9Mad45n9VUc2Ok9fOJbTYWj8fpAnManVFUNHOxbpRqrnCTXvWmswf072DcsHjS45ofAwPUlvLV+7uh/Eo4el7IYbFptH2AV32czi1Jsp4I9hk+kfaXCLCjN9bx0+vxxBqzVgJ1ge58qlkYTjI0yseZwrTBotAUDXa92+PIvczuvHKaQY9OIvpQXReuL1TLqQsYwW3jibDVB2M2DrGL/qo6bjfUHv2NyB8cVUrRAmhcQkvVu4mTV4ONlHX57 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun 19-01-25 11:32:05, Mateusz Guzik wrote: > Dumping processes with large allocated and mostly not-faulted areas is > very slow. > > Borrowing a test case from Tavian Barnes: > > int main(void) { > char *mem = mmap(NULL, 1ULL << 40, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); > printf("%p %m\n", mem); > if (mem != MAP_FAILED) { > mem[0] = 1; > } > abort(); > } > > That's 1TB of almost completely not-populated area. > > On my test box it takes 13-14 seconds to dump. > > The profile shows: > - 99.89% 0.00% a.out > entry_SYSCALL_64_after_hwframe > do_syscall_64 > syscall_exit_to_user_mode > arch_do_signal_or_restart > - get_signal > - 99.89% do_coredump > - 99.88% elf_core_dump > - dump_user_range > - 98.12% get_dump_page > - 64.19% __get_user_pages > - 40.92% gup_vma_lookup > - find_vma > - mt_find > 4.21% __rcu_read_lock > 1.33% __rcu_read_unlock > - 3.14% check_vma_flags > 0.68% vma_is_secretmem > 0.61% __cond_resched > 0.60% vma_pgtable_walk_end > 0.59% vma_pgtable_walk_begin > 0.58% no_page_table > - 15.13% down_read_killable > 0.69% __cond_resched > 13.84% up_read > 0.58% __cond_resched > > Almost 29% of the time is spent relocking the mmap semaphore between > calls to get_dump_page() which find nothing. > > Whacking that results in times of 10 seconds (down from 13-14). > > While here make the thing killable. > > The real problem is the page-sized iteration and the real fix would > patch it up instead. It is left as an exercise for the mm-familiar > reader. > > Signed-off-by: Mateusz Guzik The patch looks good to me. Feel free to add: Reviewed-by: Jan Kara BTW: I don't see how we could fundamentally move away from page-sized iteration because core dumping is "by definition" walking page tables and gathering pages there. But it could certainly be much more efficient if implemented properly (e.g. in the example above we'd see that most of PGD level tables are not even allocated so we could be skipping 1GB ranges of address space in one step). Honza > --- > > Minimally tested, very plausible I missed something. > > sent again because the previous thing has myself in To -- i failed to > fix up the oneliner suggested by lore.kernel.org. it seem the original > got lost. > > arch/arm64/kernel/elfcore.c | 3 ++- > fs/coredump.c | 38 +++++++++++++++++++++++++++++++------ > include/linux/mm.h | 2 +- > mm/gup.c | 5 ++--- > 4 files changed, 37 insertions(+), 11 deletions(-) > > diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c > index 2e94d20c4ac7..b735f4c2fe5e 100644 > --- a/arch/arm64/kernel/elfcore.c > +++ b/arch/arm64/kernel/elfcore.c > @@ -27,9 +27,10 @@ static int mte_dump_tag_range(struct coredump_params *cprm, > int ret = 1; > unsigned long addr; > void *tags = NULL; > + int locked = 0; > > for (addr = start; addr < start + len; addr += PAGE_SIZE) { > - struct page *page = get_dump_page(addr); > + struct page *page = get_dump_page(addr, &locked); > > /* > * get_dump_page() returns NULL when encountering an empty > diff --git a/fs/coredump.c b/fs/coredump.c > index d48edb37bc35..84cf76f0d5b6 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -925,14 +925,23 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start, > { > unsigned long addr; > struct page *dump_page; > + int locked, ret; > > dump_page = dump_page_alloc(); > if (!dump_page) > return 0; > > + ret = 0; > + locked = 0; > for (addr = start; addr < start + len; addr += PAGE_SIZE) { > struct page *page; > > + if (!locked) { > + if (mmap_read_lock_killable(current->mm)) > + goto out; > + locked = 1; > + } > + > /* > * To avoid having to allocate page tables for virtual address > * ranges that have never been used yet, and also to make it > @@ -940,21 +949,38 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start, > * NULL when encountering an empty page table entry that would > * otherwise have been filled with the zero page. > */ > - page = get_dump_page(addr); > + page = get_dump_page(addr, &locked); > if (page) { > + if (locked) { > + mmap_read_unlock(current->mm); > + locked = 0; > + } > int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page)); > put_page(page); > - if (stop) { > - dump_page_free(dump_page); > - return 0; > - } > + if (stop) > + goto out; > } else { > dump_skip(cprm, PAGE_SIZE); > } > + > + if (dump_interrupted()) > + goto out; > + > + if (!need_resched()) > + continue; > + if (locked) { > + mmap_read_unlock(current->mm); > + locked = 0; > + } > cond_resched(); > } > + ret = 1; > +out: > + if (locked) > + mmap_read_unlock(current->mm); > + > dump_page_free(dump_page); > - return 1; > + return ret; > } > #endif > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 75c9b4f46897..7df0d9200d8c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2633,7 +2633,7 @@ int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, > struct task_struct *task, bool bypass_rlim); > > struct kvec; > -struct page *get_dump_page(unsigned long addr); > +struct page *get_dump_page(unsigned long addr, int *locked); > > bool folio_mark_dirty(struct folio *folio); > bool folio_mark_dirty_lock(struct folio *folio); > diff --git a/mm/gup.c b/mm/gup.c > index 2304175636df..f3be2aa43543 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2266,13 +2266,12 @@ EXPORT_SYMBOL(fault_in_readable); > * Called without mmap_lock (takes and releases the mmap_lock by itself). > */ > #ifdef CONFIG_ELF_CORE > -struct page *get_dump_page(unsigned long addr) > +struct page *get_dump_page(unsigned long addr, int *locked) > { > struct page *page; > - int locked = 0; > int ret; > > - ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked, > + ret = __get_user_pages_locked(current->mm, addr, 1, &page, locked, > FOLL_FORCE | FOLL_DUMP | FOLL_GET); > return (ret == 1) ? page : NULL; > } > -- > 2.43.0 > -- Jan Kara SUSE Labs, CR