From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B25EC433E0 for ; Wed, 10 Feb 2021 16:57:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DBB3D64DDF for ; Wed, 10 Feb 2021 16:57:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DBB3D64DDF Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 62A716B0072; Wed, 10 Feb 2021 11:57:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D9406B0073; Wed, 10 Feb 2021 11:57:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F01B6B0074; Wed, 10 Feb 2021 11:57:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0198.hostedemail.com [216.40.44.198]) by kanga.kvack.org (Postfix) with ESMTP id 395F76B0072 for ; Wed, 10 Feb 2021 11:57:31 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 0A25768A5 for ; Wed, 10 Feb 2021 16:57:31 +0000 (UTC) X-FDA: 77802964302.24.tin16_4a1253a27611 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id E74031A4A5 for ; Wed, 10 Feb 2021 16:57:30 +0000 (UTC) X-HE-Tag: tin16_4a1253a27611 X-Filterd-Recvd-Size: 6233 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Wed, 10 Feb 2021 16:57:30 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1612976248; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=kJpv/JRQaWu2kKJM37i4Mr3SQE516sRCChl999wh7wI=; b=jFBI75MfEPfgyYHAp0/q+oJZIPUjvMNQuqDwuh3C+DHK5yafuWaF1I5QWQ5hVoL88eet8O 69XQzktJmbIJWJbYHs3EtlaQaG0p0ZGwZWbhX5FKQEs29nWzZjayY7rtK6EqN5o20KrAIp bY9JDmLfznssCdCwhgfUsVaiAf5HV5E= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id CEE62AF2C; Wed, 10 Feb 2021 16:57:28 +0000 (UTC) Date: Wed, 10 Feb 2021 17:57:13 +0100 From: Michal Hocko To: Vlastimil Babka Cc: Milan Broz , linux-mm@kvack.org, Linux Kernel Mailing List , Mikulas Patocka Subject: Re: Very slow unlockall() Message-ID: References: <70885d37-62b7-748b-29df-9e94f3291736@gmail.com> <20210108134140.GA9883@dhcp22.suse.cz> <9474cd07-676a-56ed-1942-5090e0b9a82f@suse.cz> <6eebb858-d517-b70d-9202-f4e84221ed89@suse.cz> <273db3a6-28b1-6605-1743-ef86e7eb2b72@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <273db3a6-28b1-6605-1743-ef86e7eb2b72@suse.cz> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 10-02-21 16:18:50, Vlastimil Babka wrote: > On 2/1/21 8:19 PM, Milan Broz wrote: > > On 01/02/2021 19:55, Vlastimil Babka wrote: > >> On 2/1/21 7:00 PM, Milan Broz wrote: > >>> On 01/02/2021 14:08, Vlastimil Babka wrote: > >>>> On 1/8/21 3:39 PM, Milan Broz wrote: > >>>>> On 08/01/2021 14:41, Michal Hocko wrote: > >>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code > >>>>>>> and someone tried to use it with hardened memory allocator library. > >>>>>>> > >>>>>>> Execution time was increased to extreme (minutes) and as we found, the problem > >>>>>>> is in munlockall(). > >>>>>>> > >>>>>>> Here is a plain reproducer for the core without any external code - it takes > >>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds! > >>>>>>> I can reproduce it on 5.10 kernels and Linus' git. > >>>>>>> > >>>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used). > >>>>>>> The real code of course does something more useful but the problem is the same. > >>>>>>> > >>>>>>> #include > >>>>>>> #include > >>>>>>> #include > >>>>>>> #include > >>>>>>> > >>>>>>> int main (int argc, char *argv[]) > >>>>>>> { > >>>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > >> > >> So, this is 2TB memory area, but PROT_NONE means it's never actually populated, > >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ | > >> PROT_WRITE there, the mlockall() starts taking ages. > >> > >> So does that reflect your use case? munlockall() with large PROT_NONE areas? If > >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would > >> expect such scenario to be uncommon, so better clarify first. > > > > It is just a simple reproducer of the underlying problem, as suggested here > > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301 > > > > We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly. > > (For the real case problem please read the whole issue report above.) > > OK, finally read through the bug report, and learned two things: > > 1) the PROT_NONE is indeed intentional part of the reproducer > 2) Linux mailing lists still have a bad reputation and people avoid them. That's > sad :( Well, thanks for overcoming that :) > > Daniel there says "I think the Linux kernel implementation of mlockall is quite > broken and tries to lock all the reserved PROT_NONE regions in advance which > doesn't make any sense." > > >From my testing this doesn't seem to be the case, as the mlockall() part is very > fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts > to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is > slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably > can't just skip them, as they might actually contain mlocked pages if those were > faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE. Mlock code is quite easy to misunderstand but IIRC the mlock part should be rather straightforward. It will mark VMAs as locked, do some merging/splitting where appropriate and finally populate the range by gup. This should fail because VMA doesn't allow neither read nor write, right? And mlock should report that. mlockall will not bother because it will ignore errors on population. So there is no page table walk happening. > And the munlock (munlock_vma_pages_range()) is slow, because it uses > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's > always traversing all levels of page tables from scratch. Funnily enough, > speeding this up was my first linux-mm series years ago. But the speedup only > works if pte's are present, which is not the case for unpopulated PROT_NONE > areas. That use case was unexpected back then. We should probably convert this > code to a proper page table walk. If there are large areas with unpopulated pmd > entries (or even higher levels) we would traverse them very quickly. Yes, this is a good idea. I suspect it will be little bit tricky without duplicating a large part of gup page table walker. -- Michal Hocko SUSE Labs