From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8E9AC433EF for ; Tue, 23 Nov 2021 12:02:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 471ED6B0071; Tue, 23 Nov 2021 07:02:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 421F96B0072; Tue, 23 Nov 2021 07:02:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C3576B0073; Tue, 23 Nov 2021 07:02:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0221.hostedemail.com [216.40.44.221]) by kanga.kvack.org (Postfix) with ESMTP id 1A5306B0071 for ; Tue, 23 Nov 2021 07:02:19 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id D3B8F818A074 for ; Tue, 23 Nov 2021 12:02:08 +0000 (UTC) X-FDA: 78840056736.12.7DC1494 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 7F868508BD48 for ; Tue, 23 Nov 2021 12:02:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637668927; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JdNCVm5I+Q2tdRBLAErQwYJMGPXNfC9Gg37QHqIgOLg=; b=GNzlh+KjSmxEvh5YfaWlwTJ7rCv+VhxN03aLBi5lpe9TiNp9Q+Uduz5+DE+8uVwj+j9ShV jr+/LiJMZHSyad3qVMjLPKtDb0KdRn2zTCm/1RV47Kbg2/PzSDDXw6pUm8gSUzgXYjjmIu Jt6UeicuBYW8VhE5FabPIhifR9x8p1k= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-564-nAq1nL98NTmznWcT9wrU1A-1; Tue, 23 Nov 2021 07:02:06 -0500 X-MC-Unique: nAq1nL98NTmznWcT9wrU1A-1 Received: by mail-wm1-f69.google.com with SMTP id i131-20020a1c3b89000000b00337f92384e0so1110287wma.5 for ; Tue, 23 Nov 2021 04:02:06 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:from:to:cc:references:organization:subject :in-reply-to:content-transfer-encoding; bh=JdNCVm5I+Q2tdRBLAErQwYJMGPXNfC9Gg37QHqIgOLg=; b=llHoU4MmBTViPVrR/ndzqsuM20IqgV8/yWInc5LjKZ14wzftXmyjSsySnKcUx+3Ysn FajiYSn1vxZaQCOhdLxyJpPRVIWjZjcpzoP5qIq2+ZqW98Lm1/1tlsSFrW+/JzVFrayo quQsdyjlaSe4F+mONd/VA6GTcZ8qc6roGV3xFBll1cGof7fCQS9aTv5XeF/L5aSag3la RHZd47aV+DvOcF55hMS+APR3rdkcq8YfoSYbUwbFexbVQnoD9yR7jcbQjtYKsgkdc481 vJ479d6ZTGpGM8maqR4uaGspKizsF65N8ahXEgztW4UnoLZg9t/fb4KJrglDSAGx+xiS V8KA== X-Gm-Message-State: AOAM531s2hJLx5bp/u7Ucv4laIQzlGsXaP23np0rdtePnrw0zSFDtzFg ycSp7C2+TSnJ7TzuvmUCMqnOZ+E6ZTXb9xVAzhvi9ElAiEQXZm9PLEtT1MKn1zAte1dzQ8uZx5s p4upqnJ0rc9g= X-Received: by 2002:a1c:ac46:: with SMTP id v67mr2408122wme.182.1637668925138; Tue, 23 Nov 2021 04:02:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJz4X+d0sHyk7qvzIlv4QXcrhMXR7T+rdEozr8VusV5WWhJueU/JGKC/I3FJqrdQaE2c0TuMzg== X-Received: by 2002:a1c:ac46:: with SMTP id v67mr2408050wme.182.1637668924714; Tue, 23 Nov 2021 04:02:04 -0800 (PST) Received: from [192.168.3.132] (p5b0c6765.dip0.t-ipconnect.de. [91.12.103.101]) by smtp.gmail.com with ESMTPSA id l5sm1074222wms.16.2021.11.23.04.02.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 23 Nov 2021 04:02:04 -0800 (PST) Message-ID: <4409acf9-4927-861e-997a-6e3db42d6851@redhat.com> Date: Tue, 23 Nov 2021 13:02:03 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 From: David Hildenbrand To: Jens Axboe , Andrew Dona-Couch , Andrew Morton , Drew DeVault Cc: Ammar Faizi , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, io_uring Mailing List , Pavel Begunkov , linux-mm@kvack.org References: <20211028080813.15966-1-sir@cmpwn.com> <593aea3b-e4a4-65ce-0eda-cb3885ff81cd@gnuweeb.org> <20211115203530.62ff33fdae14927b48ef6e5f@linux-foundation.org> <20211116114727.601021d0763be1f1efe2a6f9@linux-foundation.org> <20211116133750.0f625f73a1e4843daf13b8f7@linux-foundation.org> <8f219a64-a39f-45f0-a7ad-708a33888a3b@www.fastmail.com> <333cb52b-5b02-648e-af7a-090e23261801@redhat.com> <5f998bb7-7b5d-9253-2337-b1d9ea59c796@redhat.com> <3adc55d3-f383-efa9-7319-740fc6ab5d7a@kernel.dk> Organization: Red Hat Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 7F868508BD48 X-Stat-Signature: wr9gj7wtjkm8pqgosrya35a4s7rxhfbh Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GNzlh+Kj; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf01.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam02 X-HE-Tag: 1637668924-77021 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >>>> >>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >>>> like what was suggested, if that will help move things forward. IMHO the >>>> 32MB machine is mostly a theoretical case, but whatever . >>> >>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges >>> where FOLL_LONGTERM cannot be used, as that memory is not available. >>> >>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any >>> system completely and deeply mess up the MM. Oh my. >> >> We're talking per-user limits here. But if you want to talk hyperbole, >> then 64K multiplied by some other random number will also allow >> everything to be pinned, potentially. >> > > Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in > the worst case. > To make it clear why I keep complaining about FOLL_LONGTERM for unprivileged users even if we're talking about "only" 0.1% of RAM ... On x86-64 a 2 MiB THP (IOW pageblock) has 512 sub-pages. If we manage to FOLL_LONGTERM a single sub-page, we can make the THP unavailable to the system, meaning we cannot form a THP by compaction/swapping/migration/whatever at that physical memory area until we unpin that single page. We essentially "block" a THP from forming at that physical memory area. So with a single 4k page we can block one 2 MiB THP. With 0.1% we can, therefore, block 51,2 % of all THP. Theoretically, of course, if the stars align. ... or if we're malicious or unlucky. I wrote a reproducer this morning that tries blocking as many THP as it can: https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c ------------------------------------------------------------------------ Example on my 16 GiB (8096 THP "in theory") notebook with some applications running in the background. $ uname -a Linux t480s 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux $ ./io_uring_thp PAGE size: 4096 bytes (sensed) THP size: 2097152 bytes (sensed) RLIMIT_MEMLOCK: 16777216 bytes (sensed) IORING_MAX_REG_BUFFERS: 16384 (guess) Pages per THP: 512 User can block 4096 THP (8589934592 bytes) Process can block 4096 THP (8589934592 bytes) Blocking 1 THP Blocking 2 THP ... Blocking 3438 THP Blocking 3439 THP Blocking 3440 THP Blocking 3441 THP Blocking 3442 THP ... and after a while Blocking 4093 THP Blocking 4094 THP Blocking 4095 THP Blocking 4096 THP $ cat /proc/`pgrep io_uring_thp`/status Name: io_uring_thp Umask: 0002 State: S (sleeping) [...] VmPeak: 6496 kB VmSize: 6496 kB VmLck: 0 kB VmPin: 16384 kB VmHWM: 3628 kB VmRSS: 1580 kB RssAnon: 160 kB RssFile: 1420 kB RssShmem: 0 kB VmData: 4304 kB VmStk: 136 kB VmExe: 8 kB VmLib: 1488 kB VmPTE: 48 kB VmSwap: 0 kB HugetlbPages: 0 kB CoreDumping: 0 THP_enabled: 1 $ cat /proc/meminfo MemTotal: 16250920 kB MemFree: 11648016 kB MemAvailable: 11972196 kB Buffers: 50480 kB Cached: 1156768 kB SwapCached: 54680 kB Active: 704788 kB Inactive: 3477576 kB Active(anon): 427716 kB Inactive(anon): 3207604 kB Active(file): 277072 kB Inactive(file): 269972 kB ... Mlocked: 5692 kB SwapTotal: 8200188 kB SwapFree: 7742716 kB ... AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Let's see how many contiguous 2M pages we can still get as root: $ echo 1 > /proc/sys/vm/compact_memory $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 0 $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 537 ... keep retrying a couple of times $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 583 Let's kill the io_uring process and try again: $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 4766 ... keep retrying a couple of times $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 4823 ------------------------------------------------------------------------ I'm going to leave judgment how bad this is or isn't to the educated reader, and I'll stop spending time on this as I have more important things to work on. To summarize my humble opinion: 1) I am not against raising the default memlock limit if it's for a sane use case. While mlock itself can be somewhat bad for swap, FOLL_LONGTERM that also checks the memlock limit here is the real issue. This patch explicitly states the "IOURING_REGISTER_BUFFERS" use case, though, and that makes me nervous. 2) Exposing FOLL_LONGTERM to unprivileged users should be avoided best we can; in an ideal world, we wouldn't have it at all; in a sub-optimal world we'd have it only for use cases that really require it due to HW limitations. Ideally we'd even have yet another limit for this, because mlock != FOLL_LONGTERM. 3) IOURING_REGISTER_BUFFERS shouldn't use FOLL_LONGTERM for use by unprivileged users. We should provide a variant that doesn't rely on FOLL_LONGTERM or even rely on the memlock limit. Sorry to the patch author for bringing it up as response to the patch. After this patch just does what some distros already do (many distros even provide higher limits than 8 MiB!). I would be curious why some distros already have such high values ... and if it's already because of IOURING_REGISTER_BUFFERS after all. -- Thanks, David / dhildenb