From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EED0CC433E0 for ; Fri, 12 Feb 2021 13:00:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7401364E3D for ; Fri, 12 Feb 2021 13:00:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7401364E3D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 06BCC8D0055; Fri, 12 Feb 2021 08:00:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F371F8D0053; Fri, 12 Feb 2021 08:00:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB03B8D0055; Fri, 12 Feb 2021 08:00:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0234.hostedemail.com [216.40.44.234]) by kanga.kvack.org (Postfix) with ESMTP id BE6168D0053 for ; Fri, 12 Feb 2021 08:00:06 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 858695DF5 for ; Fri, 12 Feb 2021 13:00:06 +0000 (UTC) X-FDA: 77809623612.05.9939934 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 37E29407F8EB for ; Fri, 12 Feb 2021 13:00:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613134805; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=vhkYRaiLZan4BcohScjqBf0NeLcByWFV+HgR9bLFtYk=; b=StQNBdEitWUnGYuijskkLTD2aOwlkXTWFnvF4Pc1sPt564ORGPR707qHs+7IHjmG10ekCp 5gFKsHJvEspoPsy8lccVgkd5UZzk/L7p9d3vyBjlvCEjuoCxAggkEoCFFf96EgTmDgsMNz USUxVG6gmeykDctYda8Vd90E5O4yGdA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-241-zpywojw9O86n-YAkZ839VQ-1; Fri, 12 Feb 2021 08:00:03 -0500 X-MC-Unique: zpywojw9O86n-YAkZ839VQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 136CD1009619 for ; Fri, 12 Feb 2021 13:00:02 +0000 (UTC) Received: from [10.36.114.178] (ovpn-114-178.ams2.redhat.com [10.36.114.178]) by smtp.corp.redhat.com (Postfix) with ESMTP id AB12C1A262 for ; Fri, 12 Feb 2021 13:00:01 +0000 (UTC) From: David Hildenbrand Organization: Red Hat GmbH To: "linux-mm@kvack.org" Subject: Dynamically reserving swap space for MAP_NORESERVE mappings Message-ID: <989ec2d2-efe9-6608-b132-3167878aacb3@redhat.com> Date: Fri, 12 Feb 2021 14:00:00 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=david@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Stat-Signature: b4yzxq4z5y9if85zkxjhryfx8otqgcs4 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 37E29407F8EB Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf26; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1613134803-580910 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, I'm planning on making use of MAP_NORESERVE for sparse memory regions,=20 but I still want to have some way to reduce the chance of running into=20 random OOMs, similar to the ones we have with !MAP_NORESERVE on private=20 mappings. I want dynamic reservations of swap space. The rough idea is having a large mmap(MAP_NORESERVE) area in which I=20 dynamically populate/discard memory to control the memory consumption,=20 similar to a memory allocator - but rather in the context of dynamically=20 resizing VMs. In case the user requests a dangerous configurations ("add=20 50GB" instead of "add 5GB"), I rather want to fail in a nice way early=20 and disallow growing a VM instead of crashing the VM later on. For anything file-backed (MAP_SHARED) this is fairly easy: fallocate()=20 can preallocate memory. If it fails, there is not sufficient backing=20 storage. (it might be nice to also only reserve and not preallocate for=20 hugetlbfs, but that's another story) For anonymous memory / MAP_PRIVATE it's complicated. I want to avoid any=20 kinds of remapping (mmap(MAP_FIXED | !MAP_NORESERVE)) within the sparse=20 region, as it is expensive, I can easily run into too mapping limits,=20 and it creates quite some problems with other parallel features that are=20 enabled (e.g., userfaultfd). So I actually want to decide myself how much memory is reserved, have a=20 way to increase it (and fail if impossible) or decrease it. Doing this=20 per VMA is not possible, as it's unclear what to do on VMA=20 splits/unmappings. One idea is concurrently resizing a parallel, pre-reserved=20 mmap(MAP_PRIVATE|MAP_ANON) area, which would fail when trying to grow it=20 via mmap(MAP_FIXED) and there is not sufficient swap. This fells kind of=20 wrong to achieve the goal and it might fail due to per-process limits. My naive approach would be having a syscall that allows for=20 increasing/decreasing an additional per-process reservation like: if (!delta) return 0; if (mmap_write_lock_killable(mm)) return -EINTR; if (delta > 0) { if (security_vm_enough_memory_mm(mm, delta)) { mmap_write_unlock(mm); return -ENOMEM; } } else { if (-delta >=3D mm->extra_nr_accounted) { mmap_write_unlock(mm); return -EINVAL; } vm_unacct_memory(-delta); } mm->extra_nr_accounted +=3D delta; mmap_write_unlock(mm); return 0; Or setting an explicit reservation instead / being able to observe the=20 current reservation. We could limit it to the actual size of all VMAs that are not accounted=20 due to MAP_NORESERVE, so we would implicitly check for may_expand_vm(),=20 as that has been checked when the mmap(MAP_NORESERVE) was created. Of=20 course, we would have to update when unmapping applicable MAP_NORESERVE=20 areas (will have to think about temporary remappings in user space). Not=20 sure if that is required, but it feels like there should be an upper=20 limit besides the one in security_vm_enough_memory_mm() Which other limits do we have that we would have to consider? Alternatives? Thoughts? Am I missing something important? Thanks! --=20 Thanks, David / dhildenb