From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86A24C433EF for ; Wed, 13 Jul 2022 01:05:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F05DB9400F0; Tue, 12 Jul 2022 21:05:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EB66D9400E5; Tue, 12 Jul 2022 21:05:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D7EEC9400F0; Tue, 12 Jul 2022 21:05:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C97999400E5 for ; Tue, 12 Jul 2022 21:05:22 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id AA2FE3350B for ; Wed, 13 Jul 2022 01:05:22 +0000 (UTC) X-FDA: 79680283284.14.326AE5B Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf29.hostedemail.com (Postfix) with ESMTP id 43F06120079 for ; Wed, 13 Jul 2022 01:05:22 +0000 (UTC) Received: by mail-pj1-f44.google.com with SMTP id a15so9859939pjs.0 for ; Tue, 12 Jul 2022 18:05:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=sSTSga/AJv//urRR1YoYO7GIth+F/fWkwlvksw522a4=; b=iG2WlXWyNWQBuhaKCEW4x9xA45s+pl44sOlCVGord3Sqh3t+ralMM4aiH3vRRkkc0s 6Pl0bxSTL1uziRJmlcO0TK2BWKTWXh8T5Y2hZSppe4WMaxx7VTlRbYqbehQLrcNmwuT3 c6bC/zK/qB3PKdwdWvqbIgqBVXsT3e8RdgOlNNQCqwWpNqU0aIakttjwGHopWoZP+MWz GMnbBwR/hE3ulRE2JTFRpSCLyKcdeQ8lDzCg/iLXMA9SNHq7Jf7QHk6ykFgWU3mLZLT3 a1nLMfmYWnitS1tbNb0XIycBFzgTQVmyGCc2JwS148uUU+UOT5lXVUqBHmb94VjWSY6m lgDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=sSTSga/AJv//urRR1YoYO7GIth+F/fWkwlvksw522a4=; b=mI3zO3y3sm6TYK/iA8o0nXM/D4Vpq/VBQvb+vBH9V33izZqB1On8tKgO3R2JZz2tX6 sFyZXY9S+IfY+uKMupFO/xzcM8P6PlWS5SUvCmiHQmtv87ztoaDXXP66tjlWyqSya+Hu LPHrBRSM8yJjDWYPVE8eI3i2uO6eBVdg9uudnyYTbAgx7afMuG0Pd2lZlDANuHFhyaJh NxF/kId3vax0U8GqSqJxoxcL/0AquPaNiFKWtd3Q9mu0iKHiwiKzzsPoBFC1X6GJXYUC xNqTjprTJxscML72RoY/6FzlddZtoWDaypo4a3ZLdEdWGKSYXr5j7pRVH7vEQU924Hcc o4GA== X-Gm-Message-State: AJIora/JX/79zQ4J22H8w6zJvf9rBDm+rB3NmUAFBnuID76yRcg/DwWK Hw798mbpKQeaFfwKHckuZ+fjHg== X-Google-Smtp-Source: AGRyM1s2WMLyUGfXULiTK/1/yjSwY3QV9gjun/134HdZzVCGfPYQQ3CZJ09FPigSFvHaBd+4AljQHw== X-Received: by 2002:a17:902:7d92:b0:16c:54a4:bb2f with SMTP id a18-20020a1709027d9200b0016c54a4bb2fmr991947plm.158.1657674320895; Tue, 12 Jul 2022 18:05:20 -0700 (PDT) Received: from google.com (55.212.185.35.bc.googleusercontent.com. [35.185.212.55]) by smtp.gmail.com with ESMTPSA id f16-20020a170902ab9000b0015e8d4eb1d7sm7441996plr.33.2022.07.12.18.05.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Jul 2022 18:05:20 -0700 (PDT) Date: Tue, 12 Jul 2022 18:05:17 -0700 From: Zach O'Keefe To: Andrew Morton Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer Subject: Re: [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Message-ID: References: <20220706235936.2197195-1-zokeefe@google.com> <20220706235936.2197195-13-zokeefe@google.com> <20220708134732.fd9cc80739a3b9781a1ecf9e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220708134732.fd9cc80739a3b9781a1ecf9e@linux-foundation.org> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657674322; a=rsa-sha256; cv=none; b=cr2sIwosxWoP2NO4wQLjYSfK608mN6K7IslirK3O1P/YRdI26MqH+32IfHeni/+RBmsZ1+ 9mgMK9efcO9M0AafbxqgEzoevY9IemW6JDGFAOrim6OYsgToaIzj95uXrm4iMkvX9R4CVA h5sjEENERduaIx9PWCz37GA5+JG5ZC4= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=iG2WlXWy; spf=pass (imf29.hostedemail.com: domain of zokeefe@google.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657674322; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sSTSga/AJv//urRR1YoYO7GIth+F/fWkwlvksw522a4=; b=t8j3g6A5Zs/1ue8Bf+VTttzHsA9i6JhpoRVN0cF7RhK7KwZ9ILENR0KvzenAwrMpHFf6WH AOjRdCxRbzpaZ2uRvOpe/xTTe0P3ZRWnA1bT5Ca2l2y+EsPtSfU0OvTs42hTT2V1KwPhhj 29q/x2XKIZ+Jgjw2UAjUEC1r6PoeNNA= X-Rspamd-Queue-Id: 43F06120079 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=iG2WlXWy; spf=pass (imf29.hostedemail.com: domain of zokeefe@google.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=zokeefe@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam02 X-Rspam-User: X-Stat-Signature: oy94yqfsx1nabxp8ih4gcqthnfnsp61w X-HE-Tag: 1657674322-662002 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Jul 08 13:47, Andrew Morton wrote: > On Wed, 6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" wrote: > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > This is maximally restrictive. I didn't see any discussion of why this > was chosen either here of in the [0/N]. I expect that people will be > coming after us to relax this. > > So please do add (a lot of) words explaining this decision, and > describing what might be done in the future to relax it. Hey Andrew, Thanks for taking the time to look at this series. After taking a look through capabilities(7) I think you're absolutely right to call this out - thanks for that. I think move_pages(2) seems to be the best comparison here. There, we use CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to copying + moving memory of an eternal process, between nodes. This is also the current default for process_madvise(2). However, MADV_COLLAPSE additionally is able to: 1) Influence the RSS of a process / memory charged to a cgroup (by collapsing a hugepage-sized/aligned region with nonresident pages). Note that for file/shmem, this might cause increase in file/shmem RSS for non-target mm's. 2) Bypass sysfs THP settings For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS / memcg usage, and we don't require any additional capabilities there. For (2), I don't think there is an easy precedent. I think it makes sense that the caller has write permission to /sys/kernel/mm/transparent_hugapage/*. AFAICT, this means an effective user ID of 0 ... which is similarly restrictive like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads could always assume an real/effective user ID of 0. That said, I'm note sure CAP_SETUID is needed, and perhaps the existing process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to copy around all the same memory.. we'll just be doing some additional page table manipulations after some of that copying - which should (mostly) be transparent to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that much. Could be wrong through. Again, thanks for your time, Zach