From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=rRjF=6P=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.9 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1,USER_IN_DEF_DKIM_WL autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 48389C4724C
	for <linux-mm@archiver.kernel.org>; Fri,  1 May 2020 03:14:54 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C6762206B8
	for <linux-mm@archiver.kernel.org>; Fri,  1 May 2020 03:14:53 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WNKr9Uhp"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C6762206B8
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 386038E0005; Thu, 30 Apr 2020 23:14:53 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 335618E0001; Thu, 30 Apr 2020 23:14:53 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 225838E0005; Thu, 30 Apr 2020 23:14:53 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0093.hostedemail.com [216.40.44.93])
	by kanga.kvack.org (Postfix) with ESMTP id 07B858E0001
	for <linux-mm@kvack.org>; Thu, 30 Apr 2020 23:14:53 -0400 (EDT)
Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id C8CCF8248047
	for <linux-mm@kvack.org>; Fri,  1 May 2020 03:14:52 +0000 (UTC)
X-FDA: 76766683224.23.bomb78_81df1994c7f2e
X-HE-Tag: bomb78_81df1994c7f2e
X-Filterd-Recvd-Size: 9663
Received: from mail-oi1-f193.google.com (mail-oi1-f193.google.com [209.85.167.193])
	by imf18.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri,  1 May 2020 03:14:52 +0000 (UTC)
Received: by mail-oi1-f193.google.com with SMTP id k133so1658027oih.12
        for <linux-mm@kvack.org>; Thu, 30 Apr 2020 20:14:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :user-agent:mime-version;
        bh=vxn7I7UAVcNdfkGfzl2R7/Cb4X4GMseOoH9dLP36oTs=;
        b=WNKr9Uhp5NOFl3UGOXsGPcoTMTJG2b6Vel4RMAJ0RtVdVSbjgtVVKODspid5dozUPD
         0HeahloFDbjxQltQ8dqGTl/XI2flPiRCZnHEEI1dSJuVyPNtPb9ASl+wiGWxTc8Ik5Hw
         Sqahb9ReLMd3uLszQY/MpHg6Acjcu3uiooHPACs4Ceou8tneCuP/jbteRKfVQwThdk9R
         YGWnVavcQjcZTRW3eCN74UCqi2uDZz3XI9wVhhIRqlHu2BnTBD/RXi6CNUJFhpU8iAjV
         KCVlG9f5ACH8TCkhU+PQnrBaRfokEhPcIdfeluBN/yaYEOu+9YICPMaNl1djZpGc1gpL
         c/1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:user-agent:mime-version;
        bh=vxn7I7UAVcNdfkGfzl2R7/Cb4X4GMseOoH9dLP36oTs=;
        b=Vxne+PVFF/nzAiGFbEfHBRlV8bTiZBzb8lzpFUk/lK1MDJz5lQPMKaTSgjJ3NWsyYp
         W7T23enD4shEMvXqEi3+dIU1XxXZQvt2ocnqiZPatCs8mAeGzg5HxNatgms+f2Ue3u9t
         QD9KQ9gitgjCDk1609P7ymlVdXWabAG7O3giDTLfTyK9wZfQWoDKAIskaIJIW+x3+pm2
         43nzSStjU5OYuun/eni7+4leu/7w5ICA75yUyGXytCgTrMGfyv+o54/P+i2Ee2RbzcXp
         aNHCqkAPhXJpoE5FOEu7FmMk7g4gC8g8IPXlMeoCebHYG+ZlBpr5Yg9pTsBbeJygUdIc
         guoQ==
X-Gm-Message-State: AGi0PuZuzFeRRntc7O8mLz/dFW61W2U1HcosrCGPsS9br0mYEnB3JLy5
	hPJGKVh6R3CUc5xPyz+tUZ4oXw==
X-Google-Smtp-Source: APiQypK60o2fTHBrNJikwF416C0VavJfXtnyWMo43Fxk0uCt5IqMURvQgp6JMQk2//BXYNCDP/y5Lw==
X-Received: by 2002:aca:c3cd:: with SMTP id t196mr1722370oif.58.1588302891263;
        Thu, 30 Apr 2020 20:14:51 -0700 (PDT)
Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id c26sm498388otl.49.2020.04.30.20.14.49
        (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128);
        Thu, 30 Apr 2020 20:14:50 -0700 (PDT)
Date: Thu, 30 Apr 2020 20:14:31 -0700 (PDT)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@eggly.anvils
To: Topi Miettinen <toiwoton@gmail.com>
cc: Hugh Dickins <hughd@google.com>, Helge Deller <deller@gmx.de>, 
    Pete Zaitcev <zaitcev@redhat.com>, linux-mm@kvack.org
Subject: Re: Tmpfs size accounting misses memory allocations
In-Reply-To: <38759fbe-77a1-f105-ccde-9529170bd9a0@gmail.com>
Message-ID: <alpine.LSU.2.11.2004301902330.2204@eggly.anvils>
References: <63ab63bd-d6c6-a96a-9667-bbe8edb7cedb@gmail.com> <alpine.LSU.2.11.2004271728100.4718@eggly.anvils> <38759fbe-77a1-f105-ccde-9529170bd9a0@gmail.com>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, 28 Apr 2020, Topi Miettinen wrote:
> On 28.4.2020 4.34, Hugh Dickins wrote:
> > On Sat, 25 Apr 2020, Topi Miettinen wrote:
> > > Hi,
> > > 
> > > It seems that tmpfs does not count memory which is allocated for short
> > > symlinks or xattrs towards size= limit.
> > 
> > Yes, you are right. And that is why tmpfs does not (so far) support
> > user xattrs, because the unprivileged user could take up too much
> > memory that way.
> > 
> > > I guess the fix would be to change
> > > shmem_sb_info->{used_blocks,max_blocks} to use bytes as units (instead of
> > > blocks) and then add accounting and checks to shmem_symlink() and
> > > shmem_initxattrs(). Would a patch for that be acceptable?
> > 
> > Thank you for offering, but I don't think a patch for exactly that would
> > be acceptable. Because "size=" has just been another way of expressing
> > "nr_blocks=" ever since it was added in 2.4.4, and tmpfs has never
> > counted the kernel metadata towards its data blocks limit.
> > 
> > You could certainly argue that it should have done from the start; but
> > in order to keep the accounting suitably simple (pages rather than bytes)
> > it never did. And I believe there are many users who expect a tmpfs of a
> > certain size to be able to accommodate data of that size, who would not
> > care to have to change their scripts or apps to meet a lower limitation.
> > 
> > > 
> > > Another issue is that inodes aren't counted towards size= limit either,
> > > but
> > > perhaps that's intentional because there's nr_inodes= mount option for
> > > exactly that.
> > 
> > Yes, tmpfs lets the nr_inodes limit be used to constrain the kernel
> > metadata (and tmpfs has a peculiarity, that it actually counts hard
> > links out of nr_inodes, in order to limit the memory spent on dentries).
> > 
> > I doubt the nr_inodes limit is depended upon so critically as the
> > nr_blocks, and I think we might extend it (say, consider each 1 of
> > nr_inodes to grant approximately 1kB of unswappable lowmem metadata)
> > to enable limited user xattrs: a patch along those lines might well
> > be acceptable.
> 
> I'm interested in restricting the amount of memory allocated to tmpfs mounts
> in the system rather than granting more. I've seen a system lock up because
> tmpfs mounts consumed the entire memory. Possible contributing factors could
> be use of LVM and encryption for the swap.

Yes, it is too easy to get into a terrible state that way.  With OOM
killer doing no good at all, because it's busy killing processes, which
does nothing to free the memory held by tmpfs files.  I've never found
a good answer to that in general, though marking files as suitable for
truncation on OOM has been useful in special cases.

> 
> Perhaps there should be a single system limit (sysctl) for total memory
> consumed by all {dev,}tmpfs mounts? Setting this to for example 75% could be
> life saving. Then also the compatibility issues would not matter and all
> memory allocations could be counted.

It's a good suggestion, though I don't much like it. Why not? Hmm:
I'm having difficulty expressing why not, let me sit on it for a bit,
I may come around to your idea there.

My resistance is partly because we already have several other schemes
for resource limiting: the nr_blocks+nr_inodes, the __vm_enough_memory()
checks (/proc/sys/vm/overcommit*), and nowadays memory cgroups.  Adding
yet another is likely to make some fast paths slower.

I expect you to say that this is a fundamental problem with tmpfs, which
should not have to rely on use of MEMCG to save it: I'll agree with you.
And I'd hoped to sell you /proc/sys/vm/overcommit_memory 2 (never
overcommit), before realizing that offers no protection at all from an
explosion of tmpfs inodes, only an explosion of tmpfs data.  Not sure
why we never realized that back in the day: perhaps that is what should
be fixed.

I see now that whether metadata (nr_inodes space) is included in size
or not is just a detail: you had quite reasonably expected it to be
included, but it happens to be additional. The problem is not that it's
additional to size, but that few of us are conscious of how high its
default is: how much memory could be gobbled up in that way (I roughly
estimate 1kB per inode+dentry). Though I think you need a malicious
explosion of inodes in several tmpfs instances to get into trouble
that way, if data sizes are already properly limited.

> 
> > > If these are not bugs but intentional features, I think the manual page
> > > tmpfs(5) should be clearer in this respect.
> > 
> > Nobody has asked for that before, it seems to have met expectations as is.
> > But a sentence could be added to the manpage: do you have wording in mind?
> 
> For example addition to the size option:
> 
> NB: Only the contents (blocks) of regular files are counted towards the size
> limit. To limit RAM consumption also by the inodes of the directories,
> symbolic links or device special files, option nr_inodes must be used.

Better not get into listing directories, symbolic links or device special
files there: it's equally a weakness for regular files - if so minded,
I think you could exhaust memory with enough 0-length regular files
across enough tmpfs instances.

I'm holding back from sending that on to Michael Kerrisk for the man-page
for now: if we do decide to add a protective sysctl, we shall want to
mention that there too.

But one little change that occurred to me this morning: how about this
patch, to stop hiding the default size and nr_inodes from /proc/mounts
(and "mount" command), to help make people more conscious of these limits,
and encourage them to be scaled down:

--- 5.7-rc3/mm/shmem.c	2020-04-26 16:05:25.061228461 -0700
+++ linux/mm/shmem.c	2020-04-30 18:59:59.253865989 -0700
@@ -3583,11 +3583,8 @@ static int shmem_show_options(struct seq
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
 
-	if (sbinfo->max_blocks != shmem_default_max_blocks())
-		seq_printf(seq, ",size=%luk",
-			sbinfo->max_blocks << (PAGE_SHIFT - 10));
-	if (sbinfo->max_inodes != shmem_default_max_inodes())
-		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
+	seq_printf(seq, ",size=%luk", sbinfo->max_blocks << (PAGE_SHIFT - 10));
+	seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
 	if (sbinfo->mode != (0777 | S_ISVTX))
 		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
 	if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID))