From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 643CCC433F5 for ; Mon, 15 Nov 2021 18:59:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1307963688 for ; Mon, 15 Nov 2021 18:59:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1307963688 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id AF49A6B0089; Mon, 15 Nov 2021 13:59:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A7CAC6B008A; Mon, 15 Nov 2021 13:59:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 91C1C6B008C; Mon, 15 Nov 2021 13:59:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60]) by kanga.kvack.org (Postfix) with ESMTP id 81A446B0089 for ; Mon, 15 Nov 2021 13:59:33 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 470511BAF9FA2 for ; Mon, 15 Nov 2021 18:59:33 +0000 (UTC) X-FDA: 78812078226.02.BDCAAB8 Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf19.hostedemail.com (Postfix) with ESMTP id B8208B0000B8 for ; Mon, 15 Nov 2021 18:59:22 +0000 (UTC) Received: by mail-yb1-f169.google.com with SMTP id g17so49813502ybe.13 for ; Mon, 15 Nov 2021 10:59:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=j9isATshRpNCp9M9I/oxCqB/tiXTLxAbh8WEspOg7HA=; b=qtzEXtgjPMCcvYNyLawsdtSgDfbTOrXWIye07d/cOumHTbZQ/0YQ7IfvtQCEzWNrw7 ZgGvgzbIkSJmwNGdUmWAKWQGQiFf0V7yUVlpQyYU3cFffJ2yi4QWmRBlp94nrhCqbXUo pbyNyRmu75Gy2dbSfYbcCeDACqy84MWea44WEoSdGfI8xT4k6Qzx24IjB5HHnT1HBPcN mcuHhd1CXF2ZP/aS61+qnrHhEtnzjUKqVFQyirBIXaY5x5KwBb3B9+zRHCNXHIG81jMr yHXSQL+6Gi6ZOX7+Jg5pQF/prN5kjebVxQ9pZPsMk01hkYAAzL/On8WkK/jhtAZ9ajvZ jNuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=j9isATshRpNCp9M9I/oxCqB/tiXTLxAbh8WEspOg7HA=; b=vAOzGjmIG1BrjOxTogwwWMrIoEgm3ntRYPBkklkolgHXZ4CfCjsRdW8TjPtNNS4cEx mJXS4iM3sNp6Ckld83SOzy/Uys9pBv+eR2CQONkcj9feyc0iI/ofxlUj9Yghebx12fCp u2IE0LJRpofMYeYvrMFyoqZ2LZ25qfICASZCrHTlYPAhNjVvB/PoKU4BIA4w8tGYfa5P wE6pRr69W33cUukZ2QVXs/mDiea6kp1rLAmz5jigQRX6/zeFSaW86U2JFtNZLSsunF4/ 16/rd7r+h+Xkg3+QhKNz95DKEte+sKOWo7F6EvUyAfPOysYSIlFwSPSQoooPhPmW2/w3 +0vg== X-Gm-Message-State: AOAM531v/5iwOol6/Ov86CUjEN/EtgcTpgCl9BVg/BMIiN73p551KNLT Qbu7bsqjZejO3Fh1CdWZPCAnvu1ooxTjktbGSAu0qg== X-Google-Smtp-Source: ABdhPJwORjkE+Oy2pssuHDx7dD8a047R6Yq50CV0Ct2mHGQfqJVDqJfkC62iPCNh9yCdfFkzRQjRxFHI4npCcZKI0EY= X-Received: by 2002:a25:6645:: with SMTP id z5mr1370689ybm.127.1637002771688; Mon, 15 Nov 2021 10:59:31 -0800 (PST) MIME-Version: 1.0 References: <20211019215511.3771969-1-surenb@google.com> <20211019215511.3771969-2-surenb@google.com> <89664270-4B9F-45E0-AC0B-8A185ED1F531@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Mon, 15 Nov 2021 10:59:20 -0800 Message-ID: Subject: Re: [PATCH v11 2/3] mm: add a field to store names for private anonymous memory To: akpm@linux-foundation.org Cc: Alexey Alexandrov , ccross@google.com, sumit.semwal@linaro.org, mhocko@suse.com, dave.hansen@intel.com, keescook@chromium.org, willy@infradead.org, kirill.shutemov@linux.intel.com, vbabka@suse.cz, hannes@cmpxchg.org, corbet@lwn.net, viro@zeniv.linux.org.uk, rdunlap@infradead.org, kaleshsingh@google.com, peterx@redhat.com, rppt@kernel.org, peterz@infradead.org, catalin.marinas@arm.com, vincenzo.frascino@arm.com, chinwen.chang@mediatek.com, axelrasmussen@google.com, aarcange@redhat.com, jannh@google.com, apopple@nvidia.com, jhubbard@nvidia.com, yuzhao@google.com, will@kernel.org, fenghua.yu@intel.com, thunder.leizhen@huawei.com, hughd@google.com, feng.tang@intel.com, jgg@ziepe.ca, guro@fb.com, tglx@linutronix.de, krisman@collabora.com, chris.hyser@oracle.com, pcc@google.com, ebiederm@xmission.com, axboe@kernel.dk, legion@kernel.org, eb@emlix.com, gorcunov@gmail.com, pavel@ucw.cz, songmuchun@bytedance.com, viresh.kumar@linaro.org, thomascedeno@google.com, sashal@kernel.org, cxfcosmos@gmail.com, linux@rasmusvillemoes.dk, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B8208B0000B8 X-Stat-Signature: 5tsdwb3setyknwg5hkbyt76t68ywsskd Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qtzEXtgj; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of surenb@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=surenb@google.com X-HE-Tag: 1637002762-68534 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Oct 28, 2021 at 3:08 PM Suren Baghdasaryan wrot= e: > > On Wed, Oct 27, 2021 at 1:01 PM Suren Baghdasaryan wr= ote: > > > > On Wed, Oct 27, 2021 at 11:35 AM Alexey Alexandrov wrote: > > > > > > > On Oct 19, 2021, at 2:55 PM, Suren Baghdasaryan = wrote: > > > > > > > > From: Colin Cross > > > > > > > > In many userspace applications, and especially in VM based applicat= ions > > > > like Android uses heavily, there are multiple different allocators = in use. > > > > At a minimum there is libc malloc and the stack, and in many cases = there > > > > are libc malloc, the stack, direct syscalls to mmap anonymous memor= y, and > > > > multiple VM heaps (one for small objects, one for big objects, etc.= ). > > > > Each of these layers usually has its own tools to inspect its usage= ; > > > > malloc by compiling a debug version, the VM through heap inspection= tools, > > > > and for direct syscalls there is usually no way to track them. > > > > > > > > On Android we heavily use a set of tools that use an extended versi= on of > > > > the logic covered in Documentation/vm/pagemap.txt to walk all pages= mapped > > > > in userspace and slice their usage by process, shared (COW) vs. un= ique > > > > mappings, backing, etc. This can account for real physical memory = usage > > > > even in cases like fork without exec (which Android uses heavily to= share > > > > as many private COW pages as possible between processes), Kernel Sa= mePage > > > > Merging, and clean zero pages. It produces a measurement of the pa= ges > > > > that only exist in that process (USS, for unique), and a measuremen= t of > > > > the physical memory usage of that process with the cost of shared p= ages > > > > being evenly split between processes that share them (PSS). > > > > > > > > If all anonymous memory is indistinguishable then figuring out the = real > > > > physical memory usage (PSS) of each heap requires either a pagemap = walking > > > > tool that can understand the heap debugging of every layer, or for = every > > > > layer's heap debugging tools to implement the pagemap walking logic= , in > > > > which case it is hard to get a consistent view of memory across the= whole > > > > system. > > > > > > > > Tracking the information in userspace leads to all sorts of problem= s. > > > > It either needs to be stored inside the process, which means every > > > > process has to have an API to export its current heap information u= pon > > > > request, or it has to be stored externally in a filesystem that > > > > somebody needs to clean up on crashes. It needs to be readable whi= le > > > > the process is still running, so it has to have some sort of > > > > synchronization with every layer of userspace. Efficiently trackin= g > > > > the ranges requires reimplementing something like the kernel vma > > > > trees, and linking to it from every layer of userspace. It require= s > > > > more memory, more syscalls, more runtime cost, and more complexity = to > > > > separately track regions that the kernel is already tracking. > > > > > > > > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to sh= ow a > > > > userspace-provided name for anonymous vmas. The names of named ano= nymous > > > > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:]. > > > > > > > > Userspace can set the name for a region of memory by calling > > > > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)= name); > > > > Setting the name to NULL clears it. The name length limit is 80 byt= es > > > > including NUL-terminator and is checked to contain only printable a= scii > > > > characters (including space), except '[',']','\','$' and '`'. Ascii > > > > strings are being used to have a descriptive identifiers for vmas, = which > > > > can be understood by the users reading /proc/pid/maps or /proc/pid/= smaps. > > > > Names can be standardized for a given system and they can include s= ome > > > > variable parts such as the name of the allocator or a library, tid = of > > > > the thread using it, etc. > > > > > > > > The name is stored in a pointer in the shared union in vm_area_stru= ct > > > > that points to a null terminated string. Anonymous vmas with the sa= me > > > > name (equivalent strings) and are otherwise mergeable will be merge= d. > > > > The name pointers are not shared between vmas even if they contain = the > > > > same name. The name pointer is stored in a union with fields that a= re > > > > only used on file-backed mappings, so it does not increase memory u= sage. > > > > > > > > CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable t= his > > > > feature. It keeps the feature disabled by default to prevent any > > > > additional memory overhead and to avoid confusing procfs parsers on > > > > systems which are not ready to support named anonymous vmas. > > > > > > > > The patch is based on the original patch developed by Colin Cross, = more > > > > specifically on its latest version [1] posted upstream by Sumit Sem= wal. > > > > It used a userspace pointer to store vma names. In that design, nam= e > > > > pointers could be shared between vmas. However during the last upst= reaming > > > > attempt, Kees Cook raised concerns [2] about this approach and sugg= ested > > > > to copy the name into kernel memory space, perform validity checks = [3] > > > > and store as a string referenced from vm_area_struct. > > > > One big concern is about fork() performance which would need to str= dup > > > > anonymous vma names. Dave Hansen suggested experimenting with worst= -case > > > > scenario of forking a process with 64k vmas having longest possible= names > > > > [4]. I ran this experiment on an ARM64 Android device and recorded = a > > > > worst-case regression of almost 40% when forking such a process. Th= is > > > > regression is addressed in the followup patch which replaces the po= inter > > > > to a name with a refcounted structure that allows sharing the name = pointer > > > > between vmas of the same name. Instead of duplicating the string du= ring > > > > fork() or when splitting a vma it increments the refcount. > > > > > > > > [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.s= emwal@linaro.org/ > > > > [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescoo= k/ > > > > [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook= / > > > > [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d= 6a8e95@intel.com/ > > > > > > > > Changes for prctl(2) manual page (in the options section): > > > > > > > > PR_SET_VMA > > > > Sets an attribute specified in arg2 for virtual memory areas > > > > starting from the address specified in arg3 and spanning the > > > > size specified in arg4. arg5 specifies the value of the attr= ibute > > > > to be set. Note that assigning an attribute to a virtual memo= ry > > > > area might prevent it from being merged with adjacent virtual > > > > memory areas due to the difference in that attribute's value. > > > > > > > > Currently, arg2 must be one of: > > > > > > > > PR_SET_VMA_ANON_NAME > > > > Set a name for anonymous virtual memory areas. arg5 s= hould > > > > be a pointer to a null-terminated string containing t= he > > > > name. The name length including null byte cannot exce= ed > > > > 80 bytes. If arg5 is NULL, the name of the appropriat= e > > > > anonymous virtual memory areas will be reset. The nam= e > > > > can contain only printable ascii characters (includin= g > > > > space), except '[',']','\','$' and '`'. > > > > > > > > This feature is available only if the kernel is buil= t with > > > > the CONFIG_ANON_VMA_NAME option enabled. > > > > > > For what it=E2=80=99s worth, it=E2=80=99s definitely interesting to s= ee this going upstream. > > > In particular, we would use it for high-level grouping of the data in > > > production profiling when proper symbolization is not available: > > > > > > * JVM could associate a name with the memory regions it uses for the = JIT > > > code so that Linux perf data are associated with a high level name = like > > > "Java JIT" even if the proper Java JIT profiling is not enabled. > > > * Similar for other JIT engines like v8 - they could annotate the mem= ory > > > regions they manage and use as well. > > > * Traditional memory allocators like tcmalloc can use this as well so > > > that the associated name is used in data access profiling via Linux= perf. > > > > Hi Alexey, > > Thanks for providing your feedback! Nice to hear that this can be > > useful outside of Android. > > Folks, it has been almost two weeks since I posted this v11 patchset. > Is there anything else I can do to advance it towards merging? Hi Andrew, I haven't seen any feedback on my patchset for some time now. I think I addressed all the questions and comments (please correct me if I missed anything). Can it be accepted as is or is there something I should address further? >From the feedback, I see that there are several interested parties in this patchset (albeit all from different teams at Google) but maybe when it's merged more users will start using it. I believe I've done everything I could to ensure no/minimal impact on the users who don't use this feature. Please advise. Thanks, Suren.