From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84315C4320E for ; Thu, 19 Aug 2021 07:01:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 32F8A6101A for ; Thu, 19 Aug 2021 07:01:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 32F8A6101A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id C857C6B0075; Thu, 19 Aug 2021 03:01:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C35216B0078; Thu, 19 Aug 2021 03:01:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B236B8D0001; Thu, 19 Aug 2021 03:01:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 970AB6B0075 for ; Thu, 19 Aug 2021 03:01:45 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 2E6DF274A1 for ; Thu, 19 Aug 2021 07:01:45 +0000 (UTC) X-FDA: 78490934970.03.70C7085 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf23.hostedemail.com (Postfix) with ESMTP id AC978900AE43 for ; Thu, 19 Aug 2021 07:01:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629356504; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xhVHosNfawIyQK5RmFK8SONJExNGksIgOyq5OVKNQ4c=; b=UPlQG1n74bsmxhnWSLNekkXAgaPUV460NvpOjzJXtAiOJdDrqneKDKs80AIUbTJ9bzWq9P ITtW8C7OA7bU7tOlQKYlxCbN0fqMnFdMRL3DYV9Voy41qivfEgjb8OqB0QHgh8I5ShYaVS 0Q0IAf6utqBZg4tP8eRur0p116s7JFA= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-236-1h-TZKmdPkmJ4cF4jJBbtQ-1; Thu, 19 Aug 2021 03:01:43 -0400 X-MC-Unique: 1h-TZKmdPkmJ4cF4jJBbtQ-1 Received: by mail-wr1-f71.google.com with SMTP id a9-20020a0560000509b029015485b95d0cso1373003wrf.5 for ; Thu, 19 Aug 2021 00:01:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=xhVHosNfawIyQK5RmFK8SONJExNGksIgOyq5OVKNQ4c=; b=h6kNYOiRaWLsMaOrMb8J59feKgra8bNjwguMf/grK1cTgb5SGPWJRgzzHZ5NJAgQCA kWTzhkO7amUn4OYNhYQYkJ2qeNmq55iV+8Wa/8OxhanQ/92QaPYZmwWRul2HfEDPmvf8 djrsE92u/sQ8L2aSubsbk08vCSN8PXGRP96jSIwJsFLLCEBcV8xNt56P9jSOlsyRXcpB ru1eeVIaKK9vJVPXyaFjlbZ4Pr7sPDPnAraLninch6F1wJEM8gZEuEDHJIuUohrccRI1 o0oK0jof1oEWIF4eif4FSibG3ysktIG9F4YrhFEdYrHgsUKKtYqbimQ4WYABU91zHhte KVUA== X-Gm-Message-State: AOAM532YxfSud6sqhNVhae8OemTKHPXF8vFcIMxsUqN3OrEg2zY0Sig6 Xi1WdOPGlW9Mmf1tUPeaq5dTuK/qRpm857uNCZ5cDINVsQVidd+WbWKlnPAkUlpzLDfMJTD2rhd KfV1BE1vLc+k= X-Received: by 2002:adf:979d:: with SMTP id s29mr1814076wrb.264.1629356501836; Thu, 19 Aug 2021 00:01:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzPgcRMOO9d5aL/kke4yL0+nCtx2qzOwpYP3VaQCsuq/+pt+iZrukIeGZIVhuF0VsLGshS14w== X-Received: by 2002:adf:979d:: with SMTP id s29mr1814047wrb.264.1629356501621; Thu, 19 Aug 2021 00:01:41 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6bd1.dip0.t-ipconnect.de. [91.12.107.209]) by smtp.gmail.com with ESMTPSA id k13sm1640202wms.33.2021.08.19.00.01.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 19 Aug 2021 00:01:41 -0700 (PDT) To: Qi Zheng , akpm@linux-foundation.org, tglx@linutronix.de, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com References: <20210819031858.98043-1-zhengqi.arch@bytedance.com> <20210819031858.98043-7-zhengqi.arch@bytedance.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 6/9] mm: free user PTE page table pages Message-ID: <5aa3020c-fcf2-87bd-31fe-e2b5c2aafcf2@redhat.com> Date: Thu, 19 Aug 2021 09:01:40 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210819031858.98043-7-zhengqi.arch@bytedance.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Queue-Id: AC978900AE43 Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=UPlQG1n7; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf23.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam04 X-Stat-Signature: qyk8qysdk8xoyk9m6jkc9zyd445ijq7e X-HE-Tag: 1629356504-656855 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 19.08.21 05:18, Qi Zheng wrote: > Some malloc libraries(e.g. jemalloc or tcmalloc) usually > allocate the amount of VAs by mmap() and do not unmap > those VAs. They will use madvise(MADV_DONTNEED) to free > physical memory if they want. But the page tables do not > be freed by madvise(), so it can produce many page tables > when the process touches an enormous virtual address space. >=20 > The following figures are a memory usage snapshot of one > process which actually happened on our server: >=20 > VIRT: 55t > RES: 590g > VmPTE: 110g >=20 > As we can see, the PTE page tables size is 110g, while the > RES is 590g. In theory, the process only need 1.2g PTE page > tables to map those physical memory. The reason why PTE page > tables occupy a lot of memory is that madvise(MADV_DONTNEED) > only empty the PTE and free physical memory but doesn't free > the PTE page table pages. So we can free those empty PTE page > tables to save memory. In the above cases, we can save memory > about 108g(best case). And the larger the difference between > the size of VIRT and RES, the more memory we save. >=20 > In this patch series, we add a pte_refcount field to the > struct page of page table to track how many users of PTE page > table. Similar to the mechanism of page refcount, the user of > PTE page table should hold a refcount to it before accessing. > The PTE page table page will be freed when the last refcount > is dropped. >=20 > While we access ->pte_refcount of a PTE page table, any of the > following ensures the pmd entry corresponding to the PTE page > table stability: >=20 > - mmap_lock > - anon_lock > - i_mmap_lock > - parallel threads are excluded by other means which > can make ->pmd stable(e.g. gup case) >=20 > This patch does not support THP temporarily, it will be > supported in the next patch. Can you clarify (and document here) who exactly takes a reference on the=20 page table? Do I understand correctly that a) each !pte_none() entry inside a page table take a reference to the=20 page it's containted in. b) each page table walker temporarily grabs a page table reference c) The PMD tables the PTE is referenced in (->currently only ever a=20 single one) does *not* take a reference. So if there are no PTE entries left and nobody walks the page tables,=20 you can remove it? You should really extend the=20 description/documentation to make it clearer how exactly it's supposed=20 to work. It feels kind of strange to not introduce the CONFIG_FREE_USER_PTE=20 Kconfig option in this patch. At least it took me a while to identify it=20 in the previous patch. Maybe you should introduce the empty stubs and use them in a separate=20 patch, and then have this patch just introduce CONFIG_FREE_USER_PTE=20 along with the actual refcounting magic inside the !stub implementation. --=20 Thanks, David / dhildenb