From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A145EC433EF for ; Mon, 29 Nov 2021 21:33:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1F3236B006C; Mon, 29 Nov 2021 16:33:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1A3D56B0072; Mon, 29 Nov 2021 16:33:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 044206B0073; Mon, 29 Nov 2021 16:33:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0132.hostedemail.com [216.40.44.132]) by kanga.kvack.org (Postfix) with ESMTP id E8F496B006C for ; Mon, 29 Nov 2021 16:33:05 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B2801184D83E3 for ; Mon, 29 Nov 2021 21:32:55 +0000 (UTC) X-FDA: 78863267868.17.DD10B0F Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by imf27.hostedemail.com (Postfix) with ESMTP id C641770000A0 for ; Mon, 29 Nov 2021 21:32:53 +0000 (UTC) Received: by mail-ed1-f43.google.com with SMTP id r11so77908073edd.9 for ; Mon, 29 Nov 2021 13:32:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=dwtF+dlsYEMFu/RdZEresGPLO87mGYT5bh+/vwDanTQ=; b=jXuZO4/4sykyu54pdbVOcUUUAVu+gybxVefUtoOwvDZsorj8pK2djM6CPHIEWHQm4+ V6bVtNIsee71GrNGp+Elde5LIkzkmw10SWZHEOmIyr7j8dTMHjyaAxz9wvEPFouBUemM /nWWNgDZCtZAhbMA1CqXRAwNAkpjKAPTR1WXga1bSJSNPkqA96OdzJC05pq6zYfc3kMu kdYr62PwbkvYdkt7IzrLp76RLw+WkMYvwHJPAprmHkH1tFQPsC3M2kDYE4RAPxEvvoio Fs4M5Css8NfOy+BBqGyf4fEe7FzXwJPHwk/jdS/IS9363a6WiVAJV1UzgaQAFPai84Vg CoWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dwtF+dlsYEMFu/RdZEresGPLO87mGYT5bh+/vwDanTQ=; b=wacmCzqXtpWvpTYvHkrFu4dJtLUw1s7Uz4lpldI4ZInWyIY62eUjtuI9YNSoKNPAoM 9f4Y+yHoEOJufiT+aZ8ALLexB+O2IEEXdxaDOQJieKywF5G3w6vUKv6OlfqcyFIDjgMs 0hlvn8OYgS4rLjgvDznxzDjw6ORiRaZg7nXtwCo4H25aTxIVO0Vsc0OXkviSbQx6WnQZ yICgMo0TnmBnS4p2SmGydwhxqKtLysVY3Ra2MoMk1AYuwq8qeOwxHSt1tkGJ6eAoU5Mq S8jhsFFOZaA0a36HsKyz0pXz0Hmy7IJWwfDfU0DM7KCtcgT8iyAi1ohJWg62vG8gpeN+ hsSw== X-Gm-Message-State: AOAM532564ewTUNPWgGYYSUTyrXu9uTz6mCnFyddtssXUXTF9op/ZGc4 qgwqGD8Kk4epwItUZcBbdPkWqJX0Dd4RmpKghPU= X-Google-Smtp-Source: ABdhPJwvdxDHNEeaqwebCGk4wW8gYa8Lc/keXv5k24bjWnuQgJIG5d2O5/yrYVG7KCIrG5nwVcwGRvoFxFtWAhx4Bkk= X-Received: by 2002:a17:907:1dd5:: with SMTP id og21mr64782841ejc.233.1638221574076; Mon, 29 Nov 2021 13:32:54 -0800 (PST) MIME-Version: 1.0 References: <20211120201230.920082-1-shakeelb@google.com> <25b36a5c-5bbd-5423-0c67-05cd6c1432a7@redhat.com> In-Reply-To: From: Yang Shi Date: Mon, 29 Nov 2021 13:32:41 -0800 Message-ID: Subject: Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED To: David Hildenbrand Cc: Peter Xu , Shakeel Butt , "Kirill A . Shutemov" , Zi Yan , Matthew Wilcox , Andrew Morton , Linux MM , Linux Kernel Mailing List , David Rientjes Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: C641770000A0 X-Stat-Signature: jgwa83dk3jefqb6o7gg5ro3w586w5non Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="jXuZO4/4"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=shy828301@gmail.com X-Rspamd-Server: rspam02 X-HE-Tag: 1638221573-634143 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Nov 26, 2021 at 1:17 AM David Hildenbrand wrote: > > >> > >> Thanks for making me rerun this and yes indeed I had a very silly bug in the > >> benchmark code (i.e. madvise the same page for the whole loop) and this is > >> indeed several times slower than without the patch (sorry David for misleading > >> you). > > No worries, BUGs happen :) > > >> > >> To better understand what is happening, I profiled the benchmark: > >> > >> - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync > >> - 31.27% zap_page_range_sync > >> - 30.25% split_local_deferred_list > >> - 30.16% split_huge_page_to_list > >> - 21.05% try_to_migrate > >> + rmap_walk_anon > >> - 7.47% remove_migration_ptes > >> + 7.34% rmap_walk_locked > >> + 1.02% zap_page_range_details > > > > Makes sense, thanks for verifying it, Shakeel. I forgot it'll also walk > > itself. > > > > I believe this effect will be exaggerated when the mapping is shared, > > e.g. shmem file thp between processes. What's worse is that when one process > > DONTNEED one 4k page, all the rest mms will need to split the huge pmd without > > even being noticed, so that's a direct suffer from perf degrade. > > Would this really apply to MADV_DONTNEED on shmem, and would deferred > splitting apply on shmem? I'm constantly confused about shmem vs. anon, > but I would have assumed that shmem is fd-based and we wouldn't end up > in rmap_walk_anon. For shmem, the pagecache would contain the THP which > would stick around and deferred splits don't even apply. The deferred split is anon THP only, it doesn't apply to shmem THP. For the rmap walk, there are two ramp walks for anon THP, but just one for shmem THP. Both needs one rmap walk to unmap the THP before doing actual split, but anon THP needs another rmap walk to reinstall PTEs for still mapped pages, however this is not needed for shmem pages since they could be reached via page cache. > > But again, I'm constantly confused so I'd love to be enlighted. > > > > >> > >> The overhead is not due to copying page flags but rather due to two rmap walks. > >> I don't think this much overhead is justified for current users of MADV_DONTNEED > >> and munmap. I have to rethink the approach. > > Most probably not. > > > > > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I > > thought about MADV_SPLIT (or any of its variance): > > > > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ > > > > My memory was that there's some issue to be solved so that was blocked, however > > when I read the thread it sounds like the list was mostly reaching a consensus > > on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I > > missed something important. > > > > If we think MADV_COLLAPSE can help to implement an userspace (and more > > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of > > kcompactd, perhaps. > > > > That's probably a bit off topic of this specific discussion on the specific use > > case, but so far it seems all reasonable and discussable. > > User space can trigger a split manually using some MADV hackery. But it > can only be used for the use case here, where we actually want to zap a > page. > > 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE > and the compound page. > 2. MADV_DONTNEED either the complete range or the single 4k page. > > -- > Thanks, > > David / dhildenb >