From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 201F5C4338F for ; Tue, 17 Aug 2021 20:24:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C5411604D7 for ; Tue, 17 Aug 2021 20:24:20 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C5411604D7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 70016900002; Tue, 17 Aug 2021 16:24:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B0566B0072; Tue, 17 Aug 2021 16:24:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57A7F900002; Tue, 17 Aug 2021 16:24:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0053.hostedemail.com [216.40.44.53]) by kanga.kvack.org (Postfix) with ESMTP id 3EC0D6B0071 for ; Tue, 17 Aug 2021 16:24:20 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id CFB411924B for ; Tue, 17 Aug 2021 20:24:19 +0000 (UTC) X-FDA: 78485699838.10.A275481 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 78A9210193A5 for ; Tue, 17 Aug 2021 20:24:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629231859; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=QOMKNDJNdfhb8sOCj0R2Nfirh2qK76ofz1GMvuJS7sg=; b=T39GUyIEoSiQ1CQnDtpi6JwFdcOo6p3/B2DXY0cI7FcPR/af8CrXOGmTTtowTvMN+5UUbk K0wjrVNccWdgyKDVc+SxDyzfsG4uTSMTCs0ys39IfciYSPuHtzcFK1hbkmYWKowbcn1CXa LD5MsN/cE8/q8Q+LEmw7Eu0hhLuatPw= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-304-q6_jaTE0MzKASVKYdigMtQ-1; Tue, 17 Aug 2021 16:24:14 -0400 X-MC-Unique: q6_jaTE0MzKASVKYdigMtQ-1 Received: by mail-qt1-f200.google.com with SMTP id v28-20020a05622a189cb029028e697f617dso11732623qtc.20 for ; Tue, 17 Aug 2021 13:24:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=QOMKNDJNdfhb8sOCj0R2Nfirh2qK76ofz1GMvuJS7sg=; b=TaNjliO4i1ynjHEf20GSLzcnQnUHtPxCM4RS2Ozok5HUkR9IW9vcZHUlxE2wQ2yMrq wz7Dd4uqO14yfcOR9Dh0RWe1/wyHkDHF0iNTe3K4hscyFccrYoedEd3m0D+l1yV+wLzG 8NbDLIdTznLwuNJeq9TsfjmpBt8oRiNxI8r/KX7eDmZus8SkwpPkgHx8ssqtSrH2nbtr O4HbeOyBiUdYu+eUKOZuthLyRAQvaveFpP0B5dc05KQmRHh9cjIQ4hXzqNNUM+QrIPtp iONWqMm3FchWjGm3kSdu1jGWYyw/wh3gaDjnUL0t2+pMomP6jFnA9ysqEnTUebKcJESR V9Fg== X-Gm-Message-State: AOAM533CdWOzrKyvy1qDMhei5ujTwRz1wu7pqZjd+DmQ359CP/XOzWI+ c+e8csXfeeMKFSZReK7hUYaG+1I71vUS0Ue/vANC7ON6ZhkZ73vPgB3eUlQdD8G31xp8RHGMswh VvXf0TDoPkGo= X-Received: by 2002:a05:6214:301d:: with SMTP id ke29mr5207724qvb.45.1629231854338; Tue, 17 Aug 2021 13:24:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyXvyf/6/dfFegAkX3W2Ty9u08DfDF5DEr+ehsrmgmaz5RK9iEID+tFYDGRq6AwlqpNJJomYA== X-Received: by 2002:a05:6214:301d:: with SMTP id ke29mr5207705qvb.45.1629231854064; Tue, 17 Aug 2021 13:24:14 -0700 (PDT) Received: from t490s ([2607:fea8:56a3:500::d413]) by smtp.gmail.com with ESMTPSA id p123sm2184065qke.94.2021.08.17.13.24.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Aug 2021 13:24:13 -0700 (PDT) Date: Tue, 17 Aug 2021 16:24:11 -0400 From: Peter Xu To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Alistair Popple , Tiberiu Georgescu , ivan.teterevkov@nutanix.com, Mike Rapoport , Hugh Dickins , Matthew Wilcox , Andrea Arcangeli , "Kirill A . Shutemov" , Andrew Morton , Mike Kravetz Subject: Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Message-ID: References: <20210807032521.7591-1-peterx@redhat.com> <16a765e7-c2a3-982a-e585-c04067766e3f@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 78A9210193A5 X-Stat-Signature: 4pqt3wdrhyhc31p5d11b43dgxu7z71ch Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=T39GUyIE; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf13.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com X-HE-Tag: 1629231859-582664 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 17, 2021 at 08:46:45PM +0200, David Hildenbrand wrote: > > Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not > > accurate from the 1st day, imho. E.g., when a page is being migrated from numa > > node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can > > make it more accurate, but I think it's fine, because it's a hint. > > That inconsistency doesn't really matter as you can determine if something > is present and worth dumping if it's either swapped or present. As long as > it's one of both but not simply nothing. > > I will shamelessly reference > tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that > checks exactly for that (the test case uses only private anonymous memory). Then I think the MADV_POPULATE_READ|WRITE test cases shouldn't depend on PM_SWAP for that when it goes beyond anonymous private memories - when shmem swapped out the pte can be none, then the test case can fail even if it shouldn't, imho. The mincore() syscall seems to be ideally the thing you may want to make it accurate, but again it's not a problem for current anonymous private memories. > > > > > > Take CRIU as an example, it has to be correct even if a process would remap a > > > memory region, fork() and unmap in the parent as far as I understand, ... > > > > Are you talking about dirty bit or swap bit? I'm a bit confused on why swap > > bit needs to be accurate. Maybe you mean the dirty bit? > > https://criu.org/Shared_memory > > "Dumping present pages" > > "... CRIU does not dump all of the data. Instead, it determines which pages > contain it, and only dumps those pages. This is done similarly to how > regular memory dumping and restoring works, i.e. by looking for PRESENT or > SWAPPED bits in owners' pagemap entries." > > -> Neither PRESENT nor SWAPPED results in memory not getting dumped, which > makes perfect sense. > > 1) Process A sets up shared memory and writes data to it. > 2) System swaps out memory, hints are setup. > 3) Process A forks Process B, hints are not copied. > 4) Process A unmaps shared memory, hints are dropped. > 5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in > pagemap. > 6) Process B uses memory in shared memory region. Pages were not migrated. > > Just one example; feel free to correct me. I think pte marker won't crash criu, what will happen is that it'll see more ptes that used to be none that become the pte markers. This reminded me that maybe I should teach up mincore() syscall to also be aware of the pte marker at least, and all non_swap_entry() callers. > > > There is notion of the mincore() systemcall: > > "There is one particular feature of shared memory dumps worth mentioning. > Sometimes, a shared memory page can exist in the kernel, but it is not > mapped to any process. CRIU detects such pages by calling mincore() on the > shmem segment, which reports back the page in-memory status. The mincore > bitmap is when ANDed with the per-process ones. " > > Not sure if they actually mean ORed, because otherwise they'd be losing > pages that have been swapped out. "mincore() returns a vector that indicates > whether pages of the calling process's virtual memory are resident in core > (RAM)" I am wildly guessing they ORed the two just because PM_SWAP is not working properly for shmem, so the OR happens only for shmem. Criu may not only rely on mincore() because they also want the dirty bits. Btw, I noticed in 2016 criu switched from mincore() to lseek(): https://github.com/checkpoint-restore/criu/commit/1821acedd04b602b37b587eac5a481094b6274ae Criu should want to know "whether this page has valid data" not "whether this page has swapped out", so lseek() seems to be more suitable, which I'm not aware of before. I'm now wondering whether for Tiberiu's case mincore() can also be used. It should just still be a bit slow because it'll look up the cache too, but it should work similarly like the original proposal. Thanks, -- Peter Xu