From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F38A2C636CC for ; Sat, 4 Feb 2023 18:44:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 713306B0072; Sat, 4 Feb 2023 13:44:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 69C126B0073; Sat, 4 Feb 2023 13:44:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53D096B0074; Sat, 4 Feb 2023 13:44:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3B6A96B0072 for ; Sat, 4 Feb 2023 13:44:15 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 11C38A082C for ; Sat, 4 Feb 2023 18:44:15 +0000 (UTC) X-FDA: 80430484470.10.214B964 Received: from mail-pj1-f45.google.com (mail-pj1-f45.google.com [209.85.216.45]) by imf26.hostedemail.com (Postfix) with ESMTP id 3C175140007 for ; Sat, 4 Feb 2023 18:44:13 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=HySefaM8; spf=pass (imf26.hostedemail.com: domain of aloktiagi@gmail.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=aloktiagi@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675536253; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PqYt9g6AFqNm6rbEG0gvrXgdEtTQwUJ8dtb6JRDh6lM=; b=pIlH5Dsjik5L89M4PTuT7lRDjFbyshE0vH0beOEstI5pDEF8h/Y4WsSZBfx0zdpnLal7rg jfYwGOEMaHscoiZZd4Izpk5eoB3JF7TD93J5XnSzVUv8+TcRbzh2jikwqDAscyix8e38tF xh//7jplpmmnXgw0A/fwr+hNZJ33INQ= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=HySefaM8; spf=pass (imf26.hostedemail.com: domain of aloktiagi@gmail.com designates 209.85.216.45 as permitted sender) smtp.mailfrom=aloktiagi@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675536253; a=rsa-sha256; cv=none; b=Bl9UHmYgv6ztCXru4mXJYt1G/ESuLD1Pqblgr+k5qeoWQi8EN8NDRFhXBcS7Bmr9y4hVrX VdGcWDUZCqgxvW4QzaLbifTEHFQidSUq4EJRGWeISoaWnhXwX2V66Wma5eYZsriPchYBM4 7MX/axjnPVZD5pGg4Adh4/sbRQICu+I= Received: by mail-pj1-f45.google.com with SMTP id c10-20020a17090a1d0a00b0022e63a94799so11712207pjd.2 for ; Sat, 04 Feb 2023 10:44:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=PqYt9g6AFqNm6rbEG0gvrXgdEtTQwUJ8dtb6JRDh6lM=; b=HySefaM8ppVO2gTyuhnFHaomMMuXSNUflZHhoBsrJtyYOfWoVgLRnLP7/R3YiDip2U A9xZdewZMKEC3qxN6uEwh/dieVOkKZ4PMGU6x+u9MU+sjlinV0g24vTr/TIZ2hxEoRbh L2hYGJJSmvxUOyZ3RvpMJ5N+6PLaffdHgkNQEq3l9KKyzPBBYP+Z7gJkRTvNW5zmN+JS LHhBG/sAhS9lhl/74440Nnk6HePPWMnVnWPBskAFe2CgMBDXJ4s2/UZSboUWBVbfo6y8 BBPxuoL5gYgsWEFRTuC9UB8PTSDYf5hP4zbPAKRwWXaffOOsm+vOalR2QmJfo6lWrMwZ M59Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=PqYt9g6AFqNm6rbEG0gvrXgdEtTQwUJ8dtb6JRDh6lM=; b=iJcrXPM7qmOnnISShG7kn45nFqtSrQSD2EPL/miOKd1np5r/cCYAJitNPb/Ons4Oi0 PLdarFbKZ3LjGuIn3QUP+gk+RVArNa/3fQl/I6PRTVjXNg/98NRLcyyY0U3GUiGvAfZX CahuA5dSs9S1lbG9bFl06UHQRpoHpkGDhsUD83niDx0lcwP0J2zHj12bw4i7D6jitZZx tYWDd5boq/B1Omn57UszmjaliB/QTOrgudhv2zlY6jpfYURgaafxjNr53tufZbg7pDMz GJEuaEPWGecBrhIUmZ61Y/riqdcVcUwff4U1Oyn+j+BoQex8Ax4lF9HFT1lECEhbOZK7 lWtQ== X-Gm-Message-State: AO0yUKUaGsHaA15jTSuKXkLudNuMWYmZAxqcFQvMZ7i54gnvaW0O3s0Z 00CP+le+djQGp4vTVnyQCzM= X-Google-Smtp-Source: AK7set8atgpNU869HwOIb299A6k9OhBoRhryH/T/rOucAP5gZA3UryJk5gF0qjrjutOEqyRLGtTswA== X-Received: by 2002:a17:903:41cf:b0:198:e8c6:859a with SMTP id u15-20020a17090341cf00b00198e8c6859amr5417858ple.0.1675536251611; Sat, 04 Feb 2023 10:44:11 -0800 (PST) Received: from ip-172-31-38-16.us-west-2.compute.internal (ec2-52-37-71-140.us-west-2.compute.amazonaws.com. [52.37.71.140]) by smtp.gmail.com with ESMTPSA id g6-20020a170902868600b00192fe452e17sm3774293plo.162.2023.02.04.10.44.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 04 Feb 2023 10:44:11 -0800 (PST) Date: Sat, 4 Feb 2023 18:44:09 +0000 From: Alok Tiagi To: "Eric W. Biederman" Cc: Eric Dumazet , Hillf Danton , netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tycho@tycho.pizza Subject: Re: [RFC] net: add new socket option SO_SETNETNS Message-ID: References: <20230202014810.744-1-hdanton@sina.com> <87tu0278kt.fsf@email.froward.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87tu0278kt.fsf@email.froward.int.ebiederm.org> X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: hyghcqzsynmxxndik1zutm6y9zefqkwb X-Rspamd-Queue-Id: 3C175140007 X-HE-Tag: 1675536253-858250 X-HE-Meta: U2FsdGVkX19hf0/MobNB5w7iBG+v3msYV//sHk76HdlQszpCv/P0ixFzPxNdrHGRp5i1BPIO8GXSC0LP9y3pCfkB3MkMXGfPJ4nb7CMJUeznevnuqeBMfnVw+PAiQpW1cxffC7LbSFhrihe/md8qgwlkWoCj0wJQaH2w5CcfW9WY6L6ommQ5CYx4HTRHgxtu29G4uNIZ7k6/bgAJ14wX9/bUQheyJEtTy8e9ylQRS4xKVGSjC+075Yix8N8bWVUV+ktQSC3VbhYl8bkeDuNX1b2YTrv2P2OHr5rTW5KHxHFB7Ue4viuE7tH22RZRj5su+S1ekUXiaQujWPOZJu+k0CNcAzf2sabJAflq23zF/FbjvsPnI+0XIf1hHLXJcTmXLlc+QAvkwz48r3CImvWiSAIoA5MXQ4TpZVmJTtETqhw/XbHPWsNin2vcDnHs7CfI47K3aUed3mPk0fOSTXXM5tAsJUf6gjvQKDdigYCfnt2oF8PXyPewevA41bUuzb2guTPx3zpy4y13ywsahWLQ528umoh10i14qgOh2wsGpZkJzheFqRbQctA1lnOu6i7jwD9Gs7AwmIXrPB8VYOdc2sTBEpqY5QLRy6T9X5G8ZoSCfgeJIh9AaZC0tZd0Dn1QFsEknM5ZX7JGRkwRjI/CRPWmYbJVyUzuHyBhrsMSEegBg1CyZqeT68VdFKyyo6eb05F1YaCz/kXkLatsyp5JlyjBHcxGXp2tqpGk0KILqWwq5x0+/4TkA2+qbMUw8nDJ0z9BaJZqjbg60R+Jz87PC3+a9vlNn631nHisj2zXIs2HABs3STLAT6sn3jC5UOtzPrzj3FAJu6/0JuLun60e1RvtkTR/AOdjkj/JBM08AjSWbVycwR9wwOGgOW8dt9EoWD5lasYv5DGJKcRv9AfpkuDV3xYoqrqoikkJDhaFqV/DoljLHykjFy9c4BNIwTpJEkThXtVR4TKJaUyg2hH Cwco7ejg BIxJNCjVtH7SDW3eufoc6lp9FGtd9CqJf09HtMK09rvGjVbOfNv2BfC6Fk9ilcYyZLbSAo2pqAw3ArvX8N1DjlwlMsfGxdzUR9g+Trqwq1dyVG9JdX3Zt0ekQsLxE+sb+0GfX9NCbDX/pZ2B70N8RyBDeMl2fAfiuUGGOr4j6hEwOGwwXvxXniIQZazyOWntQ9tJEUc7bQI2caqDhXy8gMdYYpMh9oZDJEfs6lif3CUygZ8H3IXsttN14GDlQC0LhO2NE94kImU+janTUpd+f4ngYVByOxlzCYIBJvsXXJrwZduwUu+98VqX971/mCM8rMiX7+sgErUx0kImNN72Jr15dNc252/zEPwvkfteX389TpWZtu5HcfUMl9tiG4xob6VjQF50ni7R6+Ux8xkZ7F2ck54l+MwaioRk6VA/cYroLFUV8DOeOiVN9+9AowC0NYCnL6Zuq2E1nUU2of4/F6f/C41hq6L4cw0irve0Id2Jn/zd0V10uFm0oXsaX//l7Y0577tyn7Oy8Zb8sSu1EFaNrk25+B3xbY8Yt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 03, 2023 at 03:17:06PM -0600, Eric W. Biederman wrote: > Alok Tiagi writes: > > > On Fri, Feb 03, 2023 at 04:09:12PM +0100, Eric Dumazet wrote: > >> On Fri, Feb 3, 2023 at 12:59 AM Alok Tiagi wrote: > >> > > >> > On Thu, Feb 02, 2023 at 09:10:23PM +0100, Eric Dumazet wrote: > >> > > On Thu, Feb 2, 2023 at 8:55 PM Alok Tiagi wrote: > >> > > > > >> > > > On Thu, Feb 02, 2023 at 09:48:10AM +0800, Hillf Danton wrote: > >> > > > > On Wed, 1 Feb 2023 19:22:57 +0000 aloktiagi > >> > > > > > @@ -1535,6 +1535,52 @@ int sk_setsockopt(struct sock *sk, int level, int optname, > >> > > > > > WRITE_ONCE(sk->sk_txrehash, (u8)val); > >> > > > > > break; > >> > > > > > > >> > > > > > + case SO_SETNETNS: > >> > > > > > + { > >> > > > > > + struct net *other_ns, *my_ns; > >> > > > > > + > >> > > > > > + if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6) { > >> > > > > > + ret = -EOPNOTSUPP; > >> > > > > > + break; > >> > > > > > + } > >> > > > > > + > >> > > > > > + if (sk->sk_type != SOCK_STREAM && sk->sk_type != SOCK_DGRAM) { > >> > > > > > + ret = -EOPNOTSUPP; > >> > > > > > + break; > >> > > > > > + } > >> > > > > > + > >> > > > > > + other_ns = get_net_ns_by_fd(val); > >> > > > > > + if (IS_ERR(other_ns)) { > >> > > > > > + ret = PTR_ERR(other_ns); > >> > > > > > + break; > >> > > > > > + } > >> > > > > > + > >> > > > > > + if (!ns_capable(other_ns->user_ns, CAP_NET_ADMIN)) { > >> > > > > > + ret = -EPERM; > >> > > > > > + goto out_err; > >> > > > > > + } > >> > > > > > + > >> > > > > > + /* check that the socket has never been connected or recently disconnected */ > >> > > > > > + if (sk->sk_state != TCP_CLOSE || sk->sk_shutdown & SHUTDOWN_MASK) { > >> > > > > > + ret = -EOPNOTSUPP; > >> > > > > > + goto out_err; > >> > > > > > + } > >> > > > > > + > >> > > > > > + /* check that the socket is not bound to an interface*/ > >> > > > > > + if (sk->sk_bound_dev_if != 0) { > >> > > > > > + ret = -EOPNOTSUPP; > >> > > > > > + goto out_err; > >> > > > > > + } > >> > > > > > + > >> > > > > > + my_ns = sock_net(sk); > >> > > > > > + sock_net_set(sk, other_ns); > >> > > > > > + put_net(my_ns); > >> > > > > > + break; > >> > > > > > >> > > > > cpu 0 cpu 2 > >> > > > > --- --- > >> > > > > ns = sock_net(sk); > >> > > > > my_ns = sock_net(sk); > >> > > > > sock_net_set(sk, other_ns); > >> > > > > put_net(my_ns); > >> > > > > ns is invalid ? > >> > > > > >> > > > That is the reason we want the socket to be in an un-connected state. That > >> > > > should help us avoid this situation. > >> > > > >> > > This is not enough.... > >> > > > >> > > Another thread might look at sock_net(sk), for example from inet_diag > >> > > or tcp timers > >> > > (which can be fired even in un-connected state) > >> > > > >> > > Even UDP sockets can receive packets while being un-connected, > >> > > and they need to deref the net pointer. > >> > > > >> > > Currently there is no protection about sock_net(sk) being changed on the fly, > >> > > and the struct net could disappear and be freed. > >> > > > >> > > There are ~1500 uses of sock_net(sk) in the kernel, I do not think > >> > > you/we want to audit all > >> > > of them to check what could go wrong... > >> > > >> > I agree, auditing all the uses of sock_net(sk) is not a feasible option. From my > >> > exploration of the usage of sock_net(sk) it appeared that it might be safe to > >> > swap a sockets net ns if it had never been connected but I looked at only a > >> > subset of such uses. > >> > > >> > Introducing a ref counting logic to every access of sock_net(sk) may help get > >> > around this but invovles a bigger change to increment and decrement the count at > >> > every use of sock_net(). > >> > > >> > Any suggestions if this could be achieved in another way much close to the > >> > socket creation time or any comments on our workaround for injecting sockets using > >> > seccomp addfd? > >> > >> Maybe the existing BPF hook in inet_create() could be used ? > >> > >> err = BPF_CGROUP_RUN_PROG_INET_SOCK(sk); > >> > >> The BPF program might be able to switch the netns, because at this > >> time the new socket is not > >> yet visible from external threads. > >> > >> Although it is not going to catch dual stack uses (open a V6 socket, > >> then use a v4mapped address at bind()/connect()/... > > > > We thought of a similar approach by intercepting the socket() call in seccomp > > and injecting a new file descritpor much earlier but as you said we run into the > > issue of handling dual stack sockets since we do not know in advance if its > > going to be used for a v4mapped address. > > I would suggest adding a default ipv4 route from your ipv6 network > namespaces to your ipv4 network namespace, but that only works for > outbound traffic. The inbound traffic problem is classically solved > via nat. > > That you are not suggesting using nat has me thinking there is something > subtle in what you are trying to do that I am missing. > > Perhaps your userspace can do: > > previous_netns = open("/proc/self/ns/net"); > setns(ipv4_netns); > socket(); > setns(previous_netns); > > > As the network namespace is per thread this is atomic if you add > the logic to block signals around it. > > Eric That is correct, we are not using nat, but we are providing a mechanism for the users of our container platform to move to ipv6 only while keeping egress connectivity to their ipv4 destinations. We are doing this transparently without any change in user code, but by intercept networking syscalls in a container manager running in a dedicated ipv4 only network namespace. Our current solution as described in my original commit message has limitations and we are looking for a way to switch a sockets namespace from the ipv6 only container network namespace to the dedicated ipv4 network namespace which really simplifies our design. Since our userspace is the container workload we have no control over how they instantiate their sockets.