Slack Nebula is a very satisfying system. It’s an overlay network written in go at Slack, and has been used heavily across their servers and infrastructure. Once setup, all devices can have an IP in the same range and securely speak to each - often directly - regardless of their network location. There are clients for most platform including window, linux, mac and android. It’s novel and great when it works, but can be unreliable in more overengineered setups.

The Good

At a high level, discovery nodes (“lighthouses”) are basically just dumb publicly accessible introducers: when a node contacts a lighthouse, the lighthouse notes the node’s cert and return path. When a new node wants to contact that node, it gets the return path from the lighthouse and starts sending encrypted traffic.

The really cool and satisfying thing about this system is that each nodes’ information, including it’s equivalent security groups are all held in their certificate, signed by the CA which becomes the authority for joining a network.

So each node is basically a certificate of it’s new overlay network IP, a hostname and a list of security groups for that host which can be used in the firewall built into all clients.

This gives a few really satisfying features:

  • getting onto the network requires a cert: the CA doesn’t ever have to go near the internet, so the super secret core nugget of control for the entire network can be locked up offline when nodes aren’t being added.
  • barely any trust is needed at all: in comms between 2 nodes, each node has a signed record explaining all the important properties of the node they’re communicating with.
  • it’s super simple. At a high level, there aren’t many complex processes or 3rd party involvement or obscure algorithm or big powerful central servers doing a lot of legwork anywhere. Just simple decentralised stuff.
  • the architecture allows for other problems to be solved easily: e.g. DNS can be handled by lighthouses using cert info, relaying can be done by also advertising accessible ranges in node certificates.
  • no central point of failure: lighthouses can be redundant, anywhere and are only used for discovery/introduction. They have no special power or access.

When nodes are connected, everything is incredibly quick: I experience 0 noticeable overhead in daily use, and ping and transfer rates over nebula are pretty much identical (although the first ping result can sometimes be high.) Many other similar solutions rely on relays, which bounce traffic around various overlay network nodes, but Nebula instead relies purely on UDP holepunching on the underlay networks to create paths directly between nodes.

An overlay network experience is really nice: just 1 consistent IP space to remember, and in most cases, once a node has joined Nebula and reached a lighthouse, you can forget about any underlying networking eccentricities for it. It’s the little benefits which are also nice, like not having to think about if my phone is on mobile or wifi to know if I’m going to be able to reach something. If I can reach the lighthouse, then I can reach all the other nodes on the network, in theory…

The bad

However, as simple and great as this sounds, my experience over the past 3 years with it has been a rollercoster. I really want this to work, and when it does, it’s just what I want. But every time I go to set it up, I get normally 5 - 10 devices in, and I start seeing unreliable behaviour. 2 servers next to eachother in the same subnet will be fine. A desktop reaching out to a server will probably also be fine. But when you start getting into the realms of android devices, docker containers, dual NAT, virtualised firewalls and other over-engineered setups, stuff starts failing. Unfortunately, it’s rarely loud in-your-face error message failures, it’s the subtle, long slow timeout, or some other symptom indicating something has silently broken recently, maybe?

The solution becomes hours and hours of trawling through configs and tweaking obscure settings in firewalls and other network devices to get a problem node playing nicely on Nebula. But having to constantly reconfigure and blast holes through networks was the exact activity I dream that a good overlay network will stop me having to do.

Some specific symptoms I would encounter:

  • one way communication: 2 nodes handshake with a lighthouse, node A can’t ping node B until node B has pinged node A., even though both have handshaken with lighthouse.
  • connection timeouts: related to above, once nodes have spoken, if they don’t speak for a while, they need a one-way keepalive, regardless of timeout settings in various places
  • DNS weirdness on Android: not sure if this is Nebula’s fault or the Android VPN/Private DNS stack, but I would consistently see all sorts of weird behaviour and could never make reliable use of a nebula based host for DNS: https://github.com/slackhq/nebula/pull/351 (an attempt to fix my own problems a while ago)

In fairness, a lot of the issues I face could be self-inflicted from other decisions in my network.

But I keep hitting similar issues over many years, over many completely different devices and networks. When troubleshooting, I see a lot of other people having a lot of similar problems. It makes me sad how many tickets, conversations and similar support discussions for Nebula that end with “I tried messing with firewall/router/nat/nebula config for ages, it didn’t work, so I switched to tailscale/netmaker/zerotier and it just worked.”

The high level architecture and certificate stuff is all exceptionally well thought out and great features work well when nodes can connect and communicate, but the underlay network management and UDP holepunching hasn’t been reliable for me. Other solutions get round this by relaying traffic via their equivalents of lighthouse nodes, which Nebula has only recently added. This approach tends to be a lot slower and less efficient than direct communication. It inevitably creates bottlenecks and concentrates traffic in certain parts of the network, rather than decentralising it.

The ugly

The overarching problem I have with Nebula is that I don’t know what it’s trying to do, or who it’s trying to please from how the project is developing, I just wish it was me. From the issues on github, there’s a clear collection of people who come to nebula to try and use it in their home setups or VPN like usage but leave feeling like it’s just unreliable or bad software not for them. This pains me so much because Nebula’s architecture and core idea gives me a similar excitement as the original Bitcoin architecture did, but I do often feel like it’s missing a lot of opportunities, or that I’m misunderstanding the use case.

I don’t think I’m unique in thinking this: some people who open tickets and PRs against the Nebula projects are clearly very smart and enthusiastic, but reading between the lines, they get fed up of dealing with the Nebula project. Have a look through the PRs sorted by comments, and you’ll find a few examples of 0, partial or lack-lustre engagement from the core nebula team in response to people trying to help. I’m sure there’s no bad intent there on either side it’s just sad to see people put in effort and the value to not be ultimately realised.

Some other things also give me just a mildly uncomfortable feeling regarding its future development priorities, e.g.: https://www.defined.net/blog/open-for-business/:

If you’ve used open source Nebula, you’ll know that it is left to the user to decide how they’ll manage certificates and keys for a network. The tools are simple and powerful, but almost certainly require someone deploying Nebula to use configuration management if they are using it at scale. At Slack, we created a management system that handled keys, certificates, and renewal, but those tools were very specific to Slack, and not suitable to release as open source.

With our dashboard, you can manage any size fleet of Nebula nodes, across every platform supported by open source Nebula. The tools are built with resilience and security in mind, and we’ve prioritized features that enterprises need, such as audit logs, and SSO for administrator accounts. We also host an API that allows customers to automate the process of adding hosts to a network managed by Defined.

one thing I loved about Nebula is that it’s lightweight and out of the way, it solves one problem very well: nodes communicating. It doesn’t try to do complex PKI, it doesn’t try to do complex identification, it doesn’t depend on some random web API and reverse proxy. It’s transparent, simple and light. Not doing those things is a feature, because they can be hard, or messy, or opinionated, or high maintenance.

Even if I put my enterprise architecture hat back on, this isn’t what I’d want. The features that are being monetised are the exact sort of features that most enterprises already have more than enough solutions for, and the solutions they have are tools that - just like Slack’s - are very specific to the business and domain. Big companies have solved dashboarding, secrets management and logging a million times already, it doesn’t seem to be making the worl a better place charging them to make it 1,000,001 times. Especially when Nebula as is makes it relatively easy to integrate with my existing solutions for them.

I get that bills need to be paid. I’m sure these 2 options have been considered, but here’s what I would do:

  1. Monetise the apps: I would personally have paid a few ££ for a loved android app, and statistically, Apple people would pay even more.
  2. Create a VPN service: think on consumer VPN. I’d personally be interested in a VPN provider that uses Nebula, they could host some nodes that advertise 0.0.0.0 routes for a traditional VPN like experience (like the ones in the youtube ads!) but then have the extra benefits of a private overlay network (like communicating with the users other devices). Basically, a paid consumer version of DefinedNetworking dressed up as a consumer VPN provider with cool benefits of it being Nebula. I can think of some potentially fun ways to experiment with different CA setups and lighthouse arrangements in such an org. Maybe even facilitating community networks or something where users share and subscribe to eachother’s homelab services.

That might not make those big enterprise bucks, but I think going down the enterprise route is starting to compete with an already relatively busy enterprise mesh area (think Consul, Itsio, etcd, linkerd etc.) This area is already well established and has a lot of resources that I think a small startup like Defined will struggle in. The shady-consumer-vpn market however, from the outside, looks ripe for disruption by some plucky, morally and technologically superior tech startup. I suspect focusing on consumer, mobile and homelab features will also lead to a less transient userbase in the meantime.

I get mild “startup desperate for that 1 big enterprise client to give us a problem for us to work on with our solution” vibe from the Nebula site too, which is sad, because I genuinely think it already is a super valuable piece of software for individuals, with a github full of suggestions and solutions and fixes that they could be building out right now.

Conclusion

I’m done with Nebula for a while again. It’s so super easy to see how I could use multiple lighthouses to create the perfect lightweight, resilient overlay network that serves all my devices, including mobiles and containers and servers, and survives internet connection outages and remote server outages without having to open any ports at home. But the amount of time I’ve spent troubleshooting Nebula has been long enough for me to figure out how I can overengineer and hack together a similar system with headscale/tailscale and some DNS and reverse tunnelling fun.

It would be really good to see a roadmap and statement around what they want and expect regarding contributions to the project, but I’m writing this mainly to remind myself that every time I come to use Nebula, I think it’ll be different a better, but I always end up feeling frustrated and disappointed with the stuff above. I think most of my tech frustrations probably are fixable, but I don’t think the project priorities from the outside will get round to my sort of issues anytime soon.

Having said all this, I’ll still come back to it at some point, just because it’s genuinely fun tech, and I have a huge appreciation for the choice to open source it. I’d also highly recommend anyone who’s interested in networking or crypto or diecentralisation to check it out.

A few more things that caught me out around adguard, nebula and DNS

My quest was to have Adguard listening on the nebula and lan interface, with a custom upstream rewrite so that any requests through the nebula interface used a nebula lighthouse DNS as an upstream. In my case, the lighthouse was on Oracle, and Adguard was at home in my lab, so I also setup a caddy l4 reverse proxy on the oracle instance so I could use DoT from my phone outside of my home or nebula network.

  • to get nebula lighthouse DNS working (specifically getting external lighthouse to serve dns on 53535) remember I had to add a nebula firewall rule in nebula conf, AND a firewall rule on the system firewall-cmd
  • when we use a custom upstream resolver on a per client basis, queries don’t show in the adguard DNS query log at all.
  • static_host_map can have multiple entries per host, don’t have just a single splitdns hostname for home devices, or it causes problems if/when it resolves the wrong way. Fix is to also include a LAN IP too as a fallback if device is immobile.
  • having multiple clients behind certain types of NAT configuration which cause ports to be reused, so on clients, specify different ports, or 0 to randomly select port.
  • systemd-resolved feels overengineered compared to predecessors. remember /etc/systemd/resolved and journalctl -u systemd-resolved -f to debug, remember it doesn’t support all inline comments regardless of vi syntax highlighting
  • It was a conscious decision to combine the lighthouse and adguard DNS in the same service and container as they’re both small services that are widely used and depended on for bootstrapping, but it was probably unnecessary and caused pain later.

Other notes to share later on this topic

  • nebula-github: I created some scripts to do PKI for nebula via a git repositry and git actions a few years ago. You trigger an action with the details of the new node, and a cert is created and a config (populating lighthouses and static hosts) and stored back to the repo, giving a nice central source of information on a nebula network. I recently revived it, but since migrating to gitea, gitea actions doesn’t have the ability to manually trigger actions with inputs. I would build it very differently now, but it works and I still use it regularly when I’m playing with Nebula.
  • nebuladguard: a docker compose and docker file to launch adguard and run nebula in the same container to listen on nebula network and the caddy-l4 config to proxy DoT requests back over nebula to that adguard container.
  • my process for converting any generic container to run on nebula. At high level: create a new Dockerfile with FROM originaimage, in Dockerfile, download and install nebula binaries and supervisord, clear the entrypoint and set it to launch supervisord. Then get the launch command from the original image and create a supervisord config from that command and the nebula start command, in docker-compose file mount the nebulacert and supervisord and nebula config into the right locations, expose nebula ports. This gives a folder where I can just run docker-compose up. There are definitely more efficient ways where I could use gateways or similar networking magic so that 1 Nebula node serves as a gateway for a bunch of the containers, but this approach allows for better isolation and makes it easier to move deployments around.