Skip to main content

2 posts tagged with "firehose"

View All Tags

Relay Operational Updates

· 6 min read

Update January 2025: Many of the operational changes described here were not implemented. We will give an update on Relay infrastructure early this year.

Summary: We are making a couple of changes to the Bluesky relay servers this week. The wire protocol and semantics are not impacted, but the changes may impact firehose consumers:

  • One way or another, the event stream sequence will change for all subscribers. Most consumers can probably just reset to the new cursor; see details below.
  • The bsky.network hostname will be swapped to a new relay instance (with new sequence)

The atproto network is growing rapidly! This is exciting and positive, but it means we need to move up some of our network scaling plans. One particular resource bottleneck is the relay firehose. We are seeing sustained traffic of over 2,000 events per second, and we have hundreds of active consumers. Multiplied, the overall throughput is hard for any one server to keep up with.

We saw this problem coming from the beginning and we have several plans and options to mitigate it. Details for each of these are linked or in sections below.

First, we are simply upgrading the size of server that we run our primary relay instances on. We expect to do this in coming days. The main impact here will be a reset of the firehose sequence (including cursor values), which will impact all downstream consumers.

Second, we are introducing the concept of "firehose fan-out" services, and releasing our implementation Rainbow. These are servers which re-broadcast the event stream firehose to many clients, reducing the bandwidth load on relays themselves. Rainbow is now in production, as an implementation detail.

Third, we recently released Jetstream as a more informal and light-weight option for event stream consumption. We encourage developers to check out Jetstream, and switch over if it works for their use case.

We encourage folks to take advantage of Rainbow and Jetstream. You can run your own, or make use of our hosted instances. Some additional longer-term options are discussed at the bottom of this article.

We don't have a hard time or date estimate for the cursor sequence change, but we expect it this week, possibly tomorrow (Tuesday). We generally try to give more time and make changes less disruptive but we're not able to do so in this situation. Please follow the @atproto.com account, this blog, and in Github Discussions for announcements.

Relay Upgrade

The current Bluesky relay instance is available at wss://bsky.network. A more specific hostname for this relay instance will soon be relay1.us-west.bsky.network (not yet configured).

Instead of upgrading the relay in-place, we are going to start a new relay from scratch. This means that event stream ordering will be different and sequence numbers will not align with the current relay. This new instance will soon run at relay2.us-west.bsky.network (not yet configured).

There will be at least a short window when both relays are running, so downstream consumers can update their services to "cut over". At some point soon after the new relay is deployed, we will switch the generic bsky.network hostname to point at it.

What firehose consumers need to do about the cut over depends on how careful they need to be about processing every single event on the firehose.

Services which always connect to the firehose from the current offset don’t need to do anything. This includes developer tools like goat, and real-time sampling like Firesky.

Best-effort consumers can likely be configured to consume from the current firehose offset, and simply restart, missing at most a few seconds of events. This probably makes sense for feed generators, analytics, automated labeling, etc.

Services which want to really minimize the number of missed events can take a different approach. They can be programmed to connect to the new relay with no cursor, detect the current sequence number, then reconnect with a cursor a couple of minutes back (at current rates this means a few hundred thousand events back). This results in a time overlap window between the two relays (with some events processed twice), which helps minimize the chance of missing any events.

Firehose Fan-out and Rainbow

Firehose fan-out servers have a single upstream connection to a relay firehose, and re-broadcast that stream to multiple subscribing clients. They maintain a local backfill window (allowing re-connection with a cursor), but do not implement endpoints like getRecord or getRepo, only com.atproto.sync.subscribeRepos. They do not re-validate the event stream (eg, they don't verify signatures). They maintain the exact sequence numbers of the upstream relay, meaning that they are interchangeable with the relay and with sibling fan-out servers, and can be placed behind a load-balancer.

Today we are releasing Rainbow, a fan-out service implemented in Go. The source code is available in the indigo git repository, with the same MIT/Apache licensing as our other open source projects.

Rainbow is already running as part of our production relay deployment (wss://bsky.network), with our load balancer (haproxy) distributing WebSocket subscriptions to a separate server running Rainbow.

In the future, we may run additional Rainbow instances outside our primary data centers, similar to how we offer Jetstream instances today. This distributes network traffic over more routers, and could improve connection reliability and throughput for subscribers in regions outside North America.

Going forward, we recommend that software which consumes from firehoses (including atproto SDKs) support HTTP redirects for WebSocket connections. This enables "pooling" behavior where a single hostname could route clients to multiple distinct servers, without use of a load-balancer.

What Else?

As a reminder, the relay synchronization API is the same as the PDS synchronization API, including the firehose. Relays are essential to the functionality of the atproto network: they are an operational and developer convenience, allowing consuming services to skip keeping track of which PDS instances are active.

A classic way to scale services like the firehose is sharding, where the stream is split into multiple parallel streams. atproto has a natural sharding key (the account DID), which means related events can be consistently routed to the correct shard. This will allow near-indefinite scaling of the firehose event rate, assuming compute and network resources are available.

Part of what makes relays difficult to operate at scale is that they function as both a "relay" (rebroadcasting events from PDS instances) and a full-network mirror (storing all repo contents). In the current protocol, it is necessary to combine both functions to fully verify repository operations, especially deletion of events. We think that clever implementations could make this work less resource intensive (for example, storing just record CIDs instead of full data). It is also possible to validate most aspects of the event stream without a full copy of the repo tree, and there could be a role in the network for "non-mirroring" relays.

Lastly, there is a particular need for efficient consumption and backfill of full-network content for only content and accounts which make use of specific record types. For example, “only whtwnd blog post records”, or “only accounts with labeling service declarations”. We are planning new features to make this type of network subscription and backfill much more efficient.

Introducing Jetstream

· 6 min read

One of most popular aspects of atproto for developers is the firehose: an aggregated stream of all the public data updates in the network. Independent developers have used the firehose to build real-time monitoring tools (like Firesky), feed generators, labeling services, bots, entire applications, and more.

But the firehose wire format is also one of the more complex parts of atproto, involving decoding binary CBOR data and CAR files, which can be off-putting to new developers. Additionally, the volume of data has increased rapidly as the network has grown, consistently producing hundreds of events per second.

The full synchronization firehose is core network infrastructure and not going anywhere, but to address these concerns we developed an alternative streaming solution, Jetstream, which has a few key advantages:

  • simple JSON encoding
  • reduced bandwidth, and compression
  • ability to filter by collection (NSID) or repo (DID)

A Jetstream server consumes from the firehose and fans out to many subscribers. It is open source, implemented in Go, simple to self-host. There is an official client library included (in Go), and community client libraries have been developed.

Jetstream was originally written as a side project by one of our engineers, Jaz. You can read more about their design goals and efficiency gains on their blog. It has been successful enough that we are promoting it to a team-maintained project, and are running several public instances:

  • jetstream1.us-east.bsky.network
  • jetstream2.us-east.bsky.network
  • jetstream1.us-west.bsky.network
  • jetstream2.us-west.bsky.network

You can read more technical details about Jetstream in the Github repo.

Why Now?

Why are we promoting Jetstream at this time?

Two factors came to a head in early September: we released an example project for building new applications on atproto (Statusphere), and we had an unexpectedly large surge in traffic in Brazil. Suddenly we had a situation where new developers would be subscribing to a torrential full-network firehose (over a thousand events per second), just to pluck out a handful of individual events from a handful of accounts. Everything about this continued to function, even on a laptop on a WiFi connection, but it feels a bit wild as an introduction to the protocol.

We knew from early on that while the current firehose is extremely powerful, it was not well-suited to some use cases. Until recently, it hadn’t been a priority to develop alternatives. The firehose is a bit overpowered, but it does Just Work.

Has the Relay encountered scaling problems or become unaffordable to operate?

Nope! The current Relay implementation ('bigsky', written in Go, in the indigo git repo) absorbed a 10x surge in daily event rate, with over 200 active subscribers, and continues to chug along reliably. We have demonstrated how even a full-network Relay can be operated affordably.

We do expect to refactor our Relay implementation and make changes to the firehose wire format to support sharding. But the overall network architecture was designed to support global scale and millions of events per second, and we don't see any serious barriers to reaching that size. Bandwidth costs are manageable today. At larger network size (events times subscribers), bandwidth will grow in cost. We expect that the economic value of the network will provide funding and aligned incentives to cover the operation of core network infrastructure, including Relays. In practical terms, we expect funded projects and organizations depending on the firehose to pay infrastructure providers to ensure reliable operation (eg, an SLA), or to operate their own Relay instances.

Tradeoffs and Use Cases

Jetstream has efficiency and simplicity advantages, but they come with some tradeoffs. We think it is a pragmatic option for many projects, but that developers need to understand what they are getting into.

Events do not include cryptographic signatures or Merkle tree nodes, meaning the data is not self-authenticating. "Authenticated Transfer" is right in the AT Protocol acronym, so this is a pretty big deal! The trust relationship between a Jetstream operator and a consuming client is pretty different from that of a Relay. Not all deployment scenarios and use-cases require verification, and we suspect many projects are already skipping that aspect when consuming from the firehose. If you are running Jetstream locally, or have a tight trust relationship with a service provider, these may be acceptable tradeoffs.

Unlike the firehose (aka, Repository Event Stream), Jetstream is not formally part of the protocol. We are not as committed to maintaining it as a stable API or critical piece of infrastructure long-term, and we anticipate adopting some of the advantages it provides into the protocol firehose over time.

On the plus side, Jetstream is easier and cheaper to operate than a Relay instance. Folks relying on Jetstream can always run their own copy on their own servers.

Some of the use cases we think Jetstream is a good fit for:

  • casual, low-stakes projects and social toys: interactive bots, and "fun" badging labelers (eg, Kiki/Bouba)
  • experimentation and prototyping: student projects, proofs of concept, demos
  • informal metrics and visualizations
  • developing new applications: filtering by collection is particularly helpful when working with new Lexicons and debugging
  • internal systems: if you have multiple services consuming from the firehose, a single local Jetstream instance can be used to fan out to multiple subscribers

Some projects it is probably not the right tool for:

  • mirroring, backups, and archives
  • any time it is important to know "who said what"
  • moderation or anti-abuse actions
  • research studies

What Else?

The ergonomics of working with the firehose and "backfilling" bulk data from the network are something we would like to improve in the protocol itself. This might include mechanisms for doing "selective sync" of specific collections within a repo, while still getting full verification of authenticity.

It would be helpful to have a mechanism to identify which repos in the network have any records of a specific type, without inspecting every account individually. For example, enumerating all of the labelers or feed generators in the network. This is particularly important for new applications with a small initial user base.

We are working to complete the atproto specifications for the firehose and for account hosting status.