3 posts tagged with "firehose"

2025-05-02T00:00:00.000Z

Relay Updates for Sync v1.1

May 2, 2025 · 5 min read

We have an updated relay implementation which incorporates Sync v1.1 protocol changes, and are preparing to switch over the bsky.network relay operated by Bluesky. This post describes new infrastructure developers can start using today and describes our plan for completing the migration.

Updated Relay Implementation

The new version of the relay is available now in the indigo go git repository, named simply relay. It is a fork and refactor of the bigsky relay implementation, and several components have stuck around. The biggest change is that the relay no longer mirrors full repository data for all accounts in the network: sync v1.1 relays are now "non-archival". We have actually been operating bsky.network with a patched non-archival version of bigsky for some time, but this update includes a number of other changes:

the new #sync message type is supported
the deprecated #handle, #migration, and #tombstone message types are fully removed (they were replaced with #identity and #account)
the sync v1.1 changes to the #commit message schema are supported
message validation and data limits updated to align with written specification
account hosting status and lifecycle updated to match written specification
#commit messages validated with MST inversion (controlled by "lenient mode" flag)
new com.atproto.sync.listHosts endpoint to enumerate subscribed PDS instances
com.atproto.sync.getRepo endpoint implemented as HTTP redirect to PDS instance
simplified configuration, operation, and SQL database schemas

The relay can aggregate firehose messages for the full atproto network on a relatively small and inexpensive server, even with signature validation and MST inversion of every message on the firehose. The relay can handle thousands of messages per second using on the order of 2 vCPU cores, 12 GByte of RAM, and 30 Mbps of inbound and outbound bandwidth. Disk storage depends on the length of the configurable "backfill window." (A 24-hour backfill currently requires a couple hundred GByte of disk.)

Staged Rollout

The end goal is to upgrade the bsky.network relay instance and for all active PDS hosts in the network to emit valid Sync v1.1 messages. We are rolling this out in stages to ensure that no active services in the network break.

We have two new production relay instances:

relay1.us-west.bsky.network
relay1.us-east.bsky.network

These both run the new relay implementation and attempt to crawl the entire atproto network. The two instances are operationally isolated and subscribe to PDS instances separately, though configuration will be synchronized between them periodically. They have sequence numbers which started at 20000000000 (20 billion) to distinguish them from the current bsky.network relay, which has a sequence around 8400000000 (8.4 billion). Note that the two new relays are independent from each other, have different sequence numbers, and will have different ordering of messages. Clients subscribing to these public hostnames will technically connect to rainbow daemons which fan-out the firehose.

We encourage service operators and protocol developers to experiment with these new instances. They are running in "lenient" MST validation mode and should work with existing PDS instances and implementations. PDS hosts can direct requestCrawl API requests to them. If hosts don't show up in listHosts, try making some repo updates to emit new #commit messages. Bluesky intends to operate both of these relays, at these hostnames, for the foreseeable future.

For the second stage of the roll-out (in the near future), we will update the bsky.network domain name to point at one of these new relay instances. If all goes smoothly, firehose consumers will simply start receiving events with a higher sequence number.

As a final stage, we will re-configure both instances to enforce MST inversion more strictly and drop #commit messages which fail validation. To ensure this does not break active hosts and accounts in the network, the relays currently attempt strict validation on every message and log failures (even if they are passed through on the firehose in "lenient" mode). We will monitor the logs and work with PDS operators who have not upgraded. If necessary, we can "ratchet" validation by host, meaning that new hosts joining the network are required to validate but specific existing hosts are given more time to update.

Other Sync v1.1 Progress

We have a few resources for PDS developers implementing Sync v1.1:

the goat CLI tool has several --verify flags on firehose consumer command, which can be pointed at a local PDS instance for debugging
likewise the relay implementation can be running locally (eg, using sqlite) and pointed at a local PDS instance
interoperability test data in the bluesky-social/atproto-interop-tests git repository, both for MST implementation details, and for commit proofs
the written specifications have not been updated yet, but the proposal doc is thorough and will be the basis for updated specs

While not strictly required by the protocol, both of the new relay hostnames support the com.atproto.sync.listReposByCollection endpoint (technically via the collectiondir microservice, not part of the relay itself). This endpoint makes it much more efficient to discover which accounts in the network have records of a given type and is particularly helpful for developers building new Lexicons and working with data independent of the app.bsky.* namespace.

The need for ordered repository CAR file exports has become more clear, and an early implementation was completed for the PDS reference implementation. That implementation is not performant enough to merge yet, and it may be some time before ordered CAR files are a norm in the network. The exact ordering also needs to be described more formally to ensure interoperation. Work has not yet started on the "partial synchronization" variant of getRepo, which will allow fetching a subset of the repository.

2024-11-18T00:00:00.000Z

Relay Operational Updates

November 18, 2024 · 7 min read

_Update May 2025: This post has been superceded by the [/blog/relay-sync-updates](Relay Updates for Sync v1.1) post.

Update January 2025: Many of the operational changes described here were not implemented. We will give an update on Relay infrastructure early this year.

Summary: We are making a couple of changes to the Bluesky relay servers this week. The wire protocol and semantics are not impacted, but the changes may impact firehose consumers:

One way or another, the event stream sequence will change for all subscribers. Most consumers can probably just reset to the new cursor; see details below.
The bsky.network hostname will be swapped to a new relay instance (with new sequence)

The atproto network is growing rapidly! This is exciting and positive, but it means we need to move up some of our network scaling plans. One particular resource bottleneck is the relay firehose. We are seeing sustained traffic of over 2,000 events per second, and we have hundreds of active consumers. Multiplied, the overall throughput is hard for any one server to keep up with.

We saw this problem coming from the beginning and we have several plans and options to mitigate it. Details for each of these are linked or in sections below.

First, we are simply upgrading the size of server that we run our primary relay instances on. We expect to do this in coming days. The main impact here will be a reset of the firehose sequence (including cursor values), which will impact all downstream consumers.

Second, we are introducing the concept of "firehose fan-out" services, and releasing our implementation Rainbow. These are servers which re-broadcast the event stream firehose to many clients, reducing the bandwidth load on relays themselves. Rainbow is now in production, as an implementation detail.

Third, we recently released Jetstream as a more informal and light-weight option for event stream consumption. We encourage developers to check out Jetstream, and switch over if it works for their use case.

We encourage folks to take advantage of Rainbow and Jetstream. You can run your own, or make use of our hosted instances. Some additional longer-term options are discussed at the bottom of this article.

We don't have a hard time or date estimate for the cursor sequence change, but we expect it this week, possibly tomorrow (Tuesday). We generally try to give more time and make changes less disruptive but we're not able to do so in this situation. Please follow the @atproto.com account, this blog, and in Github Discussions for announcements.

Relay Upgrade

The current Bluesky relay instance is available at wss://bsky.network. A more specific hostname for this relay instance will soon be relay1.us-west.bsky.network (not yet configured).

Instead of upgrading the relay in-place, we are going to start a new relay from scratch. This means that event stream ordering will be different and sequence numbers will not align with the current relay. This new instance will soon run at relay2.us-west.bsky.network (not yet configured).

There will be at least a short window when both relays are running, so downstream consumers can update their services to "cut over". At some point soon after the new relay is deployed, we will switch the generic bsky.network hostname to point at it.

What firehose consumers need to do about the cut over depends on how careful they need to be about processing every single event on the firehose.

Services which always connect to the firehose from the current offset don’t need to do anything. This includes developer tools like goat, and real-time sampling like Firesky.

Best-effort consumers can likely be configured to consume from the current firehose offset, and simply restart, missing at most a few seconds of events. This probably makes sense for feed generators, analytics, automated labeling, etc.

Services which want to really minimize the number of missed events can take a different approach. They can be programmed to connect to the new relay with no cursor, detect the current sequence number, then reconnect with a cursor a couple of minutes back (at current rates this means a few hundred thousand events back). This results in a time overlap window between the two relays (with some events processed twice), which helps minimize the chance of missing any events.

Firehose Fan-out and Rainbow

Firehose fan-out servers have a single upstream connection to a relay firehose, and re-broadcast that stream to multiple subscribing clients. They maintain a local backfill window (allowing re-connection with a cursor), but do not implement endpoints like getRecord or getRepo, only com.atproto.sync.subscribeRepos. They do not re-validate the event stream (eg, they don't verify signatures). They maintain the exact sequence numbers of the upstream relay, meaning that they are interchangeable with the relay and with sibling fan-out servers, and can be placed behind a load-balancer.

Today we are releasing Rainbow, a fan-out service implemented in Go. The source code is available in the indigo git repository, with the same MIT/Apache licensing as our other open source projects.

Rainbow is already running as part of our production relay deployment (wss://bsky.network), with our load balancer (haproxy) distributing WebSocket subscriptions to a separate server running Rainbow.

In the future, we may run additional Rainbow instances outside our primary data centers, similar to how we offer Jetstream instances today. This distributes network traffic over more routers, and could improve connection reliability and throughput for subscribers in regions outside North America.

Going forward, we recommend that software which consumes from firehoses (including atproto SDKs) support HTTP redirects for WebSocket connections. This enables "pooling" behavior where a single hostname could route clients to multiple distinct servers, without use of a load-balancer.

What Else?

As a reminder, the relay synchronization API is the same as the PDS synchronization API, including the firehose. Relays are essential to the functionality of the atproto network: they are an operational and developer convenience, allowing consuming services to skip keeping track of which PDS instances are active.

A classic way to scale services like the firehose is sharding, where the stream is split into multiple parallel streams. atproto has a natural sharding key (the account DID), which means related events can be consistently routed to the correct shard. This will allow near-indefinite scaling of the firehose event rate, assuming compute and network resources are available.

Part of what makes relays difficult to operate at scale is that they function as both a "relay" (rebroadcasting events from PDS instances) and a full-network mirror (storing all repo contents). In the current protocol, it is necessary to combine both functions to fully verify repository operations, especially deletion of events. We think that clever implementations could make this work less resource intensive (for example, storing just record CIDs instead of full data). It is also possible to validate most aspects of the event stream without a full copy of the repo tree, and there could be a role in the network for "non-mirroring" relays.

Lastly, there is a particular need for efficient consumption and backfill of full-network content for only content and accounts which make use of specific record types. For example, “only whtwnd blog post records”, or “only accounts with labeling service declarations”. We are planning new features to make this type of network subscription and backfill much more efficient.

2024-10-16T00:00:00.000Z

Introducing Jetstream

October 16, 2024 · 6 min read

One of most popular aspects of atproto for developers is the firehose: an aggregated stream of all the public data updates in the network. Independent developers have used the firehose to build real-time monitoring tools (like Firesky), feed generators, labeling services, bots, entire applications, and more.

But the firehose wire format is also one of the more complex parts of atproto, involving decoding binary CBOR data and CAR files, which can be off-putting to new developers. Additionally, the volume of data has increased rapidly as the network has grown, consistently producing hundreds of events per second.

The full synchronization firehose is core network infrastructure and not going anywhere, but to address these concerns we developed an alternative streaming solution, Jetstream, which has a few key advantages:

simple JSON encoding
reduced bandwidth, and compression
ability to filter by collection (NSID) or repo (DID)

A Jetstream server consumes from the firehose and fans out to many subscribers. It is open source, implemented in Go, simple to self-host. There is an official client library included (in Go), and community client libraries have been developed.

Jetstream was originally written as a side project by one of our engineers, Jaz. You can read more about their design goals and efficiency gains on their blog. It has been successful enough that we are promoting it to a team-maintained project, and are running several public instances:

jetstream1.us-east.bsky.network
jetstream2.us-east.bsky.network
jetstream1.us-west.bsky.network
jetstream2.us-west.bsky.network

You can read more technical details about Jetstream in the Github repo.

Why Now?

Why are we promoting Jetstream at this time?

Two factors came to a head in early September: we released an example project for building new applications on atproto (Statusphere), and we had an unexpectedly large surge in traffic in Brazil. Suddenly we had a situation where new developers would be subscribing to a torrential full-network firehose (over a thousand events per second), just to pluck out a handful of individual events from a handful of accounts. Everything about this continued to function, even on a laptop on a WiFi connection, but it feels a bit wild as an introduction to the protocol.

We knew from early on that while the current firehose is extremely powerful, it was not well-suited to some use cases. Until recently, it hadn’t been a priority to develop alternatives. The firehose is a bit overpowered, but it does Just Work.

Has the Relay encountered scaling problems or become unaffordable to operate?

Nope! The current Relay implementation ('bigsky', written in Go, in the indigo git repo) absorbed a 10x surge in daily event rate, with over 200 active subscribers, and continues to chug along reliably. We have demonstrated how even a full-network Relay can be operated affordably.

We do expect to refactor our Relay implementation and make changes to the firehose wire format to support sharding. But the overall network architecture was designed to support global scale and millions of events per second, and we don't see any serious barriers to reaching that size. Bandwidth costs are manageable today. At larger network size (events times subscribers), bandwidth will grow in cost. We expect that the economic value of the network will provide funding and aligned incentives to cover the operation of core network infrastructure, including Relays. In practical terms, we expect funded projects and organizations depending on the firehose to pay infrastructure providers to ensure reliable operation (eg, an SLA), or to operate their own Relay instances.

Tradeoffs and Use Cases

Jetstream has efficiency and simplicity advantages, but they come with some tradeoffs. We think it is a pragmatic option for many projects, but that developers need to understand what they are getting into.

Events do not include cryptographic signatures or Merkle tree nodes, meaning the data is not self-authenticating. "Authenticated Transfer" is right in the AT Protocol acronym, so this is a pretty big deal! The trust relationship between a Jetstream operator and a consuming client is pretty different from that of a Relay. Not all deployment scenarios and use-cases require verification, and we suspect many projects are already skipping that aspect when consuming from the firehose. If you are running Jetstream locally, or have a tight trust relationship with a service provider, these may be acceptable tradeoffs.

Unlike the firehose (aka, Repository Event Stream), Jetstream is not formally part of the protocol. We are not as committed to maintaining it as a stable API or critical piece of infrastructure long-term, and we anticipate adopting some of the advantages it provides into the protocol firehose over time.

On the plus side, Jetstream is easier and cheaper to operate than a Relay instance. Folks relying on Jetstream can always run their own copy on their own servers.

Some of the use cases we think Jetstream is a good fit for:

casual, low-stakes projects and social toys: interactive bots, and "fun" badging labelers (eg, Kiki/Bouba)
experimentation and prototyping: student projects, proofs of concept, demos
informal metrics and visualizations
developing new applications: filtering by collection is particularly helpful when working with new Lexicons and debugging
internal systems: if you have multiple services consuming from the firehose, a single local Jetstream instance can be used to fan out to multiple subscribers

Some projects it is probably not the right tool for:

mirroring, backups, and archives
any time it is important to know "who said what"
moderation or anti-abuse actions
research studies

What Else?

The ergonomics of working with the firehose and "backfilling" bulk data from the network are something we would like to improve in the protocol itself. This might include mechanisms for doing "selective sync" of specific collections within a repo, while still getting full verification of authenticity.

It would be helpful to have a mechanism to identify which repos in the network have any records of a specific type, without inspecting every account individually. For example, enumerating all of the labelers or feed generators in the network. This is particularly important for new applications with a small initial user base.

We are working to complete the atproto specifications for the firehose and for account hosting status.

Updated Relay Implementation​

Staged Rollout​

Other Sync v1.1 Progress​

Relay Upgrade​

Firehose Fan-out and Rainbow​

What Else?​

Why Now?​

Tradeoffs and Use Cases​

What Else?​

Updated Relay Implementation

Staged Rollout

Other Sync v1.1 Progress

Relay Upgrade

Firehose Fan-out and Rainbow

What Else?

Why Now?

Tradeoffs and Use Cases

What Else?