Pre-RFC: crate creator rate limiting in docs.rs

Continuing the discussion from Docs.rs build prioritization:

Summary

A version of the leaky bucket algorithm for docs prioritization.

Motivation

Someone has been monopolizing the docs.rs build queue. The purpose of this pre-RFC is to come up with an approach to "evening out" a crate owner's use of the docs.rs infrastructure. They can still build as many crates as they want, but their builds are queued up in a "bucket" and dispensed into the main build queue at a fixed rate.

Guide-level explanation

In the docs.rs interface, you'll see a number associated with every crate publisher (which may be a user or an organization) counting their "build minutes." This number may be negative, but it will never go below -15 minutes (unless the user got a special exception). You are gifted one build minute per hour.

You need at least one build minute for any crate to start building; if you have zero or less, your crate will sit in the docs.rs build queue until you have at least one minute, allowing other crates to go first. If you have one build minute in your account, and you run a crate that takes 15 minutes to build, docs.rs will finish the build, but you'll have -14 minutes after it's done, and you'll have to wait 15 hours before your next crate starts building.

Requesting more build minutes would be done roughly the same way requesting a longer build time limit is done, but honestly, I don't think it would often make sense to give someone this. Who's going to be able to make a good case for being given permission to monopolize the docs.rs build queue?

Reference-level explanation

Each crate_owner row adds a column, let's call it "low_water_mark" of type DATETIME NULL DEFAULT NULL. This is the "stored low_water_mark".

The "true low-water-mark" is calculated roughly like this.

// pseudo-Rust
fn true_low_water_mark(stored: Option<DateTime>) -> DateTime {
    let minimum = now_utc() - Duration::minutes(30 * 60);
    let stored = stored.unwrap_or(minimum);
    if stored < minimum {
        minimum
    } else {
        stored
    }
}

When it's time for docs.rs to dequeue an item, it checks if (true_low_water_mark() - now_utc()) >= Duration::minutes(1). If it is, then it starts something else (or just waits, if nothing else is in the queue; because docs.rs doesn't cancel builds, starting someone's build when they don't have enough minutes "because there's nothing else to do" risks priority inversion if someone else comes in while it's running).

When a build finishes, the new stored low water mark is set equal to true_low_water_mark + (build.duration * 60).

The "build minutes" display is only used for display, to get something more intuitive than the timestamps that are stored in the database. It is calculated with (now_utc() - true_low_water_mark).minutes() / 60. If the true low water mark is ahead of or equal to the current time, then you have negative (or zero) build minutes in your account, and cannot start a build. Otherwise, you have positive build minutes.

Drawbacks

This is not a simple rate-limiting scheme. It's not exactly the same as how crates.io does it, since crates.io is really rate limiting based on different criteria (we're using time, they're using crates), and since crates.io rejects traffic instead of queueing it.

Also, if you really want to, you can work around the rate limit by sockpuppeting. We might be able to catch people doing it, but do we really want to be stuck policing this sort of thing?

Rationale and alternatives

The choice of a factor of 60 is arbitrary, reasonable-sounding, and will probably be adjusted based on experience. So is the default build budget of 30 (it's enough to build two crates).

The specific design was chosen because it's an approach that is not new, is not exciting, and is probably "good enough". It has good theoretical properties; it's based on time, the unit we're actually trying to conserve, instead of using something like crates as a loose proxy for time. Even if someone gets a higher time limit (which docs.rs already does sometimes), their average resource consumption is controlled; crates that take longer to build cause the owner's budget to go further into the red, preventing them from monopolizing docs.rs's infrastructure.

Prior art

This design was mostly thought up based on the nginx "burst" rate limiter, with how it delays requests and has them actually get executed on a schedule.

This isn't really the same rate limiting algo that crates.io uses. Crates.io uses a token bucket, designed to allow bursts, while preventing bursts is the entire purpose of this docs.rs rate limiter.

Unresolved questions / Future possibilities

This technically allows a single crate owner to create a personal queue that grows forever, since crates.io allows people to publish new crates faster than docs.rs can process them. This is why nginx rate limiters tend to have "nodelay" turned on; it's not really all that good if latency is important to you, and while docs.rs's build queue is not designed for low latency, there does come a point when it gets ridiculous, and we need to start shedding load instead of just allowing the queue to grow forever.

Something should really be done about that. Since there are so many different things that consume the crates.io registry, it is not possible to make crates.io's rate limits suitable for all of them, but it probably would be helpful to at least have docs.rs warn when someone's personal queue gets too big, for potentially sending them an email (or banning them).

9 Likes

In my opinion, it would be weird to potentially have people waiting with an arbitrary rate-limiting factor while keeping the docs.rs builder idle.

3 Likes

Can I suggest this word change instead?

and you'll have to wait until you accumulate 15 minutes of build time in your account before your next crate starts building.

The current wording makes it sound like the crate would be built in the next 15 minutes of real time.

Other than that I like the proposal. And I agree about the problem of sock puppeting, but that rapidly goes down the rabbit hole of denial of service attacks, etc. I don't think we're quite to the point where we need to worry about that yet.

If the builder is idle, it could just go ahead and grab work to do. This would be for when resources are scarce.

1 Like

I guess it's just a typo and was meant to say "you'll have to wait 15 hours".

In that case, people could build up an arbitrarily large amount of time dept though? Or maybe that's not supposed to happen? It would need to be specified properly.

Good point. Maybe each owner can build up a maximum of 15 build minutes. That'll help reduce games of creating a crate solely to accumulate time.

2 Likes

You're right, yeah. If you have -14 minutes on your account, then it'll take 14 hours to get back to zero (and 15 hours before you'll be able to build again).

Assuming that docs.rs has fixed compute resources, how about just building the most popular crate that needs to be built rather building in FIFO order?

Where "popular" could be defined as the number of unique IPv4 addresses that ever downloaded it or if that's not available the number of downloads (but the latter is much easier to game).

Since the original thread was about a single user that was responsible to too much queue-delay, I personally still favor a approach that just limits the amount of delay that a single user can create.

E.g. if we say that a user must only be able to add 15 minutes of delay to a queue (on average; and worst-case < 30min). The example in the other thread featured a single user allegedly responsible for multiple hours of increase in queue delay/latency; this would be solved without being overly restrictive, and while effectively taking the current workload of the queue into consideration.

We could proceed as follows (I just came up with the algorithm, feel free to point out flaws):

Once a crate of a (currently untracked) user U starts building, insert a marker M (for user U) after the current end of the queue and start tracking time of builds of crates from U. As long as the time spent on crates of U stays below 15 minutes, proceed as usual as long the marker M is not reached. Once the time spent on crates of U exceeds 15 minutes, after compilation of the crate of U that brought up the tracked time to over 15 minutes finished, all remaining crates of U in the queue that are before M are moved back in the queue to right after M. Once M is reached in the queue, if U had exceeded their 15 minutes, reduce the tracked time for user U by 15 minutes (so it's reduced from a number between 15 minutes and 30 minutes to a number between 0 and 15 minutes) and insert a new marker M' (for user U) at the end of the queue at that time; proceed the same way, with marker M' in place of M, etc. When a marker M for user U is reached without the user having used up more than 15 minutes, their tracked time is set back to zero (It was a number between 0 and 15 minutes before); if they still have crates anywhere on the queue create a new marker M' now and continue tracking the user. If there are no more crates of U in the queue, just discard the marker and stop tracking the user.

As mentioned above, we don't need to be overly restrictive, also I don't think we need to anticipate people "gaming" the system too much. Especially creation of multiple user accounts seems unlikely as long as the mechanism is not overly annoying. (IMO ever leaving the queue idle would immediately qualify as being overly annoying.) This is mostly against unintentionally overloading the queue, and in particular against unintentional creation of large delays in the docs.rs queue.

I'd find anything unacceptable that would involve that a publisher of a single (fast-compiling) crate (or a small number of very fast-compiling crates) having have to wait significantly longer than for all the crates currently on the queue to finish before their crate(s) is/are build. Using popularity instead of FIFO has huge starvation potential AFAICT.

And IMO a publisher of multiple crates should IMO at most wait at long as if they were to publish each crate individually, always waiting for one crate to make it through the entire queue to be built before publishing the next crate, etc. (This latter condition would be violated by the proposal at the top of this thread as well.)

2 Likes

One thing we need to plan out for in any scheduling algorithm is to avoid starving a crate. Every crate needs to eventually build. Popularity might be fed into some other algorithm (you could use popularity to decide how many build-minutes to give someone in the above leaky bucket algo), but on its own, it won't guarantee this.

I don't really want to use popularity anyway. It creates feedback loops: Crates with good documentation become more popular, and popular crates get better documentation? Please no!

How is that? I thought the OP would fulfill this condition, so either I'm failing to do the math, or failing to explain myself properly.

Wouldn't that run afoul of GitHub's ToS on the other side of a crates.io account anyways?

1 Like

That's a very good point: to handle that, instead of popularity I think it would be better to use popularity divided by estimated build time (although this estimation is itself somewhat expensive - maybe count the number of tokens in the source code using the RA tokenizer?).

The theory behind this would result in spending docs.rs build time in the way that would benefit the most people.

As soon as we're talking about malicious actors, all rules regarding acceptable behavior go out the window, and if you're willing to sock puppet, then you're (IMHO) a malicious actor. I don't think that we're at the point of thinking about real threat models yet, we're just trying to make it so that if someone makes a mistake they don't bring everything down with them. Once we get to the point where someone is deliberately trying to break docs.rs, we'll have to figure out other mechanisms to force better behavior.

2 Likes

Sure. I'm saying that reporting such behavior to GitHub is likely to get their abuse team on the case and shut down the backing GitHub accounts too (that is, crates.io is not necessarily on its own when dealing with account abuse).

1 Like

I understand what you're trying to do here, but I have to say, there are a LOT of advantages to accumulating and spending time as described (within some [min..max] range).

  1. It's very easy to explain to crate authors, which may reduce the overall amount of whining if/when its implemented.
  2. It's easy to implement.
  3. It runs very, very quickly (less expensive to compute on the already strained resources of docs.rs).
  4. It doesn't starve small crates, while still allowing large crates to continue.

Is it ideal? No. But it's really, really good.

I shared my thoughts in the other thread, but to summarize we have a way to manually deprioritize huge projects: this time we missed the alerts, but in the past the approach has been successful. Unless this starts happening consistently I'm not convinced we need to spend a lot of time on designing rate limiting.

The docs.rs team is also working on removing the blockers that prevent us from scaling the build system to handle the increased load. We don't expect this to happen instantly due to some architectural limitations, but we have a clear path on how to solve them!

2 Likes

Yup, and as long as they have GitHub accounts, that'll be a good way to handle it. But as @kornel pointed out in the other topic, you don't need a GitHub account to publish to crates.io (and thence to docs.rs). So relying on GitHub's policies won't solve it.

Can you explain the architecture, @pietroalbini? I'm curious as to how it works under the hood.