#[const_instr_count(N)] label on functions?

If this is more appropriate for URLO, let me know. No hard feelings.

I am wondering if it is possible, perhaps via procedural macros, to statically assert: this function always runs + terminates within N (some arbitrary constant, say 10, 100, 1000) instructions.

pre-emptive questions:

  • What about halting problem?

    Does not apply here since (1) we only care about constant N and (2) our procedure is not complete.

  • What about loops ?

    We do not handle situations where loops do not obviously run a constant # of times.

    • What about std::push ?

    This function might be O(n) during resize, so it's not O(1)

why is this useful ?

I'm writing grapics code, with WebGL, and to get a good intuition on CPU lag, I'd really like to be able to estimate which functions are taking how much time

given different archs, how do you count instructions ?

I don't have a good answer to this question, but I think most popular archs are multiplicative O(1) of each other

why is this feasible at all ?

Imagine we are the procedural macro, and we are looking at the AST of a function. We can walk it as follows:

  • for primitive instr, count as 1 instr
  • for call to known-const-time function, use the upper bound there
  • for if: count(test)? + max(count(then clause)?, count(else clause)?) here count : code -> Option returns Some(N) is guaranteed within N steps, and None otherwise
  • for/while loops: give up

If it's done, it would have to be a compiler built-in; procedural macros run before type resolution, etc. and as such cannot have access to type information.

The likely way to do this would be to count MIR operations, after MIR is generated.

4 Likes

cargo-bloat can give you size of functions, which is roughly correlated with number of instructions. There's also llvm-lines that gives a number of llvm IR "instructions".

Static approximation of function cost may be doable for code running on the GPU.

However, don't hope to get anything useful for CPUs. Instruction counting has not been representative of code performance since mid '90s. Branches can cost more than execution of a dozen of instructions, depending on prediction pattern. Memory accesses can be more costly than hundreds of instructions, depending on cache state. So there are 100x differences in cost per instruction that depend on run-time details that may vary even between runs of the same code.

6 Likes

I agree this fails to take into account cache misses.

I don't think branch mis prediction is that big of a deal since, in the absence of loops/recursions, the # of branches is limited by the # of times if occurs in the source code.

Although I agree this is in many ways flawed, the alternative right now is ... nothing ... and I think having a count of MIR is better than having nothing.

This sounds like you want a profiler? Especially since the cost of one "call a WebGL function" is completely different from the cost of one "call u32 count_ones" or something.

And we're never going to add something as much of a stability hazard as instruction counting, since different versions of the compiler need to be able to produce different numbers of instructions for the same code.

1 Like

No. Profilers tend to measure at runtime, statistically. What I propose above measures at compile time, analytically. Functions that call WebGL / FFI can't be marked const_instr_count since we don't know how many instrs they take.

This is a strawman. I'm not asking for this to be added to the rustc. I even go so far as outline an attempt at doing this via procedural macros. I'm trying to figure out the primitives required; I'm not asking for it as a compiler extension.

But you don't need the static information, you need hard bounds on the run time. As kornel said above, the correlation between instruction count and execution time is very wobbly, so much that it will likely be useless. Either you overestimate the run time by orders of magnitude on average (or even in the worst case), or your estimate will be orders of magnitude off under adversarial conditions (i.e. when it matters most).

And since compiler optimizations can drastically change the code, your estimate will be unstable under program changes, and likely always very wrong. You are using the optimizations, right? I don't think the thing you want can be even somewhat reliably provided outside of debug mode.

You distinguish between &T and &mut T to know which arguments are const and which can be modified right?

I want the static information to know which functions are const # instrs and which are variable # instrs.

By this logic, the entire field of big-Oh analysis is useless.

Big-O analysis is asymptotic. It doesn't care whether your estimate is a billion times worse than reality. As long as you get the asymptotic right, any constant factor is insignificant for large enough inputs (but sometimes those inputs are so large the analysis is worthless in practice, c.f. optimal n*log(n) multiplication of big integers). Big-O is useful precisely because such details as instruction timings, branch prediction, caching, and most compiler optimizations change the run time at most by a multiplicative constant.

There is no such information, and there are no such guarantees given by the compiler. Any function, including the most simple ones such as addition or branching, can have any instruction count in the optimized program. A function call can be entirely eliminated, or it can be duplicated many times, or transformed in some weird ways. The execution time and instruction counts are not considered observable effects, and the compiler is free to modify them as it sees fit.

If you want specific instructions, use inline assembly.

1 Like

Great. Reread the question as : let's statically separate obviously provable O(1) functions from not obviously provable O(1) functions.

That's doable, but that result may not be as interesting as you think. For example, the following program obviously runs in O(1) time:

fn fn0() {
    println!("hello world");
}
fn fn1() { fn0(); fn0(); }
fn fn2() { fn1(); fn1(); }
fn fn3() { fn2(); fn2(); }
...
fn fn64() { fn63(); fn63(); }

fn64() runs "in O(1) time" but more than 264 steps — for most practical purposes, it is non-terminating. (This pattern is notable for being executable on things that are not even Turing-complete, as long as they have some sort of “use a previous definition” that can be repeated; it is sometimes known as the “billion laughs attack”.)

So, if you want to statically characterize performance, big-O is not sufficient, because something can be simultaneously O(1) and too slow for any practical use. The instruction-counting strategy will catch this case, but I agree with other commenters that it is unlikely to give any information useful for guiding manual optimization. Profiling is really the best option.

4 Likes

My general gist here is:

FOR EVERY definition of the static tool $T$ I define:
  THERE EXISTS some member of IRLO smart enough
    to construct a worst case scenario where $T$ is useless

and that's okay; because most of the time I write code, I'm trying to get work done, not "how do I write code that breaks $T$" :slight_smile:

===

I feel many of these styled arguments could be made vs linters, yet most of us accept that linters are atleast useful to some people.

1 Like

The difference here is that you're explicitly asking about providing a guarantee. Tools like linters are useful because they can incrementally improve local quality. A proof assistant like you're asking for is only useful if it provides an actual sound proof.

And btw, I think you'll find that putting a constant bound on the amount of work done requires first proving that no panics can happen, which is itself a hard unsolved[1] problem beyond extremely trivial (basically const propagated) cases. As panicking involves calling the panic hook, which is dynamically dispatched to a runtime modifiable hook. Not to mention allocating, which calls into opaque routines that can take arbitrarily long. Or that accessing memory for the first time could do an arbitrary amount of work.

The point is that baving a static approximation of the worst case number of MIR operations under a whole bucketload of assumptions about the environment, while possible with a sufficient set of ignored complexity, is less useful than just giving a wall-clock measurement with statistical significance for a branch-covering benchmark suite, in addition to being significantly more difficult.


  1. at a MIR level before doing expensive monomorphized inlining and optimization, anyway, since you can "just" make calling the panicking symbols a linker error ↩︎

2 Likes

MIR is one option. In OP, I left the notion of 'instruction' empty. Another option is just counting # of AST nodes of the function. This is something independent of compiler transformtions.

The fundamental belief that many here holds (which I don't hold) is that of: "useful approximations/ guarantees most be close to actual running time". This then argues: counting MIR / ast nodes is not useful; you are better off running a statistical profiler.

The absolute insanity of this mindset is the following:

  1. in my mind, for a tool to be useful, it only needs to be useful in one situation (because the developer is capable of judgment of whether or not to use the tool)

  2. the counter arguments I'm getting are all of the form: there exists some situation the tool is not useful, therefore it should not exist at all; this is like saying: hammers can't be used to cut steaks, therefore it's useless

I think you might be taking this too harshly. It's about tradeoffs. Hammers are generally useful in a wide enough set of cases to be worth their upkeep. As are flathead and phillips screwdrivers. Torx screwdrivers, however, are of limited utility and expensive (for what they are). Socket sets would be somewhere in the middle. I think this is far closer to "Torx" than it is to "hammer".

The issue seems to be about how useful the information is in general versus the overhead of maintaining the functionality. I would recommend defining what you mean by "instruction" here because it is quite fluid and has ramifications on what this attribute means for stability guarantees (can I increase it by 1 to squeeze in a feature or do I have to worry about blowing the limit on some dependent crate?).

This is not what you asked for. You asked

This is only useful if you have a guarantee that it holds.

My argument was not that it's impossible to do this in some cases. My argument was that it's impossible to ever get an answer outside of the most trivial cases, which is effectively also equivalent to const evaluating the code.

The secondary argument is that even in the cases where the question can be answered, the answer is fundamentally useless because of the number of simplicifications required to get any answer.

Rereading the OP, what's missing is making a few things explicit:

  • The described analysis does no analysis of called functions; it merely requires that they are also annotated and trusts that they are annotated accurately
  • To this point, this analysis does absolutely no flow analysis and includes trivially untaken branches in the worst time cost calculation
    • additionally meaning that any code with a panicking codepath cannot be used
    • perhaps then encouraging the use of unsafe to statically make dynamically unreachable cases unreachable_unchecked rather than relying on the compiler trivially proving some code safe at zero overhead, opening the door to UB under changes
    • presuming of course that unreachable_unchecked is given a cost of -∞
  • There exists some root level functionality (roughly: intrinsics) where we lie and just pretend they're some fixed abstract anount of work despite not being comparable in the abstract in this way nor even being a bounded amount of work over all input values
    • because we can't say it's a fixed amount of work only for some input values because the analysis deliberately does no kind of analysis to make it decidable
  • To this point, measuring this actively encourages golfing these abstract work units since they're easily measurable, despite being loosely correlated to performance at best
    • and it very often being the case that writing code which pre-optinization does more abstract work results in optimized code doing a strict subset of the work it would have done otherwise
    • a common example being the pre-slice pattern to aggregate bounds checking

Note though that the first reply, mine, actually was a "here's how something like this could be done". The rest explain why this doesn't provide the answer to the question you're asking, and that the subset of the language where it is possible to answer is to a fairly accurate first order of approximation just things which get const evaluated for an answer of 0.

Let T be a tool that can determine whether some functions "always runs + terminates within N [abstract units of work]."

point 1

Many arguments above are of the form: "T's is either too low or too high compared to wall clock time; therefore it is useless."

However, notice the [abstract units of work] above. I'm happy with a world where N is measured in MIR; or a world where N is measured in # of AST nodes of the source code.

point 2: example of a build

For [abstract units of work] defined as "AST nodes", it absolutely is possible to build a simple formulation of T:

  1. functions that (1) do not call other functions, (2) do not loop, (3) do not panic <-- "clean straight line programs" // we can infer # ast nodes

  2. functions that use 'if': we take the worst case of the two branches

  3. functions that use functions transitively defined in #1 and #2: we can also analyze

point 3: is this ever useful

I have been writing lots of graphics code lately, including my own vec2, vec3, vec4, mat2, mat3, mat4: there are all functions that I believe can be trivially proved to only take const # of instrs

This is a case where such a tool is useful to me: sanity checking low level gfx primitives -- even though it does not make any promises about wall clock time.

addendum:

Let S be the statement:

  1. there exists a situation X and "abstract unit of work" U where S is useful
  2. for all situation X and "abstract unit work" U, S is useless

from a quantifier perspective, (1) looks much easier to prove than (2), but for whatever reason, there seems to be an obsession at proving (2)

I sympathize with you, but I think you go too far here. I think the “obsession” is rather “for most situations X and "abstract work units" U, S is useless” and “S is useless for too many situations X and "abstract work units" U to be worth implementing”.

Let's put a hard definition on what Rust your analysis support: I posit it is very close to (if not exactly) the same subset as const compatible work. This will look very similar to symbolic execution done in a compiler, because it's effectively that. But, especially given only analyzing const Rust, it should be possible to do.

Again, I question the utility of such a tool, because for the cases where it works, the code is likely simple enough to validate by hand. It might also be that you would gain the same amount of utility from a simpler analysis, say, that the code contains no looping constructs.

The analysis would roughly look like

// improving this is basically what compilers do for a living
enum Symbolic<T> {
    Known(T),
    Unknown, 
}

struct Work(usize);

macro eval($($x:expr),+) {{
    $(match $x {
        (symbol, work) => {
            __work += work;
            symbol
        }
    }),+
}}

macro track_work($x:expr) {
    let mut __work = Work(0);
    ($x, __work)
}

// victim
fn example(x: i32, y: i32, z: i32) -> i32 {
    add(mul(x, y), z)
}

// expanded
const fn eval_example(_0: (i32, Work), _1: (i32, Work), _2: (i32, Work) -> (i32, Work) {
    track_work! {
        let (x, y, z) = eval!(_0, _1, _2);
        eval!(add(eval!(mul(x, y),) z))
    }
}