[Pre-RFC] Rust in(ex?)trospection

Summary

Create a summary of public API of a crate in JSON (or similar) form, either during compilation or from the built rlib. This machine readable API index shall be available to build script and/or procedural macros of dependent crates (possibly upon request).

All symbols will be fully resolved and referenced using their canonical qualified names.

The serialization of stable features should be stabilized while leaving opportunity to add attributes at any level for new features later.

Motivation

Main use would be generating bindings and wrappers. cbindgen currently has to resolve the types itself, so it does not work with aliases, and it is the simple case that only needs to enumerate extern "C" functions and #[repr(C)] functions. A more advanced generator (I am thinking e.g. GObject Introspection here) would benefit from being able to:

  • See which types implement certain traits. Doing this with procedural macros requires annotating all such types, which is unwieldy when such annotations don’t actually bring any new information, and problematic for types from dependencies.
  • Generate list of selected items. Normal procedural macros are not suitable for that as they run for each annotated item separately (and may only run for some of them in incremental compiler run). There was a proposal for such collecting procedural macro for use in test frameworks, but it was not completely general.
  • Be able to work with resolved symbols not to be thrown out by unusual imports, aliases, reexports and such.

Exporting the item definitions including resolved types in some extensible serialization format would allow the binding generator to easily generate both whatever wrappers are needed for marshaling data across the interface and corresponding declarations for the consuming language (C header, GI XML, VAPI etc.).

Explanations

TBD: the format will have to be defined.

Drawbacks

  • It is another piece of code in the compiler or compiler-related tool that has to have its backward compatibility maintained.

Rationale and alternatives

  • A simple format is needed in which description of existing item kinds can be stabilized, independent of syntax changes with editions, while new item kinds and new attributes can be added later. JSON seems to fit that requirement well.

    • Advantages of stabilizing format are that the code for processing it can be evolved as a separate crate on crates.io, and that it can be processed by tools written in other languages than Rust.
  • Alternatively the interface could be specified with types to which the data deserialize, similarly to how it is done with token trees for procedural macros.

    • Advantage of specifying types is that their use is checked by the compiler.
    • However there still needs to be a split between extraction of the data, which is necessarily compiler-version-dependent, and the interface, and backward compatibility. Otherwise it is putting extra burden on the processing tool to be always quickly updated for each compiler release.
  • For the motivating use-case, an alternative could be to use procedural macro for defining the interface, and have a way of writing some data outside the built library from them.

    • I see some additional use-cases for such mechanism, but in this case big disadvantage of procedural macros is that they run before symbols are resolved (which they have to, since they can generate more symbols that will affect that resolution), so it’s difficult to write the information so that the wrapping tool will know the correct symbol names to use in all complex cases.

Prior art

  • The information is basically what rustdoc writes into documentation, but it does not generate index in any format suitably easy to further process.

  • cbindgen parsed the code itself to find the right symbols, and I believe now utilizes some procedural macros, but it is a relatively simple use case in that it does not allow most complex types.

  • There is also wasm-bindgen that exports functions using procedural macros; I am not sure how complex types it can process. As far as I know it does not currently generate any form of IDL, just registers the functions. If it was to generate web-idl, it might benefit from this proposal.


So, does it make sense to work on this?

Note that I’ve tried to ask users whether something like that exists and got no response.

7 Likes

This is a really interesting and useful sounding idea.

There are a few pre-existing formats with varying levels of community and tooling already around them that may be worth investigating using instead of coming up with a new schema. I know of:

  • Kythe - currently indexes C++, Go, Java, Protocol Buffers, Lisp, and maybe some others. Stores its output as Protocol Buffers. Has some tooling for exploring references, visualising call-graphs, etc. Notably, supports cross-language references well (e.g. "this Go calls this C++" or "This Go symbol was generated from this Protocol Buffer symbol").
  • SemanticDB - currently indexes Scala, has a lot of interesting tooling built on top of it (including an LSP server). Also stores its output as Protocol Buffers.

I can't speak as to whether either is suitable or convenient, but I thought it was worth bringing up the prior art in case we can re-use tooling :slight_smile:

2 Likes

Are you familiar with -Z save-analysis?

This feature gives huge benifit to rust_swig . This is tool for generate wrapper around Rust code to use it from C++, Java and Python in near future. Now user have describe full signature of fn/struct/trait/enum that it want to "wrap", because of I am still not ready to emulate cargo works and parse all crates that current crate depend on.

Plus, it would be great to get information about align/sizeof not repr(C) type, now I have to use heap (Box/Rc/Arc) for cooperation with foreign languages, for small types it is overkill, it would be nice to generate C++ classes with

class Foo {
private:
  alignas(X) uint8_t rust_data[Y];
};

instead of

class Foo {
private:
   void *opaque;
};