Statically-linked C/C++ libraries

The goal is to explore the current situation of crates including statically linked C/C++ libraries and to start a discussion about ways to make it easier to import external code in crates in a secure and reliable manner.

Overview

To get an idea of the extent of this pattern, let's explore crates.io content with an analysis of the crates with more than 100k downloads on 2022-08-07 (the 4,7k top crates, see the methodology for more details).

There are currently 70 C/C++ native libraries included with git submodules in 58 crates from the top 4,7k crates. Some of them are widely used, like libz-sys with 20M downloads and 46 reverse dependencies, or libgit2-sys with 11M downloads. Among these crates:

  • 6 have the -src suffix in their name

    • boringssl-src, openblas-src, sqlite3-src, openssl-src, zeromq-src and luajit-src
    • A total of 47 crates in crates.io have the -src suffix.
  • 38 are -sys crates that include a library directly

    • libevent-sys, pcre2-sys, lmdb-sys, lzma-sys, lmdb-rkv-sys, croaring-sys, openvino-sys, zstd-sys, cloudflare-zlib-sys, mozjpeg-sys, boring-sys, libsodium-sys, librocksdb-sys, libz-sys, libnghttp2-sys, libgit2-sys, sdl2-sys, curl-sys, rpmalloc-sys, sass-sys, rdkafka-sys, snmalloc-sys, wabt-sys, z3-sys, ckb-librocksdb-sys, libssh2-sys, libbpf-sys, oboe-sys, lz4-sys, tikv-jemalloc-sys, fasthash-sys, libusb1-sys, shaderc-sys, minimp3-sys, jemalloc-sys, liblmdb-sys, aom-sys, brotli-sys
    • A total of 2288 crates in crates.io have the -sys suffix.
  • 2 have a _sys suffix, a -sys variant

    • audiopus_sys and onig_sys
  • 1 has an -ffi suffix, a -sys variant

    • wepoll-ffi
  • 12 have no specific name pattern

    • afl, hidapi, khronos_api, mimalloc, parity-secp256k1, rust-htslib, rusty_v8, souper-ir, spirv-reflect, sprs, tflite, twox-hash

Two main patterns appear:

  • standard -sys crates which are also able to compile the library they are providing an interface for, either by default or only when enabled by a feature flag (see this blog post for details on how it's done).
  • dedicated crates containing -src in the name, depended on by -sys crates

Note: This only covers the crates containing submodules, but sometimes the code is vendored directly into the repository, like freetype-sys which has a copy of freetype2 sources. In any case, the source becomes part of the crate uploaded to the registry.

Case studies

Let's have a closer looks at a few representative crates.

mozjpeg-sys

  • The source is included through a git submodule.
  • The version number of the crate, 1.0.2, is not related to the upstream version, 4.0.3.
  • The license of the crate is IJG which broadly matches the source crate (but seems incomplete)
  • It always builds mozjpeg as a static dependency.

curl-sys

  • The source is included through a git submodule.
  • The crate versions are built as the following SemVer string: 0.4.56+curl-7.83.1, defined as MAJOR.MINOR.PATCH+BUILD with BUILD being curl-+the upstream curl version.
  • By default, it will try to dynamically link to the system curl and openssl, and fallback on static linking. It also has static-curl/static-ssl features to enforce static linking.
  • There is no way to enforce dynamic linking (i.e. make the build fail if library is missing on the system).
  • The crate documents an MIT license, while curl is licensed under a custom license (but close to MIT).

openssl-src

  • The source is included through a git submodule.
  • The crate only contains the logic to build openssl. The API is in openssl-sys which depends on openssl-src when the vendored feature is enabled (disabled by default). Some crates depending on openssl-sys (like openssl and native-tls) expose a similar flag too.
  • The crate documents an MIT OR Apache-2.0 license, while openssl is licensed under:
    • Apache-2.0 starting from 3.0
    • Dual OpenSSL and SSLeay licenses before, which are in particular not compatible with the GPL. The release/111 branch providing versions under this license is still maintained.
  • The crate versions are built as the following SemVer string: 111.16.0+1.1.1l, defined as MAJOR.MINOR.PATCH+BUILD, with BUILD being the upstream openssl version.

Issues

A lot of widely-used crates include third-party libraries, with little consistency. It causes problems in terms of:

  • Visibility: It is not always easy to know if a library was statically linked (and which version) as it does not appear in the crates tree, cargo-auditable data, or any automated SBOM (like cargo-spdx).
  • Usability: The way to select static vs. dynamic compilation varies, and is sometimes not even actionnable. Some -sys crates fall back to using statically linked dependencies if not detected on the build system without a way to force dynamic linking.
  • Licenses: The core problem here is that the license documented in the Cargo.toml (which is supposedly thought to cover only the build code) is sometimes different from the licenses applicable to the library itself, meaning the crate metadata does not match reality. In this case they are not easily discoverable, and tools like cargo deny check licenses cannot check them. A good example is the OpenSSL licence for versions before 3.0, which is incompatible with GPL.
  • Vulnerabilities: Except for dedicated source crates, there is no accurate visibility over vulnerabilities affecting the included library in the usual Rust tooling (cargo-audit and cargo-deny).
  • Trust: The code is included from external sources, written by unidentified people, and is not visible in tooling like cargo-supply-chain, rust-audit or cargo-crev

Possible improvements

Just like -sys crates have an official definition in cargo docs, with a set of recommended practices, a first step could be to write an RFC with similar guidelines for external source crates. This could build upon implementations, and allow an easy convergence for libraries using different patterns. It could then be improved by additional tooling or metadata.

Dedicated -src crates

Having dedicated crates (with the -src suffix for discoverability) seems to have quite a few advantages:

  • Allow independent versioning, releases, licenses, security advisories
  • Give visibility over included code in all cargo-based tooling

One obvious big drawback is the maintenance overhead.

Consistent feature-based configuration

Ideally there should be a recommended way (through features of -sys crates) to:

  • Allow to enforce either static or dynamic linking
  • Keep the convenient default used in most existing crates (dynamic linking with static fallback)

The is already a pre-RFC by @kornel to discuss this.

Accurate metadata

License

The license of a crate should cover all files included in the crate archive, including external embedded files.

Using a dedicated crate makes it easier by allowing to easily document different licenses for external code and -sys crate.

Source identification

The other missing information is a way to identify the included software, if possible in a machine-readable manner (CPE, SWID tags, PURL, etc.). It would make it possible to integrate properly with SBOM, automate CVE detection, automate upstream version update, etc.

Note that it would be possible to identify statically linked libraries at compile time already, but this does not work on sources only and does not provide a proper software identifier, just a library name.

Versioning

Most existing -src crates use the SemVer build metadata to provide upstream version. Build metadata is defined as a series of dot separated identifiers using only ASCII alphanumerics and hyphens, which are ignored when determining version precedence. Hence, the format is quite flexible, but cannot be used for actual version comparisons (which need to rely on the base SemVer version).

Using the upstream version directly as the crate version would cause some trouble:

  • Not all software use SemVer compatible versioning
  • We need to keep a way to publish updated build code without bumping the embedded code

It could also be a separate metadata (maybe part of the source software id), but it would make upstream version invisible in most use cases.

Source embedding

There are two ways:

  • git submodule

    • Some libraries have different contents in git compared to release tarballs, and may have different build procedures (git vs. tarball).
  • Source import directly in tree

    • Makes it harder to know and check where the source comes from

Improved tooling

Some cargo-based tooling could learn to detect -src crates and implement special handling (extract upstream version, etc.), maybe using additional metadata.

It could also provide automation to alleviate the maintenance burden (automate PRs for upstream version update, security advisories based on CVEs, etc.).

And now?

crates.io is a widely used repository of C/C++ libraries, providing a great experience for Rust developers who rely on them. But the current usage patterns have shortcomings, and are not a great fit for current software supply-chain security and traceability needs.

I'm particularly interested in feedback from -sys and -src crates maintainers about the upstream library handling, how it could be improved and their opinion on the discussed issues.

Potential next steps:

  • Work on a documentation for -sys crates developers, including the -src crate pattern
  • Work with existing -sys and -src crates maintainers to improve the situation and discuss recommended practices
  • Work on an RFC to add the -src it as part of official cargo documentation

Creating a project group could help coordinate future work on this topic.

Thanks to @Shnatsel for feedback on the initial draft of this post, and to @tofay for feedback on software identifiers for SBOM.

25 Likes

I might be in the minority here but if Rust/Crates/Cargo/etc. ever officially endorse crate suffixes (like "sys" and "src") then I hope they use snake_case ("_sys" and "_src") instead of kebab-case ("-sys" and "-src"). It baffles me that crate names don't follow the same identifier rules as the Rust language (I know "-" gets mangled to "_").

The cat's already out of the bag for kebab vs snake crate names. But if Rust ever goes about officially endorsing one then I hope it's the one that doesn't need mangling... Or at the very least entirely avoid it by just using the suffix ("sys" and "src") without - or _.

I think it's maybe more an acknowledgement than an endorsement but cargo docs already use *-sys to designate these crates. And there's already >2k of -sys crates vs. 73 _sys, so it seems to me that the cat's already out of the bag for the suffix too.

7 Likes

One possible idea would be to introduce a specialised crate type (that can be used for -src like packages) known to cargo. This type is selected by a flag in the Cargo.toml. (Naming is left to the users discretion although adding a src suffix to the crate name is suggested). This crate type may not contain a lib.rs file at all, but instead just different build.rs (or configs) for the different linkage options. One benefit of this would also be speeding up cargo check or IDE tools that currently need to build the entire static lib in the build script even in check mode, but in future might just skip these "src" packages entirely. Also when called from an other build system, e.g. meson, that build system could indicate, that it wants to provide the C dependencies directly avoiding custom (and maybe flag incompatible) builds made by cargo.

2 Likes

Also, at the moment, the compilation error output and linking error output for these crates are very miserable. For beginners better error outputs are needed.