Kickstarting a database-wg

I posted the idea of a database WG on twitter recently and it was met with a lot of excitement. Also there was a post on reddit recently that proposed the same idea, taken from examples of where using Rust with databases is currently a painful experience.

Diverging interests

From the feedback I've seen so far there seems to be people interested in vastly different things when they hear rust-db-wg

  1. Writing databases
  2. Binding against databases
  3. Making all of this async

Because of these diverging interests I feel it’s important to “sit down” beforehand and talk about what people want from a database WG. Also important to consider is that databases don’t exist in a vacuum and there will be an overlap with other working groups (specifically async ecosystem wg) which offers the opportunity for further collaboration via semi-regular check-ins and meetings to discuss shared roadmap goals and implementation plans.

Organising a WG

I would as part of this also want to work out a base charter to start the WG as well as setting up when and how to have regular meetings to discuss roadmaps and current projects that are being worked on.

My ideas

Following are some things that I personally feel should be made better and where a WG could collect development effort and guide newcomers to Rust and databases to contributing to existing libraries, as well as maintaining some libraries as well.

  • An sql abstraction that wraps around multiple backends, allowing you to execute Sql, no matter the database backend (similar to slick from Scala)
  • Expand the range of available async database adapters to prepare the ecosystem for async/await
    • Having talked to people from the net-web-wg and net-async-wg, this is definitely an interest!
    • This includes building an async connection pool ecosystem - bb8?
  • Pushing the r2d2 ecosystem to 1.0 as a stable, blocking I/O connection pool
  • Stabilising, improving or developing ready-to-go database middleware for existing web-frameworks

As part of the diesel 2.0 release there will be some changes to the migration API to be more flexible and modular. This is also something that could be tackled by the WG.

Wrapping up

So, to summarise: I want to hear what your ideas are, what you think is important and what needs to be worked on. Let’s set out some high-level goals and organise our efforts around these goals.

Edit 1: Small discussion summary: Kickstarting a database-wg

25 Likes

If you have enough people interested in each topic, you could consider setting up a single WG with multiple teams, similar to how the Embedded WG is managed. I imagine the topics will be a little overlapping which allows you to share knowledge, even if people are only mostly interested in one of the topics.

2 Likes

Is this going to be an official wg or unofficial wg?

Because I remember having some discussions earlier that stated that the team doesn’t want to start new working groups till they have a policy in place which will make creating a working group more standardised. We can still go on to create an unofficial WG which is fine as well.

Anyhow, i’m interested in this and can help set up the working group if required.

@Dylan-DPC this should become an official WG. I talked to a few people from the core team about how to proceed with this and “setting up a WG charter” was a recommended first step (edit: although the way that WGs are “made” is currently changing)

Not sure when this would become an official WG but putting in the work now seems like a good idea.

2 Likes

My wishlist:

  1. Documentation. I’d love to see some “database book” that covers connections, drivers, pools, has examples for every feature and gives overview of ecosystem. Individual project would benefit greatly from better docs. The goal would be google “rust foo” and get first page with example of foo followed by more detailed description.

  2. Solution for database migrations that is easy to integrate with other projects. Modeled after Django (NOT based on writing your sql).

  3. Unified integration with other projectes like web-frameworks.

  4. Sort of ODBC would be nice, but it works differently with Rust as I am expecting type safety you can’t get from “send sql and keep your fingers crossed its valid” API, though we need that one too.

  5. Obviously async. I believe this is a good chance to come up with API all drivers implement.

  6. Django-style model-based solution that generates “create table”, manages migrations without writing sql and provides high-level unified interface.

4 Likes

@andygrove’s sqlparser and DataFusion are really interesting projects – I’m sure he would have some interest in this working group. :slight_smile:

2 Likes

That’s a good wishlist. As for migrations, there is https://github.com/rust-db/barrel (shameless plug) which would definitely become part of the db-wg responsibility to develop and make cooler.

I’m trying to build the modular components that will let me continue to build custom databases in rust in a fraction of the time that it normally takes in c and c++. Testing libraries, pagecaches, performance tooling, replication and network simulation tooling etc… I’m interested in collaborating with other people building the next generation of stateful systems around all of those things. People in this space are still working in the c/c++ mentality of “collaboration is too expensive for our high performance systems with high reliability requirements” but we have significant advantages in our tooling that can greatly reduce the costs of collaboration on these types of projects.

I’ve reached out to a cross section of folks involved in industry and academia trying to build new stateful systems in rust and there seems to be a good amount of interest in starting with some knowledge sharing around things like testing techniques, and this is may turn into a few new shared tools extracted from our various shell scripts, histogram libraries, and bespoke testing harnesses that all largely solve the same problems.

I’m much more in the building-new-things camp than integration / async-all-the-things but there may be some overlap in lock-free histogram / performance analysis / fault injection tooling etc…

7 Likes

I’m currently experimenting with a flexible data mapper (not object relational mapper) in Rust and already went through a couple of iterations.

If anyone would be interested in such a thing and would like to join in, I’d be happy to collaborate.

I don’t think such a project is easy to do for single persons (b.c. of the amount of stuff present in databases that you might miss) and also, I would like to avoid creating a straight port of other libraries.

1 Like

I’m going to post this since it’s related.

But here is link to rdbc.

I’d also love for such a system to exist for Rust.

I’m very interested in this because RDBMS are my main world. As I have say here, I repeat some points but with better elaboration, I hope:

  • Async. Rust is NOT async, and async will be forever, a second-class citizen. If a lang is not async from the start, async dependent libraries are pain. Is good to have async optional., but not force it. I don’t know if wrapping async to be sync work well and not make even more complicated the debugging, so if exist a clean way to use async under the hood is ok to me. BTW, I have use dozen of languages, NONE with async/first db layer and never have hit a real performance problem. This is much more niche (ie: truly need async for performance) that most people think, IMHO.
  • We need something like the python database api. If extended for async, fine to me.
  • We need a good way to map database results to structs. But despite being type safe is nice, having a HashMap like container is also required. Sql is dynamic and not amount of structs and types will be enough.
  • Making full ORM or heavy interfaces like diesel are nice, but is not what we must build first.
  • Making query builders and similar is nice, but not depend on it. Send sql strings with parameters. End
  • Make nice to do parametrized queries

I wanna to make clear why this is very important. A “type safe” database layer is good, so much that I’m building a relational language to map more cleanly to RDBMS, BUT, a developer need to setup a lot before start reading/writing to the DB. This assumes you are doing a project for the long run, and truly can type all the schema before running the app.

But building scripts/ETLs, building database REPLs (like psql), UIs, creating something like Django (that need to introspect the db) and much, much more things, you can’t know all before start. Send sql, get results, introspect and move is much more common than have a fixed set of what the schema is.

Is ironic, but SQL is schema less. Tables are fixed, yes, but all the rest is not!

  • Is important to support dates, decimals, enumerations, and embebed data like arrays/json

  • Look how some micro-arm are made (like dapper). I think them hit the sweet spot

  • The library must be integrated with logging

  • Convert from/to json, cvs, etc is not necessary to be on the library. BUT convert to SQL is! (ie: dump sql scripts)

  • Is necessary to provide ways to know the types, sizes, constraints, etc of the sql I have send, the db and the rest.

The #1 sin most commit when talking in how interface with RDBMS/SQL is to hate sql, and try to go full ORM on it. That have been proved, conclusively, to be a mistake, a pipe dream alike cross-platforms UIs. I don’t denied is possible to be close to the ideal, but exist a lot of tradeoffs that IMHO are not to be made on the fundamental layers of this core apis. Let others build on TOP of this.

This cascade in a lot of details. For example @Fiedzia say “migrations without writing your sql”.

How you migrate the addition of an extension on PostgreSQL? Or the removal of 10 tables, to be put on one that need a massive SQL rewrite on the middle? “Not write SQL” is nice for BASIC stuff. But if we are talking about sql, sql must be a first class citizen of the APIs.

Put on top any you want.

A 2-layer or 3-layer design can work well (1 layer: Dynamic, sql strings, parameters, 2- SqlAlchemy alike DSL for automating the SQL stuff, 3-Full ORM/Django style)

6 Likes

This sound nice. I'm toying in build a relational language, and by the way, how build a rdbms alike sqlite. I'm convinced a lot of things that could be improved to make DBs more nice. Storage, transactions and reliability are the hardest part IMHO.

Having worked professionally with RDMS as my main focus for 2 years, and having spent a year of it dealing with the nightmare that is Slick, I want to wholeheartedly warn you not to use it as a basis for a good library for inspiration. The problem with trying to support multiple backends is that eventually, some feature that I want will be missing, and I'll have to resort to using raw SQL anyway. After 2 years of day-in day-out SQL (even on the weekends, since this was a startup), I can say that ORMs are not a good idea as a first class target. ORMs are always more driven by personal preference with regards to ergonomics or style and it is definitely not a solved problem. Trying to support multiple backends like Slick also tends to leave you with:

  • indecipherable error messages
  • difficulty debugging or contributing to the code base in case of bugs
  • a lot of surface area to support, which requires deep knowledge of the edge cases of each database
  • a lot of surface area w.r.t. features (see how long it took for JSON to be implemented in Diesel)

The highest priority is the highest performance database connector, which has surprisingly few responsibilities, but a lot of room for improvement. I would look to HikariCP as an example of that. Mainly it's about how quickly you can serialize and deserialize messages back and forth since generally, the network layer is the limiting factor. That involves improvements like automatic pipelining/batching of queries. I eventually led the push at my job to dump slick, but it was deeply ingrained in our code and took a looong time to remove.

As far as validating queries, the best way in my opinion is to have an active database running locally to verify queries against in the compilation step, which there are examples of in Scala and even Diesel has a compile time connection component.

5 Likes

Having worked professionally with RDMS as my main focus for 2 years, and having spent a year of it dealing with the nightmare that is Slick, I want to wholeheartedly warn you not to use it as a basis for a good library for inspiration.

I didn't mention slick to incinuate that creating an ORM was a first-class target for a working group or even should be. I don't really think creating a new ORM is a priority or should be for the WG either.

I did mention it because it provides a pretty convenient way of doing async db connection stuff which might be an interesting thing to learn from.

I’m definitely interested in participating. I’m currently working on DataFusion, which is a Rust-native SQL query engine optimized for analytics queries against columnar data, and is now part of Apache Arrow. I am also the author of the sqlparser crate, which has been a bit neglected honestly, but I do have plans to make this more valuable to the Rust community.

Other that this, I have experience in building commercial database drivers (ODBC + JDBC) and middleware, as well as ORM, DAO etc, so I definitely feel that I could contribute to this WG.

2 Likes

This is a really interesting idea which I’ve not seen before. To clarify, instead of directly mapping a result set -> struct (or vice versa), you’re building an abstraction layer that goes from DB -> Gateway -> Repository -> Relation -> Changeset -> Struct?

Will each flavor of RDBMS have it’s own gateway implementation? Just want to make sure I’ve got this straight in my head.

Regardless, I’m really interested in participating in this WG! I’m relatively new to Rust (coming from a DBA/SysAdmin/ETL Dev background).

Along those lines, yes. It's inspired by https://rom-rb.org/ and Elixir Ecto. The nice thing about that method is that it actually doesn't force DBs to be RDBMS, but also allows for specificity that e.g. RDBMS are queried with SQL, while others in other methods (or specific RDBMS might be queried using their SQL dialect).

Rom.RB is an especially interesting library, as it is at least the 4th attempt of those folks at a proper flexible mapper and the first ones haven't been bad :).

The whole current idea is that mapping out of the database is an important application concern and that the attempt to uniformly map between domain and database leads to bad design. Also, abstracting too hard over the database, there's reasons to choose one over the other.

I have two comments here: Rust is not evented async first class, but it has extremely good support for encapsulating async components. It's completely feasible (and IMHO even useful) to run multiple schedulers in one program and handover is extremely well structured. I agree with you though that evented async DB drivers are not necessarily the first thing I'll do, but having the interface async helps. Many database interfaces are mapped to thread-pools though and in the following:

async fn read_from_db(id: u64, threadpool: DB) {
    let row = await!(DB.get_row(id));
}

Waiting for a thread to end is much better abstracted in an async fashion then in a sync fashion and very much possible in Rust.

Async maps very well to sync. You can just run one async task in its own lightweight executor.

SQL has a lot of issues making working with it painful for library authors. Django could not be so useful without an ORM. Not to say it should be avoided at all cost, but a layer above is often required and should be easily available if desired.

How you migrate the addition of an extension on PostgreSQL?

By declaring that I want to use some library/application module/crate that requires it and having this thing generating whatever is required for it to work. It might use SQL, it might not (Mysql may use percona instead of just sql), I don't need to know that. It might do different things depending on a database I am using or its version and configuration.

Or the removal of 10 tables, to be put on one that need a massive SQL rewrite on the middle?

Its trivial with Django: add new table (make migrations), add data migration, update the app to use it , remove 10 tables (make migrations), done. 2 of 3 steps will have automatically generated migrations forward and back. With "write your sql". its a lot more work that computers do better than people.

“Not write SQL” is nice for BASIC stuff.

Yes. But there are many applications that do a lot of basic stuff, having it done in automated way makes things a lot easier to change. Writing "basic" migration in a database-agnostic way is actually anything but basic.

A 2-layer or 3-layer design can work well

Conceptually maybe, but in practice you will have database driver integrated at all levels, because layers 2 and 3 require features layer 1 doesn't need or knows about.

1 Like

That is my point actually. I not against build nice stuff on top. But is important that that "stuff" not dominate the design the basic interface, because will constrain the possibilities and also, making hard to actually implement all the possible ideas.

For the other points, is matter of degree. I consider using sql much simpler and easier than try to understand a layer on top, mainly for complex stuff. I have using the Django ORM and SqlAlchemy a lot, so I understand the benefits of have idiomatic sugar. But also, I have need to workaround every single orm library in all my career. So that is just my concern: To keep it simple.

2 Likes