More powerful data manipulation

Diggsey · April 14, 2019, 10:41am

I’ve been building a lot of micro-services using Rust, and one of the biggest remaining pain-points is around data manipulation.

Often I will have a big hierarchy of related data types which are used as part of an API, and maybe stored in a database. The problem is that I’d like to evolve those data types over time: fields will be added/removed etc. and I need to maintain backwards compatibility.

At the moment, the only way to do that is to duplicate the entire set of data-types into a new module, and then painstakingly implement conversions between the two, even though most of the types are completely unchanged.

I wanted to see if people had any ideas for solutions to this problem? I’ve thought about various kinds of code-gen or procedural macro solutions, but the nature of the problem means they end up being very restrictive, incredibly complicated to implement, and also kill compile times by generating huge amounts of code.

One thing that could help a little would be expanding the “spread” operator to work across different types, as long as the fields were the same, but that alone is insufficient - you’d also need to be able to apply “into” conversions on all of the fields being spread.

vorner · April 14, 2019, 11:14am

I don’t really know your requirements and such, but it seems like duplicating the whole hierarchy is the wrong approach. If I wanted to do something like what you describe, I’d first start by asking if I can actually have one „flexible“ structure ‒ putting optional fields in with Option, tweaking how it is serialized/deserialized with serde, etc. Or adding serde(default = …) attributes for version that doesn’t have them.

If that fails, it should still be enough to have multiple versions of the things that did change instead of everything. To prevent „poisoning“ of the upper levels of the hierarchy, I’d probably make the upper levels generic.

After all, you seem to want to convert everything into one representation eventually ‒ so it would make sense to just deserialize directly into that.

bascule · April 14, 2019, 12:21pm

Have you considered things like Protocol Buffers and Rust libraries for working with them like Prost and tower-grpc? They are specifically designed to support things like adding, removing, and renaming fields while retaining backwards compatibility.

mcy · April 14, 2019, 2:16pm

++ to this. If you have a complex hierarchy of types being sent over RPC, you probably want to encode them as protos, where all fields are going to be optional (don’t use required, it’s a trap).

Diggsey · April 14, 2019, 2:25pm

That's fine for some changes, like adding fields, but I think just defers the problem until you have a bigger change, like renaming or removing a field.

I could use generics, but I'd end up with a ton of generic type parameters everywhere which don't really mean anything, so I'm not sure it would be much of an improvement.

I don't think protocol buffers actually solve these problems: protocol buffers solve the additional backwards compatibility problems that binary or non-self-describing formats would normally suffer from, but they don't give you anything beyond what a format like JSON already does - aside from being more efficient.

To take an example: lets say I have a a struct with a field is_required that is a boolean, and later on I decide I want to actually encode that information differently as an enum instead, say required: NEVER | ALWAYS | SOMETIMES.

In both JSON and protobuf, I can make the serialisation continue to work easily enough, but in both cases you are expected to do the actual migration of data in client code, and this is what is hard to do in Rust right now, particularly when it is deeply nested.

This is a good point, but it leads me to believe that my best option may be to use some kind of DSL or transformation layer, like XSLT/JSLT, because it is painful to write in Rust. This is probably the best option, but it does lose a couple of benefits:

The transformation layer itself is no longer strongly typed (at least things like JSLT are not) and so could be prone to errors.
When a change is made and the "back-compat" transformation is implemented, I can't verify that the old format is actually unchanged.

The far more verbose way of duplicating the types gives me both those guarantees.

CAD97 · April 14, 2019, 3:39pm

frunk has powerful data transformation derives. Is that the kind of thing you’re looking for?

kestred · April 14, 2019, 4:25pm

At work, we’ve solved this problem for our microservices via code-generation (a la protobuf, tarpc, etc) but the proto effectively supports a DDL language which generates request upgrade/downgrade code akin to https://stripe.com/en-US/blog/api-versioning. Super happy with how its working (as you said it was complex to implement).

All that said, an effective alternative for you might be making heavier use of generics.

Consider a v1.rs

struct Parent {
    name: String,
    children: Vec<Child>,
}

struct Child {
    name: String,
    age: i32,
}

Normally, if you wanted to update the Child structure in v2.rs you’d need to copy every single intermediate struct and impls.

Instead however, you can redefine the top level structs as generic over some T and only implement From/Into once for each child struct that has changed:

/// v1.rs
struct ParentBase<T> {
    name: String,
    children: Vec<T>,
}

impl<A, B> From<ParentBase<A>> for ParentBase<B>
    where A: From<B>
{
    // ...
}

struct Child {
    name: String,
    age: i32,
}
type Parent = ParentBase<Child>;

/// v2.rs

struct Child {
    firstname: String,
    lastname: String,
    age: i32,
}
type Parent = crate::v1::ParentBase<Child>;

impl From<crate::v1::Child> for Child {
   // ...
}

(This can happily be combined with other solutions like frunk or your own purpose-built macros to further reduce the amount of boilerplate).

toc · April 14, 2019, 6:13pm

Protobufs actually work well for this use case, as the serialized form for bool is the same as enums and ints, it's a varint. There should be no special migration needed.

Diggsey · April 14, 2019, 8:04pm

That sounds very useful! I don't suppose there's anything more about that approach that you could share?

kestred · April 15, 2019, 7:27pm

There are probably different ways that we could’ve done it, but effectively we:

We wrote a nom parser for a very rust-like DDL language with additional syntax for declaring versioning and how the api-surface changes (any way of writing a parser, will do; the team was already familiar with nom).
We generate a rust proto crate for each micro-service which contains a Service trait and a Client implementation; included are also Upgrade traits which if implemented (sometimes RPC upgrades are auto-implemented depending on complexity of api-surface changes) allow a v2::Service to handle v1::Service calls.
The trait is implemented in a separate binary crate to actually perform the microservice’s api calls.
A majority of generic request handling code (including tracing, metrics, etc) is implemented in a library that is a shared dependency of all proto crates to reduce the amount of generated code that has to be recompiled when the services change. It’s effectively a wrapper around hyper with a call method that dispatches requests to the correct trait method via an auto-generated implementation.

Fair warning; code generation is addictive! (But it was also a lot of work to write)

system · July 14, 2019, 7:27pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pre-RFC: Disjoint Polymorphism language design	31	7667	March 25, 2019
Reviving the bit-data discussion language design	36	5853	March 25, 2019
Idea: Layout inheritance once more: an easier way language design	15	1213	March 25, 2019
Simple single inheritance one day? language design	2	1340	March 25, 2019
Feedback requested: book on API type patterns	26	2346	July 29, 2021

More powerful data manipulation

Related topics