February 19, 2021 | Engineering

Oxidizing Kraken: Improving Kraken Infrastructure Using Rust

Simon Chemouil – Director of Engineering, Core Backend

For more than two years now, Kraken’s Core Backend team has been using Rust to modernize services originally written in PHP, while building new products, expanding the feature set and supporting the ever expanding increase in cryptocurrency trading activity.

Hi 👋! I’m Simon, the Director of Engineering at Kraken leading the Core Backend team. I’d like to provide a retrospective of the Core Backend team’s usage of Rust the last two years and share our perspective building with it at scale. There are many great online resources explaining how Rust is different and why it is a great language. This is no such blog post, however. It is our hope that this article will be a helpful resource to companies considering building with Rust and to developers who want to invest the time learning the language. Finally, this is also a massive thank you to all who helped make Rust possible, and a way for Kraken to contribute positively to this movement.

Rewriting Core Services

Often, building a solution from scratch to fix one problem leaves us with another. This is particularly common when the original developers are not part of the design and implementation of the new solution. Other times, the new solution is theoretically better, but takes too much time to be ready, slowing down progress on the system actively serving requests. While we can make sure to avoid these common pitfalls, it is important to challenge the need for a rewrite in the first place.

When Kraken was founded in 2011, PHP offered a mix of execution safety, speed and productivity. It is impressive to see how much functionality was built in the early days. However, Kraken has grown dramatically since, and it has become difficult to expand the PHP code base, share the know-how and make larger changes safely. These core services deal with distributed data storage, cryptography and information security considerations which are less likely to be common skill sets among PHP developers, who are generally more focused on building on existing Web and e-commerce frameworks.

More generally, Kraken has entered a stage of hypergrowth with which the code base and tools need to keep up. In this regard, dynamically typed programming languages are great to start building, but it becomes more difficult to maintain code as the code base and number of engineers grow. Strong types provide guarantees (and formal documentation) that enable fast development and more developers to work on a single code base.

Our key goals with a rewrite of core services were to:

keep the system as secure as possible,
make the system more maintainable and robust even as it grows larger,
allow better performance.

It was acknowledged early in 2018 that remaining with PHP was not the best long-term solution to achieve these goals.

Why Rust?

Early 2018, Kraken already had production services written in Go and C++. While Rust promised performance, security and modern language constructs, picking it as the language to rewrite core services was a bet.

Kraken is very security-minded. As such, we prefer not to let C++ code run against user input. Even the best C++ teams in the world, like those building Windows™ or Chrome™, produce code where about 70% of CVEs are due to memory safety issues ‒such as use after free, buffer overflows, double free, that can lead to privilege escalation or accessing memory‒ that are completely prevented by languages such as Java, Go or Rust.

While Go protects against this specific class of vulnerabilities, it does not provide modern programming features like generics or sum types which ultimately lead to data modelling issues or repetitions. Kotlin provides a more elaborate type system and like Go, it makes asynchronous programming relatively easy, but comes with a Java ecosystem carrying a lot of legacy.

Enter Rust. Its reliability and performance has made it successful in cryptocurrency and blockchain projects. A few Kraken engineers started experimenting and saw it as an opportunity to build a lasting system that would match Kraken’s backend needs: performance similar to C++, modern language constructs helping to accurately model business logic and error cases, planned first-class support of asynchronous programming, compile-time thread safety, and a vibrant ecosystem. The value proposition of Rust and the demonstrated success of the community led Kraken to begin rewriting core services in Rust mid-2018.

Two years later

The Core Backend team has grown up and nowadays is in charge of both the modern Rust core services and the legacy PHP services that are still being rewritten. In the meantime, Rust has been successfully used by other teams: we’ve been joined by the Kraken Derivatives team who had independently started its migration to Rust for all of its backend stack, Cryptowatch has picked Rust for its desktop application, Kraken moved its cold storage system to Rust and the Kraken Digital Asset Bank is being built in Rust. The language itself has improved significantly, making writing asynchronous network services easier than ever.

Generally, we have been pretty busy: the Core Backend team’s Rust git repositories hold about 500,000 lines of code – more than PHP, even though many features are still implemented in PHP. This is partly due to writing more foundational code, tests and brand new features, but also to the fact that PHP, like other dynamically typed programming languages, does not require to type structure definitions, including errors, which make up for a sizable portion of the Rust code. Not having those explicit structures in PHP has made the rewrites almost an exercise in reverse engineering.

Tactically, we have decided to rewrite the exact same functionality in Rust: since all PHP services are stateless, it made it easy to port the logic, endpoint by endpoint, to Rust. This has allowed a freshly hired team to gain more knowledge about the underlying system and made an incremental roll-out as well as easy roll-backs possible. We have built a comprehensive integration test suite that needs to pass against both PHP and Rust services to ensure the behavior is similar. After a functionality is ported to Rust, it is easier and safer to extend.

While the primary goal of the rewrite was not performance, it is great to see Rust provide fantastic speed out of the box. Our Tokio-powered RPC servers, which were not particularly optimized (though we are generally careful with memory usage patterns), support a throughput of 150k request/second per instance while keeping p99.9 latencies below 3ms. A system is as fast as its slowest parts, and while our PHP core services are not the only bottlenecks at Kraken, their IO performance is lower than Rust’s and they are more sensitive to load. After entire end-to-end paths have been migrated to Rust and bottlenecks eliminated, our clients should see dramatic improvements. In the meantime, we do everything to improve performance and reliability by moving endpoints to Rust, redesigning databases and scaling up services.

This is what happens to response times when an endpoint is ported to Rust

Rust for application services

Rust is often touted as a great systems programming language ideally suited to lower-level tasks, command-line utilities and network services such as load balancers. Many people consider Rust’s complexity a deal breaker for general business logic, and the job market pool too small to use the language for common tasks such as building a user management system or a REST API.

While Rust is a great fit for systems programming, we have also been using it for application services that are commonly implemented in languages considered higher-level such as Java, Ruby or TypeScript. Correctness is absolutely critical at Kraken, and Rust’s modern language constructs make it easier to write correct, robust code. The lack of garbage collection, often brought up as a downside for writing general logic that does not need to “care” about memory management, has not been a problem in practice, as we are building stateless services and storing cyclic data is never a concern.

However, Rust requires precision and I’d say this has been the most beneficial aspect of the language: its explicitness, supported by its strong type system, leads to expressive code that is easy to review and reliable at runtime. In that regard, I’d consider Rust both lower-level and higher-level than Java and other languages of the same category. The Core Backend team also develops some more technical services such as load balancers, or services monitoring streams, that require good performance, and it has been extremely practical not to have to switch between systems and application logic languages, and reuse libraries and patterns.

As the team and code base grow, the ability to review code effectively is critical. Rust is remarkable for making behavior very explicit in isolation, by which I mean it is not necessary to think as much about other parts of the system ‒ the current function is often enough. When reviewing code, we are presented with a diff (the lines that have changed and the surrounding context) and while one can take more time to dig into the changes, a faster review is a great motivator for developers who can get feedback rapidly. In Rust, I can be certain that a change that compiles will be both free of data races (a prominent root cause of concurrency bugs) and memory safe (we remain for the most part on safe Rust). I can easily spot functions that may lead to a panic ‒ Rust’s way of aborting execution when there is no alternative ‒, spot useless memory copies and in general gather the developer’s intentions. Clippy, Rust’s linter, helps unify code style and leads to a more idiomatic, consistent code base. I have reviewed thousands of merge requests these last two years with a much higher confidence than if it had been any other mainstream programming language.

Rust is a large, complex language and it is easy to get lost in details. Fortunately, it is not necessary to know all the minute details to be efficient. It is our experience that Rust is an extremely productive language: it has great tooling, forces us to thoroughly model problems, saving precious debug time and potential production issues, and is great for code-reuse ‒ a productivity multiplier. Finally, I feel it is necessary to debunk the “fighting the borrow checker” legend, a story depicting the Rust compiler as a boogeyman: in my experience, it happens mostly to beginners and the 1% trying to micro-optimize code or push the boundaries. Most experienced Rust developers know exactly how to model their code in a way that no time is wasted fighting the compiler on design issues, and can spot anti-patterns at a glance, just like most people know how to drive their car on the correct side of the road to avoid accidents, and notice those who don’t!

Building a Rust team

During these two years, we have built the foundations of the modern Kraken backend stack, rewritten existing functionality to Rust, built new Rust services and features, and also built the Core Backend team of 30+ engineers. A few developers were originally hired as PHP developers and have learned Rust as they joined the team. It is worth mentioning that Kraken is a globally distributed remote-first company, and the Core Backend team has Engineers of 15 nationalities working from 12 countries.

Rust attracts passionate developers, often interested in systems programming, distributed systems or cryptography. A good portion of our current Core Backend engineers are Rust enthusiasts who have discovered the opportunity through various Rust online resources, from Reddit to This Week In Rust where our job offers have been featured many times (thanks!), making Kraken somewhat known for hiring Rust developers for a long time.

Our Core Backend roles combine challenging technical and business problems in a very competitive market, remote working from all over the world with a high, location-independent base compensation and a generous option grant, and writing Rust almost full-time. This fact was not lost on the many candidates who have applied these two years and made it possible to build a world-class engineering team.

While we originally were open to hiring developers with interest but no hands-on experience in Rust, we have quickly realized it did not always pan out and that the learning curve depends on the individual. Interestingly, Rust attracts developers coming from very different languages, both statically and dynamically typed. It’s hard to say whether those coming from a specific background have a harder learning curve but some become effective in weeks while others still struggle after months. Those who are used to relying on documentation and have a formal understanding of semantics are most likely to catch up fast. Like many fast-growing companies, we need new hires to be able to help immediately on real issues. We thus require provable Rust experience and specifically test for a thorough understanding of Rust’s type system and practical knowledge of the standard library and common crates.

I believe that using Rust helps one become a better developer as it pushes for clean design and precision. However, knowledge of Rust alone does not make one a great engineer. A lot of candidates we have seen are ecstatic about Rust, but have limited experience building backend systems. We have hired many junior developers showing great potential because when growing a team, balance is key to success. Experienced developers are often great mentors: they usually carry the wisdom of keeping things simple, have learned not to trust themselves too much, and how to maximize their impact on the business line.

Considering how the language resolves pain points with C++, Java or Go, I’d expect more seasoned developers to make the jump. Coming myself from a decade of Java development, I appreciate healthy skepticism for overhyped new technologies, yet I would now dread returning to a language that does not match Rust’s qualities ‒ in particular how it lets me focus on the module at hand, instead of needing to constantly consider a number of implicit invariants like whether that piece of code is called from another thread, and the fact I’d need to make it thread-safe. We hope that with more companies now heavily investing in Rust ‒ from Discord and Deliveroo to Amazon and Microsoft ‒ we can help send a signal that there are many Rust jobs and that investing time to learn the language will not go to waste. Many seasoned developers will prefer to remain on the stack they are experts in, but some may still love trying to get away from their comfort zone and challenge themselves.

Rust is great, not perfect!

Rust has allowed us to build a lot of well-working, high performance production code. We have a large team where many people are working on very different parts of the backend. Most of that code has been incredibly robust: we have not experienced a single crash or panic (the rough equivalent of a runtime exception) of a Rust core service. All in all, I can say we only have had business logic issues, misconfiguration problems, and experienced a general performance issue linked to running Tokio on the musl libc with specific kernel configuration that was easily fixed once identified using the perf tool.

While the language is great, there are a couple of things widely acknowledged as limitations that have bitten us as well.

Ideally, each fallible function would have its own error enum to precisely capture its errors and handle them, but in practice it is too verbose and leads to using the less precise Error trait or one enum per module. The language could support this better: there are several initiatives and macros exploring this.
When designing a library crate, the lack of specialization and generic associated types (GATs) can be quite limiting.

We have been using async Rust first with Future combinators and using the async/await support as soon as it landed on nightly. It’s been an extraordinary feature that let us build massively concurrent applications using Tokio. We never had to spend much time to make our servers handle more than 10K concurrent connections or implement back-pressure. However, it can still get better:

Unlike most parts of Rust, async functions are a bit of a footgun as they look innocuously similar to regular functions, but may not be executed entirely (more precisely, the Future they return may not be polled to completion). This requires extra carefulness to handle clean-up logic, which cannot currently be asynchronous itself. The ongoing work to support asynchronous drop will hopefully provide a piece of the solution. It’s still an open question on how to make the problems more visible; could an attribute make it clear the Future is cancelation-safe and a lint warning tell us otherwise?
Though it is getting better, the ecosystem has been badly hurt by the split of asynchronous frameworks. The language would greatly benefit from providing the constructs that would allow task scheduling subsystems to be abstracted without overhead, so one may choose to use their favourite executor, and pass down task executors down to libraries, or drive the Future themselves.
The current design of statically initialized task executors makes it easy to run several executors by mistake by simply pulling a dependency. The generalized usage of thread locals makes debugging more difficult.
Being able to design asynchronous functions in traits without boxing, and to refer to the result type, would definitely be a great improvement in performance-sensitive code.
We are also hopeful to see work around io-uring result in great performance improvements hopefully without creating further splits of the ecosystem.

In terms of tooling, Cargo and Rustup have made setting up and compiling projects a non-concern. RustAnalyzer has improved spectacularly and provides a great IDE experience that will get even better. Compile times have generally reduced: they could be shorter but with incremental builds and sccache they aren’t too much of a time sink. Optimized builds are indeed slow, but overall a small price to pay for the performance and safety. A private Cargo registry with permissioning support would definitely help Rust in the enterprise. We have been running using git dependencies but the lack of semantic versioning support makes updates painful. There are a few open source Cargo registries out there, but Cargo itself does not support access tokens or credentials. We’d love to sponsor that work.

Kraken ❤️ Rust

All things considered, Rust is very mature and most of its pain points would exist in one shape or another in other mainstream languages. Rust makes reuse trivial and lets us deal safely with large code bases under active development without sacrificing performance.

Using Rust is for us no longer an experiment or a bet. It is the proven technology we are building on, and the Core Backend team is looking for engineers to help. This picture would be incomplete without mentioning our team values beyond Rust: the Core Backend team extends Kraken’s culture ‒and the commitment to our mission‒ with the engineering values of Rust and a high-performance team culture influenced by Netflix. We believe in engineering beyond code, in ownership. We reject hubris. We constantly learn from each other. Our challenge is made greater by not sharing an office, and individuals who thrive are engineers able to move forward autonomously by being at the same time self-driven, deeply technical, able to bridge between requirements and technical solutions. We combine smart and hard work, but preserve a good work-life balance and stay healthy. We care about what we are building and help our teammates succeed. We realize perfect is the enemy of good (Rust developers are notoriously perfectionists! 😉). Finally, we believe in constant self-improvement as a group: if something is broken, either technically or organizationally, we fix it.

We have openings in the Core Backend team for Mid-level and Senior Backend Engineers, as well as Site Reliability Engineers that help support and improve our operations, tooling and CI. We’re also hiring Software Engineers in Test to help us test our APIs with Rust and Cucumber.

Other teams at Kraken are also always looking for great engineers who have adopted Rust as their tool of choice to build robust and responsive systems:

The Kraken Digital Asset Bank is a Special Purpose Depository Institution that builds modern banking and payment systems in Rust, and is looking for senior engineers ;
The Kraken Derivatives team has been using Rust as their primary language for building derivatives trading for two years now, coming from Java and Kotlin, and are looking for backend engineers ;
Our Trading Technology team, building our spot trading, also runs a number of services both in C++ and Rust and is hiring backend engineers ;
Cryptowatch is also hiring Rust GUI developers as they build a lightweight desktop trading application.
Be sure to check our other openings!

Finally, we’d like to help Rust grow. We have already sponsored some open source work ‒ such as the iced GUI framework ‒ through our Kraken Grants program. We’d love to sponsor individual contributors to the Rust project or key related projects. If you are making important contributions to the Rust ecosystem and need funding, please reach out! In the meantime, we have been impressed by the outstanding work carried out by the RustAnalyzer team that directly benefits the entire community, and will be donating 50K EUR to the project!