Skip to content

ingka-group/rust-for-data-scientists-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rust for Data Scientists Workshop

Authors: Nicolas Chagnet & Ivan Oštrić. Organization: Ingka Group, IKEA.

Important

This repository is for educational purposes only. The material contained within was created for an internal workshop at Ingka Group and will not be maintained or updated. It may contain outdated information and should be used as reference material only. For any questions or issues, please contact the code owners/maintainers listed in the CODEOWNERS file.

You can find the source for the slides for this workshop here. The labs are available in the lab folder, which contains the Rust crate to work on as well as the solution. The README file in that folder will guide you through the exercises.

In this README, you can also find a small "Rust in 5 minutes" section which will show you the absolute basics of the language. This is an important section as we will not have time during the workshop to go through every detail and we will instead focus on the important conceptual differences rather than syntax differences.

Setup

  1. Install Rust via Rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  1. Install the uv package manager for Python (optional)
  • Instructions can be found here

Alternatively you can use your favourite virtual environment manager. The lab was written assuming uv is available.

  1. Verify the installation:
rustc --version
cargo --version
  1. Recommended VS Code extensions (optional):
  • rust-analyzer (language server)
  • CodeLLDB (debugger)
  1. Clone this repository somewhere on your computer

  2. To make sure everything is running smoothly, you can test your setup in the hello_world/ folder. This is a simple crate with one binary in it:

  • Run cargo run inside and you should see Hello, world!.
  • Run cargo test and you should see
running 1 test
test tests::test ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

With this, you now ran your first program and its unit tests! You can now switch to the next section, reading about Rust's basics for the workshop.

Rust in 5 minutes

This is a primer on Rust’s syntax so it doesn’t feel alien during the workshop. You’ll see explicit types, semicolons, and blocks but also familiar concepts like iterators and functional style. Some aspects (like ownership) might seem strange now but will be made clearer in the workshop.

Hello, basics

// Function definition using `fn`
fn main() -> () {                        // Return type empty `()` (should be omitted)      
    println!("Hello, world!");           // macro call (note the !): different from functions, executed during compilate
    let x: i32 = 5;                      // variables are immutable by default
    let mut y = 10;                      // mutable variable with `mut`
    y += 1;
    println!("x = {x}, y = {y}");
}
  • Semicolons end statements.
  • Assignment using let
  • Types are often inferred, but you can also explicitly annotate them (sometimes needed when ambiguous): let x: f64 = 3.14;
  • Single quotes for char 'a', double quotes for &str "a".
  • {} denote a local scope: variables defined within are dropped at the end.

Cargo, crates and modules

Rust uses cargo for its package and project management: it can handle:

  • creating new projects: cargo new my_app
  • building/running projects: cargo build and cargo run (for production code, use cargo build --release)
  • testing projects: cargo test
  • adding dependencies cargo add polars --features lazy
  • Formatting with cargo fmt

All changes by cargo are stored at the root in Cargo.toml which defines the project. Rust projects can be binaries and/or libraries

myapp/
  Cargo.toml
  src/            # Rust source code lives in a src/ folder
    main.rs       # binary crate root
    other.rs      # second binary target
    lib.rs        # library crate root (optional)

Libraries are meant to be imported and reused, binaries are meant to be run directly. A fully packaged Rust project is called a crate. Community crates can generally be found on crates.io.

Rust source code is stored in modules which can either be single-file or folder-like (with a mod.rs file in the folder):

myapp/
  Cargo.toml
  src/            
    lib.rs           # library crate root
    other_module.rs  # second binary target
    some_module/
        mod.rs       # declares nested module
        module.rs    # module with code
  • To declare a module, use mod other_module or mod some_module in lib.rs or your binary file. For nested modules declare submodules in the same way inside mod.rs.
  • To import an external crate, use use some_crate::some_function.
  • By default, everything in a module is private (can only be used within the module). Use the keyword pub in front of mod, use, fn, or any other to make it public. Inside mod.rs, you can also re-export specific components.

Types you’ll meet often

  • Numerics: types distinguished by their size
  • Integers: signed (i8, i16, i32, i64), unsigned (u8, u16, u32, u64)
  • Indexed collections are indexed by usize (either u64 or u32 under the hood)
  • Floats: f32, f64
  • Bool/char: bool (true, false), char (e.g. 'a', '\u{D7FF}')
  • Two types of strings:
    • String = owned, growable string (for dynamic content)
    • &str = string slice (borrowed, like a view of an array)
  • Collections:
    • Vector: Vec<T> (vec![1, 2, 3]) (equivalent of Python's list)
    • Hash map: std::collections::HashMap<K, V> (analogous to Python's dict)
  • Tuples: (i32, f64, bool); Arrays (fixed size): [T; N]; Slices (arrays, Vec): &[T]
let s: &str = "hi"; // a 'static lifetime string
let mut owned = String::from("hi"); // A dynamic string (most common)
owned.push('!');
let v = vec![1, 2, 3];
let a = [0; 4];            // [0, 0, 0, 0]
let slice: &[i32] = &v[1..]; // borrow a view

String vs &str (common gotcha) Rust has two common string types: String and &str. The difference between them is akin to that between a vector Vec<T> and slice &[T]. The former is the owned type: it can be declared mutatively, it can be dynamically grown, moved, etc... The latter is the borrowed slice (or view) into the vector or string. It is cheap to make and to pass since it's just a reference, and it can be more limited, but it is more general.

  • Convert: let s: String = "hi".to_string(); or String::from("hi")
  • Borrow: let view: &str = &s;

Functions, expressions, returns

fn add(a: i32, b: i32) -> i32 {
    a + b // last expression (no semicolon) is the return value
}
// This macro accepts a variable number of arguments, unlike Rust functions
println!("It is {}:{}!", "12", "50");
  • Marked with fn, functions are different from macros. Macros are executed at compilation time and generate more complex code.
  • Blocks always return the last expression.
  • Use -> Type for return types.

Control flow

  • Conditions are used with if:
let n = 5;

// Example of condition
if n >= 0 {
    println!("non-negative")
} else {
    println!("negative")
}

// `if` is also an expression so you can assign the branch result
let sign = if n >= 0 { "non-negative" } else { "negative" };
  • Loops can be done with either for, while, or loop (infinite loop, must be broken manually)
for x in 0..3 { println!("{x}"); } // 0,1,2

let mut i = 0; // Must be mutable to increment
while i < 3 { i += 1; }

let mut j = 0;
loop {
    j += 1;
    if j > 2 { break; }
}
  • Pattern matching using match to deconstruct complex patterns
let val = 2;
match val {
    0 => println!("zero"),
    1 | 2 => println!("one or two"),
    _ => println!("something else"),
}

let y = Some(42); // Complex structure with nested data
if let Some(x) = y { println!("{x}"); } // pattern-match deconstructs the data and assigns to a variable

Structs and data shapes

  • Structs hold data (like dataclass)
  • Methods are in impl blocks
  • Static method don't depend on self, others can request self, &self, &mut self or mut self (more in the workshop)
  • Convention: define a constructor new
  • Namespaces in Rust used with ::. Fields are private by default, use pub to make them public.
struct Point { x: f64, y: f64 }

impl Point {
    fn new(x: f64, y: f64) -> Self { Self { x, y }  }
    fn norm(&self) -> f64 { (self.x * self.x + self.y * self.y).sqrt() }
}

let p = Point::new(3, 4);
println!("{}", p.norm()); // 5

Iterators and closures (Pythonic feel)

  • Iterators are similar to Python's generators
  • Only executed with final collector (.sum(), .collect(), etc...)
let nums = vec![1, 2, 3, 4, 5];
let squares: Vec<i32> = nums.iter().map(|x| x * x).collect();
let evens_sum: i32 = nums.into_iter().filter(|x| x % 2 == 0).sum();
  • iter() borrows; into_iter() takes ownership (more on that in the workshop)
  • Closures: |x| x * x (equivalent to lambda in Python)

Errors, Options and the ? operator

Will be explained in the workshop!

use std::fs::File;
use std::io::{self, Read};

fn read_file(path: Option<&str>) -> io::Result<String> {
    let mut s = String::new();
    let path = path.unwrap_or("/some/default/path"); // `unwrap` removes the `Option`
    File::open(path)?.read_to_string(&mut s)?; // `?` propagates errors
    Ok(s)
}
  • Result<T, E> and Option<T> are ubiquitous.
  • Option<T> can be Some(T) or None: useful for missing values
  • ? returns early by propagating the error immediately.
  • unwrap() unwrap the value from the Ok or Some branches, and panics (fatal error) otherwise. There are similar methods (unwrap_or, unwrap_or_else, unwrap_or_default) which behave more gracefully on the unsuccessful branch.

Derives, Debug, Display

#[derive(Debug, Clone, PartialEq)]
struct User { name: String, id: u64 }

println!("{:?}", User { name: "A".into(), id: 1 }); // Debug
  • Traits are static interfaces: they will be explained in the workshop.
  • The compiler can 'derive' useful traits for you.
  • Debug to print with {:?}; Display to print with {} and often needs a manual impl std::fmt::Display.

Ownership (you’ll see this soon; two quick rules)

  • Most types move by default (cheap for pointers, real move for owned data like String).
  • Borrow with &T (read) or &mut T (write), one mutable borrow at a time or many immutable.
fn takes(s: String) { /* consumes */ }
fn borrows(s: &String) { /* reads only */ }

let s = String::from("data");
borrows(&s);      // ok, `s` still usable
takes(s);         // moved here
// s no longer valid here

Generics

Small Rust-to-Python mental map

  • let ~ assignment with optional type annotation; immutable by default.
  • mut ~ “this variable’s binding can change”.
  • match ~ structural pattern matching (richer than Python’s match in practice).
  • Option<T> ~ equivalent to type T | None but explicit.
  • Result with ? propagation ~ similar to exception "bubbling up".
  • return ~ optional in Rust, last value in expression implicitly returned.

What will look odd (but normal in Rust)

  • Semicolons everywhere; no implicit returns unless it’s the last expression without a semicolon.
  • Types sometimes explicit; very strong inference otherwise.
  • Borrowing syntax & and &mut.
  • Macros with !.
  • use and crate::... paths for imports. Generally, use :: for namespace (String::new(), etc...)

That’s it. With this, you should be comfortable reading the workshop code.

Ecosystem

Here are some alternatives to common data science and general Python libraries in Rust:

Task/Domain Python library Rust crate(s) Notes
DataFrames pandas polars Fast, lazy API, native Rust; great CSV/Parquet/Arrow support.
N-dim arrays NumPy ndarray Dense n-d arrays, BLAS/LAPACK interop via ndarray-linalg.
Statistics SciPy (stats) statrs Distributions, PDFs/CDFs, sampling.
Machine learning scikit-learn linfa, Classic ML algorithms
Deep learning PyTorch, TensorFlow candle, burn, tch candle/burn are native; tch are PyTorch bindings.
Data visualization matplotlib, seaborn plotters, plotly plotters for static/backends; plotly for interactive HTML.
Serialization Pydantic, json serde Universal serialization/deserialization.
Time/date datetime chrono, time Date/time parsing, formatting, arithmetic
Parallelism multiprocessing, joblib rayon Data-parallel iterators with minimal code changes.
Async I/O asyncio, aiohttp tokio, async-std, reqwest Async runtime + HTTP client; for pipelines/services.
HTTP APIs requests reqwest Batteries-included HTTP client (JSON, TLS, retries).
CLI tools argparse, click clap Ergonomic CLI arguments for data/ETL tools.
Logging logging tracing, log + env_logger Structured, async-aware logging (tracing).

References

Here are some comprehensive Rust references to continue your learning:

Built with

This workshop was built using:

  • Rust version 1.91.1
  • Python version 3.14.0
  • The slides were created using Slidev (MIT) and the Neversink slidev theme (MIT).
  • The tutorial section makes use of Jupyter notebooks. All crates and libraries used in this project use permissive open source licenses (MIT or Apache 2.0).

Contact

If you have any other issues or questions regarding this project, feel free to contact one of the code owners/maintainers for a more in-depth discussion.

License

This open source project is licensed under the "MIT License", read the LICENCE terms for more details.

Attributions

All Ferris assets used in this project were sourced from rustacean.net and are free of use under the author's waiver

To the extent possible under law, Karen Rustad Tölva has waived all copyright and related or neighboring rights to Ferris the Rustacean.

About

Training material for the "Rust for data scientists" internal workshop.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors