The 2D framework derives relaxed concurrent data structures, with k-out-of-order relaxation. Relaxation means that the data structures are allowed to return items in slightly the wrong order. In k-out-of-order, a FIFO queue is allowed to dequeue any of the top k items during a dequeue, instead of just the head, which enables superior scalability over strict designs.
The core idea of the 2D framework is to split a data structure over several disjoint sub-structures. Then a window is superimposed over the sub-structures, which defines an area where operations are allowed at each sub-structure. The idea is that a thread can for the most part work alone on a sub-structure as long as it stays within the window. Then the window can be shifted when full or empty.
This repository is based on the ASCYLIB framework and memory is managed using SSMEM which is a simple object-based memory allocator with epoch-based garbage collection.
- How to Relax Instantly: Elastic Relaxation of Concurrent Data Structures
- Kåre von Geijer, Philippas Tsigas.
- To appear in proceedings of the 30th International European Conference on Parallel and Distributed Computing, Euro-Par 2024.
- Monotonically Relaxing Concurrent Data-Structure Semantics for Increasing Performance: An Efficient 2D Design Framework
- Adones Rukundo, Aras Atalar, Philippas Tsigas.
- In proceedings of the 33rd International Symposium on Distributed Computing, DISC 2019.
- Brief Announcement: 2D-Stack - A Scalable Lock-Free Stack Design that Continuously Relaxes Semantics for Better Performance
- Adones Rukundo, Aras Atalar, Philippas Tsigas.
- In proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC 2018.
The src folder includes the data structures encompassed by the framework, as well as designs compared to in the publications. Each of these implementations has a README.md file with additional information.
These designs are on a high level described in the DISC paper, and form the foundation of the 2D framework. They have had some optimizations done in conjunction with later publications.
- 2D queue: ./src/2Dd-queue
- Optimized 2D queue: ./src/2Dd-queue_optimized
- 2Dc stack: ./src/2Dc-stack
- Optimized 2Dc stack: ./src/2Dc-stack_optimized
- 2Dd stack: ./src/2Dd-stack
- 2Dd deque: ./src/2Dd-deque
- 2Dc counter: ./src/2Dc-counter
- 2Dd counter: ./src/2Dd-counter
These designs extend the 2D stack and queue to encompass elastic relaxation. This means that their degree of relaxation can be changed (either manually or with a dynamic controller) during runtime. They are described in the coming Euro-Par paper.
- 2D Lateral-as-Window (LaW) queue: ./src/2Dd-queue_elastic-law
- 2D Lateral-plus-Window (LpW) queue: ./src/2Dd-queue_elastic-lpw
- 2D Lateral-plus-Window (LpW) stack: ./src/2Dc-stack_elastic
These are implementations, or copies, of data structure designs used to compare the 2D designs against. These ar the main ones, used in the Euro-Par paper, but there are some additional ones as well in ./src/.
- Michael-Scott lock-free queue: ./src/queue-ms
- Wait-free queue as fast as FAA: ./src/queue-wf
- k-Segment queue: ./src/queue-k-segment
- Treiber stack: ./src/stack-treiber
- Elimination stack: ./src/stack-elimination
- k-Segment stack: ./src/stack-k-segment
Simply clone the repository and run make to compile all implementations, using their default tests and switches. You can then find and run the respective data structure binary benchmark in bin/.
The default benchmark is a synthetic test where each thread repeatedly flips a coin to either insert or remove an item. The binary takes several cli flags, which are all described by running it with -h, such as make 2Dc-queue && ./bin/2Dc-queue -h. However, the most important arguments to get started might be:
-d: The duration in ms to run the experiment,-i: The number of items to insert before starting the benchmark,-n: The number of threads to use,-l: The depth of the 2D window, only for the 2D data structures,-w: The width of the 2D window, mainly for the 2D and k-segment data structures.
The 2D code is designed to be run only on x86-64 machines, such as Intel or AMD. This is in part due to what memory ordering is assumed from the processor, and also due to the use of 128 bit compare and swaps. Even if runnable on other architectures, the relaxation bounds will likely not hold, due to additional possible reorderings.
Furthermore, you need gcc and make to compile the tests. To run the helper scripts in scripts/, you need bc and python 3. Run the following command to install the required python packages pip3 install numpy==1.26.3 matplotlib==3.8.2 scipy==1.12.0.
To recreate the plots from the Euro-Par paper, run scripts/recreate-europar.sh, which will output its plots in the results folder. Parameters such as number of threads can be adjusted at the top of the script.
Either navigate a the data structure directory and run make, or run make <data structure name> from top level, which compiles the data structure tests with the default settings. You can further set different environment variables, such as make VERSION=O3 GC=1 INIT=one to modify the compilation. For all possible compilation switches, see ./common/Makefile.common as well as the individual Makefile for each test. Here are the most common ones:
VERSIONdefines the optimisation level e.g.VERSION=O3. It takes one of the following five values:DEBUGcompile with optimisation zero and debug flagsSYMBOLcompile with optimisation level three and -gO0compile with optimisation level zeroO1compile with optimisation level oneO2compile with optimisation level twoO3compile with optimisation level three (Default)O4compile with optimisation level three, and without any asserts
GCdefines if deleted nodes should be recycledGC=1(Default) or notGC=0.INITdefines if data structure initialization should be performed by all active threadsINIT=all(Default) or one of the threadsINIT=oneRELAXATION_ANALYSIScan be set to1to measure relaxation levels for relaxed data structures. Keep in mind that this imposes a sequential order of linearizations, totally destroying the throughput at high thread counts.TESTcan be used to change the benchmark used. This has mainly been used in the elastic data structures for testing dynamic scenarios, and possible switches can for example be seen in src/2Dd-queue_elastic-law/Makefile.
- src/: Contains the data structures' source code.
- scripts/: Contains supporting scripts, such as ones aggregating several benchmark runs into plots.
- results/: Default folder for test output of the scripts in scripts.
- include/: Contains support files e.g. the basic window framework (2Dc-window.c).
- common/: Contains make definitions file.
- external/: Contains external libraries such as the ssmem library.
- bin/: Contains the binary files for the compiled data structure benchmarks.
All tests will pin each software pthread to a hardware thread, using the function pthread_setaffinity_np. By default, the threads will be pinned sequentially to hardware threads 0, 1, 2, 3...`. However, the numbering of the hardware threads depends on the specific CPU used, and you might often want to use a different order than the default one. For example, on a dual-socket Intel Xeon E5-2695 v4 system, the even hardware threads are on the first socket while the odd numbers are on the second, and you might not want to benchmark intra-socket behavior.
Here is a short step-by-step instruction for how to add a machine-specific pinning order:
- First, see e.g. the output from
lscpuandlstopo(here we care about theP#<...>numbers) to understand the hardware topology. - Then add an entry for your machine in common/Makefile.common. For example, copy the one for the
athenamachine (start withifeq ($(PC_NAME), athena)...), but changeathenain the aforementioned line to the name of your computer (see output ofuname -n), and changeATHENAin-DATHENAto a similar identifier for your computer (to be used in include/utils.h). - Finally, add a matching entry to the aforementioned identifier in include/utils.h. This entry primarily defines the order in which to pin the software threads to hardware threads. Here again, you can look at the entry for
ATHENAfor inspiration. There are three memory layouts, but the default one (the bottom-most one) is the most important to add, which should pin a thread to each core in a socket, before continuing with SMT, and finally proceeding to the next socket. - For a simpler example, see e.g.ITHACA. Now all tests will use this pinning order. You can validate pinning orders by not allocating all hardware threads, and inspecting the output fromhtopduring a test run.