Skip to content

MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive HPC Workloads. A software distributed shared memory system that enables infinite memory capacity through intelligent tiered DRAM and storage management.

License

Notifications You must be signed in to change notification settings

grc-iit/mega_mmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

249 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MegaMmap: Blurring the Boundary Between Memory and Storage

License: MIT C++17 HPC DOI

MegaMmap is a software distributed shared memory (DSM) system that eliminates the traditional boundary between memory and storage for data-intensive HPC workloads. By providing a unified, byte-addressable interface that spans DRAM, NVMe, SSD, and HDD tiers, MegaMmap enables applications to work with datasets larger than memory capacity while maintaining competitive performance.

Key Features

  • Infinite Memory Abstraction: Present massive datasets as if they were in main memory
  • Intelligent Tiering: Automatically manage data across DRAM, NVMe, SSD, and HDD tiers
  • Transactional Memory API: Declare access patterns to optimize prefetching and coherence
  • Intent-Aware Coherence: Reduce communication overhead with workload-specific optimizations
  • Persistent Integration: Transparently stage data to/from HDF5, Parquet, and other formats
  • HPC-Optimized: Designed for scientific simulations, machine learning, and data analytics

Performance Highlights

  • 2.6x DRAM Reduction: Maintain performance with 60% less memory usage
  • 2x Faster than Spark: Outperform cloud-based solutions for memory-intensive workloads
  • Unbounded Dataset Size: Process datasets 2x larger than available memory
  • 45% Code Reduction: Simpler development compared to traditional out-of-core approaches

Architecture

MegaMmap consists of several key components:

  • Private Cache (pcache): Per-process DRAM cache for low-latency access
  • Shared Cache (scache): Distributed, tiered cache across all processes
  • Data Organizer: Intelligent placement based on access patterns and scores
  • Prefetcher: Overlaps computation with data movement
  • Transaction System: Declares intent for optimized coherence

Quick Start

Prerequisites

  • C++17 compliant compiler (GCC 9.4.1+)
  • MPI implementation (MPICH 3.4.3+ recommended)
  • Hermes buffering system
  • CMake 3.12+

Installation

# Clone the repository
git clone https://github.com/grc-iit/mega_mmap.git
cd mega_mmap

# Install dependencies (requires Spack)
./deps.sh

# Build MegaMmap
mkdir build && cd build
cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(scspkg pkg root mega_mmap)
make -j8

# Load environment
module load mega_mmap

Basic Usage

#include <mega_mmap/vector.h>

void KMeansInertia(std::vector<Point3D> &ks) {
    int rank = mpi::get_rank();
    int nprocs = mpi::get_comm_size();
    
    // Create a shared vector from a Parquet file
    mm::Vector<Point3D> pts("/points.parquet");
    pts.BoundMemory(MEGABYTES(1));  // Limit to 1MB DRAM
    pts.Pgas(rank, nprocs);          // Partition across processes
    
    // Begin read-only transaction
    auto tx = pts.SeqTxBegin(pts.local_off(), pts.local_size(), MM_READ_ONLY);
    
    float distance = 0;
    for (Point3D p : tx) {
        distance += pow(NearestCentroid(p, ks), 2);
    }
    
    pts.TxEnd();
}

MegaMmap AI Guidelines (if visiting using a coding Agent)

Build & Test Commands

  • Build: mkdir build && cd build && cmake ../ -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$(scspkg pkg root mega_mmap) && make -j8
  • Debug Build: Use -DCMAKE_BUILD_TYPE=Debug for debug symbols
  • Single Test: jarvis pipeline run yaml test/unit/pipelines/{test_name}.yaml (e.g., mm_kmeans_mega.yaml)
  • All Tests: Tests defined in test/unit/CMakeLists.txt using jarvis_test() function

Code Style Guidelines

  • Language: C++17 standard
  • Namespaces: Use mm namespace for core library code
  • Naming:
    • Classes: PascalCase (e.g., Vector, Bounds)
    • Variables: snake_case with trailing underscore (e.g., window_size_, elmts_per_page_)
    • Functions: PascalCase for public methods
    • Constants: UPPER_CASE with prefix (e.g., MM_READ_ONLY, MM_PAGE_SIZE)
  • Headers: Include guards format: MEGAMMAP_INCLUDE_{PATH}_{FILE}_H_
  • File Organization:
    • Headers in include/mega_mmap/
    • Implementations in benchmark/ for executables
    • Tests in test/unit/
  • Dependencies: Uses Hermes, MPI, Arrow, Parquet, YAML-CPP, Catch2, OpenMP
  • Macros: Use BIT_OPT(u32, n) for bit flags, KILOBYTES(), MEGABYTES() for sizes
  • Error Handling: Hermes logging via hermes_shm/util/logging.h

πŸ“ˆ Supported Workloads

MegaMmap has been validated on production HPC applications:

  • Machine Learning: KMeans clustering, Random Forest classification
  • Scientific Simulation: Gray-Scott reaction-diffusion models
  • Data Analytics: DBSCAN clustering on cosmological datasets
  • Signal Processing: Gadget2 cosmological simulation conversion

πŸ§ͺ Running Benchmarks

# Run single benchmark
jarvis pipeline run yaml test/unit/pipelines/mm_kmeans_mega.yaml

# Run all benchmarks
cd test/unit && make -j8

πŸ“š Documentation

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • National Science Foundation (NSF) Grants CSSI-2104013 and Core-2313154
  • U.S. Department of Energy (DOE) Contract DE-SC0024593
  • Chameleon Cloud Testbed for development environment

πŸ“ž Contact

For questions, support, or collaborations:

Gnosis Research Center - Illinois Institute of Technology


Citation: If you use MegaMmap in your research, please cite our SC24 paper.

@inproceedings{logan2024megammap,
  title={MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive Workloads},
  author={Logan, Luke and Kougkas, Anthony and Sun, Xian-He},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)},
  year={2024}
}

About

MegaMmap: Blurring the Boundary Between Memory and Storage for Data-Intensive HPC Workloads. A software distributed shared memory system that enables infinite memory capacity through intelligent tiered DRAM and storage management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors