This project studies how Big Data platforms reshape data exploration from a computational perspective.
The main idea is that, as data becomes inherently distributed, the dominant cost of computation shifts from arithmetic operations to communication and data movement. This change affects how algorithms are designed, especially for large-scale data exploration tasks.
Instead of focusing on system architecture, this work analyzes Big Data through the lens of computational models, and connects theoretical frameworks with practical systems.
Traditional algorithms are usually designed under the assumption of centralized data and uniform memory access. However, in distributed environments:
- data is stored across multiple nodes
- communication between nodes is expensive
- memory hierarchy introduces significant I/O cost
As a result, computation is no longer the main bottleneck.
This project highlights the following shift:
computation cost → communication & I/O cost
and shows how this shift influences both system design and algorithm design.
-
Big Data characteristics
Data distribution, scalability, and fault tolerance -
Parallel computational models
PRAM, BSP, and External Memory models, with a focus on how they model communication and I/O cost -
MapReduce
A practical framework where the shuffle phase reflects the cost of data movement -
Data exploration
How distributed constraints lead to locality-aware algorithms and parallel decomposition
- Big Data platforms implicitly define a new computational model
- Communication and data movement dominate performance at scale
- Algorithm design must adapt to system constraints
- Data exploration becomes system-aware rather than purely computation-driven
docs/
paper.pdf Final essay
paper.tex LaTeX source
refs.bib References
This project is written as a technical essay, but structured to emphasize conceptual understanding rather than system description. It can be seen as a compact summary of how theoretical models and real-world systems connect in large-scale data processing.