spark
Tags: scala
Overview
- Similar to MapReduce, processes large volumes of data efficiently
- However, map reduce had issues:
- Iterative data development was difficult
- Interactive, ad hoc queries on the same set was hard
- How does spark solve issues with iterative data processing?
- There was a desire for a framework that abstracted over distributed memory, that could perform multiple operations, while also being low latency and fault tolerant, and explicitly was able to let users choose a specific working let to load into memory
Design
Relies on resilient distributed datasets (RDDs)
- Abstraction of objects ithin the dataset
- RDDs can be created in two ways, transforming an RDD or by reading data
- A lineage graphs is kept to record the source of all RDDs
- Implemented with
- List of partition objects that contains their own sets of data
- Iterator that traverses the data in the partition
- A list of worker nodes where the partition can be quickly accessed to ensure tasking scheduling
- An RDD abstraction is similar to a cluster compute where only part of the data is on the node. Local nodes run small subsets and get joined later
- Creation
- RDDs can be created from files within a distributed file system, or collections my parallelizing a collection, or from another RDD
- can also be built by altering the persistence of an existing RDD, since RDDs are by default lazy and emphemeral
- Paritioning
- Spark automatically partitions the data inside RDDs and distributes it over a cluster of RAM over available nodes. Also provide scontrol to the user on how the data is partitioned, as well as whether the data needs to be shuffled
- Vs distributed memory system
- In distributed memory systems, R/W ops are done on specific locations within a global address space. RDDs have more coarse and fine grained transformations
- DSMs use checkpointing, which is pretty costly for restoring data. Only lost RDD partitions need to be recomputed, and provides big data recovery that’s better in terms of spatial overheads
- RDDs are immutable, operations create new RDDs (in the lineage graph) which allow you to run backup tasks to mitigate stragglers (speculative exec)
- Systems can exploit data locality to schedule tasks based on their data locality, since they know more information
- You can also page/flush to disk for RDDs
- Spark only offers bulk writes, no random writes like distributed memory systems
- Similar to a dataframe chunk
Dependencies
- When an RDD is created, it’s relationship with the parent data forms a DAG, so we have either narrow or wide dependencies
- Narrow dependencies allow pipelined execution to compute all parent and child partitions on a group of machines in a single cluster. Users can apply any operations on a parent RDD on an element-by-element basis. Also easy partition recovery, you just rerun the past partition
- Wide dependencies are more complicated, since they also shuffle the data. Recovering means you need to recompute the whole thing
- Side benefit is that you can compress dags if they get too big via snapshots, and then replicate the dags across the cluster
- Clustering can be done with Mesos, YARN, or all sorts of other stuff.
Scheduler
- Scheduler is responsbile for turning RDDs into actions and creating the DAG
- Fault tolerance retries are handled by the scheduler, who reschedules failed jobs
Actions
- Spark builds logical plan for executoins via actions, which trigger the execution of that logical plan.
- Actions are anything that returns a non-RDD value, such as
count()
,reduce()
,collect()
, and ~lookup~()
Driver - manager that orchestrates data sharding
- Launches clusters of workers
- Defines RDDs
- Keeps lineage graph
- Create execution pipelines
- Creates further tasks in each stage and sends them to workers running in parallel
- Schedules tasks
- Shares some variables with all the worker nodes called shared variables
Worker nodes
- Receive tasks from the driver
- Performs assigned tasks
- Stores partitions of the memory abstraction
- Note that usually, within datacenters, network topology is taken into account by scheduling algorithms.
Transformations
- Lazy operations by default, split into two forms: narrow and wide
- Narrow transformations have each partition input map to a single partition output
map
,filter
,join
(kv pairs only, no shuffling on joins),union
- Wide transformations have RDDs building multiple child RDDs
reduceByKey
,join
(arbitrary),groupByKey
Shared Variables
- Two types of shared variables:
- broadcast - variable that is kept on the driver, which then is cached on workers
- accumulators - variable is kept on the driver as the workers accumulate their own
Refinements
- Limited memory LRU evicts old RDDs in memory
- Flushing to disk as well, users specificy a persistence priority
- Checkpointing RDDs by compacting the graphs into smaller nodes
Design wisdom
- RDD abstraction: immutable, shardable, lazy datastructures that can be recomputed