عجفت الغور

dask

Tags: pandas, numpy

Flow order

  • Graph, scheduling, execution
    • constructs a flow order as a graph inside a python dict
    • assumptions: data is not modified in place, tasks do not hold the gil
      • python bindings for native interfaces may lock, these must be made sure to be released

Dask vs Spark

  • similar to spark
  • spark has a higher level data representation
  • dask uses numpy/pandas/sklear, spark is more mature and has sql support
  • dask allows to compute arbitrary graphs
  • spark allows computations from compositions of high level primitives or extending to rdd