The Blaze Ecosystem

Recent blog posts

Wed 17 February 2016

Introducing Dask Distributed

by Matthew Rocklin

We analyze GitHub data on a cluster using Dask.

Read

Dask distributed computing

Fri 13 November 2015

Distributed Arrays¶

We use dask.array, a small cluster on EC2, and distributed

Read

dask distributed

Wed 28 October 2015

PyData on HDFS without Java

by Matthew Rocklin

We use snakebite and distributed to run Pandas on CSV data in HDFS

Read

hdfs snakebite distributed pandas

Tue 27 October 2015

Ad-hoc Distributed Computation

by Matthew Rocklin

Ad-hoc distributed computations with a concurrent.futures interface

Read

distributed

Mon 19 October 2015

Pipelines and Reuse with dask

by Matthew Rocklin

tl;dr: We use dask to accelerate parameter searches over machine learning pipelines by naming consistently.

Read

dask sklearn dasklearn

Wed 16 September 2015

Analyzing 1.7 Billion Reddit Comments with Blaze and Impala

by Daniel Rodriguez and Kristopher Overholt

Blaze is a Python library and interface to query data on different storage systems. Blaze works by translating a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze gives Python users a familiar interface to query data living in other data storage systems such as SQL databases, NoSQL data stores, Spark, Hive, Impala, and raw data files such as CSV, JSON, and HDF5. Hive

Read

blaze impala hive reddit

Tue 08 September 2015

Analyzing Reddit Comments with Dask and Castra

by Jim Crist

The scientific Python ecosystem is great for doing data analysis. Packages like NumPy and Pandas provide an excellent interface to doing complicated computations on datasets. With only a few lines of code one can load some data into a Pandas DataFrame, run some analysis, and generate a plot of the results. However, this workflow starts to falter when working with data that's larger than the RAM on your computer. At this point people often move their workflow from a Python based one into some other larger system like Spark or Hadoop. These are great at what they do, but for small problems are a bit overkill

Read

dask castra reddit

Talks and Tutorials

Scale your data, not your process. Welcome to the Blaze ecosystem.
EuroPython 2015, Christine Doig
Going Parallel and Larger-than-memory with Graphs
PyGotham 2015, Blake Griffith
Dask Out of core NumPy and Pandas through Task Scheduling
SciPy 2015, James Crist