_images/blaze_med.png

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems. Blaze is sponsored primarily by Continuum Analytics, and a DARPA XDATA grant.

Ecosystem

Several projects have come out of Blaze development other than the Blaze project itself.

  • Blaze: Translates NumPy/Pandas-like syntax to data computing systems (e.g. database, in-memory, distributed-computing) including data gate-way server that can sit on computing machines near the data — moving the expression to the data.

    Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use (e.g. spark, impala, SQL databases, No-SQL data-stores, raw-files). It mediates our interaction with files, data structures, and databases, optimizing and translating our query as appropriate to provide a smooth and interactive session. It allows the data scientists and analyst to write their queries in a unified way that does not have to change because the data is stored in another format or a different data-store. It also provides a server-component that allows URIs to be used to easily serve views on data and refer to Data remotely in local scripts, queries, and programs.

  • Odo: Migrates data between formats.

    Odo moves data between formats (CSV, JSON, databases) and locations (local, remote, HDFS) efficiently and robustly with a dead-simple interface by leveraging a sophisticated and extensible network of conversions.

  • Dask.array: Multi-core / on-disk NumPy arrays

  • Dask.dataframe : Multi-core / on-disk Pandas data-frames

    Dask.arrays provide blocked algorithms on top of NumPy to handle larger-than-memory arrays and to leverage multiple cores. They are a drop-in replacement for a commonly used subset of NumPy algorithms.

    Dask.dataframes provide blocked algorithms on top of Pandas to handle larger-than-memory data-frames and to leverage multiple cores. They are a drop-in replacement for a subset of Pandas use-cases.

    Dask also has a general “Bag” type and a way to build “task graphs” using simple decorators as well as nascent distributed schedulers in addition to the multi-core and multi-threaded schedulers.

  • DyND: In-memory dynamic arrays

    DyND is a dynamic ND-array library like NumPy. It supports variable length strings, ragged arrays, and GPUs. It is a standalone C++ codebase with Python bindings. Generally it is more extensible than NumPy but also less mature.

These projects are mutually independent. The rest of this documentation is just about the Blaze project itself. See the pages linked to above for odo or dask.array. Other projects that have spun out of or are linked to Blaze efforts include:

Blaze

Blaze is a high-level user interface for databases and array computing systems. It consists of the following components:

  • A symbolic expression system to describe and reason about analytic queries
  • A set of interpreters from that query system to various databases / computational engines

This architecture allows a single Blaze code to run against several computational backends. Blaze interacts rapidly with the user and only communicates with the database when necessary. Blaze is also able to analyze and optimize queries to improve the interactive experience.