Apache Spark and Databricks

What is Apache Spark?

  • An open-source big data platform for data science
  • Big Data includes massive data volume, streaming data, unstructured and semi-structured data, images, video, sound.
  • There is no IDE, you need bring your own tools
  • It is a query/data analytics engine, it is meant to run queries
  • It is NOT a storage engine. One would store data in a storage layer like an S3, DataLake, HDFS etc

What is Databricks?

  • Commercial product from the creators of Apache Spark
  • Complete development environment for Apache Spark
  • Numerous proprietary Spark enhancements
  • Ideal for Data Science team collaboration
  • Optimized for cloud, dont believe you can spin up on your own data center

%d bloggers like this: