Sign up for our newsletter and get the latest big data news and analysis.

State of the Art Natural Language Processing at Scale

The two part presentation below from the Spark+AI Summit 2018 is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API’s which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark’s built-in optimizations.

Databricks Partners with RStudio To Increase Productivity of Data Science Teams

Databricks, a leader in unified analytics and founded by the original creators of Apache Spark™, announced a partnership with RStudio, providers of a free and open-source integrated development environment for R, to increase the productivity of data science teams. The partnership will allow the two companies to seamlessly integrate Databricks’ Unified Analytics Platform with the RStudio Server, simplifying R programming on big data.

Apache Spark 2.0: A Deep Dive Into Structured Streaming

In this talk, Tathagata Das takes a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”. Tathagata is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks.

Top 5 Mistakes When Writing Spark Applications

In the presentation below from Spark Summit 2016, Mark Grover goes over the top 5 things that he’s seen in the field that prevent people from getting the most out of their Spark clusters. When some of these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters, the same data, just a different approach.

The Data Scientist’s Guide to Apache Spark

Looking to dive deeper into the more cutting edge machine learning use cases in Apache Spark? To successfully use Spark’s advanced analytics capabilities including large scale machine learning and graph analysis, check out The Data Scientist’s Guide to Apache Spark, from our friends over at Databricks.

The Data Scientist’s Guide to Apache Spark™

For data scientists looking to apply Apache Spark’s advanced analytics techniques and deep learning models at scale, Databricks is happy to provide The Data Scientist’s Guide to Apache Spark. Download this eBook to: Learn the fundamentals of advanced analytics and receive a crash course in machine learning. Get a deep dive on MLlib, the primary […]

Databricks Launches Delta To Combine the Best of Data Lakes, Data Warehouses and Streaming Systems

Databricks, provider of the leading Unified Analytics Platform and founded by the team who created Apache Spark™, announced Databricks Delta, the first unified data management system that provides the scale and cost-efficiency of a data lake, the query performance of a data warehouse, and the low latency of a streaming ingest system. Databricks Delta, a […]

Apache Spark Expands With Cypher, Neo4j’s ‘SQL For Graphs,’ Adds Declarative Graph Querying

Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit. This combination allows big data analysts to incorporate graphs and graph algorithms in their work, which will dramatically broaden how they reveal connections in their data.

Impetus Technologies Delivers Visual Spark Studio – A New, Free Development Tool to Accelerate Spark Adoption in Enterprises

Impetus Technologies, a big data software products and services company, announced the immediate availability of Visual Spark StudioTM, a new standalone tool aimed at addressing the increasing demand for Spark-based analytic and data processing solutions in enterprises.

Interview: Ash Munshi, CEO at Pepperdata

I recently caught up with Ash Munshi, CEO at Pepperdata, to get a rundown on his company, a sense for how big data and DevOps are related, some highlights on new product offerings, and his sense for where Pepperdata is headed in the future.