Sign up for our newsletter and get the latest big data news and analysis.

The inside Spark channel is a resource for professionals looking to learn about the benefits of Apache Spark

State of the Art Natural Language Processing at Scale

The two part presentation below from the Spark+AI Summit 2018 is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API’s which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark’s built-in optimizations.

Databricks Partners with RStudio To Increase Productivity of Data Science Teams

Databricks, a leader in unified analytics and founded by the original creators of Apache Spark™, announced a partnership with RStudio, providers of a free and open-source integrated development environment for R, to increase the productivity of data science teams. The partnership will allow the two companies to seamlessly integrate Databricks’ Unified Analytics Platform with the RStudio Server, simplifying R programming on big data.

Databricks Conquers AI Dilemma with Unified Analytics

Databricks, a leader in unified analytics and founded by the original creators of Apache Spark™, addresses this AI dilemma with the Unified Analytics Platform. The company launched new capabilities to lower the barrier for enterprises to innovate with AI. These new capabilities unify data and AI teams and technologies: MLflow for developing an end-to-end machine learning workflow, Databricks Runtime for ML to simplify distributed machine learning; and Databricks Delta for data reliability and performance at scale.

Apache Spark 2.0: A Deep Dive Into Structured Streaming

In this talk, Tathagata Das takes a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”. Tathagata is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks.

Big Data Analytics Receive a “Spark” In the Arm

In this special guest feature, Anand Venugopal, head of StreamAnalytix at Impetus Technologies, discusses real-time streaming analytics applications and how companies can use Apache Spark for data processing and analytics functionality. Real-time data and analytics processes are the central nervous system of today’s enterprise, which makes it no surprise that the global revenue in the business intelligence (BI) and analytics software market is forecast to reach $22.8 billion by the end of 2020.

Big Data, Hadoop & Cloud: Tackling a Chain of Emerging Challenges

In this special guest feature, Chandra Ambadipudi, CEO of Clairvoyant, provides a compelling tour de force through the recent history of the big data industry and how Hadoop and the cloud have made steady acceleration possible. Also offered are recommendations for how to address several challenges faced by enterprises with respect to big data cloud implementations.

Top 5 Mistakes When Writing Spark Applications

In the presentation below from Spark Summit 2016, Mark Grover goes over the top 5 things that he’s seen in the field that prevent people from getting the most out of their Spark clusters. When some of these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters, the same data, just a different approach.

The Data Scientist’s Guide to Apache Spark

Looking to dive deeper into the more cutting edge machine learning use cases in Apache Spark? To successfully use Spark’s advanced analytics capabilities including large scale machine learning and graph analysis, check out The Data Scientist’s Guide to Apache Spark, from our friends over at Databricks.

Databricks Launches Delta To Combine the Best of Data Lakes, Data Warehouses and Streaming Systems

Databricks, provider of the leading Unified Analytics Platform and founded by the team who created Apache Spark™, announced Databricks Delta, the first unified data management system that provides the scale and cost-efficiency of a data lake, the query performance of a data warehouse, and the low latency of a streaming ingest system. Databricks Delta, a […]

Apache Spark Expands With Cypher, Neo4j’s ‘SQL For Graphs,’ Adds Declarative Graph Querying

Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit. This combination allows big data analysts to incorporate graphs and graph algorithms in their work, which will dramatically broaden how they reveal connections in their data.