Here’s a peek under the purple rug or what’s happening over at Yahoo Engineering – the new SAMOA (Scalable Advanced Massive Online Analysis) open source platform for mining big data streams. SAMOA is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.
SAMOA is both a platform and a library. As a platform, it allows the algorithm developer to abstract from the underlying execution engine, and therefore reuse their code to run on different engines. It also allows to easily write plug-in modules to port SAMOA to different execution engines.
As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering.
To learn more about SAMOA, visit Yahoo Engineering’s announcement page that includes a link to the product’s GitHub page.