Picking Winners: Applying Hadoop and Spark to Solve Real-World Performance Challenges

Even as the Hadoop ecosystem (or "zoo," depending upon one's point of view) has become more mature over the past ten years, it can still seem esoteric compared to its older relational counterparts. Somehow, Hadoop can still seem to be connected to narrow classes of problems (analytics, social media, mobile marketing) and pigeonholed as a specialty platform. As data architects, though, it's important we keep all options on the table.

In this case study, a team asked to solve a fairly typical performance challenge found that the HDFS architecture and the Spark distributed processing framework were a perfect fit for a classic parallel processing problem.

Session attendees will learn:
• How the team determined the nature of the performance challenge (and why parallel processing wasn't a perfect solution)
• How HDFS enabled an architecture that database partitioning alone could not provide
• How Spark (like Map/Reduce) enables applications to bring the processing to the data while (unlike Map/Reduce) keeping performance acceptable for small sets of data
• How the application architecture changed to take advantage of Hadoop & Spark
• How the team is learning to think outside the relational database (RDBMS), stay off the Hadoop hype cycle, and apply Big(ish) Data thinking to legacy applications and longstanding challenges

Bill Brooks has been modeling, managing, and integrating data since 1995, beginning at CID Associates developing application databases, then at Children's Hospital Boston as manager of the Decision Support Systems Group. He managed data integration before becoming Enterprise Data Architect for MFS Investment Management. Bill is now Global Chief Data Architect at Mercer, where he is developing a firm-wide data architecture practice.

Bill's background includes traditional relational database design, data warehouse design and implementation, and enterprise application integration using a variety of ETL, message broker, and service bus approaches, and he has recently focused on building Data Architecture capabilities and driving big data and advanced analytics programs.

Jifeng Shao is a seasoned software development manager and lead architect in the areas of big data and data science, including data management, data warehouse, business intelligence, machine learning, predictive modeling, MPP with HPC, and Hadoop/Spark. Jifeng has more than fifteen years experience in data science practice, including statistical inference/predictive modeling and machine learning. He is a passionate advocate for better architecture to solve complex data management and data science challenges.