ZEMOSO ENGINEERING STUDIO
June 5, 2022
2 min read

Product developer POV: Caveats when building with Spark

Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down. 

For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead. 

What was the problem?

Apache Spark in client mode had  Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client. 

Features of Spark
Features of Spark

When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path. 

Therefore, we used a Spark proxy. 

For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.

Do you have to implement your own proxy service? No! You can use these instead:

•   With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist

•   Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data-ing!!

 

An earlier version of this blog was published on Medium by the author.

ZEMOSO ENGINEERING STUDIO

Product developer POV: Caveats when building with Spark

November 5, 2018
2 min read

Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down. 

For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead. 

What was the problem?

Apache Spark in client mode had  Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client. 

Features of Spark
Features of Spark

When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path. 

Therefore, we used a Spark proxy. 

For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.

Do you have to implement your own proxy service? No! You can use these instead:

•   With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist

•   Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data-ing!!

 

An earlier version of this blog was published on Medium by the author.

Recent Publications
How to remix Amazon’s Working Backwards with Google’s Venture’s User Journey: The Dr. Strange Way
How to remix Amazon’s Working Backwards with Google’s Venture’s User Journey: The Dr. Strange Way
June 14, 2022
How we built a big data platform for a futuristic AgriTech product
How we built a big data platform for a futuristic AgriTech product
June 3, 2022
Zemoso Labs starts operations in Waterloo, Canada
Zemoso Labs starts operations in Waterloo, Canada
May 25, 2022
Deconstructing Elon Musk’s dog ate my homework answer for Twitter: More validation will be asked of startups
Deconstructing Elon Musk’s dog ate my homework answer for Twitter: More validation will be asked of startups
May 20, 2022
Real Talk: Lessons learned and evolved from 3M and Post-it®’s adoption of Crazy 8 methodology
Real Talk: Lessons learned and evolved from 3M and Post-it®’s adoption of Crazy 8 methodology
April 10, 2022
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 1
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 1
February 7, 2022
Understanding dynamic multi-column search with JPA Criteria for product development
Understanding dynamic multi-column search with JPA Criteria for product development
January 2, 2022
Product engineering zeitgeist: Implement OOPs in React JS using atomic design
Product engineering zeitgeist: Implement OOPs in React JS using atomic design
July 8, 2021
Honorable mention at O’Reilly’s Architectural Katas event
Honorable mention at O’Reilly’s Architectural Katas event
May 17, 2021
Product dev with testable spring boot applications, from day one
Product dev with testable spring boot applications, from day one
May 4, 2021
When not to @Autowire in Spring or Spring Boot applications
When not to @Autowire in Spring or Spring Boot applications
May 1, 2021
Efficiently handle data and integrations in Spring Boot
Efficiently handle data and integrations in Spring Boot
January 24, 2021
Unlock the power of atomic design in React and React Native using a theming library — Part 1
Unlock the power of atomic design in React and React Native using a theming library — Part 1
November 2, 2020
Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions
Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions
October 25, 2020
Kubernetes — What is it, what problems does it solve, and how does it compare with alternatives?
Kubernetes — What is it, what problems does it solve, and how does it compare with alternatives?
June 21, 2019
GraphQL — Why is it essential for agile product development?
GraphQL — Why is it essential for agile product development?
April 30, 2019
Orchestrating backend services for product development with AWS Step Functions
Orchestrating backend services for product development with AWS Step Functions
April 1, 2019
How To Decide When to Use Amazon Working Backwards, Business Model Canvas and Lean Canvas
How To Decide When to Use Amazon Working Backwards, Business Model Canvas and Lean Canvas
November 30, 2018
How to validate your Innovation: Mastering Experiment Design
How to validate your Innovation: Mastering Experiment Design
November 22, 2018
Working Backwards: Amazon's Culture of Innovation: My notes
Working Backwards: Amazon's Culture of Innovation: My notes
November 19, 2018
ZEMOSO ENGINEERING STUDIO
November 5, 2018
2 min read

Product developer POV: Caveats when building with Spark

Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down. 

For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead. 

What was the problem?

Apache Spark in client mode had  Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client. 

Features of Spark
Features of Spark

When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path. 

Therefore, we used a Spark proxy. 

For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.

Do you have to implement your own proxy service? No! You can use these instead:

•   With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist

•   Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data-ing!!

 

An earlier version of this blog was published on Medium by the author.

Recent Publications

ZEMOSO ENGINEERING STUDIO

How we built a big data platform for a futuristic AgriTech product

June 3, 2022
8 min read
ZEMOSO NEWS

Zemoso Labs starts operations in Waterloo, Canada

May 25, 2022
5 min read
ZEMOSO ENGINEERING STUDIO

Honorable mention at O’Reilly’s Architectural Katas event

May 17, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Product dev with testable spring boot applications, from day one

May 4, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

When not to @Autowire in Spring or Spring Boot applications

May 1, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Efficiently handle data and integrations in Spring Boot

January 24, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions

October 25, 2020
5 min read
ZEMOSO ENGINEERING STUDIO

GraphQL — Why is it essential for agile product development?

April 30, 2019
12 min read
ZEMOSO PRODUCT STUDIO

How to validate your Innovation: Mastering Experiment Design

November 22, 2018
8 min read
ZEMOSO PRODUCT STUDIO

Working Backwards: Amazon's Culture of Innovation: My notes

November 19, 2018
8 min read

Want more best practices?

Access thought-leadership and best practice content across
the product development lifecycle

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

© 2021 Zemoso Technologies
Privacy Policy

Terms of Use
LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies

© 2021 Zemoso Technologies
Privacy Policy

LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies
June 22, 2022