ZEMOSO ENGINEERING STUDIO
November 5, 2018
2 min read

Product developer POV: Caveats when building with Spark

Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down. 

For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead. 

What was the problem?

Apache Spark in client mode had  Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client. 

Features of Spark
Features of Spark

When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path. 

Therefore, we used a Spark proxy. 

For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.

Do you have to implement your own proxy service? No! You can use these instead:

•   With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist

•   Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data-ing!!

 

An earlier version of this blog was published on Medium by the author.

ZEMOSO ENGINEERING STUDIO

Product developer POV: Caveats when building with Spark

November 5, 2018
2 min read

Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down. 

For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead. 

What was the problem?

Apache Spark in client mode had  Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client. 

Features of Spark
Features of Spark

When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path. 

Therefore, we used a Spark proxy. 

For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.

Do you have to implement your own proxy service? No! You can use these instead:

•   With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist

•   Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data-ing!!

 

An earlier version of this blog was published on Medium by the author.

Recent Publications
Actual access control without getting in the way of actual work: 2023
Actual access control without getting in the way of actual work: 2023
March 13, 2023
Breaking the time barrier: Test Automation and its impact on product launch cycles
Breaking the time barrier: Test Automation and its impact on product launch cycles
January 20, 2023
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
January 9, 2023
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
December 6, 2022
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
October 13, 2022
Testing what doesn’t exist with a Wizard of Oz twist
Testing what doesn’t exist with a Wizard of Oz twist
October 12, 2022
Docs, Guides, Resources: Getting developer microsites right in a product-led world
Docs, Guides, Resources: Getting developer microsites right in a product-led world
September 20, 2022
Beyond methodologies: Five engineering do's for an agile product build
Beyond methodologies: Five engineering do's for an agile product build
September 5, 2022
Actual access control without getting in the way of actual work: 2023
Actual access control without getting in the way of actual work: 2023
March 13, 2023
Breaking the time barrier: Test Automation and its impact on product launch cycles
Breaking the time barrier: Test Automation and its impact on product launch cycles
January 20, 2023
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
January 9, 2023
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
December 6, 2022
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
October 13, 2022
ZEMOSO ENGINEERING STUDIO
November 5, 2018
2 min read

Product developer POV: Caveats when building with Spark

Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down. 

For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead. 

What was the problem?

Apache Spark in client mode had  Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client. 

Features of Spark
Features of Spark

When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path. 

Therefore, we used a Spark proxy. 

For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.

Do you have to implement your own proxy service? No! You can use these instead:

•   With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist

•   Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org

Happy big data-ing!!

 

An earlier version of this blog was published on Medium by the author.

Recent Publications

ZEMOSO ENGINEERING STUDIO

Testing what doesn’t exist with a Wizard of Oz twist

October 12, 2022
7 min read
ZEMOSO ENGINEERING STUDIO

Beyond methodologies: Five engineering do's for an agile product build

September 5, 2022
6 min read
ZEMOSO ENGINEERING STUDIO

How we built a big data platform for a futuristic AgriTech product

June 3, 2022
8 min read
ZEMOSO NEWS

Zemoso Labs starts operations in Waterloo, Canada

May 25, 2022
5 min read
ZEMOSO ENGINEERING STUDIO

Honorable mention at O’Reilly’s Architectural Katas event

May 17, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Product dev with testable spring boot applications, from day one

May 4, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

When not to @Autowire in Spring or Spring Boot applications

May 1, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Efficiently handle data and integrations in Spring Boot

January 24, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions

October 25, 2020
5 min read
ZEMOSO ENGINEERING STUDIO

How to use BERT and DNN to build smarter NLP algorithms for products

February 14, 2020
12 min read
ZEMOSO ENGINEERING STUDIO

GraphQL — Why is it essential for agile product development?

April 30, 2019
12 min read
ZEMOSO ENGINEERING STUDIO

GraphQL with Java Spring Boot and Apollo Angular for Agile platforms

April 30, 2019
9 min read
ZEMOSO ENGINEERING STUDIO

Deploying Airflow on Kubernetes 

November 30, 2018
2 min read
ZEMOSO PRODUCT STUDIO

How to validate your Innovation: Mastering Experiment Design

November 22, 2018
8 min read
ZEMOSO PRODUCT STUDIO

Working Backwards: Amazon's Culture of Innovation: My notes

November 19, 2018
8 min read

Want more best practices?

Access thought-leadership and best practice content across
the product development lifecycle

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

© 2023  Zemoso Technologies
Privacy Policy

Terms of Use
LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies

© 2021 Zemoso Technologies
Privacy Policy

LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies
May 25, 2023