Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down.
For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead.
Apache Spark in client mode had Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client.
When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path.
Therefore, we used a Spark proxy.
For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.
Do you have to implement your own proxy service? No! You can use these instead:
• With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist
• Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org
Happy big data-ing!!
An earlier version of this blog was published on Medium by the author.
Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down.
For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead.
Apache Spark in client mode had Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client.
When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path.
Therefore, we used a Spark proxy.
For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.
Do you have to implement your own proxy service? No! You can use these instead:
• With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist
• Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org
Happy big data-ing!!
An earlier version of this blog was published on Medium by the author.
Building big data platforms and creating analytics frameworks is our superpower! We’ve migrated Gartner-ranked data-management platforms, and used artificial intelligence (AI) for natural language processing (NLP) and computer vision applications. As a result, we use Apache Spark frequently to deploy machine learning, process big data, and manipulate large datasets. This means, you can avoid the rabbit holes that we’ve already been down.
For one product dev. Project, we started with PySpark. The product sent a call to a Python-based server, and this server would do the work for the Spark cluster. PySpark was used as our interface with the Spark cluster. Up until this point, everything worked well. However, things started to get more complicated when we needed to cache the Spark data frame in memory instead of loading it every single time from the source. Doing this would speed up processing because we no longer had the IO overhead.
Apache Spark in client mode had Java Virtual Machine (JVM) running locally on the client server and this client would have a session object pointing to the cluster. This means that the client’s JVM has a session context tied to a specific client and would not work outside this particular JVM AKA client.
When we wanted to upscale our servers to meet the incoming demand from all the different clients, the caching concept no longer made sense across different servers. Spark documentation could have been better since different servers mean different JVMs, and session contexts. So, even though all the PySpark clients point to the same physical Spark cluster, the session objects are different, the local references are different, and it is impossible to cache or share data frames between different sessions/clients even under the same application name. Since the requisitions weren't guaranteed to be routed to a particular server, we could no longer continue down this path.
Therefore, we used a Spark proxy.
For this product, we set it up so that the client applications communicated to servers and the servers would have a communication medium to a centralized single Spark client outside these servers. It delegated the tasks of each Spark client.
Do you have to implement your own proxy service? No! You can use these instead:
• With Hydrosphere’s Mist, Apache Spark job-based proxy service, you can define Python, Scala, or Java-based applications. Mist will execute the code for you in batches. Like here. https://github.com/Hydrospheredata/mist
• Apache Livy. Unlike Mist, Livy would execute in client mode, which we wanted for our use case. https://livy.incubator.apache.org
Happy big data-ing!!
An earlier version of this blog was published on Medium by the author.