inspired by Java Champion Fabiane Bizinella Nardon
=> Creating a sustainable process for Data Science projects
The following will provide you an overview of common problems in data science
especially in data engineering and machine learning engineering from a Java developer point of view and what you can do with Java to solve these projects.
Data science got a lot of hype in the recent years, a lot of people wish to enter the field.
Since 2013, the voice recognition has been skyrockets thanks to that enthusiasm to data science, now voice AI are as able as humans to voice recognition.
Google AI, Amazon AI and Apache Spark MLlib are the main contenders in AI algorithms, such that nowadays, it is hard to avoid them; making you use and tune it.
Despite all the tremendous emulation on data science in recent years, many were disapointed to not be able to take advantage of it for some use cases.
This is due to the fact that many companies/projects don't have enough data or that data may not be suitable enough.
Let's explore the data science pipeline !
So if you like at a typical data science pipeline:
- First, you have datasources, this can be your transactional system or a third part database that you bought, or even open dataset available on the web.
- This is in raw format: log files, database dumps, CSV files.
- You need then to clean the data to remove invalid data; or transform it like transforming a log file to some smaller file with only relevant data.
- Next you have features engineering, where you extract features from your data. Features in data science are variables that are obtained from different methods.
- Then come the data augmentation, you go fetch external data to complete your data, that could be thanks to open data or from paying private data.
- From there, you can model your data which is a piece of software from data confronted against AI.
- The insight is then inferred and displayed through data visualizations like with Google Data Studio.
So what's the point with data science?
The point is that Data scientists are fond of the Model part of the layer, but most of the work occurs before at the data engineering tasks say prepping the data.
- 90% of the work in a Data Science project is spent in Data Engineering tasks.
I you look at github repositories counts by pipeline steps, it confirms that trend:
- 183740 repositories for machine learning (data science)
- 4248 repositories for data engineering
- 1309 repositories for feature engineering (data engineering)
- 949 repositories for data lake (data engineering)
👉 It is quite obvious that all the appeal is focused on data science.
What can you do as a Java developer?
As a Java developer, you could envision easily the implementation of that pipeline.
Actually, it is not hard to do one of these pipelines, what it's hard is doing it at scale.
Usually you don't have only one of these pipelines but tens or hundreds of them.
They have to be run all the time, over very large datasets.
A company can run 3.5 billion new records per day with 4680 pipelines executed.
You get the picture: doing this by hand or with a simple Java software is very hard.
That's why tools are necessary for that purpose.
So, let's start seeking how to solution that problem
- If you start seeking intel
on data science online, you will probably find out that wrong misconceptions:
- There is no Big Data nor Legacy code
- (ie: use validation legacy code of your company apps to validate your data )
- After you finish your experiments, the job is done
(once you have define your model, you still have to execute it regularly --once a week, once a day...-- )
- There is always a Data lake with all the data you need
(creating a data lake is one of the hardest thing in data science)
The "Real" world is like this:
- Lots of data and lots of legacy (Java?) code
- We need tools to experiment, test, deploy, ctalog machine learning models
- We need good tools to catalog and build Data lake
- We need good architecture practices to build scalable solutions to process data and train models
Let's see the Java tools:
The most used tool to do distributed process and data science in Java is Apache Spark.
Java Spark is a well done and stable open source tool.
It exists cloud offers which does not require to install it by yourself.
Several Docker images exist too.
Apache Spark has two very interesting sub projects :
- Spark SQL
- (SQL-like language to explore data, the difference is that rather than doing a db transaction, you deal with a distributed cluster)
- Apache MLLib
(It is a set of machine learning algorithms ready to use)
Let's take a look at Spark SQL:
This snippet reads a CSV file, then filter this file, and then writes this file back to new file.
This shows how simple it is to deal with data with Spark SQL.
If the data is huge to process, like a Terabyte, Spark would distribute its processing in a cluster, making it very fast.
Let's look now at Spark SQL using custom code:
This snippet shows that it's totally possible extending Spark SQL with user-defined functions.
Here we see the geoHash function creation.
We call it from Spark SQL code.
You now perceive the power at hand with custom functions that you can make available for all data engineers.
That will speed the process of creating data pipelines and enable you to use your legacy code by calling it in Java.
One way to do this, is to create a series of plugins that are going to be visible from Spark SQL.
There are three types of plugins that are very useful in this kind of project:
- Semantic data types
First: Semantic data types=> It helps validate data
This snippet reads the CSV and infers the data type.
So the schema is in that way defined, let's now confront it against data while reading the data !
In this snippet we notice that the second field is not a String but a Long.
Thus, by passing the previous schema for type validation, so here it is going to be invalid.
This is like that, for instance, that data can be cleaned.
It is very easy to create a new data type:
Then you use it in Spark SQL:
In this snippet, the schema has a field called ssn and its type is SSNSparkType(the previous created class).
This will allow Spark SQL to validate the file according to that custom type.
The point is that you can create any custom field according to your business rules to validate anything you want.
- You can use it as type detectors and identify data by its type rather than scrutinize the value, making data lake creation easier.
- You can also use it to anonymize content to comply with privacy regulation such as GDPR.
- You can use it also use as validation as explain previously.
- You could have also automatic feature engineering, this latter use will be widen in the future by researchers.
Semantic data types help a lot in the data cleaning, but you still probably have to do some kind of transformation:
This snippet show a transformation class:
It implements a particular interface, then you implement the call method like here where it uses legacy code 'GeoHash'.
And then you call these functions from Spark SQL commands:
So this way, you can use all your data skills to make your data pipelines easier to make.
Now you realize that your Java skills are relevant to data engineering, let's talk about the gap between experimentation and production.
In experimentation we deal with:
- notebooks (definition below)
- interactive execution
In production we deal with:
- big data
- batch execution
Let's define notebook:
They are tools that allow you to write code and execute code in an interactive way.
The most famous is Jupyter (there is also Spark notebook).
Usually in notebook you have a code part that you can write, then you just press play and you can see the result of this code.
One of the problem of using notebooks is that you need to have your data lake (the samples) and your code lake (your functions).
This is not out of the box, there's no search for the datasets available in your data lake so this is something you have to implement but once you have this integrated it is a very powerful tool.
Behind a notebook play page, in the source code, you have a JSON of the handled data and the code snippet ruling it.
So it is possible for a developer to code this and put it to execute the pipeline in a timely way.
But that way of doing is not very scalable nor efficient.
One solving approach is to use that notebook file deduction and pair it with some form of parameters and some form of scheduler and execute this notebook file in production.
This is very powerful and a way to to mind that gap between experimentation and production.
In the upper architecture:
- You have got a scheduler.
- You have a GUI of pipelines, usually you need this to acquire some sort of lock from the data lake.
- Then you execute the pipeline.
- This is going to be using parameters sent to the Spark cluster.
- And in the end, you are going to have several different copies of the executed pipeline.
- That means that the notebook act as a template for your data pipeline.
- The output is actually a good thing because you can save the resulting notebook file as your execution log.
Great ! Up to know me forgot to talk about some prevalent point: data lineage. Let's present it:
So data lineage is making/keeping the history of the data so that you can track back from where the data came from.
So imagine, you have some datasets, and these datasets have transformations and became another dataset such that you are not able to know where it comes from, if you have the permissions to use it, or what algorithm was use to produce it.
So data lineage is a tool that keeps track of all the transformations and how the data was produced over time.
Unfortunately, there are only a few tools that can do this well, so you might have to develop your own data lineage tool.
So this is data engineering, but there is another field that had more attention since the last few years: machine learning engineering.
Machine learning is what you do once the model is trained.
For that, you need:
- Model catalog
- How the model was created (parameter? data?)
- Model versioning
- Model execution
The question is how do you created these models.
Usually to apply a model, you need to do some feature engineering transformations you have to apply to the data so that your model is going to handle the data well.
Sometimes you train the model using a set of feature engineering techniques, and then you have to apply the same transformations to run the model over real data.
And it reveals to be tedious to track all the applied feature engineering techniques.
But Java developers have a fix, with the Apache MLLib with the concept of pipeline.
With Apache MLLib, you can create a pipeline:
This is very powerful tool, at hand for Java developers, to solve the model execution problem.
Another thing that has been trendy recently, is the concept of feature store.
That allow to reuse features.
Finally, let's talk about Model lineage.
Right now, there is no really good tools to solve Model lineage problem.
That means for each version that you train, you should store which data/parameters/data provenance so you can track back and explain later how you create that model.