Setting up data science activities in your company – managerial guide.

By 1 January 2022February 11th, 2022Blog, Whitepapers

This article is written for managers who are starting with data science and clarifies what phases one goes through from data collection to deployment in production.  It also touches upon the different tools that can be used in each phase, following from the fragmented Data and AI landscape.

Recently I took a deep dive into the world of data science, data infrastructure, and machine learning. It gave me a dazzling view of the huge amount of different tools that are out there, fragmenting the landscape more and more. See for example this article by FirstMark, which shows the immense landscape clearly. Moreover, that data science is inseparably linked with the business, is probably not new to you. And for that reason, many companies started hiring data scientists. However, how to make data science work for you is another challenge that companies are learning how to deal with today.

This article is written for managers and clarifies what phases one goes through from data collection to deployment in production. Because only once models are deployed to production, they start adding value. The article will also touch upon the different tools that can be used in each phase, following from the fragmented landscape shown by FirstMark. The steps are illustrated using a fictional case from the insurance sector, however, the example methods used are by far not limited to that sector.

A simplistic representation of the stages one could follow to create a prediction model from scratch by Dutch Analytics.
A simplistic representation of the stages one could follow to create
a prediction model from scratch by UbiOps

Collect data and store it centrally

Say you’re building a classifier to determine the risk profile of a potential customer. First, you want to ensure you have data. You can use existing data sources like Amazon Polarity or in the case of insurance companies OECD insurance statistics. Alternatively, you can build your own dataset (e.g. from ERP), or make a combination of self-built and publicly available data. The data should then be centrally organized in either a data warehouse (DWH) or in (no)SQL databases. However, many organizations are still dealing with a data lake, or simply unorganized data. The difference is that in a data lake the data is not completely structured/unorganized, raw, and for which there is no specific purpose yet. A data warehouse is a repository of structured data and for which there is a specific purpose.

When selecting the right DWHs, one must consider not only costs, but also: do I want my data to be on US servers? The laws and regulations are different. Or: what if we acquire a company with a different DWH, can I do an analysis on my data in different places? Also important is data governance: how do you ensure proper handling of your data across the firm?

At this point, you should have a place (on-prem, in the cloud, or hybrid) to store structured data that you intend to use for specific purposes.

Talking about scalability and complex tooling, hybrid or cloud is preferred. However, as mentioned earlier, many companies do not have the luxury to centralize data nicely. They are working with fragmented data stored in different legacy systems (data silos). It is often a major task to centralize that data.

Examples of data warehouses

Start developing the model

Now that we have the data stored centrally, we can start with the model development, including data exploration, feature engineering, data cleaning, and training of the model. Cleaning and training are often offered in one tool because the process is very iterative, thus the process cannot be separated like black and white. Also, cleaning of data is the most time consuming and less enjoyable task, as you can see in the figure below.

What data scientists spend the most time doing?
Data scientists spend 60% of their time on cleaning and organizing data.

Process of model development

  1. First, you start with data exploration. This should give you as an analytics manager a rough idea of what patterns exist, what outliers there may be, if any, or find biases in your data. You could pull up a random subset of the data and plot a histogram or a distribution curve to find this. Using this info you can then form hypotheses. E.g. all motorcyclists of 25 and younger have a 50% higher chance of claiming insurance costs.
  2. Secondly, you conduct feature engineering. Simply put, you are interested in which variables help you to make the best prediction and test your hypotheses. Here the famous saying “rubbish in, rubbish out”, applies. For example, if you do include all possible variables, your performance may decrease as a result of that.
    In our case we would could include variables like “# historical speeding fines”, or “Common driving area” (busy/quiet?)This process is crucial to get a well-performing model later on, but will need updating when the model is in production. Feature engineering can be done using domain knowledge and logic reasoning, but also using data mining techniques. There are a variety of manual techniques of which you can read more about. In case you prefer to use a low-code tool for it, RapidMiner is amongst the best free tools available. Read more about other open-source tools.
  3. Then, for cleaning, but also for training, one can work with many different tools, depending on their preference. In my environment I often hear the use of Python, interacted with through Jupyter Notebook, and different libraries, like Pandas and Numpy. These libraries make it easier to perform certain tasks, in the preferred programming language. However, like with feature engineering, you can also choose to go with low-code tools.
Python language and Rapid miner tool
Python language: besides its usage for cleaning and training, you can also do feature engineering. Rapid miner: an analytics tool, and can be used for data mining.

The major cloud providers are expanding their offerings to reach a broader audience, including the not-so-technical people. Take for example Sagemaker, Google’s autoML, or Azure ML services. These tools can be very useful once you have a lot of knowledge how to set up pipelines in these services.

Amazon SageMaker, Google’s AutoML, Azure ML Services
These tools can be very useful once you have a lot of knowledge on how to set up pipelines in these services.

However, it can be challenging to use their services. Because of the broad range of functionalities they offer, it can get complex, very quickly. So, we will get to that later, you need to be skilled in a large variety of modules of the same provider to create and deploy your model. Nevertheless, using such services can be useful when you know exactly what you want to do with your data and how.

A more general note to model development is that you should consider where and on what data you want to run it in the future. Referring back to the choice of your DWH, what if you want to do analyses on the newly acquired company, which has data stored in a different DWH? Can you pick up your model and drop it there? When using a tool for model development, you are limited as to which DWH you can integrate with.

How is the model doing? Let’s evaluate!

Evaluation of your model is crucial both before your first model-run and during its lifecycle. Is the model correctly classifying the risk profile of a potential customer? And how does it perform with real data? In case you’re not satisfied, you can adjust the model parameters and/or do more feature engineering, and try again. This usually still happens on your local machine or, in case of larger datasets, in the cloud to handle peak loads (so your machine is not running for hours). Examples of evaluation metrics are the ROC curve, F1 Score, and simply put, accuracy. More recently fairness and transparency have become increasingly important for “responsible AI”.

You have trained your model on your training set, which is usually around 80% of your total dataset, then tested it on your test set, the remaining 20%. In case you’re satisfied with the performance, you can move on to deployment in production. Here, it is important to keep your IT team happy with the introduction of new integrations and data flowing through their carefully built infrastructure, but also ensure the model performs well, also in the future. For more info on that see “From PoC to production”.

Many models are built, evaluated and agreed to use in production. The data scientist is happy, the manager too. Seems like a great marriage. In academia, where algorithms originated from, it ends here, before “industrial” deployment. In this particular case, you want to give an estimation of the risk profile (and thus insurance costs) within seconds after the potential client has provided their info on your website, for example. However, according to VentureBeat, 87% of ML models never make it to production.

Model deployment

Bringing your models to production can be quite challenging. It is different from conventional software development, mainly because of the iterative nature of data science (see on the figure below). Ideally, you seamlessly integrate your model with the production environment, so that it can take an input and produce an output, while being able to update and change the model whenever you need to.

Data scientists want the best model, while the IT team cares about stability, continuity and wants to sleep well at night.
Data science life cycle by UbiOps.
  • Scalability
    Once your models’ data increases, and you may want to increase your pipeline, you need to scale up. Is your infrastructure ready to scale? Do you have the tools to monitor the performance and scalability issues that may arise over time?
  • Managing data science languages and production languages
    Where Python and R are common languages to write your model, C++ and Java are production languages. Porting Python to C++ for example, is difficult. Also, error checking and testing will not be able to cross the language barrier.
  • Monitoring and transparency
    Many things can go wrong with a production ML model. Unlike traditional software, an ML system’s behaviour depends on the code, but also on the data and the model (parameters) itself. Problems like data skews (training data does not represent real-life data sufficiently), changes in customer behaviour or a feature is not available in production can occur.
  • Robustify for production. As mentioned, your data scientist writes the code to make the best model. It does not mean that it is robust enough to work in a(n) (IT) production environment. Often, this step requires a lot of rewriting, integration into existing IT architecture, and the difficulty of accessing production data.
  • Models are subject to constant change
    Hardly ever will your data scientist write a model (code), test it on his testset and be finished. Very often the model (or its parameters) need adjusting while it is deployed in a staging or production environment. It is an iterative process, so you want to make it as easy as possible to review the performance, adjust the code and try again. If it works well, push it to production immediately.
  • Portability
    With the emergence of multi-cloud strategies, and not being locked into one single vendor, you want to easily bring your software from one location to another. This can create barriers for DS when creating models and deploying them.
  • ML works in spikes
    Your model may run once every so many minutes, hours, or days. You don’t want to pay for your servers when you don’t use them. It is a challenge to both scale up and down, and only pay-as-you-go.
  1. You can choose to go for the major cloud providers.
    Sagemaker (AWS), ML services (Azure) or AutoML (Google). This usually works well when you have many data scientists, DevOps and knowledge of that cloud in-house. In case you just started building your data science team, or are well on the way of building multiple models and soon want to deploy them, you may want to choose differently.

2. You can choose for an easy, more low-key platform or tool, that enables you to bring your model to production quickly. E.g. Xenia, Dataiku or KNIME. While all three companies offer different functionalities, they can help you to get business value from your data, even when you have just started.

3. Build the entire data science CI/CD(Continuous integration, continuous deployment) pipeline yourself. This takes months, generally, and requires you to perform updates, fix bugs, integrate new functionalities, continuously.

Integrate your results

When you have managed to deploy your risk classification model to a production environment, you’re almost there. Sometimes, the business requires to have the results visualized in a dashboard, or push it to an app. The output of your model can be stored in a database, from which you can connect it to your preferred BI tool, like PowerBI or Tableau. Alternatively, you can use APIs to connect it with your website, so users can fill in their details and see their risk premium. Tableau is used by more experienced users, is a bit harder to learn and is often used by medium to large companies with lots of data. PowerBI is easy to use, used by any type of company and can handle a limited amount of data. You can also choose to use for example Plotly, which is more technical than PowerBI/Tableau, but lighter in use and more user friendly for data scientists itself.

Conclusion

Diving into the data science lifecycle and selecting which tools can be used where, can be very confusing. Some tools only solve a specific problem, others can solve multiple. Going from data collection to model development, to deployment and finally to integration, is often easier said than done. Also in our case of building a classifier to determine the client’s risk profile. As a manager, you need to understand the difficulties along the way, and know to a good extent what your team is doing. The key takeaways are:

1. Understand that collecting data, building a model is only the first step. ML systems are iterative, unlike software development, and require ongoing maintenance while they are used. It requires you to review and adjust your earlier work, time after time, if you want to keep your model as accurate as possible.

2. Think early about where and how you centralize your data. For many organisations it’s (too) late, but it does not mean this shouldn’t be said and left for what it is. Many firms now spend large sums of money and lots of time to centralize data, stitch databases together, to extract value using data science. Also, consider under which laws and regulations you want to store your data: US, or EU? And, what is your data governance?

3. Spend time on determining what features you want to use, but know that this is an iterative process. Talk to field experts with domain knowledge, logic reasoning and apply data mining techniques. No stress, you can always test your model and adjust the features (variables).

4. Go for low-code or program the model yourself. When you choose low-code platforms, keep in mind that it may limit you in functionality, but also, limit you in the portability of your ML model in the future. Programming it yourself requires data science skills, which are scarce and expensive.

5. Deployment of ML models is where most models fail. Mainly, because it is different from software development and organisations are not structured to handle this. Ensure you find a reliable way to deploy, and keep both your data scientists and your IT team happy.

6. At last, consider if the output needs to be integrated. If so, integrate with e.g. a visualization tool or push it to an application, so that the business can reap the benefits of your data science efforts. When it comes to dashboards: use PowerBI if you’re relatively small and new to visualizations, otherwise I recommend Tableau more. In case your data scientists want to stay in control, consider Plotly.

Do you want to stay up to date with similar topics? Follow our blog on Medium.