Thursday, September 14, 2017

Driving Machine Learning Capability in an Organization

Driving a change in organizations is tough. Whenever any organization thinks about the change, it has to percolate through all the levels and across the departments. The Machine Learning (ML) has changed the way we look at traditional businesses. While many organizations have already adopted the machine learning for cost saving and revenue generation, there are still many organizations which have recently started. From my experience in data science field, here are some thoughts on driving a machine learning capability in organizations. 

Accepting that the ML and Artificial Intelligence is next big thing

Here are some predictions made about the future. Well, we all know what followed after.
1876: "The Americans have need of the telephone, but we do not. We have plenty of messenger boys." — William Preece, British Post Office
1946: "Television won't be able to hold on to any market it captures after the first six months. People will soon get tired of staring at a plywood box every night." — Darryl Zanuck, 20th Century Fox.
1995: "I predict the Internet will soon go spectacularly supernova and in 1996 catastrophically collapse." — Robert Metcalfe, founder of 3Com.
The first point I would suggest is, the sooner we accept that ML is next big thing, better we prepare ourselves for changes. Many large organizations have now understood the importance of ML. But the drive for this change has to be initiated at top management level. Now a days we have started listening in the news about ML taking our jobs. I feel that the fear and negativity associated with these things needs to be acted upon by arranging awareness sessions, helping employees to re-skill and also make use of their immense domain knowledge to drive this ML change.

Whoever has the data, is the king

ML algorithms are mainly driven by data. It is of utmost importance to have more and more correct data to run these algorithms effectively. The challenge in many industries like manufacturing and unlike finance, is that data is scattered and not stored in proper format most of the times. Collecting more data, enhancing the data storage, storing the data in a proper format and easy access to data will be the key to future. Installing sensors wherever possible and working on internet of things will eventually help companies to save costs in future and can create opportunities for revenue generations. Creating the system of better computational power would also be required implement the data science projects successfully. Now a days with advances in cloud technology, storing the data and building computational power is not at all costly.

Setting up team of data scientists, domain experts and data engineers

From my experience, a team of data scientists alone would never be able to successfully research, create prototype and implement any large project. Any project should consists of a domain experts, business stakeholders, data engineers, and data scientists. The key to successful project implementation is close co-ordination and accountability of all the stake holders including business stakeholders. The data science group should have enough support from top management as well. Though there is initial investment needed to do all these things, the payoff in the future would be much higher. The data science group should always keep asking question to themselves that “what value they are bringing or will bring to organization” before starting the new project or while working on the project. On other side, top management should be patient enough to realize the benefits and value of data science projects given that not all the benefits could be quantifiable immediately.   

Educating the senior leaders driving analytics

One of the key issues is managers treating data scientists more like software engineers and developers than scientists. While developers are evaluated on lines and quality of code written, the data scientist needs to be evaluated on converting a business problem in to data science problem, demonstrating the solution by proof of concept or prototype and presenting the mathematical results in a business language to business leaders / clients. The managers also need to understand that the work is more of R&D than software engineering. Enough time and room should be given to data scientists to come up with solution than setting up unreasonable deadlines.
With fast paced developments in machine learning field like deep learning, increasing availability of data and computational power, the applications of machine learning will increase at rapid pace. While ML is already proving a powerful tool, the importance of ML will increase more in future. We have to make sure that organizations adopt to ML change not just rapidly but systematically as well as prudently.

Tuesday, April 18, 2017

Avoiding look ahead bias in time series machine learning forecasting

Any time series classification or regression forecasting involves the Y prediction at 't+n' given the X and Y information available till time T. Obviously no data scientist or statistician can deploy the system without back testing and validating the performance of model in history. Using the future actual information in training data which could be termed as "Look Ahead Bias" is probably the gravest mistake a data scientist can make. Even the sentence “we cannot make use future data in training” sounds too obvious and simple in theory, anyone unknowingly can add look ahead bias in complex forecasting problems.
The discussion becomes important when you put in so much efforts in researching and building the model only to realize later that the back testing framework was using future data. It will also cost the data scientist a lot when the model is approved by Top Management and at the time of deploying the model realizing that we don’t have the future data. 
Here in this article, I suggest some simple checkpoints which might help in avoiding look ahead bias. Not all points mentioned could be relevant and directly applied to the problems at hand. At the end it is better to have bad results using correct framework than good results using wrong framework.

Construct the last window first

Suppose Mr. ABC wants to predict the number of hourly transactions, 2 hours from now. The client has given him data till 11 AM. The first thing ABC can do is to build a simple model using data till 11 AM and see if he can get forecast for 1 PM. This will ensure that data available till time 't' is used to build model to forecast for 't+n'.
The last window is important as it will also mimic the real time implementation too. The similar last window method can then be applied in a rolling window fashion for back testing where 't' can be index of a window.

Start with small sample data

Wherever possible, the first cut model could be built on small sample data set before adding complexity. In the small data set, make sure that we are not inducing any look ahead bias. Once the confidence is built, start building the actual complex framework.

Ignoring Data Points from 't-n' till 't' in features.

Consider an example where a Mr. ABC is building a model to predict 5 mins ahead stock movement. That means he would like to know what happens at 10:05AM based on information available till 10:00 AM. The last information he can use for feature calculation could be maximum only till 9:55 AM and time from 9:55 AM till 10:00 AM can be kept for labels. If by any chance, Mr. ABC used data from 9:55 AM till 10:00 AM in feature computation it will lead to look ahead bias.

Learn when the data is published

Let us take an example of monthly sales prediction using monthly macroeconomic indicators. Mr. ABC is assigned to predict the monthly sales 2 months ahead at the end of each month. Sales number of the current month are updated on the last day of each month. Let’s assume today is 31st March and ABC wants to predict for June. In this case if one of the macroeconomic indicator value for March is getting published in 3rd week April then ABC cannot use the same month value in back testing and will have to consider an extra lag of the variable.

Using randomized cross validation

The cross validation framework for time series is different than the CV used in normal classification framework where time stamp of the prediction is not important. The regular CV which randomizes the data points can induce serious bias.

Model Bias

Too many efforts in to improving validation results to decide feature set is again a type of look ahead bias. Off course this problem is very hard to tackle but some care can be taken when deciding the features. In ideal case the features should be decided first based on strong fundamental understanding of the domain and problem and then incorporated in the model.
This will also take care of over-fitting problem to some extent. The idea is to guess which features might work and then validating rather than justifying why that feature worked based on results.

Guesstimate of accuracy before even start predicting

The data scientist should have fair guesstimate of MAPE / accuracies before even starting the prediction problem and then should compare the results with the guesstimate. Too much deviation could also signify the look ahead bias. Many times even getting a guesstimates could be tough task. But in that case, one should also ask the question “Are results too good to trust? “

Sudden jumps in accuracies

If addition of new feature/s causing sudden jump in accuracy, then take a one should pause rather than rejoicing and check if the is using some future data. 
Given all these points, taking precautions at each step adding random multiple checks on process data and understanding the meaning of results at each step would also help in avoiding the look ahead bias.
- Rohit Walimbe, Data Scientist