In our last post on this subject, we discussed why business plays an important role in unstructured data analytics project, how to manage their expectations on what text they are mining and some common pitfalls to avoid. We will now talk about automated learning aspect of unstructured data analytics algorithms or machine learning. In simple terms, machine learning is the ability of a computer algorithm to perform a task better over time by absorbing new patterns and ‘learning’ new information about the task.
For a lot of people, machine learning is the list of all open source algorithms available at Apache Mahout website – Link . While there is no doubt that these algorithms work great for the problems that they are supposed to solve, more often than not, these algorithms are force fitted into a solution to provide ‘machine learning’ capabilities. Most of these algorithms are based on numerical calculations and extrapolations that may not be that relevant to unstructured data. Now I am not saying that you cannot run the Naïve Bayes algorithm from this list on a large set of emails to identify patterns for spam emails. Since that works on statistical occurrence of given keywords, it will work out pretty nicely. But when you deal with unstructured data analysis, chances are it is not going to be easy for you to convert all your machine learning problems into numerical models.
On the other hand, you do not have to be a genius with a double doctorate to build machine-learning algorithms that work for your learning problems. With sound fundamental engineering knowledge and a good volume of corpus data, you can easily create relevant and beautiful machine learning abilities into your system. Don’t believe me? Look at this advice posted on Mahout’s website:
We help a lot of businesses build smart algorithms with machine-learning capabilities and not all of them are based on canned Mahout implementations. Machine learning capabilities are very contextual and are best designed looking at the data and keeping the objectives of learning in mind. Building relevant machine learning capabilities for unstructured data analytics system is a two part process:
1. Identify Objectives
Before you can figure out the right machine-learning algorithm to use, you need to list down the purpose or objective of machine learning for your problem context. It sounds like common sense but you will be surprised how many times the solution starts from the wrong end of the rope. ‘Alright, let us put some machine learning into this thing now…” Machine learning cannot and I repeat – cannot, be your end goal. It is a means of achieving some improvement in the solution; it in itself is not the improvement.
Your machine learning objectives can be something like:
I need the system to record what the user does with the data that we present and based on the behavior, decide if it is the right data to display.
I need the system to absorb implicit user feedback and change its behavior without having to make any code changes.
I need the system identify new patterns as it processes large volume of data and make recommendations for refinement based on these patterns.
Remember, ‘I need the system to …” not “I need machine learning …”.
2. Understand & Complement Data
Data is the other important half of the machine-learning equation. The learning objectives should be support-able by the data corpus. For instance, you cannot run an algorithm on all the data collected from email exchange between companies and figure out how to make invoices better. Well maybe you can… but you get the idea. Machine-learning is no magic wand, it can only identify patterns that exist in your data. Again, although it sounds like common sense yet looking for wrong patterns in data for machine learning is a fairly common mistake.
Some times, when the data corpus does not have the information that you need to find, you can complement it with other publicly available information to identify patterns that your algorithm needs to recognize. For instance, you can get geographical information from an external source; marry it with the data from your master data management system to identify demography-based patterns to influence your system. You can then implement the right algorithm to build learning around this context.
Combining machine learning objectives with the patterns in your data can help you create simple and effective machine learning capabilities for unstructured data analysis. Since unstructured data analysis is not entirely mathematical, it is extremely important for the system to have right machine learning capabilities to provide good ROI with time. You should approach the machine learning aspect of your solution with an open mind and not be influenced by what machine learning code is available off the shelf .