There were a bunch of tech events yesterday and among them was the awarding for the WebGeek/LobangClub Developer Challenge.
This whole post will not be about my solution to the challenge, but a run through of the different approaches we participants used and presented during the meetup. The scores were close anyway, and it only happened that I chose a combination of techniques from the list below that yielded the highest score.
A word of warning, though: if you aren’t into computer science, linguistics, and math, it may be of your best interest not to continue on to the analysis below. :D
Let’s begin the analysis by stating the problem:
Operation of the Competition:
5.0 The challenge is to produce software to correctly associate products into appropriate categories based on the name of the product.
For example “Ice lemon tea” is associated with ‘Drinks’, “Smith’s Thinly Cut Potato Chips Thai Sweet Chilli ” might be associated with “snacks” and “potato chips”
5.1. At the start of the challenge Term participants will be provided with 2 files:
File 1 ( Categories File ) will contain the entire list of possible categories
File 2 (Training Set ) will contain a fairly large set of training data containing product title to category mapping, for example:
“Meiji 0% Fat Milk”,”Giant” “Meiji 0% Fat Milk”,”Dairy” “Listerine Total Care”,”Dental” “Colgate Sensitive Multi Protection 120g”,”Toothpaste” “Korea Popcorn Butter And Cheese”,”Korean Snacks”…
5.2 The participants will submit a program to read the Training Set and Categories File files. The program then takes File 3 ( the Competition Set ) containing only a list of Product Titles. The program then must output a file in the same format as the Training Set that maps the Product Titles in the competition set to all applicable categories in the Categories File. The contents of the Competition Set will not be known to the participants and will only be revealed at the conclusion of the competition.
This is clearly a Document classification problem. However, given the nature of the input data, our solutions were not as straightforward as the suggested techniques in that Wikipedia page.
That said, our techniques had components that could be classified into four categories:
1. Clean up the data while tokenizing
When you’ve got training set data like:
L'oréal,True Match Loreal Elvive,shampoo Loreal Eyeshadow ,Make Up L'oreal Fixing Mousse,Hair Gel
simply tokenizing the text to individual words is not enough. The ideal approach would be to convert all variations of “L’Oreal” to a single common word (e.g. “loreal
“) in order to help the classifier get more accurate results. One way of solving this is to use a stemming algorithm like Porter2.
You may also want to customize your chosen stemming/tokenizing algorithm to take into account other quirks in the training data, say, by adding a whitelist or blacklist of words and special characters. Phonetic algorithms like Soundex may also be used to replace or simply complement the stemming algorithm.
2. Classify based on the tokens
Now that you have clean, tokenized data, you can now choose between the many classification algorithms that will determine what category an input item should be matched to.
There are a few problems with the data set that will limit your algorithm decision, though. Unlike in certain classification problems where there are a few possible outputs (e.g. spam detection only has “spam” and “not spam”), this problem has too many categories. Not only that, the distribution of links between items to categories forms a long-tail. These two problems will prevent you from using non-probabilistic algorithms like ANN and SVM – a fact I learned by training a neural network for 12 hours only to get one that can’t properly classify “San Miguel
“.
That leaves us with probabilistic classifiers i.e. classifiers that calculate the probability of a category being correct via the token existence count. The go-to algorithm for this one is the Naive Bayes classifier.
Naive Bayes isn’t without problems, though. Using a “pure” implementation will still yield weird classifications like “stationary
” for inputs that have nothing to do with them. Again, this is caused by the long-tail distribution and can be avoided by customizing the algorithm. The simplest solution would be to automatically ignore categories that don’t have common tokens with the current input according to the training set.
Naive Bayes can also be improved by considering the category list as an extra “training set”. For instance, the category “Japan
” is not linked to the token “japan” anywhere in the training set. By adding the category list into the training set (i.e. Japan,Japan
), there is now a link and the classifier may be able to classify the items with “Japan
” into the “Japan
” category.
An alternative to Naive Bayes is to simply count the matching tokens between the input item and the training set and category. For example, if the tokens in the input item matches a certain number of tokens in a training set entry, say 3 matches, the category of that entry would be considered a classification of that input. In this case, an input Johnson's Body Care
would match Johnson's Body Care 24 Hour Lasting Moisture Body Lotion,Body Lotion
and should classify to Body Lotion
.
3. Group similar classifications
Note that some categories are often bundled with other categories in the training set:
Heineken 9S Festive Pack 330Ml Bundled With A Heineken Can Speaker. Alc. 5.0% Vol. Single Can - $2.70 @ Giant Tampines.,Alcohol Heineken 9S Festive Pack 330Ml Bundled With A Heineken Can Speaker. Alc. 5.0% Vol. Single Can - $2.70 @ Giant Tampines.,beer ... San Miguel,beer ... Tiger Beer 30S 323Ml Alc. 5.0% Vol. Single Can - $2.52 @ Giant Tampines.,Alcohol Tiger Beer 30S 323Ml Alc. 5.0% Vol. Single Can - $2.52 @ Giant Tampines.,beer
As you can see, “Alcohol
” is grouped with “beer
“. Therefore, there’s a possibility that all “beer
” can also be classified as “Alcohol
“. We may increase the number of correct classifications just by grouping together categories that can be grouped together.
The most obvious way of doing this would be to manually create a thesaurus (to know which categories are synonymous or otherwise) or a hierarchical model (“beer
” is an “Alcohol
“, but not all “Alcohol
” can be grouped with “beer
“). The problem with this is it’s too tedious.
A possible alternative heuristic would be to perform Cluster analysis on the training set and try to predict which categories can be lumped together. However, your approach must also take into account that the data has incorrect entries that may throw your algorithm off track.
As a side note, this only improves the quantity of the output classifications at the price of the accuracy. Thus, this technique is more on the side of increasing your score in the competition rather than improving the quality of the category recommendation component of your system.
—
To wrap this post up, here are some recommendations for anyone who would want to implement a tag recommendation system in their social app:
- Clean up the tags to avoid duplicates
- Classifiers like Naive Bayes work well if the tags are correct and weighted. An upvote/downvote system for tags is a crowdsourced approach to this.
- Separate categories from tags. In this challenge, the categories (hierarchical, wide coverage) and tags (crosses between categories, long-tail distribution) are lumped together as “categories”. Having them separate would make analysis much easier.
Thanks again to LobangClub (Ronald, Jeremy, and Nica) and WebGeek for holding this competition! I’ll be looking forward to the next one, even though I think the rules would prevent repeat winners to give chance to others. :D
Oh, and by the way, the language I used for the challenge wasn’t Ruby. Instead, I chose something a bit more exotic.