Posts

Hey, I’m Conor. I’m a data scientist and writer living and working in New York.


The Role of Combinatorics in Text Classification

Source:  Danny Krashen

As one might imagine, there is plenty of overlap between machine learning and the field of combinatorics. Within machine learning, natural language processing and text classification contain plenty of interesting and challenging problems.

Over the course of this post, I’ll dive into general and specific insights for implementing combinatorial approaches in the context of text classification.

What is Text Classification?

Text classification is the task of assigning predefined categories to free-text documents

On the surface, text classification is just like any other classification problem. Classification problems in machine learning deal with assigning the correct class label to the given input data. In the case of text classification, the input data is a corpus of text and the label can vary based on the scenario.

This practice can not only provide intuitive, conceptual views of document collections, but it also has important applications in the real world.

Examples of Text Classification

There’s tons of examples of text classification being put to use all around you on a day-to-day basis. A couple notable use cases include the following:

  • Topic identification: Is a news article about politics, sports, or technology?

  • Spam detection: Is an email spam or not?

  • Sentiment analysis: Is this movie review positive or negative?

  • Spelling correction: Weather or whether? Colour or color?

Note that the input doesn’t necessarily have to be an article or a paragraph of text. As seen in the last example regarding context-based spelling correction, the input could be as small as some select words or letters.

What is Combinatorics?

Combinatorics is, in short, the study of finite or countable structures.

While fairly broad, I hope this definition helps shine some light on how combinatorics can help improve efficiency in text classification problems. If you haven’t heard of Combinatorics, don’t worry — you’ve undoubtedly used it unknowingly before if you’ve analyzed sample space size or even simply counted things in an effective way.

Role of Combinatorics

Text classification typically involves dealing with large, high-dimensional datasets. This becomes a problem, especially when attempting to scale a model over time.

This doesn’t necessarily have to be the case though. Much of the time, you can achieve similar performance with significantly less features. There is almost always going to be some features that are much more important than others. This is fundamental to the thinking behind the common practice of removing stop words like ‘the’, ‘a’, and ‘he’, for instance. This is also in line with the practice of using Principal Components Analysis to minimize your feature space.

By utilizing a combinatorial approach, it’s possible to obtain meaningful representations of text data with a feature space that is on average 50 times smaller than the original space.

Application

My motivation for this post primarily stems from a paper titled, Combinatorial PCA and SVM Methods for Feature Selection in Learning Classification from by Andrei V. Anghelescu and Ilya B. Muchnik.

By using combinatorial principal component analysis (cPCA) and combinatorial support vector machines (cSVM), the authors show that thanks to combinatorics, it was possible in their trials to achieve performance within 2% with a much smaller feature space

After implementation and a bit of investigation, there are a couple key insights from the paper worth stating. Both methods follow a similar model involving the use of a rectangular matrix representation of the text with a particular score function in order to optimize the result. While each method accomplishes the same goal, they cater to different types of problems. The authors suspect that cPCA would be especially useful for image vision while cSVM would excel with any data that’s representable in a Euclidean space.

Wrapping up

In short, these methods ‘offer a simple means of determining a scale on which data points and features can be ordered, thus making possible a linear search for the optimal set of features or documents’ — this allows for more efficient feature selection.

In doing this, a combinatorial approach can help tackle problems regarding high-dimensionality in text classification, where accuracy tends to degrade quickly with more added dimensions.

This is not only effective in showing how you can improve efficiency in text classification, but also in displaying the wide applications for the field of combinatorics in data science. It’s clear that there’s often great benefit to applying a combinatorial scope to a data-driven problem.

Feel free to try this at home and use what you learned in that one forgotten math class to improve your success in machine learning.

Conor Dewey