How can we apply text categorization and machine learning techniques to the domain of short, low semantic-value texts and what is the benefit of additional training on domain specific data sources for word embedding methods such as word2vec, glove, and doc2vec?
Text categorization techniques are currently being applied to a wide array of use-cases. Many of these applications involve processing of natural language, such as email classification.
The topic of this research project is to determine if we can apply these text categorization techniques to a domain where “telegram-style” text is used to identify a great number of different classes.
The domain to which we apply these techniques is the domain of importing and exporting of goods between countries.
When importing and exporting goods, traders provide descriptions of these goods and specify a tariff code to match their import product. Specifically, we attempt to classify a great number of these descriptions of goods to more than 5000 unique import product classes. We will start at the origin of the problem and go through the technical challenges as well as the practical use-case. Along the way we discuss the applied techniques, the initial goals that are set and the development of the research case as it progressed.