The dangers of data mining for text
There is an interesting new article out, which looks at some of the commonly used algorithms in data mining, and finds that they are generally not very accurate, or even reproducible. Specifically, the study by Lancichinetti et al.
There is an interesting new article out, which looks at some of the commonly used algorithms in data mining, and finds that they are generally not very accurate, or even reproducible.
Specifically, the study by Lancichinetti et al. (2015) looks at automated topic classification using the commonly used latent Dirichlet allocation algorithm (LDA), a machine learning process which uses a probabilistic approach to categorise and filter large groups of text. Essentially this is a common approach used in data mining.
But the Lancichinetti et al. (2015) article finds that, even using a well structured source of data, such as Wikipedia, the results are – to put it mildly, disappointing. Around 20% of the time, the results did not come back the same, and when looking at a more complex group of scientific articles, reliability was as low as 55%.
As the authors point out, there has been little attempt to test the accuracy and validity of these data mining approaches, but they caution that users should be cautious about relying on inferences using these methods. They then go-on to describe a method that produces much better levels of reliability, yet until now, most analysis would have had this unknown level of inaccuracy: even if the test had been re-run with the same data, there is a good chance the results would have been different!
This underlines one of the perils with statistical attempts to mine large amounts of text data automatically: it's too easy to do without really knowing what you are doing. There is still no reliable alternative to having a trained researcher and their brain (or even an average person off the street) reading through text and telling you what it is about. The forums I engage with are full of people asking how they can do qualitative analysis automatically, and if there is some software that will do all their transcription for them – but the realistic answer is nothing like this currently exists.
Data mining can be a powerful tool, but it is essentially all based on statistical probabilities, churned out by a computer that doesn't know what it is supposed to be looking at. Data mining is usually a process akin to giving your text to a large number of fairly dumb monkeys on typewriters. Sure, they'll get through the data quickly, but odds are most of it won't be much use! Like monkeys, computers don't have that much intuition, and can't guess what you might be interested in, or what parts are more emotionally important than others.
The closest we have come so far is probably a system like IBM's Watson computer, a natural language processing machine which requires a supercomputer with 2,880 CPU cores, 16 terabytes of ram (16,384GB), and is essentially doing the same thing – a really really large number of dumb monkeys, and a process that picks the best looking stats from a lot of numbers. If loads of really smart researchers programme it for months, it can then win a TV show like Jeopardy. But if you wanted to win Family Feud, you'd have to programme it again.
Now, a statisical overview can be a good place to start, but researchers need to understand what is going on, look at the results intelligently, and work out what parts of the output don't make sense. And to do this well, you still need to be familiar with some of the source material, and have a good grip on the topics, themes and likely outcomes. Since a human can't read and remember thousands of documents, I still think that for most cases, in-depth reading of a few dozen good sources probably gives better outcomes than statistically scan-reading thousands.
Algorithms will improve, as outlined above, and as computers get more powerful and data gets more plentiful, statistical inferences will improve. But until then, most users are better off with a computer as a tool to aid their thought process, not to provide a single statistic answer to a complicated question.