Applied Text Analysis with Python – Benjamin Benfort & Rebecca Bibro & Tony Ojeda

We live in a world increasingly filled with digital assistants that allow us to connect with other people as well as vast information resources. Part of the appeal of these smart devices is that they do not simply convey information; to a limited extent, they also understand it—facilitating human interaction at a high level by aggregating, fil‐tering, and summarizing troves of data into an easily digestible form. Applications such as machine translation, question-and-answer systems, voice transcription, text summarization, and chatbots are becoming an integral part of our computing lives. If you have picked up this book, it is likely that you are as excited as we are by thepossibilities of including natural language understanding components into a wider array of applications and software. Language understanding components are built on a modern framework of text analysis: a toolkit of techniques and methods that com‐bine string manipulation, lexical resources, computation linguistics, and machine
learning algorithms that convert language data to a machine understandable form and back again. Before we get started discussing these methods and techniques, how‐ever, it is important to identify the challenges and opportunities of this framework and address the question of why this is happening now.

The typical American high school graduate has memorized around 60,000 words and thousands of grammatical concepts, enough to communicate in a professional con‐text. While this may seem like a lot, consider how trivial it would be to write a short Python script to rapidly access the definition, etymology, and usage of any term from an online dictionary. In fact, the variety of linguistic concepts an average American uses in daily practice represents merely one-tenth the number captured in the Oxford dictionary, and only 5% of those currently recognized by Google. And yet, instantaneous access to rules and definitions is clearly not sufficient for text
analysis. If it were, Siri and Alexa would understand us perfectly, Google would return only a handful of search results, and we could instantly chat with anyone in the world in any language. Why is there such a disparity between computational ver‐sions of tasks humans can perform fluidly from a very early age—long before they’ve Preface IX accumulated a fraction of the vocabulary they will possess as adults? Clearly, natural language requires more than mere rote memorization; as a result, deterministic com‐puting techniques are not sufficient.