Summary: These are some notes, combined with my own experience and commentary, derived from Matthew Honnibal’s PyData Berlin 2018 talk: Building new NLP solutions with spaCy and Prodigy. I intended to use these as a reference when starting new NLP projects.
In NLP and ML we talk a lot about models and optimization. But this isn't where the battle is really won! I've been trying to explain my thoughts on this lately. Big thanks to @PyDataBerlin for a great event.
— Matthew Honnibal (@honnibal) August 2, 2018
📺 Slides: https://t.co/OqpBeY0uwPhttps://t.co/TzhQrBsefP
Essential Questions from the Machine Learning Hierarchy of Needs
- How will the model fit into the larger application or business process?
- What is the decision that will be impacted by a machine learning solution?
- Is there an existing software application we can plug into? Can we develop an API?
- Do we need to build something entirely custom?
- What is the annotation scheme? What is being labeled? How will the corpus be built?
- Are we classifying documents?
- Are we tagging spans of text?
- Are we trying to automatically generate structured data from text?
- What models do we need to build and compose to generate this data?
- What quality control processes will we use to ensure consistent and clean data?
- Who are experts who we can get to label data accurately?
- How will we construct a gold standard data set for evaluation?
- How can we break down the NLP problem to improve the ease and consistency of the annotation task?
Things to Avoid
- Overambitious Imagineering — Thinking too big, rather than specifically about what business problem(s) can be solved.
- Forecasting without experimentation — Most problems will need to be evaluated after experimentation with data: don’t put excessive value on arbitrary thresholds of accuracy or other performance metrics up front
- Outsourcing — You want a dataset and annotation process that is reliable and consistent, not noisy. Outsourcing solves the wrong problem (marginal annotation cost).
- Overengineering & AI FOMO — Can you compose existing, generic models to solve your task rather than tweaking parameters on the latest algorithms?
- Premature Shipping — if it fails in development, it certainly won’t work in production.
General Principles (with timestamp)
- Iterate on your data (labeling, annotation scheme, problem framing), as well as your code. (10:52)
- Rather than make assumptions, get to iterative experimentation as fast as possible. (11:14)
- Start with what you need at output, then work towards breaking down the problem into a series of modeling decisions. (12:17)
- Break down the problem so that your component models can perform accurately and you can cheaply generate annotations. (14:00) Compose generic models into novel solutions (17:00)
- Example: Is this document about a crime? If so, what do the entities represent? By comparison: jumping straight to classifying the victim, perpetrator, location is both hard for a model to classify and cognitively burdensome for a human to annotate.
- Fine-tuning existing models for your domain, using generic categories, is much cheaper than starting from scratch (17:20)
- Example: fine tuning the dependency parsing on conversational or colloquial text might improve the accuracy. (33:50)
- Annotate events and topics at the sentence, paragraph, or document level to avoid boundary issues. Additionally, annotation is easier and classification is more accurate. (17:50)
- Bootstrap your project estimates based on contact with the evidence, rather than making assumptions or guesses up front. (23:17)
- Reframe your problem so that the annotator has to consider less information at once: use pre-suggested outputs and make corrections. This reduces error by asking the annotator to correct the model when its wrong (vs. getting tired/lazy on annotation tasks from scratch) (28:57)
References:
Honnibal, M. (2018, July). Building new NLP solutions with spaCy and Prodigy. Presented at PyData Berlin 2018, Berlin, Germany.