失效链接处理 |
Human in the Loop Machine Learning PDF 下载
本站整理下载:
相关截图:
主要内容:
1.2 Introducing annotation
Annotation is the process of labeling raw data so that it becomes training data for
machine learning. Most data scientists will tell you that they spend much more time
curating and annotating datasets than they spend building the machine learning
models. Quality control for human annotation relies on more complicated statistics
than most machine learning models do, so it is important to take the necessary time to
learn how to create quality training data.
1.2.1 Simple and more complicated annotation strategies
An annotation process can be simple. If you want to label social media posts about a
product as positive, negative, or neutral to analyze broad trends in sentiment about
that product, for example, you could build and deploy an HTML form in a few hours.
A simple HTML form could allow someone to rate each social media post according
Key
Label A
Label B
Unlabeled
Sample to
be labeled
?
Transfer learning:
Use an existing
model to start or
augment training.
Annotation:
Humans label items.
Training data:
Labeled items
become the
training data.
? ? ?
?
?
Active learning:
Sample unlabeled items
that are interesting for
humans to review.
Deploy model:
Deploy trained
model over
unlabeled data.
Train model:
Use training data to
create machine
learning model.
Pool of unlabeled data
Deployed model:
Predict labels.
Data with predicted
labels
Figure 1.1 A mental model of the human-in-the-loop process for predicting labels on data
6 C HAPTER 1 Introduction to human-in-the-loop machine learning
to the sentiment option, and each rating would become the label on the social media
post for your training data.
An annotation process can also be complicated. If you want to label every object in
a video with a bounding box, for example, a simple HTML form is not enough; you
need a graphical interface that allows annotators to draw those boxes, and a good user
experience might take months of engineering hours to build.
1.2.2 Plugging the gap in data science knowledge
Your machine learning algorithm strategy and your data annotation strategy can be
optimized at the same time. The two strategies are closely intertwined, and you often
get better accuracy from your models faster if you have a combined approach. Algo-
rithms and annotation are equally important components of good machine learning.
All computer science departments offer machine learning courses, but few offer
courses on creating training data. At most, you might find one or two lectures about
creating training data among hundreds of machine learning lectures across half a
dozen courses. This situation is changing, but slowly. For historical reasons, academic
machine learning researchers have tended to keep the datasets constant and evalu-
ated their research only in terms of different algorithms.
By contrast with academic machine learning, it is more common in industry to
improve model performance by annotating more training data. Especially when the
nature of the data is changing over time (which is also common), using a handful of
new annotations can be far more effective than trying to adapt an existing model to a
new domain of data. But far more academic papers focus on how to adapt algorithms
to new domains without new training data than on how to annotate the right new train-
ing data efficiently.
Because of this imbalance in academia, I’ve often seen people in industry make
the same mistake. They hire a dozen smart PhDs who know how to build state-of-the-
art algorithms but don’t have experience creating training data or thinking about the
right interfaces for annotation. I saw exactly this situation recently at one of the
world’s largest auto manufacturers. The company had hired a large number of recent
machine learning graduates, but it couldn’t operationalize its autonomous vehicle
technology because the new employees couldn’t scale their data annotation strategy.
The company ended up letting that entire team go. During the aftermath, I advised
the company how to rebuild its strategy by using algorithms and annotation as equally-
important, intertwined components of good machine learning.
|