失效链接处理 |
evaluating-machine-learning-models PDF 下载
本站整理下载:
提取码:ipvw
相关截图:
主要内容:
This report on evaluating machine learning models arose out of a
sense of need. The content was first published as a series of six tech‐
nical posts on the Dato Machine Learning Blog. I was the editor of
the blog, and I needed something to publish for the next day. Dato
builds machine learning tools that help users build intelligent data
products. In our conversations with the community, we sometimes
ran into a confusion in terminology. For example, people would ask
for cross-validation as a feature, when what they really meant was
hyperparameter tuning, a feature we already had. So I thought, “Aha!
I’ll just quickly explain what these concepts mean and point folks to
the relevant sections in the user guide.”
So I sat down to write a blog post to explain cross-validation, holdout datasets, and hyperparameter tuning. After the first two para‐
graphs, however, I realized that it would take a lot more than a sin‐
gle blog post. The three terms sit at different depths in the concept
hierarchy of machine learning model evaluation. Cross-validation
and hold-out validation are ways of chopping up a dataset in order
to measure the model’s performance on “unseen” data. Hyperpara‐
meter tuning, on the other hand, is a more “meta” process of model
selection. But why does the model need “unseen” data, and what’s
meta about hyperparameters? In order to explain all of that, I
needed to start from the basics. First, I needed to explain the highlevel concepts and how they fit together. Only then could I dive into
each one in detail.
Machine learning is a child of statistics, computer science, and
mathematical optimization. Along the way, it took inspiration from
information theory, neural science, theoretical physics, and many
v
other fields. Machine learning papers are often full of impenetrable
mathematics and technical jargon. To make matters worse, some‐
times the same methods were invented multiple times in different
fields, under different names. The result is a new language that is
unfamiliar to even experts in any one of the originating fields.
As a field, machine learning is relatively young. Large-scale applica‐
tions of machine learning only started to appear in the last two dec‐
ades. This aided the development of data science as a profession.
Data science today is like the Wild West: there is endless opportu‐
nity and excitement, but also a lot of chaos and confusion. Certain
helpful tips are known to only a few.
Clearly, more clarity is needed. But a single report cannot possibly
cover all of the worthy topics in machine learning. I am not covering
problem formulation or feature engineering, which many people
consider to be the most difficult and crucial tasks in applied
machine learning. Problem formulation is the process of matching a
dataset and a desired output to a well-understood machine learning
task. This is often trickier than it sounds. Feature engineering is also
extremely important. Having good features can make a big differ‐
ence in the quality of the machine learning models, even more so
than the choice of the model itself. Feature engineering takes knowl‐
edge, experience, and ingenuity. We will save that topic for another
time.
This report focuses on model evaluation. It is for folks who are start‐
ing out with data science and applied machine learning. Some seas‐
oned practitioners may also benefit from the latter half of the report,
which focuses on hyperparameter tuning and A/B testing. I certainly
learned a lot from writing it, especially about how difficult it is to do
A/B testing right. I hope it will help many others build measurably
better machine learning models!
This report includes new text and illustrations not found in the orig‐
inal blog posts. In Chapter 1, Orientation, there is a clearer explana‐
tion of the landscape of offline versus online evaluations, with new
diagrams to illustrate the concepts. In Chapter 2, Evaluation Met‐
rics, there’s a revised and clarified discussion of the statistical boot‐
strap. I added cautionary notes about the difference between train‐
ing objectives and validation metrics, interpreting metrics when the
data is skewed (which always happens in the real world), and nested
hyperparameter tuning. Lastly, I added pointers to various software
vi | Preface
packages that implement some of these procedures. (Soft plugs for
GraphLab Create, the library built by Dato, my employer.)
I’m grateful to be given the opportunity to put it all together into a
single report. Blogs do not go through the rigorous process of aca‐
demic peer reviewing. But my coworkers and the community of
readers have made many helpful comments along the way. A big
thank you to Antoine Atallah for illuminating discussions on A/B
testing. Chris DuBois, Brian Kent, and Andrew Bruce provided
careful reviews of some of the drafts. Ping Wang and Toby Roseman
found bugs in the examples for classification metrics. Joe McCarthy
provided many thoughtful comments, and Peter Rudenko shared a
number of new papers on hyperparameter tuning. All the awesome
infographics are done by Eric Wolfe and Mark Enomoto; all the
average-looking ones are done by me.
If you notice any errors or glaring omissions, please let me know:
alicez@dato.com. Better an errata than never!
Last but not least, without the cheerful support of Ben Lorica and
Shannon Cutt at O’Reilly, this report would not have materialized.
Thank you!
|