失效链接处理 |
Python_文本处理指南[经典]PDF 下载
本站整理下载:
相关截图:
![]()
主要内容:
As computer professionals, we deal with text data every day. Developers and
programmers interact with XML and source code. System administrators
have to process and understand logfiles. Managers need to understand and
format financial data and reports. Web designers put in time, hand tuning and
polishing up HTML content. Managing this broad range of formats can seem
like a daunting task, but it's really not that difficult.
This book aims to introduce you, the programmer, to a variety of methods used
to process these data formats. We'll look at approaches ranging from standard
language functions through more complex third-party modules. Somewhere in
there, we'll cover a utility that's just the right tool for your specific job. In the
process, we hope to also cover some Python development best practices.
Where appropriate, we'll look into implementation details enough to help you
understand the techniques used. Most of the time, though, we'll work as hard
as we can to get you up on your feet and crunching those text files.
You'll find that Python makes tasks like this quite painless through its clean and
easy-to-understand syntax, vast community, and the available collection of
additional utilities and modules.
In this chapter, we shall:
Briefly introduce the data formats handled in this book
Implement a simple ROT13 translator
Introduce you to basic processing via filter programs
Learn state machine basics
Download from Wow! eBook <www.wowebook.com>
欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
Getting Started
[ 8 ] Learn how to install supporting libraries and components safely and without
administrative access
Look at where to find more information on introductory topics
Categorizing types of text data
Textual data comes in a variety of formats. For our purposes, we'll categorize text into three
very broad groups. Isolating down into segments helps us to understand the problem a bit
better, and subsequently choose a parsing approach. Each one of these sweeping groups can
be further broken down into more detailed chunks.
One thing to remember when working your way through the book is that text content isn't
limited to the Latin alphabet. This is especially true when dealing with data acquired via the
Internet. We'll cover some of the techniques and tricks to handling internationalized data in
Chapter 8, Understanding Encoding and i18n.
Providing information through markup
Structured text includes formats such as XML and HTML. These formats generally consist of
text content surrounded by special symbols or markers that give extra meaning to a file's
contents. These additional tags are usually meant to convey information to the processing
application and to arrange information in a tree-like structure. Markup allows a developer to
define his or her own data structure, yet rely on standardized parsers to extract elements.
For example, consider the following contrived HTML document.
<html>
<head>
<title>Hello, World!</title>
</head>
<body>
<p>
Hi there, all of you earthlings.
</p>
<p>
Take us to your leader.
</p>
</body>
</html>
In this example, our document's title is clearly identified because it is surrounded by opening
and closing <title> and </title> elements.
欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
Chapter 1
Note that although the document's tags give each element
a meaning, it's still up to the application developer to
understand what to do with a title object or a p element.
Notice that while it still has meaning to us humans, it is also laid out in such a way as to make
it computer friendly. We'll take a deeper look into these formats in Chapter 6, Structured
Markup. Python provides some rich libraries for dealing with these popular formats.
One interesting aspect to these formats is that it's possible to embed references to validation
rules as well as the actual document structure. This is a nice benefit in that we're able to rely
on the parser to perform markup validation for us. This makes our job much easier as it's
possible to trust that the input structure is valid.
Meaning through structured formats
Text data that falls into this category includes things such as configuration files, marker
delimited data, e-mail message text, and JavaScript Object Notation web data. Content
within this second category does not contain explicit markup much like XML and HTML does,
but the structure and formatting is required as it conveys meaning and information about
the text to the parsing application. For example, consider the format of a Windows INI file
or a Linux system's /etc/hosts file. There are no tags, but the column on the left clearly
means something other than the column on the right.
Python provides a collection of modules and libraries intended to help us handle popular
formats from this category. We'll look at Python's built-in text services in detail when we get
to Chapter 4, The Standard Library to the Rescue.
Understanding freeform content
This category contains data that does not fall into the previous two groupings. This describes
e-mail message content, letters, book copy, and other unstructured character-based content.
However, this is where we'll largely have to look at building our own processing components.
There are external packages available to us if we wish to perform common functions. Some
examples include full text searching and more advanced natural language processing.
Ensuring you have Python installed
Our first order of business is to ensure that you have Python installed. You'll need it in order
to complete most of the examples in this book. We'll be working with Python 2.6 and we
assume that you're using that same version. If there are any drastic differences in earlier
releases, we'll make a note of them as we go along. All of the examples should still function
properly with Python 2.4 and later versions.
|