Python_文本处理指南［经典］PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

Python_文本处理指南［经典］PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1waiPSrNsJFHVLtuP6VoKAg

提取码：yy9s

相关截图：

主要内容：

As computer professionals, we deal with text data every day. Developers and

programmers interact with XML and source code. System administrators

have to process and understand logfiles. Managers need to understand and

format financial data and reports. Web designers put in time, hand tuning and

polishing up HTML content. Managing this broad range of formats can seem

like a daunting task, but it's really not that difficult.

This book aims to introduce you, the programmer, to a variety of methods used

to process these data formats. We'll look at approaches ranging from standard

language functions through more complex third-party modules. Somewhere in

there, we'll cover a utility that's just the right tool for your specific job. In the

process, we hope to also cover some Python development best practices.

Where appropriate, we'll look into implementation details enough to help you

understand the techniques used. Most of the time, though, we'll work as hard

as we can to get you up on your feet and crunching those text files.

You'll find that Python makes tasks like this quite painless through its clean and

easy-to-understand syntax, vast community, and the available collection of

additional utilities and modules.

In this chapter, we shall:

Briefly introduce the data formats handled in this book

Implement a simple ROT13 translator

Introduce you to basic processing via filter programs

Learn state machine basics

Download from Wow! eBook <www.wowebook.com>

欢迎加入非盈利Python学习交流编程QQ群783462347，群里免费提供500+本Python书籍！

Getting Started

[ 8 ] Learn how to install supporting libraries and components safely and without

administrative access

Look at where to find more information on introductory topics

Categorizing types of text data

Textual data comes in a variety of formats. For our purposes, we'll categorize text into three

very broad groups. Isolating down into segments helps us to understand the problem a bit

better, and subsequently choose a parsing approach. Each one of these sweeping groups can

be further broken down into more detailed chunks.

One thing to remember when working your way through the book is that text content isn't

limited to the Latin alphabet. This is especially true when dealing with data acquired via the

Internet. We'll cover some of the techniques and tricks to handling internationalized data in

Chapter 8, Understanding Encoding and i18n.

Providing information through markup

Structured text includes formats such as XML and HTML. These formats generally consist of

text content surrounded by special symbols or markers that give extra meaning to a file's

contents. These additional tags are usually meant to convey information to the processing

application and to arrange information in a tree-like structure. Markup allows a developer to

define his or her own data structure, yet rely on standardized parsers to extract elements.

For example, consider the following contrived HTML document.

<html>

<head>

<title>Hello, World!</title>

</head>

<body>

<p>

Hi there, all of you earthlings.

</p>

<p>

Take us to your leader.

</p>

</body>

</html>

In this example, our document's title is clearly identified because it is surrounded by opening

and closing <title> and </title> elements.

欢迎加入非盈利Python学习交流编程QQ群783462347，群里免费提供500+本Python书籍！

Chapter 1

Note that although the document's tags give each element

a meaning, it's still up to the application developer to

understand what to do with a title object or a p element.

Notice that while it still has meaning to us humans, it is also laid out in such a way as to make

it computer friendly. We'll take a deeper look into these formats in Chapter 6, Structured

Markup. Python provides some rich libraries for dealing with these popular formats.

One interesting aspect to these formats is that it's possible to embed references to validation

rules as well as the actual document structure. This is a nice benefit in that we're able to rely

on the parser to perform markup validation for us. This makes our job much easier as it's

possible to trust that the input structure is valid.

Meaning through structured formats

Text data that falls into this category includes things such as configuration files, marker

delimited data, e-mail message text, and JavaScript Object Notation web data. Content

within this second category does not contain explicit markup much like XML and HTML does,

but the structure and formatting is required as it conveys meaning and information about

the text to the parsing application. For example, consider the format of a Windows INI file

or a Linux system's /etc/hosts file. There are no tags, but the column on the left clearly

means something other than the column on the right.

Python provides a collection of modules and libraries intended to help us handle popular

formats from this category. We'll look at Python's built-in text services in detail when we get

to Chapter 4, The Standard Library to the Rescue.

Understanding freeform content

This category contains data that does not fall into the previous two groupings. This describes

e-mail message content, letters, book copy, and other unstructured character-based content.

However, this is where we'll largely have to look at building our own processing components.

There are external packages available to us if we wish to perform common functions. Some

examples include full text searching and more advanced natural language processing.

Ensuring you have Python installed

Our first order of business is to ensure that you have Python installed. You'll need it in order

to complete most of the examples in this book. We'll be working with Python 2.6 and we

assume that you're using that same version. If there are any drastic differences in earlier

releases, we'll make a note of them as we go along. All of the examples should still function

properly with Python 2.4 and later versions.

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

Python_文本处理指南［经典］PDF 下载

Java1234官方群25：
Java1234官方群25：	838462530