Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

springcloud分布式电商秒杀实战课程

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > Java基础相关 >

Python_文本处理指南[经典]PDF 下载


分享到:
时间:2020-11-03 09:20来源:http://www.java1234.com 作者:转载  侵权举报
Python_文本处理指南[经典]PDF 下载
失效链接处理
Python_文本处理指南[经典]PDF 下载


 
本站整理下载:
提取码:yy9s 
 
 
相关截图:
 
主要内容:

As computer professionals, we deal with text data every day. Developers and 
programmers interact with XML and source code. System administrators 
have to process and understand logfiles. Managers need to understand and 
format financial data and reports. Web designers put in time, hand tuning and 
polishing up HTML content. Managing this broad range of formats can seem 
like a daunting task, but it's really not that difficult.
This book aims to introduce you, the programmer, to a variety of methods used 
to process these data formats. We'll look at approaches ranging from standard 
language functions through more complex third-party modules. Somewhere in 
there, we'll cover a utility that's just the right tool for your specific job. In the 
process, we hope to also cover some Python development best practices.
Where appropriate, we'll look into implementation details enough to help you 
understand the techniques used. Most of the time, though, we'll work as hard 
as we can to get you up on your feet and crunching those text files.
You'll find that Python makes tasks like this quite painless through its clean and 
easy-to-understand syntax, vast community, and the available collection of 
additional utilities and modules.
In this chapter, we shall:
‹ Briefly introduce the data formats handled in this book
‹ Implement a simple ROT13 translator
‹ Introduce you to basic processing via filter programs
‹ Learn state machine basics
Download from Wow! eBook <www.wowebook.com>
欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
Getting Started
[ 8 ] ‹ Learn how to install supporting libraries and components safely and without 
administrative access
‹ Look at where to find more information on introductory topics
Categorizing types of text data
Textual data comes in a variety of formats. For our purposes, we'll categorize text into three 
very broad groups. Isolating down into segments helps us to understand the problem a bit 
better, and subsequently choose a parsing approach. Each one of these sweeping groups can 
be further broken down into more detailed chunks.
One thing to remember when working your way through the book is that text content isn't 
limited to the Latin alphabet. This is especially true when dealing with data acquired via the 
Internet. We'll cover some of the techniques and tricks to handling internationalized data in 
Chapter 8, Understanding Encoding and i18n.
Providing information through markup
Structured text includes formats such as XML and HTML. These formats generally consist of 
text content surrounded by special symbols or markers that give extra meaning to a file's 
contents. These additional tags are usually meant to convey information to the processing 
application and to arrange information in a tree-like structure. Markup allows a developer to 
define his or her own data structure, yet rely on standardized parsers to extract elements.
For example, consider the following contrived HTML document.
<html>
 <head>
 <title>Hello, World!</title>
 </head>
 <body>
 <p>
 Hi there, all of you earthlings.
 </p>
 <p>
 Take us to your leader.
 </p>
 </body>
</html>
In this example, our document's title is clearly identified because it is surrounded by opening 
and closing <title> and </title> elements.
欢迎加入非盈利Python学习交流编程QQ群783462347,群里免费提供500+本Python书籍!
Chapter 1
Note that although the document's tags give each element 
a meaning, it's still up to the application developer to 
understand what to do with a title object or a p element.
Notice that while it still has meaning to us humans, it is also laid out in such a way as to make 
it computer friendly. We'll take a deeper look into these formats in Chapter 6, Structured 
Markup. Python provides some rich libraries for dealing with these popular formats.
One interesting aspect to these formats is that it's possible to embed references to validation 
rules as well as the actual document structure. This is a nice benefit in that we're able to rely 
on the parser to perform markup validation for us. This makes our job much easier as it's 
possible to trust that the input structure is valid.
Meaning through structured formats
Text data that falls into this category includes things such as configuration files, marker 
delimited data, e-mail message text, and JavaScript Object Notation web data. Content 
within this second category does not contain explicit markup much like XML and HTML does, 
but the structure and formatting is required as it conveys meaning and information about 
the text to the parsing application. For example, consider the format of a Windows INI file 
or a Linux system's /etc/hosts file. There are no tags, but the column on the left clearly 
means something other than the column on the right.
Python provides a collection of modules and libraries intended to help us handle popular 
formats from this category. We'll look at Python's built-in text services in detail when we get 
to Chapter 4, The Standard Library to the Rescue.
Understanding freeform content
This category contains data that does not fall into the previous two groupings. This describes 
e-mail message content, letters, book copy, and other unstructured character-based content. 
However, this is where we'll largely have to look at building our own processing components. 
There are external packages available to us if we wish to perform common functions. Some 
examples include full text searching and more advanced natural language processing.
Ensuring you have Python installed
Our first order of business is to ensure that you have Python installed. You'll need it in order 
to complete most of the examples in this book. We'll be working with Python 2.6 and we 
assume that you're using that same version. If there are any drastic differences in earlier 
releases, we'll make a note of them as we go along. All of the examples should still function 
properly with Python 2.4 and later versions.

 
 
------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐