1. sub 来删除匹配到的东西
#去除标签 s = "<div>\ <p>岗位职责:</p>\ <p>完成推荐算法、数据统计、接口、后台等服务器端相关工作</p>\ <p><br></p>\ <p>必备要求:</p>\ <p>良好的自我驱动力和职业素养,工作积极主动、结果导向</p>\ <p> <br></p>\ <p>技术要求:</p>\ <p>1、一年以上 Python 开发经验,掌握面向对象分析和设计,了解设计模式</p>\ <p>2、掌握HTTP协议,熟悉MVC、MVVM等概念以及相关WEB开发框架</p>\ <p>3、掌握关系数据库开发设计,掌握 SQL,熟练使用 MySQL/PostgreSQL 中的一种<br></p>\ <p>4、掌握NoSQL、MQ,熟练使用对应技术解决方案</p>\ <p>5、熟悉 Javascript/CSS/HTML5,JQuery、React、Vue.js</p>\ <p> <br></p>\ <p>加分项:</p>\ <p>大数据,数理统计,机器学习,sklearn,高性能,大并发。</p>\ </div> " p = r"</?\w+>" print(re.sub(p, "", s))
</? 中的/?表示匹配0个或者1次/, 可以匹配< 和</
\w+ 表示匹配一次或多次字符
import re test_str = "xx* xx( xx.h xxt xx.z" re_str = re.compile("(xx)((?!\.h|[a-zA-Z]).)") re_result = re.sub(re_str, r'aa\2', test_str) print(re_result)
把所有xx 替换为aa,前提条件是不以.h和字母结尾
2. split
正则表达式的分割
#-*- coding:utf-8 -*- import re #提取出单词 s3 = "hello world ha ha" print(re.split(r" +", s3)) line = "abc aa;bb,cc | dd(xx).xxx 12.12' xxxx" re.split(r'[;,\s]',line) #用[]来建立自己的字符分类,\s是空格 #结果 ['abc', 'aa', 'bb', 'cc', '|', 'dd(xx).xxx', "12.12'", 'xxxx']
3. findAll
#-*- coding:utf-8 -*- import re #提取出单词 s3 = "hello world ha ha" #\b表示单词边界符\bw+\b 表示匹配一个单词 print(re.findall(r"\b\w+\b", s3))
如果是()匹配,则findall返回的是()里匹配的内容
ip_address = re.compile('<tr.*?>\s*<td>(.*?)</td>\s*<td>(.*?)</td>') # \s* 匹配空格,起到换行作用 re_ip_address = ip_address.findall(html)
http://www.waitingfy.com/archives/3687
4. search group
phoneRegex = re.compile(r'(\d{3})-(\d{3}-\d{4})') strPhone = 'My number is 415-555-4242' result = phoneRegex.search(strPhone) print(result.group()) print(result.group(0)) print(result.group(1)) print(result.group(2)) print(result.groups()) otherResult = phoneRegex.findall(strPhone) print(otherResult) 415-555-4242 415-555-4242 415 555-4242 ('415', '555-4242') [('415', '555-4242')]
因为用了(),进行分组匹配。group() == group(0), group(1)是第一个匹配,group(2)是第二个匹配。 groups是返回一个元组
5. match
if re.match("Director|Secretary", "Director") is None: print("find Director or Secretary")
\1
re.sub(r'(\b[a-z]+) \1', r'\1', 'cat in the the hat') cat in the hat
\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches ‘the the’ or ’55 55′, but not ‘thethe’ (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the ‘[‘ and ‘]’ of a character class, all numeric escapes are treated as characters.
3687