学习正则表达式 - 提取和替换 XML 标签-觅稀奇MeXiQi.COM

目录一、需求二、实现1.插入测试数据2.使用SQL查询提取和替换标签三、分析1.提取文本中的所有XML标签（1）编写匹配标签的正则表达式（2）用递归查询提取所有标签（3）合并、去重、排序所有标签2.替换掉标签属性3.给标签添加常量字符串4.添加头尾字符串

一、需求

使用lorem.dita作为示例XML文档，通过正则表达式提取出该文档中的所有XML标签，并转换为简单的XSLT样式表。可以在Github中找到lorem.dita文件，地址是https://github.com/michaeljamesfitzgerald/Introducing-Regular-Expressions。为了节省篇幅，节选部分文本作为测试数据。

二、实现

1.插入测试数据

droptableifexistst1;

createtablet1(atext);

insertintot1values

('<?xmlversion="1.0"encoding="UTF-8"?>

<!PUBLIC"-//OASIS//DTDDITATopic//EN""topic.dtd">

<topicid="lorem">

<title>LoremIpsum</title>

<body>

Loremipsumdolorsitamet,consecteturadipiscingelit.Crasnoncommodomi.

Loremipsumdolorsitamet,consecteturadipiscingelit:

<ul>

<li>Loremipsumdolorsitamet</li>

</ul>

Loremipsumdolorsitamet,consecteturadipiscingelit.

</body>

</topic>'

);

2.使用SQL查询提取和替换标签

with

t1as--提取、去重、排序所有标签

(

withrecursivenumas

(selectn,regexp_substr(a,'<[_a-zA-Z][^>]*>',1,t.n)bfromt1,(select1n)t

unionall

selectn1,regexp_substr(a,'<[_a-zA-Z][^>]*>',1,n1)fromt1,num

wherebisnotnull)

selectreplace(convert(group_concat(distinctborderbyb)usingutf8mb4),',',char(10))afromnum),

t2as--替换掉标签属性

(selectregexp_replace(a,'id=".*"','')afromt1),

t3as--给标签添加常量字符串

(selectregexp_replace(a,'^<(.*)>$','<xsl:templatematch="$1">

<xsl:apply-templates/>

</xsl:template>

',1,0,'m')afromt2),

t4as--添加头尾字符串

(selectregexp_replace(a,'^(.*)$','<xsl:stylesheetversion="2.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform">\n\n$1\n</xsl:stylesheet>',1,0,'n')afromt3)

select*fromt4;

查询结果如下:

<xsl:stylesheetversion="2.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:templatematch="body">

<xsl:apply-templates/>

</xsl:template>

<xsl:templatematch="li">

<xsl:apply-templates/>

</xsl:template>

<xsl:templatematch="p">

<xsl:apply-templates/>

</xsl:template>

<xsl:templatematch="title">

<xsl:apply-templates/>

</xsl:template>

<xsl:templatematch="topic">

<xsl:apply-templates/>

</xsl:template>

<xsl:templatematch="ul">

<xsl:apply-templates/>

</xsl:template>

</xsl:stylesheet>

三、分析

该实现使用内嵌视图、递归查询技术，并调用regexp_substr和regexp_replace函数完成标签的提取和替换。

1.提取文本中的所有XML标签

（1）编写匹配标签的正则表达式

<[_a-zA-Z][^>]*>

第一个字符是左尖括号（<）。
在XML中元素可以以下划线字符_或者ASCII范围中的大写或小写字母开头。
在起始字符之后，标签名称可以是零或多个除右尖括号>之外的任意字符。
表达式以右尖括号结尾。

（2）用递归查询提取所有标签

withrecursivenumas

(selectn,regexp_substr(a,'<[_a-zA-Z][^>]*>',1,t.n)bfromt1,(select1n)t

unionall

selectn1,regexp_substr(a,'<[_a-zA-Z][^>]*>',1,n1)fromt1,num

wherebisnotnull)

MySQL的regexp_substr函数用于返回正则表达式的匹配项，但每次只能返回一个，用第四个参数occurrence指定返回第几个匹配项。为了获得全部标签，需要使用递归查询，将递归变量作为occurrence参数传递给regexp_substr函数。将regexp_substr函数返回null作为递归退出条件。这部分查询为每个标签返回一行。

（3）合并、去重、排序所有标签

selectreplace(convert(group_concat(distinctborderbyb)usingutf8mb4),',',char(10))afromnum

group_concat(distinctborderbyb)将递归查询返回的多行排序去重，然后合并为以逗号作为分隔符的一行字符串。
convert函数将group_concat返回的一行字符串转为utf8mb4字符集。
replace函数将合并后的一行字符串中的分隔符从逗号换成换行符。

内嵌视图t1的查询结果即为去重、排序后的，以换行符作为分隔符的所有标签。

2.替换掉标签属性

selectregexp_replace(a,'id=".*"','')afromt1

内嵌视图t2的查询结果为去掉属性的所有标签名称。本例中只有id属性。

3.给标签添加常量字符串

selectregexp_replace(a,'^<(.*)>$','<xsl:templatematch="$1">

<xsl:apply-templates/>

</xsl:template>

',1,0,'m')afromt2

内嵌视图t3的查询结果是个每个标签添加了带有XSLT样式的前后缀。使用多行模式后，正则表达式^<(.*)>$匹配每一个标签名称，并将匹配结果放入一个捕获组中，$1引用该捕获组。

4.添加头尾字符串

selectregexp_replace(a,'^(.*)$','<xsl:stylesheetversion="2.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform">\n\n$1\n</xsl:stylesheet>',1,0,'n')afromt3

内嵌视图t4的查询结果是给t3的结果添加首尾XSLT标签字符串。使用dotall模式后，正则表达式^(.*)$匹配整个多行文本，并将匹配结果放入一个捕获组中，$1引用该捕获组。