📜  Java中的正则表达式字符案例

📅  最后修改于: 2022-05-13 01:55:00.766000             🧑  作者: Mango

Java中的正则表达式字符案例

正则表达式(也称为“Regex”)是一种特殊类型的模式匹配字符串,用于描述文本中的模式。可以将正则表达式与另一个字符串进行匹配,以查看该字符串是否适合该模式。一般来说,正则表达式由普通字符、字符类、通配符和字符组成。我们将在这里专门讨论字符类。有时需要以任何顺序匹配包含一个或多个字符的任何序列,即一组字符的一部分。例如,要匹配整个单词,我们想要匹配任何字母序列。对于此类用例,字符类会派上用场。字符类是一组字符,字符放在方括号“[”和“]”之间。例如,类 [abc] 匹配 a、b、c字符。还可以使用连字符指定一系列字符。例如,要匹配小写字母的整个单词,可以使用 [az] 类。

请注意,字符类与Java中的类构造或类文件无关。此外,“匹配”一词表示字符串中存在模式,并不意味着整个字符串都与该模式匹配。正则表达式模式允许我们使用两种类型的字符类:

  • 预定义的字符类
  • 自定义字符类

预定义的字符类

Java中预定义了一些常用的字符类,如下所列。这些类中的字符通常以反斜杠“\”开头,不需要放在括号“[”和“]”中。

Predefined character classes

Meaning of predefined character classes

. (dot)This special character dot (.) matches any character. One dot matches one (any) character, two dots match two characters and so on. Dot characters may or may not match line terminators.
\dThis matches any digit character. This works the same as the character class [0-9].
\DThis matches any character except for digits. This works the same as the character class [^0-9].
\sThis matches any whitespace character. This includes a space ‘ ‘, a tab ‘\t’, a new line ‘\n’, a vertical tab ‘\x0B’, a form feed ‘\f’, a carriage return ‘\r’ and backspace character ‘\b’. 
\SThis matches any character except for the whitespace characters listed above.
\wThis matches any word character, including both uppercase and lowercase, also including digit characters and the underscore character ‘_’. This works the same as the class [a-zA-z_0-9].
\WThis matches any character except for word characters. This works the same as the class [^a-zA-z_0-9].

一些使用预定义字符类的示例正则表达式模式:

Regex pattern using predefined character classes

Input String – Result

Input String – Result

Input String – Result

b.rbar – Matchab1r – Matchba1r – Does not match
“b.r” regex means there can be any 1 character between “b” and “r”, the pattern is found in “bar” and “ab1r”, but is not found in “ba1r” as one dot matches only one character, but here there are more than one characters between “b” and “r”.
 
\d\d-\d\d-\d\d\d\d01-01-2022 – Match12-31-2050 – Match2022-02-02 – Does not match
“\d\d-\d\d-\d\d\d\d” regex is a naive regex for date in format “DD-MM-YYYY” all characters are digits. The regex is “naive” because it matches dates of the format “MM-DD-YYYY” too and dates > 31 or months > 12 are not taken care of either.
 
\d\d-\D\D\D-\d\d\d\d01-JAN-2022 – Match31-12-2050 – Does not match22-a1B-1234 – Does not match
“\d\d-\d\d-\d\d\d\d” regex is another naive regex for the date in format “DD-MMM-YYYY” where date and year characters are digits and month characters are anything other than digits. 
 
…\s…abc xyz – Matchabc_xyz – Does not matchabc xyz – Match
“…\s…” regex means two groups of any 3 characters separated by any whitespace character. As “_” is not a whitespace character, “abc_xyz” does not match.
 
…\S…123 456 – Does not match123+456 – Matchabc_xyz – Match
“…\S…” regex means two groups of any 3 characters separated by any character other than a whitespace character. As ” ” (space) is a whitespace character, “123 456” does not match.
 
\w\w\w\W\w\w\wabc xyz – MatchLMN_opq – Does not match123+456 – Match
“\w\w\w\W\w\w\w” regex means two groups of 3 word characters separated by any non-word character. As “_” is a word character, “LMN_opq” does not match.

自定义字符类

Java允许我们使用 […] 定义我们自己的字符类。自定义字符类的几个示例如下:

Example of custom character class

Meaning of the custom character class

b[aeiou]tThis regex means pattern must start with “b” followed by any of the vowels “a”, “e”, “i”, “o”, “u” followed by “t”.
Strings “bat”, “bet”, “bit”, “bot”, “but” would match this regex, but “bct”, “bkt”, etc. would not match.
[bB][aAeEiIoOuU][tT]Such a regex can be used to allow uppercase letters too in the previous regex. So the strings “bAT”, “BAT”, etc. would match the pattern.
b[^aeiou]t“^” at the beginning of character class works as negation or complement, such that this regex means any character other than vowels is allowed between “b” and “t”. Strings “bct”, “bkt”, “b+t”, etc. would match the pattern. Using a ‘^’ at the beginning of character class has a special meaning, but using ‘^’ anywhere in the class apart from at the beginning, acts like any other normal character.
[a-z][0-3]Range of letters and digits can be specified in character classes using the hyphen “-“. Strings “a1”, “z3”, etc. match the pattern. Strings “k7”, “n9”, etc. does not match.
[a-zA-Z][0-9]More than one range can be used in a class. Strings “A1”, “b2”, etc. match the pattern.
[A-F[G-Z]]Nesting character classes simply add them, so this class is the same as [A-Z] class.
[a-p&&[l-z]]Intersection of ranges also works in character classes. This regex means characters “l”, “m”, “n”, “o”, “p” would match in a string.
[a-z&&[^aeiou]]Subtraction of ranges also works in character classes. This regex means vowels are subtracted from the range “a-z”.

到目前为止讨论的正则表达式模式要求输入字符串中的每个位置都匹配特定的字符类。例如,“[az]\s\d”模式要求第一个位置是一个字母,第二个位置是一个空白字符,第三个位置是一个数字。这些模式不灵活、受限制,并且需要更多的维护工作。为了解决这个问题,可以在字符类中使用量词。使用量词,我们可以指定正则表达式中的字符与字符序列匹配的次数。

Quantifiers

Meaning of the quantifier

*Zero or more times
Placing an asterisk “*” after a character class means “allow any number of occurrences of that character class”. For example, “0*\d” regex matches any number of leading zeroes followed by a digit.
  
+One or more times
“+” plus sign has the same effect as XX*, meaning a pattern followed by pattern asterisk. For example, “0+\d” regex matches at least one leading zeroes followed by a digit.
  
?zero or one time
“?” question mark sign allows either zero or one occurrence. For example, “\w\w-?\d\d” regex matches 2-word characters followed by an optional hyphen and then followed by 2 digit characters.
  
{m}Exactly “m” times
  
{m, }At least “m” times
  
{m, n}At least “m” times and at most “n” times