如何用位于括号外的逗号分隔字符串?

I got a string of such format:

我有一串这样的格式:

"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"

so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).

基本上，它是行为人的名字列表(可选地，后面跟着他们在括号中的角色)。角色本身可以包含逗号(演员的名字不能，我强烈希望如此)。

My goal is to split this string into a list of pairs - (actor name, actor role).

我的目标是将这个字符串分割成一对(演员名、演员角色)。

One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...

一个显而易见的解决方案是遍历每个字符，检查'('，')'和'，并在出现逗号时将其拆分。但这似乎有点沉重……

I was thinking about spliting it using a regexp: first split the string by parenthesis:

我在考虑使用regexp将其拆分:首先按括号拆分字符串:

import re
x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
s = re.split(r'[()]', x) 
# ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']

The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.

这里的奇数元素是参与者名，甚至是角色名。然后我可以用逗号分隔名称，并以某种方式提取名称-角色对。但这似乎比我的第一个方法更糟糕。

Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?

是否有更简单/更好的方法来实现这一点，无论是使用一个regexp还是使用一段漂亮的代码?

10 个解决方案

#1

One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:

一种方法是使用findall和一个regex，它可以在分隔符之间进行greedily匹配。例如:

>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

The regex above matches one or more:

上面的regex匹配一个或多个:

non-comma, non-open-paren characters
non-comma non-open-paren字符
strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren
以打开的paren开头的字符串，包含0或更多的非关闭的paren，然后是关闭的paren

One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.

这种方法的一个特点是相邻的分隔符被当作一个单独的分隔符。也就是说，你不会看到一个空字符串。这可能是一个bug或特性，取决于您的用例。

Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:

还请注意，regexes不适用于可能嵌套的情况。举个例子，它会被错误地分割:

"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"

If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).

如果你需要处理嵌套你最好括号将字符串分割成表达式,逗号,和其他一切(本质上分——这部分仍然可以完成regex)然后走过这些令牌重组领域,跟踪你的嵌套级(这跟踪的嵌套级别是regex不能做什么自己)。

10 个解决方案

#1

更多相关文章

随机推荐