I got a string of such format:


"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"

so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).


My goal is to split this string into a list of pairs - (actor name, actor role).


One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...


I was thinking about spliting it using a regexp: first split the string by parenthesis:


import re
x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
s = re.split(r'[()]', x) 
# ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']

The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.


Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?


10 个解决方案



One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:


>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

The regex above matches one or more:


  • non-comma, non-open-paren characters
  • non-comma non-open-paren字符
  • strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren
  • 以打开的paren开头的字符串,包含0或更多的非关闭的paren,然后是关闭的paren

One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.


Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:


"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"

If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).



  1. Python(名称空间、函数嵌套、函数对象)
  2. pandas - 将嵌套字典值映射到dataframe列
  3. 正则将长数字转为英式写法(从后向前3个数字一个逗号)
  4. 比较mysql中的两个逗号分隔值并获取匹配的计数
  5. android listview多视图嵌套多视图
  6. 解决ScrollView嵌套RecyclerView时item显示不全的问题
  7. 从Wordpress API JSON响应反序列化嵌套的JSON标记和附件
  8. 嵌套片段不保留其状态
  9. Java XML - 具有相同名称的嵌套元素


  1. 如何以varchar字段为编号?
  2. MYSQL中的普通索引,主健,唯一,全文索引区
  3. mysql 题 大家帮我看看哪里错了
  4. 急啊,在线等!!mysql 如何实现增量备份
  5. MySQL数据库多表查询
  6. 在原最大分区基础上再增加分区方法
  7. 使用JDBC处理MySQL大数据
  8. MySql 优化之like语句
  9. mysql galera cluster 集群的分裂与仲裁
  10. linux CentOS 7.4下 mysql5.7.20 密码改