Sử dụng regex để trích xuất thông tin từ một chuỗi

Đây là sự theo dõi và biến chứng cho câu hỏi này: Extracting contents of a string within parentheses.Sử dụng regex để trích xuất thông tin từ một chuỗi

Trong câu hỏi mà tôi đã có chuỗi sau -

"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"

Và tôi muốn để có được một danh sách các bản ghi trong các hình thức (actor, character) -

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]

Để khái quát vấn đề, tôi có một chuỗi phức tạp hơn một chút và tôi cần trích xuất cùng một thông tin. Chuỗi tôi có là -

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary), 
with Stephen Root and Laura Dern (Delilah)"

Tôi cần phải định dạng này như sau:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'), 
('Stephen Root',''), ('Lauren Dern', 'Delilah')]

Tôi biết tôi có thể thay thế các từ phụ (với, và, &, vv), nhưng có thể không tìm ra cách để thêm một mục trống - '' - nếu không có tên ký tự cho diễn viên (trong trường hợp này là Stephen Root). Điều gì sẽ là cách tốt nhất để thực hiện việc này?

Cuối cùng, tôi cần đưa vào tài khoản nếu một diễn viên có nhiều vai trò và tạo một bộ tuple cho mỗi vai trò mà nam diễn viên có. Chuỗi thức tôi có là:

"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)"

Và tôi cần phải xây dựng một danh sách các hàng như sau:

[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),  
('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]

Cảm ơn bạn.

Nguồn

2011-08-10 David542

@ Michael: cảm ơn bạn đã chỉnh sửa chính tả. – David542

Sử dụng regex có thực sự cần thiết không? – utdemir

Không, nó có thể là bất cứ điều gì. Bất cứ điều gì làm việc và là tốt nhất. – David542

import re 
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

characters = splitre.split(credits) 
pairs = [] 
for character in characters: 
    if character: 
     match = matchre.match(character) 
     if match: 
      actor = match.group(1).strip() 
      if match.group(2): 
       parts = splitparts.split(match.group(2)) 
       for part in parts: 
        pairs.append((actor, part)) 
      else: 
       pairs.append((actor, "")) 

print(pairs)

Output:

[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), 
('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), 
('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

Nguồn

2011-08-10 13:14:14

gì bạn muốn là xác định trình tự của các từ bắt đầu bằng một chữ cái viết hoa, cộng với một số biến chứng (IMHO bạn không thể giả định mỗi tên được làm bằng Tên Họ, mà còn Tên Họ Jr., hoặc Tên M. Họ, hoặc biến thể địa phương khác, Jean-Claude van Damme, Louis da Silva, vv).

Bây giờ, điều này có thể là quá mức cần thiết cho đầu vào mẫu bạn đã đăng, nhưng như tôi đã viết ở trên, tôi cho rằng mọi thứ sẽ sớm bị lộn xộn, vì vậy tôi sẽ giải quyết vấn đề này bằng cách sử dụng nltk.

Dưới đây là một rất thô và không phải là rất tốt thử nghiệm đoạn, nhưng nó phải thực hiện công việc:

import nltk 
from nltk.chunk.regexp import RegexpParser 

_patterns = [ 
    (r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'), # proper nouns 
    (r'^[(]$', 'O'), 
    (r'[,]', 'COMMA'), 
    (r'^[)]$', 'C'), 
    (r'.+', 'NN')         # nouns (default) 
] 

_grammar = """ 
     NAME: {<NNP> <COMMA> <NNP>} 
     NAME: {<NNP>+} 
     ROLE: {<O> <NAME>+ <C>} 
     """  
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)" 
tagger = nltk.RegexpTagger(_patterns)  
chunker = RegexpParser(_grammar) 
text = text.replace('(', '(').replace(')', ')').replace(',', ' , ') 
tokens = text.split() 
tagged_text = tagger.tag(tokens) 
tree = chunker.parse(tagged_text) 

for n in tree: 
    if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']: 
     print n 

# output is: 
# (NAME Will/NNP Ferrell/NNP) 
# (ROLE (/O (NAME Nick/NNP Halsey/NNP))/C) 
# (NAME Rebecca/NNP Hall/NNP) 
# (ROLE (/O (NAME Samantha/NNP))/C) 
# (NAME Glenn/NNP Howerton/NNP) 
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP))/C) 
# (NAME Stephen/NNP Root/NNP) 
# (NAME Laura/NNP Dern/NNP) 
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP))/C)

Sau đó, bạn phải xử lý đầu ra được gắn thẻ và đặt tên và vai trò trong một danh sách thay vì in ấn, nhưng bạn lấy tấm hình.

Những gì chúng tôi làm ở đây là thực hiện lần đầu tiên chúng tôi gắn thẻ mỗi mã thông báo theo regex trong _patterns và sau đó thực hiện lần thứ hai để tạo các khối phức tạp hơn theo ngữ pháp đơn giản của bạn. Bạn có thể làm phức tạp ngữ pháp và các mẫu như bạn muốn, tức là. bắt các biến thể của tên, đầu vào lộn xộn, chữ viết tắt, v.v.

Tôi nghĩ rằng việc thực hiện điều này với một đường chuyền regex đơn lẻ sẽ là một nỗi đau đối với các đầu vào không tầm thường.

Nếu không, Tim's solution đang giải quyết vấn đề một cách độc đáo cho đầu vào bạn đã đăng và không có sự phụ thuộc vào nltk.

Nguồn

2011-08-10 13:49:00

Trong trường hợp bạn muốn có một giải pháp không regex ... (Giả không có ngoặc lồng nhau.)

in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"  

in_list = [] 
is_in_paren = False 
item = {} 
next_string = '' 

index = 0 
while index < len(in_string): 
    char = in_string[index] 

    if in_string[index:].startswith(' and') and not is_in_paren: 
     actor = next_string 
     if actor.startswith(' with '): 
      actor = actor[6:] 
     item['actor'] = actor 
     in_list.append(item) 
     item = {} 
     next_string = '' 
     index += 4  
    elif char == '(': 
     is_in_paren = True 
     item['actor'] = next_string 
     next_string = ''  
    elif char == ')': 
     is_in_paren = False 
     item['part'] = next_string 
     in_list.append(item) 
     item = {}     
     next_string = '' 
    elif char == ',': 
     if is_in_paren: 
      item['part'] = next_string 
      next_string = '' 
      in_list.append(item) 
      item = item.copy() 
      item.pop('part')     
    else: 
     next_string = "%s%s" % (next_string, char) 

    index += 1 


out_list = [] 
for dict in in_list: 
    actor = dict.get('actor') 
    part = dict.get('part') 

    if part is None: 
     part = '' 

    out_list.append((actor.strip(), part.strip())) 

print out_list

Output: [('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall ',' Samantha '), (' Glenn Howerton ',' Gary '), (' Glenn Howerton ',' Brad '), (' Stephen Root ',' '), (' Laura Dern ',' Delilah '), ('Laura Dern', 'Stacy')]

Nguồn

2011-08-10 17:10:36 jcfollower

giải pháp Tim Pietzcker có thể được đơn giản hóa để (lưu ý rằng mô hình được thay đổi quá):

import re 
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"\s*([^(]*)(?<!)\s*(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

pairs = [] 
for character in splitre.split(credits): 
    gr = matchre.match(character).groups('') 
    for part in splitparts.split(gr[1]): 
     pairs.append((gr[0], part)) 

print(pairs)

.210

Sau đó:

import re 
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with 
Stephen Root and Laura Dern (Delilah, Stacy)""" 

# split on commas (only if outside of parentheses), "with" or "and" 
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*") 

# match the part before the parentheses (1) and what's inside the parens (2) 
# (only if parentheses are present) 
matchre = re.compile(r"\s*([^(]*)(?<!)\s*(?:\(([^)]*)\))?") 

# split the parts inside the parentheses on commas 
splitparts = re.compile(r"\s*,\s*") 

gen = (matchre.match(character).groups('') for character in splitre.split(credits)) 

pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])] 

print pp

Bí quyết là sử dụng groups('') với một cuộc tranh cãi ''

Nguồn

2011-08-10 22:25:51 eyquem

Sử dụng regex để trích xuất thông tin từ một chuỗi

Trả lời

Các vấn đề liên quan