this post was submitted on 06 Dec 2023
5 points (100.0% liked)

Python

6288 readers
5 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

๐Ÿ“… Events

PastNovember 2023

October 2023

July 2023

August 2023

September 2023

๐Ÿ Python project:
๐Ÿ’“ Python Community:
โœจ Python Ecosystem:
๐ŸŒŒ Fediverse
Communities
Projects
Feeds

founded 1 year ago
MODERATORS
 

i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs, i was using this expression to do it:

OneOrMore(Word(alphanums)) + OneOrMore(Char(printables))

But when i parse the following string with this expression:

return abc(1, ULLONG_MAX)

All the words inside the parentheses get split:

['return', 'abc', '(', '1', ',', 'U', 'L', 'L', 'O', 'N', '_', 'M', 'A', 'X', ')', ';']

But if i use this expression:

OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation))

Only a part of the string gets parsed:

['return', 'abc', '(']

What is wrong with those expressions?

you are viewing a single comment's thread
view the rest of the comments
[โ€“] UlrikHD 4 points 10 months ago* (last edited 10 months ago)

Personally I would recommend to use regex instead for parsing, which would also allow you to more easily test your expressions. You could then get the list as

import re
result = re.findall(r'[\w_]+|\S',  yourstring)  # This will preserve ULLONG_MAX as a single word if that's what you want

As for what's wrong with your expressions:

First expression: Once you hit (, OneOrMore(Char(printables)) will take over and continue matching every printable char. Instead you should use OR (|) with the alphanumerical first for priority OneOrMore(word | Char(printables))

Second expression. You're running into the same issue with your use of +. Once string.punctuation takes over, it will continue matching until it encounters a char that is not a punctuation and then stop the matching. Instead you can write:

parser = OneOrMore(Word(alphanums) | Word(string.punctuation))
result = parser.parseString(yourstring)

Do note that underscore is considered a punctutation so ULLONG_MAX will be split, not sure if that's what you want or not.