parse

This module gathers parsers to handle whole input text

find_first_pattern

class textops.find_first_pattern(patterns)

Fast multiple pattern search, returns on first match

It works like textops.find_patterns except that it stops searching on first match.

Parameters:patterns (list) – a list of patterns.
Returns:matched value if only one capture group otherwise the full groupdict
Return type:str or dict

Examples

>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_first_pattern([r'^update:\s*(.*)', r'^access:\s*(.*)', r'^creation:\s*(.*)'])
'2015-11-16'
>>> s | find_first_pattern([r'^UPDATE:\s*(.*)'])
NoAttr
>>> s | find_first_pattern([r'^update:\s*(?P<year>.*)-(?P<month>.*)-(?P<day>.*)'])
{'year': '2015', 'month': '11', 'day': '16'}

find_first_patterni

class textops.find_first_patterni(patterns)

Fast multiple pattern search, returns on first match

It works like textops.find_first_pattern except that patterns are case insensitives.

Parameters:patterns (list) – a list of patterns.
Returns:matched value if only one capture group otherwise the full groupdict
Return type:str or dict

Examples

>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_first_patterni([r'^UPDATE:\s*(.*)'])
'2015-11-16'

find_pattern

class textops.find_pattern(pattern)

Fast pattern search

This operation can be use to find a pattern very fast : it uses re.search() on the whole input text at once. The input text is not read line by line, this means it must fit into memory. It returns the first captured group (named or not named group).

Parameters:pattern (str) – a regular expression string (case sensitive).
Returns:the first captured group or NoAttr if not found
Return type:str

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> s | find_pattern(r'^Version:\s*(.*)')
'1.2.3'
>>> s | find_pattern(r'^Format:\s*(?P<format>.*)')
'json'
>>> s | find_pattern(r'^version:\s*(.*)') # 'version' : no match because case sensitive
NoAttr

find_patterni

class textops.find_patterni(pattern)

Fast pattern search (case insensitive)

It works like textops.find_pattern except that the pattern is case insensitive.

Parameters:pattern (str) – a regular expression string (case insensitive).
Returns:the first captured group or NoAttr if not found
Return type:str

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> s | find_patterni(r'^version:\s*(.*)')
'1.2.3'

find_patterns

class textops.find_patterns(patterns)

Fast multiple pattern search

It works like textops.find_pattern except that one can specify a list or a dictionary of patterns. Patterns must contains capture groups. It returns a list or a dictionary of results depending on the patterns argument type. Each result will be the re.MatchObject groupdict if there are more than one capture named group in the pattern otherwise directly the value corresponding to the unique captured group. It is recommended to use named capture group, if not, the groups will be automatically named ‘groupN’ with N the capture group order in the pattern.

Parameters:patterns (list or dict) – a list or a dictionary of patterns.
Returns:patterns search result
Return type:dict

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> r = s | find_patterns({
... 'version':r'^Version:\s*(?P<major>\d+)\.(?P<minor>\d+)\.(?P<build>\d+)',
... 'format':r'^Format:\s*(?P<format>.*)',
... })
>>> r
{'version': {'major': '1', 'minor': '2', 'build': '3'}, 'format': 'json'}
>>> r.version.major
'1'
>>> s | find_patterns({
... 'version':r'^Version:\s*(\d+)\.(\d+)\.(\d+)',
... 'format':r'^Format:\s*(.*)',
... })
{'version': {'group0': '1', 'group1': '2', 'group2': '3'}, 'format': 'json'}
>>> s | find_patterns({'version':r'^version:\s*(.*)'}) # lowercase 'version' : no match
{}
>>> s = '''creation: 2015-10-14
... update: 2015-11-16
... access: 2015-11-17'''
>>> s | find_patterns([r'^update:\s*(.*)', r'^access:\s*(.*)', r'^creation:\s*(.*)'])
['2015-11-16', '2015-11-17', '2015-10-14']
>>> s | find_patterns([r'^update:\s*(?P<year>.*)-(?P<month>.*)-(?P<day>.*)',
... r'^access:\s*(.*)', r'^creation:\s*(.*)'])
[{'year': '2015', 'month': '11', 'day': '16'}, '2015-11-17', '2015-10-14']

find_patternsi

class textops.find_patternsi(patterns)

Fast multiple pattern search (case insensitive)

It works like textops.find_patterns except that patterns are case insensitive.

Parameters:patterns (dict) – a dictionary of patterns.
Returns:patterns search result
Return type:dict

Examples

>>> s = '''This is data text
... Version: 1.2.3
... Format: json'''
>>> s | find_patternsi({'version':r'^version:\s*(.*)'})     # case insensitive
{'version': '1.2.3'}

keyval

class textops.keyval(pattern, key_name='key', key_update=None, val_name=None)

Return a dictionnay where keys and values are taken from the pattern specify

It is a shortcut for textops.parsekv with val_name=’val’ The input can be a string or a list of strings.

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key name to optain the value that will be the key of the groupdict (‘key’ by default)
  • key_update (callable) – function to convert/normalize the calculated key. If None, the keys is normalized. If not None but not callable ,the key is unchanged.
  • val_name (str) – instead of storing the groupdict, on can choose to select the value at the key ``val_name`. (by default, None means ‘val’)
Returns:

A dict of key:val from the matched pattern groupdict or a list of dicts if the input is a list of strings

Return type:

dict

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | keyval(r'(?P<key>.*):\s*(?P<val>.*)')         #doctest: +NORMALIZE_WHITESPACE
{'name': 'Lapouyade', 'first_name': 'Eric', 'country': 'France'}
>>> s = [ '''name: Lapouyade
... first name: Eric ''',
... '''name: Python
... first name: Guido''' ]
>>> s | keyval(r'(?P<key>.*):\s*(?P<val>.*)')         #doctest: +NORMALIZE_WHITESPACE
[{'name': 'Lapouyade', 'first_name': 'Eric '}, {'name': 'Python', 'first_name': 'Guido'}]

keyvali

class textops.keyvali(pattern, key_name='key', key_update=None, val_name=None)

Return a dictionnay where keys and values are taken from the pattern specify

It works a little like textops.keyval except that the pattern is case insensitive.

Parameters:
  • pattern (str) – a regular expression string (case insensitive).
  • key_name (str) – The key name to optain the value that will be the key of the groupdict (‘key’ by default)
  • key_update (callable) – function to convert/normalize the calculated key. If None, the keys is normalized. If not None but not callable ,the key is unchanged.
  • val_name (str) – instead of storing the groupdict, on can choose to select the value at the key ``val_name`. (by default, None means ‘val’)
Returns:

A dict of key:val from the matched pattern groupdict

Return type:

dict

Examples

>>> s = '''name IS Lapouyade
... first name IS Eric
... country IS France'''
>>> s | keyvali(r'(?P<key>.*) is (?P<val>.*)')         #doctest: +NORMALIZE_WHITESPACE
{'name': 'Lapouyade', 'first_name': 'Eric', 'country': 'France'}

mgrep

class textops.mgrep(patterns_dict, key=None)

Multiple grep

This works like textops.grep except that it can do several greps in a single command. By this way, you can select many patterns in a big file.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to search are in values.
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will

contain the textops.grep result for each corresponding patterns.

Return type:

dict

Examples

>>> logs = '''
... error 1
... warning 1
... warning 2
... info 1
... error 2
... info 2
... '''
>>> t = logs | mgrep({
... 'errors' : r'^err',
... 'warnings' : r'^warn',
... 'infos' : r'^info',
... })
>>> print(t)  #doctest: +NORMALIZE_WHITESPACE
{'errors': ['error 1', 'error 2'],
'warnings': ['warning 1', 'warning 2'],
'infos': ['info 1', 'info 2']}
>>> s = '''
... Disk states
... -----------
... name: c1t0d0s0
... state: good
... fs: /
... name: c1t0d0s4
... state: failed
... fs: /home
...
... '''
>>> t = s | mgrep({
... 'disks' : r'^name:',
... 'states' : r'^state:',
... 'fss' : r'^fs:',
... })
>>> print(t)  #doctest: +NORMALIZE_WHITESPACE
{'disks': ['name: c1t0d0s0', 'name: c1t0d0s4'],
'states': ['state: good', 'state: failed'],
'fss': ['fs: /', 'fs: /home']}
>>> dict(zip(t.disks.cutre(': *',1),zip(t.states.cutre(': *',1),t.fss.cutre(': *',1))))
{'c1t0d0s0': ('good', '/'), 'c1t0d0s4': ('failed', '/home')}

mgrepi

class textops.mgrepi(patterns_dict, key=None)

same as mgrep but case insensitive

This works like textops.mgrep, except it is case insensitive.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to search are in values.
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will

contain the textops.grepi result for each corresponding patterns.

Return type:

dict

Examples

>>> 'error 1' | mgrep({'errors':'ERROR'})
{}
>>> 'error 1' | mgrepi({'errors':'ERROR'})
{'errors': ['error 1']}

mgrepv

class textops.mgrepv(patterns_dict, key=None)

Same as mgrep but exclusive

This works like textops.mgrep, except it searches lines that DOES NOT match patterns.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to exclude are in values().
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will

contain the textops.grepv result for each corresponding patterns.

Return type:

dict

Examples

>>> logs = '''error 1
... warning 1
... warning 2
... error 2
... '''
>>> t = logs | mgrepv({
... 'not_errors' : r'^err',
... 'not_warnings' : r'^warn',
... })
>>> print(t                                         )#doctest: +NORMALIZE_WHITESPACE
{'not_warnings': ['error 1', 'error 2'], 'not_errors': ['warning 1', 'warning 2']}

mgrepvi

class textops.mgrepvi(patterns_dict, key=None)

Same as mgrepv but case insensitive

This works like textops.mgrepv, except it is case insensitive.

Parameters:
  • patterns_dict (dict) – a dictionary where all patterns to exclude are in values().
  • key (int or str) – test only one column or one key (optional)
Returns:

A dictionary where the keys are the same as for patterns_dict, the values will

contain the textops.grepvi result for each corresponding patterns.

Return type:

dict

Examples

>>> logs = '''error 1
... WARNING 1
... warning 2
... ERROR 2
... '''
>>> t = logs | mgrepv({
... 'not_errors' : r'^err',
... 'not_warnings' : r'^warn',
... })
>>> print(t                                         )#doctest: +NORMALIZE_WHITESPACE
{'not_warnings': ['error 1', 'WARNING 1', 'ERROR 2'],
'not_errors': ['WARNING 1', 'warning 2', 'ERROR 2']}
>>> t = logs | mgrepvi({
... 'not_errors' : r'^err',
... 'not_warnings' : r'^warn',
... })
>>> print(t                                         )#doctest: +NORMALIZE_WHITESPACE
{'not_warnings': ['error 1', 'ERROR 2'], 'not_errors': ['WARNING 1', 'warning 2']}

parse_indented

class textops.parse_indented(sep=':')

Parse key:value indented text

It looks for key:value patterns, store found values in a dictionary. Each time a new indent is found, a sub-dictionary is created. The keys are normalized (only keep A-Za-z0-9_), the values are stripped.

Parameters:sep (str) – key:value separator (Default : ‘:’)
Returns:structured keys:values
Return type:dict

Examples

>>> s = '''
... a:val1
... b:
...     c:val3
...     d:
...         e ... : val5
...         f ... :val6
...     g:val7
... f: val8'''
>>> s | parse_indented()
{'a': 'val1', 'b': {'c': 'val3', 'd': {'e': 'val5', 'f': 'val6'}, 'g': 'val7'}, 'f': 'val8'}
>>> s = '''
... a --> val1
... b --> val2'''
>>> s | parse_indented(r'-->')
{'a': 'val1', 'b': 'val2'}

parse_smart

class textops.parse_smart

Try to automatically parse a text

It looks for key/value patterns, store found values in a dictionary. It tries to respect indents by creating sub-dictionaries. The keys are normalized (only keep A-Za-z0-9_, the original key value is stored into the inner dict under the ‘_original_key’ key), the values are stripped.

Parameters:
  • key_filter (func) – a function that will receive a key before
  • and will return a new key string. The could be useful (normalization) –
  • a chapter title is too long. (Defaut (when) – no filtering)
Returns:

structured keys:values

Return type:

dict

Examples

>>> s = '''
... Date/Time:       Wed Dec  2 09:51:17 NFT 2015
... Sequence Number: 156637
... Machine Id:      00F7B0114C00
...    Node Id:         xvio6
... Class:           H
... Type:            PERM
...    WPAR:            Global
...    Resource Name:   hdisk21
...       Resource Class:  disk
... Resource Type:   mpioapdisk
... Location:        U78AA.001.WZSHM0M-P1-C6-T1-W201400A0B8292A18-L13000000000000
...
... VPD:
...         Manufacturer................IBM
...         Machine Type and Model......1815      FAStT
...         ROS Level and ID............30393134
...         Serial Number...............
...         Device Specific.(Z0)........0000053245004032
...         Device Specific.(Z1)........
...
... Description
... DISK OPERATION ERROR
...
... Probable Causes
... DASD DEVICE
... '''
>>> parsed = s >> parse_smart()
>>> print(parsed.pretty())
{   'class': 'H',
    'date_time': 'Wed Dec  2 09:51:17 NFT 2015',
    'description': ['DISK OPERATION ERROR'],
    'location': 'U78AA.001.WZSHM0M-P1-C6-T1-W201400A0B8292A18-L13000000000000',
    'machine_id': {   '_original_key': 'Machine Id',
                      'machine_id': '00F7B0114C00',
                      'node_id': 'xvio6'},
    'probable_causes': ['DASD DEVICE'],
    'resource_type': 'mpioapdisk',
    'sequence_number': '156637',
    'type': {   '_original_key': 'Type',
                'resource_name': {   '_original_key': 'Resource Name',
                                     'resource_class': 'disk',
                                     'resource_name': 'hdisk21'},
                'type': 'PERM',
                'wpar': 'Global'},
    'vpd': {   '_original_key': 'VPD',
               'device_specific_z0': '0000053245004032',
               'device_specific_z1': '',
               'machine_type_and_model': '1815      FAStT',
               'manufacturer': 'IBM',
               'ros_level_and_id': '30393134',
               'serial_number': ''}}
>>> print(parsed.vpd.device_specific_z0)
0000053245004032

parseg

class textops.parseg(pattern)

Find all occurrences of one pattern, return MatchObject groupdict

Parameters:pattern (str) – a regular expression string (case sensitive)
Returns:A list of dictionaries (MatchObject groupdict)
Return type:list

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | parseg(r'(?P<key>.*):\s*(?P<val>.*)')         #doctest: +NORMALIZE_WHITESPACE
[{'key': 'name', 'val': 'Lapouyade'},
{'key': 'first name', 'val': 'Eric'},
{'key': 'country', 'val': 'France'}]

parsegi

class textops.parsegi(pattern)

Same as parseg but case insensitive

Parameters:pattern (str) – a regular expression string (case insensitive)
Returns:A list of dictionaries (MatchObject groupdict)
Return type:list

Examples

>>> s = '''Error: System will reboot
... Notice: textops rocks
... Warning: Python must be used without moderation'''
>>> s | parsegi(r'(?P<level>error|warning):\s*(?P<msg>.*)')         #doctest: +NORMALIZE_WHITESPACE
[{'level': 'Error', 'msg': 'System will reboot'},
{'level': 'Warning', 'msg': 'Python must be used without moderation'}]

parsek

class textops.parsek(pattern, key_name='key', key_update=None)

Find all occurrences of one pattern, return one Key

One have to give a pattern with named capturing parenthesis, the function will return a list of value corresponding to the specified key. It works a little like textops.parseg except that it returns from the groupdict, a value for a specified key (‘key’ be default)

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key to get (‘key’ by default)
  • key_update (callable) – function to convert the found value
Returns:

A list of values corrsponding to MatchObject groupdict[key]

Return type:

list

Examples

>>> s = '''Error: System will reboot
... Notice: textops rocks
... Warning: Python must be used without moderation'''
>>> s | parsek(r'(?P<level>Error|Warning):\s*(?P<msg>.*)','msg')
['System will reboot', 'Python must be used without moderation']

parseki

class textops.parseki(pattern, key_name='key', key_update=None)

Same as parsek but case insensitive

It works like textops.parsek except the pattern is case insensitive.

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key to get (‘key’ by default)
  • key_update (callable) – function to convert the found value
Returns:

A list of values corrsponding to MatchObject groupdict[key]

Return type:

list

Examples

>>> s = '''Error: System will reboot
... Notice: textops rocks
... Warning: Python must be used without moderation'''
>>> s | parsek(r'(?P<level>error|warning):\s*(?P<msg>.*)','msg')
[]
>>> s | parseki(r'(?P<level>error|warning):\s*(?P<msg>.*)','msg')
['System will reboot', 'Python must be used without moderation']

parsekv

class textops.parsekv(pattern, key_name='key', key_update=None, val_name=None)

Find all occurrences of one pattern, returns a dict of groupdicts

It works a little like textops.parseg except that it returns a dict of dicts : values are MatchObject groupdicts, keys are a value in the groupdict at a specified key (By default : ‘key’). Note that calculated keys are normalized (spaces are replaced by underscores)

Parameters:
  • pattern (str) – a regular expression string.
  • key_name (str) – The key name to optain the value that will be the key of the groupdict (‘key’ by default)
  • key_update (callable) – function to convert/normalize the calculated key. If None, the keys is normalized. If not None but not callable ,the key is unchanged.
  • val_name (str) – instead of storing the groupdict, on can choose to select the value at the key ``val_name`. (by default, None : means the whole groupdict)
Returns:

A dict of MatchObject groupdicts

Return type:

dict

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | parsekv(r'(?P<key>.*):\s*(?P<val>.*)')         #doctest: +NORMALIZE_WHITESPACE
{'name': {'key': 'name', 'val': 'Lapouyade'},
'first_name': {'key': 'first name', 'val': 'Eric'},
'country': {'key': 'country', 'val': 'France'}}
>>> s | parsekv(r'(?P<item>.*):\s*(?P<val>.*)','item',str.upper)         #doctest: +NORMALIZE_WHITESPACE
{'NAME': {'item': 'name', 'val': 'Lapouyade'},
'FIRST NAME': {'item': 'first name', 'val': 'Eric'},
'COUNTRY': {'item': 'country', 'val': 'France'}}
>>> s | parsekv(r'(?P<key>.*):\s*(?P<val>.*)',key_update=0)         #doctest: +NORMALIZE_WHITESPACE
{'name': {'key': 'name', 'val': 'Lapouyade'},
'first name': {'key': 'first name', 'val': 'Eric'},
'country': {'key': 'country', 'val': 'France'}}
>>> s | parsekv(r'(?P<key>.*):\s*(?P<val>.*)',val_name='val')         #doctest: +NORMALIZE_WHITESPACE
{'name': 'Lapouyade', 'first_name': 'Eric', 'country': 'France'}

parsekvi

class textops.parsekvi(pattern, key_name='key', key_update=None, val_name=None)

Find all occurrences of one pattern (case insensitive), returns a dict of groupdicts

It works a little like textops.parsekv except that the pattern is case insensitive.

Parameters:
  • pattern (str) – a regular expression string (case insensitive).
  • key_name (str) – The key name to optain the value that will be the key of the groupdict (‘key’ by default)
  • key_update (callable) – function to convert/normalize the calculated key. If None, the keys is normalized. If not None but not callable ,the key is unchanged.
  • val_name (str) – instead of storing the groupdict, on can choose to select the value at the key ``val_name`. (by default, None : means the whole groupdict)
Returns:

A dict of MatchObject groupdicts

Return type:

dict

Examples

>>> s = '''name: Lapouyade
... first name: Eric
... country: France'''
>>> s | parsekvi(r'(?P<key>NAME):\s*(?P<val>.*)')
{'name': {'key': 'name', 'val': 'Lapouyade'}}

sgrep

class textops.sgrep(patterns, key=None)

Switch grep

This works like textops.mgrep except that it returns a list of lists. sgrep dispatches lines matching a pattern to the list corresponding to the pattern order. If a line matches the third pattern, it will be dispatched to the third returned list. If N patterns are given to search, it will return N+1 lists, where the last list will be filled of lines that does not match any pattern in the given patterns list. The patterns list order is important : only the first matching pattern will be taken in account. One can consider that sgrep works like a switch() : it will do for each line a kind of

if pattern1 matches:
    put line in list1
elif pattern2 matches:
    put line in list2
elif patternN matches:
    put line in listN
else:
    put line in listN+1
Parameters:
  • patterns (list) – a list of patterns to search.
  • key (int or str) – test only one column or one key (optional)
Returns:

a list of lists (nb patterns + 1)

Return type:

list

Examples

>>> logs = '''
... error 1
... warning 1
... warning 2
... info 1
... error 2
... info 2
... '''
>>> t = logs | sgrep(('^err','^warn'))
>>> print(t                                         )#doctest: +NORMALIZE_WHITESPACE
[['error 1', 'error 2'], ['warning 1', 'warning 2'], ['', 'info 1', 'info 2']]

sgrepi

class textops.sgrepi(patterns, key=None)

Switch grep case insensitive

This works like textops.sgrep but is case insensitive

sgrepv

class textops.sgrepv(patterns, key=None)

Switch grep reversed

This works like textops.sgrep except that it tests that patterns DOES NOT match the line.

sgrepvi

class textops.sgrepvi(patterns, key=None)

Switch grep reversed case insensitive

This works like textops.sgrepv but is case insensitive

state_pattern

class textops.state_pattern(states_patterns_desc, reflags=0, autostrip=True)

States and patterns parser

This is a state machine parser : The main advantage is that it reads line-by-line the whole input text only once to collect all data you want into a multi-level dictionary. It uses patterns to select rules to be applied. It uses states to ensure only a set of rules are used against specific document sections.

Parameters:
  • states_patterns_desc (tupple) – descrption of states and patterns : see below for explaination
  • reflags – re flags, ie re.I or re.M or re.I | re.M (Default : no flag)
  • autostrip – before being stored, groupdict keys and values are stripped (Default : True)
Returns:

parsed data from text

Return type:

dict


The states_patterns_desc :

It looks like this:

((<if state1>,<goto state1>,<pattern1>,<out data path1>,<out filter1>),
...
(<if stateN>,<goto stateN>,<patternN>,<out data pathN>,<out filterN>))
<if state>
is a string telling on what state(s) the pattern must be searched, one can specify several states with comma separated string or a tupple. if <if state> is empty, the pattern will be searched for all lines. Note : at the beginning, the state is ‘top’
<goto state>
is a string corresponding to the new state if the pattern matches. use an empty string to not change the current state. One can use any string, usually, it corresponds to a specific section name of the document to parse where specific rules has to be used. if the pattern matches, no more rules are used for the current line except when you specify __continue__ for the goto state. This is useful when you want to apply several rules on the same line.
<pattern>
is a string or a re.regex to match a line of text. one should use named groups for selecting data, ex: (?P<key1>pattern)
<out data path>

is a string with a dot separator or a tuple telling where to place the groupdict from pattern maching process, The syntax is:

'{contextkey1}.{contextkey2}. ... .{contextkeyN}'
or
('{contextkey1}','{contextkey2}', ... ,'{contextkeyN}')
or
'key1.key2.keyN'
or
'key1.key2.keyN[]'
or
'{contextkey1}.{contextkey2}. ... .keyN[]'
or
'>context_dict_key'
or
'>>context_dict_key'
or
'>context_dict_key.{contextkey1}. ... .keyN'
or
'>>context_dict_key.{contextkey1}. ... .keyN'
or
None

The contextdict (see after the definition) is used to format strings with {contextkeyN} syntax. instead of {contextkeyN}, one can use a simple string to put data in a static path.

Once the path fully formatted, let’s say to key1.key2.keyN, the parser will store the value into the result dictionnary at : {'key1':{'key2':{'keyN' : thevalue }}}

Example, Let’s take the following data path

data path : 'disks.{name}.{var}'

if contextdict = {'name':'disk1','var':'size'}

then the formatted data path is : 'disks.disk1.size',
This means that the parsed data will be stored at :
``{'disks':{'disk1':{'size' : theparsedvalue depending on <out filter> }}}``

One can use the string [] at the end of the path : the groupdict will be appended in a list ie : {'key1':{'key2':{'keyN' : [thevalue,...] }}}

if '>context_dict_key' is used, data is not store in parsed data but will be stored in context dict at context_dict_key key. by this way, you can differ the parsed date storage. To finally store to the parsed data use '<context_dict_key' for <out filter> in some other rule. '>>context_dict_key' works like '>context_dict_key' but it updates data instead of replacing them (in others words : use > to start with an empty set of data, then use >> to update the data set). One can add dotted notation to complete data path: >>context_dict_key.{contextkey1}. ... .keyN

if None is used : nothing is stored

<out filter>

is used to build the value to store,

it could be :

  • None : no filter is applied, the re.MatchObject.groupdict() is stored
  • a dict : mainly to initalize the differed data set when using '>context_dict_key' in <out data path>
  • '<context_dict_key' to store data from context dict at key context_dict_key
  • a string : used as a format string with context dict, the formatted string is stored
  • a callable : to calculate the value to be stored and modify context dict if needed. the re.MatchObject and the context dict are given as arguments, it must return a tuple : the value to store AND the new context dict or None if unchanged

How the parser works :

You have a document where the syntax may change from one section to an another : You have just to give a name to these kind of sections : it will be your state names. The parser reads line by line the input text : For each line, it will look for the first matching rule from states_patterns_desc table, then will apply the rule. One rule has got 2 parts : the matching parameters, and the action parameters.

Matching parameters:
To match, a rule requires the parser to be at the specified state <if state> AND the line to be parsed must match the pattern <pattern>. When the parser is at the first line, it has the default state top. The pattern follow the standard python re module syntax. It is important to note that you must capture text you want to collect with the named group capture syntax, that is (?P<mydata>mypattern). By this way, the parser will store text corresponding to mypattern to a contextdict at the key mydata.
Action parameters:
Once the rule matches, the action is to store <out filter> into the final dictionary at a specified <out data path>.

Context dict :

The context dict is used within <out filter> and <out data path>, it is a dictionary that is PERSISTENT during the whole parsing process : It is empty at the parsing beginning and will accumulate all captured pattern. For exemple, if a first rule pattern contains (?P<key1>.*),(?P<key2>.*) and matches the document line val1,val2, the context dict will be { 'key1' : 'val1', 'key2' : 'val2' }. Then if a second rule pattern contains (?P<key2>.*):(?P<key3>.*) and matches the document line val4:val5 then the context dict will be UPDATED to { 'key1' : 'val1', 'key2' : 'val4', 'key3' : 'val5' }. As you can see, the choice of the key names are VERY IMPORTANT in order to avoid collision across all the rules.

Examples

>>> s = '''
... first name: Eric
... last name: Lapouyade'''
>>> s | state_pattern( (('',None,'(?P<key>.*):(?P<val>.*)','{key}','{val}'),) )
{'first_name': 'Eric', 'last_name': 'Lapouyade'}
>>> s | state_pattern( (('',None,'(?P<key>.*):(?P<val>.*)','{key}',None),) ) #doctest: +NORMALIZE_WHITESPACE
{'first_name': {'key': 'first name', 'val': 'Eric'},
'last_name': {'key': 'last name', 'val': 'Lapouyade'}}
>>> s | state_pattern((('',None,'(?P<key>.*):(?P<val>.*)','my.path.{key}','{val}'),))
{'my': {'path': {'first_name': 'Eric', 'last_name': 'Lapouyade'}}}
>>> s = '''Eric
... Guido'''
>>> s | state_pattern( (('',None,'(?P<val>.*)','my.path.info[]','{val}'),) )
{'my': {'path': {'info': ['Eric', 'Guido']}}}
>>> s = '''
... Section 1
... ---------
...   email = ericdupo@gmail.com
...
... Section 2
... ---------
...   first name: Eric
...   last name: Dupont'''
>>> s | state_pattern( (                                    #doctest: +NORMALIZE_WHITESPACE
... ('','section1','^Section 1',None,None),
... ('','section2','^Section 2',None,None),
... ('section1', '', '(?P<key>.*)=(?P<val>.*)', 'section1.{key}', '{val}'),
... ('section2', '', '(?P<key>.*):(?P<val>.*)', 'section2.{key}', '{val}')) )
{'section1': {'email': 'ericdupo@gmail.com'},
'section2': {'first_name': 'Eric', 'last_name': 'Dupont'}}
>>> s = '''
... Disk states
... -----------
... name: c1t0d0s0
... state: good
... fs: /
... name: c1t0d0s4
... state: failed
... fs: /home
...
... '''
>>> s | state_pattern( (                                    #doctest: +NORMALIZE_WHITESPACE
... ('top','disk',r'^Disk states',None,None),
... ('disk','top', r'^\s*$',None,None),
... ('disk', '', r'^name:(?P<diskname>.*)',None, None),
... ('disk', '', r'(?P<key>.*):(?P<val>.*)', 'disks.{diskname}.{key}', '{val}')) )
{'disks': {'c1t0d0s0': {'state': 'good', 'fs': '/'},
'c1t0d0s4': {'state': 'failed', 'fs': '/home'}}}
>>> s = '''
... {
... name: c1t0d0s0
... state: good
... fs: /
... },
... {
... fs: /home
... name: c1t0d0s4
... }
... '''
>>> s | state_pattern( (                                     #doctest: +NORMALIZE_WHITESPACE
... ('top','disk',r'{','>disk_info',{}),
... ('disk', '', r'(?P<key>.*):(?P<val>.*)', '>>disk_info.{key}', '{val}'),
... ('disk', 'top', r'}', 'disks.{disk_info[name]}', '<disk_info'),
... ) )
{'disks': {'c1t0d0s0': {'name': 'c1t0d0s0', 'state': 'good', 'fs': '/'},
'c1t0d0s4': {'fs': '/home', 'name': 'c1t0d0s4'}}}
>>> s='firstname:Eric lastname=Lapouyade'
>>> s | state_pattern((
... ('top','',r'firstname:(?P<val>\S+)','firstname','{val}'),
... ('top','',r'.*lastname=(?P<val>\S+)','lastname','{val}'),
... ))
{'firstname': 'Eric'}
>>> s='firstname:Eric lastname=Lapouyade'
>>> s | state_pattern((
... ('top','__continue__',r'firstname:(?P<val>\S+)','firstname','{val}'),
... ('top','',r'.*lastname=(?P<val>\S+)','lastname','{val}'),
... ))
{'firstname': 'Eric', 'lastname': 'Lapouyade'}