Filters
The following are some examples how to use the filters in the configuration files.
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
STATUS: true
For a better understanding you can do the tests using this initial configuration file:
watchpage --config test.yaml --result . --dump
It will automatically return all the links in the page https://github.com/muflone/watchpage and the following filters will be used to get or transform the results until you get what you need.
STARTS
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- STARTS: 'https://github.com/muflone/watchpage/tree/master/'
STATUS: true
Will return only the links starting with https://github.com/muflone/watchpage/tree/master/ with the following results:
https://github.com/muflone/watchpage/tree/master/.circleci
https://github.com/muflone/watchpage/tree/master/icons
https://github.com/muflone/watchpage/tree/master/watchpage
NOT STARTS
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- STARTS: 'https://pypi.org/'
- NOT STARTS: 'https://pypi.org/project/PyYAML/'
- NOT STARTS: 'https://pypi.org/project/beautifulsoup4/'
- NOT STARTS: 'https://pypi.org/project/html5lib/'
- NOT STARTS: 'https://pypi.org/project/lxml/'
STATUS: true
Will return only the links starting with https://pypi.org/ but not starting with the other patterns, with the following results:
https://pypi.org/project/WatchPage/
ENDS
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- ENDS: '.zip'
STATUS: true
Will return only the links ending with .zip with the following results:
https://github.com/muflone/watchpage/archive/refs/heads/master.zip
NOT ENDS
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- STARTS: 'https://pypi.org/'
- NOT ENDS: '/PyYAML/'
- NOT ENDS: '/beautifulsoup4/'
- NOT ENDS: '/html5lib/'
- NOT ENDS: '/lxml/'
STATUS: true
Will return only the links starting with https://pypi.org/ but not ending with the other patterns, with the following results:
https://pypi.org/project/WatchPage/
CONTAINS
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: '/watchpage/tree/master/'
STATUS: true
Will return only the links containing the string /watchpage/tree/master/ with the following results:
https://github.com/muflone/watchpage/tree/master/.circleci
https://github.com/muflone/watchpage/tree/master/icons
https://github.com/muflone/watchpage/tree/master/watchpage
NOT CONTAINS
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: '/watchpage/'
- NOT CONTAINS: 'github.com'
STATUS: true
Will return only the links containing the string /watchpage/ but not those containing the string github.com with the following results:
http://www.muflone.com/watchpage/
REGEX
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- REGEX: '^(http|https):\/\/github\.com\/.*/watchpage\/tree\/master\/'
STATUS: true
Will return only the links matching the specified regular expression with the following results:
https://github.com/muflone/watchpage/tree/master/.circleci
https://github.com/muflone/watchpage/tree/master/icons
https://github.com/muflone/watchpage/tree/master/watchpage
NOT REGEX
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: '/watchpage'
- NOT REGEX: '(github|circleci)'
STATUS: true
Will return only the links containing the string /watchpage but not matching the specified regular expression with the following results:
http://www.muflone.com/watchpage/
TRIM
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- TRIM: '/:.watchpge'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified characters, both from the start and the end, with the following results:
muflone.com
LTRIM
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- LTRIM: ':/htp'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified characters, from the text start, with the following results:
www.muflone.com/watchpage/
RTRIM
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- RTRIM: '/watchpge'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified characters, from the text end, with the following results:
http://www.muflone.com
PREPEND
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- LTRIM: ':/htp'
- PREPEND: 'https://'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified characters, from the text start and then will add the string https:// at the start with the following results:
https://www.muflone.com/watchpage
APPEND
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- RTRIM: '/watchpge'
- APPEND: '/watchpage/'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified characters, from the text end and then will add the string /watchpage/ at the end with the following results:
http://www.muflone.com/watchpage/
REMOVE
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REMOVE: 'www.'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified string with the following results:
http://muflone.com/watchpage/
REMOVE LEFT
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REMOVE LEFT: 'http://'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified string from the text start with the following results:
www.muflone.com/watchpage/
REMOVE RIGHT
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REMOVE RIGHT: 'watchpage/'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will remove the specified string from the text end with the following results:
http://www.muflone.com/
REPLACE
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REPLACE: '/watchpage/'
WITH: '/gwakeonlan/'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will replace the string /watchpage/ with /gwakeonlan/ with the following results:
http://www.muflone.com/gwakeonlan/
REVERSE
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REVERSE:
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will reverse the text with the following results:
/egaphctaw/moc.enolfum.www//:ptth
UPPER
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'pypi.org/project/WatchPage'
- UPPER:
STATUS: true
Will return only the links containing the string pypi.org/project/WatchPage and for each one of them will make the text as upper case with the following results:
HTTPS://PYPI.ORG/PROJECT/WATCHPAGE/
LOWER
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'pypi.org/project/WatchPage'
- LOWER:
STATUS: true
Will return only the links containing the string pypi.org/project/WatchPage and for each one of them will make the text as lower case with the following results:
https://pypi.org/project/watchpage/
LEFT
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- LEFT: 23
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will return the first 23 characters with the following results:
http://www.muflone.com/
RIGHT
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- RIGHT: 26
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will return the latest 26 characters with the following results:
www.muflone.com/watchpage/
REGEX REPLACE
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REGEX REPLACE: '[w]{3}\.'
WITH: ''
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will replace the specified regular expression with the new value with the following results:
http://muflone.com/watchpage/
REGEX SEARCH
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- REGEX SEARCH: '(http|https):\/\/[w]{3}\.muflone\.com\/'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will extract the result for the specified regular expression with the following results:
http://www.muflone.com/
JSON DICT
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- PREPEND: '{ "name": "Test", "value": "This is a sample dictionary" }'
- LEFT: 58
- JSON DICT: 'value'
- TRIM: '"'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will add the specified JSON text, limiting the results to only the JSON text, extracting the value and removing the surrounding quotes around it with the following results:
This is a sample dictionary
JSON LIST
NAME: watchpage
URL: github:muflone/watchpage
PARSER: html5lib
TYPE: links
ABSOLUTE_URLS: true
FILTERS:
- CONTAINS: 'www.muflone.com'
- PREPEND: '[{ "name": "Data 1" }, { "name": "Data 2" }]'
- LEFT: 44
- JSON LIST: 1
- JSON DICT: 'name'
- TRIM: '"'
STATUS: true
Will return only the links containing the string www.muflone.com and for each one of them will add the specified JSON text, limiting the results to only the JSON text, extracting the second value from the list and then extracting the value and removing the surrounding quotes around it with the following results:
Data 2