This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
I have written a crawler to fetch a bunch of university resources as they get uploaded and I was just wondering what the pythonic way to create the list of dictionaries is in this case? From what I have read, pep8 Hitchhiker's Guide to Python Code Style Guide says that unit = {} is not a recommended convention.
Also, would it be advisable to change my regex string to a compiled regex? Performance wise I know that re caches the compiled regex anyway, so I gather it would be for readability improvements.
def parse_config():
soup = bs(open('config.xml').read(), 'lxml')
units = []
for u in (soup.find_all('unit')):
unit = {}
unit['unitname'] = re.sub('<[^<] ?>', '', str(u.unitname))
unit['unitcode'] = re.sub('<[^<] ?>', '', str(u.unitcode))
unit['directory'] = re.sub('<[^<] ?>', '', str(u.directory))
unit['url'] = unescape(re.sub('<[^<] ?>', '', str(u.url)))
unit['semester'] = re.sub('<[^<] ?>', '', str(u.semester))
unit['year'] = re.sub('<[^<] ?>', '', str(u.year))
units.append(unit)
return units
edit: Best solution was provided by /u/gschizas here and here.
As said in a comment, I had actually already worked on a refactor which was:
def parse_config():
soup = bs(open('config.xml').read(), 'lxml')
for u in (soup.find_all('unit')):
yield {'name': clean_tags(u.unitname),
'code': clean_tags(u.unitcode),
'directory': clean_tags(u.directory),
'url': unescape(clean_tags(u.url)),
'semester': clean_tags(u.semester),
'year': clean_tags(u.year)}
Just wanted some other insights into how to solve this problem
edit 2: Should have checked which code style I was remembering :( changed pep8 to Hitchhikers Guide to Python
Subreddit
Post Details
- Posted
- 8 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/learnpython...