This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Hi everyone. I'm the creator of Mturk Engine. I'm putting the finishing touches on the next major version which includes full compatibility with the new site as well as some new features. I aim to release within the next 24 hours barring unforeseen circumstances.
There were a lot of challenges I faced migrating over to the new site but I believe that the new site is much more secure and ultimately a lot easier to work with and it's easier than ever to make your own scripts to scrape data. I thought I'd share what I've learned about the new site in the hopes that it helps other developers update their scripts as soon as possible.
The JSON API.
Most pages (search, dashboard, queue, hits, status detail pages) will return data as JSON when requested with a format=json
paramater and a responseType
set to 'json'. I'm maintaining some documentation to describe the response returned by each of the different pages which is available here: https://github.com/Anveio/mturk-engine/blob/master/src/worker-mturk-api.d.ts
An alternative to the JSON API is to query the DOM for the the node containing a data-react-props
attribute, which can be converted into an object via JSON.parse(node.dataset.reactProps) or JSON.parse(node.getAttribute('data-react-props'). Here's the code that does in that Mturk Engine.
Search
Fetching search results is easily done via the JSON API. The data you get back is in this format: https://github.com/Anveio/mturk-engine/blob/master/src/worker-mturk-api.d.ts#L3-L8 . A potential stumbling block is formatting your search paramaters according to the new site's format. I use the qs
library and pass { arrayFormat: 'brackets' }
as the second argument to .stringify
. The search paramaters that the new site accepts (that I know of) are listed here.
Queue, Dashboard, Status Detail pages.
These are all available through the JSON API. An important difference with Status Detail pages is that their URL is in YYYY-MM-DD
rather than MMDDYYYY
as it was before. I use moment.js
to massage the dates to the format I need them in but plan on switching to date-fns
when I get the chance in order to reduce bundle size. Something that I noticed is that hits in status detail pages now have an additional identified called assignment_id
but hit_id
is how I uniquely indexed submitted hits on the old site. I'm not 100% sure what the purpose of assignment_id
is other than being a parameter of an accepted hit page.
Returning A HIT
Returning a HIT on the Worker site is a lot different. Instead of sending a GET request, you must send a POST request to the URL of an accepted HIT. Your form data will need to include a _method
field with a value of 'delete' as well as an authenticity token. The authenticity token is retrieved from a hidden form located on the return button itself. I wasn't able to retrieve the authenticity token from HITs in queue as the return
button is not initially rendered. I may be doing something wrong but returning a HIT also caused a redirect to the sign-in page. So my current implementation assumes that if that happened, then the return went through successfully.
Accepting a HIT
Accepting a HIT via XHR (like what Turkmaster does) on the Worker site was by far my biggest hurdle. A successful accept from a accept_random
link will redirect to a non HTTP page, triggering users' browsers to cancel the request for security reasons. I was able to figure out a workaround, which I describe here. I've submitted feedback requesting that the redirect from accept_random
going directly to an HTTPS page would be a lifesaver.
With the migration out of the way, I can hopefully focus once again on building new features.
Post Details
- Posted
- 7 years ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/mturk/comme...