14 Scraping
Benjamin Jonard edited this page 2023-12-08 14:19:05 +01:00

Page under construction

A scraper builder is being implemented. For now it only provides some basic feature but it should still cover most usage. The scraper support HTML only (XML should work too, but require more tests). JSON scraping will come later for APIs once the HTML scraper is stable.

⚠️ Keep in mind that this feature is experimental, some bugs are expected.

How to use the scraper builder

The syntax used is based on xPath, you can find a tutorial here for exemple: https://www.w3schools.com/xml/xpath_intro.asp.

I would recommend using a browser extension like Try xPath to test your xPath expressions.

Here is an example for Discogs

Firefox_Screenshot_2023-08-05T12-51-42 033Z

  1. xPath for the item name.
  2. URL pattern this scraper will be used for. Optional, it only speeds up the process by automatically selecting the correct scraper based on the provided URL.
  3. xPath for the image src, must be an URL (the image will be downloaded from that URL on item submit).
  4. xPaths for additionnal data

Syntax

Here the full xPath from previous example:

Firefox_Screenshot_2023-07-27T09-36-57 916Z

As mentionned above, the scraper uses xPath syntax.

But on top of that each xPath MUST be wrapped around #. This allows to use multiple xPaths for the same field.

In the example #//h1/span/a/text()# - #//h1/text()[2]# will result in something like The band name - The album name

Additional data

Three types of data fields are supported for now :

  • Text -> if your xPath matches multiple strings, they will be concatenated using commas
  • List -> each xPath match will create a new list element
  • Country -> will try to match either the full country name or the alpha2 and alpha3 code based on ISO 3166

Usage example

On the Item or Collection create form, click on the scrap button. Choose the scraper you want to use and the URL to be scrapped.

Alternatively, you can upload an HTML file if the URL isn't publicly accessible.

Firefox_Screenshot_2023-08-05T12-55-29 489Z

Using the Discogs exemple from above, here the result :

Screenshot 2023-07-27 at 11-33-19 Koillection

Import/export

To make it easier for people to share their scrapers, an import/export function is available. Here is two scraper I've been using as exemple (save them in a json file each) :

Discogs (release page)

{"name":"Discogs - release","namePath":"#\/\/h1\/span\/a\/text()# - #\/\/h1\/text()[2]#","imagePath":"#(\/\/img)[2]\/@src#","urlPattern":"https:\/\/www.discogs.com\/release\/","dataPaths":[{"name":"Style","path":"#\/\/th[contains(text(),'Style')]\/ancestor::tr\/td\/a\/text()#","type":"text","position":1},{"name":"Country","path":"#\/\/th[contains(text(),'Country')]\/ancestor::tr\/td\/a\/text()#","type":"country","position":2},{"name":"Tracks","path":"#\/\/td[contains(@class,'trackTitle')]\/span\/text()# - #\/\/td[contains(@class,'duration')]\/span\/text()#","type":"list","position":3}]}

MyFigureCollection

{"name":"MyFigureCollection","namePath":"#\/\/span[@class='headline']\/text()#","imagePath":"#\/\/a[@class='main']\/img\/@src#","urlPattern":"https:\/\/myfigurecollection.net\/item\/","dataPaths":[{"name":"Origin","path":"#\/\/div[contains(text(),'Origin')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":1},{"name":"Character","path":"#\/\/div[contains(text(),'Character')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":2},{"name":"Version","path":"#\/\/div[contains(text(),'Version')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/text()#","type":"text","position":3},{"name":"Company","path":"#\/\/div[contains(text(),'Company')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":4},{"name":"Classification","path":"#\/\/div[contains(text(),'Classification')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":5},{"name":"Sculpted by","path":"#\/\/small[contains(text(),'As Sculptor')]\/ancestor::a\/span\/text()#","type":"text","position":6},{"name":"Illustrated by","path":"#\/\/small[contains(text(),'As Illustrator')]\/ancestor::a\/span\/text()#","type":"text","position":7},{"name":"Designed by","path":"#\/\/small[contains(text(),'As Designer')]\/ancestor::a\/span\/text()#","type":"text","position":8},{"name":"Color production by","path":"#\/\/small[contains(text(),'As Color producer')]\/ancestor::a\/span\/text()#","type":"text","position":9},{"name":"Material","path":"#\/\/div[contains(text(),'Material')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":10},{"name":"Scale","path":"#\/\/a[contains(@class,'item-scale')]\/small\/text()##\/\/a[contains(@class,'item-scale')]\/text()#","type":"text","position":11},{"name":"Country","path":"#substring-before(substring-after(\/\/div[contains(text(),'Releases')]\/ancestor::div\/div[contains(@class, 'form-input')][1]\/small\/em\/text(), '('),')')#","type":"country","position":12}]}

Limitations

Some websites have protections against scraping, especially e-commerce websites.