Scraping

mirror of https://github.com/benjaminjonard/koillection.git synced 2025-12-27 22:43:12 +00:00

Table of Contents

How to use the scraper builder
Syntax
Additional data
Usage example
Import/export

Discogs (release page)
MyFigureCollection

Limitations

Page under construction

A scraper builder is being implemented. For now it only provides some basic feature but it should still cover most usage. The scraper support HTML only (XML should work too, but require more tests). JSON scraping will come later for APIs once the HTML scraper is stable.

⚠️ Keep in mind that this feature is experimental, some bugs are expected.

How to use the scraper builder

The syntax used is based on xPath, you can find a tutorial here for exemple: https://www.w3schools.com/xml/xpath_intro.asp.

I would recommend using a browser extension like Try xPath to test your xPath expressions.

Here is an example for Discogs

xPath for the item name.
URL pattern this scraper will be used for. Optional, it only speeds up the process by automatically selecting the correct scraper based on the provided URL.
xPath for the image src, must be an URL (the image will be downloaded from that URL on item submit).
xPaths for additionnal data

Syntax

Here the full xPath from previous example:

As mentionned above, the scraper uses xPath syntax.

But on top of that each xPath MUST be wrapped around #. This allows to use multiple xPaths for the same field.

In the example #//h1/span/a/text()# - #//h1/text()[2]# will result in something like The band name - The album name

Additional data

Three types of data fields are supported for now :

Text -> if your xPath matches multiple strings, they will be concatenated using commas
List -> each xPath match will create a new list element
Country -> will try to match either the full country name or the alpha2 and alpha3 code based on ISO 3166

Usage example

On the Item or Collection create form, click on the scrap button. Choose the scraper you want to use and the URL to be scrapped.

Alternatively, you can upload an HTML file if the URL isn't publicly accessible.

Using the Discogs exemple from above, here the result :

Import/export

To make it easier for people to share their scrapers, an import/export function is available. Here is two scraper I've been using as exemple (save them in a json file each) :

Discogs (release page)

{"name":"Discogs - release","namePath":"#\/\/h1\/span\/a\/text()# - #\/\/h1\/text()[2]#","imagePath":"#(\/\/img)[2]\/@src#","urlPattern":"https:\/\/www.discogs.com\/release\/","dataPaths":[{"name":"Style","path":"#\/\/th[contains(text(),'Style')]\/ancestor::tr\/td\/a\/text()#","type":"text","position":1},{"name":"Country","path":"#\/\/th[contains(text(),'Country')]\/ancestor::tr\/td\/a\/text()#","type":"country","position":2},{"name":"Tracks","path":"#\/\/td[contains(@class,'trackTitle')]\/span\/text()# - #\/\/td[contains(@class,'duration')]\/span\/text()#","type":"list","position":3}]}

MyFigureCollection

{"name":"MyFigureCollection","namePath":"#\/\/span[@class='headline']\/text()#","imagePath":"#\/\/a[@class='main']\/img\/@src#","urlPattern":"https:\/\/myfigurecollection.net\/item\/","dataPaths":[{"name":"Origin","path":"#\/\/div[contains(text(),'Origin')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":1},{"name":"Character","path":"#\/\/div[contains(text(),'Character')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":2},{"name":"Version","path":"#\/\/div[contains(text(),'Version')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/text()#","type":"text","position":3},{"name":"Company","path":"#\/\/div[contains(text(),'Company')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":4},{"name":"Classification","path":"#\/\/div[contains(text(),'Classification')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":5},{"name":"Sculpted by","path":"#\/\/small[contains(text(),'As Sculptor')]\/ancestor::a\/span\/text()#","type":"text","position":6},{"name":"Illustrated by","path":"#\/\/small[contains(text(),'As Illustrator')]\/ancestor::a\/span\/text()#","type":"text","position":7},{"name":"Designed by","path":"#\/\/small[contains(text(),'As Designer')]\/ancestor::a\/span\/text()#","type":"text","position":8},{"name":"Color production by","path":"#\/\/small[contains(text(),'As Color producer')]\/ancestor::a\/span\/text()#","type":"text","position":9},{"name":"Material","path":"#\/\/div[contains(text(),'Material')]\/ancestor::div\/div[contains(@class, 'form-input')]\/a\/span\/text()#","type":"text","position":10},{"name":"Scale","path":"#\/\/a[contains(@class,'item-scale')]\/small\/text()##\/\/a[contains(@class,'item-scale')]\/text()#","type":"text","position":11},{"name":"Country","path":"#substring-before(substring-after(\/\/div[contains(text(),'Releases')]\/ancestor::div\/div[contains(@class, 'form-input')][1]\/small\/em\/text(), '('),')')#","type":"country","position":12}]}

Limitations

Some websites have protections against scraping, especially e-commerce websites.

Home
Installation
Updating
Configuration
Usage
- First connection
- Menu and navigation
- Settings
- Collections
- Items
- Tags
- Templates
- Fields
- Loans
- Visibility
- Scraping
- Metrics
API
FAQ
Screenshots