Scrutarising a site

“Scrutarising a site” is, in Coredem speak, to instigate the automatic process of extracting metadata from a site, formatted in ScrutariData, to allow the Scrutari server to include the site in its search results.

The “scrutarisation” of one’s site is an important step for a Coredem member. It is not always a simple step, as only information resources should be extracted from the site, not other pages (showcase pages, news, etc.), which involves enabling the necessary filters.

Coredem has made several changes in order to simplify this step.

We are currently developing add-ons for the most popular open source management systems (Spip, Wordpress, Joomla, Drupal): the idea of ​​these add-ons is to provide interfaces for configuring the extraction in ScrutariData format. i.e., with Spip, we indicate which sections contain information resources of interest to Coredem.

The various add-ons have not yet been included into the add-on repositories of these software. This will be done once they are stable enough. In the meantime, please contact Coredem if you are interested in a particular add-on.

However, all sites are not based on the content management systems mentioned above. So in order to extract data in ScrutariData format, there needs to be an ad hoc development.

As this can be expensive, the Coredem is also working on converting CSV scripts (simple text files with data in tabular form) into ScrutariData format. This means that the site to be “scrutarised” simply needs to be exported in CSV format, which is much easier to do.

The last option is that scripts may be written to create a ScrutariData file directly from the webpages.

These scripts are written in Python

On Scrutari’s technical documentation site an API is available in PHP, Java and Python, which provides help on writing a ScrutariData file: