juicer

juicer is a small API for extracting text data from web pages, you provide a URL,
and juicer squeezes out the page meta data, body text and named entities

juicer is a simple API for extracting meta data from the text in web pages. It works best on on "article" type pages such as those you find on blogs and news websites. It may not work at all on some types of page.

Two end points are provided /api/article and /api/entities, these can be use to extract meta data from a URL or by uploading a blob of text respectively.

juicer is a toy project, hacked together over the holidays. Please use and abuse it but with no guarantees!

Props

juicer is other people's excellent work glued together in a web API, it's built with …

URL

GET /api/article

Parameters

url - A URL to the article page you want meta data for

Response

A JSON response, showing article meta data, body text and named entity text, frequency and type (one of Location, Person, Organization)

          {
            "article" : {
              "url"          : http://www.bbc.co.uk/news/world-africa-16377824,
              "domain"       : "www.bbc.co.uk",
              "hash"         : "ac2f2e739421184f01c942b057f8449d",
              "title"        : "South Sudan 'sends more troops' to strife-torn town Pibor",
              "description"  : "Article meta description ...",
              "body"         : "Article body text ... ",
              "image"        : {
                "src"    : "http://news.bbcimg.co.uk/media/images/57644000/jpg/_57644369_armed-lou-nuer-youth-in-lik.jpg",
                "width"  : 304,
                "height" : 171
              },
              "entities"     : [
                {
                  "text"      : "South Sudan",
                  "type"      : "Location",
                  "frequency" : 1
                },
                ...
              ]
            }
          }
        

Example

Here's an example API call /api/article?url=http://www.bbc.co.uk/news/world-africa-16377824

URL

POST /api/entities

Parameters

text - A string of text you want to analyse

Response

A JSON response, showing entity text, frequency and type (one of Location, Person, Organization)

          {
            "entities" : [
              {
                "text"      : "Met Office",
                "type"      : "Organization",
                "frequency" : 2
              },
              {
                "text"      : "John Prior",
                "type"      : "Person",
                "frequency" : 1
              },
              {
                "text"      : "UK",
                "type"      : "Location",
                "frequency" : 2
              }
            ]
          }
        

Example

Try it out using the form below …

juicer is written in Scala, feel free to fork the project and play about. It's still in the "toy" project stage so contributions are very much welcome.

Github url: https://github.com/matth/juicer