Friday, 28 November 2014

Scraping XML Tables with R

A couple of my good friends also recently started a sports analytics blog. We’ve decided to collaborate on a couple of studies revolving around NBA data found at www.basketball-reference.com. This will be the first part of that project!

Data scientists need data. The internet has lots of data. How can I get that data into R? Scrape it!

People have been scraping websites for as long as there have been websites. It’s gotten pretty easy using R/Python/whatever other tool you want to use. This post shows how to use R to scrape the demographic information for all NBA and ABA players listed at www.basketball-reference.com.

Here’s the code:

###### Settings

library(XML)

 ###### URLs

url<-paste0("http://www.basketball-reference.com/players/",letters,"/")

len<-length(url)

 ###### Reading data

tbl<-readHTMLTable(url[1])[[1]]

 for (i in 2:len)

    {tbl<-rbind(tbl,readHTMLTable(url[i])[[1]])}

 ###### Formatting data

colnames(tbl)<-c("Name","StartYear","EndYear","Position","Height","Weight","BirthDate","College")

tbl$BirthDate<-as.Date(tbl$BirthDate[1],format="%B %d, %Y")

Created by Pretty R at inside-R.org

And here’s the result:Result

Source: http://www.r-bloggers.com/scraping-xml-tables-with-r/

Wednesday, 26 November 2014

Web Data Extraction: driving data your way

Most businesses rely on the web to gather data such as product specifications, pricing information, market trends, competitor information, and regulatory details. More often than not, companies collect this data manually—a process that not only takes a significant amount of time, but also has the potential to introduce costly errors.

By automating data extraction, you're able to free yourself (and your pointer finger) from hours of copy/pasting, eliminate human errors, and focus on the parts of your job that make you feel great.

Web data extraction: What it is, why it's used, and how to get it right on an ongoing basis

Web data extraction, screen scraping, web harvesting—while these terms may have different connotations they all essentially point to the same thing: plucking data from the web and populating it in an organized way in another place for further analysis or more focused use. In an era where “big data” has become a commonplace concept, the appeal of web data extraction has grown; it’s an extremely efficient alternative to web browsing, and culls very specific data for a focused purpose.

How it's used

While each company’s needs vary, data extraction is often used for:

    Competitive intelligence, including web popularity, social perception, other sites linking to them, and placement of competitor advertisements

    Gathering financial data including stock market movement, product pricing, and more

    Creating continuity between price sheets and online websites, catalogs, or inventory databases

    Capturing product specifications like dimensions, color, and materials

    Pulling tabular data from multiple sources for in-depth analysis

Interestingly, some people even find that web data extraction can aid them in their leisure time as well, pulling data from blogs and websites that pertain to their hobbies or interests, and creating their own library of organized information on a topic. Perhaps, for instance, you want a list of all the designers that George Clooney wears (hey- we won’t question what you do in your free time). By using web scraping tools, you could automatically extract this type of data from, say, a fashion blogger who follows celebrity style, and create your own up-to-date shopping list of items.

How it's done

When you think of gathering data from the web, you should mentally juxtapose two different images: one of gathering a bucket of sand one grain at a time, and one of filling a bucket with a shovel that has the perfect scoop size to fill it completely in one sitting. While clearly the second method makes the most sense, the majority of web data extraction happens much like the first process--manually, and slowly.

Let’s take a look at a few different ways organizations extract data today.

The least productive way: manually

While this method is the least efficient, it’s also the most widespread. On the plus side, you need to learn absolutely nothing except “Ctrl+C/V” to use this method, which explains why it is the generally preferred method, despite the hours of time it can take. Imagine, for instance, managing a sales spreadsheet that keeps inventory up to date so that the information can be properly disseminated to a global sales team. Not only does it take a significant amount of time to update the spreadsheet with information from, say, your internal database and manufacturer’s website, but information may change rapidly, leaving sales reps with inaccurate information regardless.

Finding someone in the organization with a talent for programming languages like Python

Generally, automating a task without dedicated automation software requires programming, and therefore an internal resource with a solid familiarity with programming languages to create the task and corresponding script. While most organizations do, in fact, have a resource in IT or engineering with this type of ability, it often doesn’t seem like a worthy time investment for that person to derail the initiatives he or she is working on to automate web data extraction. Additionally, if companies do choose to automate using in-house resources, that person will find himself beholden to a continuing obligation, since he or she will need to adjust scripting if web objects and attributes change, disabling the task.

Outsourcing via Elance or oDesk

Unless there is a dedicated resource ready to automate and maintain data extraction processes (and most organizations wouldn’t necessarily choose to use their in-house employee time this way), companies might turn to outsourcing companies such as Elance or oDesk to hire contract help. While this is an effective way to automate a task using a resource that has a level of acumen in automation, it represents an additional cost--be it one time or on a regular basis as data extraction requirements change or increase.

Using Excel web queries

Since more often than not, data extracted from the web is often populated into an Excel spreadsheet, it’s no wonder that Excel includes web query tools expressly for that purpose. These tools are particularly useful in pulling tabular data from a website (such as product specifications, legal codes, stock prices, and a host of other information) and automatically pushing the data into a spreadsheet. Excel queries do have limitations and a learning curve, however, particularly when creating dynamic web queries. And clearly, if you’re hoping to populate the information in other sources, such as external databases, there is yet another level of difficulty to navigate.

How automation simplifies web data extraction

Culling web data quickly

Using automation is the simplest way to extract web data. As you execute the steps necessary to perform the task one time, a macro recorder captures each action, automatically generates an easily-editable script, and lets you specify how often you would like to repeat the task, and at what speed.

Maintaining the highest level of accuracy

With humans copy/pasting data, or comparing between multiple screens and entering data manually into a spreadsheet, you’re likely to run into accuracy issues (sometimes directly proportionate to the amount of time spent on the task and amount of coffee in the office!) Automation software ensures that “what you see is what you get,” and that data is picked up from the web and put back down where you want it without a hitch.

Storing web data in your preferred format

Not only can you accurately transfer data with automation software, you can also ensure that it’s populated into spreadsheets or databases in the format you prefer. Rather than simply dumping the data into a spreadsheet, you can ensure that the right information is put into the proper column, row, field, and style (think, for instance, of the difference between writing a birth date as “03/13/1912” and “12/3/13”).

Simplifying data analysis

Automation software allows you to aggregate data from disparate sources or enormous stockpiles of structured or unstructured data in a way that makes sense for your business analysis needs. This way, the majority of employees in an organization can perform some level of analysis on their own, making it easier to surface information that informs business decisions.

Reacting to changes without a hitch

Because automation software is built to recognize icons, images, symbols, and other objects regardless of their position on a screen, it can automate processes in a self-perpetuating manner. For example, let’s say you automate data retrieval from a certain chart on a retailer’s website without automation software. If the retailer decides to move that object to another area of the screen, your task would no longer produce accurate results (or work at all), leaving you to make changes to the script (or find someone who can), or re-record the task altogether. With image recognition capabilities, however, the system “memorizes” the object itself, not merely its coordinates, so that the task can continue to run irrespective of changes.

The wide sweeping appeal of automation software

Companies often pick a comprehensive automation solution not only because of its ability to effectively automate any web data extraction task, but also because it goes beyond data extraction. Automation software can permeate into other areas of the business as well, making tasks such as application integration, data migration, IT processes, Excel automation, testing, and routine tasks such as launching applications or formatting files faster and more accurate. Because it requires no programming experience to use, adoption rates are higher and businesses get more “bang for their buck.”

Almost any organization can benefit from using automation software, particularly as they grow and scale. If you are looking to quit “moving grains of sand” and start claiming back time in your day, there are a few steps you can take:

 Watch a short video that shows how web data extraction is done with automation software

 Download a free trial and start reaping the benefits of downloading even just a couple of tasks today.

 See how tasks are automated with our short, step-by-step how-to-sheets (and then give it a try yourself!)

Source: https://www.automationanywhere.com/web-data-extraction

Sunday, 23 November 2014

A Content Marketer's Guide to Data Scraping

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

    "Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

    "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

    Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.
Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:

//*[@id="leftCol"]/div[2]/p/span/a

This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:

=XPathOnUrl(A2,"//*[@id='leftCol']/div[2]/p/span/a")

Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to ‘leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:

//*[@id='leftCol']/div[2]/p/span/a

We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:

//a[@rel='author']

As a full formula within Excel it would look something like this:

=XPathOnUrl(A2,"//a[@rel='author']")

Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", <strong>"href"</strong>)

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words ‘Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:

Google+:

=XPathOnUrl(<strong>INSERTGOOGLEPROFILEURL</strong>,"//span[@class='BOfSxb']")

LinkedIn:

=XPathOnUrl(<strong>INSERTLINKEDINURL</strong>,"//dd[@class='overview-connections']/p/strong")

4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:

=XPathOnUrl(A2,"//title")

To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:

=CountWords(A2)

From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:

Twitter:

=TwitterCount(<strong>INSERTURLHERE</strong>)

Facebook:

=FacebookLikes(<strong>INSERTURLHERE</strong>)

Google+:

=GooglePlusCount(<strong>INSERTURLHERE</strong>)

Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):

=CONCATENATE("http://api.sharedcount.com/?url=",A2)

You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):

StumbleUpon:

=JsonPathOnUrl(B2,"StumbleUpon")

Reddit:

=JsonPathOnUrl(B2,"Reddit")

Delicious:

=JsonPathOnUrl(B2,"Delicious")

Digg:

=JsonPathOnUrl(B2,"Diggs")

Pinterest:

=JsonPathOnUrl(B2,"Pinterest")

LinkedIn:

=JsonPathOnUrl(B2,"Linkedin")

Facebook Shares:

=JsonPathOnUrl(B2,"Facebook.share_count")

Facebook Comments:

=JsonPathOnUrl(B2,"Facebook.comment_count")

Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within Upworthy.com.

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:

=XPathOnUrl(<strong>INSERTARTICLEURL</strong>,"//*[@class='dateline']/text()")

Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:

    http://findmyblogway.com/scraping-communities-with-xpath/

    http://builtvisible.com/data-entry-is-a-waste-of-time/

    http://www.seotakeaways.com/data-scraping-guide-for-seo/

    http://okdork.com/2014/04/30/the-step-by-step-guide-to-10x-growth-for-any-blog/

TL;DR

    Start using actual data to inform your content campaigns instead of going on your gut feeling.

    Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.

    Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.

    Spend more time analysing what content will get you results as opposed to what sites will give you links!

    Check the website's ToS before scraping.

Source:http://moz.com/blog/a-content-marketers-guide-to-data-scraping

Wednesday, 19 November 2014

NHL ending dry scraping of ice before overtime

TORONTO (AP) — The NHL will no longer dry scrape the ice before overtime.
Instituted this season in an effort to reduce the number of shootouts, the dry scraping will stop after Friday's games.

The general managers decided at their meeting Tuesday to make the change after the league talked to the players' union the past few days.

Beginning Saturday, ice crews around the league will again shovel the ice after regulation as they did in previous years. The GMs said the dry scrape was causing too much of a delay. Director of hockey operations Colin Campbell said the delays were lasting from more than four minutes to almost seven.

The dry scrape initially had been approved in hopes of reducing shootouts by improving scoring chances without unduly slowing play by recoating the ice.

The GMs also discussed expanded video review, including goaltender interference, and the possibility of three-on-three overtime. The American Hockey League is experimenting with the three-on-three format this season.

This annual meeting the day after the Hockey Hall of Fame induction usually doesn't produce actual changes, with the dry scrape providing an exception.

The main purpose is to set up the March meeting in Boca Raton, Florida, where these items will be further addressed.

Source:http://missoulian.com/sports/hockey/nhl-ending-dry-scraping-of-ice-before-overtime/article_3dd5473c-6102-5800-99f7-2c98be0f99ad.html

Monday, 17 November 2014

Scraping websites using the Scraper extension for Chrome

If you are using Google Chrome there is a browser extension for scraping web pages. It’s called “Scraper” and it is easy to use. It will help you scrape a website’s content and upload the results to google docs.

Walkthrough: Scraping a website with the Scraper extension
  •     Open Google Chrome and click on Chrome Web Store
  •     Search for “Scraper” in extensions
  •     The first search result is the “Scraper” extension
  •     Click the add to chrome button.
  •     Now let’s go back to the listing of UK MPs
  •     Open http://www.parliament.uk/mps-lords-and-offices/mps/
  •     Now mark the entry for one MP
  •     http://farm9.staticflickr.com/8490/8264509932_6cc8802992_o_d.png
  •     Right click and select “scrape similar…”
  •     http://farm9.staticflickr.com/8200/8264509972_f3a9e5d8e8_o_d.png
  •     A new window will appear – the scraper console
  •     http://farm9.staticflickr.com/8073/8263440961_9b94e63d56_b_d.jpg
  •     In the scraper console you will see the scraped content
  •     Click on “Save to Google Docs…” to save the scraped content as a Google Spreadsheet.
Walkthrough: extended scraping with the Scraper extension

Note: Before beginning this recipe – you may find it useful to understand a bit about HTML. Read our HTML primer.

Easy wasn’t it? Now let’s do something a little more complicated. Let’s say we’re interested in the roles a specific actress played. The source for all kinds of data on this is the IMDB (You can also search on sites like DBpedia or Freebase for this kinds of information; however, we’ll stick to IMDB to show the principle)

    Let’s say we’re interested in creating a timeline with all the movies the Italian actress Asia Argento ever starred; where do we start?

    The IMDB has a quite comprehensive archive of actors. Asia Argento’s site is: http://www.imdb.com/name/nm0000782/

    If you open the page you’ll see all the roles she ever played, together with a title and the year – let’s scrape this information

    Try to scrape it like we did above

    You’ll see the list comes out garbled – this is because the list here is structured quite differently.

    Go to the scraper console. Notice the small box on the upper left, saying XPath?

    XPath is a query language for HTML and XML.

    XPath can help you find the elements in the page you’re interested in – all you need to do is find the right element and then write the xpath for it.

    Now let’s assemble our table.

    You’ll see that our current Xpath – the one including the whole information is “//div[3]/div[3]/div[2]/div”

    http://farm9.staticflickr.com/8344/8264510130_ae31697fde_o_d.png

    Xpath is very simple it tells the computer to look at the HTML document and select <div> element number 3, then in this the third one, the second one and then all <div> elements (which if you count down our list, results in exactly where you are right now.
  •     However, we’d like to have the data separated out.
  •     To do this use the columns part of the scraper console…
  •     Let’s find our title first – look at the title using Inspect Element
  •     http://farm9.staticflickr.com/8355/8263441157_b4672d01b2_o_d.png
  •     See how the title is within a <b> tag? Let’s add the tag to our xpath.
  •     The expression seems to work well: let’s make this our first column
  •     In the “Columns” section, change the name of the first column to “title”
  •     Now let’s add the XPATH for the title to it
  •     The xpaths in the columns section are relative, that means “./b” will select the <b> element
  •     add “./b” to the xpath for the title column and click “scrape”
  •     http://farm9.staticflickr.com/8357/8263441315_42d6a8745d_o_d.png
  •     See how you only get titles?
  •     Now let’s continue for year? Years are within one <span>
  •     Create a new column by clicking on the small plus next to your “title” column
  •     Now create the “year” column with xpath “./span”
  •     http://farm9.staticflickr.com/8347/8263441355_89f4315a78_o_d.png
  •     Click on scrape and see how the year is added
  •     See how easily we got information out of a less structured webpage?
Source: http://schoolofdata.org/handbook/recipes/scraper-extension-for-chrome/

Sunday, 16 November 2014

Building Java Object Graph with Tour de France results – using screen scraping, java.util.Parser and assorted facilities

Last Saturday, the Tour de France 2011 departed. For people like myself, enjoying sports and working on Data Visualizations on the one hand and far fetched uses of SQL on the other, the Tour de France offers a wealth of data to work with: rankings for each stage in various categories, nationalities and teams to group by, distances and velocity, years to compare with one another and the like. So it has been my intention for some time to get hold of that data in a format I could work with.

Today I finally found some time to get it done. To locate the statistics for the Tour de France editions for the last few years and get them onto my laptop and into my database. This article describes the first part of that journey: how to get the stage results from some source on the internet into my locally running Java program in an appropriate object structure.

My starting point is the official Tour de France website:

Image

This website goes back to 2007 and also has the latest (2011) results. It presents the result in a format pleasing to the human eye – based on an HTML structure that is fairly pleasing to my groping Java code as well.

Analyzing the source of the Tour de France data

I start my explorations in Firefox, using the Firebug plugin. When I select the tab with the results for a particular stage, I inspect the (AJAX) call that is made to retrieve the stage results into the browser:

Image

The URL that was accessed is www.letour.fr/2010/TDF/LIVE/us/700/classement/ITE.html . When I access that URL directly, I see an HTML fragment with the individual ranking for the 7th stage in 2010. It turns out that with ITG instead of ITE in this URL, I get the overall ranking after the 7th Stage. Using IME in stead of ITE, I get the 7th stage’s climbers’ standing. And so on.

The HTML associated with the stage standing looks like this:

Image

Which is not as user friendly as the corresponding display in the browser:

Image

but still fairly well structured and programmatically interpretable.

Retrieving HTML fragments and parsing in Java

Consuming these HTML fragments with stage standings into my own Java code is very easy. Parsing the data and turning it into sensible Java Objects is slightly more work, but still quite feasible. From the Java Objects I next need to create a persistent storage for the data – that is the subject for another article.

Using the Java URL class and its openStream method to open an InputStream on whatever content can be found at the URL, it is dead easy to start reading the HTML from the Tour de France website into my Java program. I make use of the java.util.Scanner class to work my way through the HTML by Table Row (TR element). When you inspect the HTML fragments, it is clear early on that every individual rider’s entry corresponds with a TR element, so it seems only logical to have the Scanner break up the data by TR.

private static Stage processStage(int year, int stageSequence, Map<Integer, Rider> riders) throws java.io.IOException, java.net.MalformedURLException {

    String typeOfStanding = "ITE";
     URL stageStanding = new URL("http://www.letour.fr/"+year+"/TDF/LIVE/us/"
                                +(stageSequence==0?"0":stageSequence+"00") +
                                "/classement/"+typeOfStanding+".html");
    InputStream stream = stageStanding.openStream();
    Scanner scanner = new Scanner(stream);
    scanner.useDelimiter("</tr>");
    Stage stage = new Stage();
    stage.setSequence(stageSequence);
    boolean first = true;
    boolean firstStanding = true;
    while (scanner.hasNext()) {
        String entry = scanner.next();
        if (first) {
            first = false;
            Matcher regexMatcher = regexDistance.matcher(entry);
            if (regexMatcher.find()) {
                String distanceString = regexMatcher.group();
                stage.setTotalDistance(Float.parseFloat(distanceString.substring(0, distanceString.length() - 3)));
            }
        }
        if (!first) {
            String[] els = entry.split("/td>");
            if (els.length > 1) { // only the standing-entries have more than one td element
                Integer riderNumber = Integer.parseInt(extractValue(els[2]));

                Rider rider=null;
                if (riders.containsKey(riderNumber)) {
                    rider = riders.get(riderNumber);
                }
                else {
                    rider = new Rider(extractValue(els[1]),riderNumber, extractValue(els[3]));
                    riders.put(riderNumber,rider);
                }
                Standing standing =
                    new Standing(firstStanding ? 1 : (Integer.parseInt(extractValue(els[0]).replace(".", ""))),
                                  rider,extractValue(els[4]),
                                  extractValue(els[5]));
                firstStanding = false;
                stage.getStandings().add(standing);                }
        }
    } //while
    scanner.close();
    return stage;
}

Subsequently, the TR elements need to be broken up in the TD cell elements that contain the rank, rider’s name, their number, the team they ride for and the time for the stage as well as their lag with regard to the winner. I have used a simple split (on /td>) to extract the cells. The final logic for pulling the correct value from the cell is in the method extractValue. Note: this code is not very pretty, and I am not necessarily overly proud of it. On the other hand: it is one-time-use-only code and it is still fairly compact and easy to write and read.

private static String extractValue(String el) {
    String r = el.split("</")[0];
    if (r.lastIndexOf(">") > 0) {
        r = r.substring(r.lastIndexOf(">") + 1);
    }
    return r.split("<")[0];
}

I have created a few domain classes: Rider, Stage, Standing (as well as Tour) that are a business domain like representation of the Tour de France result data. Objects based on these classes are instantiated in the processStage method that is being invoked from the processTour method.

public static void processTour(Tour tour) throws IOException, MalformedURLException {
    if (tour.isPrologue())
      tour.getStages().add(processStage(tour.getYear(),0, tour.getRiders()));

    for (int i=1;i<= tour.getNumberOfStages();i++)  {
        tour.getStages().add(processStage(tour.getYear(),i, tour.getRiders()));
    }
}

When I run the TourManager class – a class that create a single Tour object for the Tour de France in 2010 –

public class TourManager {
     List<Tour> tours = new ArrayList<Tour>();
     public TourManager() {
        tours.add(new Tour(2010, 20, true));
        try {
            ProcessTourStandings.processTour(tours.get(0));
        } catch (MalformedURLException e) {
            System.out.println(e.getMessage());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
     public static void main(String[] args) {
        TourManager tm = new TourManager();
        for (Tour tour : tm.getTours()) {
            for (Stage stage : tour.getStages()) {
                System.out.println("================ Stage " + stage.getSequence() + "(" + stage.getTotalDistance() +
                                   " km)");
                for (Standing standing : stage.getStandings()) {
                    if (standing.getRank() < 4) {
                        System.out.println(standing.getRank() + "." + standing.getRider().getName());
                    }
                }
            }
        }
    }

it will print the top 3 in every stage:

Image

Source:http://technology.amis.nl/2011/07/04/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities/

Wednesday, 12 November 2014

'Scrapers' Dig Deep for Data on Web

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its "Mood" discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was "scraping," or copying, every single message off PatientsLikeMe's private online forums.

Enlarge Image

Bilal Ahmed wrote about his health on a site that was scraped. Andrew Quilty for The Wall Street Journal.

PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online "buzz" for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

"I felt totally violated," says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. "It was very disturbing to know that your information is being sold," he says. Nielsen says it no longer scrapes sites requiring an individual account for access, unless it has permission.

Related Reading

    Digits: Escaping the 'Scrapers'
    Complete Coverage: What They Know

Journal Community

The market for personal data about Internet users is booming, and in the vanguard is the practice of "scraping." Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

The Wall Street Journal's examination of scraping—a trade that involves personal information as well as many other types of data—is part of the newspaper's investigation into the business of tracking people's activities online and selling details about their behavior and personal interests.

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell numbers, photographs and posts on social-network sites.

Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.

One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web—which may include personal information contained in news articles and blog postings—that help corporate clients monitor how they are portrayed. It says it doesn't gather information from password-protected parts of sites.

It's rarely a coincidence when you see Web ads for products that match your interests. WSJ's Christina Tsuei explains how advertisers use cookies to track your online habits.

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.

Nielsen spokesman Matt Anchin says the company's reports to its clients include publicly available information gleaned from the Internet, "so if someone decides to share personally identifiable information, it could be included."

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.

California has a special protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.

Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people's profiles.

Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.

"Social networks are becoming the new public records," says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminal background checks and "Date Check," which promises details about a prospective date for $14.95.

"This data is out there," Mr. Adler says. "If we don't bring it to the consumer's attention, someone else will."

Scraping for Your Real Name

PeekYou.com has applied for a patent for a way to, among other things, match people's real names to pseudonyms they use on blogs, Twitter and online forums.

Read PeekYou.com's patent application.

Enlarge Image

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names or addresses.

Employers, too, are trying to figure out how to use such data to screen job candidates. It's tricky: Employers legally can't discriminate based on gender, race and other factors they may glean from social-media profiles.

One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offering limited social-networking data—some of it scraped—to employers about a year ago. "It's slowly starting to grow," says Chris Dugger, national account manager. He says he's particularly interested in things like whether people are "talking about how they just ripped off their last employer."

Scrapers operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. "Scraping is ubiquitous, but questionable," says Eric Goldman, a law professor at Santa Clara University. "Everyone does it, but it's not totally clear that anyone is allowed to do it without permission."

Scrapers and listening companies say what they're doing is no different from what any person does when gathering information online—they just do it on a much larger scale.

"We take an incomprehensible amount of information and make it intelligent," says Chase McMichael, chief executive of InfiniGraph, a Palo Alto, Calif., "listening service" that helps companies understand the likes and dislikes of online customers.

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.

Some scrapers-for-hire don't ask clients many questions.

"If we don't think they're going to use it for illegal purposes—they often don't tell us what they're going to use it for—generally, we'll err on the side of doing it," says Todd Wilson, owner of screen-scraper.com, a 10-person firm in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as "Happy Valley" that specialize in scraping.

Enlarge Image

Some of the computer code behind screen-scraper.com's software. Chris Detrick for The Wall Street Journal

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct "business intelligence," working for companies who want to scrape competitors' websites.

One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? "We don't know," says Scott Wilson, the owner's brother and vice president of sales. Another job: attempting to scrape Facebook for a multi-level marketing company that wanted email addresses of users who "like" the firm's page—as well as their friends—so they all could be pitched products.

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any talented computer programmer can do it. But penetrating a site's defenses can be tough.

One defense familiar to most Internet users involves "captchas," the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.

More From the Series

    Web's New Goldmine: Your Secrets

    Personal Details Exposed Via Biggest Websites

    Microsoft Quashed Bid to Boost Web Privacy

    On Web's Cutting Edge, Anonymity in Name Only

    Stalking by Cellphone

    Google Agonizes Over Privacy

    The Tracking Ecosystem

    On the Web, Children Face Intensive Tracking

Some professional scrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.

Raids like these are on the rise. "Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping," says Marino Zini, managing director of Sentor Anti Scraping System. The company's Stockholm team blocks scrapers on behalf of website clients.

At Monster.com, the jobs website that stores résumés for tens of millions of individuals, fighting scrapers is a full-time job, "every minute of every day of every week," says Patrick Manzo, global chief privacy officer of Monster Worldwide Inc. Facebook, with its trove of personal data on some 500 million users, says it takes legal and technical steps to deter scraping.

At PatientsLikeMe, there are forums where people discuss experiences with AIDS, supranuclear palsy, depression, organ transplants, post-traumatic stress disorder and self-mutilation. These are supposed to be viewable only by members who have agreed not to scrape, and not by intruders such as Nielsen.

"It was a bad legacy practice that we don't do anymore," says Dave Hudson, who in June took over as chief executive of the Nielsen unit that scraped PatientsLikeMe in May. "It's something that we decided is not acceptable, and we stopped."

Mr. Hudson wouldn't say how often the practice occurred, and wouldn't identify its client.

The Nielsen unit that did the scraping is now part of a joint venture with McKinsey & Co. called NM Incite. It traces its roots to a Cincinnati company called Intelliseek that was founded in 1997. One of its most successful early businesses was scraping message boards to find mentions of brand names for corporate clients.

In 2001, the venture-capital arm of the Central Intelligence Agency, In-Q-Tel Inc., was among a group of investors that put $8 million into the business.

Intelliseek struggled to set boundaries in the new business of monitoring individual conversations online, says Sundar Kadayam, Intelliseek's co-founder. The firm decided it wouldn't be ethical to use automated software to log into private message boards to scrape them.

But, he says, Intelliseek occasionally would ask employees to do that kind of scraping if clients requested it. "The human being can just sign in as who they are," he says. "They don't have to be deceitful."

In 2006, Nielsen bought Intelliseek, which had revenue of more than $10 million and had just become profitable, Mr. Kadayam says. He left one year after the acquisition.

At the time, Nielsen, which provides television ratings and other media services, was looking to diversify into digital businesses. Nielsen combined Intelliseek with a New York startup it had bought called BuzzMetrics.

The new unit, Nielsen BuzzMetrics, quickly became a leader in the field of social-media monitoring. It collects data from 130 million blogs, 8,000 message boards, Twitter and social networks. It sells services such as "ThreatTracker," which alerts a company if its brand is being discussed in a negative light. Clients include more than a dozen of the biggest pharmaceutical companies, according to the company's marketing material.

Like many websites, PatientsLikeMe has software that detects unusual activity. On May 7, that software sounded an alarm about the "Mood" forum.

David Williams, the chief marketing officer, quickly determined that the "member" who had triggered the alert actually was an automated program scraping the forum. He shut down the account.

The next morning, the holder of that account e-mailed customer support to ask why the login and password weren't working. By the afternoon, PatientsLikeMe had located three other suspect accounts and shut them down. The site's investigators traced all of the accounts to Nielsen BuzzMetrics.

On May 18, PatientsLikeMe sent a cease-and-desist letter to Nielsen. Ten days later, Nielsen sent a letter agreeing to stop scraping. Nielsen says it was unable to remove the scraped data from its database, but a company spokesman later said Nielsen had found a way to quarantine the PatientsLikeMe data to prevent it from being included in its reports for clients.

PatientsLikeMe's president, Ben Heywood, disclosed the break-in to the site's 70,000 members in a blog post. He also reminded users that PatientsLikeMe also sells its data in an anonymous form, without attaching user's names to it. That sparked a lively debate on the site about the propriety of selling sensitive information. The company says most of the 350 responses to the blog post were supportive. But it says a total of 218 members quit.

In total, PatientsLikeMe estimates that the scraper obtained about 5% of the messages in the site's forums, primarily in "Mood" and "Multiple Sclerosis."

Source: http://online.wsj.com/articles/SB10001424052748703358504575544381288117888

Monday, 10 November 2014

My Experience in Choosing a Web Scraping Service

Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services.

If you are interested in price comparisons only and not ready to read the whole story just scroll down.

A list of web scraping services I sent my project to:

    www.datahen.com - Canadian web scraping service with nice web design
    webdata-scraping.com - Indian service by Keval Kothari
    www.iwebscraping.com - India based web scraping company (same as www.3idatascraping.com)
    scrapinghub.com - A scraping service founded by creators of Scrapy
    web-scraper.com - Yet another web scraping service
    grepsr.com - A scraping service that we already reviewed two years ago

Sending the request


All the services except scrapinghub.com have quite simple forms for the description of the project requirements. Basically, you just need to give your contact details and a project description in any form. Some of them are pretty (like datahen.com), some of them are more ascetic (like web-scraper.com), but all of them allow you to send your requirements to developers.

Scrapinghub.com has a quite long form, but most of the fields are optional and all the questions are quite natural. If you really know what you need, then it won’t be hard to answer all of them; moreover they rather help you to describe your need in detail.

Note, that in the context of the project I didn’t make a request for a scraper itself. I asked to receive data on a weekly basis only.

Getting responses

Since I sent my request on Sunday it would have been ok not to receive responses the same day, but I got the first response in 3 hrs! It was from web-scraper.com and stated that this project will cost me $250 monthly. Simple and clear. Thank you, Thang!

Right after that, I received the second response. This time it was Keval from webdata-scraping.com. He had some questions regarding the project. Then after two days he wrote me that it would be hard to scrape some of my data with the software he uses, and that he will try to use a custom scraper. After that he disappeared… ((

Then on Monday I received Cost & ETAT details from datahen.com. It looked quite professional and contained not only price, but also time estimation. They were ready to create such a scraper in 3-4 days for $249 and then maintain it for just $65/month.

On the same day I received a quote from iwebscraping.com. It was $60 per week. Everything is fine, but I’d like to mention that it wasn’t the last letter from them. After I replied to them (right after receiving the quote), I received a reminder letter from them every other day for about a week. So be ready for aggressive marketing if you ask them for a quote )).

Finally in two days after requesting a quote I got a response from scrapinghub.com. Paul Tremberth wrote me that they were ready to build a scraper for $1200 and then maintain it for $300/month.

It is interesting that I have never received an answer from grepsr.com! Two years ago it was the first web scraping service we faced on the web, but now they simply ignored my request! Or perhaps they didn’t receive it somehow? Anyway I had no time for investigation.

So what?

Let us put everything together. Out of six web scraping  services I received four quotes with the following prices:

Service     Setup fee     Monthly fee

web-scraper.com     -     $250
datahen.com     $249     $65
iwebscraping.com     -     $240
scrapinghub.com     $1200     $300


From this table you can see that  scrapinghub.com appears to be the most expensive service among those compared.

EDIT: These $300/month gives you as much support and development needed to fix a 5M multi-site web crawler, for example. If you need a cheaper solution you can use their Autoscraping tool, which is free, and would have costed around $2/month to crawl at my requested rates.

The average cost of monthly scraping is about $250, but from a long term perspective datahen.com may save you money due to their low monthly fee.

That’s it! If I had enough money available it would be interesting to compare all these services in operation and provide you a more complete report, but this is all I have for now.

If you have anything to share about your experience in using similar services, please contribute to this post by commenting on it below. Cheers!

Source: http://scraping.pro/choosing-web-scraping-service/

Saturday, 8 November 2014

Web Scraping the Solution to Data Harvesting

The internet is the number one information provider in the world and it is of course the largest in the same course. Web scraping is meant to extract and harvest useful information from the internet. It can be regarded as a multidisciplinary process that involves statistics, databases, data harvesting and data retrieval.

There has been noted a rapid expansion of the web and therefore causing an enormous growth of information. This has led to increased difficulty in the extraction of useful and potential information. Web scraping therefore confronts this problem by harvesting explicit information from a number of websites for knowledge discovery and easy access. It is important to realize that query interfaces of web databases are prone to sharing of same building blocks. It is therefore important to realize that the web offers unprecedented challenge and opportunity to data harvesting.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-solution-data-harvesting/

Wednesday, 5 November 2014

Why Web Scraping is Indispensable

The 21st century has opened the gates to hidden treasures and unlimited access to information globally without the constraints of time and space, through Internet technology. Along with this development comes the necessity for each business or company to get as much information as possible in order in order to thrive in the ever increasing demand for new innovations, comparisons, and trends.

Web scraping has consequently become an indispensable option to achieve all the needed data as quickly and efficiently as possible. In this view, data mining then appears to be the best and the only way to answer the present demand for updates, data, coping, foreknowledge, analysis, and evaluation. Indeed, information has inevitably become a valuable commodity and the most sought after product among online and offline entrepreneurs.

Need for Data

The increasing need for new data makes it possible for the experts to become increasingly creative in accessing information worldwide. The more knowledge one has, the better are his or her chances of growing and surviving. There seems to be no other time in the human existence where data has become so much a major source of revenue as the contemporary times.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-indispensable/