IR has been involved in creating feeds now for over 8 years – our Founder and MD was at Shopzilla right at the very start of the data feed revolution. “Back in 2004, the mere mention of ‘data feeds’ drew either blank expressions or deep sighs. We even had agencies asking how they design pages on shopping sites, when they were powered by data feeds. People just didn’t get it, let alone have the ability to create 8 of them in different specifications. Today’s retailers understand the process more, but it’s still a challenge,” says Steve Rivers, Founder & MD at Intelligent Reach.
This is where we come in. “Back in 2004 scraping technology was pretty unreliable or very expensive. Export data from your database was, and still is, the best and most cost-effective way to produce your base data,” adds Steve. However, here at Intelligent Reach we understand that there are limitations for many merchants, which make feed creation difficult. Legacy databases; disparate data sources; perhaps the final price doesn’t exist until the landing page resolves (like many of our travel clients); long IT queues; expensive 3rd parties: the reasons are many, but they each make it difficult to create a data feed with all the required data.
At this point many of the other data feed providers or suppliers will tell you that the data feed you have been trying to make work, with countless man hours invested in its creation, is worthless. Let us be the first to tell you – it’s NOT! We can recycle your existing data, ensuring those hours don’t go wasted. IR will always offer a free and fair assessment of your feed, and subsequently deliver a full report on where improvements can be made. Our Data Quality experts will then work with you to implement these changes; or, if required, we will extract the data from your website.
Data extraction seems to be getting a lot of attention these days. The process itself is beset by some pretty horrendous, dentist’s surgery-reminiscent terminology – extraction and scraping. In recent times a softer terminology has entered common parlance – scanning. It’s crucial that people understand what is happening in all of these processes.
1) They all simulate a user visiting your website.
2) They all extract information from your website
3) They all have to view the HTML code of the page; otherwise they cannot extract that information which is not actually shown to users
4) Scraping gets a bad press; but it’s only really web ripping which causes larger loads on your website
The real difference cited by those who champion scanning as a superior method, is the fact that there are no manual templates (also known as agents) to build; therefore the scanning is able to view a page in much the same way that a human does. However, in many trials we have seen poor results – especially for merchants with lots of data. The truth is that trained experts need to assess every website to identify all the nuances, which might only be found within a certain category with a more complex structure.
Of course, scanning can be a less expensive option – but at what opportunity cost, when the quality is not what it should be?
Intelligent Reach’s Director of Integration, Matt Sullivan, explains that our scraping technology functions in such a way that we only extract the requested items (defined by the HTML tags on a merchant’s site) – such as image URL, SKU, description etc; it doesn’t render the whole page, which is what occurs when you view a page through a web browser. “This has two immediate advantages; namely, that it drastically reduces both the load and the bandwidth on the target web server,” says Matt.
“As a consequence of the saving made by this reduced load, we can potentially increase the rate [or, number of pages scraped simultaneously], which helps to work through large sites in a sensible length of time – whilst not overloading, or causing undue strain on, their web system”.
Moreover, Matt points out, Intelligent Reach only scrapes at agreed times each day – ideally, after site replication has taken place in the small hours of the day.
“If we are running a complimentary scrape in order to assist the data already provided in a separate product feed, we will only scrape for the information missing – and nothing else.” This also serves to improve speed and load.
Matt goes on to highlight the fact that each and every one of the scrape templates that we use to lift data from a merchant’s website, is bespoke to that website, and is furthermore optimised for performance and the specific nuances of the site in question.
For those who are still hesitant, Intelligent Reach is committed to total web scraping transparency – using fixed IP addresses for the process, which clients can monitor to reassure themselves of the safety of such data mining activities, and the minimal impact on their site.
So to summarise: whilst data extracted directly from your system is the most ideal solution to create a feed, scraping – particularly in the safe and efficient way that Intelligent Reach runs it – is the best possible alternative for those merchants whose sites may possess a complex and nuanced structure. Don’t be fooled by the terminology: scraping is no less secure than scanning, and overall, it’s much more thorough and flexible. If it wasn’t, Intelligent Reach wouldn’t do it.