#did not know notino is czech also? | Explore Tumblr Posts and Blogs

engnews24h · 4 years

Text

Jan Čurn (Apify): We are downloading a billion pages a month from the web

Eng News 24h Jan Čurn (Apify): We are downloading a billion pages a month from the web

Apify is a young Czech company that based its business on providing a platform for downloading and structuring web data. Its turnover last year exceeded $ 1.5 million, has 30 employees with 25 of them being programmers, and is used by around 300 customers worldwide.

Although the company was doing well last year, it also plans to raise capital from venture capitalists this year to further scale the business worth millions of dollars.

How did you actually think of building your business purely on data collection from the web?

The idea originated years ago at Matfyz, when we found out, in a completely different (AI / ML) project, how difficult the data from the machine web is to read. It was a project of automated download of used car data. Here we realized that the quality of automated downloads is often so low that it would be worth developing a project that chooses exactly the opposite procedure, where the need for manual setup is not a bug, but a major advantage.

Today, we can confidently say that we are working to open up web data to the world. On the one hand, there is the indisputable fact that most of the information is very difficult to read on the web. On the other hand, we are convinced that it is often a pity to work with this data. If you have to, or want to work with, a lot of web data, collecting and structuring it is basically just a slave that robots can do better for you, and you can use the saved time and creative potential for something more useful.

Specifically, it looks like we have our own platform for web scraping and web automation, where you can automate virtually any conceivable process that you can manually perform in a web browser (downloading and structuring data, viewing product catalogs, automating forms, uploading files , invoices). The platform itself is very general and can be used on virtually anything – from data extraction, emailing to data transformations. Data is downloaded in a structured form (typically in the form of a table), based on the categorization that the customer initially enters.

The big advantage is that if you have page data divided into specific attributes, you can easily work with them. You can easily perform further analysis on them, you can monitor the competition, prices and composition of e-shop products etc. These scenarios are almost unlimited and it depends only on your imagination, what you use them for.

Who actually uses such services on your platform?

This is a huge heterogeneous range of subjects from around the world. We use garage start-ups, state administration and huge corporations. As for the use cases themselves, they range from checking court files in Argentina to monitoring and comparing used car offers in Europe or real estate offers to monitoring e-shops in the Czech Republic.

Again, it is not an exhaustive list, sometimes we are surprised when a customer comes to us with an idea that has not yet occurred to us in the context of date scraping. For example, we use the US nonprofit Thorn, which aggregates data from various messy servers and erotic services to help find lost children and detect sexual abuse of minors.

What are the most common recurring scenarios?

The most common way to use the Apify platform is to download products and offers from e-shops, ie classic e-commerce, it is also popular to aggregate property offers or monitor competition. But we also use banks and credit card issuers in compliance. Often, they need to have proof for regulators that their product pages at that time looked so and so and met the legal requirements. Similarly, we are used by customers who need to check that Internet ads are displaying correctly (at a given time and place).

For example, does a ticket comparator use you?

In general, for data scraping, this is one of the popular use times I've forgotten a bit. As far as I know, we have some customers who use us this way, but in general we don't see all the customers doing on our platform. We have two types of customers, one coming and wanting a turnkey service, where we have a pretty good overview of what they are using our service for. The others use us purely as a platform and set everything up by themselves, we don't see into their business there.

Is there any other interesting use you haven't mentioned?

An interesting use time that I did not mention is still machine learning. For machine learning, you need a huge amount of data to begin learning the algorithm. If you are not Google or Facebook, you may have trouble accessing such datasets at the beginning, but at the same time they are at your fingertips on the web. We have more customers who use us just for downloading datasets, for example Google Images can be pretty well aggregated in this way, or various catalogs.

Another frequently used use time is also aggregation of Google search results. This is useful, for example, for analyzing how different search phrases are performing. Because Google itself (intentionally) does not provide an API for results, so we successfully substitute this feature.

How does Google respond to this activity? I suppose he probably doesn't like automated queries.

It is true that Google is trying to partially limit it, but some e-shops, for example, block data scraping efforts much more strictly. Basically, Google does just that if you override the number of queries from a single IP address (the limit is about 300 queries per day), it will show you a captcha, which is an easily bypassed problem. Maybe it is because they themselves scrape data from across the web, and a different approach would be quite hypocritical.

So you have purchased tens of thousands of IPv4 addresses that you shoot?

Exactly, we buy the addresses we rotate and sell a large part of them to our clients. We use two kinds of IP addresses – datacenters, which we have in the order of thousands, are relatively cheap and fast, but they are suitable only for specific purposes. We mainly use residential addresses for downloading.

The problem with data center addresses is that they are quite easy to detect and block. An alternative is therefore to rent user IP addresses. It's actually a legal botnet. The way it works is that there are services that lease user IP addresses and you then download them directly through their computers through a proxy server. You can obtain IP addresses from a specific city or state if needed. Addresses are obtained quite legally, people most often exchange access for a “free” VPN connection, download a free game, or directly pay for renting an IP address (these are units of dollars per month).

Does the fast growing cost of IPv4 addresses interfere with your business model?

I think they are still relatively accessible for our needs, so not yet. Officially, the addresses have run out, but there are still a large number of companies that historically own huge blocks of addresses that they like to get rid of or lease. This created a relatively self-sufficient free market.

Returning to artificial intelligence, how do you see the real possibility of using AI and ML in data scraping?

At present, machine learning in data scraping can be used effectively only in a very small percentage of relatively well structured data cases. ML may also be able to help you set up your robot today, but it still has to be checked by a living person.

But we are not giving up on this idea, we even have one project on Matfyz to use AI for automatic data extraction, so we will see.

When I pause for a while on those non-profit projects, how did the idea come about? Shopkeepercame from you or TopMonks?

Originally, the idea appeared two years ago at the hackathon Datafestťák in Hradec Králové. There my colleague met Jakub Balada and the guys from TopMonks and Keboola and thought that a similar project could be done as a browser extension. Ideally, each of us could bring our added value to the project, we were able to collect data, TopMonks have a wealth of experience in creating extensions, and Keboola knows how to work with data well.

However, we did not invent the project entirely on a green field, we had previously provided a report with data analysis from Black Friday, which started the whole idea in the form of extensions to the browser. As the project gained in popularity, we started to devote a little more to it, the website was created, the number of e-shops was gradually expanding and so on. Today it is our heart matter.

But it's not your only nonprofit project, can you bring them all closer?

Another, relatively young issue is cooperation with Michal Bláha on State guardsfor which we are scrapping, for example, data from Facebook with comments from politicians for later analysis of sentiment. For the State Watcher, we have also been downloading documents from the councils' sites that do not provide their data in the form of API for some time.

As Shopkeeper gained popularity, the biggest e-commerce players resigned and started sending data themselves instead of throwing sticks. But not everywhere the approach was so helpful, I know, so there was a pre-trial challenge, can you say something more?

I hate to blur it, but one e-shop was very hard to bear and we solved it by stopping downloading the data. However, it was the only one of the big ones with a turnover of over a billion, the others take it much more sportingly and send us the data themselves, precisely to avoid inaccuracies in downloading.

Can you be a little more specific, was it Notino?

I do not want to comment, but you can see which of the big e-shops are missing there, and it will be clear. Anyway, it is a non-profit project and it is difficult to accuse us of some evil will or that we are drawing attention to cheating others to enrich ourselves.

No, but it can be challenged if the downloaded data is inaccurate.

This is a thing that can happen (for example, if the product name is poorly mapped), but if it happened, we were always very sorry and we were dealing with it very quickly.

How much do companies today resist scraping their data from the web?

This is happening, but the pre-trial challenge is already a bit extreme, mostly using technical means such as blocking IP addresses with too many queries, captchy or browser fingerprinting (in order to recognize the browser user from the robot). In general, however, in the US, for example, the trend of web scraping is rather to resign and take it as a normal part of web life. US courts even ordered LinkedInn last yearto allow hiQ Labs to perform data scraping because network users have provided their data in the belief that it will be public.

Is there any part of web scraping that you already consider to be an edge, such as collecting email addresses?

We have the terms of the service that it is not possible to use it for any illegal activities, but it is up to everyone who uses the service. One of the things that is already behind us is also creating false accounts for various services just to gain access to their data. But what is publicly available on the web is, in our opinion, free to download.

Finally, some numbers, how many pages and how much data do you download from the web per month?

Our volume of downloaded and transferred data is about 3.3 petabytes (3315TB) in 2019, we have downloaded an estimated 11 billion pages, but the number is growing fast, we are currently downloading about 1 billion pages per month (hundreds of TB).

Are you working to optimize this volume?

We work on many levels, but that would be a separate interview. Our whole system runs over Amazon Web Services, which has no technical problem with large data volumes, but it can also be nicely priced, so we are constantly optimizing and paying for data transfer is one of our biggest infrastructure costs. We use AWS because a large part of our customers are from the US and we generally need excellent connectivity from all parts of the world.

Have you ever thought about switching to Google Cloud?

Yes, but that huge amount of data locks us up a bit with the current vendor. If this ever happened, we would have to migrate all data at once, otherwise we would not pay.

Source: lupa.cz

Eng News 24h Jan Čurn (Apify): We are downloading a billion pages a month from the web

from WordPress https://ift.tt/37PsQdG via IFTTT

#IFTTT #WordPress

0 notes