The problem is the sheer size of the Internet. There are so many sites out there reporting on the same things. When Apple posts a press release, a hundred other sites will write about it, and a hundred more will copy those articles. Information is rapidly propagated through networks of sites until it reaches the target audience. Everyone benefits.
The Threat of Stolen Content
Google has a tough job to do; making sure any content source is the original publisher of the content. It's easy for a malicious site to scrape thousands of pages of content and publish it all under their own name. Punishment is light -- a page may be removed from search results, but the malicious site simply publishes more. It leaves webmasters in a tricky situation. How do you determine if your content is being stolen, and how do you protect it?
Stealing content is breaking copyright law. It is plagiarism. The problem is that the people who are stealing content are incredibly hard to find and punish. The sites can be removed from rankings or taken down altogether, but that doesn't stop more from popping up. Webmasters, web hosts and search engines need to remain constantly vigilant to make sure they aren't being used.
Why Content is Stolen
When stealing content is breaking the law, why do people steal content? The answer is usually a combination of laziness and profit. Malicious webmasters can steal large amounts of content for an ever-changing rotation of websites because they know they can get away with it. They set up these sites as networks and get them indexed by a search engine. The search engine has no way to identify immediately if the content is stolen, and a site with a large amount of content wields significant power in terms of pagerank and links. The network of stolen content then links to a legitimate site, giving it a huge boost to pagerank and allowing it to bring in more profits.
Another common reason is ignorance. Most common in images, many people simply do not understand copyright law. They think that once something has been posted online, it is free for the taking. This is simply not true, and the people stealing content out of ignorance cannot use it as an excuse if it comes to court.
Copyright on the Web
By default, most of what you post on a website is protected by copyright law. This includes text, images, audio and video, the layout, design and code of a website and any unique content not covered by those categories. As a webmaster, you can link to other websites but you cannot copy their content without permission. The same goes for images -- unless an image is marked as free, you cannot use it. Some websites even have certain requirements necessary to link to them, and other webmasters must follow those requirements.
Dealing with content on the Internet is much like it is with print sources. You cannot copy verbatim large amounts of content. You can quote small amounts or paraphrase with proper attribution to the source, but the exact amounts vary. For the most part, the limitations are common sense. You cannot use content that is not your own in any way, especially not in ways that change the meaning or portray the original source in a negative light.
Copyright on Graphics Online
Graphics are an interesting case. Image search engines make it easy to find images that relate to virtually any topic. It's easy to think that these images, since they were easy to find, are free to use. Quite the opposite is true. In order to use an image on your site you will either need to make it yourself, find it on a site that offers free images for use, or pay royalties or otherwise meet the requirements of the copyright holder to use it. These royalties can range from a monetary fee to a credit link to simply asking permission first.
Issues with Internet Copyright
The biggest issue with Internet copyright law is the fact that the Internet spans the globe. Each country has different copyright laws. Some are much stricter than others. If an American publishes a website hosted on a server in Sweden, and it's stolen by an Australian hosting their page in Germany, which country's laws apply?
Another problem with enforcing copyright laws online is the volume of content. It's virtually impossible to monitor every single website looking for stolen content. There is no one governing body capable of such oversight. Because of this, reporting stolen content largely falls to the webmaster who originally owns the content. This becomes even trickier when it comes to proof.
It's easy to obfuscate the original source of a piece of content, or make it harder to identify that the content is copied. This is called spinning the content, where a certain percentage of the content is changed but the meaning remains the same. Combine this with the ability to backdate content and it becomes difficult to prove that content is stolen.
Identifying Stolen Content
Again, the task of identifying stolen content largely falls on the shoulders of the webmasters who own the content to begin with. There are several tools and techniques a webmaster can use to check if their content has been copied.
1. Searching manually. In the case of text content, simply copying and pasting a section of text will show if another site has been indexed with the same content. It's helpful for locating scraper sites operating automatically, but the more an article is spun, the less this technique works.
2. Reverse Image Search. Using a reverse image search service such as Tineye or Google Search by Image will allow the owner of graphical copyrights the chance to find out if their images have been stolen. Unfortunately, image indexes are much smaller than text search indexes because the technology has not been around as long. This makes it harder to identify stolen images.
3. Copyscape. Copyscape is one of the best tools available for scanning for spun content. It searches a vast index of Internet content for phrases and sentences that match your content to a certain extent. It will then report the sites that copy the content to you. It's up to you what to do with the information, but the most common route is to file a report.
4. Analytics tools. If your content includes links to your site, you can often catch robotic scrapers by monitoring where your incoming traffic is coming from. Wordpress sites use trackbacks for similar purposes -- to notify you when your content is being linked.
Steps to take when your Content is Stolen
The first step to take is to note down the URL of the content that has been stolen, as well as the URL of the page that is stealing the content. These two URLs are your proof that you own the content.
Next, you should get into contact with the webmaster who owns the scraper page. Sometimes they are legitimate sites that simply don't know better. If they have contact information posted, you can send them an e-mail informing them of their theft and asking to have the content removed.
If the site does not provide contact information, or ignores your polite requests, you can perform a WHOIS lookup on their domain. Unless the domain was privately purchased, you can find the contact information for the person who owns the site. You can then send them a more strongly-worded letter. If that fails, contact their web host or domain registrar and file a takedown notice. Many companies provide DMCA (Digital Millennium Copyright Act) form to fill out. DMCA notices are often complied with because they can lead to legal issues.
If none of the above works, you can also contact Google. Many scraper pages use stolen content specifically to give their sites artificially high pageranks. If you report the page as a scraper, it will be down rated and they lose the benefit of the content. In many cases Google can also file a DMCA notice with much more power behind it.
Unfortunately, there is no good way to prevent your content from being stolen. Adding code to auto-attribute your content just makes the copy link back to you. Code to stop copy and paste can be worked around. In many cases copied content is almost entirely automated, and the best webmasters can do is keep up with reporting the scraper pages and minimizing the damage they deal.