This book details the main realistic projects in the development of web corpora up to giga-token size. Among these jobs are the testing process and the regular cleaning such as boilerplate elimination and elimination of copied content. Terminology handling and problems with linguistic handling coming from the different types of disturbance in web corpora are also protected. The World Extensive Web comprises the biggest current resource of text messages published in a huge assortment of 'languages'. A possible and audio way of taking advantage of this information for linguistic research is to gather a fixed corpus for a given language.