Shantanu Blog: Explore Indian Language Websites

CommonCrawl creates and maintains an open repository of web crawling data that can be viewed and analyzed by anyone.

https://commoncrawl.org/

Here are 3 easy steps to download the data in any language eg Marathi or Gujarati

1) Clone the git clone from the repo https://github.com/qburst/common-crawl-malayalam.git

2) Change the language code from mal to mar on line –
AND content_languages AS ‘mal%’ ”

The other codes are: hin (hindi), mal (malayalam), mar (marathi), guj (gujarati)

Run this shell script after replacing XXX with your AWS access and security key.

./extract_malayalam_warcs.sh XXX XXX 2021-10 s3: // nlp-malayalam

3) Change the unicode range from wrong: range (3328, 3456) to your own desired language range, for example devnagari: range (2304, 2432) Gujarati: range (2688, 2816)

Run this shell script after replacing XXX with your AWS access and security key.

./filter_malayalam.sh XXX XXX s3: // nlp-malayalam / 2021-10 / warcs s3: // nlp-malayalam / 2021-10 / sortie2

The filtered and unfiltered data will be available in the output2 subfolder of the s3 bucket.

_____

Update:

The good French from the University of the Sorbonne and Inria extracted sentences from the common crawler for all languages and made them available on

https://oscar-corpus.com/

I have no words to thank them. The paper is available here …

http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf

Without these data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.

Labels: athena, aws, python, shell script