CommonCrawl creates and maintains an open repository of web crawling data that can be viewed and analyzed by anyone.
https://commoncrawl.org/
Here are 3 easy steps to download the data in any language eg Marathi or Gujarati
1) Clone the git clone from the repo https://github.com/qburst/common-crawl-malayalam.git
2) Change the language code from mal to mar on line –
AND content_languages AS ‘mal%’ ”
The other codes are: hin (hindi), mal (malayalam), mar (marathi), guj (gujarati)
Run this shell script after replacing XXX with your AWS access and security key.
./extract_malayalam_warcs.sh XXX XXX 2021-10 s3: // nlp-malayalam
3) Change the unicode range from wrong: range (3328, 3456) to your own desired language range, for example devnagari: range (2304, 2432) Gujarati: range (2688, 2816)
Run this shell script after replacing XXX with your AWS access and security key.
./filter_malayalam.sh XXX XXX s3: // nlp-malayalam / 2021-10 / warcs s3: // nlp-malayalam / 2021-10 / sortie2
The filtered and unfiltered data will be available in the output2 subfolder of the s3 bucket.
_____
Update:
The good French from the University of the Sorbonne and Inria extracted sentences from the common crawler for all languages and made them available on
https://oscar-corpus.com/
I have no words to thank them. The paper is available here …
http://corpora.ids-mannheim.de/CMLC7-final/CMLC-7_2019-Oritz_et_al.pdf
Without these data, it would have been very difficult to extract reliable data for languages like Marathi, Gujarati.
Labels: athena, aws, python, shell script