{"id":91,"date":"2010-11-07T07:15:01","date_gmt":"2010-11-07T07:15:01","guid":{"rendered":"http:\/\/wcarss.ca\/log\/?p=91"},"modified":"2010-11-07T04:27:55","modified_gmt":"2010-11-07T09:27:55","slug":"writing-a-4chan-scraper","status":"publish","type":"post","link":"https:\/\/wcarss.ca\/log\/2010\/11\/writing-a-4chan-scraper\/","title":{"rendered":"Writing a 4chan scraper"},"content":{"rendered":"<p>I&#8217;m writing a paper on memes for a class, and while I&#8217;m not really ready to hand the paper in, I have been doing some preliminary work for it. Today, I had &#8220;free moments&#8221; for the first time in about a month, so I set to work on it.<\/p>\n<p>The key tasks this code must do are:<\/p>\n<p>1) Grab the front page of a forum, like boards.4chan.org\/b\/<br \/>\n2) Parse out the links to each thread on it<br \/>\n3) Go to each thread and grab the postnumber, date\/time, and content of each post.<br \/>\n4) Save this into a database.<br \/>\n5) Run say, once every 10 minutes or something.<\/p>\n<p>Then I&#8217;ll see about training it on \/v\/ or something, and tweak it to run on a few other chans. At that point, I&#8217;ll at least be collecting data.<\/p>\n<p>Out of curiosity, how much storage will that take up? Back of the envelope time!<\/p>\n<p>5.1 bytes \/ word (average word length in english)<br \/>\nEstimating 15 words per post (some long copypastas would drag this up)<br \/>\nEstimating 40 posts per thread (some will fall fast, others will be huge)<br \/>\nand .. huh. I don&#8217;t even know how many there are on the front page. Just looked: 15.<br \/>\nEvery 10 minutes.<\/p>\n<p>So, 40*15*(5.1+timestorage+some ancillary data+post number (an int, therefore 4))<\/p>\n<p>timestorage would be equivalent to mysql datetime (8 bytes), ancillary data would probably be about 40 characters for &#8220;&gt;&gt;012345678<\/p>\n<p>&#8221; (29 chars) and then some random spaces or html tags that get thrown in.<\/p>\n<p>Post number would, as mentioned, be a 4-byte int. So..<\/p>\n<p>40*15*(5.1+8+40+4) = 4*10*10*1.5*(57.1) = 1000*4*1.5*5.71 = 1000*6*5.71, let&#8217;s round that to 36,000 bytes per capture.<\/p>\n<p>Over the course of 2 weeks, if I capture 10 pages (3 from 4chan, 3 from 99chan, 2 from 420chan, 1 from 711chan, 1 from somewhere else) every 10 minutes, that&#8217;s<\/p>\n<p>2*7*24*60*6*36000 bytes (the last 6 is 60 minutes \/ 10 minutes per grab cycle)<br \/>\n1000*1.4*2.4*6*6*36*1000<br \/>\n100000000*1.4*2.4*3.6*3.6<br \/>\n(cheated a bit with irb) 100000000 * 43.5456 = 4,354,560,000 bytes<\/p>\n<p>that&#8217;s about 4 gigabytes of data o_o I&#8217;m pretty sure my server can handle it. That&#8217;s just 2 weeks across 3-4 forums at a mediocre granularity, yeesh.<\/p>\n<p>Coding<\/p>\n<p>First step is grabbing the page. For whatever reason, the net\/http module wasn&#8217;t working the way I wanted it to, so I&#8217;m just using open-uri. With this, I can just call out &#8216;open(&#8220;http:\/\/boards.4chan.org\/b\/&#8221;) and get a temporary file back containing the page. Read it into a string, close the file, and we&#8217;re good to go.<\/p>\n<p>Not much later, we get deep into munging html. On boards.4chan.org\/b\/ (I remember it being img.4chan.org\/b\/ &#8212; when did that change? shows that I don&#8217;t go there much any more) we can see that every post is separated by a &lt;hr&gt; tag. That&#8217;s pretty nice if we need posts delineated into packets, but can we go lower level than that? Aha! The &#8216;Reply&#8217; button only occurs on links to threads, so we can probably capture the list of threads easily:<\/p>\n<pre class=\"brush: ruby\">\r\nthreads = []\r\npage.each do |line|\r\n  if [] != (n = line.grep(\/(\\d*)&quot;&gt;Reply\/){$1}) then\r\n    threads &lt;&lt; n\r\n  end\r\nend\r\n\r\nputs &quot;The thread list is..&quot;\r\nthreads.each do |thread|\r\n  puts &quot;http:\/\/boards.4chan.org\/b\/res\/#{thread}&quot;\r\nend<\/pre>\n<p>puts &#8220;\\nWoot.\\n&#8221;<\/p>\n<p>That&#8217;s fantastic. Now we need to actually pull the information from each thread&#8230; For the purpose of testing, let&#8217;s just go with thread[0] and deal with the first thread we find. It looks like, content-wise, the meat of each post is in between &lt;blockquote&gt; tags. This means we&#8217;ll have to pull that stuff out! Regexes will be helpful.<\/p>\n<pre class=\"brush: ruby\">page_f = open(&quot;http:\/\/boards.4chan.org\/b\/res\/#{threads[0]}&quot;)\r\npage = page_f.read.split(&quot;\\n&quot;)\r\npage_f.close\r\n\r\nposts = []\r\npage.each do |line|\r\n  if line =~ \/&lt;blockquote&gt;\/ then\r\n    n = line.grep(\/&lt;blockquote&gt;(.*)&lt;\\\/blockquote&gt;\/) {$1}[0]\r\n    puts &quot;Post is --#{n}--&quot;\r\n    posts &lt;&lt; n\r\n  end\r\nend\r\nthread_posts &lt;&lt; posts<\/pre>\n<p>One problem though &#8212; this has a lot of &lt;font&gt;&lt;\/font&gt; stuff in it from when people quote earlier posts. I&#8217;d almost leave it in there, but it has references to javascript functions and classes that I don&#8217;t really want &#8212; so let&#8217;s strip it and just have the post number show up. How can we do this? String#gsub!, in all its glory:<\/p>\n<p>n.gsub!(\/<font.*>(&gt;&gt;\\d*)<\\\/a>;<\\\/font>\/, &#8216;\\1&#8217;) inserted right after the grep.<\/p>\n<p>This leaves us with thread_posts as an array containing 1 array of posts, which are each a string. Next up, we need to get the postnumber and the date, and the thread numbers. I should end up with an array that looks like this:<\/p>\n<p>[[thread_number, [[post_number, date, content], [postnumber, date, content],&#8230;]], [thread_number, [[post_number, date, content], [postnumber, date, content]]],&#8230;]<\/p>\n<p>At this point, I&#8217;ve added in regex-checking for the number and date, but something isn&#8217;t working &#8212; I changed to breakups by &lt;hr&gt; (since each post is horizontal rule delineated) but I&#8217;m not finding what I expect to, and it&#8217;s causing some array indexes to fall on nilClasses. So I&#8217;ll paste in my code, before I start breaking this monster into some finely tuned functions:<\/p>\n<pre class=\"brush: ruby\">require &#039;open-uri&#039;\r\n\r\npage_f = open(&#039;http:\/\/boards.4chan.org\/b\/&#039;)\r\npage = page_f.read.split(&quot;\\n&quot;)\r\npage_f.close\r\n\r\nthreads = []\r\npage.each do |line|\r\n  if [] != (n = line.grep(\/(\\d*)&quot;&gt;Reply\/){$1}) then\r\n    threads &lt;&lt; n\r\n  end\r\nend\r\n\r\nputs &quot;The thread list is..&quot;\r\nthreads.each do |thread|\r\n  puts &quot;http:\/\/boards.4chan.org\/b\/res\/#{thread}&quot;\r\nend\r\nputs &quot;\\nWoot.\\n&quot;\r\n\r\nthread_posts = []\r\npage_f = open(&quot;http:\/\/boards.4chan.org\/b\/res\/#{threads[0]}&quot;)\r\npage = page_f.read.split(&quot;&lt;hr&gt;&quot;)\r\npage_f.close\r\n\r\nposts = []\r\npage.each do |line|\r\n  if line =~ \/&lt;blockquote&gt;\/ then\r\n    content = line.grep(\/&lt;blockquote&gt;(.*)&lt;\\\/blockquote&gt;\/) {$1}[0]\r\n    content.gsub!(\/&lt;font.*&gt;(&gt;&gt;\\d*)&lt;\\\/a&gt;&lt;\\\/font&gt;\/, &#039;\\1&#039;)\r\n    #puts &quot;Post is --#{content}--&quot;\r\n  end\r\n  if line =~ \/&quot;norep\\d*&quot;\/ then\r\n    number = line.grep(\/&lt;span id=&quot;norep(\\d*)&quot;&gt;\/){$1}\r\n  end\r\n  if line =~ \/\\d\\d\\\/\\d\\d\\\/\\d\\d\\(.*\\)\\d\\d:\\d\\d:\\d\\d\/ then\r\n    date = line.grep(\/\\d\\d\\\/\\d\\d\\\/\\d\\d\\(.*\\)\\d\\d:\\d\\d:\\d\\d\/)\r\n  end\r\n  posts &lt;&lt; [number[0], date[0], content[0]]\r\nend\r\nthread_posts &lt;&lt; posts\r\n\r\nthead_posts[0].each do |post|\r\n  puts &quot;Post #{post[0]} at #{post[1]} said \\&quot;#{post[2]}\\&quot;&quot;\r\nend<\/pre>\n<p>Okay, so I&#8217;ll press on, tinker with irb, and get back to you in a bit. I&#8217;m also going to find a wordpress plugin that makes my code not look like rotten bananas.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m writing a paper on memes for a class, and while I&#8217;m not really ready to hand the paper in, I have been doing some preliminary work for it. Today, I had &#8220;free moments&#8221; for the first time in about a month, so I set to work on it. The key tasks this code must [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/posts\/91"}],"collection":[{"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/comments?post=91"}],"version-history":[{"count":14,"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/posts\/91\/revisions"}],"predecessor-version":[{"id":100,"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/posts\/91\/revisions\/100"}],"wp:attachment":[{"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/media?parent=91"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/categories?post=91"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wcarss.ca\/log\/wp-json\/wp\/v2\/tags?post=91"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}