Writing a 4chan scraper

I’m writing a paper on memes for a class, and while I’m not really ready to hand the paper in, I have been doing some preliminary work for it. Today, I had “free moments” for the first time in about a month, so I set to work on it.

The key tasks this code must do are:

1) Grab the front page of a forum, like boards.4chan.org/b/
2) Parse out the links to each thread on it
3) Go to each thread and grab the postnumber, date/time, and content of each post.
4) Save this into a database.
5) Run say, once every 10 minutes or something.

Then I’ll see about training it on /v/ or something, and tweak it to run on a few other chans. At that point, I’ll at least be collecting data.

Out of curiosity, how much storage will that take up? Back of the envelope time!

5.1 bytes / word (average word length in english)
Estimating 15 words per post (some long copypastas would drag this up)
Estimating 40 posts per thread (some will fall fast, others will be huge)
and .. huh. I don’t even know how many there are on the front page. Just looked: 15.
Every 10 minutes.

So, 40*15*(5.1+timestorage+some ancillary data+post number (an int, therefore 4))

timestorage would be equivalent to mysql datetime (8 bytes), ancillary data would probably be about 40 characters for “>>012345678

” (29 chars) and then some random spaces or html tags that get thrown in.

Post number would, as mentioned, be a 4-byte int. So..

40*15*(5.1+8+40+4) = 4*10*10*1.5*(57.1) = 1000*4*1.5*5.71 = 1000*6*5.71, let’s round that to 36,000 bytes per capture.

Over the course of 2 weeks, if I capture 10 pages (3 from 4chan, 3 from 99chan, 2 from 420chan, 1 from 711chan, 1 from somewhere else) every 10 minutes, that’s

2*7*24*60*6*36000 bytes (the last 6 is 60 minutes / 10 minutes per grab cycle)
1000*1.4*2.4*6*6*36*1000
100000000*1.4*2.4*3.6*3.6
(cheated a bit with irb) 100000000 * 43.5456 = 4,354,560,000 bytes

that’s about 4 gigabytes of data o_o I’m pretty sure my server can handle it. That’s just 2 weeks across 3-4 forums at a mediocre granularity, yeesh.

Coding

First step is grabbing the page. For whatever reason, the net/http module wasn’t working the way I wanted it to, so I’m just using open-uri. With this, I can just call out ‘open(“http://boards.4chan.org/b/”) and get a temporary file back containing the page. Read it into a string, close the file, and we’re good to go.

Not much later, we get deep into munging html. On boards.4chan.org/b/ (I remember it being img.4chan.org/b/ — when did that change? shows that I don’t go there much any more) we can see that every post is separated by a <hr> tag. That’s pretty nice if we need posts delineated into packets, but can we go lower level than that? Aha! The ‘Reply’ button only occurs on links to threads, so we can probably capture the list of threads easily:

threads = []
page.each do |line|
  if [] != (n = line.grep(/(\d*)">Reply/){$1}) then
    threads << n
  end
end

puts "The thread list is.."
threads.each do |thread|
  puts "http://boards.4chan.org/b/res/#{thread}"
end

puts “\nWoot.\n”

That’s fantastic. Now we need to actually pull the information from each thread… For the purpose of testing, let’s just go with thread[0] and deal with the first thread we find. It looks like, content-wise, the meat of each post is in between <blockquote> tags. This means we’ll have to pull that stuff out! Regexes will be helpful.

page_f = open("http://boards.4chan.org/b/res/#{threads[0]}")
page = page_f.read.split("\n")
page_f.close

posts = []
page.each do |line|
  if line =~ /<blockquote>/ then
    n = line.grep(/<blockquote>(.*)<\/blockquote>/) {$1}[0]
    puts "Post is --#{n}--"
    posts << n
  end
end
thread_posts << posts

One problem though — this has a lot of <font></font> stuff in it from when people quote earlier posts. I’d almost leave it in there, but it has references to javascript functions and classes that I don’t really want — so let’s strip it and just have the post number show up. How can we do this? String#gsub!, in all its glory:

n.gsub!(/(>>\d*)<\/a>;<\/font>/, ‘\1’) inserted right after the grep.

This leaves us with thread_posts as an array containing 1 array of posts, which are each a string. Next up, we need to get the postnumber and the date, and the thread numbers. I should end up with an array that looks like this:

[[thread_number, [[post_number, date, content], [postnumber, date, content],…]], [thread_number, [[post_number, date, content], [postnumber, date, content]]],…]

At this point, I’ve added in regex-checking for the number and date, but something isn’t working — I changed to breakups by <hr> (since each post is horizontal rule delineated) but I’m not finding what I expect to, and it’s causing some array indexes to fall on nilClasses. So I’ll paste in my code, before I start breaking this monster into some finely tuned functions:

require 'open-uri'

page_f = open('http://boards.4chan.org/b/')
page = page_f.read.split("\n")
page_f.close

threads = []
page.each do |line|
  if [] != (n = line.grep(/(\d*)">Reply/){$1}) then
    threads << n
  end
end

puts "The thread list is.."
threads.each do |thread|
  puts "http://boards.4chan.org/b/res/#{thread}"
end
puts "\nWoot.\n"

thread_posts = []
page_f = open("http://boards.4chan.org/b/res/#{threads[0]}")
page = page_f.read.split("<hr>")
page_f.close

posts = []
page.each do |line|
  if line =~ /<blockquote>/ then
    content = line.grep(/<blockquote>(.*)<\/blockquote>/) {$1}[0]
    content.gsub!(/<font.*>(>>\d*)<\/a><\/font>/, '\1')
    #puts "Post is --#{content}--"
  end
  if line =~ /"norep\d*"/ then
    number = line.grep(/<span id="norep(\d*)">/){$1}
  end
  if line =~ /\d\d\/\d\d\/\d\d\(.*\)\d\d:\d\d:\d\d/ then
    date = line.grep(/\d\d\/\d\d\/\d\d\(.*\)\d\d:\d\d:\d\d/)
  end
  posts << [number[0], date[0], content[0]]
end
thread_posts << posts

thead_posts[0].each do |post|
  puts "Post #{post[0]} at #{post[1]} said \"#{post[2]}\""
end

Okay, so I’ll press on, tinker with irb, and get back to you in a bit. I’m also going to find a wordpress plugin that makes my code not look like rotten bananas.

Leave a Reply

Your email address will not be published.