Mashup maker

I think it’d be pretty cool to have online video editing tools that could take Youtube URLs (with the #t parameter) and durations for sources, and then a couple of audio channels.

I’m pretty sure you could build it by capturing out of the rtmp stream the way youtubedownloader-ey things do, then you could just run some command-line video encoding software on the pieces, stitch it up as an flv, and give it back.

Sounds hard, but fun. Maybe a Christmas project.

Preliminary thoughts on clothing

I’ve never been well dressed.

There’s been this mix of apathy, misunderstanding, and a lack of resource in my life that led to an endless stream of stretchy khakis or cords and baggy t-shirts. Giant fuzzy sweaters, loose-hanging, wrinkly dress shirts. Big black tube-socks. I had heard the words of people who cared about fashion, “this is the way you present yourself; it’s the first impression you give, the message you tell the world”, and thought internally, “ha, the message you’re giving off is that you can’t stand on your attributes, so you’ve got to hide behind some ideal of flashy popularity”. What mattered to me was that my clothing was comfortable. That I could go crawl through a sewer or climb up a wall and my shirt could get a bit messed up. If I don’t care what I look like, why should you?

My opinions have been there a long time, and for some reason – likely just increasing maturity – they have been changing. I’m beginning to see the value in having a well made garment. I won’t be suggesting that I am hiding behind a flashy ideal, I’ll just be showing off the fact that I’ve thought about this. If I care how I look, it shows that I’ll also care about other things. The difference in thought has been subtle, but this feels like a sweeping change.

It’s led me to recognize that there is a difference between having a piece of clothing that’s comfortable because it is loose, and one that fits properly. As a child, I likely understood only “too small” or “not too small”, and I certainly didn’t want “too small”.

So, I’ve decided to invest some spare cycles into this. I don’t think I’m going to really take to fashion, per se, but to a pursuit of being well dressed. This means fewer clothes, which fit and look better, which I take better care of. Better care here means “more regular washing/drying and ironing, as opposed to “no longer crawling through sewer grates or climbing up walls.”

Some guidelines occurred to me as a place to start, they are that a good piece of clothing must:

  1. Be comfortable without being too large.
  2. Provide some variability or option to the way it is worn, and
  3. Strike a balance between reality and imagination; because the best artist is in the observer’s mind.

I think that if I can manage to find things that fit those guidelines, I’ll probably look pretty good. Next up is figuring out where and when to buy things.

As my opinion progresses, I’ll keep you posted.

Zeno wrote a letter…

I’m beginning to come to some conclusions about the failures of communication between people. It draws heavily upon the example Zeno presented in the dichotomy paradox.

Two persons know of concepts, which they wish to discuss. One could analogize to suggest that person A’s understanding of a concept sits at one end of a track, while person B’s understanding of that same concept sits at the other. Together, they will use language to attempt to move together toward the ideal of the concept, placed at the center of the track.

The first person introduces their understanding in terms of the language they understand, and hence, they reduce the distance between their listener and the ideal of the concept between them.

The second person responds, utilizing their own understanding of the concept, and through language, moves the other person closer to the ideal of the concept which stands between them.

This process continues, and each person progressively moves close and closer to the ideal, but by a similar logical flow to Zeno’s paradox, neither will ever arrive at the ideal of the concept. Neither will ever reach the other’s understanding of the concept.

In fact, as each moves closer, the minute differences between the understandings of each person’s language will become magnified – almost as distortions in a fractal pattern like the Mandelbrot set grow and come to dominate the landscape. The most insignificant and trivial of separations become the entire known world – and the distance that from afar seemed so small becomes again intuitively uncrossable.

Zeno, your insight never ceases to amaze.

Failures and Errata

Failures and errata could be thought of as a fixture in the Computer Science landscape. There is no standards body that is capable of eliminating errors altogether from computer code, no set of tools or mathematical proofs that can guarantee that all of our code is perfect. Software is buggy; that’s just the nature of the beast. We’ll find problems and fix them, but there will always be more. Right?

not a sucess

chickens running about a messed up shipment

Yes. Continue reading

Writing a 4chan scraper

I’m writing a paper on memes for a class, and while I’m not really ready to hand the paper in, I have been doing some preliminary work for it. Today, I had “free moments” for the first time in about a month, so I set to work on it.

The key tasks this code must do are:

1) Grab the front page of a forum, like boards.4chan.org/b/
2) Parse out the links to each thread on it
3) Go to each thread and grab the postnumber, date/time, and content of each post.
4) Save this into a database.
5) Run say, once every 10 minutes or something.

Then I’ll see about training it on /v/ or something, and tweak it to run on a few other chans. At that point, I’ll at least be collecting data.

Out of curiosity, how much storage will that take up? Back of the envelope time!

5.1 bytes / word (average word length in english)
Estimating 15 words per post (some long copypastas would drag this up)
Estimating 40 posts per thread (some will fall fast, others will be huge)
and .. huh. I don’t even know how many there are on the front page. Just looked: 15.
Every 10 minutes.

So, 40*15*(5.1+timestorage+some ancillary data+post number (an int, therefore 4))

timestorage would be equivalent to mysql datetime (8 bytes), ancillary data would probably be about 40 characters for “>>012345678

” (29 chars) and then some random spaces or html tags that get thrown in.

Post number would, as mentioned, be a 4-byte int. So..

40*15*(5.1+8+40+4) = 4*10*10*1.5*(57.1) = 1000*4*1.5*5.71 = 1000*6*5.71, let’s round that to 36,000 bytes per capture.

Over the course of 2 weeks, if I capture 10 pages (3 from 4chan, 3 from 99chan, 2 from 420chan, 1 from 711chan, 1 from somewhere else) every 10 minutes, that’s

2*7*24*60*6*36000 bytes (the last 6 is 60 minutes / 10 minutes per grab cycle)
1000*1.4*2.4*6*6*36*1000
100000000*1.4*2.4*3.6*3.6
(cheated a bit with irb) 100000000 * 43.5456 = 4,354,560,000 bytes

that’s about 4 gigabytes of data o_o I’m pretty sure my server can handle it. That’s just 2 weeks across 3-4 forums at a mediocre granularity, yeesh.

Coding

First step is grabbing the page. For whatever reason, the net/http module wasn’t working the way I wanted it to, so I’m just using open-uri. With this, I can just call out ‘open(“http://boards.4chan.org/b/”) and get a temporary file back containing the page. Read it into a string, close the file, and we’re good to go.

Not much later, we get deep into munging html. On boards.4chan.org/b/ (I remember it being img.4chan.org/b/ — when did that change? shows that I don’t go there much any more) we can see that every post is separated by a <hr> tag. That’s pretty nice if we need posts delineated into packets, but can we go lower level than that? Aha! The ‘Reply’ button only occurs on links to threads, so we can probably capture the list of threads easily:

threads = []
page.each do |line|
  if [] != (n = line.grep(/(\d*)">Reply/){$1}) then
    threads << n
  end
end

puts "The thread list is.."
threads.each do |thread|
  puts "http://boards.4chan.org/b/res/#{thread}"
end

puts “\nWoot.\n”

That’s fantastic. Now we need to actually pull the information from each thread… For the purpose of testing, let’s just go with thread[0] and deal with the first thread we find. It looks like, content-wise, the meat of each post is in between <blockquote> tags. This means we’ll have to pull that stuff out! Regexes will be helpful.

page_f = open("http://boards.4chan.org/b/res/#{threads[0]}")
page = page_f.read.split("\n")
page_f.close

posts = []
page.each do |line|
  if line =~ /<blockquote>/ then
    n = line.grep(/<blockquote>(.*)<\/blockquote>/) {$1}[0]
    puts "Post is --#{n}--"
    posts << n
  end
end
thread_posts << posts

One problem though — this has a lot of <font></font> stuff in it from when people quote earlier posts. I’d almost leave it in there, but it has references to javascript functions and classes that I don’t really want — so let’s strip it and just have the post number show up. How can we do this? String#gsub!, in all its glory:

n.gsub!(/(>>\d*)<\/a>;<\/font>/, ‘\1’) inserted right after the grep.

This leaves us with thread_posts as an array containing 1 array of posts, which are each a string. Next up, we need to get the postnumber and the date, and the thread numbers. I should end up with an array that looks like this:

[[thread_number, [[post_number, date, content], [postnumber, date, content],…]], [thread_number, [[post_number, date, content], [postnumber, date, content]]],…]

At this point, I’ve added in regex-checking for the number and date, but something isn’t working — I changed to breakups by <hr> (since each post is horizontal rule delineated) but I’m not finding what I expect to, and it’s causing some array indexes to fall on nilClasses. So I’ll paste in my code, before I start breaking this monster into some finely tuned functions:

require 'open-uri'

page_f = open('http://boards.4chan.org/b/')
page = page_f.read.split("\n")
page_f.close

threads = []
page.each do |line|
  if [] != (n = line.grep(/(\d*)">Reply/){$1}) then
    threads << n
  end
end

puts "The thread list is.."
threads.each do |thread|
  puts "http://boards.4chan.org/b/res/#{thread}"
end
puts "\nWoot.\n"

thread_posts = []
page_f = open("http://boards.4chan.org/b/res/#{threads[0]}")
page = page_f.read.split("<hr>")
page_f.close

posts = []
page.each do |line|
  if line =~ /<blockquote>/ then
    content = line.grep(/<blockquote>(.*)<\/blockquote>/) {$1}[0]
    content.gsub!(/<font.*>(>>\d*)<\/a><\/font>/, '\1')
    #puts "Post is --#{content}--"
  end
  if line =~ /"norep\d*"/ then
    number = line.grep(/<span id="norep(\d*)">/){$1}
  end
  if line =~ /\d\d\/\d\d\/\d\d\(.*\)\d\d:\d\d:\d\d/ then
    date = line.grep(/\d\d\/\d\d\/\d\d\(.*\)\d\d:\d\d:\d\d/)
  end
  posts << [number[0], date[0], content[0]]
end
thread_posts << posts

thead_posts[0].each do |post|
  puts "Post #{post[0]} at #{post[1]} said \"#{post[2]}\""
end

Okay, so I’ll press on, tinker with irb, and get back to you in a bit. I’m also going to find a wordpress plugin that makes my code not look like rotten bananas.