Preliminary thoughts on clothing

I’ve never been well dressed.

There’s been this mix of apathy, misunderstanding, and a lack of resource in my life that led to an endless stream of stretchy khakis or cords and baggy t-shirts. Giant fuzzy sweaters, loose-hanging, wrinkly dress shirts. Big black tube-socks. I had heard the words of people who cared about fashion, “this is the way you present yourself; it’s the first impression you give, the message you tell the world”, and thought internally, “ha, the message you’re giving off is that you can’t stand on your attributes, so you’ve got to hide behind some ideal of flashy popularity”. What mattered to me was that my clothing was comfortable. That I could go crawl through a sewer or climb up a wall and my shirt could get a bit messed up. If I don’t care what I look like, why should you?

My opinions have been there a long time, and for some reason – likely just increasing maturity – they have been changing. I’m beginning to see the value in having a well made garment. I won’t be suggesting that I am hiding behind a flashy ideal, I’ll just be showing off the fact that I’ve thought about this. If I care how I look, it shows that I’ll also care about other things. The difference in thought has been subtle, but this feels like a sweeping change.

It’s led me to recognize that there is a difference between having a piece of clothing that’s comfortable because it is loose, and one that fits properly. As a child, I likely understood only “too small” or “not too small”, and I certainly didn’t want “too small”.

So, I’ve decided to invest some spare cycles into this. I don’t think I’m going to really take to fashion, per se, but to a pursuit of being well dressed. This means fewer clothes, which fit and look better, which I take better care of. Better care here means “more regular washing/drying and ironing, as opposed to “no longer crawling through sewer grates or climbing up walls.”

Some guidelines occurred to me as a place to start, they are that a good piece of clothing must:

  1. Be comfortable without being too large.
  2. Provide some variability or option to the way it is worn, and
  3. Strike a balance between reality and imagination; because the best artist is in the observer’s mind.

I think that if I can manage to find things that fit those guidelines, I’ll probably look pretty good. Next up is figuring out where and when to buy things.

As my opinion progresses, I’ll keep you posted.

Snooping

We all do it

As a computer professional, I’ve been in situations where I had access to information that others did not. Sometimes that information was mundane; sometimes it had commercial value, was personally sensitive, or just wasn’t intended for public consumption. In particular, I was responsible for maintaining some machines when I was younger. I took issues of privacy less seriously then than I do now, and looking back, it resulted in some questionable behaviour.

A person would bring a computer to me to be fixed. Often this would be a computer which they used for work purposes, and I would be left alone to make it right again. I regularly had to copy all their personal files off of the machine, then back. I quickly discovered: pretty much everyone looks at pornography. Most people either do not care to hide it, do not understand that it leaves very noticeable traces, or perhaps just forget that it’s there.

Unwise

It became a habit to do a porn-check upon receiving a new computer. This was inevitably the most interesting part of fixing a box. What had they looked at? Had they bothered hiding it? Where? It was entirely a white-hat matter of curiosity. I had access to the system, and this was a secret I could examine. I am (and always have been) strongly motivated to know things that I am not really supposed to know. Case in point: The amount I know about tunnels.

At the time I justified my behaviour by telling myself, “I’m going to see the majority of these files anyway. I need to do a sweep to make sure I’m getting all of their important documents.” In a broad sense, this is a true statement – the best lies are composed mostly of truths. My first sweep would be pretty much only be looking for that sort of sensational find, and then I’d go through and grab all of the regular vacation photos and limewire songsets and bookmarks-folders that the person publicly wanted brought to their new install.

Judgement

Was I wrong? To this day, I’m not certain if snooping in that fashion was objectively “wrong”. I don’t feel at all that it was right, but I benefited tremendously from the activity. I found a music folder a friend had hidden with pornography for some reason. It contained Radiohead’s entire discography, and was my introduction to the band. My musical tastes have been hugely influenced by my love of Radiohead. Some of the better moments in my life were spent falling asleep listening to OK Computer with a pretty cool girl. If I hadn’t have snooped back then, some of my most valued experiences would not have come to pass the same way.

I do know that, placed in that situation again, I would now act differently. In the several intervening years, I have come to possess a combination of two relevant traits. The first is an appreciation that someone’s personal information is simply not my business. The second is the apathy to not care what some vague acquaintance has squirreled away in a hidden folder inside “C:\Windows\Old_Documents”.

Universal application

What about everyone else? If everyone snooped “a bit” sometime ago, it’s pretty much as bad as if everyone just snooped regularly. Some people might find incriminating things, and then they’d be left with the difficult question: Do they contact authorities and take the hit on their own reputation, or do they let someone get away with heinous crimes? Other people might find things that they know they could sell. How much is it worth to a person to sell a secret, when it could cause real damage? What about when there is no perceivable damage to be caused? There’s a lot to this snooping thing.

On the whole, if I were to compress the stream of grays into black and white, I’d have to come down completely against improper access of information. It happens all over the place and it shouldn’t! Facebook employees seeing too much information, government employees stealing records, tech-dudes stealing data from customers? None of this is ‘right’. Taking the simplest definition of good and bad, that means that all of those things are ‘wrong’. Simple.

Can anything be done…?

Better training, more oversight, and some morality-based discussions would probably go a long way to remedy these problems. Training to help people cope with the situation in a mentally safe environment, to build up some initial resistance to the idea of snooping for fun. Oversight to catch snoopers – screen recorders or video recorders could be used to “watch the watchmen”, so to speak. People who are being watched (or even who think they are) tend to behave better. And simple moral discussions to establish firmly what is right and what is wrong, and what’s expected of people, would allow issues to be brought into the open and standards to be clear. The moral discussions overlap heavily with better training.

But aside from those, you can’t really force people not to snoop. Leaks will occur, information will get out, and that’s kind of just the nature of it. Entropy guides the increase in disorder in the universe, affecting not just pure energy but information as well. There’s no perfect transfer; something always escapes.

Zeno wrote a letter…

I’m beginning to come to some conclusions about the failures of communication between people. It draws heavily upon the example Zeno presented in the dichotomy paradox.

Two persons know of concepts, which they wish to discuss. One could analogize to suggest that person A’s understanding of a concept sits at one end of a track, while person B’s understanding of that same concept sits at the other. Together, they will use language to attempt to move together toward the ideal of the concept, placed at the center of the track.

The first person introduces their understanding in terms of the language they understand, and hence, they reduce the distance between their listener and the ideal of the concept between them.

The second person responds, utilizing their own understanding of the concept, and through language, moves the other person closer to the ideal of the concept which stands between them.

This process continues, and each person progressively moves close and closer to the ideal, but by a similar logical flow to Zeno’s paradox, neither will ever arrive at the ideal of the concept. Neither will ever reach the other’s understanding of the concept.

In fact, as each moves closer, the minute differences between the understandings of each person’s language will become magnified – almost as distortions in a fractal pattern like the Mandelbrot set grow and come to dominate the landscape. The most insignificant and trivial of separations become the entire known world – and the distance that from afar seemed so small becomes again intuitively uncrossable.

Zeno, your insight never ceases to amaze.

Failures and Errata

Failures and errata could be thought of as a fixture in the Computer Science landscape. There is no standards body that is capable of eliminating errors altogether from computer code, no set of tools or mathematical proofs that can guarantee that all of our code is perfect. Software is buggy; that’s just the nature of the beast. We’ll find problems and fix them, but there will always be more. Right?

not a sucess

chickens running about a messed up shipment

Yes. Continue reading

Writing a 4chan scraper

I’m writing a paper on memes for a class, and while I’m not really ready to hand the paper in, I have been doing some preliminary work for it. Today, I had “free moments” for the first time in about a month, so I set to work on it.

The key tasks this code must do are:

1) Grab the front page of a forum, like boards.4chan.org/b/
2) Parse out the links to each thread on it
3) Go to each thread and grab the postnumber, date/time, and content of each post.
4) Save this into a database.
5) Run say, once every 10 minutes or something.

Then I’ll see about training it on /v/ or something, and tweak it to run on a few other chans. At that point, I’ll at least be collecting data.

Out of curiosity, how much storage will that take up? Back of the envelope time!

5.1 bytes / word (average word length in english)
Estimating 15 words per post (some long copypastas would drag this up)
Estimating 40 posts per thread (some will fall fast, others will be huge)
and .. huh. I don’t even know how many there are on the front page. Just looked: 15.
Every 10 minutes.

So, 40*15*(5.1+timestorage+some ancillary data+post number (an int, therefore 4))

timestorage would be equivalent to mysql datetime (8 bytes), ancillary data would probably be about 40 characters for “>>012345678

” (29 chars) and then some random spaces or html tags that get thrown in.

Post number would, as mentioned, be a 4-byte int. So..

40*15*(5.1+8+40+4) = 4*10*10*1.5*(57.1) = 1000*4*1.5*5.71 = 1000*6*5.71, let’s round that to 36,000 bytes per capture.

Over the course of 2 weeks, if I capture 10 pages (3 from 4chan, 3 from 99chan, 2 from 420chan, 1 from 711chan, 1 from somewhere else) every 10 minutes, that’s

2*7*24*60*6*36000 bytes (the last 6 is 60 minutes / 10 minutes per grab cycle)
1000*1.4*2.4*6*6*36*1000
100000000*1.4*2.4*3.6*3.6
(cheated a bit with irb) 100000000 * 43.5456 = 4,354,560,000 bytes

that’s about 4 gigabytes of data o_o I’m pretty sure my server can handle it. That’s just 2 weeks across 3-4 forums at a mediocre granularity, yeesh.

Coding

First step is grabbing the page. For whatever reason, the net/http module wasn’t working the way I wanted it to, so I’m just using open-uri. With this, I can just call out ‘open(“http://boards.4chan.org/b/”) and get a temporary file back containing the page. Read it into a string, close the file, and we’re good to go.

Not much later, we get deep into munging html. On boards.4chan.org/b/ (I remember it being img.4chan.org/b/ — when did that change? shows that I don’t go there much any more) we can see that every post is separated by a <hr> tag. That’s pretty nice if we need posts delineated into packets, but can we go lower level than that? Aha! The ‘Reply’ button only occurs on links to threads, so we can probably capture the list of threads easily:

threads = []
page.each do |line|
  if [] != (n = line.grep(/(\d*)">Reply/){$1}) then
    threads << n
  end
end

puts "The thread list is.."
threads.each do |thread|
  puts "http://boards.4chan.org/b/res/#{thread}"
end

puts “\nWoot.\n”

That’s fantastic. Now we need to actually pull the information from each thread… For the purpose of testing, let’s just go with thread[0] and deal with the first thread we find. It looks like, content-wise, the meat of each post is in between <blockquote> tags. This means we’ll have to pull that stuff out! Regexes will be helpful.

page_f = open("http://boards.4chan.org/b/res/#{threads[0]}")
page = page_f.read.split("\n")
page_f.close

posts = []
page.each do |line|
  if line =~ /<blockquote>/ then
    n = line.grep(/<blockquote>(.*)<\/blockquote>/) {$1}[0]
    puts "Post is --#{n}--"
    posts << n
  end
end
thread_posts << posts

One problem though — this has a lot of <font></font> stuff in it from when people quote earlier posts. I’d almost leave it in there, but it has references to javascript functions and classes that I don’t really want — so let’s strip it and just have the post number show up. How can we do this? String#gsub!, in all its glory:

n.gsub!(/(>>\d*)<\/a>;<\/font>/, ‘\1’) inserted right after the grep.

This leaves us with thread_posts as an array containing 1 array of posts, which are each a string. Next up, we need to get the postnumber and the date, and the thread numbers. I should end up with an array that looks like this:

[[thread_number, [[post_number, date, content], [postnumber, date, content],…]], [thread_number, [[post_number, date, content], [postnumber, date, content]]],…]

At this point, I’ve added in regex-checking for the number and date, but something isn’t working — I changed to breakups by <hr> (since each post is horizontal rule delineated) but I’m not finding what I expect to, and it’s causing some array indexes to fall on nilClasses. So I’ll paste in my code, before I start breaking this monster into some finely tuned functions:

require 'open-uri'

page_f = open('http://boards.4chan.org/b/')
page = page_f.read.split("\n")
page_f.close

threads = []
page.each do |line|
  if [] != (n = line.grep(/(\d*)">Reply/){$1}) then
    threads << n
  end
end

puts "The thread list is.."
threads.each do |thread|
  puts "http://boards.4chan.org/b/res/#{thread}"
end
puts "\nWoot.\n"

thread_posts = []
page_f = open("http://boards.4chan.org/b/res/#{threads[0]}")
page = page_f.read.split("<hr>")
page_f.close

posts = []
page.each do |line|
  if line =~ /<blockquote>/ then
    content = line.grep(/<blockquote>(.*)<\/blockquote>/) {$1}[0]
    content.gsub!(/<font.*>(>>\d*)<\/a><\/font>/, '\1')
    #puts "Post is --#{content}--"
  end
  if line =~ /"norep\d*"/ then
    number = line.grep(/<span id="norep(\d*)">/){$1}
  end
  if line =~ /\d\d\/\d\d\/\d\d\(.*\)\d\d:\d\d:\d\d/ then
    date = line.grep(/\d\d\/\d\d\/\d\d\(.*\)\d\d:\d\d:\d\d/)
  end
  posts << [number[0], date[0], content[0]]
end
thread_posts << posts

thead_posts[0].each do |post|
  puts "Post #{post[0]} at #{post[1]} said \"#{post[2]}\""
end

Okay, so I’ll press on, tinker with irb, and get back to you in a bit. I’m also going to find a wordpress plugin that makes my code not look like rotten bananas.