Mobile app version of vmapp.org
Login or Join
Cugini213

: How much text does the average person generate per day? I'm figuring out how to do comment rate limiting, and have decided that I'll keep track of the number of bytes each user (defined as

@Cugini213

Posted in: #Security #UserGeneratedContent

I'm figuring out how to do comment rate limiting, and have decided that I'll keep track of the number of bytes each user (defined as IP address + port) generates on my drive (so, each comment adds a constant metadata overhead of, say, 50 bytes in addition to the total number of bytes of comment) and limit the total number of bytes to some human daily maximum.

Of course, I'll also limit the IP address to some sane multiple of that (so, not 65,536 * user_max, but more like 30 * user_max).

The difficulty I'm having is in determining what user_max should be. If someone spends all day browsing my site and commenting on stuff, I don't want them to know there's a rate limit unless they're actually being intentionally abusive.

So, to determine user_max, I need some research on how much typed text the average internet user generates daily. Or, better yet, I'd want the bell curve. I can limit at 3 sigmas from the mean, and safely cover the overwhelming majority.

EDIT:

I should clarify that the site hasn't gone live, yet, so I'm just looking for a sane baseline.

Since I'll be moderating the comments manually (at least to begin with), what I want to know is primarily "could this be a human given the level of output?". Apart from that, I'm treating the users as gently as I can- like fawns venturing from the underbrush. They're majestic, beautiful, shy creatures, so I really don't want to spook them, but I also don't want to get kicked in the face.

I'm working on the assumption that the first users matter disproportionately, but that's more a gut feeling than anything and I'm willing to be corrected about that.

EDIT 2:

A back-of-the-envelope calculation says 16 KiB is a good number. 16 KiB is about 8 typewritten pages (in practice they'll be able to write less, because I factor in metadata + indexing overhead. But unless they write a bunch of tiny comments, not much less). If someone's typing an 8 page manifesto into my site, chances are they need to take some time off, anyway.

And at 16 KiB, assuming I limit each IP to 32 times (2 ** 5) that, the total data I can get per day from each IP would be 2 ^ 14 * 2 ^ 5, or 2 ^ 19 (half of 1 MiB).

If a botnet 2 million strong tries to DoS my site via the comments (assuming each bot has its own IP address and is smart enough to use multiple ports) I have a worst-case influx of 2 ^ 21 * 2 ^ 19 bytes, which comes out to 2 ^ 40 bytes, or one TiB / day.

Now, that certainly would crash my site. However, that also assumes I have no contingency for such a situation (e.g. disabling comments if the disk starts to get too full, or gradually bringing down the maximum limit per IP as the disk fills), and also assumes that I don't do any filtering based on content at all (not happening; I will do some sanity checks), and that my database can't compress any of the information (it does try compression), and that my network bandwidth wouldn't suffer so much that my site actually being technically up would matter (unlikely).

Of course, the other caveat to that would be that I've pissed off someone willing to rent a 2 million strong botnet. Which, although I can't say I'm friends with everyone, I'm still filing under "unlikely". Botnets are expensive, and realistically this site won't be that important.

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Cugini213

1 Comments

Sorted by latest first Latest Oldest Best

 

@Kevin317

You don't need to know how much content a user generates per day. You need to know how much content a user generates per day on your site.

Besides the fact that there probably isn't any real data about how much content a user generates, it really doesn't matter because you need to handle specifically how much data one of your typical users generates per day. The values on other sites will vary based on the type of site, type of user, etc. Way too many variables to make it correlate to what your users are doing.

But you can always analyze what your users are doing. Find your most prolific users who aren't violating your rules/abusing your site and use them as your baseline. Naturally you can adjust as you get more data and/or your users change their behavior. But this is the most accurate and useful information you can get.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme