Saturday, February 23, 2013

T-SQL: Recompiling an SP Causes Bad Plan

Stored Procedure generates bad plan after re-compile?

Friday at work I encountered an interesting situation.  I had a stored procedure which did a bunch of things, and I noticed that when executing it with the same parameters, same data-set, etc. I would get varied performance.  Digging a little deeper I noticed that there were two different plans that it was generating.  The big difference between the two plans was essentially the order of execution.  In the good plan, it was scanning a small table first, filter the result-set to about a dozen records, before continuing on and working with these records.  In the bad plan, it was joining a very large table first, doing much processing with a few million records, before finally filtering it at the end to the dozen or so records I needed.

Now, at first you think this has something to do with parameter-sniffing, and that sort of problem, where the engine caches a plan based on the first execution which is a bad plan for subsequent executions.  However, while the stored procedure does have some IF/ELSE logic embedded, this still doesn't make any sense.  When I was executing the stored procedure, I was using the same parameters, and the same underlying data.  With every execution it was going down the same path, and adding a WITH RECOMPILE didn't solve the problem.

In fact, it's the exact opposite.  What I found out was that working with the indexes used by the plans caused the good plan, but recompiling the stored procedure led to a bad plan on the next execution.  What am I talking about?  Here is the essential order of events, using pseudo-code:

1) EXECUTE usp_TheStoredProcedure -> It is executed with the good plan, runs in 2 seconds or less.
2) RECOMPILE usp_TheStoredProcedure
3) EXECUTE usp_TheStoredProcedure -> It is executed with the bad plan, takes 45+ seconds to run.
4) CREATE INDEX AnyIndexUsedBySP.
5) EXECUTE usp_TheStoredProcedure -> It is executed with the good plan, runs in 2 seconds or less.
6) EXECUTE usp_TheStoredProcedure -> It is executed with the good plan, runs in 2 seconds or less.
7) RECOMPILE usp_TheStoredProcedure
8) EXECUTE usp_TheStoredProcedure -> It is executed with the bad plan, takes 45+ seconds to run.
7) DROP INDEX AnyIndexUsedBySP.
8) EXECUTE usp_TheStoredProcedure -> It is executed with the good plan, runs in 2 seconds or less.

In other words, simply affecting any index attached to any table queried by the SP would allow subsequent executions of the SP to use the good plan, UNTIL I recompiled the SP, then it would go back to using the terrible plan.  What I did with the Index made no difference, in fact the work around I have found is that I can create and drop some random index on a small table WITHIN the stored procedure itself, and it then generates a good plan for every execution! In other words:

ALTER usp_TheStoredProcedure AS BEGIN

CREATE INDEX idx_SomeIndex ON SomeSmallTable(SomeColumn);
DROP INDEX idx_SomeIndex;

/* Continue the normal SP logic */

END

This solved the problem! What is going on here? Are the statistics going bad every single time I recompile the SP, and affecting the Indexes refreshes them?

If anyone out there can explain this phenomenon, please comment below or Tweet me at @RyanJLind

Tuesday, August 14, 2012

Most pitchers in a season

The 2012 Toronto Blue Jays have used 32 different pitchers as of August 14, 2012.  Which teams have used the most pitchers in a season?

Since 1910, 17 teams have used as many as 30 pitchers in a season:

Team
Year
lgId
num_pitchers
Pitchers
Games
SDN
2002
NL
37
Jason Kershner,Mike Holtz,Clay Condrey,Jose Antonio Nunez,David Lundquist,Rodney Myers,Bobby Jones,Doug Nickle,Kevin Jarvis,Alan Embree,Adam Eaton,Jason Middlebrook,Kevin Pickford,D'Angelo Jimenez,Matt DeWitt,Tom Davey,Jeremy Fikac,Jake Peavy,Mike Bynum,Eric Cyr,Kevin Walker,Brandon Villafuerte,Brian Tollberg,Trevor Hoffman,J.J. Trujillo,Jason Pearson,Brett Tomko,Steve Reed,Jason Shiell,Brian Lawrence,Dennis Tankersley,Ben Howard,Jonathan Johnson,Jason Boyd,Oliver Perez,Juan Moreno
621
SDN
2009
NL
33
Cesar Carrillo,Edwin Moreno,Joe Thatcher,Ernesto Frieri,Heath Bell,Luke Gregerson,Cesar Ramos,Jake Peavy,Josh Wilson,Clayton Richard,Greg Burke,Duaner Sanchez,Eulogio De La Cruz,Walter Silva,Tim Stauffer,Josh Geer,Mike Adams,Aaron Poreda,Kevin Correia,Edward Mujica,Arturo Lopez,Cla Meredith,Josh Banks,Sean Gallagher,Ryan Webb,Chris Young,Luis Perdomo,Chad Gaudin,Shawn Hill,Mat Latos,Adam Russell,Michael Ekstrom,Wade LeBlanc
689
CLE
2000
AL
32
Jaime Navarro,Bartolo Colon,Willie Martinez,Justin Speier,Paul Rigdon,Chris Nichting,Cam Cairncross,Charles Nagy,Dave Burba,Andrew Lorraine,Tom Martin,Kane Davis,Tim Drew,Steve Reed,Steve Karsay,Jim Brower,Chuck Finley,Scott Kamieniecki,Sean Depaula,Alan Newman,Jaret Wright,Bobby Witt,Mike Mohler,Chris Haney,Jason Bere,Paul Shuey,Jamie Brewington,Bob Wickman,Steve Woodard,Ricardo Rincon,Brian Williams,Mark Watson
624
SDN
2008
NL
32
Jared Wells,Justin Germano,Brett Tomko,Scott Patterson,Justin Hampson,Charlie Haeger,Josh Geer,Trevor Hoffman,Mike Adams,Cla Meredith,Josh Banks,Sean Henn,Chris Young,Carlos Guevara,Shawn Estes,Chad Reineke,Bryan Corey,Michael Ekstrom,Wade LeBlanc,Enrique Gonzalez,Randy Wolf,Brian Falkenborg,Joe Thatcher,Cha Seung Baek,Kevin Cameron,Heath Bell,Clay Hensley,Greg Maddux,Glendon Rusch,Jake Peavy,Dirk Hayhurst,Wilfredo Ledezma
653
CLE
2002
AL
31
Ricardo Rodriguez,C.C. Sabathia,Sean Depaula,Jason Davis,Brian Tallet,Charles Nagy,Nerio Rodriguez,Jerrod Riggan,Terry Mulholland,Jason Beverlin,Jake Westbrook,Chuck Finley,David Riske,Jaret Wright,Carl Sadler,Chad Paronto,Jason Phillips,Heath Murray,Mark Wohlers,Ryan Drese,Paul Shuey,Dave Burba,Danys Baez,Bob Wickman,Ricardo Rincon,Alex Herrera,Dave Elder,Bartolo Colon,Dave Maurer,Cliff Lee,Roy Smith
583
FLO
2010
NL
31
Hunter Jones,Jose Ceda,Burke Badenhop,Jorge Sosa,Steve Cishek,Rick VandenHurk,Anibal Sanchez,Jose Veras,Ricky Nolasco,Sandy Rosario,Clay Hensley,Sean West,Renyel Pinto,Juan Oveida,Taylor Tankersley,Josh Johnson,Andrew Miller,Jhan Marinez,Brian Sanches,Alejandro Sanabia,Scott Strickland,Tim Wood,James Houser,Christopher Leroux,Jay Buente,Nate Robertson,Adalberto Mendez,Brett Sinkbeil,Chris Volstad,Will Ohman,Dan Meyer
643
KCA
2006
AL
31
Jorge de la Rosa,Elmer Dessens,Joe Mays,Runelvys Hernandez,Mark Redman,Scott Dohmann,Adam Bernero,Jose Diaz,Scott Elarton,Steve Stemle,Seth Etherton,Zack Greinke,Andrew Sisco,Bobby Keppel,Denny Bautista,Ryan Braun,Mike MacDougal,Mike Wood,Luke Hudson,Odalis Perez,Juan Oveida,Jeremy Affeldt,Todd Wellemeyer,Steve Andrade,Kyle Snyder,Jimmy Gobble,Ambiorix Burgos,Joe Nelson,Joel Peralta,Brandon Duckworth,Chris Booker
635
CIN
2003
NL
30
Jose Acevedo,Ryan Wagner,Felix Heredia,Joey Hamilton,Jimmy Anderson,Chris Reitsma,Todd Van Poppel,Paul Wilson,Aaron Harang,Danny Graves,Dan Serafini,Josh Hall,Seth Etherton,Scott Williamson,Jeff Austin,Brian Reith,Jimmy Haynes,Gabe White,Scott Randall,John Riedling,Phil Norton,Kent Mercker,Ryan Dempster,Mark Watson,Juan Cerros,Scott Sullivan,Joe Valentine,Matt Belisle,John Bale,Josias Manzanillo
637
CLE
2004
AL
30
Chad Durbin,Rick White,Bob Wickman,Lou Pote,Kyle Denney,Matt Miller,Jeriome Robertson,Fernando Cabrera,Joey Dawley,Jake Westbrook,Jake Robbins,Jeff D'Amico,David Riske,Francisco Cruceta,Kazuhito Tadano,Scott Elarton,Jack Cressend,Cliff Bartosh,Bobby Howry,Jose Jimenez,Rafael Betancourt,Jeremy Guthrie,Tim Laker,David Lee,Cliff Lee,Jason Stanford,Jason Anderson,C.C. Sabathia,Jason Davis,Scott Stewart
641
COL
2007
NL
30
Manuel Corpas,Ryan Speier,Rodrigo Lopez,Jason Hirsh,Alberto Arias,Ramon Ortiz,Elmer Dessens,Taylor Buchholz,Juan Morillo,Denny Bautista,Franklin Morales,Matt Herges,Brian Fuentes,Jeremy Affeldt,Jeff Francis,Tim Harikkala,Ubaldo Jimenez,Jorge Julio,Zach McClellan,Josh Newman,Tom Martin,Ramon Ramirez,LaTroy Hawkins,Josh Fogg,Dan Serafini,Darren Clarke,Bobby Keppel,Byung-Hyun Kim,Mark Redman,Aaron Cook
692
DET
2002
AL
30
Eric Eckenstahler,Shane Loux,Terry Pearson,Juan Acevedo,Adam Bernero,Julio Santana,Kris Keller,Mike Maroth,Danny Patterson,Jamie Walker,Jeff Farnsworth,Steve Sparks,Erik Sabel,Fernando Rodney,Jose Paniagua,Matt Perisho,Mark Redman,Matt Anderson,Franklyn German,Jose Lima,Brian Powell,Andy Van Hekken,Jason Beverlin,Nate Cornejo,Matt Miller,Oscar Henriquez,Seth Greisinger,Jason Jimenez,Brian Moehler,Jeff Weaver
533
FLO
2007
NL
30
Erasmo Ramirez,Henry Owens,Byung-Hyun Kim,Renyel Pinto,Josh Johnson,Scott Olsen,Jorge Julio,Chris Seddon,Kevin Gregg,Harvey Garcia,Sergio Mitre,Dontrelle Willis,Taylor Tankersley,Wes Obermueller,Daniel Barone,Logan Kensing,Marcos Carvajal,Randy Messenger,Lee Gardner,Justin Miller,Matt Lindstrom,Ross Wolf,Armando Benitez,Rick VandenHurk,Anibal Sanchez,Mauro Zarate,Jason Wood,Carlos Martinez,Ricky Nolasco,Nate Field
722
TEX
2004
AL
30
Nick Bierbrodt,Jay Powell,Joaquin Benoit,Michael Tejera,Jeff Nelson,Nick Regilio,Frank Francisco,R.A. Dickey,Colby Lewis,Travis Hughes,Brian Shouse,Chan Ho Park,Ron Mahay,Mickey Callaway,John Wasdin,Sam Narron,Ryan Drese,Francisco Cordero,Doug Brocail,Carlos Almanzar,Kenny Rogers,Ryan Snare,Rosman Garcia,Mike Bacsik,Juan Dominguez,Erasmo Ramirez,Chris Young,Kameron Loe,Ricardo Rodriguez,Scott Erickson
630
TEX
2005
AL
30
Steve Karsay,Justin Thompson,Josh Rupe,Matt Riley,Juan Dominguez,Pedro Astacio,Erasmo Ramirez,Chris Young,Kameron Loe,Ricardo Rodriguez,Edinson Volquez,Joaquin Benoit,Michael Tejera,Nick Regilio,Kevin Gryboski,R.A. Dickey,James Baldwin,Brian Shouse,Chan Ho Park,Ron Mahay,C.J. Wilson,John Wasdin,Ryan Drese,Francisco Cordero,Doug Brocail,Carlos Almanzar,Kenny Rogers,Scott Feldman,Ryan Bukvich,Jason Standridge
616
TEX
2008
AL
30
Bill White,Dustin Nippert,Luis Mendoza,Matt Harrison,Scott Feldman,Brandon McCarthy,Doug Mathis,Kevin Millwood,Vicente Padilla,Eddie Guardado,Frank Francisco,Brian Gordon,Wes Littleton,Kameron Loe,Warner Madrigal,Jamey Wright,Kazuo Fukumori,Franklyn German,Sidney Ponson,Jose Diaz,Joaquin Benoit,Elizardo Ramirez,Kason Gabbard,A.J. Murray,Tommy Hunter,C.J. Wilson,Eric Hurley,Josh Rupe,Jason Jennings,Robinson Tejeda
619
TOR
2011
AL
30
Jason Frasor,Mike McCoy,Brandon Morrow,Kyle Drabek,Chad Beck,Trever Miller,Rommie Lewis,Danny Farquhar,Luis Perez,Ricky Romero,Henderson Alvarez,Scott Richmond,Carlos Villanueva,Brad Mills,Jesse Litsch,P.J. Walters,Joel Carreno,Wilfredo Ledezma,Frank Francisco,Jo-Jo Reyes,Octavio Dotel,Casey Janssen,Zach Stewart,Shawn Camp,Brian Tallet,David Purcey,Brett Cecil,Jon Rauch,Dustin McGowan,Marc Rzepczynski
636
WAS
2009
NL
30
Joel Hanrahan,Victor Garate,Jay Bergmann,Steven Shell,Kip Wells,Mike Hinckley,Joe Beimel,Julian Tavarez,Garrett Mock,Saul Rivera,Jordan Zimmermann,Daniel Cabrera,Shairon Martis,Zack Segovia,Marco Estrada,Collin Balester,Mike MacDougal,Ron Villone,Jesus Colome,Sean Burnett,Jorge Sosa,John Lannan,Livan Hernandez,Wilfredo Ledezma,Logan Kensing,Craig Stammen,J.D. Martin,Ross Detwiler,Scott Olsen,Tyler Clippard
694


The Padres apparently love it.

Interesting that even though I went back to 1910, all of the teams that hit 30 are in the past 12 years.


Sunday, March 25, 2012

Random stuff I am working on.

Here are a few projects I have on the go.

A tool for querying baseball seasons (who has the most games at SS for a player who made zero errors and hit 40 homeruns?) -- This is similar to BB-ref's PI tool except it's more flexible (and less robust.) I built it specifically to make it very easy to add new stats, and more functionality. IT doesn't look like much but the back-end is coded to be highly scale-able and extensible. Even doing something drastic like changing it from Baseball to Hockey should be a fairly straight forward change, and I might do exactly that when I'm done.

A tool for testing baseball lineups: This is a tool for simulating seasons with various lineups to see, eg, would a team score more runs with Ichiro batting lead-off or batting third? This is the only lineup tool I know of that uses an actual play-by-play simulator in the backend (rather than applying complicated algorithms,) so I think it's more auditable. All players can have their stats adjusted so if you want to plug in hypothetical numbers or projections rather than using the 2011 stats you can do that. This project is currently shelved until I can solve some of my javascript performance problems.

Baseball Hall of Fame Ballot Counter: Part 3 of 3 in my baseball site toolbox from Curly Bacon :) This project tracks voting history for the hall of fame. The plan is to track votes so you can easily look back on the past and see who voted for whom. It is missing a lot of data right now as I am lazy and data entry is not really my favorite thing (Data entry people wanted!!)

These are the baseball-related projects I have ongoing. I have a few more personal sites that I won't post until I complete.

Monday, July 11, 2011

Lack of understanding statistics.

There's an article on yahoo that illustrates well the sort of flawed conclusions people draw when they look at statistics and don't know how to interpret them. As I like to say, numbers don't lie, but they can mislead. In this article, the author writes about the "home run derby curse" which insinuates that players who participate in the home run derby "mess up their swings" and have poor second halfs. This is a legend that has been very popular ever since Bobby Abreu had a legendary performance at the HRD and had a poor second half following. He compares slugging percentages of HRD participants before the HRD and after the HRD and sees a noticable drop (-.130) which he concludes is evidence that the HRD messes players up.

But wait.

The problem is regression towards the mean. Just because a player hit a lot of homers in his first half doesn't necessarily mean we should expect him to do the same in the second half. It is far more likely that in the second half his performance will fall more in line with his career rate of performance. In fact, the fact that the player has been invited to participate in the HRD already indicates he has had a great first half which is more than likely above his regular rate of performance. This is what we call sampling bias.

To be fair, the author indicates this somewhat in his piece, and to control for it compares it to sluggers who did not participate in the HRD. He finds that sluggers that did not participate had an increase in their slugging in the second half, which he believes lends credence to the hypothesis that the HRD messes up swings. I disagree. The fact that these sluggers were not invited to the HRD indicates that they were having a poorer first half than the players who WERE invited, and are much more likely to have a "better" second half by the same laws of regression. By better I just mean compared to their first half, not compared to the other pool of players.

If I am a very bad player who fluked his way into hitting 25 bombs in the first half of the season, and then I stub my toe during the all-star break and only hit 4 homers the rest of the season, it isn't very logical to conclude that my poorer second half was caused by my stubbed toe. It is much more likely caused by the fact that I am a very bad player who had a fluke run.

In a way, this is exactly what happened with Bobby Abreu. Abreu is, of course, not a very bad player, but he has never been considered a homerun hitter. He didn't hit a lot of homeruns in his career before that HRD and the HRD was a huge surprise. So the fact that he hasn't hit many homers since the HRD shouldn't come as a huge surprise.

Does all of this mean that the HRD has no effect? Of course not, but we have to analyze it far more carefully. What we need to do is compare the expected rate of hitting for players participating in the HRD and compare them to reality. How many homeruns should we expect out of Curtis Granderson in the second half, based on his career to date? And how does that projection compare to what actually happens? If player consistently under-perform these projections after the HRD then I can start to buy into it as a "curse."

Friday, August 21, 2009

40-HR Kings

Chris Jaffe over at the hardball times has been posting some cool lists lately. The theme: "when was the last time a franchise had a player who...."

He has tackled "won 20 games" and "had 200 hits." I, of course, wondered about home runs. So I am going to completely rip him off and post the list for 40 home run guys. So, for all 30 MLB franchises, when was the last time each one had a player hit 40 ding-dongs in a season?

I will cluster them into 3 sections...

Section 1: The Recent Years:


PHI Howard 2008
MIL Fielder 2007
CIN Dunn 2007
TBD Pena 2007
NYY A-Rod 2007
NYM Beltran 2007
HOU Berkman 2006
CHW Dye/Thome 2006
CLE Hafner 2006
ATL Andruw 2006
BOS Ortiz 2006
WSN Soriano 2006
STL Pujols 2006
CHC Lee 2005
TEX Teixeira 2005
LAD Beltre 2005


Some notes:

* Pujols is sitting on 39 right now. So St. Louis will soon be on top at 2009.

* Pena's season in 2007 was the first 40 HR season in Tampa. He could possibly make it again this year as he is at 31.

* Adam Dunn has hit exactly 40 homers four years in a row. With 31, he is basically on track to do it again this year. Rooting for him.

* Chicago, Texas, and LAD are going to drop off the 5-year window as no one is close.



Section 2: A Little While Ago:

SFG Bonds 2004
TOR Delgado 2003
LAA Glaus 2001
ARI LuGo 2001
COL Helton 2001
SDP Nevin 2001
OAK Giambi 2000
SEA A-Rod 2000


Some notes:

* Troy Glaus hit 47 dongs in 2000 and 41 in 2001. Nobody else has done it for the Angels. Their leader this year is Morales with 27. But, they're leading the AL in runs, so I doubt they care.

* Luis Gonzalez's season is the only one in Arizona history, but Mark Reynolds has 38 this year, so should make two.

* San Diego may also see their spell end, A-Gone is at 33.

* Coors, where art thou? Crazy that nobody has hit 40 homers for the Rockies in the past seven seasons, and this year will be 8 as their leader is Tulo with 23. Damn that humidor.



Section 3: Official Droughts:

BAL Raffy 1998
FLA Sheff 1996
DET Fielder 1991
PIT Stargell 1973
MIN Killer 1970
KCR No-one! ----


Some notes:

* Matt Wieters hasn't quite challenged this yet.

* Yep, Barry Bonds never hit 40 for Pittsburgh. He hit 46 the year after he left though, so there's that. Brian Giles hit 39 in 99, and 38 in 2002. Bay hit 35 in 2006.

* Kansas City and Pittsburgh you expect to see at the bottom of every list, but what's Minnesota doing there? But yep, known to prefer the speedy slap hitters, they live up to that reputation on this list. Roy Sievers hit 42 dingers for the team in 1957 (when they were in Washington, of course,) and Harmon Killebrew did it 8 times for the Twins, from 1959 to 1970. Since Killer, though? Nothing even close. Justin Morneau hit 34 in 2006 and has 28 this year. In fact, the Twins only have nine 30-HR seasons since 1970.

* And finally we have the Royals. In existence since 1969 and zero 40-HR seasons. Zilch. Nada. Their leader is Steve Balboni with 36 in 1985. I knew this, of course, from reading a lot of Joe Posnanski, but I still find it amazing. You'd think you could get one guy that could fluke his way to a 40 dinger year. Or hell, even a 39 or 38 homer year. That nobody has even been close in 40 years is astonishing to me. Mike Jacobs is their team leader this season. He has 16. Sixteen! In their history, they have ten 30 HR seasons, and only Danny Tartabull has managed that feat twice for the Royals. Boy, I hope Billy Butler works out for them . . .

Monday, December 22, 2008

Passwords

I want to try and bridge a gap here, so to speak. I feel as someone interested in education, I should attempt to educate in some ways by communicating the things I know and learn in a simple language, and try to make uninteresting things sound interesting by highlighting their importance and pertinence to the layman's life.

So let's talk about passwords.

I have two goals with this post:

1) To explain the technologies used by password-cracking software.
2) To demonstrate the insanely-high speed that modern computers can operate.

Part I) The Explanation

Everybody uses them, and many people have some understanding of how the work. A lot of work has been done trying to educate the public about how to have secure passwords, and yet a lot of people still choose very poor ones. I think that there is still a lack of understanding among the general public about how passwords get broken. It is my goal in this post to explicate password-cracking in an effort to enlighten the public and exterminate the risk of my friends getting hacked.

IT professionals and just general geeks like me know about these things and want to hear about hashes, salts and rainbow tables. If you do not have a clue what I am talking about, then this post is for you. I am going to ignore those for this post and focus on generalities.

Basically, there are two methods to cracking passwords:

1) Wordlist
2) Brute Force

While these techniques have been around for as long as I have, the big change over time has been computer speed. Moore's law dictates that computing power doubles every other year, which means that a password that might have taken 24 hours to break in 1998 can probably be broken now in about 45 minutes, without even changing the techniques used.

Method #1: Wordlists

Ever sat down at a friends computer and tried to guess their password? Maybe it's the name of their dog! No wait, the name of their cat! This is, in essence, the methodology used here. Of course, a computer can guess passwords a hell of a lot faster than you can. Essentially, the software is fed a "list" of words to try, and it tries them all, one by one, until it gets it right. This list might be a concise list of popular passwords, or it might be a list of every word in the dictionary.

Method #2: Brute Force

Ever thought about how long it would take to guess your PIN in case you forgot it? First you would try 0000, then 0001, then 0002, etc. and eventually you would get to your PIN. This is brute force. It literally means trying every combination possible until you get one right. Of course, if you were actually trying to guess a PIN, you wouldn't start at 0000 would you? Who has a PIN of 00XX? A smarter way would be to start in the middle, and work your way out. Modern day cracking software works in about the same fashion. It will try every possible combination, but it does so in an intelligent order to reduce the time it will likely take.

Part II) The Demonstration

The fact is, it might shock you to learn how fast these programs can work. It shocked me. One of the most popular, renowned, and oldest software-cracking tools available is called "John the Ripper." I downloaded John and put it to work on my own machine so that I could test it out a bit. I have read most of the documentation and obtained knowledge on how it functions, and I intend to explain that here, as best as I can.

I made a bunch of fake windows accounts to test some passwords. I made 3 to start, with what I figured were weak, medium, and strong passwords.

My weak password was "apple." I guessed this would take a couple seconds to break.

My medium password is a password we use at work that involves a word and one number.

My strong password is a password that I personally have used for about 8 years, that involves letters and numbers in a way that does not resemble any sort of word.

I sic'd John on these computers and barely even left the enter key when it responded with answers to the first two. John only gives granularity down to seconds, so I can't even say the exact speed it took other than it was not even one second. Literally, my "medium" password that we use at work took less than a second to crack. I was pretty amazed.

I was even more amazed when it cracked my "strong" password just 2 hours later.

It was at this point I had to learn more and started looking more deeply into how JtR operates. Luckily, it is a very open and well-documented software with a lot of options and configuration files. I was able to learn what I needed very quickly.

John has three different "modes." By default it tries all three in order, but it contains options to only try each one individually.

Mode 1: Single mode

Basically, this is where it uses a wordlist of just one word, but tries many variations of that word. It calls these variations "mangles." There are a LOT of mangles that it tries on that word, everything from "replace every o with a 0" to "try the word backwards" to "reverse the word and capitalize every other character." What is the word that it is trying? The username.

For example, suppose your username is "admin." this mode will try "admin," "nimda," "admin1," "adm1n," etc. and a whole lot more. A LOT more. So if you think you're being cute by using a backwards-username as a password, you're not. Any variation that you can think of, JtR probably also tries. So don't use any mangle of your username as your password.

Mode 2: Wordlist mode

This is probably the most efficient and effective. JtR comes with a list of 3,706 common passwords and tries them all one-by-one. It also has the option to try "mangles" as above. However, because it is trying thousands of words instead of just one, the default amount of mangles it tries is far less. With that said, JtR is extensible so any user can add more mangles to the list with ease.

Here is a link to the wordlist that JtR comes with:

WordList

The first thing it does is try every word on that list, a process which takes less than a second. After that, it goes through the list again, trying every word in lowercase. Then, every word in uppercase. All told, there are 26 different mangles that are tried by default. I will not list them all here for fear of tedium, but I will post a link.

Let me put it bluntly: If your "password" is a word in the dictionary, you are vulnerable. Adding a number (e.g. "apple1" instead of "apple") will not help. A computer can break it in seconds (or less.)

I wanted to really underscore the performance aspect involved here, so I did some testing. First of all, I built my own wordlist. Well, not really. I found a wordlist that is supplied for spell-checking programs like the one in MS word. This list is essentially every word in the dictionary. I then added another list which contains a bunch of popular abbreviations. Then another list that contains a bunch of "slang" words and such and popular misspellings. All told, my list had 86,542 words in it to try.

I ran JtR using the custom wordlist (no mangling), and it completed in less than a second. Slightly annoyed, I tried to slow it down by enabling the mangling options. Remember, it is trying every word in a list that is 86,542 words long, then going back to the top and trying every word backwards, and so on. There are 26 mangle options which means there are 86,542 X 26 = 2,250,092 different passwords to try.

How long did JtR take to try over 2 million different passwords? One second. One measly second. Barely detectable, really. And this is while I was running another JtR session in the background, along with about 40 other things (bitTorrent, MSN, Avast!, Skype, etc.) And this is on my computer, which is hardly a supercomputer.

However, none of the 2M+ passwords matched the new ones I had put into my password file, so I moved on to the next technique...

Mode 3: Incremental mode

This essentially means Brute Force. This mode absolutely guarantees that it will crack any password ... eventually. Depending on the length of the password and size of the pool it could take seconds or years.

It's all about probability. If you have a password that must be exactly 8 characters, and you are limited to only using letters, and case doesn't matter (m = M,) then there are 268 = 208,827,064,576 different combinations.

Like all numbers, this one requires context. Two-hunded-and-eight billion is sure a large number. Far bigger than the measly 2 million I tried in my previous demonstration. There are, however, some things to keep in mind:

1) It will not be necessary to try all combinations, only the amount it takes until it gets the right one.

2) The password cracker will not, likely, do "brute force" as we would intrinsically think of it. That is, it has an intelligent order with which it tries passwords. This will decrease the time it takes to get your password.

3) Computers are fast. To test, I configured my computer to try every alpha password up to 7 chars in length, case insensitive. The amount of passwords to try here are:

267 + 266 + ... + 261 = 8,353,082,582. It ran through this exercise in about 50 minutes, or 3,000 seconds. This means it was able to try around 2.7 million passwords per second.

Do the math. With 208 billion possible combinations and 2.7 million combinations tried every second, my computer can test every possible combination of an 8-digit letters-only password in about 21 hours. Chances are, it will get to yours overnight.

4) Computers are always getting faster. Now, my computer can get every combination in 21 hours. In two years, it will be able to do it in 10 hours. By 2014, It will be able to do it in 2 hours.

There was a time where 8-digit passwords were sufficient. When JtR first came out, it was maxed at about one thousand combos per second. But that ship has sailed, and it seems like too many consumers are left on the beach.

Part III) The Advice

So how do you improve your password?

1) Length. The beauty of exponents is that adding one digit to your password can make a world of difference. By extending your password to 9 characters instead of 8, you would add 5,429,503,678,976 (5.4 trillion) more combinations! I mentioned before that my computer can try all combinations of an 8-character letters-only password in 21 hours. Guess what? To try all combinations of a 9-character letters-only password would take 23 days. Ten characters? Six-hundred-and-five days. Hooray for exponents!

The problem, of course, is that people find longer passwords harder to remember.

Well, a big problem is that people still use the word "password." "Password," really, is a bit of a misnomer. There is no reason your password needs to be a word. It can be a phrase. A good one to use would be a song lyric or even a title. Is it really hard to remember/type "StairwayToHeaven" ? Such a password is not all that secure, but it's 10 times better than "heaven" or any other word. Since this password is 16 chars long, a brute force attack would mean trying 2616 different combinations to get all of them, and 5216 combinations if it's case sensitive.

5216 = 2,857,942,574,656,970,690,381,479,936.

Of course, it will be able to crack it far faster, because of other factors that I won't get into to keep this post simple. Which means you need...

2) Numbers. But don't put "1" at the end. Suppose your favorite number is 9 and song is Stairway. Why not: "Stairway9To9Heaven"? Is that hard to remember? This would leave your password impervius to the wordlist attack and now a Brute Force attack would have to be programmed to run 6218 different combinations.

3) Symbols. Much like numbers, you can usually put symbols into words just as easily. What I usually do is think of a number and then use whatever symbols correspond to that number. In the above example we used the number 9. If you SHIFT+9, you get "(." So our password becomes: "Stairway(To(Heaven." Now we're talking about 8516 different combinations.

So that's my primer, so to speak, on passwords. Please choose your passwords carefully, and keep the things you learned today in mind!

Thursday, May 29, 2008

Leadoff Walks

During last nights Jays/A's game, BJ Ryan came into the ninth inning to close out the one-run lead, and immediately issued a lead-off walk to Mark Ellis. I hate leadoff walks, and so does Rance Muliniks. In fact, Rance remarked (paraphrasing):

"I would estimate that the leadoff walk comes around to score 60-65% of the time."

I generally like Muliniks, but whenever a broadcaster makes a claim like this, something triggers in my brain. Rance went on to give reasons why he thinks walks come around to score so often, but I wasn't listening.

In The Book, the authors present a table of run values for all the batting events by base/out state. The unintentional walk, with the bases empty and no outs, has a run value of 0.41, so that doesn't help Rance's case. Not ever someone to take someone on their word, I ran the numbers myself, using data from 1998-2007. Here's what I found:


BBs R %
-------------------------
27929 10797 .3865


So in 27,929 leadoff walks, the runner came around to score about 39% of the time. Pretty close to the value from the book, and not nearly as high as 65%.

But that's not really the issue. The assertion, see, is that leadoff walks are "special." That is, the runner is going to score more often after being walked than he would had he singled (or reached first some other way.) That's easy to check.

Let's see what the numbers are for leadoff singles:

1Bs R %
-------------------------
62645 24031 .3836


Oh. 38%. Basically the same. Drat.

Sorry, Rancey baby.