I searched Google for "website baker media" and found a lot of results linking to "WebsiteBaker Administration - Media-" with url's that point to "www.domainname.com/wb/admin/media/basic_header.html".
I feel a bit uncomfortable about it somehow. Can it be dangerous?
you can always use a robots.txt to protect the folders.
cheers
Klaus
There are lots of WB sites out there.
Every site has the /admin/media/basic_header.html file. (and all other non .php files in the admin area)
Everybody can download and look at the sourcecode to find out what it can do.
It should not be dangerous.
Typically a html page will not do much more than display data.
Php pages in the admin area (or the modules area) should all have a bit of code to prevent unning without the wb framework (and its security).
Although I can imagine that using html templates in the admin area together with .htaccess mods that allow html to run php code could be dangerous.
Ruud
Although I understand your explanation, I do feel it's weird to have loose admin header files linked to in Google. I'll do some robots.txt stuff, but maybe it's an idea to have an updated WB version just prevent this possibility.
I agree on the last part.
It should not be that difficult to prevent that. (using robots.txt or .php templates instead of .html)
BTW: I wonder how google has found those pages.
there are no links pointing to those files. Google doesn't look for pages without links pointing to them.
I think the only way that could happen is when a wb tree is installed on a server without .php enabled and with directory browsing enabled.
While writing this message, if found one that had a google_sitemap.xml document (generated with some external generator) that included ALL admin .html .js .gif .png etc..
This is the opposite of the robots.txt. It's asking google to "please index my hidden stuff".
(do a search on google like this: "/media/basic_header.html google_sitemap.xml". You will see what I mean)
Conclusion, don't worry. Your pages will not be indexed by google unless you ask them to.
Ruud
Hi,
correct me, if I am wrong, but I am sure, that a robots.txt as a solution is very unsecure. Searchengines can respect robots.txt. But they don´t need to. Maybe reading robots.txt-files could be a help and inspiration for bad guys. :-D
And what, if a bad guy knows the stucture of WB (or any other cms)? He only has to find a website, built with this cms, and he is able to see the admin-files. If the files are in google-index or not doesn´t matter.
The very simple and only safe solution is passwordprotection of admin by htaccess. You can´t see any file in any subfolder. And the admin-functions are locked safely.
So why don´t you use this? Am I completely wrong or thinking much to simple?
Regards
Uwe
The serious search engines will respect robots.txt. But they are not your worry.
The bad guys will use your robots.txt for finding places where you don't want the search engines to go.
Scanning for /admin will tell (on most WB sites) directly you are dealing with a WB site.
You can be sure, the admin structure is already known to someone serious in hacking.
Protecting the admin area with a .htaccess is something you could do, but you will need to logon twice every time you want to do something over there. The admin area is protected on php level.
There is nothing wrong with the html templates or images being accessible. They will not do anything anyway.
Personally I am a pretty paranoid guy, but as long as there are no vulnerability is popping up on sites like: http://secunia.com/advisories/23828/ I don't worry too much.
As I explained in my previous post, pages in the admin area are not indexed in google at all, unless you take the effort of asking google to do so.
Ruud
In some cases the apache "index of" pages might cause the problem.
I always disallow "modules" and "admin" in robots.txt, of couse I know a "bad guy" can also read it, but its more dangerous when a bad guy can simply use google-search to find security holes.
Guys,
I would actually disallow everything to everybody and in the next step allow the pages directory and, if wanted selected other directories. This way you do not give away your whole folder structure.
User-agent: *
Disallow: /
Then you open the pages directory, since you want to have the pages crawled
Allow: /pages/
Secure it like host systems used to be ... close everything and then open the required folders for indexing.
cheers
Klaus
QuoteUser-agent: *
Disallow: /
Allow: /pages/
Uiui - and what about the homepage?
If you do this, your whole site isnt crawled if you have no deep links.
That is what we want to achieve ... the crawling allow list is in part 2 ...
Allow: /pages/
cheers
Klaus
www.domain.de/ (index.php) is NOT in /pages/
it is in /
ja, hast recht ... aber warum schreibst Du dann nicht einfach den Zusatz:
Allow: /index.php
Das wäre vielleicht hilfreich gewesen ...
Gruß
Klaus
weil die Startseite nicht index.php heißt (deswegen in Klammern)
sondern: /
Wenn du das Startverzeichnis per robots.txt sperrst, kommt der Crawler schon gar nicht auf die Domain, egal was danach offen ist. Wo soll er denn anfangen.
Also Chio, ich bin mir nicht ganz sicher, wo Dein Wissen herkommt, oder wie alt es ist ...
Der Regelblock, den ich aufgestellt habe bewirkt folgendes:
User-agent: * # betrifft ALLE Useragents
Disallow: / # erstmal darf garnix gecrawled werden bis auf:
Allow: /index.php #die index.php Seite
Allow: /pages/ # der Seitenordner
Vielleicht liest Du auch mal hier nach: http://janeandrobot.com/post/Managing-Robots-Access-To-Your-Website.aspx
Die guten Crawler habe eine Eigenschaft ... sie lesen und halten sich an diese Regeln, und zwar alle.
Gruß
Klaus
OK,
Wie lautet die URL der Startseite?
1) www.domain.de
oder
2) www.domain.de/index.php
?
Richtig ist: 1)
Eine normale WB-Installation verlinkt nirgends auf www.domain.de/index.php, sondern immer auf www.domain.de
www.domain.de == /
Dort beginnt auch der Bot (zwangsläufig, was anderes kennt er ohne Deeplinks nicht)
Jetzt darf er aber nicht, weil / gesperrt ist.
Also interessiert er sich schon gar nicht für /index.php oder gar /pages/, woher sollte er denn wissen, dass es diese Seiten gibt.
Klar: Du kannst sagen: wenn die Startseite per robots.txt gesperrt ist, darf sie der Spider ja trotzdem crawlen, und dann würde er die Links finden - und dann würde er auch die Seiten finden.
Ich frage mich aber, ob das auch Google so sagt.
Und auch die von dir zitierte Quelle?:
http://janeandrobot.com/robots.txt
Easiest way is to add .htaccess password protect to /admin.